Articles by: John Davi

John runs everything product for Diffbot. Drop him a line at john at diffbot if you have questions.

New API Features: Authentication and Content POSTing

One of our most common feature requests: can Diffbot APIs access content behind a login or firewall? Until recently, the answer was mostly “no.”

But now we’ve recently added new features to all of our APIs, both Automatic and Custom, that should allow much broader access to non-publicly available content:

Continue reading

Diffbot’s New Product API Teaches Robots to Shop Online

Diffbot’s human wranglers are proud today to announce the release of our newest product: an API for… products!

The Product API can be used for extracting clean, structured data from any e-commerce product page. It automatically makes available all the product data you’d expect: price, discount/savings amount, shipping cost, product description, any relevant product images, SKU and/or other product IDs.

Continue reading

Announcing Crawlbot: Smart Site Spidering and Extraction

Today we’re happy to announce the public availability of Crawlbot, our computer-vision-powered site crawler and extractor.

If you want structured data from an entire site, Crawlbot will fully spider a domain and hand off the right pages to Diffbot APIs. The result? A queryable index of the entire site’s data, or a complete download of the site’s structured data in easy-to-read — for a robot — JSON.

Continue reading

New Feature: Correct and *Concatenate* Multi-Page Articles

Our Article API automatically joins multiple-page articles into a single “text” or “html” field.

On some sites though our algorithm is unable to concatenate for various reasons (typically non-standard pagination design convention). Furthermore, any site with an overridden “text” field (via a Custom API rule) will no longer automatically concatenate multiple pages.

nextPage

We’re happy to introduce an oft-requested fix for this. From now on, if you create a ‘nextPage’ rule in our Custom API Toolkit (developer login required) we will automatically follow the specified link specified — and any subsequent links, up to ten pages — and concatenate into a single result. Moreover, you’ll only be charged for a single API call.

For more information check out our overview in Diffbot Support, or have a go in our Custom API Toolkit.

Diffbot’s HackerNews Trend Analyzer

Like any good developer service, we’re fans of Hacker News. Making the vaunted Frontpage is a, well, vaunt-worthy accomplishment (we’ve been there once), so we thought we’d use our APIs to analyze and identify any trends in what content makes the Frontpage.

The result is Diffbot’s HackerNews Trend Analyzer. Feel free to click that link and play around, or read more here for details on how we did it.

Continue reading

New Feature: Custom Timeouts

The slowest part of any Diffbot API request is the call-response to third-party content. Depending on the third party server’s responsiveness and location, it could be anywhere from a third of a second to tens of seconds before we receive content to process. (Diffbot internal rendering and processing, by comparison, averages just over 100 milliseconds.) Continue reading