Articles by: John Davi

John runs everything product for Diffbot. Drop him a line at john at diffbot if you have questions.

Video: Crawling Basics and Advanced Techniques for Web Site Data Extraction

Just for the visual and auditory learners — and/or those of you who prefer their web crawling with the dulcet tones of yours truly — a couple of Crawlbot tutorials to help you get up and running:

Crawlbot Basics

A quick overview of Crawlbot using the Analyze API to automatically identify and extract products from an e-commerce site.

Advanced Usage

This tutorial discusses some of the methods for narrowing your crawl within a site, and setting up a repeat or recurring crawl.

Related links:

Various Ways to Control Your Crawlbot Crawls for Web Data

In 2013 we welcomed Matt Wells, founder of Gigablast (and henceforth known as our grand search poobah) aboard to head up our burgeoning crawl and search infrastructure. Since then we’ve released Crawlbot 2.0, our Bulk Service/Bulk API, and our Search API — and are hard at work on more exciting stuff.

Crawlbot 2.0 included a number of ways to control which parts of sites are spidered, both to improve performance and to make sure only specific data is returned in some cases. Here’s a quick overview of the various ways to control Crawlbot.

Continue reading

Article API: Returning Clean and Consistent HTML

We’ve long offered HTML as a response element in our Article API (as an alternative to our plain-text text field). This is useful for maintaining inline images, text formatting, external links, etc.

Until recently, the HTML we returned was a direct copy of the underlying source, warts and all — which, if you work with web markup, you’ll know tilts heavily toward the “warts” side. Now though, as many of our long-waiting customers have started to see, our html field is now returning normalized markup according to our new HTML Specification.

Continue reading