In 2013 we welcomed Matt Wells, founder of Gigablast (and henceforth known as our grand search poobah) aboard to head up our burgeoning crawl and search infrastructure. Since then we’ve released Crawlbot 2.0, our Bulk Service/Bulk API, and our Search API — and are hard at work on more exciting stuff.
Crawlbot 2.0 included a number of ways to control which parts of sites are spidered, both to improve performance and to make sure only specific data is returned in some cases. Here’s a quick overview of the various ways to control Crawlbot.
A common use for Diffbot APIs: build an index of structured content for easy and precise searching. This post walks through the most simple way to do that using our Bulk Processing Service and Search API.
One of the more common uses of Crawlbot and our article extraction API: monitoring news sites to identify the latest articles, and then extracting clean article text (and all other data) automatically. In this post we’ll discuss the most straightforward way to do that.
Miles Grimshaw of Thrive Capital recently used Crawlbot and our Product API to analyze product availability and extract pricing data from a number of online fashion marketplaces — to help determine the scale, margins, customer profile and trends of each site, and to inform their investment decision-making.
We’ve long offered HTML as a response element in our Article API (as an alternative to our plain-text text field). This is useful for maintaining inline images, text formatting, external links, etc.
Until recently, the HTML we returned was a direct copy of the underlying source, warts and all — which, if you work with web markup, you’ll know tilts heavily toward the “warts” side. Now though, as many of our long-waiting customers have started to see, our html field is now returning normalized markup according to our new HTML Specification.
We just released Diffbot API clients in 36 different programming languages, ranging from general purpose languages (Ruby/Python/Java), to systems languages (Go/C), to scripting languages (Bash), and even embedded (x86-64 anyone?). View them here: http://github.com/diffbot.
We added a couple of frequently requested features to Crawlbot this week: the ability to pass in Diffbot API parameters to tailor the output of your crawl extractions; and the option to download a comma-separated-values (CSV) file of product crawl data.