One of the more common uses of Crawlbot and our article extraction API: monitoring news sites to identify the latest articles, and then extracting clean article text (and all other data) automatically. In this post we’ll discuss the most straightforward way to do that.
We’ve long offered HTML as a response element in our Article API (as an alternative to our plain-text text field). This is useful for maintaining inline images, text formatting, external links, etc. Until recently, the HTML we returned was a direct copy of the underlying source, warts and all — which, if you work with […]
One of our most common feature requests: can Diffbot APIs access content behind a login or firewall? Until recently, the answer was mostly “no.” But now we’ve recently added new features to all of our APIs, both Automatic and Custom, that should allow much broader access to non-publicly available content:
We just released Diffbot API clients in 36 different programming languages, ranging from general purpose languages (Ruby/Python/Java), to systems languages (Go/C), to scripting languages (Bash), and even embedded (x86-64 anyone?). View them here: http://github.com/diffbot.
We added a couple of frequently requested features to Crawlbot this week: the ability to pass in Diffbot API parameters to tailor the output of your crawl extractions; and the option to download a comma-separated-values (CSV) file of product crawl data.
Diffbot’s human wranglers are proud today to announce the release of our newest product: an API for… products! The Product API can be used for extracting clean, structured data from any e-commerce product page. It automatically makes available all the product data you’d expect: price, discount/savings amount, shipping cost, product description, any relevant product images, SKU and/or other […]
We noticed recently that a common use for our Custom API Toolkit was augmenting Diffbot’s Automatic APIs with custom fields to return markup <META> tag data: meta descriptions, OpenGraph and Twitter Card tags, Schema.org microdata, etc. We figured we’d save you the trouble of hand-curating rules, so we added the <META> parameter across all of our […]
Today we’re happy to announce the public availability of Crawlbot, our computer-vision-powered site crawler and extractor. If you want structured data from an entire site, Crawlbot will fully spider a domain and hand off the right pages to Diffbot APIs. The result? A queryable index of the entire site’s data, or a complete download of the […]
Previously, I wrote about how Amazon EC2 Spot Instances + Auto Scaling are an ideal combo for machine learning loads. In this post, I’ll provide code snippets needed to set up a workable autoscaling spot-bidding system, and point out the caveats along the way. I’ll show you how to set up an auto-scaling group with […]