Article API: Returning Clean and Consistent HTML

By on June 22, 2014

We’ve long offered HTML as a response element in our Article API (as an alternative to our plain-text text field). This is useful for maintaining inline images, text formatting, external links, etc.

Until recently, the HTML we returned was a direct copy of the underlying source, warts and all — which, if you work with web markup, you’ll know tilts heavily toward the “warts” side. Now though, as many of our long-waiting customers have started to see, our html field is now returning normalized markup according to our new HTML Specification.

Continue reading

Diffbot’s New Product API Teaches Robots to Shop Online

By on July 31, 2013

Diffbot’s human wranglers are proud today to announce the release of our newest product: an API for… products!

The Product API can be used for extracting clean, structured data from any e-commerce product page. It automatically makes available all the product data you’d expect: price, discount/savings amount, shipping cost, product description, any relevant product images, SKU and/or other product IDs.

Continue reading

Diffbot APIs Are Getting Very META

By on July 14, 2013

We noticed recently that a common use for our Custom API Toolkit was augmenting Diffbot’s Automatic APIs with custom fields to return markup <META> tag data: meta descriptions, OpenGraph and Twitter Card tags, Schema.org microdata, etc.

We figured we’d save you the trouble of hand-curating rules, so we added the <META> parameter across all of our APIs.  Continue reading

Announcing Crawlbot: Smart Site Spidering and Extraction

By on July 2, 2013

Today we’re happy to announce the public availability of Crawlbot, our computer-vision-powered site crawler and extractor.

If you want structured data from an entire site, Crawlbot will fully spider a domain and hand off the right pages to Diffbot APIs. The result? A queryable index of the entire site’s data, or a complete download of the site’s structured data in easy-to-read — for a robot — JSON.

Continue reading

Setting up a Machine Learning Farm in the Cloud with Spot Instances + Auto Scaling

By on June 25, 2013
Artist rendition of The Grid. May or may not be what Amazon’s servers actually look like.

Artist rendition of The Grid. May or may not be what Amazon’s servers actually look like.

Previously, I wrote about how Amazon EC2 Spot Instances + Auto Scaling are an ideal combo for machine learning loads.

In this post, I’ll provide code snippets needed to set up a workable autoscaling spot-bidding system, and point out the caveats along the way. I’ll show you how to set up an auto-scaling group with a simple CPU monitoring rule, create a spot-instance bidding policy, and attach that rule to the bidding policy.

But first, let’s talk about how to frame the machine learning problem as a distributed system.

Continue reading

Machine Learning in the Cloud

By on June 24, 2013

Installing_100G_RAM_in_Diffbot_Server

Machine Learning Loads are Different than Web Loads

One of the lessons I learned early is that scaling a machine learning system is a different undertaking than scaling a database or optimizing the experiences of concurrent users. Thus most of the scalability advice on the web doesn’t apply. This is because the scarce resources in machine learning systems aren’t the I/O devices, but the compute devices: CPU and GPU.

Continue reading