We added a couple of frequently requested Crawlbot features this week: webhook notifications and much smarter content de-duplication.
We added a couple of frequently requested features to Crawlbot this week: the ability to pass in Diffbot API parameters to tailor the output of your crawl extractions; and the option to download a comma-separated-values (CSV) file of product crawl data.
Diffbot’s human wranglers are proud today to announce the release of our newest product: an API for… products!
The Product API can be used for extracting clean, structured data from any e-commerce product page. It automatically makes available all the product data you’d expect: price, discount/savings amount, shipping cost, product description, any relevant product images, SKU and/or other product IDs.
We noticed recently that a common use for our Custom API Toolkit was augmenting Diffbot’s Automatic APIs with custom fields to return markup <META> tag data: meta descriptions, OpenGraph and Twitter Card tags, Schema.org microdata, etc.
We figured we’d save you the trouble of hand-curating rules, so we added the <META> parameter across all of our APIs. Continue reading
Today we’re happy to announce the public availability of Crawlbot, our computer-vision-powered site crawler and extractor.
If you want structured data from an entire site, Crawlbot will fully spider a domain and hand off the right pages to Diffbot APIs. The result? A queryable index of the entire site’s data, or a complete download of the site’s structured data in easy-to-read — for a robot — JSON.
Previously, I wrote about how Amazon EC2 Spot Instances + Auto Scaling are an ideal combo for machine learning loads.
In this post, I’ll provide code snippets needed to set up a workable autoscaling spot-bidding system, and point out the caveats along the way. I’ll show you how to set up an auto-scaling group with a simple CPU monitoring rule, create a spot-instance bidding policy, and attach that rule to the bidding policy.
But first, let’s talk about how to frame the machine learning problem as a distributed system.
Machine Learning Loads are Different than Web Loads
One of the lessons I learned early is that scaling a machine learning system is a different undertaking than scaling a database or optimizing the experiences of concurrent users. Thus most of the scalability advice on the web doesn’t apply. This is because the scarce resources in machine learning systems aren’t the I/O devices, but the compute devices: CPU and GPU.
Our Article API automatically joins multiple-page articles into a single “text” or “html” field.
On some sites though our algorithm is unable to concatenate for various reasons (typically non-standard pagination design convention). Furthermore, any site with an overridden “text” field (via a Custom API rule) will no longer automatically concatenate multiple pages.
We’re happy to introduce an oft-requested fix for this. From now on, if you create a ‘nextPage’ rule in our Custom API Toolkit (developer login required) we will automatically follow the specified link specified — and any subsequent links, up to ten pages — and concatenate into a single result. Moreover, you’ll only be charged for a single API call.
Like any good developer service, we’re fans of Hacker News. Making the vaunted Frontpage is a, well, vaunt-worthy accomplishment (we’ve been there once), so we thought we’d use our APIs to analyze and identify any trends in what content makes the Frontpage.
The result is Diffbot’s HackerNews Trend Analyzer. Feel free to click that link and play around, or read more here for details on how we did it.
See additional details and registration at http://semantichack.eventbrite.com, and more inside this post.