Diffbot’s New Product API Teaches Robots to Shop Online

Diffbot’s human wranglers are proud today to announce the release of our newest product: an API for… products! The Product API can be used for extracting clean, structured data from any e-commerce product page. It automatically makes available all the product data you’d expect: price, discount/savings amount, shipping cost, product description, any relevant product images, SKU and/or other […]

Read More

Diffbot APIs Are Getting Very META

We noticed recently that a common use for our Custom API Toolkit was augmenting Diffbot’s Automatic APIs with custom fields to return markup <META> tag data: meta descriptions, OpenGraph and Twitter Card tags, Schema.org microdata, etc. We figured we’d save you the trouble of hand-curating rules, so we added the <META> parameter across all of our […]

Read More

Machine Learning in the Cloud

Machine Learning Loads are Different than Web Loads One of the lessons I learned early is that scaling a machine learning system is a different undertaking than scaling a database or optimizing the experiences of concurrent users. Thus most of the scalability advice on the web doesn’t apply. This is because the scarce resources in machine […]

Read More

New Feature: Correct and *Concatenate* Multi-Page Articles

Our Article API automatically joins multiple-page articles into a single “text” or “html” field. On some sites though our algorithm is unable to concatenate for various reasons (typically non-standard pagination design convention). Furthermore, any site with an overridden “text” field (via a Custom API rule) will no longer automatically concatenate multiple pages. We’re happy to […]

Read More

World of Web Data

The Semantic web is a dream that many are attempting to make into reality through the use of machine-readable metadata. Web developers worldwide would use this metadata to make the search for content easy for users that wish to extract web data. In a perfect world this would have already happened, but alas, developers today […]

Read More