Converting text documents into knowledge graphs with the Diffbot Natural Language API

Most of the world’s knowledge is encoded in natural language (e.g., news articles, books, emails, academic papers). It is estimated that 80 percent of business-relevant information originates in unstructured form, primarily text. However, the ambiguous nature of human communication makes it difficult for software engineers and data scientists to leverage this information in their applications.

After years of research, we are proud to announce the Diffbot Natural Language API, a new product to help businesses convert their text documents into knowledge graphs. Knowledge graphs represent information about real-world entities (e.g., people, organizations, products, articles) via their relationships with other entities (e.g., founded by, educated at, was mentioned in). This is the same production-grade technology that we use to build the world’s largest knowledge graph from the web, and we are making it available to all.

Continue reading

Analyzing Consumer Marketplaces Using Crawlbot and the Product API

Miles Grimshaw of Thrive Capital recently used Crawlbot and our Product API to analyze product availability and extract pricing data from a number of online fashion marketplaces — to help determine the scale, margins, customer profile and trends of each site, and to inform their investment decision-making.

Miles writes about his experience and analysis on his blog. Nice Diffbotting, Miles!

Announcing Semantic Hack (June 1, 2013)

What could you build if the entire web was your database? Could you do it in a day?

We’re glad to be working with the fine folks at to host the inaugural Semantic Hack at the Semantic Technology & Business Conference in San Francisco on June 1, 2013.

See additional details and registration at, and more inside this post.

Continue reading

Diffbot Leads in Text Extraction Shootout

In a recent benchmark, Diffbot placed first overall among text extraction APIs on an academic evaluation set and one sampled from Google News.

Tomaz Kovacic, a university student in artificial intelligence, recently conducted a comprehensive benchmark of text extraction methods as part of his thesis. Included in the study are commercial vendors as well as open-source APIs for text extraction. He did an excellent job in designing the study, measuring both precision, recall, F1, as well as careful error case analysis.

Image credit: Tomaz Kovacic

The CleanEval dataset, developed at the Association of Computational Linguistics conference, is a widely used evaluation in academia, and the Google News article dataset was sampled from the 5000+ news sources that Google aggregates.

Diffbot’s method relies on training a core set of visual features (such as geometrical, stylistic, and render properties) to recognize different types of documents. In this case, we had trained Diffbot on a set of news article typed pages to recognize certain parts of news pages. In addition to article text, Diffbot’s article API returns the content author, date, location, article images, article videos, favicon, and even topics (support in English and other languages coming soon). Besides article pages, Diffbot’s core features have been trained to extract information from other types of pages too (such as frontpages).

This result gives us great promise that generalized vision-based machine learning techniques can perform just as well, if not better, than approaches engineered for specific tasks.

Learn more details about the study.