Diffbot’s Approach to Knowledge Graph

Google introduced to the general public the term Knowledge Graph (“Things not Strings”) when they added the information boxes that you see to the right-hand side of many searches. However, the benefits of storing information indexed around the entity and its properties and relationships are well-known to computer scientists and have been one of the central approaches to designing information systems.

When computer scientist Tim-Berners Lee originally designed the Web, he proposed a system that modeled information as uniquely identified entities (the URI) and their relationships. He described it this way in his 1999 book Weaving the Web:

I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A “Semantic Web”, which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The “intelligent agents” people have touted for ages will finally materialize.

You can trace this way of modeling data even further back to the era of symbolic artificial intelligence (Good old fashioned AI”) and the Relational Model of data first described by Edgar Codd in 1970, the theory that forms the basis of relational database systems, the workhorse of information storage in the enterprise.

From “A Relational Model of Data for Large Shared Data Banks”, E.F. Codd, 1970

What is striking is that these ideas of representing information as a set of entities and their relations are not new, but are so very old. It seems as if there is something very natural and human about representing the world in this way. So, the problem we are working on at Diffbot isn’t a new or hypothetical problem that we defined, but rather one of the age-old problems of computer science, and one that is found within every organization that tries to represent the information of the organization in a way that is useful and scalable. Rather, the work we are doing at Diffbot is in creating a better solution to this age-old problem, in the context of this new world that has increasingly large amounts of complex and heterogeneous data.

The well-known general knowledge graphs (i.e. those that are not verticalized knowledge graphs), can be grouped into certain categories: the search engine company maintained KGs: Google, Bing, and Yahoo knowledge graph, community-maintained knowledge graphs: like Wikidata, and academic knowledge graphs, like Wordnet and ConceptNet.

The Diffbot Knowledge Graph approach differs in three main ways: it is an automatically constructed knowledge graph (not based on human labor), it is sourced from crawling the entire public web and all its languages, and it is available for use.

The first point is that all other knowledge graphs involve a heavy amount of human curation – involving direct data entry of the facts about each entity, selecting what entities to include, and the categorization of those entities. At Google, the Knowledge Graph is actually a data format for structured data that is standardized across various product teams (shopping, movies, recipes, events, sports) and hundreds of employees and even more contractors both enter and curate the categories of this data, combining these separate product domains together into a seamless experience. The Yahoo and Bing knowledge graphs operate in the similar way.

A large portion of the information these consumer search knowledge graphs contain is imported directly from Wikipedia, another crowd-sourced community of humans that both enter and curate the categories of knowledge. Wikipedia’s sister project, Wikidata, has humans directly crowd-editing a knowledge graph. (You could argue that the entire web is also a community of humans editing knowledge. However–the entire web doesn’t operate as a singular community, with shared standards, and a common namespace for entities and their concepts–otherwise, we’d have the Semantic Web today).

Academic knowledge graphs such as ConceptNet, WordNet, and earlier, CyC, are also manually constructed by crowd-sourced humans, although to a larger degree informed by linguistics, and often by people employed under the same organization, rather than volunteers on the Internet.

Diffbot’s approach to acquiring knowledge is different. Diffbot’s knowledge graph is built by a fully autonomous system. We create machine learning algorithms that can classify each page on the web as an entity and then extract the facts about that entity from each of those pages, then use machine learning to link and fuse the facts from various pages to form a coherent knowledge graph. We build a new knowledge graph from this fully automatic pipeline every 4-5 days without human supervision.

The second differentiator is that Diffbot’s knowledge graph is sourced from crawling the entire web. Other knowledge graphs may have humans citing pages on the web, but the set of cited pages is a drop in the ocean compared to all pages on the web. Even the Google’s regular search engine is not an index of the whole web–rather it is a separate index for each language that appears on the web . If you speak an uncommon language, you are not searching a very big fraction of the web. However, when we analyze each page on the web, our multi-lingual NLP is able to classify and extract the page, building a unified Knowledge Graph for the whole web across all the languages. The other two companies besides Diffbot that crawl the whole web (Google and Bing in the US) index all of the text on the page for their search rankings but do not extract entities and relationships from every page. The consequence of our approach is that our knowledge graph is much larger and it autonomously grows by 100M new entities each month and the rate is accelerating as new pages are added to the web and we expand the hardware in our datacenter.

The combination of automatically extracted and web-scale crawling means that our knowledge graph is much more comprehensive than other knowledge graphs. While you may notice in google search a knowledge graph panel will activate when you search for Taylor Swift, Donald Trump, or Tiger Woods (entities that have a Wikipedia page), a panel is likely not going to appear if you try searches for your co-workers, colleagues, customers, suppliers, family members, and friends. The former category are the popular celebrities that have the most optimized queries on a consumer search engine and the latter category are actually the entities that surround you on a day-to-day basis. We would argue that having a knowledge graph that has coverage of those real-life entities–the latter category–makes it much more useful to building applications that get real work done. After all, you’re not trying to sell your product to Taylor Swift, recruit Donald Trump, or book a meeting with Tiger Woods–those just aren’t entities that most people encounter and interact with on a daily basis.

Lastly, access. The major search engines do not give any meaningful access to their knowledge graphs, much to the frustration of academic researchers trying to improve information retrieval and AI systems. This is because the major search engines see their knowledge graphs as competitive features that aid the experiences of their ad-supported consumer products, and do not want others to use the data to build competitive systems that might threaten their business. In fact, Google ironically restricts crawling of themselves, and the trend over time has been to remove functionality from their APIs. Academics have created their own knowledge graphs for research use, but they are toy KGs that are 10-100MBs in size and released only a few times per year. They make it possible to do some limited research, but are too small and out-of-date to support most real-world applications.

In contrast, the Diffbot knowledge graph is available and open for business. Our business model is providing Knowledge-as-a-Service, and so we are fully aligned with our customers’ success. Our customers fund the development of improvements to the quality of our knowledge graph and that quality improves the efficiency of their knowledge workflows. We also provide free access to our KG to the academic research community, clearing away one of the main bottlenecks to academic research progress in this area. Researchers and PhD students should not feel compelled to join an industrial AI lab to access their data and hardware resources, in order to make progress in the field of knowledge graphs and automatic information extraction. They should be able to fruitfully research these topics in their academic institutions. We benefit the most from any advancements to to the field, since we are running the largest implementation of automatic information extraction at web-scale.

We argue that a fully autonomous knowledge graph is the only way to build intelligent systems that successfully handle the world we live in: one that is large, complex, and changing.

Up Next