A knowledge graph is an extension of a graph data structure that allows data to be stored in interrelated (contextually linked) entities as well as the automated inference of new knowledge. Knowledge graphs can be domain specific or general and sourced from proprietary or public data.
Domain-specific knowledge graphs are often compiled with domain experts and the help of machine learning algorithms. For knowledge graphs to scale, data gathering, parsing, and entity resolution typically must be automatable. Knowledge graphs of useful scale are often primarily or entirely AI created (as is the case with Diffbot’s public web data-sourced Knowledge Graph).
What is a Graph?
To understand knowledge graphs, it’s helpful to understand the underlying data structure known as a “graph.” Graphs are ways of organizing data centered around nodes and edges. You can think of nodes as individual entities, a person, organization, article, or pretty much any type of noun. Edges show how nodes are connected. For example, you could say “Ringo Starr <a node> was a member of <an edge> The Beatles <another node>.”
One way to put graphs in context is to look at their cousin relational databases. Relational databases are structured so as to retain coherency of individual entries. And each entry can be structured to include data about relationships.
In the case of a graph, it’s the relationships themselves that are “first class residents.” By maintaining a data structure based on relationships between entities computation of complex relationships can occur much more efficiently. Additionally, graphs can more easily describe what is meaningful to humans about objects in the world. Namely, how entities relate to one another and patterns in this interrelation. Graphs are typically more flexible than relational databases which allows them to grow when they encounter new forms of information.
What is the Difference Between a Knowledge Graph and a Graph?
As with all data structures, graphs are inherently ‘dumb.’ In-and-of themselves, they’re simply a pattern for organizing data. Their usefulness is a function of the following:
- how (and how much) data is sourced to them
- the quality of the data
- the freshness of the data
- data origins and provenance (see data provenance)
- the ability to expand entity ontologies (see ontology)
- the ability to properly separate or merge entities
- the ability to infer new knowledge (see knowledge graph reasoner)
- visualization tools
- tools for search
- tools for data discovery
As opposed to a graph, a Knowledge Graph is the culmination of all of the above. Namely, robust data sourcing, entity resolution, validation, ontology creation and editing, and data exploration and discovery tools. Knowledge Graphs also facilitate the creation of new knowledge through automated reasoning or inference.
While knowledge graphs are often stored in an underlying graph structure, they are much more, similarly to how data pipelines are more than the data of the pipeline on its own.
History of Knowledge Graphs
From the beginning, knowledge graphs have been semantic. Through the mid-late 1980’s, the universities of Twente and Groningen began joint work on a project named “Knowledge Graphs.” This project involved the creation of a set of semantic networks built in the form of a graph. Nodes were limited to a relatively low number of edges (relationship types). But decades before the emergence of some form of semantic web, the ground work was being laid for semantic querying.
Cyc, another long term research project, began in 1984. Their efforts to build a knowledge base for AI led to a collection of 21M fields. This was pre-internet so data sources included unstructured data of many forms. The aim of this collection was to gather all facts needed to infer all of “common sense knowledge” (see common sense knowledge graphs). It was thought this could be used to help AI to handle novel situations better and be less “brittle.”
In the ensuing years — particularly the late 90’s and early 2000’s — the acceleration of content and time-spent online shifted the paradigm for knowledge graph creation. Wikipedia was born and allowed for exponentially lower cost-per-fact by virtue of its crowdsourced nature. In 2006, Metaweb begin working on Freebase. As with many ensuing general knowledge graphs, Freebase was created in part by leveraging the structure and facts within Wikipedia to gain a start. Building off of wiki-style data, Freebase was able to gain over 1.9B fields, a massive improvement from Cyc. While Metaweb was bought and eventually shut down by Google, a great deal of their “shared data base of the world’s knowledge” was transferred to Wikidata.
As with many computing disciplines, the conceptual groundwork for a solution has been laid for some time before processing requirements or surrounding ecosystems enable a leap forward. For knowledge graphs, this leap forward came through the application of contemporary natural language processing, computer vision, web data extraction, and storage. In short, AI’s began to read and filter the web.
This is where Diffbot steps into the history of knowledge graphs. The foundation for the world’s largest knowledge graph came through a set of products including our AI-enabled Automatic Extraction APIs that could pull structured data from any page type and our web crawler (Crawlbot).
In tandem these tools enable the crawling and processing of virtually the entire public web. Built on with cutting-edge natural language processing, entity resolution, and the ability to gauge the validity of facts from an ever growing collection of sites, Diffbot was able to announce the creation of the world’s largest Knowledge Graph in 2018.
Since 2018, Diffbot’s Knowledge Graph has been able to amass over 10 billion entities filled with over 1 trillion facts. The average entity is comprised of over 20 facts sourced from public web data. Each fact is sourced from an average of six weighted locations around the web.
It should be noted that many enterprise knowledge graphs aren’t sourced from public web data. And many are also comprised of facts that are manually curated (not AI-enabled). While important, these Knowledge Graphs are not for public use and pale in size to a handful of public data-sourced and AI-enabled knowledge graphs.
What Type of Entities Are In a Knowledge Graph?
Entities within knowledge graphs depend on the input data and domain of interest. Though typically entities tend to be concrete nouns (that is, organizations, people, locations, and so forth).
Within the Diffbot Knowledge Graph we have a growing set of entities that include the following:
- Intangibles (Skills, Educational Majors, Roles, Employment)
- And Videos
Each entity type contains facts that are pertinant to that type of entity in the real world. For example, an organization could have a funding round field. An article isn’t likely to need such a field. Conversely, an article may have a publication date, a field not necessary for an organization. The fields available to each entity type are determined by an ontology.
Knowledge Graph Ontology
Ontologies are a set of axioms that determine the properties that a given entity type can contain. As we mentioned in the last section an article entity may not need a funding round. A physical location doesn’t need employees. But a corporation entity may include both.
Ontologies play an important role in growing knowledge graphs as the facts that are pertinent to an entity type may change over time. Or new data sources may surface. Within Diffbot’s Knowledge Graph a set of non-standard fields about entities are also collected. We call these “non-canonical facts”. While not presently at a threshold of importance or accuracy equal to more mainstream fact types, non-canonical facts can help a Knowledge Graph to grow over time as they become more verified, data becomes more available, or they gain importance.
For a detailed look at Diffbot’s Knowledge Graph ontology, check out our docs.
See more in our guide what is a knowledge graph ontology.
Interested in trying out our Knowledge Graph? Check out our 14-day free trial (no credit card needed).
Natural Language Processing and Knowledge Graphs
Natural language processing (NLP) is at the heart of AI-derived Knowledge Graphs, as Knowledge Graphs of a certain scale are far beyond what could be compiled by a team of humans. Diffbot’s NL API product helps individuals to leverage some of the underlying technology behind Diffbot’s Knowledge Graph to create knowledge graphs based on your own corpora.
It’s important to note that knowledge graphs come in all shapes and sizes. And vary greatly in subject matter. Our NL API can help you to establish nodes, edges, and facts about entities from a natural language corpus of your choice. For more info on creating your own knowledge graph, check out our NL API Demo.