Articles by: Mike Tung

CEO at Diffbot

Generating Company Recommendations using Large Language Models and Knowledge Graphs

Recommendation systems have become an essential component of modern technology, used in various applications such as e-commerce, search engines, and chat assistants. In this blog post, we’ll explore how to generate high-quality company recommendations using large language models like OpenAI’s GPT-4 and knowledge graphs. We’ll dive deep into the technical details, discussing challenges and potential solutions, and provide Python code snippets to help you get started with your own implementation.

Large Language Models for Company Recommendations

Large language models such as GPT-4 have shown remarkable capabilities in understanding and generating human-like text. By leveraging their contextual understanding and knowledge of various domains, we can use them to generate company recommendations based on user input.

To start, let’s create a function that takes a user’s query and returns a list of recommended companies using the GPT-4 model. A typical call to the GPT-4 API might look like this.

import openai

def generate_recommendations(prompt, model="gpt-4", max_tokens=100, n=5):
  openai.api_key = OPENAI_API_KEY
  completion = openai.ChatCompletion.create(model=model, messages=[{'role': 'user', 'content': prompt}])
  return completion.choices[0].message.content.split('\n')

LangChain has become a popular tool for building LLM applications by chaining together various calls to LLM APIs and providing prompt management. They recently raised a $10M seed round from Benchmark and are purportedly currently raising $20M from Sequoia.

Let’s try to find companies similar to LangChain by using the prompt “Recommend me five companies similar to LangChain (”:

None of these recommendations are similar to LangChain

However, you will immediately run into one of the main problems of using LLMs for recommendations, which is that they are limited to the knowledge that was available at the time the LLM was trained (September 2021 according to GPT-4, here). Since the LangChain project was only started in late 2022, GPT-4 isn’t aware of it, and produces recommendations simply based on the name “LangChain”, returning back recommended companies that have to do with either translation services or language learning.

Freshness of information isn’t the only problem. Even if text describing the company was available at the time of training, if the company is a long-tail company (which most of the day-to-day entities you deal with likely are), a good representation of the company may not have been retained in the weights of the LLM. This is because of the nature of LLMs to “compress” the information in the original large corpus into a relatively smaller model, leading to invalid “hallucinated” answers in the generation when it is asked a rare question.

Knowledge Graphs for Enhanced Recommendations

Knowledge graphs are structured databases that store entities, attributes, and relationships between them. By using knowledge graphs, we can extract up-to-date and relevant information to enhance our recommendations. Unlike LLMs, which are lossy compressed representations of information stored in in-memory weights, a Knowledge Graph, like most databases, are exact representations of information and are retrieved from much larger disk-based storage.

There are a variety of languages used to ask Knowledge Graphs questions, e.g. GraphQL/SparkQL/SQL, but what they have in common is that they involve specifying your query in a precise way with typed statements and using predicates from a defined schema vocabulary. For example, to find companies “like LangChain”, we have to ask ourselves what “like LangChain” means. One reasonable interpretation of that might mean that we are looking for NLP startups that have also recently raised a Series A. This is what that question would look like, expressed in the Diffbot Query Language:

type:Company industries:"natural language processing" investments.series:"Series A"

This is what the results look like against the Diffbot Knowledge Graph.

Diffbot Query Language Search

Structured knowledge bases give you back exactly what you asked for–i.e. there is a deterministic execution against a precise intent and you know exactly why each result is there. There is no room for hallucinations and no information is corrupted in the retrieval process.

However, the obvious downside is that the user had to do work to formulate a precise question. They also need to know the schema of the knowledge graph and what entity types and properties are available in order to “translate” their question into the structured query language.

Combining Large Language Models and Knowledge Graphs

What we’d really like is a system that’s like a smart human assistant that has access to a high-quality database of information. You could express what you want in your own language, without prior knowledge of the database schema or needing technical skill in querying.

Here’s what a recommendation function that combines and LLM with a KG might look like:

import re
import requests

def kg_enhance(company, url):
  res = requests.get('', params={
      'token': DIFFBOT_TOKEN,
      'type': 'Organization',
      'name': company,
      'url': url
  data = res.json()['data']
  return data[0]['entity'] if data else None

def kg_enhanced_recommendations(company, url):
  prompt = f"Recommend me five companies that are similar to {company}. For each answer, include the name, (url), and up to 5 word summary."
  # Lookup company in knowledge graph
  input = kg_enhance(company, url)
  if input:
    prompt = f"About {company}:\n {input['description']}\n\n" + prompt
  # Ask LLM using enhanced input
  answers = generate_recommendations(prompt)
  # Validate answers against KG
  pattern = r'(\d+)?\.?\s?([\w\s]+)\s\(([^)]+)\)'
  return [line for line in answers if (matches := re.match(pattern, line)) and kg_enhance(,]

In this code snippet, we first take our input company (LangChain), and look it up in the Knowledge Graph to get additional information about it using the Enhance API. In Diffbot’s Knowledge Graph, we can see that LangChain is described as a “large language model application development library”, is based in Singapore, and recently raised $10M. We can use this information to augment the prompt, so that the language model can know what the company actually does in order to recommend other companies that do similar things. Here is what the retrieval-augmented prompt, along with the output of GPT-4 looks like.

The recommendations are now all NLP API companies

Much better! We see that the recommendations are no longer language learning services, but actually startup companies that provide NLP developer APIs that you can use in our own applications. This has solved the problem of lack of information at training time.

To solve the problem of hallucination, we can turn to the Knowledge Graph again to lookup each of these recommended companies by their name and URL to find their entities in the Diffbot Knowledge Graph. This code just checks that these return non-null company entities, but you could also add any manner of validation here on the returned entities (e.g. on industry=”natural language process”, nbEmployees<100 to enforce company size, etc.)

In this blog post, we demonstrated how to generate company recommendations using large language models like GPT-4 and large knowledge graphs such as the Diffbot Knowledge Graph. By combining these two powerful technologies, we can create a more accurate and reliable recommendation system than using either technology alone. While this example focused on company recommendations, the same approach can be applied to other domains and use cases, such as movie or book recommendations.

Grounded Natural Language Generation with Knowledge Graphs

Automation Bias is a well-studied phenomenon in social psychology that says humans have a tendency to overly trust the decisions of automated systems. This effect was on full display during last month’s Bing Chat launch keynote and Google Bard launch. As many sharp-eyed bloggers pointed out, both demos were riddled with examples of the LLM producing inaccurate and “hallucinated” facts. Everyone, including Bing in their keynote, recognizes that AI makes mistakes, but what was surprising here is that AI was so good at getting past all of the human reviewers–all of the engineers, product managers, marketers, PR folks, and execs that must have reviewed these examples for a high profile launch event that Satya Nadella described as the “third wave of the web”.

Factual accuracy is not part of the objective function that Large Language Models are designed and optimized on. Rather, they are trained on their ability to “blur” the web into a more abstract representation, which forms the basis of their remarkable skill in all manner of creative and artistic generation. That an LLM is able to generate any true statements at all is merely the coincidence of that statement appearing enough times in the training set that it becomes retained, not by explicit design.

It just so happens that there is a data structure that has been specifically designed to store facts at web-scale losslessly and with provenance: Knowledge Graphs. By guiding the generation of language with the “rails” of a Knowledge Graph, we can create hybrid systems that are provably accurate with trusted data provenance, while leveraging the LLM’s strong abilities in transforming text.

To see this, let’s go through some of the examples demoed in the Bing Chat keynote and see how they could have been improved with access to a Knowledge Graph.

Let’s Shop for TVs using Bing

In this example, Yusuf Medhi, Chief Consumer Marketing Officer for Microsoft, is demoing the new Bing’s ability to do interactive refinement of search results through dialogue. He asks Bing Chat for the query “top selling 65-inch TVs”.

Looks like a pretty decent response. While it’s difficult to fact-check Bing’s claim that these are the “top selling 65-inch TVs in 2023″ without access to private sales data, these ten results are at least TVs (more specifically TVs product lines/series) that have 65” models. For the sake of discussion, let’s give Bing the benefit of the doubt, and assume these are correct:

Next, Yusuf asks Bing “which of these is best for gaming?”

Bing Chat result for "which of these is best for gaming"

Seems pretty great right? From a quick visual scan, you can see the token “game” sprinkled liberally throughout the response. But look more closely. These responses are not at all a subset of “these” 65-inch TVs from the previous response! In fact, out of these 10 recommendations, only 3 of these are from the previous list. 7 have been swapped out:

This is where the demo of query refinement starts to break down. As you start to add more conditions to your query, the likelihood that someone has created a piece of content on the web that answers the specific question at the intersection of those conditions becomes exceedingly rare. I.e, the probability of a blog post existing for “best selling 65-inch TVs” is greater than the probability someone has created a blog post for the “best selling 65-inch TVs for gaming.

Notice that now, Bing is starting to return individual product model numbers instead of product lines or series. The LLM is definitely picking up on the input token “gaming” and blending in product numbers it has seen in gaming related content, but two of the recommendations, the “LG 65SM8600PUA” and “LG OLED65C8P”, have been discontinued and are no longer sold on the web (though you can find the 65SM8600PUA used on Amazon according to the Diffbot Knowledge Graph).

Are these the “best for gaming”? That’s subjective and hard to evaluate, but these results certainly cannot be the “best-selling in 2023” if they are not even available for sale, except on a used Amazon listing. The steelman argument could be that these were the best selling TVs back in 2021 when the LLM was trained, but the response says that these are in 2023.

Next, Yusuf asks “which one of these is the cheapest?”

Again, these results look great in the context of a demo if you don’t look too carefully at what they are saying.

If you are actually trying to use Bing to do product research though, and have specific reasons for wanting the TV to be 65-inches (maybe that’s how much space you have to work with on your living room wall, or that is something you have already decided on based on previous reviews), then they aren’t so great. We can see here that one of the TVs recommended is 55 inches. And again Bing loses the context from the previous questions, with only 1 out of the 5 recommendations being a subset of the prior results, seemingly there by coincidence.

Let’s Learn about Japanese Poets

I think this example was put into the keynote to show how Bing Chat can be used to do more exploratory discovery, as a contrast to goal-oriented task sessions, which to be fair is a pretty great use case for LLMs.

Yusuf asks Bing Chat simply “top japanese poets”

Again, this looks like a great result if you’re looking at the results with the mindset of a demo and not with the mindset of an actual learner trying to learn about Japanese poets. It certainly seems to come back faster and cleaner than the alternative of searching for an article on the web about famous Japanese poets written by an expert and dealing with the potentially jarring visual layout of visiting a new website. But is the underlying information correct?

Well, it is mostly correct. But the question is asking for top Japanese poets, and Gackt (which I admit having listened to his songs during my teenage years), being a famous Japanese rock star and pop group, is definitely no poet. Yes, he could have written a poem, backstage, after a jam session, during a moment of reflection, but there is no charitable common-sense interpretation of this answer where Gackt is one of the top 10 Japanese poets. Another nit with this response is that Eriko Kishida has no Wikipedia page (though she has IMDB credits for writing the lyrics of many children’s TV shows) yet Bing claims “according to Wikipedia”. By citing Wikipedia for these results, Bing confers a greater sense of authority to these results than they actually deserve.

A General Recipe for Grounded Language Generation using Knowledge Graphs

So, how could both of these examples been improved by using a Knowledge Graph?

First of all, it’s important to recognize that not all use cases of LLMs require or desire accuracy. Many of the most impressive examples of LLM and multi-modal LLM outputs are examples where creativity and “hallucination” create a lot of artistic value and present the user with novel blends of styles that they would not have thought of. We love stories about unicorns and pictures of avocado chairs.

But for use cases where accuracy matters, whether this is because of an intersection of hard constraints, a need for auditability / review, or for augmenting professional services knowledge work, Knowledge Graphs can be used as a “rail” to guide the generation of the natural language output.

For these applications, a common design pattern is emerging for grounded language generation:

The first step is using the structured data of the Knowledge Graph to enrich the natural language of the query with structured data. In the example of the TV shopping and Japanese poets, it is recognizing that the question aligns with an entity type that is available in the Knowledge Graph. In the TV example it is “type:Product” and in the poets example it is “type:Person“. Knowing the requested entity type of the question is very useful, allowing you to type-check the responses later, and even present result UIs that are optimized for that type. There is an extensive line of academic research in this vein and Diffbot provides tools for structuring natural language.

The Enrichment step is also where you might pull in any additional information about entities that are mentioned in the natural language question. For example, if the question is “What are alternatives to Slack?”, you’d want to pull in information from the Knowledge Graph about Slack, structured attributes as well as text descriptions so that the LLM has more text to work with. This is especially important if the question contains entities that are not so popular on the web, or only known about privately.

Using the now augmented input, formulate a prompt for the LLM that provides this context and specifies the expected type of entity that is desired (e.g. products, companies, people, poets). Prompt engineering is becoming more-and-more of its own skillset, so we won’t cover that in detail here, but will show some examples of Knowledge Graph prompts in a follow up post.

As we’ve seen in the above two examples, even with the desired type provided to the LLM, it can still generate inaccurate outputs. This is where the next step of KG-based verification comes in. You can look up each of the LLM-generated results in a Knowledge Graph and discard results that do not match on the desired type or required attributes. This would have allowed you to discard the 55″ TV from the product result (specs), discontinued and non-existent products, and the Japanese rock star (employment) from the list of poets. If you have confidence in the completeness of your Knowledge Graph, you can discard any LLM-generated responses that don’t appear in the knowledge graph or only use candidates from the knowledge graph during the enrichment stage in the prompt.

Another aspect in which the Knowledge Graph can help in the rectification of the results is in the presentation itself. A Knowledge Graph can provide provenance for where each of the recommendations in the response came from, separately. For some applications, you can even use the Knowledge Graph to limit which facts are returned to only those facts that are in the Knowledge Graph, so that you have 100% provenance for everything that is shown to the user. In this mode, you can consider the LLM as a intelligent “SELECT” clause on the Knowledge Graph that adaptively picks which columns to present in the user experience based on the query. For many use cases, such as comparison product shopping, a tabular presentation of the products showing the title, image, price, specs, etc is a lot more user friendly to deal with than text, which is why it is the design of most shopping sites. Next-gen LLM/KG hybrid systems should take the lessons learned from UX design but use the new capabilities of LLMs to create adaptive inputs and outputs.

The future of trustworthy AI systems lies in the synergy between LLMs and Knowledge Graphs and knowing when it’s appropriate to use both. By combining the creative power of LLMs and the factual index of Knowledge Graphs, we can build systems that not only inspire creativity, but can be relied on for knowledge workflows and mission-critical applications.

The Top Hacker News Writers (2022)

Hacker News is a crowd-sourced aggregator of the top content on the web that “good hackers find interesting”. It’s easy enough to see who are the top curators on HN, but who are the writers that are most successful at getting to the front page of Hacker News?

We used Diffbot’s Article Extraction API to analyze the 10,950 stories that made it to the Hacker News frontpage in the last 12 months, extracting the author and topics of each article. Sorting by the most prolific individual authors, here are the Top 20 Authors of HN frontpage content in the last 12 months:

AuthorFrontpage Appearances
(last 12mo)
Recent Frontpage ArticlesTopics
Credit: Kristof ClerixBrian Krebs26A Closer Look at the LAPSUS$ Data Extortion Group
Scary Fraud Ensues When ID Theft & Usury Collide
NY Man Pleads Guilty in $20 Million SIM Swap Theft
Microsoft, World Wide Web, computer security
Jonathan Corbet - Author and Kernel Developer and Executive Editor @ LWN  Net - Crunchbase Person ProfileJonathan Corbet24A way out for a.out
Toward a better list iterator for the kernel
Moving the kernel to modern C, kernel, Unix
AvatarKen Shirriff23Silicon die teardown: a look inside an early 555 timer chip
Yamaha DX7 chip reverse-engineering, part V: the output circuitry
Inside the Apple-1’s unusual MOS clock driver chip
Yamaha DX7, read-only memory, engineering Evans17Implementing a toy version of TLS 1.3
Celebrate tiny learning milestones
Some tiny personal programs I’ve written
Domain Name System, debugging, Rust
ImageDan Luu17Why is it so hard to buy things that work well?
Cocktail party ideas
The container throttling problem
Google, CPU, Steve Yegge
Derek LoweDerek Lowe14
Deliberately Optimizing for Harm
These Are Real Compounds
An ALS Protein, Revealed
Genentech, CRISPR, AlphaFold
Jennifer Ouellette: “Nobody likes soda that's been left open and gone flat”  | by Bobbie Johnson | MediumJennifer Ouellette13An asteroid killed dinosaurs in spring—which might explain why mammals survived
Study: 1960 ramjet design for interstellar travel—a sci-fi staple—is unfeasible
Tiny tardigrades walk like insects 500,000 times their size
Italy, Luis, Walter Alvarez, Ig Nobel Prize
Catalin Cimpanu12GitLab servers are being exploited in DDoS attacks in excess of 1 Tbps
DDoS attacks hit multiple email providers
Malware found preinstalled in classic push-button phones sold in Russia
Google, computer security, Android, Russia
Jeff GeerlingJeff Geerling11Check your driver! Faster Linux 2.5G Networking with Realtek RTL8125B
Turing Pi 2: 4 Raspberry Pi nodes on a mini ITX board
SpaceX’s Starlink Review – Four months in
Raspberry Pi, Starlink, SpaceX
Michal Necasek11Unidentified PC DOS 1.1 Boot Sector Junk Identified
The Secret History of ATAPI
Looking for High Sierra
Microsoft, IBM PC DOS, MS-DOS
Paul Graham (programmer) - WikipediaPaul Graham10Putting Ideas into Words
Is There Such a Thing as Good Taste?
A Project of One’s Own
knowledge, PayPal, Michael Lind
ImageSimon Willison10How I build a feature
git-history: a tool for analyzing scraped data collected using Git and SQLite
Apply conversion functions to data in SQLite columns with the sqlite-utils CLI tool
SQLite, JSON, Python
Ned Utzig10Holy Nonads! A Nine-Bit Computer!
The Further Text Adventures of Scott Adams
A Talk With Computer Gaming Pioneer Walter Bright About Empire
IBM, Walter Bright, Sun Microsystems
Davide Castelvecchi on Muck RackDavide Castelvecchi9Earth-like planet spotted orbiting Sun’s closest star
DeepMind’s AI helps untangle the mathematics of knots
Astrophysicists unveil glut of gravitational-wave detections
mathematics, Roger Penrose, theoretical physics
Dan Goodin9Cybercriminals who breached Nvidia issue one of the most unusual demands ever
iOS zero-day let SolarWinds hackers compromise fully updated iPhones
This is not a drill: VMware vuln with 9.8 severity rating is under attack
Microsoft, iOS, graphics card
Ian CutressDr. Ian Cutress9From There to Here, and Beyond
Did IBM Just Preview The Future of Caches?
An AnandTech Interview with Jim Anderson, CEO of Lattice Semiconductor
Intel, CPU cache, Ryzen, Advanced Micro Devices
Jake Edge9Restricting SSH agent keys
Moving Google toward the mainline
Cooperative package management for Python, Python, Secure Shell
Jean-Luc AufrancJean-Luc Aufranc9Android 13 virtualization lets Pixel 6 run Windows 11, Linux distributions
Add 10GbE to your system with an M.2 2280 module
StarFive Dubhe 64-bit RISC-V core to be found in 12nm, 2 GHz processors
RISC-V, SiFive, ARM Cortex-A75
Bret Devereaux9Collections: How the Weak Can Win – A Primer on Protracted War
Collections: Rome: Decline and Fall? Part II: Institutions
Collections: Fortification, Part V: The Age of Industrial Firepower
Rome, Ancient Rome, Decline and Fall, War
Howard Oakley on Muck RackHoward Oakley9Explainer: Whatever happened to QuickTime?
How good is Monterey’s Visual Look Up?
How Secure Boot works on M1 series Macs
Apple Inc., macOS, M1
Top Authors of HN Frontpage content from 2021-03-27 to 2022-03-27

It’s good to see that after all these years, Hacker News has stayed true to the core hacker audience: operating systems, hardware, and security dominate the topics of the to writers.

You can find the full colab notebook for generating these results.

A Less-biased Way to Discern Media Bias Using Knowledge Graph Enhanced AI

As it becomes increasingly difficult to separate what is real from what is virtual, it becomes increasingly important for us to have tools that measure the biases in the information that we consume everyday.  Bias has always existed, but as we spend more of our conscious hours online, media — rather than direct experience — is what overwhelmingly shapes our worldviews.  Various journalistic organizations and NGOs have studied media bias, producing charts like the following.

Source: Poynter Institute: Should you trust media bias charts?

Most of these methodologies rely on surveying panels of humans, which we know are incredibly biased.  Both producers of these annual media bias studies methodologies can be summarized as the following:

The leading producer of media political bias charts that score the degree to which media outlets lean politically to the left vs. right notes about their methodology:

Keep in mind that this ratings system currently uses humans with subjective biases to rate things that are created by other humans with subjective biases and place them on an objective scale.

Ad Fontes Media

How do we avoid our own biases (or the biases of a panel of humans) when studying bias?  It is well known by now that AI systems (read: statistical models learned from data) trained on human-supplied labels reflect the biases of those human judgements encoded in the data.  How do we avoid asking humans to judge the biases of the articles?

Answer: by building a system that (a) defines the target output with an objective statement and (b) combines independent AI components that are trained on tasks that are orthogonal to the bias scoring task. Here’s what a system we built at Diffbot to score political bias of media outlets looks like:

We can define via the input parameters, the desired output of the system as the sentiment towards the Republican Party (Diffbot entity ID: EQux7TYFDMgO6n_OByeSXzg) minus the sentiment towards the Democratic Party (Diffbot entity ID: EsAK1CigZMFeqk72s5EidGQ).  These entities refer to the Republican and Democratic political parties in the United States.  The beauty of this objective definition of system output is that you can modify the definition by varying the inputs to produce bias scores along any other political bias spectrum (e.g. Libertarian-Authoritarian, or the multi-party variations in your local country) and the system can produce new scores along that given those parameters without performing another bias-prone re-surveying of humans.

The two AI components of the system are a (a) named entity recognizer, and a (b) sentiment analyzer.

The named entity recognizer is trained to find subjects and objects in English and link them to Uniform Resource Identifiers (URIs) in the Diffbot Knowledge Graph.  The entity recognizers know nothing of the political bias task and aren’t trained on examples of political/non-political text. What that model learns is the syntax of English, which positions in a sentence constitute a subject or object, and which entity a span of text refers to.  The Republican Party and Democratic Party are just two unremarkable entities out of a possible billions of possible entities in the Diffbot Knowledge Graph that the NER system could link to.

The sentiment analyzer is a model that is trained to determine whether a piece of text is positive or negative, but it also knows nothing about political bias nor has it seen anything in its training set specific to political entities. This model is merely learning how we in general express negativity or positivity.  For example,  “I like puppies!” is a sentence that indicates the author has positive sentiment towards puppies. “I’m bearish on crypto” is a sentence that indicates the author has negative sentiment towards cryptocurrencies.

By combining these two independent systems, none of which has seen the political bias task or has training data that was gathered for that purpose, we can build a system that calculates the bias in text along a spectrum defined by any two entities.  We ran an experiment by querying the Diffbot Knowledge Graph for content from the mainstream media outlets and ran the bias detector on the 17,468,963 resulting articles to produce the Diffbot Media Bias Chart, below. 

There are some interesting insights:

  • There’s an overall negativity bias to news. There’s truth to the old adage that the frontpage the newspaper reports on the worst things that’ve happened around the world that day. The news reports on heinous crimes, pandemics, disaster, and corruption. This overall negativity bias dominates any left-right political bias. However, there is also clearly a per-outlet bias that ranges from heavily critical (, to a subdued slight negativity (,
  • There is often a characterization of political bias among news outlet rivals that compete for your media attention and advertising dollars, e.g. the CNN/Fox News rivalry, but both are actually rather centrist relative to the other outlets.  The data does not support a bi-modal distribution of political bias–that is, one cluster on the left and another cluster on the right, but rather something that looks more like a normal distribution–a large centrist cluster, with few outlets at the extremes.  This may have to do with the fact that the business model of media ultimately competes for large audiences.  

Of course, there is no perfectly unbiased methodology calculating a political bias score, but we hope that this approach spurs more research into developing new methods for how AI can help detect human biases.  We showed that two AI components that solve orthogonal problems–named entity recognition and sentiment analysis–can be composed to build a single system whose goal isn’t to replicate human judgement, but do it better. 

You can download the full dataset for the above experiment here and reproduce your own bias chart along any sentiment spectrum by using the Diffbot Natural Language API.





Diffbot Partners with Avast to Improve Consumer Online Privacy

Excited to make public our collaboration with Avast Software, now the world’s largest Antivirus security company, which is using Diffbot, the world’s largest Knowledge Graph, to improve the online privacy of consumers around the world. The average internet user visits 94 web pages each day, and each site includes various trackers and lengthy legal terms that are impossible for the average person to fully read and understand the implications of. We’re using AI to improve online privacy–by using machines to read all of the privacy policies on the entire web and making every company’s privacy posture transparent.
Working with the Avast team has also been a great example of corporate-startup collaboration, oft sought-after by corporate innovation groups, but rarely achieved. It’s been a pleasure to observe a team of ML engineers from different companies coming together to solve a common problem of societal importance, and shipping code. 
In addition to integrating this into Avast products, we plan to publish our privacy insights in a series of blog posts and hope to make available the underlying datasets for academic and industry privacy research groups.

Full details:

Diffbot-Powered Academic Research in 2020

At Diffbot, our goal is to build the most accurate, comprehensive, and fresh Knowledge Graph of the public web, and Diffbot researchers advance the state-of-the-art in information extraction and natural language processing techniques.

Outside of our own research, we’re proud to enable others to do new kinds of research in some of the most important topics of our times: like analyzing the spread of online news, misinformation, privacy advice, emerging entities, and Knowledge Graph representations.

As an academic researcher, one of the limiting factors in your work is often access to high-quality accurate training data to study your particular problem. This is where tapping into an external Knowledge Graph API can help you greatly accelerate the boostrapping of your own ML dataset.

Here is a sampling of some of the academic research conducted by others in 2020 that uses Diffbot:

Continue reading

From Knowledge Graphs to Knowledge Workflows

2020 was undeniably the “Year of the Knowledge Graph.”

2020 was the year that Gartner put Knowledge Graphs at the peak of its hype cycle.

It was the year where 10% of the papers published at EMNLP referenced “knowledge” in their titles.

It was the year over 1000 engineers, enterprise users, and academics came together to talk about Knowledge Graphs at the 2nd Knowledge Graph Conference.

There are good reasons for this grass-roots trend, as it isn’t any one company that is pushing this trend (ahem, I’m looking at you, Cognitive Computing), but rather a broad coalition of academics, industry vertical practitioners, and enterprise users that generally deal with building intelligent information systems.

Knowledge graphs represent the best of how we hope the “next step” of AI looks like: intelligent systems that aren’t black boxes, but are explainable, that are grounded in the same real-world entities as us humans, and are able to exchange knowledge with us with precise common vocabularies. It’s no coinincidence that in the same year that marked the peak of the deep learning revolution (2012), Google introduced the Google Knowledge Graph as a way to provide interpretability to its otherwise opaque search ranking algorithms.

The Risk Of Hype: Touted Benefits Don’t Materialize

Continue reading

Extracting Product Variant Data with DiffbotAPI

Diffbot API allows you to automatically gather ecommerce information such as images, description, brand, prices and specs from product pages, but what about when product pages contain mutiple variants of the product, being offered at different prices?

A product variant is when there are variations of a base product, such as mulitiple sizes, colors, or styles that may have their own pricing and availability. For many kinds of products–ranging from apparel, to home goods, to car parts, these product variants are crucial to understand. For example, you wouldn’t want to get kid-sized shoes sent to you for adult-sized feet. Product variants also give you clues as to which variations of a product are available from the merchant, and which might be sold-out.

Diffbot’s APIs might not always be able to extract variants automatically using AI, but thankfully Diffbot includes a powerful Custom API that allows you to both correct and augment what is extracted.

Let’s take a look at this product page – in this example a bedding sheets set from Macys – that has product variants. If we pass this URL to Diffbot API, Diffbot automatically extracts the base product’s title, text, price, sku, images, as well as the thread count and fabric. However, it does not extract the variants.

In this example, the sheets come in multiple sizes (from Twin to California King) and come in colors ranging from a classic white to Pomegrante (which unsurprisingly has plenty in stock). We can easily see as a human that the add-to-bag price depends on the size, and not the color.

Let’s make our AI see this too.

To do this we can use an X-Eval rule, essentially a Javascript function with our own custom scraping logic to augment what Diffbot already extracts. An X-eval can be specified when creating a custom rule using the Custom API.

function () {
  var variants = [];
  /* get sizes*/
  var sizes = $('').filter((i,e) => {
    return !$(e).hasClass('unavailable');
  for (var i = 0; i < sizes.length; i++){
    var sizes = $('').filter((i,e) => {
        return !$(e).hasClass('unavailable');
    var sizeEl = sizes[i];;
    /* get colors. click first */
    var colors = $('li.color-swatch').filter((i,e) => {
      return !$(e).hasClass('unavailable');
    if (colors.length > 0) {
    var price = $('div.price').text().match(/([0-9.]+)/)[1]; 
    for(var j = 0; j < colors.length; j++) {
      var colorEl = colors[j];
      'size': sizeEl.textContent.trim(),
      'color': $(colorEl).find('.color-swatch-div').attr('aria-label'),
      'offerPrice': price
  save ("variants", variants);

All X-eval functions start with a start(); invocation and end with end(); to signal that the function is complete (important when there are callbacks that execute after function return).

We proceed by enumerating the list of available sizes using Jquery, which is supported in X-eval functions. We then click on the DOM element corresponding to each size, and then use another Jquery selector to select the list of available colors. Finally, we use a third Jquery selector to select the offer price, and save this combination of (size, color, price) to a variants array.

The last step is calling save() on variants, which saves the variants array as a property of the product JSON that is returned by Diffbot. Our final extracted product now has these variants captured.

The Economics of Building Knowledge Bases

During the summers of my high school years in suburban Georgia, my friend and I would fill the time by randomly walking into local establishments asking for odd jobs. It was a great way as a student to meet people from all walks of life and learn about different industries. We interviewed to be warehouse forklift operators, car salesmen, baristas, wait staff, and lab technicians.

One of the jobs that left an impression for me was working for AT&T (BellSouth) in their fulfillment center doing data entry and taking technical support calls. It was an ideal high school job. We were getting paid $9 per hour to play with computers, talk on the phones to people dialing in from all across the country (mostly those having problems with their fax machines and Caller ID devices), and interact with adults in the office.

In the data entry department, our task would be to take in large pallets of postal mail, open each envelope, determine which program or promotion they were submitting to, enter in the information on the form into the internal CRM, and then move on to the next bin.

This setup looked something like this:

Given each form contained about 6 fields, and each field had about 10 words, typing at 60 words per minute meant that it took on average a minute to key in each form. At $9 / hour, this translates to $0.025 to obtain each field being entered into their CRM. This is a lower bound to the true cost, as it doesn’t include the costs to the customer of filling out this form, the cost of mailing this letter to the fulfillment center, and the costs of the overhead of the organization itself, which would increase this estimate by a couple factors more.

What limits the speed, and therefore cost, of data acquisition? Notice that in the above diagram, the main bottleneck and majority of the time spent is in the back-and-forth feedback loop that takes place between reading and typing. This internal feedback loop is tied to the human brain’s ability to process symbols on the page, chunk them into bits of meaning, and plan a sequence of motor actions in my fingers that result in keystrokes.

As far as knowledge work goes, this setup is quite minimalist, as I am only entering in information from a single information source (the paper form); most knowledge work involves combining information from multiple sources, and sometimes synthesizing and reconciling competing pieces of information to produce an output. However, note that the largest bottleneck of any knowledge acquisition job is not actually the speed or words per minute that I can type.  Even with access to a perfect high-bandwidth human-machine interface via a neural lace directly wiring the motor and somatosensory cortex of my brain to the computer, the main bottleneck would still be the speed in which I could read and understand the words on the page (language processing is largely believed to be happening in the Broca’s region of the brain). 

Manual data collection like the setup of my summer job is by far the most prevalent form of building digital knowledge bases, and has persisted from the beginning of digital computers til the present day.  In fact, one of the original motivations for creating computer companies was to enable this task. The founder of original computer company, IBM, was motivated in part by his work in compiling the 1880 US census, one of the first databases.

While we can scale up the knowledge acquisition effort (i.e. we can build larger knowledge bases) by hiring larger teams of people to work in parallel, this would simply be an aggregation of labor, and not a net gain in productivity. The unit economics (i.e. the cost per field) wouldn’t change, we’d simply be paying more for a larger team of humans, and it would in fact go up a bit due to the overhead cost of coordinating the team of humans. For many decades, due to the growth of the modern corporation, this is how we got larger and larger knowledge bases, including Cyc, one of the early efforts to build a knowledge base for AI, which contained 21M fields. Most knowledge bases today are constructed by an organization of people trained to do the task. However, something was brewing in the mid-90s that would change this cost structure forever.

That step-function change was the Internet. A growing global network of inter-connected computers meant a large increase in the addressable labor pool (millions, and then later billions of people), and access to global economies with lower wages. The biggest change though, was that a lot of people spent their “free” time on the Internet. This allowed sites like Wikipedia to flourish, which can be viewed as a knowledge base built by a global community of contributors. This dramatically lowers the effective cost of each record, as most of the users don’t view building the knowledge base as their primary means of employment, but a volunteering activity or hobby. Building a knowledge resource like Wikipedia would have been very prohibitively expensive for a single organization to execute on pre-Internet.

A startup called MetaWeb leveraged crowdsourcing in order to build a knowledge base called Freebase. Importing much of Wikipedia and with a wiki-style web-based editor, they were able to build the size of the knowledge base up to 1.9B fields. This represented a 100X improvement in the cost of acquiring each field in the knowledge base. Freebase was summarily shut down when MetaWeb was acquired by Google, however its Wikipedia origins are why many of the knowledge graph panels that Google returns are based on Wikipedia pages.

Crowdsourcing has become an effective technique for maintaining large publicly-accessible knowledge bases. For example, IMDB, Foursquare, Yelp, and the Google Knowledge panels all take advantage of Internet users to curate, complete, and find errors in those knowledge bases. While crowdsourcing has been great in enabling the creation of these very useful datasets and tools, it has its limitations as well. The key limitation is that it is only possible to crowdsource the construction of a database in certain areas of knowledge where there is a sufficient level of mass-market popularity to form an online community of users, typically 100k or more. This is why, as a general rule, we tend to see crowd-sourced knowledge bases in the domains of celebrities (Wikipedia pages), movies (IMDB), restaurants (Yelp), and other entertainment activities but not scientific and business activities (e.g. drug interactions, vendor databases, financial market data, business intelligence, legal records). This is because, unlike leisure, work requires specialized knowledge, and there are not online communities of 100k specialists in each area.

So what technology will enable the next 100X breakthrough in knowledge acquisition?

Naturally, to go beyond the limitations of groups of humans, we will have to turn to artificial intelligence for acquiring knowledge. This field is called Automated Knowledge base construction, and is the focus at Diffbot. At Diffbot, we have developed a commercial system that combines multiple areas of research–visual extraction of webpages, natural language processing, computer vision, and knowledge fusion–to build an automonous system that can build a production-level knowledge base. Because the fields in the knowledge base are not gathered by humans but by an AI system that is synthesizing multiple documents, the domains of knowledge are not limited to what is popular, and and it now becomes economically feasible to acquire the kind of knowledge that is useful for business applications.

Here is a summary of the unit economics of various methods of building knowledge bases. Much credit goes to Heiko Paulheim, for his analysis framework in “How much is a Triple?” (ISWC ’18), which I have merely updated with my own estimates and calculations.

The above framework makes some simplifying assumptions. For example, it treats the economic task of building a knowledge base as building a static resource, of a fixed size. However, we all know that the real value of a knowledge base is in how accurately it reflects the real world, which is always changing. Just as we perform a census once every 10 years, the calculations above don’t take into account the cost of refreshing and maintaining the data, as an ongoing knowledge service that is expressed per unit time. Business applications require data that is updated with a frequency of weeks, days, and even seconds. This is an area where the AI factor is even more pronounced. More on this later…

Diffbot’s Approach to Knowledge Graph

Google introduced to the general public the term Knowledge Graph (“Things not Strings”) when they added the information boxes that you see to the right-hand side of many searches. However, the benefits of storing information indexed around the entity and its properties and relationships are well-known to computer scientists and have been one of the central approaches to designing information systems.

When computer scientist Tim-Berners Lee originally designed the Web, he proposed a system that modeled information as uniquely identified entities (the URI) and their relationships. He described it this way in his 1999 book Weaving the Web:

I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A “Semantic Web”, which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The “intelligent agents” people have touted for ages will finally materialize.

You can trace this way of modeling data even further back to the era of symbolic artificial intelligence (Good old fashioned AI”) and the Relational Model of data first described by Edgar Codd in 1970, the theory that forms the basis of relational database systems, the workhorse of information storage in the enterprise.

From “A Relational Model of Data for Large Shared Data Banks”, E.F. Codd, 1970

What is striking is that these ideas of representing information as a set of entities and their relations are not new, but are so very old. It seems as if there is something very natural and human about representing the world in this way. So, the problem we are working on at Diffbot isn’t a new or hypothetical problem that we defined, but rather one of the age-old problems of computer science, and one that is found within every organization that tries to represent the information of the organization in a way that is useful and scalable. Rather, the work we are doing at Diffbot is in creating a better solution to this age-old problem, in the context of this new world that has increasingly large amounts of complex and heterogeneous data.

The well-known general knowledge graphs (i.e. those that are not verticalized knowledge graphs), can be grouped into certain categories: the search engine company maintained KGs: Google, Bing, and Yahoo knowledge graph, community-maintained knowledge graphs: like Wikidata, and academic knowledge graphs, like Wordnet and ConceptNet.

The Diffbot Knowledge Graph approach differs in three main ways: it is an automatically constructed knowledge graph (not based on human labor), it is sourced from crawling the entire public web and all its languages, and it is available for use.

The first point is that all other knowledge graphs involve a heavy amount of human curation – involving direct data entry of the facts about each entity, selecting what entities to include, and the categorization of those entities. At Google, the Knowledge Graph is actually a data format for structured data that is standardized across various product teams (shopping, movies, recipes, events, sports) and hundreds of employees and even more contractors both enter and curate the categories of this data, combining these separate product domains together into a seamless experience. The Yahoo and Bing knowledge graphs operate in the similar way.

A large portion of the information these consumer search knowledge graphs contain is imported directly from Wikipedia, another crowd-sourced community of humans that both enter and curate the categories of knowledge. Wikipedia’s sister project, Wikidata, has humans directly crowd-editing a knowledge graph. (You could argue that the entire web is also a community of humans editing knowledge. However–the entire web doesn’t operate as a singular community, with shared standards, and a common namespace for entities and their concepts–otherwise, we’d have the Semantic Web today).

Academic knowledge graphs such as ConceptNet, WordNet, and earlier, CyC, are also manually constructed by crowd-sourced humans, although to a larger degree informed by linguistics, and often by people employed under the same organization, rather than volunteers on the Internet.

Diffbot’s approach to acquiring knowledge is different. Diffbot’s knowledge graph is built by a fully autonomous system. We create machine learning algorithms that can classify each page on the web as an entity and then extract the facts about that entity from each of those pages, then use machine learning to link and fuse the facts from various pages to form a coherent knowledge graph. We build a new knowledge graph from this fully automatic pipeline every 4-5 days without human supervision.

The second differentiator is that Diffbot’s knowledge graph is sourced from crawling the entire web. Other knowledge graphs may have humans citing pages on the web, but the set of cited pages is a drop in the ocean compared to all pages on the web. Even the Google’s regular search engine is not an index of the whole web–rather it is a separate index for each language that appears on the web . If you speak an uncommon language, you are not searching a very big fraction of the web. However, when we analyze each page on the web, our multi-lingual NLP is able to classify and extract the page, building a unified Knowledge Graph for the whole web across all the languages. The other two companies besides Diffbot that crawl the whole web (Google and Bing in the US) index all of the text on the page for their search rankings but do not extract entities and relationships from every page. The consequence of our approach is that our knowledge graph is much larger and it autonomously grows by 100M new entities each month and the rate is accelerating as new pages are added to the web and we expand the hardware in our datacenter.

The combination of automatically extracted and web-scale crawling means that our knowledge graph is much more comprehensive than other knowledge graphs. While you may notice in google search a knowledge graph panel will activate when you search for Taylor Swift, Donald Trump, or Tiger Woods (entities that have a Wikipedia page), a panel is likely not going to appear if you try searches for your co-workers, colleagues, customers, suppliers, family members, and friends. The former category are the popular celebrities that have the most optimized queries on a consumer search engine and the latter category are actually the entities that surround you on a day-to-day basis. We would argue that having a knowledge graph that has coverage of those real-life entities–the latter category–makes it much more useful to building applications that get real work done. After all, you’re not trying to sell your product to Taylor Swift, recruit Donald Trump, or book a meeting with Tiger Woods–those just aren’t entities that most people encounter and interact with on a daily basis.

Lastly, access. The major search engines do not give any meaningful access to their knowledge graphs, much to the frustration of academic researchers trying to improve information retrieval and AI systems. This is because the major search engines see their knowledge graphs as competitive features that aid the experiences of their ad-supported consumer products, and do not want others to use the data to build competitive systems that might threaten their business. In fact, Google ironically restricts crawling of themselves, and the trend over time has been to remove functionality from their APIs. Academics have created their own knowledge graphs for research use, but they are toy KGs that are 10-100MBs in size and released only a few times per year. They make it possible to do some limited research, but are too small and out-of-date to support most real-world applications.

In contrast, the Diffbot knowledge graph is available and open for business. Our business model is providing Knowledge-as-a-Service, and so we are fully aligned with our customers’ success. Our customers fund the development of improvements to the quality of our knowledge graph and that quality improves the efficiency of their knowledge workflows. We also provide free access to our KG to the academic research community, clearing away one of the main bottlenecks to academic research progress in this area. Researchers and PhD students should not feel compelled to join an industrial AI lab to access their data and hardware resources, in order to make progress in the field of knowledge graphs and automatic information extraction. They should be able to fruitfully research these topics in their academic institutions. We benefit the most from any advancements to to the field, since we are running the largest implementation of automatic information extraction at web-scale.

We argue that a fully autonomous knowledge graph is the only way to build intelligent systems that successfully handle the world we live in: one that is large, complex, and changing.