Welcome Chun Han Hsiao – Senior Software Engineer

Hi there, my name is Chun Han and I’m a new Senior Software Engineer at Diffbot. I started programming as a senior in high school and have enjoyed it a lot. I then started studying Computer Science at National Central University in Taiwan.

While I worked as a Software Engineer for several companies, I enjoyed contributing to some open source projects and being a part of the community. I contributed to many projects like Netty, Mitmproxy, ModelMapper, and Trino. I enjoyed learning from those experiences, and working with different people. Afterwards, I started my own project, Nitmproxy (or Netty-in-the-middle proxy). This project started as a personal project, but never knew it would be used by someone other than me. I’m surprised and really appreciate that now it was used by Diffbot. I’m glad that my work really solves problems and is used by other people.

I’m excited about my new journey with Diffbot. I feel there are so many things I can do here. From working on different projects, to improving my skills, and growing with the company.

The Top Hacker News Writers (2022)

Hacker News is a crowd-sourced aggregator of the top content on the web that “good hackers find interesting”. It’s easy enough to see who are the top curators on HN, but who are the writers that are most successful at getting to the front page of Hacker News?

We used Diffbot’s Article Extraction API to analyze the 10,950 stories that made it to the Hacker News frontpage in the last 12 months, extracting the author and topics of each article. Sorting by the most prolific individual authors, here are the Top 20 Authors of HN frontpage content in the last 12 months:

AuthorFrontpage Appearances
(last 12mo)
Recent Frontpage ArticlesTopics
Credit: Kristof ClerixBrian Krebs26A Closer Look at the LAPSUS$ Data Extortion Group
Scary Fraud Ensues When ID Theft & Usury Collide
NY Man Pleads Guilty in $20 Million SIM Swap Theft
Microsoft, World Wide Web, computer security
Jonathan Corbet - Author and Kernel Developer and Executive Editor @ LWN  Net - Crunchbase Person ProfileJonathan Corbet24A way out for a.out
Toward a better list iterator for the kernel
Moving the kernel to modern C
LWN.net, kernel, Unix
AvatarKen Shirriff23Silicon die teardown: a look inside an early 555 timer chip
Yamaha DX7 chip reverse-engineering, part V: the output circuitry
Inside the Apple-1’s unusual MOS clock driver chip
Yamaha DX7, read-only memory, engineering
softwaremisadventures.com/podcast/2021/01/julia...Julia Evans17Implementing a toy version of TLS 1.3
Celebrate tiny learning milestones
Some tiny personal programs I’ve written
Domain Name System, debugging, Rust
ImageDan Luu17Why is it so hard to buy things that work well?
Cocktail party ideas
The container throttling problem
Google, CPU, Steve Yegge
Derek LoweDerek Lowe14
Deliberately Optimizing for Harm
These Are Real Compounds
An ALS Protein, Revealed
Genentech, CRISPR, AlphaFold
Jennifer Ouellette: “Nobody likes soda that's been left open and gone flat”  | by Bobbie Johnson | MediumJennifer Ouellette13An asteroid killed dinosaurs in spring—which might explain why mammals survived
Study: 1960 ramjet design for interstellar travel—a sci-fi staple—is unfeasible
Tiny tardigrades walk like insects 500,000 times their size
Italy, Luis, Walter Alvarez, Ig Nobel Prize
Catalin Cimpanu12GitLab servers are being exploited in DDoS attacks in excess of 1 Tbps
DDoS attacks hit multiple email providers
Malware found preinstalled in classic push-button phones sold in Russia
Google, computer security, Android, Russia
Jeff GeerlingJeff Geerling11Check your driver! Faster Linux 2.5G Networking with Realtek RTL8125B
Turing Pi 2: 4 Raspberry Pi nodes on a mini ITX board
SpaceX’s Starlink Review – Four months in
Raspberry Pi, Starlink, SpaceX
Michal Necasek11Unidentified PC DOS 1.1 Boot Sector Junk Identified
The Secret History of ATAPI
Looking for High Sierra
Microsoft, IBM PC DOS, MS-DOS
Paul Graham (programmer) - WikipediaPaul Graham10Putting Ideas into Words
Is There Such a Thing as Good Taste?
A Project of One’s Own
knowledge, PayPal, Michael Lind
ImageSimon Willison10How I build a feature
git-history: a tool for analyzing scraped data collected using Git and SQLite
Apply conversion functions to data in SQLite columns with the sqlite-utils CLI tool
SQLite, JSON, Python
Ned Utzig10Holy Nonads! A Nine-Bit Computer!
The Further Text Adventures of Scott Adams
A Talk With Computer Gaming Pioneer Walter Bright About Empire
IBM, Walter Bright, Sun Microsystems
Davide Castelvecchi on Muck RackDavide Castelvecchi9Earth-like planet spotted orbiting Sun’s closest star
DeepMind’s AI helps untangle the mathematics of knots
Astrophysicists unveil glut of gravitational-wave detections
mathematics, Roger Penrose, theoretical physics
Dan Goodin9Cybercriminals who breached Nvidia issue one of the most unusual demands ever
iOS zero-day let SolarWinds hackers compromise fully updated iPhones
This is not a drill: VMware vuln with 9.8 severity rating is under attack
Microsoft, iOS, graphics card
Ian CutressDr. Ian Cutress9From There to Here, and Beyond
Did IBM Just Preview The Future of Caches?
An AnandTech Interview with Jim Anderson, CEO of Lattice Semiconductor
Intel, CPU cache, Ryzen, Advanced Micro Devices
Jake Edge9Restricting SSH agent keys
Moving Google toward the mainline
Cooperative package management for Python
LWN.net, Python, Secure Shell
Jean-Luc AufrancJean-Luc Aufranc9Android 13 virtualization lets Pixel 6 run Windows 11, Linux distributions
Add 10GbE to your system with an M.2 2280 module
StarFive Dubhe 64-bit RISC-V core to be found in 12nm, 2 GHz processors
RISC-V, SiFive, ARM Cortex-A75
Bret Devereaux9Collections: How the Weak Can Win – A Primer on Protracted War
Collections: Rome: Decline and Fall? Part II: Institutions
Collections: Fortification, Part V: The Age of Industrial Firepower
Rome, Ancient Rome, Decline and Fall, War
Howard Oakley on Muck RackHoward Oakley9Explainer: Whatever happened to QuickTime?
How good is Monterey’s Visual Look Up?
How Secure Boot works on M1 series Macs
Apple Inc., macOS, M1
Top Authors of HN Frontpage content from 2021-03-27 to 2022-03-27

It’s good to see that after all these years, Hacker News has stayed true to the core hacker audience: operating systems, hardware, and security dominate the topics of the to writers.

You can find the full colab notebook for generating these results.

Benchmarking: Diffbot Knowledge Graph Versus Google Knowledge Graph

Knowledge graphs play a role in many of our favorite products. They provide information and context that serves up recommendations and additional information just where we need it.

They’re how Alexa and Google search can provide information on entities related to a request. They’re how Netflix builds a profile of the genres, plots, and actors you like.

Many knowledge graphs that reach consumers are primarily built on internal data stores. But a growing number also augment their breadth and timeliness by sourcing information from the public internet

Three North American entities have claims to crawling the whole web in order to structure information into knowledge graphs. These organizations are Google, Bing, and Diffbot. 

All three provide some level of knowledge graph access to end consumers. Though –of these three – Diffbot is the only commercial knowledge graph provider that allows data teams to integrate and download the entirety of their data. This makes Diffbot’s Knowledge Graph a great starting point for machine learning projects, deeper market intelligence exercises, or web-wide news monitoring projects. 

With that said, many product leaders and data teams are not looking for the widest coverage or the largest sets of ingestible data, per se. Rather, these teams are discerning which knowledge graph has the coverage they need.

For example, do you need rapidly updated information about large entities that are easy to track? Do you need suitable coverage of extremely long tail organizations? And what types of data do you need? Basic organizational data? Articles about specific entities? Product data? Discussions or events? 

In this guide we’ll work through a comparison of data coverage between Diffbot and Google knowledge graphs, both of which are available through knowledge graph search APIs. 

Note: Before we jump in, one thing worth noting is that the Google knowledge graph API is not recommended for production uses, rather it’s more of a demo of their internal technology and data. 

Check out our comparison of data returned from Google and Diffbot KG search APIs here

Which Knowledge Graph is Larger? Google VS. Diffbot

Historically, knowledge graphs used in academic settings have been too small for viable commercial use. But once knowledge graphs grew substantially past this size, the absolute size of a knowledge graph often wasn’t a proxy for the usefulness of a knowledge graph. 

To show you what we mean, both the Diffbot and Google knowledge graphs hold roughly the same number of entities: ~5B (Google, 2020), and ~5.9B (Diffbot, 2022). But knowledge graphs are built around “things” (items in the world), and ~5B doesn’t begin to account for all of those.

So what “things” are included? Industry insights? Global news coverage? Can this data tell you whether you’d like a movie or should buy a product?

All of these are viable uses for knowledge graphs, and the answers are dependent on the following non-scale related features:

  • What type (topics and fact types) of data is included
  • How up-to-date data is
  • The number of valuable fields per entity
  • How accurate data is
  • How easy it is to extract the data you need
  • How easy it is to fit this data into your workflows
  • And business process-related aspects like pricing, uptime, data provenance, and so forth

To dive into the differences between Diffbot and Google knowledge graphs on the above points, we’ll need to provide some background information about how these knowledge graphs are constructed. Following this, we’ll jump into an up-to-date benchmarking of the coverage of specific entities within each knowledge graph. 

How Google Crawls the Web

Historically, Google has crawled the web to surface what it deems to be the most useful pieces of content around search keywords. Sites deemed more useful or “important” get crawled more frequently. And the top sites tend to present highly in many search terms related to their offerings. 

While Google applies robust natural language processing to pages in order to provide their search service, many surfaced “facts” are not integrated with their knowledge graph as seen in knowledge panels or their KG search API. Take for example the knowledge panel result for Diffbot.

The entry to the right typifies an organization “knowledge panel” within Google search

The area to the right in the screenshot above is the Google knowledge graph-derived knowledge panel. Facts included in these panels are typical of linked data, wherein the organization entity of Diffbot is attached to other knowledge graph entities including locations, people, and other organizations. 

Search results (content) are used to expand knowledge panel offerings

Furthermore, the result is enhanced by additional content. If we click through to competitors, Google can serve up and highlight a portion of content that claims to be about competitors. But even though Webhose, Thinknum, Scrapinghub and others listed all have their own knowledge graph entries, this data isn’t linked. The NLP by which Google is parsing content to categorize and serve up this article on competitors is not integrated into the Google knowledge graph. Clicking through to the headline about competitors does not lead to knowledge panel-related data. But rather takes you to an article that is the top ranking result of the search “Diffbot competitors.” 

Recommendations (“People also search”) are facilitated by knowledge graph linkages

Let’s take another example, wherein we look at a publicly traded company. Searching “Microsoft Revenue” returns the Microsoft knowledge panel as well as the most recent publicly listed revenue number. A great number of fields displayed here are from the Google knowledge graph API. But clicking through to the disclaimer below financial data takes us to Google Finance, a separate service from Google’s knowledge graph. Linked data is present in the “people also search for” section. And each of these organizations does have their own knowledge panel result. But at the end of the day, clicking through any of these simply routes users to a suggested search. 

Related products are held in the knowledge graph, but data is provided by Google Shopping

Dropping to the bottom of the knowledge panel for Microsoft, yet again we see the appearance of linked data. In this case products that are related to Microsoft. But clicking through to each simply returns search results (albeit aggregated values related to price, availability, and reviews). We can verify that this product data is not in fact part of Google knowledge graph by searching for a “Microsoft Xbox One Wireless Controller” using the knowledge graph search API.

There is no “XBOX One Wireless Controller” (from the prior image) in the Google Knowledge Graph

Above is the result for a Google knowledge graph API search for a particular model of XBOX controller (an XBOX One controller) that is served up within the knowledge panel results. What is returned is a general category of “XBOX controller” sourced from Wikipedia, with entity types of “thing” and “productModel.” The closest entity to what was served within the knowledge panel as actually a somewhat generic category of products. The “XBOX One Controller” from the knowledge panel actually isn’t from the Google knowledge graph. 

All this hints at the fact that Google pads out the appearance of their knowledge graph in it’s most prevalent form (knowledge panel results) while not actually ingesting and linking many of these additional data structures. 

Sure, Google crawls the entire web to return search results. But what does a large portion of this crawling have to do with their knowledge graph? 

This distinction between Google’s search-related crawls and their knowledge graph data likely begins with Freebase. Freebase was rolled into an early version of Google’s knowledge graph after the company was acquired. Freebase largely crowdsourced knowledge, allowing users to manually tag, relate, update, and create their own knowledge bases. While this enabled some scale (2.4B facts as of 2014), little automation was factored into fact accumulation. While Freebase compiled one of the largest commercially-aimed knowledge bases, they did so manually.

Freebase’s data pipeline didn’t really have anything to do with Google’s automated knowledge accumulation that powers their search engine. 

You don’t have to take our word for it, see Google’s own description

“Facts in the Knowledge Graph come from a variety of sources that compile factual information. In addition to public sources, we license data to provide information such as sports scores, stock prices, and weather forecasts. We also receive factual information directly from content owners in various ways, including from those who suggest changes to knowledge panels they’ve claimed.”

Or put another way: “it’s a bit amusing that i’ve been invited to speak at a conference on automated knowledge base construction because both in the world I work and my background I don’t know anything about the automated side of this. The world I work in is far from automated. We have automated processes and things like that. But in terms of knowledge base construction, the world I work in is really one of a watchmaker. A precision scientist.” – Jamie Taylor, Decade-long leader at Freebase (now Google Knowledge Graph)

While Google’s knowledge graph is certainly a massive knowledge base, the inclusion of core constituents that are manually sourced including “claimed” (human sourced) knowledge panels and Freebase point to a knowledge graph primarily based on human inputs. 

How Diffbot Crawls the Web

For comparison, Diffbot’s web crawling was always set up as a way to extract, structure, validate, and link data across the web in an automated way (see “The Economics of Building Knowledge Bases”) . Our original product line of AI-enabled automatic extraction APIs were meant to be able to extract valuable facts and information from a variety of page types without even seeing their format in advance. Over time, crawling infrastructure as well as the ability to link and apply automated inference and understanding on top of these page crawls enabled our Knowledge Graph. 

How does this work? 

Early research shows us that a large majority of the internet was composed of 9 separate “types” or pages. Think of these pages like articles, discussions, profiles, product pages, event pages, lists, and so forth. 

Across languages and sites, the “types” of information that humans tend to find valuable on these page types persists. For example, whether you’re on Amazon or Walmart’s websites, some of the valuable data types on a product page include reviews, price, availability, a picture, and product specifications. These commonalities allow Diffbot to automatically extract information humans care about in a standardized format even if the actual layout of pages is different. All of this with no human input. 

Once facts, underlying text, images, and metadata are extracted, powerful natural language processing tech can transform these inputs into entities and relationships (constructing a graph). 

Because they provide information and context, graph databases are one of the most well suited data sources for machine learning. We leverage ML over our graph to incorporate new fact types such as similarity scores, enhanced organizational descriptors, and estimated revenue of private organizations. 

The range of automated inputs allow Diffbot’s Knowledge Graph to cover a huge range of commercially interesting data types. As of the time of this article’s writing, our data coverage included the following, all linked and with an average of 31 facts per entity:

  • 243MM organization entities
  • 773MM person entities
  • 1,879MM image entities
  • 1,621 article entities
  • 880MM post entities
  • 128MM discussion entities
  • 141MM product entities
  • 89MM video entities
  • 20MM job entities
  • 42MM event entities
  • .56M FAQ entities
  • 73MM miscellaneous entities
  • 10MM place entities
  • 49MM creativeWork entities
  • .17MM intangible entities

Total: 5,953MM entities

We’ll jump into additional comparisons of data within Diffbot and Google knowledge graphs in the next section. But hopefully you can begin to see the fundamental differences between a Knowledge Graph built for automated fact accumulation from the start (Diffbot) and one built with manual processes (Google). 

Benchmarking Google And Diffbot’s Knowledge Graphs

While there are substantial coverage differences between entity types in Google and Diffbot knowledge graphs, organization entities are well represented in both. Organization entities are also of broad commercial interest, with uses ranging from market intelligence, to supply chain risk analysis, to sales prospecting. 

For our study of Google and Diffbot knowledge graph organizational coverage, we looked at a representative number of randomized head entity and long tail organizations. Head entity organizations in this case are publicly-traded companies randomly chosen from the Russel 2000 index. Long tail entities include a random sampling of Series A and earlier startups with less than 50 employees. 

For both head entity and long tail organizations, we sought out external records of truth on a range of fields including:

  • CEO
  • Headquarter location
  • Number of employees
  • Revenue
  • And homepage URL

Example “ground truth” publications included SEC financial filings, Crunchbase, and Linkedin.

The results of our analysis show strong coverage of head entities across both knowledge graph providers. In this instance, lack of coverage centered around missing “ground truth” revenue fields in the case of several publicly traded companies who are yet to generate revenue. 

Among startups, a substantial spread emerged. For many uses, SMB/MMKT data is particularly hard to come by at scale, and Diffbot’s coverage includes 100’s of millions of “longtail” entities. 

An organization entity within Diffbot’s visual interface for Knowledge Graph search

While we chose fields present for organizations in both knowledge graphs, Diffbot’s Knowledge Graph also provides a wider range of additional fields. Above is a screenshot from our Knowledge Graph search visual interface. But additional fields attached to most organization entities within the Knowledge Graph include:

  • Noteworthy Employees
  • News Coverage
  • Industries
  • Locations
  • Subsidiaries
  • Funding Rounds
  • Descriptions
  • Revenue (or estimated revenue)
  • Similar organizations 
  • Technologies used
  • Among many other fields. 
The same entity as it’s presented in Google’s search interface

In the Google knowledge graph derived knowledge panel, the only three fields not presented by an external API (Google Finance) are a brief description, the URL of the organization, and the logo. 

While we’ve presented but a handful of samples within this article, Diffbot routinely benchmarks wide ranges of our knowledge graph against competitors and can patently say we have the world’s most accurate and up-to-date large-scale knowledge graph.

Interested in exploring Diffbot Knowledge Graph data for yourself? Grab a free trial or reach out to our sales team for a custom demo.

Calculating Average Employee Tenure And Attrition With Diffbot’s Knowledge Graph

Data on the talent distribution at organizations is available across the public web. Github, Crunchbase, personal blogs, press releases, and LinkedIn profiles (among others) can lead to insights into hiring, firing, and skill sets.

Historically, tracking tenure or attrition data across large organizations required a ton of manual fact accumulation or commissioning a market intelligence report.

Today, this information can be read by web-reading bots. Diffbot is one of three North American organizations with a claim to crawling the entire web. And our bots extract relevant facts about organizations, people, skills, and more. These facts are then incorporated into the world’s largest commercial Knowledge Graph (try it out for two weeks free today).

In this guide we’ll look at how you can gain tenure and attrition data for organizations in the Knowledge Graph. As some organizations can be quite large, we’ll talk through topics like monitoring the number of calls you’re making to conserve search credits, as well as how you can segment through portions of an organization (e.g. ‘tenure for engineers’ or ‘tenure for management’).


  • A trial or paid account for Diffbot’s Knowledge Graph
  • For average tenure, knowledge of Python or willingness to follow along with our step-by-step instructions and template script
  • For attrition, willingness to follow along in our visual Knowledge Graph search interface with step-by-step instructions
  • The name of an organization you’re interested in tracking tenure or attrition for

Tracking Average Tenure At An Organization In Diffbot’s Knowledge Graph

We’ve set up a Google Colaboratory notebook that you can copy to begin your investigation. Why do we need Google Colab and a script? Because some particularly large organizations can have tens or hundreds of thousands of employees (person entities in our Knowledge Graph). We’ll need to wrangle the start and (potential) end dates of their employments to calculate tenure. It’s simply easier to wrangle that much data with our Knowledge Graph API and a short script.

If you’re unfamiliar with Google Colab or Jupyter Notebooks, you run individual blocks of code by pressing the play button to the left of each block. You’ll need to start by running the first block of code (above) which imports all dependencies needed for the project.

Next you can see that we have two additional blocks of code. They both make API calls to our Knowledge Graph API but return slightly different data. The first returns the average tenure of all employees (person entities) past a certain date at a specific organization. The second returns tenure for a specific job function within an organization.

To begin, you’ll need to locate your token. This will grant you API access to the Knowledge Graph. Your API token can be viewed by clicking the “API Token” button in the top right hand corner of the Diffbot Dashboard.

Copy your full token from the top line of the page that loads and paste this into the two lines within the Google Collab that start with TOKEN= between the quotation marks.

Next we can choose the organization we want to track as well as the date we want to start our inquiry. In other words, if the company has a long history, do you want to see average tenure after a specific date? Note that you’ll need to keep the date field in single quotes inside of double quotes (as it is originally presented). Additionally, the date format used is YYYY-MM-DD.

Notice that our variable entities_to_return is set to one. So as to be mindful of Knowledge Graph API credit usage, we’ll use our initial query to only return full data on one entity (a single person). Once you click the “play” button to run the code, you should see some output at the bottom of this block of code. If you tried Microsoft for the dates I’ve entered, you should see the following.

{'version': 1, 'hits': 90419, 'results': 1, 'kgversion': '235',...

What we’re looking for here is the “hits” number. This is the total number of entities matching our query. So in the case of this example, there are 90,419 person entities who have worked at Microsoft since the first day of 2017. For very large organizations, loading this much data can take some time (and consume many credits), so you’ll need to decide whether you want to shift the timeframe you’re looking at or the number of credits are justified. For your trial run, you can also just try a smaller organization to conserve credits.

Once you have a timeframe and organization you think will lead to an interesting insight, take the value after 'hits': and use it to replace 1 in the entities_to_return variable.

Next you’ll want to comment out the line that says print(response). This will avoid a memory error attempting to print the entire output of of queries for large organizations. To comment out a line, simply add # in front of it.

Next click run, a query returning data on thousands of employees may take some time. But most organizations should be quite quick.

If you’ve followed all the steps above, your results should populate the bar below the block of code you just executed!

To obtain tenure by category of employment, skip to the next block of code.

Our process here is the same as the above with one addition, you’ll want to replace the employment category. You can gain a view of all of our employment categories within our Knowledge Graph search dashboard.

  1. Select person entity
  2. Select filter by employment then categories
  3. Browse a list of job functions

Once you’ve inputted an organization, a date, and a category of employment, click run.

Like our previous example, we’ll evaluate the number of ‘hits’ (person entities showing up in results). If you’re satisfied with the number to evaluate, comment out the print statement detailed in the past example and place the ‘hits’ number as the value for the entities_to_return variable. Then run the code to see the average tenure for workers in a specific work function.

You’re done! Want to utilize the same script to calculate average tenure for segments of employees other than these? Familiarize yourself with Diffbot Query Language and craft a person entity query of your own. Place this value inside of the line of code starting with query =.

Calculating Attrition At An Organization In Diffbot’s Knowledge Graph

The point of the script in the last example was largely just to work with large numbers of dates for the start and end of person entity employments. In this example, we simply want absolute numbers for headcount and employees who have left. These are numbers we can find directly within the visual search interface for the Knowledge Graph.

Because attrition is measured across a time period, you may want to look for how many employees an organization had at the start of a given period. Organization entities within the Knowledge Graph have a field noting their present headcount. But for a specific date in the past we’ll be looking at the employment fields attached to person entities.

Let’s say you want to see attrition for all employees at Netflix since 2015. You can copy the following query to gain those employed before 2016.

type:Person employments.{employer.name:"Netflix" from<"2016-01-01" or(to>"2016-01-01", not(has:to))}

The curly braces in this example are an example of a nested query (learn more here). In this case we’re saying return all person entities who both have an employer named Netflix and were employees there from before the first day of 2016.

The final “or” statement is expressing the fact that we want results returned who worked at Netflix at least into the start of 2016, and to include individuals who don’t have an employed “to” (e.g. last day or work) value. This last portion excludes individuals who worked before 2016 but also left before 2016.

The results include 3,324 employees at Netflix (as of 2016-01-01). For this investigation this can be our baseline to see the percentage of attrition.

To see what the makeup of the org was at this point, feel free to add facet:employments.categories.name to the end of the query. This results in a breakdown of the employment category of Netflix at this point in time.

Employment categories of employees at Netflix as of 2016-01-01

Next we simply alter our query slightly to see who has left. This time we want to see employees who worked at Netflix as of the first day of 2016, but later left. We can do this simply by removing not(has:to) and replacing it with has:to. This is specifying that we want individuals who have a “to” (ending) date to their employment.

This query would look like the following:

type:Person employments.{employer.name:"Netflix" from<"2016-01-01" to>"2016-01-01" has:to}

1,289 of the original cohort have left since 2016. Or an attrition rate of ~39%.

By adding the same facet query to the end, we can see which roles within this cohort have had the most (or least) attrition.

Perhaps interestingly, attrition rates largely follow the general distribution of talent in our original cohort. In short, there isn’t a major branch of the business with disproportionately high attrition.

You can perform queries on attrition within particular roles by removing the portion of the query about categories and replacing this with employments.employer.title:"Title of Job".

Additionally of note is that above we’re working through the attrition of a particular hiring cohort(s) (pre-2016 hires). Obtaining a raw look at attrition over a time period is a simpler query.

In the case of Netflix, they’ve performed the bulk of their hiring since 2016. So total attrition numbers may be more informative than looking at a 2016 baseline.

The query format for obtaining a list of all individuals who have left an employer since a specific date can be found thus:
type:Person employments.{employer.name:"Netflix" to>"2016-01-01" has:to}

This query results in 7,555 person entities returned. And what we’re looking at here are individuals employed at any point after 2016 for Netflix who have left.

The same facet query used above for this query shows us turnover is largely among performers and entertainment roles, followed by management and design.

Job function counts of employees who have left Netflix since 2016

So there we have it! The ability to calculate attrition and tenure for individuals working at any of the hundreds of millions of organizations within the Knowledge Graph. For hiring data, note that you can invert from and to dates to see new additions to organizations.

Looking for more examples of market intelligence, competitive intelligence, and firmographic Knowledge Graph queries, be sure to check out our guide to market intelligence search queries!

17 Uses of Natural Language Processing (NLP) In Business Settings

The Library of Alexandria was the pinnacle of the ancient world’s recorded knowledge. It’s estimated that it contained the scroll equivalent of 100,000 books. This was the culmination of thousands of years of knowledge that made it into the records of the time. Today, the Library of Congress holds much the same distinction, with over 170M items in its collection.

While impressive, those 170M items digitized could fit onto a shelf in your basement. Roughly 10 12 terabyte hard drives could contain the entirety.

For comparison, the average data center of today (there are 7.2M of them at last count) takes up an average of 100,000 square feet. Nearly every foot filled with storage.

With this much data, there’s no army of librarians in the whole world who could organize them…

Natural language processing refers to technologies and techniques that take unorganized data and provide meaning and structure at scale. Imagine taking a stack of documents on your desk, making them searchable, sortable, prioritizing them, or generating summaries for each. These are the sort of tasks natural language processing supports in business and research settings.

At Diffbot, we see a wide range of use cases using our benchmark-topping Natural Language API. We’ll work through some of these use cases as well as others supported by other technologies below.

Sentiment Analysis

These days, it seems as if nearly everyone online has an opinion (and is willing to share it widely). The velocity of social media, support ticket, and review data is astounding, and many teams have sought solutions to automate the understanding of these exchanges.

Sentiment analysis is one of the most widespread uses of natural language processing. This process involves determining how “positive” or “negative” a given text is. Common uses for sentiment analysis are wide ranging and include:

  • Buyer risk
  • Supplier risk
  • Market intelligence
  • Product intelligence (reviews)
  • Social media monitoring
  • Underwriting
  • Support ticket routing
  • Investment intelligence

While no natural language processing task is foolproof, studies show that analysts tend to agree with top-tier sentiment analysis services close to 85% of the time.

One categorical difference between sentiment analysis providers is that some provide a sentiment score for entire documents, while some providers can give you the sentiment of individual entities within the text. A second important factor about entity-level sentiment involves knowing how central an entity is to understanding the text. This measure is commonly called the “salience” of an entity.

Text Classification

Text classification can refer to a process internal to natural language processing tools in which text is grouped into related words and prepared for further analysis. Additionally, text (topic) classification can refer to the user output of greater business use.

The uses of text (topic) classification include ticket or call routing, news mention tracking, and providing contextuality to other natural language processing outputs. Text classification can function as an “operator” of sorts, routing requests to the person best suited to solve the issue.

Studies have shown that the average support worker can only handle around 20 support tickets a day. Text classification can dramatically increase the time before tickets reach the right support team member as well as provide this team member with context to solve an issue quickly. Salesforce has noted that 69% of high-performing support teams are considering the use of AI for ticket routing.

Additionally, you can think of text classification as one “building block” for understanding what is going on in bulk unstructured text. Text classification processes may also trigger additional natural language processing through identifying languages or topics that should be analyzed in a particular way.

Chatbots & Virtual Assistants

Loved by some, despised by others, chatbots form a viable way to direct informational conversations towards self service or human team members.

While historical chatbots have relied on makers plotting out ‘decision trees’ (e.g. a flow chart pattern where a specific input yields a specific choice), natural language processing allows chatbot users several distinct benefits:

  • The ability to input a nuanced request
  • The ability to type a request in informal writing
  • More intelligence judgment on when to hand off a call to an agent

As the quality of chatbot interactions has improved with advances in natural language processing, consumers have grown accustomed to dealing with them. The number of consumers willing to deal with chatbots doubled between 2018 and 2019. And more recently it has been reported that close to 70% of consumers prefer to deal with chatbots for answers to simple inquiries.

Text Extraction (Mining)

Text extraction is a crucial functionality in many natural language processing applications. This functionality involves pulling out key pieces of information from unstructured text. Key pieces of information could be entities (e.g. companies, people, email addresses, products), relationships, specifications, references to laws or any other mention of interest. A second function of text extraction can be to clean and standardize data. The same entity can be referenced in many different ways within a text, as pronouns, in shorthand, as grammatically possessive, and so forth.

Text extraction is often a “building block” for many other more advanced natural language processing tasks.

Text extraction plays a critical role in Diffbot’s AI-enabled web scraping products, allowing us to determine which pieces of information are most important on a wide variety of pages without human input as well as pull relevant facts into the world’s largest Knowledge Graph.

Machine Translation

Few organizations of size don’t interface with global suppliers, customers, regulators, or the public at large. “Human in the loop” global news tracking is often costly and reliant on recruiting individuals who can read all of the languages that could provide actionable intelligence for your organization.

Machine translation allows these processes to occur at scale, and refers to the natural language processing task of converting natural text in one language to another. This relies on understanding the context, being able to determine entities and relationships, as well as understanding the overall sentiment of a document.

While some natural language processing products center their offerings around machine translation, others simply standardize their output to a single language. Diffbot’s Natural Language API can take input in English, Chinese, French, German, Spanish, Russian, Japanese, Dutch, Polish, Norwegian, Danish or Swedish and standardize output into English.

Text Summarization

Text summarization is one of a handful of “generative” natural language processing tasks. Reliant on text extraction, classification, and sentiment analysis, text summarization takes a set of input text and summarizes it. Perhaps the most commonly utilized example of text summarization occurs when search results highlight a particular sentence within a document to answer a query.

Two main approaches are used for text summarizing natural language processing. The extraction approach finds a sentence(s) within a text that it believes coherently summarizes the main points of the document. The abstraction approach actually rewrites the input text, removing points it believes are less important and rephrasing to reduce length.

The primary benefit of text summarization is the preserving of time for end users. In cases like question answering in support or search, consumers utilize text summarization daily. Technical, medical, and legal settings also utilize text summarization to give a quick high-level view of the main points of a document.

Market Intelligence

Check out a media monitoring dashboard that combines Diffbot’s web scraping, Knowledge Graph, and natural language processing products above!

The range of data sources on consumers, suppliers, distributors, and competitors makes market intelligence incredibly ripe for disruption via natural language processing. Web data is a primary source for a wide range of inputs on market conditions, and the ability to provide meaning while absolving individuals from the need to read all underlying documents is a game changer.

Applied with web crawling, natural language processing can provide information on key market happenings such as mergers and acquisitions, key hires, funding rounds, new office openings, and changes in headcount. Other common market intelligence uses include sentiment analysis of reviews, summarization of financial, legal, or regulatory documents, among other uses.

Intent Classification

Intent classification is one of the most revenue-centered and actionable applications of natural language processing. In intent classification the input is direct communications from a prospect or customer. Using machine learning, intent classification tools can rate how “ready to buy” a given individual is during an interaction. This can prompt sales and marketing outreach, special offers, cross-selling, up-selling, and help with lead scoring.

Additionally, intent classification can help to route inquiries aimed at support or general queries like those related to billing. The ability to infer intentions and needs without even needing to prompt discussion members to answer specific questions enables for a faster and more frictionless experience for service providers and customers.

Urgency Detection

Urgency detection is related to intent classification, but with less focus on where a text indicates a writer is within a buying process. Urgency detection has been successfully used in cases such as law enforcement, humanitarian crises, and health care hotlines to “flag up” text that indicates a certain urgency threshold.

Because urgency detection is just one method — among others — in which communications can be routed or filtered, low or no supervision machine learning can often be used to prepare these functions. In instances in which an organization does not have the resources to field all requests, urgency detection can help them to prioritize the most urgent.

Speech Recognition

In today’s world of smart homes and mobile connectivity, speech recognition opens up the door to natural language processing away from written text. By focusing on high fidelity speech-to-text functionality, the range of documents that can be fed to natural language processing programs expands dramatically.

In 2020, an estimated 30% of all searches held a voice component. Applying natural language processing detailed in the other points in this guide is a huge opportunity for organizations providing speech-related capabilities.

Search Autocorrect and Autocomplete

Search autocorrect and complete may be the area most individuals deal with natural language processing most readily. In recent years, search on many ecommerce and knowledge base sites has been entirely rethought. The ability to quickly identify intent and pair it with an appropriate response can lead to better user experience, higher conversion rates, and more end data about what users want.

While 96% of major ecommerce sites employ autocorrect and/or autocomplete, major benchmarks find that close to 30% of these sites have severe usability issues. For some of the largest traffic volume sites on the web, this is a major opportunity to employ quality predictive search using cutting-edge natural language processing.

Social Media Monitoring

Of all media sources online, social can be the most overwhelming in velocity, range of tone and conversation type. Global organizations may need to field or monitor requests in many languages, on many platforms. Additionally, social media can provide useful inputs into external issues that may affect your organization, from geopolitical strife, to changing consumer opinion, to competitor intelligence.

On the customer service and sales fronts, 79% of consumers expect brands to respond within a day on social media requests. Recent studies have shown that across industries only 29% of brands regularly hit this mark. Additionally, the cost of finding new customers is 7x that of keeping existing customers, leading to increased need for intent monitoring and natural language processing of social media requests.

Web Data Extraction

Rule-based web data extraction simply doesn’t scale past a certain point. Unless you know the structure of a web page in advance (many of which are changing constantly), rules specified for which information is relevant to extract will break. This is where natural language processing comes into play.

Organizations like Diffbot apply natural language processing for web data extraction. By training natural language processing models around what information is likely useful by page type (e.g. product page, profile page, article page, discussion page, etc.), we can extract web data without pre-specified rules. This leads to resiliency in web crawling as well as enables us to expand the number of pages we can extract data from. This ability to crawl across many page types and continuously extract facts is what powers our Knowledge Graph. Interested in web data extraction? Be sure to check out our automatic extraction APIs or pre-extracted firmographic, demographic, and article data within our Knowledge Graph.

Machine Learning

See how ProQuo AI utilizes our web sourced Knowledge Graph to speed up predictive analytics

While machine learning is often an input to natural language processing tools, the output of natural language processing tools can also jumpstart machine learning projects. Using automatically structured data from the web can help you skip time-consuming and expensive annotation tasks.

We routinely see our Natural Language API as well as Knowledge Graph data — both enabled with natural language processing technology — utilized to jump start machine learning exercises. There are few training data sets as large as public web data. And the range of public web data types and topics makes it a great starting point for many, many machine learning journeys.

Threat Detection

See how FactMata uses Diffbot Knowledge Graph data to detect fake news and threats online

For platforms or other text data sources with high velocity, natural language processing has proven to be a good first line of defense for flagging hate speech, threatening speech, or false claims. The ability to monitor social networks and other locations at scale allows for the identification of networks of “bad actors” and a systemic protection from malicious actors online.

We’ve partnered with multiple organizations to help combat fake news with our natural language processing API, site crawlers, and Knowledge Graph data. Whether as a source for live structured web data or as training data for future threat detection tools, the web is the largest source of written harmful or threatening communications. This makes it the best location for training effective natural language processing tools used by non-profits, governmental bodies, media sites looking to police their own content, and other uses.

Fraud Detection

Natural language processing plays multiple roles in fraud prevention efforts. The ability to structure product pages is utilized by large ecommerce sites to seek out duplicate and fraudulent product offerings. Secondly, structured data on organizations and key members of these organizations can help to detect patterns in illicit activity.

Knowledge graphs — one possible output of natural language processing — are particularly well suited for fraud detection because of their ability to link distinct data types. Just as human research-enabled fraud investigations “piece together” information from varying sources and on various entities, Knowledge Graphs allow for machine accumulation of similar information.

Native Advertising

For advertising embedded in other content, tracking what context provides the best setting for ad placement allows for systems to generate better and better ad placement. Using web scraping paired with natural language processing, information like the sentiment of articles, mentions of key entities as well as which entities are most central to the text can lead to better ad placement.

Many brands suffer from underperforming advertising spending as well as brand safety (placement in suitable locations), problems that natural language processing helps to aid at scale.

A Less-biased Way to Discern Media Bias Using Knowledge Graph Enhanced AI

As it becomes increasingly difficult to separate what is real from what is virtual, it becomes increasingly important for us to have tools that measure the biases in the information that we consume everyday.  Bias has always existed, but as we spend more of our conscious hours online, media — rather than direct experience — is what overwhelmingly shapes our worldviews.  Various journalistic organizations and NGOs have studied media bias, producing charts like the following.

Source: Poynter Institute: Should you trust media bias charts?

Most of these methodologies rely on surveying panels of humans, which we know are incredibly biased.  Both producers of these annual media bias studies methodologies can be summarized as the following:

The leading producer of media political bias charts that score the degree to which media outlets lean politically to the left vs. right notes about their methodology:

Keep in mind that this ratings system currently uses humans with subjective biases to rate things that are created by other humans with subjective biases and place them on an objective scale.

Ad Fontes Media

How do we avoid our own biases (or the biases of a panel of humans) when studying bias?  It is well known by now that AI systems (read: statistical models learned from data) trained on human-supplied labels reflect the biases of those human judgements encoded in the data.  How do we avoid asking humans to judge the biases of the articles?

Answer: by building a system that (a) defines the target output with an objective statement and (b) combines independent AI components that are trained on tasks that are orthogonal to the bias scoring task. Here’s what a system we built at Diffbot to score political bias of media outlets looks like:

We can define via the input parameters, the desired output of the system as the sentiment towards the Republican Party (Diffbot entity ID: EQux7TYFDMgO6n_OByeSXzg) minus the sentiment towards the Democratic Party (Diffbot entity ID: EsAK1CigZMFeqk72s5EidGQ).  These entities refer to the Republican and Democratic political parties in the United States.  The beauty of this objective definition of system output is that you can modify the definition by varying the inputs to produce bias scores along any other political bias spectrum (e.g. Libertarian-Authoritarian, or the multi-party variations in your local country) and the system can produce new scores along that given those parameters without performing another bias-prone re-surveying of humans.

The two AI components of the system are a (a) named entity recognizer, and a (b) sentiment analyzer.

The named entity recognizer is trained to find subjects and objects in English and link them to Uniform Resource Identifiers (URIs) in the Diffbot Knowledge Graph.  The entity recognizers know nothing of the political bias task and aren’t trained on examples of political/non-political text. What that model learns is the syntax of English, which positions in a sentence constitute a subject or object, and which entity a span of text refers to.  The Republican Party and Democratic Party are just two unremarkable entities out of a possible billions of possible entities in the Diffbot Knowledge Graph that the NER system could link to.

The sentiment analyzer is a model that is trained to determine whether a piece of text is positive or negative, but it also knows nothing about political bias nor has it seen anything in its training set specific to political entities. This model is merely learning how we in general express negativity or positivity.  For example,  “I like puppies!” is a sentence that indicates the author has positive sentiment towards puppies. “I’m bearish on crypto” is a sentence that indicates the author has negative sentiment towards cryptocurrencies.

By combining these two independent systems, none of which has seen the political bias task or has training data that was gathered for that purpose, we can build a system that calculates the bias in text along a spectrum defined by any two entities.  We ran an experiment by querying the Diffbot Knowledge Graph for content from the mainstream media outlets and ran the bias detector on the 17,468,963 resulting articles to produce the Diffbot Media Bias Chart, below. 

There are some interesting insights:

  • There’s an overall negativity bias to news. There’s truth to the old adage that the frontpage the newspaper reports on the worst things that’ve happened around the world that day. The news reports on heinous crimes, pandemics, disaster, and corruption. This overall negativity bias dominates any left-right political bias. However, there is also clearly a per-outlet bias that ranges from heavily critical (reason.com, realclearpolitics.com) to a subdued slight negativity (npr.org, huffpost.com).
  • There is often a characterization of political bias among news outlet rivals that compete for your media attention and advertising dollars, e.g. the CNN/Fox News rivalry, but both are actually rather centrist relative to the other outlets.  The data does not support a bi-modal distribution of political bias–that is, one cluster on the left and another cluster on the right, but rather something that looks more like a normal distribution–a large centrist cluster, with few outlets at the extremes.  This may have to do with the fact that the business model of media ultimately competes for large audiences.  

Of course, there is no perfectly unbiased methodology calculating a political bias score, but we hope that this approach spurs more research into developing new methods for how AI can help detect human biases.  We showed that two AI components that solve orthogonal problems–named entity recognition and sentiment analysis–can be composed to build a single system whose goal isn’t to replicate human judgement, but do it better. 

You can download the full dataset for the above experiment here and reproduce your own bias chart along any sentiment spectrum by using the Diffbot Natural Language API.


[1] https://www.poynter.org/fact-checking/media-literacy/2021/should-you-trust-media-bias-charts/

[2] https://adfontesmedia.com/how-ad-fontes-ranks-news-sources/

[3] https://www.allsides.com/media-bias/media-bias-rating-methods

Diffbot Partners with Avast to Improve Consumer Online Privacy

Excited to make public our collaboration with Avast Software, now the world’s largest Antivirus security company, which is using Diffbot, the world’s largest Knowledge Graph, to improve the online privacy of consumers around the world. The average internet user visits 94 web pages each day, and each site includes various trackers and lengthy legal terms that are impossible for the average person to fully read and understand the implications of. We’re using AI to improve online privacy–by using machines to read all of the privacy policies on the entire web and making every company’s privacy posture transparent.
Working with the Avast team has also been a great example of corporate-startup collaboration, oft sought-after by corporate innovation groups, but rarely achieved. It’s been a pleasure to observe a team of ML engineers from different companies coming together to solve a common problem of societal importance, and shipping code. 
In addition to integrating this into Avast products, we plan to publish our privacy insights in a series of blog posts and hope to make available the underlying datasets for academic and industry privacy research groups.

Full details: https://blog.avast.com/avast-and-diffbot-collaboration-avast

No News Is Good News – Monitoring Average Sentiment By News Network With Diffbot’s Knowledge Graph

Ever have the feeling that news used to be more objective? That news organizations — now media empires — have moved into the realm of entertainment? Or that a cluster of news “across the aisle” from your beliefs is completely outrageous?

Many have these feelings, and coverage is rampant on bias and even straight up “fake” facts in news reporting.

With this in mind, we wanted to see if these hunches are valid. Has news gotten more negative over time? Is it a portion of the political spectrum driving this change? Or is it simply that bad things happen in the world and later get reported on?

To jump into this inquiry we utilized Diffbot’s Knowledge Graph. Diffbot is one of the few North American organizations to crawl the entire web. We apply AI-enabled web scrapers to pages that are publicly available to extract entities — think people, places, or things — and facts — think job titles, topics, and funding rounds.

We started our inquiry with some external coverage on bias in journalism provided by AllSides Media Bias Ratings.

Continue reading

Generating B2B Sales Leads With Diffbot’s Knowledge Graph

Generation of leads is the single largest challenge for up to 85% of B2B marketers.

Simultaneously, marketing and sales dashboards are filled with ever more data. There are more ways to get in front of a potential lead than ever before. And nearly every org of interest has a digital footprint.

So what’s the deal? 🤔

Firmographic, demographic, technographic (components of quality market segmentation) data are spread across the web. And even once they’re pulled into our workflows they’re often siloed, still only semi-structured, or otherwise disconnected. Data brokers provide data that gets stale more quickly than quality curated web sources.

But the fact persists, all the lead generation data you typically need is spread across the public web.

You just needs someone (or something 🤖) to find, read, and structure this data.

Continue reading

Towards A Public Web Infused Dashboard For Market Intel, News Monitoring, and Lead Gen [Whitepaper]

It took Google knowledge panels one month and twenty days to update following the inception of a new CEO at Citi, a F100 company. In Diffbot’s Knowledge Graph, a new fact was logged within the week, with zero human intervention and sourced from the public web.

The CEO change at Citi was announced in September 2020, highlighting the reliance on manual updates to underlying Wiki entities.

In many studies data teams report spending 25-30% of their time cleaning, labelling, and gathering data sets [1]. While the number 80% is at times bandied about, an exact percentage will depend on the team and is to some degree moot. What we know for sure is that data teams and knowledge workers generally spend a noteworthy amount of their time procuring data points that are available on the public web.

The issues at play here are that the public web is our largest — and overall — most reliable source of many types of valuable information. This includes information on organizations, employees, news mentions, sentiment, products, and other “things.”

Simultaneously, large swaths of the web aren’t structured for business and analytical purposes. Of the few organizations that crawl and structure the web, most resulting products aren’t meant for anything more than casual consumption, and rely heavily on human input. Sure, there are millions of knowledge panel results. But without the full extent of underlying data (or skirting TOS), they just aren’t meant to be part of a data pipeline [2].

With that said, there’s still a world of valuable data on the public web.

At Diffbot we’ve harnessed this public web data using web crawling, machine vision, and natural language understanding to build the world’s largest commercially-available Knowledge Graph. For more custom needs, we harness our automatic extraction APIs pointed at specific domains, or our natural language processing API in tandem with the KG.

In this paper we’re going to share how organizations of all sizes are utilizing our structured public web data from a selection of sites of interest, entire web crawls, or in tandem with additional natural language processing to build impactful and insightful dashboards par excellence.

Note: you can replace “dashboard” here with any decision-enabling or trend-surfacing software. For many this takes place in a dashboard. But that’s really just a visual representation of what can occur in a spreadsheet, or a Python notebook, or even a printed report.

Continue reading