What can you do with billions of web-sourced facts on hundreds of millions of organizations? Beyond analyzing the facts themselves, you (or a machine of your choice) can learn a lot. Historically, our Knowledge Graph has had one of the largest collections of publicly-disclosed organization revenue. Recently, we’ve applied machine learning processes across many org fields to estimate revenue for private organizations as well.
Every big tech investment is big news. If your firm raises a funding round with prestigious investors or is acquired, you better bet you’ll spread the news far and wide.
But where can you go for this information en masse? Even covering a handful of big investors over a handful of years can lead to a list of thousands of invested in firms. And a list of firms themselves isn’t that useful. Sure, some big names pop out. But how do you see what “plays” big tech is making?
That’s where our web-reading bots come in. By working through billions of web pages using NLP and machine vision, Diffbot’s Knowledge Graph is the largest public-web sourced database of organizations, articles, people, products, and events. For each entity — organization, articles, people, etc. — facts are vetted and accumulated to create a filterable, searchable database of “things.” So when we wanted to check out which industries big tech has invested in over the last decade, we knew right where to turn. No analyst middlepersons, just public web data structured into a market intel-rich format.
Big Tech Investment By Industry 2010-2021
Yes, even the biggest leaders in market intelligence. Even us.
Some focus solely on startups. Some only on venture-backed companies. But you probably wouldn’t even know. Because most won’t (or can’t) tell you what their data is biased towards! 🤭
“We have over 10M companies in our database!” is a meaningless statement if you can’t tell whether the data is a representative sample of Indian restaurants in the world, or perhaps more realistically, what they just happened to scrape.
Unless we’re talking at least 200M+ unique organizations strong, you’re looking at a biased dataset. And that’s still a conservative minimum.
This is common knowledge for data buyers, who make up for the lack of a known bias by evaluating datasets for known, easily verifiable data, like the Fortune 1000.
Given enough evaluation feedback cycles, most organization data brokers end up biased towards the Fortune 1000.
If your target is enterprise b2b, you’re in luck. You can find that data anywhere. Just check your spam folder.
And if your market intelligence application needs the closest thing to a truly representative sample of global organizations, it might seem impossible.
For data brokers, it just doesn’t make any sense to boil the ocean. It’s cheaper and easier to focus data entry resources on a few markets and whatever coverage gap feedback they get from lost deals.
Even if they did manage to compile all the companies on Earth, they would have to do it over and over again to keep their records fresh.
It’s an absurd and impractical human labor cost to maintain. So no one employs hundreds of people just to enter org data. Not even us.
We employ machines instead, which crawl millions of publicly accessible websites, interpret raw text into data autonomously, and structure each detail into facts on every organization known to the public web.
Which, as it turns out, is our known bias.
Generation of leads is the single largest challenge for up to 85% of B2B marketers.
Simultaneously, marketing and sales dashboards are filled with ever more data. There are more ways to get in front of a potential lead than ever before. And nearly every org of interest has a digital footprint.
So what’s the deal? 🤔
Firmographic, demographic, technographic (components of quality market segmentation) data are spread across the web. And even once they’re pulled into our workflows they’re often siloed, still only semi-structured, or otherwise disconnected. Data brokers provide data that gets stale more quickly than quality curated web sources.
But the fact persists, all the lead generation data you typically need is spread across the public web.
You just needs someone (or something 🤖) to find, read, and structure this data.Continue reading
It took Google knowledge panels one month and twenty days to update following the inception of a new CEO at Citi, a F100 company. In Diffbot’s Knowledge Graph, a new fact was logged within the week, with zero human intervention and sourced from the public web.
The CEO change at Citi was announced in September 2020, highlighting the reliance on manual updates to underlying Wiki entities.
In many studies data teams report spending 25-30% of their time cleaning, labelling, and gathering data sets . While the number 80% is at times bandied about, an exact percentage will depend on the team and is to some degree moot. What we know for sure is that data teams and knowledge workers generally spend a noteworthy amount of their time procuring data points that are available on the public web.
The issues at play here are that the public web is our largest — and overall — most reliable source of many types of valuable information. This includes information on organizations, employees, news mentions, sentiment, products, and other “things.”
Simultaneously, large swaths of the web aren’t structured for business and analytical purposes. Of the few organizations that crawl and structure the web, most resulting products aren’t meant for anything more than casual consumption, and rely heavily on human input. Sure, there are millions of knowledge panel results. But without the full extent of underlying data (or skirting TOS), they just aren’t meant to be part of a data pipeline .
With that said, there’s still a world of valuable data on the public web.
At Diffbot we’ve harnessed this public web data using web crawling, machine vision, and natural language understanding to build the world’s largest commercially-available Knowledge Graph. For more custom needs, we harness our automatic extraction APIs pointed at specific domains, or our natural language processing API in tandem with the KG.
In this paper we’re going to share how organizations of all sizes are utilizing our structured public web data from a selection of sites of interest, entire web crawls, or in tandem with additional natural language processing to build impactful and insightful dashboards par excellence.
Note: you can replace “dashboard” here with any decision-enabling or trend-surfacing software. For many this takes place in a dashboard. But that’s really just a visual representation of what can occur in a spreadsheet, or a Python notebook, or even a printed report.
We started with about 1,000 companies in the Employbl database, mostly in the Bay Area. Now with Diffbot we can expand to other cities and add thousands of additional companies.Connor Leech – CEO @Employbl
Fixing tech starts with hiring. And fixing hiring is an information problem. That’s what Connor Leech, cofounder and CEO at Employbl discovered when creating a new talent marketplace meant to connect tech employees with the information-rich hiring marketplace they deserve.
Tech job seekers rely on a range of metrics to gauge the opportunity and stability of a potential employer.
While information like funding rounds, founders, team size, industry, and investors are often public, it can be hard to grab the myriad fields candidates value in a up-to-date format from around the web.
These difficulties are amplified by the fact that many tech startups are often “long tail” entities that also regularly change.Continue reading
Hindsight is 20/20. And as we usher in a new president in what has been one of the most tumultuous years in American history, we can begin to see clarity about the forces that moved throughout our jobs, our lives, and our collective imagination.
Another way to put this is that over time we tend to have more context.
When our AI reads articles it pulls out quotes, and when it can it attributes a speaker to these quotes. As our crawlers traverse the entirety of the public web, sources of quotes are validated and over time some quotes circulate more than others.
When performing a facet search, this lets us essentially show something like a retweet count for the entire web. This answers questions like whose voices are being heard? And what speakers are the most widely cited in a given topic?
To commemorate the end of an era, let’s take a look at a few of the most circulated statements of the last 365 days.
What were the 10 most circulated quotes across the web by President Joe Biden in the last 365 days?
2020 was undeniably the “Year of the Knowledge Graph.”
2020 was the year that Gartner put Knowledge Graphs at the peak of its hype cycle.
It was the year where 10% of the papers published at EMNLP referenced “knowledge” in their titles.
It was the year over 1000 engineers, enterprise users, and academics came together to talk about Knowledge Graphs at the 2nd Knowledge Graph Conference.
There are good reasons for this grass-roots trend, as it isn’t any one company that is pushing this trend (ahem, I’m looking at you, Cognitive Computing), but rather a broad coalition of academics, industry vertical practitioners, and enterprise users that generally deal with building intelligent information systems.
Knowledge graphs represent the best of how we hope the “next step” of AI looks like: intelligent systems that aren’t black boxes, but are explainable, that are grounded in the same real-world entities as us humans, and are able to exchange knowledge with us with precise common vocabularies. It’s no coinincidence that in the same year that marked the peak of the deep learning revolution (2012), Google introduced the Google Knowledge Graph as a way to provide interpretability to its otherwise opaque search ranking algorithms.
The Risk Of Hype: Touted Benefits Don’t Materialize
The public web is chock full of indicators with implications for stock prices, commodities prices, supply chain issues, or the general perceived value of an entity. But how do you reliably get these market indicators?
You can search online… and slog through the most popular pages that all your competitors have also looked at. Or you can read a commentator’s take. And likely stay one step removed from the actual information you should be dealing in.
Or you could deal directly with all of the articles on the web. Each annotated with helpful fields you can filter through like sentiment scores, AI-generated topic tags, what country the article was published in, and many others. That’s where Diffbot’s Knowledge Graph (KG) comes in.
The news index of Diffbot’s KG is 50x the size of Google News’ index. And each article entity in the KG is populated with a rich set of fields you can use to actually search the entire web (not just the portion of the web who paid to get in front of you).
In this guide we’ll work through how to set up a global news monitoring query in the KG. And then schedule this query to repeat and email you when new articles surface.
Organizations are one of our most popular standard entities in the Diffbot Knowledge Graph, for good reason. Behind 200M+ company data profiles is an architecture that enables incredibly precise search and summarization, allowing anyone to estimate the size of a market and forecast business opportunity in any niche.
- A Diffbot account – Sign up for a free trial or login to follow along.
- Some familiarity with DQL (Diffbot Query Language) will make for easier skimming.
Step 1 – Find Companies Like X
In a perfect world, every market and industry on the planet is neatly organized into well defined categories. In practice, this gets close, but not close enough, especially for niche markets.
What we’ll need instead is a combination of traits, including industry classifiers, keywords, and other characteristics that define companies in a market.
This is much easier to define by starting with companies we know that fit the bill. Think of it as searching for “companies like X”.
As an example, let’s start with finding companies like Bauducco, producer of this lovely Panettone cake. This is a market we’re hoping to sell say, a commercial cake baking oven to.
The closest definition of a market I might imagine for them is something like “packaged foods”. We could google this term and get some really generic hits for “food and beverage companies”, or we can do better.
We’ll start by looking this company up on Diffbot’s Knowledge Graph with a query like this
Next, click through the most relevant result to a company profile.
Now let’s gather everything on this page that describes a company like Bauducco.
Under the company summary, the closest descriptor to their signature Panettone is “cakes”. Note that.
Under industries, they might be involved in agriculture to some degree, but we’re not really looking for other companies that are involved in agriculture. “Food and Drink Companies” will do!
Now that we have these traits, let’s construct a search query with DQL:
type:Organization industries:"Food and Drink Companies" description:or("cakes", "cake")
Nearly 48,000 results! That’s a huge list of potential customers. Like the original google search, it’s a bit too generic to work with. Unlike results from Google though, we can segment this down as much as we’d like with just a few more parameters.
💡 Pro Tip: To see a full list of available traits to construct your query with, go to enhance.diffbot.com/ontology.
Step 2 – Remove Irrelevant Traits
What I’m first noticing is that there are a lot of international brands on this list. I’m interested in selling to companies like Bauducco in the U.S., so let’s trim this list to just companies with a presence in the United States.
type:Organization industries:"Food and Drink Companies" description:or("cakes", "cake") locations.country.name:"United States"
Note that there are two “location” attributes. A singular and a plural version. The plural version (“locations”) will match all known locations of a company. The singular version (“location”) will only match the known headquarters of a company.
Down to 8800 results. Much better. We’re not really interested in ice cream companies in this market either (after all, we’re selling a baking oven), so we’ll use the not() operator to filter ice cream companies out.
type:Organization industries:"Food and Drink Companies" description:or("cakes", "cake") not(description:"ice cream") locations.country.name:"United States"
Let’s also say our oven is really only practical for large operations of at least 100 employees. We’ll add a minimum employee threshold to our query.
type:Organization industries:"Food and Drink Companies" description:or("cakes", "cake") not(description:"ice cream") locations.country.name:"United States" nbEmployeesMin>=100
262 results. Now we’re really getting somewhere. Let’s stop here to calculate our total addressable market.
Step 4 – Calculate Total Addressable Market
To calculate TAM, we simply multiply the number of potential customers by the annual contract value of each customer.
TAM = Number of Potential Customers x Annual Contract Value
At a $1M average contract value with 262 potential customers, our TAM is approximately $262M.
This is just a starting point of course, we’ll want to assess existing competition, pricing sensitivity, as well as how much of the existing market would be willing to switch for our unique value proposition. We’ll leave that for another day.
Try replicating these steps for a market of your choosing. The ability to filter and summarize practically any field in the ontology provides limitless potential for market and competitive intelligence.
Need some inspiration? Here’re some additional examples: