The Top 50 Most Underrated Startups as Told by AI

While Diffbot’s Knowledge Graph has historically offered revenue values for publicly-held companies, we recently computed an estimated revenue value for 99.7% of the 250M+ organizations in the KG.

What does this mean?

Most organizations are privately-held, and thus have no public revenue reporting requirement. Diffbot has utilized our unrivaled long-tail organization coverage to create a machine learning-enabled estimated revenue field. This field looks at the myriad fact types we’ve extracted and structured from the public web and infers a revenue from a range of signals.

Estimated revenue is just that… a machine learning-enabled estimate. But with a training set the size of our Knowledge Graph, we’ve found that a great majority of our revenue values are actually quite accurate.

How can I use estimated revenue?

Revenue — even if estimated — is a huge marker for determining size and valuation. In it’s absence it’s hard to effectively segment organizations. We see this field used in market intelligence, finance, and investing use cases. And it’s as simple as filtering organizations using the revenue.value field.

Where Does Diffbot Get It’s Data?

Diffbot is one of only a handful of organizations to crawl the entire web. We apply NLP and machine vision to crawled web pages to find entities and facts about them. These entities are consolidated in the world’s largest Knowledge Graph along with data provenance, linkages between entities, and additional computed fields (like sentiment, or estimated revenue). In this ranking we looked at organization entities. But organization entities are just the “tip of the iceberg” for Diffbot data, which comprises articles, products, people, events, and many other entity types.

(more…)

Read More

The Top Coding Bootcamps For Founders According To The Knowledge Graph

Last week we took a look at the top universities for female founders. In our results, we noted that our web-reading AI associates tech bootcamp attendance with education, and a large cluster of founders attended specific universities in conjunction with bootcamps.

New to the Knowledge Graph? Diffbot’s Knowledge Graph is constructed by crawling a vast majority of the web and structuring data on pages using NLP and machine vision. The end result is one of the world’s largest databases of organizations, people, articles, products and more, all linked and with data provenance.

To return results from the Knowledge Graph, you submit queries which filter which entities to return. In this case we queried the Knowledge Graph to return individuals who:

  1. Attended an educational institution with the name of a top bootcamp
  2. Have held a job title including “CEO,” “chief executive officer,” or “founder”

We then returned a facet (summary) view of how many of these individuals attended each bootcamp.

(more…)

Read More

The Best Schools For Female Founders According To The Knowledge Graph

Upon seeing Crunchbase’s annual ranking of the best schools for graduating entrepreneurs, we wanted to see how our Knowledge Graph results stack up.

The Diffbot Knowledge Graph is sourced from crawling a majority of the web and extracting entities and facts using NLP and machine vision.

Two prominent entity types are person and organization entities. When paired together powerful observations sourced from across the web are possible. In this exploration we returned all person entities within the Knowledge Graph who are currently founders and who are female. We filtered to make sure each organization had at least some publicly disclosed funding, and then we took a look at a summary view of which schools these founders had attended. You can check out the Knowledge Graph query here with a free trial.

While the top schools for female founders were consistent with Crunchbase’s coverage, you may wonder why the numbers vary so dramatically. Crunchbase’s ranking this year was looking at 2019-2020 graduates, and Crunchbase’s data is centered around tech and startup firmographics. While Diffbot’s Knowledge Graph certainly has firmographic details on tech-centered companies, our database of organizations is much wider ranging (over 250M+ orgs at last count). This means our list includes founders of all sorts of endeavors: non-profits, artistic organizations, medical organizations, and tech companies to name a few.

(more…)

Read More

Monitoring Large Food Retailer Investments With The Knowledge Graph

A few weeks ago we published a view into Big Tech investments by industry. In this post we’ll take a similar look at the largest food retailers.

Panning out a bit, there are over 250M organizations within the Knowledge Graph. To obtain this list of large food retailers we first narrowed our search to food retailers with more than 1,000 employees. This query surfaces more than 7,000 fact-rich entities.

From there we simply sorted the results by number of employees to gain the largest food retailers including Walmart, Target, Tesco, Kroger, Carrefour, and Safeway.

With this list in mind, we looked for a list of organizations who had been invested in by one of these organizations. Bounded by calendar years, we then returned a summary view that looked at which industries the invested-in companies represented. If you have a subscription or free trial feel free to check out the resulting query.
(more…)

Read More

Startup Revenue By County With Diffbot’s Knowledge Graph

What can you do with billions of web-sourced facts on hundreds of millions of organizations? Beyond analyzing the facts themselves, you (or a machine of your choice) can learn a lot. Historically, our Knowledge Graph has had one of the largest collections of publicly-disclosed organization revenue. Recently, we’ve applied machine learning processes across many org fields to estimate revenue for private organizations as well.

(more…)

Read More

Using the Knowledge Graph to Segment Big Tech Investments By Industry

Every big tech investment is big news. If your firm raises a funding round with prestigious investors or is acquired, you better bet you’ll spread the news far and wide.

But where can you go for this information en masse? Even covering a handful of big investors over a handful of years can lead to a list of thousands of invested in firms. And a list of firms themselves isn’t that useful. Sure, some big names pop out. But how do you see what “plays” big tech is making?

That’s where our web-reading bots come in. By working through billions of web pages using NLP and machine vision, Diffbot’s Knowledge Graph is the largest public-web sourced database of organizations, articles, people, products, and events. For each entity — organization, articles, people, etc. — facts are vetted and accumulated to create a filterable, searchable database of “things.” So when we wanted to check out which industries big tech has invested in over the last decade, we knew right where to turn. No analyst middlepersons, just public web data structured into a market intel-rich format.

Big Tech Investment By Industry 2010-2021

Distribution of industries of organizations invested in by Facebook, Alphabet, Amazon, Microsoft, Apple, and Netflix from 2010 to July 2021. Firmographic data sourced from Diffbot’s Knowledge Graph.
(more…)

Read More

Every Company That Sells Organization Data is Biased

Yes, even the biggest leaders in market intelligence. Even us.

Some focus solely on startups. Some only on venture-backed companies. But you probably wouldn’t even know. Because most won’t (or can’t) tell you what their data is biased towards! 🤭

“We have over 10M companies in our database!” is a meaningless statement if you can’t tell whether the data is a representative sample of Indian restaurants in the world, or perhaps more realistically, what they just happened to scrape.

Unless we’re talking at least 200M+ unique organizations strong, you’re looking at a biased dataset. And that’s still a conservative minimum.

This is common knowledge for data buyers, who make up for the lack of a known bias by evaluating datasets for known, easily verifiable data, like the Fortune 1000.

Given enough evaluation feedback cycles, most organization data brokers end up biased towards the Fortune 1000.

If your target is enterprise b2b, you’re in luck. You can find that data anywhere. Just check your spam folder.

If it’s anything even remotely more niched, like rubber gasket manufacturers or global non-profits focused on relieving poverty, you’re probably scraping this data yourself off a conference site.

And if your market intelligence application needs the closest thing to a truly representative sample of global organizations, it might seem impossible.

For data brokers, it just doesn’t make any sense to boil the ocean. It’s cheaper and easier to focus data entry resources on a few markets and whatever coverage gap feedback they get from lost deals.

Even if they did manage to compile all the companies on Earth, they would have to do it over and over again to keep their records fresh.

It’s an absurd and impractical human labor cost to maintain. So no one employs hundreds of people just to enter org data. Not even us.

We employ machines instead, which crawl millions of publicly accessible websites, interpret raw text into data autonomously, and structure each detail into facts on every organization known to the public web.

Which, as it turns out, is our known bias.

Read More

Generating B2B Sales Leads With Diffbot’s Knowledge Graph

Generation of leads is the single largest challenge for up to 85% of B2B marketers.

Simultaneously, marketing and sales dashboards are filled with ever more data. There are more ways to get in front of a potential lead than ever before. And nearly every org of interest has a digital footprint.

So what’s the deal? 🤔

Firmographic, demographic, technographic (components of quality market segmentation) data are spread across the web. And even once they’re pulled into our workflows they’re often siloed, still only semi-structured, or otherwise disconnected. Data brokers provide data that gets stale more quickly than quality curated web sources.

But the fact persists, all the lead generation data you typically need is spread across the public web.

You just needs someone (or something 🤖) to find, read, and structure this data.

(more…)

Read More

Towards A Public Web Infused Dashboard For Market Intel, News Monitoring, and Lead Gen [Whitepaper]

It took Google knowledge panels one month and twenty days to update following the inception of a new CEO at Citi, a F100 company. In Diffbot’s Knowledge Graph, a new fact was logged within the week, with zero human intervention and sourced from the public web.

The CEO change at Citi was announced in September 2020, highlighting the reliance on manual updates to underlying Wiki entities.

In many studies data teams report spending 25-30% of their time cleaning, labelling, and gathering data sets [1]. While the number 80% is at times bandied about, an exact percentage will depend on the team and is to some degree moot. What we know for sure is that data teams and knowledge workers generally spend a noteworthy amount of their time procuring data points that are available on the public web.

The issues at play here are that the public web is our largest — and overall — most reliable source of many types of valuable information. This includes information on organizations, employees, news mentions, sentiment, products, and other “things.”

Simultaneously, large swaths of the web aren’t structured for business and analytical purposes. Of the few organizations that crawl and structure the web, most resulting products aren’t meant for anything more than casual consumption, and rely heavily on human input. Sure, there are millions of knowledge panel results. But without the full extent of underlying data (or skirting TOS), they just aren’t meant to be part of a data pipeline [2].

With that said, there’s still a world of valuable data on the public web.

At Diffbot we’ve harnessed this public web data using web crawling, machine vision, and natural language understanding to build the world’s largest commercially-available Knowledge Graph. For more custom needs, we harness our automatic extraction APIs pointed at specific domains, or our natural language processing API in tandem with the KG.

In this paper we’re going to share how organizations of all sizes are utilizing our structured public web data from a selection of sites of interest, entire web crawls, or in tandem with additional natural language processing to build impactful and insightful dashboards par excellence.

Note: you can replace “dashboard” here with any decision-enabling or trend-surfacing software. For many this takes place in a dashboard. But that’s really just a visual representation of what can occur in a spreadsheet, or a Python notebook, or even a printed report.

(more…)

Read More

Is Christian Bale a Christian? Is Mitt Romney a glove?

Download This Dataset of 12,118 Yahoo Answers for $1

With only 2 weeks left till May 4th (be with you), the internet is bursting with excitement over all the work that needs to be done before Yahoo Answers finally 404s.

From scheduling a 2nd COVID vaccine to your annual panic attack at missing the tax filing deadline (you probably didn’t, it was extended to May 17 in the U.S.), there is nothing short of a lengthy agenda for everyone ahead of the shutdown of this iconic website.

(more…)

Read More