The Top 50 Most Underrated Startups as Told by AI

While Diffbot’s Knowledge Graph has historically offered revenue values for publicly-held companies, we recently computed an estimated revenue value for 99.7% of the 250M+ organizations in the KG.

What does this mean?

Most organizations are privately-held, and thus have no public revenue reporting requirement. Diffbot has utilized our unrivaled long-tail organization coverage to create a machine learning-enabled estimated revenue field. This field looks at the myriad fact types we’ve extracted and structured from the public web and infers a revenue from a range of signals.

Estimated revenue is just that… a machine learning-enabled estimate. But with a training set the size of our Knowledge Graph, we’ve found that a great majority of our revenue values are actually quite accurate.

How can I use estimated revenue?

Revenue — even if estimated — is a huge marker for determining size and valuation. In it’s absence it’s hard to effectively segment organizations. We see this field used in market intelligence, finance, and investing use cases. And it’s as simple as filtering organizations using the revenue.value field.

Where Does Diffbot Get It’s Data?

Diffbot is one of only a handful of organizations to crawl the entire web. We apply NLP and machine vision to crawled web pages to find entities and facts about them. These entities are consolidated in the world’s largest Knowledge Graph along with data provenance, linkages between entities, and additional computed fields (like sentiment, or estimated revenue). In this ranking we looked at organization entities. But organization entities are just the “tip of the iceberg” for Diffbot data, which comprises articles, products, people, events, and many other entity types.

(more…)

Read More

The Top Coding Bootcamps For Founders According To The Knowledge Graph

Last week we took a look at the top universities for female founders. In our results, we noted that our web-reading AI associates tech bootcamp attendance with education, and a large cluster of founders attended specific universities in conjunction with bootcamps.

New to the Knowledge Graph? Diffbot’s Knowledge Graph is constructed by crawling a vast majority of the web and structuring data on pages using NLP and machine vision. The end result is one of the world’s largest databases of organizations, people, articles, products and more, all linked and with data provenance.

To return results from the Knowledge Graph, you submit queries which filter which entities to return. In this case we queried the Knowledge Graph to return individuals who:

  1. Attended an educational institution with the name of a top bootcamp
  2. Have held a job title including “CEO,” “chief executive officer,” or “founder”

We then returned a facet (summary) view of how many of these individuals attended each bootcamp.

(more…)

Read More

The Best Schools For Female Founders According To The Knowledge Graph

Upon seeing Crunchbase’s annual ranking of the best schools for graduating entrepreneurs, we wanted to see how our Knowledge Graph results stack up.

The Diffbot Knowledge Graph is sourced from crawling a majority of the web and extracting entities and facts using NLP and machine vision.

Two prominent entity types are person and organization entities. When paired together powerful observations sourced from across the web are possible. In this exploration we returned all person entities within the Knowledge Graph who are currently founders and who are female. We filtered to make sure each organization had at least some publicly disclosed funding, and then we took a look at a summary view of which schools these founders had attended. You can check out the Knowledge Graph query here with a free trial.

While the top schools for female founders were consistent with Crunchbase’s coverage, you may wonder why the numbers vary so dramatically. Crunchbase’s ranking this year was looking at 2019-2020 graduates, and Crunchbase’s data is centered around tech and startup firmographics. While Diffbot’s Knowledge Graph certainly has firmographic details on tech-centered companies, our database of organizations is much wider ranging (over 250M+ orgs at last count). This means our list includes founders of all sorts of endeavors: non-profits, artistic organizations, medical organizations, and tech companies to name a few.

(more…)

Read More

Monitoring Large Food Retailer Investments With The Knowledge Graph

A few weeks ago we published a view into Big Tech investments by industry. In this post we’ll take a similar look at the largest food retailers.

Panning out a bit, there are over 250M organizations within the Knowledge Graph. To obtain this list of large food retailers we first narrowed our search to food retailers with more than 1,000 employees. This query surfaces more than 7,000 fact-rich entities.

From there we simply sorted the results by number of employees to gain the largest food retailers including Walmart, Target, Tesco, Kroger, Carrefour, and Safeway.

With this list in mind, we looked for a list of organizations who had been invested in by one of these organizations. Bounded by calendar years, we then returned a summary view that looked at which industries the invested-in companies represented. If you have a subscription or free trial feel free to check out the resulting query.
(more…)

Read More

Startup Revenue By County With Diffbot’s Knowledge Graph

What can you do with billions of web-sourced facts on hundreds of millions of organizations? Beyond analyzing the facts themselves, you (or a machine of your choice) can learn a lot. Historically, our Knowledge Graph has had one of the largest collections of publicly-disclosed organization revenue. Recently, we’ve applied machine learning processes across many org fields to estimate revenue for private organizations as well.

(more…)

Read More

Using the Knowledge Graph to Segment Big Tech Investments By Industry

Every big tech investment is big news. If your firm raises a funding round with prestigious investors or is acquired, you better bet you’ll spread the news far and wide.

But where can you go for this information en masse? Even covering a handful of big investors over a handful of years can lead to a list of thousands of invested in firms. And a list of firms themselves isn’t that useful. Sure, some big names pop out. But how do you see what “plays” big tech is making?

That’s where our web-reading bots come in. By working through billions of web pages using NLP and machine vision, Diffbot’s Knowledge Graph is the largest public-web sourced database of organizations, articles, people, products, and events. For each entity — organization, articles, people, etc. — facts are vetted and accumulated to create a filterable, searchable database of “things.” So when we wanted to check out which industries big tech has invested in over the last decade, we knew right where to turn. No analyst middlepersons, just public web data structured into a market intel-rich format.

Big Tech Investment By Industry 2010-2021

Distribution of industries of organizations invested in by Facebook, Alphabet, Amazon, Microsoft, Apple, and Netflix from 2010 to July 2021. Firmographic data sourced from Diffbot’s Knowledge Graph.
(more…)

Read More

Generating B2B Sales Leads With Diffbot’s Knowledge Graph

Generation of leads is the single largest challenge for up to 85% of B2B marketers.

Simultaneously, marketing and sales dashboards are filled with ever more data. There are more ways to get in front of a potential lead than ever before. And nearly every org of interest has a digital footprint.

So what’s the deal? 🤔

Firmographic, demographic, technographic (components of quality market segmentation) data are spread across the web. And even once they’re pulled into our workflows they’re often siloed, still only semi-structured, or otherwise disconnected. Data brokers provide data that gets stale more quickly than quality curated web sources.

But the fact persists, all the lead generation data you typically need is spread across the public web.

You just needs someone (or something 🤖) to find, read, and structure this data.

(more…)

Read More

Towards A Public Web Infused Dashboard For Market Intel, News Monitoring, and Lead Gen [Whitepaper]

It took Google knowledge panels one month and twenty days to update following the inception of a new CEO at Citi, a F100 company. In Diffbot’s Knowledge Graph, a new fact was logged within the week, with zero human intervention and sourced from the public web.

The CEO change at Citi was announced in September 2020, highlighting the reliance on manual updates to underlying Wiki entities.

In many studies data teams report spending 25-30% of their time cleaning, labelling, and gathering data sets [1]. While the number 80% is at times bandied about, an exact percentage will depend on the team and is to some degree moot. What we know for sure is that data teams and knowledge workers generally spend a noteworthy amount of their time procuring data points that are available on the public web.

The issues at play here are that the public web is our largest — and overall — most reliable source of many types of valuable information. This includes information on organizations, employees, news mentions, sentiment, products, and other “things.”

Simultaneously, large swaths of the web aren’t structured for business and analytical purposes. Of the few organizations that crawl and structure the web, most resulting products aren’t meant for anything more than casual consumption, and rely heavily on human input. Sure, there are millions of knowledge panel results. But without the full extent of underlying data (or skirting TOS), they just aren’t meant to be part of a data pipeline [2].

With that said, there’s still a world of valuable data on the public web.

At Diffbot we’ve harnessed this public web data using web crawling, machine vision, and natural language understanding to build the world’s largest commercially-available Knowledge Graph. For more custom needs, we harness our automatic extraction APIs pointed at specific domains, or our natural language processing API in tandem with the KG.

In this paper we’re going to share how organizations of all sizes are utilizing our structured public web data from a selection of sites of interest, entire web crawls, or in tandem with additional natural language processing to build impactful and insightful dashboards par excellence.

Note: you can replace “dashboard” here with any decision-enabling or trend-surfacing software. For many this takes place in a dashboard. But that’s really just a visual representation of what can occur in a spreadsheet, or a Python notebook, or even a printed report.

(more…)

Read More

4 Ways Technical Leaders Are Structuring Text To Drive Data Transformations [Whitepaper]

Natural and unstructured language is how humans largely communicate. For this reason, it’s often the format of organizations’ most detailed and meaningful feedback and market intelligence. 

Historically impractical to parse at scale, natural language processing has hit mainstream adoption. The global NLP market is expected to grow 20% annually through 2026.  Analysts suggest that 

As a benchmark-topping natural language processing API provider, Diffbot is in a unique position to survey cutting-edge NLP uses. In this paper, we’ll work through the state of open source, cloud-based, and custom NLP solutions in 2021, and lay out four ways in which technical leaders are structuring text to drive data transformations. 

In particular, we’ll take a look at:

  • How researchers are using the NL API to create a knowledge graph for entire country
  • How the largest native ad network in finance uses NLP to monitor topics of discussion and serve up relavent ads
  • The use of custom properties for fraud detection in natural language documents at scale
  • How the ability to train recognition of 1M custom named entities in roughly a day helps create better data

(more…)

Read More

The 6 Biggest Difficulties With Data Cleaning (With Work Arounds)

Data is the new soil.

David Mccandless

If data is the new soil, then data cleaning is the act of tilling the field. It’s one of the least glamorous and (potentially) most time consuming portions of the data science lifecycle. And without it, you don’t have a foundation from which solid insights can grow.

At it’s simplest, data cleaning revolves around two opposing needs:

  • The need to amend data points that will skew the quality of your results
  • The need to retain as much of your useful data as you can

These needs are often most strictly opposed when choosing to clean a data set by removing data points that are incorrect, corrupted, or otherwise unusable in their present format.

Perhaps the most important result from a data cleaning job is that results be standardized in a way that analytics and BI tools can easily access any value, present data in dashboards, or otherwise make the data manipulatable.

(more…)

Read More