Articles by: Merrill Cook

Using Diffbot’s Knowledge Graph For Fundraising

The primary Knowledge Graph use cases we see center around market intelligence, ecommerce, news monitoring, and machine learning. With that said, similar datasets and analysis techniques can yield a different set of organizations and individuals: investors.

The bedrock of investigating investments and potential investors within the Knowledge Graph is the investments field attached to organization entities. This field has a few components, all of which can yield useful data for both market intelligence, investing, or funding searches. In particular, the following sub-fields can be useful:

  • Investment amount
  • Investment currency
  • Investment date
  • Names and DiffbotUri’s for investing orgs
  • “Importance” of investing orgs
  • What series of funding rounds were raised

There are three basic motions that can yield insights for fundraising.

  1. Look at the specifics of investments in orgs similar to your own (e.g. ‘who invests in battery tech companies who are expanding in Asia?’)
  2. If orgs similar to your own don’t have many investments, look for orgs your org could be similar to in the future. Who invested in these orgs?
  3. Once you have a set of investing organizations, can you discern actionable intel? Who might you reach out to? What do these organizations write about? What are their focus areas? How would you pitch them?

Investors today operate globally, and to answer the above questions on this scale you’ll need a tool that can aggregate relationships between global organizations as well as monitor news from around the world (and potentially in many languages). Our Knowledge Graph is a cinch in both of these instances.

Who Invests In Companies Like Mine?

To show how the Knowledge Graph might be used in fundraising scenarios, let’s start with a hypothetical scenario. You’re a alternative energy company based on Arizona, and you want to expand throughout the region.

First off, let’s get a list of regional alternative energy companies. If you aren’t concerned with the specific state or nation, you can utilize the near parameter to look within a specific radius.

type:Organization industries:"Renewable Energy Companies" near[500mi](name:'Phoenix')

This query returns over 2,400 renewable energy companies within 500 miles of Phoenix, Arizona. This is likely too many companies to manually look through. So you’ll likely want to perform some facet searches to get a summary view of what is in this dataset.

Adding facet:nbEmployeesMax provides a summary view of the number of employees of these organizations. It looks like this specific set of organizations primarily fall into three sizes: 100-500 FTEs, 10-20 FTEs, or 50-100 FTEs. While these clusters could be explained by the type of renewable energy product each company makes (e.g. software vs. large physical installations), these clusters also align with common headcounts associated with particular funding rounds. 10-20 FTEs, may be a bootstrap, seed, or angel round company. 50-100 FTEs may have raised a series A funding round, with 100-500 FTEs may be multiple funding rounds in.

In this hypothetical you have 20 employees, and need funding to expand your operations and grow into the slightly larger renewable energy companies. So let’s mine into the 50-100 FTE cluster.

type:Organization industries:"Renewable Energy Companies" near[500mi](name:"Phoenix") nbEmployeesMax>=50 nbEmployeesMax<=100

The above query yields 236 organizations. A decent sample size from which to investigate past funding trends.

From here we can look at a summary of the organizations that invested in these organizations by adding to the end of the query. For this group of companies, only three investors have invested in multiple renewable energy companies in our list. 9 total investing organizations are present. If you need a larger list for outreach, you could try altering or removing the nbEmployeesMax fields. (Removing nbEmployeesMax returns >25 results, with 9 organizations who have invested in multiple of this set of renewable energy companies.)

This list of 9 investors could be your jumping off point for the third stage of this inquiry below. Or you could continue investigating to explore other angles for generating a list of potential investors.

What Similar Orgs Receive Investments?

Jumping to the second angle of inquiry we outlined in the intro, we can begin to look at the characteristics of organizations who gain investment in this industry. But first, let’s gain some insight into what types of investments have been attained by our similar organizations.

type:Organization industries:"Renewable Energy Companies" near[500mi](name:"Phoenix") facet:investments.series

The above query returns sizable groupings for “series unknown,” “post IPO equity,” “seed,” “debt financing,” and “grant.” In our hypothetical our organization isn’t close to an IPO and is perhaps beyond seed funding stage. So let’s exclude organizations at these stages. One way to do this is to check what funding stage an organization is currently in. As organizations in series B have already gone through series A. This means organizations in the Knowledge Graph in series B would show up for searches looking for both series A and series B funding round recipient organizations. By using the isCurrent we can look at organizations currently in a given stage of funding.

type: Organization industries:'Renewable Energy Companies' near[500mi](name:"Phoenix") investments.{series:or('Series Unknown','Debt Financing','Grant','Equity Crowdfunding','Series A') isCurrent:true}

The above query returns 16 companies, a nice middle ground for some aggregation of values with the potential to deep dive into each.

By looking at results on our map view, we can see two clusters of activity. As in many industries, investment is higher in specific locales. In this case, Henderson/Las Vegas, Nevada and Phoenix, Arizona.

Two useful fields to obtain summary views of for a group of organizations include descriptors as well as industries.

Those respective queries can be seen below:

type:Organization industries:"Renewable Energy Companies" near[500mi](name:"Phoenix") investments.{series:or('Series Unknown','Debt Financing','Grant','Equity Crowdfunding','Series A') isCurrent:true} facet:industries


type:Organization industries:"Renewable Energy Companies" near[500mi](name:"Phoenix") investments.{series:or('Series Unknown','Debt Financing','Grant','Equity Crowdfunding','Series A') isCurrent:true} facet:descriptors

Within the industries facet query, we predictably see that these organizations are both “energy” and “renewable energy companies.” We can also see that solar power– in particular — as well as manufacturing tend to be most commonly invested in.

Within descriptors we can jump to specifics that are more granular than entire industry. In this case perhaps our hypothetical organization is already involved in building or energy storage (or are considering an expansion in these areas). Below they can find validation that similar organizations have been invested in, and surface an even more targeted list of organizations to deep dive into.

In order to shorten this list of organizations to only those who are described as working in energy storage and building, we could add a descriptors filter to our query.

type:Organization industries:"Renewable Energy Companies" near[500mi](name:"Phoenix") investments.{series:or('Series Unknown','Debt Financing','Grant','Equity Crowdfunding','Series A') isCurrent:true} descriptors:or('Energy Storage','Building')

The above query surface 8 organizations who are beyond seed funding, not yet to close to an IPO, provide energy storage and building services within the renewable energy industry and who are regional to Phoenix, Arizona. With a targeted list this size we can begin to look at each and every investor manually.

Investigating A Targeted List Of Investors

Now that we have a targeted list of organizations we can grab a list of all their investors. One route to quickly generate the list of investors is to simply add facet:investments.investors.diffbotURI to the end of the query. Another route is to export the investor fields into CSV.

The fields we may find of interest include investments_amount and investments_investor_diffbotUri. Also referencing the size and summary of the invested-in organizations to verify they are similar enough to your current firmographics.

DiffbotURIs are unique identifiers for entities in the Knowledge Graph. In the event entities have similar or identical names, DiffbotURIs are a more precise way to reference the actual organization of interest and disambiguate.

Once you have this list of DiffbotURIs, we can string them together into an “or” statement for organization, article, and person entity analysis. In our case there are 18 investors, 11 of which are unique. If you were looking for a serial investor in this space, this would also be promising by mining in to which of these organizations have invested in multiple of our 8 company target list.

We can start by simply returning the list of investors with the following query:

type:organization diffbotUri:or('','','','','','','','','','','')

A quick view of the entities mapped shows that few of these organizations are regional. Meaning you may not need to limit your investor search by region.

A second search we can perform is to look at all organizations who have been invested in by these 11 investors to surface their broader interests. We can then facet through location and industry.

type:Organization investments.investors.diffbotUri: or('','','','','','','','','','','') facet:industries

The largest industry clusters for investments from these organizations include software, energy, manufacturing, renewable energy, solar, and computer hardware.

By clicking through any one of these facet results, you can see a list a companies invested in with that specific industry. For example, clicking through solar energy companies yields over 200 companies invested in by this cohort. This can be used to provide another view of the types of observations surfaced in the first and second sections of this guide.

A second facet query around location of invested-in organizations can be useful to start focusing on which investors tend to invest within the region. We can filter by organizations in states located in the Southwest and then facet by investor to get a view of which of these investors invests the most in Texas and Arizona. While the below query is quite lengthy, the basics are simple, passing in the DiffbotURI of specific investors and then bounding (the DiffbotURIs inside of the square brackets) our facet query at the end to only return results about the same set of investors.

type:Organization investments.investors.diffbotUri: or('','','','','','','','','','','')"Texas","Arizona") facet['','','','','','','','','','','']:investments.investors.diffbotUri

This final view shows a clear winner, a DiffbotURI we identified as a investor within our targeted list of renewable energy companies in an earlier section and who can see has invested in 70 companies in Texas and Arizona from this view.

This DiffbotURI resolves to the New York State Energy Research and Development Authority, a public benefit corporation that may be a great candidate to look into for potential investment.

Armed with a single (or handful) of DiffbotURIs we can look for news coverage of these entities, key individuals to reach out to, and more.

DiffbotURIs can show up as topical tags mentioned in articles. Tags are natural language processing-generated topics found in articles within our article index. They are available in content of every language and are presented in English.

The following query looks at articles we’ve identified as mentioning the New York State Energy Research and Development Authority. At present over 260 results are returned.

type:Article tags.uri:""

Using an ‘or’ statement similarly to prior queries we’ve worked through, we could also return a larger newsfeed of all of the investors we’re interested in. An alterative route to expanding your list of organizations is to utilize our similarTo query. Our machine learning computed similarity scores are present for every unique pairing of Knowledge Graph organizations. The syntax for expanding your list of interesting orgs for news monitoring via similarTo would look like the following.

type:Organization similarTo(id:"EZgkYMhjPPHeIdxJRti6IYA")

The above returns 25 organizations most similar to our investor of interest.

Jumping back to useful article queries that start from a list of organizations, the sentiment field can be a powerful way to quickly surface actionable data. By adding sentiment>0 date<365d to our article query above we can see positive news about an entity over the last year. This can be used to quickly assess where industry successes and expansions are occurring.

Finally, we can use the name(s) of our investor of interest to search through person entities connected to this entity. In this case, this could involve looking at hiring trends (e.g. an entity is expanding in the southwest, or with analysts related to a specific technology). It can also be used to discern the proper contacts in a use case like we’re describing in this guide. In our case, some of the useful fields we may wish to look at include:

  • Skills
  • Seniority
  • Role
  • New Hires
  • New Locations
  • Details Related To Personalization of Outreach
  • Among Others

While fundraising isn’t one of the most common uses for the Knowledge Graph we see, many organizations that understand the basic strengths of Knowledge Graph data do go on to use our data for a variety of uses. On one level, most tasks that require manually gathering information from the web for further analysis can be completed at a much larger scale within the Knowledge Graph.

If you enjoyed this guide and are looking for additional guides on market intelligence or news monitoring uses of the Knowledge Graph, grab a two-week free trial and check out our Knowledge Graph Getting Started Guide.

Dear Diffy, Find Me A Coworking Space

Disclaimer: this article is about a very mundane consumer search. With this said, how knowledge work and fact accumulation are often performed have wide-reaching implications for knowledge work flows.

The other day I was searching for coworking spaces.

As in many domains of knowledge, data coverage online was largely human curated. Lists with some undisclosed methodology provided the writer’s favorite coworking spots by city.

Sure, search engines will return a list plotted to a map in any major search engine. But I’m sure we’ve all run into the following.

  1. Load map…
  2. Pan slightly to surface more results…
  3. Zoom slightly to surface more results…
  4. Pan the opposite direction to try and find a result that had caught our eye…
  5. Try to recall the name that caught our eye in a new search…

Five steps to seek further data points on a single search result. Devoid of context, data provenance, and the ability to analyze at scale.

Sure, consumer search works in many, many cases. So do phone books.

If you’re a power user, a data hoarder, or a productivity buff, you can likely see the appeal of a search that actually returns comprehensive data. If you’re building an intelligent application or performing market intelligence, using search that won’t let you explore the underlying data is just a waste of time.

So after this predictable foray in which I ignored the advice of several articles, scrolled around a map, and got sidetracked once or twice, I decided to resort to a different sort of search: Diffbot’s Knowledge Graph.


  • The title of our article may not make much sense if you haven’t been acquainted with Diffy, Diffbot’s web-reading bot
  • You see the promise of external web data for many applications… if it were structured (or at least felt disappointment at consumer search engines keeping you from public web data)

Opening the Knowledge Graph, it took all of 20 seconds to return data on over 4,000 coworking spaces. And sure, unless you’re selling a service to coworking space, you may wonder why anyone would need all this data as a personal consumer…

4000+ coworking space entities in ~20s

Maybe it’s simple curiosity. Maybe it’s the principle of it all; the fact that all of this information is publicly available online, but not in a structured format. Maybe this is just an analogy for non-consumer searches that also can’t be performed on major search engines. Any way you take it, search of the present is flawed for many uses, and it’s still our primary collective data source.

So what does search in the Knowledge Graph look like?

Well it starts with entities.

Knowledge graphs are built around entities (think people, places, or things) and relationships between entities. The types of relationships that can occur between entities, and the types of facts attached to entities are prescribed by a schema. One of the major “selling points” for knowledge graphs is that they have flexible schemas. That is — more so than other types of databases — they can adapt to what types of facts matter out in the world.

The Importance of Structured Web Data

At their core knowledge graphs (the category of graphs) can be built from any underlying data set. In the case of Diffbot’s Knowledge Graph, it’s the world’s largest structured feed of web data. Diffbot is one of only a handful of organizations to crawl the web. And using machine vision and natural language processing we’re able to pull out mentions of entities as well as infer facts and relationships.

Why is this important?

The web is largely made up of unstructured or semi-structured data. This means you can’t easily filter, sort, or manipulate this data at scale. While the internet is our largest collective source of knowledge, it’s not organized for modern knowledge work.

Diffbot’s products center around organizing the world’s information, whether through our AI-enabled web scrapers, our Knowledge Graph, or our Natural Language API. The ability to source the information from the web in a structured way provides the bedrock for machine learning initiatives, market intelligence, news monitoring, as well as the monitoring of large ecommerce datasets.

The State of Coworking Spaces As Told By AI

So what can you learn from a coworking space dataset that’s much more explorable than consumer search?

It turns out a lot.

While each individual data point is all available online, it’s not aggregated anywhere else in quite as explorable of a format.

In our case we can start with a simple facet query. Faceted search provides a summary view of the value of one fact type attached to a set of entities. So with this sort of query we can quickly discover what locations have the most coworking spaces.

By simply adding we can turn over 4,000 unique results into an observation. While data found about these coworking spaces across the web would be in many different formats (and in many languages), knowledge graphs help to consolidate similar entities around standard fields.

An additional strength of knowledge graphs is that data points can be consolidated from many different sources with data provenance and then built off of. Using natural language processing and machine learning, fields can be computed or inferred from many underlying data sources. Our original query looked at organization entities with “coworking spaces” as part of their description. But an AI-generated field of “descriptors” allows for additional granularity. Let’s look at a facet view of the most common services offered by coworking spaces.

Depending on your experience with a range of coworking spaces, descriptors such as “expat,” “civil & social organization,” or “self improvement” may be novel. By amalgamating tens of thousands of online mentions, articles, and entries into this subset of org entities, the Knowledge Graph dramatically cuts down on time of fact accumulation.

One final area in which consumer search is severely lacking (or just in practice unpractical) is that of market research. Industry-specific events such as funding rounds, openings of new offices, key executive hires or leavings, or clues as to private organization revenue can be hard to pinpoint across the web. Softer signals like sentiment around topics or velocity of news coverage can also be informative.

Diffbot’s article index is roughly 50x the size of Google News. Unlike traditional content channels, you aren’t presented with content that’s gamed the system or paid to get your attention. Additionally, where consumer search engines are siloed by language or location, Diffbot’s article index is pan-lingual. With articles augmented by additional filterable fields underlying articles can become unique observations on sentiment, key happenings, and more. All underlying article data is returned as well, supporting the ability to mine in once you’ve found an interesting angle.

For a deeper dive into creating custom news feeds around organizations and events be sure to check out our Knowledge Graph news monitoring test drive.


Maybe you don’t buy the segue from what really is a consumer search (“coworking spaces near me”) and the copious coworking data available in the Knowledge Graph. But the fact of the matter is that a great deal of knowledge work still relies on human fact accumulation. Without automated ways to structure unstructured data, there’s a definite floor to the cost per fact.

Knowledge graphs provide a bedrock for knowledge workflows reengineered from the ground up. In particular:

  • Knowledge graphs mirror what we care about “in the world” (entities and relationships)
  • Knowledge graphs provide flexible schemas allowing for fact types attached to entities to change over time (as the world changes)
  • Automated knowledge graphs provide one of the only feasible ways to structure market intel and news monitoring data that can be spread across the web
  • Knowledge graphs that don’t expose their underlying data aren’t suitable for use in intelligent applications or machine learning use cases
  • Knowledge graphs that provide additionally computed fields (sentiment, tags, inferences on revenue or events) provide additional value for market intelligence and news monitoring

No News Is Good News – Monitoring Average Sentiment By News Network With Diffbot’s Knowledge Graph

Ever have the feeling that news used to be more objective? That news organizations — now media empires — have moved into the realm of entertainment? Or that a cluster of news “across the aisle” from your beliefs is completely outrageous?

Many have these feelings, and coverage is rampant on bias and even straight up “fake” facts in news reporting.

With this in mind, we wanted to see if these hunches are valid. Has news gotten more negative over time? Is it a portion of the political spectrum driving this change? Or is it simply that bad things happen in the world and later get reported on?

To jump into this inquiry we utilized Diffbot’s Knowledge Graph. Diffbot is one of the few North American organizations to crawl the entire web. We apply AI-enabled web scrapers to pages that are publicly available to extract entities — think people, places, or things — and facts — think job titles, topics, and funding rounds.

We started our inquiry with some external coverage on bias in journalism provided by AllSides Media Bias Ratings.

Continue reading

The Top 50 Most Underrated Startups as Told by AI

While Diffbot’s Knowledge Graph has historically offered revenue values for publicly-held companies, we recently computed an estimated revenue value for 99.7% of the 250M+ organizations in the KG.

What does this mean?

Most organizations are privately-held, and thus have no public revenue reporting requirement. Diffbot has utilized our unrivaled long-tail organization coverage to create a machine learning-enabled estimated revenue field. This field looks at the myriad fact types we’ve extracted and structured from the public web and infers a revenue from a range of signals.

Estimated revenue is just that… a machine learning-enabled estimate. But with a training set the size of our Knowledge Graph, we’ve found that a great majority of our revenue values are actually quite accurate.

How can I use estimated revenue?

Revenue — even if estimated — is a huge marker for determining size and valuation. In it’s absence it’s hard to effectively segment organizations. We see this field used in market intelligence, finance, and investing use cases. And it’s as simple as filtering organizations using the revenue.value field.

Where Does Diffbot Get It’s Data?

Diffbot is one of only a handful of organizations to crawl the entire web. We apply NLP and machine vision to crawled web pages to find entities and facts about them. These entities are consolidated in the world’s largest Knowledge Graph along with data provenance, linkages between entities, and additional computed fields (like sentiment, or estimated revenue). In this ranking we looked at organization entities. But organization entities are just the “tip of the iceberg” for Diffbot data, which comprises articles, products, people, events, and many other entity types.

Continue reading

The Top Coding Bootcamps For Founders According To The Knowledge Graph

Last week we took a look at the top universities for female founders. In our results, we noted that our web-reading AI associates tech bootcamp attendance with education, and a large cluster of founders attended specific universities in conjunction with bootcamps.

New to the Knowledge Graph? Diffbot’s Knowledge Graph is constructed by crawling a vast majority of the web and structuring data on pages using NLP and machine vision. The end result is one of the world’s largest databases of organizations, people, articles, products and more, all linked and with data provenance.

To return results from the Knowledge Graph, you submit queries which filter which entities to return. In this case we queried the Knowledge Graph to return individuals who:

  1. Attended an educational institution with the name of a top bootcamp
  2. Have held a job title including “CEO,” “chief executive officer,” or “founder”

We then returned a facet (summary) view of how many of these individuals attended each bootcamp.

Continue reading

The Best Schools For Female Founders According To The Knowledge Graph

Upon seeing Crunchbase’s annual ranking of the best schools for graduating entrepreneurs, we wanted to see how our Knowledge Graph results stack up.

The Diffbot Knowledge Graph is sourced from crawling a majority of the web and extracting entities and facts using NLP and machine vision.

Two prominent entity types are person and organization entities. When paired together powerful observations sourced from across the web are possible. In this exploration we returned all person entities within the Knowledge Graph who are currently founders and who are female. We filtered to make sure each organization had at least some publicly disclosed funding, and then we took a look at a summary view of which schools these founders had attended. You can check out the Knowledge Graph query here with a free trial.

While the top schools for female founders were consistent with Crunchbase’s coverage, you may wonder why the numbers vary so dramatically. Crunchbase’s ranking this year was looking at 2019-2020 graduates, and Crunchbase’s data is centered around tech and startup firmographics. While Diffbot’s Knowledge Graph certainly has firmographic details on tech-centered companies, our database of organizations is much wider ranging (over 250M+ orgs at last count). This means our list includes founders of all sorts of endeavors: non-profits, artistic organizations, medical organizations, and tech companies to name a few.

Continue reading

Monitoring Large Food Retailer Investments With The Knowledge Graph

A few weeks ago we published a view into Big Tech investments by industry. In this post we’ll take a similar look at the largest food retailers.

Panning out a bit, there are over 250M organizations within the Knowledge Graph. To obtain this list of large food retailers we first narrowed our search to food retailers with more than 1,000 employees. This query surfaces more than 7,000 fact-rich entities.

From there we simply sorted the results by number of employees to gain the largest food retailers including Walmart, Target, Tesco, Kroger, Carrefour, and Safeway.

With this list in mind, we looked for a list of organizations who had been invested in by one of these organizations. Bounded by calendar years, we then returned a summary view that looked at which industries the invested-in companies represented. If you have a subscription or free trial feel free to check out the resulting query.
Continue reading

Startup Revenue By County With Diffbot’s Knowledge Graph

What can you do with billions of web-sourced facts on hundreds of millions of organizations? Beyond analyzing the facts themselves, you (or a machine of your choice) can learn a lot. Historically, our Knowledge Graph has had one of the largest collections of publicly-disclosed organization revenue. Recently, we’ve applied machine learning processes across many org fields to estimate revenue for private organizations as well.

Continue reading

Using the Knowledge Graph to Segment Big Tech Investments By Industry

Every big tech investment is big news. If your firm raises a funding round with prestigious investors or is acquired, you better bet you’ll spread the news far and wide.

But where can you go for this information en masse? Even covering a handful of big investors over a handful of years can lead to a list of thousands of invested in firms. And a list of firms themselves isn’t that useful. Sure, some big names pop out. But how do you see what “plays” big tech is making?

That’s where our web-reading bots come in. By working through billions of web pages using NLP and machine vision, Diffbot’s Knowledge Graph is the largest public-web sourced database of organizations, articles, people, products, and events. For each entity — organization, articles, people, etc. — facts are vetted and accumulated to create a filterable, searchable database of “things.” So when we wanted to check out which industries big tech has invested in over the last decade, we knew right where to turn. No analyst middlepersons, just public web data structured into a market intel-rich format.

Big Tech Investment By Industry 2010-2021

Distribution of industries of organizations invested in by Facebook, Alphabet, Amazon, Microsoft, Apple, and Netflix from 2010 to July 2021. Firmographic data sourced from Diffbot’s Knowledge Graph.
Continue reading

Generating B2B Sales Leads With Diffbot’s Knowledge Graph

Generation of leads is the single largest challenge for up to 85% of B2B marketers.

Simultaneously, marketing and sales dashboards are filled with ever more data. There are more ways to get in front of a potential lead than ever before. And nearly every org of interest has a digital footprint.

So what’s the deal? 🤔

Firmographic, demographic, technographic (components of quality market segmentation) data are spread across the web. And even once they’re pulled into our workflows they’re often siloed, still only semi-structured, or otherwise disconnected. Data brokers provide data that gets stale more quickly than quality curated web sources.

But the fact persists, all the lead generation data you typically need is spread across the public web.

You just needs someone (or something 🤖) to find, read, and structure this data.

Continue reading