Dear Diffy, Find Me A Coworking Space

Disclaimer: this article is about a very mundane consumer search. With this said, how knowledge work and fact accumulation are often performed have wide-reaching implications for knowledge work flows.

The other day I was searching for coworking spaces.

As in many domains of knowledge, data coverage online was largely human curated. Lists with some undisclosed methodology provided the writer’s favorite coworking spots by city.

Sure, search engines will return a list plotted to a map in any major search engine. But I’m sure we’ve all run into the following.

  1. Load map…
  2. Pan slightly to surface more results…
  3. Zoom slightly to surface more results…
  4. Pan the opposite direction to try and find a result that had caught our eye…
  5. Try to recall the name that caught our eye in a new search…

Five steps to seek further data points on a single search result. Devoid of context, data provenance, and the ability to analyze at scale.

Sure, consumer search works in many, many cases. So do phone books.

If you’re a power user, a data hoarder, or a productivity buff, you can likely see the appeal of a search that actually returns comprehensive data. If you’re building an intelligent application or performing market intelligence, using search that won’t let you explore the underlying data is just a waste of time.

So after this predictable foray in which I ignored the advice of several articles, scrolled around a map, and got sidetracked once or twice, I decided to resort to a different sort of search: Diffbot’s Knowledge Graph.


  • The title of our article may not make much sense if you haven’t been acquainted with Diffy, Diffbot’s web-reading bot
  • You see the promise of external web data for many applications… if it were structured (or at least felt disappointment at consumer search engines keeping you from public web data)

Opening the Knowledge Graph, it took all of 20 seconds to return data on over 4,000 coworking spaces. And sure, unless you’re selling a service to coworking space, you may wonder why anyone would need all this data as a personal consumer…

4000+ coworking space entities in ~20s

Maybe it’s simple curiosity. Maybe it’s the principle of it all; the fact that all of this information is publicly available online, but not in a structured format. Maybe this is just an analogy for non-consumer searches that also can’t be performed on major search engines. Any way you take it, search of the present is flawed for many uses, and it’s still our primary collective data source.

So what does search in the Knowledge Graph look like?

Well it starts with entities.

Knowledge graphs are built around entities (think people, places, or things) and relationships between entities. The types of relationships that can occur between entities, and the types of facts attached to entities are prescribed by a schema. One of the major “selling points” for knowledge graphs is that they have flexible schemas. That is — more so than other types of databases — they can adapt to what types of facts matter out in the world.

The Importance of Structured Web Data

At their core knowledge graphs (the category of graphs) can be built from any underlying data set. In the case of Diffbot’s Knowledge Graph, it’s the world’s largest structured feed of web data. Diffbot is one of only a handful of organizations to crawl the web. And using machine vision and natural language processing we’re able to pull out mentions of entities as well as infer facts and relationships.

Why is this important?

The web is largely made up of unstructured or semi-structured data. This means you can’t easily filter, sort, or manipulate this data at scale. While the internet is our largest collective source of knowledge, it’s not organized for modern knowledge work.

Diffbot’s products center around organizing the world’s information, whether through our AI-enabled web scrapers, our Knowledge Graph, or our Natural Language API. The ability to source the information from the web in a structured way provides the bedrock for machine learning initiatives, market intelligence, news monitoring, as well as the monitoring of large ecommerce datasets.

The State of Coworking Spaces As Told By AI

So what can you learn from a coworking space dataset that’s much more explorable than consumer search?

It turns out a lot.

While each individual data point is all available online, it’s not aggregated anywhere else in quite as explorable of a format.

In our case we can start with a simple facet query. Faceted search provides a summary view of the value of one fact type attached to a set of entities. So with this sort of query we can quickly discover what locations have the most coworking spaces.

By simply adding we can turn over 4,000 unique results into an observation. While data found about these coworking spaces across the web would be in many different formats (and in many languages), knowledge graphs help to consolidate similar entities around standard fields.

An additional strength of knowledge graphs is that data points can be consolidated from many different sources with data provenance and then built off of. Using natural language processing and machine learning, fields can be computed or inferred from many underlying data sources. Our original query looked at organization entities with “coworking spaces” as part of their description. But an AI-generated field of “descriptors” allows for additional granularity. Let’s look at a facet view of the most common services offered by coworking spaces.

Depending on your experience with a range of coworking spaces, descriptors such as “expat,” “civil & social organization,” or “self improvement” may be novel. By amalgamating tens of thousands of online mentions, articles, and entries into this subset of org entities, the Knowledge Graph dramatically cuts down on time of fact accumulation.

One final area in which consumer search is severely lacking (or just in practice unpractical) is that of market research. Industry-specific events such as funding rounds, openings of new offices, key executive hires or leavings, or clues as to private organization revenue can be hard to pinpoint across the web. Softer signals like sentiment around topics or velocity of news coverage can also be informative.

Diffbot’s article index is roughly 50x the size of Google News. Unlike traditional content channels, you aren’t presented with content that’s gamed the system or paid to get your attention. Additionally, where consumer search engines are siloed by language or location, Diffbot’s article index is pan-lingual. With articles augmented by additional filterable fields underlying articles can become unique observations on sentiment, key happenings, and more. All underlying article data is returned as well, supporting the ability to mine in once you’ve found an interesting angle.

For a deeper dive into creating custom news feeds around organizations and events be sure to check out our Knowledge Graph news monitoring test drive.


Maybe you don’t buy the segue from what really is a consumer search (“coworking spaces near me”) and the copious coworking data available in the Knowledge Graph. But the fact of the matter is that a great deal of knowledge work still relies on human fact accumulation. Without automated ways to structure unstructured data, there’s a definite floor to the cost per fact.

Knowledge graphs provide a bedrock for knowledge workflows reengineered from the ground up. In particular:

  • Knowledge graphs mirror what we care about “in the world” (entities and relationships)
  • Knowledge graphs provide flexible schemas allowing for fact types attached to entities to change over time (as the world changes)
  • Automated knowledge graphs provide one of the only feasible ways to structure market intel and news monitoring data that can be spread across the web
  • Knowledge graphs that don’t expose their underlying data aren’t suitable for use in intelligent applications or machine learning use cases
  • Knowledge graphs that provide additionally computed fields (sentiment, tags, inferences on revenue or events) provide additional value for market intelligence and news monitoring

The Top 50 Most Underrated Startups as Told by AI

While Diffbot’s Knowledge Graph has historically offered revenue values for publicly-held companies, we recently computed an estimated revenue value for 99.7% of the 250M+ organizations in the KG.

What does this mean?

Most organizations are privately-held, and thus have no public revenue reporting requirement. Diffbot has utilized our unrivaled long-tail organization coverage to create a machine learning-enabled estimated revenue field. This field looks at the myriad fact types we’ve extracted and structured from the public web and infers a revenue from a range of signals.

Estimated revenue is just that… a machine learning-enabled estimate. But with a training set the size of our Knowledge Graph, we’ve found that a great majority of our revenue values are actually quite accurate.

How can I use estimated revenue?

Revenue — even if estimated — is a huge marker for determining size and valuation. In it’s absence it’s hard to effectively segment organizations. We see this field used in market intelligence, finance, and investing use cases. And it’s as simple as filtering organizations using the revenue.value field.

Where Does Diffbot Get It’s Data?

Diffbot is one of only a handful of organizations to crawl the entire web. We apply NLP and machine vision to crawled web pages to find entities and facts about them. These entities are consolidated in the world’s largest Knowledge Graph along with data provenance, linkages between entities, and additional computed fields (like sentiment, or estimated revenue). In this ranking we looked at organization entities. But organization entities are just the “tip of the iceberg” for Diffbot data, which comprises articles, products, people, events, and many other entity types.

Continue reading

Generating B2B Sales Leads With Diffbot’s Knowledge Graph

Generation of leads is the single largest challenge for up to 85% of B2B marketers.

Simultaneously, marketing and sales dashboards are filled with ever more data. There are more ways to get in front of a potential lead than ever before. And nearly every org of interest has a digital footprint.

So what’s the deal? 🤔

Firmographic, demographic, technographic (components of quality market segmentation) data are spread across the web. And even once they’re pulled into our workflows they’re often siloed, still only semi-structured, or otherwise disconnected. Data brokers provide data that gets stale more quickly than quality curated web sources.

But the fact persists, all the lead generation data you typically need is spread across the public web.

You just needs someone (or something 🤖) to find, read, and structure this data.

Continue reading

Towards A Public Web Infused Dashboard For Market Intel, News Monitoring, and Lead Gen [Whitepaper]

It took Google knowledge panels one month and twenty days to update following the inception of a new CEO at Citi, a F100 company. In Diffbot’s Knowledge Graph, a new fact was logged within the week, with zero human intervention and sourced from the public web.

The CEO change at Citi was announced in September 2020, highlighting the reliance on manual updates to underlying Wiki entities.

In many studies data teams report spending 25-30% of their time cleaning, labelling, and gathering data sets [1]. While the number 80% is at times bandied about, an exact percentage will depend on the team and is to some degree moot. What we know for sure is that data teams and knowledge workers generally spend a noteworthy amount of their time procuring data points that are available on the public web.

The issues at play here are that the public web is our largest — and overall — most reliable source of many types of valuable information. This includes information on organizations, employees, news mentions, sentiment, products, and other “things.”

Simultaneously, large swaths of the web aren’t structured for business and analytical purposes. Of the few organizations that crawl and structure the web, most resulting products aren’t meant for anything more than casual consumption, and rely heavily on human input. Sure, there are millions of knowledge panel results. But without the full extent of underlying data (or skirting TOS), they just aren’t meant to be part of a data pipeline [2].

With that said, there’s still a world of valuable data on the public web.

At Diffbot we’ve harnessed this public web data using web crawling, machine vision, and natural language understanding to build the world’s largest commercially-available Knowledge Graph. For more custom needs, we harness our automatic extraction APIs pointed at specific domains, or our natural language processing API in tandem with the KG.

In this paper we’re going to share how organizations of all sizes are utilizing our structured public web data from a selection of sites of interest, entire web crawls, or in tandem with additional natural language processing to build impactful and insightful dashboards par excellence.

Note: you can replace “dashboard” here with any decision-enabling or trend-surfacing software. For many this takes place in a dashboard. But that’s really just a visual representation of what can occur in a spreadsheet, or a Python notebook, or even a printed report.

Continue reading

Download This Dataset of 12,118 Yahoo Answers for $1

With only 2 weeks left till May 4th (be with you), the internet is bursting with excitement over all the work that needs to be done before Yahoo Answers finally 404s.

From scheduling a 2nd COVID vaccine to your annual panic attack at missing the tax filing deadline (you probably didn’t, it was extended to May 17 in the U.S.), there is nothing short of a lengthy agenda for everyone ahead of the shutdown of this iconic website.

Continue reading

The 6 Biggest Difficulties With Data Cleaning (With Work Arounds)

Data is the new soil.

David Mccandless

If data is the new soil, then data cleaning is the act of tilling the field. It’s one of the least glamorous and (potentially) most time consuming portions of the data science lifecycle. And without it, you don’t have a foundation from which solid insights can grow.

At it’s simplest, data cleaning revolves around two opposing needs:

  • The need to amend data points that will skew the quality of your results
  • The need to retain as much of your useful data as you can

These needs are often most strictly opposed when choosing to clean a data set by removing data points that are incorrect, corrupted, or otherwise unusable in their present format.

Perhaps the most important result from a data cleaning job is that results be standardized in a way that analytics and BI tools can easily access any value, present data in dashboards, or otherwise make the data manipulatable.

Continue reading

The 25 Most Covid-Safe Restaurants in San Francisco (According to NLP)

A few weeks ago, we ran reviews for a Michelin-reviewed restaurant through our Natural Language API. It was able to tell us what people liked or disliked about the restaurant, and even rank dishes by sentiment. In our analysis, we also noticed something curious. When our NL API pulled out the entity “Covid-19,” it wasn’t always paired with a negative sentiment.

When we mined back in to where these positive mentions of Covid-19 occurred in the reviews, we saw that our NL API appeared to be picking up on language in which restaurant reviewers felt a restaurant had handled Covid-19 well. In other words, when Covid-19 was determined to be part of a positive statement, it was because guests felt relatively safe. Or that the restaurant had come up with novel solutions for dealing with Covid-19.

With this in mind, we set to starting up another, larger analysis.

Continue reading

What We Found Analyzing 300 Yelp Reviews of a Michelin Reviewed Restaurant with Natural Language Processing

Reviews are a veritable gold mine of data. They’re one of the few times when unsolicited customers lay out the best and the worst parts of using a product or service. And the relative richness of natural language can quickly point product or service providers in a nuanced direction more definitively than quantitative metrics like time on site, bounce rate, or sales numbers.

The flip side of this linguistic richness is that reviews are largely unstructured data. Beyond that, many reviews are written somewhat informally, making the task of decoding their meaning at scale even harder.

Restaurant reviews are known as being some of the richest of all reviews. They tend to document the entire experience: social interactions, location, décor, service, price, and food.

Continue reading

From Knowledge Graphs to Knowledge Workflows

2020 was undeniably the “Year of the Knowledge Graph.”

2020 was the year that Gartner put Knowledge Graphs at the peak of its hype cycle.

It was the year where 10% of the papers published at EMNLP referenced “knowledge” in their titles.

It was the year over 1000 engineers, enterprise users, and academics came together to talk about Knowledge Graphs at the 2nd Knowledge Graph Conference.

There are good reasons for this grass-roots trend, as it isn’t any one company that is pushing this trend (ahem, I’m looking at you, Cognitive Computing), but rather a broad coalition of academics, industry vertical practitioners, and enterprise users that generally deal with building intelligent information systems.

Knowledge graphs represent the best of how we hope the “next step” of AI looks like: intelligent systems that aren’t black boxes, but are explainable, that are grounded in the same real-world entities as us humans, and are able to exchange knowledge with us with precise common vocabularies. It’s no coinincidence that in the same year that marked the peak of the deep learning revolution (2012), Google introduced the Google Knowledge Graph as a way to provide interpretability to its otherwise opaque search ranking algorithms.

The Risk Of Hype: Touted Benefits Don’t Materialize

Continue reading

Robotic Process Automation Extraction Is A Time Saver. But it’s Not Built For the Future

Enough individuals have heard the siren song of Robotic Process Automation to build several $1B companies. Even if you don’t know the “household names” in the space, something about the buzzword abbreviated as “RPA” leaves the impression that you need it. That it boosts productivity. That it enables “smart” processes. 

RPA saves millions of work hours, for sure. But how solid is the foundation for processes built using RPA tech? 

Related Reads: 


First off, RPA operates by literally moving pixels across the screen. Repetitive tasks are automated by saving “steps” with which someone would manipulate applications with their mouse, and then enacting these steps without human oversight. There are plenty of examples for situations in which this is handy. You need to move entries from a spreadsheet to a CRM. You need to move entries from a CRM to a CDP. You need to cut and paste thousands or millions of times between two windows in a browser. 

These are legitimate issues within back end business workflows. And RPA remedies these issues. But what happens when your software is updated? Or you need to connect two new programs? Or your ecosystem of tools changes completely? Or you just want to use your data differently? 

This shows the hint of the first issue with the foundation on which RPA is built. RPA can’t operate in environments in which it hasn’t seen (and received extensive documentation about). 

Continue reading