Articles by: Merrill Cook

The 25 Most Covid-Safe Restaurants in San Francisco (According to NLP)

A few weeks ago, we ran reviews for a Michelin-reviewed restaurant through our Natural Language API. It was able to tell us what people liked or disliked about the restaurant, and even rank dishes by sentiment. In our analysis, we also noticed something curious. When our NL API pulled out the entity “Covid-19,” it wasn’t always paired with a negative sentiment.

When we mined back in to where these positive mentions of Covid-19 occurred in the reviews, we saw that our NL API appeared to be picking up on language in which restaurant reviewers felt a restaurant had handled Covid-19 well. In other words, when Covid-19 was determined to be part of a positive statement, it was because guests felt relatively safe. Or that the restaurant had come up with novel solutions for dealing with Covid-19.

With this in mind, we set to starting up another, larger analysis.

Continue reading

How Employbl Saved 250 Hours Building Their Career-Matching Database

We started with about 1,000 companies in the Employbl database, mostly in the Bay Area. Now with Diffbot we can expand to other cities and add thousands of additional companies. 

Connor Leech – CEO @Employbl

Fixing tech starts with hiring. And fixing hiring is an information problem. That’s what Connor Leech, cofounder and CEO at Employbl discovered when creating a new talent marketplace meant to connect tech employees with the information-rich hiring marketplace they deserve.

Tech job seekers rely on a range of metrics to gauge the opportunity and stability of a potential employer.

While information like funding rounds, founders, team size, industry, and investors are often public, it can be hard to grab the myriad fields candidates value in a up-to-date format from around the web.

These difficulties are amplified by the fact that many tech startups are often “long tail” entities that also regularly change.

Continue reading

What We Found Analyzing 300 Yelp Reviews of a Michelin Reviewed Restaurant with Natural Language Processing

Reviews are a veritable gold mine of data. They’re one of the few times when unsolicited customers lay out the best and the worst parts of using a product or service. And the relative richness of natural language can quickly point product or service providers in a nuanced direction more definitively than quantitative metrics like time on site, bounce rate, or sales numbers.

The flip side of this linguistic richness is that reviews are largely unstructured data. Beyond that, many reviews are written somewhat informally, making the task of decoding their meaning at scale even harder.

Restaurant reviews are known as being some of the richest of all reviews. They tend to document the entire experience: social interactions, location, décor, service, price, and food.

Continue reading

Context Matters, Tracking Quote Spread Across The Web In A Historic Year

Hindsight is 20/20. And as we usher in a new president in what has been one of the most tumultuous years in American history, we can begin to see clarity about the forces that moved throughout our jobs, our lives, and our collective imagination.

Another way to put this is that over time we tend to have more context.

Within Diffbot’s Knowledge Graph, one unique lens through which we can leverage the context of semantic data is by looking at the speakers of quotes.

When our AI reads articles it pulls out quotes, and when it can it attributes a speaker to these quotes. As our crawlers traverse the entirety of the public web, sources of quotes are validated and over time some quotes circulate more than others.

When performing a facet search, this lets us essentially show something like a retweet count for the entire web. This answers questions like whose voices are being heard? And what speakers are the most widely cited in a given topic?

To commemorate the end of an era, let’s take a look at a few of the most circulated statements of the last 365 days.

What were the 10 most circulated quotes across the web by President Joe Biden in the last 365 days?

Continue reading

Robotic Process Automation Extraction Is A Time Saver. But it’s Not Built For the Future

Enough individuals have heard the siren song of Robotic Process Automation to build several $1B companies. Even if you don’t know the “household names” in the space, something about the buzzword abbreviated as “RPA” leaves the impression that you need it. That it boosts productivity. That it enables “smart” processes. 

RPA saves millions of work hours, for sure. But how solid is the foundation for processes built using RPA tech? 

Related Reads: 

 

First off, RPA operates by literally moving pixels across the screen. Repetitive tasks are automated by saving “steps” with which someone would manipulate applications with their mouse, and then enacting these steps without human oversight. There are plenty of examples for situations in which this is handy. You need to move entries from a spreadsheet to a CRM. You need to move entries from a CRM to a CDP. You need to cut and paste thousands or millions of times between two windows in a browser. 

These are legitimate issues within back end business workflows. And RPA remedies these issues. But what happens when your software is updated? Or you need to connect two new programs? Or your ecosystem of tools changes completely? Or you just want to use your data differently? 

This shows the hint of the first issue with the foundation on which RPA is built. RPA can’t operate in environments in which it hasn’t seen (and received extensive documentation about). 

Continue reading

How to Track Market Indicators Using News Monitoring Scheduling

The public web is chock full of indicators with implications for stock prices, commodities prices, supply chain issues, or the general perceived value of an entity. But how do you reliably get these market indicators?

You can search online… and slog through the most popular pages that all your competitors have also looked at. Or you can read a commentator’s take. And likely stay one step removed from the actual information you should be dealing in.

Or you could deal directly with all of the articles on the web. Each annotated with helpful fields you can filter through like sentiment scores, AI-generated topic tags, what country the article was published in, and many others. That’s where Diffbot’s Knowledge Graph (KG) comes in.

The news index of Diffbot’s KG is 50x the size of Google News’ index. And each article entity in the KG is populated with a rich set of fields you can use to actually search the entire web (not just the portion of the web who paid to get in front of you).

In this guide we’ll work through how to set up a global news monitoring query in the KG. And then schedule this query to repeat and email you when new articles surface.
Continue reading

Most “Autoscrapers” Are Still Rule-Based Web Scraping Tools

And why it matters for scaling your public web data sources

As with most forms of tech these days, web scrapers have recently seen a surge of claims that they’re somehow based on AI or machine learning tech. While this suggests that an AI will detect exactly what you want extracted from a page, most scrapers are still rule-based (there are some exceptions, such as Diffbot’s Automatic Extraction APIs).

Why does this matter?

Historically rule-based extraction has been the norm. In rule-based extraction, you specify a set of rules for what you want pulled from a page. This is often an HTML element, CSS selector, or a regex pattern. Maybe you want the third bulleted item beneath every paragraph in a text, or all headers, or all links on a page; rule-based extraction can help with that.

Continue reading

The Ultimate Guide To Data Analysis


Data analysis comes at the tail end of the data lifecycle. Directly after or simultaneously performed with data integration (in which data from different sources are pulled into a unified view). Data analysis involves cleaning, modelling, inspecting and visualizing data.

The ultimate goal of data analysis is to provide useful data-driven insights for guiding organizational decisions. And without data analysis, you might as well not even collect data in the first place. Data analysis is the process of turning data into information, insight, or hopefully knowledge of a given domain.
Continue reading

Stories By DQL: Tracking the Sentiment of a City


The story: sentiment of news mentions of Gaza fluctuate by as much as 2000% a week. 90% of news mentions about Minneapolis have had negative sentiment through the first week in June 2020 (they’re typically about 50% negative). Positive sentiment news mentions about New York City have steadily increased week by week through the pandemic.

Locations are important. They help form our identities. They bring us together or apart. Governance organizations, journalists, and scholars routinely need to track how one location perceives another. From threat detection to product launches, news monitoring in Diffbot’s Knowledge Graph makes it easy to take a truly global news feed and dissect how entities being talked about.

In this story by DQL discover ways to query millions of articles that feature location data (towns, cities, regions, nations).

How we got there: One of the most valuable aspects of Diffbot’s Knowledge Graph is the ability to utilize the relationships between different entity types. You can look for news mentions (article entities) related to people, products, brands, and more. You can look for what skills (skill or people entities) are held by which companies. You can look for discussions on specific products.
Continue reading

Stories By DQL: George Floyd, Police, and Donald Trump

We will get justice. We will get it. We will not let this door close.

– Philonise Floyd, Brother of George Floyd

News coverage this week centered on George Floyd, police, and Donald Trump. COVID-19 related news continue to dominate globally.
That’s the macro story from all Knowledge Graph article published in the last week. But Knowledge Graph article entities provide users with many ways to traverse and dissect breaking news. By facet searching for the most common phrases in articles tagged “George Floyd” you see a nuanced view of the voices being heard.

In this story hopefully you can begin to see the power of global news mentions that can be sliced and diced on so many levels. Wondering how to gain these insights for yourself? Below we’ll work through how to perform these queries in detail.


How we got there: Diffbot’s Knowledge Graph holds hundreds of millions of article entities at any given moment. These articles are of truly global origins, and are parsed by our cutting-edge machine vision and natural language processing systems to take unstructured article data and transform it into structured, query-able entities.

Continue reading