What We Found Analyzing 300 Yelp Reviews of a Michelin Reviewed Restaurant with Natural Language Processing

Reviews are a veritable gold mine of data. They’re one of the few times when unsolicited customers lay out the best and the worst parts of using a product or service. And the relative richness of natural language can quickly point product or service providers in a nuanced direction more definitively than quantitative metrics like time on site, bounce rate, or sales numbers.

The flip side of this linguistic richness is that reviews are largely unstructured data. Beyond that, many reviews are written somewhat informally, making the task of decoding their meaning at scale even harder.

Restaurant reviews are known as being some of the richest of all reviews. They tend to document the entire experience: social interactions, location, décor, service, price, and food.

(more…)

Read More

Context Matters, Tracking Quote Spread Across The Web In A Historic Year

Hindsight is 20/20. And as we usher in a new president in what has been one of the most tumultuous years in American history, we can begin to see clarity about the forces that moved throughout our jobs, our lives, and our collective imagination.

Another way to put this is that over time we tend to have more context.

Within Diffbot’s Knowledge Graph, one unique lens through which we can leverage the context of semantic data is by looking at the speakers of quotes.

When our AI reads articles it pulls out quotes, and when it can it attributes a speaker to these quotes. As our crawlers traverse the entirety of the public web, sources of quotes are validated and over time some quotes circulate more than others.

When performing a facet search, this lets us essentially show something like a retweet count for the entire web. This answers questions like whose voices are being heard? And what speakers are the most widely cited in a given topic?

To commemorate the end of an era, let’s take a look at a few of the most circulated statements of the last 365 days.

What were the 10 most circulated quotes across the web by President Joe Biden in the last 365 days?

(more…)

Read More

From Knowledge Graphs to Knowledge Workflows

2020 Was The “Year of the Knowledge Graph”

2020 was undeniably the “Year of the Knowledge Graph.”

2020 was the year that Gartner put Knowledge Graphs at the peak of its hype cycle.

It was the year where 10% of the papers published at EMNLP referenced “knowledge” in their titles.

It was the year over 1000 engineers, enterprise users, and academics came together to talk about Knowledge Graphs at the 2nd Knowledge Graph Conference.

There are good reasons for this grass-roots trend, as it isn’t any one company that is pushing this trend (ahem, I’m looking at you, Cognitive Computing), but rather a broad coalition of academics, industry vertical practitioners, and enterprise users that generally deal with building intelligent information systems.

Knowledge graphs represent the best of how we hope the “next step” of AI looks like: intelligent systems that aren’t black boxes, but are explainable, that are grounded in the same real-world entities as us humans, and are able to exchange knowledge with us with precise common vocabularies. It’s no coinincidence that in the same year that marked the peak of the deep learning revolution (2012), Google introduced the Google Knowledge Graph as a way to provide interpretability to its otherwise opaque search ranking algorithms.

The Risk Of Hype: Touted Benefits Don’t Materialize

(more…)

Read More

Extracting Product Variant Data with DiffbotAPI

Diffbot API allows you to automatically gather ecommerce information such as images, description, brand, prices and specs from product pages, but what about when product pages contain mutiple variants of the product, being offered at different prices?

A product variant is when there are variations of a base product, such as mulitiple sizes, colors, or styles that may have their own pricing and availability. For many kinds of products–ranging from apparel, to home goods, to car parts, these product variants are crucial to understand. For example, you wouldn’t want to get kid-sized shoes sent to you for adult-sized feet. Product variants also give you clues as to which variations of a product are available from the merchant, and which might be sold-out.

Diffbot’s APIs might not always be able to extract variants automatically using AI, but thankfully Diffbot includes a powerful Custom API that allows you to both correct and augment what is extracted.

Let’s take a look at this product page – in this example a bedding sheets set from Macys – that has product variants. If we pass this URL to Diffbot API, Diffbot automatically extracts the base product’s title, text, price, sku, images, as well as the thread count and fabric. However, it does not extract the variants.

In this example, the sheets come in multiple sizes (from Twin to California King) and come in colors ranging from a classic white to Pomegrante (which unsurprisingly has plenty in stock). We can easily see as a human that the add-to-bag price depends on the size, and not the color.

Let’s make our AI see this too.

To do this we can use an X-Eval rule, essentially a Javascript function with our own custom scraping logic to augment what Diffbot already extracts. An X-eval can be specified when creating a custom rule using the Custom API.

function () {
  start();
  var variants = [];
  
  /* get sizes*/
  var sizes = $('li.swatch-itm').filter((i,e) => {
    return !$(e).hasClass('unavailable');
  });
  for (var i = 0; i < sizes.length; i++){
    var sizes = $('li.swatch-itm').filter((i,e) => {
        return !$(e).hasClass('unavailable');
    });
    var sizeEl = sizes[i];
    sizeEl.click();
    /* get colors. click first */
    var colors = $('li.color-swatch').filter((i,e) => {
      return !$(e).hasClass('unavailable');
    });
    if (colors.length > 0) {
      colors[0].click();
    }
    var price = $('div.price').text().match(/([0-9.]+)/)[1]; 
    for(var j = 0; j < colors.length; j++) {
      var colorEl = colors[j];
      variants.push({
      'size': sizeEl.textContent.trim(),
      'color': $(colorEl).find('.color-swatch-div').attr('aria-label'),
      'offerPrice': price
      }); 
    }
  }
  save ("variants", variants);
  end();
}

All X-eval functions start with a start(); invocation and end with end(); to signal that the function is complete (important when there are callbacks that execute after function return).

We proceed by enumerating the list of available sizes using Jquery, which is supported in X-eval functions. We then click on the DOM element corresponding to each size, and then use another Jquery selector to select the list of available colors. Finally, we use a third Jquery selector to select the offer price, and save this combination of (size, color, price) to a variants array.

The last step is calling save() on variants, which saves the variants array as a property of the product JSON that is returned by Diffbot. Our final extracted product now has these variants captured.

Read More

Robotic Process Automation Extraction Is A Time Saver. But it’s Not Built For the Future

Enough individuals have heard the siren song of Robotic Process Automation to build several $1B companies. Even if you don’t know the “household names” in the space, something about the buzzword abbreviated as “RPA” leaves the impression that you need it. That it boosts productivity. That it enables “smart” processes. 

RPA saves millions of work hours, for sure. But how solid is the foundation for processes built using RPA tech? 

Related Reads: 

 

First off, RPA operates by literally moving pixels across the screen. Repetitive tasks are automated by saving “steps” with which someone would manipulate applications with their mouse, and then enacting these steps without human oversight. There are plenty of examples for situations in which this is handy. You need to move entries from a spreadsheet to a CRM. You need to move entries from a CRM to a CDP. You need to cut and paste thousands or millions of times between two windows in a browser. 

These are legitimate issues within back end business workflows. And RPA remedies these issues. But what happens when your software is updated? Or you need to connect two new programs? Or your ecosystem of tools changes completely? Or you just want to use your data differently? 

This shows the hint of the first issue with the foundation on which RPA is built. RPA can’t operate in environments in which it hasn’t seen (and received extensive documentation about). 

(more…)

Read More

How to Track Market Indicators Using News Monitoring Scheduling

The public web is chock full of indicators with implications for stock prices, commodities prices, supply chain issues, or the general perceived value of an entity. But how do you reliably get these market indicators?

You can search online… and slog through the most popular pages that all your competitors have also looked at. Or you can read a commentator’s take. And likely stay one step removed from the actual information you should be dealing in.

Or you could deal directly with all of the articles on the web. Each annotated with helpful fields you can filter through like sentiment scores, AI-generated topic tags, what country the article was published in, and many others. That’s where Diffbot’s Knowledge Graph (KG) comes in.

The news index of Diffbot’s KG is 50x the size of Google News’ index. And each article entity in the KG is populated with a rich set of fields you can use to actually search the entire web (not just the portion of the web who paid to get in front of you).

In this guide we’ll work through how to set up a global news monitoring query in the KG. And then schedule this query to repeat and email you when new articles surface.
(more…)

Read More

How to Estimate the Size of a Market with the Diffbot Knowledge Graph

Organizations are one of our most popular standard entities in the Diffbot Knowledge Graph, for good reason. Behind 200M+ company data profiles is an architecture that enables incredibly precise search and summarization, allowing anyone to estimate the size of a market and forecast business opportunity in any niche.

 

Pre-Requisites

 

Step 1 – Find Companies Like X

In a perfect world, every market and industry on the planet is neatly organized into well defined categories. In practice, this gets close, but not close enough, especially for niche markets.

What we’ll need instead is a combination of traits, including industry classifiers, keywords, and other characteristics that define companies in a market.

This is much easier to define by starting with companies we know that fit the bill. Think of it as searching for “companies like X”.

 
Box of Panettone cake

 

As an example, let’s start with finding companies like Bauducco, producer of this lovely Panettone cake. This is a market we’re hoping to sell say, a commercial cake baking oven to.

The closest definition of a market I might imagine for them is something like “packaged foods”. We could google this term and get some really generic hits for “food and beverage companies”, or we can do better.

We’ll start by looking this company up on Diffbot’s Knowledge Graph with a query like this

 
View In Knowledge Graphtype:Organization homepageUri:”bauducco.com”
 
Next, click through the most relevant result to a company profile.

Now let’s gather everything on this page that describes a company like Bauducco.

 
Diffbot company profile page for Bauducco

 

Under the company summary, the closest descriptor to their signature Panettone is “cakes”. Note that.

Under industries, they might be involved in agriculture to some degree, but we’re not really looking for other companies that are involved in agriculture. “Food and Drink Companies” will do!

That’s it.

Now that we have these traits, let’s construct a search query with DQL:

 
View In Knowledge Graphtype:Organization industries:"Food and Drink Companies" description:or("cakes", "cake")

Diffbot search results - 47,000 companies like Bauducco

 

Nearly 48,000 results! That’s a huge list of potential customers. Like the original google search, it’s a bit too generic to work with. Unlike results from Google though, we can segment this down as much as we’d like with just a few more parameters.

💡 Pro Tip: To see a full list of available traits to construct your query with, go to enhance.diffbot.com/ontology.

 

Step 2 – Remove Irrelevant Traits

What I’m first noticing is that there are a lot of international brands on this list. I’m interested in selling to companies like Bauducco in the U.S., so let’s trim this list to just companies with a presence in the United States.

 
View In Knowledge Graphtype:Organization industries:"Food and Drink Companies" description:or("cakes", "cake") locations.country.name:"United States"

Diffbot search results - companies like Bauducco in the U.S.
 

Note that there are two “location” attributes. A singular and a plural version. The plural version (“locations”) will match all known locations of a company. The singular version (“location”) will only match the known headquarters of a company.

Down to 8800 results. Much better. We’re not really interested in ice cream companies in this market either (after all, we’re selling a baking oven), so we’ll use the not() operator to filter ice cream companies out.

 
View In Knowledge Graphtype:Organization industries:"Food and Drink Companies" description:or("cakes", "cake") not(description:"ice cream") locations.country.name:"United States"
 

Let’s also say our oven is really only practical for large operations of at least 100 employees. We’ll add a minimum employee threshold to our query.

 
View In Knowledge Graphtype:Organization industries:"Food and Drink Companies" description:or("cakes", "cake") not(description:"ice cream") locations.country.name:"United States" nbEmployeesMin>=100


 

262 results. Now we’re really getting somewhere. Let’s stop here to calculate our total addressable market.

 

Step 4 – Calculate Total Addressable Market

To calculate TAM, we simply multiply the number of potential customers by the annual contract value of each customer.

TAM = Number of Potential Customers x Annual Contract Value

At a $1M average contract value with 262 potential customers, our TAM is approximately $262M.

This is just a starting point of course, we’ll want to assess existing competition, pricing sensitivity, as well as how much of the existing market would be willing to switch for our unique value proposition. We’ll leave that for another day.

 

Takeaways

Try replicating these steps for a market of your choosing. The ability to filter and summarize practically any field in the ontology provides limitless potential for market and competitive intelligence.

Need some inspiration? Here’re some additional examples:

Read More

Most “Autoscrapers” Are Still Rule-Based Web Scraping Tools

And why it matters for scaling your public web data sources

As with most forms of tech these days, web scrapers have recently seen a surge of claims that they’re somehow based on AI or machine learning tech. While this suggests that an AI will detect exactly what you want extracted from a page, most scrapers are still rule-based (there are some exceptions, such as Diffbot’s Automatic Extraction APIs).

Why does this matter?

Historically rule-based extraction has been the norm. In rule-based extraction, you specify a set of rules for what you want pulled from a page. This is often an HTML element, CSS selector, or a regex pattern. Maybe you want the third bulleted item beneath every paragraph in a text, or all headers, or all links on a page; rule-based extraction can help with that.

(more…)

Read More

How We Increased Our Lead Contact Rate by 46% with Diffbot Enhance

Hi! This is Jerome from Diffbot. You might’ve seen us around before. We’re known for our automatic extraction APIs, and our knowledge graph of the public web. Today, I’d like to introduce you to Diffbot Enhance, lead enrichment anywhere you need it.

Lead enrichment doesn’t get enough credit

When I first saw it in action, it looked like a gimmick - just fields populated in a CRM sold with shockingly pricey annual contracts up-sold alongside Salesforce.

Like keeping your personal address book up to date. Helpful? Sure. Necessary? Not really.

Sales always insists it’s helpful though. I didn’t get it.

Fast forward a few years, we noticed one day that 62% of our inbound leads never make it to a demo call. 62%! These are people who choose to ignore the self-start trial option, fill out a 6 field form, pass a captcha, and click a button that literally says request a demo.

Screenshot of sign up modal on Diffbot's homepage

(more…)

Read More

The Ultimate Guide To Data Analysis


Data analysis comes at the tail end of the data lifecycle. Directly after or simultaneously performed with data integration (in which data from different sources are pulled into a unified view). Data analysis involves cleaning, modelling, inspecting and visualizing data.

The ultimate goal of data analysis is to provide useful data-driven insights for guiding organizational decisions. And without data analysis, you might as well not even collect data in the first place. Data analysis is the process of turning data into information, insight, or hopefully knowledge of a given domain.
(more…)

Read More