4 Ways Technical Leaders Are Structuring Text To Drive Data Transformations [Whitepaper]

Natural and unstructured language is how humans largely communicate. For this reason, it’s often the format of organizations’ most detailed and meaningful feedback and market intelligence. 

Historically impractical to parse at scale, natural language processing has hit mainstream adoption. The global NLP market is expected to grow 20% annually through 2026.  Analysts suggest that 

As a benchmark-topping natural language processing API provider, Diffbot is in a unique position to survey cutting-edge NLP uses. In this paper, we’ll work through the state of open source, cloud-based, and custom NLP solutions in 2021, and lay out four ways in which technical leaders are structuring text to drive data transformations. 

In particular, we’ll take a look at:

  • How researchers are using the NL API to create a knowledge graph for entire country
  • How the largest native ad network in finance uses NLP to monitor topics of discussion and serve up relavent ads
  • The use of custom properties for fraud detection in natural language documents at scale
  • How the ability to train recognition of 1M custom named entities in roughly a day helps create better data

(more…)

Read More

Diffbot-Powered Academic Research in 2020

At Diffbot, our goal is to build the most accurate, comprehensive, and fresh Knowledge Graph of the public web, and Diffbot researchers advance the state-of-the-art in information extraction and natural language processing techniques.

Outside of our own research, we’re proud to enable others to do new kinds of research in some of the most important topics of our times: like analyzing the spread of online news, misinformation, privacy advice, emerging entities, and Knowledge Graph representations.

As an academic researcher, one of the limiting factors in your work is often access to high-quality accurate training data to study your particular problem. This is where tapping into an external Knowledge Graph API can help you greatly accelerate the boostrapping of your own ML dataset.

Here is a sampling of some of the academic research conducted by others in 2020 that uses Diffbot:

(more…)

Read More

The 6 Biggest Difficulties With Data Cleaning (With Work Arounds)

Data is the new soil.

David Mccandless

If data is the new soil, then data cleaning is the act of tilling the field. It’s one of the least glamorous and (potentially) most time consuming portions of the data science lifecycle. And without it, you don’t have a foundation from which solid insights can grow.

At it’s simplest, data cleaning revolves around two opposing needs:

  • The need to amend data points that will skew the quality of your results
  • The need to retain as much of your useful data as you can

These needs are often most strictly opposed when choosing to clean a data set by removing data points that are incorrect, corrupted, or otherwise unusable in their present format.

Perhaps the most important result from a data cleaning job is that results be standardized in a way that analytics and BI tools can easily access any value, present data in dashboards, or otherwise make the data manipulatable.

(more…)

Read More

These Are The Hardest Page Types To Scrape — With Workarounds For Each

Phrases like “the web is held together by [insert ad hoc, totally precarious binding agent]” have been around for a while for a reason.

While the services we rely on tend to sport hugely impressive availability considering, that still doesn’t negate the fact that the macro web is a tangled mess of semi or unstructured data, and site-by-site nuances.

Put this together with the fact that the web is by far our largest source of valuable external data, and you have a task as high reward as it is error prone. That task is web scraping.

As one of three western entities to crawl and structure a vast majority of the web, we’ve learned a thing or two about where web crawling can wrong. And incorporated many solutions into our rule-less Automatic Extraction APIs and Crawlbot.

In this guide we round up some of the most common challenges for teams or individuals trying to harvest data from the public web. And we provide a workaround for each. Want to see what rule-less extraction looks like for your site of interest? Check out our extraction test drive!

(more…)

Read More

The 25 Most Covid-Safe Restaurants in San Francisco (According to NLP)

A few weeks ago, we ran reviews for a Michelin-reviewed restaurant through our Natural Language API. It was able to tell us what people liked or disliked about the restaurant, and even rank dishes by sentiment. In our analysis, we also noticed something curious. When our NL API pulled out the entity “Covid-19,” it wasn’t always paired with a negative sentiment.

When we mined back in to where these positive mentions of Covid-19 occurred in the reviews, we saw that our NL API appeared to be picking up on language in which restaurant reviewers felt a restaurant had handled Covid-19 well. In other words, when Covid-19 was determined to be part of a positive statement, it was because guests felt relatively safe. Or that the restaurant had come up with novel solutions for dealing with Covid-19.

With this in mind, we set to starting up another, larger analysis.
(more…)

Read More

How Employbl Saved 250 Hours Building Their Career-Matching Database

We started with about 1,000 companies in the Employbl database, mostly in the Bay Area. Now with Diffbot we can expand to other cities and add thousands of additional companies. 

Connor Leech – CEO @Employbl

Fixing tech starts with hiring. And fixing hiring is an information problem. That’s what Connor Leech, cofounder and CEO at Employbl discovered when creating a new talent marketplace meant to connect tech employees with the information-rich hiring marketplace they deserve.

Tech job seekers rely on a range of metrics to gauge the opportunity and stability of a potential employer.

While information like funding rounds, founders, team size, industry, and investors are often public, it can be hard to grab the myriad fields candidates value in a up-to-date format from around the web.

These difficulties are amplified by the fact that many tech startups are often “long tail” entities that also regularly change.

(more…)

Read More

What We Found Analyzing 300 Yelp Reviews of a Michelin Reviewed Restaurant with Natural Language Processing

Reviews are a veritable gold mine of data. They’re one of the few times when unsolicited customers lay out the best and the worst parts of using a product or service. And the relative richness of natural language can quickly point product or service providers in a nuanced direction more definitively than quantitative metrics like time on site, bounce rate, or sales numbers.

The flip side of this linguistic richness is that reviews are largely unstructured data. Beyond that, many reviews are written somewhat informally, making the task of decoding their meaning at scale even harder.

Restaurant reviews are known as being some of the richest of all reviews. They tend to document the entire experience: social interactions, location, décor, service, price, and food.

(more…)

Read More

Context Matters, Tracking Quote Spread Across The Web In A Historic Year

Hindsight is 20/20. And as we usher in a new president in what has been one of the most tumultuous years in American history, we can begin to see clarity about the forces that moved throughout our jobs, our lives, and our collective imagination.

Another way to put this is that over time we tend to have more context.

Within Diffbot’s Knowledge Graph, one unique lens through which we can leverage the context of semantic data is by looking at the speakers of quotes.

When our AI reads articles it pulls out quotes, and when it can it attributes a speaker to these quotes. As our crawlers traverse the entirety of the public web, sources of quotes are validated and over time some quotes circulate more than others.

When performing a facet search, this lets us essentially show something like a retweet count for the entire web. This answers questions like whose voices are being heard? And what speakers are the most widely cited in a given topic?

To commemorate the end of an era, let’s take a look at a few of the most circulated statements of the last 365 days.

What were the 10 most circulated quotes across the web by President Joe Biden in the last 365 days?

(more…)

Read More

From Knowledge Graphs to Knowledge Workflows

2020 Was The “Year of the Knowledge Graph”

2020 was undeniably the “Year of the Knowledge Graph.”

2020 was the year that Gartner put Knowledge Graphs at the peak of its hype cycle.

It was the year where 10% of the papers published at EMNLP referenced “knowledge” in their titles.

It was the year over 1000 engineers, enterprise users, and academics came together to talk about Knowledge Graphs at the 2nd Knowledge Graph Conference.

There are good reasons for this grass-roots trend, as it isn’t any one company that is pushing this trend (ahem, I’m looking at you, Cognitive Computing), but rather a broad coalition of academics, industry vertical practitioners, and enterprise users that generally deal with building intelligent information systems.

Knowledge graphs represent the best of how we hope the “next step” of AI looks like: intelligent systems that aren’t black boxes, but are explainable, that are grounded in the same real-world entities as us humans, and are able to exchange knowledge with us with precise common vocabularies. It’s no coinincidence that in the same year that marked the peak of the deep learning revolution (2012), Google introduced the Google Knowledge Graph as a way to provide interpretability to its otherwise opaque search ranking algorithms.

The Risk Of Hype: Touted Benefits Don’t Materialize

(more…)

Read More

Extracting Product Variant Data with DiffbotAPI

Diffbot API allows you to automatically gather ecommerce information such as images, description, brand, prices and specs from product pages, but what about when product pages contain mutiple variants of the product, being offered at different prices?

A product variant is when there are variations of a base product, such as mulitiple sizes, colors, or styles that may have their own pricing and availability. For many kinds of products–ranging from apparel, to home goods, to car parts, these product variants are crucial to understand. For example, you wouldn’t want to get kid-sized shoes sent to you for adult-sized feet. Product variants also give you clues as to which variations of a product are available from the merchant, and which might be sold-out.

Diffbot’s APIs might not always be able to extract variants automatically using AI, but thankfully Diffbot includes a powerful Custom API that allows you to both correct and augment what is extracted.

Let’s take a look at this product page – in this example a bedding sheets set from Macys – that has product variants. If we pass this URL to Diffbot API, Diffbot automatically extracts the base product’s title, text, price, sku, images, as well as the thread count and fabric. However, it does not extract the variants.

In this example, the sheets come in multiple sizes (from Twin to California King) and come in colors ranging from a classic white to Pomegrante (which unsurprisingly has plenty in stock). We can easily see as a human that the add-to-bag price depends on the size, and not the color.

Let’s make our AI see this too.

To do this we can use an X-Eval rule, essentially a Javascript function with our own custom scraping logic to augment what Diffbot already extracts. An X-eval can be specified when creating a custom rule using the Custom API.

function () {
  start();
  var variants = [];
  
  /* get sizes*/
  var sizes = $('li.swatch-itm').filter((i,e) => {
    return !$(e).hasClass('unavailable');
  });
  for (var i = 0; i < sizes.length; i++){
    var sizes = $('li.swatch-itm').filter((i,e) => {
        return !$(e).hasClass('unavailable');
    });
    var sizeEl = sizes[i];
    sizeEl.click();
    /* get colors. click first */
    var colors = $('li.color-swatch').filter((i,e) => {
      return !$(e).hasClass('unavailable');
    });
    if (colors.length > 0) {
      colors[0].click();
    }
    var price = $('div.price').text().match(/([0-9.]+)/)[1]; 
    for(var j = 0; j < colors.length; j++) {
      var colorEl = colors[j];
      variants.push({
      'size': sizeEl.textContent.trim(),
      'color': $(colorEl).find('.color-swatch-div').attr('aria-label'),
      'offerPrice': price
      }); 
    }
  }
  save ("variants", variants);
  end();
}

All X-eval functions start with a start(); invocation and end with end(); to signal that the function is complete (important when there are callbacks that execute after function return).

We proceed by enumerating the list of available sizes using Jquery, which is supported in X-eval functions. We then click on the DOM element corresponding to each size, and then use another Jquery selector to select the list of available colors. Finally, we use a third Jquery selector to select the offer price, and save this combination of (size, color, price) to a variants array.

The last step is calling save() on variants, which saves the variants array as a property of the product JSON that is returned by Diffbot. Our final extracted product now has these variants captured.

Read More