No News Is Good News – Monitoring Average Sentiment By News Network With Diffbot’s Knowledge Graph

Ever have the feeling that news used to be more objective? That news organizations — now media empires — have moved into the realm of entertainment? Or that a cluster of news “across the aisle” from your beliefs is completely outrageous?

Many have these feelings, and coverage is rampant on bias and even straight up “fake” facts in news reporting.

With this in mind, we wanted to see if these hunches are valid. Has news gotten more negative over time? Is it a portion of the political spectrum driving this change? Or is it simply that bad things happen in the world and later get reported on?

To jump into this inquiry we utilized Diffbot’s Knowledge Graph. Diffbot is one of the few North American organizations to crawl the entire web. We apply AI-enabled web scrapers to pages that are publicly available to extract entities — think people, places, or things — and facts — think job titles, topics, and funding rounds.

We started our inquiry with some external coverage on bias in journalism provided by AllSides Media Bias Ratings.

Continue reading

Generating B2B Sales Leads With Diffbot’s Knowledge Graph

Generation of leads is the single largest challenge for up to 85% of B2B marketers.

Simultaneously, marketing and sales dashboards are filled with ever more data. There are more ways to get in front of a potential lead than ever before. And nearly every org of interest has a digital footprint.

So what’s the deal? 🤔

Firmographic, demographic, technographic (components of quality market segmentation) data are spread across the web. And even once they’re pulled into our workflows they’re often siloed, still only semi-structured, or otherwise disconnected. Data brokers provide data that gets stale more quickly than quality curated web sources.

But the fact persists, all the lead generation data you typically need is spread across the public web.

You just needs someone (or something 🤖) to find, read, and structure this data.

Continue reading

Towards A Public Web Infused Dashboard For Market Intel, News Monitoring, and Lead Gen [Whitepaper]

It took Google knowledge panels one month and twenty days to update following the inception of a new CEO at Citi, a F100 company. In Diffbot’s Knowledge Graph, a new fact was logged within the week, with zero human intervention and sourced from the public web.

The CEO change at Citi was announced in September 2020, highlighting the reliance on manual updates to underlying Wiki entities.

In many studies data teams report spending 25-30% of their time cleaning, labelling, and gathering data sets [1]. While the number 80% is at times bandied about, an exact percentage will depend on the team and is to some degree moot. What we know for sure is that data teams and knowledge workers generally spend a noteworthy amount of their time procuring data points that are available on the public web.

The issues at play here are that the public web is our largest — and overall — most reliable source of many types of valuable information. This includes information on organizations, employees, news mentions, sentiment, products, and other “things.”

Simultaneously, large swaths of the web aren’t structured for business and analytical purposes. Of the few organizations that crawl and structure the web, most resulting products aren’t meant for anything more than casual consumption, and rely heavily on human input. Sure, there are millions of knowledge panel results. But without the full extent of underlying data (or skirting TOS), they just aren’t meant to be part of a data pipeline [2].

With that said, there’s still a world of valuable data on the public web.

At Diffbot we’ve harnessed this public web data using web crawling, machine vision, and natural language understanding to build the world’s largest commercially-available Knowledge Graph. For more custom needs, we harness our automatic extraction APIs pointed at specific domains, or our natural language processing API in tandem with the KG.

In this paper we’re going to share how organizations of all sizes are utilizing our structured public web data from a selection of sites of interest, entire web crawls, or in tandem with additional natural language processing to build impactful and insightful dashboards par excellence.

Note: you can replace “dashboard” here with any decision-enabling or trend-surfacing software. For many this takes place in a dashboard. But that’s really just a visual representation of what can occur in a spreadsheet, or a Python notebook, or even a printed report.

Continue reading

4 Ways Technical Leaders Are Structuring Text To Drive Data Transformations [Whitepaper]

Natural and unstructured language is how humans largely communicate. For this reason, it’s often the format of organizations’ most detailed and meaningful feedback and market intelligence. 

Historically impractical to parse at scale, natural language processing has hit mainstream adoption. The global NLP market is expected to grow 20% annually through 2026.  Analysts suggest that 

As a benchmark-topping natural language processing API provider, Diffbot is in a unique position to survey cutting-edge NLP uses. In this paper, we’ll work through the state of open source, cloud-based, and custom NLP solutions in 2021, and lay out four ways in which technical leaders are structuring text to drive data transformations. 

In particular, we’ll take a look at:

  • How researchers are using the NL API to create a knowledge graph for entire country
  • How the largest native ad network in finance uses NLP to monitor topics of discussion and serve up relavent ads
  • The use of custom properties for fraud detection in natural language documents at scale
  • How the ability to train recognition of 1M custom named entities in roughly a day helps create better data

Continue reading

Diffbot-Powered Academic Research in 2020

At Diffbot, our goal is to build the most accurate, comprehensive, and fresh Knowledge Graph of the public web, and Diffbot researchers advance the state-of-the-art in information extraction and natural language processing techniques.

Outside of our own research, we’re proud to enable others to do new kinds of research in some of the most important topics of our times: like analyzing the spread of online news, misinformation, privacy advice, emerging entities, and Knowledge Graph representations.

As an academic researcher, one of the limiting factors in your work is often access to high-quality accurate training data to study your particular problem. This is where tapping into an external Knowledge Graph API can help you greatly accelerate the boostrapping of your own ML dataset.

Here is a sampling of some of the academic research conducted by others in 2020 that uses Diffbot:

Continue reading

These Are The Hardest Page Types To Scrape — With Workarounds For Each

Phrases like “the web is held together by [insert ad hoc, totally precarious binding agent]” have been around for a while for a reason.

While the services we rely on tend to sport hugely impressive availability considering, that still doesn’t negate the fact that the macro web is a tangled mess of semi or unstructured data, and site-by-site nuances.

Put this together with the fact that the web is by far our largest source of valuable external data, and you have a task as high reward as it is error prone. That task is web scraping.

As one of three western entities to crawl and structure a vast majority of the web, we’ve learned a thing or two about where web crawling can wrong. And incorporated many solutions into our rule-less Automatic Extraction APIs and Crawlbot.

In this guide we round up some of the most common challenges for teams or individuals trying to harvest data from the public web. And we provide a workaround for each. Want to see what rule-less extraction looks like for your site of interest? Check out our extraction test drive!

Continue reading

What We Found Analyzing 300 Yelp Reviews of a Michelin Reviewed Restaurant with Natural Language Processing

Reviews are a veritable gold mine of data. They’re one of the few times when unsolicited customers lay out the best and the worst parts of using a product or service. And the relative richness of natural language can quickly point product or service providers in a nuanced direction more definitively than quantitative metrics like time on site, bounce rate, or sales numbers.

The flip side of this linguistic richness is that reviews are largely unstructured data. Beyond that, many reviews are written somewhat informally, making the task of decoding their meaning at scale even harder.

Restaurant reviews are known as being some of the richest of all reviews. They tend to document the entire experience: social interactions, location, décor, service, price, and food.

Continue reading

Extracting Product Variant Data with DiffbotAPI

Diffbot API allows you to automatically gather ecommerce information such as images, description, brand, prices and specs from product pages, but what about when product pages contain mutiple variants of the product, being offered at different prices?

A product variant is when there are variations of a base product, such as mulitiple sizes, colors, or styles that may have their own pricing and availability. For many kinds of products–ranging from apparel, to home goods, to car parts, these product variants are crucial to understand. For example, you wouldn’t want to get kid-sized shoes sent to you for adult-sized feet. Product variants also give you clues as to which variations of a product are available from the merchant, and which might be sold-out.

Diffbot’s APIs might not always be able to extract variants automatically using AI, but thankfully Diffbot includes a powerful Custom API that allows you to both correct and augment what is extracted.

Let’s take a look at this product page – in this example a bedding sheets set from Macys – that has product variants. If we pass this URL to Diffbot API, Diffbot automatically extracts the base product’s title, text, price, sku, images, as well as the thread count and fabric. However, it does not extract the variants.

In this example, the sheets come in multiple sizes (from Twin to California King) and come in colors ranging from a classic white to Pomegrante (which unsurprisingly has plenty in stock). We can easily see as a human that the add-to-bag price depends on the size, and not the color.

Let’s make our AI see this too.

To do this we can use an X-Eval rule, essentially a Javascript function with our own custom scraping logic to augment what Diffbot already extracts. An X-eval can be specified when creating a custom rule using the Custom API.

function () {
  start();
  var variants = [];
  
  /* get sizes*/
  var sizes = $('li.swatch-itm').filter((i,e) => {
    return !$(e).hasClass('unavailable');
  });
  for (var i = 0; i < sizes.length; i++){
    var sizes = $('li.swatch-itm').filter((i,e) => {
        return !$(e).hasClass('unavailable');
    });
    var sizeEl = sizes[i];
    sizeEl.click();
    /* get colors. click first */
    var colors = $('li.color-swatch').filter((i,e) => {
      return !$(e).hasClass('unavailable');
    });
    if (colors.length > 0) {
      colors[0].click();
    }
    var price = $('div.price').text().match(/([0-9.]+)/)[1]; 
    for(var j = 0; j < colors.length; j++) {
      var colorEl = colors[j];
      variants.push({
      'size': sizeEl.textContent.trim(),
      'color': $(colorEl).find('.color-swatch-div').attr('aria-label'),
      'offerPrice': price
      }); 
    }
  }
  save ("variants", variants);
  end();
}

All X-eval functions start with a start(); invocation and end with end(); to signal that the function is complete (important when there are callbacks that execute after function return).

We proceed by enumerating the list of available sizes using Jquery, which is supported in X-eval functions. We then click on the DOM element corresponding to each size, and then use another Jquery selector to select the list of available colors. Finally, we use a third Jquery selector to select the offer price, and save this combination of (size, color, price) to a variants array.

The last step is calling save() on variants, which saves the variants array as a property of the product JSON that is returned by Diffbot. Our final extracted product now has these variants captured.

Robotic Process Automation Extraction Is A Time Saver. But it’s Not Built For the Future

Enough individuals have heard the siren song of Robotic Process Automation to build several $1B companies. Even if you don’t know the “household names” in the space, something about the buzzword abbreviated as “RPA” leaves the impression that you need it. That it boosts productivity. That it enables “smart” processes. 

RPA saves millions of work hours, for sure. But how solid is the foundation for processes built using RPA tech? 

Related Reads: 

 

First off, RPA operates by literally moving pixels across the screen. Repetitive tasks are automated by saving “steps” with which someone would manipulate applications with their mouse, and then enacting these steps without human oversight. There are plenty of examples for situations in which this is handy. You need to move entries from a spreadsheet to a CRM. You need to move entries from a CRM to a CDP. You need to cut and paste thousands or millions of times between two windows in a browser. 

These are legitimate issues within back end business workflows. And RPA remedies these issues. But what happens when your software is updated? Or you need to connect two new programs? Or your ecosystem of tools changes completely? Or you just want to use your data differently? 

This shows the hint of the first issue with the foundation on which RPA is built. RPA can’t operate in environments in which it hasn’t seen (and received extensive documentation about). 

Continue reading

Most “Autoscrapers” Are Still Rule-Based Web Scraping Tools

And why it matters for scaling your public web data sources

As with most forms of tech these days, web scrapers have recently seen a surge of claims that they’re somehow based on AI or machine learning tech. While this suggests that an AI will detect exactly what you want extracted from a page, most scrapers are still rule-based (there are some exceptions, such as Diffbot’s Automatic Extraction APIs).

Why does this matter?

Historically rule-based extraction has been the norm. In rule-based extraction, you specify a set of rules for what you want pulled from a page. This is often an HTML element, CSS selector, or a regex pattern. Maybe you want the third bulleted item beneath every paragraph in a text, or all headers, or all links on a page; rule-based extraction can help with that.

Continue reading