Every Company That Sells Organization Data is Biased

Yes, even the biggest leaders in market intelligence. Even us.

Some focus solely on startups. Some only on venture-backed companies. But you probably wouldn’t even know. Because most won’t (or can’t) tell you what their data is biased towards! 🤭

“We have over 10M companies in our database!” is a meaningless statement if you can’t tell whether the data is a representative sample of Indian restaurants in the world, or perhaps more realistically, what they just happened to scrape.

Unless we’re talking at least 200M+ unique organizations strong, you’re looking at a biased dataset. And that’s still a conservative minimum.

This is common knowledge for data buyers, who make up for the lack of a known bias by evaluating datasets for known, easily verifiable data, like the Fortune 1000.

Given enough evaluation feedback cycles, most organization data brokers end up biased towards the Fortune 1000.

If your target is enterprise b2b, you’re in luck. You can find that data anywhere. Just check your spam folder.

If it’s anything even remotely more niched, like rubber gasket manufacturers or global non-profits focused on relieving poverty, you’re probably scraping this data yourself off a conference site.

And if your market intelligence application needs the closest thing to a truly representative sample of global organizations, it might seem impossible.

For data brokers, it just doesn’t make any sense to boil the ocean. It’s cheaper and easier to focus data entry resources on a few markets and whatever coverage gap feedback they get from lost deals.

Even if they did manage to compile all the companies on Earth, they would have to do it over and over again to keep their records fresh.

It’s an absurd and impractical human labor cost to maintain. So no one employs hundreds of people just to enter org data. Not even us.

We employ machines instead, which crawl millions of publicly accessible websites, interpret raw text into data autonomously, and structure each detail into facts on every organization known to the public web.

Which, as it turns out, is our known bias.

Read More

Generating B2B Sales Leads With Diffbot’s Knowledge Graph

Generation of leads is the single largest challenge for up to 85% of B2B marketers.

Simultaneously, marketing and sales dashboards are filled with ever more data. There are more ways to get in front of a potential lead than ever before. And nearly every org of interest has a digital footprint.

So what’s the deal? 🤔

Firmographic, demographic, technographic (components of quality market segmentation) data are spread across the web. And even once they’re pulled into our workflows they’re often siloed, still only semi-structured, or otherwise disconnected. Data brokers provide data that gets stale more quickly than quality curated web sources.

But the fact persists, all the lead generation data you typically need is spread across the public web.

You just needs someone (or something 🤖) to find, read, and structure this data.

(more…)

Read More

Towards A Public Web Infused Dashboard For Market Intel, News Monitoring, and Lead Gen [Whitepaper]

It took Google knowledge panels one month and twenty days to update following the inception of a new CEO at Citi, a F100 company. In Diffbot’s Knowledge Graph, a new fact was logged within the week, with zero human intervention and sourced from the public web.

The CEO change at Citi was announced in September 2020, highlighting the reliance on manual updates to underlying Wiki entities.

In many studies data teams report spending 25-30% of their time cleaning, labelling, and gathering data sets [1]. While the number 80% is at times bandied about, an exact percentage will depend on the team and is to some degree moot. What we know for sure is that data teams and knowledge workers generally spend a noteworthy amount of their time procuring data points that are available on the public web.

The issues at play here are that the public web is our largest — and overall — most reliable source of many types of valuable information. This includes information on organizations, employees, news mentions, sentiment, products, and other “things.”

Simultaneously, large swaths of the web aren’t structured for business and analytical purposes. Of the few organizations that crawl and structure the web, most resulting products aren’t meant for anything more than casual consumption, and rely heavily on human input. Sure, there are millions of knowledge panel results. But without the full extent of underlying data (or skirting TOS), they just aren’t meant to be part of a data pipeline [2].

With that said, there’s still a world of valuable data on the public web.

At Diffbot we’ve harnessed this public web data using web crawling, machine vision, and natural language understanding to build the world’s largest commercially-available Knowledge Graph. For more custom needs, we harness our automatic extraction APIs pointed at specific domains, or our natural language processing API in tandem with the KG.

In this paper we’re going to share how organizations of all sizes are utilizing our structured public web data from a selection of sites of interest, entire web crawls, or in tandem with additional natural language processing to build impactful and insightful dashboards par excellence.

Note: you can replace “dashboard” here with any decision-enabling or trend-surfacing software. For many this takes place in a dashboard. But that’s really just a visual representation of what can occur in a spreadsheet, or a Python notebook, or even a printed report.

(more…)

Read More

Is Christian Bale a Christian? Is Mitt Romney a glove?

Download This Dataset of 12,118 Yahoo Answers for $1

With only 2 weeks left till May 4th (be with you), the internet is bursting with excitement over all the work that needs to be done before Yahoo Answers finally 404s.

From scheduling a 2nd COVID vaccine to your annual panic attack at missing the tax filing deadline (you probably didn’t, it was extended to May 17 in the U.S.), there is nothing short of a lengthy agenda for everyone ahead of the shutdown of this iconic website.

(more…)

Read More

4 Ways Technical Leaders Are Structuring Text To Drive Data Transformations [Whitepaper]

Natural and unstructured language is how humans largely communicate. For this reason, it’s often the format of organizations’ most detailed and meaningful feedback and market intelligence. 

Historically impractical to parse at scale, natural language processing has hit mainstream adoption. The global NLP market is expected to grow 20% annually through 2026.  Analysts suggest that 

As a benchmark-topping natural language processing API provider, Diffbot is in a unique position to survey cutting-edge NLP uses. In this paper, we’ll work through the state of open source, cloud-based, and custom NLP solutions in 2021, and lay out four ways in which technical leaders are structuring text to drive data transformations. 

In particular, we’ll take a look at:

  • How researchers are using the NL API to create a knowledge graph for entire country
  • How the largest native ad network in finance uses NLP to monitor topics of discussion and serve up relavent ads
  • The use of custom properties for fraud detection in natural language documents at scale
  • How the ability to train recognition of 1M custom named entities in roughly a day helps create better data

(more…)

Read More

Diffbot-Powered Academic Research in 2020

At Diffbot, our goal is to build the most accurate, comprehensive, and fresh Knowledge Graph of the public web, and Diffbot researchers advance the state-of-the-art in information extraction and natural language processing techniques.

Outside of our own research, we’re proud to enable others to do new kinds of research in some of the most important topics of our times: like analyzing the spread of online news, misinformation, privacy advice, emerging entities, and Knowledge Graph representations.

As an academic researcher, one of the limiting factors in your work is often access to high-quality accurate training data to study your particular problem. This is where tapping into an external Knowledge Graph API can help you greatly accelerate the boostrapping of your own ML dataset.

Here is a sampling of some of the academic research conducted by others in 2020 that uses Diffbot:

(more…)

Read More

The 6 Biggest Difficulties With Data Cleaning (With Work Arounds)

Data is the new soil.

David Mccandless

If data is the new soil, then data cleaning is the act of tilling the field. It’s one of the least glamorous and (potentially) most time consuming portions of the data science lifecycle. And without it, you don’t have a foundation from which solid insights can grow.

At it’s simplest, data cleaning revolves around two opposing needs:

  • The need to amend data points that will skew the quality of your results
  • The need to retain as much of your useful data as you can

These needs are often most strictly opposed when choosing to clean a data set by removing data points that are incorrect, corrupted, or otherwise unusable in their present format.

Perhaps the most important result from a data cleaning job is that results be standardized in a way that analytics and BI tools can easily access any value, present data in dashboards, or otherwise make the data manipulatable.

(more…)

Read More

These Are The Hardest Page Types To Scrape — With Workarounds For Each

Phrases like “the web is held together by [insert ad hoc, totally precarious binding agent]” have been around for a while for a reason.

While the services we rely on tend to sport hugely impressive availability considering, that still doesn’t negate the fact that the macro web is a tangled mess of semi or unstructured data, and site-by-site nuances.

Put this together with the fact that the web is by far our largest source of valuable external data, and you have a task as high reward as it is error prone. That task is web scraping.

As one of three western entities to crawl and structure a vast majority of the web, we’ve learned a thing or two about where web crawling can wrong. And incorporated many solutions into our rule-less Automatic Extraction APIs and Crawlbot.

In this guide we round up some of the most common challenges for teams or individuals trying to harvest data from the public web. And we provide a workaround for each. Want to see what rule-less extraction looks like for your site of interest? Check out our extraction test drive!

(more…)

Read More

The 25 Most Covid-Safe Restaurants in San Francisco (According to NLP)

A few weeks ago, we ran reviews for a Michelin-reviewed restaurant through our Natural Language API. It was able to tell us what people liked or disliked about the restaurant, and even rank dishes by sentiment. In our analysis, we also noticed something curious. When our NL API pulled out the entity “Covid-19,” it wasn’t always paired with a negative sentiment.

When we mined back in to where these positive mentions of Covid-19 occurred in the reviews, we saw that our NL API appeared to be picking up on language in which restaurant reviewers felt a restaurant had handled Covid-19 well. In other words, when Covid-19 was determined to be part of a positive statement, it was because guests felt relatively safe. Or that the restaurant had come up with novel solutions for dealing with Covid-19.

With this in mind, we set to starting up another, larger analysis.
(more…)

Read More

How Employbl Saved 250 Hours Building Their Career-Matching Database

We started with about 1,000 companies in the Employbl database, mostly in the Bay Area. Now with Diffbot we can expand to other cities and add thousands of additional companies. 

Connor Leech – CEO @Employbl

Fixing tech starts with hiring. And fixing hiring is an information problem. That’s what Connor Leech, cofounder and CEO at Employbl discovered when creating a new talent marketplace meant to connect tech employees with the information-rich hiring marketplace they deserve.

Tech job seekers rely on a range of metrics to gauge the opportunity and stability of a potential employer.

While information like funding rounds, founders, team size, industry, and investors are often public, it can be hard to grab the myriad fields candidates value in a up-to-date format from around the web.

These difficulties are amplified by the fact that many tech startups are often “long tail” entities that also regularly change.

(more…)

Read More