Comparison of Web Data Providers: Alexa vs. Ahrefs vs. Diffbot

Skip Ahead

Web-Wide Versus Niche Web Data Extraction
Analysis of Ahrefs' Database of the Web
Analysis of Alexa's Database of the Web
Analysis of Diffbot's Database of the Web

Many cornerstone providers of martech bill themselves out as “databases of the web.” In a sense, any marketing analytics or news monitoring platform that can provide data on long tail queries has a solid basis for such a claim. There are countless applications for many of these web databases. But what many new users or those early in their buying process aren’t exposed to is the fact that web-wide crawlers can crawl the exact same pages and pull out extensively different data.

Use cases for three of the largest commercially-available “databases of the web”

In this guide we’ll look at three of the largest purveyors of web data sourced from the entire public web: Alexa, Ahrefs, and Diffbot. We’ll aim to point out how these three providers have both legitimate claims to web-wide breadth within their databases, and provide categorically different data and knowledge products.

Skip to:

Link Data or Linked Data?
Web-Wide Versus Niche Web Crawlers
Analysis of Ahrefs’ Database of the Web
Analysis of Alexa’s Database of the Web
Analysis of Diffbot’s Database of the Web

Link Data or Linked Data?

One of the first major differences between data provided from Ahrefs, Alexa, and Diffbot’s web crawlers can be hard to suss out from marketing materials.

This is largely due to the fact that it hinges on the difference of two letters.

Alexa and Ahrefs are enduringly-popular providers of backlink, keyword and SEO-related data. Their products provide access to some of the largest databases of literal links between pages on the web. Coupled with data on which pages rank for search queries as well as traffic estimates, both Alexa and Ahrefs are able to extrapolate the relative SEO strengths and popularities of sites across the web.

Examples of the types of data provided by web crawlers Ahrefs and Alexa

Ahrefs and Alexa crawl the web for link data that is then processed to provide SEO intelligence for marketers. The central component of these offerings is data on links and data extrapolated from links.

Data you may expect to gain from Ahrefs and Alexa web-wide crawls include:

The number and quality of backlinks on a domain
Which keywords a domain ranks for in search queries
Estimates on traffic and popularity of a site
The Domain Rank or Alexa Rank (depending on provider)
A suite of tools that leverage link data for lead generation and competitive analysis

Diffbot also crawls roughly the entire public web. And also stores entries for virtually every piece of publicly published content. But there are some key differences between what Diffbot data looks like at the end of the day. This is due to facts like:

Diffbot parses crawled content from across the web differently than both Ahrefs and Alexa
The characteristics and structure of Diffbot data entities are greatly different than Ahrefs and Alexa’s database entries

An illustration of the differences between data on links and semantic data pulled into linked entities

A fundamental difference here involves data “entities” versus “entries.”

You see, the results of Diffbot’s web-wide crawling (known as Diffbot’s Knowledge Graph™, or KG) are composed of a variety of types of entities. Unlike Alexa and Ahrefs, who provide entries on various link, traffic, and keyword metrics for websites (one entry type), Diffbot’s KG takes data on websites and processes it to provide facts on various entities found in the world.

The current list of entity types within Diffbot’s KG includes:

AdministrativeArea
Article
Corporation
Degree Entity
Discussion
EducationMajorEntity
EducationalInstitution
EmploymentCategory
Event
Image
Intangible
Landmark
LocalBusiness
Miscellaneous
Organization
Person
Place
Post
Product
Role
Skill
Or Video

One of the key features of these KG entities is that different types of facts are important for different types of entities. For example, a person entity may have a fact related to their current employer. Meanwhile, a corporation entity may have facts related to their subsidiaries and brands.

This flexibility, inherent to knowledge graphs, allows data returned by Diffbot’s web crawlers to meaningfully represent many types of “things.” With entity types ranging from products, to people, to discussions, this leads to many, many use cases.

In the case of Diffbot’s KG, this means that facts attached to entities are included because they are judged to be a pertinent part of that entity type. They have “meaning” because:

By way of another example, we know that seed funding round data has some meaning within the context of an organization’s history. It’s unlikely that a seed funding round has meaning for an article entity. (Now that would be a valuable piece of content!)

Knowledge graphs are such a natural fit for representing semantic data because they’re centered around the relationships between entities. Where traditional databases (even relational databases) are built to preserve the integrity of entries in the database, KG’s are built to preserve the relationships between entities.

These types of relationships are explorable at a web-wide scale within Diffbot’s Knowledge Graph.

Semantic data helps preserve real-world relationships in the form of entities in the KG

In summary, KG entity types are populated with semantic data that shows the relationships between entities in the world. Modelling facts in this way at a web-wide scale is extremely valuable for many use cases.

To return to the key differences between Alexa, Ahrefs, and Diffbot data, let’s summarize the most important points:

How web crawler data is parsed by Alexa, Ahrefs, and Diffbot:

Alexa and Ahrefs => arrange web crawler data on the features and composition of web sites
Diffbot => arranges web crawler data on what content from these websites actually means

The structure and characteristics of data from Alexa, Ahrefs, and Diffbot:

Alexa and Ahrefs => data on the number, quality and type of link data for one type of entry (web pages)
Diffbot => data parsed into facts placed within many contextually-linked entity types (organizations, people, products, web pages, and many more)

Just because a group of crawlers all crawl the entire web doesn’t mean the resulting data is remotely similar.

Web-Wide Versus Niche Web Data Extraction

At most recent count, the following web crawling stats relate to the three “databases of the web” in question:

Ahrefs: 5 billion web pages crawled a day. 16 trillion known links.
Alexa: Provides detailed data on top 30 million sites by popularity.
Diffbot: Extracts data on over 98% of the public web. Over 10 billion entities and 2 trillion facts.

Even the most minor websites are likely to get crawled by these three web crawlers.

By comparison, many niche web crawlers only crawl a specific subset of sites.

Examples of niche crawlers include:

Crawlers for ecommerce data from Amazon
Crawlers for contact data pointed at specific target sites
Crawlers for mentions on social media
And many more

Additionally, some crawlers may crawl a wider set of sites, but only look for specific data on those sites.

Examples of this form of niche crawlers include:

Old-school outreach “bots”
Crawlers that monitor discussions and reviews
Search engine index crawlers
Marketing data-related crawlers

In this latter sense, these crawlers may provide some of the “widest” crawls available. But they don’t meaningfully extract or interact with all aspects of the page that’s being crawled.

Both types of niche crawlers above are typically ordered around well-defined use cases for a small set of verticals or user roles. This isn’t a bad thing. As in many fields, both niche and more general providers do play valuable parts in a given ecosystem.

It’s also worth noting that when compared to other martech crawlers, Ahrefs and Alexa are anything but niche. They provide some of the widest and most up-to-date data on a wide range of marketing-related metrics.

With all that said, the range of data and entity types extracted from sites crawled by Diffbot is much wider than those extracted by Ahrefs and Alexa. They may be some of the largest web crawlers online, but their focus areas are niche compared to Diffbot.

This difference can be seen by comparing the primary use cases for Alexa, Ahrefs, and Diffbot.

Nearly all of Ahrefs and Alexas use cases can be found within Diffbot’s market intelligence and news monitoring use cases. This isn’t to say the resulting data of any one provider is better. Rather that the structure and nature of what Diffbot’s web crawler extracts has many more applications.

Competitive Analysis
Keyword Research
Backlink Research
Content Research
Rank Tracking
Web Monitoring

Primary Alexa Use Cases

Keyword Research
Competitive Website Analysis
SEO Analysis
Checking Backlinks
Target Audience Analysis

Primary Diffbot Use Cases

Market Intelligence
Ecommerce
News Monitoring
Machine Learning
Web Scraping

Analysis of Ahrefs’ Database of the Web

Overview

Ahrefs provides the world’s largest index of live backlinks. Their crawlers continuously work their way through over 54 billion web pages and report changes to backlinks between pages. Their index of backlinks is updated roughly every 15 minutes with backlinks from newly (re-)crawled pages.

Ahrefs’ index of backlinks, coupled with keyword rankings for pages on the web allows Ahrefs to extrapolate to many forms of marketing information detailed below.

What type of data is available?

Backlink data
“Keyword difficulty” (how challenging it is to rank a page for a keyword)
What keywords a given page or site ranks for
Relative rankings on the strength of a domain or page
Competing page analysis
Content gaps
Data-driven keyword suggestions
Site scan data related to SEO
Estimated traffic

Features

Site Explorer
Keyword Explorer
Site Audit
Rank Tracker
Content Explorer
Link Intersect
Batch Analysis
Browser Plugins

Use Cases

Competitive Intelligence
Web/News Monitoring
Keyword-Related Marketing Research
SEO Analysis
Lead Generation

Who It’s For

Marketers
Publishers
Growth Teams
SEO Professionals
PPC Professionals
Public Relations Professionals
Content Marketers
Webmasters

Pricing

$99-$999/Month
$179/Month Standard Plan

Analysis of Alexa’s Database of the Web

Overview

Alexa provides information on the relative popularity of domains through the sampling of traffic patterns provided by individuals who utilize the Alexa plugin. Popularity is extrapolated from the number of unique visitors who visit a domain as well as the overall traffic to a domain. Additionally, Alexa crawls a large portion of the entire web to extract backlink data.

This link data, in conjunction with data on which content is ranking for which keywords on search engines is used to provide keyword and PPC recommendations and tracking. Perhaps the most unique offering of Alexa is one of the largest publicly available indexes of traffic patterns online. These are offered in the form of the Alexa traffic rank for pages as well as an insights panel related to audience analysis.

What type of data is available?

Web traffic data
Audience demographics
Backlink data
“Keyword difficulty” (how challenging it is to rank a page for a keyword)
What keywords a given page or site ranks for
Relative rankings on the strength of a domain or page
Competing page analysis
Content gaps
Share of “voice” in a given niche

Features

Site audits
Organic keyword research
PPC keyword research
Web competitor analysis and research
Keyword analysis
Traffic analysis
Audience analysis
Share of voice analysis
Site comparison
Browser plugin

Use Cases

Marketers
Publishers
Growth Teams
SEO Professionals
PPC Professionals
Public Relations Professionals
Content Marketers
Webmasters

Pricing

$79-$299/Month
$149/Month Standard “Advanced” Plan

Analysis of Diffbot’s Database of the Web

Overview

Diffbot sources Knowledge Graph™ data through web-wide crawls that are then parsed by a cutting-edge machine vision and natural language processing AI system. Data is organized into entity types that are interlinked and populated by facts. Data within Diffbot entities is semantic, meaning that data is encoded next to the “meaning” or context of that data. At a general level, Diffbot takes unstructured data from billions of sites across the web and provides this data structure within the world’s largest knowledge graph.

Diffbot also offers customizable data extraction solutions including Crawlbot as well as custom Extraction APIs.

What type of data is available?

Organizational (firmographic) data
Person data
Product data
News mention data
Other entity data

Features

Access to structured entity data extracted from pages across the web (Knowledge Graph)
Data enrichment (Enhance)
Bulk crawler
Custom Extraction APIs
Excel, Google Sheets, Tableau integrations

Use Cases