Comparison of Web Data Providers: Alexa vs. Ahrefs vs. Diffbot

Use cases for three of the largest commercially-available “databases of the web”

Many cornerstone providers of martech bill themselves out as “databases of the web.” In a sense, any marketing analytics or news monitoring platform that can provide data on long tail queries has a solid basis for such a claim. There are countless applications for many of these web databases. But what many new users or those early in their buying process aren’t exposed to is the fact that web-wide crawlers can crawl the exact same pages and pull out extensively different data.

 

In this guide we’ll look at three of the largest purveyors of web data sourced from the entire public web: Alexa, Ahrefs, and Diffbot. We’ll aim to point out how these three providers have both legitimate claims to web-wide breadth within their databases, and provide categorically different data and knowledge products.

 

Skip to:

Link Data or Linked Data?

One of the first major differences between data provided from Ahrefs, Alexa, and Diffbot’s web crawlers can be hard to suss out from marketing materials.

This is largely due to the fact that it hinges on the difference of two letters.

Alexa and Ahrefs are enduringly-popular providers of backlink, keyword and SEO-related data. Their products provide access to some of the largest databases of literal links between pages on the web. Coupled with data on which pages rank for search queries as well as traffic estimates, both Alexa and Ahrefs are able to extrapolate the relative SEO strengths and popularities of sites across the web.

Examples of the types of data provided by web crawlers Ahrefs and Alexa

Ahrefs and Alexa crawl the web for link data that is then processed to provide SEO intelligence for marketers. The central component of these offerings is data on links and data extrapolated from links.

Data you may expect to gain from Ahrefs and Alexa web-wide crawls include: 

  • The number and quality of backlinks on a domain
  • Which keywords a domain ranks for in search queries
  • Estimates on traffic and popularity of a site
  • The Domain Rank or Alexa Rank (depending on provider)
  • A suite of tools that leverage link data for lead generation and competitive analysis

Diffbot also crawls roughly the entire public web. And also stores entries for virtually every piece of publicly published content. But there are some key differences between what Diffbot data looks like at the end of the day. This is due to facts like: 

  1. Diffbot parses crawled content from across the web differently than both Ahrefs and Alexa
  2. The characteristics and structure of Diffbot data entities are greatly different than Ahrefs and Alexa’s database entries
An illustration of the differences between data on links and semantic data pulled into linked entities

A fundamental difference here involves data “entities” versus “entries.”  

You see, the results of Diffbot’s web-wide crawling (known as Diffbot’s Knowledge Graph™, or KG) are composed of a variety of types of entities. Unlike Alexa and Ahrefs, who provide entries on various link, traffic, and keyword metrics for websites (one entry type), Diffbot’s KG takes data on websites and processes it to provide facts on various entities found in the world. 

The current list of entity types within Diffbot’s KG includes: 

  • AdministrativeArea
  • Article
  • Corporation
  • Degree Entity
  • Discussion
  • EducationMajorEntity
  • EducationalInstitution
  • EmploymentCategory
  • Event
  • Image
  • Intangible
  • Landmark
  • LocalBusiness
  • Miscellaneous
  • Organization
  • Person
  • Place
  • Post
  • Product
  • Role
  • Skill
  • Or Video

One of the key features of these KG entities is that different types of facts are important for different types of entities. For example, a person entity may have a fact related to their current employer. Meanwhile, a corporation entity may have facts related to their subsidiaries and brands.  

This flexibility, inherent to knowledge graphs, allows data returned by Diffbot’s web crawlers to meaningfully represent many types of “things.” With entity types ranging from products, to people, to discussions, this leads to many, many use cases. 

A foundational difference between a knowledge graph and another “database of the web” is that knowledge graphs are comprised of entities that mirror the structure of what they’re representing. Additionally, a hierarchy — called an ontology — is implied between many entity types. For example, an award could be given to a university, but a university can’t be “given” to an award. 

The rules and possibilities for this type of data structure can be summed up by the term “semantic data.” 

Semantic data is data that’s encoded alongside the “meaning” of this data. 

In the case of Diffbot’s KG, this means that facts attached to entities are included because they are judged to be a pertinent part of that entity type. They have “meaning” because:

  • They somehow mirror the structure of an entity out in the world. 
  • They preserve useful relationships between entity types.

By way of another example, we know that seed funding round data has some meaning within the context of an organization’s history. It’s unlikely that a seed funding round has meaning for an article entity. (Now that would be a valuable piece of content!) 

Knowledge graphs are such a natural fit for representing semantic data because they’re centered around the relationships between entities. Where traditional databases (even relational databases) are built to preserve the integrity of entries in the database, KG’s are built to preserve the relationships between entities. 

This conceptual shift is exceedingly valuable when trying to represent objects from the world (as most all databases are trying to do). As most of the “meaning” we derive from observation comes in the form of the relationships between different entity types. 

These types of relationships are explorable at a web-wide scale within Diffbot’s Knowledge Graph

Semantic data helps preserve real-world relationships in the form of entities in the KG

In summary, KG entity types are populated with semantic data that shows the relationships between entities in the world. Modelling facts in this way at a web-wide scale is extremely valuable for many use cases.

To return to the key differences between Alexa, Ahrefs, and Diffbot data, let’s summarize the most important points:

How web crawler data is parsed by Alexa, Ahrefs, and Diffbot: 

  • Alexa and Ahrefs => arrange web crawler data on the features and composition of web sites
  • Diffbot => arranges web crawler data on what content from these websites actually means

The structure and characteristics of data from Alexa, Ahrefs, and Diffbot: 

  • Alexa and Ahrefs => data on the number, quality and type of link data for one type of entry (web pages) 
  • Diffbot => data parsed into facts placed within many contextually-linked entity types (organizations, people, products, web pages, and many more)

Just because a group of crawlers all crawl the entire web doesn’t mean the resulting data is remotely similar. 

 

Web-Wide Versus Niche Web Data Extraction

At most recent count, the following web crawling stats relate to the three “databases of the web” in question:

  • Ahrefs: 5 billion web pages crawled a day. 16 trillion known links.
  • Alexa: Provides detailed data on top 30 million sites by popularity.
  • Diffbot: Extracts data on over 98% of the public web. Over 10 billion entities and 2 trillion facts.

Even the most minor websites are likely to get crawled by these three web crawlers. 

By comparison, many niche web crawlers only crawl a specific subset of sites. 

Examples of niche crawlers include: 

  • Crawlers for ecommerce data from Amazon
  • Crawlers for contact data pointed at specific target sites
  • Crawlers for mentions on social media
  • And many more

Additionally, some crawlers may crawl a wider set of sites, but only look for specific data on those sites. 

Examples of this form of niche crawlers include: 

  • Old-school outreach “bots” 
  • Crawlers that monitor discussions and reviews
  • Search engine index crawlers
  • Marketing data-related crawlers

In this latter sense, these crawlers may provide some of the “widest” crawls available. But they don’t meaningfully extract or interact with all aspects of the page that’s being crawled. 

Both types of niche crawlers above are typically ordered around well-defined use cases for a small set of verticals or user roles. This isn’t a bad thing. As in many fields, both niche and more general providers do play valuable parts in a given ecosystem. 

It’s also worth noting that when compared to other martech crawlers, Ahrefs and Alexa are anything but niche. They provide some of the widest and most up-to-date data on a wide range of marketing-related metrics. 

With all that said, the range of data and entity types extracted from sites crawled by Diffbot is much wider than those extracted by Ahrefs and Alexa. They may be some of the largest web crawlers online, but their focus areas are niche compared to Diffbot. 

 

This difference can be seen by comparing the primary use cases for Alexa, Ahrefs, and Diffbot. 

 

Use cases for three of the largest commercially-available “databases of the web”

Nearly all of Ahrefs and Alexas use cases can be found within Diffbot’s market intelligence and news monitoring use cases. This isn’t to say the resulting data of any one provider is better. Rather that the structure and nature of what Diffbot’s web crawler extracts has many more applications. 

Primary Ahrefs Use Cases

  • Competitive Analysis
  • Keyword Research
  • Backlink Research
  • Content Research
  • Rank Tracking
  • Web Monitoring

Primary Alexa Use Cases

  • Keyword Research
  • Competitive Website Analysis
  • SEO Analysis
  • Checking Backlinks
  • Target Audience Analysis

Primary Diffbot Use Cases

  • Market Intelligence
  • Ecommerce
  • News Monitoring
  • Machine Learning
  • Web Scraping

 

 

Analysis of Ahrefs’ Database of the Web

Overview

Ahrefs provides the world’s largest index of live backlinks. Their crawlers continuously work their way through over 54 billion web pages and report changes to backlinks between pages. Their index of backlinks is updated roughly every 15 minutes with backlinks from newly (re-)crawled pages. 

Ahrefs’ index of backlinks, coupled with keyword rankings for pages on the web allows Ahrefs to extrapolate to many forms of marketing information detailed below.

What type of data is available?

  • Backlink data
  • “Keyword difficulty” (how challenging it is to rank a page for a keyword)
  • What keywords a given page or site ranks for
  • Relative rankings on the strength of a domain or page
  • Competing page analysis
  • Content gaps
  • Data-driven keyword suggestions
  • Site scan data related to SEO
  • Estimated traffic

Features

  • Site Explorer
  • Keyword Explorer
  • Site Audit
  • Rank Tracker
  • Content Explorer
  • Link Intersect
  • Batch Analysis
  • Browser Plugins

Use Cases

  • Competitive Intelligence
  • Web/News Monitoring
  • Keyword-Related Marketing Research
  • SEO Analysis
  • Lead Generation

Who It’s For

  • Marketers
  • Publishers
  • Growth Teams
  • SEO Professionals
  • PPC Professionals
  • Public Relations Professionals
  • Content Marketers
  • Webmasters

Pricing

  • $99-$999/Month
  • $179/Month Standard Plan
 

Analysis of Alexa’s Database of the Web

Overview

Alexa provides information on the relative popularity of domains through the sampling of traffic patterns provided by individuals who utilize the Alexa plugin. Popularity is extrapolated from the number of unique visitors who visit a domain as well as the overall traffic to a domain. Additionally, Alexa crawls a large portion of the entire web to extract backlink data. 

This link data, in conjunction with data on which content is ranking for which keywords on search engines is used to provide keyword and PPC recommendations and tracking. Perhaps the most unique offering of Alexa is one of the largest publicly available indexes of traffic patterns online. These are offered in the form of the Alexa traffic rank for pages as well as an insights panel related to audience analysis.

What type of data is available?

  • Web traffic data
  • Audience demographics
  • Backlink data
  • “Keyword difficulty” (how challenging it is to rank a page for a keyword)
  • What keywords a given page or site ranks for
  • Relative rankings on the strength of a domain or page
  • Competing page analysis
  • Content gaps
  • Share of “voice” in a given niche

Features

  • Site audits
  • Organic keyword research
  • PPC keyword research
  • Web competitor analysis and research
  • Keyword analysis
  • Traffic analysis
  • Audience analysis
  • Share of voice analysis
  • Site comparison
  • Browser plugin

Use Cases

  • Marketers
  • Publishers
  • Growth Teams
  • SEO Professionals
  • PPC Professionals
  • Public Relations Professionals
  • Content Marketers
  • Webmasters

Pricing

  • $79-$299/Month
  • $149/Month Standard “Advanced” Plan
 

Analysis of Diffbot’s Database of the Web

Overview

Diffbot sources Knowledge Graph™ data through web-wide crawls that are then parsed by a cutting-edge machine vision and natural language processing AI system. Data is organized into entity types that are interlinked and populated by facts. Data within Diffbot entities is semantic, meaning that data is encoded next to the “meaning” or context of that data. At a general level, Diffbot takes unstructured data from billions of sites across the web and provides this data structure within the world’s largest knowledge graph. 

Diffbot also offers customizable data extraction solutions including Crawlbot as well as custom Extraction APIs

What type of data is available?

  • Organizational (firmographic) data
  • Person data
  • Product data
  • News mention data
  • Other entity data

Features

  • Access to structured entity data extracted from pages across the web (Knowledge Graph)
  • Data enrichment (Enhance)
  • Bulk crawler
  • Custom Extraction APIs
  • Excel, Google Sheets, Tableau integrations

Use Cases

  • Market Intelligence
  • News Monitoring
  • Ecommerce
  • Machine Learning
  • Web Scraping

Pricing

  • $299-$899/Month
  • $299/Month Standard “Startup” Plan