< Back to Glossary

Data provenance

Data provenance (also referred to as “data lineage”) is metadata that is paired with records that details the origin and confidence of the validity of data. Data provenance is important for tracking down errors within data and attributing them to sources. Additionally, data provenance can be useful in reporting and auditing for business processes. 

The primary metadata points related to data provenance include:

  • Data origin
  • How data is transformed or augmented
  • Where data moves over time

The primary uses of data provenance include: 

  • Validation of data
  • Debugging of faulty data
  • Regeneration of lost or corrupted data
  • Analysis of Data Dependency
  • Auditing and Compliance Rationales

The extent to which data provenance is important to an organization and implemented is typically influenced by: 

  • An enterprise’s data management strategy
  • Regulations and Legal Requirements
  • Reporting requirements
  • Confidence requirements for critical segments of org data
  • Data impact analysis

Unstructured data and data provenance

An estimated 80-90% of organizational data is unstructured. Additionally, the web is by-in-large almost entirely unstructured. Combined, this means that most organizations deal with unstructured data as at least some portion of nearly all of their data-centered activities. Sources of unstructured data of use to organizations are also growing much faster (as a share of utilized data and in total) than structured and curated data silos. Big data in particular is dominated by unstructured data sources, with an estimated 90% of big data residing in this form. 

We should note that just because data is unstructured doesn’t mean it’s entirely chaotic. Rather, data is considered unstructured primarily because it (a) doesn’t reside in a database, or (b) doesn’t neatly fit into a database structure. 

Two of the main issues organizations face when dealing with unstructured data include: 

  • Being able to prepare and unbox exactly what data is saying
  • Being able to source the validity or origin of given data points (data provenance)

Diffbot’s Automatic Extraction APIs, Knowledge Graph, or Enhance are all potential solutions for these issues with unstructured data. 

Data Provenance and Diffbot

In a sense examples of data provenance are “facts about facts.” In Diffbot’s Knowledge Graph™ confidence scores are calculated for every fact as sources for data are compared and integrated into records. 

What is a confidence score in Diffbot’s Knowledge Graph?

Each fact has a confidence score that represents our level of belief on whether this fact is correct or not. A confidence score of one means that we’re absolutely certain that this fact is correct — while a confidence score of zero means we’re absolutely certain this fact is incorrect. We currently discard facts whose confidence score is below 0.5.
 
What is a “fact” within Diffbot’s Knowledge Graph? 
 
Within DIffbot’s Knowledge Graph, facts are sourced data points about a given entity. As entities within the Knowledge Graph are contextually linked, a fact may also pertain to multiple entities. An example of this could be seen through the statement “Albert Einstein won the Nobel Prize.” This fact could be included within the Albert Einstein entity as one of his “awards won.” Simultaneously the Nobel Prize entity could include Albert Einstein within a field “recipients of.” 
 
On average each of the entities within the Knowledge Graph has 22-25 facts. Each fact has an average of three origins or sources. The culmination of these validated facts of traceable origin is the single largest collection of web data that provides for data provenance. 

See also transparency and explainability in AI