The 6 Biggest Difficulties With Data Cleaning (With Work Arounds)

Skip Ahead

Cleaning Duplicate Data
Cleaning Empty Field Data
Cleaning Data With Formatting Errors
Cleaning Incorrect Data
Cleaning Overmerged Data
Augmenting Data

Data is the new soil.
David Mccandless

If data is the new soil, then data cleaning is the act of tilling the field. It’s one of the least glamorous and (potentially) most time consuming portions of the data science lifecycle. And without it, you don’t have a foundation from which solid insights can grow.

At it’s simplest, data cleaning revolves around two opposing needs:

The need to amend data points that will skew the quality of your results
The need to retain as much of your useful data as you can

These needs are often most strictly opposed when choosing to clean a data set by removing data points that are incorrect, corrupted, or otherwise unusable in their present format.

Perhaps the most important result from a data cleaning job is that results be standardized in a way that analytics and BI tools can easily access any value, present data in dashboards, or otherwise make the data manipulatable.

Cleaning Duplicate Data

Duplicate data can be a signal of a number of bad data gathering practices or technical limitations. Most notably, duplicate data often shows that you have had difficulty merging data from different sources or the same source as it has changed over time.

Some common issues that duplicate data can lead to include:

Inability to keep up with support tickets
Inaccurate financial reporting
Lack of a single customer view

When performing analysis, duplicate data can lead to faulty results by overrepresenting a given data point. Potentially worse, duplicate entries with different data points can throw doubt on the provenance of the entire data set.

So what should you do about duplicate data? Basically you have a few techniques, and a few choices.

Techniques

Filter by unique values
Merging duplicate entries
Standardizing entries with external data

Choices

What to do when duplicate entries have fields that disagree?
If duplicates are not merged, which entry is the entry of record?
Should you rebase your entries around standardized external data?

As the leading provider of structured public web data through extraction, our Knowledge Graph, or Natural Language API, we see use cases regularly where clients choose to rebase their data store based on a new set of entities.

All of the billions of entities within the Knowledge Graph hold unique identifiers as well as provenance for all data points, making them a great starting point for standardizing your data set for analysis.

Cleaning Empty Field Data

Empty fields within some data set entries can lead to misleading analyses, bad user experience, as well as misappraisals of your data coverage.

To wrangle with empty field data, you’ve got to look at all of your options for data sourcing. Additionally, consider what would occur to your data set if you drop all entries featuring empty fields.

Data enrichment services like Diffbot’s Enhance can quickly fill out firmographic and person data with a range of useful fields.

Techniques

Replace empty fields with comparable value (ie “0” for numerical comparison or “N/A” for categorical comparison)
Remove entries that surpass a threshold of empty fields if no enrichment or augmentation method is available for this data
Enrich existing records to increase field coverage

Choices

Determining what type of analysis you want to do on an empty field can inform what type of data should fill it.
Balancing the imperatives of removing largely empty entries and retaining dataset size.
Determining the best data enrichment service (provenance, coverage, freshness, how comprehensive it is)

Cleaning Data With Formatting Errors

Formatting errors can occur when fields are populated from sources in other formats, manual entry errors, changing data taxonomies, and corrupted data.

Assuming errors are truly in formatting or syntax alone, and not data that is actually incorrect and out of date, there are a range of processing techniques that can help to remedy the situation.

Techniques

Use pattern matching technologies (like regular expressions) to select and change syntax
Review conversions between character types or document formats
Remove formatting characters like newline and return carriages
Pull new “bedrock” data from a knowledge as a service provider and salvage erroneous data as you can

Choices

Cost benefit analysis of fixing or dropping syntax error ridden fields
What taxonomy fits your analysis needs the best
How will you track and communicate proper syntax and changes to syntax
What process minimizes syntax errors or corrupted data within the data gathering stage

Cleaning Incorrect Data

Incorrect data is one of the most nefarious difficulties in data cleaning. When assessing whether your data is incorrect, you should also consider your definitions. Is data incorrect when it’s slightly out of date? How much so? And is there a percentage of incorrect data that’s acceptable for your purposes?

If you find you have an unacceptably high rate of incorrect data, you may want to first look at your data sources. Are they out of date? For data available on the public web, is there a way to re-scrape web pages that have been updated? For data available by APIs, should you refresh your data more regularly?

For data stores of a certain size, it doesn’t make sense to try and manually correct incorrect data. If you can’t “refresh” the process by which you retained your information in the first place, you’ll likely want to seek out data enrichment or knowledge as a service providers who can take your correct identification fields and replace outdated or incorrect values.

Diffbot’s Enhance product provides best-in-class coverage and pricing for tasks like CRM enrichment or for enrichment of firmographic data.

Techniques

For public web data, re-scrape the location you originally obtained the data from if it has been updated
For API-enabled data, refresh your data more often
Utilize data enrichment services for firmographic and demographic data corrections and updates

Choices

Is your data out of date? If so, how much?
What percentage incorrect data are you ok with?
Are there public web sources that are more correct or up-to-date that you can scrape?
Are there APIs you can re-query to refresh or correct your data?
Are there data enrichment services for your data type of interest?

Cleaning Overmerged Data

Overmerged data occurs when entries that should be separate have been combined. Imagine a database filled with person data. Let’s say each name in the database is unique except for the most common. And there are 15 John Smith entries. But the data is all in the same entry. Depending on the extent of overmerging, this will lead to vastly incorrect data.

In the event of overmerged data, you’ll need to try and trace at what stage of data gathering or transformation the overmerging occured and correct your process in the future. For this solid data provenance is key.

If you have no way of telling where overmerging occured, you have a few options. Delete overmerged entities as they’re uncovered, set up processes for tracking provenance in the future, or rebase your data set around a properly merged data set.

Techniques

Establish proper data provenance for tracking data gathering and transforms
Remove overmerged entities as they surface
Repeat data gathering without processing steps that lead to overmerging
Rebase your data set around properly merged entities with unique identifiers

Choices

Determine if overmerging is prevalent enough to warrant restarting your data gathering stage
Determine if you should rebase your data set around a properly merged data set

Augmenting Data

Augmenting data is not a data cleaning issue in and of itself. But the process of bringing in new or external data sources can bring up many of the above issues if not pursued correctly.

Techniques

Determine what format each data field should be in to support your analysis or data needs
Establish and follow robust data provenance protocols whenever transforming or augmenting your data set
Determine the repeatability of data augmentation methods and stability of sources of data

Choices

What is the data augmentation source?
How stable is the data source?
Should internal or external be used as the basis for data augmentation?

Skip Ahead

Cleaning Duplicate Data

Cleaning Empty Field Data

Cleaning Data With Formatting Errors

Cleaning Incorrect Data

Cleaning Overmerged Data

Augmenting Data

Up Next