Data is the new soil.David Mccandless
If data is the new soil, then data cleaning is the act of tilling the field. It’s one of the least glamorous and (potentially) most time consuming portions of the data science lifecycle. And without it, you don’t have a foundation from which solid insights can grow.
At it’s simplest, data cleaning revolves around two opposing needs:
- The need to amend data points that will skew the quality of your results
- The need to retain as much of your useful data as you can
These needs are often most strictly opposed when choosing to clean a data set by removing data points that are incorrect, corrupted, or otherwise unusable in their present format.
Perhaps the most important result from a data cleaning job is that results be standardized in a way that analytics and BI tools can easily access any value, present data in dashboards, or otherwise make the data manipulatable.
Cleaning Duplicate Data
Duplicate data can be a signal of a number of bad data gathering practices or technical limitations. Most notably, duplicate data often shows that you have had difficulty merging data from different sources or the same source as it has changed over time.
Some common issues that duplicate data can lead to include:
- Inability to keep up with support tickets
- Inaccurate financial reporting
- Lack of a single customer view
When performing analysis, duplicate data can lead to faulty results by overrepresenting a given data point. Potentially worse, duplicate entries with different data points can throw doubt on the provenance of the entire data set.
So what should you do about duplicate data? Basically you have a few techniques, and a few choices.
- Filter by unique values
- Merging duplicate entries
- Standardizing entries with external data
- What to do when duplicate entries have fields that disagree?
- If duplicates are not merged, which entry is the entry of record?
- Should you rebase your entries around standardized external data?
As the leading provider of structured public web data through extraction, our Knowledge Graph, or Natural Language API, we see use cases regularly where clients choose to rebase their data store based on a new set of entities.
All of the billions of entities within the Knowledge Graph hold unique identifiers as well as provenance for all data points, making them a great starting point for standardizing your data set for analysis.
Cleaning Empty Field Data
Empty fields within some data set entries can lead to misleading analyses, bad user experience, as well as misappraisals of your data coverage.
To wrangle with empty field data, you’ve got to look at all of your options for data sourcing. Additionally, consider what would occur to your data set if you drop all entries featuring empty fields.
Data enrichment services like Diffbot’s Enhance can quickly fill out firmographic and person data with a range of useful fields.
- Replace empty fields with comparable value (ie “0” for numerical comparison or “N/A” for categorical comparison)
- Remove entries that surpass a threshold of empty fields if no enrichment or augmentation method is available for this data
- Enrich existing records to increase field coverage
- Determining what type of analysis you want to do on an empty field can inform what type of data should fill it.
- Balancing the imperatives of removing largely empty entries and retaining dataset size.
- Determining the best data enrichment service (provenance, coverage, freshness, how comprehensive it is)
Cleaning Data With Formatting Errors
Formatting errors can occur when fields are populated from sources in other formats, manual entry errors, changing data taxonomies, and corrupted data.
Assuming errors are truly in formatting or syntax alone, and not data that is actually incorrect and out of date, there are a range of processing techniques that can help to remedy the situation.
- Use pattern matching technologies (like regular expressions) to select and change syntax
- Review conversions between character types or document formats
- Remove formatting characters like newline and return carriages
- Pull new “bedrock” data from a knowledge as a service provider and salvage erroneous data as you can
- Cost benefit analysis of fixing or dropping syntax error ridden fields
- What taxonomy fits your analysis needs the best
- How will you track and communicate proper syntax and changes to syntax
- What process minimizes syntax errors or corrupted data within the data gathering stage
Cleaning Incorrect Data
Incorrect data is one of the most nefarious difficulties in data cleaning. When assessing whether your data is incorrect, you should also consider your definitions. Is data incorrect when it’s slightly out of date? How much so? And is there a percentage of incorrect data that’s acceptable for your purposes?
If you find you have an unacceptably high rate of incorrect data, you may want to first look at your data sources. Are they out of date? For data available on the public web, is there a way to re-scrape web pages that have been updated? For data available by APIs, should you refresh your data more regularly?
For data stores of a certain size, it doesn’t make sense to try and manually correct incorrect data. If you can’t “refresh” the process by which you retained your information in the first place, you’ll likely want to seek out data enrichment or knowledge as a service providers who can take your correct identification fields and replace outdated or incorrect values.
Diffbot’s Enhance product provides best-in-class coverage and pricing for tasks like CRM enrichment or for enrichment of firmographic data.
- For public web data, re-scrape the location you originally obtained the data from if it has been updated
- For API-enabled data, refresh your data more often
- Utilize data enrichment services for firmographic and demographic data corrections and updates
- Is your data out of date? If so, how much?
- What percentage incorrect data are you ok with?
- Are there public web sources that are more correct or up-to-date that you can scrape?
- Are there APIs you can re-query to refresh or correct your data?
- Are there data enrichment services for your data type of interest?
Cleaning Overmerged Data
Overmerged data occurs when entries that should be separate have been combined. Imagine a database filled with person data. Let’s say each name in the database is unique except for the most common. And there are 15 John Smith entries. But the data is all in the same entry. Depending on the extent of overmerging, this will lead to vastly incorrect data.
In the event of overmerged data, you’ll need to try and trace at what stage of data gathering or transformation the overmerging occured and correct your process in the future. For this solid data provenance is key.
If you have no way of telling where overmerging occured, you have a few options. Delete overmerged entities as they’re uncovered, set up processes for tracking provenance in the future, or rebase your data set around a properly merged data set.
- Establish proper data provenance for tracking data gathering and transforms
- Remove overmerged entities as they surface
- Repeat data gathering without processing steps that lead to overmerging
- Rebase your data set around properly merged entities with unique identifiers
- Determine if overmerging is prevalent enough to warrant restarting your data gathering stage
- Determine if you should rebase your data set around a properly merged data set
Augmenting data is not a data cleaning issue in and of itself. But the process of bringing in new or external data sources can bring up many of the above issues if not pursued correctly.
- Determine what format each data field should be in to support your analysis or data needs
- Establish and follow robust data provenance protocols whenever transforming or augmenting your data set
- Determine the repeatability of data augmentation methods and stability of sources of data
- What is the data augmentation source?
- How stable is the data source?
- Should internal or external be used as the basis for data augmentation?