Data is the new soil.David Mccandless
If data is the new soil, then data cleaning is the act of tilling the field. It’s one of the least glamorous and (potentially) most time consuming portions of the data science lifecycle. And without it, you don’t have a foundation from which solid insights can grow.
At it’s simplest, data cleaning revolves around two opposing needs:
- The need to amend data points that will skew the quality of your results
- The need to retain as much of your useful data as you can
These needs are often most strictly opposed when choosing to clean a data set by removing data points that are incorrect, corrupted, or otherwise unusable in their present format.
Perhaps the most important result from a data cleaning job is that results be standardized in a way that analytics and BI tools can easily access any value, present data in dashboards, or otherwise make the data manipulatable.(more…)