Getting and Data Cleaning – Data Science
Data cleaning (sometimes also referred to as data cleansing or data scrubbing) is the act of detecting and either removing or correcting corrupt or inaccurate records from a record set, table, or database. Used mainly in cleansing databases, the process applies identifying incomplete, incorrect, inaccurate, irrelevant, etc. items of data and then replacing, modifying, or deleting this “dirty” information.
The next step after data cleaning is data reduction. This includes defining and extracting attributes, decreasing the dimensions of data, representing the problems to be solved, summarizing the data, and selecting portions of the data for analysis.
Generally, in order to be classified as “high-quality,” data needs to pass a firm and exacting set of criteria. Those include:
Accuracy: an aggregated value over the criteria of integrity, consistency, and density
Integrity: an aggregated value over the criteria of completeness and validity
Completeness: achieved by correcting data containing the anomalies
Validity: approximated by the amount of data satisfying integrity constraints
Consistency: concerns contradictions and syntactical anomalies
Uniformity: directly related to irregularities and in compliance with the set “unit of measure”
Density: the quotient of missing values in the data and the number of total values ought to be known
Sharon Machlis of ComputerWorld puts data cleaning into perspective: “Before you can analyze and visualize data, it often needs to be ‘cleaned.’ What does that mean? Perhaps some entries list ‘New York City’ while others say ‘New York, NY’ and you need to standardize them before you can see patterns. There might be some records with misspellings or numerical data-entry errors.” Such procedures as this constitute data cleaning.
Note: The need to analyze time-series or other forms of streaming (velocity) data poses unique data cleaning challenges. Examples of this class of data include economic time-series like stock prices, exchange rates, or unemployment figures, biomedical data sequences like electrocardiograms or electroencephalograms, or industrial process operating data sequences like temperatures, pressures or concentrations. Nevertheless, fundamental principles of data cleaning, apply just as much to these data sets as to any others.
Bruce Ratner, Ph.D. – of the highly-regarded DM Stat-1 Consulting – identifies ten fundamentals of data cleaning:
- Check frequencies of continuous and categorical variables for unreasonable distributions.
- Check frequencies of continuous and categorical variables for detection of unexpected values. For continuous variables, look into data “clumps” and “gaps.”
- Check for improbable values (e.g., a boy named Sue), and impossible values (e.g., age is 120 years young, and x/0).
- Check the type for numeric variables: Decimal, integer, and date.
- Check the meanings of misinformative values, e.g., “NA”, the blank ” “, the number “0”, the letter “O”, the dash “-“, and the dot “. “.
- Check for out-of-range data: Values “far out” from the “fences” of the data.
- Check for outliers: Values “outside” the fences of the data.
- Check for missing values, and the meanings of their coded values, e.g., the varied string of “9s”, the number “0”, the letter “O”, the dash “-“, and the dot “. “.
- Check the logic of data, e.g., response rates cannot be 110%, and weigh contradictory values, along with conflict resolution rules, e.g., duplicate records of BR’s DOB: 12/22/56 and 12/22/65.
- Last but not least, check for the typos.
Ratner: “After the ten basic and analysts-specific checks are done, data cleaning is not completed until the noise in the data is eliminated. Noise is the idiosyncrasies of the data: The particulars, the “nooks and crannies” that are not part of the sought-after essence (e.g., predominant pattern) of the data with regard to the objective of the analysis/model. Ergo, the data particulars is lonely, not-really-belonging-to pieces of information that happen to be both in the population from which the data was drawn and in the data itself (what an example of a double-chance occurrence!) Paradoxically, as the analyst includes more and more of the prickly particulars in the analysis/model, the analysis/model becomes better and better, yet the analysis/model validation becomes worse and worse. Noise must be eliminated from the data.”