3.3 Assessing Data
This is the second step of the Data Wrangling, and the aims of this lesson is to explain some details. There are two kind of unclean data:
- Quality issues: Dirty data;
- Missing, duplicated, or incorrect data;
- Lack of tidiness: Also known as messy data.
- Strucutural issues
There are two ways to assess:
- Visual: Plotting a simple graphic or visualizing the table (rows and columns);
- Programmatic: Using code to summarize the data frame using .info(), .describe(), average, summation, max, min, etc..
3.3.0.1 Dirty Data
Is related to the content issues, as known as low quality data.
- Innacurated data: Typos, corrupted, and duplicated data;
3.3.0.2 Messy Data
Messy data is related to structural issues, as known as untidy data.
- Each observation is a row;
- Each variable/features is a column;
- Based on the Hadley Wickham principles of tidy data.
3.3.1 Assess Proccess
In both cases (visual or programmatic), we could be divided into two main steps:
- Detect;
- Document.
3.3.2 Data Quality Dimensions
Data quality dimensions help guide your thought process while assessing and also cleaning. The four main data quality dimensions are:
- Completeness: Missing values;
- Validity: Invalid value (like negative height or weight, zip code with only 4 digits, etc.);
- Accuracy: Wrong data which is valid (like the typo in the height);
- Consistency: Data without a standard notation (New York and NY, Colorado and CO, same information but differents notations).
The severity of this problem is decreasing order: Completeness, Validity, Accuracy, and Consistency.
A work by AH Uyekita
anderson.uyekita[at]gmail.com