3.3 Assessing Data

This is the second step of the Data Wrangling, and the aims of this lesson is to explain some details. There are two kind of unclean data:

  • Quality issues: Dirty data;
    • Missing, duplicated, or incorrect data;
  • Lack of tidiness: Also known as messy data.
    • Strucutural issues

There are two ways to assess:

  • Visual: Plotting a simple graphic or visualizing the table (rows and columns);
  • Programmatic: Using code to summarize the data frame using .info(), .describe(), average, summation, max, min, etc..

3.3.0.1 Dirty Data

Is related to the content issues, as known as low quality data.

  • Innacurated data: Typos, corrupted, and duplicated data;

3.3.0.2 Messy Data

Messy data is related to structural issues, as known as untidy data.

  • Each observation is a row;
  • Each variable/features is a column;
  • Based on the Hadley Wickham principles of tidy data.

3.3.1 Assess Proccess

In both cases (visual or programmatic), we could be divided into two main steps:

  • Detect;
  • Document.

3.3.2 Data Quality Dimensions

Data quality dimensions help guide your thought process while assessing and also cleaning. The four main data quality dimensions are:

  • Completeness: Missing values;
  • Validity: Invalid value (like negative height or weight, zip code with only 4 digits, etc.);
  • Accuracy: Wrong data which is valid (like the typo in the height);
  • Consistency: Data without a standard notation (New York and NY, Colorado and CO, same information but differents notations).

The severity of this problem is decreasing order: Completeness, Validity, Accuracy, and Consistency.

 

A work by AH Uyekita

anderson.uyekita[at]gmail.com