3.3 Assessing Data

This is the second step of the Data Wrangling, and the aims of this lesson is to explain some details. There are two kind of unclean data:

There are two ways to assess:

Visual: Plotting a simple graphic or visualizing the table (rows and columns);
Programmatic: Using code to summarize the data frame using .info(), .describe(), average, summation, max, min, etc..

Is related to the content issues, as known as low quality data.

Messy data is related to structural issues, as known as untidy data.

In both cases (visual or programmatic), we could be divided into two main steps:

Data quality dimensions help guide your thought process while assessing and also cleaning. The four main data quality dimensions are:

Completeness: Missing values;
Validity: Invalid value (like negative height or weight, zip code with only 4 digits, etc.);
Accuracy: Wrong data which is valid (like the typo in the height);
Consistency: Data without a standard notation (New York and NY, Colorado and CO, same information but differents notations).

The severity of this problem is decreasing order: Completeness, Validity, Accuracy, and Consistency.

anderson.uyekita[at]gmail.com