3.1 Introduction to Data Wrangling

There are roughly three steps in the Data Wrangling.

  • Gathering;
  • Assessing, and;
  • Cleaning.

This is a iterate process between these three steps.

Data wrangling is about gathering the right pieces of data, assessing your data’s quality and structure, then modifying your data to make it clean. But the assessments you make and convert to cleaning operations won’t make your analysis, viz, or model better, though. The goal is to just make them possible, i.e., functional.

EDA is about exploring your data to later augment it to maximize the potential of our analyses, visualizations, and models. When exploring, simple visualizations are often used to summarize your data’s main characteristics. From there you can do things like remove outliers and create new and more descriptive features from existing data, also known as feature engineering. Or detect and remove outliers so your model’s fit is better.

ETL: You also may have heard of the extract-transform-load process also known as ETL. ETL differs from data wrangling in three main ways: * The users are different * The data is different * The use cases are different This article (Data Wrangling Versus ETL: What’s the Difference?) by Wei Zhang explains these three differences well.

All text extracted from the class notes.

3.1.1 Gathering

Gathering is the first step of a Data Wrangling, is also known as Collecting or Acquiring. The Armenian Online Job Post has 19,000 jobs postings from 2004 to 2015.

Best Practice: Downloading Files Programmatically

This is the reasons:

  • Scalability: This automation will save time, and prevents erros;
  • Reproducibility: Key point to any research. Anyone could reproduce your work and check it.

3.1.2 Assessing

The assessing in divided into two mains aspects:

  • Quality of the dataset
  • Tidiness of the dataset

3.1.2.1 Quality

Low quality dataset is related to a dirty dataset, which means the content quality of data.

Commom issues:

  • Missing values
  • Non standard units (km, meters, inches, etc. all mixed)
  • Innacurate data, invalid data, inconsistent data, etc.

One dataset may be high enough quality for one application but not for another.

3.1.2.2 Tidiness

Untidy data or messy data, is about the structure of the dataset.

  • Each obsevation by rows, and;
  • Each variable/features by column.

This is the Hadley Wickham definition of tidy data.

3.1.3 Assessing the data

There are two ways to assess the data.

  • Visual, and;
  • Programmatic.

3.1.3.1 Visual Assessment

Using regular tools, such as Graphics, Excel, tables, etc. It means, there is a human assessing the data.

3.1.3.2 Programmatic Assessment

Using automation to dataset evaluation is scalable, and allows you to handle a very huge quantity of data.

Examples of “Programmatic Assessment”: Analysing the data using .info(), .head(), .describe(), plotting graphics (.plot()), etc..

Bear in mind, in this step we do not use “verbs” to describe any erros/problem, because the “verbs” will be actions to the next step.

3.1.4 Cleaning

Improving the quality of a dataset or cleaning the dataset do not means: Changing the data (because it could be data fraud).

The meaning of Cleaning is correcting the data or removing the data.

  • Innacurate, wrong or irrelevant data.
  • Replacing or filling (NULL or NA values) data.
  • Combining/Merging datasets.

Improving the tidiness is transform the dataset to follow:

  • each observation = row
  • each variable = column

There are two ways to cleaning the data: manually and programmatic.

3.1.4.1 Manually

To be avoided.

3.1.4.2 Programmatic

There are three steps:

  1. Define
  2. Code
  3. Test

Defining means defining a data cleaning plan in writing, where we turn our assessments into defined cleaning tasks. This plan will also serve as an instruction list so others (or us in the future) can look at our work and reproduce it.

Coding means translating these definitions to code and executing that code.

Testing means testing our dataset, often using code, to make sure our cleaning operations worked.

Text from the class notes.

 

A work by AH Uyekita

anderson.uyekita[at]gmail.com