Chapter 1 eq_clean_data
This function has two behaviours:
- When you assign a file to load, and;
# Loading the 'signif.txt' file.
eq_clean_data(file_name = system.file("extdata", "signif.txt", package = "msdr"))
- When you pipe a dataset already loaded.
# Pipe.
readr::read_delim("signif.txt",
delim = "\t") %>% eq_clean_data()
1.1 Loading the data
This function also loads the Earthquake database from NOAA.
# Path to the raw data.
raw_data_path <- system.file("extdata", "signif.txt", package = "msdr")
# Loading the dataset of Earthquake.
df <- readr::read_delim(file = raw_data_path,
delim = '\t',
col_names = TRUE,
progress = FALSE,
col_types = readr::cols())
# Printing the first 5 rows.
head(df) %>%
select(I_D, YEAR, LOCATION_NAME, EQ_PRIMARY, TOTAL_DEATHS) %>%
kable()
I_D | YEAR | LOCATION_NAME | EQ_PRIMARY | TOTAL_DEATHS |
---|---|---|---|---|
1 | -2150 | JORDAN: BAB-A-DARAA,AL-KARAK | 7.3 | NA |
3 | -2000 | TURKMENISTAN: W | 7.1 | 1 |
2 | -2000 | SYRIA: UGARIT | NA | NA |
5877 | -1610 | GREECE: THERA ISLAND (SANTORINI) | NA | NA |
8 | -1566 | ISRAEL: ARIHA (JERICHO) | NA | NA |
11 | -1450 | ITALY: LACUS CIMINI | NA | NA |
As you can see, there are several observations with NA values.
1.2 Creating new features
The eq_clean_data
creates the DATE variable binding the columns YEAR, MONTH, and DAY. All this using the Lubridate package.
# Creating a new feature.
df <- df %>%
mutate(DATE = lubridate::ymd(paste(df$YEAR, # YEAR column
df$MONTH, # MONTH column
df$DAY, # DAY column
sep = "/"))) # YYYY/MM/DD
1.3 Conversion Process
I have converted the class of some features:
TOTAL_DEATHS
to numeric;EQ_PRIMARY
to numeric;- All
NA
’s ofTOTAL_DEATHS
in zeros.
1.4 Cleaning Process
I have removed:
- All observations flagged as Tsunami, and;
- All observations with no Date.
1.5 Example 1
How to load a txt
file.
# Load the package
library(msdr)
# Define as file_name the txt file.
df <- eq_clean_data(file_name = raw_data_path)
# Dimensions of the loaded dataframe.
dim(df)
#> [1] 2840 49
1.6 Example 2
Piping a dataset to the eq_clean_data
.
# Load the package
library(msdr)
# Piping a read_delim with eq_clean_data.
readr::read_delim(raw_data_path,
delim = "\t") %>%
eq_clean_data() -> df
# Dimensions of the loaded dataframe.
dim(df)
#> [1] 2840 49