4.14 Regression Lesson 14 | ND111 - Data Science II

4.14 Regression `Lesson 14`

Bear in mind, regression is a subject of the Supervised branch of Machine Learning. Figure 1 shows a simple big picture.

Figure 1 - Machine Learning Branches.

Some hot points in these two branches (there are other branches, but these two are the most know).

Supervised: A machine learning technique where we are attempting to predict a label based on inputs.
- Predict fraudulent transactions
- Predict chance of default on a loan
- Predict home prices
Unsupervised: A machine learning technique where we are attempting to group together unlabeled data based on similar characteristics.
- Customer segmentation
- Group document that cover similar topics

The Linear and Logistic Regression fall into the Supervised Machine Learning branch.

4.14.1 Introduction to Linear Regression

Response variable or dependent (y): The variable you are interested in predicting, and;
Explanatory variable or independent (x): The variable used to predicted the response.

A common way to visualize the relationship between two variables in linear regression is using a scatterplot. You will see more on this in the concepts ahead. — Udacity notebook

Figure 2 shows an example of a scatter plot.

Figure 2 - Hours studying vs Test grades.

4.14.1.1 Scatter Plot

Scatter plots are a common visual for comparing two quantitative variables. A common summary statistic that relates to a scatter plot is the correlation coefficient commonly denoted by r.

Though there are a few different ways to measure correlation between two variables, the most common way is with Pearson’s correlation coefficient. Pearson’s correlation coefficient provides the:

Strength

Direction

of a linear relationship. Spearman’s Correlation Coefficient does not measure linear relationships specifically, and it might be more appropriate for certain cases of associating two variables. — Udacity notebook

Figure 3 shows an example of strong positive relationship.

Figure 3 - Strong and Positive Relantionship.

When x increase y also increase, and the points is very close from each other. Figure 4 shows the opposite of positive direction.

Figure 4 - Moderate and Negative Relantionship.

When x increase y decrease, and the points is a bit sparse. Figure 5 shows an example of scatter plot with weak strength.

Figure 5 - Weak and Negative (??) Relantionship.

Both, strength and direction is capture by the correlation (r).

Correlation
- Varies from -1 to +1;
- Values close to -1 and +1 are very strong;
- The signal (positive or negative) means the direction.

Figure 6 - Example of Correlation.

4.14.2 Correlation Coefficients

This is highly field-dependent measure and these values are a general rule of thumbs.

Have in mind, in social sciences is very difficult to find a strong correlation (probably because human are very complex and hard to understand).

Strong: \(0.7 \leq |r| \leq 1.0\)
Moderate: \(0.3 \leq |r| \leq .7\)
Weak: \(0.0 \leq |r| \leq 0.3\)

Sometimes a plot could help a lot, Figure 7 shows an example of two graphs with same correlation coefficients.

Figure 7 - Two Graphics with same Correlation Coefficients.

This problem presented in Figure 6 is part of the Anscombe’s Quartet images.

4.14.3 Coefficients

A Linear Regression is a way to estimate the values of some coefficients:

Intercept: The expected value of the response when the explanatory variable is 0 (zero);
- \(b_0:\) statistic value (sample)
- \(\beta_0:\) parameter value (population)
Slope: The expected change in the response for each 1 unit increase in the explanatory variable.
- \(b_1:\) statistic value (sample)
- \(\beta_1:\) parameter value (population)

Based on the Intercept and Slope, the Linear Regression equation is presented in equation (1).

\[ \hat y = b_0 + b_1x \]

Where: * \(\hat y:\) is the predicted value of the response from the line. * \(y:\) is an actual response value for a data point in our dataset (not a prediction from our line). * \(b_0:\) is the intercept. * \(b_1:\) is the slope. * \(x_1:\) is the explanatory variable.

Figure 8 ilustrate this equation in a picture.

Figure 8 - Linear Model and Equation.

4.14.4 Least-squares

A work by AH Uyekita

anderson.uyekita[at]gmail.com