4.14 Regression Lesson 14

Bear in mind, regression is a subject of the Supervised branch of Machine Learning. Figure 1 shows a simple big picture.

Figure 1 - Machine Learning Branches.


Some hot points in these two branches (there are other branches, but these two are the most know).

  • Supervised: A machine learning technique where we are attempting to predict a label based on inputs.
    • Predict fraudulent transactions
    • Predict chance of default on a loan
    • Predict home prices
  • Unsupervised: A machine learning technique where we are attempting to group together unlabeled data based on similar characteristics.
    • Customer segmentation
    • Group document that cover similar topics

The Linear and Logistic Regression fall into the Supervised Machine Learning branch.

4.14.1 Introduction to Linear Regression

  • Response variable or dependent (y): The variable you are interested in predicting, and;
  • Explanatory variable or independent (x): The variable used to predicted the response.

A common way to visualize the relationship between two variables in linear regression is using a scatterplot. You will see more on this in the concepts ahead. — Udacity notebook

Figure 2 shows an example of a scatter plot.

Figure 2 - Hours studying vs Test grades.


4.14.1.1 Scatter Plot

Scatter plots are a common visual for comparing two quantitative variables. A common summary statistic that relates to a scatter plot is the correlation coefficient commonly denoted by r.

Though there are a few different ways to measure correlation between two variables, the most common way is with Pearson’s correlation coefficient. Pearson’s correlation coefficient provides the:

  1. Strength
  2. Direction

of a linear relationship. Spearman’s Correlation Coefficient does not measure linear relationships specifically, and it might be more appropriate for certain cases of associating two variables. — Udacity notebook

Figure 3 shows an example of strong positive relationship.

Figure 3 - Strong and Positive Relantionship.


When x increase y also increase, and the points is very close from each other. Figure 4 shows the opposite of positive direction.

Figure 4 - Moderate and Negative Relantionship.


When x increase y decrease, and the points is a bit sparse. Figure 5 shows an example of scatter plot with weak strength.

Figure 5 - Weak and Negative (??) Relantionship.


Both, strength and direction is capture by the correlation (r).

  • Correlation
    • Varies from -1 to +1;
    • Values close to -1 and +1 are very strong;
    • The signal (positive or negative) means the direction.
Figure 6 - Example of Correlation.


4.14.2 Correlation Coefficients

This is highly field-dependent measure and these values are a general rule of thumbs.

Have in mind, in social sciences is very difficult to find a strong correlation (probably because human are very complex and hard to understand).

  • Strong: \(0.7 \leq |r| \leq 1.0\)
  • Moderate: \(0.3 \leq |r| \leq .7\)
  • Weak: \(0.0 \leq |r| \leq 0.3\)

Sometimes a plot could help a lot, Figure 7 shows an example of two graphs with same correlation coefficients.

Figure 7 - Two Graphics with same Correlation Coefficients.


This problem presented in Figure 6 is part of the Anscombe’s Quartet images.

4.14.3 Coefficients

A Linear Regression is a way to estimate the values of some coefficients:

  • Intercept: The expected value of the response when the explanatory variable is 0 (zero);
    • \(b_0:\) statistic value (sample)
    • \(\beta_0:\) parameter value (population)
  • Slope: The expected change in the response for each 1 unit increase in the explanatory variable.
    • \(b_1:\) statistic value (sample)
    • \(\beta_1:\) parameter value (population)

Based on the Intercept and Slope, the Linear Regression equation is presented in equation (1).

\[ \hat y = b_0 + b_1x \]

Where: * \(\hat y:\) is the predicted value of the response from the line. * \(y:\) is an actual response value for a data point in our dataset (not a prediction from our line). * \(b_0:\) is the intercept. * \(b_1:\) is the slope. * \(x_1:\) is the explanatory variable.

Figure 8 ilustrate this equation in a picture.

Figure 8 - Linear Model and Equation.


4.14.4 Least-squares

 

A work by AH Uyekita

anderson.uyekita[at]gmail.com