4.15 Multilinear Regression Lesson 15

This is a generalization of the simple Linear Model, which allows us to use several explanatory variables, as you can see an example of Multiple Linear Regression in equation (1).

\[ \hat y = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4x_4 \tag{1} \]

Figure 1 ilustrate the variables which could be used to predic the house’s price.

Figure 1 - House Price Dataset.


Have in mind, in this table above there are quantitative and categorical variables.

  • Quantitative:
    • Price, Area, Bedrooms, and Bathrooms;
  • Categorical:
    • Neighborhood and Style.

All these variables will be used to predict the house’s price, to do it so it is necessary a little of linear algebra, as you can see in the Figure 2.

Figure 2 - Matrix notation.


Where:

  • \(\bold X:\) Matrix of inputs;
  • \(\bold y:\) Vector of response that we want to predict.

After many steps of linear algebra, equation (2) resume the \(\beta\) calculation.

\[ \beta = (\bold X' \bold X)^{-1} \bold X \bold y \tag{2}\]

Have in mind, the coefficients calculated in \(\beta\) are the same shown in Figure 3 in column coef.

Interpretation

Figure 3 presents outputs from the OLS of a generic dataset (house_prices.csv).

Figure 3 - Coefficients, Standard Errors, etc.


In this video, the coefficients were all positive. Therefore, we can interpret each coefficient as the predicted increase in the response for every one unit increase in the explanatory variable, holding all other variables in the model constant.

This principles is the so-called ceteris paribus.

Ceteris paribus or caeteris paribus is a Latin phrase meaning “other things equal”. English translations of the phrase include “all other things being equal” or “other things held constant” or “all else unchanged”. A prediction or a statement about a causal, empirical, or logical relation between two states of affairs is ceteris paribus if it is acknowledged that the prediction, although usually accurate in expected conditions, can fail or the relation can be abolished by intervening factors.Wikipedia

4.15.0.1 Statistical Significance

Analogous to the simple Linear Model, it is important to analyse the p-values of each variable. Figure 4 shows the p-values.

Figure 4 - Statistical Significance.


As you can see, most of the variable do not have lower p-values, which suggest these variables are not statistically signigicant to the response variable. In this example, only area reject \(H_0\) the others variables failed to reject \(H_0\).

Signigicant bivariate relationships are not always signigicant in Multiple Linear Regression.

4.15.1 Dummy Variable

This is a method to insert categorical variables in Multiple Linear Regressions. We will use 0 and 1 to encode this Dummy Variable, which will work creating new columns for each category of the variable. Figure 5 shows an example.

c4_l15_05

Figure 5 - Converting Categorical Variable to Dummy Variable.


Remember, the C columns is not necessary because is a complement of the other two columns (A and B). The dropped columns (in this case the C column) is the so-called baseline. What you need to understand about the baseline is:

The coefficients you obtain from the output of your multiple linear regression models are then an indication of how the encoded levels compare to the baseline level (the dropped level). — Udacity Notebook

The mathematical reason to drop one columns is to garantee the \(\bold X\) matrix is invertible.

Interpretation

The baseline will be used to calculate the real coefficients of each Dummy variable. Figure 6 shows an example of output of a Multiple Linear Regression.

c4_l15_06

Figure 6 - Example of Coefficients from a Dummies Variables.


The intercept is the value of the baseline (in other words it is the coefficient of A). To calculate the other coefficients you need to subtract the baseline from each coefficient.

  • Coefficient A: 5.411e+05
  • Coefficient B: 5.411e+05 - 5.295e+05 = 1070600.0
  • Coefficient C: 5.411e+05 - 332.36 = 540767.6406

Coef C < Coef A < Coef B

Each of the coefficients provided by the OLS is a comparison with the baseline.

In a more complex example as presented in Figure 7.

c4_l15_07

Figure 7 - Dummies Variables from two categories.


There are “two” baseline variables: A and ranch.

  • Coefficient A: -383300.0
  • Coefficient B: -383300.0 + 5.229e5 = 139600.0
  • Coefficient C: -383300.0 - 7168.63 = -390500.0
  • ranch: -383300.0
  • lodge: -383300.0 + 1.685e5 = -214800.0
  • victorian: -383300.0 + 7.056e4 = -312770.0

4.15.2 Problems in Multiple Linear Regression

Problems is sensible to the application of this Multiple Linear Regression.

  • What is your model for?
    • Understand how is related x and y
    • Best predict the response variable
    • Which variable is really useful in predicting your response

Founded on the objectivies of the model, there are many possibilities of problems, below there are some common problems.

  • A linear relationship does not exist
  • Correlated errors
  • Non-constant variance
  • Outliers
  • Multicollinearity

Multicollinearity

Generally, we assume the explanatory variables are all uncorrelated with one another, but we want these explanatory variables are correlated with the responer variable.

An example of possible multicollinearity:

We expected a bigger number of bedrooms and bathrooms when the house area increases.

There are two ways to identify the multicollinearity:

Scatterplot

Figure 8 shows the scatter plot of area, bedrooms, and bathrooms.

c4_l15_08

Figure 8 - Scatterplot of area, bedrooms, and bathrooms.


All three variables have strong and positive correlation.

OK! Let’s calculate the Multiple Linear Regression using area, bedrooms, and bathrooms as explanatory variables to predict the price. Figure 9 shows the output of the OLS method.

c4_l15_09

Figure 9 - Coefficients of a Multiple Linear Regression based on area, bedrooms, and bathrooms to predict price.


Remember, we expected positive values of coefficients, but the bedrooms coefficient is negative. This is one of the side effect of multicollinearity, it could flip the signal, and produce a weird result.

Calculating the VIF

The Variance Inflation Factor (VIF) is a measure of colinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model’s betas divide by the variane of a single beta if it were fit alone. — Ernest Tavares Website

In other words, this means to calculate for each explanatory variable the R-squared adopting it as response variable, and later calculate the VIF according to the equation (3).

\[VIF_i = \frac{1}{1-R^2_i} \tag{3}\]

Where:

  • \(i:\) intercept, area, bedrooms, and bathrooms.

This is my original equation:

\[ price = intercept + area*b_1 + bedrooms * b_2 + bathrooms * b_3 \]

Where:

  • Explanatory variables: intercept, area, bedrooms, and bathrooms.
  • Response variable: price

I need to calculate the \(R^2\) for each of these equations:

\[ intercept = area*b_1 + bedrooms * b_2 + bathrooms * b_3 \\ bedrooms = intercept + area*b_1 + bathrooms * b_3 \\ bathrooms = intercept + area*b_1 + bedrooms * b_3 \\ area = intercept + bathrooms*b_1 + bedrooms * b_3 \]

The result of the \(R^2\) is presented in Table 1.

Table 1 - R-Squared used to Calculate the VIF.
Response Variable \(R^2\)
intercept 0.8635203885901309
area 0.8167890866692036
bedrooms 0.952048682065964
bathrooms 0.9473873926707678

Table 2 shows the VIFs for each value of \(R^2\).

Table 2 - VIFs.
Response Variable VIF
intercept 7.3271017529266529
area 5.458190136274525
bedrooms 20.854484153608585
bathrooms 19.006851223744377

When VIFs are greater than 10, this suggests that multicollinearity is certainly a problem in your model. Some experts even suggest VIFs of greater than 5 can be problematic. In most cases, not just one VIF is high, rather many VIFs are high, as these are measures of how related variables are with one another. — Udacity notebook.

In this case I will remove the bathrooms variable, and update my equation:

\[ price = intercept + area * b_1 + bedrooms * b_2 \tag{4}\]

Following the same process, I will calculate the new \(R^2\) and later the VIF’s.

Table 3 - R-Squared used to Calculate the VIF.
Response Variable \(R^2\) VIF
intercept 0.8350894909730524 6.063894932472694
area 0.8129232492956318 5.345399662089865
bedrooms 0.8129232492956316 5.3453996620898625

There is an other way to calculate this VIF’s using the variance_inflation_factor from statsmodels.stats.outliers_influence. It is much simpler to apply this methods, but it is quite hard to understand the hidden subject behind the scenes.

4.15.3 Higher Order Terms

Adding higher order terms is a way to add non-linearity to explain the response variable.

  • Higher Order Terms:
    • Interactions: \(x_1x_2\)
    • Quadratics: \(x^2\)
    • Cubics: \(x^3\)
    • Other higher orders terms: \(x^n\)

Adding these terms could improve the response results, but it makes harder to understand and explain the relationship between this higher orders terms.

For instance, \(y_1\) do not have higher orders terms and \(y_2\) has a interaction. Both equations are quite similar, except from the term \(cx_1x_2\).

\[y_1 = ax_1 + bx_2 + d \\ y_2 = ax_1 + bx_2 + cx_1x_2 + d\]

The derivate from each equation will return the slope coefficient.

\[\frac{\partial y_1}{\partial x_1} = a \\ \frac{\partial y_2}{\partial x_1} = a + c \cdot x_2\]

In \(\partial y_1/\partial x_1\) the slope is constant (\(a\)), whereas in \(\partial y_2/\partial x_1\) the slope will varies according to the \(c \cdot x_2\).

4.15.4 New Method

In this lesson the pandas package method called .get_dummies will introduced.

4.15.4.1 .get_dummies()

This method returns a Data Frame with each category of the categorical variable as a new column.

pd.get_dummies(df['categorical_variable'])

Remember, this method do not remove the baseline.

 

A work by AH Uyekita

anderson.uyekita[at]gmail.com