What’s in this section:

## Introduction to Multiple Linear Regression

Linear regression models are used to analyze the relationship between an independent variable (IV) or variables and a dependent variable (DV), a.k.a the predicted variable. If only one predictor variable (IV) is used in the model, then that is called a single linear regression model. However, the model will not be robust in design and will have little to no explanation power because in the real world there is no 1 variable that can fully explain, or predict, an outcome. Most commonly, the model will have multiple IVs which will make it a multiple linear regression model. Lets take a look at the equation formula for multiple linear regression:

`Y = Β`

_{0} + Β_{1}X_{1} + Β_{2}X_{2} + Β_{n}X_{n} + e_{i}

Where:

- Y = outcome variable (dependent variable)
- Β
_{0}= constant coefficient (Y-intercept) - Β
_{1}X_{1}= is the coefficient and known constant for variable 1, respectively - Β
_{2}X_{2}= is the coefficient and known constant for variable 2, respectively - Β
_{n}X_{n}= is the coefficient and known constant for variable n, respectively - e
_{i}= error

For every variable in the regression model, the model will add a Β_{n}X_{n} term in the model. The “n” will notate the number of the IV. Okay, that’s enough of a primer for now. Don’t worry is this doesn’t make much sense to you now. The model will be filled in later with numbers from the example data which will make the picture more cohesive.

Multiple linear regression has assumptions just like all other parametric approaches. It is often time not possible to test the assumptions before running the model. Therefor, model diagnostics should be ran after fitting the model to test the following assumptions:

- Data must not have multicollinearity
- The residual errors should be approximately normally distributed
- Homoscedasticity
- Independence of errors

Often times when using real world data, there will be violations. If the residuals violate the assumptions of homoscedasticity and/or normality, then you can try transforming the data or using a robust regression model (discussed below). If no transformations or corrections are made to the data/model, then one will have difficulties generalizing the model to the population, i.e. the findings should be limited to the sample used.

##### Data used in this section

Data in this section is from Kaggle.com from the user Miri Choi. Link to the original data set can be found here. This data set contains information on insurance premium charges, age, sex, BMI, number of children, smoker status, and region. The question being asked is, what are the most influential predictors of insurance premium charges? This makes the DV be the insurance premium charge and the IVs are the other variables in the data.

Let’s import the required libraries, load the data set, and take a look at the variables!

# These are for exploring the data import pandas as pd import researchpy as rp # These are for running the model and conducting model diagnostics import statsmodels.formula.api as smf import statsmodels.stats.api as sms from scipy import stats from statsmodels.compat import lzip df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/insurance.csv") df.head()

age | sex | bmi | children | smoker | region | charges |
---|---|---|---|---|---|---|

19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |

18 | male | 33.770 | 1 | no | southeast | 1725.55230 |

28 | male | 33.000 | 3 | no | southeast | 4449.46200 |

33 | male | 22.705 | 0 | no | northwest | 21984.47061 |

32 | male | 28.880 | 0 | no | northwest | 3866.85520 |

Now let’s get some more information about the variables. Researchpy has nice options to explore continuous and categorical data. Official documentation can be found here.

# Let's get more information on the continuous varibles rp.summary_cont(df[['charges','age', 'children']])

Variable | N | Mean | SD | SE | 95% Conf. | Interval |
---|---|---|---|---|---|---|

charges | 1338.0 | 13270.422265 | 12110.011237 | 331.067454 | 12620.954034 | 13919.890496 |

age | 1338.0 | 39.207025 | 14.049960 | 0.384102 | 38.453516 | 39.960534 |

children | 1338.0 | 1.094918 | 1.205493 | 0.032956 | 1.030266 | 1.159569 |

There is a lot of variability in the variables charges, and a decent amount in age. Something to keep an eye on. Next let’s get some descriptives on the categorical variables.

rp.summary_cat(df[['sex', 'smoker', 'region']])

Variable | Outcome | Count | Percent |
---|---|---|---|

sex | male | 676 | 50.52 |

female | 662 | 49.48 | |

smoker | no | 1064 | 79.52 |

yes | 274 | 20.48 | |

region | southeast | 364 | 27.20 |

northwest | 325 | 24.29 | |

southwest | 325 | 24.29 | |

northeast | 324 | 24.22 |

The sample is a near even split of males and females, majority are non-smokers, and there is a very close breakdown in region. I could run the the regression model using the variables as is, however I want to run a correlation matrix to look for multicollinearity between variables. In order to do this I have to recreate dummy variables of the categorical data. I will show two methods of how to do this.

The first is to use the .replace() method that is built-in to Pandas DataFrame.

df['sex'].replace({'female' : 1, 'male' : 0}, inplace= True) df['smoker'].replace({'no': 0, 'yes': 1}, inplace= True)

This method uses Pandas’s dictionary to match up the current variable label and to replace it with a numeric representation. Females, and smokers are going to be indicated by a 1 while males, and non-smokers are going to be indicated by a 0. Since the region variable has multiple categories within it, an easy way to create multiple dummy variables is to use Pandas’s built-in .get_dummies() method.

df = pd.get_dummies(df) df.head()

age | sex | bmi | children | smoker | charges | region_northeast | region_northwest | region_southeast | region_southwest |
---|---|---|---|---|---|---|---|---|---|

19 | 1 | 27.900 | 0 | 1 | 16884.92400 | 0 | 0 | 0 | 1 |

18 | 0 | 33.770 | 1 | 0 | 1725.55230 | 0 | 0 | 1 | 0 |

28 | 0 | 33.000 | 3 | 0 | 4449.46200 | 0 | 0 | 1 | 0 |

33 | 0 | 22.705 | 0 | 0 | 21984.47061 | 0 | 1 | 0 | 0 |

32 | 0 | 28.880 | 0 | 0 | 3866.85520 | 0 | 1 | 0 | 0 |

## Multiple Linear Regression Example

Now that the data is ready to go, I will fit a model using statsmodels formula method. Then diagnostics will be ran on the model. When using dummy variables, it is important to not include 1 of the dummy variables per original categorical variable. For example, since region had 4 possible categories, I will include 3 of those 4 in the model and the 4th category will still be captured in the model’s intercept.

model = smf.ols("charges ~ age + bmi + sex + smoker + children + region_northwest + region_southeast + region_southwest", data= df).fit() model.summary()

Dep. Variable: | charges | R-squared: | 0.751 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.749 |

Method: | Least Squares | F-statistic: | 500.8 |

Date: | Wed, 21 Nov 2018 | Prob (F-statistic): | 0.00 |

Time: | 13:33:36 | Log-Likelihood: | -13548. |

No. Observations: | 1338 | AIC: | 2.711e+04 |

Df Residuals: | 1329 | BIC: | 2.716e+04 |

Df Model: | 8 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

Intercept | -1.207e+04 | 999.649 | -12.074 | 0.000 | -1.4e+04 | -1.01e+04 |

age | 256.8564 | 11.899 | 21.587 | 0.000 | 233.514 | 280.199 |

bmi | 339.1935 | 28.599 | 11.860 | 0.000 | 283.088 | 395.298 |

sex | 131.3144 | 332.945 | 0.394 | 0.693 | -521.842 | 784.470 |

smoker | 2.385e+04 | 413.153 | 57.723 | 0.000 | 2.3e+04 | 2.47e+04 |

children | 475.5005 | 137.804 | 3.451 | 0.001 | 205.163 | 745.838 |

region_northwest | -352.9639 | 476.276 | -0.741 | 0.459 | -1287.298 | 581.370 |

region_southeast | -1035.0220 | 478.692 | -2.162 | 0.031 | -1974.097 | -95.947 |

region_southwest | -960.0510 | 477.933 | -2.009 | 0.045 | -1897.636 | -22.466 |

Omnibus: | 300.366 | Durbin-Watson: | 2.088 |
---|---|---|---|

Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 718.887 |

Skew: | 1.211 | Prob(JB): | 7.86e-157 |

Kurtosis: | 5.651 | Cond. No. | 315. |

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The model explains a significant amount of variance which is indicated by the F-statistic value and it’s correponding p-value, F(8, 1329)= 500.8, p< 0.01. Now to run diagnostics on the model. Statsmodels already provides some model diagnostics which are the omnibus test (measuring amount of explained variance), Durbin-Watson test (measuring a relationship between values separated by a time lag), Jarque-Bera test (test of normality), and the Condition Number (measure of multicollinearity).

### Diagnosing Multicollinearity

Now let’s check for multicollinearity. When checking, it doesn’t really matter if the correlations are significant or not since that is not the purpose of running a correlation analysis in this context.

df.corr()

age | bmi | children | charges | sex | smoker | region_northeast | region_northwest | region_southeast | region_southwest | |

age | 1.000000 | 0.109272 | 0.042469 | 0.299008 | 0.020856 | -0.025019 | 0.002475 | -0.000407 | -0.011642 | 0.010016 |

bmi | 0.109272 | 1.000000 | 0.012759 | 0.198341 | -0.046371 | 0.003750 | -0.138156 | -0.135996 | 0.270025 | -0.006205 |

children | 0.042469 | 0.012759 | 1.000000 | 0.067998 | -0.017163 | 0.007673 | -0.022808 | 0.024806 | -0.023066 | 0.021914/td> |

charges | 0.299008 | 0.198341 | 0.067998 | 1.000000 | -0.057292 | 0.787251 | 0.006349 | -0.039905 | 0.073982 | -0.043210 |

sex | 0.020856 | -0.046371 | -0.017163 | -0.057292 | 1.000000 | -0.076185 | 0.002425 | 0.011156 | -0.017117 | 0.004184 |

smoker | -0.025019 | 0.003750 | 0.007673 | 0.787251 | -0.076185 | 1.000000 | 0.002811 | -0.036945 | 0.068498 | -0.036945 |

region_northeast | 0.002475 | -0.138156 | -0.022808 | 0.006349 | 0.002425 | 0.002811 | 1.000000 | -0.320177 | -0.345561 | -0.320177 |

region_northwest | -0.000407 | -0.135996 | 0.024806 | -0.039905 | 0.011156 | -0.036945 | -0.320177 | 1.000000 | -0.346265 | -0.320829 |

region_southeast | -0.011642 | 0.270025 | -0.023066 | 0.073982 | -0.017117 | 0.068498 | -0.345561 | -0.346265 | 1.000000 | -0.346265 |

region_southwest | 0.010016 | -0.006205 | 0.021914 | -0.043210 | 0.004184 | -0.036945 | -0.320177 | -0.320829 | -0.346265 | 1.000000 |

There are no strong correlations between the IVs meaning there is no need to worry about multicollinearity.

### Diagnosing Normality

For regressions, the test of normality applies to the model’s residuals and not the variables themselves. This can be tested visually by plotting the residuals as a histogram, and/or using a probability plot. I will show how to visually inspect the residuals using a probability plot. If you are unfamiliar with how to read a probability plot, the residuals which will be represented as dots (in blue) should fall on the red line.

stats.probplot(model.resid, plot= plt) plt.title("Model1 Residuals Probability Plot")

This plot indicates that the model’s residuals are not normally distributed. Something to note is that visual representation of data is always subjective, but they are not bad to include because they visually show characteristics of the data that formal statistical testing cannot.

Common formal tests include the chi-square test, Kolmogorov-Smirnov test, and the Lilliefors test. I will demonstrate how to test for normality using the Kolmogorov-Smirnov test. Official documentation can be found here.

stats.kstest(model.resid, 'norm')

The test is significant which indicates that the model’s residuals are not normally distributed.

### Diagnosing Homoscedasticity

The assumption of homoscedasticity is a vital assumption for linear regression. If this assumption is violated, then the standard errors will be biased. The standard errors are used to conduct significance tests, and calculate the confidence intervals.

This can be tested using a few different statistical tests, these include the

Brown-Forsythe test, Levene’s test, Bruesch-Pagan test, or Cook-Weisberg test. In this example, I will demonstrate how to conduct the Bruesch-Pagan test followed by how to conduct the Levene’s test. The test for homoscedasticity of variance needs to be conducted for each level of the categorical variables.

name = ['Lagrange multiplier statistic', 'p-value', 'f-value', 'f p-value'] test = sms.het_breuschpagan(model.resid, model.model.exog) lzip(name, test)

(‘p-value’, 5.8721533542513195e-22),

(‘f-value’, 16.628612027375389),

(‘f p-value’, 1.1456058246340301e-23)]

Now for how to conduct Levene’s test. Since I know residuals are not normally distributed, I will conduct the Levene’s test using the median instead of the mean. Official documentation can be found here.

sex_variance_results = stats.levene(df['charges'][df['sex'] == 0], df['charges'][df['sex'] == 1], center= 'median') smoker_variance_results = stats.levene(df['charges'][df['smoker'] == 0], df['charges'][df['smoker'] == 1], center= 'median') region_variance_results = stats.levene(df['charges'][df['region_northeast'] == 1], df['charges'][df['region_northwest'] == 1], df['charges'][df['region_southeast'] == 1], df['charges'][df['region_southwest'] == 1], center= 'median') print(f"Sex Variance: {sex_variance_results}", "\n", f"Smoker Variance: {smoker_variance_results}", "\n", f"Region Variance: {region_variance_results}", "\n")

Smoker Variance: LeveneResult(statistic=332.6135162726081, pvalue=1.5593284881803726e-66)

Region Variance: LeveneResult(statistic=5.559966758410606, pvalue=0.0008610590250786703)

Both of the tests are significant meaning the data violates the assumption of homoscedasticity, i.e. heteroscedasticity is present in the data. What to do? Either one can transform the variables to improve the model, or use a robust regression method that accounts for the heteroscedasticity.

In order to account for the heteroscedasticity in the data, one has to select a heteroscedasticity consistent covariance matrix (HCCM) and pass it in the “cov_type=” argument apart of the *.fit()* method. What is HCCM? Here is a nice read if interested more on this. There are a few HCCMs to choose from:

- HC0, not good on sample size ≤ 250
- HC1, not good on sample size ≤ 250
- HC2, good on sample size ≤ 250
- HC3, which out performs HC0, HC1, and HC2 when sample size ≤ 250
- Little difference in performance when sample is ≥ 500

For the current model, using a robust regression technique will work. This will be demonstrated in the re-running of the model.

### Model Re-run

The re-run of the model will use a heteroscedasticity consistent covariance matrix so the model is protected from the violation of homogeneity of variance, the model however is not correcting the violation of normality.

model3 = smf.ols("charges ~ age + bmi + sex + smoker + children + region_northwest + region_southeast + region_southwest", data= df).fit(cov_type='HC3') model3.summary()

Dep. Variable: | charges | R-squared: | 0.751 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.749 |

Method: | Least Squares | F-statistic: | 298.4 |

Date: | Wed, 21 Nov 2018 | Prob (F-statistic): | 2.25e-290 |

Time: | 15:54:47 | Log-Likelihood: | -13548. |

No. Observations: | 1338 | AIC: | 2.711e+04 |

Df Residuals: | 1329 | BIC: | 2.716e+04 |

Df Model: | 8 | ||

Covariance Type: | HC3 |

coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

Intercept | -1.207e+04 | 1062.898 | -11.356 | 0.000 | -1.42e+04 | -9986.611 |

age | 256.8564 | 11.961 | 21.474 | 0.000 | 233.412 | 280.300 |

bmi | 339.1935 | 31.879 | 10.640 | 0.000 | 276.711 | 401.676 |

sex | 131.3144 | 334.971 | 0.392 | 0.695 | -525.217 | 787.846 |

smoker | 2.385e+04 | 578.079 | 41.255 | 0.000 | 2.27e+04 | 2.5e+04 |

children | 475.5005 | 131.009 | 3.630 | 0.000 | 218.727 | 732.274 |

region_northwest | -352.9639 | 486.616 | -0.725 | 0.468 | -1306.714 | 600.786 |

region_southeast | -1035.0220 | 503.426 | -2.056 | 0.040 | -2021.718 | -48.326 |

region_southwest | -960.0510 | 463.014 | -2.073 | 0.038 | -1867.541 | -52.561 |

Omnibus: | 300.366 | Durbin-Watson: | 2.088 |
---|---|---|---|

Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 718.887 |

Skew: | 1.211 | Prob(JB): | 7.86e-157 |

Kurtosis: | 5.651 | Cond. No. | 315. |

Warnings:

[1] Standard Errors are heteroscedasticity robust (HC3)

## Linear Regression Interpretation

First one has to look to see if the model is significant. This information is found in “Prov (F-statistic)= “, the current model is significant. Now one can look at the affects of each IV on the DV (the IV coefficients) and see if it is a significant predictor or not (P>|z|). As usual, one wants a p-value < 0.05. All the IVs are significant predictors of insurance premium charges except for sex. Indicating that sex doesn't matter in this context.
The coefficient (coef) can be interpreted as the affect in unit change in terms of the DV. Meaning, for every 1 unit increase in the IV, the DV will increase or decrease by the coefficient amount. In the example above, for every year increase in age, there will be a $256.86 increase in the insurance premium charge. Let's review the linear regression equation to see how this data makes as a predictive/explanatory model. It is predictive when used to predict future outcomes, and explanatory when used to explain the influence of each IV.
`Y = Β`

_{0} + Β_{1}X_{1} + Β_{2}X_{2} + Β_{n}X_{n} + e_{i}

Where:

- Y = outcome variable (dependent variable)
- Β
_{0}= constant coefficient (Y-intercept) - Β
_{1}X_{1}= is the coefficient and known constant for variable 1, respectively - Β
_{2}X_{2}= is the coefficient and known constant for variable 2, respectively - Β
_{n}X_{n}= is the coefficient and known constant for variable n, respectively - e
_{i}= error

Using the current model, one can write the formula as:

`Charges = -12,070 + 256.86(age) + 339.19(bmi) + 131.31(sex_female) + 23,850(smoker_yes) + 474.50(number of children) - 352.96(region_northwest) - 1,035.02(region_southeast) - 960.05(region_soughtwest)`

Now to write up the model’s performance.

#### Model write up

Multiple regression analysis was used to test if age, BMI, sex (female), smoking status, number of children, and region significantly predicted the cost of insurance premiums. The results of the regression indicated the nine predictors explained 75.1% of the variance (R^{2}= 0.75, F(8,1329)= 298.4, p< 0.01). The predicted insurance premium charge is equal to -12,070 + 256.86(age) + 339.19(bmi) + 131.31(sex) + 23,850(smoker) + 474.50(number of children) - 352.96(region_northwest) - 1,035.02(region_southeast) - 960.05(region_soughtwest), where sex is coded as 1= female, 0= male; smoker is coded as 1= smoker, 0= non-smoker; region_[location] is coded as 1= in region, 0= not in region. All of the independent variables used in the model were significant predictors of insurance premium charge, except for sex

**References:**Kutner, M., Nachtsheim, C., Neter, J., and Li, W. (2004).

*Applied linear statistical models (5th ed.)*. New York: McGraw-Hill

For checking multicollinearity, you are using pairwise correlation, right? I have read somewhere that pairwise correlation doesn’t work always as it fails to determine the correlation between several independent variables. For example, there are three independent variables x_1,x_2 and x_3. Although, these variables don’t share any linear relationship among themselves, but there can be a relationship between x_1 and sum of x_2 and x_3(x_2+x_3).

Hello Vishal,

Thank you for your note! The default method that Pandas uses is pairwise correlation, so that is correct. If you could find the reference for that it would be a good read! If I’m reading the context correctly, I see those being two separate analyses.

Using a model with the three separate independent variables (IV) x_1, x_2, and x_3 the regression model would look like:

y = b_0 + x_1 + x_2 + x_3 + error

whereas summing the IVs x_2 and x_3 we would be reducing the number of IVs in the overall model making the model look like:

y = b_0 + x_1 + x_5 + error; where x_5 = x_2 + x_3

I would also question why add variables x_2 and x_3 together? If looking for an effect of varying levels of each of those variables, the relation would better be represented as an interaction effect and would be multiplicative in nature, and the original variables would still also need to be included in the model making the model look like:

y = b_0 + x_2*x_3 + x_2 + x_3 + x_1 + error

Best,

Corey