What’s in this section:

## Introduction to Multiple Linear Regression

Linear regression models are used to analyze the relationship between an independent variable (IV) or variables and a dependent variable (DV), a.k.a the predicted variable. If only one predictor variable (IV) is used in the model, then that is called a single linear regression model. However, the model will not be robust in design and will have little to no explanation power because in the real world there is no 1 variable that can fully explain, or predict, an outcome. Most commonly, the model will have multiple IVs which will make it a multiple linear regression model.

Assumptions for multiple linear regression:

- The DV has to be measured on a continuous level
- Linear relationship between the DV and each of the IV
- There needs to be a theoretical reason as to why it would be expected that as one increases the other increase or decreases

- Data must not have multicollinearity
- This is a large concern when conducting linear regressions. One way to check for multicollinearity is to run a correlation matrix on the data or to check the variance inflation factors (VIFs). If there are strong correlations between the IVs, or a high VIF, then one could drop one of the variables or conduct a Shapley’s regression or Ridge regression which takes into account the highly correlated variables

- The residual errors should be approximately normally distributed
- One can test this by using the Kolmogorov-Smirnov test, Jarque–Bera test, or the Shapiro-Wilk test for normality on the models residuals and/or by looking at a Q-Q plot of the model’s residuals
- One thing to keep in mind is that as the sample size increases, the likelihood of violating this assumption increases

- One can test this by using the Kolmogorov-Smirnov test, Jarque–Bera test, or the Shapiro-Wilk test for normality on the models residuals and/or by looking at a Q-Q plot of the model’s residuals
- Homoscedasticity should be present
- We can check this by plotting the data as a scatter plot and checking if there is a cone shape, or by using the Levene’s test for homogeneity of variance, Breusch-Pagan test, and/or the NCV test

- Independence of errors
- Non-repeated measures design, i.e. each participant has 1 data point for each variable in the model

Often times when using real world data, there will be violations. If the residuals violate the assumptions of homoscedasticity and/or normality, then you can try transforming the data or using a robust regression model (discussed below). If no transformations or corrections are made to the data/model, then one will have difficulties generalizing the model to the population, i.e. the findings should be limited to the sample used.

##### Data used in this section

Data in this section is from Kaggle.com from the used Miri Choi. Link to the original data set can be found here. This data set contains information on insurance premium charges, age, sex, BMI, number of children, smoker status, and region. The question being asked is, what are the most influential predictors of insurance premium charges? This makes the DV be the insurance premium charge and the IVs are the other variables in the data.

Let’s import Pandas as pd, load the data set, and take a look at the variables!

import pandas as pd df= pd.read_csv("insurance.csv") df.describe()

age | bmi | children | charges | |
---|---|---|---|---|

count | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 |

mean | 39.207025 | 30.663397 | 1.094918 | 13270.422265 |

std | 14.049960 | 6.098187 | 1.205493 | 12110.011237 |

min | 18.000000 | 15.960000 | 0.000000 | 1121.873900 |

25% | 27.000000 | 26.296250 | 0.000000 | 4740.287150 |

50% | 39.000000 | 30.400000 | 1.000000 | 9382.033000 |

75% | 51.000000 | 34.693750 | 2.000000 | 16639.912515 |

max | 64.000000 | 53.130000 | 5.000000 | 63770.428010 |

As you can see, by default Python ignored the categorical variables. In order to see these variables, one will have to call the method on these variables separately. Python did this because the data set contained a mix of continuous and and categorical variables and the information provided by the *.describe()* method is different for continuous and categorical data.

df[['sex', 'smoker']].describe()

sex | smoker | |
---|---|---|

count | 1338 | 1338 |

unique | 2 | 2 |

top | male | no |

freq | 676 | 1064 |

The “freq” row is the count of the most commonly occurred variable, i.e. “male” occurred 676 times and “no” occurred 1,064 times in the data set.

## Multiple Linear Regression Example

Before a multiple linear regression analysis can be conducted, one needs to import the required libraries, recode the categorical data to binary form, and check the assumptions!

import statsmodels.formula.api as smf import statsmodels.stats.api as sms from scipy import stats from statsmodels.compat import lzip import statsmodels import matplotlib.pyplot as plt

Now to recode the categorical data to binary form that way it can be included in the model. This means to code one category as 1, and the other as 0. If there are multiple categories in the variable, then one needs to create a dummy variable for each category. Always code the category outcome of interest as 1.

This is easy work using the “pd.get_dummies()” method. It automatically creates dummy variables and recodes them to binary form of 1 and 0. A 1 indicates the membership in that category and a 0 indicates non-membership.

df= pd.get_dummies(df) # Dropping the dummy variable that are not needed df.drop(['sex_male' , 'smoker_no'], axis=1 ,inplace= True) df.describe()

age | bmi | children | charges | sex_female | smoker_yes | region_northeast | region_northwest | region_southeast | region_southwest | |
---|---|---|---|---|---|---|---|---|---|---|

count | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 |

mean | 39.207025 | 30.663397 | 1.094918 | 13270.422265 | 0.494768 | 0.204783 | 0.242152 | 0.242900 | 0.272048 | 0.242900 |

std | 14.049960 | 6.098187 | 1.205493 | 12110.011237 | 0.500160 | 0.403694 | 0.428546 | 0.428995 | 0.445181 | 0.428995 |

min | 18.000000 | 15.960000 | 0.000000 | 1121.873900 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |

25% | 27.000000 | 26.296250 | 0.000000 | 4740.287150 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |

50% | 39.000000 | 30.400000 | 1.000000 | 9382.033000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |

75% | 51.000000 | 34.693750 | 2.000000 | 16639.912515 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |

max | 64.000000 | 53.130000 | 5.000000 | 63770.428010 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |

#### Assumption of Multicollinearity

Now let’s check for multicollinearity. When checking, it doesn’t really matter if the correlations are significant or not since that is not the purpose of running a correlation analysis in this context.

df.corr()

age | bmi | children | charges | sex_female | smoker_yes | region_northeast | region_northwest | region_southeast | region_southwest | |

age | 1.000000 | 0.109272 | 0.042469 | 0.299008 | 0.020856 | -0.025019 | 0.002475 | -0.000407 | -0.011642 | 0.010016 |

bmi | 0.109272 | 1.000000 | 0.012759 | 0.198341 | -0.046371 | 0.003750 | -0.138156 | -0.135996 | 0.270025 | -0.006205 |

children | 0.042469 | 0.012759 | 1.000000 | 0.067998 | -0.017163 | 0.007673 | -0.022808 | 0.024806 | -0.023066 | 0.021914/td> |

charges | 0.299008 | 0.198341 | 0.067998 | 1.000000 | -0.057292 | 0.787251 | 0.006349 | -0.039905 | 0.073982 | -0.043210 |

sex_female | 0.020856 | -0.046371 | -0.017163 | -0.057292 | 1.000000 | -0.076185 | 0.002425 | 0.011156 | -0.017117 | 0.004184 |

smoker_yes | -0.025019 | 0.003750 | 0.007673 | 0.787251 | -0.076185 | 1.000000 | 0.002811 | -0.036945 | 0.068498 | -0.036945 |

region_northeast | 0.002475 | -0.138156 | -0.022808 | 0.006349 | 0.002425 | 0.002811 | 1.000000 | -0.320177 | -0.345561 | -0.320177 |

region_northwest | -0.000407 | -0.135996 | 0.024806 | -0.039905 | 0.011156 | -0.036945 | -0.320177 | 1.000000 | -0.346265 | -0.320829 |

region_southeast | -0.011642 | 0.270025 | -0.023066 | 0.073982 | -0.017117 | 0.068498 | -0.345561 | -0.346265 | 1.000000 | -0.346265 |

region_southwest | 0.010016 | -0.006205 | 0.021914 | -0.043210 | 0.004184 | -0.036945 | -0.320177 | -0.320829 | -0.346265 | 1.000000 |

There are no strong correlations between the IVs meaning there is no need to worry about multicollinearity.

#### Assumption of Normality of the Residuals

It is often miss-understood about what normality is being tested. For linear regression, it is the residuals- not the variables themselves. This can be tested with the Kolmogorov-Smirnov test, Jarque–Bera test, or the Shapiro-Wilk test for normality on the model’s residuals. For this example, the Jarque-Bera test will be used which is the *sms.jarque_bera()* method.

Again, this test is to be conducted on the model’s residuals. The model is developed in it’s appropriate section (below) and is referenced here.

name = ['Jarque-Bera', 'Chi^2 two-tail prob.', 'Skew', 'Kurtosis'] test = sms.jarque_bera(model.resid) lzip(name, test)

(‘Chi^2 two-tail prob.’, 7.8634686618215363e-157),

(‘Skew’, 1.211211005167187),

(‘Kurtosis’, 5.650794298374027)]

The test is significant; meaning the data violates the assumption of normality of the residuals. Given that we have a large sample in this data (N= 1,338), that is likely to occur. As the sample size increases, the more likely the data will violate this assumption. There are a couple more steps to take in order to decide what action is required. They are, explore the distribution shape of the DV and the IVs, and their errors, to see if the distribution fits a better model and/or to transform the data in order to make it normal and the residuals normal.

Let’s start with looking at the distribution of the DV (charges). Before looking at the graph, think about the DV in the model. How is it expected to be distributed? In case of insurance charges billed to the individual, it would be reasonable to expect that most of the billing cost is low. Which would drive the distribution to the left (positive skew). Which is exactly what occurs in the data. Let’s take a look!

The data follows more of a skewed normal distribution. A common way to correct this would be to take the log transformation of the DV and use it in the model. The caveat to this is that interpretation of the results can get tricky. Another regression model, such as quantile regression or a nonparametric linear regression, may also be useful given the distribution.

Let’s take a look at how taking the log transformation of the DV changes the distribution of the variable. Keep in mind, we truly need to test the distribution of the residuals. This is to provide a visual of what occurs.

df['charges_log'] = np.log(df['charges']) df['charges_log'].plot(kind='hist', title= 'Log of Insurance Charges Distribution')

You can see how it transforms the data into more of a normal distribution which would likely change the residuals as well. One should test this for each variable’s residuals to find the culprit and decide what to do. The best way to graphically check for normality is to use a Q-Q plot. A Q-Q plot of the model’s residuals is below. The data points (blue) should be on the red line. If not, then it indicates non-normality.

#Running plot & giving it a title stats.probplot(model.resid, dist="norm", plot= plt) plt.title("Model1 Residuals Q-Q Plot") #Saving plot as a png plt.savefig("Model1_Resid_qqplot.png")

Given the sample size of the data and the Central Limit Theorem, we are protected from violating this assumption and our findings will still be valid. For a brief touch on this if interested, see here. Although the results are protected, depending on the field of the study, you may have to make adjustments to keep with your field’s standard of practice.

#### Assumption of Homoscedasticity

The assumption of homoscedasticity is a vital assumption for linear regression. If this assumption is violated, then the standard errors will be biased. The standard errors are used to conduct significance tests, and calculate the confidence intervals.

This can be tested using a residual vs. fitted values plot, looking at a scatter plot (if a cone shape is present then heteroscedasticity is present), or by using a statistical test such as Bruesch-Pagan, Cook-Weisberg test, or White general test. In this example, the Bruesch-Pagan test will be used.

name = ['Lagrange multiplier statistic', 'p-value', 'f-value', 'f p-value'] test = sms.het_breuschpagan(model.resid, model.model.exog) lzip(name, test)

(‘p-value’, 5.8721533542513195e-22),

(‘f-value’, 16.628612027375389),

(‘f p-value’, 1.1456058246340301e-23)]

The test is significant meaning the data violates the assumption of homoscedasticity, i.e. heteroscedasticity is present in the data. What to do? Either one can transform the variables to improve the model, or use a robust regression method that accounts for the heteroscedasticity.

In order to account for the heteroscedasticity in the data, one has to select a heteroscedasticity consistent covariance matrix (HCCM) and pass it in the “cov_type=” argument apart of the *.fit()* method. What is HCCM? Here is a nice read if interested more on this. There are a few HCCMs to choose from:

- HC0, not good on sample size ≤ 250
- HC1, not good on sample size ≤ 250
- HC2, good on sample size ≤ 250
- HC3, which out performs HC0, HC1, and HC2 when sample size ≤ 250
- Little difference in performance when sample is ≥ 500

For the current model, using a robust regression technique will work. This will be demonstrated in the running of the model.

### Linear Regression Model

Now that all the assumptions have been checked, time to run the model. The first model to be ran will be the original model used to test the assumptions. Note that when using dummy variables, for each category of the orignal variable, one needs to drop 1 dummy variable, i.e. (k-1).

model = smf.ols("charges ~ age + bmi + sex_female + smoker_yes + children + region_northwest + region_southeast + region_southwest", data= df).fit() model.summary()

Remember, the model above is not taking into account the presence of heteroscedasticity. This is noted in “Covariance Type: nonrobust”. Let’s see the results of the model that accounts for this. In order to do this, pass the desried HCCM into the “cov_type=” argument within the *.fit()* method.

model3 = smf.ols("charges ~ age + bmi + sex_female + smoker_yes + children + region_northwest + region_southeast + region_southwest", data= df).fit(cov_type='HC3') model3.summary()

There are differences in the models and their results. Accounting for the heteroscedasiticity has altered the F-statistics, the standard errors, and etc. The second model ran, model3, is a better model.

As you can see, the statistical tests used to test the assumptions are provided at the bottom of the model’s summary. You don’t have to test for the assumptions of independent errors (Durban-Watson), normality of residuals (Jarque-Bera), or the presence of homoscedasticity (Breusch-Pagan) before running the model because they are provided with the model. I wanted you to see them beforehand with a bit more explanation of each assumption. One does however, have to check for the presence of multicollinearity before running the model.

## Linear Regression Interpretation

First one has to look to see if the model is significant. This information is found in “Prov (F-statistic)= “, the current model is significant. Now one can look at the affects of each IV on the DV (the IV coefficients) and see if it is a significant predictor or not (P>|z|). As usual, one wants a p-value < 0.05. All the IVs are significant predictors of insurance premium charges except for sex_female. Indicating that sex doesn't matter in this context.
The coefficient (coef) can be interpreted as the affect in unit change in terms of the DV. Meaning, for every 1 unit increase in the IV, the DV will increase or decrease by the coefficient amount. In the example above, for every year increase in age, there will be a $256.86 increase in the insurance premium charge. Let's review the linear regression equation to see how this data makes as a predictive/explanatory model. It is predictive when used to predict future outcomes, and explanatory when used to explain the influence of each IV.
`y = b`

_{0} + m_{1}X_{1} + m_{2}X_{2} + m_{n}X_{n}

Where:

**b**is the intercept of the model, and_{0}**m**is the coefficient (m) to the respective variable (X)_{n}X_{n}

Using the current model, one can write the formula as:

`Charges = -12,070 + 256.86(age) + 339.19(bmi) + 131.31(sex_female) + 23,850(smoker_yes) + 474.50(number of children) - 352.96(region_northwest) - 1,035.02(region_southeast) - 960.05(region_soughtwest)`

Now to write up the model’s performance.

#### Model write up

Multiple regression analysis was used to test if age, BMI, sex (female), smoking status, number of children, and region significantly predicted the cost of insurance premiums. The results of the regression indicated the nine predictors explained 75.1% of the variance (R^{2}= 0.75, F(8,1329)= 298.4, p< 0.01). The predicted insurance premium charge is equal to -12,070 + 256.86(age) + 339.19(bmi) + 131.31(sex_female) + 23,850(smoker_yes) + 474.50(number of children) - 352.96(region_northwest) - 1,035.02(region_southeast) - 960.05(region_soughtwest), where sex_female is coded as 1= female, 0= male; smoker_yes is coded as 1= smoker, 0= non-smoker; region_[location] is coded as 1= in region, 0= not in region. All of the independent variables used in the model were significant predictors of insurance premium charge, except for sex

For checking multicollinearity, you are using pairwise correlation, right? I have read somewhere that pairwise correlation doesn’t work always as it fails to determine the correlation between several independent variables. For example, there are three independent variables x_1,x_2 and x_3. Although, these variables don’t share any linear relationship among themselves, but there can be a relationship between x_1 and sum of x_2 and x_3(x_2+x_3).

Hello Vishal,

Thank you for your note! The default method that Pandas uses is pairwise correlation, so that is correct. If you could find the reference for that it would be a good read! If I’m reading the context correctly, I see those being two separate analyses.

Using a model with the three separate independent variables (IV) x_1, x_2, and x_3 the regression model would look like:

y = b_0 + x_1 + x_2 + x_3 + error

whereas summing the IVs x_2 and x_3 we would be reducing the number of IVs in the overall model making the model look like:

y = b_0 + x_1 + x_5 + error; where x_5 = x_2 + x_3

I would also question why add variables x_2 and x_3 together? If looking for an effect of varying levels of each of those variables, the relation would better be represented as an interaction effect and would be multiplicative in nature, and the original variables would still also need to be included in the model making the model look like:

y = b_0 + x_2*x_3 + x_2 + x_3 + x_1 + error

Best,

Corey