Introduction to Multiple Linear Regression

Linear regression models are used to analyze the relationship between an independent variable (IV) or variables and a dependent variable (DV), a.k.a the predicted variable. If only one predictor variable (IV) is used in the model, then that is called a single linear regression model. However, the model will not be robust in design and will have little to no explanation power because in the real world there is no 1 variable that can fully explain, or predict, an outcome. Most commonly, the model will have multiple IVs which will make it a multiple linear regression model. Lets take a look at the equation formula for multiple linear regression:

Y = Β0 + Β1X1 + Β2X2 + ΒnXn + ei

Where:

  • Y = outcome variable (dependent variable)
  • Β0 = constant coefficient (Y-intercept)
  • Β1X1 = is the coefficient and known constant for variable 1, respectively
  • Β2X2 = is the coefficient and known constant for variable 2, respectively
  • ΒnXn = is the coefficient and known constant for variable n, respectively
  • ei = error

For every variable in the regression model, the model will add a ΒnXn term in the model. The “n” will notate the number of the IV. Okay, that’s enough of a primer for now. Don’t worry is this doesn’t make much sense to you now. The model will be filled in later with numbers from the example data which will make the picture more cohesive.

Multiple linear regression has assumptions just like all other parametric approaches. It is often time not possible to test the assumptions before running the model. Therefor, model diagnostics should be ran after fitting the model to test the following assumptions:

  • Data must not have multicollinearity
  • The residual errors should be approximately normally distributed
  • Homoscedasticity
  • Independence of errors

Often times when using real world data, there will be violations. If the residuals violate the assumptions of homoscedasticity and/or normality, then you can try transforming the data or using a robust regression model (discussed below). If no transformations or corrections are made to the data/model, then one will have difficulties generalizing the model to the population, i.e. the findings should be limited to the sample used.

Data used in this section

Data in this section is from Kaggle.com from the user Miri Choi. Link to the original data set can be found here. This data set contains information on insurance premium charges, age, sex, BMI, number of children, smoker status, and region. The question being asked is, what are the most influential predictors of insurance premium charges? This makes the DV be the insurance premium charge and the IVs are the other variables in the data.

Let’s import the required libraries, load the data set, and take a look at the variables!

# These are for exploring the data
import pandas as pd
import researchpy as rp

# These are for running the model and conducting model diagnostics
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
from scipy import stats
from statsmodels.compat import lzip

df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/insurance.csv")
    
df.head()

 

age sex bmi children smoker region charges
19 female 27.900 0 yes southwest 16884.92400
18 male 33.770 1 no southeast 1725.55230
28 male 33.000 3 no southeast 4449.46200
33 male 22.705 0 no northwest 21984.47061
32 male 28.880 0 no northwest 3866.85520


Now let’s get some more information about the variables. Researchpy has nice options to explore continuous and categorical data. Official documentation can be found here.

# Let's get more information on the continuous varibles

rp.summary_cont(df[['charges','age', 'children']])


 

Variable N Mean SD SE 95% Conf. Interval
charges 1338.0 13270.422265 12110.011237 331.067454 12620.954034 13919.890496
age 1338.0 39.207025 14.049960 0.384102 38.453516 39.960534
children 1338.0 1.094918 1.205493 0.032956 1.030266 1.159569


There is a lot of variability in the variables charges, and a decent amount in age. Something to keep an eye on. Next let’s get some descriptives on the categorical variables.

rp.summary_cat(df[['sex', 'smoker', 'region']])


Variable Outcome Count Percent
sex male 676 50.52
female 662 49.48
smoker no 1064 79.52
yes 274 20.48
region southeast 364 27.20
northwest 325 24.29
southwest 325 24.29
northeast 324 24.22

The sample is a near even split of males and females, majority are non-smokers, and there is a very close breakdown in region. I could run the the regression model using the variables as is, however I want to run a correlation matrix to look for multicollinearity between variables. In order to do this I have to recreate dummy variables of the categorical data. I will show two methods of how to do this.

The first is to use the .replace() method that is built-in to Pandas DataFrame.

df['sex'].replace({'female' : 1, 'male' : 0}, inplace= True)
df['smoker'].replace({'no': 0, 'yes': 1}, inplace= True)


This method uses Pandas’s dictionary to match up the current variable label and to replace it with a numeric representation. Females, and smokers are going to be indicated by a 1 while males, and non-smokers are going to be indicated by a 0. Since the region variable has multiple categories within it, an easy way to create multiple dummy variables is to use Pandas’s built-in .get_dummies() method.

df = pd.get_dummies(df)
df.head()


age sex bmi children smoker charges region_northeast region_northwest region_southeast region_southwest
19 1 27.900 0 1 16884.92400 0 0 0 1
18 0 33.770 1 0 1725.55230 0 0 1 0
28 0 33.000 3 0 4449.46200 0 0 1 0
33 0 22.705 0 0 21984.47061 0 1 0 0
32 0 28.880 0 0 3866.85520 0 1 0 0

Multiple Linear Regression Example

Now that the data is ready to go, I will fit a model using statsmodels formula method. Then diagnostics will be ran on the model. When using dummy variables, it is important to not include 1 of the dummy variables per original categorical variable. For example, since region had 4 possible categories, I will include 3 of those 4 in the model and the 4th category will still be captured in the model’s intercept.

model = smf.ols("charges ~ age + bmi + sex + smoker + children + region_northwest + region_southeast + region_southwest", data= df).fit()
    
model.summary()



 

OLS Regression Results
Dep. Variable: charges R-squared: 0.751
Model: OLS Adj. R-squared: 0.749
Method: Least Squares F-statistic: 500.8
Date: Wed, 21 Nov 2018 Prob (F-statistic): 0.00
Time: 13:33:36 Log-Likelihood: -13548.
No. Observations: 1338 AIC: 2.711e+04
Df Residuals: 1329 BIC: 2.716e+04
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -1.207e+04 999.649 -12.074 0.000 -1.4e+04 -1.01e+04
age 256.8564 11.899 21.587 0.000 233.514 280.199
bmi 339.1935 28.599 11.860 0.000 283.088 395.298
sex 131.3144 332.945 0.394 0.693 -521.842 784.470
smoker 2.385e+04 413.153 57.723 0.000 2.3e+04 2.47e+04
children 475.5005 137.804 3.451 0.001 205.163 745.838
region_northwest -352.9639 476.276 -0.741 0.459 -1287.298 581.370
region_southeast -1035.0220 478.692 -2.162 0.031 -1974.097 -95.947
region_southwest -960.0510 477.933 -2.009 0.045 -1897.636 -22.466
Omnibus: 300.366 Durbin-Watson: 2.088
Prob(Omnibus): 0.000 Jarque-Bera (JB): 718.887
Skew: 1.211 Prob(JB): 7.86e-157
Kurtosis: 5.651 Cond. No. 315.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The model explains a significant amount of variance which is indicated by the F-statistic value and it’s correponding p-value, F(8, 1329)= 500.8, p< 0.01. Now to run diagnostics on the model. Statsmodels already provides some model diagnostics which are the omnibus test (measuring amount of explained variance), Durbin-Watson test (measuring a relationship between values separated by a time lag), Jarque-Bera test (test of normality), and the Condition Number (measure of multicollinearity).

Diagnosing Multicollinearity

Now let’s check for multicollinearity. When checking, it doesn’t really matter if the correlations are significant or not since that is not the purpose of running a correlation analysis in this context.

df.corr()

 

age bmi children charges sex smoker region_northeast region_northwest region_southeast region_southwest
age 1.000000 0.109272 0.042469 0.299008 0.020856 -0.025019 0.002475 -0.000407 -0.011642 0.010016
bmi 0.109272 1.000000 0.012759 0.198341 -0.046371 0.003750 -0.138156 -0.135996 0.270025 -0.006205
children 0.042469 0.012759 1.000000 0.067998 -0.017163 0.007673 -0.022808 0.024806 -0.023066 0.021914/td>
charges 0.299008 0.198341 0.067998 1.000000 -0.057292 0.787251 0.006349 -0.039905 0.073982 -0.043210
sex 0.020856 -0.046371 -0.017163 -0.057292 1.000000 -0.076185 0.002425 0.011156 -0.017117 0.004184
smoker -0.025019 0.003750 0.007673 0.787251 -0.076185 1.000000 0.002811 -0.036945 0.068498 -0.036945
region_northeast 0.002475 -0.138156 -0.022808 0.006349 0.002425 0.002811 1.000000 -0.320177 -0.345561 -0.320177
region_northwest -0.000407 -0.135996 0.024806 -0.039905 0.011156 -0.036945 -0.320177 1.000000 -0.346265 -0.320829
region_southeast -0.011642 0.270025 -0.023066 0.073982 -0.017117 0.068498 -0.345561 -0.346265 1.000000 -0.346265
region_southwest 0.010016 -0.006205 0.021914 -0.043210 0.004184 -0.036945 -0.320177 -0.320829 -0.346265 1.000000

 

There are no strong correlations between the IVs meaning there is no need to worry about multicollinearity.

Diagnosing Normality

For regressions, the test of normality applies to the model’s residuals and not the variables themselves. This can be tested visually by plotting the residuals as a histogram, and/or using a probability plot. I will show how to visually inspect the residuals using a probability plot. If you are unfamiliar with how to read a probability plot, the residuals which will be represented as dots (in blue) should fall on the red line.

stats.probplot(model.resid, plot= plt)
plt.title("Model1 Residuals Probability Plot")


python multiple linear regression pandas statsmodels

This plot indicates that the model’s residuals are not normally distributed. Something to note is that visual representation of data is always subjective, but they are not bad to include because they visually show characteristics of the data that formal statistical testing cannot.

Common formal tests include the chi-square test, Kolmogorov-Smirnov test, and the Lilliefors test. I will demonstrate how to test for normality using the Kolmogorov-Smirnov test. Official documentation can be found here.

stats.kstest(model.resid, 'norm')
KstestResult(statistic=0.6263077931301486, pvalue=0.0)


The test is significant which indicates that the model’s residuals are not normally distributed.

Diagnosing Homoscedasticity

The assumption of homoscedasticity is a vital assumption for linear regression. If this assumption is violated, then the standard errors will be biased. The standard errors are used to conduct significance tests, and calculate the confidence intervals.

This can be tested using a few different statistical tests, these include the
Brown-Forsythe test, Levene’s test, Bruesch-Pagan test, or Cook-Weisberg test. In this example, I will demonstrate how to conduct the Bruesch-Pagan test followed by how to conduct the Levene’s test. The test for homoscedasticity of variance needs to be conducted for each level of the categorical variables.

name = ['Lagrange multiplier statistic', 'p-value', 
        'f-value', 'f p-value']
test = sms.het_breuschpagan(model.resid, model.model.exog)
lzip(name, test)
[(‘Lagrange multiplier statistic’, 121.74360137568986),
(‘p-value’, 5.8721533542513195e-22),
(‘f-value’, 16.628612027375389),
(‘f p-value’, 1.1456058246340301e-23)]

 
Now for how to conduct Levene’s test. Since I know residuals are not normally distributed, I will conduct the Levene’s test using the median instead of the mean. Official documentation can be found here.

sex_variance_results = stats.levene(df['charges'][df['sex'] == 0], 
                                    df['charges'][df['sex'] == 1], center= 'median')

smoker_variance_results = stats.levene(df['charges'][df['smoker'] == 0],
                                       df['charges'][df['smoker'] == 1], center= 'median')

region_variance_results = stats.levene(df['charges'][df['region_northeast'] == 1],
                                       df['charges'][df['region_northwest'] == 1],
                                       df['charges'][df['region_southeast'] == 1],
                                       df['charges'][df['region_southwest'] == 1], center= 'median')

print(f"Sex Variance: {sex_variance_results}", "\n",
      f"Smoker Variance: {smoker_variance_results}", "\n",
      f"Region Variance: {region_variance_results}", "\n")
Sex Variance: LeveneResult(statistic=9.90925122305512, pvalue=0.0016808765833903443)
Smoker Variance: LeveneResult(statistic=332.6135162726081, pvalue=1.5593284881803726e-66)
Region Variance: LeveneResult(statistic=5.559966758410606, pvalue=0.0008610590250786703)


Both of the tests are significant meaning the data violates the assumption of homoscedasticity, i.e. heteroscedasticity is present in the data. What to do? Either one can transform the variables to improve the model, or use a robust regression method that accounts for the heteroscedasticity.

In order to account for the heteroscedasticity in the data, one has to select a heteroscedasticity consistent covariance matrix (HCCM) and pass it in the “cov_type=” argument apart of the .fit() method. What is HCCM? Here is a nice read if interested more on this. There are a few HCCMs to choose from:

  • HC0, not good on sample size ≤ 250
  • HC1, not good on sample size ≤ 250
  • HC2, good on sample size ≤ 250
  • HC3, which out performs HC0, HC1, and HC2 when sample size ≤ 250
  • Little difference in performance when sample is ≥ 500

For the current model, using a robust regression technique will work. This will be demonstrated in the re-running of the model.

Model Re-run

The re-run of the model will use a heteroscedasticity consistent covariance matrix so the model is protected from the violation of homogeneity of variance, the model however is not correcting the violation of normality.

model3 = smf.ols("charges ~ age + bmi + sex + smoker + children + region_northwest + region_southeast + region_southwest", data= df).fit(cov_type='HC3')
    
model3.summary()


OLS Regression Results
Dep. Variable: charges R-squared: 0.751
Model: OLS Adj. R-squared: 0.749
Method: Least Squares F-statistic: 298.4
Date: Wed, 21 Nov 2018 Prob (F-statistic): 2.25e-290
Time: 15:54:47 Log-Likelihood: -13548.
No. Observations: 1338 AIC: 2.711e+04
Df Residuals: 1329 BIC: 2.716e+04
Df Model: 8
Covariance Type: HC3
coef std err z P>|z| [0.025 0.975]
Intercept -1.207e+04 1062.898 -11.356 0.000 -1.42e+04 -9986.611
age 256.8564 11.961 21.474 0.000 233.412 280.300
bmi 339.1935 31.879 10.640 0.000 276.711 401.676
sex 131.3144 334.971 0.392 0.695 -525.217 787.846
smoker 2.385e+04 578.079 41.255 0.000 2.27e+04 2.5e+04
children 475.5005 131.009 3.630 0.000 218.727 732.274
region_northwest -352.9639 486.616 -0.725 0.468 -1306.714 600.786
region_southeast -1035.0220 503.426 -2.056 0.040 -2021.718 -48.326
region_southwest -960.0510 463.014 -2.073 0.038 -1867.541 -52.561
Omnibus: 300.366 Durbin-Watson: 2.088
Prob(Omnibus): 0.000 Jarque-Bera (JB): 718.887
Skew: 1.211 Prob(JB): 7.86e-157
Kurtosis: 5.651 Cond. No. 315.

Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)

Linear Regression Interpretation

First one has to look to see if the model is significant. This information is found in “Prov (F-statistic)= “, the current model is significant. Now one can look at the affects of each IV on the DV (the IV coefficients) and see if it is a significant predictor or not (P>|z|). As usual, one wants a p-value < 0.05. All the IVs are significant predictors of insurance premium charges except for sex. Indicating that sex doesn't matter in this context. The coefficient (coef) can be interpreted as the affect in unit change in terms of the DV. Meaning, for every 1 unit increase in the IV, the DV will increase or decrease by the coefficient amount. In the example above, for every year increase in age, there will be a $256.86 increase in the insurance premium charge. Let's review the linear regression equation to see how this data makes as a predictive/explanatory model. It is predictive when used to predict future outcomes, and explanatory when used to explain the influence of each IV. Y = Β0 + Β1X1 + Β2X2 + ΒnXn + ei

Where:

  • Y = outcome variable (dependent variable)
  • Β0 = constant coefficient (Y-intercept)
  • Β1X1 = is the coefficient and known constant for variable 1, respectively
  • Β2X2 = is the coefficient and known constant for variable 2, respectively
  • ΒnXn = is the coefficient and known constant for variable n, respectively
  • ei = error

Using the current model, one can write the formula as:
Charges = -12,070 + 256.86(age) + 339.19(bmi) + 131.31(sex_female) + 23,850(smoker_yes) + 474.50(number of children) - 352.96(region_northwest) - 1,035.02(region_southeast) - 960.05(region_soughtwest)

Now to write up the model’s performance.

Model write up

Multiple regression analysis was used to test if age, BMI, sex (female), smoking status, number of children, and region significantly predicted the cost of insurance premiums. The results of the regression indicated the nine predictors explained 75.1% of the variance (R2= 0.75, F(8,1329)= 298.4, p< 0.01). The predicted insurance premium charge is equal to -12,070 + 256.86(age) + 339.19(bmi) + 131.31(sex) + 23,850(smoker) + 474.50(number of children) - 352.96(region_northwest) - 1,035.02(region_southeast) - 960.05(region_soughtwest), where sex is coded as 1= female, 0= male; smoker is coded as 1= smoker, 0= non-smoker; region_[location] is coded as 1= in region, 0= not in region. All of the independent variables used in the model were significant predictors of insurance premium charge, except for sex

References: Kutner, M., Nachtsheim, C., Neter, J., and Li, W. (2004). Applied linear statistical models (5th ed.). New York: McGraw-Hill

2 comments

  1. For checking multicollinearity, you are using pairwise correlation, right? I have read somewhere that pairwise correlation doesn’t work always as it fails to determine the correlation between several independent variables. For example, there are three independent variables x_1,x_2 and x_3. Although, these variables don’t share any linear relationship among themselves, but there can be a relationship between x_1 and sum of x_2 and x_3(x_2+x_3).

    1. Hello Vishal,

      Thank you for your note! The default method that Pandas uses is pairwise correlation, so that is correct. If you could find the reference for that it would be a good read! If I’m reading the context correctly, I see those being two separate analyses.

      Using a model with the three separate independent variables (IV) x_1, x_2, and x_3 the regression model would look like:

      y = b_0 + x_1 + x_2 + x_3 + error

      whereas summing the IVs x_2 and x_3 we would be reducing the number of IVs in the overall model making the model look like:

      y = b_0 + x_1 + x_5 + error; where x_5 = x_2 + x_3

      I would also question why add variables x_2 and x_3 together? If looking for an effect of varying levels of each of those variables, the relation would better be represented as an interaction effect and would be multiplicative in nature, and the original variables would still also need to be included in the model making the model look like:

      y = b_0 + x_2*x_3 + x_2 + x_3 + x_1 + error

      Best,

      Corey

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.