Introduction to Logistic Regression

Logistic regression models are used to analyze the relationship between a dependent variable (DV) and independent variable(s) (IV) when the DV is dichotomous. The DV is the outcome variable, a.k.a. the predicted variable, and the IV(s) are the variables that are believed to have an influence on the outcome, a.k.a. predictor variables. If the model contains 1 IV, then it is a simple logistic regression model, and if the model contains 2+ IVs, then it is a multiple logistic regression model.

Assumptions for logistic regression models:

  • The DV is categorical (binary)
    • If there are more than 2 categories in terms of types of outcome, a multinomial logistic regression should be used
  • Independence of observations
    • Cannot be a repeated measures design, i.e. collecting outcomes at two different time points.
  • Independent variables are linearly related to the log odds
  • Absence of multicollinearity
  • Lack of outliers
Data Used in this example

Data used in this example is the data set that is used in UCLA’s Logistic Regression for Stata example. The question being asked is, how does GRE score, GPA, and prestige of the undergraduate institution effect admission into graduate school. The DV is admission status (binary), and the IVs are: GRE score, GPA, and undergraduate prestige.

Let’s import pandas as pd, load the data set, and take a look at the variables! The data can be loaded either with the code below, or from our GitHub. Loading the data from both sources are shown below.

import pandas as pd
import numpy as np

## Loading data directly from UCLA
df = pd.read_stata("https://stats.idre.ucla.edu/stat/stata/dae/binary.dta")

## Loading data from our GitHub
df2 = pd.read_csv("https://raw.githubusercontent.com/Opensourcefordatascience/Data-sets/master/admission.csv")

## Taking a look at the data
df.describe()

 

admit gre gpa rank
count 400.000000 400.000000 400.000000 400.000000
mean 0.317500 587.700012 3.389901 2.485000
std 0.466087 115.516663 0.380567 0.944462
min 0.000000 220.000000 2.260000 1.000000
25% 0.000000 520.000000 3.130000 2.000000
50% 0.000000 580.000000 3.395000 2.000000
75% 1.000000 660.000000 3.670000 3.000000
max 1.000000 800.000000 4.000000 4.000000

 

Looking at the descriptive statistics, 31% of the students get admitted to a graduate program, the average GRE score is 587 with a large standard deviation, the average GPA is 3.39, and the average undergraduate school prestige is 2.49.

One thing to keep in mind, the variable “rank” is truly categorical and will need to be converted to this data type to accurately be included in the model. This will be expressed in the formula when it is ran. For assumption checking purposes, this variable will need to be dummy coded. This is simple work with the pd.get_dummies() method. It automatically creates a new variable for each category type of the original variable and codes new dummy variables where 1 is membership to that category and 0 is non-membership.

## Converting variable to categorical data type (since that what it is)
## and then creating dummy variables
df['rank'] = df['rank'].astype('category')

df = pd.get_dummies(df)

 

Multiple Logistic Regression Example

First the assumptions of the model need to be checked- the first 2 assumptions are met. Let’s import the library needed to run the logistic regression, a graphing library, and then check the rest of the assumptions!

# Needed to run the logistic regression
import statsmodels.formula.api as smf

# For plotting/checking assumptions
import seaborn as sns

 

Assumption of Continuous IVs being Linearly Related to the Log Odds

Logistic regression does not require the continuous IV(s) to be linearly related to the DV. It does require the continuous IV(s) be linearly related to the log odds of the IV though. A way to test this is to plot the IV(s) in question and look for an S-shaped curve. Sometimes the S-shape will not be obvious. The plot should have a flat or flat-ish top and bottom with an increase or decreasing middle.

This can be done with the following code using the Seaborn statistical plotting library for Python.

gre = sns.regplot(x= 'gre', y= 'admit', data= df, logistic= True).set_title("GRE Log Odds Linear Plot")
gre.figure.savefig("gre log lin.png")

gpa = sns.regplot(x= 'gpa', y= 'admit', data= df, logistic= True).set_title("GPA Log Odds Linear Plot")
gpa.figure.savefig("gpa log lin.png")

 

It may be hard to see, but the data does have somewhat of a curve occurring that resembles the S-shaped curve that is required. If a non-S-shaped line were to be present, sometimes a U-shape will be present, how to handle that data needs to be considered.

Assumption of Absence of Multicollinearity

An easy way to test this is to use a correlation matrix, and look for any highly correlated variables, and/or to look at for high Variance Inflation Factor (VIF) scores. If there are variables that are highly correlated, or have a high VIF, a corrective action would be to drop one of them since they are measuring the same/similar thing.

df.corr()

 

admit gre gpa rank_1.0 rank_2.0 rank_3.0 rank_4.0
admit 1.000000 0.184434 0.178212 0.203651 0.067109 -0.121800 -0.133356
gre 0.184434 1.000000 0.384266 0.088622 0.056202 -0.073200 -0.068235
gpa 0.178212 0.384266 1.000000 0.070550 -0.057867 0.074490 -0.084428
rank_1.0 0.203651 0.088622 0.070550 1.000000 -0.330334 -0.279354 -0.190274
rank_2.0 0.067109 0.056202 -0.057867 -0.330334 1.000000 -0.512837 -0.349304
rank_3.0 -0.121800 -0.073200 0.074490 -0.279354 -0.512837 1.000000 -0.295397
rank_4.0 -0.133356 -0.068235 -0.084428 -0.190274 -0.349304 -0.295397 1.000000

 

The only independent variables that have a moderate correlation with each other are rank_2.0 and rank_3.0. Given that these variables are dummy codes from the original variable “rank”, there is no concern with there being multicollinearity.

Assumption of Lack of outliers

The assumption of lack of outliers is an easy one to check. One can get a feel of this with the descriptive statistics provided by the .describe() method. The easiest way to check for outliers is to use a box plot.

Do to there being a drastic difference between the values used to measure GRE and the GPA/ Rank, two separate box plot charts will be produced.

gpa_rank_box = sns.boxplot(data= df[['gpa', 'rank']]).set_title("GPA and Rank Box Plot")
gpa_rank_box.figure.savefig("GPA and Rank Box Plot.png")

gre_box = sns.boxplot(x= 'gre', data= df, orient= 'v').set_title("GRE Box Plot")
gre_box.figure.savefig("GRE Box Plot.png")

 

There looks to be 2 values that could be considered outliers for the GRE variable, and 1 value for GPA variable. In both of these cases, the values that are in question are not so far away from the rest of the values in their respective variable. Thus, the values can be kept and used in the analysis.

Logistic Regression Model

The assumptions have been checked, and the data is good to run.

From here, it’s straightforward of plugging the desired model into the formula. For the “Rank” variable, one can either use the dummy variables created to look at multicollinearity or by using the “C(variable_of_interest)”. If using the dummy variables, be sure to not include 1 of the groups to avoid the Dummy Variable Trap. The group dropped is then considered the reference group for the other dummy variables that came from the same original variable. Most commonly, the highest ranked group is dropped, or sometimes the lowest ranked group is dropped – it all depends on the hypothesis. Using the “C(variable_of_interest)” method automatically does this.

To recap, the analysis is looking into the effects GRE score, GPA, and undergraduate university prestige has on admission into the program. Using the api formula method, the general structure is as follows smf.logit(formula=" DV ~ IV1 + IV2 + IVn, data= your_data_frame).

model= smf.logit(formula="admit~ gre + gpa + C(rank)", data= df).fit()
model.summary()

 

multiple logistic regression python pandas

Interpreting Logistic Regression

The model summary shows the coefficients, standard error, the associated z-score, and the 95% confidence intervals. We interpret the results as follows, the overall model is significant indicated by a LLR p-value < 0.05 (7.578e-08) which allows us to look at the rest of the results. All the IVs have a significant effect on the log odds of being admitted as indicated by the z values < 0.05.

Numeric Variables

Interpreting continuous variables is not very different from interpreting them in a linear regression model. For every one unit increase in gre score, the log odds of admission increase by 0.0023; for every one unit increase in gpa, the log odds of admission increase by 0.8040.

Categorical Variables

The categorical variables have a different interpretation. Since Rank 1 was dropped from the analysis, it is the comparison group and plays an important role in interpreting the other categories. For example, if an applicant attended a Rank 2 University compared to a Rank 1 University, there is a -0.6754 decrease in the log odds of admission; if an applicant attended a Rank 3 University compared to a Rank 1 University, there is a -1.3402 decrease in the log odds of admission.

Taking Logistic Regression a Step Further

Interpreting the log odds is not very straight forward when thinking about it’s effects. An easier way to interpret the findings is by converting the coefficients of the logistic regression model into odd ratios. This can be done by getting the exponent of the coefficient value.

# GETTING THE ODDS RATIOS, Z-VALUE, AND 95% CI
model_odds = pd.DataFrame(np.exp(model.params), columns= ['OR'])
model_odds['z-value']= model.pvalues
model_odds[['2.5%', '97.5%']] = np.exp(model.conf_int())
model_odds

 

OR z-value 2.5% 97.5%
Intercept 0.018500 0.000465 0.001981 0.172783
C(rank)[T.2.0] 0.508931 0.032829 0.273692 0.946358
C(rank)[T.3.0] 0.261792 0.000104 0.133055 0.515089
C(rank)[T.4.0] 0.211938 0.000205 0.093443 0.480692
gre 1.002267 0.038465 1.000120 1.004418
gpa 2.234545 0.015388 1.166122 4.281877

 

Now the interpretation is easier. Converting the logistic coefficients into odds ratios makes it easier to interpret the effects on the DV. The confidence intervals have been converted to odds as well.

Numeric Variables

For every one unit increase in gpa, the odds of being admitted increases by a factor of 2.235; for every one unit increase in gre score, the odds of being admitted increases by a factor of 1.002.

Categorical Variables

Still interpreting the results in comparison to the group that was dropped. Applicants from a Rank 2 University compared to a Rank 1 University are 0.509 as likely to be admitted; applicants from a Rank 3 University compared to a Rank 1 University are 0.262 as likely to be admitted, etc.

An even easier way to say the above would be, applicants from a Rank 2 University are about half as likely to be admitted compared to applicants from a Rank 1 University, and applicants from a Rank 3 University are about a quarter as likely to be admitted compared to applicants from a Rank 1 University.

When interpreting odd ratios, any value greater than 1 indicates an increase in the odds, i.e. an increase in the likely hood, of that group being in the outcome variable, and any value less than 1 indicates a decrease in the odds, i.e. an decrease in the likely hood.

2 comments

  1. Hi there, great post, very help, I just have one question. In the tutorial, it is stated that “Logistic regression does not require the continuous IV(s) to be linearly related to the DV. It does require the continuous IV(s) be linearly related to the log odds of the IV though. A way to test this is to plot the IV(s) in question and look for an S-shaped curve.”

    Is there any way to evaluate this assumption with more rigour other than visually inspecting the data?

    1. Hey Greg,

      There is a method that is used which is called the Box-Tidwell test that can test for this. Unfortunately, at this time, this method has not been implemented in Python.

      Best,

      Corey

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.