What’s in this section:
Introduction to Logistic Regression
Logistic regression models are used to analyze the relationship between a dependent variable (DV) and independent variable(s) (IV) when the DV is dichotomous. The DV is the outcome variable, a.k.a. the predicted variable, and the IV(s) are the variables that are believed to have an influence on the outcome, a.k.a. predictor variables. If the model contains 1 IV, then it is a simple logistic regression model, and if the model contains 2+ IVs, then it is a multiple logistic regression model.
Assumptions for logistic regression models:
- The DV is categorical (binary)
- If there are more than 2 categories in terms of types of outcome, a multinomial logistic regression should be used
- Independence of observations
- Cannot be a repeated measures design, i.e. collecting outcomes at two different time points.
- Independent variables are linearly related to the log odds
- Absence of multicollinearity
- Lack of outliers
Data Used in this example
Data used in this example is the data set that is used in UCLA’s Logistic Regression for Stata example. The question being asked is, how does GRE score, GPA, and prestige of the undergraduate institution effect admission into graduate school. The DV is admission status (binary), and the IVs are: GRE score, GPA, and undergraduate prestige.
Let’s import pandas as pd, load the data set, and take a look at the variables! The data can be loaded either with the code below, or from our GitHub. Loading the data from both sources are shown below.
Looking at the descriptive statistics, 31% of the students get admitted to a graduate program, the average GRE score is 587 with a large standard deviation, the average GPA is 3.39, and the average undergraduate school prestige is 2.49.
One thing to keep in mind, the variable “rank” is truly categorical and will need to be converted to this data type to accurately be included in the model. This will be expressed in the formula when it is ran. For assumption checking purposes, this variable will need to be dummy coded. This is simple work with the pd.get_dummies() method. It automatically creates a new variable for each category type of the original variable and codes new dummy variables where 1 is membership to that category and 0 is non-membership.
Multiple Logistic Regression Example
First the assumptions of the model need to be checked- the first 2 assumptions are met. Let’s import the library needed to run the logistic regression, a graphing library, and then check the rest of the assumptions!
Assumption of Continuous IVs being Linearly Related to the Log Odds
Logistic regression does not require the continuous IV(s) to be linearly related to the DV. It does require the continuous IV(s) be linearly related to the log odds of the IV though. A way to test this is to plot the IV(s) in question and look for an S-shaped curve. Sometimes the S-shape will not be obvious. The plot should have a flat or flat-ish top and bottom with an increase or decreasing middle.
This can be done with the following code using the Seaborn statistical plotting library for Python.
It may be hard to see, but the data does have somewhat of a curve occurring that resembles the S-shaped curve that is required. If a non-S-shaped line were to be present, sometimes a U-shape will be present, how to handle that data needs to be considered.
Assumption of Absence of Multicollinearity
An easy way to test this is to use a correlation matrix, and look for any highly correlated variables, and/or to look at for high Variance Inflation Factor (VIF) scores. If there are variables that are highly correlated, or have a high VIF, a corrective action would be to drop one of them since they are measuring the same/similar thing.
The only independent variables that have a moderate correlation with each other are rank_2.0 and rank_3.0. Given that these variables are dummy codes from the original variable “rank”, there is no concern with there being multicollinearity.
Assumption of Lack of outliers
The assumption of lack of outliers is an easy one to check. One can get a feel of this with the descriptive statistics provided by the .describe() method. The easiest way to check for outliers is to use a box plot.
Do to there being a drastic difference between the values used to measure GRE and the GPA/ Rank, two separate box plot charts will be produced.
There looks to be 2 values that could be considered outliers for the GRE variable, and 1 value for GPA variable. In both of these cases, the values that are in question are not so far away from the rest of the values in their respective variable. Thus, the values can be kept and used in the analysis.
Logistic Regression Model
The assumptions have been checked, and the data is good to run.
From here, it’s straightforward of plugging the desired model into the formula. For the “Rank” variable, one can either use the dummy variables created to look at multicollinearity or by using the “C(variable_of_interest)”. If using the dummy variables, be sure to not include 1 of the groups to avoid the Dummy Variable Trap. The group dropped is then considered the reference group for the other dummy variables that came from the same original variable. Most commonly, the highest ranked group is dropped, or sometimes the lowest ranked group is dropped – it all depends on the hypothesis. Using the “C(variable_of_interest)” method automatically does this.
To recap, the analysis is looking into the effects GRE score, GPA, and undergraduate university prestige has on admission into the program. Using the api formula method, the general structure is as follows
smf.logit(formula=" DV ~ IV1 + IV2 + IVn, data= your_data_frame).
Interpreting Logistic Regression
The model summary shows the coefficients, standard error, the associated z-score, and the 95% confidence intervals. We interpret the results as follows, the overall model is significant indicated by a LLR p-value < 0.05 (7.578e-08) which allows us to look at the rest of the results. All the IVs have a significant effect on the log odds of being admitted as indicated by the z values < 0.05.
Interpreting continuous variables is not very different from interpreting them in a linear regression model. For every one unit increase in gre score, the log odds of admission increase by 0.0023; for every one unit increase in gpa, the log odds of admission increase by 0.8040.
The categorical variables have a different interpretation. Since Rank 1 was dropped from the analysis, it is the comparison group and plays an important role in interpreting the other categories. For example, if an applicant attended a Rank 2 University compared to a Rank 1 University, there is a -0.6754 decrease in the log odds of admission; if an applicant attended a Rank 3 University compared to a Rank 1 University, there is a -1.3402 decrease in the log odds of admission.
Taking Logistic Regression a Step Further
Interpreting the log odds is not very straight forward when thinking about it’s effects. An easier way to interpret the findings is by converting the coefficients of the logistic regression model into odd ratios. This can be done by getting the exponent of the coefficient value.
Now the interpretation is easier. Converting the logistic coefficients into odds ratios makes it easier to interpret the effects on the DV. The confidence intervals have been converted to odds as well.
For every one unit increase in gpa, the odds of being admitted increases by a factor of 2.235; for every one unit increase in gre score, the odds of being admitted increases by a factor of 1.002.
Still interpreting the results in comparison to the group that was dropped. Applicants from a Rank 2 University compared to a Rank 1 University are 0.509 as likely to be admitted; applicants from a Rank 3 University compared to a Rank 1 University are 0.262 as likely to be admitted, etc.
An even easier way to say the above would be, applicants from a Rank 2 University are about half as likely to be admitted compared to applicants from a Rank 1 University, and applicants from a Rank 3 University are about a quarter as likely to be admitted compared to applicants from a Rank 1 University.
When interpreting odd ratios, any value greater than 1 indicates an increase in the odds, i.e. an increase in the likely hood, of that group being in the outcome variable, and any value less than 1 indicates a decrease in the odds, i.e. an decrease in the likely hood.