Chi-square test of independence

The Chi-square test of independence tests if there is a significant relationship between two categorical variables. The test is comparing the observed observations to the expected observations. The data is usually displayed in a cross-tabulation format with each row representing a category for one variable and each column representing a category for another variable. Chi-square test of independence is an omnibus test. Meaning it tests the data as a whole. This means that one will not be able to tell which levels (categories) of the variables are responsible for the relationship if the Chi-square table is larger than 2×2. If the test is larger than 2×2, it requires post hoc testing. If this doesn’t make much sense right now, don’t worry. Further explanation will be provided when we start working with the data.

The H0 (Null Hypothesis): There is no relationship between variable one and variable two.

The H1 (Alternative Hypothesis): There is a relationship between variable 1 and variable 2.

If the p-value is significant, you can reject the null hypothesis and claim that the findings support the alternative hypothesis.

Chi-square Assumptions

The following assumptions need to be meet in order for the results of the Chi-square test to be trusted.

  • When testing the data, the cells should be frequencies or counts of cases and not percentages. It is okay to convert to percentages after testing the data
  • The levels (categories) of the variables being tested are mutually exclusive
  • Each participant contributes to only one cell within the Chi-square table
  • The groups being tested must be independent
  • The value of expected cells should be greater than 5

If all of these assumptions are met, then Chi-square is the correct test to use.

This page will go over how to conduct a Chi-square test of independence using Python, how to interpret the results, and will provide a custom function that was developed by Python for Data Science, LLC for you to use! It cleans up the output, ability to calculate row/column percentages, and has the ability to export the results to a csv file.

First we need to import the required libraries and data!

import pandas as pd
import researchpy as rp
from scipy import stats

df = pd.read_csv("mental-heath-in-tech.csv")

 

Data used for this Example

The data used in this example is from Kaggle.com from Open Sourcing Mental Illness, LTD. The data set is from the 2016 OSMI Mental Health in Tech Survey which aims to measure attitudes towards mental health in the tech workplace, and examine the frequency of mental health disorders among tech workers. Link to the Kaggle source of the data set is here.

For this example, we will test if there is an association between willingness to discuss a mental health issues with a direct supervisor and currently having a mental health disorder. In the data set, these are variables “Would you have been willing to discuss a mental health issue with your direct supervisor(s)?” and “Do you currently have a mental health disorder?” respectively. Let’s take a look at the data!

rp.summary_cat(df[['Do you currently have a mental health disorder?', 'Would you have been willing to discuss a mental health issue with your direct supervisor(s)?' ]])


Variable Outcome Count Percent
Do you currently have a mental health disorder? Yes 575 40.13
No 531 37.06
Maybe 327 22.82
Would you have been willing to discuss a menta… Some of my previous employers 654 51.74
No, at none of my previous employers 416 32.91
I don’t know 101 7.99
Yes, at all of my previous employers 93 7.36

For the variable “Do you currently have a mental health disorder?”, we are going to drop the responses of “Maybe” since we are only interested in if people know they do or do not have a mental health disorder. In order to do this, we need to use a function to recode the data. In addition, the variables will be renamed to shorten them.

def drop_maybe(series):
    if series.lower() == 'yes' or series.lower() == 'no':
        return series
    else:
        return

 

    df['current_mental_disorder'] = df['Do you currently have a mental health disorder?'].apply(drop_maybe)
    df['willing_discuss_mh_supervisor'] = df['Would you have been willing to discuss a mental health issue with your direct supervisor(s)?']
    
    rp.summary_cat(df[['current_mental_disorder', 'willing_discuss_mh_supervisor']])

 

Variable Outcome Count Percent
current_mental_disorder Yes 575 51.99
No 531 48.01
willing_discuss_mh_supervisor Some of my previous employers 654 51.74
No, at none of my previous employers 416 32.91
I don’t know 101 7.99
Yes, at all of my previous employers 93 7.36

Our data is set, so let’s take a look at the crosstab frequencies of the two groups.

pd.crosstab(df['willing_discuss_mh_supervisor'], df['current_mental_disorder'])

 

current_mental_disorder No Yes
willing_discuss_mh_supervisor
I don’t know 51 29
No, at none of my previous employers 119 194
Some of my previous employers 237 267
Yes, at all of my previous employers 51 24

Chi-square test using scipy.stats.chi2_contingency

You should have already imported Scipy.stats as stats, if you haven’t yet, do so now. The chi2_contingency() method conducts the Chi-square test on a contingency table (crosstab). The full documentation on this method can be found here on the official site. With that, first we need to assign our crosstab to a variable to pass it through the method.

crosstab = pd.crosstab(df['willing_discuss_mh_supervisor'], df['current_mental_disorder'])
crosstab

 

Now we can simply pass the crosstab variable through the chi2_contingency() method to conduct a Chi-square test of independence. The output doesn’t look neat in formatting, but all the required information is there.

While we check the results of the chi2 test, we need also to check that the expected cell frequencies are greater than or equal to 5; this is one of the assumptions (as mentioned above) for the chi2 test. If a cell has an expected frequency less that 5, then the Fisher’s Exact test should be use to overcome this problem. Interpretation of the results are the same. This information is also provided in the output.

stats.chi2_contingency(crosstab)

 

(32.408194625396376,
4.2928597930482389e-07,
3,
array([[ 37.69547325, 42.30452675],
[ 147.48353909, 165.51646091],
[ 237.48148148, 266.51851852],
[ 35.33950617, 39.66049383]]))

 

The first value (32.408) is the Chi-square value, followed by the p-value (4.29e-07), then comes the degrees of freedom (3), and lastly it outputs the expected frequencies as an array. Since all of the expected frequencies are greater than 5, the chi2 test results can be trusted. We can reject the null hypothesis as the p-value is less than 0.05. Thus, the results indicate that there is a relationship between willingness to discuss a mental health issues with a direct supervisor and currently having a mental health disorder within the tech/IT workplace.

Although our Chi-square test was signficant, since our analysis is 2×3 we can’t yet state exactly where the relationship is since the Chi-square test is an omnibus test. We have to conduct post hoc tests to test where the relationship is between the different levels (categories) of each variable. This example will use the Bonferroni-adjusted p-value method which will be covered in the section after next.

Chi-square test using Researchpy

Researchpy has a nice crosstab method that can do more than just producing cross-tabulation tables and conducting the chi-square test of independence test. The link for the full documentation is here. Let’s conduct the same analysis and see the cross-tabulation table in terms of column percent. This will allow us to compare the percentages of those with a mental health disorder against those without a mental health disorder. The output comes as a tuple, but for cleanliness, I will store the cross-tabulation table as one object and the results as another object.

table, results = rp.crosstab(df['willing_discuss_mh_supervisor'], df['current_mental_disorder'], prop= 'col', test= 'chi-square')
    
table


current_mental_disorder
No Yes All
willing_discuss_mh_supervisor
I don’t know 11.14 5.64 8.23
No, at none of my previous employers 25.98 37.74 32.20
Some of my previous employers 51.75 51.95 51.85
Yes, at all of my previous employers 11.14 4.67 7.72
All 100.00 100.00 100.00
results


Chi-square test results
Pearson Chi-square ( 3.0) = 32.4082
p-value = 0.0000
Cramer’s V = 0.1826

In the results table, researchpy’s crosstab method also returned Cramer’s V which is an effect size measure. This tells how strong the relationship between the two variables are. Let’s interpret the results.

There is a statistically significant relationship between having a current mental health disorder and the willingness to discuss mental health with supervisor, χ2(3)= 32.4082, p < 0.0001. The strength of that relationship is small, Cramer's V= 0.1826. As mentioned earlier, post-hoc tests need to be conducted in order to see where the difference between the willingness to discuss mental health with supervisor and having a current mental health disorder is since the willingness to discuss mental health variables has 4 groups. Also as mentioned earlier, this example will use the Bonferroni-adjusted p-value method which will be covered in the next section.

Chi-square post hoc testing

Now that we know our Chi-square test of independence is significant, we want to test where the relationship is between the levels of the variables. In order to do this, we need to conduct multiple 2×2 Chi-square tests using the Bonferroni-adjusted p-value.

Some of you may ask why? By comparing multiple levels (categories) against each other, the error rate of a false positive compounds with each test. Meaning, our first test at the level 0.05 is a 5% chance of a false positive; the test after that would be 10% chance of a false positive, and so forth. With each subsequent test, one would be increasing the error rate by 5%. If we were to conduct all of the possible 6 pairwise comparisons, our last 2×2 Chi-square test would have an error rate of 30%! Meaning our p-value being tested at would equal 0.30, which is not acceptable on any level.

To avoid this, the Bonferroni-adjusted method adjusts the p-value by how many planned pairwise comparisons are being conducted. The formula is p/N, where “p”= the original tests p-value and “N”= the number of planned pairwise comparisons.
In our example, if we were planning on conducting all possible pairwise comparisons then the formula would be 0.05/6 = 0.008. Meaning, a post hoc 2×2 Chi-square test would have to have a p-value less than 0.008 to be significant. However, we are not interested in the “I don’t know” category of the “willing_discuss_mh_supervisor” variable. Thus making the formula be 0.05/3, which equals 0.017. So for our planned pairwise comparisons to be significant, the p-value must be less than 0.017.

To conduct multiple 2×2 Chi-square tests, one needs to regroup the variables for each test to where it is one category against the rest. For us, it will be:

  • No, at none of my previous employers vs. the rest
  • Some of my previous employers vs, the rest
  • Yes, at all of my previous employers vs. the rest

Python makes this task easy! There is a pd.get_dummies() method which creates dummy variables where each new variable is only one category of the original variable and is equal to “1” if they belong in that category and “0” if they do not. We will assign the dummy variables to a new Python data frame.

dummies = pd.get_dummies(df['willing_discuss_mh_supervisor'])
dummies.drop(["I don't know"], axis= 1, inplace= True)
dummies.head()

 

No, at none of my previous employers Some of my previous employers Yes, at all of my previous employers
0 1 0
0 1 0
0 0 0
0 1 0
0 1 0

Now that we have our dummy variables set, we can conduct our planned post hoc comparisons. This will be easy using a for loop. There is going to be a bit extra code in the for loop to clean up the output.

for series in dummies:
    nl = "\n"
    
    crosstab = pd.crosstab(dummies[f"{series}"], df['current_mental_disorder'])
    print(crosstab, nl)
    chi2, p, dof, expected = stats.chi2_contingency(crosstab)
    print(f"Chi2 value= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}{nl}")

 

current_mental_disorder No Yes
No, at none of my previous employers
0 411 380
1 119 194
Chi2 value= 16.90623905539159
p-value= 3.927228826835633e-05
Degrees of freedom= 1

 

current_mental_disorder No Yes
Some of my previous employers
0 294 308
1 236 266
Chi2 value= 0.29589978434689185
p-value= 0.5864643795737425
Degrees of freedom= 1

 

current_mental_disorder No Yes
Yes, at all of my previous employers
0 479 550
1 51 24
Chi2 value= 12.040740742132103
p-value= 0.0005205028333059755
Degrees of freedom= 1

 

Using the Bonferroni-adjusted p-value of 0.017, 2 of the 3 planned pairwise comparisons are significant. There is a significant relationship between current_mental_disorder & No, at none of my previous employers, and current_mental_disorder & Yes, at all of my previous employers. Now we can compare the cells within the Chi-square test table.

  • Looking at current_mental_disorder & No, at none of my previous employers, it can be stated that a higher proportion of individuals with a current mental illness reported they would not have been willing to discuss a mental health issue with their direct supervisor.
  • Looking at current_mental_disorder & Yes, at all of my previous employers, it can be stated that a lower proportion of those with a current mental illness reported they would have been willing to discuss a mental health issue with their direct supervisor.

10 comments

  1. Thanks. This is very useful. Do you think it would be possible to do something that does column proportion comparisons like what SPSS has available when you run a cross tabulation?

    1. Hey there Jose,

      First, thanks for your comment and yes you can! There is a bit of work that needs to be done using Python to get there but it’s not bad. There are two ways that I can think of how to do it; 1) from scratch, and 2) using the “proportions_ztest” method which is a part of the statsmodels.stats.proportion library.

      I’m going to show with the quicker way which is to use the proportions_ztest method (below).

      ## General code structure below (pseudo code) ##

      from statsmodels.stats.proportion import proportions_ztest

      count = np.array([group1_success, group2_success])
      nobs = np.array([group1_N, group2_N])
      proportions_ztest(counts, nobs)

      ## End pseudo code ##

      Note: group1_success and group2_success refers to the number of “yes”, or “completes”, or however one is measuring a success of that group, and group1_N and group2_N refers to the total number of observations for that group.

      Using the page’s data to demonstrate further. Let’s conduct a z-test on the proportions of those with a mental health disorder between those that said “no, at none of my previous employers” and “yes, at all of my previous employers”. Our hypothesis is that there will be a significant difference in the proportion of individuals with a current mental health disorder that would not tell any of their previous employers compared to those that state they would tell all of their previous employers. Since a direction was not stated, this makes it a two-tailed test. If there is a hypothesis of which proportion would be greater, one can pass that in the proportions_ztest(alternative= ) argument. The default is a two-tailed test. So if nothing is passed, the method tests the data as two-tailed. The pages documentation can be found: http://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportions_ztest.html for information on this if desired.

      Quick digression, verbally speaking this would be “row proportions” since in the example above these groups are displayed as rows, but it doesn’t matter since we are testing the proportions. If we wanted to make it “column proportions” we could just switch how the data is displayed. Either way, we’re still comparing proportions and that is what matters.

      Now for the test!

      ## Code example ##

      Referencing the pseudo code above
      group1_success = number of those with a current mental health disorder & would tell all employers (24)
      group2_success = number of those with a current mental health disorder & would not tell any employers (194)
      group1_N = total number of those with a current mental health disorder & would tell all employers (24 + 51 = 75)
      group2_N = total number of those with a current mental health disorder & would not tell any employers (194 + 119 = 313)

      count = np.array([24, 194])
      nobs = np.array([75, 313])
      proportions_ztest(counts, nobs)
      Results: (-4.7001272637134131, 2.5999940974104787e-06)

      # If you want to pretty up the output a bit #
      z,p = proportions_ztest(count, nobs)
      print(f”z-statistic: {z:.3f}, p-value: {p: .3f}”)

      Results: (z-statistic: -4.700, p-value: 0.000)

      ## End code example ##

      Please note that it matters where you plug the numbers in for “count” and “nobs”. They need to be entered so they align vertically.

      There is a significant difference in the proportion between those with a mental health disorder that would tell all of their employers and those that would tell none of their employers. One would have to do this for each post-hoc comparison test that is planned and don’t forget to adjust the p-value for the multiple comparisons!

      Best,

      Corey from Python for data science.

      p.s. I’m definitely going to be making a page on z-test in the future. Thanks for the idea!

  2. I am getting module not found when using import researchpy as rp.
    I am using Anaconda distribution and jupyter note book. Could you please tell me how to solve the issue. Even in the internet i did not find anything about researchpy module

    1. Hey Raj,

      It sounds like you don’t have the package installed. Assuming you have windows, click your window button (I think it’s called) in the bottom left corner of your screen, find the Anaconda folder and click it, then select “Anaconda Prompt”. Once that launches, type “pip install researchpy”.

      This will download the package and you will be able to use it.

      1. I am also using Anaconda (though I’m using the Spyder interface with Python 3.6) and I tried downloading researchpy from the Anaconda prompt. I typed in ‘pip install researchpy’ in the Anaconda prompt and I get the following error message: “Could not find a version that satisfies the requirement researchpy (from versions: ) No matching distribution found for researchpy”. Do you have any suggestions on how to reconcile this?

      2. Hey there,

        Is your Anaconda environment updated? Try this in the Anaconda prompt first:

        conda update conda
        conda update conda-build

        Then try using the ‘pip install researchpy’ afterwards

  3. I tried updating my Anaconda environment as suggested (it updated), but I get the same error when trying to install ‘researchpy’. No worries though. I’m mostly interested in the percentages in the cross table, but I think I can find a workaround.
    Thanks for an excellent post. This was very well done, easy to follow, well explained, and overall top quality.

    1. Thanks, I’ll build a Anaconda distribution soon- it’s been on the list. I’ll update here when it’s good to go.

      – Corey

    2. Hey there, a conda distribution has been built and is now available through traditional conda installation. Run the following code from the Anaconda prompt:

      conda install -c researchpy researchpy

      It’s been tested and works – please let me know if you have any issues.

      Thanks and best regards,

      Corey
      Python for Data Science

      1. Thanks, Corey! I followed your guidance and got it working just great with Spyder in the Anaconda environment.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.