Introduction to ANOVA (One-Way)

The analysis of variance (ANOVA) can be thought of as an extension to the t-test. The independent t-test is used to compare the means of a condition between 2 groups. ANOVA is used when one wants to compare the means of a condition between 2+ groups. ANOVA is an omnibus test, meaning it tests the data as a whole. Another way to say that is this, ANOVA tests if there is a difference in the mean somewhere in the model (testing if there was an overall effect), but it does not tell one where the difference is if the there is one. To find out where the difference is between the groups, one has to conduct post-hoc tests. This is also covered in this section.

Although it can be thought of as an extension of the t-test, in terms of when to use it, mathematically speaking, it’s more of a regression model and is considered a generalized linear model (GLM). The general regression equation is as follow:

outcomei = (model) + errori

Replacing the general regression equation with fake groups to show context would make the equation look like this:

outcomei = b0 + b1Group1 + b2Group1 + errori

Where:

  • b0 is the model’s intercept (a.k.a. the constant term),
  • b1Group1 is the the coefficient (b1) and the respective group value (Group1), and
  • errori is the error present in the model

That’s enough of a primer for now, the model will be updated as when we actually get some data to work with. The testing hypothesis of an ANOVA is as follows:

  • H0: No difference between means, i.e. ͞x1 = ͞x2 = ͞x3
  • Ha: Difference between means exist somewhere, i.e. ͞x1 ≠ ͞x2 ≠ ͞x3, or ͞x1 = ͞x2 ≠ ͞x3, or ͞x1 ≠ ͞x2 = ͞x3

ANOVA Assumptions

There are 3 assumptions that need to be met for the results of an ANOVA test to be considered accurate and trust worthy. It’s important to note the the assumptions apply to the residuals and not the variables themselves. The ANOVA assumptions are the same as for linear regression and are:

  • Normality
    • Caveat to this is, if group sizes are equal, the F-statistic is robust to violations of normality
  • Homogeneity of variance
    • Same caveat as above, if group sizes are equal, the F-statistic is robust to this violation
  • Independent observations

If possible, it is best to have groups the same size so corrections to the data do not need to be made. However, with real world data, that is often not the case and one will have to make corrections to the data. If these assumptions are not met, and one does not want to transform the data, an alternative test that could be used is the Kruskal-Wallis H-test or Welch’s ANOVA.

Data used in this example

Data used in this example is fictional and can be found on our GitHub. It is made up data that is measuring the effects of different doses of a clinical drug, Difficile, on libido. It contains 2 columns of interest, “dose” and “libido”. Dose contains information on the dosing, “placebo”, “low”, and “high”, and libido is a measure of low-high libido on a 7 point Likert scale with 7 being the highest and 1 being the lowest. Now let’s import the libraries needed for the analysis, load the data and take a look! While were at it, let’s drop the ‘person’ column since we don’t need it.

import pandas as pd
import scipy.stats as stats
import researchpy as rp
import statsmodels.api as sm
from statsmodels.formula.api import ols
    
import matplotlib.pyplot as plt

# Loading data
df = pd.read_csv("https://raw.githubusercontent.com/Opensourcefordatascience/Data-sets/master/difficile.csv")
df.drop('person', axis= 1, inplace= True)

# Recoding value from numeric to string
df['dose'].replace({1: 'placebo', 2: 'low', 3: 'high'}, inplace= True)
    
# Gettin summary statistics
rp.summary_cont(df['libido'])


Variable N Mean SD SE 95% Conf. Interval
libido 15.0 3.466667 1.76743 0.456349 2.487896 4.445437

That’s good to see the data as a whole, but we are really interested in the data by dosing.

rp.summary_cont(df['libido'].groupby(df['dose']))


N Mean SD SE 95% Conf. Interval
dose
high 5 5.0 1.581139 0.707107 3.450484 6.549516
low 5 3.2 1.303840 0.583095 1.922236 4.477764
placebo 5 2.2 1.303840 0.583095 0.922236 3.477764

ANOVA Example

There are a few ways this can be done with Python. One is with the stats.f_oneway() method which is apart of the scipy.stats library, and the other is using statsmodels.

ANOVA with scipy.stats

If using scipy.stats, the method needed is stats.f_oneway(). The general applied method looks like this:

stats.f_oneway(data_group1, data_group2, data_group3, data_groupN)

stats.f_oneway(df['libido'][df['dose'] == 'high'], 
             df['libido'][df['dose'] == 'low'],
             df['libido'][df['dose'] == 'placebo'])
F_onewayResult(statistic=5.1186440677966099, pvalue=0.024694289538222603)


The F-statistic= 5.119 and the p-value= 0.025 which is indicating that there is an overall significant effect of medication on libido. However, we don’t know where the difference between dosing/groups is yet. This is in the post-hoc section. A thing to note, is that if you are doing this for academic research purposes, this method is missing some of the information that is required for publication. For example, one would need the degrees of freedom, have to calculate the sum of squares, and conduct post-hoc tests by hand. It’s not difficult to do in Python, but there is a much easier way. Next is how to conduct an ANOVA using the regression formula; since after all, it is a generalized linear model (GLM).

ANOVA with statsmodels

Using statsmodels, we get a bit more information and enter the model as a regression formula. The general input using this method looks like this:

model_name = ols('outcome_variable ~ group1 + group2 + groupN', data=your_data).fit()

If you dummy code the groups, you have to not include 1 of the groups in the formula. This group’s data will still get captured in the model’s intercept and is the base (control) group. If you use the following method of entering the formula Python takes care of this for you.

model_name = ols('outcome_variable ~ C(group_variable)', data=your_data).fit()

results = ols('libido ~ C(dose)', data=df).fit()
results.summary()

anova regression python results interpretation pandas statistics

This method provides more information and is overall more useful. Like mentioned earlier, the intercept group is the high dose group since the high dose group’s data was not included in the model’s formula. Their data is still captured because this group has values of 0 in both of the other groups.

Something to note, at the bottom of the table there are a few tests that were conducted to test the models’s assumptions. This will be discussed later and shown how to call these diagnostics without printing out the model in the regression format.

Let’s interpret the table. Overall the model is significiant, F(2,12)= 5.12, p = 0.0247. This tells us that there is a significant difference in the group means. The coefficients (coef in the table), are the difference in mean between the control group and the respective group listed. The intercept is the mean for the high dose group, placebo group’s coefficient = 2.2 – 5.0 = -2.8, and low dose coefficient = 3.2 – 5.0 = -1.8. Looking at the p-values now (P>|t| in the table), we can see the difference between the high dose group and placebo group is significant, p = 0.008, but the difference between the low dose group and high dose group is not, p = 0.065. There is no comparison between the low dose group and the placebo group. I wanted to show you this to see where these numbers come from. Coming from the ANOVA framework, the information we are really after in this table it the F-statistic and it’s corresponding p-value. This tells us if we explained a significant amount of the overall variance. To test between groups, we need to do some post-hoc testing where we can compare all groups against each other. We are still missing some useful information with this method, we need an ANOVA table.

aov_table = sm.stats.anova_lm(results, typ=2)
aov_table


sum_sq df F PR(>F)
dose 20.133333 2.0 5.118644 0.024694
Residual 23.6 12.0

Let’s break down this ANOVA table. The dose row is the between groups effect which is the overall experimental effect. The sum of squares for the model (SSM; value 20.133 in the table) is how much variance is explained by our model. The current model explains a significant amount of variance, F(2,12)= 5.12, p < 0.05. The residual row is the unsystematic variation in the data (SSR; also called the unexplained variance; value 23.600 in the table). In this case, the unsystematic variation represents the natural individual differences in libido and natural different reactions to the drug, Difficile.

Calculating Model Effect Size

Something that is useful is the effect size. The effect size tells us how much of an impact the experiment will have in the real world. There are a few different effect sizes one can use: eta squared (η2), and omega squared (ω2). Omega squared is considered a better measure of effect size than eta squared because it is unbiased in it’s calculation.

Something to note, for some reason R2 is called eta squared within the ANOVA framework. They are the same thing. R2 is a measure of how much variance is explained by the model and is calculated by taking the explained variance (SSM) and dividing it by the total variance (SST; also called total sum of squares). With the total variance (SST) equaling the sum of squares for the model (SSM) plus the sum of square for the residual (SSR). Thus making the equation for R2 and eta squared:

R2 and eta squared = SSM/SST
R2 and eta squared = 20.133/43.733 = 0.460

That means the current model accounts for 46.0% of the variance in contributing to libido. Like just mentioned, within the ANOVA framework, R2 is also called eta squared, and can be interpreted as the amount of explained variance, as well as an effect size measure.

Another thing we need to calculate is the mean squares. The mean squares is desired because it eliminates the bias present in the SSM and SSR, and it is also used to calculate the F-statistic and omega squared. SSM and SSR are biased because they are influenced by the number of values summed to calculated them. To calculate the mean squares, one divides the sum of squares (SSM and SSR) by the degrees of freedom respectively.

MSM= SSM/dfM = 20.135/2 = 10.067
MSR= SSR/dfR = 23.60/12 = 1.967

MSM is the average amount of variance explained by the current model, MSR is the average amount of variance unexplained by the current model. The ratio of MSM to MSR is used to calculate the F-statistic. We don’t need to do this since we already have it, but it’s nice to understand where the numbers come from!

MSM/MSR = 10.067/1.967 = 5.118

The following function calculates the effect sizes mentioned, as well as the mean squares and updates the table!

def anova_table(aov):
    aov['mean_sq'] = aov[:]['sum_sq']/aov[:]['df']
    
    aov['eta_sq'] = aov[:-1]['sum_sq']/sum(aov['sum_sq'])
    
    aov['omega_sq'] = (aov[:-1]['sum_sq']-(aov[:-1]['df']*aov['mean_sq'][-1]))/(sum(aov['sum_sq'])+aov['mean_sq'][-1])
    
    cols = ['sum_sq', 'df', 'mean_sq', 'F', 'PR(>F)', 'eta_sq', 'omega_sq']
    aov = aov[cols]
    return aov

anova_table(aov_table)


sum_sq df mean_sq F PR(>F) eta_sq omega_sq
dose 20.133333 2.0 10.066667 5.118644 0.024694 0.460366 0.354486
Residual 23.6 12.0 1.966667

Assumption Checks/Model Diagnostics

As mentioned earlier, when working with linear regression and ANOVA models, the assumptions pertain to the residuals and not the variables themselves. Using Statsmodels, we can use the diagnostics that is already provided. The default output is not pretty, so often times I like to print the model summary as a regression table and look there than use the following code as it’s more readable in the regression table.

results.diagn
{‘jb’: 1.1080275776425244,
‘jbpv’: 0.5746386969445545,
‘skew’: 0.19458085550133977,
‘kurtosis’: 1.7266590060327494,
‘omni’: 2.5173586607759626,
‘omnipv’: 0.2840288872319992,
‘condno’: 3.7320508075688776,
‘mineigval’: 1.3397459621556134}

 

These are the same diagnostics from the bottom of the regression table from before. The Durban-Watson tests is to detect the presence of autocorrelation (not provided when calling diagnostics this way), Jarque-Bera (jb; jbpv is p-value) tests the assumption of normality, Omnibus (omni; omnipv is p-value) tests the assumption of homogeneity of variance, and the Condition Number (condno) assess multicollinearity. Condition Number values over 20 are indicative of multicollinearity.

If the omnibus test were to be significant, an option on how to handle it would be to use a heteroscedasticity corrected coefficient covariance matrix in the .anova_lm() method. This corrects the calculations to account for the heteroscedasticity present. More information on the method can be found on it’s official documentation page.

Other ways to check assumptions

Here are some other ways to test the assumptions of the ANOVA model. I tend to use these methods when conducting an ANOVA.

Assumption: Homogeneity of Variance

One can use the Levene’s test to test for equal variances between groups. This is apart of the scipy.stats library. Official documentation can be found here. The reason I prefer using these methods is that the homogeneity of variance assumption should be checked for each level of the categorical variable. The diagnostic output provided by statsmodels appears to only test it as the whole.

stats.levene(df['libido'][df['dose'] == 'placebo'],
             df['libido'][df['dose'] == 'low'],
             df['libido'][df['dose'] == 'high'])
LeveneResult(statistic=0.11764705882352934, pvalue=0.8900225182757423)}

 

Levene’s test for homogeneity of variance is not significant which indicates that the groups have equal variances.

Assumption: Normality

The assumption of normality is tested on the residuals as a whole which is how the diagnostic information provided by statsmodels tests the residuals. One could use the Jarque-Bera test provided, or one could use Shapiro or others. I will demonstrate how to test for normality using the Shapiro method. The output is not labelled, but the numbers are the test statistic value followed by the p-value. The official documentation can be found here.

stats.shapiro(results.resid)
(0.9166916012763977, 0.17146942019462585)

 

The results from the Shapiro-Wilk test is not statistically significant which indicates that the residuals are normally distributed.

Post-hoc Testing

The overall model was significant, now to test which groups differ. Deciding which groups to compare should be theory driven. There are a few different techniques that can be used. Each of these techniques have different ways of controlling for familywise error rate. 3 common methods are:

    • Fisher’s Least Significant Difference (LSD): Take the groups you want to compare and conduct multiple t-tests. This method requires that the ANOVA model be significant. This method is easy, but receives push back since it doesn’t account for familywise error rate. The argument is that since the overall model was significant, one is protected from increasing the familywise error rate.
    • Bonferroni correction: Take the alpha the ANOVA was tested at, 0.05, then divide it by the number of planned comparisons. In this case, 0.05/3 = 0.0167. A post-hoc test would have to have an alpha level < 0.0167 to be considered significant. To test the groups, conduct multiple t-tests, but set the alpha value to the corrected value. This method is quick, but often considered too conservative.
  • Tukey’s HSD: Method also controls for familywise error rate with a different method than Bonferroni, and is also considered conservative.

There are many other techniques out there that can be used for post-hoc testing each with different guidelines for when they should be used, you are encouraged to learn about them!

Tukey’s HSD Post-hoc comparison

from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison

mc = MultiComparison(df['libido'], df['dose'])
mc_results = mc.tukeyhsd()
print(mc_results)

tukey hsd post-hoc post hoc comparison python pandas anova

The Tukey HSD post-hoc comparison test controls for type I error and maintains the familywise error rate at 0.05 (FWER= 0.05 top of the table). The group1 and group2 columns are the groups being compared, the meandiff column is the difference in means of the two groups being calculated as group2 – group1, the lower/upper columns are the lower/upper boundaries of the 95% confidence interval, and the reject column states whether or not the null hypothesis should be rejected. Unfortunately, this method currently does not provide the t-statistic so treatment effect size cannot be calculated.

Bonferroni Correction Post-hoc Comparison

First the corrected p-value needs to be calculated. This can be done using the formula:

p-value/# of comparisons = 0.05/3 = 0.01667

Now the t-tests that are conducted have to have a p-value less than 0.01667 in order to be considered significant.

stats.ttest_ind(df['libido'][df['dose'] == 'high'], df['libido'][df['dose'] == 'low'])
Ttest_indResult(statistic=1.963961012123931, pvalue=0.08513507177899203)


stats.ttest_ind(df['libido'][df['dose'] == 'low'], df['libido'][df['dose'] == 'placebo'])
Ttest_indResult(statistic=1.2126781251816647, pvalue=0.25984504521378449)


stats.ttest_ind(df['libido'][df['dose'] == 'high'], df['libido'][df['dose'] == 'placebo'])
Ttest_indResult(statistic=3.0550504633038926, pvalue=0.015700141250047695)



Using the Bonferroni correction, only the difference between the high dose and placebo groups are significantly different. We can calculate the high dosing’s effect size! To calculate the effect size for the treatment dosing we also need to calculate the degrees of freedom since it’s not provided. The following equations can be used:

dof = #_observations_group1 + #_observations_group2 - #_of_groups
dof = 5 + 5 - 2 = 8
effect size r = square root of (t2/t2 + dof)
effect size r = sqrt(1.213**2/(1.213**2 + 8)) = 0.39

The high dose has a medium effect size.

ANOVA Results Interpretation

While interpreting the ANOVA results, the Bonferroni post-hoc analysis results will be used.

There was a significant effect of Difficile on the level of libido, F(2,12)= 5.12, p < 0.05, ω2 = 0.35. Planned post-hoc testing, using the Bonferroni correction α= 0.0167, revealed that high dose of Difficile significantly increased libido compared to the placebo, t(8)=3.06, p < 0.0167, r= 0.39. There were no other statistically significant differences between groups.

4 comments

  1. Thanks for this rich implementation. It is very helpfull to cover the main requirement when applying an anova analysis. I just have a question about the t(8) statistic in the Bonferroni Correction Post-hoc Comparison between ‘high’ and ‘placebo’ group. Is it 3.0550504633038926 value or the 1.21 that you use.

    1. Hey Pierre,

      Thanks for the note and the catch! You are correct and I will fix this. The correct t-value is 3.06. When writing I accidentally wrote out the t-value from the low and placebo comparison instead of the t-value from the high and placebo comparison.

      Thanks again!

      Corey

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.