What is in this section:
Introduction to ANOVA (OneWay)
The analysis of variance (ANOVA) can be thought of as an extension to the ttest. The independent ttest is used to compare the means of a condition between 2 groups. ANOVA is used when one wants to compare the means of a condition between 2+ groups. ANOVA is an omnibus test, meaning it tests the data as a whole. Another way to say that is this, ANOVA tests if there is a difference in the mean somewhere in the model (testing if there was an overall effect), but it does not tell one where the difference is if the there is one. To find out where the difference is between the groups, one has to conduct posthoc tests. This is also covered in this section.
Although it can be thought of as an extension of the ttest, in terms of when to use it, mathematically speaking, it’s more of a regression model and is considered a generalized linear model (GLM). The general regression equation is as follow:
outcome_{i} = (model) + error_{i}
Replacing the general regression equation with fake groups to show context would make the equation look like this:
outcome_{i} = b_{0} + b_{1}Group_{1} + b_{2}Group_{1} + error_{i}
Where:
 b_{0} is the model’s intercept (a.k.a. the constant term),
 b_{1}Group_{1} is the the coefficient (b_{1}) and the respective group value (Group_{1}), and
 error_{i} is the error present in the model
That’s enough of a primer for now, the model will be updated as when we actually get some data to work with. The testing hypothesis of an ANOVA is as follows:
 H_{0}: No difference between means, i.e. ͞x_{1} = ͞x_{2} = ͞x_{3}
 H_{a}: Difference between means exist somewhere, i.e. ͞x_{1} ≠ ͞x_{2} ≠ ͞x_{3}, or ͞x_{1} = ͞x_{2} ≠ ͞x_{3}, or ͞x_{1} ≠ ͞x_{2} = ͞x_{3}
ANOVA Assumptions
There are 3 assumptions that need to be met for the results of an ANOVA test to be considered accurate and trust worthy. It’s important to note the the assumptions apply to the residuals and not the variables themselves. The ANOVA assumptions are the same as for linear regression and are:
 Normality
 Caveat to this is, if group sizes are equal, the Fstatistic is robust to violations of normality
 Homogeneity of variance
 Same caveat as above, if group sizes are equal, the Fstatistic is robust to this violation
 Independent observations
If possible, it is best to have groups the same size so corrections to the data do not need to be made. However, with real world data, that is often not the case and one will have to make corrections to the data. If these assumptions are not met, and one does not want to transform the data, an alternative test that could be used is the KruskalWallis Htest or Welch’s ANOVA.
Data used in this example
Data used in this example is fictional and can be found on our GitHub. It is made up data that is measuring the effects of different doses of a clinical drug, Difficile, on libido. It contains 2 columns of interest, “dose” and “libido”. Dose contains information on the dosing, “placebo”, “low”, and “high”, and libido is a measure of lowhigh libido on a 7 point Likert scale with 7 being the highest and 1 being the lowest. Now let’s import the libraries needed for the analysis, load the data and take a look! While were at it, let’s drop the ‘person’ column since we don’t need it.
import pandas as pd import scipy.stats as stats import researchpy as rp import statsmodels.api as sm from statsmodels.formula.api import ols import matplotlib.pyplot as plt # Loading data df = pd.read_csv("https://raw.githubusercontent.com/Opensourcefordatascience/Datasets/master/difficile.csv") df.drop('person', axis= 1, inplace= True) # Recoding value from numeric to string df['dose'].replace({1: 'placebo', 2: 'low', 3: 'high'}, inplace= True) # Gettin summary statistics rp.summary_cont(df['libido'])
Variable  N  Mean  SD  SE  95% Conf.  Interval 

libido  15.0  3.466667  1.76743  0.456349  2.487896  4.445437 
That’s good to see the data as a whole, but we are really interested in the data by dosing.
rp.summary_cont(df['libido'].groupby(df['dose']))
N  Mean  SD  SE  95% Conf.  Interval  

dose  
high  5  5.0  1.581139  0.707107  3.450484  6.549516 
low  5  3.2  1.303840  0.583095  1.922236  4.477764 
placebo  5  2.2  1.303840  0.583095  0.922236  3.477764 
ANOVA Example
There are a few ways this can be done with Python. One is with the stats.f_oneway() method which is apart of the scipy.stats library, and the other is using statsmodels.
ANOVA with scipy.stats
If using scipy.stats, the method needed is stats.f_oneway(). The general applied method looks like this:
stats.f_oneway(data_group1, data_group2, data_group3, data_groupN)
stats.f_oneway(df['libido'][df['dose'] == 'high'], df['libido'][df['dose'] == 'low'], df['libido'][df['dose'] == 'placebo'])
The Fstatistic= 5.119 and the pvalue= 0.025 which is indicating that there is an overall significant effect of medication on libido. However, we don’t know where the difference between dosing/groups is yet. This is in the posthoc section. A thing to note, is that if you are doing this for academic research purposes, this method is missing some of the information that is required for publication. For example, one would need the degrees of freedom, have to calculate the sum of squares, and conduct posthoc tests by hand. It’s not difficult to do in Python, but there is a much easier way. Next is how to conduct an ANOVA using the regression formula; since after all, it is a generalized linear model (GLM).
ANOVA with statsmodels
Using statsmodels, we get a bit more information and enter the model as a regression formula. The general input using this method looks like this:
model_name = ols('outcome_variable ~ group1 + group2 + groupN', data=your_data).fit()
If you dummy code the groups, you have to not include 1 of the groups in the formula. This group’s data will still get captured in the model’s intercept and is the base (control) group. If you use the following method of entering the formula Python takes care of this for you.
model_name = ols('outcome_variable ~ C(group_variable)', data=your_data).fit()
results = ols('libido ~ C(dose)', data=df).fit() results.summary()
This method provides more information and is overall more useful. Like mentioned earlier, the intercept group is the high dose group since the high dose group’s data was not included in the model’s formula. Their data is still captured because this group has values of 0 in both of the other groups.
Something to note, at the bottom of the table there are a few tests that were conducted to test the models’s assumptions. This will be discussed later and shown how to call these diagnostics without printing out the model in the regression format.
Let’s interpret the table. Overall the model is significiant, F(2,12)= 5.12, p = 0.0247. This tells us that there is a significant difference in the group means. The coefficients (coef in the table), are the difference in mean between the control group and the respective group listed. The intercept is the mean for the high dose group, placebo group’s coefficient = 2.2 – 5.0 = 2.8, and low dose coefficient = 3.2 – 5.0 = 1.8. Looking at the pvalues now (P>t in the table), we can see the difference between the high dose group and placebo group is significant, p = 0.008, but the difference between the low dose group and high dose group is not, p = 0.065. There is no comparison between the low dose group and the placebo group. I wanted to show you this to see where these numbers come from. Coming from the ANOVA framework, the information we are really after in this table it the Fstatistic and it’s corresponding pvalue. This tells us if we explained a significant amount of the overall variance. To test between groups, we need to do some posthoc testing where we can compare all groups against each other. We are still missing some useful information with this method, we need an ANOVA table.
aov_table = sm.stats.anova_lm(results, typ=2) aov_table
sum_sq  df  F  PR(>F)  

dose  20.133333  2.0  5.118644  0.024694 
Residual  23.6  12.0 
Let’s break down this ANOVA table. The dose row is the between groups effect which is the overall experimental effect. The sum of squares for the model (SS_{M}; value 20.133 in the table) is how much variance is explained by our model. The current model explains a significant amount of variance, F(2,12)= 5.12, p < 0.05. The residual row is the unsystematic variation in the data (SS_{R}; also called the unexplained variance; value 23.600 in the table). In this case, the unsystematic variation represents the natural individual differences in libido and natural different reactions to the drug, Difficile.
Calculating Model Effect Size
Something that is useful is the effect size. The effect size tells us how much of an impact the experiment will have in the real world. There are a few different effect sizes one can use: eta squared (η^{2}), and omega squared (ω^{2}). Omega squared is considered a better measure of effect size than eta squared because it is unbiased in it’s calculation.
Something to note, for some reason R^{2} is called eta squared within the ANOVA framework. They are the same thing. R^{2} is a measure of how much variance is explained by the model and is calculated by taking the explained variance (SS_{M}) and dividing it by the total variance (SS_{T}; also called total sum of squares). With the total variance (SS_{T}) equaling the sum of squares for the model (SS_{M}) plus the sum of square for the residual (SS_{R}). Thus making the equation for R^{2} and eta squared:
R^{2} and eta squared = SS_{M}/SS_{T}
R^{2} and eta squared = 20.133/43.733 = 0.460
That means the current model accounts for 46.0% of the variance in contributing to libido. Like just mentioned, within the ANOVA framework, R^{2} is also called eta squared, and can be interpreted as the amount of explained variance, as well as an effect size measure.
Another thing we need to calculate is the mean squares. The mean squares is desired because it eliminates the bias present in the SS_{M} and SS_{R}, and it is also used to calculate the Fstatistic and omega squared. SS_{M} and SS_{R} are biased because they are influenced by the number of values summed to calculated them. To calculate the mean squares, one divides the sum of squares (SS_{M} and SS_{R}) by the degrees of freedom respectively.
MS_{M}= SS_{M}/df_{M} = 20.135/2 = 10.067
MS_{R}= SS_{R}/df_{R} = 23.60/12 = 1.967
MS_{M} is the average amount of variance explained by the current model, MS_{R} is the average amount of variance unexplained by the current model. The ratio of MS_{M} to MS_{R} is used to calculate the Fstatistic. We don’t need to do this since we already have it, but it’s nice to understand where the numbers come from!
MS_{M}/MS_{R} = 10.067/1.967 = 5.118
The following function calculates the effect sizes mentioned, as well as the mean squares and updates the table!
def anova_table(aov): aov['mean_sq'] = aov[:]['sum_sq']/aov[:]['df'] aov['eta_sq'] = aov[:1]['sum_sq']/sum(aov['sum_sq']) aov['omega_sq'] = (aov[:1]['sum_sq'](aov[:1]['df']*aov['mean_sq'][1]))/(sum(aov['sum_sq'])+aov['mean_sq'][1]) cols = ['sum_sq', 'df', 'mean_sq', 'F', 'PR(>F)', 'eta_sq', 'omega_sq'] aov = aov[cols] return aov anova_table(aov_table)
sum_sq  df  mean_sq  F  PR(>F)  eta_sq  omega_sq  

dose  20.133333  2.0  10.066667  5.118644  0.024694  0.460366  0.354486 
Residual  23.6  12.0  1.966667 
Assumption Checks/Model Diagnostics
As mentioned earlier, when working with linear regression and ANOVA models, the assumptions pertain to the residuals and not the variables themselves. Using Statsmodels, we can use the diagnostics that is already provided. The default output is not pretty, so often times I like to print the model summary as a regression table and look there than use the following code as it’s more readable in the regression table.
results.diagn
‘jbpv’: 0.5746386969445545,
‘skew’: 0.19458085550133977,
‘kurtosis’: 1.7266590060327494,
‘omni’: 2.5173586607759626,
‘omnipv’: 0.2840288872319992,
‘condno’: 3.7320508075688776,
‘mineigval’: 1.3397459621556134}
These are the same diagnostics from the bottom of the regression table from before. The DurbanWatson tests is to detect the presence of autocorrelation (not provided when calling diagnostics this way), JarqueBera (jb; jbpv is pvalue) tests the assumption of normality, Omnibus (omni; omnipv is pvalue) tests the assumption of homogeneity of variance, and the Condition Number (condno) assess multicollinearity. Condition Number values over 20 are indicative of multicollinearity.
If the omnibus test were to be significant, an option on how to handle it would be to use a heteroscedasticity corrected coefficient covariance matrix in the .anova_lm() method. This corrects the calculations to account for the heteroscedasticity present. More information on the method can be found on it’s official documentation page.
Other ways to check assumptions
Here are some other ways to test the assumptions of the ANOVA model. I tend to use these methods when conducting an ANOVA.
Assumption: Homogeneity of Variance
One can use the Levene’s test to test for equal variances between groups. This is apart of the scipy.stats library. Official documentation can be found here. The reason I prefer using these methods is that the homogeneity of variance assumption should be checked for each level of the categorical variable. The diagnostic output provided by statsmodels appears to only test it as the whole.
stats.levene(df['libido'][df['dose'] == 'placebo'], df['libido'][df['dose'] == 'low'], df['libido'][df['dose'] == 'high'])
Levene’s test for homogeneity of variance is not significant which indicates that the groups have equal variances.
Assumption: Normality
The assumption of normality is tested on the residuals as a whole which is how the diagnostic information provided by statsmodels tests the residuals. One could use the JarqueBera test provided, or one could use Shapiro or others. I will demonstrate how to test for normality using the Shapiro method. The output is not labelled, but the numbers are the test statistic value followed by the pvalue. The official documentation can be found here.
stats.shapiro(results.resid)
The results from the ShapiroWilk test is not statistically significant which indicates that the residuals are normally distributed.
Posthoc Testing
The overall model was significant, now to test which groups differ. Deciding which groups to compare should be theory driven. There are a few different techniques that can be used. Each of these techniques have different ways of controlling for familywise error rate. 3 common methods are:

 Fisher’s Least Significant Difference (LSD): Take the groups you want to compare and conduct multiple ttests. This method requires that the ANOVA model be significant. This method is easy, but receives push back since it doesn’t account for familywise error rate. The argument is that since the overall model was significant, one is protected from increasing the familywise error rate.

 Bonferroni correction: Take the alpha the ANOVA was tested at, 0.05, then divide it by the number of planned comparisons. In this case, 0.05/3 = 0.0167. A posthoc test would have to have an alpha level < 0.0167 to be considered significant. To test the groups, conduct multiple ttests, but set the alpha value to the corrected value. This method is quick, but often considered too conservative.
 Tukey’s HSD: Method also controls for familywise error rate with a different method than Bonferroni, and is also considered conservative.
There are many other techniques out there that can be used for posthoc testing each with different guidelines for when they should be used, you are encouraged to learn about them!
Tukey’s HSD Posthoc comparison
from statsmodels.stats.multicomp import pairwise_tukeyhsd from statsmodels.stats.multicomp import MultiComparison mc = MultiComparison(df['libido'], df['dose']) mc_results = mc.tukeyhsd() print(mc_results)
The Tukey HSD posthoc comparison test controls for type I error and maintains the familywise error rate at 0.05 (FWER= 0.05 top of the table). The group1 and group2 columns are the groups being compared, the meandiff column is the difference in means of the two groups being calculated as group2 – group1, the lower/upper columns are the lower/upper boundaries of the 95% confidence interval, and the reject column states whether or not the null hypothesis should be rejected. Unfortunately, this method currently does not provide the tstatistic so treatment effect size cannot be calculated.
Bonferroni Correction Posthoc Comparison
First the corrected pvalue needs to be calculated. This can be done using the formula:
pvalue/# of comparisons = 0.05/3 = 0.01667
Now the ttests that are conducted have to have a pvalue less than 0.01667 in order to be considered significant.
stats.ttest_ind(df['libido'][df['dose'] == 'high'], df['libido'][df['dose'] == 'low'])
stats.ttest_ind(df['libido'][df['dose'] == 'low'], df['libido'][df['dose'] == 'placebo'])
stats.ttest_ind(df['libido'][df['dose'] == 'high'], df['libido'][df['dose'] == 'placebo'])
Using the Bonferroni correction, only the difference between the high dose and placebo groups are significantly different. We can calculate the high dosing’s effect size! To calculate the effect size for the treatment dosing we also need to calculate the degrees of freedom since it’s not provided. The following equations can be used:
dof = #_observations_group1 + #_observations_group2  #_of_groups
dof = 5 + 5  2 = 8
effect size r = square root of (t^{2}/t^{2} + dof)
effect size r = sqrt(1.213**2/(1.213**2 + 8)) = 0.39
The high dose has a medium effect size.
ANOVA Results Interpretation
While interpreting the ANOVA results, the Bonferroni posthoc analysis results will be used.
There was a significant effect of Difficile on the level of libido, F(2,12)= 5.12, p < 0.05, ω^{2} = 0.35. Planned posthoc testing, using the Bonferroni correction α= 0.0167, revealed that high dose of Difficile significantly increased libido compared to the placebo, t(8)=3.06, p < 0.0167, r= 0.39. There were no other statistically significant differences between groups.
Thanks for this. It was a very useful and concise primer! Any chance you have similar material for the multivariate case?
Nvmnd. Found it!
Thanks for this rich implementation. It is very helpfull to cover the main requirement when applying an anova analysis. I just have a question about the t(8) statistic in the Bonferroni Correction Posthoc Comparison between ‘high’ and ‘placebo’ group. Is it 3.0550504633038926 value or the 1.21 that you use.
Hey Pierre,
Thanks for the note and the catch! You are correct and I will fix this. The correct tvalue is 3.06. When writing I accidentally wrote out the tvalue from the low and placebo comparison instead of the tvalue from the high and placebo comparison.
Thanks again!
Corey