What is it?

The independent t-test is also called the two sample t-test, student’s t-test, or unpaired t-test. It’s an univariate test that tests for a significant difference between the mean of two unrelated groups.

The hypothesis being tested is:

  • Null hypothesis (H0): u1 = u2, which translates to the mean of sample 1 is equal to the mean of sample 2
  • Alternative hypothesis (HA): u1 ≠ u2, which translates to the mean of sample 1 is not equal to the mean of sample 2

If the p-value is less than what is tested at, most commonly 0.05, one can reject the null hypothesis.

Independent t-test Assumptions

Like every test, this inferential statistic test has assumptions. The assumptions that the data must meet in order for the test results to be valid are:

  • The independent variable (IV) is categorical with at least two levels (groups)
  • The dependent variable (DV) is continuous which is measured on an interval or ratio scale
  • The distribution of the two groups should follow the normal distribution
  • The variances between the two groups are equal
    • This can be tested using statistical tests including Levene’s test, F-test, and Bartlett’s test.

If any of these assumptions are violated then another test should be used.

Data used in this example

The data used in this example is from Kaggle.com and was posted by the user Web IR. The link to the data set is here. The data set contains the sepal and petal length and width of various floral species. We will be testing to see if there is a significant difference in the sepal width between the species Iris-setosa and Iris-versicolor which are variables “sepal_width” and “species” respectively.

Let’s import pandas as pd, the data, and then take a look at what we will be working with!

import pandas as pd

df= pd.read_csv("Iris_Data.csv")

df.groupby("species")['sepal_width'].describe()

 

species count mean std min 25% 50% 75% max
Iris-setosa 50.0 3.418 0.381024 2.3 3.125 3.4 3.675 4.4
Iris-versicolor 50.0 2.770 0.313798 2.0 2.525 2.8 3.000 3.4
Iris-virginica 50.0 2.974 0.322497 2.2 2.800 3.0 3.175 3.8

To make the code in the next steps a bit cleaner to read, I will create 2 data frames that are subsets of the original data where each data frame only contains data for a respective flower species.

setosa = df[(df['species'] == 'Iris-setosa')]
versicolor = df[(df['species'] == 'Iris-versicolor')]

 

Independent t-test Example

One needs to import scipy.stats for the following steps! The library matplotlib.pyplot is being imported the graphs can be exported.

from scipy import stats
import matplotlib.pyplot as plt

 

Before the t-test can be conducted, one needs to test the assumptions. First to test for the homogeneity of variances. To do this, I will use Levene’s test for homogeneity of variance. The method to conduct this test is stats.levene().

stats.levene(setosa['sepal_width'], versicolor['sepal_width'])
LeveneResult(statistic=0.66354593329432332, pvalue=0.41728596812962038)

 

The test is not significant meaning there is homogeneity of variances and we can proceed. If the test were to be significant, a viable alternative would be to conduct a Welch’s t-test. Next to test the assumption of normality. This can be done visually with a histogram and/or as a q-q plot, and by using the Shapiro-Wilk test which is the stats.shaprio() method. First, I will check them visually.

setosa['sepal_width'].plot(kind="hist", title="Setosa Sepal Width")
plt.xlabel("Length (units)")
plt.savefig('Setosa_sepal_width')

 

python pandas independent t-test t test student's

versicolor['sepal_width'].plot(kind="hist", title= "Versicolor Sepal Width", color="green")
plt.xlabel("Length (units)")
plt.savefig('Versicolor_sepal_width')

 

python pandas independent t-test t test

From the looks of the histogram, each variable appears to be fairly normally distributed. Let’s see how it looks on a q-q plot, it’s easier to get a sense of normality visualizing the data as q-q plot. If you are unfamiliar with reading a q-q plot, the data should be on the red line. If there are data points that are far off of it, it’s an indication that there are some deviations from normality.

stats.probplot(setosa['sepal_width'], dist="norm", plot= plt)
plt.title("Setosa Sepal Width Q-Q Plot")
plt.savefig("Setosa_qqplot.png")

 

q-q plot python pandas independent t-test student's t test

stats.probplot(versicolor['sepal_width'], dist="norm", plot= plt)
plt.title("Versicolor Sepal Width Q-Q Plot")
plt.savefig("versicolor_qqplot.png")

 

stats.probplot(versicolor['sepal_width'], dist=

There is some deviation from normality in the Setosa q-q plot, but it does not appear to be a large violation. In all, the data looks to have normality. To be sure, we can test it statistically using the Shapiro-Wilk test for normality which is the stats.shaprio() method. Unfortunately the output is not labeled. The first value is the W test statistic and the second value is the p-value.

stats.shapiro(setosa['sepal_width'])
(0.968691885471344, 0.20465604960918427)

 

stats.shapiro(versicolor['sepal_width'])
(0.9741330742835999, 0.33798879384994507)

 

Neither of the tests for normality we significant meaning neither of the variables violates the assumption of normality. We can continue as planned. To conduct the Independent t-test, one needs to use the stats.ttest_ind() method.

stats.ttest_ind(setosa['sepal_width'], versicolor['sepal_width'])
Ttest_indResult(statistic=9.2827725555581111, pvalue=4.3622390160102143e-15)

 

The Independent t-test results are significant! Therefore, one can reject the null hypothesis in support of the alternative.

Another component one needs to report the findings is the degrees of freedom (df). This can be calculated by adding the two group Ns and subtracting 2. In our case, df = (50 + 50) – 2 = 98.

Interpretation of Results

The purpose of the current study was to test if there is a significant difference in the sepal width between the floral species Iris-setosa and Iris-versicolor. Iris-setosa’s average sepal width (M= 3.418, SD= 0.381) is wider and has greater variation than Iris-versicolor (M= 2.770, SD= 0.314). Levene’s test for homogeneity of variances indicated equality of variance (F= 0.664, p=0.417); therefore an Independent t-test was used. Results indicate that there is a significant difference in the sepal width between Iris-setosa and Iris-versicolor (t(98)=9.282, p=4.362).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.