Page sections:
What is a correlation?
A correlation is a statistical test of association between variables that is measured on a 1 to 1 scale. The closer the correlation value is to 1 or 1 the stronger the association, the closer to 0, the weaker the association. It measures how change in one variable is associated with change in another variable.
There are a few common types of tests to measure the level of correlation, Pearson, Spearman, and Kendall. Each have their own assumptions about the data that needs to be meet in order for the test to be able to accurately measure the level of correlation. These are discussed further in the post. Each type of correlation test is testing the following hypothesis.
H_{0} hypothesis: There is not a relationship between variable 1 and variable 2
H_{1} hypothesis: There is a relationship between variable 1 and variable 2
If the obtained pvalue (alpha level) is less than what it is being tested at, then one can state that there is a significant relationship between the variables. Depending on the field, the pvalue can vary. Most fields use an alpha level of 0.05. For our purpose, we will use an alpha level of 0.05.
There are a few types of correlation:

 Positive correlation: as one variable increases so does the other

 Negative (inverse) correlation: as one variable increases the other variable decreases

 No correlation: there is no association between the changes in the two variables
The strength of the correlation matters. The closer the absolute value is to 1 or 1, the stronger the correlation.
r value  Strength 

0.0 – 0.2  Weak correlation 
0.3 – 0.6  Moderate correlation 
0.7 – 1.0  Strong correlation 
Pearson Correlation Assumptions
Pearson correlation test is a parametric test that makes assumption about the data. In order for the results of a Pearson correlation test to be valid, the data must meet these assumptions. If any of the assumptions are violated, then another test should be used. The assumptions are:
 Each variable should be either ratio or interval measurements
 Related pairs of data; each participant should have data in each variable being compared
 Absence of outliers
 A linear relationship between the two variables is present
 When plotted, the lines form a line and is not curved
 There is homoscedasticity of the data
 When plotted, the data points do not form a cone looking shape
Spearman Rank Correlation Assumptions
The Spearman rank correlation is a nonparametric test that does not make any assumptions about the distribution of the data. There are two assumptions for the Spearman rank correlation test and they are:
 Each variable should be either ordinal, ratio, or interval measurements
 There is a monotonic relationship between the variables being tested
 A monotonic relationship exists when either as one variable increases so does the other, or as one variable increases the other variable decreases.
Kendall’s Taub Correlation Assumptions
The Kendall’s Taub correlation is a nonparametric test that does not make any assumptions about the distribution of the data. There are two assumptions for the Kendall’s Taub correlation test and they are:
 Each variable should be either ordinal, ratio, or interval measurements
 There should be a monotonic relationship between the variables being tested
 Kendall’s Taub tests if there is a monotonic relationship,
and this assumption is not a strict one for this test
 Kendall’s Taub tests if there is a monotonic relationship,
An easy way to check for a linear relationship, homoscedasticity of the data, and/or a monotonic relationship is to plot the variables.
Data used for this Example
The data used in this example is from Kaggle.com from the user shivamagrawal. The data set contains information on attributes of diamonds. Link to the Kaggle source of the data set is here.
For this example, we will test if there is a significant correlation between the carat and price of diamonds. These are variables “carat” and “price”, respectively, in the data set. We will also be using the variable “depth” later in this post. Let’s take a look at these variables.
import pandas as pd df = pd.read_csv("diamonds.csv") df[["carat", "price", "depth"]].describe()
carat  price  depth  

count  53,940  53,940  53,940 
mean  0.797940  3932.799722  61.749405 
std  0.474011  3989.439738  1.432621 
min  0.200000  326.000000  43.000000 
25%  0.400000  950.000000  61.000000 
50%  0.700000  2401.000000  61.800000 
75%  1.040000  5324.250000  62.500000 
max  5.010000  18823.000000  79.000000 
Checking the Assumptions
Let’s plot these variables to check for the assumptions of a linear relationship being present and homoscedasticity of the data. We can use Pandas’s built in scatter plot method to complete this task.
df.plot.scatter("carat", "price")
Based on the scatter plot, it appears that we are violating Pearson’s correlation assumption of homoscedasticity. For a more rigorous test, we can test the variances using the Levene’s test.
stats.levene(df['carat'], df['price'])
Levene’s test for equal variances is significant, meaning we violate the assumption of homoscedasticity. Given that, the appropriate correlation test to use would be the Spearman Rank correlation test. For demonstration purposes, we will also conduct the Pearson correlation test.
The Pandas Correlation Method
To conduct the correlation test itself, we can use the .corr() method. This method conducts the correlation test between the variables and excludes all missing values. It can conduct the correlation test using a Pearson (the default method), Kendall, and Spearman method. Official documentation is here.
df['carat'].corr(df['price'])
0.92 
df['carat'].corr(df['price'], method= 'spearman')
0.96 
You can see that the two different methods of calculating the level of correlation between the variables produces different results. Both methods show a strong level of correlation between the carat of the diamond and the price of the diamond. Unfortunately, the builtin Pandas method does not provide a way to obtain the pvalue for the correlation test. But fear not! There are other methods that are available to use.
Pearson Correlation Method using scipy.stats.pearsonr()
Scipy Stats is a great library for Python with many statistical tests available. Here is the full list of statistical tests available in this library. We will be using the pearsonr() method. First, we have to import this library!
import scipy.stats as stats
Now let’s conduct the correlation test using the pearsonr() method for demonstration purposes and obtain both the correlation value and pvalue. Unfortunately the output is not labeled, but it is printed in the ordered of (r value, pvalue).
stats.pearsonr(df['carat'], df['price'])
(0.92, 0.0) 
We see that the correlation is strong and significant between the size of the carat and the price of the diamond. Remember this is not the correct correlation test to use for our data as it violated the assumption of homoscedasticity. We should have conducted the correlation test using the Spearman Rank method or Kendall’s Taub.
Spearman Rank Correlation Method using scipy.stats.spearmanr()
Now let’s conduct the correlation test using the spearmanr() method and obtain both the correlation value and pvalue. Given our data, this is a viable alternative correlation method to use.
stats.spearmanr(df['carat'], df['price'])'])
(correlation= 0.96, pvalue= 0.0) 
We see that the correlation is strong and significant between the size of the carat and the price of the diamond.
Kendall TauB Correlation Method using scipy.stats.kendalltau()
Now let’s conduct the correlation test using the kendalltau() method and obtain both the correlation value and pvalue. Given our data, this is a viable alternative correlation method to use. It is used less than Spearman Rank correlation, but this method handles ties in the data better.
stats.kendalltau(df['carat'], df['price'])
(correlation= 0.83, pvalue= 0.0) 
Using this method, we see that there is still a strong correlation between the size of the carat and the price of the diamond.
Correlation on Multiple Variables
There will be times you will want to find the correlation values and pvalues for multiple variables. It would be tiresome to repeat all the steps we just completed for all the variables of interest. Fear not! Below are custom functions that will save you time. You will have to decide if you want to use listwise deletion or pairwise deletion.
Listwise deletion
Listwise deletion keeps only the completed cases for all variables being tested at that time. Meaning, if testing 3 variables and they each have a different amount of completed cases (N), the largest possible N will be the smallest N of those 3. However, it uses only the cases where observations are completed for all 3 variables meaning the N could be smaller.
This function conducts listwise deletion, finds the Pearson, Spearman Rank, or Kendall’s Taub correlation values, and the significance of those. It also has the ability to export the results to a csv file. The correlation and pvalues are printed in their own table. You need to import pandas and scipy.stats for this function to work.
import pandas as pd import scipy.stats as stats def lwise_corr_pvalues(df, to_csv= False, file_name = None, method= None): df = df.dropna(how= 'any')._get_numeric_data() dfcols = pd.DataFrame(columns=df.columns) rvalues = dfcols.transpose().join(dfcols, how='outer') pvalues = dfcols.transpose().join(dfcols, how='outer') length = str(len(df)) if method == None: test = stats.pearsonr test_name = "Pearson" elif method == "spearman": test = stats.spearmanr test_name = "Spearman Rank" elif method == "kendall": test = stats.kendalltau test_name = "Kendall's Taub" for r in df.columns: for c in df.columns: rvalues[r][c] = round(test(df[r], df[c])[0], 4) for r in df.columns: for c in df.columns: pvalues[r][c] = format(test(df[r], df[c])[1], '.4f') if to_csv == False: print("Correlation test conducted using listwise deletion", "\n", "Total observations used: ",length, "\n", "\n", f"{test_name} Correlation Values", "\n", rvalues, "\n", "Significant Levels", "\n", pvalues) if to_csv == True: print("Correlation test conducted using listwise deletion", "\n", "Total observations used: ",length, "\n", "\n", f"{test_name} Correlation Values", "\n", rvalues, "\n", "Significant Levels", "\n", pvalues) file = open(file_name, 'a') file.write("Correlation test conducted using listwise deletion" + "\n" + f"{test_name} correlation values" + "\n") file.write("Total observations used: " + length + "\n") file.close() rvalues.to_csv(file_name, header= True, mode= 'a') file = open(file_name, 'a') file.write("pvalues" + "\n") file.close() pvalues.to_csv(file_name, header= True, mode= 'a')
lwise_corr_pvalues(df[['carat', 'price', 'depth']])
Correlation test conducted using listwise deletion Total observations used: 53,940 

Pearson Correlation Values  
carat  price  depth  
carat  1  0.9216  0.0282 
price  0.9216  1  0.0106 
depth  0.0282  0.0106  1 
Significant Levels  
carat  price  depth  
carat  0.000  0.000  0.000 
price  0.000  0.000  0.0134 
depth  0.000  0.0134  0.000 
Pairwise deletion
Using pairwise deletion when calculating correlations between multiple variables at a time will only remove the missing cases between the 2 variables being compare at that time. Meaning, each correlation value could be based off of a different N.
If you want to run correlations using the pairwise deletion method then use this custom function! It removes all missing values only for the 2 variables being compared at that time and reports the Pearson, Spearman Rank, or Kendall’s Taub correlation value, pvalue, and number of observations used in the calculation. It also has the ability to export the results to a csv file. You have to import pandas, scipy.stats, and itertools for this to work.
import pandas as pd import scipt.stats as stats import itertools def pwise_corr_pvalues(df, to_csv= False, file_name = None, method= None): correlations = {} pvalues = {} length = {} columns = df.columns.tolist() if method == None: test = stats.pearsonr test_name = "Pearson" elif method == "spearman": test = stats.spearmanr test_name = "Spearman Rank" elif method == "kendall": test = stats.kendalltau test_name = "Kendall's Taub" for col1, col2 in itertools.combinations(columns, 2): sub = df[[col1,col2]].dropna(how= "any") correlations[col1 + " " + "&" + " " + col2] = format(test(sub.loc[:, col1], sub.loc[:, col2])[0], '.4f') pvalues[col1 + " " + "&" + " " + col2] = format(test(sub.loc[:, col1], sub.loc[:, col2])[1], '.4f') length[col1 + " " + "&" + " " + col2] = len(df[[col1,col2]].dropna(how= "any")) corrs = pd.DataFrame.from_dict(correlations, orient= "index") corrs.columns = ["r value"] pvals = pd.DataFrame.from_dict(pvalues, orient= "index") pvals.columns = ["pvalue"] l = pd.DataFrame.from_dict(length, orient= "index") l.columns = ["N"] results = corrs.join([pvals,l]) if to_csv == False: print(f"{test_name} correlation", "\n", results) if to_csv == True: print(results) results.to_csv(file_name, header= True, mode = 'w', float_format='%.4f')
pwise_corr_pvalues(df[['carat', 'price', 'depth']])
r value  pvalue  N  

carat & price  0.9216  0.0000  53,940 
carat & depth  0.0282  0.0000  53,940 
price & depth  0.0106  0.0134  53,940 
Excellent explanation and script. Its helped me lot during data preprocessing in my capstone project…..one minor change in script
import scipt.stats as stats
instead change it to
import scipy.stats as stats…
again superb video…
I’m glad this page was able to help you! Also, thanks for the editing catch; I made the changes.
Cheers,
Corey
Superb understanding of correlation…