Descriptive statistics summarizes the data and are broken down into measures of central tendency (mean, median, and mode) and measures of variability (standard deviation, minimum/maximum values, range, kurtosis, and skewness).

Measures of central tendency

Mean is the average value of the data.
Median is the middle number of the data.
Mode is the number that occurs most in the data.

Measures of variability

Standard deviation tells how much deviation is present in the data, i.e. how spread out the numbers are from the mean value.
Minimum value smallest number in the data.
Maximum value largest number in the data.
Range largest number – smallest number.
Kurtosis is a measure of tailedness, it measures the sharpness of the peak.
Skew is a measure of symmetry.

We can get most of the descriptive statistic values using .describe() method from Pandas. The .describe() method will return the counts, mean, standard deviation (std), minimum/maximum values, and the 25th, 50th, and 75th percentile values.

There is also ResearchPy that provides nice descriptive statistic options for continuous and categorical data type. For continuous data, it provides the number of non-missing observations, mean, standard deviation (std), standard error, and the 95% confidence interval. For continuous data, it provides the count and percent of the total for each category. I will demonstrate each of these below.

Remember, first you have to import the libraries Pandas and ResearchPy!

import pandas as pd
import researchpy as rp


and our data set!

resp_df = pd.read_csv("responses.csv")


Now to demonstrate each method on some specific variables of interest. First up, the .describe() method.

resp_df[['Number of siblings', 'Chemistry', 'Mathematics', 'Biology']].describe()


Number of siblings
Number
Chemistry Mathematics Biology
count 1004.0 1000.0 1007.0 1004.0
mean 1.30 2.17 2.33 2.67
std 1.01 1.38 1.35 1.38
min 0.0 1.0 1.0 1.0
25% 1.0 1.0 1.0 2.0
50% 1.0 2.0 2.0 2.0
75% 2.0 3.0 3.0 4.0
max 10.0 5.0 5.0 5.0


Now to get summary statistics on this continuous data type using the summary_cont() method from ResearchPy

rp.summary_cont(resp_df[['Number of siblings', 'Chemistry', 'Mathematics', 'Biology']])


Variable N Mean SD SE 95% Conf. Interval
Number of siblings 1004.0 1.297809 1.013348 0.031981 1.235051 1.360566
Chemistry 1000.0 2.165000 1.378287 0.043585 2.079471 2.250529
Mathematics 1007.0 2.334657 1.352496 0.042621 2.251022 2.418293
Biology 1004.0 2.665339 1.384127 0.043683 2.579619 2.751059

Now to get descriptives on a categorical data.

rp.summary_cat(resp_df['Education'])


Variable Outcome Count Percent
Education secondary school 621 61.55
college/bachelor degree 212 21.01
masters degree 81 8.03
primary school 80 7.93
currently a primary school pupil 10 0.99
doctorate degree 5 0.50

To get the other measures, we have to enter the .mode(), .median(), .kurtosis(), and .skew() methods separately.

resp_df[['Number of siblings', 'Chemistry', 'Mathematics', 'Biology']].mode()


Number of siblings
Number
Chemistry Mathematics Biology
0 1.0 1.0 1.0 2.0
resp_df[['Number of siblings', 'Chemistry', 'Mathematics', 'Biology']].median()


Number of siblings 1.0
Chemistry 2.0
Mathematics 2.0
Biology 2.0
dtype: float64
resp_df[['Number of siblings', 'Chemistry', 'Mathematics', 'Biology']].kurtosis()


Number of siblings 7.44
Chemistry -0.41
Mathematics -0.86
Biology -1.07
dtype: float64
resp_df[['Number of siblings', 'Chemistry', 'Mathematics', 'Biology']].skew()


Number of siblings 1.77
Chemistry 0.96
Mathematics 0.61
Biology 0.41
dtype: float64

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.