You will need Pandas and NumPy to conduct data analysis with Python. If you do not have it installed, or want to read more on it, you can read more here on the library’s site.
This page will go over the common steps in data science. To follow along, you can download the data set. The data set was downloaded from kaggle.com. On a side note, Kaggle is an excellent source of free data sets to practice on. In addition, there are competitions to help hone your skills.
Note: All code examples will have a black background with highlighted text that makes it easier to differentiate between keywords and variable names.
First, you need to import the Pandas library so it will be accessible. To shorten what is required to type when wanting to use a Pandas method we are importing Pandas as pd; this way we only need to type “pd.method_you_want()”. Importing a library as something else, is like assigning a nickname to the library. Python will know that when you call the library as a nickname, you are truly referencing the library.
import pandas as pd
Common Data Analyst Steps
- Set you directory
- Load data
- Preview data
- Sub-setting (if desired)
- Run descriptive analyses/ basic graphs
- Run statistical tests
1. Setting Working Directory
A working directory is where you store the files for the project you are working on. Essentially, it’s a folder on your hard drive that contains all the documents that are needed for your project.
An easy way to make sure you are always in the right folder, or working directory, is to use the “os” library in your Python script. You can do this by importing it, and then using a method ( os.chdir(“your_file_path”) ) from the library to change the current working directory. Prime placement for this is at the top of your code script. That way it will tell Python to look in this location for everything that you are needing for the project.
There are 2 useful methods for working with working directories: os.getcwd() and os.chdir(). os.getcwd() will output the current working directory’s location, also called a file path, and os.chdir() will change the current working directories location.
Sometimes you need to change the single “\” to double “\\” in order for Python to go to the correct location.
2. Load Data
Python supports importing data from multiple sources, including, but not limited to: csv, excel, SAS, Stata, json, html, and reading data directly from your clipboard. Each one of these methods has various parameters that can be passed to it to get the desired loading result. A parameter is something that goes within the “()” after a method.
With our data set, we need to use the pd.read_csv(“your_file_name.csv”) and need to assign that to a variable. Remember, in this example we imported pandas as pd; therefore, when we want to call a method from the pandas library, we only need to type pd instead of pandas. A common variable name that contains a Pandas DataFrame is “df”; however, you can name it whatever you’d like so as long as it follows the naming rules. A good naming convention is to name the variable something that reminds the programmer what is contained within the variable.
3. Preview / Explore Data
You must know the data set you are working with. If you haven’t already explored the data set before now, this is the time. Using the method your_dataframe_name.head(), we can see the first 5 entries of each column. If you want to see more or less than 5 entries, pass the number to method; like this:
The code will output a data frame with the first 10 observations and all the columns in the data set. The data frame below is a subset of the all columns; these are the columns that we are interested in and will be using for the rest of the example.
It’s also good to know the number of rows, known as observations, in the data set. The following code will tell us this information. The number of observations is your sample size and is important because it is a factor in what type of statistical analyses you can conduct.
There are a total of 205 observations in this data set, and we know that there are at least 2 car makes in this data set. Let’s find out how many different car makes there are in the data set, and the frequency of each. We are only going to be interested in the top 2 car makes.
To do this, we will introduce a new method your_dataframe_name[‘series’].value_counts(). Where [‘series’] is the variable, a.k.a. the column, that you are interested in. By default, this method outputs a list that is in descending order (largest to smallest).
4. Sub-setting (if desired)
Since we are only interested in the 5 columns from above and the top 2 car makes, we can create another data frame that is a subset of the entire data set that will only contain these columns and those makes. This is done by using a list when sub-setting. If you need a refresher on lists click here and go to that section, if you need a refresher on indexing a data frame, click here to go to our explanation.
Sub-setting a data frame into a smaller data frame will make it easier on the computer to make computations and can save you some some typing. I’m going to do this process in 2 steps that way it’s easier to follow along.
5. Summary Statistics/ Basic Graphs
Now that we have our subset of the data set that we want to work with, it is a good practice to get summary statistics on the variables (columns) that you are interested in. This can also serve as a quality check. If you know a response for a variable is suppose to be between 1 – 5, any response that is not within that range is an error and you will have to decide on how to handle that.
Pandas makes this easy with the the your_data_frame_name.describe() method. This will provide the count (number of observations), the mean, standard deviation, minimum/ maximum value, and the 25th, 50th, and 75th quartile values.
It’s nice to see the overall descriptive statistics of the data set you are working with. Now let’s see the descriptive statistics based on make.
To do this, you will need to use the your_data_frame.groupby(‘series’) method and combine it with the .describe() method.
The output will look like this.
Although very small, we are interested in testing if the difference in highway-mpg is statistically significant between the 2 car makes. In order to test this, we need to conduct a t-test, if equal variances, or a Welch’s t-test, if the variances are not equal.
A library that you may become very familiar with is Scipy. It is a great library with multiple statistical tests. As with any library in Python, it needs to be imported first in order to be used. We are going to only be using the stats module of the library so only that will imported.
Before we can conduct any analyses, it will be easier to read if we have 2 separate data frames where each one contains only 1 make.
Note: It is possible to run the Levene’s test and the t-test without creating 2 separate data frames- the code is just a bit harder to read. Since this is an introductory page, easier to read code is the goal.
First we need to test for equality of variances using the Levene’s test.
Since the Levene’s test is not significant, we can assume equal variances and conduct a t-test to test for significant differences between the highway-mpg between the 2 makes.
As one might have expected, the difference observed in the highway-mpg between the 2 makes are not statistically significant.