Sub-setting makes working with large data frames easier. Commonly, a researcher wants to use a smaller portion of the data set, or they want to have subsets of the data set by a certain categorical variable classifier (car make, disease state, group type, etc.) stored in different Python objects that contain only the data pertaining to that categorical classifier. Ways of sub-setting all data that pertain to a categorical variable classifier can be done with the following methods below combined with using an assignment statement. Examples will be provided.

 

Selection Method Description
DataFrame[‘column_label_desired‘] Selects the column(s) desired and all of the rows.
Pass a list if more than one column is desired
DataFrame.loc[rows_desired, ‘column_label_desired‘] Selects the desired number of rows, and column label(s) desired.
Pass a list if more than one column is desired
DataFrame.iloc[rows_desired, column_position_desired] Selects the desired rows, and column position desired.
Pass a list of column positions if they are not next to each other,
otherwise use a slice to select the range of the positions.

 

The difference between using .loc[] and .iloc[] is that .loc[] selects the columns by column label (column name), whereas .iloc[] selects the columns by position. The column position starts at 0, just like the row indexes.

For both of these methods, if you want all rows for the desired columns you pass a colon, “:”, in the rows_desired parameter.

Data used for Examples

The data set used on this page was downloaded from Kaggle.com from the user Miroslav Sabo. To download, go to our GitHub page (https://github.com/Opensourcefordatascience/Data-sets), or get it from Kaggle (https://www.kaggle.com/miroslavsabo/young-people-survey).

Note: The file from our GitHub page is modified from the original .csv file. In our version, a “Participant Number” column has been added. This column is arbitrarily assigned.

Data Preview

The data has been assigned to a Python object named resp_df.

resp_df.head()

 

Participant
Number
Music Slow songs
or fast songs
Dance Folk Country Classical
music
Musical Pop Rock
0 1 5.0 3.0 2.0 1.0 2.0 2.0 1.0 5.0 5.0
1 2 4.0 4.0 2.0 1.0 1.0 1.0 2.0 3.0 5.0
2 3 5.0 5.0 2.0 2.0 3.0 4.0 5.0 3.0 5.0
3 4 5.0 3.0 2.0 1.0 1.0 1.0 1.0 2.0 2.0
4 5 5.0 3.0 4.0 3.0 2.0 4.0 3.0 5.0 3.0

Sub-setting by Column(s)

DataFrame[‘column_label_desired‘]

Selecting a column or columns this way will by default include data from every row as well. Most often this is what is desired. Selecting data this way is often used for sub-setting a data set.

sub_folk = resp_df['Folk']

sub_folk.head()

 

Folk
0 1.0
1 1.0
2 2.0
3 1.0
4 3.0

 

If you want to select more than 1 column, pass a list.

sub_folk_country = resp_df[['Folk', 'Country']]

sub_folk_country.head()

 

Folk Country
0 1.0 2.0
1 1.0 1.0
2 2.0 3.0
3 1.0 1.0
4 3.0 2.0

DataFrame.loc[rows_desired, ‘column_label_desired‘]

Sub-setting a data frame using the .loc[] method is not much different than the previous method shown. The main advantage of using this method is that you can identify specific rows that are of interest- you still identify the column(s) of interest using the column label name(s).

Identifying the rows of interested is done using the index, and can be identified using a slice range or by passing a list of specific row indexes.

sub_music = resp_df.loc[[0,2,3], 'Music']

sub_music

 

Music
0 5.0
2 5.0
3 5.0

 

You can also pass a list of column labels (column names) if you want more than 1 column.

sub_music_rock = resp_df.loc[2:7, ['Music', 'Rock']]

sub_music_rock

 

Music Rock
2 5.0 5.0
3 5.0 2.0
4 5.0 3.0
5 5.0 5.0
6 5.0 3.0
7 5.0 5.0

DataFrame.iloc[rows_desired, column_position_desired]

The .iloc[] method has the same capabilities as the .loc[] method. The difference between the two is that with the .iloc[] method you identify the columns of interest using their position instead of label name. The column position index starts at 0, just like the row index.

The advantage of using the .iloc[] method over the .loc[] method is if there are multiple columns of interest next to each other and you know the position of them, you can pass numbers instead of typing out multiple label names. Essentially, it saves time typing.

sub_music = resp_df.iloc[[0,2,3], 1]

sub_music

 

Music
0 5.0
2 5.0
3 5.0

 

sub_music_rock = resp_df.iloc[2:7, [1, 9]]

sub_music_rock

 

Music Rock
2 5.0 5.0
3 5.0 2.0
4 5.0 3.0
5 5.0 5.0
6 5.0 3.0

Sub-setting by Column Value

It is often desired to subset data by a column’s values. The general code structure is below. You can sub-set based on strings, or numerical values. If the column of interest contains numerical values, you can use any of the conditional operators to filter the observations.

# Column of interest contains string values
New_Python_Object = data_frame[(data_frame["Column_of_Interest"] == "value")]

# Column of interest contains numeric values
New_Python_Object = data_frame[(data_frame["Column_of_Interest"] >= 10)]

 

In our data set, there is a column labeled “Village – town” and contains categorical data with the values of either “city” or “village”. We will use this column in our example.

resp_df["Village - town"].value_counts()
city 707
village 299

 

To subset the data that only includes the observations that have value of “city”, we enter the following code.

sub_city = resp_df[(resp_df["Village - town"] == "city")]            
          
sub_city["Village - town"].value_counts()
city 707

 

Sub-setting by Multiple Column Values

It is possible to sub-set based on multiple column values. The general code structure looks likes this.

New_Python_Object = data_frame[(data_frame["Column_of_Interest"] == "value") &
                               (data_frame["Column_of_Interest"] >= 10)]

 

For every new column value condition you want to filter on, it has to be entered in it’s own parenthesis statement and have an ampersand (&) between each parenthesis. This is saying you want values that are condition 1, and condition 2, and condition 3, etc.

Using our data set, let’s see the code for filtering on individuals that live in a city and scored pop music as a 3 or higher.

sub = resp_df[(resp_df["Village - town"] == "city") & 
              (resp_df['Pop'] >= 3)]

 

Now our sub-set data frame contains all observations that have a value of city, and scored pop music as a 3 or greater.