Recoding variables is sometimes necessary if you want to create new variable groups, or convert categorical to numeric, or vise versa. To complete this task, one must use a function. If you are rusty on functions, refresh yourself here.

Data used for Examples

The data set used on this page was downloaded from Kaggle.com from the user Miroslav Sabo. To download, go to our GitHub page (https://github.com/Opensourcefordatascience/Data-sets), or get it from Kaggle (https://www.kaggle.com/miroslavsabo/young-people-survey).

Note: The file from our GitHub page is modified from the original .csv file. In our version, a “Participant Number” column has been added. This column is arbitrarily assigned.

resp_df = pd.read_csv("responses.csv")

 

Recoding using Functions

In our data set, there is a variable named “Village – town” and contains categorical data values of “city” and “village”.

resp_df['Village - town'].value_counts()

 

Village – town
city 707
village 299

However, we want the values to reflect the variable name, i.e. have data values of “village” and “town”. To do this, we will create a function and then apply it to that variable to recode all values that are “city” to “town”. Here, we will introduce the .apply() function.

In order to have the recoding stick, we must assign the variable to itself. If that’s unclear, here it is in action.

def village(series):
    if series == 'city':
        return 'town'
    else:
        return series

resp_df['Village - town'] = resp_df['Village - town'].apply(village)
        
resp_df['Village - town'].value_counts()

 

Village – town
town 707
village 299

As you can see, now all the values are either “city” or “town”.

Creating a New Variable with Recoding

You can easily create a new variable that contains recoded values of another variable. To do this, you follow all the steps above with one exception. Instead of assigning the recoded variable to itself, you assign it to a new variable.

In this example, we will take the age variable that contains numeric values and create a new variable that contains categorical values that will represent age groupings.

def age_groups(series):
    if series < 20:
        return "15-19 yrs"
    elif 20 <= series < 25:
        return "20-24 yrs"
    elif 25 <= series:
        return "25-30 yrs"

resp_df['Age Group'] = resp_df['Age'].apply(age_groups)

resp_df['Age Group'].value_counts(sort=False)

 

Age Group
15-19 yrs 426
20-24 yrs 480
25-30 yrs 97

How to Video

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.