What this section covers:

  • How to merge and update an existing Pandas data frame

This builds off of the Join and Merge Pandas Data Frame page. This page shows how to update an existing data frame with new values.

First, let’s import pandas as pd!

import pandas as pd

 

Data used in this example

Now let’s create a small, simple data set (data frame) for demonstration purposes.

org_df = pd.DataFrame({'ID': ['11', '12', '13', '14'],
                       'NAME': ['Kiara', 'Rajah', 'Simba', np.nan],
                       'TEST 1': [66, 99, 75, 23]})

new_data_df = pd.DataFrame({'ID': ['11', '14', '15'],
                       'NAME': ['Kiara', 'Wilbert', 'Momma'],
                       'TEST 1': [75, np.nan, 23]})

 

Now let’s take a look at each dataframe, first the original and then the new dataframe.

org_df

 

ID NAME TEST 1
11 Kiara 66
12 Rajah 99
13 Simba 75
14 NaN 23

 

new_data_df

 

ID NAME TEST 1
11 Kiara 75
14 Wilbert NaN
15 Momma 23

 

In the original dataframe there are 4 observations with one of those missing a value in the ‘NAME’ column. In the new dataframe (the one that is going to be merged with the original), there are 3 observations. Of those, ID 11 has a new test score, ID 14 updates the missing value in the ‘NAME’ column, and there is a new observation to be added. Let’s get to work!

Merge and Updating an Existing Dataframe

If you need a refresher on the options available for the pd.merge() method, take a look at Join and Merge Pandas Data Frame page or the official documentation page. For this example, I want all observations that are in both dataframes (how= ‘outer’), to merge on the ID column (on= ‘ID’), change the merging suffixes from ‘_x’ and ‘_y’ to ‘_m’ and ‘_n’ (suffixes=(‘_m’, ‘_n’)), and to have and indicator column to know if the column was in both dataframes (indicator= True).

combined_df = pd.merge(org_df, new_data_df, how = 'outer', on= 'ID', 
                       suffixes= ('_m', '_n'), indicator= True)

combined_df

 

ID NAME_m TEST 1_m NAME_n TEST 1_n _merge
11 Kiara 66 Kiara 75 both
12 Rajah 99 NaN NaN left_only
13 Simba 75 NaN NaN left_only
14 NaN 23 Wilbert NaN both
15 NaN NaN Momma 23 right_only

 

The ‘_m’ suffix indicates the column came from the original dataframe (org_df), while the ‘_n’ indicates the column came from the new data frame (new_data_df). The next step is to consolidate the the columns into one. The result that is desired is to:

  • keep the original data if there is nothing in the new dataframe to update it with, and
  • if there is a value in the new dataframe, replace the data in the original dataframe with this new data,

We can achieve this with the following code. Below is the general code structure with dummy values followed by the code that works for this example.

combined_df['column_of_interest'] = combined_df['column_of_interest_m'].where(combined_df['column_of_interest_n'].isnull(), combined_df['column_of_interest_n'])

 

What this code is saying is, create a new column in the dataframe, using the column of interest values from the original dataframe if there are no values in the new data frame (the one being merged), otherwise if there are values in the new data frame then use those. Now let’s see it in action.

combined_df['NAME'] = combined_df['NAME_m'].where(combined_df['NAME_n'].isnull(), combined_df['NAME_n'])

# ALWAYS VERIFY YOUR WORK
combined_df[['ID', 'NAME', 'NAME_m', 'NAME_n']]

 

ID NAME NAME_m NAME_n
11 Kiara Kiara Kiara
12 Rajah Rajah NaN
13 Simba Simba NaN
14 Wilbert NaN Wilbert
15 Momma NaN Momma

 

It worked beautifully! Now for the next column, the test scores.

combined_df['TEST 1'] = combined_df['TEST 1_m'].where(combined_df['TEST 1_n'].isnull(), combined_df['TEST 1_n'])

# REMEMBER TO CHECK YOUR WORK
combined_df[['ID', 'TEST 1', 'TEST 1_m', 'TEST 1_n']]

 

ID TEST 1 TEST 1_m TEST 1_n
11 75 66 75
12 99 99 NaN
13 75 75 NaN
14 23 23 NaN
15 23 NaN 23

 

Now to clean up the dataframe and remove the unnecessary columns.

combined_df.drop(['NAME_m', 'NAME_n', 'TEST 1_m', 'TEST 1_n', '_merge'], axis= 1, inplace= True)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.