What this section covers:
- How to merge and update an existing Pandas data frame
This builds off of the Join and Merge Pandas Data Frame page. This page shows how to update an existing data frame with new values.
First, let’s import pandas as pd!
Data used in this example
Now let’s create a small, simple data set (data frame) for demonstration purposes.
Now let’s take a look at each dataframe, first the original and then the new dataframe.
In the original dataframe there are 4 observations with one of those missing a value in the ‘NAME’ column. In the new dataframe (the one that is going to be merged with the original), there are 3 observations. Of those, ID 11 has a new test score, ID 14 updates the missing value in the ‘NAME’ column, and there is a new observation to be added. Let’s get to work!
Merge and Updating an Existing Dataframe
If you need a refresher on the options available for the pd.merge() method, take a look at Join and Merge Pandas Data Frame page or the official documentation page. For this example, I want all observations that are in both dataframes (how= ‘outer’), to merge on the ID column (on= ‘ID’), change the merging suffixes from ‘_x’ and ‘_y’ to ‘_m’ and ‘_n’ (suffixes=(‘_m’, ‘_n’)), and to have and indicator column to know if the column was in both dataframes (indicator= True).
|ID||NAME_m||TEST 1_m||NAME_n||TEST 1_n||_merge|
The ‘_m’ suffix indicates the column came from the original dataframe (org_df), while the ‘_n’ indicates the column came from the new data frame (new_data_df). The next step is to consolidate the the columns into one. The result that is desired is to:
- keep the original data if there is nothing in the new dataframe to update it with, and
- if there is a value in the new dataframe, replace the data in the original dataframe with this new data,
We can achieve this with the following code. Below is the general code structure with dummy values followed by the code that works for this example.
What this code is saying is, create a new column in the dataframe, using the column of interest values from the original dataframe if there are no values in the new data frame (the one being merged), otherwise if there are values in the new data frame then use those. Now let’s see it in action.
It worked beautifully! Now for the next column, the test scores.
|ID||TEST 1||TEST 1_m||TEST 1_n|
Now to clean up the dataframe and remove the unnecessary columns.