There are a few methods you can use to combine data frames in Python. These methods are .concat(), and .merge(). Both of them are apart of the Pandas library. So let’s import Pandas!

import pandas as pd

 

pandas.concat() method

The pandas.concat() method combines two data frames by stacking them on top of each other. If one of the data frames does not contain a variable column or variable rows, observations in that data frame will be filled with NaN values.

pandas.merge() Method

The pandas.merge() method joins two data frames by a “key” variable that contains unique values.

With pandas.merge(), you can only combine 2 data frames at a time. If you have more than 2 data frames to merge, you will have to use this method multiple times. Let’s go over pandas.merge() and some of the available arguments to pass. Here is the general structure and the recommended bare minimum arguments to pass. This method has more argument to pass if desired.

pandas.merge(left_data_frame, right_data_frame, on= , how= )

 

  • left is one of the data frames
  • right is the other data frame
  • on is the variable, a.k.a the column, you want to merge on. This variable has to have the same name in both data frames.
    • If the data frames has different column names for the merge variables you can use left_on and right_on.
      • left_on is the variable name in the left data frame to be merged on
      • right_on is the variable name in the left data frame to be merged on
  • how is where you pass the options of merging. These include:
    • “inner”, where only the observations with matching values based on the “on” argument that is passed are kept.
    • “left”, where all observations will be kept from the data frame in the left argument regardless if there is matching values with the data frame in the right argument. Observations that do not have a matching value based on the on argument in the “right” data frame will be discarded.
    • “right”, where all observations will be kept from the data frame in the right argument regardless if there is matching values with the data frame in the left argument. Observations that do not have a matching value based on the on argument in the “left” data frame will be discarded.
    • “outer”, all observations will be kept from both data frames.
Data used for Examples

Normally, we would use a real data set for our examples. However, for this section we will create a few data sets so it will be easier to demonstrate what is occurring.

Creating 4 data frames

dataframe1 = pd.DataFrame({'ID': ['0011','0013','0014','0016','0017'], 
                           'First Name': ['Joseph', 'Mike', 'Jordan', 'Steven', 'Susan']})
dataframe2 = pd.DataFrame({'ID': ['0010','0011','0013','0014','0017'], 
                           'Last Name': ['Gordan', 'Johnson', 'Might' , 'Jackson', 'Shack']})
dataframe3 = pd.DataFrame({'ID': ['0020','0022','0025'],
                           'First Name': ['Adam', 'Jackie', 'Sue']})
dataframe4 = pd.DataFrame({'Key': ['0020','0022','0025'],
                           'Scores': [95, 90, 80]})

 

dataframe1

 

First Name ID
0 Joseph 0011
1 Mike 0013
2 Jordan 0014
3 Steven 0016
4 Susan 0017
dataframe2

 

ID Last Name
0 0010 Gordan
1 0011 Johnson
2 0013 Might
3 0014 Jackson
4 0017 Shack
dataframe3

 

First Name ID
0 Adam 0020
1 Jackie 0022
2 Sue 0025
dataframe4

 

Key Scores
0 0020 95
1 0022 90
2 0025 80

Concatenate Examples

With pandas.concat(), you pass the data frames that you want to join as a list. You can pass as many as you need to join. Remember, this method joins the data frames by rows (stacking them on top of each other) by default. To join them by columns (put the data frames side to side), you pass the “axis= 1” argument.

Concatenation Example 1. Joining data frames by rows (stacking one on top of another)

If you were to join data frames by rows with an uneven number of columns, i.e. one data frame has more columns than the other, the data frame without the columns will have “NaN” values filled in where data is missing. This will be shown in example 2.

new_concat_ROWS_dataframe = pd.concat([dataframe1, dataframe3])

new_concat_ROWS_dataframe

 

First Name ID
0 Joseph 0011
1 Mike 0013
2 Jordan 0014
3 Steven 0016
4 Susan 0017
0 Adam 0020
1 Jackie 0022
2 Sue 0025

Look at the index, notice how Python kept the original index from both data frames and just stacked them on top of each other as well. If you wish for a new index starting at 0, pass the “ignore_index” argument as “true”.

new_concat_ROWS_dataframe = pd.concat([dataframe1, dataframe3], ignore_index= "true")

new_concat_ROWS_dataframe

 

First Name ID
0 Joseph 0011
1 Mike 0013
2 Jordan 0014
3 Steven 0016
4 Susan 0017
5 Adam 0020
6 Jackie 0022
7 Sue 0025
Concatenation Example 2. Joining data frames by columns (stacking them side by side)

Passing the “axis= 1” argument will join the data frames by columns, placing the data frames next to each other. If one data frame has more rows than the other, the data frame that has less rows will be filled with “NaN” values where the extra rows will occur.

new_concat_COL_dataframe = pd.concat([dataframe1, dataframe3], axis=1)

new_concat_COL_dataframe

 

First Name ID First Name ID
0 Joseph 0011 Adam 0020
1 Mike 0013 Jackie 0022
2 Jordan 0014 Sue 0025
3 Steven 0016 NaN Nan
4 Susan 0017 NaN Nan

In this situation, it’s clear that stacking the data frames on top of each other (joining by rows) is what we want. The second example would make for a horrible data frame.

Merge Examples

Merge Example 1. Keep matching observations only

Passing the “how = ‘inner'” argument will keep observations that have a match on the merge variable in both data frames.

new_merged_dataframe = pd.merge(dataframe1, dataframe2, on= "ID", how= "inner")

new_merged_dataframe

 

First Name ID Last Name
0 Joseph 0011 Johnson
1 Mike 0013 Might
2 Jordan 0014 Jackson
3 Susan 0017 Shack
Merge Example 2. Keep all observations

Passing the “how = ‘outer'” argument will keep all observations from both data frames.

new_OUTER_merged_dataframe = pd.merge(dataframe1, dataframe2, on= "ID", how= "outer")

new_OUTER_merged_dataframe

 

First Name ID Last Name
0 Joseph 0011 Johnson
1 Mike 0013 Might
2 Jordan 0014 Jackson
3 Steven 0016 NaN
4 Susan 0017 Shack
5 Nan 0010 Gordan
Merge Example 3. Keeping match observations and all observations from the left data frame

Passing the “how= ‘left'” argument will keep all observations in the data frame that is being passed in the left argument regardless if there is a matching value in the data frame that is being passed in the right argument. Rows without a matching merge variable from the right data frame will be discarded.

new_LEFT_merged_dataframe = pd.merge(dataframe1, dataframe2, on= "ID", how= "left")

new_LEFT_merged_dataframe

 

First Name ID Last Name
0 Joseph 0011 Johnson
1 Mike 0013 Might
2 Jordan 0014 Jackson
3 Steven 0016 NaN
4 Susan 0017 Shack
Merge Example 4. Keeping match observations and all observations from the right data frame

Passing the “how= ‘right'” argument will keep all observations in the data frame that is being passed in the right argument regardless if there is a matching value in the data frame that is being passed in the left argument. Rows without a matching merge variable from the left data frame will be discarded.

new_RIGHT_merged_dataframe = pd.merge(dataframe1, dataframe2, on= "ID", how= "right")

new_RIGHT_merged_dataframe
First Name ID Last Name
0 Joseph 0011 Johnson
1 Mike 0013 Might
2 Jordan 0014 Jackson
3 Susan 0017 Shack
4 NaN 0010 Gordan
Merge Example 5. Merging on different columns with unique values

If the two data frames each contain the unique identifier, but are stored under different columns, you can merge using the left_on and right_on arguments. If going this route, you have to pass both arguments. The general structure is left_on = column_name_with_unique_identifier, right_on = column_name_with_unique_identifier.

Note: Merging data frames this way will keep both columns stated in the left_on and right_on arguments.

merged_dataframe = pd.merge(dataframe3, dataframe4, left_on= "ID", right_on= "Key", how= "inner")

merged_dataframe

 

First Name ID Key Scores
0 Adam 0020 0020 95
1 Jackie 0022 0022 90
2 Sue 0025 0025 80

There is more useful information and advanced concatenation/ merging methods on the official documentation page.