76.6 Merging, Joining, and Concatenating DataFrames
Right, so you’ve got your data, but it’s living in two separate DataFrames. Of course it is. Welcome to the real world, where your dataset is never in the single, tidy CSV file they showed you in the tutorial. You’ll spend about 70% of your data-wrangling time combining these disparate pieces, and Pandas gives you three main tools for the job: concat, merge, and join. They are not the same. Using the wrong one is like trying to screw in a lightbulb with a hammer—you might get a result, but it will be terrifying and probably wrong.
Let’s get our hands dirty with some sample data. We’ll make a couple of ridiculously small DataFrames to illustrate the point. Real data is bigger, but the principles are identical.
import pandas as pd
# Some employee info
df_employees = pd.DataFrame({
'employee_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'department_id': [101, 102, 101, 999] # Diana's dept is... unknown.
})
# Some department info
df_departments = pd.DataFrame({
'department_id': [101, 102, 103],
'department_name': ['Engineering', 'Marketing', 'Sales']
})
Concatenation: Stacking Things Vertically or Horizontally
Think of pd.concat as the brute force method. It’s for when you have multiple DataFrames with the same columns and you just want to stack them on top of each other (axis=0), or multiple DataFrames with the same index and you want to stack them side-by-side (axis=1). It does not do any matching based on keys; it just aligns by index.
# Let's say we get a second batch of employees
df_employees_new = pd.DataFrame({
'employee_id': [5, 6],
'name': ['Evan', 'Fiona'],
'department_id': [103, 102]
})
# Stack them vertically
df_all_employees = pd.concat([df_employees, df_employees_new], ignore_index=True)
print(df_all_employees)
The ignore_index=True is a classic “gotcha.” Without it, the indices from the original DataFrames (0,1,2,3 and 0,1) are preserved, leaving you with duplicate index values. This will break your .loc calls later. You almost always want to reset the index.
Merging: The SQL-Style Join (This is What You Want 90% of the Time)
This is the workhorse. pd.merge() is for combining DataFrames based on the values in one or more key columns. It’s a database-style join, and it’s brilliantly powerful. The most important argument is how, which defines the type of join.
Let’s join our employees to their department names.
# A simple inner join on 'department_id'
df_inner = pd.merge(df_employees, df_departments, on='department_id')
print("Inner Join:\n", df_inner)
Notice anything? Diana is missing. That’s because an inner join only includes rows where the key is present in both DataFrames. Her department_id of 999 doesn’t exist in df_departments. This is the most common “pitfall”—silently losing data because you didn’t think about your join type.
# Let's find Diana. Use a left join.
df_left = pd.merge(df_employees, df_departments, on='department_id', how='left')
print("Left Join:\n", df_left)
There she is! Her department_name is filled with NaN, which is Pandas’ way of saying, “You clearly know something I don’t, pal.” This is honest data. The how='left' join keeps everyone from the left DataFrame (df_employees) and brings in the matching info from the right where it exists.
What if your key column has a different name in each DataFrame? You use left_on and right_on instead of just on. It’s verbose but necessary.
# Suppose the department column was named 'dept_id' in the employees table
df_employees_renamed = df_employees.rename(columns={'department_id': 'dept_id'})
df_custom_merge = pd.merge(df_employees_renamed, df_departments, left_on='dept_id', right_on='department_id')
Joining: A Convenient Shortcut for Index-Based Merging
The .join() method is essentially a convenience wrapper for pd.merge() that defaults to joining on the index of the right DataFrame. It’s useful when your key is already the index.
# Set the index of the departments DF to be the department_id
df_departments_indexed = df_departments.set_index('department_id')
# Now we can use .join. We specify how to join and what column from the left DF to use as the key.
df_via_join = df_employees.join(df_departments_indexed, on='department_id', how='left')
print("Using .join():\n", df_via_join)
This gives you the same result as the left merge above. It’s cleaner in this specific scenario. Use merge when joining on columns, use join when your key is already the index of the right DataFrame.
The Dreaded Suffixes and Duplicate Columns
Here’s a classic headache. What if both DataFrames have a column with the same name that isn’t your key? Pandas slaps a suffix on them (_x and _y) to avoid ambiguity. You can control this with the suffixes parameter, and you absolutely should.
# Let's add a 'location' column to both DFs to create a conflict
df_employees['location'] = 'Seattle'
df_departments['location'] = 'HQ'
df_suffix = pd.merge(df_employees, df_departments, on='department_id', how='left', suffixes=('_emp', '_dept'))
print("Suffixes in action:\n", df_suffix)
Without the suffixes parameter, you’d get location_x and location_y, which is meaningless. Naming them _emp and _dept instantly tells you what each column refers to. This is a small thing that makes your code and your data exponentially more readable.
The golden rule? Always be explicit. Don’t rely on defaults. Specify how, specify on (or left_on/right_on), and specify suffixes. It saves you from a world of confusing, silent errors that are a nightmare to debug in a large dataset. Now go combine something. Carefully.