new_data <- merge(x_data, y_data, by = "key")Lecture 4: Merging Data
In applied data analysis, information is rarely stored in a single dataset. Instead, we often need to combine multiple data sources using shared identifiers. In this lecture, we learn how to merge datasets in R using the base R function merge().
The goal is not just to learn syntax, but to understand:
What it means to join datasets conceptually
How different types of joins affect your sample
How to diagnose common merging mistakes
When Do We Need to Merge Data?
You need to merge data whenever:
One dataset contains outcomes and another contains covariates
Data are split across files (e.g., demographics and economic variables)
Information is recorded at different times or by different sources
Example:
Dataset A: county-level citizenship shares
Dataset B: county-level unemployment rates
To analyze the relationship between these variables, they must live in the same data frame.
The Key Idea: Keys
A key is a variable (or set of variables) that uniquely identifies observations.
Common keys include:
County FIPS codes
State–county pairs
Customer IDs
Firm IDs and years
Before merging, always ask:
“What uniquely identifies an observation in each dataset?”
The basic syntax is:
This will merge the x dataset and the y dataset by their common key.
A Simple Example
Suppose we have two county-level datasets.
citizenship <- data.frame(
county_fips = c(53063, 53067, 53082),
noncitizen_share = c(0.06, 0.12, 0.04)
)
unemployment <- data.frame(
county_fips = c(53063, 53067, 53075),
unemployment_rate = c(4.1, 5.3, 3.8)
)Both datasets share the same key: county_fips.
Inner Joins (The Default)
An inner join keeps only observations that appear in both datasets.
merged_inner <- merge(citizenship, unemployment, by = "county_fips")What this does:
Matches rows where
county_fipsis the sameDrops counties missing from either dataset
This is the default behavior of merge().
Interpretation:
“Keep only counties for which we have complete information.”
Initially, both datasets have 3 observations. After the join, only 2 observations are left because there are only 2 observations in both datasets.
Outer Joins
Sometimes we want to keep observations even if information is missing in one dataset.
Left Join
Keeps all observations from the first (x) dataset.
merged_left <- merge(
citizenship,
unemployment,
by = "county_fips",
all.x = TRUE
)Counties missing unemployment data will have NA values.
This merge results in a dataset with 3 observations - all of the observations in the x dataset.
Left Join
Keeps all observations from the first (x) dataset.
merged_left <- merge(
citizenship,
unemployment,
by = "county_fips",
all.y = TRUE
)Full Join
Keeps all observations from both datasets.
merged_full <- merge(
citizenship,
unemployment,
by = "county_fips",
all = TRUE
)Interpretation:
“Keep everything, even if some variables are missing.”
Merging When Key Names Differ
Often, the same identifier has different names in different datasets.
Example:
citizenship <- data.frame(
fips = c(53063, 53067, 53082),
noncitizen_share = c(0.06, 0.12, 0.04)
)
unemployment <- data.frame(
county_fips = c(53063, 53067, 53075),
unemployment_rate = c(4.1, 5.3, 3.8)
)We now specify the key variables separately:
merged <- merge(
citizenship,
unemployment,
by.x = "fips",
by.y = "county_fips"
)Merging on Multiple Variables
Sometimes a single variable is not enough to uniquely identify observations.
Example: state–county combinations.
df1 <- data.frame(
state = c("WA", "WA", "OR"),
county = c("Spokane", "King", "Multnomah"),
noncitizen_share = c(0.06, 0.18, 0.14)
)
df2 <- data.frame(
state = c("WA", "WA", "OR"),
county = c("Spokane", "King", "Multnomah"),
median_income = c(62000, 85000, 78000)
)We merge using both variables as keys:
merged <- merge(
df1,
df2,
by = c("state", "county")
)Interpretation:
“Match rows only when both state and county are the same.”
Check Your Merge
After merging, always check:
nrow(df1)[1] 3
nrow(df2)[1] 3
nrow(merged)[1] 3
head(merged) state county noncitizen_share median_income
1 OR Multnomah 0.14 78000
2 WA King 0.18 85000
3 WA Spokane 0.06 62000
summary(merged) state county noncitizen_share median_income
Length:3 Length:3 Min. :0.0600 Min. :62000
Class :character Class :character 1st Qu.:0.1000 1st Qu.:70000
Mode :character Mode :character Median :0.1400 Median :78000
Mean :0.1267 Mean :75000
3rd Qu.:0.1600 3rd Qu.:81500
Max. :0.1800 Max. :85000
Questions to ask:
Did I lose rows unexpectedly?
Do the merged variables look reasonable?
Are there new
NAvalues?
Common Merging Mistakes
Using the wrong key
Forgetting that keys must uniquely identify observations
Accidentally many-to-many merges
Not checking row counts after merging
Merging errors are one of the most common sources of silent data problems.
Exercise
Merge the county citizenship data with the county unemployment data.
Identify the key for each dataset.
Perform an inner join.
Perform a full join.
For each merge, explain which observations are kept and why.