Lecture 4: Merging Data

In applied data analysis, information is rarely stored in a single dataset. Instead, we often need to combine multiple data sources using shared identifiers. In this lecture, we learn how to merge datasets in R using the base R function merge().

The goal is not just to learn syntax, but to understand:

What it means to join datasets conceptually
How different types of joins affect your sample
How to diagnose common merging mistakes

When Do We Need to Merge Data?

You need to merge data whenever:

One dataset contains outcomes and another contains covariates
Data are split across files (e.g., demographics and economic variables)
Information is recorded at different times or by different sources

Example:

Dataset A: county-level citizenship shares
Dataset B: county-level unemployment rates

To analyze the relationship between these variables, they must live in the same data frame.

The Key Idea: Keys

A key is a variable (or set of variables) that uniquely identifies observations.

Common keys include:

County FIPS codes
State–county pairs
Customer IDs
Firm IDs and years

Before merging, always ask:

“What uniquely identifies an observation in each dataset?”

The basic syntax is:

new_data <- merge(x_data, y_data, by = "key")

This will merge the x dataset and the y dataset by their common key.

A Simple Example

Suppose we have two county-level datasets.

citizenship <- data.frame(
  county_fips = c(53063, 53067, 53082),
  noncitizen_share = c(0.06, 0.12, 0.04)
)

unemployment <- data.frame(
  county_fips = c(53063, 53067, 53075),
  unemployment_rate = c(4.1, 5.3, 3.8)
)

Both datasets share the same key: county_fips.

Inner Joins (The Default)

An inner join keeps only observations that appear in both datasets.

merged_inner <- merge(citizenship, unemployment, by = "county_fips")

What this does:

Matches rows where county_fips is the same
Drops counties missing from either dataset

This is the default behavior of merge().

Interpretation:

“Keep only counties for which we have complete information.”

Initially, both datasets have 3 observations. After the join, only 2 observations are left because there are only 2 observations in both datasets.

Outer Joins

Sometimes we want to keep observations even if information is missing in one dataset.

Left Join

Keeps all observations from the first (x) dataset.

merged_left <- merge(
  citizenship,
  unemployment,
  by = "county_fips",
  all.x = TRUE
)

Counties missing unemployment data will have NA values.

This merge results in a dataset with 3 observations - all of the observations in the x dataset.

Left Join

Keeps all observations from the first (x) dataset.

merged_left <- merge(   
  citizenship,   
  unemployment,   
  by = "county_fips",  
  all.y = TRUE 
)

Full Join

Keeps all observations from both datasets.

merged_full <- merge(
  citizenship,
  unemployment,
  by = "county_fips",
  all = TRUE
)

Interpretation:

“Keep everything, even if some variables are missing.”

Merging When Key Names Differ

Often, the same identifier has different names in different datasets.

Example:

citizenship <- data.frame(
  fips = c(53063, 53067, 53082),
  noncitizen_share = c(0.06, 0.12, 0.04)
)

unemployment <- data.frame(
  county_fips = c(53063, 53067, 53075),
  unemployment_rate = c(4.1, 5.3, 3.8)
)

We now specify the key variables separately:

merged <- merge(
  citizenship,
  unemployment,
  by.x = "fips",
  by.y = "county_fips"
)

Merging on Multiple Variables

Sometimes a single variable is not enough to uniquely identify observations.

Example: state–county combinations.

df1 <- data.frame(
  state = c("WA", "WA", "OR"),
  county = c("Spokane", "King", "Multnomah"),
  noncitizen_share = c(0.06, 0.18, 0.14)
)


df2 <- data.frame(
  state = c("WA", "WA", "OR"),
  county = c("Spokane", "King", "Multnomah"),
  median_income = c(62000, 85000, 78000)
)

We merge using both variables as keys:

merged <- merge(
  df1, 
  df2,
  by = c("state", "county")
)

Interpretation:

“Match rows only when both state and county are the same.”

Check Your Merge

After merging, always check:

nrow(df1)

[1] 3

nrow(df2)

[1] 3

nrow(merged)

[1] 3

head(merged)

  state    county noncitizen_share median_income
1    OR Multnomah             0.14         78000
2    WA      King             0.18         85000
3    WA   Spokane             0.06         62000

summary(merged)

    state              county          noncitizen_share median_income  
 Length:3           Length:3           Min.   :0.0600   Min.   :62000  
 Class :character   Class :character   1st Qu.:0.1000   1st Qu.:70000  
 Mode  :character   Mode  :character   Median :0.1400   Median :78000  
                                       Mean   :0.1267   Mean   :75000  
                                       3rd Qu.:0.1600   3rd Qu.:81500  
                                       Max.   :0.1800   Max.   :85000

Questions to ask:

Did I lose rows unexpectedly?
Do the merged variables look reasonable?
Are there new NA values?

Common Merging Mistakes

Using the wrong key
Forgetting that keys must uniquely identify observations
Accidentally many-to-many merges
Not checking row counts after merging

Merging errors are one of the most common sources of silent data problems.

Exercise

Merge the county citizenship data with the county unemployment data.
Identify the key for each dataset.
Perform an inner join.
Perform a full join.

For each merge, explain which observations are kept and why.