Using Name Full Name and Maiden Name Strings (and Birthdays) to Match Individuals Across Time
====================================================================================================
In this article, we’ll explore the challenges of matching individuals across time using name full names and maiden name strings, along with birthdays. We’ll dive into the code used in a Stack Overflow question to create a time-independent ID for each unique individual.
Introduction
Matching individuals across time is a common problem in various fields such as data science, sociology, and epidemiology. When dealing with longitudinal data, it’s essential to identify unique individuals over multiple observations. In this article, we’ll focus on using name full names and maiden name strings, along with birthdays, to create a time-independent ID for each individual.
Background
In the Stack Overflow question provided, the user has 20 consecutive individual-level cross-sectional data sets that need to be linked together. Unfortunately, there’s no time-stable ID number available, but fields for first, last, and maiden names, as well as year of birth, are present. The goal is to create a time-independent ID for each unique individual.
Building Cumulatively: Assigning IDs
The basic approach presented in the code snippet is to build cumulatively by assigning IDs in the first year, then looking for matches in the second year, and so on. This process involves slowly expanding the matching criteria to minimize mismatches.
Step 1: Get ID Function
get_id<-function(yr,key_from,key_to=key_from,
mdis,msch,mard,init,mexp,step){
#Want to exclude anyone who is matched
existing_ids<-full_data[.(yr),unique(na.omit(teacher_id))]
#Get the most recent prior observation of all
# unmatched teachers, excluding those teachers
# who cannot be uniquely identified by the
# current key setting
unmatched<-
full_data[.(1996:(yr-1))
][!teacher_id %in% existing_ids,
.SD[.N],by=teacher_id,
.SDcols=c(key_from,"teacher_id")
][,if (.N==1L) .SD,keyby=key_from
][,(flags):=list(mdis,msch,mard,init,mexp,step)]
#Merge, reset keys
setkey(setkeyv(
full_data,key_to)[year==yr&is.na(teacher_id),
(update_cols):=unmatched[.SD,update_cols,with=F]],
year)
full_data[.(yr),(update_cols):=lapply(.SD,function(x)na.omit(x)[1]),
by=id,.SDcols=update_cols]
}
In this function, yr
represents the current year, key_from
is the key used to identify individuals in the first year, and key_to
is the key used to identify individuals in subsequent years. The function returns a list containing the matches for each individual.
Step 2: Assigning New IDs
current_max<-full_data[.(yy),max(teacher_id,na.rm=T)]
new_ids<-
setkey(full_data[year==yy&is.na(teacher_id),.(id=unique(id))
][,add_id:=.I+current_max],id)
setkey(setkey(full_data,id)[year==yy&is.na(teacher_id),
teacher_id:=new_ids[.SD,add_id]],year)
In this step, the function calculates the current maximum ID for each year and assigns a new ID to each individual based on their previous IDs.
Step 3: Expanding Matching Criteria
for (step in seq(1, 12)){
# Run matches with progressively looser criteria
get_id(yy,c("first_name_clean","last_name_clean","birth_year"),
mdis=T,msch=T,mard=F,init=F,mexp=F,step=step)
}
In this loop, the function runs get_id
for each year from 1 to 12, gradually loosening the matching criteria.
Conclusion
Matching individuals across time using name full names and maiden name strings, along with birthdays, is a challenging task. The code presented in this article demonstrates a basic approach to building cumulatively by assigning IDs in the first year and then looking for matches in subsequent years. However, it’s essential to note that this approach may not be foolproof and may require further refinement.
In practice, you may need to consider additional factors such as data quality issues, missing values, and the potential for false positives or false negatives when matching individuals across time.
Last modified on 2023-12-22