Using Name Full Name and Maiden Name Strings (and Birthdays) to Match Individuals Across Time

====================================================================================================

In this article, we’ll explore the challenges of matching individuals across time using name full names and maiden name strings, along with birthdays. We’ll dive into the code used in a Stack Overflow question to create a time-independent ID for each unique individual.

Introduction

Matching individuals across time is a common problem in various fields such as data science, sociology, and epidemiology. When dealing with longitudinal data, it’s essential to identify unique individuals over multiple observations. In this article, we’ll focus on using name full names and maiden name strings, along with birthdays, to create a time-independent ID for each individual.

Background

In the Stack Overflow question provided, the user has 20 consecutive individual-level cross-sectional data sets that need to be linked together. Unfortunately, there’s no time-stable ID number available, but fields for first, last, and maiden names, as well as year of birth, are present. The goal is to create a time-independent ID for each unique individual.

Building Cumulatively: Assigning IDs

The basic approach presented in the code snippet is to build cumulatively by assigning IDs in the first year, then looking for matches in the second year, and so on. This process involves slowly expanding the matching criteria to minimize mismatches.

Step 1: Get ID Function

get_id&lt;-function(yr,key_from,key_to=key_from,
                 mdis,msch,mard,init,mexp,step){
  
  #Want to exclude anyone who is matched
  existing_ids&lt;-full_data[.(yr),unique(na.omit(teacher_id))]
  
  #Get the most recent prior observation of all
  #  unmatched teachers, excluding those teachers
  #  who cannot be uniquely identified by the
  #  current key setting
  unmatched&lt;-
    full_data[.(1996:(yr-1))
              ][!teacher_id %in% existing_ids,
                .SD[.N],by=teacher_id,
                .SDcols=c(key_from,"teacher_id")
                ][,if (.N==1L) .SD,keyby=key_from
                  ][,(flags):=list(mdis,msch,mard,init,mexp,step)]
  
  #Merge, reset keys
  setkey(setkeyv(
    full_data,key_to)[year==yr&amp;is.na(teacher_id),
                      (update_cols):=unmatched[.SD,update_cols,with=F]],
    year)
  full_data[.(yr),(update_cols):=lapply(.SD,function(x)na.omit(x)[1]),
                                        by=id,.SDcols=update_cols]
}

In this function, yr represents the current year, key_from is the key used to identify individuals in the first year, and key_to is the key used to identify individuals in subsequent years. The function returns a list containing the matches for each individual.

Step 2: Assigning New IDs

current_max&lt;-full_data[.(yy),max(teacher_id,na.rm=T)]
new_ids&lt;-
  setkey(full_data[year==yy&amp;is.na(teacher_id),.(id=unique(id))
                   ][,add_id:=.I+current_max],id)
setkey(setkey(full_data,id)[year==yy&amp;is.na(teacher_id),
                            teacher_id:=new_ids[.SD,add_id]],year)

In this step, the function calculates the current maximum ID for each year and assigns a new ID to each individual based on their previous IDs.

Step 3: Expanding Matching Criteria

for (step in seq(1, 12)){
  
  # Run matches with progressively looser criteria
  get_id(yy,c("first_name_clean","last_name_clean","birth_year"),
         mdis=T,msch=T,mard=F,init=F,mexp=F,step=step)
}

In this loop, the function runs get_id for each year from 1 to 12, gradually loosening the matching criteria.

Conclusion

Matching individuals across time using name full names and maiden name strings, along with birthdays, is a challenging task. The code presented in this article demonstrates a basic approach to building cumulatively by assigning IDs in the first year and then looking for matches in subsequent years. However, it’s essential to note that this approach may not be foolproof and may require further refinement.

In practice, you may need to consider additional factors such as data quality issues, missing values, and the potential for false positives or false negatives when matching individuals across time.

Last modified on 2023-12-22