Extracting City Names from Large Text Data with R: A Comparison of Regular Expressions and Geocoding APIs

Extract City Names from Large Text with R

=====================================================

In this article, we will explore two different approaches to extract city names from large text data. The first approach uses regular expressions and string manipulation techniques in R, while the second approach utilizes a geocoding API.

Approach 1: Using Regular Expressions and String Manipulation Techniques

The original question presented a long character string containing city names separated by pipes (|). The goal was to extract all the city names from this string. The provided code snippet achieves this using the following steps:

Step 1: Replace Pipes with Commas

test2 <- str_replace_all(test, "[|]", ", ")

This step replaces all occurrences of pipes (|) with commas (,), which allows us to split the string into individual city names.

Step 2: Remove Punctuation from Data

test3 <- gsub("[[:punct:]\n]","", test2)

This step removes all punctuation marks and newline characters from the string, making it easier to work with.

Step 3: Split Data at Word Boundaries

test4 <- strsplit(test3, " ")

This step splits the string into individual words, which represent the city names.

Step 4: Load City Data from Package Maps

data(world.cities)

This step loads the world.cities data frame from the maps package, which contains information about cities worldwide.

Step 5: Match on Cities in World.cities

citetest <- lapply(test4, function(x) x[which(x %in% world.cities$name)])

This step uses the lapply() function to apply a function to each element of the test4 vector. The function checks if each word is present in the world.cities$name column and returns the corresponding city name.

However, this approach has limitations, particularly when dealing with two-word city names (e.g., “New York”). To address this issue, we need to preprocess the data further.

Preprocessing Data for Two-Word City Names

One possible solution is to use a dictionary or database that maps city names to their corresponding coordinates. This would allow us to extract the city name and then look up its coordinates in the database.

Alternatively, we could use a machine learning approach to classify two-word city names into their individual components (e.g., “New York” -> [“New”, “York”]).

For the sake of simplicity, let’s focus on the first approach using regular expressions and string manipulation techniques.

Approach 2: Using Geocoding API

The second approach involves passing each address to a geocoding API, which returns the city name along with its coordinates. The provided code snippet uses the ggmap package to perform this task.

Step 1: Load Libraries and Data

library(tidyverse)

places <- data_frame(string = "Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology")

Step 2: Separate Rows

places <- places %>% separate_rows(string, sep = '\\|')

This step separates each address into individual rows.

Step 3: Geocode Data

places <- places %>% 
    mutate(geodata = map(string, ~{Sys.sleep(1); ggmap::geocode(.x, output = 'all')}))

This step uses the ggmap package to geocode each address and returns a dataframe with city names.

Step 4: Extract City Names

places <- places %>% 
    mutate(address_components = map(geodata, list('results', 1, 'address_components')),
           address_components = map(address_components, ~as_data_frame(transpose(.x)) %>% unnest(long_name, short_name)),
           city = map_chr(city = ~{
               l <- set_names(.x$long_name, .x$types);
               coalesce(l['locality'], l['administrative_area_level_1'])
           }))

This step extracts the city name from the address_components dataframe.

Results

The final result shows all city names extracted using both approaches. However, there are some discrepancies between the two approaches, particularly when dealing with Korean cities.

For instance, the original question’s approach uses pipes (|) to separate city names, while the geocoding API approach uses commas (,) to do so. This can lead to issues when trying to match the extracted city names with the world.cities data frame.

Moreover, the geocoding API approach returns a dataframe with city names that may not be exactly as expected. For example, the Korean city “Seoul” is returned as “Seoul National University Hospital”, which is not the correct city name.

Despite these limitations, both approaches can be useful depending on the specific requirements of the project. The regular expression-based approach is more lightweight and can handle large datasets efficiently, while the geocoding API approach provides more accurate results but requires internet connectivity and may incur costs for usage.

In conclusion, extracting city names from large text data involves a combination of string manipulation techniques, geocoding APIs, and data preprocessing steps. By understanding the strengths and limitations of each approach, we can choose the most suitable method for our specific use case.


Last modified on 2024-02-05