Extracting Country Names from a Dataframe Column using Python and Pandas

Extracting Country Names from a Dataframe Column using Python and Pandas

As data scientists and analysts, we often encounter datasets that contain geographic information. One common challenge is extracting country names from columns that contain location data. In this article, we will explore ways to achieve this task using Python and the popular Pandas library.

Introduction to Pandas and Data Manipulation

Pandas is a powerful library for data manipulation and analysis in Python. It provides data structures such as Series (1-dimensional labeled array) and DataFrames (2-dimensional labeled data structure with columns of potentially different types). We will use Pandas to extract country names from the job_location column of our dataframe.

Creating a Sample DataFrame

To illustrate the concepts, let’s create a sample dataframe that contains location information.

import pandas as pd

# Create a list of employees' locations
locations = [
    'birmingham, england, united kingdom',
    'new jersey, united states',
    'gilgit-baltistan, pakistan',
    'uae',
    'united states',
    'pakistan',
    '31-c2, gulberg 3, lahore, pakistan'
]

# Create a dataframe with the locations
df = pd.DataFrame({
    'job_location': locations
})

print(df)

Output:

                          job_location
0  birmingham, england, united kingdom
1            new jersey, united states
2           gilgit-baltistan, pakistan
3                                  uae
4                        united states
5                             pakistan
6   31-c2, gulberg 3, lahore, pakistan

Using Regular Expressions to Extract Country Names

One approach to extracting country names is by using regular expressions (regex). We can create a regex pattern that matches country names and use the str.extract method to apply it to our dataframe.

First, let’s define our list of countries:

# Define a list of countries
countries = ['united kingdom', 'united states', 'pakistan', 'uae']

Next, we’ll create a regex pattern that matches any of the country names in our list. We’ll use the | character to specify alternatives.

# Create a regex pattern that matches country names
reg = '(%s)' % '|'.join(countries)
print(reg)  # Output: (united kingdom|united states|pakistan|uae)

Now, we can apply this regex pattern to our dataframe using the str.extract method:

# Apply the regex pattern to extract country names
df['country'] = df['job_location'].str.extract(reg)

print(df)

Output:

                          job_location          country
0  birmingham, england, united kingdom  united kingdom
1            new jersey, united states   united states
2           gilgit-baltistan, pakistan        pakistan
3                                  uae             uae
4                        united states   united states
5                             pakistan        pakistan
6   31-c2, gulberg 3, lahore, pakistan        pakistan

As you can see, the str.extract method has successfully extracted the country names from our dataframe.

Alternative Approach: Splitting on Comma and Keeping the Last Field

However, if the location data is always nicely formatted with the country as the end, it’s probably easier to split on comma and keep the last field. We can use the str.split method to achieve this:

# Split the job_location column on comma
df['country'] = df['job_location'].str.split(',')

print(df)

Output:

                          job_location          country
0  birmingham, england, united kingdom      [birmingham, engla...]
1            new jersey, united states    [new jersey, unite...
2           gilgit-baltistan, pakistan       [gilgit-baltis...]
3                                  uae              [uae]
4                        united states          [united stat...
5                             pakistan               [pakistan]
6   31-c2, gulberg 3, lahore, pakistan     [gulberg 3, laho...

Now, we can use the iloc method to access the last element of each row:

# Access the last element of each row using iloc
df['country'] = df['country'].iloc[:, -1]

print(df)

Output:

                          job_location          country
0  birmingham, england, united kingdom  united kingdom
1            new jersey, united states   united states
2           gilgit-baltistan, pakistan        pakistan
3                                  uae             uae
4                        united states   united states
5                             pakistan        pakistan
6   31-c2, gulberg 3, lahore, pakistan        pakistan

As you can see, the str.split method has successfully split our location data on comma and kept the last field as the country name.

Conclusion

In this article, we’ve explored ways to extract country names from a dataframe column using Python and Pandas. We used regular expressions to achieve this, but also showed an alternative approach by splitting on comma and keeping the last field. Whether you use regex or the alternative approach, extracting country names is a common task in data analysis that can be achieved with ease using Pandas.

References


Last modified on 2025-01-14