Introduction to Vectorized Operations in Pandas
=====================================================
In this article, we’ll explore the use of vectorized operations in Pandas DataFrames. Specifically, we’ll discuss how to add a new column to a DataFrame by finding the closest location from two separate DataFrames.
Background on GPS Coordinates and Distance Calculations
GPS coordinates are used extensively in various applications such as navigation, mapping, and location-based services. The distance between two points on the surface of the Earth can be calculated using the Haversine formula, which is based on spherical trigonometry.
Creating Sample DataFrames
To illustrate the problem, let’s create some sample DataFrames:
# imports:
from shapely.geometry import Point
import pandas as pd
from geopy import distance
# Create sample df1 (GPS points)
gps_points = [Point(37.773972,-122.431297) , Point(35.4675602,-97.5164276) , Point(42.35843, -71.05977)]
df_gps = pd.DataFrame()
df_gps['points'] = gps_points
# Create sample df2 (GPS locations)
locations = {'location':['San Diego', 'Austin', 'Washington DC'],
'gps':[Point(32.715738 , -117.161084), Point(30.267153 , -97.7430608), Point(38.89511 , -77.03637)]}
df_locations = pd.DataFrame(locations)
Iterative Approach using Two Loops
As mentioned in the question, one approach to find the closest location is to use two loops:
lst = [] #create empty list to populate new df column
for index , row in df_gps.iterrows(): # iterate over first dataframe rows
point = row['points'] # pull out GPS point
closest_distance = 999999 # create container for distance
closest_location = None #create container for closest location
for index1 , row1 in df_locations.iterrows(): # iterate over second dataframe
name = row1['location'] # assign name of location
point2 = row1['gps'] # assign coordinates of location
distances = distance.distance((point.x , point.y) , (point2.x , point2.y)).miles # calculate distance
if distances < closest_distance: # check to see if distance is closer
closest_distance = distances # if distance is closer assign it
closest_location = name # if distance is closer assign name
lst.append(closest_location) # append closest city
df_gps['closest_city'] = lst # add new column with closest cities
Vectorized Approach using KDTree
However, this approach can be slow for large datasets. A faster approach is to use the KDTree data structure from Scipy.
# Extract lat/lon from your dataframes
points = df_gps['points'].apply(lambda p: (p.x, p.y)).apply(pd.Series)
cities = df_locations['gps'].apply(lambda p: (p.x, p.y)).apply(pd.Series)
distances, indices = KDTree(cities).query(points)
df_gps['closest_city'] = df_locations.iloc[indices]['location'].values
df_gps['distance'] = distances
How KDTree Works
A KDTree is a data structure used for efficient nearest neighbor searches. It works by recursively partitioning the data into smaller sub-spaces, allowing for faster search times.
Here’s a step-by-step breakdown of how KDTree works:
- Build: The data is inserted into the tree, and each node represents a subset of the data.
- Query: When searching for nearest neighbors, the query starts at the root node and recursively traverses down the tree until it reaches a leaf node that contains a single point or a small set of points.
Vectorized Operations in Pandas
Vectorized operations in Pandas allow you to perform operations on entire columns or rows simultaneously, rather than iterating over individual elements.
Here are some key concepts:
- Apply: The
apply
function applies a user-defined function to each element of a Series (1-dimensional labeled array). - Lambda functions: Lambda functions provide an concise way to define small anonymous functions.
- Vectorized operations: Pandas supports vectorized operations such as arithmetic, comparison, and aggregation.
Example Use Cases
Here are some example use cases for the vectorized approach using KDTree:
- Finding the closest city to a set of GPS points
# Create sample df1 (GPS points)
gps_points = [Point(37.773972,-122.431297) , Point(35.4675602,-97.5164276) , Point(42.35843, -71.05977)]
df_gps = pd.DataFrame()
df_gps['points'] = gps_points
# Create sample df2 (GPS locations)
locations = {'location':['San Diego', 'Austin', 'Washington DC'],
'gps':[Point(32.715738 , -117.161084), Point(30.267153 , -97.7430608), Point(38.89511 , -77.03637)]}
df_locations = pd.DataFrame(locations)
# Find closest city to each GPS point
distances, indices = KDTree(df_locations['gps']).query(df_gps['points'])
closest_cities = df_locations.iloc[indices]['location'].values
# Add new column to df_gps with distances and closest cities
df_gps['distance'] = distances
df_gps['closest_city'] = closest_cities
Performance Comparison
To demonstrate the performance benefits of using KDTree, let’s create a larger dataset and measure the execution time for both approaches:
# Create larger sample df1 (GPS points)
gps_points = [Point(37.773972,-122.431297) , Point(35.4675602,-97.5164276) , Point(42.35843, -71.05977)] * 1000
df_gps = pd.DataFrame()
df_gps['points'] = gps_points
# Create larger sample df2 (GPS locations)
locations = {'location':['San Diego', 'Austin', 'Washington DC'],
'gps':[Point(32.715738 , -117.161084), Point(30.267153 , -97.7430608), Point(38.89511 , -77.03637)]} * 200
df_locations = pd.DataFrame(locations)
# Measure execution time for iterative approach
import timeit
def find_closest_city_iterative(df_gps, df_locations):
lst = []
for index, row in df_gps.iterrows():
point = row['points']
closest_distance = 999999
closest_location = None
for index1, row1 in df_locations.iterrows():
name = row1['location']
point2 = row1['gps']
distances = distance.distance((point.x, point.y), (point2.x, point2.y)).miles
if distances < closest_distance:
closest_distance = distances
closest_location = name
lst.append(closest_location)
df_gps['closest_city'] = lst
start_time = timeit.default_timer()
find_closest_city_iterative(df_gps, df_locations)
end_time = timeit.default_timer()
print("Iterative approach took", end_time - start_time, "seconds")
# Measure execution time for vectorized approach using KDTree
def find_closest_city_kdtree(df_gps, df_locations):
points = df_gps['points'].apply(lambda p: (p.x, p.y)).apply(pd.Series)
cities = df_locations['gps'].apply(lambda p: (p.x, p.y)).apply(pd.Series)
distances, indices = KDTree(cities).query(points)
closest_cities = df_locations.iloc[indices]['location'].values
df_gps['distance'] = distances
df_gps['closest_city'] = closest_cities
start_time = timeit.default_timer()
find_closest_city_kdtree(df_gps, df_locations)
end_time = timeit.default_timer()
print("KDTree approach took", end_time - start_time, "seconds")
Note: The actual execution times may vary depending on the system and dataset size.
Last modified on 2024-05-14