Optimizing GPS Location-Based Services with Vectorized Operations in Pandas Using KDTree

Introduction to Vectorized Operations in Pandas

=====================================================

In this article, we’ll explore the use of vectorized operations in Pandas DataFrames. Specifically, we’ll discuss how to add a new column to a DataFrame by finding the closest location from two separate DataFrames.

Background on GPS Coordinates and Distance Calculations


GPS coordinates are used extensively in various applications such as navigation, mapping, and location-based services. The distance between two points on the surface of the Earth can be calculated using the Haversine formula, which is based on spherical trigonometry.

Creating Sample DataFrames


To illustrate the problem, let’s create some sample DataFrames:

# imports:
from shapely.geometry import Point
import pandas as pd
from geopy import distance

# Create sample df1 (GPS points)
gps_points = [Point(37.773972,-122.431297) , Point(35.4675602,-97.5164276) , Point(42.35843, -71.05977)]
df_gps = pd.DataFrame()
df_gps['points'] = gps_points

# Create sample df2 (GPS locations)
locations = {'location':['San Diego', 'Austin', 'Washington DC'],
        'gps':[Point(32.715738 , -117.161084), Point(30.267153 , -97.7430608), Point(38.89511 , -77.03637)]}
df_locations = pd.DataFrame(locations)

Iterative Approach using Two Loops


As mentioned in the question, one approach to find the closest location is to use two loops:

lst = [] #create empty list to populate new df column
for index , row in df_gps.iterrows(): # iterate over first dataframe rows
    point = row['points'] # pull out GPS point
    closest_distance = 999999 # create container for distance
    closest_location = None #create container for closest location
    for index1 , row1 in df_locations.iterrows(): # iterate over second dataframe
        name = row1['location'] # assign name of location
        point2 = row1['gps'] # assign coordinates of location
        distances = distance.distance((point.x , point.y) , (point2.x , point2.y)).miles # calculate distance
        if distances < closest_distance: # check to see if distance is closer
            closest_distance = distances # if distance is closer assign it
            closest_location = name # if distance is closer assign name
    lst.append(closest_location) # append closest city
df_gps['closest_city'] = lst # add new column with closest cities

Vectorized Approach using KDTree


However, this approach can be slow for large datasets. A faster approach is to use the KDTree data structure from Scipy.

# Extract lat/lon from your dataframes
points = df_gps['points'].apply(lambda p: (p.x, p.y)).apply(pd.Series)
cities = df_locations['gps'].apply(lambda p: (p.x, p.y)).apply(pd.Series)

distances, indices = KDTree(cities).query(points)

df_gps['closest_city'] = df_locations.iloc[indices]['location'].values
df_gps['distance'] = distances

How KDTree Works


A KDTree is a data structure used for efficient nearest neighbor searches. It works by recursively partitioning the data into smaller sub-spaces, allowing for faster search times.

Here’s a step-by-step breakdown of how KDTree works:

  1. Build: The data is inserted into the tree, and each node represents a subset of the data.
  2. Query: When searching for nearest neighbors, the query starts at the root node and recursively traverses down the tree until it reaches a leaf node that contains a single point or a small set of points.

Vectorized Operations in Pandas


Vectorized operations in Pandas allow you to perform operations on entire columns or rows simultaneously, rather than iterating over individual elements.

Here are some key concepts:

  • Apply: The apply function applies a user-defined function to each element of a Series (1-dimensional labeled array).
  • Lambda functions: Lambda functions provide an concise way to define small anonymous functions.
  • Vectorized operations: Pandas supports vectorized operations such as arithmetic, comparison, and aggregation.

Example Use Cases


Here are some example use cases for the vectorized approach using KDTree:

  • Finding the closest city to a set of GPS points
# Create sample df1 (GPS points)
gps_points = [Point(37.773972,-122.431297) , Point(35.4675602,-97.5164276) , Point(42.35843, -71.05977)]
df_gps = pd.DataFrame()
df_gps['points'] = gps_points

# Create sample df2 (GPS locations)
locations = {'location':['San Diego', 'Austin', 'Washington DC'],
        'gps':[Point(32.715738 , -117.161084), Point(30.267153 , -97.7430608), Point(38.89511 , -77.03637)]}
df_locations = pd.DataFrame(locations)

# Find closest city to each GPS point
distances, indices = KDTree(df_locations['gps']).query(df_gps['points'])
closest_cities = df_locations.iloc[indices]['location'].values

# Add new column to df_gps with distances and closest cities
df_gps['distance'] = distances
df_gps['closest_city'] = closest_cities

Performance Comparison


To demonstrate the performance benefits of using KDTree, let’s create a larger dataset and measure the execution time for both approaches:

# Create larger sample df1 (GPS points)
gps_points = [Point(37.773972,-122.431297) , Point(35.4675602,-97.5164276) , Point(42.35843, -71.05977)] * 1000
df_gps = pd.DataFrame()
df_gps['points'] = gps_points

# Create larger sample df2 (GPS locations)
locations = {'location':['San Diego', 'Austin', 'Washington DC'],
        'gps':[Point(32.715738 , -117.161084), Point(30.267153 , -97.7430608), Point(38.89511 , -77.03637)]} * 200
df_locations = pd.DataFrame(locations)

# Measure execution time for iterative approach
import timeit

def find_closest_city_iterative(df_gps, df_locations):
    lst = []
    for index, row in df_gps.iterrows():
        point = row['points']
        closest_distance = 999999
        closest_location = None
        for index1, row1 in df_locations.iterrows():
            name = row1['location']
            point2 = row1['gps']
            distances = distance.distance((point.x, point.y), (point2.x, point2.y)).miles
            if distances < closest_distance:
                closest_distance = distances
                closest_location = name
        lst.append(closest_location)
    df_gps['closest_city'] = lst

start_time = timeit.default_timer()
find_closest_city_iterative(df_gps, df_locations)
end_time = timeit.default_timer()
print("Iterative approach took", end_time - start_time, "seconds")

# Measure execution time for vectorized approach using KDTree
def find_closest_city_kdtree(df_gps, df_locations):
    points = df_gps['points'].apply(lambda p: (p.x, p.y)).apply(pd.Series)
    cities = df_locations['gps'].apply(lambda p: (p.x, p.y)).apply(pd.Series)
    distances, indices = KDTree(cities).query(points)
    closest_cities = df_locations.iloc[indices]['location'].values
    df_gps['distance'] = distances
    df_gps['closest_city'] = closest_cities

start_time = timeit.default_timer()
find_closest_city_kdtree(df_gps, df_locations)
end_time = timeit.default_timer()
print("KDTree approach took", end_time - start_time, "seconds")

Note: The actual execution times may vary depending on the system and dataset size.


Last modified on 2024-05-14