Removing Spatial Outliers from Latitude and Longitude Data

Removing Spatial Outliers (lat and long coordinates) in R

Removing spatial outliers from a set of latitude and longitude coordinates is an essential task in various fields such as geography, urban planning, and environmental science. In this article, we will explore how to remove spatial outliers from a list of data frames containing multiple rows with different numbers of coordinates.

Introduction

Spatial outliers are points that are far away from the mean location of similar points. In the context of latitude and longitude coordinates, these points may represent errors in measurement or recording, or they might be real locations that do not belong to the dataset. The goal is to identify and remove these spatial outliers while retaining the remaining data.

Background

Before we dive into the solution, it’s essential to understand some basic concepts related to distance calculations between two points on a sphere (such as Earth) and how they relate to geodetic distances.

  • Geodetic Distance: Geodetic distance is the shortest distance between two points on a surface of a sphere. The formula for calculating geodetic distance involves spherical trigonometry, which takes into account the latitude and longitude of both points.
  • Euclidean Distance: Euclidean distance is a standard straight-line distance metric used in flat surfaces, where distances are calculated as the square root of the sum of squared differences in coordinates.

Removing Spatial Outliers

The problem presented involves several steps:

  1. Calculate the mean center (latitude and longitude) for each item in the list.
  2. Combine all these centers into one data frame.
  3. Find the distance between each coordinate pair in the original data frame and its corresponding mean center.
  4. Remove any points from the original data frame that have distances greater than a specified threshold.

In the provided code, we see an attempt to calculate the Euclidean (flat surface) distance instead of geodetic distance, which is more accurate for calculating spatial relationships on a sphere like Earth. Here’s how you can use the earth.dist function to correctly calculate the geodetic distances:

df$dist <- earth.dist(df$lon, df$lat, mean(df$lon), mean(df$lat))

However, it seems there was an error in this specific code snippet. The correct approach should be:

df$dist &lt;<-
  function(x, y) {
    rad &lt;- pi/180
    a1 &lt;- x * rad
    a2 &lt;- y * rad
    
    b1 &lt;- mean(df$lat) * rad
    b2 &lt;- mean(df$lon) * rad
    
    dlon &lt;- b2 - a2
    dlat &lt;- b1 - a1
    
    a &lt;<-
      (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
    
    c &lt;<-
      2 * atan2(sqrt(a), sqrt(1-a))
    
    R &lt;- 6378.145
    d &lt;<- R * c
    
    return(d)
  }(
    df$lat,
    df$lon
  ),
  mean(df$lat),
  mean(df$lon)
)

df[df$dist > 0.1,] # Filter those above 100m

Conclusion

In this article, we have explored how to remove spatial outliers from a list of data frames containing multiple rows with different numbers of coordinates. The approach involves calculating the geodetic distance between each coordinate pair and its corresponding mean center. We also discussed common pitfalls in spatial outlier removal, such as incorrect distance calculations. By following these steps, you can effectively identify and remove spatial outliers while retaining your dataset’s integrity.

Step 1: Combining Coordinate Lists into One

The first step is to combine all the coordinates from different data frames into a single list of points, one point for each row in the original list.

lonMean &lt;- lapply(dfList, function(x) mean(x$lon))
latMean &lt;<-
  lapply(dfList, function(x) mean(x$lat))

lonLat &lt;<-
  mapply(c, lonMean, latMean, SIMPLIFY = FALSE)

Step 2: Defining the Distance Calculation Function

Next, we define a function to calculate the geodetic distance between two points. This function takes the longitude and latitude of both points as inputs.

earth.dist &lt;<-
  function (long1, lat1, long2, lat2) {
    rad &lt;- pi/180
    a1 &lt;<- lat1 * rad
    a2 &lt;<- long1 * rad
    
    b1 &lt;<- lat2 * rad
    b2 &lt;<- long2 * rad
    
    dlon &lt;<- b2 - a2
    dlat &lt;<- b1 - a1
    
    a &lt;<-
      (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
    
    c &lt;<-
      2 * atan2(sqrt(a), sqrt(1-a))
    
    R &lt;- 6378.145
    d &lt;<- R * c
    
    return(d)
  }

Step 3: Calculating Distances Between Coordinates and Their Mean Centers

We use the mapply function to apply our distance calculation function to each pair of coordinates from different data frames along with their corresponding mean center.

df$dist &lt;<-
  earth.dist(df$lon, df$lat, lonLat$[1], latLat$[1]),
  earth.dist(df$lon, df$lat, lonLat$[2], latLat$[2]),
  earth.dist(df$lon, df$lat, lonLat$[3], latLat$[3]),
  earth.dist(df$lon, df$lat, lonLat$[4], latLat$[4])

Step 4: Removing Points with Distances Greater Than the Threshold

Finally, we filter out points whose distances are greater than a specified threshold (in this case, 0.1).

df[df$dist > 0.1,] # Filter those above 100m

This final step ensures that only points within a certain distance from their mean center remain in the dataset.

Combining Code into a Single Function

For better organization and reusability, we can combine all these steps into a single function:

remove_spatial_outliers &lt;<-
  function (dfList) {
    # Calculate mean for longs and lats
    lonMean &lt;<-
      lapply(dfList, function(x) mean(x$lon))
    
    latMean &lt;<-
      lapply(dfList, function(x) mean(x$lat))
    
    # Combine into one list of points
    lonLat &lt;<-
      mapply(c, lonMean, latMean, SIMPLIFY = FALSE)
    
    # Define distance calculation function
    earth.dist &lt;<-
      function (long1, lat1, long2, lat2) {
        rad &lt;- pi/180
        a1 &lt;<- lat1 * rad
        a2 &lt;<- long1 * rad
        
        b1 &lt;<- lat2 * rad
        b2 &lt;<- long2 * rad
        
        dlon &lt;<- b2 - a2
        dlat &lt;<- b1 - a1
        
        a &lt;<-
          (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
        
        c &lt;<-
          2 * atan2(sqrt(a), sqrt(1-a))
        
        R &lt;- 6378.145
        d &lt;<- R * c
        
        return(d)
      }
    
    # Calculate distances for each point and mean center
    df$dist &lt;<-
      earth.dist(df$lon, df$lat, lonLat$[1], latLat$[1]),
      earth.dist(df$lon, df$lat, lonLat$[2], latLat$[2]),
      earth.dist(df$lon, df$lat, lonLat$[3], latLat$[3]),
      earth.dist(df$lon, df$lat, lonLat$[4], latLat$[4])
    
    # Filter points with distances greater than 0.1
    return(
      filter(df, dist &lt; 0.1)
    )
  }

This function encapsulates all the steps for removing spatial outliers from a list of coordinates and their mean centers. It can be used as follows:

dfList &lt;<-
  list(
    data.frame(x = 1:4, y = 2:5),
    data.frame(x = 6:9, y = 10:13)
  )
  
cleaned_df &lt;<-
  remove_spatial_outliers(dfList)

Last modified on 2024-12-19