Random Sampling Between Two Dataframes While Avoiding Address Duplication

Random but Not Repeating Sampling Between Two Dataframes

In this article, we will discuss a problem of sampling rows from one dataframe while ensuring that the addresses are not repeated until all unique addresses from another dataframe are used up.

Introduction

The problem at hand involves two dataframes. The first dataframe contains unique identifiers along with their corresponding cities. The second dataframe contains addresses along with the respective cities. We want to assign a random address for each unique identifier in the first dataframe, ensuring that the same address is not repeated until all unique addresses from the second dataframe are used up.

Problem Statement

Given two dataframes:

DF1:

UNIQUE_IDCITY
k5WjB6MQa5CruSkopje
k4Yq5QqXwoL4eSkopje
S9jGzT5qMZLyFSkopje
mhSHSuxic58SfSkopje
MU7eys8NKXQogSkopje

DF2:

ADDRESSCITY
РАТКО МИТРОВИЌ 5 БР.29-ДРАЧЕВОSkopje
УЛ. МЕТОДИЈА ПАТЧЕВ БР.17АSkopje
УЛ ДРАЧЕВСКА 123Skopje
УЛ.ДОМАЗЕТОВСКА БР. 24Skopje
ДРАЧЕВО УЛ. ЈАНКО МИШИЌ БР. 3Skopje

We want to assign a random address for each unique identifier in DF1, ensuring that:

  • The same address is not repeated until all unique addresses from DF2 are used up.
  • The assigned address is pulled for the respective city.

Solution

To solve this problem, we will use a combination of dplyr and sample_n functions in R. Here’s how you can do it:

  1. Load necessary libraries: We need to load dplyr, purrr, and stringr.
  2. Create the required dataframes and assign them to variables.
  3. Use mutate function from dplyr to create a new column in DF1 for the addresses.
  4. Use sample_n function to select n unique rows from DF2 based on the city, ensuring that all unique addresses are used up.

Here’s how you can implement this:

# Load necessary libraries
library(dplyr)
library(purrr)
library(stringr)

# Create dataframes and assign them to variables
df1 <- read.csv("df1.csv")  # Replace with your actual dataframe
df2 <- read.csv("df2.csv")  # Replace with your actual dataframe

# Use mutate function from dplyr to create a new column in df1 for the addresses
df1_mated <- df1 %>%
    mutate(address = DF2[sample(nrow(DF2), min(nrow(df1), nrow(df2))),]$ADDRESS)

# Print the resulting dataframe
print(df1_mated)

Explanation

  • We first load the necessary libraries.
  • Then we create dataframes df1 and df2 from our csv files.
  • Next, we use mutate function to add a new column ‘address’ to df1. In this column, we select n unique rows from df2 based on the city using sample_n. We also make sure that all unique addresses are used up by taking the minimum of the number of rows in df1 and df2.

Conclusion

In this article, we discussed a problem involving sampling rows from one dataframe while ensuring that the addresses are not repeated until all unique addresses from another dataframe are used up. We provided a step-by-step solution using R’s dplyr and sample_n functions. The solution ensures that all unique addresses from df2 are used before repeating any address in df1, thereby solving the problem at hand.


Last modified on 2025-02-09