Random but Not Repeating Sampling Between Two Dataframes

In this article, we will discuss a problem of sampling rows from one dataframe while ensuring that the addresses are not repeated until all unique addresses from another dataframe are used up.

Introduction

The problem at hand involves two dataframes. The first dataframe contains unique identifiers along with their corresponding cities. The second dataframe contains addresses along with the respective cities. We want to assign a random address for each unique identifier in the first dataframe, ensuring that the same address is not repeated until all unique addresses from the second dataframe are used up.

Problem Statement

Given two dataframes:

DF1:

UNIQUE_ID	CITY
k5WjB6MQa5Cru	Skopje
k4Yq5QqXwoL4e	Skopje
S9jGzT5qMZLyF	Skopje
mhSHSuxic58Sf	Skopje
MU7eys8NKXQog	Skopje

DF2:

ADDRESS	CITY
РАТКО МИТРОВИЌ 5 БР.29-ДРАЧЕВО	Skopje
УЛ. МЕТОДИЈА ПАТЧЕВ БР.17А	Skopje
УЛ ДРАЧЕВСКА 123	Skopje
УЛ.ДОМАЗЕТОВСКА БР. 24	Skopje
ДРАЧЕВО УЛ. ЈАНКО МИШИЌ БР. 3	Skopje

We want to assign a random address for each unique identifier in DF1, ensuring that:

The same address is not repeated until all unique addresses from DF2 are used up.
The assigned address is pulled for the respective city.

Solution

To solve this problem, we will use a combination of dplyr and sample_n functions in R. Here’s how you can do it:

Load necessary libraries: We need to load dplyr, purrr, and stringr.
Create the required dataframes and assign them to variables.
Use mutate function from dplyr to create a new column in DF1 for the addresses.
Use sample_n function to select n unique rows from DF2 based on the city, ensuring that all unique addresses are used up.

Here’s how you can implement this:

# Load necessary libraries
library(dplyr)
library(purrr)
library(stringr)

# Create dataframes and assign them to variables
df1 <- read.csv("df1.csv")  # Replace with your actual dataframe
df2 <- read.csv("df2.csv")  # Replace with your actual dataframe

# Use mutate function from dplyr to create a new column in df1 for the addresses
df1_mated <- df1 %>%
    mutate(address = DF2[sample(nrow(DF2), min(nrow(df1), nrow(df2))),]$ADDRESS)

# Print the resulting dataframe
print(df1_mated)

Explanation

We first load the necessary libraries.
Then we create dataframes df1 and df2 from our csv files.
Next, we use mutate function to add a new column ‘address’ to df1. In this column, we select n unique rows from df2 based on the city using sample_n. We also make sure that all unique addresses are used up by taking the minimum of the number of rows in df1 and df2.

Conclusion

In this article, we discussed a problem involving sampling rows from one dataframe while ensuring that the addresses are not repeated until all unique addresses from another dataframe are used up. We provided a step-by-step solution using R’s dplyr and sample_n functions. The solution ensures that all unique addresses from df2 are used before repeating any address in df1, thereby solving the problem at hand.

Last modified on 2025-02-09