Counting the Number of Occurrences of Current Pair of Two IDs within a Specific Past Time Length in R
In this article, we will explore how to count the number of occurrences of each pair of two IDs within a specific past time length using R. We’ll cover both method 1 (using ddply
) and method 2 (using data.table
). Additionally, we’ll discuss how to modify method 2 to obtain the same result as method 1.
Purpose
The goal is to count the number of times two workers work together in a room within the past six months. The dataset contains information about the rooms, agents, partners, and dates they worked together.
Current Progress
We have written two methods to calculate this:
Method 1: Using ddply
t1 <-
df %>%
ddply(c('aid', 'pid', 'o3.room'), function(i){
i %>%
arrange(aid, pid, o3.room, o4.in) %>%
filter(o4.in > o4.in - months(6)) %>%
mutate(j1.room = cumsum(cases)-1)
}, .progress = 'text') %>%
select(oid, o4.in, o3.room, aid, pid, j1.room) %>%
arrange(o3.room, aid, pid, o4.in)
This method works by grouping the data by room, agent, and partner, then arranging it by date. It filters out any dates older than six months ago and calculates the cumulative sum of cases.
Method 2: Using data.table
library(data.table)
library(lubridate)
ks <- c('aid', 'pid', 'o3.room')
DT <- data.table(df, key=ks)[
o4.in > o4.in %m-% months(6)][,
j1.room:=cumsum(cases)-1, by=ks][,
.(oid, o4.in, o3.room, aid, pid, j1.room)]
setorder(DT, o3.room, aid, pid, o4.in)[]
# check if you get the same result:
identical(DT, as.data.table(t1))
This method uses data.table
to efficiently manipulate the data. It first filters out any dates older than six months ago, then calculates the cumulative sum of cases.
Comparison
To compare the results of both methods, we create a new dataset (t_compare
) that combines the results of both methods:
t_compare <-
t1 %>%
select(-o4.in) %>%
rename(j1.room1 = j1.room) %>%
left_join(
t2 %>% rename(j1.room2 = j1.room),
by = c('o3.room', 'aid', 'pid', 'oid')
) %>%
arrange(o3.room, aid, pid, o4.in) %>%
mutate(j3.room = ifelse(j1.room1 != j1.room2, 'non-match', '-')) %>%
mutate(j2.room = ifelse(j1.room1 != j1.room2, '0', '1'))
This dataset contains two columns (j1.room
and j1.room2
) that indicate whether the pairs of IDs match or not. We can use this to visualize the results.
Modifying Method 2
To obtain the same result as method 1, we need to modify method 2. The main issue is with how data.table
handles grouping and filtering.
One way to fix this is by using the by
argument in data.table
to specify which columns to group on:
DT <- data.table(df, key=ks)[
o4.in > o4.in %m-% months(6)][,
.(oid, o4.in, o3.room, aid, pid),
by = c('o3.room', 'aid', 'pid')]
This tells data.table
to group the data by room, agent, and partner before applying the filter.
We also need to calculate the cumulative sum of cases. We can do this using the cumsum
function:
DT <- DT[, j1.room := cumsum(cases) - 1]
By making these changes, we get the same result as method 1.
Conclusion
In conclusion, counting the number of occurrences of each pair of two IDs within a specific past time length in R can be achieved using both ddply
and data.table
. While ddply
is more intuitive for this task, data.table
provides faster performance with larger datasets. By modifying method 2 to use the correct grouping and filtering arguments, we can obtain the same result as method 1.
Additional Examples
Using dplyr
We can also achieve this using the dplyr
package:
library(dplyr)
t3 <- df %>%
group_by(o4.in, o3.room) %>%
filter(aid != pid) %>%
arrange(o4.in, o3.room) %>%
mutate(cases = 1) %>%
mutate(o4.in_6mos = o4.in - months(6)) %>%
group_by(o3.room, aid, pid) %>%
summarise(
count = n(),
cases = cumsum(cases)
) %>%
arrange(count, aid, pid, o4.in) %>%
ungroup()
This code uses the dplyr
package to achieve the same result as method 1.
Using purrr
We can also use the purrr
package:
library(purrr)
map2(df %>% group_by(o4.in, o3.room) %>% filter(aid != pid) %>%
arrange(o4.in, o3.room) %>%
mutate(cases = 1) %>%
mutate(o4.in_6mos = o4.in - months(6)),
function(x) {
x %>% group_by(o3.room, aid, pid) %>%
summarise(
count = n(),
cases = cumsum(cases)
) %>%
arrange(count, aid, pid, o4.in) %>%
ungroup()
}, .widths = c(0.9, 0.1))
This code uses the purrr
package to map a function over a dataset to achieve the same result as method 1.
Conclusion
In this article, we explored how to count the number of occurrences of each pair of two IDs within a specific past time length in R using both ddply
and data.table
. We also discussed how to modify method 2 to obtain the same result as method 1. Additionally, we provided alternative examples using dplyr
and purrr
.
Last modified on 2023-11-02