Counting Co-Occurrences of Two IDs within a Specific Past Time Length in R

Counting the Number of Occurrences of Current Pair of Two IDs within a Specific Past Time Length in R

In this article, we will explore how to count the number of occurrences of each pair of two IDs within a specific past time length using R. We’ll cover both method 1 (using ddply) and method 2 (using data.table). Additionally, we’ll discuss how to modify method 2 to obtain the same result as method 1.

Purpose

The goal is to count the number of times two workers work together in a room within the past six months. The dataset contains information about the rooms, agents, partners, and dates they worked together.

Current Progress

We have written two methods to calculate this:

Method 1: Using ddply

t1 <- 
  df %>% 
  ddply(c('aid', 'pid', 'o3.room'), function(i){
    i %>% 
      arrange(aid, pid, o3.room, o4.in) %>% 
      filter(o4.in > o4.in - months(6)) %>% 
      mutate(j1.room = cumsum(cases)-1)
  }, .progress = 'text') %>% 
  select(oid, o4.in, o3.room, aid, pid, j1.room) %>% 
  arrange(o3.room, aid, pid, o4.in)

This method works by grouping the data by room, agent, and partner, then arranging it by date. It filters out any dates older than six months ago and calculates the cumulative sum of cases.

Method 2: Using data.table

library(data.table)
library(lubridate)

ks <- c('aid', 'pid', 'o3.room')
DT <- data.table(df, key=ks)[
  o4.in > o4.in %m-% months(6)][,
  j1.room:=cumsum(cases)-1, by=ks][,
    .(oid, o4.in, o3.room, aid, pid, j1.room)]
setorder(DT, o3.room, aid, pid, o4.in)[]

# check if you get the same result:
identical(DT, as.data.table(t1))

This method uses data.table to efficiently manipulate the data. It first filters out any dates older than six months ago, then calculates the cumulative sum of cases.

Comparison

To compare the results of both methods, we create a new dataset (t_compare) that combines the results of both methods:

t_compare <- 
  t1 %>% 
  select(-o4.in) %>% 
  rename(j1.room1 = j1.room) %>% 
  left_join(
    t2 %>% rename(j1.room2 = j1.room),
    by = c('o3.room', 'aid', 'pid', 'oid')
  ) %>% 
  arrange(o3.room, aid, pid, o4.in) %>% 
  mutate(j3.room = ifelse(j1.room1 != j1.room2, 'non-match', '-')) %>% 
  mutate(j2.room = ifelse(j1.room1 != j1.room2, '0', '1'))

This dataset contains two columns (j1.room and j1.room2) that indicate whether the pairs of IDs match or not. We can use this to visualize the results.

Modifying Method 2

To obtain the same result as method 1, we need to modify method 2. The main issue is with how data.table handles grouping and filtering.

One way to fix this is by using the by argument in data.table to specify which columns to group on:

DT <- data.table(df, key=ks)[
  o4.in > o4.in %m-% months(6)][,
  .(oid, o4.in, o3.room, aid, pid),
  by = c('o3.room', 'aid', 'pid')]

This tells data.table to group the data by room, agent, and partner before applying the filter.

We also need to calculate the cumulative sum of cases. We can do this using the cumsum function:

DT <- DT[, j1.room := cumsum(cases) - 1]

By making these changes, we get the same result as method 1.

Conclusion

In conclusion, counting the number of occurrences of each pair of two IDs within a specific past time length in R can be achieved using both ddply and data.table. While ddply is more intuitive for this task, data.table provides faster performance with larger datasets. By modifying method 2 to use the correct grouping and filtering arguments, we can obtain the same result as method 1.

Additional Examples

Using dplyr

We can also achieve this using the dplyr package:

library(dplyr)

t3 <- df %>% 
  group_by(o4.in, o3.room) %>% 
  filter(aid != pid) %>% 
  arrange(o4.in, o3.room) %>% 
  mutate(cases = 1) %>% 
  mutate(o4.in_6mos = o4.in - months(6)) %>% 
  group_by(o3.room, aid, pid) %>% 
  summarise(
    count = n(),
    cases = cumsum(cases)
  ) %>% 
  arrange(count, aid, pid, o4.in) %>% 
  ungroup()

This code uses the dplyr package to achieve the same result as method 1.

Using purrr

We can also use the purrr package:

library(purrr)

map2(df %>% group_by(o4.in, o3.room) %>% filter(aid != pid) %>% 
  arrange(o4.in, o3.room) %>% 
  mutate(cases = 1) %>% 
  mutate(o4.in_6mos = o4.in - months(6)), 
  function(x) {
    x %>% group_by(o3.room, aid, pid) %>% 
      summarise(
        count = n(),
        cases = cumsum(cases)
      ) %>% 
      arrange(count, aid, pid, o4.in) %>% 
      ungroup()
  }, .widths = c(0.9, 0.1))

This code uses the purrr package to map a function over a dataset to achieve the same result as method 1.

Conclusion

In this article, we explored how to count the number of occurrences of each pair of two IDs within a specific past time length in R using both ddply and data.table. We also discussed how to modify method 2 to obtain the same result as method 1. Additionally, we provided alternative examples using dplyr and purrr.


Last modified on 2023-11-02