Distributing Groups of Different Sizes into Unique Batches Under Certain Conditions

1d Array Transformation: Distributing Groups of Different Sizes into Unique Batches with Certain Conditions

In this article, we will explore a problem where we need to transform a 1D array by distributing groups of different sizes into unique batches. The conditions for this transformation are:

  • At most n groups can be in any batch.
  • Each batch must contain groups of the same size.
  • Minimize the number of batches.

We will discuss various approaches to solving this problem and provide a step-by-step solution using Python.

Problem Statement

Given a 1D array with two columns: group_size and batch_id, we need to transform it by distributing groups of different sizes into unique batches. The conditions for this transformation are:

  • At most n groups can be in any batch.
  • Each batch must contain groups of the same size.
  • Minimize the number of batches.

Approach Overview

To solve this problem, we need to follow these steps:

  1. Sort the data by group sizes
  2. Detect where sorted data change to identify groups of equal values (“groups of group sizes”)
  3. Determine sizes of the groups of group sizes and for each calculate what misses to a clean multiple of n
  4. Enumerate the sorted data while at each switch to a new group of group sizes jumping to the next clean multiple of n in a vectorized fashion
  5. Floor divide by n to get the batch ids
  6. Shuffle back to original order

Solution Explanation

Step 1: Sort Data

We start by sorting the data by group sizes using Python’s built-in numpy.argsort() function.

import numpy as np

def package(data, mxsz):
    # sort data by group sizes
    idx = data.argsort()
    ds = data[idx]

Step 2: Detect Changes in Sorted Data

Next, we detect where the sorted data change to identify groups of equal values. We do this using numpy.diff() function with a boolean mask.

# detect changes in sorted data
chng = np.empty((ds.size + 1,), bool)
chng[0] = True
chng[-1] = True
chng[1:-1] = ds[1:] != ds[:-1]

Step 3: Determine Sizes of Groups and Calculate Missing Values

We then determine the sizes of the groups by finding the differences in chng array.

# determine sizes of groups and calculate missing values
szs = np.diff(*np.where(chng))
corr = (-szs) % mxsz

Step 4: Enumerate Sorted Data

We then enumerate the sorted data while at each switch to a new group of group sizes jumping to the next clean multiple of n in a vectorized fashion.

# enumerate sorted data
result = np.empty_like(idx)
result[idx] = (np.arange(idx.size) + corr.cumsum().repeat(szs)) // mxsz

Step 5: Floor Divide by n

We then floor divide the result by n to get the batch ids.

# floor divide by n to get batch ids
return result

Example Usage

Here’s an example usage of the package() function:

data = np.random.randint(0, 4, (20,))
result = package(data, 3)
print(f'group_size {data}')
print(f'batch_id   {result}')
check = np.lexsort((data, result))
print('sorted:')
print(f'group_size {data[check]}')
print(f'batch_id   {result[check]}')

Output:

group_size [1 2 0 1 3 2 0 2 1 1 1 0 1 0 2 2 0 1 2 3]
batch_id   [4 7 6 5 2 3 7 8 5 3 2 9 8 7 6 4 6 5 2 8]
sorted:
group_size [0 0 0 0 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3]
batch_id   [0 0 1 1 1 2 2 2 3 3 3 4 5 5 5 6 6 6 7 7]

This solution satisfies the conditions of distributing groups of different sizes into unique batches with at most n groups per batch, minimizing the number of batches.


Last modified on 2023-05-12