1d Array Transformation: Distributing Groups of Different Sizes into Unique Batches with Certain Conditions
In this article, we will explore a problem where we need to transform a 1D array by distributing groups of different sizes into unique batches. The conditions for this transformation are:
- At most
n
groups can be in any batch. - Each batch must contain groups of the same size.
- Minimize the number of batches.
We will discuss various approaches to solving this problem and provide a step-by-step solution using Python.
Problem Statement
Given a 1D array with two columns: group_size
and batch_id
, we need to transform it by distributing groups of different sizes into unique batches. The conditions for this transformation are:
- At most
n
groups can be in any batch. - Each batch must contain groups of the same size.
- Minimize the number of batches.
Approach Overview
To solve this problem, we need to follow these steps:
- Sort the data by group sizes
- Detect where sorted data change to identify groups of equal values (“groups of group sizes”)
- Determine sizes of the groups of group sizes and for each calculate what misses to a clean multiple of
n
- Enumerate the sorted data while at each switch to a new group of group sizes jumping to the next clean multiple of
n
in a vectorized fashion - Floor divide by
n
to get the batch ids - Shuffle back to original order
Solution Explanation
Step 1: Sort Data
We start by sorting the data by group sizes using Python’s built-in numpy.argsort()
function.
import numpy as np
def package(data, mxsz):
# sort data by group sizes
idx = data.argsort()
ds = data[idx]
Step 2: Detect Changes in Sorted Data
Next, we detect where the sorted data change to identify groups of equal values. We do this using numpy.diff()
function with a boolean mask.
# detect changes in sorted data
chng = np.empty((ds.size + 1,), bool)
chng[0] = True
chng[-1] = True
chng[1:-1] = ds[1:] != ds[:-1]
Step 3: Determine Sizes of Groups and Calculate Missing Values
We then determine the sizes of the groups by finding the differences in chng
array.
# determine sizes of groups and calculate missing values
szs = np.diff(*np.where(chng))
corr = (-szs) % mxsz
Step 4: Enumerate Sorted Data
We then enumerate the sorted data while at each switch to a new group of group sizes jumping to the next clean multiple of n
in a vectorized fashion.
# enumerate sorted data
result = np.empty_like(idx)
result[idx] = (np.arange(idx.size) + corr.cumsum().repeat(szs)) // mxsz
Step 5: Floor Divide by n
We then floor divide the result by n
to get the batch ids.
# floor divide by n to get batch ids
return result
Example Usage
Here’s an example usage of the package()
function:
data = np.random.randint(0, 4, (20,))
result = package(data, 3)
print(f'group_size {data}')
print(f'batch_id {result}')
check = np.lexsort((data, result))
print('sorted:')
print(f'group_size {data[check]}')
print(f'batch_id {result[check]}')
Output:
group_size [1 2 0 1 3 2 0 2 1 1 1 0 1 0 2 2 0 1 2 3]
batch_id [4 7 6 5 2 3 7 8 5 3 2 9 8 7 6 4 6 5 2 8]
sorted:
group_size [0 0 0 0 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3]
batch_id [0 0 1 1 1 2 2 2 3 3 3 4 5 5 5 6 6 6 7 7]
This solution satisfies the conditions of distributing groups of different sizes into unique batches with at most n
groups per batch, minimizing the number of batches.
Last modified on 2023-05-12