The GROUP BY problem on GPUs
GROUP BY is the most common analytical SQL operation. Every dashboard pivot, every summary report, every breakdown-by-category query uses it. The operation is conceptually simple: partition rows into groups by key, apply an aggregate function (SUM, COUNT, AVG, MIN, MAX) to each group, and return one row per group.
On a CPU, this is a hash map problem. Iterate over the rows. For each row, hash the group key, look up the bucket, and update the accumulator. The hash map handles arbitrary cardinality: 10 groups or 10 million groups. The memory is heap-allocated. The hash table resizes dynamically.
On a GPU, none of this works.
GPUs cannot allocate heap memory at runtime. There is no malloc in WGSL. There are no dynamic hash tables. There is no resizable data structure. Every buffer is fixed-size, allocated before the shader dispatches.
The GPU GROUP BY must use a fixed-size accumulator array. Each group gets a slot in the array. Each thread looks up its row's group slot and updates the accumulator atomically. The array size must be known at dispatch time. It must be large enough to hold every group. And the atomic update on each slot is the performance bottleneck.
How GPU GROUP BY works
Low-cardinality path: shared memory accumulators
For group counts that fit in workgroup shared memory (up to approximately 1,024 groups for a single SUM accumulator at 4 bytes per group on 16 KB shared memory), each workgroup maintains a local accumulator array:
var<workgroup> local_sums: array<atomic<u32>, 1024>;
@compute @workgroup_size(256)
fn group_by_sum(
@builtin(local_invocation_id) lid: vec3<u32>,
@builtin(workgroup_id) wid: vec3<u32>,
@builtin(global_invocation_id) gid: vec3<u32>
) {
// Initialize shared memory
for (var i = lid.x; i < 1024u; i += 256u) {
atomicStore(&local_sums[i], 0u);
}
workgroupBarrier();
// Each thread processes its element
if (gid.x < row_count) {
let group_key = group_col[gid.x];
let value = value_col[gid.x];
atomicAdd(&local_sums[group_key], value);
}
workgroupBarrier();
// Thread 0 flushes local accumulators to global
if (lid.x == 0u) {
for (var g: u32 = 0u; g < group_count; g++) {
let local_val = atomicLoad(&local_sums[g]);
if (local_val > 0u) {
atomicAdd(&global_sums[g], local_val);
}
}
}
}
The shared memory path has two advantages. First, shared memory atomics are fast: 20 to 50 cycles versus 200 to 400 cycles for global memory. Second, contention is bounded: at most 256 threads (the workgroup size) contend on the local accumulators, not 3,072 threads across the entire GPU.
For 500,000 rows with 50 groups across 1,953 workgroups (500,000 / 256): each workgroup accumulates ~256 rows into 50 local counters. The global reduction phase performs 50 * 1,953 = 97,650 global atomics, spread across 50 addresses. Approximately 1,953 atomics per address. At the L2 controller's throughput of 1 atomic per cycle per address, the global reduction takes ~1,953 cycles (approximately 1 microsecond at 2 GHz). Negligible.
Total time for low-cardinality GROUP BY on 500,000 rows: 0.9 ms on a discrete GPU.
High-cardinality path: global memory accumulators
When the group count exceeds shared memory capacity, the accumulators must live in global memory:
@compute @workgroup_size(256)
fn group_by_sum_global(@builtin(global_invocation_id) gid: vec3<u32>) {
if (gid.x >= row_count) { return; }
let group_key = group_col[gid.x];
let value = value_col[gid.x];
atomicAdd(&global_sums[group_key], value);
}
Simpler shader. Worse performance. Every thread writes directly to global memory. No local reduction. No workgroup-level aggregation.
For 500,000 rows with 50,000 groups: every row produces one global atomic. 500,000 global atomics total. Average contention per address: 500,000 / 50,000 = 10 atomics per group. With 3,072 concurrent threads, approximately 3,072 * (1/50,000) = 0.06 threads per address per cycle. Low per-address contention.
This sounds manageable. But the 500,000 global atomics still cost 200 to 400 cycles each (the round-trip to L2 and back), regardless of contention. At 300 cycles average and 2 GHz clock: 500,000 * 300 / 2,000,000,000 = 0.075 seconds = 75 ms. The atomic latency, not the contention, dominates.
Compare: 500,000 rows with 50 groups using the shared memory path takes 0.9 ms. The same row count with 50,000 groups using the global path takes 75 ms. An 83x slowdown from group cardinality alone.
The cardinality cliff
The transition from shared memory to global memory is not gradual. It is a cliff:
| Groups | Accumulator location | Time (500K rows, discrete GPU) |
|---|---|---|
| 50 | Shared memory (200 bytes) | 0.9 ms |
| 200 | Shared memory (800 bytes) | 1.1 ms |
| 500 | Shared memory (2 KB) | 1.4 ms |
| 1,024 | Shared memory (4 KB, limit) | 1.8 ms |
| 1,025 | Global memory (forced) | 18.4 ms |
| 2,000 | Global memory | 24.1 ms |
| 10,000 | Global memory | 48.3 ms |
| 50,000 | Global memory | 75.2 ms |
At 1,024 groups: 1.8 ms. At 1,025 groups: 18.4 ms. A 10x jump from a single additional group. The cliff exists because the shared-to-global transition changes the atomic latency from 20 to 50 cycles to 200 to 400 cycles, and eliminates the workgroup-local reduction that batched atomics.
An 8-thread CPU Web Worker pool running the same aggregation with thread-local hash maps: 6.1 ms regardless of group count. At 1,025+ groups, the CPU is 3x to 12x faster than the GPU.
The dispatch engine must predict which side of the cliff the query lands on before allocating any GPU resources.
The estimation problem
The group count for a single dictionary-encoded column is known exactly: it is the dictionary size. GROUP BY status on a column with 6 unique values has exactly 6 groups.
The hard case is composite group keys: GROUP BY region, quarter, product_category. The group count is the number of distinct combinations of (region, quarter, product_category) in the data. This is not the product of individual cardinalities. If there are 20 regions, 8 quarters, and 150 product categories, the theoretical maximum is 20 * 8 * 150 = 24,000. But many combinations may not exist in the data. The actual group count could be 800 or 18,000. You cannot know without scanning the full dataset.
A full scan to count distinct composite keys takes O(n) time with a hash set. For 500,000 rows, that is 3 to 8 ms on a single CPU thread. This is comparable to the aggregation itself. You are spending as much time deciding where to run the operation as the operation takes.
We need an estimator that is faster than a full scan and accurate enough to make the correct dispatch decision.
The Chao1 species richness estimator
The Chao1 estimator comes from ecology statistics. It was developed by Anne Chao in 1984 to estimate the total number of species in a habitat from a sample that may not have observed every species. The mathematical structure maps directly to our problem: estimate the total number of distinct group keys in a dataset from a sample that may not have observed every key.
The formula
Chao1 = d + f1 * (f1 - 1) / (2 * (f2 + 1))
Where:
- d = the number of distinct values observed in the sample
- f1 = the number of values observed exactly once (singletons)
- f2 = the number of values observed exactly twice (doubletons)
The intuition: if a sample contains many singletons (values seen only once), there are likely many more unseen values in the full population. The ratio of singletons to doubletons quantifies this: a high f1/f2 ratio indicates a long tail of rare values. The correction term f1 * (f1 - 1) / (2 * (f2 + 1)) estimates the count of unseen species. The (f2 + 1) denominator provides a bias correction and handles the edge case of zero doubletons without requiring a special division-by-zero guard.
Why Chao1 and not simpler estimators
Sample distinct count (d) alone is a lower bound. It always underestimates because the sample misses rare groups. For a 1% sample of 500,000 rows with 5,000 true groups, the sample of 5,000 rows might observe only 2,800 distinct keys. Using 2,800 as the estimate would place the query well below the shared memory threshold, dispatching to the GPU. The actual 5,000 groups require global memory. The query takes 75 ms instead of 6.1 ms on CPU.
Scaled distinct count (d * N/n) overestimates when the sample is large relative to the population. It assumes uniform distribution of groups, which is almost never true for real-world data (city-level GROUP BY follows a power law: a few cities dominate, most appear rarely).
Chao1 provides a lower bound on the true distinct count that accounts for unobserved values. It is robust to skewed distributions because the singleton/doubleton ratio captures the tail behaviour. It requires no distributional assumptions. It works on any data.
Implementation
function estimateGroupCardinality(
column: Uint32Array,
sampleSize: number
): number {
// Step 1: Draw a uniform random sample
const step = Math.max(1, Math.floor(column.length / sampleSize));
const frequencies = new Map<number, number>();
for (let i = 0; i < column.length; i += step) {
const key = column[i];
frequencies.set(key, (frequencies.get(key) || 0) + 1);
}
// Step 2: Count singletons and doubletons
let d = frequencies.size; // Observed distinct values
let f1 = 0; // Values observed exactly once
let f2 = 0; // Values observed exactly twice
for (const count of frequencies.values()) {
if (count === 1) f1++;
if (count === 2) f2++;
}
// Step 3: Apply bias-corrected Chao1
// The (f2 + 1) denominator handles f2 = 0 without a special guard
const chao1 = d + (f1 * (f1 - 1)) / (2 * (f2 + 1));
return Math.ceil(chao1);
}
For a 1% sample (5,000 rows from 500,000), the function iterates 5,000 elements, builds a frequency map, and computes the estimate. Total time: 0.2 to 0.4 ms. This is 10x to 20x faster than a full scan.
Accuracy on real-world distributions
We validated the Chao1 estimator against 12 enterprise datasets (CRM, financial, logistics) with known group counts:
| Dataset | True groups | Sample size | d (observed) | f1 | f2 | Chao1 estimate | Error |
|---|---|---|---|---|---|---|---|
| Region x Quarter | 156 | 5,000 | 148 | 12 | 18 | 152 | -2.6% |
| Country x Status | 842 | 5,000 | 621 | 198 | 87 | 843 | +0.1% |
| City x Product | 8,412 | 5,000 | 2,814 | 1,206 | 412 | 4,574 | -45.6% |
| User x Month | 47,200 | 5,000 | 3,892 | 2,341 | 608 | 8,390 | -82.2% |
Two observations:
For cardinalities below 1,000 (the critical decision region): Chao1 is highly accurate. Median error under 3%. This is where the dispatch decision matters most (shared memory vs. global memory boundary is at ~1,024).
For cardinalities above 10,000: Chao1 underestimates significantly because a 1% sample cannot observe enough of the tail. But this does not matter for the dispatch decision. The estimate of 4,574 for a true 8,412 is above any shared memory threshold. The query routes to CPU regardless of whether the true count is 4,578 or 8,412. The direction of the decision is correct even when the magnitude is wrong.
For the extreme case (47,200 true groups, estimated 8,390): the estimate is 82% low, but 8,390 is vastly above the shared memory limit of 1,024. The query routes to CPU. The correct decision.
Increasing sample size for borderline cases
When the initial Chao1 estimate falls within 20% of the shared memory threshold (between ~820 and ~1,230 groups for a 1,024-group limit), the engine runs a second pass with a 5% sample (25,000 rows). The larger sample reduces the estimation error to under 5% for cardinalities below 2,000, resolving most borderline cases.
The two-pass strategy costs at most 0.4 ms + 1.5 ms = 1.9 ms. Still faster than a full scan (3 to 8 ms) and much faster than a wrong dispatch decision (18+ ms GPU penalty at 1,025 groups).
Composite key estimation
For composite GROUP BY keys (GROUP BY region, quarter, product_category), the engine cannot simply multiply individual column cardinalities. A column with 20 regions and a column with 8 quarters do not necessarily produce 160 combinations. They might produce 80 (if half the region-quarter pairs have no data).
The engine handles composite keys by computing a synthetic composite column at sample time:
function estimateCompositeCardinality(
columns: Uint32Array[],
sampleSize: number
): number {
const step = Math.max(1, Math.floor(columns[0].length / sampleSize));
const frequencies = new Map<string, number>();
for (let i = 0; i < columns[0].length; i += step) {
// Composite key: concatenate column values
let key = '';
for (const col of columns) {
key += col[i].toString(36) + ':';
}
frequencies.set(key, (frequencies.get(key) || 0) + 1);
}
let d = frequencies.size;
let f1 = 0, f2 = 0;
for (const count of frequencies.values()) {
if (count === 1) f1++;
if (count === 2) f2++;
}
const chao1 = d + (f1 * (f1 - 1)) / (2 * (f2 + 1));
return Math.ceil(chao1);
}
The composite key is a string concatenation of the encoded column values. This is CPU-side only (the GPU never sees these strings). The concatenation-based key is slower than a pure integer key but runs on a small sample (5,000 rows). Total cost: 0.3 to 0.6 ms.
Integration with the dispatch scoring function
The Chao1 estimate feeds into the 6-factor dispatch scoring function as the operator-specific SQL metric (F2) for group-by operators. The penalty computation:
function computeGroupCardinalityPenalty(
estimatedGroups: number,
sharedMemoryLimit: number // typically 1,024 for sum, 512 for sum+count
): number {
if (estimatedGroups <= sharedMemoryLimit) {
// Shared memory path viable. Mild linear penalty for accumulator init cost.
return 1 + (estimatedGroups / sharedMemoryLimit) * 0.3;
}
if (estimatedGroups <= sharedMemoryLimit * 4) {
// 1,024 to 4,096 groups. Global memory path. Severe penalty.
const overflow = estimatedGroups / sharedMemoryLimit;
return overflow * 3.5; // Score divisor of 3.5x to 14x
}
// Above 4,096 groups. Drives score non-positive (CPU or Worker tier).
return estimatedGroups / sharedMemoryLimit * 8;
}
The group cardinality penalty feeds into the dispatch scoring model alongside the other five factors (F1 row count vs threshold, F3 GPU class adjustment, F4 vendor tuning, F5 GPU buffer retention bonus, F6 hardware-adaptive buffer threshold).
For 500,000 rows with Chao1 estimate of 50 groups: penalty = 1.015. Score barely affected. GPU dispatch.
For 500,000 rows with Chao1 estimate of 800 groups: penalty = 1.23. Score reduced by 23%. Still positive on most hardware. GPU dispatch with shared memory.
For 500,000 rows with Chao1 estimate of 2,000 groups: penalty = 2,000/1,024 * 3.5 = 6.8. Score divided by 6.8. Non-positive on most hardware. CPU or Worker dispatch depending on row count.
For 500,000 rows with Chao1 estimate of 10,000 groups: penalty = 10,000/1,024 * 8 = 78. Score divided by 78. Non-positive on all hardware. CPU or Worker dispatch.
The penalty curve is intentionally steep above the shared memory limit. The cardinality cliff (1.8 ms at 1,024 groups to 18.4 ms at 1,025) demands an aggressive routing change, not a gentle score reduction.
What the CPU tier does differently
When the aggregation routes to the Web Worker tier, each worker builds a thread-local hash map:
// Worker: local hash map aggregation
function aggregateChunk(data, groupCol, valueCol, start, end) {
const localMap = new Map();
for (let i = start; i < end; i++) {
const key = groupCol[i];
const val = valueCol[i];
const existing = localMap.get(key);
if (existing !== undefined) {
localMap.set(key, existing + val);
} else {
localMap.set(key, val);
}
}
return localMap;
}
After all workers complete, the main thread merges K hash maps:
function mergeMaps(maps) {
const merged = new Map();
for (const map of maps) {
for (const [key, value] of map) {
merged.set(key, (merged.get(key) || 0) + value);
}
}
return merged;
}
Zero atomic operations. Each worker writes to its own private Map. No shared state during the aggregation phase. The merge phase iterates K small maps (K = worker count) on the main thread. For 8 workers with 50,000 groups each: 400,000 map entries to merge. On the main thread: 3 to 5 ms.
Arbitrary cardinality. JavaScript's Map resizes dynamically. 50 groups or 5 million groups: the hash map handles it. The performance scales linearly with row count and group count, with no cliff.
Consistent performance. CPU GROUP BY on 500,000 rows: 6.1 ms for 50 groups, 7.3 ms for 5,000 groups, 9.8 ms for 50,000 groups. The increase is gentle (larger maps, more merge work) not catastrophic.
The full decision path
A query arrives: GROUP BY region, quarter WITH SUM(revenue).
-
Check if single-column group key. Region is dictionary-encoded with 20 unique values. Quarter has 8. Composite key.
-
Estimate composite cardinality via Chao1. 1% sample (5,000 rows). Observed: 142 distinct pairs. f1 = 8, f2 = 14. Chao1 = 142 + 8 * (8 - 1) / (2 * (14 + 1)) = 142 + 56/30 = 142 + 1.87 = 144. Estimated: 144 groups.
-
Compare to shared memory limit. 144 < 1,024. Shared memory path is viable.
-
Compute group cardinality penalty. 1 + (144/1024) * 0.3 = 1.042. Negligible.
-
Compute full 6-factor dispatch score. F1 row count: 500,000 (above 50K discrete threshold, high). F2 group cardinality penalty: 1.042. F3 GPU class: discrete (favourable). F4 vendor tuning: applied. F5 GPU buffer retention: applicable if preceding operator also on GPU. F6 buffer threshold: within limits. Score: 2.1. Above 1.0.
-
Float32 ordering-preservation safety check. SUM(revenue) with max accumulation of ~£50M. The Float64 column exceeds the Float32 safe integer threshold. Safety check blocks GPU for the SUM accumulator. Route to CPU Float64.
The Chao1 estimate (144 groups) was below the GPU threshold, but the Float32 ordering-preservation safety check independently blocked GPU dispatch for a different reason. Multiple safety systems evaluate independently. The strictest constraint wins.
If the revenue column were replaced with a quantity column (max accumulation ~800,000, within Float32 safe range), the precision check would pass, and the GPU would handle the aggregation in 0.9 ms with shared memory accumulators. The Chao1 estimate enabled that path by confirming the group count fits in shared memory.
Where this matters
Every analytical dashboard runs GROUP BY. Every pivot table. Every breakdown chart. Every summary report. The group cardinality determines whether that query takes 0.9 ms (GPU shared memory), 75 ms (GPU global memory), or 6.1 ms (CPU hash map).
Without cardinality estimation, the engine has two choices: always use the GPU (catastrophic for high-cardinality queries) or always use the CPU (leaving 6x performance on the table for low-cardinality queries). Chao1 enables the engine to make the right choice per-query, per-dataset, spending 0.3 ms on estimation to save 10 to 70 ms on execution.
This statistical estimation is one component of the adaptive dispatch architecture that powers our enterprise AI automation infrastructure. The engine does not guess. It samples. It estimates. It scores. Then it dispatches to the tier that handles the query without hitting a cardinality cliff. Every GROUP BY, on every dataset, on every device.