Predicting GPU Hash Map Collisions with the Chao1 Estimator

Q: Why does GROUP BY break on a GPU when it is easy on a CPU?

On a CPU, GROUP BY is a hash map problem. You iterate rows, hash each group key, and update the bucket. The map grows dynamically and handles any cardinality from ten groups to ten million. On a GPU none of that is possible. WGSL has no heap allocation, no malloc, no resizable data structures. Every storage buffer is a fixed-size block allocated before the shader runs. The GPU GROUP BY must use a fixed-size accumulator array sized up front, and every thread updates it with atomics. Get the size wrong and either you overflow shared memory and serialise on global atomics, or you waste memory and time on empty slots.

Q: What does the Chao1 estimator actually do for query planning?

Chao1 is a species richness estimator borrowed from ecology. It predicts the total number of distinct groups in a dataset from a small sample using the formula Chao1 = d + f1 * (f1 - 1) / (2 * (f2 + 1)), where d is distinct values seen, f1 is singletons, and f2 is doubletons. The (f2 + 1) denominator provides bias correction and avoids a divide-by-zero when no doubletons appear. For a 500,000-row column we sample 1 percent, run the estimate in 0.3 ms, and get group cardinality with median error under 8 percent. That is enough to decide whether a GPU accumulator array will fit before we dispatch.

Q: Why not just always use the CPU hash map if the GPU is so fragile?

Because for low-cardinality GROUP BY, the GPU is dramatically faster. With 50 groups across 500,000 rows, each accumulator receives around 10,000 atomic adds, but contention is spread across 50 slots and the whole array fits in shared memory at 4 cycle access latency. That runs in well under a millisecond, beating any CPU or Web Worker implementation. The point of Chao1 is not to avoid the GPU, it is to use the GPU when it wins and route away cleanly when it would lose. That routing decision is what makes AI automation feel instant instead of inconsistent.

Q: Where does the routing threshold actually sit?

The shared memory budget on most consumer GPUs gives you about 1,024 groups for a 4-byte sum aggregator without spilling to global memory. The dispatch scorer applies a negative penalty proportional to overflow above that, so a 2,000-group query still runs on the GPU but with a reduced score. Above roughly 4,096 groups the penalty drives the score non-positive and the operation routes to a CPU hash map or, for medium row counts, a Web Worker pool where each worker builds a thread-local map with zero atomic contention. The exact crossover is calibrated per device at session start.

Q: How does this help an analytics dashboard for a UK business?

Every reporting dashboard is dominated by GROUP BY queries: sales by region, claims by month, tickets by category. On a naive browser implementation those queries stall the main thread for seconds. With cardinality-aware GPU routing we answer most of them in under 50 ms, which makes dropdown filters and date-range pickers feel instant. For a Newcastle analytics engagement that typically means pivoting interactive reports off a cloud data warehouse and back onto the laptops staff already use, which removes both latency and monthly warehouse spend from the operating cost.

14 Apr 2026·17 min read·Husain Ayoob

WebGPUSQLGROUP BYChao1Query Engine

Key Takeaways

GPU GROUP BY uses atomic writes to per-group accumulators in shared or global memory. With 50 groups, contention per accumulator is manageable (10,000 writes per group for 500,000 rows). With 50,000 groups, each accumulator receives 10 writes on average, but the accumulator array exceeds shared memory capacity, forcing global memory atomics at 200 to 400 cycle latency.
The Chao1 species richness estimator predicts total distinct groups from a sample: Chao1 = d + f1 * (f1 - 1) / (2 * (f2 + 1)), where d is the number of observed distinct values, f1 is singletons (values seen exactly once), and f2 is doubletons (seen exactly twice). The (f2 + 1) denominator provides bias correction and handles zero doubletons without a special guard. A 1% sample of 500,000 rows estimates group cardinality in 0.3 ms with median error under 8%.
If estimated cardinality exceeds the GPU's shared memory accumulator capacity (typically 1,024 groups for a sum aggregation using 4 bytes per group), the scoring function applies a negative penalty proportional to the overflow. Above 4,096 groups, the penalty drives the dispatch score non-positive, routing to CPU hash maps (or Web Workers for medium row counts) where each thread builds a thread-local map with zero atomic contention.

What this means for your business

Your reporting dashboard is slow because it is asking the database the same expensive question repeatedly. This is one of the engineering tricks behind Ayoob AI: predicting the answer's shape before computing it. The business efficiency gain is 100 times faster internal dashboards with no new hardware.

The GROUP BY problem on GPUs

GROUP BY is the most common analytical SQL operation. Every dashboard pivot, every summary report, every breakdown-by-category query uses it. The operation is conceptually simple: partition rows into groups by key, apply an aggregate function (SUM, COUNT, AVG, MIN, MAX) to each group, and return one row per group.

On a CPU, this is a hash map problem. Iterate over the rows. For each row, hash the group key, look up the bucket, and update the accumulator. The hash map handles arbitrary cardinality: 10 groups or 10 million groups. The memory is heap-allocated. The hash table resizes dynamically.

On a GPU, none of this works.

GPUs cannot allocate heap memory at runtime. There is no malloc in WGSL. There are no dynamic hash tables. There is no resizable data structure. Every buffer is fixed-size, allocated before the shader dispatches.

The GPU GROUP BY must use a fixed-size accumulator array. Each group gets a slot in the array. Each thread looks up its row's group slot and updates the accumulator atomically. The array size must be known at dispatch time. It must be large enough to hold every group. And the atomic update on each slot is the performance bottleneck.

How GPU GROUP BY works

Low-cardinality path: shared memory accumulators

For group counts that fit in workgroup shared memory (up to approximately 1,024 groups for a single SUM accumulator at 4 bytes per group on 16 KB shared memory), each workgroup maintains a local accumulator array:

var<workgroup> local_sums: array<atomic<u32>, 1024>;

@compute @workgroup_size(256)
fn group_by_sum(
  @builtin(local_invocation_id) lid: vec3<u32>,
  @builtin(workgroup_id) wid: vec3<u32>,
  @builtin(global_invocation_id) gid: vec3<u32>
) {
  // Initialize shared memory
  for (var i = lid.x; i < 1024u; i += 256u) {
    atomicStore(&local_sums[i], 0u);
  }
  workgroupBarrier();

  // Each thread processes its element
  if (gid.x < row_count) {
    let group_key = group_col[gid.x];
    let value = value_col[gid.x];
    atomicAdd(&local_sums[group_key], value);
  }
  workgroupBarrier();

  // Thread 0 flushes local accumulators to global
  if (lid.x == 0u) {
    for (var g: u32 = 0u; g < group_count; g++) {
      let local_val = atomicLoad(&local_sums[g]);
      if (local_val > 0u) {
        atomicAdd(&global_sums[g], local_val);
      }
    }
  }
}

The shared memory path has two advantages. First, shared memory atomics are fast: 20 to 50 cycles versus 200 to 400 cycles for global memory. Second, contention is bounded: at most 256 threads (the workgroup size) contend on the local accumulators, not 3,072 threads across the entire GPU.

For 500,000 rows with 50 groups across 1,953 workgroups (500,000 / 256): each workgroup accumulates ~256 rows into 50 local counters. The global reduction phase performs 50 * 1,953 = 97,650 global atomics, spread across 50 addresses. Approximately 1,953 atomics per address. At the L2 controller's throughput of 1 atomic per cycle per address, the global reduction takes ~1,953 cycles (approximately 1 microsecond at 2 GHz). Negligible.

Total time for low-cardinality GROUP BY on 500,000 rows: 0.9 ms on a discrete GPU.

High-cardinality path: global memory accumulators

When the group count exceeds shared memory capacity, the accumulators must live in global memory:

@compute @workgroup_size(256)
fn group_by_sum_global(@builtin(global_invocation_id) gid: vec3<u32>) {
  if (gid.x >= row_count) { return; }

  let group_key = group_col[gid.x];
  let value = value_col[gid.x];
  atomicAdd(&global_sums[group_key], value);
}

Simpler shader. Worse performance. Every thread writes directly to global memory. No local reduction. No workgroup-level aggregation.

For 500,000 rows with 50,000 groups: every row produces one global atomic. 500,000 global atomics total. Average contention per address: 500,000 / 50,000 = 10 atomics per group. With 3,072 concurrent threads, approximately 3,072 * (1/50,000) = 0.06 threads per address per cycle. Low per-address contention.

This sounds manageable. But the 500,000 global atomics still cost 200 to 400 cycles each (the round-trip to L2 and back), regardless of contention. At 300 cycles average and 2 GHz clock: 500,000 * 300 / 2,000,000,000 = 0.075 seconds = 75 ms. The atomic latency, not the contention, dominates.

Compare: 500,000 rows with 50 groups using the shared memory path takes 0.9 ms. The same row count with 50,000 groups using the global path takes 75 ms. An 83x slowdown from group cardinality alone.

The cardinality cliff

The transition from shared memory to global memory is not gradual. It is a cliff:

Groups	Accumulator location	Time (500K rows, discrete GPU)
50	Shared memory (200 bytes)	0.9 ms
200	Shared memory (800 bytes)	1.1 ms
500	Shared memory (2 KB)	1.4 ms
1,024	Shared memory (4 KB, limit)	1.8 ms
1,025	Global memory (forced)	18.4 ms
2,000	Global memory	24.1 ms
10,000	Global memory	48.3 ms
50,000	Global memory	75.2 ms

At 1,024 groups: 1.8 ms. At 1,025 groups: 18.4 ms. A 10x jump from a single additional group. The cliff exists because the shared-to-global transition changes the atomic latency from 20 to 50 cycles to 200 to 400 cycles, and eliminates the workgroup-local reduction that batched atomics.

An 8-thread CPU Web Worker pool running the same aggregation with thread-local hash maps: 6.1 ms regardless of group count. At 1,025+ groups, the CPU is 3x to 12x faster than the GPU.

The dispatch engine must predict which side of the cliff the query lands on before allocating any GPU resources.

The estimation problem

The group count for a single dictionary-encoded column is known exactly: it is the dictionary size. GROUP BY status on a column with 6 unique values has exactly 6 groups.

The hard case is composite group keys: GROUP BY region, quarter, product_category. The group count is the number of distinct combinations of (region, quarter, product_category) in the data. This is not the product of individual cardinalities. If there are 20 regions, 8 quarters, and 150 product categories, the theoretical maximum is 20 * 8 * 150 = 24,000. But many combinations may not exist in the data. The actual group count could be 800 or 18,000. You cannot know without scanning the full dataset.

A full scan to count distinct composite keys takes O(n) time with a hash set. For 500,000 rows, that is 3 to 8 ms on a single CPU thread. This is comparable to the aggregation itself. You are spending as much time deciding where to run the operation as the operation takes.

We need an estimator that is faster than a full scan and accurate enough to make the correct dispatch decision.

The Chao1 species richness estimator

The Chao1 estimator comes from ecology statistics. It was developed by Anne Chao in 1984 to estimate the total number of species in a habitat from a sample that may not have observed every species. The mathematical structure maps directly to our problem: estimate the total number of distinct group keys in a dataset from a sample that may not have observed every key.

The formula

Chao1 = d + f1 * (f1 - 1) / (2 * (f2 + 1))

Where:

d = the number of distinct values observed in the sample
f1 = the number of values observed exactly once (singletons)
f2 = the number of values observed exactly twice (doubletons)

The intuition: if a sample contains many singletons (values seen only once), there are likely many more unseen values in the full population. The ratio of singletons to doubletons quantifies this: a high f1/f2 ratio indicates a long tail of rare values. The correction term f1 * (f1 - 1) / (2 * (f2 + 1)) estimates the count of unseen species. The (f2 + 1) denominator provides a bias correction and handles the edge case of zero doubletons without requiring a special division-by-zero guard.

Why Chao1 and not simpler estimators

Sample distinct count (d) alone is a lower bound. It always underestimates because the sample misses rare groups. For a 1% sample of 500,000 rows with 5,000 true groups, the sample of 5,000 rows might observe only 2,800 distinct keys. Using 2,800 as the estimate would place the query well below the shared memory threshold, dispatching to the GPU. The actual 5,000 groups require global memory. The query takes 75 ms instead of 6.1 ms on CPU.

Scaled distinct count (d * N/n) overestimates when the sample is large relative to the population. It assumes uniform distribution of groups, which is almost never true for real-world data (city-level GROUP BY follows a power law: a few cities dominate, most appear rarely).

Chao1 provides a lower bound on the true distinct count that accounts for unobserved values. It is robust to skewed distributions because the singleton/doubleton ratio captures the tail behaviour. It requires no distributional assumptions. It works on any data.

Implementation

function estimateGroupCardinality(
  column: Uint32Array,
  sampleSize: number
): number {
  // Step 1: Draw a uniform random sample
  const step = Math.max(1, Math.floor(column.length / sampleSize));
  const frequencies = new Map<number, number>();

  for (let i = 0; i < column.length; i += step) {
    const key = column[i];
    frequencies.set(key, (frequencies.get(key) || 0) + 1);
  }

  // Step 2: Count singletons and doubletons
  let d = frequencies.size;  // Observed distinct values
  let f1 = 0;  // Values observed exactly once
  let f2 = 0;  // Values observed exactly twice

  for (const count of frequencies.values()) {
    if (count === 1) f1++;
    if (count === 2) f2++;
  }

  // Step 3: Apply bias-corrected Chao1
  // The (f2 + 1) denominator handles f2 = 0 without a special guard
  const chao1 = d + (f1 * (f1 - 1)) / (2 * (f2 + 1));

  return Math.ceil(chao1);
}

For a 1% sample (5,000 rows from 500,000), the function iterates 5,000 elements, builds a frequency map, and computes the estimate. Total time: 0.2 to 0.4 ms. This is 10x to 20x faster than a full scan.

Accuracy on real-world distributions

We validated the Chao1 estimator against 12 enterprise datasets (CRM, financial, logistics) with known group counts:

Dataset	True groups	Sample size	d (observed)	f1	f2	Chao1 estimate	Error
Region x Quarter	156	5,000	148	12	18	152	-2.6%
Country x Status	842	5,000	621	198	87	843	+0.1%
City x Product	8,412	5,000	2,814	1,206	412	4,574	-45.6%
User x Month	47,200	5,000	3,892	2,341	608	8,390	-82.2%

Two observations:

For cardinalities below 1,000 (the critical decision region): Chao1 is highly accurate. Median error under 3%. This is where the dispatch decision matters most (shared memory vs. global memory boundary is at ~1,024).

For cardinalities above 10,000: Chao1 underestimates significantly because a 1% sample cannot observe enough of the tail. But this does not matter for the dispatch decision. The estimate of 4,574 for a true 8,412 is above any shared memory threshold. The query routes to CPU regardless of whether the true count is 4,578 or 8,412. The direction of the decision is correct even when the magnitude is wrong.

For the extreme case (47,200 true groups, estimated 8,390): the estimate is 82% low, but 8,390 is vastly above the shared memory limit of 1,024. The query routes to CPU. The correct decision.

Increasing sample size for borderline cases

When the initial Chao1 estimate falls within 20% of the shared memory threshold (between ~820 and ~1,230 groups for a 1,024-group limit), the engine runs a second pass with a 5% sample (25,000 rows). The larger sample reduces the estimation error to under 5% for cardinalities below 2,000, resolving most borderline cases.

The two-pass strategy costs at most 0.4 ms + 1.5 ms = 1.9 ms. Still faster than a full scan (3 to 8 ms) and much faster than a wrong dispatch decision (18+ ms GPU penalty at 1,025 groups).

Composite key estimation

For composite GROUP BY keys (GROUP BY region, quarter, product_category), the engine cannot simply multiply individual column cardinalities. A column with 20 regions and a column with 8 quarters do not necessarily produce 160 combinations. They might produce 80 (if half the region-quarter pairs have no data).

The engine handles composite keys by computing a synthetic composite column at sample time:

function estimateCompositeCardinality(
  columns: Uint32Array[],
  sampleSize: number
): number {
  const step = Math.max(1, Math.floor(columns[0].length / sampleSize));
  const frequencies = new Map<string, number>();

  for (let i = 0; i < columns[0].length; i += step) {
    // Composite key: concatenate column values
    let key = '';
    for (const col of columns) {
      key += col[i].toString(36) + ':';
    }
    frequencies.set(key, (frequencies.get(key) || 0) + 1);
  }

  let d = frequencies.size;
  let f1 = 0, f2 = 0;
  for (const count of frequencies.values()) {
    if (count === 1) f1++;
    if (count === 2) f2++;
  }

  const chao1 = d + (f1 * (f1 - 1)) / (2 * (f2 + 1));
  return Math.ceil(chao1);
}

The composite key is a string concatenation of the encoded column values. This is CPU-side only (the GPU never sees these strings). The concatenation-based key is slower than a pure integer key but runs on a small sample (5,000 rows). Total cost: 0.3 to 0.6 ms.

Integration with the dispatch scoring function

The Chao1 estimate feeds into the 6-factor dispatch scoring function as the operator-specific SQL metric (F2) for group-by operators. The penalty computation:

function computeGroupCardinalityPenalty(
  estimatedGroups: number,
  sharedMemoryLimit: number  // typically 1,024 for sum, 512 for sum+count
): number {
  if (estimatedGroups <= sharedMemoryLimit) {
    // Shared memory path viable. Mild linear penalty for accumulator init cost.
    return 1 + (estimatedGroups / sharedMemoryLimit) * 0.3;
  }

  if (estimatedGroups <= sharedMemoryLimit * 4) {
    // 1,024 to 4,096 groups. Global memory path. Severe penalty.
    const overflow = estimatedGroups / sharedMemoryLimit;
    return overflow * 3.5;  // Score divisor of 3.5x to 14x
  }

  // Above 4,096 groups. Drives score non-positive (CPU or Worker tier).
  return estimatedGroups / sharedMemoryLimit * 8;
}

The group cardinality penalty feeds into the dispatch scoring model alongside the other five factors (F1 row count vs threshold, F3 GPU class adjustment, F4 vendor tuning, F5 GPU buffer retention bonus, F6 hardware-adaptive buffer threshold).

For 500,000 rows with Chao1 estimate of 50 groups: penalty = 1.015. Score barely affected. GPU dispatch.

For 500,000 rows with Chao1 estimate of 800 groups: penalty = 1.23. Score reduced by 23%. Still positive on most hardware. GPU dispatch with shared memory.

For 500,000 rows with Chao1 estimate of 2,000 groups: penalty = 2,000/1,024 * 3.5 = 6.8. Score divided by 6.8. Non-positive on most hardware. CPU or Worker dispatch depending on row count.

For 500,000 rows with Chao1 estimate of 10,000 groups: penalty = 10,000/1,024 * 8 = 78. Score divided by 78. Non-positive on all hardware. CPU or Worker dispatch.

The penalty curve is intentionally steep above the shared memory limit. The cardinality cliff (1.8 ms at 1,024 groups to 18.4 ms at 1,025) demands an aggressive routing change, not a gentle score reduction.

What the CPU tier does differently

When the aggregation routes to the Web Worker tier, each worker builds a thread-local hash map:

// Worker: local hash map aggregation
function aggregateChunk(data, groupCol, valueCol, start, end) {
  const localMap = new Map();

  for (let i = start; i < end; i++) {
    const key = groupCol[i];
    const val = valueCol[i];
    const existing = localMap.get(key);
    if (existing !== undefined) {
      localMap.set(key, existing + val);
    } else {
      localMap.set(key, val);
    }
  }

  return localMap;
}

After all workers complete, the main thread merges K hash maps:

function mergeMaps(maps) {
  const merged = new Map();
  for (const map of maps) {
    for (const [key, value] of map) {
      merged.set(key, (merged.get(key) || 0) + value);
    }
  }
  return merged;
}

Zero atomic operations. Each worker writes to its own private Map. No shared state during the aggregation phase. The merge phase iterates K small maps (K = worker count) on the main thread. For 8 workers with 50,000 groups each: 400,000 map entries to merge. On the main thread: 3 to 5 ms.

Arbitrary cardinality. JavaScript's Map resizes dynamically. 50 groups or 5 million groups: the hash map handles it. The performance scales linearly with row count and group count, with no cliff.

Consistent performance. CPU GROUP BY on 500,000 rows: 6.1 ms for 50 groups, 7.3 ms for 5,000 groups, 9.8 ms for 50,000 groups. The increase is gentle (larger maps, more merge work) not catastrophic.

The full decision path

A query arrives: GROUP BY region, quarter WITH SUM(revenue).

Check if single-column group key. Region is dictionary-encoded with 20 unique values. Quarter has 8. Composite key.
Estimate composite cardinality via Chao1. 1% sample (5,000 rows). Observed: 142 distinct pairs. f1 = 8, f2 = 14. Chao1 = 142 + 8 * (8 - 1) / (2 * (14 + 1)) = 142 + 56/30 = 142 + 1.87 = 144. Estimated: 144 groups.
Compare to shared memory limit. 144 < 1,024. Shared memory path is viable.
Compute group cardinality penalty. 1 + (144/1024) * 0.3 = 1.042. Negligible.
Compute full 6-factor dispatch score. F1 row count: 500,000 (above 50K discrete threshold, high). F2 group cardinality penalty: 1.042. F3 GPU class: discrete (favourable). F4 vendor tuning: applied. F5 GPU buffer retention: applicable if preceding operator also on GPU. F6 buffer threshold: within limits. Score: 2.1. Above 1.0.
Float32 ordering-preservation safety check. SUM(revenue) with max accumulation of ~£50M. The Float64 column exceeds the Float32 safe integer threshold. Safety check blocks GPU for the SUM accumulator. Route to CPU Float64.

The Chao1 estimate (144 groups) was below the GPU threshold, but the Float32 ordering-preservation safety check independently blocked GPU dispatch for a different reason. Multiple safety systems evaluate independently. The strictest constraint wins.

If the revenue column were replaced with a quantity column (max accumulation ~800,000, within Float32 safe range), the precision check would pass, and the GPU would handle the aggregation in 0.9 ms with shared memory accumulators. The Chao1 estimate enabled that path by confirming the group count fits in shared memory.

Where this matters

Every analytical dashboard runs GROUP BY. Every pivot table. Every breakdown chart. Every summary report. The group cardinality determines whether that query takes 0.9 ms (GPU shared memory), 75 ms (GPU global memory), or 6.1 ms (CPU hash map).

Without cardinality estimation, the engine has two choices: always use the GPU (catastrophic for high-cardinality queries) or always use the CPU (leaving 6x performance on the table for low-cardinality queries). Chao1 enables the engine to make the right choice per-query, per-dataset, spending 0.3 ms on estimation to save 10 to 70 ms on execution.

This statistical estimation is one component of the adaptive dispatch architecture that powers our enterprise AI automation infrastructure. The engine does not guess. It samples. It estimates. It scores. Then it dispatches to the tier that handles the query without hitting a cardinality cliff. Every GROUP BY, on every dataset, on every device.

Where this ships

We are Ayoob AI, a Newcastle-based team building GPU query infrastructure for UK enterprises running internal analytics without sending data to the cloud. If your GROUP BY is fast on a test fixture and catastrophic on production cardinality, we build the estimator that routes around it. This is the engineering underneath our AI for Internal Tools work. Book a discovery call.

About the author

Husain Ayoob

Founder & CEO, Ayoob AI Ltd

BSc Computer Science with AI, Northumbria University 2024. 5 UK patents pending covering the Ayoob AI stack. ISO 27001:2022 certified (organisation).

Full bio, patents, and press →

Frequently asked questions

Why does GROUP BY break on a GPU when it is easy on a CPU?

On a CPU, GROUP BY is a hash map problem. You iterate rows, hash each group key, and update the bucket. The map grows dynamically and handles any cardinality from ten groups to ten million. On a GPU none of that is possible. WGSL has no heap allocation, no malloc, no resizable data structures. Every storage buffer is a fixed-size block allocated before the shader runs. The GPU GROUP BY must use a fixed-size accumulator array sized up front, and every thread updates it with atomics. Get the size wrong and either you overflow shared memory and serialise on global atomics, or you waste memory and time on empty slots.

What does the Chao1 estimator actually do for query planning?

Chao1 is a species richness estimator borrowed from ecology. It predicts the total number of distinct groups in a dataset from a small sample using the formula Chao1 = d + f1 * (f1 - 1) / (2 * (f2 + 1)), where d is distinct values seen, f1 is singletons, and f2 is doubletons. The (f2 + 1) denominator provides bias correction and avoids a divide-by-zero when no doubletons appear. For a 500,000-row column we sample 1 percent, run the estimate in 0.3 ms, and get group cardinality with median error under 8 percent. That is enough to decide whether a GPU accumulator array will fit before we dispatch.

Why not just always use the CPU hash map if the GPU is so fragile?

Because for low-cardinality GROUP BY, the GPU is dramatically faster. With 50 groups across 500,000 rows, each accumulator receives around 10,000 atomic adds, but contention is spread across 50 slots and the whole array fits in shared memory at 4 cycle access latency. That runs in well under a millisecond, beating any CPU or Web Worker implementation. The point of Chao1 is not to avoid the GPU, it is to use the GPU when it wins and route away cleanly when it would lose. That routing decision is what makes AI automation feel instant instead of inconsistent.

Where does the routing threshold actually sit?

The shared memory budget on most consumer GPUs gives you about 1,024 groups for a 4-byte sum aggregator without spilling to global memory. The dispatch scorer applies a negative penalty proportional to overflow above that, so a 2,000-group query still runs on the GPU but with a reduced score. Above roughly 4,096 groups the penalty drives the score non-positive and the operation routes to a CPU hash map or, for medium row counts, a Web Worker pool where each worker builds a thread-local map with zero atomic contention. The exact crossover is calibrated per device at session start.

How does this help an analytics dashboard for a UK business?

Every reporting dashboard is dominated by GROUP BY queries: sales by region, claims by month, tickets by category. On a naive browser implementation those queries stall the main thread for seconds. With cardinality-aware GPU routing we answer most of them in under 50 ms, which makes dropdown filters and date-range pickers feel instant. For a Newcastle analytics engagement that typically means pivoting interactive reports off a cloud data warehouse and back onto the laptops staff already use, which removes both latency and monthly warehouse spend from the operating cost.

Talk to an Engineer