What atomic operations do
An atomic operation reads a memory address, modifies the value, and writes it back as a single indivisible unit. No other thread can observe the intermediate state. No other thread can interleave a write between the read and the write-back.
In WebGPU compute shaders, atomicAdd, atomicSub, atomicMax, atomicMin, atomicOr, atomicAnd, atomicExchange, and atomicCompareExchangeWeak provide these guarantees on atomic<u32> and atomic<i32> storage variables.
The canonical use case: counting results. A filter shader evaluates a predicate on every element and needs to report how many elements passed. Each thread that finds a match increments a shared counter:
@group(0) @binding(1) var<storage, read_write> match_count: atomic<u32>;
@compute @workgroup_size(256)
fn filter(@builtin(global_invocation_id) id: vec3<u32>) {
let idx = id.x;
if (data[idx] > threshold) {
atomicAdd(&match_count, 1u);
}
}
This shader is correct. It is also a performance trap.
Why atomic contention destroys GPU throughput
A GPU with 3,072 cores can have 3,072 threads executing the same shader simultaneously. If the predicate is selective (few matches), few threads hit the atomicAdd per cycle. No problem. If the predicate is broad (many matches), hundreds or thousands of threads hit atomicAdd on the same memory address in the same cycle.
The hardware cannot execute these atomics in parallel. By definition, an atomic operation is serial at the target address. The memory controller must process each atomicAdd one at a time, or in small batches depending on the architecture.
The serialization cascade
Here is what happens on current GPU hardware when 512 threads attempt atomicAdd on the same address within one dispatch cycle:
Step 1: Warp-level serialization. Within a 32-thread warp, threads that execute the same atomic to the same address are serialized. If 20 threads in a warp hit the atomicAdd, the warp stalls for 20 cycles while each thread completes its atomic. The other 12 threads (those that did not match the predicate) are already done and sit idle.
Step 2: L2 cache controller queuing. The atomic operation targets a specific L2 cache line. The GPU's memory controller can process a limited number of concurrent atomics to the same cache line: typically 1 per cycle on NVIDIA hardware, with some pipelining that allows 32 to 64 outstanding requests. When multiple warps submit atomics to the same address simultaneously, the excess requests queue at the memory controller.
Step 3: Memory bus saturation. Each queued atomic occupies a slot in the memory request pipeline. While these slots are held by atomic retries, other warps that need to read or write global memory for unrelated work are starved. The entire pipeline backs up behind the contention point.
The result is non-linear degradation. Doubling the number of contending threads does not double the latency. It can quadruple it, because the queue depth increases and the memory bus starvation compounds.
Measured contention costs
We benchmarked a simple counting shader (increment a single atomic<u32> counter for each matching element) on a discrete GPU with 3,072 cores:
| Match rate | Matches per dispatch | Time | Throughput vs. no atomics |
|---|---|---|---|
| 0.1% (1 in 1,000) | ~1,000 | 0.8 ms | 95% of peak |
| 1% (10 in 1,000) | ~10,000 | 1.1 ms | 72% of peak |
| 5% (50 in 1,000) | ~50,000 | 3.4 ms | 23% of peak |
| 10% (100 in 1,000) | ~100,000 | 8.7 ms | 9% of peak |
| 25% (250 in 1,000) | ~250,000 | 28.1 ms | 2.8% of peak |
| 50% (500 in 1,000) | ~500,000 | 63.4 ms | 1.2% of peak |
Dataset: 1 million elements. The no-atomics baseline (predicate evaluation without counter increment) completes in 0.76 ms.
At 10% match rate, the GPU is operating at 9% of its theoretical throughput. At 50%, it is at 1.2%. The shader is spending nearly all its time waiting for the memory controller to process atomic requests. The 3,072 cores are functionally idle.
An 8-thread Web Worker pool running the same filter with a simple count++ in JavaScript takes 4.8 ms regardless of match rate. The CPU does not have an atomic contention problem because 8 threads produce negligible contention, and each thread maintains a local counter that is merged after the loop.
At 10% match rate, the GPU takes 8.7 ms. The CPU takes 4.8 ms. The GPU is 1.8x slower. At 25%, the GPU is 5.9x slower. The performance inversion grows with match density.
Beyond counting: where atomic contention appears
The counting example is the simplest case. Atomic contention appears in every GPU algorithm that writes to a shared output structure with unpredictable write targets.
Histogram construction
A histogram with 256 bins and 1 million elements has, on average, 3,906 elements per bin. Each bin is an atomic<u32> counter. Threads writing to the same bin contend. With 256 bins, the contention per bin is 1/256th of the total, which is manageable. But reduce the bin count (e.g., a 16-bin histogram for a low-cardinality categorical column), and contention per bin rises to 62,500 elements. The GPU stalls.
This is why our two-phase text search uses 128-bin character frequency histograms in workgroup shared memory rather than global memory. Shared memory atomics are faster (20 to 50 cycles versus 200 to 400 cycles for global memory), and the contention is limited to 256 threads per workgroup rather than 3,072 across the entire GPU.
Scatter writes in compaction
A parallel filter that compacts matching elements into a dense output array needs each thread to claim an output slot. The standard approach: atomicAdd on a global write pointer, then write the element to the returned index.
if (data[idx] > threshold) {
let slot = atomicAdd(&write_head, 1u);
output[slot] = data[idx];
}
Every matching thread contends on write_head. At 50% selectivity, half of all threads contend on a single address every cycle. This is why our compaction uses a prefix-sum approach instead: compute a bitmask, prefix-sum the bitmask to derive write positions, then scatter without atomics. The prefix sum is conflict-free and fully parallel.
Aggregation with high group cardinality
A GROUP BY aggregation where each group has its own accumulator (atomicAdd(&group_totals[group_id], value)) distributes contention across group_count addresses. With 1,000 groups and 1 million rows, each group receives ~1,000 atomic writes. With 10 groups, each receives ~100,000. The contention per group scales inversely with group count.
Our query engine uses the Chao1 group cardinality estimator precisely for this reason. Low group cardinality means high per-group contention. The scoring function penalizes GPU dispatch as estimated group count drops.
Our output density profiler
The contention examples share a common structure: the severity of atomic contention is determined by the output density of the operation. Output density is the ratio of elements that produce an atomic write to total elements processed.
For a filter: output density equals the selectivity (fraction of elements that pass the predicate).
For a histogram: output density is 1.0 (every element increments some bin), but contention per bin depends on bin count. Effective contention density is 1 / bin_count.
For a GROUP BY aggregation: output density is 1.0, with effective contention density of 1 / group_count.
For a text search: output density is the fraction of documents that contain the query pattern.
Our engine's output density profiler estimates this ratio before GPU dispatch. The estimation uses the same lightweight statistics maintained during schema ingestion: column histograms, dictionary cardinality, min/max values, and null counts.
Estimation for filter predicates
For a range predicate (column > X), the profiler interpolates from the column's 64-bucket histogram to estimate selectivity. For an equality predicate on a dictionary-encoded column, selectivity is 1 / dictionary_size. For compound predicates (AND, OR), selectivities are combined with independence assumptions (multiply for AND, add minus product for OR).
Estimation for aggregation
Effective contention density is estimated as 1 / estimated_group_count, using the Chao1 estimator for composite group-by keys or exact dictionary size for single-column group-by.
Estimation for text search
The character frequency histogram pre-filter in Phase 1 produces a candidate bitmask. The popcount of this bitmask gives the exact Phase 2 output density before Phase 2 begins. If the popcount indicates high density (many candidates survived the pre-filter), the engine can re-route Phase 2 to the CPU.
The categorical penalty threshold
When estimated output density exceeds 100 matches per 1,000 input elements (10%), our engine assigns a categorical penalty of IEEE 754 negative infinity to the GPU dispatch score. This is a mathematically absolute value: no combination of positive scoring factors can produce a non-negative result. The workload is routed unconditionally to the Web Worker tier.
Why 10%
The threshold derives from the GPU's memory controller capacity, not from arbitrary tuning.
A discrete GPU with 3,072 cores dispatches 3,072 threads per cycle. At 10% output density, approximately 307 threads attempt an atomic write per cycle. The L2 cache controller on current-generation hardware (NVIDIA Ampere/Ada, AMD RDNA 3, Intel Arc) can sustain 32 to 64 concurrent atomic operations to distinct cache lines without queuing. For atomics to the same cache line, the throughput drops to 1 to 4 per cycle depending on the memory controller's coalescing capabilities.
At 307 contending threads per cycle targeting a small number of addresses (a single counter, or a handful of histogram bins), the queue depth exceeds the controller's capacity by 5x to 10x. The excess requests retry, each retry consuming a memory pipeline slot, and the cascade described earlier begins.
Below 10%, the contention is manageable. At 5% output density (~154 threads per cycle), the memory controller queues briefly but clears within a few cycles. Throughput drops to 23% of peak, which is still faster than the CPU alternative for large datasets. The dispatch score accounts for this degradation through the standard 6-factor scoring function without needing a categorical override.
Above 10%, the degradation becomes non-linear and unpredictable. The GPU's throughput drops below CPU levels, and the gap widens rapidly with density. A continuous scoring function cannot model this accurately because the relationship between density and latency is not linear, not quadratic, and varies by GPU microarchitecture. The only safe response is categorical exclusion.
This is the same Categorical GPU Inhibition Scoring principle applied to branch divergence: when the penalty is non-linear and hardware-dependent, a continuous penalty underestimates the risk on some configurations. Negative infinity eliminates the risk on all configurations.
What the CPU tier does differently
The Web Worker tier avoids atomic contention entirely through a different algorithmic structure.
For counting: each worker maintains a thread-local counter. After all workers complete, the main thread sums 8 counters. Zero contention. Zero atomic operations.
For compaction: each worker compacts matches into a thread-local output section of the SharedArrayBuffer. The main thread concatenates the sections. Each worker writes to a disjoint memory region. Zero contention.
For aggregation: each worker builds a thread-local hash map of group accumulators. The main thread merges 8 hash maps. Each worker reads shared input but writes to private output. Zero contention.
This is the structural advantage of the Web Worker parallel tier. Eight threads with private output state have fundamentally different contention characteristics than 3,072 threads with shared output state. The algorithm changes, not just the execution target.
Workgroup shared memory atomics: the partial mitigation
Before the categorical threshold kicks in, our engine uses workgroup shared memory atomics as a partial contention mitigation for GPU dispatch.
The strategy: each workgroup of 256 threads maintains local accumulators in shared memory (var<workgroup>). Threads within the workgroup atomically increment shared memory counters, which have 20 to 50 cycle latency versus 200 to 400 cycles for global memory. After the workgroup completes, thread 0 writes the workgroup's local result to the global accumulator via a single atomicAdd.
This reduces global atomic contention by a factor of 256 (one global atomic per workgroup instead of one per thread). For 1 million elements with 10% match rate, global atomic operations drop from 100,000 to 390 (one per workgroup). That is well within the memory controller's capacity.
var<workgroup> local_count: atomic<u32>;
@compute @workgroup_size(256)
fn filter_with_local_reduce(
@builtin(local_invocation_id) lid: vec3<u32>,
@builtin(workgroup_id) wid: vec3<u32>,
@builtin(global_invocation_id) gid: vec3<u32>
) {
// Initialize shared counter
if (lid.x == 0u) {
atomicStore(&local_count, 0u);
}
workgroupBarrier();
// Each thread evaluates its element
if (gid.x < element_count && data[gid.x] > threshold) {
atomicAdd(&local_count, 1u);
}
workgroupBarrier();
// Thread 0 flushes to global
if (lid.x == 0u) {
let count = atomicLoad(&local_count);
if (count > 0u) {
atomicAdd(&global_count, count);
}
}
}
This technique keeps the GPU viable up to the 10% density threshold. Beyond that, even shared memory atomics within a 256-thread workgroup produce measurable contention (25+ threads contending per cycle within the workgroup), and the global merge via thread 0 is not the bottleneck. The per-thread shared memory contention is.
The full decision tree
Putting it together, here is how our engine routes an operation that involves atomic writes:
- Estimate output density from column statistics.
- If density > 10%: Categorical penalty (negative infinity). Route to Web Worker tier. Workers use thread-local accumulators.
- If density is 10% or below: Check if the operation can use the shared-memory-reduce pattern (workgroup-local atomics with thread-0 global flush).
- If yes: Apply standard 6-factor scoring with a contention adjustment factor proportional to density. GPU dispatch is permitted if the adjusted score exceeds 1.0.
- If no (operation requires per-thread global atomics that cannot be restructured): Apply a continuous penalty scaled by density. GPU dispatch is permitted only if the dataset is large enough for the GPU's compute advantage to overcome the contention cost.
- If the operation involves precision-sensitive accumulation: The Float32 Safety Guard may override the dispatch decision independently, routing to CPU for numerical correctness regardless of contention analysis.
Multiple safety systems can independently block GPU dispatch. Branch divergence, precision loss, and atomic contention are evaluated separately. Any one of them can force CPU routing. The GPU path runs only when all three analyses confirm it is safe, correct, and faster.
Why this matters
Most browser-based GPU compute implementations ship with a single code path: "if the dataset is big, use the GPU." They discover atomic contention in production when a user's query returns a high match rate and the dashboard freezes for 3 seconds instead of responding in 10 ms.
Our engine discovers it before the first GPU instruction executes. The profiler estimates density. The categorical threshold blocks the dispatch. The Web Worker tier handles the workload in 4.8 ms. The user never sees the failure that was prevented.
This is the engineering standard behind our enterprise AI automation infrastructure. We do not ship GPU code paths that degrade under load. We profile the expected contention pattern, enforce hard boundaries where degradation is non-linear, and route to the tier that handles the workload without performance cliffs. The GPU is fast when it is uncontended. We make sure it stays that way.