WebGPU Atomic Contention: When to Stop Using the GPU

11 Apr 2026·16 min read·Husain Ayoob

WebGPUAtomicsGPU ArchitectureSynchronizationPerformance

Key Takeaways

GPU atomics serialize at the L2 cache controller. A single atomicAdd to a contested address takes 200 to 400 clock cycles (global memory) versus 4 cycles for an uncontested read-modify-write. When 512 threads contend simultaneously, the queue depth exceeds the controller's 32 to 64 slot capacity, and excess requests retry in a cascade that degrades throughput non-linearly.
Our Categorical GPU Inhibition Scoring system profiles expected atomic contention before GPU dispatch. If predicted contention density exceeds 10% of threads per cycle, the system assigns a penalty of negative infinity, unconditionally routing to the Web Worker tier where thread-local accumulators eliminate contention entirely.
The seven WGSL atomic operations (atomicAdd, atomicSub, atomicMax, atomicMin, atomicAnd, atomicOr, atomicExchange) all share the same serialization constraint. atomicCompareExchangeWeak adds a retry loop on failure, compounding the contention: each failed CAS attempt re-enters the queue, amplifying queue depth beyond the raw thread count.

What this means for your business

The mark of mature Ayoob AI is not "we always use the GPU". It is knowing when not to. Our software picks the right hardware per operation. That is how business efficiency stays predictable.

What GPU atomics actually do at the hardware level

An atomic operation guarantees that a read-modify-write sequence on a memory address completes as an indivisible unit. No other thread can observe or interfere with the intermediate state between the read and the write.

On a CPU, atomics are implemented via cache coherence protocols (MESI, MOESI). The core acquires exclusive ownership of the cache line, performs the operation, and releases it. With 8 to 16 cores, contention is manageable. The coherence protocol scales to dozens of participants.

On a GPU, the architecture is fundamentally different. Three thousand cores share a memory subsystem designed for throughput, not coherence. There is no per-core cache coherence protocol for atomic operations. Instead, atomics are resolved at a centralized point: the L2 cache controller (for global memory) or the workgroup shared memory unit (for workgroup-scoped memory).

This centralization is the source of every atomic performance problem on GPUs.

The seven WGSL atomic operations

WGSL provides seven atomic read-modify-write operations on atomic<u32> and atomic<i32> types:

Operation	WGSL function	Behaviour
Add	`atomicAdd(&x, val)`	`x = x + val`. Returns old value.
Subtract	`atomicSub(&x, val)`	`x = x - val`. Returns old value.
Maximum	`atomicMax(&x, val)`	`x = max(x, val)`. Returns old value.
Minimum	`atomicMin(&x, val)`	`x = min(x, val)`. Returns old value.
Bitwise AND	`atomicAnd(&x, val)`	`x = x & val`. Returns old value.
Bitwise OR	`atomicOr(&x, val)`	`x = x \| val`. Returns old value.
Exchange	`atomicExchange(&x, val)`	`x = val`. Returns old value.

Plus one conditional operation:

Operation	WGSL function	Behaviour
Compare-exchange	`atomicCompareExchangeWeak(&x, expected, val)`	If `x == expected`, set `x = val`. Returns `{old_value, exchanged}`.

Every one of these operations shares the same serialization constraint. The difference is not in how the hardware serializes them (it serializes them all identically), but in how application code uses them and the contention patterns that result.

The memory hierarchy and where atomics resolve

Understanding contention requires understanding where in the memory hierarchy the atomic operation physically executes.

Global memory atomics

When an atomic targets a var<storage, read_write> variable, it resolves at the L2 cache controller. The path:

Thread -> Warp scheduler -> L1 cache (miss, not cached for atomics)
       -> L2 cache controller (serialization point)
       -> DRAM (if L2 misses)

The L2 controller processes atomic requests one at a time per cache line. On NVIDIA Ampere/Ada architecture, the L2 can sustain 1 atomic operation per cycle per memory partition. A GPU with 12 memory partitions can process 12 concurrent atomics per cycle, but only if they target different cache lines. Atomics to the same cache line (same 128-byte aligned address) serialize to 1 per cycle.

Latency for a single uncontested global atomic: 200 to 400 cycles (the round-trip to L2 and back). For a contested atomic with N threads queued: approximately N * 1 cycle at the controller, plus the 200-cycle round-trip. For 512 threads contending on one address: 200 + 512 = 712 cycles minimum. In practice, the queue management overhead and memory bus contention push this higher.

Shared memory atomics

When an atomic targets a var<workgroup> variable, it resolves at the workgroup's shared memory unit. Shared memory is on-chip SRAM, physically adjacent to the compute cores. Latency for an uncontested shared memory atomic: 20 to 50 cycles.

Contention in shared memory is limited to the workgroup size (typically 256 threads). At 256 threads, worst-case contention produces 256 serialized operations at 1 per cycle: 256 + 20 = 276 cycles. Painful but bounded.

This is why our two-phase text search builds character frequency histograms in shared memory rather than global memory. The contention ceiling is 256 threads per workgroup, not 3,072 threads across the entire GPU. And shared memory atomic latency (20 to 50 cycles) is 5x to 10x lower than global memory atomic latency (200 to 400 cycles).

The contention multiplier

The critical insight: contention cost is not linear with thread count. It is super-linear.

At low contention (8 threads per cycle targeting the same address), the L2 controller queues and drains the requests in 8 cycles. The overhead is small relative to the 200-cycle base latency. Total: approximately 208 cycles. Barely noticeable.

At moderate contention (64 threads), the queue drains in 64 cycles. Total: approximately 264 cycles. The atomic is 1.3x slower than uncontested. Still manageable.

At high contention (512 threads), the queue drains in 512 cycles. Total: approximately 712 cycles. The atomic is 3.5x slower than uncontested. But the real problem is not the individual atomic latency. It is the pipeline stall.

While 512 threads are queued at the L2 controller waiting for their atomic to complete, those threads are not executing any other instructions. Their warps are stalled. The warp scheduler cannot issue other instructions from those warps because the atomic has not returned. If those 512 threads represent a significant fraction of the active warps on the GPU, the entire compute pipeline stalls.

On a GPU with 96 active warps (3,072 threads / 32 per warp), 512 stalled threads means 16 stalled warps. That is 17% of the pipeline. The remaining 83% continues, but the stalled warps create bubbles in the scheduler that reduce overall throughput.

At 1,024 contending threads (33% of the pipeline stalled), throughput drops to 40% to 50% of peak. At 2,048 threads (67% stalled), throughput drops to 10% to 20%. The relationship is non-linear because the scheduler's ability to hide latency with other work degrades as more warps are stalled.

The compare-and-swap amplification problem

atomicCompareExchangeWeak is structurally worse than the other atomics under contention. The other six operations always succeed: atomicAdd adds its value and returns the old value, regardless of how many other threads are contending. The thread may wait in the queue, but it will not need to retry after completion.

Compare-and-swap (CAS) can fail. If the value at the target address changed between when the thread read the expected value and when the CAS executes, the exchange does not happen. The thread must reload the current value, recompute, and retry.

// CAS-based append to a result buffer
fn appendResult(value: u32) {
  var old = atomicLoad(&write_head);
  loop {
    let result = atomicCompareExchangeWeak(&write_head, old, old + 1u);
    if (result.exchanged) {
      output[old] = value;
      break;
    }
    old = result.old_value;
    // Failed. Another thread incremented write_head. Retry.
  }
}

Under contention, multiple threads load the same value of write_head, attempt the CAS simultaneously, and only one succeeds. The rest fail and retry. The retry re-enters the queue at the L2 controller. The queue depth is no longer equal to the number of contending threads. It is the number of contending threads multiplied by the average retry count.

With 64 threads contending on a single CAS target, 1 succeeds and 63 retry. On the retry round, 1 succeeds and 62 retry. The total number of CAS operations is:

64 + 63 + 62 + ... + 1 = 64 * 65 / 2 = 2,080 CAS operations

For 64 threads producing 64 successful writes. The amplification factor is 2,080 / 64 = 32.5x. The L2 controller processes 2,080 atomic operations instead of 64.

For 512 threads, the amplification factor is 512 * 513 / 2 / 512 = 256.5x. The controller processes 131,328 atomic operations. At 1 per cycle, that is 131,328 cycles. Over 65 microseconds at 2 GHz. For a single batch of CAS operations.

This is why CAS-based algorithms (lock-free queues, concurrent hash maps, atomic counters with custom logic) are catastrophically slow on GPUs under contention. The algorithm that is "lock-free" on a CPU becomes effectively locked on a GPU, with a lock whose hold time scales quadratically with thread count.

Where atomic contention appears in real workloads

Atomic contention is not an exotic edge case. It appears in the most common GPU compute patterns.

Pattern 1: Global counters

Any shader that counts matching elements using a single atomicAdd counter:

if (predicate(data[idx])) {
  atomicAdd(&match_count, 1u);
}

Contention scales linearly with match rate. At 1% selectivity on 1 million elements: ~10,000 atomics. At 50% selectivity: ~500,000 atomics. The counter becomes the bottleneck.

Pattern 2: Output compaction

Writing matching elements to a dense output array requires claiming an output slot via atomicAdd on a write pointer:

if (predicate(data[idx])) {
  let slot = atomicAdd(&write_head, 1u);
  output[slot] = data[idx];
}

Same contention profile as global counters, with the additional cost of a dependent global memory write to output[slot].

Pattern 3: Histogram accumulation

Building a histogram with K bins:

let bin = computeBin(data[idx]);
atomicAdd(&histogram[bin], 1u);

Contention per bin = total_atomics / K. With 256 bins and 1 million elements: ~3,906 atomics per bin. Manageable. With 4 bins and 1 million elements: ~250,000 atomics per bin. Catastrophic.

Pattern 4: Reduction (min, max, sum)

A parallel reduction that writes per-workgroup partial results to a global accumulator:

if (local_id == 0u) {
  atomicAdd(&global_sum, workgroup_partial_sum);
}

Contention = number of workgroups. For 1 million elements / 256 per workgroup = 3,906 workgroups. 3,906 atomics to a single address. At 1 per cycle: 3,906 cycles. Approximately 2 microseconds. Low absolute cost, but it adds up across multiple reduction targets (sum, count, min, max each needing their own atomic).

Pattern 5: Bitmask writes

Setting bits in a result bitmask via atomicOr:

if (predicate(data[idx])) {
  atomicOr(&bitmask[idx / 32u], 1u << (idx % 32u));
}

Contention per bitmask word = number of matching elements within each 32-element group. With uniformly distributed matches, each word receives match_rate * 32 atomics. At 50% match rate: 16 atomics per word. Spread across many words (total_elements / 32), the per-address contention is low. Bitmask atomics are GPU-friendly for most workloads.

This is why our two-phase search uses atomicOr on a bitmask in Phase 1 (low per-address contention, spread across many words) rather than atomicAdd on a counter (all contention on one address).

Our Categorical GPU Inhibition Scoring system

Standard dispatch scoring treats atomic contention as a continuous variable: more contention means a lower score, which eventually routes to CPU. This fails because the contention-to-latency relationship is non-linear. A scoring function calibrated on low-contention measurements underestimates the penalty at high contention.

Our system uses a categorical boundary. Below the boundary, continuous scoring applies. Above it, the penalty is negative infinity. No continuous score can override it.

How contention density is estimated

Before GPU dispatch, the engine estimates the atomic contention density: the number of threads that will execute an atomic operation per GPU clock cycle.

For a filter with a counting atomic:

contentionDensity = (elementCount * estimatedSelectivity) / (elementCount / threadsPerCycle)
                  = estimatedSelectivity * threadsPerCycle

On a 3,072-core GPU with 1 thread per core per cycle: contentionDensity = selectivity * 3072. At 10% selectivity: 307 threads per cycle attempting atomicAdd on the same address.

For a histogram with K bins:

contentionDensity = (selectivity * threadsPerCycle) / K

At 100% selectivity (every element writes to some bin) with 256 bins: 3072 / 256 = 12 threads per bin per cycle. Low. With 4 bins: 3072 / 4 = 768 threads per bin per cycle. High.

The selectivity estimate comes from column statistics (histograms, min/max, dictionary cardinality) maintained during data ingestion. For text search, the selectivity is estimated from the Phase 1 histogram pre-filter popcount.

The 10% categorical threshold

When estimated contention density exceeds 10% of the GPU's thread capacity per cycle (approximately 307 threads on a 3,072-core GPU), the system assigns IEEE 754 negative infinity, a mathematically absolute value that no combination of positive factors can override.

The threshold derives from the L2 cache controller's atomic throughput. The controller sustains 32 to 64 concurrent atomic operations per cycle across all memory partitions. At 307 contending threads, the request-to-capacity ratio is 307 / 64 = 4.8x. The queue overflows. Warps stall. Throughput collapses to below CPU levels.

Below 10% (~150 threads per cycle at 5% density), the queue depth stays within 2x to 3x of the controller's capacity. Stalls are brief. The GPU still outperforms the CPU on large datasets. The continuous scoring function handles this range accurately, penalizing the GPU score proportionally to the estimated density.

Above 10%, the non-linear cascade makes the GPU slower than the CPU. The categorical penalty prevents dispatch.

What happens on the CPU tier

The Web Worker parallel tier handles the rerouted workload with a structurally different algorithm.

Counting: Each of 8 workers maintains a local count variable (a regular JavaScript number, not an atomic). After all workers complete, the main thread sums 8 values. Total atomic operations: 0.

Compaction: Each worker writes matches to a private section of the output SharedArrayBuffer. No two workers write to the same region. The main thread concatenates by adjusting offsets. Total atomic operations: 0.

Histogram: Each worker builds a local 256-entry array. The main thread sums 8 arrays element-wise (2,048 additions). Total atomic operations: 0.

Reduction: Each worker computes a local partial sum/min/max. The main thread reduces 8 values. Total atomic operations: 0.

The CPU tier eliminates contention by design, not by mitigation. Eight threads with private state do not contend. The algorithm is different. The data structure is different. Only the result is identical.

Architectural patterns that avoid atomic contention on the GPU

Not every GPU workload needs atomics. When we can restructure the algorithm to avoid them, the GPU path remains viable even at high output density.

Pattern: Prefix sum for compaction

Instead of atomicAdd on a write pointer, compute a parallel prefix sum on the predicate bitmask:

Each thread evaluates the predicate and writes 0 or 1 to a predicate array.
A parallel Blelloch scan computes the exclusive prefix sum of the predicate array.
Each thread with a 1 reads its prefix sum value as its output index and writes the element.

Zero atomics. Fully parallel. The prefix sum is O(n) work and O(log n) depth. This is the compaction strategy used throughout our query engine.

Pattern: Workgroup-local reduction with thread-0 flush

Instead of every thread writing to a global accumulator, each workgroup reduces locally in shared memory, and thread 0 writes the workgroup result to global memory:

var<workgroup> local_sum: atomic<u32>;

// ... each thread atomicAdds to local_sum ...

workgroupBarrier();
if (local_id == 0u) {
  atomicAdd(&global_sum, atomicLoad(&local_sum));
}

This reduces global atomics from N (one per thread) to N/256 (one per workgroup). For 1 million elements: 3,906 global atomics instead of up to 1,000,000. The shared memory atomics within each workgroup contend among at most 256 threads, which is manageable.

Pattern: Tile-based histograms

Instead of a single global histogram with K bins, each workgroup builds a local K-bin histogram in shared memory, then writes K values to a workgroup-indexed section of a global histogram array. A final reduction dispatch sums the per-workgroup histograms.

Zero global atomics during the main pass. The final reduction processes (workgroup_count * K) values, which is small enough for a single workgroup to handle.

The decision framework

The full decision tree for atomic contention:

Can the algorithm be restructured to avoid atomics? (prefix sum for compaction, tile-based reduction, bitwise-OR bitmask) If yes, restructure. GPU dispatch proceeds normally.
If atomics are unavoidable, estimate contention density from column statistics and operation characteristics.
Density below 5%: Standard 6-factor scoring. GPU dispatch likely optimal for large datasets.
Density 5% to 10%: Standard scoring with a contention penalty. GPU may or may not win depending on dataset size and hardware.
Density above 10%: Categorical penalty. Negative infinity. Route to Web Workers unconditionally.
CAS-based algorithms: Categorical penalty regardless of density. The quadratic amplification from retry loops makes CAS contention unpredictable and universally worse than the RMW atomics.
Check for independent safety overrides: Branch divergence and precision constraints are evaluated separately. Any one of the three safety systems can independently block GPU dispatch.

Why this matters for your architecture

If you are building browser-based compute, you will encounter atomic contention the moment you move beyond trivial shaders. Every filter needs a result counter. Every aggregation needs an accumulator. Every compaction needs an output index.

The difference between a system that works and a system that collapses under load is whether it detects contention before it happens. A dashboard that runs a 2% selectivity filter at 2 ms on the GPU will run a 25% selectivity filter on the same dataset at 45 ms on the GPU. The user changed one slider. The latency jumped 22x. That is not a performance regression. It is an architectural failure to account for the hardware's serialization constraints.

Our adaptive dispatch engine prevents this. The selectivity estimate changes when the user moves the slider. The contention density recalculates. If the new density exceeds 10%, the operation routes to CPU workers in 4.8 ms instead of stalling the GPU at 45 ms. The user sees consistent 2 to 5 ms response times regardless of the filter parameters.

This is how we build enterprise AI automation infrastructure. Not by assuming the GPU is always faster. By measuring when it is, enforcing boundaries where it is not, and routing every operation to the tier that handles it without performance cliffs. Atomic contention is the cliff. Our engine stops you before the edge.

Where this ships

We are Ayoob AI, a Newcastle-based team building adaptive GPU dispatch engines for UK enterprises running analytics and dashboards that cannot afford performance cliffs under user load. If your WebGPU path freezes when the selectivity climbs, we build the profiler that prevents it. Book a discovery call.

Frequently asked questions

Why are GPU atomics so much slower than CPU atomics?

Because the hardware architectures are fundamentally different. A CPU has 8 to 16 cores coordinated by a cache coherence protocol like MESI, which lets each core acquire exclusive ownership of a cache line, perform an atomic update, and release it. With that many participants the protocol scales comfortably. A GPU has thousands of cores sharing a memory subsystem designed for throughput, not coherence. There is no per-core coherence protocol for atomics. Instead, all atomic operations funnel through a centralised L2 cache controller (for global memory) or a workgroup shared memory unit (for workgroup atomics). Every contended atomic serialises at that single point.

When exactly does atomic contention overwhelm the GPU?

An uncontested atomic read-modify-write costs around 4 cycles. A contested one jumps to 200 to 400 cycles because it waits in the L2 controller's queue. The controller typically has 32 to 64 slots. When 512 threads hit the same address simultaneously, excess requests retry in a cascade and throughput degrades non-linearly. Our scorer measures expected contention density per dispatch: if more than 10 percent of threads per cycle are predicted to hit contended addresses, we apply a categorical penalty of negative infinity and route to the Web Worker tier where thread-local accumulators eliminate contention entirely.

Which WGSL atomic operations are affected?

All seven: atomicAdd, atomicSub, atomicMax, atomicMin, atomicAnd, atomicOr, and atomicExchange share the same serialisation constraint at the L2 controller. atomicCompareExchangeWeak is worse because every failed CAS attempt re-enters the queue, amplifying queue depth beyond the raw thread count. There is no way to pick a lighter atomic operation and dodge the bottleneck. The only real mitigation inside the shader is to reduce contention: partition work so threads target different addresses, or accumulate privately and reduce at the end. If neither restructuring is possible, the correct engineering move is to run the workload off-GPU.

Why not just always use the CPU for aggregations then?

Because for low-contention workloads the GPU still wins comfortably. A 500,000-row aggregation into 50 groups has roughly 10,000 atomic writes per slot, spread across 50 slots, all in shared memory at 4 cycles each. That is dramatically faster than any CPU implementation. The whole point of our dispatch engine is to route each operation to the tier that wins on its specific access pattern, not to pick a tier once and stick with it. That nuance is what separates mature AI automation from a naive GPU-everything or CPU-everything design.

What does this mean for a Newcastle business evaluating AI tools?

It means you should not trust any vendor who claims the GPU is always fastest. Atomic contention is one of several failure modes where GPU dispatch is strictly slower than parallel CPU execution, and it appears most often in exactly the operations a UK business cares about: aggregations, histograms, and counting. Our engine ships with a measurable routing policy that you can inspect in full code, so you can see precisely when it chooses GPU, Web Worker, or main-thread CPU. That transparency is the mark of engineering that has been pressure-tested on real workloads rather than demoed on a contrived benchmark.

Talk to an Engineer