Ayoob AI

WebGPU Atomic Contention: When to Stop Using the GPU

·15 min read·Ayoob AI
WebGPUAtomicsGPU ArchitectureSynchronizationPerformance

What GPU atomics actually do at the hardware level

An atomic operation guarantees that a read-modify-write sequence on a memory address completes as an indivisible unit. No other thread can observe or interfere with the intermediate state between the read and the write.

On a CPU, atomics are implemented via cache coherence protocols (MESI, MOESI). The core acquires exclusive ownership of the cache line, performs the operation, and releases it. With 8 to 16 cores, contention is manageable. The coherence protocol scales to dozens of participants.

On a GPU, the architecture is fundamentally different. Three thousand cores share a memory subsystem designed for throughput, not coherence. There is no per-core cache coherence protocol for atomic operations. Instead, atomics are resolved at a centralized point: the L2 cache controller (for global memory) or the workgroup shared memory unit (for workgroup-scoped memory).

This centralization is the source of every atomic performance problem on GPUs.

The seven WGSL atomic operations

WGSL provides seven atomic read-modify-write operations on atomic<u32> and atomic<i32> types:

OperationWGSL functionBehaviour
AddatomicAdd(&x, val)x = x + val. Returns old value.
SubtractatomicSub(&x, val)x = x - val. Returns old value.
MaximumatomicMax(&x, val)x = max(x, val). Returns old value.
MinimumatomicMin(&x, val)x = min(x, val). Returns old value.
Bitwise ANDatomicAnd(&x, val)x = x & val. Returns old value.
Bitwise ORatomicOr(&x, val)x = x | val. Returns old value.
ExchangeatomicExchange(&x, val)x = val. Returns old value.

Plus one conditional operation:

OperationWGSL functionBehaviour
Compare-exchangeatomicCompareExchangeWeak(&x, expected, val)If x == expected, set x = val. Returns {old_value, exchanged}.

Every one of these operations shares the same serialization constraint. The difference is not in how the hardware serializes them (it serializes them all identically), but in how application code uses them and the contention patterns that result.

The memory hierarchy and where atomics resolve

Understanding contention requires understanding where in the memory hierarchy the atomic operation physically executes.

Global memory atomics

When an atomic targets a var<storage, read_write> variable, it resolves at the L2 cache controller. The path:

Thread -> Warp scheduler -> L1 cache (miss, not cached for atomics)
       -> L2 cache controller (serialization point)
       -> DRAM (if L2 misses)

The L2 controller processes atomic requests one at a time per cache line. On NVIDIA Ampere/Ada architecture, the L2 can sustain 1 atomic operation per cycle per memory partition. A GPU with 12 memory partitions can process 12 concurrent atomics per cycle, but only if they target different cache lines. Atomics to the same cache line (same 128-byte aligned address) serialize to 1 per cycle.

Latency for a single uncontested global atomic: 200 to 400 cycles (the round-trip to L2 and back). For a contested atomic with N threads queued: approximately N * 1 cycle at the controller, plus the 200-cycle round-trip. For 512 threads contending on one address: 200 + 512 = 712 cycles minimum. In practice, the queue management overhead and memory bus contention push this higher.

Shared memory atomics

When an atomic targets a var<workgroup> variable, it resolves at the workgroup's shared memory unit. Shared memory is on-chip SRAM, physically adjacent to the compute cores. Latency for an uncontested shared memory atomic: 20 to 50 cycles.

Contention in shared memory is limited to the workgroup size (typically 256 threads). At 256 threads, worst-case contention produces 256 serialized operations at 1 per cycle: 256 + 20 = 276 cycles. Painful but bounded.

This is why our two-phase text search builds character frequency histograms in shared memory rather than global memory. The contention ceiling is 256 threads per workgroup, not 3,072 threads across the entire GPU. And shared memory atomic latency (20 to 50 cycles) is 5x to 10x lower than global memory atomic latency (200 to 400 cycles).

The contention multiplier

The critical insight: contention cost is not linear with thread count. It is super-linear.

At low contention (8 threads per cycle targeting the same address), the L2 controller queues and drains the requests in 8 cycles. The overhead is small relative to the 200-cycle base latency. Total: approximately 208 cycles. Barely noticeable.

At moderate contention (64 threads), the queue drains in 64 cycles. Total: approximately 264 cycles. The atomic is 1.3x slower than uncontested. Still manageable.

At high contention (512 threads), the queue drains in 512 cycles. Total: approximately 712 cycles. The atomic is 3.5x slower than uncontested. But the real problem is not the individual atomic latency. It is the pipeline stall.

While 512 threads are queued at the L2 controller waiting for their atomic to complete, those threads are not executing any other instructions. Their warps are stalled. The warp scheduler cannot issue other instructions from those warps because the atomic has not returned. If those 512 threads represent a significant fraction of the active warps on the GPU, the entire compute pipeline stalls.

On a GPU with 96 active warps (3,072 threads / 32 per warp), 512 stalled threads means 16 stalled warps. That is 17% of the pipeline. The remaining 83% continues, but the stalled warps create bubbles in the scheduler that reduce overall throughput.

At 1,024 contending threads (33% of the pipeline stalled), throughput drops to 40% to 50% of peak. At 2,048 threads (67% stalled), throughput drops to 10% to 20%. The relationship is non-linear because the scheduler's ability to hide latency with other work degrades as more warps are stalled.

The compare-and-swap amplification problem

atomicCompareExchangeWeak is structurally worse than the other atomics under contention. The other six operations always succeed: atomicAdd adds its value and returns the old value, regardless of how many other threads are contending. The thread may wait in the queue, but it will not need to retry after completion.

Compare-and-swap (CAS) can fail. If the value at the target address changed between when the thread read the expected value and when the CAS executes, the exchange does not happen. The thread must reload the current value, recompute, and retry.

// CAS-based append to a result buffer
fn appendResult(value: u32) {
  var old = atomicLoad(&write_head);
  loop {
    let result = atomicCompareExchangeWeak(&write_head, old, old + 1u);
    if (result.exchanged) {
      output[old] = value;
      break;
    }
    old = result.old_value;
    // Failed. Another thread incremented write_head. Retry.
  }
}

Under contention, multiple threads load the same value of write_head, attempt the CAS simultaneously, and only one succeeds. The rest fail and retry. The retry re-enters the queue at the L2 controller. The queue depth is no longer equal to the number of contending threads. It is the number of contending threads multiplied by the average retry count.

With 64 threads contending on a single CAS target, 1 succeeds and 63 retry. On the retry round, 1 succeeds and 62 retry. The total number of CAS operations is:

64 + 63 + 62 + ... + 1 = 64 * 65 / 2 = 2,080 CAS operations

For 64 threads producing 64 successful writes. The amplification factor is 2,080 / 64 = 32.5x. The L2 controller processes 2,080 atomic operations instead of 64.

For 512 threads, the amplification factor is 512 * 513 / 2 / 512 = 256.5x. The controller processes 131,328 atomic operations. At 1 per cycle, that is 131,328 cycles. Over 65 microseconds at 2 GHz. For a single batch of CAS operations.

This is why CAS-based algorithms (lock-free queues, concurrent hash maps, atomic counters with custom logic) are catastrophically slow on GPUs under contention. The algorithm that is "lock-free" on a CPU becomes effectively locked on a GPU, with a lock whose hold time scales quadratically with thread count.

Where atomic contention appears in real workloads

Atomic contention is not an exotic edge case. It appears in the most common GPU compute patterns.

Pattern 1: Global counters

Any shader that counts matching elements using a single atomicAdd counter:

if (predicate(data[idx])) {
  atomicAdd(&match_count, 1u);
}

Contention scales linearly with match rate. At 1% selectivity on 1 million elements: ~10,000 atomics. At 50% selectivity: ~500,000 atomics. The counter becomes the bottleneck.

Pattern 2: Output compaction

Writing matching elements to a dense output array requires claiming an output slot via atomicAdd on a write pointer:

if (predicate(data[idx])) {
  let slot = atomicAdd(&write_head, 1u);
  output[slot] = data[idx];
}

Same contention profile as global counters, with the additional cost of a dependent global memory write to output[slot].

Pattern 3: Histogram accumulation

Building a histogram with K bins:

let bin = computeBin(data[idx]);
atomicAdd(&histogram[bin], 1u);

Contention per bin = total_atomics / K. With 256 bins and 1 million elements: ~3,906 atomics per bin. Manageable. With 4 bins and 1 million elements: ~250,000 atomics per bin. Catastrophic.

Pattern 4: Reduction (min, max, sum)

A parallel reduction that writes per-workgroup partial results to a global accumulator:

if (local_id == 0u) {
  atomicAdd(&global_sum, workgroup_partial_sum);
}

Contention = number of workgroups. For 1 million elements / 256 per workgroup = 3,906 workgroups. 3,906 atomics to a single address. At 1 per cycle: 3,906 cycles. Approximately 2 microseconds. Low absolute cost, but it adds up across multiple reduction targets (sum, count, min, max each needing their own atomic).

Pattern 5: Bitmask writes

Setting bits in a result bitmask via atomicOr:

if (predicate(data[idx])) {
  atomicOr(&bitmask[idx / 32u], 1u << (idx % 32u));
}

Contention per bitmask word = number of matching elements within each 32-element group. With uniformly distributed matches, each word receives match_rate * 32 atomics. At 50% match rate: 16 atomics per word. Spread across many words (total_elements / 32), the per-address contention is low. Bitmask atomics are GPU-friendly for most workloads.

This is why our two-phase search uses atomicOr on a bitmask in Phase 1 (low per-address contention, spread across many words) rather than atomicAdd on a counter (all contention on one address).

Our Categorical GPU Inhibition Scoring system

Standard dispatch scoring treats atomic contention as a continuous variable: more contention means a lower score, which eventually routes to CPU. This fails because the contention-to-latency relationship is non-linear. A scoring function calibrated on low-contention measurements underestimates the penalty at high contention.

Our system uses a categorical boundary. Below the boundary, continuous scoring applies. Above it, the penalty is negative infinity. No continuous score can override it.

How contention density is estimated

Before GPU dispatch, the engine estimates the atomic contention density: the number of threads that will execute an atomic operation per GPU clock cycle.

For a filter with a counting atomic:

contentionDensity = (elementCount * estimatedSelectivity) / (elementCount / threadsPerCycle)
                  = estimatedSelectivity * threadsPerCycle

On a 3,072-core GPU with 1 thread per core per cycle: contentionDensity = selectivity * 3072. At 10% selectivity: 307 threads per cycle attempting atomicAdd on the same address.

For a histogram with K bins:

contentionDensity = (selectivity * threadsPerCycle) / K

At 100% selectivity (every element writes to some bin) with 256 bins: 3072 / 256 = 12 threads per bin per cycle. Low. With 4 bins: 3072 / 4 = 768 threads per bin per cycle. High.

The selectivity estimate comes from column statistics (histograms, min/max, dictionary cardinality) maintained during data ingestion. For text search, the selectivity is estimated from the Phase 1 histogram pre-filter popcount.

The 10% categorical threshold

When estimated contention density exceeds 10% of the GPU's thread capacity per cycle (approximately 307 threads on a 3,072-core GPU), the system assigns IEEE 754 negative infinity, a mathematically absolute value that no combination of positive factors can override.

The threshold derives from the L2 cache controller's atomic throughput. The controller sustains 32 to 64 concurrent atomic operations per cycle across all memory partitions. At 307 contending threads, the request-to-capacity ratio is 307 / 64 = 4.8x. The queue overflows. Warps stall. Throughput collapses to below CPU levels.

Below 10% (~150 threads per cycle at 5% density), the queue depth stays within 2x to 3x of the controller's capacity. Stalls are brief. The GPU still outperforms the CPU on large datasets. The continuous scoring function handles this range accurately, penalizing the GPU score proportionally to the estimated density.

Above 10%, the non-linear cascade makes the GPU slower than the CPU. The categorical penalty prevents dispatch.

What happens on the CPU tier

The Web Worker parallel tier handles the rerouted workload with a structurally different algorithm.

Counting: Each of 8 workers maintains a local count variable (a regular JavaScript number, not an atomic). After all workers complete, the main thread sums 8 values. Total atomic operations: 0.

Compaction: Each worker writes matches to a private section of the output SharedArrayBuffer. No two workers write to the same region. The main thread concatenates by adjusting offsets. Total atomic operations: 0.

Histogram: Each worker builds a local 256-entry array. The main thread sums 8 arrays element-wise (2,048 additions). Total atomic operations: 0.

Reduction: Each worker computes a local partial sum/min/max. The main thread reduces 8 values. Total atomic operations: 0.

The CPU tier eliminates contention by design, not by mitigation. Eight threads with private state do not contend. The algorithm is different. The data structure is different. Only the result is identical.

Architectural patterns that avoid atomic contention on the GPU

Not every GPU workload needs atomics. When we can restructure the algorithm to avoid them, the GPU path remains viable even at high output density.

Pattern: Prefix sum for compaction

Instead of atomicAdd on a write pointer, compute a parallel prefix sum on the predicate bitmask:

  1. Each thread evaluates the predicate and writes 0 or 1 to a predicate array.
  2. A parallel Blelloch scan computes the exclusive prefix sum of the predicate array.
  3. Each thread with a 1 reads its prefix sum value as its output index and writes the element.

Zero atomics. Fully parallel. The prefix sum is O(n) work and O(log n) depth. This is the compaction strategy used throughout our query engine.

Pattern: Workgroup-local reduction with thread-0 flush

Instead of every thread writing to a global accumulator, each workgroup reduces locally in shared memory, and thread 0 writes the workgroup result to global memory:

var<workgroup> local_sum: atomic<u32>;

// ... each thread atomicAdds to local_sum ...

workgroupBarrier();
if (local_id == 0u) {
  atomicAdd(&global_sum, atomicLoad(&local_sum));
}

This reduces global atomics from N (one per thread) to N/256 (one per workgroup). For 1 million elements: 3,906 global atomics instead of up to 1,000,000. The shared memory atomics within each workgroup contend among at most 256 threads, which is manageable.

Pattern: Tile-based histograms

Instead of a single global histogram with K bins, each workgroup builds a local K-bin histogram in shared memory, then writes K values to a workgroup-indexed section of a global histogram array. A final reduction dispatch sums the per-workgroup histograms.

Zero global atomics during the main pass. The final reduction processes (workgroup_count * K) values, which is small enough for a single workgroup to handle.

The decision framework

The full decision tree for atomic contention:

  1. Can the algorithm be restructured to avoid atomics? (prefix sum for compaction, tile-based reduction, bitwise-OR bitmask) If yes, restructure. GPU dispatch proceeds normally.

  2. If atomics are unavoidable, estimate contention density from column statistics and operation characteristics.

  3. Density below 5%: Standard 6-factor scoring. GPU dispatch likely optimal for large datasets.

  4. Density 5% to 10%: Standard scoring with a contention penalty. GPU may or may not win depending on dataset size and hardware.

  5. Density above 10%: Categorical penalty. Negative infinity. Route to Web Workers unconditionally.

  6. CAS-based algorithms: Categorical penalty regardless of density. The quadratic amplification from retry loops makes CAS contention unpredictable and universally worse than the RMW atomics.

  7. Check for independent safety overrides: Branch divergence and precision constraints are evaluated separately. Any one of the three safety systems can independently block GPU dispatch.

Why this matters for your architecture

If you are building browser-based compute, you will encounter atomic contention the moment you move beyond trivial shaders. Every filter needs a result counter. Every aggregation needs an accumulator. Every compaction needs an output index.

The difference between a system that works and a system that collapses under load is whether it detects contention before it happens. A dashboard that runs a 2% selectivity filter at 2 ms on the GPU will run a 25% selectivity filter on the same dataset at 45 ms on the GPU. The user changed one slider. The latency jumped 22x. That is not a performance regression. It is an architectural failure to account for the hardware's serialization constraints.

Our adaptive dispatch engine prevents this. The selectivity estimate changes when the user moves the slider. The contention density recalculates. If the new density exceeds 10%, the operation routes to CPU workers in 4.8 ms instead of stalling the GPU at 45 ms. The user sees consistent 2 to 5 ms response times regardless of the filter parameters.

This is how we build enterprise AI automation infrastructure. Not by assuming the GPU is always faster. By measuring when it is, enforcing boundaries where it is not, and routing every operation to the tier that handles it without performance cliffs. Atomic contention is the cliff. Our engine stops you before the edge.

Want to discuss how this applies to your business?

Book a Discovery Call