Mitigating Atomic Contention in Parallel Browser Environments

8 Apr 2026·13 min read·Husain Ayoob

WebGPUAtomicsGPU ArchitecturePerformanceParallel Computing

Key Takeaways

Atomic operations serialize concurrent access to a single memory address. When 3,072 GPU threads contend on the same atomicAdd, throughput degrades non-linearly: 32 threads per warp retry in sequence, and warps queue behind each other at the L2 cache controller, producing 100x to 500x slowdowns versus uncontended access.
Our engine profiles expected output density before dispatch. If match density exceeds 100 results per 1,000 input elements (10% selectivity), the predicted atomicAdd contention triggers a categorical penalty of negative infinity, unconditionally routing to the Web Worker tier.
The contention threshold is not arbitrary. It derives from the GPU's L2 cache line arbitration bandwidth: at 10% output density on a 3,072-core GPU, approximately 307 threads attempt atomic writes per dispatch cycle, exceeding the 32 to 64 concurrent atomic operations most GPU memory controllers can sustain without queuing.

What atomic operations do

An atomic operation reads a memory address, modifies the value, and writes it back as a single indivisible unit. No other thread can observe the intermediate state. No other thread can interleave a write between the read and the write-back.

In WebGPU compute shaders, atomicAdd, atomicSub, atomicMax, atomicMin, atomicOr, atomicAnd, atomicExchange, and atomicCompareExchangeWeak provide these guarantees on atomic<u32> and atomic<i32> storage variables.

The canonical use case: counting results. A filter shader evaluates a predicate on every element and needs to report how many elements passed. Each thread that finds a match increments a shared counter:

@group(0) @binding(1) var<storage, read_write> match_count: atomic<u32>;

@compute @workgroup_size(256)
fn filter(@builtin(global_invocation_id) id: vec3<u32>) {
  let idx = id.x;
  if (data[idx] > threshold) {
    atomicAdd(&match_count, 1u);
  }
}

This shader is correct. It is also a performance trap.

Why atomic contention destroys GPU throughput

A GPU with 3,072 cores can have 3,072 threads executing the same shader simultaneously. If the predicate is selective (few matches), few threads hit the atomicAdd per cycle. No problem. If the predicate is broad (many matches), hundreds or thousands of threads hit atomicAdd on the same memory address in the same cycle.

The hardware cannot execute these atomics in parallel. By definition, an atomic operation is serial at the target address. The memory controller must process each atomicAdd one at a time, or in small batches depending on the architecture.

The serialization cascade

Here is what happens on current GPU hardware when 512 threads attempt atomicAdd on the same address within one dispatch cycle:

Step 1: Warp-level serialization. Within a 32-thread warp, threads that execute the same atomic to the same address are serialized. If 20 threads in a warp hit the atomicAdd, the warp stalls for 20 cycles while each thread completes its atomic. The other 12 threads (those that did not match the predicate) are already done and sit idle.

Step 2: L2 cache controller queuing. The atomic operation targets a specific L2 cache line. The GPU's memory controller can process a limited number of concurrent atomics to the same cache line: typically 1 per cycle on NVIDIA hardware, with some pipelining that allows 32 to 64 outstanding requests. When multiple warps submit atomics to the same address simultaneously, the excess requests queue at the memory controller.

Step 3: Memory bus saturation. Each queued atomic occupies a slot in the memory request pipeline. While these slots are held by atomic retries, other warps that need to read or write global memory for unrelated work are starved. The entire pipeline backs up behind the contention point.

The result is non-linear degradation. Doubling the number of contending threads does not double the latency. It can quadruple it, because the queue depth increases and the memory bus starvation compounds.

Measured contention costs

We benchmarked a simple counting shader (increment a single atomic<u32> counter for each matching element) on a discrete GPU with 3,072 cores:

Match rate	Matches per dispatch	Time	Throughput vs. no atomics
0.1% (1 in 1,000)	~1,000	0.8 ms	95% of peak
1% (10 in 1,000)	~10,000	1.1 ms	72% of peak
5% (50 in 1,000)	~50,000	3.4 ms	23% of peak
10% (100 in 1,000)	~100,000	8.7 ms	9% of peak
25% (250 in 1,000)	~250,000	28.1 ms	2.8% of peak
50% (500 in 1,000)	~500,000	63.4 ms	1.2% of peak

Dataset: 1 million elements. The no-atomics baseline (predicate evaluation without counter increment) completes in 0.76 ms.

At 10% match rate, the GPU is operating at 9% of its theoretical throughput. At 50%, it is at 1.2%. The shader is spending nearly all its time waiting for the memory controller to process atomic requests. The 3,072 cores are functionally idle.

An 8-thread Web Worker pool running the same filter with a simple count++ in JavaScript takes 4.8 ms regardless of match rate. The CPU does not have an atomic contention problem because 8 threads produce negligible contention, and each thread maintains a local counter that is merged after the loop.

At 10% match rate, the GPU takes 8.7 ms. The CPU takes 4.8 ms. The GPU is 1.8x slower. At 25%, the GPU is 5.9x slower. The performance inversion grows with match density.

Beyond counting: where atomic contention appears

The counting example is the simplest case. Atomic contention appears in every GPU algorithm that writes to a shared output structure with unpredictable write targets.

Histogram construction

A histogram with 256 bins and 1 million elements has, on average, 3,906 elements per bin. Each bin is an atomic<u32> counter. Threads writing to the same bin contend. With 256 bins, the contention per bin is 1/256th of the total, which is manageable. But reduce the bin count (e.g., a 16-bin histogram for a low-cardinality categorical column), and contention per bin rises to 62,500 elements. The GPU stalls.

This is why our two-phase text search uses 128-bin character frequency histograms in workgroup shared memory rather than global memory. Shared memory atomics are faster (20 to 50 cycles versus 200 to 400 cycles for global memory), and the contention is limited to 256 threads per workgroup rather than 3,072 across the entire GPU.

Scatter writes in compaction

A parallel filter that compacts matching elements into a dense output array needs each thread to claim an output slot. The standard approach: atomicAdd on a global write pointer, then write the element to the returned index.

if (data[idx] > threshold) {
  let slot = atomicAdd(&write_head, 1u);
  output[slot] = data[idx];
}

Every matching thread contends on write_head. At 50% selectivity, half of all threads contend on a single address every cycle. This is why our compaction uses a prefix-sum approach instead: compute a bitmask, prefix-sum the bitmask to derive write positions, then scatter without atomics. The prefix sum is conflict-free and fully parallel.

Aggregation with high group cardinality

A GROUP BY aggregation where each group has its own accumulator (atomicAdd(&group_totals[group_id], value)) distributes contention across group_count addresses. With 1,000 groups and 1 million rows, each group receives ~1,000 atomic writes. With 10 groups, each receives ~100,000. The contention per group scales inversely with group count.

Our query engine uses the Chao1 group cardinality estimator precisely for this reason. Low group cardinality means high per-group contention. The scoring function penalizes GPU dispatch as estimated group count drops.

Our output density profiler

The contention examples share a common structure: the severity of atomic contention is determined by the output density of the operation. Output density is the ratio of elements that produce an atomic write to total elements processed.

For a filter: output density equals the selectivity (fraction of elements that pass the predicate).

For a histogram: output density is 1.0 (every element increments some bin), but contention per bin depends on bin count. Effective contention density is 1 / bin_count.

For a GROUP BY aggregation: output density is 1.0, with effective contention density of 1 / group_count.

For a text search: output density is the fraction of documents that contain the query pattern.

Our engine's output density profiler estimates this ratio before GPU dispatch. The estimation uses the same lightweight statistics maintained during schema ingestion: column histograms, dictionary cardinality, min/max values, and null counts.

Estimation for filter predicates

For a range predicate (column > X), the profiler interpolates from the column's 64-bucket histogram to estimate selectivity. For an equality predicate on a dictionary-encoded column, selectivity is 1 / dictionary_size. For compound predicates (AND, OR), selectivities are combined with independence assumptions (multiply for AND, add minus product for OR).

Estimation for aggregation

Effective contention density is estimated as 1 / estimated_group_count, using the Chao1 estimator for composite group-by keys or exact dictionary size for single-column group-by.

Estimation for text search

The character frequency histogram pre-filter in Phase 1 produces a candidate bitmask. The popcount of this bitmask gives the exact Phase 2 output density before Phase 2 begins. If the popcount indicates high density (many candidates survived the pre-filter), the engine can re-route Phase 2 to the CPU.

The categorical penalty threshold

When estimated output density exceeds 100 matches per 1,000 input elements (10%), our engine assigns a categorical penalty of IEEE 754 negative infinity to the GPU dispatch score. This is a mathematically absolute value: no combination of positive scoring factors can produce a non-negative result. The workload is routed unconditionally to the Web Worker tier.

Why 10%

The threshold derives from the GPU's memory controller capacity, not from arbitrary tuning.

A discrete GPU with 3,072 cores dispatches 3,072 threads per cycle. At 10% output density, approximately 307 threads attempt an atomic write per cycle. The L2 cache controller on current-generation hardware (NVIDIA Ampere/Ada, AMD RDNA 3, Intel Arc) can sustain 32 to 64 concurrent atomic operations to distinct cache lines without queuing. For atomics to the same cache line, the throughput drops to 1 to 4 per cycle depending on the memory controller's coalescing capabilities.

At 307 contending threads per cycle targeting a small number of addresses (a single counter, or a handful of histogram bins), the queue depth exceeds the controller's capacity by 5x to 10x. The excess requests retry, each retry consuming a memory pipeline slot, and the cascade described earlier begins.

Below 10%, the contention is manageable. At 5% output density (~154 threads per cycle), the memory controller queues briefly but clears within a few cycles. Throughput drops to 23% of peak, which is still faster than the CPU alternative for large datasets. The dispatch score accounts for this degradation through the standard 6-factor scoring function without needing a categorical override.

Above 10%, the degradation becomes non-linear and unpredictable. The GPU's throughput drops below CPU levels, and the gap widens rapidly with density. A continuous scoring function cannot model this accurately because the relationship between density and latency is not linear, not quadratic, and varies by GPU microarchitecture. The only safe response is categorical exclusion.

This is the same Categorical GPU Inhibition Scoring principle applied to branch divergence: when the penalty is non-linear and hardware-dependent, a continuous penalty underestimates the risk on some configurations. Negative infinity eliminates the risk on all configurations.

What the CPU tier does differently

The Web Worker tier avoids atomic contention entirely through a different algorithmic structure.

For counting: each worker maintains a thread-local counter. After all workers complete, the main thread sums 8 counters. Zero contention. Zero atomic operations.

For compaction: each worker compacts matches into a thread-local output section of the SharedArrayBuffer. The main thread concatenates the sections. Each worker writes to a disjoint memory region. Zero contention.

For aggregation: each worker builds a thread-local hash map of group accumulators. The main thread merges 8 hash maps. Each worker reads shared input but writes to private output. Zero contention.

This is the structural advantage of the Web Worker parallel tier. Eight threads with private output state have fundamentally different contention characteristics than 3,072 threads with shared output state. The algorithm changes, not just the execution target.

Workgroup shared memory atomics: the partial mitigation

Before the categorical threshold kicks in, our engine uses workgroup shared memory atomics as a partial contention mitigation for GPU dispatch.

The strategy: each workgroup of 256 threads maintains local accumulators in shared memory (var<workgroup>). Threads within the workgroup atomically increment shared memory counters, which have 20 to 50 cycle latency versus 200 to 400 cycles for global memory. After the workgroup completes, thread 0 writes the workgroup's local result to the global accumulator via a single atomicAdd.

This reduces global atomic contention by a factor of 256 (one global atomic per workgroup instead of one per thread). For 1 million elements with 10% match rate, global atomic operations drop from 100,000 to 390 (one per workgroup). That is well within the memory controller's capacity.

var<workgroup> local_count: atomic<u32>;

@compute @workgroup_size(256)
fn filter_with_local_reduce(
  @builtin(local_invocation_id) lid: vec3<u32>,
  @builtin(workgroup_id) wid: vec3<u32>,
  @builtin(global_invocation_id) gid: vec3<u32>
) {
  // Initialize shared counter
  if (lid.x == 0u) {
    atomicStore(&local_count, 0u);
  }
  workgroupBarrier();

  // Each thread evaluates its element
  if (gid.x < element_count && data[gid.x] > threshold) {
    atomicAdd(&local_count, 1u);
  }
  workgroupBarrier();

  // Thread 0 flushes to global
  if (lid.x == 0u) {
    let count = atomicLoad(&local_count);
    if (count > 0u) {
      atomicAdd(&global_count, count);
    }
  }
}

This technique keeps the GPU viable up to the 10% density threshold. Beyond that, even shared memory atomics within a 256-thread workgroup produce measurable contention (25+ threads contending per cycle within the workgroup), and the global merge via thread 0 is not the bottleneck. The per-thread shared memory contention is.

The full decision tree

Putting it together, here is how our engine routes an operation that involves atomic writes:

Estimate output density from column statistics.
If density > 10%: Categorical penalty (negative infinity). Route to Web Worker tier. Workers use thread-local accumulators.
If density is 10% or below: Check if the operation can use the shared-memory-reduce pattern (workgroup-local atomics with thread-0 global flush).
If yes: Apply standard 6-factor scoring with a contention adjustment factor proportional to density. GPU dispatch is permitted if the adjusted score exceeds 1.0.
If no (operation requires per-thread global atomics that cannot be restructured): Apply a continuous penalty scaled by density. GPU dispatch is permitted only if the dataset is large enough for the GPU's compute advantage to overcome the contention cost.
If the operation involves precision-sensitive accumulation: The Float32 Safety Guard may override the dispatch decision independently, routing to CPU for numerical correctness regardless of contention analysis.

Multiple safety systems can independently block GPU dispatch. Branch divergence, precision loss, and atomic contention are evaluated separately. Any one of them can force CPU routing. The GPU path runs only when all three analyses confirm it is safe, correct, and faster.

Why this matters

Most browser-based GPU compute implementations ship with a single code path: "if the dataset is big, use the GPU." They discover atomic contention in production when a user's query returns a high match rate and the dashboard freezes for 3 seconds instead of responding in 10 ms.

Our engine discovers it before the first GPU instruction executes. The profiler estimates density. The categorical threshold blocks the dispatch. The Web Worker tier handles the workload in 4.8 ms. The user never sees the failure that was prevented.

This is the engineering standard behind our enterprise AI automation infrastructure. We do not ship GPU code paths that degrade under load. We profile the expected contention pattern, enforce hard boundaries where degradation is non-linear, and route to the tier that handles the workload without performance cliffs. The GPU is fast when it is uncontended. We make sure it stays that way.

Where this ships

We are Ayoob AI, a Newcastle-based team building adaptive GPU dispatch infrastructure for UK enterprises running dashboards and analytics where dense result sets cannot stall the UI. If your WebGPU path freezes when the filter match rate climbs, we build the profiler that prevents it. Book a discovery call.

About the author

Husain Ayoob

Founder & CEO, Ayoob AI Ltd

BSc Computer Science with AI, Northumbria University 2024. 5 UK patents pending covering the Ayoob AI stack. ISO 27001:2022 certified (organisation).

Full bio, patents, and press →

Frequently asked questions

What is atomic contention in GPU compute?

When multiple threads try to modify the same memory address at the same time. GPUs use hardware atomic operations (atomicAdd, atomicMax, atomicCompareExchange) to serialise these modifications so no two threads corrupt each other's writes. The serialisation has a cost: each thread waits its turn. When 32 threads in a warp all try to increment the same counter, they execute sequentially within the warp. When 3,000 GPU threads all try to increment the same counter, the warps queue behind each other at the memory controller. Throughput collapses by 100x to 500x compared to uncontended access.

Why is this a problem for GPU compute?

Because the canonical use case (counting results from a filter, building a histogram, accumulating a reduction) touches this exact pattern. A filter shader that evaluates a predicate on every element and increments a shared counter runs into contention the moment the predicate matches a meaningful fraction of elements. For selective filters (under 10 percent match rate), atomic contention is low and GPU wins. For non-selective filters (above 10 percent), the GPU bottlenecks on atomic serialisation and CPU beats it. This is why naive GPU dispatch on filter-heavy workloads often produces performance regressions.

How does the engine avoid atomic contention?

By profiling expected output density before dispatch. If the estimated match density exceeds 100 results per 1,000 input elements (10 percent selectivity), the GPU inhibition scoring engine assigns a categorical penalty of negative infinity, which unconditionally routes the operation to the Web Worker tier. This is a hard cutoff, not a soft optimisation. The engine never dispatches contention-prone workloads to GPU and then discovers the bottleneck at runtime. The density estimate comes from statistical summaries computed during workload characterisation.

Where does the 10 percent threshold come from?

The GPU's L2 cache line arbitration bandwidth. On a typical discrete GPU with 3,072 cores, 10 percent output density means around 307 threads attempt atomic writes per dispatch cycle. Most GPU memory controllers can sustain 32 to 64 concurrent atomic operations without queuing. Above that, the controller queues atomic requests, and queuing latency dominates execution time. The 10 percent threshold is where the cost curve crosses from GPU-faster to CPU-faster on real hardware. We derived it through microbenchmarking across a range of GPU classes and validated it on production workloads.

Does this architecture affect correctness?

No, only performance. The atomic operations produce correct results regardless of contention. The question is whether they produce them fast enough to beat CPU execution. Routing high-density operations to CPU trades theoretical GPU parallelism for guaranteed throughput. For UK enterprise applications where user experience depends on predictable latency across mixed hardware, this trade-off is the right one. The engine produces correct results quickly. It never produces correct results eventually after the GPU stalls on atomic contention, which is what naive dispatch would do.

Talk to an Engineer