The problem with choosing a compute backend
Your browser has three execution tiers: the main thread, Web Workers, and (as of 2024) WebGPU compute shaders. Most engineering teams pick one and hard-code it. That decision is wrong the moment a user opens the same application on different hardware.
A fixed Web Worker pool runs well on a developer's 16-thread workstation. It crawls on a 4-core enterprise laptop locked to a power profile. A fixed WebGPU shader performs beautifully on a discrete NVIDIA RTX 4060 and falls flat on Intel UHD integrated graphics with 24 execution units sharing system memory.
Static backend selection is dead. The question is not "WebGPU or Web Workers." The question is "which backend, for this dataset, on this hardware, right now."
How Web Workers handle parallelism
Web Workers run JavaScript on separate OS threads. They share memory via SharedArrayBuffer and coordinate with Atomics. The threading model is straightforward: you partition your dataset, assign each partition to a worker, and merge results.
The ceiling is your CPU core count. A typical enterprise laptop exposes 4 to 8 logical cores. A high-end workstation might offer 16. The theoretical speedup follows Amdahl's Law. In practice, you hit diminishing returns around 8 workers due to memory bus contention and the overhead of postMessage serialization for non-shared data.
Where Web Workers excel
- Sequential logic with branching. CPUs handle conditional paths efficiently. If your workload is 60% branching logic and 40% arithmetic, Web Workers win.
- Small to medium datasets. Below 100,000 elements, the overhead of GPU buffer allocation, shader compilation, and PCIe/shared memory transfer eliminates any GPU advantage.
- Environments without GPU access. Enterprise VDI instances, locked-down terminals, and headless environments often disable GPU acceleration entirely.
Where Web Workers hit their ceiling
The bottleneck is thread count. Eight threads processing a 10-million-element dataset means each thread handles 1.25 million elements sequentially. Every element still passes through a scalar ALU one at a time. You are parallel across threads but serial within each thread.
For arithmetic-heavy, uniform workloads (sorting, aggregation, filtering, matrix operations), this is the wrong execution model. You need thousands of concurrent operations, not eight.
How WebGPU compute shaders change the equation
WebGPU exposes the GPU's compute pipeline directly from JavaScript. A compute shader written in WGSL dispatches across workgroups, each containing up to 256 invocations running in lockstep on the GPU's SIMT cores.
A discrete GPU with 3,072 CUDA cores (or equivalent) can process 3,072 elements per clock cycle in the best case. An integrated GPU with 96 execution units handles fewer, but still vastly outpaces 8 CPU threads for data-parallel workloads.
The programming model differs fundamentally from Web Workers:
@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
let idx = id.x;
if (idx >= arrayLength(&input)) { return; }
output[idx] = process(input[idx]);
}
This shader processes every element independently. No locks. No shared mutable state. No Atomics.wait(). You write the operation for a single element, and the GPU hardware schedules it across all available cores.
The cost: transfer overhead
GPU compute is not free. Data must travel from JavaScript heap memory to GPU-accessible buffers via device.queue.writeBuffer(). Results must be read back via mapAsync(). On discrete GPUs, this crosses the PCIe bus. On integrated GPUs, the memory is shared, but the driver still performs format validation and synchronization.
For a 1-million-element Float32Array (4 MB), the round-trip transfer on PCIe 4.0 x16 takes roughly 0.25 ms. The compute itself might take 0.1 ms. The transfer dominates. This is why small datasets are faster on the CPU: the compute savings never recover the transfer cost.
Our 3-tier adaptive architecture
We do not pick a backend statically. We built the Adaptive Hardware-Aware Dispatch Engine to make the decision at runtime, per-operation, per-device.
The architecture has three tiers:
Tier 1: CPU single-thread
For datasets under 10,000 elements, the main thread is fastest. No worker spawn overhead. No GPU buffer allocation. Just a tight loop on a single core with L1 cache locality.
Tier 2: Web Worker parallel
For medium datasets (10,000 to 500,000 elements on capable hardware), we split across a worker pool sized to navigator.hardwareConcurrency. Each worker operates on a contiguous slice of a SharedArrayBuffer. Merge overhead is minimal because the partitions are pre-sorted by index.
Tier 3: WebGPU compute
For large datasets, we dispatch to the GPU. The crossover point depends on the hardware. On a discrete GPU with dedicated VRAM and high memory bandwidth (250+ GB/s), the crossover sits at roughly 500,000 elements. On integrated GPUs sharing system memory (40 to 60 GB/s bandwidth), the crossover rises to approximately 2,000,000 elements.
These are not magic numbers. They are calculated.
How we derive crossover thresholds: runtime microbenchmarks
Most systems use static thresholds. "If the array has more than N elements, use the GPU." That breaks the moment the hardware changes. A threshold tuned for an RTX 4090 is wrong for an Intel Arc A380.
We probe the actual hardware. On first load, the dispatch engine runs a calibration sequence:
Step 1: Adapter capability probing
const adapter = await navigator.gpu.requestAdapter();
const info = await adapter.requestAdapterInfo();
const limits = adapter.limits;
This gives us the device vendor, architecture string, maximum buffer size, maximum compute workgroup dimensions, and maximum storage buffers per shader stage. We classify the adapter into discrete, integrated, or software fallback.
Step 2: Memory bandwidth microbenchmark
We allocate a 4 MB test buffer, dispatch a trivial pass-through compute shader (read + write, no arithmetic), and measure wall-clock time. This gives us effective memory bandwidth for the GPU path, including driver overhead and any PCIe transfer cost.
Step 3: Dispatch overhead microbenchmark
We run 100 minimal dispatches (1 workgroup, 1 invocation each) and measure the average per-dispatch cost. This captures the fixed overhead of commandEncoder.beginComputePass(), dispatchWorkgroups(), and queue.submit().
Step 4: Calibration ratio
From these two measurements, we derive a calibration ratio:
calibrationRatio = (cpuThroughput / gpuThroughput) * (1 + dispatchOverhead / expectedComputeTime)
The ratio encodes the break-even point. When the expected compute time for a given dataset size exceeds the GPU's transfer + dispatch overhead, the dispatch engine routes to Tier 3. Below that point, it stays on Tier 1 or Tier 2.
This calibration runs once per session. It takes under 200 ms on modern hardware. The result is cached for the session lifetime.
The dispatch score
Every operation request receives a dispatch score computed from three inputs:
- Dataset cardinality. Element count and byte size.
- Operation complexity. Arithmetic intensity (FLOPs per byte). A simple filter is memory-bound. A matrix multiply is compute-bound. The optimal backend differs.
- Calibration ratio. The hardware-specific break-even derived from microbenchmarks.
The formula:
dispatchScore = (elementCount * opsPerElement) / calibrationRatio
If the score exceeds 1.0, the GPU path is faster. Below 0.3, the CPU single-thread path wins. Between 0.3 and 1.0, Web Workers are the optimal choice.
This means the same application, processing the same dataset, will use different backends on different machines. A user on a MacBook Air M2 (integrated GPU, unified memory, 100 GB/s bandwidth) hits the WebGPU path at a different threshold than a user on a Dell Optiplex with Intel UHD 730.
That is the point. You do not configure this. The hardware tells you.
Real-world performance characteristics
We measured these numbers across our test matrix of 14 device configurations:
| Dataset size | Discrete GPU | Integrated GPU | 8-thread CPU |
|---|---|---|---|
| 100,000 elements | 2.1 ms (GPU overhead dominates) | 3.4 ms | 1.8 ms (CPU wins) |
| 500,000 elements | 3.2 ms | 8.7 ms | 12.4 ms |
| 2,000,000 elements | 5.8 ms | 14.1 ms | 48.6 ms |
| 10,000,000 elements | 18.3 ms | 52.4 ms | 243.1 ms |
At 10 million elements, the discrete GPU path is 13.3x faster than the CPU worker pool. The integrated GPU path is 4.6x faster. These ratios hold consistently for arithmetic-heavy operations like sorting, prefix sums, and histogram computation.
Why this matters for your enterprise AI automation infrastructure
If you are building browser-based analytics, real-time dashboards, or client-side data pipelines, your compute backend choice determines whether your application feels instant or sluggish. The difference between 18 ms and 243 ms is the difference between a responsive UI and a frozen tab.
Static backend selection forces you to pick the lowest common denominator. Adaptive dispatch lets every user's hardware run at its ceiling. This is the same principle behind our broader enterprise AI automation infrastructure: probe the environment, measure the constraints, and route computation to the optimal execution path.
Where this is heading
The WebGPU specification is stabilizing. Chrome, Edge, and Firefox ship it behind stable flags or by default. Safari support is in preview. Within 18 months, WebGPU compute will be as universally available as Web Workers are today.
The teams that build adaptive dispatch now will have a structural advantage. Their applications will automatically exploit next-generation hardware: discrete GPUs with higher core counts, integrated GPUs with wider memory buses, and eventually NPUs exposed through future web APIs.
The teams that hard-code new Worker() will be rewriting their compute layer every hardware generation.
We chose not to rewrite. We chose to measure.