Why Hardcoded GPU Dispatch Thresholds Fail in the Browser

14 Apr 2026·17 min read·Husain Ayoob

WebGPUPerformanceCalibrationHardwareAdaptive Computing

Key Takeaways

GPU dispatch overhead varies 50x across consumer hardware: 0.02 ms on a discrete NVIDIA RTX 4060 to 1.1 ms on a low-end Android GPU. A hardcoded threshold of 100,000 elements is optimal for exactly one hardware configuration and wrong for every other.
Our self-calibrating function runs three microbenchmarks at session start (under 200 ms total): adapter capability probing (maxStorageBufferBindingSize, workgroup limits), memory bandwidth measurement (4 MB pass-through shader), and dispatch overhead measurement (100 minimal single-workgroup dispatches). The results derive a calibration ratio specific to the user's current hardware.
The calibration ratio dynamically scales base thresholds by operation type. An element-wise filter crosses over to GPU at 1,000,000 elements on integrated hardware but 200,000 on discrete. A matrix multiply crosses over at 128x128 on discrete but 512x512 on integrated. Same engine, same code, different hardware, different decisions.

What this means for your business

Your staff use ten different laptops. Naïve AI software runs fast on three and slow on the other seven. We build Ayoob AI that auto-tunes to each machine. Business efficiency becomes uniform, not luck-of-the-draw.

The hardcoded threshold problem

Every WebGPU tutorial includes a line like this:

if (data.length > 100000) {
  useGPU(data);
} else {
  useCPU(data);
}

This is wrong. Not approximately wrong. Fundamentally wrong. The number 100,000 encodes an assumption about the hardware that is true for exactly one device: whichever device the developer benchmarked on.

On a workstation with an NVIDIA RTX 4090 (16,384 CUDA cores, 1 TB/s memory bandwidth), the GPU crossover for a simple element-wise operation is approximately 50,000 elements. Below that, the dispatch overhead and buffer transfer cost exceed the compute savings.

On a MacBook Air M2 (10-core GPU, 100 GB/s unified memory bandwidth), the crossover is approximately 400,000 elements. The GPU has fewer cores and shares memory bandwidth with the CPU. The break-even point is 8x higher.

On a budget Android tablet with a Mali-G57 (3 shader cores, 12 GB/s memory bandwidth), the crossover is approximately 3,000,000 elements. The GPU is so slow relative to the CPU that only massive datasets justify the dispatch overhead.

A hardcoded threshold of 100,000 dispatches to the GPU on the Android tablet, where the GPU is slower than the CPU for that dataset size. The user waits longer. The developer never knows, because they tested on their workstation.

Why the variance is so large

Three hardware-dependent variables determine the GPU crossover point. Each varies by 10x to 100x across consumer devices.

Variable 1: Dispatch overhead

Every GPU operation has a fixed overhead before any computation begins. The overhead includes:

device.createCommandEncoder(): allocate the command buffer structure.
encoder.beginComputePass(): set up the compute pipeline state.
pass.setPipeline() and pass.setBindGroup(): bind the shader and its data.
pass.dispatchWorkgroups(): record the dispatch command.
encoder.finish() and device.queue.submit(): finalize and submit to the GPU driver.
Driver-level work: validate the command buffer, translate to hardware-specific commands, submit to the GPU's command queue, wait for the GPU to start execution.

On a discrete NVIDIA GPU with a mature Windows driver, this overhead is 0.02 to 0.05 ms. The driver is heavily optimized for low-latency submission. The PCIe link is dedicated. The GPU's command processor is fast.

On an Apple M-series GPU with unified memory, the overhead is 0.05 to 0.15 ms. No PCIe traversal, but the Metal-to-WebGPU translation layer adds cost. The GPU and CPU share the same memory controller, introducing contention.

On an Intel integrated GPU (UHD 770, Iris Xe), the overhead is 0.1 to 0.3 ms. The integrated GPU shares the system memory bus. Driver overhead is higher because the integrated GPU's command processor is simpler.

On a mobile GPU (Qualcomm Adreno 730, ARM Mali-G715), the overhead is 0.3 to 1.1 ms. Mobile drivers prioritize power efficiency over latency. The GPU's clock speed is lower. The command submission path is longer.

For an operation that takes 0.5 ms of compute, the dispatch overhead is:

NVIDIA discrete: 4% of total time (negligible)
Apple M-series: 20% of total time (significant)
Intel integrated: 40% of total time (dominant for small datasets)
Mobile: 100%+ of total time (the overhead alone exceeds the compute)

Variable 2: Memory bandwidth

Data must travel to the GPU before processing begins. On discrete GPUs, this crosses the PCIe bus. On integrated and mobile GPUs, the data stays in system memory but the driver still performs cache management.

Hardware	Effective GPU memory bandwidth	4 MB transfer time
RTX 4060 (PCIe 4.0 x16, GDDR6)	272 GB/s VRAM, ~20 GB/s PCIe upload	0.20 ms
Apple M2 (unified memory)	100 GB/s shared	0.04 ms
Intel Iris Xe (shared DDR5)	50 GB/s shared	0.08 ms
Intel UHD 770 (shared DDR4)	35 GB/s shared	0.11 ms
Qualcomm Adreno 730 (LPDDR5)	44 GB/s shared	0.09 ms
ARM Mali-G57 (LPDDR4X)	12 GB/s shared	0.33 ms

The Apple M2 has the lowest transfer cost due to unified memory (no bus crossing). The RTX 4060 has the highest VRAM bandwidth but pays a PCIe transfer penalty. The Mali-G57 has 6x higher transfer cost than the M2 for the same data.

For an operation on 1 million Float32 elements (4 MB), the transfer time ranges from 0.04 ms to 0.33 ms. The ratio between fastest and slowest: 8.3x.

Variable 3: Compute throughput

The GPU's raw processing speed determines how quickly the data is processed once it arrives.

Hardware	Compute cores	Peak GFLOPS (FP32)	1M element-wise multiply time
RTX 4060	3,072 CUDA cores	15,110	0.08 ms
Apple M2 GPU	10 cores (1,280 ALUs)	3,600	0.14 ms
Intel Iris Xe	96 EUs (768 ALUs)	2,460	0.21 ms
Intel UHD 770	32 EUs (256 ALUs)	819	0.63 ms
Qualcomm Adreno 730	1,024 ALUs	1,700	0.30 ms
ARM Mali-G57	3 cores (48 ALUs)	100	5.1 ms

The RTX 4060 is 64x faster than the Mali-G57 in raw compute. For the same element-wise operation on 1 million elements, the GPU compute time ranges from 0.08 ms to 5.1 ms.

The combined effect

Total GPU time = dispatch overhead + transfer time + compute time.

For 1 million Float32 elements through a simple multiply:

Hardware	Dispatch	Transfer	Compute	Total GPU	CPU time (4 cores)	GPU faster?
RTX 4060	0.03 ms	0.20 ms	0.08 ms	0.31 ms	2.1 ms	Yes (6.8x)
Apple M2	0.10 ms	0.04 ms	0.14 ms	0.28 ms	1.8 ms	Yes (6.4x)
Intel Iris Xe	0.20 ms	0.08 ms	0.21 ms	0.49 ms	2.4 ms	Yes (4.9x)
Intel UHD 770	0.25 ms	0.11 ms	0.63 ms	0.99 ms	3.1 ms	Yes (3.1x)
Adreno 730	0.50 ms	0.09 ms	0.30 ms	0.89 ms	2.8 ms	Yes (3.1x)
Mali-G57	1.10 ms	0.33 ms	5.10 ms	6.53 ms	4.2 ms	No (0.64x)

At 1 million elements, five of six devices benefit from GPU dispatch. The Mali-G57 does not. A hardcoded threshold of 100,000 would dispatch to the GPU on all six devices, making the Mali-G57 user 1.6x slower than the CPU path.

At 100,000 elements, the picture changes:

Hardware	Total GPU	CPU time (4 cores)	GPU faster?
RTX 4060	0.25 ms	0.21 ms	Barely (1.2x overhead)
Apple M2	0.15 ms	0.18 ms	Barely (1.2x faster)
Intel UHD 770	0.42 ms	0.31 ms	No (0.74x)
Mali-G57	1.86 ms	0.42 ms	No (0.23x)

At 100,000 elements, even the RTX 4060 barely breaks even. The dispatch overhead and transfer cost dominate the tiny compute. The hardcoded threshold is wrong for the majority of devices.

Our self-calibrating threshold function

We do not guess the crossover point. We measure it. On first load, the adaptive dispatch engine runs three microbenchmarks that characterize the specific hardware. Total calibration time: under 200 ms.

Microbenchmark 1: Adapter capability probing

const adapter = await navigator.gpu.requestAdapter();
const info = await adapter.requestAdapterInfo();
const limits = adapter.limits;

const capabilities = {
  vendor: info.vendor,
  architecture: info.architecture,
  maxBufferSize: limits.maxStorageBufferBindingSize,
  maxWorkgroupSize: limits.maxComputeWorkgroupSizeX,
  maxWorkgroupsPerDimension: limits.maxComputeWorkgroupsPerDimension,
  maxBindGroups: limits.maxBindGroups,
  maxStorageBuffersPerStage: limits.maxStorageBuffersPerShaderStage,
};

This is not a benchmark. It is a capability query. It takes under 1 ms. The results tell the engine:

maxStorageBufferBindingSize: The largest buffer the GPU can bind. On desktop GPUs: 1 to 4 GB. On mobile GPUs: 128 MB to 512 MB. This sets the upper bound on dataset size for GPU dispatch. A 2 GB dataset cannot dispatch to a GPU with a 256 MB buffer limit.
maxComputeWorkgroupSizeX: The maximum threads per workgroup. Desktop: 256 to 1,024. Mobile: 64 to 256. This determines the tile size for local operations like bitonic sort and histogram construction.
Vendor and architecture strings: Used for coarse hardware classification (discrete, integrated, mobile) before the microbenchmarks provide precise measurements.

Microbenchmark 2: Memory bandwidth

The engine allocates a 4 MB test buffer, dispatches a trivial pass-through compute shader (read one element, write it to an output buffer, no arithmetic), and measures wall-clock time.

@compute @workgroup_size(256)
fn bandwidth_test(@builtin(global_invocation_id) id: vec3<u32>) {
  output[id.x] = input[id.x];
}

This shader does the minimum work per element: one global memory read, one global memory write. The execution time is dominated by memory bandwidth, not compute. The engine runs this 5 times and takes the median.

From the result:

effectiveBandwidth = (4 MB * 2) / medianTime  // *2 because read + write

The effective bandwidth captures everything: PCIe transfer cost on discrete GPUs, cache flushing on integrated GPUs, driver overhead, and any memory controller contention. It is the actual bandwidth the GPU achieves for this operation on this device at this moment.

Typical results:

Hardware	Measured effective bandwidth
RTX 4060	18.2 GB/s (PCIe-limited for upload)
Apple M2 GPU	72.4 GB/s (unified memory)
Intel Iris Xe	38.1 GB/s
Intel UHD 770	24.6 GB/s
Adreno 730	31.2 GB/s
Mali-G57	8.4 GB/s

Microbenchmark 3: Dispatch overhead

The engine runs 100 minimal dispatches: a single workgroup, a single invocation, a shader that writes one value. Each dispatch goes through the full submission path: create encoder, begin pass, set pipeline, dispatch, end pass, finish, submit.

const times: number[] = [];
for (let i = 0; i < 100; i++) {
  const start = performance.now();
  const encoder = device.createCommandEncoder();
  const pass = encoder.beginComputePass();
  pass.setPipeline(minimalPipeline);
  pass.setBindGroup(0, minimalBindGroup);
  pass.dispatchWorkgroups(1);
  pass.end();
  device.queue.submit([encoder.finish()]);
  await device.queue.onSubmittedWorkDone();
  times.push(performance.now() - start);
}

const dispatchOverhead = median(times);

The onSubmittedWorkDone() fence ensures the GPU has completed the dispatch before the next iteration. The median of 100 samples eliminates outliers from OS scheduling jitter.

This measurement captures the fixed cost of every GPU dispatch, independent of data size. It is the floor below which no GPU operation can execute, no matter how fast the shader is.

Deriving the calibration ratio

The three measurements combine into a calibration ratio that normalizes GPU performance against CPU performance for the specific device:

function deriveCalibrationRatio(
  bandwidth: number,      // bytes/ms
  dispatchOverhead: number, // ms
  cpuThroughput: number    // elements/ms (measured from a CPU reference benchmark)
): number {
  // GPU throughput for a memory-bound operation
  const gpuThroughputPerMs = bandwidth / 4;  // 4 bytes per Float32 element

  // Break-even dataset size: where GPU compute time = CPU compute time
  // GPU total = dispatchOverhead + (elements / gpuThroughputPerMs)
  // CPU total = elements / cpuThroughput
  // Break-even: dispatchOverhead + (N / gpuThroughput) = N / cpuThroughput
  // N = dispatchOverhead * (gpuThroughput * cpuThroughput) / (gpuThroughput - cpuThroughput)

  const ratio = cpuThroughput / gpuThroughputPerMs;
  // ratio < 1: GPU throughput exceeds CPU. Lower thresholds.
  // ratio > 1: CPU throughput exceeds GPU. Higher thresholds (or CPU-only).
  // ratio = 1: GPU and CPU are equal. Thresholds are at break-even.

  return ratio;
}

The ratio encodes a single number that characterizes how this device's GPU compares to its CPU for data-parallel work. A lower ratio means the GPU is relatively stronger. A higher ratio means the CPU is relatively stronger.

Typical ratios:

Hardware	Calibration ratio	Interpretation
RTX 4060 (discrete)	0.12	GPU is 8.3x faster per element. Low thresholds.
Apple M2 (integrated, fast)	0.28	GPU is 3.6x faster. Moderate thresholds.
Intel Iris Xe (integrated)	0.41	GPU is 2.4x faster. Higher thresholds.
Intel UHD 770 (integrated)	0.63	GPU is 1.6x faster. High thresholds.
Adreno 730 (mobile)	0.52	GPU is 1.9x faster. High thresholds.
Mali-G57 (mobile, weak)	1.45	CPU is faster for most operations. Very high thresholds or CPU-only.

When the ratio exceeds 1.0, the CPU is faster than the GPU per element. The GPU can still win on very large datasets (where the throughput advantage overcomes the fixed overhead), but the crossover point is in the millions of elements.

Scaling base thresholds by operation type

Different operations have different compute-to-memory ratios. An element-wise filter (1 comparison per element, memory-bound) breaks even at a different dataset size than a matrix multiply (O(n^3) arithmetic per O(n^2) data, compute-bound).

The engine maintains base thresholds for operation categories:

Operation category	Base threshold (reference hardware)	Arithmetic intensity
Element-wise (filter, map)	500,000 elements	Low (memory-bound)
Reduction (sum, min, max)	200,000 elements	Low to medium
Radix sort	300,000 elements	Medium (4 passes)
Histogram	100,000 elements	Medium
Group-by aggregation	150,000 elements	Medium to high
Matrix multiply	128 x 128 matrix	High (compute-bound)
Convolution / windowed ops	50,000 elements	High

These base thresholds are calibrated for a "reference" device (discrete GPU, calibration ratio = 0.15). The actual threshold for any device is scaled by the calibration ratio:

function computeThreshold(baseThreshold: number, calibrationRatio: number): number {
  return Math.round(baseThreshold * (calibrationRatio / REFERENCE_RATIO));
}

For an element-wise filter (base threshold 500,000):

RTX 4060 (ratio 0.12): threshold = 500,000 * (0.12 / 0.15) = 400,000
Apple M2 (ratio 0.28): threshold = 500,000 * (0.28 / 0.15) = 933,000
Intel UHD 770 (ratio 0.63): threshold = 500,000 * (0.63 / 0.15) = 2,100,000
Mali-G57 (ratio 1.45): threshold = 500,000 * (1.45 / 0.15) = 4,833,000

The M2 user needs nearly 1 million elements before the GPU helps for element-wise work. The UHD 770 user needs over 2 million. The Mali-G57 user needs nearly 5 million. A hardcoded 100,000-element threshold would dispatch to the GPU on all four devices, degrading performance on three of them.

For a matrix multiply (base threshold 128 x 128):

RTX 4060: threshold = 128 * (0.12 / 0.15) = 102 -> 102 x 102 matrix
Apple M2: threshold = 128 * (0.28 / 0.15) = 239 -> 239 x 239 matrix
Intel UHD 770: threshold = 128 * (0.63 / 0.15) = 538 -> 538 x 538 matrix

Matrix multiply is compute-bound, so the GPU's advantage activates at smaller dataset sizes. But even here, the crossover varies 5x across hardware.

When the calibration itself fails

The calibration assumes the hardware's performance characteristics are stable. Three scenarios invalidate this.

Thermal throttling

A laptop that started on AC power (full GPU clock) switches to battery (throttled GPU clock). The calibration ratio was derived at full speed. The actual performance is now 40% to 60% of calibrated.

Our engine does not re-calibrate continuously (the 200 ms cost would be disruptive). Instead, the dispatch scoring function includes a safety margin: the calibration ratio is multiplied by 1.15 before threshold calculation. This means the thresholds are 15% conservative. If the GPU is slightly slower than calibrated, the engine still makes a correct (if slightly suboptimal) decision.

For severe throttling (GPU clock drops by 50%+), the device loss handler may trigger (the GPU watchdog kills throttled compute shaders that exceed timeout). The engine falls back to CPU and re-probes on the next invocation, at which point the new calibration reflects the throttled state.

Background GPU contention

Another tab or application using the GPU reduces available resources. The calibration measured uncontested performance. The actual performance is lower.

The safety margin partially handles this. For severe contention, the engine detects elevated dispatch times (the operation took 3x longer than the calibration predicted) and temporarily biases toward CPU dispatch for subsequent operations. This is a runtime adaptation, not a re-calibration.

Driver updates

A driver update can change GPU performance characteristics (sometimes dramatically, for better or worse). The calibration is cached for the session lifetime. On the next session (page reload), the engine re-calibrates and derives a fresh ratio.

The dispatch decision in practice

Putting it together, here is what happens when the application calls engine.dispatch('filter_gt', data):

Read the dataset size. 500,000 Float32 elements.
Look up the operation category. Element-wise filter.
Compute the device-specific threshold. Base 500,000 * (calibrationRatio / 0.15).
Compare. If dataset size >= threshold, dispatch to GPU. If dataset size >= 10,000 (worker threshold), dispatch to Web Workers. Otherwise, main thread.
Check safety overrides. Branch divergence: none for element-wise filter. Precision: check accumulation bound. Atomic contention: check output density.
Dispatch.

The decision takes under 0.001 ms. It runs on every operation. The application code is unaware of it:

// Same code on every device. Different dispatch decisions.
const result = await engine.dispatch('filter_gt', { data, threshold: 1000 });

On the RTX 4060 (threshold ~400,000): dataset is 500,000. GPU dispatch. Result in 1.1 ms.

On the Intel UHD 770 (threshold ~2,100,000): dataset is 500,000. Below GPU threshold. Web Worker dispatch. Result in 4.8 ms.

On the Mali-G57 (threshold ~4,833,000): dataset is 500,000. Below GPU threshold. Web Worker dispatch. Result in 6.2 ms.

All three produce the correct result. The RTX 4060 user gets GPU speed. The UHD 770 user gets Web Worker speed (faster than GPU on their hardware). The Mali-G57 user gets Web Worker speed (dramatically faster than GPU on their hardware). No user is penalized by a threshold calibrated for someone else's device.

Comparison with static threshold approaches

Approach	RTX 4060	Apple M2	Intel UHD 770	Mali-G57
Hardcoded 100K (naive)	GPU: 0.25 ms	GPU: 0.15 ms	GPU: 0.42 ms (slower than CPU)	GPU: 1.86 ms (4.4x slower than CPU)
Hardcoded 500K (conservative)	GPU: 1.1 ms	CPU: 4.8 ms (missed GPU opportunity)	CPU: 4.8 ms (correct)	CPU: 6.2 ms (correct)
Our calibrated threshold	GPU: 1.1 ms	GPU: 1.8 ms (correctly dispatched)	CPU: 4.8 ms (correctly avoided)	CPU: 6.2 ms (correctly avoided)

The naive threshold is optimal for zero devices. The conservative threshold is safe but misses the GPU on devices where it would help (the M2). Our calibrated threshold makes the correct decision for every device because it measured every device.

Why this matters for enterprise deployments

Enterprise hardware is heterogeneous. The developer's workstation has a discrete GPU. The sales team's laptops have Intel UHD integrated graphics. The warehouse tablets run Qualcomm Adreno. The reception kiosk runs an ARM Chromebook. The VDI sessions have no GPU at all.

One application serves all of them. Hardcoded thresholds optimize for one and penalize the rest. Self-calibrating thresholds optimize for each.

The 200 ms calibration runs once per session. It is invisible to the user (it runs during page load, overlapping with other initialization). The benefit lasts the entire session: every operation dispatches to the correct tier for that specific device.

This is the foundation of our enterprise AI automation infrastructure. We do not assume hardware. We measure it. We do not hardcode thresholds. We derive them. The engine adapts to the device, not the other way around.

Where this ships

We are Ayoob AI, a Newcastle-based team building self-calibrating dispatch infrastructure for UK enterprise SaaS where browser-side performance is the product. If your GPU code is tuned for one device and penalising the rest of your fleet, we build the calibration that fixes it. Book a discovery call.

About the author

Husain Ayoob

Founder & CEO, Ayoob AI Ltd

BSc Computer Science with AI, Northumbria University 2024. 5 UK patents pending covering the Ayoob AI stack. ISO 27001:2022 certified (organisation).

Full bio, patents, and press →

Frequently asked questions

Why is a hardcoded GPU threshold always wrong?

Because GPU dispatch overhead varies by around 50x across consumer hardware. On a discrete NVIDIA RTX 4060 a minimal dispatch takes about 0.02 ms. On a low-end Android GPU the same dispatch takes 1.1 ms. A threshold of 100,000 elements will be optimal on exactly one device, the one the developer benchmarked on, and wrong on every other machine in the fleet. For a Newcastle SMB with mixed laptop ages and a couple of Chromebooks, a hardcoded threshold means the AI feels fast on three machines and painful on seven. The threshold has to be a runtime property of the device, not a compile-time constant.

What does the self-calibrating dispatch engine actually measure?

At session start we run three microbenchmarks totalling under 200 ms. First, adapter capability probing reads maxStorageBufferBindingSize, workgroup limits, and supported features. Second, a 4 MB pass-through shader measures effective memory bandwidth. Third, we dispatch 100 minimal single-workgroup shaders to measure dispatch overhead. Those three numbers collapse into a calibration ratio that scales our base thresholds per operation class. The whole probe runs once per session and is invisible to the user. From that point onwards every dispatch decision uses numbers measured on this machine, not numbers extrapolated from a development laptop.

How do crossover points differ between integrated and discrete GPUs?

The gap is large and operation-dependent. An element-wise filter crosses over to the GPU at around 200,000 elements on a discrete GPU but not until 1,000,000 elements on integrated hardware, because integrated GPUs share system memory and have lower peak throughput. A matrix multiply crosses over at 128 by 128 on discrete hardware but only at 512 by 512 on integrated. The same full code engine makes different decisions on different machines, which is the entire point. A naive implementation forces a single threshold and pays the cost on every non-matching device.

Does the calibration have to run every time the user opens the app?

Once per session is sufficient, and the results can be cached across sessions keyed by adapter string. On a consumer laptop the GPU does not change between reloads, so a cached calibration from yesterday is still accurate today. We re-run the probe if the adapter string changes (for example, the user moved from integrated to discrete after plugging in an external GPU) or if the calibration is older than a configurable TTL. For most users the 200 ms probe runs once and never again, which makes AI automation feel instant from the first interaction onwards.

What happens on a device that does not support WebGPU at all?

The engine routes everything to the Web Worker parallel tier and the main-thread CPU tier. No single operation fails, no feature is disabled at the user level, and the calibration probe simply reports zero GPU capability. On older iPads or Chromebooks without WebGPU support, we still deliver meaningful speedups from parallelising across CPU cores with SharedArrayBuffer-backed workers. That means a UK business does not have to audit its laptop fleet before deploying, because the same full code runs on every device and adapts to whatever hardware is available.

Talk to an Engineer