The metric that determines whether the GPU helps
Not all computation benefits equally from the GPU. The difference between a 50x speedup and a 0.5x slowdown comes down to one number: arithmetic intensity.
Arithmetic intensity is the ratio of floating-point operations (FLOPs) performed to bytes transferred from memory:
arithmeticIntensity = FLOPs / bytes
It measures how much useful work the processor does for each byte it reads. High intensity means the processor spends most of its time computing. Low intensity means it spends most of its time waiting for data.
GPUs have massive compute throughput (10,000+ GFLOPS on a discrete GPU) but moderate memory bandwidth (200 to 400 GB/s). The ratio of peak throughput to peak bandwidth defines the hardware's operational intensity ceiling:
operationalIntensity = peakGFLOPS / peakBandwidth(GB/s)
For an NVIDIA RTX 4060: 15,110 GFLOPS / 272 GB/s = 55.5 FLOPs/byte.
This means: if your algorithm performs fewer than 55.5 FLOPs per byte of data it reads, the GPU is waiting for memory. It cannot compute fast enough to keep its ALUs busy. The memory bus is the bottleneck. You are memory-bound.
If your algorithm performs more than 55.5 FLOPs per byte, the ALUs are the bottleneck. Memory delivers data faster than the cores can process it. You are compute-bound. This is where the GPU's thousands of cores provide maximum value.
Element-wise operations: memory-bound
An element-wise operation (add, multiply, compare) performs one operation per element. Each element is a Float32 (4 bytes). The intensity:
FLOPs: 1 per element
Bytes: 4 per element (read) + 4 per element (write) = 8
Intensity: 1 / 8 = 0.125 FLOPs/byte
For a fused multiply-add (a * x + b):
FLOPs: 2 per element
Bytes: 8 (same read + write)
Intensity: 2 / 8 = 0.25 FLOPs/byte
0.25 FLOPs/byte versus the RTX 4060's operational intensity of 55.5 FLOPs/byte. The algorithm uses 0.45% of the GPU's compute capacity. The other 99.55% sits idle, waiting for the memory controller to deliver the next byte.
This is why element-wise operations on small datasets are faster on the CPU. The CPU's memory bandwidth (40 to 60 GB/s for DDR5) is lower than the GPU's, but the CPU's compute-to-bandwidth ratio is also lower (1 to 2 FLOPs/byte for a single core). The CPU is better matched to memory-bound work. It wastes less of its compute capacity.
Element-wise operations only justify GPU dispatch when the dataset is large enough for the GPU's raw memory bandwidth advantage (272 GB/s vs. 50 GB/s) to overcome the fixed dispatch overhead and transfer cost. On a discrete GPU, that crossover is approximately 1,000,000 elements (4 MB). Below that, the CPU wins.
Matrix multiplication: compute-bound
GEMM (General Matrix Multiply) computes C = A × B where A is m×k, B is k×n, and C is m×n.
For the general (non-square) case, the arithmetic intensity as defined in our patent is:
FLOPs: 2 * M * N * K
Bytes: (M*K + K*N + M*N) * 4 (read A, read B, write C, all Float32)
Intensity: 2MNK / ((MK + KN + MN) * 4) FLOPs/byte
For square matrices (m = k = n), this simplifies to:
FLOPs: 2 * n^3 (n^3 multiplications + n^3 additions)
Bytes: 3 * n^2 * 4 = 12 * n^2 (read A, read B, write C, each n^2 Float32 elements)
Intensity: 2n^3 / (12n^2) = n/6 FLOPs/byte
The intensity scales linearly with matrix dimension. At n=128: 128/6 = 21.3 FLOPs/byte. At n=256: 42.7 FLOPs/byte. At n=512: 85.3 FLOPs/byte. At n=1024: 170.7 FLOPs/byte.
By n=256, the arithmetic intensity (42.7) approaches the RTX 4060's operational ceiling (55.5). The GPU's ALUs are nearly fully utilized. The memory bus delivers data fast enough to keep them busy.
This is why GEMM is the canonical GPU workload. The O(n^3) computation on O(n^2) data creates an intensity that grows with the problem size. Larger matrices are not just "more work." They are more efficient work. Each byte of data produces more compute operations, keeping the GPU's ALUs busier.
Comparing intensity at the dispatch boundary
| Operation | Data size | FLOPs | Bytes | Intensity | GPU utilization (RTX 4060) |
|---|---|---|---|---|---|
| Element-wise multiply (1M elements) | 4 MB | 1,000,000 | 8,000,000 | 0.125 | 0.23% |
| Element-wise FMA (1M elements) | 4 MB | 2,000,000 | 8,000,000 | 0.25 | 0.45% |
| Reduction SUM (1M elements) | 4 MB | 1,000,000 | 4,000,000 | 0.25 | 0.45% |
| Radix sort pass (1M elements) | 4 MB | ~3,000,000 | 8,000,000 | 0.375 | 0.68% |
| GEMM 128×128 | 192 KB | 4,194,304 | 196,608 | 21.3 | 38.4% |
| GEMM 256×256 | 768 KB | 33,554,432 | 786,432 | 42.7 | 76.9% |
| GEMM 512×512 | 3 MB | 268,435,456 | 3,145,728 | 85.3 | 100% (compute-bound) |
| GEMM 1024×1024 | 12 MB | 2,147,483,648 | 12,582,912 | 170.7 | 100% (fully saturated) |
At 128×128 (192 KB of data across three matrices, 4 million FLOPs), GEMM achieves 38% GPU utilization. At the same data volume, an element-wise operation achieves 0.23%. GEMM is 167x more efficient per byte.
This is why our dispatch engine sends a 128×128 matrix multiply to the GPU but keeps a 16,000-element (64 KB) element-wise operation on the CPU. Same data size. Completely different intensity. Completely different dispatch decision.
How GEMM executes on WebGPU
The naive kernel
The textbook GEMM: each thread computes one element of C by iterating over the shared dimension k.
@compute @workgroup_size(16, 16)
fn gemm_naive(
@builtin(global_invocation_id) gid: vec3<u32>
) {
let row = gid.y;
let col = gid.x;
if (row >= M || col >= N) { return; }
var sum: f32 = 0.0;
for (var i: u32 = 0u; i < K; i++) {
sum += A[row * K + i] * B[i * N + col];
}
C[row * N + col] = sum;
}
Each thread reads one full row of A and one full column of B from global memory. For n=512, each thread performs 512 multiply-adds but reads 1,024 Float32 values (4,096 bytes) from global memory. That is 1,024 FLOPs / 4,096 bytes = 0.25 FLOPs/byte per thread.
Wait. The total intensity is n/6 = 85.3 FLOPs/byte. But per thread, it is 0.25. The discrepancy: every thread reads the same data. Thread (0,0) reads row 0 of A. Thread (0,1) also reads row 0 of A. Across all n threads in that row, row 0 of A is read n times. The global memory bandwidth is consumed n times for data that could be cached.
The naive kernel achieves 3% to 8% of the GPU's peak GFLOPS. The arithmetic intensity exists in the algorithm but is not realized in the implementation because the data reuse pattern is not exploited.
The tiled kernel: exploiting shared memory
The tiled GEMM loads blocks of A and B into workgroup shared memory, where all threads in the workgroup can access them at register-like speed. Each element is loaded once from global memory and reused by every thread that needs it.
const TILE: u32 = 16u;
var<workgroup> tileA: array<array<f32, 16>, 16>;
var<workgroup> tileB: array<array<f32, 16>, 16>;
@compute @workgroup_size(16, 16)
fn gemm_tiled(
@builtin(local_invocation_id) lid: vec3<u32>,
@builtin(workgroup_id) wid: vec3<u32>
) {
let row = wid.y * TILE + lid.y;
let col = wid.x * TILE + lid.x;
var sum: f32 = 0.0;
let numTiles = (K + TILE - 1u) / TILE;
for (var t: u32 = 0u; t < numTiles; t++) {
// Cooperatively load one tile of A and one tile of B into shared memory
let aCol = t * TILE + lid.x;
let bRow = t * TILE + lid.y;
if (row < M && aCol < K) {
tileA[lid.y][lid.x] = A[row * K + aCol];
} else {
tileA[lid.y][lid.x] = 0.0;
}
if (bRow < K && col < N) {
tileB[lid.y][lid.x] = B[bRow * N + col];
} else {
tileB[lid.y][lid.x] = 0.0;
}
workgroupBarrier();
// Each thread computes its partial sum using the shared tile
for (var i: u32 = 0u; i < TILE; i++) {
sum += tileA[lid.y][i] * tileB[i][lid.x];
}
workgroupBarrier();
}
if (row < M && col < N) {
C[row * N + col] = sum;
}
}
Each workgroup is 16×16 threads (256 threads). Each tile is 16×16 elements (1 KB for Float32). Two tiles (A and B) occupy 2 KB of shared memory. Well within the 16 KB workgroup shared memory limit.
The data reuse: each element of tileA is used by 16 threads (every thread in the tile's column). Each element of tileB is used by 16 threads (every row). The reuse factor is 16x. Each global memory load serves 16 multiply-add operations instead of 1.
For n=512 with 16×16 tiles: the algorithm processes 512/16 = 32 tiles along the k dimension. Each tile pair requires loading 16×16 = 256 elements of A and 256 of B from global memory (2,048 bytes total). Each tile produces 16×16×16 = 4,096 multiply-adds (8,192 FLOPs). Effective intensity: 8,192 / 2,048 = 4.0 FLOPs/byte per tile load.
That is 16x better than the naive kernel (0.25 FLOPs/byte per thread). Total intensity across all tiles approaches the theoretical n/6 = 85.3 FLOPs/byte. Measured GPU utilization: 75% to 90% of peak GFLOPS on a discrete GPU.
Shared memory access patterns
Within the tile, the 16 threads in a row read consecutive elements of tileA from shared memory. This is a broadcast pattern: all threads in the row need element tileA[lid.y][i] for the same i. Shared memory handles broadcasts in a single cycle (bank broadcast). No bank conflicts. Full throughput.
The tileB access pattern has each thread reading tileB[i][lid.x]. Threads in the same row read different columns of tileB. On GPUs with 32-bank shared memory (NVIDIA, AMD), the 16 threads access 16 different banks (assuming 4-byte elements and consecutive layout). No bank conflicts. Full throughput.
The tiling transforms GEMM from a global-memory-bound disaster into a shared-memory-fed compute engine that runs at near-peak ALU utilization.
How intensity drives dispatch decisions
Our 7-factor scoring function includes arithmetic intensity as one of its dispatch factors. The intensity determines how the dispatch threshold scales with data size.
Memory-bound operations (intensity < 1 FLOP/byte)
For element-wise, filter, and reduction operations, the GPU's advantage comes from memory bandwidth, not compute throughput. The GPU must read and write every byte. The question is whether the GPU's memory bandwidth (272 GB/s discrete, 50 to 100 GB/s integrated) outweighs the transfer overhead and dispatch cost.
At 0.25 FLOPs/byte, the GPU processes data at its memory bandwidth ceiling. The effective speedup over CPU is the ratio of GPU bandwidth to CPU bandwidth: roughly 5x on discrete (272 / 50), 2x on integrated (50 / 25). After accounting for dispatch overhead (0.03 to 1.1 ms) and transfer cost (0.2 ms per 4 MB on PCIe), the break-even is:
breakEvenElements = (dispatchOverhead + transferTime) / (1/cpuBandwidth - 1/gpuBandwidth)
For a discrete GPU: approximately 800,000 to 1,200,000 elements. Our calibrated threshold reflects this per-device.
For an integrated GPU: approximately 2,000,000 to 4,000,000 elements. The bandwidth ratio is smaller, so the dataset must be larger to amortize the overhead.
Compute-bound operations (intensity > 10 FLOPs/byte)
For GEMM, convolution, and polynomial evaluation, the GPU's advantage comes from raw compute throughput. The data transfer is a small fraction of the total time because the GPU spends most of its time computing, not transferring.
At 21.3 FLOPs/byte (128×128 GEMM), the total data is 192 KB. Transfer time on PCIe 4.0: 0.01 ms. Dispatch overhead: 0.03 ms. Compute time on GPU: 0.26 ms. Total GPU: 0.30 ms.
CPU compute for 128×128 GEMM: 2.1 ms (single-threaded) or 0.35 ms (8-thread BLAS). The GPU wins even against a multi-threaded CPU at just 128×128.
The break-even for compute-bound operations is determined by the compute ratio (GPU GFLOPS / CPU GFLOPS), not the bandwidth ratio. For a discrete GPU with 15,110 GFLOPS versus a CPU with 200 GFLOPS (4 cores at 50 GFLOPS each): 75x compute advantage. The fixed overhead (0.04 ms) is amortized at:
breakEvenFLOPs = dispatchOverhead * gpuGFLOPS = 0.04 ms * 15,110 GFLOPS = 604,400 FLOPs
A 128×128 GEMM performs 4,194,304 FLOPs. 7x above break-even. GPU dispatch is decisively correct.
The scoring formula integration
The arithmetic intensity factor enters the dispatch score as one of seven factors in the scoring formula. The seven factors are: size-vs-threshold, sparsity penalty, precision risk, GPU class bonus, vendor tuning, fusion bonus, and arithmetic intensity. Together they produce a composite score that determines whether an operation routes to CPU, Web Workers, or GPU.
For a GEMM at 128×128 (cardinality = 16,384 output elements, intensity = 21.3): the high arithmetic intensity drives a strong GPU score. Even with a moderate hardware calibration ratio (integrated GPU), the score exceeds 1.0. GPU dispatch.
For an element-wise multiply at 100,000 elements (cardinality = 100,000, intensity = 0.125): the low intensity produces a weak score. On an integrated GPU with calibration ratio 0.41, the score falls below 0.3. CPU dispatch.
Despite both being modest in data volume, these operations produce opposite dispatch decisions because the intensity differs by 170x. The scoring function captures this automatically.
Performance: GEMM on WebGPU vs. CPU
Benchmarked on square matrices of Float32 values:
| Dimension | Data size | FLOPs | GPU tiled (discrete) | GPU tiled (integrated) | CPU 8-thread | GPU vs. CPU (discrete) |
|---|---|---|---|---|---|---|
| 64×64 | 48 KB | 524K | 0.18 ms | 0.32 ms | 0.08 ms | 0.44x (CPU wins) |
| 128×128 | 192 KB | 4.2M | 0.30 ms | 0.51 ms | 0.35 ms | 1.17x |
| 256×256 | 768 KB | 33.6M | 0.52 ms | 1.24 ms | 2.4 ms | 4.6x |
| 512×512 | 3 MB | 268M | 1.8 ms | 5.1 ms | 18.2 ms | 10.1x |
| 1024×1024 | 12 MB | 2.1B | 8.4 ms | 28.3 ms | 142 ms | 16.9x |
| 2048×2048 | 48 MB | 17.2B | 52 ms | 198 ms | 1,140 ms | 21.9x |
At 64×64, the GPU is slower than the CPU. The dispatch overhead (0.03 ms) and transfer cost (0.01 ms) consume 22% of the total GPU time, and the 524K FLOPs are too few to exploit the GPU's parallelism. The calibrated threshold correctly routes 64×64 to the CPU on discrete hardware.
At 128×128, the GPU barely wins on discrete hardware (1.17x). On integrated hardware, the CPU wins. The threshold is at the boundary. The calibration ratio determines the decision per-device.
At 256×256 and above, the GPU dominates. The intensity is high enough that the GPU's compute advantage overwhelms all overhead. The speedup grows with dimension because the intensity grows linearly: 42.7 FLOPs/byte at 256, 85.3 at 512, 170.7 at 1024.
At 2048×2048 on a discrete GPU: 21.9x faster than 8-thread CPU. 52 ms versus 1.14 seconds. For real-time applications (neural network inference, PCA on live data, covariance matrix updates), this is the difference between interactive and batch.
Where GEMM appears in enterprise applications
Matrix multiplication is not an academic exercise. It is the core operation in multiple enterprise workloads.
Neural network inference. Every dense layer in a neural network is a matrix multiply: output = weights × input + bias. A small model with 3 dense layers of dimension 256 requires 3 GEMM operations per inference. At 0.52 ms per GEMM on the GPU: 1.56 ms total for the dense layers. On CPU: 7.2 ms. The GPU makes per-request inference viable at sub-10 ms latency.
Principal component analysis. PCA involves computing a covariance matrix (GEMM), then eigenvalue decomposition. The covariance matrix of 500 features from 100,000 observations: a 500×500 GEMM followed by eigensolving. The GEMM takes 1.2 ms on the GPU. On CPU: 12 ms. For interactive data exploration where PCA recomputes on every filter change, this is the difference between responsive and laggy.
Recommendation scoring. Collaborative filtering computes user-item scores as a matrix product of user embeddings × item embeddings. For 1,000 users × 10,000 items with embedding dimension 128: the GEMM is (1000×128) × (128×10000). The GPU handles this in approximately 3 ms. On CPU: approximately 40 ms.
Portfolio optimization. Mean-variance optimization requires covariance matrix computation and quadratic program solving, both GEMM-heavy. The precision analyser evaluates whether Float32 is safe. If the covariance matrix is well-conditioned (κ < 1,000), the GEMM runs on the GPU. If ill-conditioned, the Safety Guard routes to CPU Float64. The dispatch decision accounts for both performance and correctness.
The roofline model for WebGPU
The roofline model visualizes the relationship between arithmetic intensity and achievable performance. The "roof" is defined by two lines: the memory bandwidth ceiling (a diagonal line for memory-bound operations) and the compute throughput ceiling (a horizontal line for compute-bound operations). The intersection is the hardware's operational intensity.
For the RTX 4060:
Memory bandwidth ceiling: 272 GB/s = 272 GFLOPS at 1 FLOP/byte
Compute ceiling: 15,110 GFLOPS
Operational intensity: 15,110 / 272 = 55.5 FLOPs/byte
Operations below 55.5 FLOPs/byte are under the diagonal (memory-bound). Their performance scales with bandwidth, not compute. Moving from 0.125 to 0.25 FLOPs/byte doubles throughput (more work per byte), but the absolute throughput is still a fraction of peak.
Operations above 55.5 FLOPs/byte are under the horizontal ceiling (compute-bound). Their performance is at the GPU's peak regardless of further intensity increases. A GEMM at 85 FLOPs/byte (512×512) achieves the same peak as a GEMM at 170 FLOPs/byte (1024×1024). Both are compute-saturated.
Our dispatch scoring function encodes this roofline implicitly. The arithmetic intensity factor in the score determines whether the GPU's compute advantage or bandwidth advantage (or neither) applies. The calibration ratio adjusts the roofline for each device's actual bandwidth and compute throughput.
The principle
Arithmetic intensity is the single most important metric for GPU dispatch decisions. It determines whether the GPU is compute-bound (most of its cores are productive) or memory-bound (most of its cores are idle, waiting for data).
GEMM has intensity that grows with the problem size. Element-wise operations have fixed, low intensity regardless of size. This is why a 128×128 GEMM (192 KB of data) justifies GPU dispatch but a 192 KB element-wise operation does not. The data volume is identical. The compute density is 170x different.
Our adaptive dispatch engine captures this through the 7-factor scoring function. Intensity is not the only factor (size-vs-threshold, sparsity penalty, precision risk, GPU class bonus, vendor tuning, and fusion bonus all matter), but it is the factor that produces the largest dispatch threshold differences between operation types.
This is the analytical foundation of our enterprise AI automation infrastructure. We do not dispatch to the GPU because the GPU is fast. We dispatch to the GPU when the operation's arithmetic intensity is high enough to utilize the GPU's compute capacity. When it is not, we use the CPU, which is better matched to memory-bound work. The intensity metric makes this decision quantitative, not heuristic.