Arithmetic Intensity Explained: The Formula and Why It Predicts GPU Speedup

Q: Why does a tiled GEMM kernel beat a naive one so heavily?

A naive O(n^3) GEMM loads every operand from global memory on every multiply. Global memory latency on a GPU is 200 to 400 cycles, so the compute units sit idle most of the time and the kernel achieves under 5 percent of peak throughput. A tiled kernel loads 16 by 16 blocks into workgroup shared memory once, then reuses each element 16 times before eviction. Shared memory access is roughly 4 cycles, so the effective memory cost per multiply collapses. Measured throughput rises to around 85 percent of peak. The tile structure is also what makes arithmetic intensity realisable: without tiling, intensity on paper is high but intensity in practice is memory-bound.

Q: At what matrix size does WebGPU actually beat a CPU in the browser?

For GEMM on discrete hardware, the crossover sits around 128 by 128 matrices, which is about 192 KB of data across three Float32 matrices. Below that size the PCIe transfer cost and dispatch overhead dominate. Above it, GPU compute pulls away rapidly and the gap grows with n^3. On integrated GPUs the crossover moves up to roughly 512 by 512 because unified memory reduces transfer cost but peak throughput is lower. Our dispatch scorer measures both values at session start, so the full code engine routes each multiply to whichever tier actually wins on the user's machine rather than assuming a single threshold.

Q: How does this translate into cloud cost savings?

The single most expensive line item in cloud AI invoices is matrix maths on managed GPU instances. Shifting that compute into the browser on a laptop the business already owns removes both the per-hour GPU cost and the data egress charge. For a typical Newcastle AI automation project we see cloud bills fall by around 95 percent once inference, embedding lookups, and small model fine-tuning move on-device. The remaining cloud spend covers orchestration and persistent storage, which are orders of magnitude smaller than GPU hours. The cost model shifts from usage-scaled to fixed.

Q: Does this apply to small models or only to large ones?

It applies mainly to the medium range: attention blocks, embedding projections, classifier heads, and small to mid-size transformers. Very small models (a dozen neurons, a linear regression) have so little compute that CPU dispatch wins every time. Very large models (tens of billions of parameters) exceed browser memory limits and need server inference. In between, which is where most business AI workloads actually sit, GEMM-heavy layers map cleanly to tiled WebGPU kernels and deliver consistent speedups on consumer hardware. That is the sweet spot for practical AI automation in a UK business.

22 Jun 2026·18 min read·Husain Ayoob

WebGPUMatrix MultiplicationGEMMPerformanceAI

Key Takeaways

Arithmetic intensity is the ratio of compute operations to memory bytes transferred. GEMM on two n x n matrices has intensity n/6 FLOPs per byte: 2n^3 FLOPs on 12n^2 bytes of data (three matrices of Float32). At n=128, this is 21.3 FLOPs per byte. An element-wise multiply has intensity 0.25 FLOPs per byte regardless of size. At 128x128, GEMM is 85x more compute-dense.
Our Dispatch Scoring Model uses arithmetic intensity as a primary routing factor. Compute-bound operations (GEMM: 256+ FLOPs/byte) justify GPU dispatch at 128×128 dimensions (192 KB of data across three matrices). Memory-bound operations (element-wise: 0.25 FLOPs/byte) require 1,000,000+ elements (4 MB) to amortize PCIe transfer overhead.
A tiled GEMM kernel using 16×16 tiles in workgroup shared memory achieves 85% of peak GPU throughput by reusing each loaded element 16 times before eviction. Without tiling, the naive O(n^3) algorithm achieves under 5% of peak because every multiply requires a global memory fetch at 200 to 400 cycle latency.

Arithmetic intensity is the ratio of compute operations to bytes of memory moved, written as FLOPs / bytes. It is the single number that predicts whether a workload runs faster on the GPU or the CPU: a high-intensity operation like matrix multiplication keeps the GPU cores busy and wins decisively, while a low-intensity, memory-bound operation often finishes sooner on the CPU. This guide gives the formula, worked examples for matrix multiplication versus element-wise operations, and the compute-bound versus memory-bound dividing line that decides where each operation should run.

What this means for your business

The single most expensive line item in cloud AI is matrix maths. Moving it into the browser on your existing laptops is how Ayoob AI delivers the same capability for around a twentieth of the cost. That is what business efficiency looks like at the architecture layer.

The metric that determines whether the GPU helps

Not all computation benefits equally from the GPU. The difference between a 50x speedup and a 0.5x slowdown comes down to one number: arithmetic intensity.

Arithmetic intensity is the ratio of floating-point operations (FLOPs) performed to bytes transferred from memory:

arithmeticIntensity = FLOPs / bytes

It measures how much useful work the processor does for each byte it reads. High intensity means the processor spends most of its time computing. Low intensity means it spends most of its time waiting for data.

GPUs have massive compute throughput (10,000+ GFLOPS on a discrete GPU) but moderate memory bandwidth (200 to 400 GB/s). The ratio of peak throughput to peak bandwidth defines the hardware's operational intensity ceiling:

operationalIntensity = peakGFLOPS / peakBandwidth(GB/s)

For an NVIDIA RTX 4060: 15,110 GFLOPS / 272 GB/s = 55.5 FLOPs/byte.

Compute-bound versus memory-bound

This means: if your algorithm performs fewer than 55.5 FLOPs per byte of data it reads, the GPU is waiting for memory. It cannot compute fast enough to keep its ALUs busy. The memory bus is the bottleneck. You are memory-bound.

If your algorithm performs more than 55.5 FLOPs per byte, the ALUs are the bottleneck. Memory delivers data faster than the cores can process it. You are compute-bound. This is where the GPU's thousands of cores provide maximum value.

Element-wise operations: memory-bound

An element-wise operation (add, multiply, compare) performs one operation per element. Each element is a Float32 (4 bytes). The intensity:

FLOPs: 1 per element
Bytes: 4 per element (read) + 4 per element (write) = 8
Intensity: 1 / 8 = 0.125 FLOPs/byte

For a fused multiply-add (a * x + b):

FLOPs: 2 per element
Bytes: 8 (same read + write)
Intensity: 2 / 8 = 0.25 FLOPs/byte

0.25 FLOPs/byte versus the RTX 4060's operational intensity of 55.5 FLOPs/byte. The algorithm uses 0.45% of the GPU's compute capacity. The other 99.55% sits idle, waiting for the memory controller to deliver the next byte.

This is why element-wise operations on small datasets are faster on the CPU. The CPU's memory bandwidth (40 to 60 GB/s for DDR5) is lower than the GPU's, but the CPU's compute-to-bandwidth ratio is also lower (1 to 2 FLOPs/byte for a single core). The CPU is better matched to memory-bound work. It wastes less of its compute capacity.

Element-wise operations only justify GPU dispatch when the dataset is large enough for the GPU's raw memory bandwidth advantage (272 GB/s vs. 50 GB/s) to overcome the fixed dispatch overhead and transfer cost. On a discrete GPU, that crossover is approximately 1,000,000 elements (4 MB). Below that, the CPU wins.

Matrix multiplication: compute-bound

GEMM (General Matrix Multiply) computes C = A × B where A is m×k, B is k×n, and C is m×n.

For the general (non-square) case, the arithmetic intensity as defined in our patent is:

FLOPs: 2 * M * N * K
Bytes: (M*K + K*N + M*N) * 4  (read A, read B, write C, all Float32)
Intensity: 2MNK / ((MK + KN + MN) * 4) FLOPs/byte

For square matrices (m = k = n), this simplifies to:

FLOPs: 2 * n^3 (n^3 multiplications + n^3 additions)
Bytes: 3 * n^2 * 4 = 12 * n^2 (read A, read B, write C, each n^2 Float32 elements)
Intensity: 2n^3 / (12n^2) = n/6 FLOPs/byte

The intensity scales linearly with matrix dimension. At n=128: 128/6 = 21.3 FLOPs/byte. At n=256: 42.7 FLOPs/byte. At n=512: 85.3 FLOPs/byte. At n=1024: 170.7 FLOPs/byte.

By n=256, the arithmetic intensity (42.7) approaches the RTX 4060's operational ceiling (55.5). The GPU's ALUs are nearly fully utilized. The memory bus delivers data fast enough to keep them busy.

This is why GEMM is the canonical GPU workload. The O(n^3) computation on O(n^2) data creates an intensity that grows with the problem size. Larger matrices are not just "more work." They are more efficient work. Each byte of data produces more compute operations, keeping the GPU's ALUs busier.

Comparing intensity at the dispatch boundary

Operation	Data size	FLOPs	Bytes	Intensity	GPU utilization (RTX 4060)
Element-wise multiply (1M elements)	4 MB	1,000,000	8,000,000	0.125	0.23%
Element-wise FMA (1M elements)	4 MB	2,000,000	8,000,000	0.25	0.45%
Reduction SUM (1M elements)	4 MB	1,000,000	4,000,000	0.25	0.45%
Radix sort pass (1M elements)	4 MB	~3,000,000	8,000,000	0.375	0.68%
GEMM 128×128	192 KB	4,194,304	196,608	21.3	38.4%
GEMM 256×256	768 KB	33,554,432	786,432	42.7	76.9%
GEMM 512×512	3 MB	268,435,456	3,145,728	85.3	100% (compute-bound)
GEMM 1024×1024	12 MB	2,147,483,648	12,582,912	170.7	100% (fully saturated)

At 128×128 (192 KB of data across three matrices, 4 million FLOPs), GEMM achieves 38% GPU utilization. At the same data volume, an element-wise operation achieves 0.23%. GEMM is 167x more efficient per byte.

This is why our dispatch engine sends a 128×128 matrix multiply to the GPU but keeps a 16,000-element (64 KB) element-wise operation on the CPU. Same data size. Completely different intensity. Completely different dispatch decision.

How GEMM executes on WebGPU

The naive kernel

The textbook GEMM: each thread computes one element of C by iterating over the shared dimension k.

@compute @workgroup_size(16, 16)
fn gemm_naive(
  @builtin(global_invocation_id) gid: vec3<u32>
) {
  let row = gid.y;
  let col = gid.x;
  if (row >= M || col >= N) { return; }

  var sum: f32 = 0.0;
  for (var i: u32 = 0u; i < K; i++) {
    sum += A[row * K + i] * B[i * N + col];
  }
  C[row * N + col] = sum;
}

Each thread reads one full row of A and one full column of B from global memory. For n=512, each thread performs 512 multiply-adds but reads 1,024 Float32 values (4,096 bytes) from global memory. That is 1,024 FLOPs / 4,096 bytes = 0.25 FLOPs/byte per thread.

Wait. The total intensity is n/6 = 85.3 FLOPs/byte. But per thread, it is 0.25. The discrepancy: every thread reads the same data. Thread (0,0) reads row 0 of A. Thread (0,1) also reads row 0 of A. Across all n threads in that row, row 0 of A is read n times. The global memory bandwidth is consumed n times for data that could be cached.

The naive kernel achieves 3% to 8% of the GPU's peak GFLOPS. The arithmetic intensity exists in the algorithm but is not realized in the implementation because the data reuse pattern is not exploited.

The tiled kernel: exploiting shared memory

The tiled GEMM loads blocks of A and B into workgroup shared memory, where all threads in the workgroup can access them at register-like speed. Each element is loaded once from global memory and reused by every thread that needs it.

const TILE: u32 = 16u;

var<workgroup> tileA: array<array<f32, 16>, 16>;
var<workgroup> tileB: array<array<f32, 16>, 16>;

@compute @workgroup_size(16, 16)
fn gemm_tiled(
  @builtin(local_invocation_id) lid: vec3<u32>,
  @builtin(workgroup_id) wid: vec3<u32>
) {
  let row = wid.y * TILE + lid.y;
  let col = wid.x * TILE + lid.x;
  var sum: f32 = 0.0;

  let numTiles = (K + TILE - 1u) / TILE;

  for (var t: u32 = 0u; t < numTiles; t++) {
    // Cooperatively load one tile of A and one tile of B into shared memory
    let aCol = t * TILE + lid.x;
    let bRow = t * TILE + lid.y;

    if (row < M && aCol < K) {
      tileA[lid.y][lid.x] = A[row * K + aCol];
    } else {
      tileA[lid.y][lid.x] = 0.0;
    }

    if (bRow < K && col < N) {
      tileB[lid.y][lid.x] = B[bRow * N + col];
    } else {
      tileB[lid.y][lid.x] = 0.0;
    }

    workgroupBarrier();

    // Each thread computes its partial sum using the shared tile
    for (var i: u32 = 0u; i < TILE; i++) {
      sum += tileA[lid.y][i] * tileB[i][lid.x];
    }

    workgroupBarrier();
  }

  if (row < M && col < N) {
    C[row * N + col] = sum;
  }
}

Each workgroup is 16×16 threads (256 threads). Each tile is 16×16 elements (1 KB for Float32). Two tiles (A and B) occupy 2 KB of shared memory. Well within the 16 KB workgroup shared memory limit.

The data reuse: each element of tileA is used by 16 threads (every thread in the tile's column). Each element of tileB is used by 16 threads (every row). The reuse factor is 16x. Each global memory load serves 16 multiply-add operations instead of 1.

For n=512 with 16×16 tiles: the algorithm processes 512/16 = 32 tiles along the k dimension. Each tile pair requires loading 16×16 = 256 elements of A and 256 of B from global memory (2,048 bytes total). Each tile produces 16×16×16 = 4,096 multiply-adds (8,192 FLOPs). Effective intensity: 8,192 / 2,048 = 4.0 FLOPs/byte per tile load.

That is 16x better than the naive kernel (0.25 FLOPs/byte per thread). Total intensity across all tiles approaches the theoretical n/6 = 85.3 FLOPs/byte. Measured GPU utilization: 75% to 90% of peak GFLOPS on a discrete GPU.

Shared memory access patterns

Within the tile, the 16 threads in a row read consecutive elements of tileA from shared memory. This is a broadcast pattern: all threads in the row need element tileA[lid.y][i] for the same i. Shared memory handles broadcasts in a single cycle (bank broadcast). No bank conflicts. Full throughput.

The tileB access pattern has each thread reading tileB[i][lid.x]. Threads in the same row read different columns of tileB. On GPUs with 32-bank shared memory (NVIDIA, AMD), the 16 threads access 16 different banks (assuming 4-byte elements and consecutive layout). No bank conflicts. Full throughput.

The tiling transforms GEMM from a global-memory-bound disaster into a shared-memory-fed compute engine that runs at near-peak ALU utilization.

How intensity drives dispatch decisions

Our 7-factor scoring function includes arithmetic intensity as one of its dispatch factors. The intensity determines how the dispatch threshold scales with data size.

Memory-bound operations (intensity < 1 FLOP/byte)

For element-wise, filter, and reduction operations, the GPU's advantage comes from memory bandwidth, not compute throughput. The GPU must read and write every byte. The question is whether the GPU's memory bandwidth (272 GB/s discrete, 50 to 100 GB/s integrated) outweighs the transfer overhead and dispatch cost.

At 0.25 FLOPs/byte, the GPU processes data at its memory bandwidth ceiling. The effective speedup over CPU is the ratio of GPU bandwidth to CPU bandwidth: roughly 5x on discrete (272 / 50), 2x on integrated (50 / 25). After accounting for dispatch overhead (0.03 to 1.1 ms) and transfer cost (0.2 ms per 4 MB on PCIe), the break-even is:

breakEvenElements = (dispatchOverhead + transferTime) / (1/cpuBandwidth - 1/gpuBandwidth)

For a discrete GPU: approximately 800,000 to 1,200,000 elements. Our calibrated threshold reflects this per-device.

For an integrated GPU: approximately 2,000,000 to 4,000,000 elements. The bandwidth ratio is smaller, so the dataset must be larger to amortize the overhead.

Compute-bound operations (intensity > 10 FLOPs/byte)

For GEMM, convolution, and polynomial evaluation, the GPU's advantage comes from raw compute throughput. The data transfer is a small fraction of the total time because the GPU spends most of its time computing, not transferring.

At 21.3 FLOPs/byte (128×128 GEMM), the total data is 192 KB. Transfer time on PCIe 4.0: 0.01 ms. Dispatch overhead: 0.03 ms. Compute time on GPU: 0.26 ms. Total GPU: 0.30 ms.

CPU compute for 128×128 GEMM: 2.1 ms (single-threaded) or 0.35 ms (8-thread BLAS). The GPU wins even against a multi-threaded CPU at just 128×128.

The break-even for compute-bound operations is determined by the compute ratio (GPU GFLOPS / CPU GFLOPS), not the bandwidth ratio. For a discrete GPU with 15,110 GFLOPS versus a CPU with 200 GFLOPS (4 cores at 50 GFLOPS each): 75x compute advantage. The fixed overhead (0.04 ms) is amortized at:

breakEvenFLOPs = dispatchOverhead * gpuGFLOPS = 0.04 ms * 15,110 GFLOPS = 604,400 FLOPs

A 128×128 GEMM performs 4,194,304 FLOPs. 7x above break-even. GPU dispatch is decisively correct.

The scoring formula integration

The arithmetic intensity factor enters the dispatch score as one of seven factors in the scoring formula. The seven factors are: size-vs-threshold, sparsity penalty, precision risk, GPU class bonus, vendor tuning, fusion bonus, and arithmetic intensity. Together they produce a composite score that determines whether an operation routes to CPU, Web Workers, or GPU.

For a GEMM at 128×128 (cardinality = 16,384 output elements, intensity = 21.3): the high arithmetic intensity drives a strong GPU score. Even with a moderate hardware calibration ratio (integrated GPU), the score exceeds 1.0. GPU dispatch.

For an element-wise multiply at 100,000 elements (cardinality = 100,000, intensity = 0.125): the low intensity produces a weak score. On an integrated GPU with calibration ratio 0.41, the score falls below 0.3. CPU dispatch.

Despite both being modest in data volume, these operations produce opposite dispatch decisions because the intensity differs by 170x. The scoring function captures this automatically.

Performance: GEMM on WebGPU vs. CPU

Benchmarked on square matrices of Float32 values:

Dimension	Data size	FLOPs	GPU tiled (discrete)	GPU tiled (integrated)	CPU 8-thread	GPU vs. CPU (discrete)
64×64	48 KB	524K	0.18 ms	0.32 ms	0.08 ms	0.44x (CPU wins)
128×128	192 KB	4.2M	0.30 ms	0.51 ms	0.35 ms	1.17x
256×256	768 KB	33.6M	0.52 ms	1.24 ms	2.4 ms	4.6x
512×512	3 MB	268M	1.8 ms	5.1 ms	18.2 ms	10.1x
1024×1024	12 MB	2.1B	8.4 ms	28.3 ms	142 ms	16.9x
2048×2048	48 MB	17.2B	52 ms	198 ms	1,140 ms	21.9x

At 64×64, the GPU is slower than the CPU. The dispatch overhead (0.03 ms) and transfer cost (0.01 ms) consume 22% of the total GPU time, and the 524K FLOPs are too few to exploit the GPU's parallelism. The calibrated threshold correctly routes 64×64 to the CPU on discrete hardware.

At 128×128, the GPU barely wins on discrete hardware (1.17x). On integrated hardware, the CPU wins. The threshold is at the boundary. The calibration ratio determines the decision per-device.

At 256×256 and above, the GPU dominates. The intensity is high enough that the GPU's compute advantage overwhelms all overhead. The speedup grows with dimension because the intensity grows linearly: 42.7 FLOPs/byte at 256, 85.3 at 512, 170.7 at 1024.

At 2048×2048 on a discrete GPU: 21.9x faster than 8-thread CPU. 52 ms versus 1.14 seconds. For real-time applications (neural network inference, PCA on live data, covariance matrix updates), this is the difference between interactive and batch.

Where GEMM appears in enterprise applications

Matrix multiplication is not an academic exercise. It is the core operation in multiple enterprise workloads.

Neural network inference. Every dense layer in a neural network is a matrix multiply: output = weights × input + bias. A small model with 3 dense layers of dimension 256 requires 3 GEMM operations per inference. At 0.52 ms per GEMM on the GPU: 1.56 ms total for the dense layers. On CPU: 7.2 ms. The GPU makes per-request inference viable at sub-10 ms latency.

Principal component analysis. PCA involves computing a covariance matrix (GEMM), then eigenvalue decomposition. The covariance matrix of 500 features from 100,000 observations: a 500×500 GEMM followed by eigensolving. The GEMM takes 1.2 ms on the GPU. On CPU: 12 ms. For interactive data exploration where PCA recomputes on every filter change, this is the difference between responsive and laggy.

Recommendation scoring. Collaborative filtering computes user-item scores as a matrix product of user embeddings × item embeddings. For 1,000 users × 10,000 items with embedding dimension 128: the GEMM is (1000×128) × (128×10000). The GPU handles this in approximately 3 ms. On CPU: approximately 40 ms.

Portfolio optimization. Mean-variance optimization requires covariance matrix computation and quadratic program solving, both GEMM-heavy. The precision analyser evaluates whether Float32 is safe. If the covariance matrix is well-conditioned (κ < 1,000), the GEMM runs on the GPU. If ill-conditioned, the Safety Guard routes to CPU Float64. The dispatch decision accounts for both performance and correctness.

The roofline model for WebGPU

The roofline model visualizes the relationship between arithmetic intensity and achievable performance. The "roof" is defined by two lines: the memory bandwidth ceiling (a diagonal line for memory-bound operations) and the compute throughput ceiling (a horizontal line for compute-bound operations). The intersection is the hardware's operational intensity.

For the RTX 4060:

Memory bandwidth ceiling: 272 GB/s = 272 GFLOPS at 1 FLOP/byte
Compute ceiling: 15,110 GFLOPS
Operational intensity: 15,110 / 272 = 55.5 FLOPs/byte

Operations below 55.5 FLOPs/byte are under the diagonal (memory-bound). Their performance scales with bandwidth, not compute. Moving from 0.125 to 0.25 FLOPs/byte doubles throughput (more work per byte), but the absolute throughput is still a fraction of peak.

Operations above 55.5 FLOPs/byte are under the horizontal ceiling (compute-bound). Their performance is at the GPU's peak regardless of further intensity increases. A GEMM at 85 FLOPs/byte (512×512) achieves the same peak as a GEMM at 170 FLOPs/byte (1024×1024). Both are compute-saturated.

Our dispatch scoring function encodes this roofline implicitly. The arithmetic intensity factor in the score determines whether the GPU's compute advantage or bandwidth advantage (or neither) applies. The calibration ratio adjusts the roofline for each device's actual bandwidth and compute throughput.

The principle

Arithmetic intensity is the single most important metric for GPU dispatch decisions. It determines whether the GPU is compute-bound (most of its cores are productive) or memory-bound (most of its cores are idle, waiting for data).

GEMM has intensity that grows with the problem size. Element-wise operations have fixed, low intensity regardless of size. This is why a 128×128 GEMM (192 KB of data) justifies GPU dispatch but a 192 KB element-wise operation does not. The data volume is identical. The compute density is 170x different.

Our adaptive dispatch engine captures this through the 7-factor scoring function. Intensity is not the only factor (size-vs-threshold, sparsity penalty, precision risk, GPU class bonus, vendor tuning, and fusion bonus all matter), but it is the factor that produces the largest dispatch threshold differences between operation types.

This is the analytical foundation of our enterprise AI automation infrastructure. We do not dispatch to the GPU because the GPU is fast. We dispatch to the GPU when the operation's arithmetic intensity is high enough to utilize the GPU's compute capacity. When it is not, we use the CPU, which is better matched to memory-bound work. The intensity metric makes this decision quantitative, not heuristic.

Where this ships

We are Ayoob AI, a Newcastle-based team building compute-aware dispatch infrastructure for UK enterprise SaaS where browser-side performance is the product. If your GPU path wins on paper but loses on element-wise work, we build the scoring function that tells the two apart. Book a discovery call.

Frequently asked questions

What is arithmetic intensity and why does it predict GPU speedup?

Arithmetic intensity is the ratio of floating-point operations performed to bytes transferred from memory. GEMM on two n by n matrices has intensity roughly n/6 FLOPs per byte, because it performs 2n^3 multiplies and adds on 12n^2 bytes of Float32 data. At n=128 that is 21.3 FLOPs per byte. An element-wise multiply has intensity 0.25 FLOPs per byte regardless of dataset size. GPUs have enormous compute throughput but only moderate memory bandwidth, so a high-intensity operation keeps the compute units busy while a low-intensity one stalls waiting for memory. Arithmetic intensity is therefore the single best predictor of whether the GPU will help.

Why does a tiled GEMM kernel beat a naive one so heavily?

A naive O(n^3) GEMM loads every operand from global memory on every multiply. Global memory latency on a GPU is 200 to 400 cycles, so the compute units sit idle most of the time and the kernel achieves under 5 percent of peak throughput. A tiled kernel loads 16 by 16 blocks into workgroup shared memory once, then reuses each element 16 times before eviction. Shared memory access is roughly 4 cycles, so the effective memory cost per multiply collapses. Measured throughput rises to around 85 percent of peak. The tile structure is also what makes arithmetic intensity realisable: without tiling, intensity on paper is high but intensity in practice is memory-bound.

At what matrix size does WebGPU actually beat a CPU in the browser?

For GEMM on discrete hardware, the crossover sits around 128 by 128 matrices, which is about 192 KB of data across three Float32 matrices. Below that size the PCIe transfer cost and dispatch overhead dominate. Above it, GPU compute pulls away rapidly and the gap grows with n^3. On integrated GPUs the crossover moves up to roughly 512 by 512 because unified memory reduces transfer cost but peak throughput is lower. Our dispatch scorer measures both values at session start, so the full code engine routes each multiply to whichever tier actually wins on the user's machine rather than assuming a single threshold.

How does this translate into cloud cost savings?

The single most expensive line item in cloud AI invoices is matrix maths on managed GPU instances. Shifting that compute into the browser on a laptop the business already owns removes both the per-hour GPU cost and the data egress charge. For a typical Newcastle AI automation project we see cloud bills fall by around 95 percent once inference, embedding lookups, and small model fine-tuning move on-device. The remaining cloud spend covers orchestration and persistent storage, which are orders of magnitude smaller than GPU hours. The cost model shifts from usage-scaled to fixed.

Does this apply to small models or only to large ones?

It applies mainly to the medium range: attention blocks, embedding projections, classifier heads, and small to mid-size transformers. Very small models (a dozen neurons, a linear regression) have so little compute that CPU dispatch wins every time. Very large models (tens of billions of parameters) exceed browser memory limits and need server inference. In between, which is where most business AI workloads actually sit, GEMM-heavy layers map cleanly to tiled WebGPU kernels and deliver consistent speedups on consumer hardware. That is the sweet spot for practical AI automation in a UK business.

Talk to an Engineer