Ayoob AI

Eliminating PCIe Bus Bottlenecks in Enterprise AI Compliance Tools

·14 min read·Ayoob AI
WebGPUPipeline FusionCompliancePerformanceEnterprise

The transfer problem nobody measures

GPU compute is fast. PCIe transfers are not.

A WebGPU compute shader processing 500,000 records through a filter operation takes 1.1 ms on a discrete GPU. Writing those 500,000 records to the GPU via device.queue.writeBuffer() takes 0.25 ms. Reading the results back via mapAsync() takes another 0.25 ms.

For a single operation, the transfer overhead is 31% of total time (0.5 ms transfer, 1.1 ms compute). Tolerable. The GPU is still faster than the CPU alternative.

Now chain six operations. A compliance pipeline that filters records, classifies risk levels, groups by category, aggregates exposure values, sorts by severity, and extracts the top violations. If each operation independently uploads its input and reads back its output, you pay 12 transfers: 6 uploads and 6 readbacks.

At 0.25 ms each, that is 3.0 ms of pure data movement. The compute across all six operations totals 4.8 ms. The transfer overhead is 38% of the pipeline's wall-clock time. You are spending more than a third of your time copying bytes across a bus instead of processing them.

On integrated GPUs (Intel UHD, Apple Silicon) where CPU and GPU share system memory, the raw bandwidth penalty is lower. But the driver still performs format validation, cache flushing, and synchronization on every transfer. Measured overhead: 0.08 to 0.15 ms per transfer even with shared memory. For 12 transfers, that is still 1.0 to 1.8 ms of overhead.

This is the bottleneck that standard GPU compute libraries ignore. They optimize the shader. They ignore the bus.

How standard GPU pipelines waste bandwidth

A typical WebGPU compute pipeline for a single operation looks like this:

// Step 1: Upload input to GPU
const inputBuffer = device.createBuffer({ size: dataSize, usage: STORAGE | COPY_DST });
device.queue.writeBuffer(inputBuffer, 0, inputData);

// Step 2: Create output buffer
const outputBuffer = device.createBuffer({ size: outputSize, usage: STORAGE | COPY_SRC });

// Step 3: Dispatch compute shader
const encoder = device.createCommandEncoder();
const pass = encoder.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(workgroupCount);
pass.end();
device.queue.submit([encoder.finish()]);

// Step 4: Read results back to CPU
await outputBuffer.mapAsync(GPUMapMode.READ);
const result = new Float32Array(outputBuffer.getMappedRange().slice(0));
outputBuffer.unmap();

Four steps. Two of them (steps 1 and 4) are transfers. For a single operation, this is the correct pattern.

When you chain two operations, the naive approach repeats the entire pattern:

// Operation 1: Filter
writeBuffer(filterInput);         // Transfer 1: CPU -> GPU
dispatch(filterShader);
const filtered = readBuffer();    // Transfer 2: GPU -> CPU

// Operation 2: Aggregate
writeBuffer(filtered);            // Transfer 3: CPU -> GPU  (same data!)
dispatch(aggregateShader);
const result = readBuffer();      // Transfer 4: GPU -> CPU

Transfer 3 uploads the exact data that Transfer 2 just downloaded. The data crossed the bus twice for no reason. It was on the GPU. You pulled it to the CPU. Then you pushed it back to the GPU.

For N chained operations, this pattern produces 2N transfers. The intermediate results bounce between CPU and GPU memory at every step.

Our Pipeline Fusion Engine

The Pipeline Fusion Engine eliminates intermediate transfers. When consecutive operators in a query pipeline are both routed to the GPU by the 7-factor scoring function, the engine keeps the intermediate result in a GPU storage buffer. The output buffer of operation K becomes the input buffer of operation K+1. No mapAsync. No writeBuffer. No bus traversal.

How fusion works

The engine analyzes the query execution plan (a directed acyclic graph of operators) and identifies fusible segments: maximal sequences of consecutive operators that are all GPU-routed.

For a 6-operation pipeline where operators 1 through 4 are GPU-routed and operators 5 and 6 are CPU-routed:

Without fusion (standard approach):

CPU -> [upload] -> GPU op1 -> [readback] -> CPU
CPU -> [upload] -> GPU op2 -> [readback] -> CPU
CPU -> [upload] -> GPU op3 -> [readback] -> CPU
CPU -> [upload] -> GPU op4 -> [readback] -> CPU
CPU op5
CPU op6

Transfers: 8 (4 uploads + 4 readbacks)

With fusion:

CPU -> [upload] -> GPU op1 -> GPU op2 -> GPU op3 -> GPU op4 -> [readback] -> CPU
CPU op5
CPU op6

Transfers: 2 (1 upload + 1 readback)

As defined in our patent, the total data movement count drops from 2N to N+1 for a pipeline of N GPU operations: one initial upload, N-1 internal GPU buffer operations (ping-pong swaps that stay on the GPU), and one final readback. The critical metric is PCIe bus transfers, which drop from 2N to just 2 (one upload, one readback). For the 4-operation GPU segment above: 8 PCIe transfers become 2.

Buffer management

The engine maintains a pool of GPU storage buffers sized to the maximum intermediate result. When a fused segment begins, the engine allocates (or reuses from the pool) two buffers: a read buffer and a write buffer. Each operation reads from one and writes to the other. After each dispatch, the buffers swap roles (ping-pong pattern).

let readBuffer = bufferPool.acquire(maxSize);
let writeBuffer = bufferPool.acquire(maxSize);

// Initial upload to readBuffer
device.queue.writeBuffer(readBuffer, 0, inputData);

for (const op of fusedSegment) {
  const encoder = device.createCommandEncoder();
  const pass = encoder.beginComputePass();
  pass.setPipeline(op.pipeline);
  pass.setBindGroup(0, createBindGroup(readBuffer, writeBuffer));
  pass.dispatchWorkgroups(op.workgroups);
  pass.end();
  device.queue.submit([encoder.finish()]);

  // Swap: output becomes next input
  [readBuffer, writeBuffer] = [writeBuffer, readBuffer];
}

// Final readback from readBuffer (which holds the last output)
await readBuffer.mapAsync(GPUMapMode.READ);

The ping-pong swap is a pointer exchange, not a data copy. The GPU never moves bytes between the two buffers. It reads from one address range and writes to another, then the roles reverse.

Handling variable output sizes

Not every operation produces the same number of output elements as its input. A filter with 10% selectivity reduces 500,000 rows to 50,000. The output buffer for the filter is 10% the size of the input.

The engine handles this by writing an element count to a small metadata buffer alongside the data output. The subsequent operation reads the metadata to determine how many workgroups to dispatch. The storage buffers are allocated at maximum size (the input cardinality), but only the occupied portion is processed.

This wastes some GPU memory (the buffer is larger than needed after a selective filter), but avoids the alternative: reading the count back to the CPU to determine the next dispatch size, which would require a mapAsync that breaks the fusion chain.

The transfer cost model

To quantify the impact of fusion, you need to understand the transfer costs on real hardware.

PCIe discrete GPUs

On PCIe 4.0 x16 (the most common discrete GPU interface in enterprise hardware as of 2026), theoretical bandwidth is 31.5 GB/s in each direction. Effective bandwidth for WebGPU buffer writes, after driver overhead and IOMMU translation, is 20 to 25 GB/s.

Buffer sizeUpload timeReadback timeRound-trip
400 KB (100K Float32)0.02 ms0.03 ms0.05 ms
2 MB (500K Float32)0.09 ms0.12 ms0.21 ms
4 MB (1M Float32)0.18 ms0.25 ms0.43 ms
20 MB (5M Float32)0.85 ms1.10 ms1.95 ms
40 MB (10M Float32)1.70 ms2.20 ms3.90 ms

Readback is slower than upload because mapAsync() includes a GPU fence synchronization: the CPU must wait for all GPU commands to complete before the buffer can be mapped. Upload via writeBuffer() is fire-and-forget from the CPU's perspective (the driver queues it).

Integrated GPUs (shared memory)

On integrated GPUs (Intel UHD, AMD Radeon Graphics, Apple M-series), CPU and GPU share the same physical memory. There is no PCIe bus. But the WebGPU API still requires explicit buffer creation and data writes. The driver performs cache coherence operations (flushing CPU caches, invalidating GPU caches) on every transfer.

Buffer sizeUpload timeReadback timeRound-trip
2 MB0.04 ms0.06 ms0.10 ms
4 MB0.07 ms0.10 ms0.17 ms
20 MB0.30 ms0.45 ms0.75 ms

Lower than PCIe, but not zero. And the overhead is per-transfer, not per-byte. Even a 1 KB transfer incurs 0.02 to 0.04 ms of driver overhead. For a 12-transfer unfused pipeline, the fixed overhead alone is 0.24 to 0.48 ms.

Fusion in practice: a compliance pipeline

Consider a Data Protection Impact Assessment (DPIA) generation pipeline. A financial institution must evaluate 500,000 customer records against GDPR data processing rules, classify risk levels, aggregate exposure by department and data category, and produce a ranked report.

The pipeline:

StepOperationDescription
1FilterRecords with PII in scope (has_pii = true AND processing_basis != 'contract')
2ClassifyAssign risk level based on data sensitivity score thresholds
3FilterHigh-risk and medium-risk records only
4GroupByGroup by department x data_category
5AggregateCOUNT records, SUM exposure_value per group
6SortBy total exposure DESC

Scoring and routing

The 7-factor scoring function evaluates each operator:

StepInput rowsScoreRouted to
1. Filter (PII scope)500,0001.8GPU
2. Classify (risk level)~175,0001.4GPU
3. Filter (high + medium)~175,0001.3GPU
4. GroupBy (dept x category)~105,0001.1GPU
5. Aggregate (COUNT, SUM)~105,0000.9GPU (borderline, pushed over by fusion bonus)
6. Sort~240 groups0.01CPU main thread

Steps 1 through 5 form a fusible segment. Step 6 runs on the CPU because sorting 240 rows is trivial.

Without fusion

Transfer 1:  CPU -> GPU    (500K records, 4 MB)        0.18 ms
Compute 1:   Filter                                     1.1 ms
Transfer 2:  GPU -> CPU    (175K records, 1.4 MB)       0.10 ms
Transfer 3:  CPU -> GPU    (175K records, 1.4 MB)       0.06 ms
Compute 2:   Classify                                   0.8 ms
Transfer 4:  GPU -> CPU    (175K records, 1.4 MB)       0.10 ms
Transfer 5:  CPU -> GPU    (175K records, 1.4 MB)       0.06 ms
Compute 3:   Filter                                     0.6 ms
Transfer 6:  GPU -> CPU    (105K records, 840 KB)       0.08 ms
Transfer 7:  CPU -> GPU    (105K records, 840 KB)       0.04 ms
Compute 4:   GroupBy                                    1.2 ms
Transfer 8:  GPU -> CPU    (105K groups, 840 KB)        0.08 ms
Transfer 9:  CPU -> GPU    (105K groups, 840 KB)        0.04 ms
Compute 5:   Aggregate                                  0.5 ms
Transfer 10: GPU -> CPU    (240 groups, ~2 KB)          0.03 ms
Compute 6:   Sort (CPU)                                 0.1 ms

Total transfers: 10 x avg 0.077 ms = 0.77 ms
Total compute:   4.3 ms
Total:           5.07 ms

With fusion

Transfer 1:  CPU -> GPU    (500K records, 4 MB)         0.18 ms
Compute 1:   Filter                                      1.1 ms
  [buffer stays on GPU]
Compute 2:   Classify                                    0.8 ms
  [buffer stays on GPU]
Compute 3:   Filter                                      0.6 ms
  [buffer stays on GPU]
Compute 4:   GroupBy                                     1.2 ms
  [buffer stays on GPU]
Compute 5:   Aggregate                                   0.5 ms
Transfer 2:  GPU -> CPU    (240 groups, ~2 KB)           0.03 ms
Compute 6:   Sort (CPU)                                  0.1 ms

Total transfers: 2 x avg 0.105 ms = 0.21 ms
Total compute:   4.3 ms
Total:           4.51 ms

Transfer overhead drops from 0.77 ms to 0.21 ms. The pipeline is 11% faster in absolute terms. On larger datasets or longer pipelines, the savings compound.

For a 10-operation all-GPU pipeline on a 10-million-record dataset (40 MB per transfer): unfused transfers total 20 x 1.95 ms = 39.0 ms. Fused: 2 x 1.95 ms = 3.90 ms. Transfer time drops by 90%. On that pipeline, fusion saves 35.1 ms, which may exceed the total compute time.

The fusion bonus (Factor F6)

Fusion does more than eliminate transfers. It changes the dispatch decisions for downstream operators.

The 7-factor scoring function includes the transfer cost in its calculation. When an operator must upload data to the GPU, the upload time is factored into the score as overhead that the GPU's compute advantage must overcome. For borderline operations (score near 1.0), the upload cost can tip the decision from GPU to CPU.

When the pipeline fusion engine detects that the previous operator's output is already resident on the GPU, it removes the upload cost from the downstream operator's score calculation. This produces a fusion bonus: the downstream operator's score increases because it inherits GPU residency from the prior operation.

In the compliance pipeline example, Step 5 (Aggregate) scored 0.9 without the fusion bonus. That would route it to Web Workers. But because Step 4's output is already on the GPU (via the fused segment), the upload cost is zero. The adjusted score rises above 1.0, and the operator stays on the GPU. This extends the fused segment by one operation, which eliminates one additional transfer pair.

The fusion bonus cascades through iterative re-scoring. The engine re-evaluates dispatch scores up to 5 iterations, and each operator that stays on the GPU extends the fused segment, which provides residency for the next operator, which may also benefit from the bonus. A pipeline that would fragment into alternating GPU and CPU operators without the bonus can fuse into a single continuous GPU segment with it.

When the bonus does not apply

The fusion bonus only applies when the upstream operator is GPU-routed. If a CPU operator breaks the chain (because its branch divergence classification, precision constraints, or atomic contention score forces CPU dispatch), the downstream operator must upload fresh data. The bonus resets.

The engine does not override safety constraints to preserve fusion. If an operator must run on the CPU for correctness reasons (precision-sensitive accumulation, categorical divergence), it runs on the CPU. The fused segment ends. A new segment may begin at the next GPU-routed operator.

Correctness overrides performance. Always.

Integration with device loss recovery

GPU buffers are volatile. If the GPU device is lost mid-pipeline, every buffer in the fused segment is destroyed. The intermediate results between operations 2 and 5 in the compliance example would be gone.

The engine handles this because the original input data never leaves CPU memory. The writeBuffer() call in Step 1 copied the data to the GPU. The CPU's SharedArrayBuffer still holds the original. On device loss, the engine re-dispatches the entire pipeline to the Web Worker tier, starting from the original input.

The re-dispatch is not resumable (it does not continue from the point of failure, because all intermediate state was on the GPU). It restarts the pipeline from Step 1 on the CPU. For the compliance pipeline, the CPU path takes 22 ms instead of 4.5 ms. Slower, but correct. The caller receives results without error handling, retries, or awareness that the GPU was ever involved.

Where fusion matters most

The transfer savings from fusion scale with three variables:

Pipeline depth. More chained operations means more eliminated transfers. A 3-operation pipeline saves 4 transfers. A 10-operation pipeline saves 18. The absolute savings grow linearly with depth.

Dataset size. Larger buffers mean more bytes per transfer. Fusing a pipeline on a 1 MB dataset saves microseconds. Fusing on a 40 MB dataset saves tens of milliseconds. The percentage improvement is constant, but the absolute time saved scales with data volume.

Transfer overhead. Discrete GPUs on PCIe have higher per-transfer costs than integrated GPUs on shared memory. Fusion matters more on discrete hardware. On integrated GPUs, the savings are smaller in absolute terms but still meaningful because the driver's per-transfer synchronization overhead is a fixed cost regardless of data size.

For enterprise compliance tools processing hundreds of thousands of records through multi-step validation, classification, and aggregation pipelines, all three variables are high. Deep pipelines, large datasets, discrete GPU hardware on analyst workstations. This is where fusion converts a theoretical GPU advantage into a practical one by ensuring the bus does not erase the compute savings.

This is the systems-level engineering behind our enterprise AI automation infrastructure. The shader is not the bottleneck. The bus is. We built the engine that eliminates the bus from the critical path, so every GPU cycle you pay for produces useful work instead of waiting for data to arrive.

Want to discuss how this applies to your business?

Book a Discovery Call