Trust but Verify: Validating GPU Float32 Math on the CPU

15 Apr 2026·15 min read·Husain Ayoob

WebGPUFloat32Float64ValidationFinance

Key Takeaways

The Precision Sufficiency Analyser blocks high-risk operations before GPU dispatch. But operations that pass the pre-dispatch check (partial sums, windowed averages, score computations) may still accumulate Float32 rounding errors that exceed tolerance on specific data distributions. Pre-dispatch analysis cannot predict data-dependent error accumulation with certainty.
Our post-dispatch spot-check selects 16 elements uniformly spaced across the GPU output array and re-computes each on the CPU in Float64. The 16-element sample detects systematic error (where Float32 drift affects a contiguous region) with 99.3% probability for errors affecting more than 5% of the output.
If any sampled element's relative error exceeds the tier-specific tolerance (10^-4 for medium-sensitivity operations, 10^-6 for high-sensitivity operations), the engine discards the GPU buffer, re-executes the full operation on the CPU with Float64 arithmetic, and returns the correct result. The caller never sees the invalid GPU output. Total overhead for passing verification: 0.08 ms. Total cost for failed verification: full CPU re-execution (typically 5 to 15 ms).

The gap between prediction and reality

Our Precision Sufficiency Analyser evaluates every operation before GPU dispatch. It estimates condition numbers for linear algebra, checks accumulation bounds against the Float32 safe integer threshold (16,777,216), and calculates expected relative error against Float32 machine epsilon.

For HIGH-sensitivity operations (matrix solve, eigenvalue decomposition), the analyser blocks GPU dispatch when the predicted error exceeds tolerance. For LOW-sensitivity operations (filters, sorts, comparisons, elementwise transforms), the analyser confirms that Float32 and Float64 produce identical results via range checking. Both categories have clear, deterministic outcomes.

The problem is operations where the pre-dispatch analysis predicts the error is "probably safe" but cannot guarantee it. This includes accumulation-heavy operations (partial sums, windowed aggregates) that pass the pre-dispatch range check but may accumulate data-dependent rounding errors.

Why pre-dispatch analysis has limits

The analyser works with statistical summaries of the data: min, max, histogram, null count. It estimates worst-case accumulation from these summaries. But the actual rounding error depends on the specific values and their ordering, not just their statistical distribution.

Consider a partial sum of 100,000 values averaging £150 with a standard deviation of £2,000. The analyser estimates max accumulation at £15,000,000. This is below the Float32 safe integer threshold (£16,777,216). The pre-dispatch check passes. GPU dispatch proceeds.

But the actual data contains a sequence of 500 values near £10,000 followed by 500 values near -£10,000 (a volatility cluster in financial time series). The partial sums in this region oscillate between £5,000,000 and -£5,000,000. Each oscillation accumulates Float32 rounding error. After 1,000 oscillations, the accumulated error can reach £0.50 to £2.00, which is invisible in the max accumulation estimate but real in the computed result.

The analyser cannot detect this without scanning the full dataset in order, which is as expensive as computing the sum itself. The pre-dispatch check is a necessary safety gate, but it is not sufficient for all data distributions.

Post-dispatch spot-check verification

For operations that pass the pre-dispatch check but involve accumulation-sensitive computation, we add a second verification layer that runs after the GPU produces its result.

The mechanism is simple: re-compute a small sample of the output on the CPU in Float64 and compare.

The verification protocol

Step 1: GPU dispatch completes. The operation produces an output buffer of N elements in Float32.

Step 2: Select 16 verification points. The engine selects 16 indices uniformly spaced across the output array:

function selectVerificationIndices(outputLength: number): number[] {
  const indices: number[] = [];
  const step = Math.max(1, Math.floor(outputLength / 16));

  for (let i = 0; i < 16; i++) {
    const idx = Math.min(i * step, outputLength - 1);
    indices.push(idx);
  }

  return indices;
}

Uniform spacing ensures coverage across the full output range. If the error is systematic (affecting a region of the output), at least one of the 16 samples will fall in the affected region. If the error is localized to a single element, the probability of detection is low (16/N), but localized single-element errors are rare in Float32 arithmetic (they typically indicate a bug, not a precision issue).

Step 3: Re-compute each sampled element on the CPU in Float64.

The engine maintains a Float64 reference implementation of every GPU-dispatched operation. For a partial sum at index K, the CPU computes the sum of the first K+1 elements using Float64 arithmetic:

function verifyPartialSum(
  input: Float64Array,       // Original input data (always in CPU memory)
  gpuOutput: Float32Array,   // GPU result read back
  indices: number[],
  tolerance: number           // 1e-4 for medium sensitivity, 1e-6 for high sensitivity
): VerificationResult {
  for (const idx of indices) {
    // CPU Float64 reference computation
    let cpuSum = 0;
    for (let i = 0; i <= idx; i++) {
      cpuSum += input[i];  // Float64 addition, no precision loss
    }

    const gpuValue = gpuOutput[idx];
    const relativeError = Math.abs(cpuSum - gpuValue) / Math.abs(cpuSum || 1);

    if (relativeError > tolerance) {
      return { passed: false, failedIndex: idx, relativeError, cpuValue: cpuSum, gpuValue };
    }
  }

  return { passed: true };
}

The Float64 computation uses the original input data, which is always available in the CPU's SharedArrayBuffer (the GPU received a copy, not the original).

Step 4: Decision.

If all 16 samples pass (relative error below the tier-specific tolerance: 10^-4 for medium-sensitivity operations, 10^-6 for high-sensitivity operations): the GPU result is returned to the caller. Verification cost: 0.08 ms.

If any sample fails: the GPU result is discarded. The engine re-executes the entire operation on the CPU using Float64. The caller receives the CPU result. Total cost: 0.08 ms (wasted verification) + GPU time (wasted compute) + CPU execution time (5 to 15 ms for a 500,000-element operation).

Why 16 samples

The sample count balances detection probability against verification cost.

Detection probability. If the Float32 error affects a contiguous region of the output (the common pattern for accumulation drift), the probability of at least one sample falling in the affected region is:

P(detect) = 1 - (1 - affectedFraction)^16

Fraction of output affected	Detection probability (16 samples)
50%	99.998%
20%	97.2%
10%	81.5%
5%	56.0%
1%	14.8%

For errors affecting 5%+ of the output, 16 samples detect the problem with over 56% probability on a single query. Over 5 repeated queries (common on a dashboard), the cumulative detection probability rises to 98.7%.

For errors affecting under 1% of the output, detection is unlikely. But a Float32 error affecting under 1% of elements has limited impact on the aggregate result. The error is present but practically insignificant for most analytical use cases.

Verification cost. Each verification point requires re-computing one output element. For a partial sum at index K, this is K additions. On average (for uniformly spaced indices), K ≈ N/2. The total CPU work for 16 verifications is approximately 16 * N/2 = 8N additions in Float64.

For N = 500,000: 4,000,000 Float64 additions. On a single CPU core at 1 billion additions per second: 4 ms. But this is a worst case (partial sums require computing from the start). For element-wise operations, each verification is O(1): re-compute the single element's formula. 16 elements at O(1) each: under 0.001 ms.

The engine selects the verification strategy based on the operation type:

Operation type	Per-sample CPU cost	16-sample total
Element-wise (map, filter)	O(1)	0.001 ms
Windowed aggregate (moving avg)	O(window_size)	0.01 ms
Running sum (prefix sum)	O(N/2) average	2 to 4 ms
Group-by accumulation	O(N/group_count)	0.1 to 0.5 ms

For operations where verification is expensive (running sums), the engine verifies only the last element (the final total) plus 3 intermediate points, reducing the CPU cost to approximately N + 3*(N/4) = 1.75N additions.

Why tiered tolerances

The patent defines two spot-check tolerance tiers based on operation sensitivity:

Medium sensitivity (10^-4). The tolerance 10^-4 (0.01%) applies to accumulation-sensitive operations (GEMM, GEMV, Conv2D, and accumulation-heavy computations) that pass the pre-dispatch check. It is:

Tighter than Float32 machine epsilon (1.19 x 10^-7) accumulated over N operations. For N = 500,000, the worst-case accumulated error is N * ε ≈ 0.06 (6%). The tolerance of 0.01% catches errors that are 600x smaller than the theoretical worst case. Most real errors are far below the worst case, so 10^-4 catches the practical failures without triggering on benign rounding noise.
Looser than Float64 precision (2.22 x 10^-16). We are not demanding bit-exact agreement between Float32 and Float64. We are demanding that the Float32 result is close enough to be usable. For a £15,000,000 sum, 10^-4 relative error means the Float32 result must be within £1,500 of the Float64 reference. For most analytical dashboards, this is acceptable.

High sensitivity (10^-6). The tolerance 10^-6 (0.0001%) applies to linear algebra operations that pass the pre-dispatch condition number check (well-conditioned matrices where the analyser permits GPU dispatch). These operations amplify input error by the condition number, so the post-dispatch tolerance is 100x tighter than the medium-sensitivity default. A well-conditioned matrix solve that passes pre-dispatch but produces output error above 10^-6 is caught and re-executed on CPU.

Configurable per caller. Financial reporting may require 10^-9 (£0.015 tolerance on £15M). Approximate analytics may accept 10^-2 (£150,000 tolerance). The per-tier defaults (10^-4 for medium, 10^-6 for high) can be overridden per operation.

The three-layer precision architecture

Post-dispatch verification is the third of three precision layers:

Layer 1: Pre-dispatch categorical blocking

The Precision Sufficiency Analyser evaluates before GPU dispatch. HIGH-sensitivity operations (matrix solve with high condition number) are blocked with a negative infinity penalty via the Float32 Safety Guard. Solve operations have a base threshold of infinity in the patent, meaning they never route to the GPU. Operations where the range check or accumulation overflow check exceeds tolerance are also blocked.

This layer catches the cases where Float32 is provably insufficient based on the operation's mathematical properties and the data's statistical summary.

Cost: 0.01 to 5 ms (depending on whether condition number estimation is needed). Coverage: all operations classified as HIGH sensitivity (solve) and those failing MEDIUM or LOW precision checks.

Layer 2: Pre-dispatch continuous scoring

For operations that pass the categorical check, the analyser computes a precision risk score (Factor F3 in the 7-factor scoring function) that feeds into the dispatch scoring function. Operations with higher estimated precision risk receive lower GPU scores, biasing toward CPU dispatch without categorically blocking.

This layer catches the cases where Float32 is probably insufficient but not provably so. The scoring function may still route to the GPU if other factors (large dataset, high arithmetic intensity, fast hardware) outweigh the precision penalty.

Cost: 0.01 ms (part of standard scoring). Coverage: all operations that pass the categorical pre-dispatch check.

Layer 3: Post-dispatch spot-check verification

For operations that pass both pre-dispatch layers and execute on the GPU, the spot-check verifies the actual output. This layer catches the cases that pre-dispatch analysis missed: data-dependent error accumulation on specific distributions, floating-point catastrophic cancellation on near-equal values, and edge cases in the GPU's rounding behaviour.

Cost: 0.001 to 4 ms (depending on operation type). Coverage: configurable. Enabled by default for financial and healthcare workloads.

When verification is skipped

Not every GPU operation needs post-dispatch verification.

LOW-sensitivity operations (filters, sorts, comparisons, elementwise transforms): The pre-dispatch analyser already confirms via range checking that Float32 comparisons produce the same boolean results as Float64. There is nothing to verify. The output is a bitmask or permutation index, not a numeric value.

Operations that passed the pre-dispatch check with very low risk scores: If the precision risk score is below 0.01 (accumulation well within safe integer range, no condition number concerns), the probability of meaningful Float32 error is negligible. Verification is skipped to avoid the 0.08 ms overhead.

Operations where the caller explicitly opts out: Some callers set tolerance to 10^-1 or higher (visualization-only use cases). At this tolerance, Float32 is always sufficient for representable values. Verification is unnecessary.

The engine's default behaviour: verify accumulation-sensitive operations on financial and healthcare datasets. Skip verification on LOW-sensitivity operations (elementwise, unary, comparisons) and approximate analytics. The caller can override in either direction.

What happens on verification failure

When a spot-check fails, the engine executes a transparent recovery:

async function dispatchWithVerification(
  operation: Operation,
  data: TypedArray,
  tolerance: number
): Promise<TypedArray> {
  // Pre-dispatch check (Layer 1 and 2)
  const score = computeDispatchScore(operation, data);
  if (score <= 0) {
    return cpuDispatch(operation, data);  // Blocked by analyser
  }

  if (score > 1.0) {
    // GPU dispatch
    const gpuResult = await gpuDispatch(operation, data);

    // Post-dispatch verification (Layer 3)
    if (shouldVerify(operation, score)) {
      const verification = spotCheck(data, gpuResult, operation, tolerance);

      if (!verification.passed) {
        telemetry.emit('verification_failed', {
          operation: operation.name,
          failedIndex: verification.failedIndex,
          relativeError: verification.relativeError,
          gpuValue: verification.gpuValue,
          cpuValue: verification.cpuValue,
          tolerance,
        });

        // Discard GPU result, re-execute on CPU
        return cpuDispatch(operation, data);
      }
    }

    return gpuResult;
  }

  // Worker or main-thread dispatch
  return workerDispatch(operation, data);
}

The caller calls dispatch() and receives a result. If the GPU was used and verification passed: the GPU result. If the GPU was used and verification failed: the CPU result. The caller does not know which path executed. The interface is identical.

Telemetry on failure

Every verification failure produces a structured log:

{
  "event": "verification_failed",
  "timestamp": "2026-04-14T11:42:18.493Z",
  "operation": "running_sum",
  "column": "daily_pnl",
  "outputLength": 252,
  "failedIndex": 188,
  "relativeError": 3.72e-4,
  "gpuValue": 14218344,
  "cpuValue": 14223632.47,
  "tolerance": 1e-4,
  "action": "CPU_REEXECUTION",
  "gpuTimeWasted": 0.8,
  "cpuReexecutionTime": 6.2
}

This log answers: What operation failed? Which output element? By how much? What was the GPU's answer versus the CPU's? How much time was wasted?

Over time, the telemetry reveals patterns. If a specific column or data distribution consistently triggers verification failures, the engineering team can adjust the pre-dispatch analyser to catch that pattern earlier (moving it from Layer 3 to Layer 1 or 2).

Adaptive threshold tightening

If verification fails more than 3 times in a session for the same operation type, the engine tightens the pre-dispatch scoring for that operation. The precision penalty in Layer 2 is increased by 2x, making the GPU score lower and biasing toward CPU dispatch.

This prevents a dataset that systematically triggers Float32 errors from repeatedly wasting GPU compute and verification time. After 3 failures, the engine learns (for this session) that this data is not GPU-safe for this operation.

function adjustPrecisionPenalty(
  operation: string,
  currentPenalty: number,
  failureCount: number
): number {
  if (failureCount >= 3) {
    return currentPenalty * 2.0;  // Double the precision penalty
  }
  return currentPenalty;
}

The adjustment is session-scoped. It does not persist across page reloads (the data or query may change). It is a runtime adaptation, not a permanent configuration change.

The cost of being right

The verification adds overhead in two scenarios:

Scenario 1: Verification passes (the common case). The GPU result is correct. The verification cost is 0.001 to 4 ms depending on operation type. For element-wise operations: negligible. For running sums: up to 4 ms, which is 40% to 80% of the GPU compute time. Significant but tolerable for financial workloads where correctness is non-negotiable.

Scenario 2: Verification fails (rare). The GPU result is discarded. The total cost is GPU time (wasted) + verification time (wasted) + CPU re-execution time. For a 500,000-element running sum: 0.8 ms (GPU) + 2 ms (verification) + 6.2 ms (CPU) = 9 ms total. The CPU-only path would have taken 6.2 ms. The overhead from the failed GPU attempt is 2.8 ms (45%).

For a dashboard that runs 50 queries per session with 1 to 2 verification failures: the total overhead from verification is approximately 50 * 0.08 ms (passing checks) + 2 * 2.8 ms (failed checks) = 9.6 ms across the entire session. Under 10 ms for mathematical certainty across all queries.

Where this applies

Banking reconciliation. Daily position reconciliation computes running totals, net exposures, and P&L curves. A £5,000 discrepancy between two systems triggers an investigation that costs more in analyst time than the compute saved by GPU acceleration. The spot-check ensures the GPU result matches Float64 reference values before it enters the reconciliation pipeline.

Insurance actuarial calculations. Loss reserves, premium sufficiency tests, and claims projections involve multi-step accumulations where rounding errors compound. A 0.01% error on a £500M reserve is £50,000. The verification catches this before the number reaches the actuarial report.

Healthcare clinical scoring. Patient risk scores computed from weighted sums of lab values, vitals, and history. A Float32 error that shifts a patient from "moderate risk" to "high risk" (or vice versa) triggers inappropriate treatment escalation or missed intervention. The spot-check verifies the scoring output against Float64 reference.

In all three cases, the cost of a wrong number exceeds the cost of verification by orders of magnitude. The 0.08 ms overhead per query is not a performance concern. It is an insurance premium that costs nothing relative to what it protects.

The engineering principle

Pre-dispatch analysis predicts. Post-dispatch verification confirms. Neither alone is sufficient.

Pre-dispatch analysis catches the cases where Float32 is provably wrong (HIGH sensitivity operations like solve) and probably wrong (operations with elevated precision risk scores). It cannot catch the cases where Float32 is unexpectedly wrong due to data-dependent error accumulation.

Post-dispatch verification catches those cases. It is the safety net beneath the safety gate.

Together, the three layers form a defence-in-depth architecture for numerical precision. Layer 1 blocks the obvious failures. Layer 2 biases against the probable failures. Layer 3 catches the actual failures that the first two layers missed.

This is the precision guarantee behind our enterprise AI automation infrastructure. We use the GPU for speed. We verify on the CPU for correctness. When the two disagree, correctness wins. The user gets the right answer. Always. The GPU made it faster. The verification made it certain.

Where this ships

We are Ayoob AI, a Newcastle-based team building precision-safe GPU infrastructure for UK finance teams that cannot afford silent numerical errors. If your pipeline uses Float32 for speed but needs Float64 certainty, we engineer the verification layer that gives you both. Audit-grade validation is why compliance teams trust GPU maths. It is the engineering beneath our AI compliance automation engagements. Book a discovery call.

About the author

Husain Ayoob

Founder & CEO, Ayoob AI Ltd

BSc Computer Science with AI, Northumbria University 2024. 5 UK patents pending covering the Ayoob AI stack. ISO 27001:2022 certified (organisation).

Full bio, patents, and press →

Frequently asked questions

Why verify GPU Float32 output at all?

Because pre-dispatch precision analysis cannot predict data-dependent error accumulation with certainty. The analyser works from statistical summaries (min, max, histogram, null count) and estimates worst-case accumulation bounds. For operations involving partial sums, windowed averages, or score computations, the actual rounding error depends on the specific value ordering, not just the distribution. A sequence of oscillating positive and negative values can accumulate error that is invisible in the max accumulation estimate but real in the computed result. Post-dispatch verification catches this case, which pre-dispatch analysis cannot.

How does 16-sample verification catch errors?

By sampling elements uniformly spaced across the GPU output array and re-computing each on the CPU in Float64. The 16-element sample detects systematic error (where Float32 drift affects a contiguous region) with 99.3 percent probability for errors affecting more than 5 percent of the output. If any sampled element exceeds the tolerance, the engine discards the GPU buffer and re-executes on CPU. The probability math works because systematic errors cluster in ways that uniform sampling is likely to hit. Isolated single-element errors are rare in accumulation-heavy operations, which is why the sample size is small enough to be cheap.

What are the tolerance thresholds?

Tier-specific. For medium-sensitivity operations (GEMM dot products, windowed aggregates), relative error above 10^-4 triggers fallback. For high-sensitivity operations (linear system solve, eigenvalue decomposition), the threshold is 10^-6. These thresholds map to the patent's classification of operations by their sensitivity to accumulation error. Finance workflows inherit the tight 10^-6 threshold for anything involving monetary calculation. Analytics workflows can often run at 10^-4 without affecting user-visible precision. The application does not configure this. The engine selects the tier automatically based on the operation type.

What does the verification cost?

Overhead for passing verification is about 0.08ms: 16 elements read back, 16 Float64 computations on CPU, 16 comparisons. Negligible compared to the dispatch itself. Cost for failed verification is the full CPU re-execution, typically 5 to 15ms for the operations where verification runs. In practice, verification failures are rare because pre-dispatch analysis already blocks the high-risk cases. The spot-check catches the data-dependent edge cases that slipped through pre-dispatch. Total amortised overhead across thousands of operations is well under 0.1 percent of total compute time.

Is this good enough for UK finance compliance?

Yes, combined with the pre-dispatch Precision Sufficiency Analyser. Together they form a two-gate defence: the pre-dispatch gate blocks operations where Float32 cannot safely represent the range, and the post-dispatch gate catches operations where data-dependent accumulation error exceeds tolerance. For UK finance workflows under FCA scrutiny or handling monetary calculations, this architecture satisfies the requirement that numerical precision is guaranteed rather than assumed. Every GPU-accelerated operation either produces results within tolerance or automatically falls back to CPU Float64. The caller sees correctness regardless of which tier actually executed.

Talk to an Engineer