The gap between prediction and reality
Our Precision Sufficiency Analyser evaluates every operation before GPU dispatch. It estimates condition numbers for linear algebra, checks accumulation bounds against the Float32 safe integer threshold (16,777,216), and calculates expected relative error against Float32 machine epsilon.
For HIGH-sensitivity operations (matrix solve, eigenvalue decomposition), the analyser blocks GPU dispatch when the predicted error exceeds tolerance. For LOW-sensitivity operations (filters, sorts, comparisons, elementwise transforms), the analyser confirms that Float32 and Float64 produce identical results via range checking. Both categories have clear, deterministic outcomes.
The problem is operations where the pre-dispatch analysis predicts the error is "probably safe" but cannot guarantee it. This includes accumulation-heavy operations (partial sums, windowed aggregates) that pass the pre-dispatch range check but may accumulate data-dependent rounding errors.
Why pre-dispatch analysis has limits
The analyser works with statistical summaries of the data: min, max, histogram, null count. It estimates worst-case accumulation from these summaries. But the actual rounding error depends on the specific values and their ordering, not just their statistical distribution.
Consider a partial sum of 100,000 values averaging £150 with a standard deviation of £2,000. The analyser estimates max accumulation at £15,000,000. This is below the Float32 safe integer threshold (£16,777,216). The pre-dispatch check passes. GPU dispatch proceeds.
But the actual data contains a sequence of 500 values near £10,000 followed by 500 values near -£10,000 (a volatility cluster in financial time series). The partial sums in this region oscillate between £5,000,000 and -£5,000,000. Each oscillation accumulates Float32 rounding error. After 1,000 oscillations, the accumulated error can reach £0.50 to £2.00, which is invisible in the max accumulation estimate but real in the computed result.
The analyser cannot detect this without scanning the full dataset in order, which is as expensive as computing the sum itself. The pre-dispatch check is a necessary safety gate, but it is not sufficient for all data distributions.
Post-dispatch spot-check verification
For operations that pass the pre-dispatch check but involve accumulation-sensitive computation, we add a second verification layer that runs after the GPU produces its result.
The mechanism is simple: re-compute a small sample of the output on the CPU in Float64 and compare.
The verification protocol
Step 1: GPU dispatch completes. The operation produces an output buffer of N elements in Float32.
Step 2: Select 16 verification points. The engine selects 16 indices uniformly spaced across the output array:
function selectVerificationIndices(outputLength: number): number[] {
const indices: number[] = [];
const step = Math.max(1, Math.floor(outputLength / 16));
for (let i = 0; i < 16; i++) {
const idx = Math.min(i * step, outputLength - 1);
indices.push(idx);
}
return indices;
}
Uniform spacing ensures coverage across the full output range. If the error is systematic (affecting a region of the output), at least one of the 16 samples will fall in the affected region. If the error is localized to a single element, the probability of detection is low (16/N), but localized single-element errors are rare in Float32 arithmetic (they typically indicate a bug, not a precision issue).
Step 3: Re-compute each sampled element on the CPU in Float64.
The engine maintains a Float64 reference implementation of every GPU-dispatched operation. For a partial sum at index K, the CPU computes the sum of the first K+1 elements using Float64 arithmetic:
function verifyPartialSum(
input: Float64Array, // Original input data (always in CPU memory)
gpuOutput: Float32Array, // GPU result read back
indices: number[],
tolerance: number // 1e-4 for medium sensitivity, 1e-6 for high sensitivity
): VerificationResult {
for (const idx of indices) {
// CPU Float64 reference computation
let cpuSum = 0;
for (let i = 0; i <= idx; i++) {
cpuSum += input[i]; // Float64 addition, no precision loss
}
const gpuValue = gpuOutput[idx];
const relativeError = Math.abs(cpuSum - gpuValue) / Math.abs(cpuSum || 1);
if (relativeError > tolerance) {
return { passed: false, failedIndex: idx, relativeError, cpuValue: cpuSum, gpuValue };
}
}
return { passed: true };
}
The Float64 computation uses the original input data, which is always available in the CPU's SharedArrayBuffer (the GPU received a copy, not the original).
Step 4: Decision.
If all 16 samples pass (relative error below the tier-specific tolerance: 10^-4 for medium-sensitivity operations, 10^-6 for high-sensitivity operations): the GPU result is returned to the caller. Verification cost: 0.08 ms.
If any sample fails: the GPU result is discarded. The engine re-executes the entire operation on the CPU using Float64. The caller receives the CPU result. Total cost: 0.08 ms (wasted verification) + GPU time (wasted compute) + CPU execution time (5 to 15 ms for a 500,000-element operation).
Why 16 samples
The sample count balances detection probability against verification cost.
Detection probability. If the Float32 error affects a contiguous region of the output (the common pattern for accumulation drift), the probability of at least one sample falling in the affected region is:
P(detect) = 1 - (1 - affectedFraction)^16
| Fraction of output affected | Detection probability (16 samples) |
|---|---|
| 50% | 99.998% |
| 20% | 97.2% |
| 10% | 81.5% |
| 5% | 56.0% |
| 1% | 14.8% |
For errors affecting 5%+ of the output, 16 samples detect the problem with over 56% probability on a single query. Over 5 repeated queries (common on a dashboard), the cumulative detection probability rises to 98.7%.
For errors affecting under 1% of the output, detection is unlikely. But a Float32 error affecting under 1% of elements has limited impact on the aggregate result. The error is present but practically insignificant for most analytical use cases.
Verification cost. Each verification point requires re-computing one output element. For a partial sum at index K, this is K additions. On average (for uniformly spaced indices), K ≈ N/2. The total CPU work for 16 verifications is approximately 16 * N/2 = 8N additions in Float64.
For N = 500,000: 4,000,000 Float64 additions. On a single CPU core at 1 billion additions per second: 4 ms. But this is a worst case (partial sums require computing from the start). For element-wise operations, each verification is O(1): re-compute the single element's formula. 16 elements at O(1) each: under 0.001 ms.
The engine selects the verification strategy based on the operation type:
| Operation type | Per-sample CPU cost | 16-sample total |
|---|---|---|
| Element-wise (map, filter) | O(1) | 0.001 ms |
| Windowed aggregate (moving avg) | O(window_size) | 0.01 ms |
| Running sum (prefix sum) | O(N/2) average | 2 to 4 ms |
| Group-by accumulation | O(N/group_count) | 0.1 to 0.5 ms |
For operations where verification is expensive (running sums), the engine verifies only the last element (the final total) plus 3 intermediate points, reducing the CPU cost to approximately N + 3*(N/4) = 1.75N additions.
Why tiered tolerances
The patent defines two spot-check tolerance tiers based on operation sensitivity:
Medium sensitivity (10^-4). The tolerance 10^-4 (0.01%) applies to accumulation-sensitive operations (GEMM, GEMV, Conv2D, and accumulation-heavy computations) that pass the pre-dispatch check. It is:
-
Tighter than Float32 machine epsilon (1.19 x 10^-7) accumulated over N operations. For N = 500,000, the worst-case accumulated error is N * ε ≈ 0.06 (6%). The tolerance of 0.01% catches errors that are 600x smaller than the theoretical worst case. Most real errors are far below the worst case, so 10^-4 catches the practical failures without triggering on benign rounding noise.
-
Looser than Float64 precision (2.22 x 10^-16). We are not demanding bit-exact agreement between Float32 and Float64. We are demanding that the Float32 result is close enough to be usable. For a £15,000,000 sum, 10^-4 relative error means the Float32 result must be within £1,500 of the Float64 reference. For most analytical dashboards, this is acceptable.
High sensitivity (10^-6). The tolerance 10^-6 (0.0001%) applies to linear algebra operations that pass the pre-dispatch condition number check (well-conditioned matrices where the analyser permits GPU dispatch). These operations amplify input error by the condition number, so the post-dispatch tolerance is 100x tighter than the medium-sensitivity default. A well-conditioned matrix solve that passes pre-dispatch but produces output error above 10^-6 is caught and re-executed on CPU.
Configurable per caller. Financial reporting may require 10^-9 (£0.015 tolerance on £15M). Approximate analytics may accept 10^-2 (£150,000 tolerance). The per-tier defaults (10^-4 for medium, 10^-6 for high) can be overridden per operation.
The three-layer precision architecture
Post-dispatch verification is the third of three precision layers:
Layer 1: Pre-dispatch categorical blocking
The Precision Sufficiency Analyser evaluates before GPU dispatch. HIGH-sensitivity operations (matrix solve with high condition number) are blocked with a negative infinity penalty via the Float32 Safety Guard. Solve operations have a base threshold of infinity in the patent, meaning they never route to the GPU. Operations where the range check or accumulation overflow check exceeds tolerance are also blocked.
This layer catches the cases where Float32 is provably insufficient based on the operation's mathematical properties and the data's statistical summary.
Cost: 0.01 to 5 ms (depending on whether condition number estimation is needed). Coverage: all operations classified as HIGH sensitivity (solve) and those failing MEDIUM or LOW precision checks.
Layer 2: Pre-dispatch continuous scoring
For operations that pass the categorical check, the analyser computes a precision risk score (Factor F3 in the 7-factor scoring function) that feeds into the dispatch scoring function. Operations with higher estimated precision risk receive lower GPU scores, biasing toward CPU dispatch without categorically blocking.
This layer catches the cases where Float32 is probably insufficient but not provably so. The scoring function may still route to the GPU if other factors (large dataset, high arithmetic intensity, fast hardware) outweigh the precision penalty.
Cost: 0.01 ms (part of standard scoring). Coverage: all operations that pass the categorical pre-dispatch check.
Layer 3: Post-dispatch spot-check verification
For operations that pass both pre-dispatch layers and execute on the GPU, the spot-check verifies the actual output. This layer catches the cases that pre-dispatch analysis missed: data-dependent error accumulation on specific distributions, floating-point catastrophic cancellation on near-equal values, and edge cases in the GPU's rounding behaviour.
Cost: 0.001 to 4 ms (depending on operation type). Coverage: configurable. Enabled by default for financial and healthcare workloads.
When verification is skipped
Not every GPU operation needs post-dispatch verification.
LOW-sensitivity operations (filters, sorts, comparisons, elementwise transforms): The pre-dispatch analyser already confirms via range checking that Float32 comparisons produce the same boolean results as Float64. There is nothing to verify. The output is a bitmask or permutation index, not a numeric value.
Operations that passed the pre-dispatch check with very low risk scores: If the precision risk score is below 0.01 (accumulation well within safe integer range, no condition number concerns), the probability of meaningful Float32 error is negligible. Verification is skipped to avoid the 0.08 ms overhead.
Operations where the caller explicitly opts out: Some callers set tolerance to 10^-1 or higher (visualization-only use cases). At this tolerance, Float32 is always sufficient for representable values. Verification is unnecessary.
The engine's default behaviour: verify accumulation-sensitive operations on financial and healthcare datasets. Skip verification on LOW-sensitivity operations (elementwise, unary, comparisons) and approximate analytics. The caller can override in either direction.
What happens on verification failure
When a spot-check fails, the engine executes a transparent recovery:
async function dispatchWithVerification(
operation: Operation,
data: TypedArray,
tolerance: number
): Promise<TypedArray> {
// Pre-dispatch check (Layer 1 and 2)
const score = computeDispatchScore(operation, data);
if (score <= 0) {
return cpuDispatch(operation, data); // Blocked by analyser
}
if (score > 1.0) {
// GPU dispatch
const gpuResult = await gpuDispatch(operation, data);
// Post-dispatch verification (Layer 3)
if (shouldVerify(operation, score)) {
const verification = spotCheck(data, gpuResult, operation, tolerance);
if (!verification.passed) {
telemetry.emit('verification_failed', {
operation: operation.name,
failedIndex: verification.failedIndex,
relativeError: verification.relativeError,
gpuValue: verification.gpuValue,
cpuValue: verification.cpuValue,
tolerance,
});
// Discard GPU result, re-execute on CPU
return cpuDispatch(operation, data);
}
}
return gpuResult;
}
// Worker or main-thread dispatch
return workerDispatch(operation, data);
}
The caller calls dispatch() and receives a result. If the GPU was used and verification passed: the GPU result. If the GPU was used and verification failed: the CPU result. The caller does not know which path executed. The interface is identical.
Telemetry on failure
Every verification failure produces a structured log:
{
"event": "verification_failed",
"timestamp": "2026-04-14T11:42:18.493Z",
"operation": "running_sum",
"column": "daily_pnl",
"outputLength": 252,
"failedIndex": 188,
"relativeError": 3.72e-4,
"gpuValue": 14218344,
"cpuValue": 14223632.47,
"tolerance": 1e-4,
"action": "CPU_REEXECUTION",
"gpuTimeWasted": 0.8,
"cpuReexecutionTime": 6.2
}
This log answers: What operation failed? Which output element? By how much? What was the GPU's answer versus the CPU's? How much time was wasted?
Over time, the telemetry reveals patterns. If a specific column or data distribution consistently triggers verification failures, the engineering team can adjust the pre-dispatch analyser to catch that pattern earlier (moving it from Layer 3 to Layer 1 or 2).
Adaptive threshold tightening
If verification fails more than 3 times in a session for the same operation type, the engine tightens the pre-dispatch scoring for that operation. The precision penalty in Layer 2 is increased by 2x, making the GPU score lower and biasing toward CPU dispatch.
This prevents a dataset that systematically triggers Float32 errors from repeatedly wasting GPU compute and verification time. After 3 failures, the engine learns (for this session) that this data is not GPU-safe for this operation.
function adjustPrecisionPenalty(
operation: string,
currentPenalty: number,
failureCount: number
): number {
if (failureCount >= 3) {
return currentPenalty * 2.0; // Double the precision penalty
}
return currentPenalty;
}
The adjustment is session-scoped. It does not persist across page reloads (the data or query may change). It is a runtime adaptation, not a permanent configuration change.
The cost of being right
The verification adds overhead in two scenarios:
Scenario 1: Verification passes (the common case). The GPU result is correct. The verification cost is 0.001 to 4 ms depending on operation type. For element-wise operations: negligible. For running sums: up to 4 ms, which is 40% to 80% of the GPU compute time. Significant but tolerable for financial workloads where correctness is non-negotiable.
Scenario 2: Verification fails (rare). The GPU result is discarded. The total cost is GPU time (wasted) + verification time (wasted) + CPU re-execution time. For a 500,000-element running sum: 0.8 ms (GPU) + 2 ms (verification) + 6.2 ms (CPU) = 9 ms total. The CPU-only path would have taken 6.2 ms. The overhead from the failed GPU attempt is 2.8 ms (45%).
For a dashboard that runs 50 queries per session with 1 to 2 verification failures: the total overhead from verification is approximately 50 * 0.08 ms (passing checks) + 2 * 2.8 ms (failed checks) = 9.6 ms across the entire session. Under 10 ms for mathematical certainty across all queries.
Where this applies
Banking reconciliation. Daily position reconciliation computes running totals, net exposures, and P&L curves. A £5,000 discrepancy between two systems triggers an investigation that costs more in analyst time than the compute saved by GPU acceleration. The spot-check ensures the GPU result matches Float64 reference values before it enters the reconciliation pipeline.
Insurance actuarial calculations. Loss reserves, premium sufficiency tests, and claims projections involve multi-step accumulations where rounding errors compound. A 0.01% error on a £500M reserve is £50,000. The verification catches this before the number reaches the actuarial report.
Healthcare clinical scoring. Patient risk scores computed from weighted sums of lab values, vitals, and history. A Float32 error that shifts a patient from "moderate risk" to "high risk" (or vice versa) triggers inappropriate treatment escalation or missed intervention. The spot-check verifies the scoring output against Float64 reference.
In all three cases, the cost of a wrong number exceeds the cost of verification by orders of magnitude. The 0.08 ms overhead per query is not a performance concern. It is an insurance premium that costs nothing relative to what it protects.
The engineering principle
Pre-dispatch analysis predicts. Post-dispatch verification confirms. Neither alone is sufficient.
Pre-dispatch analysis catches the cases where Float32 is provably wrong (HIGH sensitivity operations like solve) and probably wrong (operations with elevated precision risk scores). It cannot catch the cases where Float32 is unexpectedly wrong due to data-dependent error accumulation.
Post-dispatch verification catches those cases. It is the safety net beneath the safety gate.
Together, the three layers form a defence-in-depth architecture for numerical precision. Layer 1 blocks the obvious failures. Layer 2 biases against the probable failures. Layer 3 catches the actual failures that the first two layers missed.
This is the precision guarantee behind our enterprise AI automation infrastructure. We use the GPU for speed. We verify on the CPU for correctness. When the two disagree, correctness wins. The user gets the right answer. Always. The GPU made it faster. The verification made it certain.