The precision problem nobody warns you about
WebGPU compute shaders operate on 32-bit floating-point numbers. JavaScript uses 64-bit. When you write a JavaScript number into a WebGPU storage buffer, the runtime narrows it from Float64 to Float32. This is not optional. WGSL has no f64 type in the base specification. Every value loses precision.
For scientific visualization, game physics, or image processing, this is fine. The lost precision falls below the threshold of perceptual or functional relevance. Nobody notices that a pixel coordinate shifted by 0.000001.
For financial data, the lost precision is a compliance violation.
A Float64 value of 25000000.50 (twenty-five million pounds and fifty pence) narrowed to Float32 becomes 25000000.0. The fifty pence is gone. No error thrown. No NaN produced. No warning logged. The number looks valid. It is wrong.
This is silent numerical degradation. It is the most dangerous class of bug in GPU-accelerated financial systems because it produces plausible but incorrect results that pass every type check and every schema validation.
How Float32 precision actually works
IEEE 754 single-precision (Float32) allocates 23 bits to the significand (plus one implicit leading bit, giving 24 bits of precision). This means Float32 can represent integers exactly up to 2^24: 16,777,216.
Above that threshold, consecutive representable Float32 values are spaced more than 1 apart:
| Value range | Spacing between consecutive Float32 values |
|---|---|
| 0 to 16,777,216 | 1 or less (exact integer representation) |
| 16,777,216 to 33,554,432 | 2 |
| 33,554,432 to 67,108,864 | 4 |
| 67,108,864 to 134,217,728 | 8 |
| 134,217,728 to 268,435,456 | 16 |
A Float32 cannot represent 16,777,217. The nearest representable values are 16,777,216 and 16,777,218. Math.fround(16777217) returns 16777216. One pound sterling vanished.
For financial values in the tens of millions (routine for enterprise portfolios, payroll aggregations, quarterly revenue), Float32 cannot represent individual pounds, let alone pence. For values in the hundreds of millions, the gap between representable values is 8 to 16. You are not rounding to the nearest penny. You are rounding to the nearest £8.
Float64, by contrast, has a 52-bit significand (53 bits with the implicit leading bit). It represents integers exactly up to 2^53: 9,007,199,254,740,992. For all practical financial values stored as pence (or cents), Float64 is exact. This is why JavaScript's single number type works for most financial computations despite not having a native decimal type.
The moment you move that data to the GPU, you lose 29 bits of significand precision. Silently.
Where the damage occurs in practice
The narrowing from Float64 to Float32 causes errors at three levels, each progressively harder to detect.
Level 1: Individual value corruption
A single value above 16,777,216 loses information on GPU upload. This is the simplest case and the easiest to detect (if you check). A portfolio position of £30,000,000.75 becomes £30,000,000.00. The 75p is not rounded. It does not exist in Float32 representation at that magnitude.
For a dataset of 100,000 financial records, the number of affected values depends on the magnitude distribution. If 10% of values exceed 16,777,216, you have 10,000 silently corrupted records. Every downstream computation on those records propagates the error.
Level 2: Accumulation drift
Even when individual values are within Float32's exact range, summing them can exceed it. Consider summing 100,000 values that average £500. The true sum is £50,000,000. Float32 cannot represent integers above 16,777,216 exactly. As the running total crosses that threshold during accumulation, each subsequent addition loses precision.
Worse, Float32 addition is not associative in practice. (a + b) + c and a + (b + c) can produce different results when the operands differ by orders of magnitude. GPU parallel reductions sum values in an arbitrary tree order determined by the hardware scheduler. The same dataset summed on two different GPUs, or even on the same GPU with different workgroup sizes, can produce different totals.
For a CPU sequential sum in Float64, this is not a concern: the 53-bit significand provides exact integer arithmetic up to £90 trillion. For a GPU parallel sum in Float32, a £50 million total can be off by hundreds of pounds depending on the reduction tree.
Level 3: Algorithmic amplification
Some operations amplify input errors. Linear system solving (matrix inversion, least-squares regression, portfolio optimization) is the critical case. A system with condition number κ amplifies relative input error by a factor of κ.
Float32 machine epsilon is approximately 1.19 x 10^-7 (2^-23). For a well-conditioned system (κ = 10), the expected relative output error is κ * ε = 1.19 x 10^-6. For a financial value of £10,000,000, that is an error of approximately £12. Manageable, possibly.
For a moderately ill-conditioned system (κ = 10,000, common in correlation matrices from financial time series), the expected relative error is κ * ε = 1.19 x 10^-3. On a £10,000,000 value, that is an error of £11,900. On a £100,000,000 portfolio optimization, the error is £119,000.
You are not rounding. You are computing a wrong answer that looks right.
Our Precision Sufficiency Analyser
We built the Precision Sufficiency Analyser as a pre-dispatch gate in our adaptive compute engine. Before any operation reaches the GPU, the analyser evaluates whether Float32 arithmetic can produce results within the caller's specified tolerance. If it cannot, the Float32 Safety Guard forces CPU dispatch with full Float64 precision.
The analyser classifies operations into three precision sufficiency tiers, as defined in our patent filing, and applies a different analysis to each.
HIGH precision sensitivity: Linear system solving
Operations that solve linear systems, invert matrices, compute eigenvalues, or perform least-squares fitting. These amplify numerical error by the condition number of the input matrix. The patent specifies a base threshold of infinity for Solve operations, meaning they always route to CPU.
The analyser estimates the condition number without computing the full SVD (which would be as expensive as the operation itself). It uses a 1-norm condition number estimator based on Hager's algorithm, which requires O(n^2) work for an n x n matrix. This is a small fraction of the O(n^3) cost of the actual solve.
The precision risk score for high-sensitivity operations:
expectedRelativeError = conditionNumber * Float32_EPSILON
precisionRiskScore = expectedRelativeError / userTolerance
Where Float32_EPSILON is 1.1920929 x 10^-7 and userTolerance is the caller's acceptable relative error (default: 1 x 10^-9 for financial workloads, configurable per operation).
If precisionRiskScore > 1.0, the expected error exceeds tolerance. The Safety Guard blocks GPU dispatch.
Example: A portfolio optimization with a 200 x 200 covariance matrix. The analyser estimates κ = 8,500. Expected relative error: 8,500 * 1.19 x 10^-7 = 1.01 x 10^-3. User tolerance: 1 x 10^-9. Risk score: 1.01 x 10^6. GPU dispatch is blocked. The optimization runs on CPU with Float64, where the expected relative error is 8,500 * 2.22 x 10^-16 = 1.89 x 10^-12, well within tolerance.
MEDIUM precision sensitivity: GEMM, GEMV, and Conv2D operations
Operations such as matrix multiplication (GEMM), matrix-vector multiplication (GEMV), and 2D convolution (Conv2D) are classified as MEDIUM sensitivity. These involve large accumulations in their inner loops (e.g., dot products) where overflow of partial sums can degrade precision. The analyser checks accumulation overflow bounds for these operation types.
Reduction operations (SUM, running totals) are classified as LOW sensitivity in the patent alongside elementwise, unary, and FFT operations. However, the range check applied to reductions still catches cases where accumulated values exceed Float32's representable range. The following example illustrates how accumulation bounds are evaluated for operations that produce running totals.
The analyser computes two values:
Maximum accumulation bound. The worst-case running total during the operation. For a SUM over positive values, this is the total sum. For a running average, it is the maximum partial sum before division.
Safe integer threshold. 2^24 = 16,777,216 for Float32. Values at or below this threshold are represented exactly.
The precision risk score for medium-sensitivity operations:
maxAccumulation = estimateMaxAccumulation(dataset, operation)
precisionRiskScore = maxAccumulation / FLOAT32_SAFE_INTEGER
If the score exceeds 1.0, the accumulation will cross the safe integer boundary during execution. Float32 arithmetic will introduce rounding errors in the final result.
Example: Summing a revenue column with 500,000 entries averaging £400. Estimated max accumulation: £200,000,000. Safe integer threshold: 16,777,216. Risk score: 11.9. GPU dispatch is blocked. The sum runs on CPU with Float64.
Example: Summing a quantity column with 500,000 entries averaging 3.2 units. Estimated max accumulation: 1,600,000. Risk score: 0.095. GPU dispatch is permitted. The sum will stay within Float32's exact range throughout the entire accumulation.
LOW precision sensitivity: Elementwise, unary, reduce, and FFT operations
Operations that filter, sort by rank, classify values into bins, or perform elementwise and unary transforms. These produce boolean or ordinal results, or outputs where precision requirements are satisfied by a range check. The relative error in the input affects comparison results only if two values are so close that Float32 cannot distinguish them.
The analyser estimates the minimum gap between adjacent values in the sort order. If the gap exceeds the Float32 ULP (Unit in the Last Place) at that magnitude, the comparison results will be identical in Float32 and Float64. The risk score is:
minGap = estimateMinimumGap(dataset)
ulpAtMagnitude = computeULP(estimateMaxMagnitude(dataset))
precisionRiskScore = ulpAtMagnitude / minGap
For most filtering and sorting workloads on financial data, the minimum gap (e.g., 1 penny = 0.01) vastly exceeds the Float32 ULP at the relevant magnitude. The risk score is near zero, and GPU dispatch is safe.
This means the same dataset can have its SUM blocked from GPU dispatch (LOW sensitivity, but accumulation exceeds the safe integer threshold via range check) while its SORT runs on the GPU (LOW sensitivity, comparison gaps are safe). The per-operator routing in our query engine makes this seamless: each operator in a pipeline is evaluated independently.
The Float32 Safety Guard
The Safety Guard is not advisory. It is a hard gate.
When the precision risk score exceeds 1.0 for any sensitivity tier, the Safety Guard overrides the dispatch score. The mechanism is identical to our categorical GPU inhibition for branch divergence: the precision penalty is injected before the hardware calibration ratio is applied. For high-sensitivity operations above tolerance, the penalty is negative infinity. No dataset size, no hardware capability, no performance advantage can override it.
The Safety Guard logs every intervention:
{
operation: "SUM",
column: "revenue",
riskScore: 11.9,
maxAccumulation: 200000000,
threshold: 16777216,
action: "CPU_DISPATCH_FORCED",
reason: "ACCUMULATION_EXCEEDS_FLOAT32_SAFE_INTEGER"
}
This log entry is available to the caller. For regulated industries, it provides an audit trail proving that the system evaluated precision risk and took corrective action. You are not explaining to a regulator why your numbers are wrong. You are showing them the automated safeguard that prevented them from being wrong.
What this looks like on a real dashboard
Consider an enterprise finance dashboard with four linked panels:
Panel 1: Transaction table. 500,000 rows. Filter by date range, region, counterparty. Filterable and sortable.
Panel 2: Revenue by region. Bar chart. GROUP BY region, SUM(revenue).
Panel 3: Portfolio risk heatmap. 50 x 50 covariance matrix visualization. Requires eigenvalue decomposition for principal component overlay.
Panel 4: Running P&L. Line chart. Cumulative sum of daily profit/loss over 252 trading days.
The Precision Sufficiency Analyser evaluates each panel's operations:
| Panel | Operation | Sensitivity | Risk score | Backend |
|---|---|---|---|---|
| 1 | Filter (date range) | LOW | 0.001 | GPU |
| 1 | Sort (amount DESC) | LOW | 0.003 | GPU |
| 2 | GroupBy + SUM(revenue) | LOW (range check) | 14.2 | CPU (Float64) |
| 3 | Eigenvalue decomposition | HIGH | 2.3 x 10^5 | CPU (Float64) |
| 4 | Cumulative SUM(pnl) | LOW (range check) | 8.7 | CPU (Float64) |
Panels 1's filter and sort run on the GPU in 3 ms total. Panel 2's aggregation runs on CPU workers in 12 ms (Float64 sum, exact to the penny). Panel 3's eigenvalue decomposition runs on the CPU in 45 ms (Float64, condition number handled). Panel 4's cumulative sum runs on CPU in 8 ms.
Total dashboard refresh: under 70 ms. Every number is correct to the precision your compliance team requires. The GPU accelerated the operations it could handle safely. The CPU handled the rest. No manual configuration. No per-panel backend selection.
The alternative: what happens without precision analysis
We have seen three failure patterns in production systems that dispatch financial data to Float32 without analysis.
Pattern 1: The vanishing basis point. A fund management dashboard sums daily returns across 10,000 positions. Individual returns are small (0.01% to 0.5%), but the accumulated portfolio return over a quarter exceeds Float32's safe range. The reported quarterly return drifts by 3 to 5 basis points from the Float64 reference. For a £500 million fund, 5 basis points is £250,000 in misreported performance.
Pattern 2: The phantom correlation. A risk system computes correlation matrices from daily price series. Float32 rounding introduces noise in the 6th decimal place. For stable, low-correlation pairs, this noise is larger than the true correlation. The optimizer sees phantom diversification benefits that do not exist, underestimating portfolio risk.
Pattern 3: The non-reproducible reconciliation. A settlement system runs the same aggregation on two different machines. Different GPUs use different reduction tree orders. Float32's non-associative addition produces different totals. The reconciliation fails with a £12 discrepancy that no one can explain, because the code is identical and the data is identical. Only the hardware differs.
All three patterns share the same root cause: the system assumed Float32 was sufficient without measuring whether it was.
Why we do not use emulated Float64 on the GPU
WGSL does not natively support f64. Some implementations emulate double-precision using pairs of Float32 values (double-single arithmetic). Each Float64 operation becomes 4 to 6 Float32 operations. Throughput drops to 15% to 25% of native Float32 speed.
We evaluated this approach and rejected it for three reasons.
First, the performance loss eliminates the GPU advantage. If a Float64 CPU sum takes 12 ms and an emulated Float64 GPU sum takes 18 ms (due to 4x instruction overhead plus transfer cost), there is no reason to use the GPU.
Second, emulated Float64 does not guarantee identical rounding to hardware Float64. The intermediate rounding behaviour of double-single arithmetic differs from IEEE 754 double-precision in edge cases. For auditable financial systems, "almost the same precision" is not sufficient. The result must match the CPU reference exactly.
Third, the complexity is unjustified. The CPU handles Float64 natively at full throughput. Using it for precision-sensitive operations is not a fallback. It is the correct engineering choice. The GPU handles Float32 operations where precision analysis confirms safety. Each backend does what it does best.
Where this fits in the larger system
Precision analysis is one dimension of dispatch routing alongside hardware capability probing, branch divergence detection, and operator-level adaptive scoring. Together, these systems ensure that GPU dispatch is only used when it is faster, safe, and correct.
This is the principle behind our enterprise AI automation infrastructure. Speed without correctness is a liability. We do not optimize first and verify later. We verify first, and optimize within the boundaries that verification defines.
For finance, those boundaries are set by the data, the operation, and the regulatory tolerance. Our system measures all three before a single GPU instruction executes. If the measurement says CPU, the answer is CPU. No override. No exception.