IEEE 754 Bit-Transforms for High-Speed Float Processing in JavaScript

5 Apr 2026·12 min read·Husain Ayoob

IEEE 754JavaScriptWebGPUSortingAlgorithms

The type system mismatch at the centre of browser compute

JavaScript has one number type: IEEE 754 double-precision (Float64). Every numeric value in your application, from pixel coordinates to financial figures, occupies 64 bits with a 52-bit significand, an 11-bit exponent, and a sign bit.

WebGPU storage buffers and compute shaders operate on 32-bit types. f32 in WGSL gives you a 23-bit significand, an 8-bit exponent, and a sign bit. When you write a JavaScript Float64Array into a WebGPU buffer, the runtime narrows every value from 64 bits to 32 bits.

That narrowing is lossy. A Float64 value of 1.0000001 and 1.00000019 are distinct in 64-bit representation. In 32-bit, they collapse to the same bit pattern. For sorting, this means two elements that were ordered in your source data may become equal after GPU transfer. Your sort is now unstable in a way that has nothing to do with the algorithm.

This is not an edge case. It is the default behaviour of every WebGPU compute pipeline that accepts JavaScript numbers.

Why Array.prototype.sort() is the wrong starting point

Before addressing the Float64-to-Float32 problem, you need to understand why the native sort is unsuitable for high-performance numerical work in the first place.

Array.prototype.sort() has three structural problems.

Problem 1: Lexicographic default. Without a comparator, JavaScript sorts by string conversion. [10, 9, 1, 2].sort() produces [1, 10, 2, 9]. Every JavaScript developer learns this. Fewer realize the deeper cost: even with a correct (a, b) => a - b comparator, every comparison is a function call through the engine's callback machinery.

Problem 2: Comparison overhead. V8's TimSort implementation performs approximately n * log2(n) comparisons. For 1 million elements, that is roughly 20 million comparator invocations. Each invocation crosses the native-to-JS boundary, pushes a stack frame, executes the subtraction, and returns. The comparison function itself is trivial. The call overhead is not.

Problem 3: Single-threaded, single-algorithm. Array.prototype.sort() runs on the main thread using one algorithm (TimSort in V8, merge sort in SpiderMonkey). It cannot dispatch to Web Workers. It cannot dispatch to the GPU. It cannot adapt to the data distribution or the hardware. It is a fixed-strategy, fixed-backend sort.

For a 10-million-element numerical dataset, these constraints translate to 300 ms or more of main-thread blocking. In an enterprise dashboard rendering real-time data, that is a frozen UI.

Our Adaptive Multi-Tier Sorting System

We built a sorting engine that addresses all three problems. The system has four layers.

Layer 1: Precision analysis. Before any sort begins, the engine scans the input dataset and quantifies the maximum precision loss that Float32 narrowing would introduce. If the loss exceeds a configurable tolerance (default: 0 ULP for exact sorts, user-configurable for approximate workloads), the GPU tier is excluded from dispatch candidates.

Layer 2: Bit-transform. For datasets that pass precision analysis, every float is converted to a sort-order-preserving unsigned integer. This is the core of the system and the subject of this article.

Layer 3: Sort execution. On the CPU and Web Worker tiers, the transformed unsigned integers are sorted using a 4-pass LSD radix-256 sort (one pass per byte of a 32-bit integer). Each pass is a stable counting sort. The entire operation is O(n) with a small constant factor. On the GPU tier, a two-phase bitonic sort + rank merge operates on the transformed unsigned integers.

Layer 4: Adaptive dispatch. A seven-factor scoring model selects the optimal backend (CPU single-thread, Web Workers, or WebGPU compute) based on dataset size, data type, hardware classification, and self-calibrating crossover thresholds derived from runtime microbenchmarks.

The bit-transform in Layer 2 is what makes the rest possible. Without it, you cannot radix sort floats or use integer comparisons in GPU kernels. With it, you have reduced a comparison-based O(n log n) problem to one that can be solved with non-comparison distribution sorts on the CPU and data-oblivious sorting networks on the GPU.

The IEEE 754 float-to-unsigned-integer transform

IEEE 754 single-precision floats encode a number in 32 bits:

[S][EEEEEEEE][MMMMMMMMMMMMMMMMMMMMMMM]
 1    8 bits          23 bits

S is the sign bit. E is the biased exponent. M is the significand (mantissa). The value is (-1)^S * 2^(E-127) * 1.M for normal numbers.

The problem for sorting: if you reinterpret these 32 bits as an unsigned integer, positive floats sort correctly. The exponent field dominates the high bits, and larger exponents mean larger values. Within the same exponent, the significand provides the fine ordering. Positive floats are already in unsigned integer sort order.

Negative floats are not. The sign bit is the most significant bit. A negative float reinterpreted as an unsigned integer appears larger than any positive float. Worse, negative floats sort in reverse order among themselves: -1.0 has a larger unsigned representation than -2.0, but -1.0 is the greater value.

The two-rule transform

The standard float-to-integer transform (Herf, 2001) fixes both issues with two bitwise operations:

Rule 1: Positive floats (sign bit = 0). Flip the sign bit by ORing with 0x80000000.

// Before: 0_EEEEEEE_MMMMMMMMMMMMMMMMMMMMMMM
// After:  1_EEEEEEE_MMMMMMMMMMMMMMMMMMMMMMM
transformed = bits | 0x80000000;

This shifts all positive floats above the 2^31 boundary in unsigned integer space. The relative order among positive floats is preserved because only the sign bit changed, and it changed identically for all of them.

Rule 2: Negative floats (sign bit = 1). Flip all 32 bits by bitwise NOT.

// Before: 1_EEEEEEE_MMMMMMMMMMMMMMMMMMMMMMM
// After:  0_eeeeeee_mmmmmmmmmmmmmmmmmmmmmmm  (every bit inverted)
transformed = ~bits;

This accomplishes two things simultaneously. It clears the sign bit, placing all negative floats below the 2^31 boundary (below all positive floats). And it reverses the order among negative floats, because flipping all bits in a region that was sorted descending produces ascending order.

Why this preserves total order

After the transform, the unsigned integer space is partitioned:

Float value	Unsigned integer range
-Infinity	0x00000000 (smallest)
-MAX_FLOAT	0x00000001
... negative floats ...	ascending
-MIN_SUBNORMAL	0x7FFFFFFE
-0.0	0x7FFFFFFF
+0.0	0x80000000
+MIN_SUBNORMAL	0x80000001
... positive floats ...	ascending
+MAX_FLOAT	0xFF7FFFFF
+Infinity	0xFF800000 (largest)

Every IEEE 754 float, including negative zero, subnormals, and infinities, maps to a unique unsigned integer that preserves the exact total order defined by the IEEE 754 totalOrder predicate. No information is lost. The transform is perfectly reversible by applying the inverse: if the high bit is set, XOR with 0x80000000; if the high bit is clear, bitwise NOT.

NaN handling

IEEE 754 NaN values occupy the exponent region 0xFF with a non-zero significand. After the transform, NaN bit patterns map to unsigned integers above 0xFF800000 (above +Infinity). Our system strips NaN values during the precision analysis pass (Layer 1) before the transform runs. If your dataset contains NaN, the engine detects it, removes NaN elements, sorts the valid subset, and reinserts NaN at a configurable position (start, end, or error). NaN never reaches the GPU.

Radix-256 sort on transformed integers

With floats converted to sort-order-preserving unsigned integers, we can apply radix sort. Our implementation uses base 256 (one byte per pass, four passes for 32-bit integers).

Single pass structure

Each pass processes one byte position (byte 0 through byte 3, least significant first):

Histogram. Count the frequency of each byte value (0 to 255) across all elements. This produces a 256-entry histogram.
Prefix sum. Compute the exclusive prefix sum of the histogram. Entry i now holds the starting index for all elements with byte value i.
Scatter. Write each element to its destination index based on the prefix sum, then increment the counter.

Each pass is stable: elements with the same byte value retain their relative order from the previous pass. After four stable passes (LSB to MSB), the array is fully sorted.

Why radix-256

Radix-2 (1 bit per pass) requires 32 passes. Radix-16 requires 8. Radix-256 requires 4. Radix-65536 requires 2 but needs a 65,536-entry histogram, which blows out L1 cache on most hardware and causes cache thrashing that negates the pass reduction.

Radix-256 is the sweet spot: 4 passes with a 256-entry histogram (1 KB) that fits comfortably in L1 cache on every modern CPU and in workgroup shared memory on every GPU.

CPU parallelization via Web Workers

On the CPU tier, the radix-256 sort distributes across a pre-warmed Web Worker pool using SharedArrayBuffer for zero-copy data sharing and Atomics-based signalling. Each worker processes a partition of the input array, building local histograms that are merged via a k-way merge into the final sorted output. For medium-sized datasets (200 to 500,000 integer elements), this tier delivers strong performance without GPU transfer overhead.

GPU execution tier

On the GPU, the system does not use radix sort. Instead, it employs a two-phase algorithm described in detail in our article on Array.prototype.sort() performance limits: a local bitonic sort processing 256-element tiles entirely in workgroup shared memory, followed by a global parallel binary-search rank merge that computes each element's final position. This architecture maps naturally to WebGPU's SIMT execution model. For 1 million elements on a discrete GPU, the sort completes in under 10 ms. The equivalent Array.prototype.sort() with a comparator takes over 70 ms on the same machine.

The Float64 precision boundary

The bit-transform operates on 32-bit representations. JavaScript numbers are 64-bit. The precision analysis in Layer 1 determines whether the narrowing from Float64 to Float32 is acceptable for a given dataset.

The analysis works by computing the maximum ULP (Unit in the Last Place) error across the dataset:

function maxFloat32Error(data: Float64Array): number {
  let maxUlp = 0;
  for (let i = 0; i < data.length; i++) {
    const f64 = data[i];
    const f32 = Math.fround(f64);
    const ulpError = Math.abs(f64 - f32) / ulpSize(f32);
    if (ulpError > maxUlp) maxUlp = ulpError;
  }
  return maxUlp;
}

Math.fround() performs the exact same narrowing that WebGPU buffer writes perform. By comparing the original Float64 value to its Float32 projection, we quantify the worst-case precision loss.

For integer-valued data (counts, IDs, timestamps as epoch seconds), Float32 represents values up to 2^24 (16,777,216) exactly. Beyond that, precision degrades. For fractional data (prices, coordinates, sensor readings), the precision boundary depends on the magnitude and the number of significant digits.

If the maximum ULP error exceeds the configured tolerance, the engine skips the GPU tier entirely and sorts using a Float64-native comparison sort on the CPU. No silent precision loss. No incorrect orderings.

Performance characteristics

We benchmarked the Adaptive Multi-Tier Sorting System across 14 device configurations. All numbers are wall-clock time for sorting uniformly distributed random Float32 values:

Elements	GPU bitonic-merge (discrete)	GPU bitonic-merge (integrated)	CPU radix-256 (8 workers)	Array.prototype.sort()
100,000	1.9 ms	2.8 ms	1.4 ms	6.2 ms
500,000	3.4 ms	6.1 ms	5.8 ms	34.1 ms
1,000,000	4.1 ms	9.3 ms	11.2 ms	71.8 ms
5,000,000	11.7 ms	31.4 ms	52.6 ms	389.2 ms
10,000,000	19.8 ms	58.2 ms	103.1 ms	812.4 ms

At 10 million elements, the GPU bitonic-merge path is 41x faster than Array.prototype.sort(). The CPU radix-256 path (also O(n), running across Web Workers) is 7.9x faster. Across the full patent test suite (184 tests across 14 device configurations), the system achieves 5.2x to 45.5x speedup depending on data size and hardware. Even without GPU access, the bit-transform plus radix sort is a massive improvement over comparison-based sorting.

The crossover between CPU radix-256 and GPU bitonic-merge occurs at approximately 400,000 elements on discrete GPUs and 1,200,000 elements on integrated GPUs. Below those thresholds, CPU radix is faster due to zero transfer overhead.

Inverse transform and result extraction

After sorting, the unsigned integers must be converted back to floats. The inverse transform mirrors the forward transform:

function inverseTransform(bits: number): number {
  if (bits & 0x80000000) {
    // Was a positive float: clear the sign bit
    bits = bits ^ 0x80000000;
  } else {
    // Was a negative float: flip all bits
    bits = ~bits;
  }
  // Reinterpret as Float32
  tempUint32[0] = bits >>> 0;
  return tempFloat32[0];
}

Using a shared ArrayBuffer with both Uint32Array and Float32Array views (type-punning through DataView or aliased typed arrays), the reinterpretation is zero-cost. No allocation. No parsing. One memory read through a different view.

On the GPU path, the inverse transform runs as a final compute shader dispatch, writing directly to the output Float32Array buffer that the CPU reads back via mapAsync(). The entire pipeline (forward transform, bitonic sort, rank merge, inverse transform) stays on the GPU with no intermediate CPU round-trips.

Where this sits in our broader architecture

The bit-transform and radix sort are not standalone features. They are the sorting tier of a larger adaptive compute system that includes branch divergence detection to prevent GPU dispatch of incompatible workloads, and hardware-aware calibration to select the optimal backend per-device.

Sorting is one of the most common operations in data processing pipelines: ranking, deduplication, merge joins, percentile calculations, and histogram construction all depend on sorted order. Making the sort itself 5.2x to 45.5x faster propagates through every downstream operation.

This is the engineering approach behind our enterprise AI automation infrastructure. We do not accept framework defaults. We examine the bit-level representation of your data, the architectural constraints of the target hardware, and the theoretical lower bound of the algorithm. Then we build the system that closes the gap between what the hardware can do and what existing tools actually achieve.

Array.prototype.sort() is a 20-million-callback, single-threaded, O(n log n) operation that ignores your data type and your hardware. We replaced it with a zero-callback, multi-tier, O(n) operation that adapts to both. The difference is not incremental. It is structural.

Where this ships

We are Ayoob AI, a Newcastle-based team building bit-level sort infrastructure for UK enterprise SaaS where browser-side performance is the product. If your grid, dashboard, or analytics layer is bottlenecked on float comparisons, we replace the sort with something that exploits the representation. Book a discovery call.

About the author

Husain Ayoob

Founder & CEO, Ayoob AI Ltd

BSc Computer Science with AI, Northumbria University 2024. 5 UK patents pending covering the Ayoob AI stack. ISO 27001:2022 certified (organisation).

Full bio, patents, and press →

Frequently asked questions

Why does Float64 to Float32 narrowing matter?

Because it silently loses precision on values above 2^24 (16,777,216). JavaScript numbers are Float64 with 52 bits of precision. WebGPU storage buffers use Float32 with 23 bits of significand plus one implicit leading bit. Writing a Float64 into a WebGPU buffer narrows the value without warning. For values below 2^24, the narrowing is exact. Above that threshold, consecutive representable Float32 values are spaced more than 1 apart, so whole integers get rounded to the nearest representable Float32. For finance, analytics, or any workload where precision is a correctness requirement, this silent narrowing is a serious risk that needs architectural safeguards.

What is the Float32 safety guard?

A pre-dispatch analyser that quantifies the maximum ULP (unit in the last place) error per dataset and blocks GPU dispatch when precision loss would exceed tolerance. The guard works by checking the data range against Float32 representable limits. For values safely below 2^24, GPU dispatch proceeds. For values that would suffer narrowing, the operation routes to CPU with Float64 arithmetic. This is our contribution on top of the Herf 2001 bit-transform: a hard gate that prevents silent precision loss from ever reaching production output on finance-sensitive workloads.

How does the IEEE 754 bit-transform preserve order?

By exploiting the structure of the IEEE 754 bit layout. Positive floats have the sign bit at 0, so OR-ing with 0x80000000 flips it to 1, placing them above negatives in unsigned integer space. Negative floats have the sign bit at 1, and their internal ordering is reversed relative to magnitude (larger magnitude negatives have larger unsigned bit patterns). Bitwise NOT flips all bits, which reverses this ordering and also handles the sign bit. The transform preserves total order across negative zero, subnormals, positive and negative infinities. The result is that comparing transformed values as unsigned integers produces the same ordering as comparing the original floats.

Why four radix passes?

Because we sort by byte. A 32-bit unsigned integer is four bytes. Each pass sorts by one byte using a 256-bucket histogram (2^8 values per byte). Four passes cover the full 32-bit range. Each pass is O(n): count occurrences per bucket, compute prefix sums, scatter elements to output positions. Total work is 4n histogram counts, 4 x 256 prefix sums, and 4n scatter writes, all of which vectorise cleanly on both CPU and GPU. The end-to-end complexity is O(n) rather than the O(n log n) of comparison-based sorts, which is why the speedup on large datasets is so large.

Does this work for integer data too?

Yes, and more directly. For integer data, you skip the IEEE 754 bit-transform and sort the raw integer bit patterns. The same radix-256 algorithm handles Int32, Uint32, Int16, Uint16 without change. For finance applications processing integer amounts in pence (to avoid floating-point precision entirely), this is the natural primitive. For analytics applications with mixed numeric types, the dispatch layer picks the right transform based on the column's declared type. Full code AI automation handling typed columns gets this infrastructure for free from our sorting engine.

Talk to an Engineer