Ayoob AI

IEEE 754 Bit-Transforms for High-Speed Float Processing in JavaScript

·12 min read·Ayoob AI
IEEE 754JavaScriptWebGPUSortingAlgorithms

The type system mismatch at the centre of browser compute

JavaScript has one number type: IEEE 754 double-precision (Float64). Every numeric value in your application, from pixel coordinates to financial figures, occupies 64 bits with a 52-bit significand, an 11-bit exponent, and a sign bit.

WebGPU storage buffers and compute shaders operate on 32-bit types. f32 in WGSL gives you a 23-bit significand, an 8-bit exponent, and a sign bit. When you write a JavaScript Float64Array into a WebGPU buffer, the runtime narrows every value from 64 bits to 32 bits.

That narrowing is lossy. A Float64 value of 1.0000001 and 1.00000019 are distinct in 64-bit representation. In 32-bit, they collapse to the same bit pattern. For sorting, this means two elements that were ordered in your source data may become equal after GPU transfer. Your sort is now unstable in a way that has nothing to do with the algorithm.

This is not an edge case. It is the default behaviour of every WebGPU compute pipeline that accepts JavaScript numbers.

Why Array.prototype.sort() is the wrong starting point

Before addressing the Float64-to-Float32 problem, you need to understand why the native sort is unsuitable for high-performance numerical work in the first place.

Array.prototype.sort() has three structural problems.

Problem 1: Lexicographic default. Without a comparator, JavaScript sorts by string conversion. [10, 9, 1, 2].sort() produces [1, 10, 2, 9]. Every JavaScript developer learns this. Fewer realize the deeper cost: even with a correct (a, b) => a - b comparator, every comparison is a function call through the engine's callback machinery.

Problem 2: Comparison overhead. V8's TimSort implementation performs approximately n * log2(n) comparisons. For 1 million elements, that is roughly 20 million comparator invocations. Each invocation crosses the native-to-JS boundary, pushes a stack frame, executes the subtraction, and returns. The comparison function itself is trivial. The call overhead is not.

Problem 3: Single-threaded, single-algorithm. Array.prototype.sort() runs on the main thread using one algorithm (TimSort in V8, merge sort in SpiderMonkey). It cannot dispatch to Web Workers. It cannot dispatch to the GPU. It cannot adapt to the data distribution or the hardware. It is a fixed-strategy, fixed-backend sort.

For a 10-million-element numerical dataset, these constraints translate to 300 ms or more of main-thread blocking. In an enterprise dashboard rendering real-time data, that is a frozen UI.

Our Adaptive Multi-Tier Sorting System

We built a sorting engine that addresses all three problems. The system has four layers.

Layer 1: Precision analysis. Before any sort begins, the engine scans the input dataset and quantifies the maximum precision loss that Float32 narrowing would introduce. If the loss exceeds a configurable tolerance (default: 0 ULP for exact sorts, user-configurable for approximate workloads), the GPU tier is excluded from dispatch candidates.

Layer 2: Bit-transform. For datasets that pass precision analysis, every float is converted to a sort-order-preserving unsigned integer. This is the core of the system and the subject of this article.

Layer 3: Sort execution. On the CPU and Web Worker tiers, the transformed unsigned integers are sorted using a 4-pass LSD radix-256 sort (one pass per byte of a 32-bit integer). Each pass is a stable counting sort. The entire operation is O(n) with a small constant factor. On the GPU tier, a two-phase bitonic sort + rank merge operates on the transformed unsigned integers.

Layer 4: Adaptive dispatch. A seven-factor scoring model selects the optimal backend (CPU single-thread, Web Workers, or WebGPU compute) based on dataset size, data type, hardware classification, and self-calibrating crossover thresholds derived from runtime microbenchmarks.

The bit-transform in Layer 2 is what makes the rest possible. Without it, you cannot radix sort floats or use integer comparisons in GPU kernels. With it, you have reduced a comparison-based O(n log n) problem to one that can be solved with non-comparison distribution sorts on the CPU and data-oblivious sorting networks on the GPU.

The IEEE 754 float-to-unsigned-integer transform

IEEE 754 single-precision floats encode a number in 32 bits:

[S][EEEEEEEE][MMMMMMMMMMMMMMMMMMMMMMM]
 1    8 bits          23 bits

S is the sign bit. E is the biased exponent. M is the significand (mantissa). The value is (-1)^S * 2^(E-127) * 1.M for normal numbers.

The problem for sorting: if you reinterpret these 32 bits as an unsigned integer, positive floats sort correctly. The exponent field dominates the high bits, and larger exponents mean larger values. Within the same exponent, the significand provides the fine ordering. Positive floats are already in unsigned integer sort order.

Negative floats are not. The sign bit is the most significant bit. A negative float reinterpreted as an unsigned integer appears larger than any positive float. Worse, negative floats sort in reverse order among themselves: -1.0 has a larger unsigned representation than -2.0, but -1.0 is the greater value.

The two-rule transform

The standard float-to-integer transform (Herf, 2001) fixes both issues with two bitwise operations:

Rule 1: Positive floats (sign bit = 0). Flip the sign bit by ORing with 0x80000000.

// Before: 0_EEEEEEE_MMMMMMMMMMMMMMMMMMMMMMM
// After:  1_EEEEEEE_MMMMMMMMMMMMMMMMMMMMMMM
transformed = bits | 0x80000000;

This shifts all positive floats above the 2^31 boundary in unsigned integer space. The relative order among positive floats is preserved because only the sign bit changed, and it changed identically for all of them.

Rule 2: Negative floats (sign bit = 1). Flip all 32 bits by bitwise NOT.

// Before: 1_EEEEEEE_MMMMMMMMMMMMMMMMMMMMMMM
// After:  0_eeeeeee_mmmmmmmmmmmmmmmmmmmmmmm  (every bit inverted)
transformed = ~bits;

This accomplishes two things simultaneously. It clears the sign bit, placing all negative floats below the 2^31 boundary (below all positive floats). And it reverses the order among negative floats, because flipping all bits in a region that was sorted descending produces ascending order.

Why this preserves total order

After the transform, the unsigned integer space is partitioned:

Float valueUnsigned integer range
-Infinity0x00000000 (smallest)
-MAX_FLOAT0x00000001
... negative floats ...ascending
-MIN_SUBNORMAL0x7FFFFFFE
-0.00x7FFFFFFF
+0.00x80000000
+MIN_SUBNORMAL0x80000001
... positive floats ...ascending
+MAX_FLOAT0xFF7FFFFF
+Infinity0xFF800000 (largest)

Every IEEE 754 float, including negative zero, subnormals, and infinities, maps to a unique unsigned integer that preserves the exact total order defined by the IEEE 754 totalOrder predicate. No information is lost. The transform is perfectly reversible by applying the inverse: if the high bit is set, XOR with 0x80000000; if the high bit is clear, bitwise NOT.

NaN handling

IEEE 754 NaN values occupy the exponent region 0xFF with a non-zero significand. After the transform, NaN bit patterns map to unsigned integers above 0xFF800000 (above +Infinity). Our system strips NaN values during the precision analysis pass (Layer 1) before the transform runs. If your dataset contains NaN, the engine detects it, removes NaN elements, sorts the valid subset, and reinserts NaN at a configurable position (start, end, or error). NaN never reaches the GPU.

Radix-256 sort on transformed integers

With floats converted to sort-order-preserving unsigned integers, we can apply radix sort. Our implementation uses base 256 (one byte per pass, four passes for 32-bit integers).

Single pass structure

Each pass processes one byte position (byte 0 through byte 3, least significant first):

  1. Histogram. Count the frequency of each byte value (0 to 255) across all elements. This produces a 256-entry histogram.
  2. Prefix sum. Compute the exclusive prefix sum of the histogram. Entry i now holds the starting index for all elements with byte value i.
  3. Scatter. Write each element to its destination index based on the prefix sum, then increment the counter.

Each pass is stable: elements with the same byte value retain their relative order from the previous pass. After four stable passes (LSB to MSB), the array is fully sorted.

Why radix-256

Radix-2 (1 bit per pass) requires 32 passes. Radix-16 requires 8. Radix-256 requires 4. Radix-65536 requires 2 but needs a 65,536-entry histogram, which blows out L1 cache on most hardware and causes cache thrashing that negates the pass reduction.

Radix-256 is the sweet spot: 4 passes with a 256-entry histogram (1 KB) that fits comfortably in L1 cache on every modern CPU and in workgroup shared memory on every GPU.

CPU parallelization via Web Workers

On the CPU tier, the radix-256 sort distributes across a pre-warmed Web Worker pool using SharedArrayBuffer for zero-copy data sharing and Atomics-based signalling. Each worker processes a partition of the input array, building local histograms that are merged via a k-way merge into the final sorted output. For medium-sized datasets (200 to 500,000 integer elements), this tier delivers strong performance without GPU transfer overhead.

GPU execution tier

On the GPU, the system does not use radix sort. Instead, it employs a two-phase algorithm described in detail in our article on Array.prototype.sort() performance limits: a local bitonic sort processing 256-element tiles entirely in workgroup shared memory, followed by a global parallel binary-search rank merge that computes each element's final position. This architecture maps naturally to WebGPU's SIMT execution model. For 1 million elements on a discrete GPU, the sort completes in under 10 ms. The equivalent Array.prototype.sort() with a comparator takes over 70 ms on the same machine.

The Float64 precision boundary

The bit-transform operates on 32-bit representations. JavaScript numbers are 64-bit. The precision analysis in Layer 1 determines whether the narrowing from Float64 to Float32 is acceptable for a given dataset.

The analysis works by computing the maximum ULP (Unit in the Last Place) error across the dataset:

function maxFloat32Error(data: Float64Array): number {
  let maxUlp = 0;
  for (let i = 0; i < data.length; i++) {
    const f64 = data[i];
    const f32 = Math.fround(f64);
    const ulpError = Math.abs(f64 - f32) / ulpSize(f32);
    if (ulpError > maxUlp) maxUlp = ulpError;
  }
  return maxUlp;
}

Math.fround() performs the exact same narrowing that WebGPU buffer writes perform. By comparing the original Float64 value to its Float32 projection, we quantify the worst-case precision loss.

For integer-valued data (counts, IDs, timestamps as epoch seconds), Float32 represents values up to 2^24 (16,777,216) exactly. Beyond that, precision degrades. For fractional data (prices, coordinates, sensor readings), the precision boundary depends on the magnitude and the number of significant digits.

If the maximum ULP error exceeds the configured tolerance, the engine skips the GPU tier entirely and sorts using a Float64-native comparison sort on the CPU. No silent precision loss. No incorrect orderings.

Performance characteristics

We benchmarked the Adaptive Multi-Tier Sorting System across 14 device configurations. All numbers are wall-clock time for sorting uniformly distributed random Float32 values:

ElementsGPU bitonic-merge (discrete)GPU bitonic-merge (integrated)CPU radix-256 (8 workers)Array.prototype.sort()
100,0001.9 ms2.8 ms1.4 ms6.2 ms
500,0003.4 ms6.1 ms5.8 ms34.1 ms
1,000,0004.1 ms9.3 ms11.2 ms71.8 ms
5,000,00011.7 ms31.4 ms52.6 ms389.2 ms
10,000,00019.8 ms58.2 ms103.1 ms812.4 ms

At 10 million elements, the GPU bitonic-merge path is 41x faster than Array.prototype.sort(). The CPU radix-256 path (also O(n), running across Web Workers) is 7.9x faster. Across the full patent test suite (184 tests across 14 device configurations), the system achieves 5.2x to 45.5x speedup depending on data size and hardware. Even without GPU access, the bit-transform plus radix sort is a massive improvement over comparison-based sorting.

The crossover between CPU radix-256 and GPU bitonic-merge occurs at approximately 400,000 elements on discrete GPUs and 1,200,000 elements on integrated GPUs. Below those thresholds, CPU radix is faster due to zero transfer overhead.

Inverse transform and result extraction

After sorting, the unsigned integers must be converted back to floats. The inverse transform mirrors the forward transform:

function inverseTransform(bits: number): number {
  if (bits & 0x80000000) {
    // Was a positive float: clear the sign bit
    bits = bits ^ 0x80000000;
  } else {
    // Was a negative float: flip all bits
    bits = ~bits;
  }
  // Reinterpret as Float32
  tempUint32[0] = bits >>> 0;
  return tempFloat32[0];
}

Using a shared ArrayBuffer with both Uint32Array and Float32Array views (type-punning through DataView or aliased typed arrays), the reinterpretation is zero-cost. No allocation. No parsing. One memory read through a different view.

On the GPU path, the inverse transform runs as a final compute shader dispatch, writing directly to the output Float32Array buffer that the CPU reads back via mapAsync(). The entire pipeline (forward transform, bitonic sort, rank merge, inverse transform) stays on the GPU with no intermediate CPU round-trips.

Where this sits in our broader architecture

The bit-transform and radix sort are not standalone features. They are the sorting tier of a larger adaptive compute system that includes branch divergence detection to prevent GPU dispatch of incompatible workloads, and hardware-aware calibration to select the optimal backend per-device.

Sorting is one of the most common operations in data processing pipelines: ranking, deduplication, merge joins, percentile calculations, and histogram construction all depend on sorted order. Making the sort itself 5.2x to 45.5x faster propagates through every downstream operation.

This is the engineering approach behind our enterprise AI automation infrastructure. We do not accept framework defaults. We examine the bit-level representation of your data, the architectural constraints of the target hardware, and the theoretical lower bound of the algorithm. Then we build the system that closes the gap between what the hardware can do and what existing tools actually achieve.

Array.prototype.sort() is a 20-million-callback, single-threaded, O(n log n) operation that ignores your data type and your hardware. We replaced it with a zero-callback, multi-tier, O(n) operation that adapts to both. The difference is not incremental. It is structural.

Want to discuss how this applies to your business?

Book a Discovery Call