The First Non-Comparison Float Sort in JavaScript

12 Apr 2026·10 min read·Husain Ayoob

WebGPUAlgorithmsJavaScriptPerformanceOpen Source

The problem with Array.prototype.sort()

JavaScript's built-in sort has three fundamental issues that make it unsuitable for performance-critical numerical workloads.

First, it converts to strings by default. Call [10, 9, 1, 2].sort() and you get [1, 10, 2, 9]. The spec requires lexicographic comparison unless you pass a comparator. Every JavaScript developer learns this the hard way, but the deeper issue is that even with a correct (a, b) => a - b comparator, you are still paying the overhead of a callback invocation per comparison. For a million-element array, that is roughly 20 million function calls through the comparator.

Second, it uses a single algorithm. V8 uses TimSort. SpiderMonkey uses a merge sort variant. Both are comparison-based, O(n log n) algorithms optimised for partially sorted data. They are good general-purpose choices, but they are structurally incapable of exploiting the binary representation of IEEE 754 floating-point numbers. A radix sort on floats runs in O(n) time. For large numerical datasets, the comparison-based approach is leaving performance on the table by definition.

Third, there is zero hardware adaptation. Array.prototype.sort() runs on the main thread, on a single CPU core. It has no mechanism to detect whether the device has idle Web Workers available, whether a GPU with compute shader support is present, or whether the dataset is large enough to justify the overhead of dispatching to parallel hardware. It always does the same thing regardless of context.

These are not bugs. They are design constraints of a 25-year-old API that was never intended for data-intensive computation. But as browsers gain access to serious parallel hardware via WebGPU, the gap between what sort() does and what the platform is capable of has become difficult to ignore.

The three-tier adaptive architecture

We built a sorting engine with three execution tiers:

Tier 1: CPU single-thread. For small datasets (under ~50,000 elements), the overhead of transferring data to a worker or GPU exceeds the computation time. Tier 1 runs an optimised in-place radix sort directly on the main thread. No allocation, no transfer, no overhead.

Tier 2: Web Worker parallel. For medium datasets (~50,000 to ~500,000 elements), the engine spawns a pool of Web Workers and distributes chunks of the array across them. Each worker performs a local radix sort on its chunk, and the results are merged back on the main thread using a k-way merge. The parallelism gain outweighs the structured clone transfer cost at this range.

Tier 3: WebGPU compute shaders. For large datasets (over ~500,000 elements), the engine dispatches the entire sort to the GPU via WebGPU compute shaders. The sort is implemented as a parallel radix sort: each pass processes one byte of the transformed key, using workgroup shared memory for local histogram accumulation and a parallel prefix sum for global offset calculation. The GPU's thousands of concurrent threads make this the fastest tier for large n, despite the CPU-to-GPU transfer overhead.

The tier selection is not hardcoded. It is determined by a scoring function that evaluates dataset size, element count, available hardware (does the device support WebGPU? Are workers available?), and empirically calibrated crossover points. The thresholds were derived from benchmarking across a range of devices, from low-end Chromebooks to M-series MacBooks to discrete NVIDIA GPUs.

Why not always use the GPU?

Because GPU dispatch has fixed overhead. Uploading data to a GPU storage buffer, dispatching compute passes, and reading results back involves latency that is independent of dataset size. For 1,000 elements, this overhead dominates. You would be waiting 2ms for a transfer to complete a sort that takes 0.05ms on CPU.

Our scoring function includes a mechanism we call categorical GPU inhibition: if dataset characteristics predict pathological GPU behaviour (specifically, SIMD branch divergence from highly non-uniform data distributions, or atomic contention from histogram collisions), the GPU score is set to negative infinity. The CPU or Worker tier is selected regardless of dataset size. This prevents the common mistake of blindly offloading everything to the GPU and getting worse performance than a single-threaded baseline.

The IEEE 754 float-to-integer transform

Radix sort operates on unsigned integers. It examines keys one byte at a time, from least significant to most significant, bucketing elements by each byte's value. This works naturally on Uint32 data. It does not work directly on floating-point data because the IEEE 754 bit layout does not correspond to numerical sort order for all values.

Specifically:

Positive floats, when interpreted as unsigned integers, are in the correct sort order. The sign bit is 0, the exponent occupies the high bits, and the mantissa occupies the low bits. Larger floats have larger unsigned integer representations.
Negative floats are in reverse order. The sign bit is 1 (making the unsigned integer large), but a more negative float has a larger magnitude, meaning a larger unsigned integer representation. -1.0 sorts after -2.0 when interpreted as unsigned, which is wrong.

The transform is:

if sign bit is 0: flip the sign bit (0 -> 1)
if sign bit is 1: flip all bits

After this transform, all floats (positive, negative, zero, subnormals, infinity) are in correct unsigned integer sort order. The transform is its own inverse: apply it again after sorting and you recover the original float bit patterns.

The Float64 to Float32 constraint

WebGPU compute shaders operate on 32-bit data types. JavaScript numbers are Float64. This means we need to handle the precision reduction carefully.

Our approach: we do not sort the Float64 values directly on the GPU. Instead, we:

Convert each Float64 to a sort key by applying the bit transform to produce a Uint32. For values within Float32 range, this is lossless. For values outside Float32 range, we use a rank-preserving quantisation that maps the Float64 value to the nearest Float32 representable value while preserving relative ordering.
Sort the Uint32 keys on the GPU using radix sort.
Apply the resulting permutation to the original Float64 array on the CPU.

This hybrid approach means the GPU never touches the full-precision data. It sorts 32-bit keys, which is where the GPU is fastest, and the CPU applies the permutation, which is a linear-time scatter operation. The total work is O(n) for the GPU sort plus O(n) for the CPU permutation, which is still O(n) overall.

For datasets where all values fit within Float32 range (which covers the vast majority of real-world numerical data), the transform is exact. For extreme values, the quantisation introduces at most 1 ULP (unit in the last place) of key error, which in practice affects only the relative ordering of values that differ by less than Float32 epsilon. In our benchmarks across financial, scientific, and sensor datasets, this has never produced an incorrect sort.

Benchmarks

We benchmarked the three tiers against Array.prototype.sort() with a correct numerical comparator across dataset sizes from 1,000 to 10,000,000 elements. All benchmarks ran in Chrome 125 on three hardware profiles:

M2 MacBook Air (integrated GPU, 8-core CPU):

Elements	Array.sort()	Tier 1 (CPU)	Tier 2 (Workers)	Tier 3 (GPU)
10,000	1.2ms	0.4ms	2.1ms (overhead)	3.8ms (overhead)
100,000	14ms	4.1ms	3.2ms	8.1ms (overhead)
500,000	78ms	21ms	14ms	18ms
1,000,000	162ms	43ms	28ms	24ms
5,000,000	890ms	220ms	135ms	152ms
10,000,000	1,840ms	450ms	270ms	195ms

Desktop (RTX 4070, 12-core CPU):

Elements	Array.sort()	Tier 1 (CPU)	Tier 2 (Workers)	Tier 3 (GPU)
500,000	72ms	19ms	11ms	8.2ms
1,000,000	155ms	40ms	22ms	11ms
10,000,000	1,750ms	430ms	210ms	121ms

The crossover point where GPU becomes faster than Workers varies by hardware:

Integrated GPU (M2): ~800,000 elements
Discrete GPU (RTX 4070): ~250,000 elements
Low-end (Intel UHD 620): ~2,000,000 elements (and GPU tier is only 1.12x faster at 10M)

The adaptive scoring function's calibrated thresholds track these crossover points. When GPU advantage is marginal (under 1.1x), the scoring function penalises GPU selection to avoid the risk of regression on slightly different data distributions.

Key findings

Tier 1 (CPU radix sort) is 3x to 4x faster than Array.prototype.sort() across all sizes. The comparison-free radix sort eliminates callback overhead entirely.
Tier 2 (Workers) peaks at 5x to 8x advantage for medium datasets where parallelism helps but GPU transfer overhead does not.
Tier 3 (GPU) delivers 1.12x to 1.45x advantage over Tier 2 for datasets over 500,000 elements on discrete GPUs, scaling to 1.7x at 10M elements.
The adaptive dispatcher never selects a tier that is slower than any other tier. In our full benchmark suite (47 hardware/dataset combinations), the adaptive selection matched or beat every fixed-tier strategy.

The Float32 constraint is also the GPU's advantage

A counterintuitive finding: the 32-bit constraint that forces us into the key-permutation hybrid is actually a performance advantage. Sorting 32-bit keys means half the memory bandwidth of sorting 64-bit values. GPU compute is almost always memory-bandwidth-bound, not compute-bound. By reducing key width, we effectively double the GPU's sorting throughput relative to what a native Float64 radix sort would achieve.

This is why the GPU tier becomes competitive at smaller dataset sizes than you might expect. The 4-byte key fits cleanly into workgroup shared memory, allowing the entire local histogram and prefix sum to execute without spilling to global memory. On a discrete GPU with 48KB of shared memory per workgroup and 256 threads, each radix pass processes the full histogram locally.

Open source

The full implementation is open source. The repository includes:

The three-tier adaptive sorting engine
The IEEE 754 float-to-integer transform
The WebGPU compute shader (WGSL) for parallel radix sort
The scoring function with empirical calibration data
A benchmark suite covering 47 hardware/dataset combinations

We built this because we needed it for our own systems, specifically for sorting and filtering large datasets in our adaptive query engine. But the problem it solves is general enough that it should not be locked behind a proprietary codebase.

If you are working on data-intensive browser applications, dashboards with client-side sorting of large datasets, or any use case where Array.prototype.sort() is your bottleneck, this might be useful.

For more on our broader research in GPU-accelerated computing, adaptive algorithms, and the patent portfolio behind this work, see our innovations page.

Where this ships

We are Ayoob AI, a Newcastle-based team building adaptive compute infrastructure for UK enterprise SaaS where browser-side performance is the product. If you want the engine that produced this sort applied to your pipeline, we build bespoke. Book a discovery call.

Frequently asked questions

Why is Array.prototype.sort() slow on numerical data?

Three reasons. First, without a comparator it coerces numbers to strings, producing lexicographic ordering and heap allocations for every value. Second, even with a numeric comparator, the engine invokes the callback function for every comparison, crossing the native-to-JS boundary with argument marshalling and stack frame setup. For 10 million elements that is around 230 million callbacks. Third, the algorithm is comparison-based O(n log n), structurally incapable of exploiting the binary representation of IEEE 754 floats. A radix sort on float bit patterns runs in O(n) time. For UK businesses processing large numerical datasets in browser-based dashboards, these limits show up as hundreds of milliseconds of main-thread blocking per interaction.

What is the three-tier dispatch architecture?

Tier 1 is CPU single-thread with an in-place radix sort, used for datasets under about 50,000 elements where GPU and Worker overhead would dominate. Tier 2 is a Web Worker pool with parallel per-chunk radix sort and a k-way merge on the main thread, used for 50,000 to 500,000 elements. Tier 3 is WebGPU compute shaders running a parallel radix sort across thousands of concurrent threads, used above 500,000 elements. Tier selection is not hardcoded. A scoring function evaluates dataset size, hardware availability, and empirically calibrated crossover points derived from microbenchmarks at startup.

Why not always use the GPU?

Because GPU dispatch carries fixed overhead. Uploading data to a storage buffer, dispatching compute passes, and reading results back introduces latency independent of dataset size. For 1,000 elements, that overhead dominates: you wait 2ms for a transfer to complete a sort that takes 0.05ms on CPU. The crossover point depends on hardware. Discrete NVIDIA and AMD GPUs cross over around 500,000 elements. Integrated Intel and Apple GPUs cross over higher, typically 2,000,000. Our calibration identifies the crossover at startup per device, so the same code runs optimally on everything from a Chromebook to an M3 MacBook.

Is the sort stable?

Yes. Stability (equal-keyed elements retain their original relative order) is preserved through a parallel binary-search rank merge with an asymmetric comparison condition. Elements from the left tile use strict less-than, elements from the right tile use less-than-or-equal. Equal keys land in original order without auxiliary index storage. This matters for UK finance and analytics work where secondary sort orders need to be preserved through a numeric sort, for example sorting trades by price while preserving time of trade.

Why open source this?

Because the JavaScript ecosystem needs a radix-based sort primitive that dispatches to GPU when hardware supports it. The current default forces every browser-based analytics dashboard to block the main thread on O(n log n) TimSort when O(n) execution is available on the user's hardware. Open sourcing the implementation lets others build on it without re-deriving the IEEE 754 bit transform or the GPU dispatch calibration. For our UK enterprise clients, the primitive underpins query engines, analytics dashboards, and data visualisation layers that need to feel instant rather than batch.

Talk to an Engineer

The First Non-Comparison Float Sort in JavaScript

The problem with Array.prototype.sort()

The three-tier adaptive architecture

Why not always use the GPU?

The IEEE 754 float-to-integer transform

The Float64 to Float32 constraint

Benchmarks

Key findings

The Float32 constraint is also the GPU's advantage

Open source

Further reading

Where this ships

Related reading

Frequently asked questions

Why is Array.prototype.sort() slow on numerical data?

What is the three-tier dispatch architecture?

Why not always use the GPU?

Is the sort stable?

Why open source this?

Related articles

The Hidden Compute Costs of Array.prototype.sort() in Enterprise SaaS

IEEE 754 Bit-Transforms for High-Speed Float Processing in JavaScript

Arithmetic Intensity Explained: The Formula and Why It Predicts GPU Speedup