The problem with Array.prototype.sort()
JavaScript's built-in sort has three fundamental issues that make it unsuitable for performance-critical numerical workloads.
First, it converts to strings by default. Call [10, 9, 1, 2].sort() and you get [1, 10, 2, 9]. The spec requires lexicographic comparison unless you pass a comparator. Every JavaScript developer learns this the hard way, but the deeper issue is that even with a correct (a, b) => a - b comparator, you are still paying the overhead of a callback invocation per comparison. For a million-element array, that is roughly 20 million function calls through the comparator.
Second, it uses a single algorithm. V8 uses TimSort. SpiderMonkey uses a merge sort variant. Both are comparison-based, O(n log n) algorithms optimised for partially sorted data. They are good general-purpose choices, but they are structurally incapable of exploiting the binary representation of IEEE 754 floating-point numbers. A radix sort on floats runs in O(n) time. For large numerical datasets, the comparison-based approach is leaving performance on the table by definition.
Third, there is zero hardware adaptation. Array.prototype.sort() runs on the main thread, on a single CPU core. It has no mechanism to detect whether the device has idle Web Workers available, whether a GPU with compute shader support is present, or whether the dataset is large enough to justify the overhead of dispatching to parallel hardware. It always does the same thing regardless of context.
These are not bugs. They are design constraints of a 25-year-old API that was never intended for data-intensive computation. But as browsers gain access to serious parallel hardware via WebGPU, the gap between what sort() does and what the platform is capable of has become difficult to ignore.
The three-tier adaptive architecture
We built a sorting engine with three execution tiers:
Tier 1: CPU single-thread. For small datasets (under ~50,000 elements), the overhead of transferring data to a worker or GPU exceeds the computation time. Tier 1 runs an optimised in-place radix sort directly on the main thread. No allocation, no transfer, no overhead.
Tier 2: Web Worker parallel. For medium datasets (~50,000 to ~500,000 elements), the engine spawns a pool of Web Workers and distributes chunks of the array across them. Each worker performs a local radix sort on its chunk, and the results are merged back on the main thread using a k-way merge. The parallelism gain outweighs the structured clone transfer cost at this range.
Tier 3: WebGPU compute shaders. For large datasets (over ~500,000 elements), the engine dispatches the entire sort to the GPU via WebGPU compute shaders. The sort is implemented as a parallel radix sort: each pass processes one byte of the transformed key, using workgroup shared memory for local histogram accumulation and a parallel prefix sum for global offset calculation. The GPU's thousands of concurrent threads make this the fastest tier for large n, despite the CPU-to-GPU transfer overhead.
The tier selection is not hardcoded. It is determined by a scoring function that evaluates dataset size, element count, available hardware (does the device support WebGPU? Are workers available?), and empirically calibrated crossover points. The thresholds were derived from benchmarking across a range of devices, from low-end Chromebooks to M-series MacBooks to discrete NVIDIA GPUs.
Why not always use the GPU?
Because GPU dispatch has fixed overhead. Uploading data to a GPU storage buffer, dispatching compute passes, and reading results back involves latency that is independent of dataset size. For 1,000 elements, this overhead dominates. You would be waiting 2ms for a transfer to complete a sort that takes 0.05ms on CPU.
Our scoring function includes a mechanism we call categorical GPU inhibition: if dataset characteristics predict pathological GPU behaviour (specifically, SIMD branch divergence from highly non-uniform data distributions, or atomic contention from histogram collisions), the GPU score is set to negative infinity. The CPU or Worker tier is selected regardless of dataset size. This prevents the common mistake of blindly offloading everything to the GPU and getting worse performance than a single-threaded baseline.
The IEEE 754 float-to-integer transform
Radix sort operates on unsigned integers. It examines keys one byte at a time, from least significant to most significant, bucketing elements by each byte's value. This works naturally on Uint32 data. It does not work directly on floating-point data because the IEEE 754 bit layout does not correspond to numerical sort order for all values.
Specifically:
- Positive floats, when interpreted as unsigned integers, are in the correct sort order. The sign bit is 0, the exponent occupies the high bits, and the mantissa occupies the low bits. Larger floats have larger unsigned integer representations.
- Negative floats are in reverse order. The sign bit is 1 (making the unsigned integer large), but a more negative float has a larger magnitude, meaning a larger unsigned integer representation.
-1.0sorts after-2.0when interpreted as unsigned, which is wrong.
The transform is:
if sign bit is 0: flip the sign bit (0 -> 1)
if sign bit is 1: flip all bits
After this transform, all floats (positive, negative, zero, subnormals, infinity) are in correct unsigned integer sort order. The transform is its own inverse: apply it again after sorting and you recover the original float bit patterns.
The Float64 to Float32 constraint
WebGPU compute shaders operate on 32-bit data types. JavaScript numbers are Float64. This means we need to handle the precision reduction carefully.
Our approach: we do not sort the Float64 values directly on the GPU. Instead, we:
- Convert each Float64 to a sort key by applying the bit transform to produce a Uint32. For values within Float32 range, this is lossless. For values outside Float32 range, we use a rank-preserving quantisation that maps the Float64 value to the nearest Float32 representable value while preserving relative ordering.
- Sort the Uint32 keys on the GPU using radix sort.
- Apply the resulting permutation to the original Float64 array on the CPU.
This hybrid approach means the GPU never touches the full-precision data. It sorts 32-bit keys, which is where the GPU is fastest, and the CPU applies the permutation, which is a linear-time scatter operation. The total work is O(n) for the GPU sort plus O(n) for the CPU permutation, which is still O(n) overall.
For datasets where all values fit within Float32 range (which covers the vast majority of real-world numerical data), the transform is exact. For extreme values, the quantisation introduces at most 1 ULP (unit in the last place) of key error, which in practice affects only the relative ordering of values that differ by less than Float32 epsilon. In our benchmarks across financial, scientific, and sensor datasets, this has never produced an incorrect sort.
Benchmarks
We benchmarked the three tiers against Array.prototype.sort() with a correct numerical comparator across dataset sizes from 1,000 to 10,000,000 elements. All benchmarks ran in Chrome 125 on three hardware profiles:
M2 MacBook Air (integrated GPU, 8-core CPU):
| Elements | Array.sort() | Tier 1 (CPU) | Tier 2 (Workers) | Tier 3 (GPU) |
|---|---|---|---|---|
| 10,000 | 1.2ms | 0.4ms | 2.1ms (overhead) | 3.8ms (overhead) |
| 100,000 | 14ms | 4.1ms | 3.2ms | 8.1ms (overhead) |
| 500,000 | 78ms | 21ms | 14ms | 18ms |
| 1,000,000 | 162ms | 43ms | 28ms | 24ms |
| 5,000,000 | 890ms | 220ms | 135ms | 152ms |
| 10,000,000 | 1,840ms | 450ms | 270ms | 195ms |
Desktop (RTX 4070, 12-core CPU):
| Elements | Array.sort() | Tier 1 (CPU) | Tier 2 (Workers) | Tier 3 (GPU) |
|---|---|---|---|---|
| 500,000 | 72ms | 19ms | 11ms | 8.2ms |
| 1,000,000 | 155ms | 40ms | 22ms | 11ms |
| 10,000,000 | 1,750ms | 430ms | 210ms | 121ms |
The crossover point where GPU becomes faster than Workers varies by hardware:
- Integrated GPU (M2): ~800,000 elements
- Discrete GPU (RTX 4070): ~250,000 elements
- Low-end (Intel UHD 620): ~2,000,000 elements (and GPU tier is only 1.12x faster at 10M)
The adaptive scoring function's calibrated thresholds track these crossover points. When GPU advantage is marginal (under 1.1x), the scoring function penalises GPU selection to avoid the risk of regression on slightly different data distributions.
Key findings
- Tier 1 (CPU radix sort) is 3x to 4x faster than
Array.prototype.sort()across all sizes. The comparison-free radix sort eliminates callback overhead entirely. - Tier 2 (Workers) peaks at 5x to 8x advantage for medium datasets where parallelism helps but GPU transfer overhead does not.
- Tier 3 (GPU) delivers 1.12x to 1.45x advantage over Tier 2 for datasets over 500,000 elements on discrete GPUs, scaling to 1.7x at 10M elements.
- The adaptive dispatcher never selects a tier that is slower than any other tier. In our full benchmark suite (47 hardware/dataset combinations), the adaptive selection matched or beat every fixed-tier strategy.
The Float32 constraint is also the GPU's advantage
A counterintuitive finding: the 32-bit constraint that forces us into the key-permutation hybrid is actually a performance advantage. Sorting 32-bit keys means half the memory bandwidth of sorting 64-bit values. GPU compute is almost always memory-bandwidth-bound, not compute-bound. By reducing key width, we effectively double the GPU's sorting throughput relative to what a native Float64 radix sort would achieve.
This is why the GPU tier becomes competitive at smaller dataset sizes than you might expect. The 4-byte key fits cleanly into workgroup shared memory, allowing the entire local histogram and prefix sum to execute without spilling to global memory. On a discrete GPU with 48KB of shared memory per workgroup and 256 threads, each radix pass processes the full histogram locally.
Open source
The full implementation is open source. The repository includes:
- The three-tier adaptive sorting engine
- The IEEE 754 float-to-integer transform
- The WebGPU compute shader (WGSL) for parallel radix sort
- The scoring function with empirical calibration data
- A benchmark suite covering 47 hardware/dataset combinations
We built this because we needed it for our own systems, specifically for sorting and filtering large datasets in our adaptive query engine. But the problem it solves is general enough that it should not be locked behind a proprietary codebase.
If you are working on data-intensive browser applications, dashboards with client-side sorting of large datasets, or any use case where Array.prototype.sort() is your bottleneck, this might be useful.
For more on our broader research in GPU-accelerated computing, adaptive algorithms, and the patent portfolio behind this work, see our innovations page.