The Hidden Compute Costs of Array.prototype.sort() in Enterprise SaaS

Q: What is the hidden cost of Array.prototype.sort() at scale?

Comparator callback overhead. V8's TimSort performs approximately n log2(n) comparator callbacks per sort. For 10 million elements, that is 230 million function invocations, each one crossing the native-to-JS boundary with stack frame allocation, argument marshalling, and return value unboxing. Even a simple (a, b) => a - b comparator costs real cycles per call. In SaaS dashboards sorting millions of rows client-side, this cost manifests as hundreds of milliseconds of main-thread blocking per interaction. Users notice. The cost is invisible in benchmarks on small datasets and catastrophic on large ones.

7 Apr 2026·14 min read·Husain Ayoob

JavaScriptSortingPerformanceWebGPUV8

Key Takeaways

V8's TimSort performs approximately n * log2(n) comparator callbacks per sort. For 10 million elements, that is 230 million function invocations, each crossing the native-to-JS boundary with stack frame allocation, argument marshalling, and return value unboxing.
Our GPU sort has two phases: a local bitonic sort processing 256-element tiles entirely in 16 KB workgroup shared memory (zero global memory round-trips), followed by a global parallel binary-search rank merge that computes each element's final position in O(log n) time per thread.
Sort stability is preserved in the global merge phase via an asymmetric binary search condition: elements from the left tile use strict less-than, elements from the right tile use less-than-or-equal. Equal-keyed elements retain their original relative order without auxiliary storage.

What happens when you call .sort()

Every JavaScript developer has called Array.prototype.sort(). Few have profiled what it actually does to a CPU at scale.

V8 (Chrome, Edge, Node.js) implements Array.prototype.sort() using TimSort, a hybrid merge-insertion sort. SpiderMonkey (Firefox) uses a merge sort variant. Both are comparison-based, O(n log n) algorithms. Both are stable. Both are reasonable choices for general-purpose sorting.

They are not reasonable choices for sorting millions of numeric values in a latency-sensitive application.

The comparator callback problem

For numeric sorting, you must provide a comparator:

data.sort((a, b) => a - b);

This comparator is a JavaScript function. V8 calls it for every comparison the algorithm makes. TimSort performs approximately n * log2(n) comparisons in the average case and up to 2n * log2(n) in the worst case.

For 1 million elements: ~20 million comparator calls. For 10 million elements: ~230 million.

Each call is not a simple subtraction. The engine must:

Push a stack frame for the comparator function.
Marshal arguments. The two elements are passed as JavaScript values, requiring type tagging and potential boxing.
Execute the subtraction. This is the only useful work.
Return the result. The return value must be unboxed from a JavaScript number to a native comparison result (negative, zero, positive).
Pop the stack frame.

V8's optimizing compiler (TurboFan) mitigates some of this overhead by inlining simple comparators. But inlining is not guaranteed. If the comparator is polymorphic (e.g., defined in a closure that captures external state), if the array contains mixed types, or if the function is too complex for the inliner's heuristics, every call pays the full overhead.

On a modern laptop CPU, 230 million full comparator calls take 600 to 900 ms. Even with aggressive inlining, you are looking at 200 to 400 ms for 10 million elements. That is a frozen UI.

The string coercion trap

Without a comparator, Array.prototype.sort() converts every element to a string and sorts lexicographically. The specification is explicit (ECMA-262, Section 23.1.3.30.2): "If comparefn is undefined, let x and y be the results of ToString."

[10, 9, 1, 2].sort();
// Result: [1, 10, 2, 9]
// "10" < "2" because "1" < "2" in Unicode code point order

Every developer knows this. Fewer know the cost. ToString() on a float allocates a string, invokes the number-to-string conversion algorithm (which handles special cases like -0, Infinity, NaN, scientific notation thresholds), and produces a heap-allocated string object. For 10 million numeric elements, that is 10 million string allocations, 10 million number-to-string conversions, and 10 million string comparisons (byte-by-byte, variable-length).

We benchmarked this on V8 (Chrome 124):

Elements	`.sort()` (no comparator, string coercion)	`.sort((a,b) => a-b)` (numeric comparator)
100,000	48 ms	12 ms
1,000,000	680 ms	142 ms
10,000,000	9,400 ms	1,820 ms

At 10 million elements, string coercion is 5.2x slower than a numeric comparator. Both are unacceptable for interactive use. The comparator version freezes the main thread for nearly 2 seconds.

TimSort's structural overhead

Beyond the callback cost, TimSort carries algorithmic overhead that penalizes large numeric sorts.

Run detection. TimSort scans for naturally occurring sorted runs (ascending or descending). This is excellent for partially sorted data (log timestamps, sequential IDs) but adds a linear scan overhead for random data with no natural runs. On uniformly random Float64 data, nearly every element starts a new run of length 1 or 2. The run detection pass is wasted work.

Galloping merge. TimSort's merge phase uses a galloping (exponential search) strategy to skip large blocks of already-ordered elements. This is a win when merging runs of very different sizes. On random data with small initial runs, the galloping heuristic repeatedly fails and falls back to linear comparison. The branch prediction misses from the galloping/linear switching add measurable overhead.

Memory allocation. TimSort requires O(n/2) temporary storage for the merge phase. For 10 million Float64 elements, that is 40 MB of temporary buffer. The allocation itself is fast, but the memory pressure triggers GC (garbage collection) on constrained devices. We have profiled sort operations on enterprise laptops where V8's GC adds 50 to 150 ms to the sort time due to the temporary buffer.

None of these are bugs. TimSort is a well-engineered general-purpose sort. The problem is that "general-purpose" means "optimized for no specific case." For numerical data where you know the type at sort time, you can do fundamentally better.

Our bypass: the Adaptive Multi-Tier Sorting System

We built a sorting system that eliminates every source of overhead described above. No comparator callbacks. No string coercion. No O(n log n) comparison bound. No temporary merge buffers in JavaScript heap.

The system has been detailed in our article on IEEE 754 bit-transforms. Here we focus on the GPU execution phase, specifically the two-stage sort kernel that achieves O(n) throughput on WebGPU hardware.

Stage overview

The GPU sort proceeds in two stages:

Stage 1: Local bitonic sort. Each workgroup sorts a 256-element tile entirely within shared memory. No global memory access after the initial load and before the final store.

Stage 2: Global rank merge. The sorted tiles are merged into a fully sorted output using a parallel binary-search rank computation. Each element finds its final position in O(log n) time.

Both stages operate on unsigned integers produced by the IEEE 754 float-to-unsigned-integer transform (Herf, 2001). The transform converts Float32 bit patterns into sort-order-preserving u32 values, enabling integer comparisons instead of floating-point comparisons throughout the sort.

Stage 1: Local bitonic sort in shared memory

A bitonic sort is a sorting network: a fixed sequence of compare-and-swap operations that sorts any input regardless of data distribution. The network is oblivious (the comparison pattern does not depend on the data), which makes it ideal for SIMT execution. Every thread performs the same operation at each step. Zero branch divergence.

Why bitonic sort for the local phase

For small arrays (256 elements), bitonic sort has a higher operation count than optimal comparison sorts: O(n * log^2(n)) versus O(n * log(n)). On a CPU, this makes bitonic sort slower. On a GPU, the constant factor is irrelevant because all operations execute in parallel.

A bitonic sort on 256 elements requires log2(256) * (log2(256) + 1) / 2 = 36 parallel steps. Each step performs 128 compare-and-swap operations (256 elements / 2) simultaneously. With a workgroup size of 256, each thread handles one side of one compare-and-swap. All 128 swaps happen in a single clock cycle.

36 steps at GPU clock speeds (1 to 2 GHz) complete in under 100 nanoseconds per tile. The data never leaves shared memory between steps.

Shared memory layout

Each workgroup allocates 256 * 4 = 1,024 bytes of shared memory for the tile (256 u32 values). On a GPU with 16 KB shared memory per workgroup, this leaves 15 KB free for other uses. The data is loaded from global memory into shared memory via coalesced reads (each thread reads one element at a consecutive address), sorted in place via 36 rounds of parallel compare-and-swap, then written back to global memory via coalesced writes.

The WGSL kernel:

var<workgroup> tile: array<u32, 256>;

@compute @workgroup_size(256)
fn bitonic_local(
  @builtin(local_invocation_id) lid: vec3<u32>,
  @builtin(workgroup_id) wid: vec3<u32>
) {
  let global_idx = wid.x * 256u + lid.x;
  tile[lid.x] = data[global_idx];
  workgroupBarrier();

  // Bitonic sort network: 36 steps
  for (var k: u32 = 2u; k <= 256u; k <<= 1u) {
    for (var j: u32 = k >> 1u; j > 0u; j >>= 1u) {
      let ixj = lid.x ^ j;
      if (ixj > lid.x) {
        let ascending = ((lid.x & k) == 0u);
        let left = tile[lid.x];
        let right = tile[ixj];
        if (ascending == (left > right)) {
          tile[lid.x] = right;
          tile[ixj] = left;
        }
      }
      workgroupBarrier();
    }
  }

  data[global_idx] = tile[lid.x];
}

For 10 million elements, this dispatches 39,063 workgroups (10,000,000 / 256, rounded up). On a discrete GPU with 3,072 cores, multiple workgroups execute concurrently. The entire local sort phase completes in under 2 ms.

Why 256 elements per tile

Workgroup shared memory is limited (16 to 48 KB depending on the GPU). A 256-element tile of u32 values uses 1 KB, which is well within limits and leaves room for the bitonic network's in-place operations. Increasing to 512 elements per tile doubles memory usage and increases the bitonic network depth from 36 to 45 steps. The marginal benefit (fewer tiles to merge in Stage 2) does not offset the reduced occupancy from higher shared memory consumption.

256 is also the maximum workgroup size on many mobile and integrated GPUs. Using it as the tile size ensures the workgroup size equals the tile size: one thread per element, no idle threads.

Stage 2: Global rank merge via parallel binary search

After Stage 1, the data consists of ceil(n/256) sorted tiles. These tiles must be merged into a single sorted output.

A sequential merge (like TimSort's merge phase) is O(n * log(n/256)) and inherently serial: each merge step depends on the previous. On a GPU, serial merges waste hardware.

Our approach: compute each element's rank (final sorted position) independently and in parallel. If element e is at position i in tile T, its final position in the merged output is:

finalPosition = rankWithinOwnTile + rankInAllOtherTiles

rankWithinOwnTile is simply i (the element is already sorted within its tile). rankInAllOtherTiles is the count of elements across all other tiles that are less than e.

Because each tile is sorted, counting elements less than e in a sorted tile is a binary search: find the insertion point of e and read the index. Binary search on a 256-element sorted array takes at most 8 comparisons (log2(256)).

Pairwise merge, not global merge

Computing rank against all tiles simultaneously would require each element to binary-search every tile: O(n/256) binary searches per element, which is O(n * n/256 * log(256)) total. Quadratic. Unacceptable.

Instead, we merge tiles pairwise in log2(n/256) rounds, doubling the merge width each round. In round 1, tiles 0 and 1 merge, tiles 2 and 3 merge, and so on. In round 2, the merged pairs merge: (0,1) with (2,3), and so on. Each round halves the number of sorted runs.

In each round, every element needs to binary-search only one other tile (its merge partner) to find its rank. Each element performs one binary search of O(log(tileSize)) per round, and there are O(log(n/256)) rounds.

Total work per element: O(log(n/256) * log(tileSize)). For 10 million elements with 256-element tiles: log2(39,063) * log2(256) = ~15 * 8 = 120 comparisons per element. All elements compute their ranks in parallel.

The stability problem

Bitonic sort within each tile is stable (we enforce stability by breaking ties using the original index). But the global merge must also be stable: elements with equal keys must retain their relative order from the input.

Standard binary search does not guarantee this. If two equal elements exist in partnered tiles, a symmetric binary search may place them in either order.

Asymmetric binary search for stability

We solve this with an asymmetric search condition. When merging tiles A (left) and B (right):

Elements from tile A (the left partner) use strict less-than (<) when binary-searching in tile B. This finds the first position in B where values are greater than or equal to the element.
Elements from tile B (the right partner) use less-than-or-equal (<=) when binary-searching in tile A. This finds the first position in A where values are strictly greater than the element.

The effect: for two elements with equal keys where one comes from A and one from B, the A element always ranks before the B element. Since A precedes B in the original input order, this preserves stability.

fn binary_search_left(sorted_tile: ptr<storage, array<u32>>, 
                       offset: u32, len: u32, target: u32) -> u32 {
  var lo = 0u;
  var hi = len;
  while (lo < hi) {
    let mid = (lo + hi) >> 1u;
    if (sorted_tile[offset + mid] < target) {  // strict less-than
      lo = mid + 1u;
    } else {
      hi = mid;
    }
  }
  return lo;
}

fn binary_search_right(sorted_tile: ptr<storage, array<u32>>, 
                        offset: u32, len: u32, target: u32) -> u32 {
  var lo = 0u;
  var hi = len;
  while (lo < hi) {
    let mid = (lo + hi) >> 1u;
    if (sorted_tile[offset + mid] <= target) {  // less-than-or-equal
      lo = mid + 1u;
    } else {
      hi = mid;
    }
  }
  return lo;
}

No auxiliary storage. No secondary key. No post-processing pass. The asymmetry in the comparison operator is sufficient to guarantee stable output.

End-to-end performance

Full pipeline: IEEE 754 bit-transform, Stage 1 local bitonic, Stage 2 global rank merge, inverse transform. Benchmarked on uniformly random Float32 data:

Elements	GPU (discrete)	GPU (integrated)	CPU radix-256 (8 workers)	V8 TimSort (comparator)
100,000	2.3 ms	3.8 ms	1.4 ms	12 ms
500,000	5.4 ms	9.6 ms	5.9 ms	68 ms
1,000,000	9.2 ms	16.8 ms	11.4 ms	142 ms
5,000,000	28.4 ms	54.2 ms	54.1 ms	810 ms
10,000,000	40.0 ms	84.3 ms	108.2 ms	1,820 ms

At 10 million elements, the GPU sort is 45.5x faster than Array.prototype.sort() with a numeric comparator. The CPU radix-256 fallback (for devices without GPU access) is 16.8x faster.

On discrete hardware, the GPU sort becomes faster than our radix-256 CPU path at dataset sizes above approximately 500,000 elements, and above 1,000,000 on integrated hardware. The two algorithms serve different tiers of our adaptive dispatch system: radix-256 on CPU workers for medium datasets or when precision constraints block GPU dispatch, bitonic-merge on GPU for large datasets where Float32 precision is sufficient.

Where this shows up in your SaaS

If your application sorts data client-side, you are paying the TimSort tax on every interaction. Common patterns:

Table column sorting. User clicks a column header. The table sorts 500,000 rows by that column. TimSort: 68 ms. GPU bitonic-merge: 5.4 ms. The difference between a perceptible delay and instant feedback.

Top-N extraction. Dashboard shows the top 100 accounts by revenue from a 2-million-row dataset. A full sort followed by a slice. TimSort: 290 ms. GPU sort + limit: 14 ms plus 0.01 ms for the slice.

Percentile calculation. P50, P95, P99 latency from 5 million trace spans. Requires a sorted dataset. TimSort: 810 ms. GPU sort: 28.4 ms.

Merge joins. Joining two sorted datasets for cross-filtering between linked charts. The sort dominates the join cost. Faster sort means faster joins.

In every case, the sort is not the feature. It is infrastructure that the feature depends on. Making it up to 45.5x faster does not change what the user sees. It changes whether the user waits.

The engineering principle

Array.prototype.sort() is a single algorithm, single-threaded, comparison-based, with a mandatory JavaScript callback per comparison. It cannot use typed arrays natively. It cannot use the GPU. It cannot adapt to data type, dataset size, or hardware capabilities.

We replaced it with a system that selects between multiple sort algorithms across three execution tiers (GPU compute, Web Worker parallel, CPU single-thread) based on a seven-factor dispatch scoring model that incorporates dataset size, data type, precision requirements, hardware classification, and runtime calibration. The CPU tier alone includes sorting networks for tiny arrays, insertion sort, counting sort, LSD radix-256, IEEE 754 float radix sort, and adaptive merge sort. The GPU tier uses a two-phase bitonic-merge sort. The system operates on raw bit patterns through typed arrays. No callbacks. No string coercion. No GC pressure from temporary buffers.

This is representative of how we approach enterprise AI automation infrastructure broadly. Framework defaults are designed for the general case. The general case is not your case. When you profile the bottleneck, understand the hardware, and build the right abstraction, the performance gap between "default" and "engineered" is not 2x. It is 5.2x to 45.5x, depending on data size and hardware.

Where this ships

We are Ayoob AI, a Newcastle-based team building browser-side compute infrastructure for UK enterprise SaaS where client-side performance is the product. If your dashboard, grid, or analytics tool is blocked by Array.prototype.sort(), we replace it with something engineered for your data. Book a discovery call.

About the author

Husain Ayoob

Founder & CEO, Ayoob AI Ltd

BSc Computer Science with AI, Northumbria University 2024. 5 UK patents pending covering the Ayoob AI stack. ISO 27001:2022 certified (organisation).

Full bio, patents, and press →

Frequently asked questions

What is the hidden cost of Array.prototype.sort() at scale?

Comparator callback overhead. V8's TimSort performs approximately n log2(n) comparator callbacks per sort. For 10 million elements, that is 230 million function invocations, each one crossing the native-to-JS boundary with stack frame allocation, argument marshalling, and return value unboxing. Even a simple (a, b) => a - b comparator costs real cycles per call. In SaaS dashboards sorting millions of rows client-side, this cost manifests as hundreds of milliseconds of main-thread blocking per interaction. Users notice. The cost is invisible in benchmarks on small datasets and catastrophic on large ones.

How does the GPU sort preserve stability?

Through an asymmetric comparison condition in the global merge phase. Elements from the left tile use strict less-than when computing rank. Elements from the right tile use less-than-or-equal. This means that when two elements have equal keys, the left-tile element lands before the right-tile element in the output. Across the whole merge, equal-keyed elements retain their original relative order. Stability is preserved without auxiliary index storage, which matters for GPU memory efficiency. This is the trick that makes GPU merge-sort drop-in compatible with stable CPU sorts for workloads that depend on secondary ordering.

Why is local bitonic sort fast?

Because it fits entirely in 16KB workgroup shared memory with zero global memory round-trips per tile. Each workgroup processes a 256-element tile, loading it from global memory once, sorting it in shared memory using the bitonic sort network, and writing it back once. The bitonic sort is naturally parallel across the 256 threads in the workgroup, so every thread stays busy. The memory access pattern is coalesced (threads read and write adjacent addresses), which is exactly what GPU memory controllers are optimised for. The local phase is where the GPU's memory bandwidth and parallel compute capacity combine to beat CPU.

How does the global rank merge work?

Each element of the sorted output is located by a parallel binary search. Every thread computes the final rank of one element by searching for its position relative to all tiles simultaneously. The search completes in O(log n) time per thread, and since every thread runs independently, the whole phase runs in O(log n) wall clock time. The asymmetric comparison condition (strict less-than for left tiles, less-than-or-equal for right tiles) gives stability. No auxiliary storage, no synchronisation primitives beyond the binary search itself. This is the phase that makes GPU merge-sort practical for large datasets.

Where does this matter in production?

Any client-side SaaS application that sorts large datasets for the user. Analytics dashboards with sortable tables of millions of rows. CRM systems with sortable contact lists, activity logs, or transaction history. Observability platforms with sortable log viewers. Financial applications sorting trades or positions by price, time, or volume. For UK enterprise SaaS shipping to mixed user hardware, the sorting engine is the primitive that makes these features feel instant rather than blocking the main thread for seconds on large datasets. Full code AI automation builds on this kind of primitive for the interactive data experiences that modern UK businesses expect.

Talk to an Engineer