Ayoob AI

Zero-Copy Parallel Processing with SharedArrayBuffer in JavaScript

·15 min read·Ayoob AI
Web WorkersSharedArrayBufferJavaScriptPerformanceParallel Computing

The gap between main thread and GPU

Our adaptive dispatch engine routes computation across three tiers: CPU main thread, Web Worker parallel, and WebGPU compute. The GPU tier handles datasets above 500,000 elements on discrete hardware. The main thread handles datasets below 10,000 elements.

Between those boundaries sits a range where neither tier is optimal.

A 200,000-element sort on the main thread takes 28 ms (V8 TimSort with a numeric comparator). That is nearly two animation frames. The UI stutters visibly.

The same sort dispatched to the GPU takes 2.4 ms of compute but 3.1 ms of buffer transfer (write to GPU, read back results). Total: 5.5 ms. The transfer overhead is 56% of the total time. For a dataset this size, you are spending more time moving data than processing it.

The Web Worker parallel tier fills this gap. It processes data where it already lives (in JavaScript heap memory), avoids the GPU transfer penalty, and spreads the work across multiple CPU cores. For the 10,000 to 500,000 element range, it is the fastest option on most hardware.

But Web Workers have their own overhead problems. Solving them is what makes the parallel tier production-ready rather than a naive postMessage fan-out.

The postMessage serialization problem

The standard way to send data to a Web Worker is postMessage():

worker.postMessage({ data: myFloat32Array });

By default, this uses the structured clone algorithm to copy the data. For a 200,000-element Float32Array (800 KB), the copy takes 0.3 to 0.5 ms. Send it to 8 workers, and you have copied 6.4 MB in 2.4 to 4.0 ms. You have not started computing yet.

Transferable objects (postMessage(data, [data.buffer])) avoid the copy by moving ownership of the ArrayBuffer to the worker. But transfer is exclusive: the source loses access. You cannot transfer the same buffer to multiple workers. For parallel processing where every worker needs read access to the full dataset, transfer does not work.

You could partition the data into 8 slices and transfer each slice to a different worker. But then the merge phase requires transferring 8 results back to the main thread, and any operation that needs cross-partition visibility (a global sort, a full-dataset aggregation with shared group keys) requires additional coordination messages.

Every postMessage call, even with transfers, incurs event loop overhead: the message enters the worker's message queue, waits for the current microtask to complete, and fires the onmessage handler. Measured latency per postMessage round-trip: 0.1 to 0.3 ms. For an operation that takes 3 ms of compute, 8 coordination messages add 0.8 to 2.4 ms of communication overhead. The overhead approaches the compute time.

SharedArrayBuffer: zero-copy shared memory

SharedArrayBuffer is an ArrayBuffer that multiple threads can access simultaneously. Unlike a regular ArrayBuffer, it is not copied or transferred. Every worker that receives a reference to a SharedArrayBuffer reads and writes the same physical memory.

// Main thread
const shared = new SharedArrayBuffer(800_000); // 200,000 float32s = 800 KB
const view = new Float32Array(shared);
// Fill view with data...

// Send to all workers. No copy. No transfer. Zero overhead.
for (const worker of workers) {
  worker.postMessage({ buffer: shared, offset: 0, length: 200_000 });
}

The postMessage call sends only the reference (a few bytes). The data stays in place. Eight workers, one copy of the data, zero serialization cost.

This is the foundation of our parallel tier. Every operation on the worker pool uses SharedArrayBuffer for input data, output data, and coordination state.

The security context requirement

SharedArrayBuffer requires the page to be served with specific HTTP headers:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

These headers were mandated after the Spectre vulnerability disclosure to prevent cross-origin timing attacks. Without them, SharedArrayBuffer is unavailable and the constructor throws a TypeError.

Our engine checks for SharedArrayBuffer availability at initialization. If it is unavailable (missing headers, legacy browser, or restricted environment), the parallel tier falls back to postMessage with transferable objects and single-partition-per-worker dispatch. Performance degrades by 20% to 40%, but correctness is unaffected.

Lock-free synchronization with Atomics

Shared memory without synchronization is a data race. Two workers writing to adjacent memory locations can produce torn reads on some architectures. JavaScript's Atomics API provides the synchronization primitives.

Atomics.wait() and Atomics.notify()

These are the core coordination mechanism. Atomics.wait() parks a thread until another thread writes a specific value to a shared memory location and calls Atomics.notify().

// Worker: park until notified
const control = new Int32Array(sharedControlBuffer);
Atomics.wait(control, WORKER_SLOT, 0);
// Execution resumes here when main thread calls Atomics.notify()

// Main thread: wake the worker
Atomics.store(control, WORKER_SLOT, 1);
Atomics.notify(control, WORKER_SLOT, 1);

Atomics.wait() is a true OS-level thread park, not a busy-wait spin loop. The worker consumes zero CPU cycles while parked. The wake latency (time from Atomics.notify() to the worker resuming execution) is 0.01 to 0.05 ms on modern hardware. Compare this to postMessage wake latency of 0.1 to 0.3 ms.

Why not use postMessage for coordination

postMessage is asynchronous and event-loop-bound. The message sits in the worker's event queue until the current task completes. Atomics.notify() wakes the thread immediately at the OS scheduler level, bypassing the JavaScript event loop. For latency-sensitive operations where every 0.1 ms matters, this difference compounds across multiple synchronization points.

A parallel sort with 8 workers requires at minimum 2 synchronization points per worker (start of work, end of work). With postMessage, that is 16 async messages: 1.6 to 4.8 ms of coordination overhead. With Atomics, that is 16 atomic wake/wait operations: 0.16 to 0.8 ms. On a 200,000-element sort that takes 4 ms of compute, the coordination model determines whether total latency is 5.6 ms or 8.8 ms.

The pre-warmed worker pool

Creating a Web Worker has a fixed cost. The browser must spawn an OS thread, initialize a JavaScript context, parse and compile the worker script, and execute any setup code. Measured cost: 5 to 15 ms per worker on first creation. On subsequent creations (after the script is cached), 2 to 5 ms.

For an 8-worker pool, cold creation costs 40 to 120 ms. This is acceptable as a one-time startup cost. It is unacceptable as a per-operation cost.

Our engine creates the worker pool at initialization time. Each worker runs a startup sequence that allocates its local state, then parks on Atomics.wait():

// Worker script (simplified)
const control = new Int32Array(sharedControlBuffer);
const WORKER_ID = self.workerIndex;

// Initialization complete. Park until work arrives.
while (true) {
  Atomics.wait(control, WORKER_ID, 0); // Parked. Zero CPU usage.

  // Woken up. Read task descriptor from shared memory.
  const task = readTaskDescriptor(WORKER_ID);

  // Execute the task.
  executeTask(task);

  // Signal completion.
  Atomics.store(control, WORKER_ID, 0);
  Atomics.notify(control, COMPLETION_SLOT, 1);
}

The pool is sized to navigator.hardwareConcurrency (typically 4 to 16 on modern machines). Workers are created once and reused for every operation for the lifetime of the page. The amortized creation cost per operation is effectively zero.

Wake-to-first-instruction latency (from Atomics.notify() to the worker executing its first task instruction) is under 0.05 ms. This means the parallel tier is viable for datasets as small as 10,000 elements, where the compute time is 0.5 to 1 ms. A 0.05 ms wake cost is 5% to 10% overhead. A 5 to 15 ms worker creation cost would be 500% to 1,500%.

Per-chunk adaptive algorithm selection

This is where our parallel tier diverges from a naive "split the array, sort each piece, merge results" approach.

When the main thread dispatches a sort to the worker pool, it partitions the SharedArrayBuffer into contiguous chunks (one per worker). Each worker receives its chunk boundaries (offset and length into the shared buffer). The workers then sort their chunks independently.

The key: each worker inspects its chunk and selects the optimal sort algorithm for that specific data.

Why per-chunk adaptation matters

Real-world datasets are not uniformly distributed. A sales table might have a revenue column where 90% of values fall between 0 and 10,000 (small transactions) and 10% fall between 100,000 and 10,000,000 (enterprise deals). When this column is partitioned across 8 workers, some chunks will contain mostly small values with a tight range. Others will contain a mix of small and large values with a wide range.

A single fixed algorithm optimized for the average case is suboptimal for both extremes. Per-chunk selection lets each worker use the best algorithm for its specific data.

Algorithm selection logic

Each worker runs a lightweight analysis pass on its chunk before sorting:

function selectAlgorithm(data, offset, length) {
  if (length <= 64) {
    return 'insertion'; // Under 64 elements: insertion sort is fastest
  }

  // Scan for min, max, and integer-ness
  let min = Infinity, max = -Infinity;
  let allInteger = true;
  for (let i = offset; i < offset + length; i++) {
    const v = data[i];
    if (v < min) min = v;
    if (v > max) max = v;
    if (allInteger && (v | 0) !== v) allInteger = false;
  }

  const range = max - min;

  if (allInteger && range < 65536) {
    return 'counting'; // Bounded integers: O(n + k) counting sort
  }

  return 'radix256'; // General case: LSD radix-256 with bit-transform
}

Insertion sort (length 64 or fewer). For very small chunks (which occur at partition boundaries or when the main dataset is small), insertion sort's low constant factor and cache-friendliness beat any asymptotically faster algorithm. No allocation. No auxiliary buffers. Pure in-place comparisons.

Counting sort (bounded integers, range under 65,536). If every value in the chunk is an integer and the range from minimum to maximum fits within 65,536, counting sort runs in O(n + k) time where k is the range. A 65,536-entry count array occupies 256 KB. The sort requires one pass to count, one pass to compute prefix sums, and one pass to scatter. No comparisons. No callbacks. For data like status codes (0 to 99), age values (0 to 120), or quantity fields (0 to 10,000), counting sort is 3x to 5x faster than radix-256 because it completes in 3 passes versus 4, with a smaller histogram at each pass.

LSD radix-256 (general case). For floating-point data or integers with wide ranges, the worker applies the IEEE 754 bit-transform and runs a 4-pass least-significant-digit radix sort with base 256. This is O(n) regardless of data distribution, handles the full Float32 range, and requires only a 256-entry histogram (1 KB) per pass.

Measured impact of per-chunk adaptation

We benchmarked on a realistic mixed dataset: 500,000 elements where 70% are bounded integers (0 to 1,000) and 30% are wide-range floats. Eight-worker pool:

StrategyTime
Fixed TimSort per chunk18.4 ms
Fixed radix-256 per chunk7.2 ms
Per-chunk adaptive (counting + radix)4.9 ms

The adaptive strategy is 1.47x faster than fixed radix-256 because the 70% of chunks dominated by bounded integers use counting sort (3 passes, smaller histogram) instead of radix-256 (4 passes). On datasets with higher integer concentration, the advantage grows to 2.1x.

K-way merge on the main thread

After all workers complete (signaled via Atomics.notify() on a shared completion counter), the main thread holds K sorted chunks in the SharedArrayBuffer. These must be merged into a single sorted output.

We use a k-way merge with a min-heap of size K (equal to the worker count, typically 8). The heap holds the current minimum element from each chunk along with its chunk index and position within the chunk.

function kWayMerge(shared, chunks, output) {
  const heap = new MinHeap(chunks.length);

  // Initialize: insert the first element from each chunk
  for (let i = 0; i < chunks.length; i++) {
    heap.insert(shared[chunks[i].offset], i, chunks[i].offset);
  }

  let writePos = 0;
  while (heap.size > 0) {
    const { value, chunkIdx, readPos } = heap.extractMin();
    output[writePos++] = value;

    const chunk = chunks[chunkIdx];
    const nextPos = readPos + 1;
    if (nextPos < chunk.offset + chunk.length) {
      heap.insert(shared[nextPos], chunkIdx, nextPos);
    }
  }
}

For K = 8, each extractMin and insert operation touches at most 3 levels of the heap (log2(8) = 3). For n = 500,000 elements, the merge performs 500,000 extract-insert cycles with 3 comparisons each: 1.5 million comparisons total. On a single core, this completes in 3 to 5 ms.

The merge runs on the main thread because it is inherently sequential (each extraction depends on the previous heap state) and because it is fast enough to complete within a single frame budget. Parallelizing the merge would add synchronization overhead that exceeds the compute saved for K of 16 or fewer.

Writing to the output buffer

The merge writes to a separate region of the SharedArrayBuffer (or a dedicated output buffer). It does not write back into the input region, because the workers' sorted chunks are being read simultaneously. Double-buffering avoids read-write conflicts without locks.

The total memory footprint is 2x the input data size: one region for the sorted chunks (in-place, zero-copy from the workers), one region for the merged output. For a 500,000-element Float32 dataset, that is 4 MB total. Negligible on any modern device.

End-to-end parallel tier performance

Full pipeline: pre-warmed pool wake, chunk dispatch, per-chunk adaptive sort, k-way merge. Benchmarked on 8-core hardware with mixed Float32 data:

ElementsParallel tier (8 workers)Main thread (V8 TimSort)Speedup
10,0000.6 ms1.1 ms1.8x
50,0001.8 ms5.8 ms3.2x
100,0003.1 ms12.2 ms3.9x
200,0005.2 ms25.6 ms4.9x
500,00011.8 ms68.1 ms5.8x

The speedup increases with dataset size because the fixed coordination overhead (wake + merge) is amortized over more compute. At 500,000 elements, the coordination overhead is under 8% of total time. At 10,000 elements, it is roughly 30%, but the absolute time (0.6 ms) is still well within a single animation frame.

Comparison with the GPU tier

For context, the same 500,000-element sort on a discrete GPU takes 3.6 ms (versus 11.8 ms on the worker pool). The GPU wins by 3.3x. But on an integrated GPU, the GPU tier takes 6.4 ms, and the advantage narrows to 1.8x. On a device with no GPU access (enterprise VDI, locked-down terminal), the worker pool is the only option faster than the main thread.

The adaptive dispatch engine selects between these tiers based on the hardware calibration ratio. It does not assume GPU availability. The worker pool is not a fallback. It is a full-featured tier that handles the medium-scale workloads where GPU transfer overhead erodes the GPU's compute advantage.

Where this applies beyond sorting

The pre-warmed pool with SharedArrayBuffer and Atomics coordination is a general-purpose parallel execution substrate. We use it for:

Parallel filter. Each worker evaluates the predicate on its chunk and writes matching indices to a thread-local section of a shared output buffer. The main thread concatenates the sections. For 500,000 rows with a selective predicate, this is 1.2 ms versus 4.8 ms single-threaded.

Parallel aggregation. Each worker computes partial aggregates (sum, count, min, max) for its chunk. The main thread reduces the K partial results. For 500,000 rows with GROUP BY on a low-cardinality column, this is 2.1 ms versus 7.3 ms single-threaded.

Parallel histogram. Each worker builds a local histogram over its chunk. The main thread sums the K histograms bin by bin. For 500,000 values with 256 bins, this is 0.8 ms versus 3.2 ms single-threaded.

All three operations integrate with our query engine as the Tier 2 backend. The dispatch scoring function routes each operator to the worker pool when the dataset is in the medium range and the GPU score is below 1.0.

The production readiness details

Two details separate a demo-quality worker pool from one that runs in enterprise production.

Graceful degradation on low-core hardware. Not every enterprise machine has 8 cores. A 2-core laptop with hyperthreading reports hardwareConcurrency = 4. With only 4 workers, the parallel speedup is lower (2.5x to 3.5x instead of 4.5x to 5.8x). The engine adjusts: fewer workers means larger chunks per worker, which changes the per-chunk algorithm selection (larger chunks are more likely to have wide value ranges, shifting from counting sort to radix-256). The system adapts without configuration.

Device loss recovery integration. When the GPU device is lost mid-operation, the engine re-dispatches to the worker pool. The workers are already warm. The SharedArrayBuffer already contains the input data (the GPU received a copy, not the original). Re-dispatch latency is under 0.1 ms. The user experiences a one-time slowdown from GPU speed to worker speed, then normal GPU operation resumes after re-probing.

This is what enterprise AI automation infrastructure looks like at the compute layer. Not a single execution path with error handling bolted on. Three execution tiers, each engineered for its optimal range, with automatic selection and transparent failover between them. The worker pool is the middle tier. It is also the safety net.

Want to discuss how this applies to your business?

Book a Discovery Call