Zero-Copy Parallel Processing with SharedArrayBuffer in JavaScript

8 Apr 2026·15 min read·Husain Ayoob

Web WorkersSharedArrayBufferJavaScriptPerformanceParallel Computing

Key Takeaways

SharedArrayBuffer eliminates the serialization cost of postMessage. A 40 MB dataset shared across 8 workers has zero copy overhead: every worker reads and writes the same physical memory region, coordinated via Atomics.wait() and Atomics.notify() for lock-free synchronization.
Each worker independently selects its sort algorithm based on chunk characteristics: counting sort for bounded integers (range under 65,536), LSD radix-256 for wide-range numerics, and insertion sort for chunks under 64 elements. This per-chunk adaptation outperforms a single fixed algorithm by 1.3x to 2.1x on real-world mixed distributions.
A pre-warmed worker pool eliminates the 5 to 15 ms spawn overhead per operation. Workers are created at initialization and parked on Atomics.wait(). Wake-to-first-instruction latency is under 0.05 ms, enabling the parallel tier for datasets as small as 10,000 elements without amortization concerns.

What this means for your business

Your staff laptops have 8 cores. Most AI software uses 1. Ayoob AI that actually uses all 8 means an 8-times speedup. Pure business efficiency, zero new spend.

The gap between main thread and GPU

Our adaptive dispatch engine routes computation across three tiers: CPU main thread, Web Worker parallel, and WebGPU compute. The GPU tier handles datasets above 500,000 elements on discrete hardware. The main thread handles datasets below 10,000 elements.

Between those boundaries sits a range where neither tier is optimal.

A 200,000-element sort on the main thread takes 28 ms (V8 TimSort with a numeric comparator). That is nearly two animation frames. The UI stutters visibly.

The same sort dispatched to the GPU takes 2.4 ms of compute but 3.1 ms of buffer transfer (write to GPU, read back results). Total: 5.5 ms. The transfer overhead is 56% of the total time. For a dataset this size, you are spending more time moving data than processing it.

The Web Worker parallel tier fills this gap. It processes data where it already lives (in JavaScript heap memory), avoids the GPU transfer penalty, and spreads the work across multiple CPU cores. For the 10,000 to 500,000 element range, it is the fastest option on most hardware.

But Web Workers have their own overhead problems. Solving them is what makes the parallel tier production-ready rather than a naive postMessage fan-out.

The postMessage serialization problem

The standard way to send data to a Web Worker is postMessage():

worker.postMessage({ data: myFloat32Array });

By default, this uses the structured clone algorithm to copy the data. For a 200,000-element Float32Array (800 KB), the copy takes 0.3 to 0.5 ms. Send it to 8 workers, and you have copied 6.4 MB in 2.4 to 4.0 ms. You have not started computing yet.

Transferable objects (postMessage(data, [data.buffer])) avoid the copy by moving ownership of the ArrayBuffer to the worker. But transfer is exclusive: the source loses access. You cannot transfer the same buffer to multiple workers. For parallel processing where every worker needs read access to the full dataset, transfer does not work.

You could partition the data into 8 slices and transfer each slice to a different worker. But then the merge phase requires transferring 8 results back to the main thread, and any operation that needs cross-partition visibility (a global sort, a full-dataset aggregation with shared group keys) requires additional coordination messages.

Every postMessage call, even with transfers, incurs event loop overhead: the message enters the worker's message queue, waits for the current microtask to complete, and fires the onmessage handler. Measured latency per postMessage round-trip: 0.1 to 0.3 ms. For an operation that takes 3 ms of compute, 8 coordination messages add 0.8 to 2.4 ms of communication overhead. The overhead approaches the compute time.

SharedArrayBuffer: zero-copy shared memory

SharedArrayBuffer is an ArrayBuffer that multiple threads can access simultaneously. Unlike a regular ArrayBuffer, it is not copied or transferred. Every worker that receives a reference to a SharedArrayBuffer reads and writes the same physical memory.

// Main thread
const shared = new SharedArrayBuffer(800_000); // 200,000 float32s = 800 KB
const view = new Float32Array(shared);
// Fill view with data...

// Send to all workers. No copy. No transfer. Zero overhead.
for (const worker of workers) {
  worker.postMessage({ buffer: shared, offset: 0, length: 200_000 });
}

The postMessage call sends only the reference (a few bytes). The data stays in place. Eight workers, one copy of the data, zero serialization cost.

This is the foundation of our parallel tier. Every operation on the worker pool uses SharedArrayBuffer for input data, output data, and coordination state.

The security context requirement

SharedArrayBuffer requires the page to be served with specific HTTP headers:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

These headers were mandated after the Spectre vulnerability disclosure to prevent cross-origin timing attacks. Without them, SharedArrayBuffer is unavailable and the constructor throws a TypeError.

Our engine checks for SharedArrayBuffer availability at initialization. If it is unavailable (missing headers, legacy browser, or restricted environment), the parallel tier falls back to postMessage with transferable objects and single-partition-per-worker dispatch. Performance degrades by 20% to 40%, but correctness is unaffected.

Lock-free synchronization with Atomics

Shared memory without synchronization is a data race. Two workers writing to adjacent memory locations can produce torn reads on some architectures. JavaScript's Atomics API provides the synchronization primitives.

Atomics.wait() and Atomics.notify()

These are the core coordination mechanism. Atomics.wait() parks a thread until another thread writes a specific value to a shared memory location and calls Atomics.notify().

// Worker: park until notified
const control = new Int32Array(sharedControlBuffer);
Atomics.wait(control, WORKER_SLOT, 0);
// Execution resumes here when main thread calls Atomics.notify()

// Main thread: wake the worker
Atomics.store(control, WORKER_SLOT, 1);
Atomics.notify(control, WORKER_SLOT, 1);

Atomics.wait() is a true OS-level thread park, not a busy-wait spin loop. The worker consumes zero CPU cycles while parked. The wake latency (time from Atomics.notify() to the worker resuming execution) is 0.01 to 0.05 ms on modern hardware. Compare this to postMessage wake latency of 0.1 to 0.3 ms.

Why not use postMessage for coordination

postMessage is asynchronous and event-loop-bound. The message sits in the worker's event queue until the current task completes. Atomics.notify() wakes the thread immediately at the OS scheduler level, bypassing the JavaScript event loop. For latency-sensitive operations where every 0.1 ms matters, this difference compounds across multiple synchronization points.

A parallel sort with 8 workers requires at minimum 2 synchronization points per worker (start of work, end of work). With postMessage, that is 16 async messages: 1.6 to 4.8 ms of coordination overhead. With Atomics, that is 16 atomic wake/wait operations: 0.16 to 0.8 ms. On a 200,000-element sort that takes 4 ms of compute, the coordination model determines whether total latency is 5.6 ms or 8.8 ms.

The pre-warmed worker pool

Creating a Web Worker has a fixed cost. The browser must spawn an OS thread, initialize a JavaScript context, parse and compile the worker script, and execute any setup code. Measured cost: 5 to 15 ms per worker on first creation. On subsequent creations (after the script is cached), 2 to 5 ms.

For an 8-worker pool, cold creation costs 40 to 120 ms. This is acceptable as a one-time startup cost. It is unacceptable as a per-operation cost.

Our engine creates the worker pool at initialization time. Each worker runs a startup sequence that allocates its local state, then parks on Atomics.wait():

// Worker script (simplified)
const control = new Int32Array(sharedControlBuffer);
const WORKER_ID = self.workerIndex;

// Initialization complete. Park until work arrives.
while (true) {
  Atomics.wait(control, WORKER_ID, 0); // Parked. Zero CPU usage.

  // Woken up. Read task descriptor from shared memory.
  const task = readTaskDescriptor(WORKER_ID);

  // Execute the task.
  executeTask(task);

  // Signal completion.
  Atomics.store(control, WORKER_ID, 0);
  Atomics.notify(control, COMPLETION_SLOT, 1);
}

The pool is sized to navigator.hardwareConcurrency (typically 4 to 16 on modern machines). Workers are created once and reused for every operation for the lifetime of the page. The amortized creation cost per operation is effectively zero.

Wake-to-first-instruction latency (from Atomics.notify() to the worker executing its first task instruction) is under 0.05 ms. This means the parallel tier is viable for datasets as small as 10,000 elements, where the compute time is 0.5 to 1 ms. A 0.05 ms wake cost is 5% to 10% overhead. A 5 to 15 ms worker creation cost would be 500% to 1,500%.

Per-chunk adaptive algorithm selection

This is where our parallel tier diverges from a naive "split the array, sort each piece, merge results" approach.

When the main thread dispatches a sort to the worker pool, it partitions the SharedArrayBuffer into contiguous chunks (one per worker). Each worker receives its chunk boundaries (offset and length into the shared buffer). The workers then sort their chunks independently.

The key: each worker inspects its chunk and selects the optimal sort algorithm for that specific data.

Why per-chunk adaptation matters

Real-world datasets are not uniformly distributed. A sales table might have a revenue column where 90% of values fall between 0 and 10,000 (small transactions) and 10% fall between 100,000 and 10,000,000 (enterprise deals). When this column is partitioned across 8 workers, some chunks will contain mostly small values with a tight range. Others will contain a mix of small and large values with a wide range.

A single fixed algorithm optimized for the average case is suboptimal for both extremes. Per-chunk selection lets each worker use the best algorithm for its specific data.

Algorithm selection logic

Each worker runs a lightweight analysis pass on its chunk before sorting:

function selectAlgorithm(data, offset, length) {
  if (length <= 64) {
    return 'insertion'; // Under 64 elements: insertion sort is fastest
  }

  // Scan for min, max, and integer-ness
  let min = Infinity, max = -Infinity;
  let allInteger = true;
  for (let i = offset; i < offset + length; i++) {
    const v = data[i];
    if (v < min) min = v;
    if (v > max) max = v;
    if (allInteger && (v | 0) !== v) allInteger = false;
  }

  const range = max - min;

  if (allInteger && range < 65536) {
    return 'counting'; // Bounded integers: O(n + k) counting sort
  }

  return 'radix256'; // General case: LSD radix-256 with bit-transform
}

Insertion sort (length 64 or fewer). For very small chunks (which occur at partition boundaries or when the main dataset is small), insertion sort's low constant factor and cache-friendliness beat any asymptotically faster algorithm. No allocation. No auxiliary buffers. Pure in-place comparisons.

Counting sort (bounded integers, range under 65,536). If every value in the chunk is an integer and the range from minimum to maximum fits within 65,536, counting sort runs in O(n + k) time where k is the range. A 65,536-entry count array occupies 256 KB. The sort requires one pass to count, one pass to compute prefix sums, and one pass to scatter. No comparisons. No callbacks. For data like status codes (0 to 99), age values (0 to 120), or quantity fields (0 to 10,000), counting sort is 3x to 5x faster than radix-256 because it completes in 3 passes versus 4, with a smaller histogram at each pass.

LSD radix-256 (general case). For floating-point data or integers with wide ranges, the worker applies the IEEE 754 bit-transform and runs a 4-pass least-significant-digit radix sort with base 256. This is O(n) regardless of data distribution, handles the full Float32 range, and requires only a 256-entry histogram (1 KB) per pass.

Measured impact of per-chunk adaptation

We benchmarked on a realistic mixed dataset: 500,000 elements where 70% are bounded integers (0 to 1,000) and 30% are wide-range floats. Eight-worker pool:

Strategy	Time
Fixed TimSort per chunk	18.4 ms
Fixed radix-256 per chunk	7.2 ms
Per-chunk adaptive (counting + radix)	4.9 ms

The adaptive strategy is 1.47x faster than fixed radix-256 because the 70% of chunks dominated by bounded integers use counting sort (3 passes, smaller histogram) instead of radix-256 (4 passes). On datasets with higher integer concentration, the advantage grows to 2.1x.

K-way merge on the main thread

After all workers complete (signaled via Atomics.notify() on a shared completion counter), the main thread holds K sorted chunks in the SharedArrayBuffer. These must be merged into a single sorted output.

We use a k-way merge with a min-heap of size K (equal to the worker count, typically 8). The heap holds the current minimum element from each chunk along with its chunk index and position within the chunk.

function kWayMerge(shared, chunks, output) {
  const heap = new MinHeap(chunks.length);

  // Initialize: insert the first element from each chunk
  for (let i = 0; i < chunks.length; i++) {
    heap.insert(shared[chunks[i].offset], i, chunks[i].offset);
  }

  let writePos = 0;
  while (heap.size > 0) {
    const { value, chunkIdx, readPos } = heap.extractMin();
    output[writePos++] = value;

    const chunk = chunks[chunkIdx];
    const nextPos = readPos + 1;
    if (nextPos < chunk.offset + chunk.length) {
      heap.insert(shared[nextPos], chunkIdx, nextPos);
    }
  }
}

For K = 8, each extractMin and insert operation touches at most 3 levels of the heap (log2(8) = 3). For n = 500,000 elements, the merge performs 500,000 extract-insert cycles with 3 comparisons each: 1.5 million comparisons total. On a single core, this completes in 3 to 5 ms.

The merge runs on the main thread because it is inherently sequential (each extraction depends on the previous heap state) and because it is fast enough to complete within a single frame budget. Parallelizing the merge would add synchronization overhead that exceeds the compute saved for K of 16 or fewer.

Writing to the output buffer

The merge writes to a separate region of the SharedArrayBuffer (or a dedicated output buffer). It does not write back into the input region, because the workers' sorted chunks are being read simultaneously. Double-buffering avoids read-write conflicts without locks.

The total memory footprint is 2x the input data size: one region for the sorted chunks (in-place, zero-copy from the workers), one region for the merged output. For a 500,000-element Float32 dataset, that is 4 MB total. Negligible on any modern device.

End-to-end parallel tier performance

Full pipeline: pre-warmed pool wake, chunk dispatch, per-chunk adaptive sort, k-way merge. Benchmarked on 8-core hardware with mixed Float32 data:

Elements	Parallel tier (8 workers)	Main thread (V8 TimSort)	Speedup
10,000	0.6 ms	1.1 ms	1.8x
50,000	1.8 ms	5.8 ms	3.2x
100,000	3.1 ms	12.2 ms	3.9x
200,000	5.2 ms	25.6 ms	4.9x
500,000	11.8 ms	68.1 ms	5.8x

The speedup increases with dataset size because the fixed coordination overhead (wake + merge) is amortized over more compute. At 500,000 elements, the coordination overhead is under 8% of total time. At 10,000 elements, it is roughly 30%, but the absolute time (0.6 ms) is still well within a single animation frame.

Comparison with the GPU tier

For context, the same 500,000-element sort on a discrete GPU takes 3.6 ms (versus 11.8 ms on the worker pool). The GPU wins by 3.3x. But on an integrated GPU, the GPU tier takes 6.4 ms, and the advantage narrows to 1.8x. On a device with no GPU access (enterprise VDI, locked-down terminal), the worker pool is the only option faster than the main thread.

The adaptive dispatch engine selects between these tiers based on the hardware calibration ratio. It does not assume GPU availability. The worker pool is not a fallback. It is a full-featured tier that handles the medium-scale workloads where GPU transfer overhead erodes the GPU's compute advantage.

Where this applies beyond sorting

The pre-warmed pool with SharedArrayBuffer and Atomics coordination is a general-purpose parallel execution substrate. We use it for:

Parallel filter. Each worker evaluates the predicate on its chunk and writes matching indices to a thread-local section of a shared output buffer. The main thread concatenates the sections. For 500,000 rows with a selective predicate, this is 1.2 ms versus 4.8 ms single-threaded.

Parallel aggregation. Each worker computes partial aggregates (sum, count, min, max) for its chunk. The main thread reduces the K partial results. For 500,000 rows with GROUP BY on a low-cardinality column, this is 2.1 ms versus 7.3 ms single-threaded.

Parallel histogram. Each worker builds a local histogram over its chunk. The main thread sums the K histograms bin by bin. For 500,000 values with 256 bins, this is 0.8 ms versus 3.2 ms single-threaded.

All three operations integrate with our query engine as the Tier 2 backend. The dispatch scoring function routes each operator to the worker pool when the dataset is in the medium range and the GPU score is below 1.0.

The production readiness details

Two details separate a demo-quality worker pool from one that runs in enterprise production.

Graceful degradation on low-core hardware. Not every enterprise machine has 8 cores. A 2-core laptop with hyperthreading reports hardwareConcurrency = 4. With only 4 workers, the parallel speedup is lower (2.5x to 3.5x instead of 4.5x to 5.8x). The engine adjusts: fewer workers means larger chunks per worker, which changes the per-chunk algorithm selection (larger chunks are more likely to have wide value ranges, shifting from counting sort to radix-256). The system adapts without configuration.

Device loss recovery integration. When the GPU device is lost mid-operation, the engine re-dispatches to the worker pool. The workers are already warm. The SharedArrayBuffer already contains the input data (the GPU received a copy, not the original). Re-dispatch latency is under 0.1 ms. The user experiences a one-time slowdown from GPU speed to worker speed, then normal GPU operation resumes after re-probing.

This is what enterprise AI automation infrastructure looks like at the compute layer. Not a single execution path with error handling bolted on. Three execution tiers, each engineered for its optimal range, with automatic selection and transparent failover between them. The worker pool is the middle tier. It is also the safety net.

Where this ships

We are Ayoob AI, a Newcastle-based team building parallel compute infrastructure for UK enterprise SaaS where browser-side performance is the product. If your workload is too big for the main thread and too small to justify the GPU, we build the worker pool that fills the gap. Book a discovery call.

Frequently asked questions

What does SharedArrayBuffer give us that postMessage does not?

postMessage copies data. Every byte you send to a worker is serialised on the main thread, cloned into the worker's heap, and deserialised on arrival. For a 40 MB dataset across 8 workers that is 320 MB of copies and tens of milliseconds of pure overhead. SharedArrayBuffer is a single block of memory visible to every worker at the same address. There is no copy and no serialisation. Workers coordinate access via Atomics.wait and Atomics.notify for lock-free synchronisation. For medium-sized datasets in the 10,000 to 500,000 element range, the zero-copy property is what makes parallel CPU compute competitive with GPU dispatch.

Why does each worker pick a different sort algorithm?

Because no single algorithm wins on every chunk. Counting sort is unbeatable for bounded integers with a range under 65,536: it runs in O(n) with no comparisons. LSD radix-256 is best for wide-range numerics, avoiding comparison overhead entirely. Insertion sort wins on chunks under 64 elements because its constant factors are tiny and it is cache-friendly. Our workers inspect their chunk at entry, measure value range and size, and select the appropriate algorithm. This per-chunk adaptation outperforms a single fixed algorithm by 1.3x to 2.1x on real-world mixed distributions, which is exactly the kind of data a UK business actually has.

Why pre-warm the worker pool?

Because spawning a Web Worker takes 5 to 15 ms on typical hardware, and that cost is paid every time you create one. For an operation that should complete in 3 ms, paying 10 ms to create the workers obliterates any speedup. We create the pool once at initialisation and park each worker on Atomics.wait. When a dispatch arrives, Atomics.notify wakes the workers and the wake-to-first-instruction latency is under 0.05 ms. That lets us use the parallel tier for datasets as small as 10,000 elements without amortisation concerns, which widens the range where AI automation feels instant.

Does SharedArrayBuffer have security or deployment requirements?

Yes, and they are worth knowing before deployment. Browsers require cross-origin isolation (COOP and COEP headers) to expose SharedArrayBuffer, because of historical Spectre concerns. In practice that means setting Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp on the serving origin. For a Newcastle UK business rolling out our engine, this is a one-time hosting configuration rather than a runtime concern. We document the exact headers and CDN settings as part of delivery, and the full code falls back gracefully to postMessage when the headers are missing, so deployment never hard-fails.

How much speedup does this actually deliver on an 8-core laptop?

For workloads that map cleanly to parallel sort, filter, or aggregate, we see 5x to 7x wall-clock speedup on an 8-core machine, short of the theoretical 8x because of synchronisation and the final merge step. On a typical mid-range Newcastle office laptop that turns a 200,000-element sort from 28 ms (blocking two animation frames) into around 4 ms (invisible). The business-efficiency story is not theoretical: staff stop waiting on dropdowns, filters, and report refreshes, and the software feels native rather than sluggish. All eight cores earning their keep, no new hardware, no cloud spend.

Talk to an Engineer