Ayoob AI

The Ayoob AI Architecture: Merging CPU, Workers, and WebGPU

·15 min read·Ayoob AI
ArchitectureWebGPUWeb WorkersHeterogeneous ComputeEnterprise

One engine, three tiers, zero configuration

This article is the architectural capstone of a 29-article series on browser-based heterogeneous compute. Each previous article examined one component in depth. This article shows how they connect.

The system we built solves a single problem: given an operation and a dataset, execute it on the fastest available hardware while guaranteeing correctness, fault tolerance, and mathematical precision. The application developer calls one function. The engine handles everything else.

const result = await engine.dispatch(operation, data);

Behind that single call, the engine runs four stages in sequence. The total decision time is under 0.1 ms. The result is a dispatch to one of three compute tiers, with automatic fallback if the chosen tier fails.

Stage 1: Workload Characterisation

Before the engine considers hardware, it characterizes the workload. Three analyses run in parallel.

Control flow analysis

The engine inspects the operation's control flow topology and classifies it into one of three categories:

Uniform. Every element follows the same instruction path. Examples: element-wise arithmetic, radix sort scatter/gather, parallel histogram construction. GPU-safe. No SIMD branch divergence.

Bounded. The operation branches on data, but the number of distinct paths is small and predictable. Examples: clamping, enum-based classification, dictionary-encoded filter with IN clause. GPU-viable with predication cost (10% to 30% throughput reduction).

Categorical. Every element may follow a unique execution path. Examples: NFA regex traversal, Levenshtein distance, trie lookup, UTF-8 case-insensitive search. Categorical GPU Inhibition: penalty of negative infinity. GPU dispatch blocked unconditionally.

Output density profiling

For operations that produce atomic writes (counters, histogram bins, compaction pointers), the engine estimates the fraction of threads that will execute an atomic per cycle.

The estimation uses column statistics: histogram-based selectivity for filters, Chao1 cardinality estimation for group-by, and Phase 1 popcount for text search.

If estimated density exceeds 10% of the GPU's thread capacity per cycle, atomic contention would collapse throughput non-linearly. The engine assigns a categorical penalty of negative infinity. GPU dispatch blocked.

Encoding detection

For text operations, the engine samples the corpus for multi-byte UTF-8 sequences. If multi-byte content is detected and the search is case-insensitive, the combination of variable-width encoding and Unicode case folding produces categorical divergence. GPU dispatch blocked.

For string columns in structured queries, the engine verifies dictionary encoding is in place. String predicates are resolved to integer comparisons at compile time. The GPU never processes a string.

Characterisation output

The three analyses produce a workload profile:

interface WorkloadProfile {
  controlFlow: 'uniform' | 'bounded' | 'categorical';
  outputDensity: number;              // 0.0 to 1.0
  encodingClass: 'ascii' | 'utf8_multibyte';
  categoricalInhibition: boolean;     // true if any analysis triggers -Infinity
  arithmeticIntensity: number;        // FLOPs per byte
}

If categoricalInhibition is true, the engine skips Stages 2 and 3 entirely. The operation routes to the CPU tier (Workers or main thread depending on dataset size). No GPU resources are allocated. No further analysis is needed.

Stage 2: Precision Sufficiency Analysis

For operations that survive Stage 1, the Precision Sufficiency Analyser evaluates whether Float32 arithmetic can produce results within the caller's tolerance.

Three sensitivity tiers

High sensitivity (linear algebra). Matrix solve, eigenvalue decomposition, least-squares. The analyser estimates the condition number κ via Hager's O(n^2) algorithm. Expected relative error: κ * 1.19 x 10^-7 (Float32 machine epsilon). If this exceeds the caller's tolerance (default 10^-9 for financial workloads), the Float32 Safety Guard assigns negative infinity. GPU blocked.

Medium sensitivity (accumulation). SUM, AVG, running totals, windowed aggregations. The analyser estimates maximum intermediate accumulation and compares against the Float32 safe integer threshold (16,777,216). If the accumulation exceeds this boundary, GPU blocked for the numeric output. Additionally, operations that pass the pre-dispatch check receive post-dispatch spot-check verification: 16 sampled elements are re-computed in Float64 on the CPU. If any sample's relative error exceeds 10^-4, the GPU result is discarded and the operation re-executes on the CPU.

Low sensitivity (comparison). Filters, sorts, classifications. The analyser estimates the minimum gap between adjacent values and compares against the Float32 ULP at the relevant magnitude. If comparisons are unaffected by rounding, GPU dispatch is safe. The output is boolean or ordinal, not numeric.

Analysis output

interface PrecisionProfile {
  sensitivityTier: 'high' | 'medium' | 'low';
  riskScore: number;
  precisionInhibition: boolean;      // true if risk exceeds tolerance
  requiresPostDispatchVerification: boolean;
}

If precisionInhibition is true, the operation routes to CPU Float64. If requiresPostDispatchVerification is true, the GPU result will be spot-checked after execution.

Stage 3: Dispatch Scoring

For operations that survive Stages 1 and 2, a multi-factor scoring function computes the final dispatch score. The exact factors vary by domain: structured queries use a 6-factor formula with SQL-specific metrics, while sort operations use a 7-factor dispatch that includes sort-specific inputs. The following factors illustrate the query scoring path, which is the most general example.

The factors (structured query example)

Factor 1: Dataset cardinality. The number of elements entering the operator. Larger datasets favour GPU dispatch (more parallel work to amortize fixed overhead).

Factor 2: Predicate selectivity. For filters, the estimated fraction of rows that pass the predicate. Low selectivity means most GPU threads produce no output. Affects compaction efficiency and downstream operator cardinality.

Factor 3: Group cardinality. For GROUP BY operators, the estimated number of distinct groups via Chao1. Low cardinality (under 1,024) enables shared memory accumulators. High cardinality forces global memory atomics with severe contention.

Factor 4: Arithmetic intensity. The ratio of FLOPs to memory bytes. Compute-bound operations (GEMM: n/6 FLOPs/byte) justify GPU dispatch at small data sizes. Memory-bound operations (element-wise: 0.25 FLOPs/byte) require large datasets.

Factor 5: Memory access pattern. Sequential (coalesced GPU reads, full bandwidth) versus random (uncoalesced, 10% to 25% bandwidth). Filters and sorts are sequential. Hash joins and index lookups are random.

Factor 6: Hardware calibration ratio. The device-specific break-even between CPU and GPU, derived from runtime microbenchmarks at session start: memory bandwidth probe, dispatch overhead measurement, adapter capability query. Normalizes scoring across hardware.

The formula

operatorScore = (cardinality * arithmeticIntensity * accessPatternWeight)
              / (selectivityPenalty * groupCardinalityPenalty * calibrationRatio)

Score > 1.0: GPU dispatch. The GPU's compute or bandwidth advantage outweighs all overhead.

Score 0.3 to 1.0: Web Worker dispatch. The GPU's advantage is marginal or negative, but the dataset is large enough to benefit from multi-threaded CPU execution.

Score < 0.3: CPU main thread dispatch. The dataset is small enough that single-threaded execution is fastest (no worker wake overhead, no GPU dispatch overhead).

Pipeline fusion bonus

If the preceding operator in a fused pipeline was GPU-dispatched, the current operator's data is already resident on the GPU. The upload cost is zero. The scoring function adds a retention bonus that increases the score, pulling borderline operators onto the GPU path. This extends the fused segment, eliminating PCIe transfers between consecutive GPU operators.

Stage 4: Tier Routing and Execution

The score maps to one of three compute tiers.

Tier 1: CPU main thread

When used: Score < 0.3. Datasets under ~10,000 elements. Trivial post-aggregation sorts on small result sets. Also used for precision-sensitive operations that require Float64.

How it works: The operation executes synchronously on the calling thread. No worker spawn. No message passing. No buffer allocation. The data is already in JavaScript heap memory. The result is returned directly.

Performance: Sub-0.5 ms for typical small-dataset operations. Zero overhead. L1 cache locality for tight loops.

When it is the only option: VDI environments with no GPU and hardwareConcurrency = 1. The engine degrades gracefully to single-threaded execution.

Tier 2: SharedArrayBuffer Web Worker pool

When used: Score 0.3 to 1.0. Datasets between ~10,000 and ~500,000 elements (thresholds vary by hardware calibration). Also used as fallback when GPU is unavailable or device loss occurs.

How it works: A pre-warmed pool of threads (sized to navigator.hardwareConcurrency, typically 4 to 16) communicates via SharedArrayBuffer for zero-copy data sharing. Workers are parked on Atomics.wait() and wake in under 0.05 ms on Atomics.notify().

Each worker receives a contiguous chunk of the SharedArrayBuffer. For text search, the boundary overlap protocol extends each chunk by (patternLength - 1) bytes to prevent missed matches at partition boundaries.

Each worker independently selects its algorithm based on chunk characteristics: counting sort for bounded integers (range < 65,536), LSD radix-256 for wide-range numerics, insertion sort for chunks under 64 elements.

After all workers complete (signaled via Atomics.notify()), the main thread performs a k-way merge (for sorts) or result concatenation (for filters and aggregations).

Performance: 3x to 6x speedup over main thread for medium datasets. Consistent across hardware (CPU core counts vary less than GPU capabilities).

Tier 3: WebGPU compute pipeline

When used: Score > 1.0. Datasets above ~500,000 elements on discrete GPUs, ~2,000,000 on integrated GPUs (thresholds set by calibration). Compute-bound operations (GEMM) dispatch at smaller sizes due to high arithmetic intensity.

How it works: Data is uploaded to the GPU via a size-bucketed buffer pool (eliminating repeated allocation overhead). Operations execute as compute shader dispatches. Consecutive GPU-routed operators are pipeline-fused: intermediate results stay in GPU storage buffers, reducing transfers from 2N to N+1. Results are read back via mapAsync().

For text search, the two-phase pipeline runs a character frequency histogram pre-filter in 16 KB shared memory (Phase 1), eliminating up to 97% of candidates before byte-level matching (Phase 2). For streaming data, the searched-frontier mechanism ensures only new data is processed.

For structured queries, dictionary-encoded string columns are processed as integer arrays. WHERE clauses compile to u32 comparisons. GROUP BY uses Chao1-estimated shared memory accumulators for low-cardinality groups.

For sorting, the IEEE 754 bit-transform converts floats to sort-order-preserving unsigned integers, enabling O(n) radix-256 sort or local bitonic sort with global rank merge.

Performance: 10x to 75x speedup over Array.prototype.sort(). 2x to 20x over Web Workers. Sub-5 ms for 500,000-element operations on discrete hardware.

The cascading fallback

The three tiers are not independent options. They are a cascade. If the primary tier fails, the engine falls through to the next without application-level intervention.

GPU to Workers

If the GPU device is lost (driver crash, watchdog timeout, eGPU disconnection, power management, background tab throttling), the engine:

  1. Invalidates all cached state (pipeline cache, buffer pool, bind groups) within a single microtask. Time: under 0.1 ms.
  2. Re-dispatches pending operations to the Web Worker tier. The input data is intact in the SharedArrayBuffer (the GPU received a copy). Time: 0.1 to 0.5 ms for re-dispatch.
  3. Schedules hardware re-probe for the next invocation. The engine calls navigator.gpu.requestAdapter(), compares adapter info for hardware changes, re-runs calibration microbenchmarks, and resumes GPU dispatch with updated thresholds. Time: under 200 ms on next invocation.

The caller's promise resolves with correct results. Latency increases (GPU speed to Worker speed), but execution never fails.

Workers to main thread

If SharedArrayBuffer is unavailable (missing COOP/COEP headers, legacy browser) or navigator.hardwareConcurrency === 1 (single-core device), the Worker tier degrades to main-thread execution. The engine uses postMessage with transferable objects instead of SharedArrayBuffer, accepting 20% to 40% performance degradation. On single-core devices, all computation runs synchronously on the main thread.

GPU to main thread (direct)

If both GPU and Workers are unavailable (no WebGPU adapter, no SharedArrayBuffer, single-core CPU), the engine runs everything on the main thread. This is the lowest-performance path but guarantees that the engine functions on every browser, on every device, with no external dependencies.

The safety systems

Three independent systems can block GPU dispatch. Each evaluates separately. Any one can override the dispatch score.

Safety systemWhat it detectsPenaltyArticles
Branch divergence classifierPer-element conditional branching (NFA, DP, trie)-Infinity#2, #21
Atomic contention profilerOutput density > 10% causing non-linear throughput collapse-Infinity#10, #17
Precision Sufficiency AnalyserFloat32 error exceeding caller tolerance-Infinity#6, #14, #29

The GPU path runs only when all three systems confirm: no categorical divergence, no contention cliff, no precision risk. This layered architecture means the engine never dispatches a workload that is divergent, contended, or imprecise.

The resource management layer

Two systems manage GPU resources to prevent memory leaks and allocation overhead.

Buffer pool. Size-bucketed (power-of-two) pool of GPU storage buffers. Checkout/return protocol eliminates per-query allocation cost (0.35 ms per buffer reduced to 0.01 ms). Leak detection with configurable timeout. Force-destroy unreturned buffers. Pool budget set to 25% of maxStorageBufferBindingSize.

Memory limit checking. Before any GPU allocation, the engine verifies the dataset fits within maxStorageBufferBindingSize (128 MB to 4 GB depending on hardware). Oversized datasets route to CPU unconditionally. No allocation attempt. No out-of-memory risk.

The full pipeline in one diagram

Query enters
    |
    v
[Stage 1: Workload Characterisation]
    |-- Control flow analysis -> categorical? --> CPU (Workers or main thread)
    |-- Output density profiling -> >10%? -----> CPU (Workers or main thread)
    |-- Encoding detection -> UTF-8 + case-insensitive? -> CPU Workers
    |
    v (passed Stage 1)
[Stage 2: Precision Sufficiency Analysis]
    |-- High sensitivity (linear algebra) -> κ * ε > tolerance? --> CPU Float64
    |-- Medium sensitivity (accumulation) -> exceeds 16,777,216? --> CPU Float64
    |-- Low sensitivity (comparison) -> gap < ULP? --> flag, but usually passes
    |
    v (passed Stage 2)
[Stage 3: Dispatch Scoring]
    |-- Multi-factor formula (domain-specific): cardinality, selectivity,
    |   group cardinality, arithmetic intensity, access pattern, calibration
    |-- Pipeline fusion retention bonus (if preceding op was GPU)
    |
    +--> Score > 1.0  --> [Tier 3: WebGPU Compute]
    |                         |-- Buffer pool allocation
    |                         |-- Pipeline-fused dispatch
    |                         |-- Post-dispatch verification (if medium sensitivity)
    |                         |-- On device loss: cascade to Tier 2
    |
    +--> Score 0.3-1.0 --> [Tier 2: Web Worker Pool]
    |                         |-- SharedArrayBuffer zero-copy
    |                         |-- Atomics.wait/notify coordination
    |                         |-- Per-chunk adaptive algorithm
    |                         |-- Boundary overlap for text search
    |
    +--> Score < 0.3  --> [Tier 1: CPU Main Thread]
                              |-- Synchronous execution
                              |-- Float64 precision
                              |-- Zero overhead

Performance across the full hardware spectrum

The same application, the same query, on five different devices:

DeviceHardwareTier selected500K-row filter time500K-row sort time
Developer workstationRTX 4060GPU1.1 ms3.1 ms
MacBook AirM2 integratedGPU1.8 ms4.5 ms
Enterprise laptopIntel Iris XeWorkers4.8 ms11.8 ms
Corporate tabletAdreno 730Workers5.2 ms13.4 ms
VDI terminalNo GPUWorkers6.1 ms14.2 ms

No device crashes. No device runs the wrong tier. No device is penalized by a threshold calibrated for different hardware. The engine measured each device at session start and routed accordingly.

The developer on the RTX 4060 gets 1.1 ms filter times. The employee on the VDI terminal gets 6.1 ms. Both are within a single animation frame at 60 fps. Both see a responsive dashboard. The 5.5x performance gap is invisible to the user because both are below the perception threshold.

What ties it all together

This architecture did not emerge from a single design decision. It emerged from 29 specific engineering problems, each solved individually, then composed into a unified system:

  • Article #1: The adaptive dispatch engine and hardware calibration.
  • Article #2: SIMD divergence detection and categorical inhibition.
  • Article #3: IEEE 754 bit-transform for O(n) float sorting.
  • Article #4: The multi-factor scoring function and per-operator routing.
  • Article #5: Two-phase GPU text search with histogram pre-filter.
  • Article #6: Float32 precision risks in financial data.
  • Article #7: GPU device loss detection and recovery.
  • Article #8: Bitonic sort with asymmetric binary search merge.
  • Article #9: SharedArrayBuffer zero-copy parallel processing.
  • Article #10: Atomic contention mitigation and categorical threshold.
  • Article #11: Gaming anti-cheat with real-time pattern detection.
  • Article #12: Sub-200 ms hospitality CRM with face-scan recognition.
  • Article #13: Pipeline fusion eliminating PCIe transfer overhead.
  • Article #14: Float32 Safety Guard with condition number analysis.
  • Article #15: Searched-frontier tracking for streaming corpora.
  • Article #16: On-device cost analysis versus cloud APIs.
  • Article #17: Deep-dive on GPU synchronization primitives.
  • Article #18: Radix-256 sort bypassing Array.prototype.sort().
  • Article #19: Fault-tolerant AI workflows with device loss recovery.
  • Article #21: UTF-8 variable-width encoding detection and routing.
  • Article #22: Worker boundary overlap for zero missed matches.
  • Article #23: Dictionary encoding for GPU SQL string filtering.
  • Article #25: Self-calibrating dispatch thresholds.
  • Article #26: Chao1 estimator for GROUP BY cardinality prediction.
  • Article #27: GPU memory limits and buffer pool management.
  • Article #28: Arithmetic intensity and GEMM dispatch thresholds.
  • Article #29: Post-dispatch Float32 verification.

Each article solves one constraint. Together, they form an enterprise AI automation infrastructure that runs correct, fast, and fault-tolerant computation on any browser, on any hardware, without configuration.

The GPU makes it fast. The Workers make it parallel. The CPU makes it safe. The engine makes it automatic.

That is the architecture. No guessing. No hardcoded thresholds. No crossed fingers. Just measurement, scoring, dispatch, and verification. On every operation. On every device. Every time.

Want to discuss how this applies to your business?

Book a Discovery Call