Ayoob AI

WebGPU

27 articles on WebGPU from Ayoob AI, the full code AI automation agency based in Newcastle upon Tyne.

The Ayoob AI Architecture: Merging CPU, Workers, and WebGPU

A complete architectural overview of our heterogeneous dispatch engine. Every operation flows through workload characterization, precision analysis, and dispatch scoring before routing to the optimal tier: CPU main thread, SharedArrayBuffer Web Workers, or WebGPU compute. Cascading fallback guarantees execution continuity.

15 min read·2026-04-15

Trust but Verify: Validating GPU Float32 Math on the CPU

Our post-dispatch spot-check verification selects 16 elements from the GPU's Float32 output, re-computes them in Float64 on the CPU, and compares. If relative error exceeds the tier-specific tolerance (10^-4 for medium sensitivity, 10^-6 for high sensitivity), the engine discards the GPU result and re-executes on CPU. Speed first, correctness guaranteed.

15 min read·2026-04-15

Arithmetic Intensity: Why Matrix Multiplication Loves WebGPU

Why matrix multiplication is the one operation your browser's GPU was built for, and how Newcastle AI teams use it to replace six-figure cloud bills.

17 min read·2026-04-15

Why Hardcoded GPU Dispatch Thresholds Fail in the Browser

Hardcoded GPU thresholds break across devices. Self-calibrating dispatch makes AI software fast on every laptop, engineered for UK SMB workloads.

17 min read·2026-04-14

Managing WebGPU Memory Limits for Enterprise Datasets

Browser GPUs share memory with rendering and enforce strict allocation limits via maxStorageBufferBindingSize. Our engine queries these limits at runtime, routes oversized datasets to CPU unconditionally, and uses a size-bucketed buffer pool to eliminate repeated allocation overhead and prevent memory leaks.

15 min read·2026-04-14

Predicting GPU Hash Map Collisions with the Chao1 Estimator

GPU databases crash when GROUP BY cardinality is guessed wrong. The Chao1 estimator predicts it, used in our Newcastle-built analytics engine.

17 min read·2026-04-14

Executing SQL WHERE Clauses on the GPU with Dictionary Encoding

Filtering 10M customer records on a GPU in a browser, under 200ms. The technique powering our Newcastle AI data-query engines.

17 min read·2026-04-14

The Variable-Width Problem: Why UTF-8 Breaks WebGPU Text Search

GPUs break on variable-width text (apostrophes, emojis, names). Our UTF-8-safe engine is why Newcastle law and finance firms trust our AI search.

15 min read·2026-04-13

Bypassing Array.prototype.sort() with IEEE 754 Bit-Transforms

V8's TimSort coerces numbers to strings and cannot use parallel hardware. Our Adaptive Multi-Tier Sorting System transforms IEEE 754 floats to sort-order-preserving unsigned integers using two bitwise operations, enabling radix-256 sort on CPU workers and a two-phase GPU bitonic-merge sort with 1.45x speedup over Web Workers at 5M+ elements on discrete GPU.

14 min read·2026-04-12

Why We Built the First Non-Comparison Float Sort in JavaScript (And Open Sourced It)

Array.prototype.sort() is broken for numerical data. We built a three-tier adaptive sorting engine that dispatches between CPU, Web Workers, and WebGPU compute shaders based on dataset characteristics. Here is why, and how.

10 min read·2026-04-12

Building Fault-Tolerant AI Workflows: Handling WebGPU Device Loss

Browser GPUs crash, drivers reset, and hardware context vanishes without warning. Our cascading fallback architecture registers on the GPUDevice.lost promise, invalidates all cached state, re-dispatches to CPU workers within the same microtask, and re-probes hardware on the next invocation.

13 min read·2026-04-12

WebGPU Atomic Contention: When to Stop Using the GPU

Sometimes the GPU is slower than the CPU. Knowing when is the real engineering, the decision logic behind our Newcastle AI builds.

16 min read·2026-04-11

Why On-Device WebGPU Architecture is Cheaper Than Cloud LLM APIs

Routing every sort, filter, and aggregation to a cloud server costs $0.12 to $0.85 per 1,000 queries at scale. Our adaptive dispatch engine profiles local hardware via navigator.gpu.requestAdapter() and routes computation to the client GPU, eliminating server compute costs for data transformation entirely.

12 min read·2026-04-11

Preventing Silent Numerical Degradation in GPU-Accelerated Finance AI

GPU-accelerated finance AI silently loses precision below the pound. Our Float32 safety guard catches it before it hits your ledger, engineered in Newcastle.

17 min read·2026-04-10

Eliminating PCIe Bus Bottlenecks in Enterprise AI Compliance Tools

Most compliance AI wastes 80% of its time shuffling data between CPU and GPU. We eliminated that. Built for UK regulated industries.

15 min read·2026-04-10

Real-Time Threat Detection with GPU-Accelerated Streaming Corpora

Live log streams grow continuously. Our searched-frontier tracking mechanism extends the corpus buffer without re-encoding existing documents and dispatches the GPU only against unsearched data beyond the frontier offset. Atomic contention detection prevents non-linear slowdowns when match density spikes.

14 min read·2026-04-10

Eliminating Bot Networks: Two-Phase GPU Pattern Matching for Gaming Anti-Cheat

Standard regex cannot run on GPUs due to SIMD branch divergence. Our two-phase pattern matching engine uses character frequency histograms in 16 KB shared memory to eliminate 97% of candidates before byte-level matching, enabling sub-second fraud detection across millions of chat messages.

15 min read·2026-04-09

Sub-200ms Hospitality CRMs: Moving SQL Relational Operators to WebGPU

Server-side CRM queries add 150 to 400 ms per interaction. Our Adaptive WebGPU Data Query Engine runs relational operators on in-memory columnar data at the client, using dictionary encoding for GPU string processing and a 6-factor scoring function for per-operator dispatch.

15 min read·2026-04-09

Mitigating Atomic Contention in Parallel Browser Environments

When thousands of GPU threads compete for the same atomic memory address, throughput collapses non-linearly. Our engine profiles expected output density and assigns a categorical penalty of negative infinity when contention exceeds safe thresholds, routing to CPU before the GPU stalls.

13 min read·2026-04-08

The Hidden Compute Costs of Array.prototype.sort() in Enterprise SaaS

V8's TimSort performs 20 million comparator callbacks per million elements, each crossing the native-to-JS boundary. Our adaptive sorting system bypasses this entirely with IEEE 754 bit-transforms and a two-phase GPU sort: local bitonic in shared memory, global rank merge via parallel binary search.

14 min read·2026-04-07

Engineering Resilient Compute Pipelines: Handling WebGPU Device Loss

Browser GPUs crash, drivers update, and hardware context vanishes without warning. Our engine detects device loss via the GPUDevice.lost promise, invalidates all cached state, and transparently re-dispatches to CPU within the same operation.

14 min read·2026-04-07

Why Reduced-Precision GPU Arithmetic is Dangerous for Enterprise Finance

WebGPU forces Float64 financial data into Float32, silently corrupting values above 16,777,216. Our Precision Sufficiency Analyser estimates condition numbers and accumulation bounds to prevent GPU dispatch when precision loss exceeds tolerance.

14 min read·2026-04-06

The Two-Phase GPU Text Search Algorithm for Massive Log Files

Brute-force pattern matching on 1 million log entries takes 800 ms on CPU. Our two-phase GPU algorithm uses a character frequency histogram pre-filter in 16 KB shared memory to eliminate up to 97% of candidates before byte-level matching begins.

13 min read·2026-04-06

IEEE 754 Bit-Transforms for High-Speed Float Processing in JavaScript

JavaScript uses Float64. WebGPU requires Float32. The IEEE 754 bit-transform (Herf 2001) converts floats to sort-order-preserving unsigned integers. Our contribution is the Float32 safety guard that inhibits GPU dispatch when Float64-to-Float32 truncation would alter sort order, plus the adaptive multi-tier dispatch system.

12 min read·2026-04-05

GPU-Accelerated Relational Queries: Moving the Database to the Browser

Server round-trips add 50 to 300 ms per dashboard interaction. Our Adaptive WebGPU Data Query Engine compiles structured queries into execution plans where each operator is routed to one of three execution tiers (CPU main thread, Web Worker thread pool, or WebGPU compute pipeline) based on a 6-factor dispatch scoring function.

14 min read·2026-04-05

Handling SIMD Branch Divergence in Browser-Based Compute Shaders

GPU wavefronts serialize when threads diverge. We built a categorical inhibition system that detects divergence-prone workloads at dispatch time and unconditionally routes them to the CPU tier.

11 min read·2026-04-04

Why WebGPU is Replacing Web Workers for Enterprise Data Processing

When to replace Web Workers with WebGPU for enterprise data processing. The calibration tells you which. Built by a Newcastle AI team.

9 min read·2026-04-04

Want to discuss webgpu for your business?

Book a Discovery Call