Performance

15 articles on Performance from Ayoob AI, the full code AI automation agency based in Newcastle upon Tyne.

Arithmetic Intensity: Why Matrix Multiplication Loves WebGPU

Why matrix multiplication is the one operation your browser's GPU was built for, and how Newcastle AI teams use it to replace six-figure cloud bills.

17 min read·2026-04-15

Why Hardcoded GPU Dispatch Thresholds Fail in the Browser

Hardcoded GPU thresholds break across devices. Self-calibrating dispatch makes AI software fast on every laptop, engineered for UK SMB workloads.

17 min read·2026-04-14

Managing WebGPU Memory Limits for Enterprise Datasets

Browser GPUs share memory with rendering and enforce strict allocation limits via maxStorageBufferBindingSize. Our engine queries these limits at runtime, routes oversized datasets to CPU unconditionally, and uses a size-bucketed buffer pool to eliminate repeated allocation overhead and prevent memory leaks.

15 min read·2026-04-14

Preventing Missed Matches in Parallel Web Worker Text Search

Parallel text search silently misses matches at the boundary between chunks. We fixed it, used in every Newcastle document-processing engagement we ship.

16 min read·2026-04-13

Why We Built the First Non-Comparison Float Sort in JavaScript (And Open Sourced It)

Array.prototype.sort() is broken for numerical data. We built a three-tier adaptive sorting engine that dispatches between CPU, Web Workers, and WebGPU compute shaders based on dataset characteristics. Here is why, and how.

10 min read·2026-04-12

WebGPU Atomic Contention: When to Stop Using the GPU

Sometimes the GPU is slower than the CPU. Knowing when is the real engineering, the decision logic behind our Newcastle AI builds.

16 min read·2026-04-11

Eliminating PCIe Bus Bottlenecks in Enterprise AI Compliance Tools

Most compliance AI wastes 80% of its time shuffling data between CPU and GPU. We eliminated that. Built for UK regulated industries.

15 min read·2026-04-10

Sub-200ms Hospitality CRMs: Moving SQL Relational Operators to WebGPU

Server-side CRM queries add 150 to 400 ms per interaction. Our Adaptive WebGPU Data Query Engine runs relational operators on in-memory columnar data at the client, using dictionary encoding for GPU string processing and a 6-factor scoring function for per-operator dispatch.

15 min read·2026-04-09

Zero-Copy Parallel Processing with SharedArrayBuffer in JavaScript

Zero-copy parallel AI compute in a browser, using every CPU core. How Newcastle SMBs get server-grade performance on standard laptops.

15 min read·2026-04-08

Mitigating Atomic Contention in Parallel Browser Environments

When thousands of GPU threads compete for the same atomic memory address, throughput collapses non-linearly. Our engine profiles expected output density and assigns a categorical penalty of negative infinity when contention exceeds safe thresholds, routing to CPU before the GPU stalls.

13 min read·2026-04-08

The Hidden Compute Costs of Array.prototype.sort() in Enterprise SaaS

V8's TimSort performs 20 million comparator callbacks per million elements, each crossing the native-to-JS boundary. Our adaptive sorting system bypasses this entirely with IEEE 754 bit-transforms and a two-phase GPU sort: local bitonic in shared memory, global rank merge via parallel binary search.

14 min read·2026-04-07

The Two-Phase GPU Text Search Algorithm for Massive Log Files

Brute-force pattern matching on 1 million log entries takes 800 ms on CPU. Our two-phase GPU algorithm uses a character frequency histogram pre-filter in 16 KB shared memory to eliminate up to 97% of candidates before byte-level matching begins.

13 min read·2026-04-06

GPU-Accelerated Relational Queries: Moving the Database to the Browser

Server round-trips add 50 to 300 ms per dashboard interaction. Our Adaptive WebGPU Data Query Engine compiles structured queries into execution plans where each operator is routed to one of three execution tiers (CPU main thread, Web Worker thread pool, or WebGPU compute pipeline) based on a 6-factor dispatch scoring function.

14 min read·2026-04-05

Handling SIMD Branch Divergence in Browser-Based Compute Shaders

GPU wavefronts serialize when threads diverge. We built a categorical inhibition system that detects divergence-prone workloads at dispatch time and unconditionally routes them to the CPU tier.

11 min read·2026-04-04

Why WebGPU is Replacing Web Workers for Enterprise Data Processing

When to replace Web Workers with WebGPU for enterprise data processing. The calibration tells you which. Built by a Newcastle AI team.

9 min read·2026-04-04

Want to discuss performance for your business?

Book a Discovery Call