GPU-Accelerated Relational Queries: Moving the Database to the Browser

5 Apr 2026·14 min read·Husain Ayoob

WebGPUQuery EngineData VisualizationPerformanceEnterprise

Key Takeaways

A 6-factor dispatch scoring function evaluates each query operator independently: F1 row count vs threshold (weight 4.0, with discrete GPU threshold at 50,000 rows and integrated GPU at 100,000), F2 operator-specific SQL metric (predicate selectivity for filters, Chao1 group cardinality for group-by, join key overlap ratio for joins), F3 GPU class adjustment, F4 vendor tuning, F5 GPU buffer retention bonus (keeping intermediate results in GPU buffers across pipeline stages), and F6 hardware-adaptive buffer threshold.
Dictionary encoding converts variable-length string columns to integer indices, enabling WebGPU compute shaders to process categorical data using fixed-width u32 comparisons instead of byte-by-byte string matching.
Adaptive per-operator routing eliminates the all-or-nothing GPU dispatch problem. A single query can execute its filter on the GPU (low selectivity, high parallelism) and its aggregation on the CPU (high group cardinality, divergent reduction).

The latency you are ignoring

Every time a user adjusts a filter on your analytics dashboard, a request leaves the browser. It crosses the network to your API server. The server parses the query, authenticates the session, hits the database, serializes the result, and sends it back. The browser deserializes, reconciles state, and re-renders.

Best case on a well-optimized stack: 50 ms. Typical enterprise environment with VPN overhead, multi-region routing, and a cold connection pool: 150 to 300 ms. Per interaction. Every slider drag, every checkbox toggle, every date range change.

Your users are not waiting for compute. They are waiting for packets.

The data is small enough to live in the browser. A 500,000-row dataset with 12 columns of mixed types occupies roughly 25 to 40 MB in columnar format. That fits comfortably in a single ArrayBuffer. The question is whether the browser can query it fast enough to replace the server round-trip.

It can. But not with JavaScript alone.

Why JavaScript query engines hit a ceiling

Client-side query libraries exist. They parse SQL or a query DSL, iterate over arrays, and return results. For 10,000 rows, they are fast enough. For 500,000 rows, they are not.

A filter operation on 500,000 rows with a numeric predicate (WHERE revenue > 50000) requires 500,000 comparisons. In JavaScript, each comparison involves a property access on an object (or typed array index), a numeric comparison, and a conditional branch to build the result set. V8 optimizes this aggressively, but you are still bound by single-threaded scalar execution. On a modern laptop, this takes 8 to 15 ms.

A group-by aggregation (GROUP BY region, quarter) adds a hash lookup per row, accumulator updates, and a final reduction. For 500,000 rows with 200 groups, JavaScript needs 25 to 40 ms.

Chain three operators (filter, group-by, sort) and you are at 50 to 80 ms. That is competitive with a fast server round-trip but not faster. Add a second concurrent query (cross-filtering between linked charts) and the main thread blocks for 100 to 160 ms. The UI stutters.

Web Workers help. Offloading to a worker pool eliminates main-thread blocking, and parallel execution reduces wall-clock time. But the CPU core count ceiling applies. Eight threads processing 500,000 rows means each thread handles 62,500 rows sequentially. For arithmetic-heavy aggregations, you need more parallelism than the CPU provides.

Our Adaptive WebGPU Data Query Engine

We built a query engine that runs entirely in the browser, compiles structured queries into execution plans, and routes each operator to the optimal compute backend. Not the entire query. Each operator independently.

The architecture is built on five principal modules: a hardware capability detector, columnar data storage with dictionary encoding, a per-operator workload characteriser, a dispatch scoring model, and a pipeline executor.

Columnar data storage with dictionary encoding

When a dataset is loaded, the engine analyzes each column's type and cardinality.

Numeric columns (integers, floats, timestamps) are stored as typed arrays (Float32Array, Int32Array, BigInt64Array) in columnar layout. One contiguous buffer per column. No object overhead. No property lookups.

String columns are dictionary-encoded. The engine builds a sorted dictionary of unique values and replaces each string with its integer index:

// Raw data
const regions = ["EMEA", "APAC", "EMEA", "NAM", "APAC", "EMEA"];

// Dictionary
const dict = ["APAC", "EMEA", "NAM"];  // sorted, unique

// Encoded column
const encoded = new Uint32Array([1, 0, 1, 2, 0, 1]);

This is critical for GPU execution. WebGPU compute shaders cannot process variable-length strings. They operate on fixed-width numeric types. Dictionary encoding converts every string column to a Uint32Array of indices that the GPU can compare, group, and sort using integer arithmetic.

The dictionary itself stays on the CPU. Only the integer indices are uploaded to GPU buffers. For a column with 500,000 rows and 200 unique values, the dictionary is a 200-entry string array (a few KB). The encoded column is a 2 MB Uint32Array. The memory saving over storing 500,000 JavaScript strings is typically 4x to 8x.

Query compilation

The engine accepts structured queries as typed objects:

const query = {
  filter: { column: "revenue", op: ">", value: 50000 },
  groupBy: ["region", "quarter"],
  aggregate: { column: "revenue", fn: "sum" },
  orderBy: { column: "sum_revenue", direction: "desc" },
  limit: 100
};

The compiler translates this into a directed acyclic graph of operators: Filter, GroupBy, Aggregate, Sort, Limit. Each operator is a self-contained unit with typed input and output schemas. The compiler resolves column references to buffer offsets and dictionary lookups at compile time, not execution time.

Per-operator workload characterisation and dispatch scoring

This is where our engine diverges from every other client-side query library. We do not run the entire query on one backend. The per-operator workload characteriser analyses each operator, and the dispatch scoring model evaluates it against a 6-factor scoring function to route it independently to one of three execution tiers: CPU main thread, Web Worker thread pool, or WebGPU compute pipeline.

Pipeline executor

The pipeline executor handles plan generation, multi-pass re-scoring (up to 3 iterations), segment identification, and execution with transparent CPU fallback on GPU failure. Operators execute in dependency order. Intermediate results pass between operators as typed array buffers. When consecutive operators are both routed to the GPU, a GPUResidentDataset keeps the intermediate buffer in GPU memory with no CPU round-trip. This GPU buffer retention is what the dispatch scoring model captures in Factor 5 (GPU buffer retention bonus), re-scoring downstream operators to account for data already resident on the GPU. This mechanism is distinct from the general pipeline fusion principle in our sorting engine: the query engine's GPU buffer retention bonus is SQL-specific, feeding back into the dispatch scoring model to create cascading GPU segment formation.

The 6-factor dispatch scoring function

Each operator receives a dispatch score computed from six factors defined in our patent filing.

Factor 1 (F1): Row count vs threshold

The number of rows the operator will process, evaluated against hardware-specific thresholds. For the first operator in the pipeline, this is the full dataset size. For downstream operators, it is the estimated output cardinality of the preceding operator. F1 carries a weight of 4.0, making it the dominant factor. The discrete GPU threshold is 50,000 rows; the integrated GPU threshold is 100,000 rows. Below the threshold, the fixed overhead of GPU buffer allocation and shader dispatch dominates. Above it, the GPU's throughput advantage compensates.

Factor 2 (F2): Operator-specific SQL metric

F2 captures the workload characteristic most relevant to the specific SQL operator being scored. For filter operators, this is predicate selectivity: the fraction of rows that pass the predicate. The engine estimates selectivity from lightweight statistics maintained during schema ingestion: min, max, histogram (64 buckets), and null count per column.

For group-by operators, F2 is the Chao1 group cardinality estimate. We estimate group cardinality using the Chao1 species richness estimator from ecology statistics. During schema ingestion, the engine samples a fixed-size random subset of each column and counts observed unique values and singleton frequencies:

Chao1 = d + f1 * (f1 - 1) / (2 * (f2 + 1))

Where d is the number of observed unique values in the sample, f1 is the number of values appearing exactly once (singletons), and f2 is the number of values appearing exactly twice (doubletons). The (f2 + 1) denominator provides a bias correction that handles the edge case of zero doubletons without requiring a special guard. The Chao1 estimator provides a lower bound on total distinct values, accounting for unseen categories in the full dataset. For a dictionary-encoded column, the exact cardinality is already known (it is the dictionary size). Chao1 is used for composite group-by keys (GROUP BY region, quarter) where the cross-product cardinality is unknown without a full scan.

For join operators, F2 is the join key overlap ratio: the fraction of keys in the smaller relation that have matches in the larger relation.

Factor 3 (F3): GPU class adjustment

Adjusts the score based on the detected GPU class (discrete, integrated, or absent). A discrete GPU with dedicated VRAM receives a favourable adjustment. An integrated GPU sharing system memory receives a penalty that reflects its lower memory bandwidth and reduced parallelism.

Factor 4 (F4): Vendor tuning

Hardware vendor-specific tuning coefficients. Different GPU vendors (and different generations within a vendor) have different atomic throughput, shared memory size, and dispatch overhead characteristics. F4 captures these differences so that the same query on the same dataset produces different routing decisions on hardware from different vendors.

Factor 5 (F5): GPU buffer retention bonus

When a preceding operator has already produced its output in a GPU buffer (via the GPUResidentDataset class), the next operator receives a bonus for keeping execution on the GPU. This avoids the cost of reading results back to the CPU and re-uploading them. The pipeline executor performs multi-pass re-scoring (up to 3 iterations) to propagate buffer retention bonuses through the operator pipeline. This factor is specific to the query engine's SQL pipeline: the GPU buffer retention bonus feeds back into the dispatch scoring model to create cascading GPU segment formation across multi-operator plans.

Factor 6 (F6): Hardware-adaptive buffer threshold

The hardware-specific buffer size threshold derived from the hardware capability detector's runtime probing. This factor accounts for the device's actual GPU buffer limits and memory bandwidth, normalising the scoring function across hardware classes.

Score computation and tier routing

The six factors combine into a dispatch score. If the score is positive, the operator dispatches to the WebGPU compute pipeline. If the score is non-positive and the row count falls within a defined medium range (between 10,000 and 500,000 rows in our preferred configuration), the operator dispatches to the Web Worker thread pool. Otherwise, the operator executes on the CPU main thread. If the operator's control flow analysis triggers a categorical divergence classification, the score is overridden to negative infinity regardless of the other five factors (this uses the same categorical inhibition principle covered by our GPU Inhibition patent).

How operators execute on the GPU

Parallel filter

The filter uses three of the engine's nine WGSL shaders. The filter predicate evaluation shader evaluates the predicate for every row in parallel, writing a 1 or 0 to a bitmask buffer. The three-level prefix sum shader computes compacted output indices for each surviving row. The filter scatter shader writes matching rows to a dense output buffer.

Three dispatches. For 500,000 rows on a discrete GPU: 1.2 ms total. The equivalent CPU filter with Web Workers: 4.8 ms. If the GPU encounters a failure at any stage, the pipeline executor transparently falls back to CPU execution.

Parallel group-by aggregation

The group-by path uses the group key hash shader and the segmented reduction with fixup shader. For low group cardinality (under 1,024 groups), each workgroup maintains a local accumulator array in shared memory. Workgroups process tiles of the input, incrementing local accumulators by dictionary-encoded group key. The segmented reduction with fixup shader merges workgroup-local accumulators into the global result.

Shared memory accumulators avoid atomic contention on global memory. For 500,000 rows grouped into 50 regions with a sum aggregation: 0.9 ms on GPU versus 6.1 ms on 8-thread CPU.

For high group cardinality (above 1,024), the scoring function routes to the CPU tier. The CPU uses a hash map with open addressing, which handles arbitrary group counts without the shared memory size constraints of GPU workgroups.

Parallel sort (radix-256)

Sorting reuses our IEEE 754 bit-transform and radix-256 pipeline, with the addition of a composite sort key extraction shader for multi-column ORDER BY clauses. For sort-by on a numeric column, the transform is applied, four radix passes execute, and an index permutation array is produced. The permutation is then applied to all output columns. For sort-by on a dictionary-encoded string column, the integer indices are sorted directly (no bit-transform needed, as dictionary indices are already non-negative integers in lexicographic order of the dictionary). The engine also performs a Float32 ordering-preservation safety check for any Float64 columns involved in sorting, ensuring that the radix sort on Float32 values preserves the correct ordering of the original Float64 data.

Mixed-tier query execution in practice

Consider a real dashboard query on a 500,000-row sales dataset:

SELECT region, quarter, SUM(revenue), AVG(deal_size)
FROM sales
WHERE close_date > '2025-01-01' AND status = 'won'
GROUP BY region, quarter
ORDER BY SUM(revenue) DESC
LIMIT 20

The engine compiles this into four operators and scores each:

Operator	Cardinality in	Selectivity / Groups	Score	Routed to
Filter (date + status)	500,000	~35% selectivity	1.4	GPU
GroupBy (region x quarter)	~175,000	Chao1: ~80 groups	1.8	GPU
Sort (sum_revenue DESC)	80	n/a	0.01	CPU (main thread)
Limit (20)	80	n/a	0.001	CPU (main thread)

The filter and group-by run on the GPU. The GPUResidentDataset keeps the intermediate buffer in GPU memory between them, and the multi-pass re-scoring (F5 GPU buffer retention bonus) ensures the group-by operator benefits from the data already being GPU-resident. The 80-row aggregated result is read back to the CPU, where a trivial sort and limit execute in under 0.1 ms.

Total query time: 2.8 ms. No network round-trip. No server load. The user drags a date slider and the chart updates in under 3 ms.

The same query on a device with only integrated GPU and 4 CPU cores:

Operator	Score	Routed to
Filter	0.7	Web Workers (4 threads)
GroupBy	0.6	Web Workers (4 threads)
Sort	0.01	CPU (main thread)
Limit	0.001	CPU (main thread)

Total: 11.4 ms. Still under one frame at 60 fps. Still no server round-trip.

Why dictionary encoding is non-negotiable

String data dominates enterprise datasets. Region names, product categories, status codes, department labels. Without dictionary encoding, every string comparison in a filter or group-by requires byte-by-byte matching of variable-length values. On the CPU, V8's string internalization helps. On the GPU, there is no string type at all.

Dictionary encoding solves three problems simultaneously:

Problem 1: GPU compatibility. Variable-length strings cannot exist in a WebGPU storage buffer. Fixed-width u32 indices can. A WHERE region = 'EMEA' filter becomes WHERE region_idx == 1, which is a single 32-bit integer comparison per row.

Problem 2: Memory efficiency. A 500,000-row column of region names averaging 8 characters each occupies 4 MB as raw strings (plus object overhead in JavaScript). Dictionary-encoded, it is a 2 MB Uint32Array plus a few hundred bytes for the dictionary. The GPU buffer upload is half the size.

Problem 3: Group-by performance. Grouping by string equality requires hashing each string. Grouping by integer index requires no hashing. The dictionary index is the group key. For a column with 50 unique values, the group key is a number between 0 and 49, which maps directly to an array offset in the accumulator. Zero hash collisions. Zero rehashing.

Offline-first and hybrid architectures

The engine is designed for datasets that can be loaded once and queried many times. The initial data load (downloading 25 to 40 MB of columnar data) takes 1 to 3 seconds on a typical enterprise connection. Every subsequent query runs locally in under 15 ms.

This enables architectures where the server pushes a data snapshot to the client on load, and the client handles all filtering, aggregation, and sorting locally. The server is contacted only for writes, data refreshes, or queries that span more data than the client holds.

For dashboards with 5 to 10 linked charts, each generating a query on every user interaction, the difference is dramatic. Server-round-trip architecture: 500 ms to 3 seconds per interaction (serialized queries, network latency, connection pool contention). Client-side engine: 5 to 30 ms per interaction, all charts updating within a single animation frame.

This is the architectural direction behind our enterprise AI automation infrastructure. Move compute to where the data lives. Eliminate the round-trips that your users feel but your backend metrics fail to capture. Probe the hardware, route each operation to its optimal tier, and let the browser do what it is increasingly capable of doing: real computation, not just rendering.

Where this ships

We are Ayoob AI, a Newcastle-based team building GPU query engines for UK enterprises running internal analytics without sending data to the cloud. If your dashboard is slower than your users can click, we build the in-browser query path that keeps up. Book a discovery call.

Frequently asked questions

Why move database queries to the browser?

To eliminate round-trip latency and server-side compute cost. A typical enterprise dashboard query runs 150 to 300ms on the wire before any compute happens: VPN overhead, multi-region routing, connection pool warm-up, JSON serialisation. The query itself resolves in under 1ms. The rest is network. Moving the query engine into the browser on WebGPU collapses total latency to 4 to 8ms on a 2-million-record dataset, which is the difference between a dashboard that feels sluggish and one that feels instant. Server-side compute cost also drops by roughly 80 to 90 percent for the sort, filter, and aggregate workloads.

What is per-operator routing?

Dispatching each operator within a query to the optimal tier independently. A filter might run on the GPU because its predicate is uniform and the row count is high. An aggregation in the same query might run on the CPU because its group cardinality is high and the reduction is divergent. Traditional GPU query engines run the whole query on one backend, which means the weakest operator determines overall performance. Per-operator routing via our 6-factor dispatch score picks the right tier per operator, so the whole query runs at the speed of the fastest decomposition.

How does dictionary encoding help?

By converting variable-length string columns to fixed-width integer indices. A customer status column with values Premium, Standard, and Trial becomes a Uint32Array of indices into a dictionary. Filter predicates against string values become single integer comparisons on the GPU, which is where WebGPU shines. Without dictionary encoding, string filtering requires byte-by-byte matching, which is divergence-prone on GPUs. With dictionary encoding, the filter becomes as fast as a numeric filter on the same data size. For UK SaaS analytics dashboards handling categorical data, this is what makes GPU execution viable on real workloads.

Does this replace our server database?

No. The server database remains the system of record for persistence, access control, and write handling. The browser query engine handles read-heavy analytics operations on data that has been fetched to the client. Typical pattern: server returns a filtered subset on initial load (100,000 to 10,000,000 rows depending on the use case), browser handles all subsequent interactive querying locally. Updates write back to the server through standard APIs. For UK enterprise SaaS, this architecture keeps the server database lean while delivering sub-10ms interactive query latency on the client.

What sizes of data make sense client-side?

Up to around 100 million rows of columnar data for most use cases. A 500-column table with 100,000 rows takes about 200MB compressed, which fits comfortably in modern browser memory. For larger datasets, the query engine runs against pre-filtered subsets rather than the full corpus. The server pushes down the initial filter (date range, tenant, user permissions) and the client handles everything else interactively. For typical enterprise analytics workloads (sales data, operational metrics, guest CRM records) the client-side approach handles the whole interactive query load without server round-trips.

Talk to an Engineer