The latency you are ignoring
Every time a user adjusts a filter on your analytics dashboard, a request leaves the browser. It crosses the network to your API server. The server parses the query, authenticates the session, hits the database, serializes the result, and sends it back. The browser deserializes, reconciles state, and re-renders.
Best case on a well-optimized stack: 50 ms. Typical enterprise environment with VPN overhead, multi-region routing, and a cold connection pool: 150 to 300 ms. Per interaction. Every slider drag, every checkbox toggle, every date range change.
Your users are not waiting for compute. They are waiting for packets.
The data is small enough to live in the browser. A 500,000-row dataset with 12 columns of mixed types occupies roughly 25 to 40 MB in columnar format. That fits comfortably in a single ArrayBuffer. The question is whether the browser can query it fast enough to replace the server round-trip.
It can. But not with JavaScript alone.
Why JavaScript query engines hit a ceiling
Client-side query libraries exist. They parse SQL or a query DSL, iterate over arrays, and return results. For 10,000 rows, they are fast enough. For 500,000 rows, they are not.
A filter operation on 500,000 rows with a numeric predicate (WHERE revenue > 50000) requires 500,000 comparisons. In JavaScript, each comparison involves a property access on an object (or typed array index), a numeric comparison, and a conditional branch to build the result set. V8 optimizes this aggressively, but you are still bound by single-threaded scalar execution. On a modern laptop, this takes 8 to 15 ms.
A group-by aggregation (GROUP BY region, quarter) adds a hash lookup per row, accumulator updates, and a final reduction. For 500,000 rows with 200 groups, JavaScript needs 25 to 40 ms.
Chain three operators (filter, group-by, sort) and you are at 50 to 80 ms. That is competitive with a fast server round-trip but not faster. Add a second concurrent query (cross-filtering between linked charts) and the main thread blocks for 100 to 160 ms. The UI stutters.
Web Workers help. Offloading to a worker pool eliminates main-thread blocking, and parallel execution reduces wall-clock time. But the CPU core count ceiling applies. Eight threads processing 500,000 rows means each thread handles 62,500 rows sequentially. For arithmetic-heavy aggregations, you need more parallelism than the CPU provides.
Our Adaptive WebGPU Data Query Engine
We built a query engine that runs entirely in the browser, compiles structured queries into execution plans, and routes each operator to the optimal compute backend. Not the entire query. Each operator independently.
The architecture is built on five principal modules: a hardware capability detector, columnar data storage with dictionary encoding, a per-operator workload characteriser, a dispatch scoring model, and a pipeline executor.
Columnar data storage with dictionary encoding
When a dataset is loaded, the engine analyzes each column's type and cardinality.
Numeric columns (integers, floats, timestamps) are stored as typed arrays (Float32Array, Int32Array, BigInt64Array) in columnar layout. One contiguous buffer per column. No object overhead. No property lookups.
String columns are dictionary-encoded. The engine builds a sorted dictionary of unique values and replaces each string with its integer index:
// Raw data
const regions = ["EMEA", "APAC", "EMEA", "NAM", "APAC", "EMEA"];
// Dictionary
const dict = ["APAC", "EMEA", "NAM"]; // sorted, unique
// Encoded column
const encoded = new Uint32Array([1, 0, 1, 2, 0, 1]);
This is critical for GPU execution. WebGPU compute shaders cannot process variable-length strings. They operate on fixed-width numeric types. Dictionary encoding converts every string column to a Uint32Array of indices that the GPU can compare, group, and sort using integer arithmetic.
The dictionary itself stays on the CPU. Only the integer indices are uploaded to GPU buffers. For a column with 500,000 rows and 200 unique values, the dictionary is a 200-entry string array (a few KB). The encoded column is a 2 MB Uint32Array. The memory saving over storing 500,000 JavaScript strings is typically 4x to 8x.
Query compilation
The engine accepts structured queries as typed objects:
const query = {
filter: { column: "revenue", op: ">", value: 50000 },
groupBy: ["region", "quarter"],
aggregate: { column: "revenue", fn: "sum" },
orderBy: { column: "sum_revenue", direction: "desc" },
limit: 100
};
The compiler translates this into a directed acyclic graph of operators: Filter, GroupBy, Aggregate, Sort, Limit. Each operator is a self-contained unit with typed input and output schemas. The compiler resolves column references to buffer offsets and dictionary lookups at compile time, not execution time.
Per-operator workload characterisation and dispatch scoring
This is where our engine diverges from every other client-side query library. We do not run the entire query on one backend. The per-operator workload characteriser analyses each operator, and the dispatch scoring model evaluates it against a 6-factor scoring function to route it independently to one of three execution tiers: CPU main thread, Web Worker thread pool, or WebGPU compute pipeline.
Pipeline executor
The pipeline executor handles plan generation, multi-pass re-scoring (up to 3 iterations), segment identification, and execution with transparent CPU fallback on GPU failure. Operators execute in dependency order. Intermediate results pass between operators as typed array buffers. When consecutive operators are both routed to the GPU, a GPUResidentDataset keeps the intermediate buffer in GPU memory with no CPU round-trip. This GPU buffer retention is what the dispatch scoring model captures in Factor 5 (GPU buffer retention bonus), re-scoring downstream operators to account for data already resident on the GPU. This mechanism is distinct from the general pipeline fusion principle in our sorting engine: the query engine's GPU buffer retention bonus is SQL-specific, feeding back into the dispatch scoring model to create cascading GPU segment formation.
The 6-factor dispatch scoring function
Each operator receives a dispatch score computed from six factors defined in our patent filing.
Factor 1 (F1): Row count vs threshold
The number of rows the operator will process, evaluated against hardware-specific thresholds. For the first operator in the pipeline, this is the full dataset size. For downstream operators, it is the estimated output cardinality of the preceding operator. F1 carries a weight of 4.0, making it the dominant factor. The discrete GPU threshold is 50,000 rows; the integrated GPU threshold is 100,000 rows. Below the threshold, the fixed overhead of GPU buffer allocation and shader dispatch dominates. Above it, the GPU's throughput advantage compensates.
Factor 2 (F2): Operator-specific SQL metric
F2 captures the workload characteristic most relevant to the specific SQL operator being scored. For filter operators, this is predicate selectivity: the fraction of rows that pass the predicate. The engine estimates selectivity from lightweight statistics maintained during schema ingestion: min, max, histogram (64 buckets), and null count per column.
For group-by operators, F2 is the Chao1 group cardinality estimate. We estimate group cardinality using the Chao1 species richness estimator from ecology statistics. During schema ingestion, the engine samples a fixed-size random subset of each column and counts observed unique values and singleton frequencies:
Chao1 = d + f1 * (f1 - 1) / (2 * (f2 + 1))
Where d is the number of observed unique values in the sample, f1 is the number of values appearing exactly once (singletons), and f2 is the number of values appearing exactly twice (doubletons). The (f2 + 1) denominator provides a bias correction that handles the edge case of zero doubletons without requiring a special guard. The Chao1 estimator provides a lower bound on total distinct values, accounting for unseen categories in the full dataset. For a dictionary-encoded column, the exact cardinality is already known (it is the dictionary size). Chao1 is used for composite group-by keys (GROUP BY region, quarter) where the cross-product cardinality is unknown without a full scan.
For join operators, F2 is the join key overlap ratio: the fraction of keys in the smaller relation that have matches in the larger relation.
Factor 3 (F3): GPU class adjustment
Adjusts the score based on the detected GPU class (discrete, integrated, or absent). A discrete GPU with dedicated VRAM receives a favourable adjustment. An integrated GPU sharing system memory receives a penalty that reflects its lower memory bandwidth and reduced parallelism.
Factor 4 (F4): Vendor tuning
Hardware vendor-specific tuning coefficients. Different GPU vendors (and different generations within a vendor) have different atomic throughput, shared memory size, and dispatch overhead characteristics. F4 captures these differences so that the same query on the same dataset produces different routing decisions on hardware from different vendors.
Factor 5 (F5): GPU buffer retention bonus
When a preceding operator has already produced its output in a GPU buffer (via the GPUResidentDataset class), the next operator receives a bonus for keeping execution on the GPU. This avoids the cost of reading results back to the CPU and re-uploading them. The pipeline executor performs multi-pass re-scoring (up to 3 iterations) to propagate buffer retention bonuses through the operator pipeline. This factor is specific to the query engine's SQL pipeline: the GPU buffer retention bonus feeds back into the dispatch scoring model to create cascading GPU segment formation across multi-operator plans.
Factor 6 (F6): Hardware-adaptive buffer threshold
The hardware-specific buffer size threshold derived from the hardware capability detector's runtime probing. This factor accounts for the device's actual GPU buffer limits and memory bandwidth, normalising the scoring function across hardware classes.
Score computation and tier routing
The six factors combine into a dispatch score. If the score is positive, the operator dispatches to the WebGPU compute pipeline. If the score is non-positive and the row count falls within a defined medium range (between 10,000 and 500,000 rows in our preferred configuration), the operator dispatches to the Web Worker thread pool. Otherwise, the operator executes on the CPU main thread. If the operator's control flow analysis triggers a categorical divergence classification, the score is overridden to negative infinity regardless of the other five factors (this uses the same categorical inhibition principle covered by our GPU Inhibition patent).
How operators execute on the GPU
Parallel filter
The filter uses three of the engine's nine WGSL shaders. The filter predicate evaluation shader evaluates the predicate for every row in parallel, writing a 1 or 0 to a bitmask buffer. The three-level prefix sum shader computes compacted output indices for each surviving row. The filter scatter shader writes matching rows to a dense output buffer.
Three dispatches. For 500,000 rows on a discrete GPU: 1.2 ms total. The equivalent CPU filter with Web Workers: 4.8 ms. If the GPU encounters a failure at any stage, the pipeline executor transparently falls back to CPU execution.
Parallel group-by aggregation
The group-by path uses the group key hash shader and the segmented reduction with fixup shader. For low group cardinality (under 1,024 groups), each workgroup maintains a local accumulator array in shared memory. Workgroups process tiles of the input, incrementing local accumulators by dictionary-encoded group key. The segmented reduction with fixup shader merges workgroup-local accumulators into the global result.
Shared memory accumulators avoid atomic contention on global memory. For 500,000 rows grouped into 50 regions with a sum aggregation: 0.9 ms on GPU versus 6.1 ms on 8-thread CPU.
For high group cardinality (above 1,024), the scoring function routes to the CPU tier. The CPU uses a hash map with open addressing, which handles arbitrary group counts without the shared memory size constraints of GPU workgroups.
Parallel sort (radix-256)
Sorting reuses our IEEE 754 bit-transform and radix-256 pipeline, with the addition of a composite sort key extraction shader for multi-column ORDER BY clauses. For sort-by on a numeric column, the transform is applied, four radix passes execute, and an index permutation array is produced. The permutation is then applied to all output columns. For sort-by on a dictionary-encoded string column, the integer indices are sorted directly (no bit-transform needed, as dictionary indices are already non-negative integers in lexicographic order of the dictionary). The engine also performs a Float32 ordering-preservation safety check for any Float64 columns involved in sorting, ensuring that the radix sort on Float32 values preserves the correct ordering of the original Float64 data.
Mixed-tier query execution in practice
Consider a real dashboard query on a 500,000-row sales dataset:
SELECT region, quarter, SUM(revenue), AVG(deal_size)
FROM sales
WHERE close_date > '2025-01-01' AND status = 'won'
GROUP BY region, quarter
ORDER BY SUM(revenue) DESC
LIMIT 20
The engine compiles this into four operators and scores each:
| Operator | Cardinality in | Selectivity / Groups | Score | Routed to |
|---|---|---|---|---|
| Filter (date + status) | 500,000 | ~35% selectivity | 1.4 | GPU |
| GroupBy (region x quarter) | ~175,000 | Chao1: ~80 groups | 1.8 | GPU |
| Sort (sum_revenue DESC) | 80 | n/a | 0.01 | CPU (main thread) |
| Limit (20) | 80 | n/a | 0.001 | CPU (main thread) |
The filter and group-by run on the GPU. The GPUResidentDataset keeps the intermediate buffer in GPU memory between them, and the multi-pass re-scoring (F5 GPU buffer retention bonus) ensures the group-by operator benefits from the data already being GPU-resident. The 80-row aggregated result is read back to the CPU, where a trivial sort and limit execute in under 0.1 ms.
Total query time: 2.8 ms. No network round-trip. No server load. The user drags a date slider and the chart updates in under 3 ms.
The same query on a device with only integrated GPU and 4 CPU cores:
| Operator | Score | Routed to |
|---|---|---|
| Filter | 0.7 | Web Workers (4 threads) |
| GroupBy | 0.6 | Web Workers (4 threads) |
| Sort | 0.01 | CPU (main thread) |
| Limit | 0.001 | CPU (main thread) |
Total: 11.4 ms. Still under one frame at 60 fps. Still no server round-trip.
Why dictionary encoding is non-negotiable
String data dominates enterprise datasets. Region names, product categories, status codes, department labels. Without dictionary encoding, every string comparison in a filter or group-by requires byte-by-byte matching of variable-length values. On the CPU, V8's string internalization helps. On the GPU, there is no string type at all.
Dictionary encoding solves three problems simultaneously:
Problem 1: GPU compatibility. Variable-length strings cannot exist in a WebGPU storage buffer. Fixed-width u32 indices can. A WHERE region = 'EMEA' filter becomes WHERE region_idx == 1, which is a single 32-bit integer comparison per row.
Problem 2: Memory efficiency. A 500,000-row column of region names averaging 8 characters each occupies 4 MB as raw strings (plus object overhead in JavaScript). Dictionary-encoded, it is a 2 MB Uint32Array plus a few hundred bytes for the dictionary. The GPU buffer upload is half the size.
Problem 3: Group-by performance. Grouping by string equality requires hashing each string. Grouping by integer index requires no hashing. The dictionary index is the group key. For a column with 50 unique values, the group key is a number between 0 and 49, which maps directly to an array offset in the accumulator. Zero hash collisions. Zero rehashing.
Offline-first and hybrid architectures
The engine is designed for datasets that can be loaded once and queried many times. The initial data load (downloading 25 to 40 MB of columnar data) takes 1 to 3 seconds on a typical enterprise connection. Every subsequent query runs locally in under 15 ms.
This enables architectures where the server pushes a data snapshot to the client on load, and the client handles all filtering, aggregation, and sorting locally. The server is contacted only for writes, data refreshes, or queries that span more data than the client holds.
For dashboards with 5 to 10 linked charts, each generating a query on every user interaction, the difference is dramatic. Server-round-trip architecture: 500 ms to 3 seconds per interaction (serialized queries, network latency, connection pool contention). Client-side engine: 5 to 30 ms per interaction, all charts updating within a single animation frame.
This is the architectural direction behind our enterprise AI automation infrastructure. Move compute to where the data lives. Eliminate the round-trips that your users feel but your backend metrics fail to capture. Probe the hardware, route each operation to its optimal tier, and let the browser do what it is increasingly capable of doing: real computation, not just rendering.