The cost nobody puts on the architecture diagram
Every server-side query has a cost. Not just the database query. The full chain: the API gateway that receives the request, the application server that parses and validates it, the compute instance that executes the sort or aggregation, the memory that holds the intermediate result, the network egress that sends the response back to the client.
For a single query, the cost is fractions of a cent. For a SaaS platform with thousands of concurrent users, each interacting with a dashboard 50 to 100 times per session, the cost compounds into a line item that grows linearly with usage and never stops.
This is the cost model that cloud infrastructure vendors prefer. Every user interaction generates server compute. More users means more compute. More features means more queries per interaction. The bill scales with success.
We build systems where the bill does not.
The server cost of a dashboard query
Consider a standard analytics dashboard. A user opens it, filters by date range, sorts by revenue, groups by region, adjusts a slider, changes the grouping, sorts again. Ten interactions in 30 seconds. Each interaction triggers 2 to 5 backend queries (one per chart panel that updates).
For a single user session: 20 to 50 queries. For 10,000 daily active users: 200,000 to 500,000 queries per day.
Compute cost per query
A server-side sort of 500,000 records takes 15 to 40 ms of CPU time depending on the query complexity. On an AWS c7g.xlarge instance ($0.1224/hour, 4 vCPUs), that is:
CPU-seconds per query: 0.025 (25 ms average)
Queries per CPU-second: 40
Queries per vCPU-hour: 144,000
Cost per 1,000 queries (compute only): $0.1224 / 144 = $0.00085
That looks cheap. But compute is not the only cost.
The full cost stack
| Cost component | Per 1,000 queries | Yearly (500K queries/day) |
|---|---|---|
| Compute (c7g.xlarge) | $0.00085 | $155 |
| Application Load Balancer | $0.008 | $1,460 |
| NAT Gateway data processing | $0.035 | $6,388 |
| Data transfer out (avg 50 KB/response) | $0.045 | $8,213 |
| RDS/database query cost | $0.015 | $2,738 |
| CloudWatch logging | $0.005 | $913 |
| API Gateway (if used) | $0.035 | $6,388 |
| Total | $0.144 | $26,255 |
The compute is 0.6% of the total cost. The infrastructure around the compute (load balancers, NAT gateways, data transfer, logging) is 99.4%. This is the hidden cost of server-side architecture: the tax on every request that has nothing to do with the computation itself.
For larger deployments (50,000 DAU, 2.5 million queries/day), the yearly cost scales to $131,000. For 100,000 DAU: $262,000. Linear scaling. No economy of scale on per-query infrastructure costs.
The OpenAI API cost comparison
Some platforms route data transformations through LLM APIs. "Use GPT-4 to filter and summarize this dataset." This is financially catastrophic at scale.
A GPT-4o API call processing 500,000 tokens of tabular data (a modest 50,000-row dataset serialized as CSV) costs approximately $2.50 per query at current input pricing ($5.00/1M input tokens). For 500,000 queries per day: $1,250,000 per day. $456 million per year.
Even GPT-4o-mini at $0.15/1M input tokens costs $75 per day for the same workload. $27,375 per year. For sorting and filtering. Operations that a GPU compute shader handles in 3 ms for $0.
LLM APIs are the correct tool for natural language understanding, generation, and reasoning. They are the wrong tool for data transformation. Using them for sorting, filtering, and aggregation is paying $5.00/1M tokens for an operation that costs zero when executed locally.
What on-device dispatch eliminates
Our adaptive dispatch engine routes data transformation operations to the client's own hardware. The server is not involved. The cost components that scale with query volume disappear:
| Cost component | Server-side | On-device |
|---|---|---|
| Compute per query | $0.00085 | $0 (client hardware) |
| Load balancer | $0.008/1K queries | $0 |
| NAT gateway | $0.035/1K queries | $0 |
| Data transfer out | $0.045/1K queries | $0 |
| Database query | $0.015/1K queries | $0 (data cached locally) |
| Logging per query | $0.005/1K queries | $0 (client-side telemetry) |
| API gateway | $0.035/1K queries | $0 |
| Total per 1K queries | $0.144 | $0 |
The server still serves the initial data load. A 500,000-row dataset in columnar format is 20 to 40 MB. Served via CloudFront CDN, the per-user data transfer cost is $0.0017 to $0.0034 (at $0.085/GB). For 10,000 DAU: $17 to $34 per day. $6,200 to $12,400 per year. That is the total server cost: one data load per session, cached on the client, zero query costs thereafter.
Compare: $6,200 to $12,400 per year (on-device) versus $26,255 per year (server-side queries) for the same 10,000-user deployment. Savings: $14,000 to $20,000 per year. For larger deployments, the gap widens because query costs scale linearly while CDN costs scale sub-linearly (CDN cache hit rates improve with user density).
How the dispatch engine profiles hardware
The cost savings depend on the client's hardware being capable of running the computation. Not every device has a GPU. Not every GPU is fast enough to beat a server round-trip. The dispatch engine profiles each client's hardware at session start and sets thresholds accordingly.
Step 1: Adapter detection
const gpu = navigator.gpu;
if (!gpu) {
// No WebGPU support. Fall back to Web Workers only.
setDispatchMode('cpu-only');
return;
}
const adapter = await gpu.requestAdapter();
if (!adapter) {
// WebGPU supported but no adapter available (e.g., VDI, software renderer).
setDispatchMode('cpu-only');
return;
}
const info = await adapter.requestAdapterInfo();
The requestAdapter() call returns the best available GPU adapter, or null if none is available. This handles the full range of enterprise hardware: discrete GPU workstations, integrated GPU laptops, VDI sessions with no GPU access, and locked-down terminals.
Step 2: Hardware classification
The adapter info provides vendor, architecture, and device strings. Combined with adapter.limits (maximum buffer size, maximum workgroup dimensions), the engine classifies the hardware:
Discrete GPU. Dedicated VRAM, high memory bandwidth (250+ GB/s), thousands of compute cores. Identified by vendor strings (NVIDIA, AMD discrete) or by the presence of dedicated video memory in the adapter limits.
Integrated GPU. Shared system memory, moderate bandwidth (40 to 100 GB/s), fewer compute units. Intel UHD/Iris, AMD Radeon Graphics, Apple M-series GPU. Identified by vendor strings or by memory characteristics.
CPU-only. No GPU adapter, or adapter is a software fallback (WARP on Windows, SwiftShader). All computation routes to Web Workers.
Step 3: Dynamic crossover thresholds
The engine runs calibration microbenchmarks (memory bandwidth probe and dispatch overhead measurement) and derives crossover thresholds:
| Hardware class | GPU crossover threshold | Worker crossover threshold |
|---|---|---|
| Discrete GPU | 500,000 elements | 10,000 elements |
| Integrated GPU | 2,000,000 elements | 10,000 elements |
| CPU-only | n/a (no GPU) | 10,000 elements |
Below the worker threshold (10,000 elements), the main thread handles the operation. Between the worker threshold and the GPU threshold, Web Workers with SharedArrayBuffer handle it. Above the GPU threshold, the GPU handles it.
These are not static numbers. They are derived from the calibration ratio, which accounts for the specific hardware's memory bandwidth, dispatch overhead, and CPU single-thread performance. A fast discrete GPU might lower the threshold to 300,000. A slow integrated GPU might raise it to 3,000,000. The engine measures. It does not guess.
Step 4: Continuous operation
After calibration (under 200 ms at session start), the engine runs silently. Every dispatch() call checks the dataset size against the thresholds and routes to the optimal tier. The application code does not branch on hardware capabilities. It calls dispatch() and gets results.
If the GPU device is lost (driver crash, external GPU disconnected, power management reclamation), the engine falls back to Web Workers transparently. The operation is slower but correct. No server involvement. No cost increase.
Cost model for a real deployment
Consider a mid-size enterprise deployment: a hospitality chain with 20 properties, 500 total dashboard users, processing guest databases of 200,000 to 500,000 records per property.
Server-side architecture (current state)
Each dashboard interaction queries a centralized PostgreSQL database via REST API.
Users: 500
Sessions per day: 500 (1 per user)
Queries per session: 50 (average)
Total queries per day: 25,000
Average response size: 80 KB
Monthly server costs:
| Component | Monthly cost |
|---|---|
| RDS db.r6g.xlarge (multi-AZ) | $876 |
| EC2 c7g.xlarge (2 instances, load balanced) | $179 |
| Application Load Balancer | $22 |
| NAT Gateway + data processing | $45 |
| Data transfer (25K queries x 80 KB) | $18 |
| CloudWatch + logging | $12 |
| API Gateway | $9 |
| Total monthly | $1,161 |
| Total yearly | $13,932 |
On-device architecture (our implementation)
Each user loads their property's guest database (20 to 40 MB) once per session via CDN. All queries run locally.
Users: 500
Data loads per day: 500
Average data size: 30 MB
Queries per session: 50
Server-side queries: 0
Monthly server costs:
| Component | Monthly cost |
|---|---|
| S3 storage (600 MB total, all properties) | $0.01 |
| CloudFront CDN (500 loads x 30 MB = 15 GB/day) | $39 |
| Lambda for data refresh (nightly sync) | $3 |
| CloudWatch (minimal, no per-query logging) | $2 |
| Total monthly | $44 |
| Total yearly | $528 |
Annual savings: $13,404. The on-device architecture costs 3.8% of the server-side architecture.
The savings scale with usage. If the hospitality chain grows to 50 properties with 1,500 users:
- Server-side: scales to approximately $38,000/year (more database capacity, more compute instances).
- On-device: scales to approximately $1,200/year (more CDN transfer, same S3 storage).
The on-device cost curve is nearly flat because the marginal cost of an additional user is one CDN data load per session. The server-side cost curve is linear because every query consumes compute, bandwidth, and database capacity.
What stays on the server
On-device dispatch does not eliminate the server. It eliminates server involvement in read-path data transformations. The server still handles:
Data writes. When a user updates a guest profile, the write goes to the server database. The local cache is updated optimistically and reconciled on the next sync.
Authentication and authorization. Session tokens, role-based access control, and data scoping are server-side. The client never receives data it is not authorized to see.
Data synchronization. The client's local dataset is refreshed periodically (nightly batch, or real-time via WebSocket for high-frequency data). The server prepares and serves the columnar dataset.
Queries that exceed local data. Cross-property analytics, chain-wide reporting, and historical queries beyond the local cache window fall back to server-side queries. These are infrequent (typically less than 5% of total queries) and can be handled by a smaller, cheaper server fleet than the one required to serve 100% of queries.
AI inference. Language model inference (summarization, classification, entity extraction) runs on the server or via API. These are the operations where cloud compute is justified: high arithmetic intensity, model weights too large for client-side deployment, and per-query cost that reflects genuine computational work.
The principle: use the server for what requires the server. Use the client for what the client can handle. Sorting 500,000 rows does not require a server. It requires an IEEE 754 bit-transform and 4 ms of GPU time.
The latency dividend
Cost reduction is the primary argument. Latency reduction is the bonus.
A server-side query at p50: 120 ms. At p95: 280 ms. At p99 (database connection pool exhaustion, GC pause on the app server, network congestion): 800 ms.
An on-device query at p50: 3.2 ms. At p95: 4.5 ms. At p99: 8.1 ms.
The p99 improvement is 99x. But more importantly, the variance collapses. Server-side p99/p50 ratio: 6.7x. On-device p99/p50 ratio: 2.5x. The tail latency that makes dashboards feel unreliable disappears.
Users do not perceive average latency. They perceive worst-case latency. A dashboard that usually responds in 120 ms but occasionally freezes for 800 ms feels broken. A dashboard that consistently responds in 3 to 8 ms feels instant. The consistency is the product quality improvement. The cost savings is the business case.
The architecture decision
Every query you route to a server is a cost you pay forever. Every query you route to the client is a cost you pay once (the engineering investment to build the client-side engine) and never again.
Server-side query architectures made sense when browsers were rendering engines with no compute capability. WebGPU changes that equation. The user's device has a GPU with thousands of cores sitting idle while your server farm processes their sort request.
Our adaptive dispatch engine makes the client GPU productive. The precision analyser ensures correctness. The pipeline fusion engine eliminates transfer overhead. The device loss handler ensures reliability. Together, they provide the same guarantees as a server-side query engine (correct results, fault tolerance, observability) without the server.
This is the cost structure behind our enterprise AI automation infrastructure. We do not optimize your server costs. We eliminate them for the 95% of queries that never needed a server in the first place. The remaining 5% run on a server fleet sized for 5% of the load, at 5% of the cost. The savings are not incremental. They are structural.