Why On-Device WebGPU Costs Less Than Cloud LLM APIs

11 Apr 2026·12 min read·Husain Ayoob

WebGPUCost OptimizationCloudOn-Device AIEnterprise

Key Takeaways

A SaaS dashboard with 10,000 daily active users generating 50 queries each produces 500,000 server-side query executions per day. At $0.12 per 1,000 queries (compute + egress), that is $60/day or $21,900/year for sorting and filtering alone. On-device dispatch reduces this to $0.
Our adaptive dispatch engine calls navigator.gpu.requestAdapter() to classify client hardware and sets dynamic crossover thresholds: 500,000 elements for discrete GPUs, 2,000,000 for integrated GPUs. Operations below these thresholds run on CPU Web Workers. Operations above run on the client GPU. The server is never involved.
For a 500-user enterprise dashboard processing 2 million records, migrating sort/filter/aggregate from server-side to client-side WebGPU eliminates approximately $8,200/year in compute costs and reduces p95 query latency from 280 ms to 4.5 ms by removing the network round-trip.

The cost nobody puts on the architecture diagram

Every server-side query has a cost. Not just the database query. The full chain: the API gateway that receives the request, the application server that parses and validates it, the compute instance that executes the sort or aggregation, the memory that holds the intermediate result, the network egress that sends the response back to the client.

For a single query, the cost is fractions of a cent. For a SaaS platform with thousands of concurrent users, each interacting with a dashboard 50 to 100 times per session, the cost compounds into a line item that grows linearly with usage and never stops.

This is the cost model that cloud infrastructure vendors prefer. Every user interaction generates server compute. More users means more compute. More features means more queries per interaction. The bill scales with success.

We build systems where the bill does not.

The server cost of a dashboard query

Consider a standard analytics dashboard. A user opens it, filters by date range, sorts by revenue, groups by region, adjusts a slider, changes the grouping, sorts again. Ten interactions in 30 seconds. Each interaction triggers 2 to 5 backend queries (one per chart panel that updates).

For a single user session: 20 to 50 queries. For 10,000 daily active users: 200,000 to 500,000 queries per day.

Compute cost per query

A server-side sort of 500,000 records takes 15 to 40 ms of CPU time depending on the query complexity. On an AWS c7g.xlarge instance ($0.1224/hour, 4 vCPUs), that is:

CPU-seconds per query: 0.025 (25 ms average)
Queries per CPU-second: 40
Queries per vCPU-hour: 144,000
Cost per 1,000 queries (compute only): $0.1224 / 144 = $0.00085

That looks small. But compute is not the only cost.

The full cost stack

Cost component	Per 1,000 queries	Yearly (500K queries/day)
Compute (c7g.xlarge)	$0.00085	$155
Application Load Balancer	$0.008	$1,460
NAT Gateway data processing	$0.035	$6,388
Data transfer out (avg 50 KB/response)	$0.045	$8,213
RDS/database query cost	$0.015	$2,738
CloudWatch logging	$0.005	$913
API Gateway (if used)	$0.035	$6,388
Total	$0.144	$26,255

The compute is 0.6% of the total cost. The infrastructure around the compute (load balancers, NAT gateways, data transfer, logging) is 99.4%. This is the hidden cost of server-side architecture: the tax on every request that has nothing to do with the computation itself.

For larger deployments (50,000 DAU, 2.5 million queries/day), the yearly cost scales to $131,000. For 100,000 DAU: $262,000. Linear scaling. No economy of scale on per-query infrastructure costs.

The OpenAI API cost comparison

Some platforms route data transformations through LLM APIs. "Use GPT-4 to filter and summarize this dataset." This is financially catastrophic at scale.

A GPT-4o API call processing 500,000 tokens of tabular data (a modest 50,000-row dataset serialized as CSV) costs approximately $2.50 per query at current input pricing ($5.00/1M input tokens). For 500,000 queries per day: $1,250,000 per day. $456 million per year.

Even GPT-4o-mini at $0.15/1M input tokens costs $75 per day for the same workload. $27,375 per year. For sorting and filtering. Operations that a GPU compute shader handles in 3 ms for $0.

LLM APIs are the correct tool for natural language understanding, generation, and reasoning. They are the wrong tool for data transformation. Using them for sorting, filtering, and aggregation is paying $5.00/1M tokens for an operation that costs zero when executed locally.

What on-device dispatch eliminates

Our adaptive dispatch engine routes data transformation operations to the client's own hardware. The server is not involved. The cost components that scale with query volume disappear:

Cost component	Server-side	On-device
Compute per query	$0.00085	$0 (client hardware)
Load balancer	$0.008/1K queries	$0
NAT gateway	$0.035/1K queries	$0
Data transfer out	$0.045/1K queries	$0
Database query	$0.015/1K queries	$0 (data cached locally)
Logging per query	$0.005/1K queries	$0 (client-side telemetry)
API gateway	$0.035/1K queries	$0
Total per 1K queries	$0.144	$0

The server still serves the initial data load. A 500,000-row dataset in columnar format is 20 to 40 MB. Served via CloudFront CDN, the per-user data transfer cost is $0.0017 to $0.0034 (at $0.085/GB). For 10,000 DAU: $17 to $34 per day. $6,200 to $12,400 per year. That is the total server cost: one data load per session, cached on the client, zero query costs thereafter.

Compare: $6,200 to $12,400 per year (on-device) versus $26,255 per year (server-side queries) for the same 10,000-user deployment. Savings: $14,000 to $20,000 per year. For larger deployments, the gap widens because query costs scale linearly while CDN costs scale sub-linearly (CDN cache hit rates improve with user density).

How the dispatch engine profiles hardware

The cost savings depend on the client's hardware being capable of running the computation. Not every device has a GPU. Not every GPU is fast enough to beat a server round-trip. The dispatch engine profiles each client's hardware at session start and sets thresholds accordingly.

Step 1: Adapter detection

const gpu = navigator.gpu;
if (!gpu) {
  // No WebGPU support. Fall back to Web Workers only.
  setDispatchMode('cpu-only');
  return;
}

const adapter = await gpu.requestAdapter();
if (!adapter) {
  // WebGPU supported but no adapter available (e.g., VDI, software renderer).
  setDispatchMode('cpu-only');
  return;
}

const info = await adapter.requestAdapterInfo();

The requestAdapter() call returns the best available GPU adapter, or null if none is available. This handles the full range of enterprise hardware: discrete GPU workstations, integrated GPU laptops, VDI sessions with no GPU access, and locked-down terminals.

Step 2: Hardware classification

The adapter info provides vendor, architecture, and device strings. Combined with adapter.limits (maximum buffer size, maximum workgroup dimensions), the engine classifies the hardware:

Discrete GPU. Dedicated VRAM, high memory bandwidth (250+ GB/s), thousands of compute cores. Identified by vendor strings (NVIDIA, AMD discrete) or by the presence of dedicated video memory in the adapter limits.

Integrated GPU. Shared system memory, moderate bandwidth (40 to 100 GB/s), fewer compute units. Intel UHD/Iris, AMD Radeon Graphics, Apple M-series GPU. Identified by vendor strings or by memory characteristics.

CPU-only. No GPU adapter, or adapter is a software fallback (WARP on Windows, SwiftShader). All computation routes to Web Workers.

Step 3: Dynamic crossover thresholds

The engine runs calibration microbenchmarks (memory bandwidth probe and dispatch overhead measurement) and derives crossover thresholds:

Hardware class	GPU crossover threshold	Worker crossover threshold
Discrete GPU	500,000 elements	10,000 elements
Integrated GPU	2,000,000 elements	10,000 elements
CPU-only	n/a (no GPU)	10,000 elements

Below the worker threshold (10,000 elements), the main thread handles the operation. Between the worker threshold and the GPU threshold, Web Workers with SharedArrayBuffer handle it. Above the GPU threshold, the GPU handles it.

These are not static numbers. They are derived from the calibration ratio, which accounts for the specific hardware's memory bandwidth, dispatch overhead, and CPU single-thread performance. A fast discrete GPU might lower the threshold to 300,000. A slow integrated GPU might raise it to 3,000,000. The engine measures. It does not guess.

Step 4: Continuous operation

After calibration (under 200 ms at session start), the engine runs silently. Every dispatch() call checks the dataset size against the thresholds and routes to the optimal tier. The application code does not branch on hardware capabilities. It calls dispatch() and gets results.

If the GPU device is lost (driver crash, external GPU disconnected, power management reclamation), the engine falls back to Web Workers transparently. The operation is slower but correct. No server involvement. No cost increase.

Cost model for a real deployment

Consider a mid-size enterprise deployment: a hospitality chain with 20 properties, 500 total dashboard users, processing guest databases of 200,000 to 500,000 records per property.

Server-side architecture (current state)

Each dashboard interaction queries a centralized PostgreSQL database via REST API.

Users: 500
Sessions per day: 500 (1 per user)
Queries per session: 50 (average)
Total queries per day: 25,000
Average response size: 80 KB

Monthly server costs:

Component	Monthly cost
RDS db.r6g.xlarge (multi-AZ)	$876
EC2 c7g.xlarge (2 instances, load balanced)	$179
Application Load Balancer	$22
NAT Gateway + data processing	$45
Data transfer (25K queries x 80 KB)	$18
CloudWatch + logging	$12
API Gateway	$9
Total monthly	$1,161
Total yearly	$13,932

On-device architecture (our implementation)

Each user loads their property's guest database (20 to 40 MB) once per session via CDN. All queries run locally.

Users: 500
Data loads per day: 500
Average data size: 30 MB
Queries per session: 50
Server-side queries: 0

Monthly server costs:

Component	Monthly cost
S3 storage (600 MB total, all properties)	$0.01
CloudFront CDN (500 loads x 30 MB = 15 GB/day)	$39
Lambda for data refresh (nightly sync)	$3
CloudWatch (minimal, no per-query logging)	$2
Total monthly	$44
Total yearly	$528

Annual savings: $13,404. The on-device architecture costs 3.8% of the server-side architecture.

The savings scale with usage. If the hospitality chain grows to 50 properties with 1,500 users:

Server-side: scales to approximately $38,000/year (more database capacity, more compute instances).
On-device: scales to approximately $1,200/year (more CDN transfer, same S3 storage).

The on-device cost curve is nearly flat because the marginal cost of an additional user is one CDN data load per session. The server-side cost curve is linear because every query consumes compute, bandwidth, and database capacity.

What stays on the server

On-device dispatch does not eliminate the server. It eliminates server involvement in read-path data transformations. The server still handles:

Data writes. When a user updates a guest profile, the write goes to the server database. The local cache is updated optimistically and reconciled on the next sync.

Authentication and authorization. Session tokens, role-based access control, and data scoping are server-side. The client never receives data it is not authorized to see.

Data synchronization. The client's local dataset is refreshed periodically (nightly batch, or real-time via WebSocket for high-frequency data). The server prepares and serves the columnar dataset.

Queries that exceed local data. Cross-property analytics, chain-wide reporting, and historical queries beyond the local cache window fall back to server-side queries. These are infrequent (typically less than 5% of total queries) and can be handled by a smaller, leaner server fleet than the one required to serve 100% of queries.

AI inference. Language model inference (summarization, classification, entity extraction) runs on the server or via API. These are the operations where cloud compute is justified: high arithmetic intensity, model weights too large for client-side deployment, and per-query cost that reflects genuine computational work.

The principle: use the server for what requires the server. Use the client for what the client can handle. Sorting 500,000 rows does not require a server. It requires an IEEE 754 bit-transform and 4 ms of GPU time.

The latency dividend

Cost reduction is the primary argument. Latency reduction is the bonus.

A server-side query at p50: 120 ms. At p95: 280 ms. At p99 (database connection pool exhaustion, GC pause on the app server, network congestion): 800 ms.

An on-device query at p50: 3.2 ms. At p95: 4.5 ms. At p99: 8.1 ms.

The p99 improvement is 99x. But more importantly, the variance collapses. Server-side p99/p50 ratio: 6.7x. On-device p99/p50 ratio: 2.5x. The tail latency that makes dashboards feel unreliable disappears.

Users do not perceive average latency. They perceive worst-case latency. A dashboard that usually responds in 120 ms but occasionally freezes for 800 ms feels broken. A dashboard that consistently responds in 3 to 8 ms feels instant. The consistency is the product quality improvement. The cost savings is the business case.

The architecture decision

Every query you route to a server is a cost you pay forever. Every query you route to the client is a cost you pay once (the engineering investment to build the client-side engine) and never again.

Server-side query architectures made sense when browsers were rendering engines with no compute capability. WebGPU changes that equation. The user's device has a GPU with thousands of cores sitting idle while your server farm processes their sort request.

Our adaptive dispatch engine makes the client GPU productive. The precision analyser ensures correctness. The pipeline fusion engine eliminates transfer overhead. The device loss handler ensures reliability. Together, they provide the same guarantees as a server-side query engine (correct results, fault tolerance, observability) without the server.

This is the cost structure behind our enterprise AI automation infrastructure. We do not optimize your server costs. We eliminate them for the 95% of queries that never needed a server in the first place. The remaining 5% run on a server fleet sized for 5% of the load, at 5% of the cost. The savings are not incremental. They are structural.

Where this ships

We are Ayoob AI, a Newcastle-based team building on-device AI architecture for UK enterprises running internal analytics without paying cloud compute bills for every sort and filter. If your server costs scale linearly with user clicks, we move the compute to where the data already lives. The full cost analysis sits alongside our build vs buy AI guide and our case for private AI on-premise. Book a discovery call.

Frequently asked questions

How does on-device WebGPU actually reduce cost?

By moving computation that used to run on a server compute instance to the client GPU the user already paid for. Every server-side sort, filter, or aggregation costs money: compute instance time, memory, network egress, and the overhead of the API layer around it. Moving those operations to the client eliminates all of that. For a SaaS dashboard with 10,000 daily active users at 50 queries each, that is 500,000 queries per day running on compute you do not pay for. At $0.12 per 1,000 queries for sort and aggregate workloads, the annual saving is $21,900 on a single pattern. The server still handles authentication, orchestration, and actual persistence, but the compute-heavy work moves to the client.

Does client-side compute work on all user hardware?

Not identically. The adaptive dispatch engine calibrates at startup per device. Discrete GPUs cross the threshold to GPU compute at 500,000 elements. Integrated GPUs cross at 2,000,000. Devices without WebGPU support fall back to Web Workers, and if SharedArrayBuffer is unavailable, to single-thread CPU. The application code does not change. The dispatch engine selects the fastest tier available on the user's hardware. For UK enterprise SaaS shipping to mixed user hardware, this is the architecture that gives consistent user experience across the installed base.

What happens to server-side costs when we migrate?

The API gateway, authentication, and data persistence layers still run server-side, which represents a small fraction of the original compute cost. The sort, filter, aggregate, and join operations move to the client. For a typical enterprise analytics dashboard, server-side compute drops roughly 80 to 90 percent. Network egress drops because results do not transit back from a server aggregate. The remaining server cost is essentially the cost of serving the initial data payload and handling writes. For UK businesses under cost pressure on cloud bills, this is often the single largest optimisation available.

Does this affect latency as well as cost?

Yes, dramatically. A typical server round-trip for a dashboard query runs 150 to 300ms in enterprise environments with VPN overhead and multi-region routing. Client-side WebGPU execution on a 2-million-record dataset runs in 4 to 8ms. The difference is 30 to 75 times, and users notice. Dashboards feel instant rather than sluggish. Slider drags and filter changes respond in real time. For UK hospitality CRMs with face-scan recognition, the latency win is what makes the end-to-end latency target work: under 200ms from camera frame to on-screen profile instead of 800ms-plus with a server round-trip.

Is this architecture appropriate for sensitive data?

It is ideal for sensitive data. Client-side execution means the data never transits to a third-party compute service for aggregation. For UK regulated clients under FCA, SRA, or NHS DSPT, this architecture simplifies compliance because the data stays on the user's device during computation rather than touching cloud compute. Combined with on-premise or private-cloud data storage, you get end-to-end private data handling. A dental practice client runs their entire admin pipeline this way: zero cloud compute, 70 percent admin time eliminated, NHS DSP compliant by architecture rather than by policy.

Talk to an Engineer