The reliability gap in browser GPU compute
Server-side GPU compute runs in controlled environments. The CUDA context lives for the lifetime of the process. Driver updates are scheduled. If the hardware fails, Kubernetes migrates the workload to another node. The application code never handles a GPU failure because the orchestration layer handles it.
Browser-side GPU compute has no orchestration layer. The GPU device can be destroyed at any moment by forces entirely outside your application's control. Your code must handle this, or your application crashes.
Most browser-based GPU implementations do not handle it. They assume the GPU is available for the lifetime of the page. When it disappears, the user sees a blank screen, a frozen tab, or an unhandled promise rejection in the console.
For an enterprise deployment where 500 users rely on a dashboard every working day, "the GPU sometimes crashes the tab" is not a bug report. It is a business continuity failure.
What destroys a GPU device
Seven scenarios cause GPU device loss in production enterprise environments. Every one of them is outside your control.
Driver crash and recovery
On Windows, the Timeout Detection and Recovery (TDR) mechanism resets the GPU after a hang. The default timeout is 2 seconds. Any GPU operation that exceeds this (including operations from other tabs or other applications) triggers a driver reset. Every GPU context on the system is destroyed. Your application's WebGPU device is collateral damage.
On macOS, IOKit reclaims GPU resources under memory pressure or when the WindowServer detects an unresponsive GPU. On Linux, the DRM fence timeout triggers a GPU reset.
These are not rare events. On enterprise fleets with mixed driver versions and background software (antivirus real-time scanning, VPN clients, endpoint management agents), TDR events occur on 2% to 5% of machines per week.
Chrome GPU watchdog
Chrome enforces its own GPU timeout independent of the OS driver. Compute shaders that exceed the browser's threshold (typically 2 seconds, configurable via enterprise policy) are killed. The GPU process is terminated and restarted. Every tab using that GPU process loses its device.
A poorly optimized WebGL game in another tab can trigger the watchdog, killing your compute pipeline in a tab the user is actively working in.
External GPU disconnection
Thunderbolt eGPUs are standard in enterprise creative and engineering workflows. Users dock and undock throughout the day. If your application initialized WebGPU on the external GPU and the user pulls the cable, the device is lost instantly. The system falls back to the integrated GPU, which has different performance characteristics, different memory bandwidth, and potentially different driver capabilities.
Power management transitions
When a laptop switches from AC power to battery, Windows Hybrid Graphics or macOS automatic graphics switching may power down the discrete GPU. GPU contexts on the discrete GPU are destroyed. The user does not receive a warning. Your application receives GPUDevice.lost.
System sleep and resume
After a sleep/wake cycle, GPU state may or may not survive. The behaviour varies by OS, driver, and hardware generation. On many enterprise laptop configurations (particularly Intel + NVIDIA Optimus setups), the GPU context is destroyed during sleep and not restored.
Background tab throttling
Chrome aggressively manages resources for background tabs. After 5 minutes of inactivity, a background tab may have its GPU access revoked. When the user returns to the tab, the device is gone.
VDI and remote desktop
Virtual Desktop Infrastructure (Citrix, VMware Horizon, Amazon WorkSpaces) presents a virtualized GPU to the browser. The virtual GPU can be reclaimed, migrated, or reset by the hypervisor at any time. VDI environments are among the least stable for GPU persistence.
What GPUDevice.lost provides
The WebGPU specification gives you exactly one detection mechanism:
const device = await adapter.requestDevice();
device.lost.then((info: GPUDeviceLostInfo) => {
// info.reason: "destroyed" (explicit) or "unknown" (everything else)
// info.message: human-readable description (browser-dependent)
});
Three properties define this API:
It fires once. The promise resolves a single time. There is no event listener you can re-register. After resolution, the device object is permanently dead.
It is asynchronous. The promise resolves in a microtask, not synchronously at the point of failure. If your code is encoding a command buffer when the device dies, the encoding calls do not throw. They produce silently invalid state. queue.submit() may or may not throw, depending on timing and browser implementation.
The reason is opaque. "unknown" covers every scenario from driver crash to cable disconnection to sleep/wake. Your code cannot distinguish between them. The recovery path must handle all of them.
What becomes invalid on device loss
Every GPU object created from the lost device is permanently unusable:
| Object type | Count in a typical pipeline | State after loss |
|---|---|---|
| GPUBuffer (data, intermediate, output) | 10 to 50 | Invalid. VRAM contents gone. |
| GPUComputePipeline (compiled shaders) | 5 to 15 | Invalid. Compiled code gone. |
| GPUBindGroup (buffer/texture bindings) | 5 to 15 | Invalid. Binding references gone. |
| GPUShaderModule (WGSL source compiled) | 5 to 15 | Invalid. Compilation result gone. |
| GPUCommandEncoder (in-flight commands) | 0 to 2 | Invalid. Encoded commands discarded. |
| GPUQuerySet (timing, pipeline stats) | 0 to 4 | Invalid. Measurement data gone. |
A compute engine that caches pipelines (to avoid 5 to 20 ms shader compilation per operation) and pools buffers (to avoid per-operation allocation overhead) holds dozens of these objects. All of them become invalid simultaneously.
If the engine does not detect this and attempts to use a dead pipeline or buffer, the result is a GPUValidationError on the next queue.submit(), a silently failed dispatch (no results written), or a browser-level crash of the GPU process.
Our cascading fallback architecture
We treat device loss as a normal operational event. Not an exception. Not an edge case. A state transition that the engine handles automatically, with no application-level error handling required.
The fallback has three cascading stages.
Stage 1: Immediate state invalidation (< 0.1 ms)
The GPUDevice.lost callback fires. Within the same microtask:
device.lost.then((info) => {
// 1. Invalidate pipeline cache
pipelineCache.clear();
// All Map entries referencing GPUComputePipeline objects are removed.
// The pipelines are dead. Holding references would prevent GC
// and risk accidental reuse.
// 2. Drain buffer pool
bufferPool.invalidateAll();
// Every GPUBuffer in the pool (idle and in-flight) is marked dead.
// The pool's free list resets to empty.
// No buffer.destroy() calls - the device is already gone.
// 3. Drop bind group cache
bindGroupCache.clear();
// Bind groups reference specific buffers and pipelines.
// All invalid. All cleared.
// 4. Set device state
deviceAvailable = false;
currentDevice = null;
currentAdapter = null;
// 5. Schedule re-probe (executes on next dispatch, not now)
if (info.reason !== 'destroyed') {
reProbeScheduled = true;
}
// 6. Emit telemetry
telemetry.emit('device_lost', {
timestamp: Date.now(),
reason: info.reason,
pendingOps: pendingOperations.size,
adapterInfo: lastAdapterInfo,
});
});
The entire invalidation touches only in-memory JavaScript data structures. No async operations. No GPU calls (the device is dead). No network requests. Completion time: under 0.1 ms.
Stage 2: In-flight operation re-dispatch (0.1 to 0.5 ms)
If a compute operation was in progress when device loss occurred, the caller is holding a promise that must resolve. The engine maintains a pending operation queue. Each entry contains:
- The operation descriptor (what to compute)
- The input data (still in JavaScript heap memory as typed arrays in the
SharedArrayBuffer) - The result promise's resolver
On device loss, the engine iterates the pending queue:
for (const op of pendingOperations) {
// Re-dispatch to CPU tier
if (navigator.hardwareConcurrency > 1) {
workerDispatch(op.descriptor, op.inputData).then(op.resolve);
} else {
cpuDispatch(op.descriptor, op.inputData).then(op.resolve);
}
}
pendingOperations.clear();
The input data is intact. It was never moved to the GPU. device.queue.writeBuffer() copies data to the GPU. The original SharedArrayBuffer retains the source. There is no data loss. There is no checkpoint to restore. The CPU tier recomputes from the original input.
The caller's promise resolves with correct results. The only observable difference is latency: the GPU path for a 500,000-element sort takes 3 ms. The Web Worker fallback takes 12 ms. The user might notice a single slower interaction. They will not notice an error.
Stage 3: Hardware re-probe on next invocation (< 200 ms)
The re-probe does not happen immediately after device loss. There is no point probing hardware when no operation needs the GPU. The probe executes on the next dispatch() call where deviceAvailable is false and reProbeScheduled is true.
Step 1: Request a new adapter.
const adapter = await navigator.gpu?.requestAdapter();
if (!adapter) {
// No GPU available (eGPU disconnected, VDI reclaimed, etc.)
// Remain in CPU-only mode. No error.
reProbeScheduled = false;
return;
}
If requestAdapter() returns null, no GPU exists on the system. The engine stays in CPU-only mode indefinitely (or until the user docks an eGPU and the next probe detects it).
Step 2: Compare adapter info.
const newInfo = await adapter.requestAdapterInfo();
const hardwareChanged = (
newInfo.vendor !== lastAdapterInfo.vendor ||
newInfo.architecture !== lastAdapterInfo.architecture ||
newInfo.device !== lastAdapterInfo.device
);
If the vendor, architecture, or device string has changed, the hardware is different. An eGPU was disconnected and the system fell back to integrated graphics. Or a driver update changed the device identifier. The engine flags hardwareChanged = true to force recalibration.
Step 3: Request a new device.
const device = await adapter.requestDevice({
requiredLimits: { maxBufferSize: targetBufferSize },
});
// Register the lost callback immediately
device.lost.then(handleDeviceLoss);
A fresh GPUDevice.lost callback is registered on the new device before any operations are dispatched. If this device also fails, the cascade repeats.
Step 4: Re-run calibration.
The engine runs the same memory bandwidth and dispatch overhead microbenchmarks used during initial startup. The calibration takes under 200 ms. The new calibration ratio replaces the old one.
If the hardware changed (discrete to integrated GPU after undocking), the new ratio reflects the weaker hardware. The crossover thresholds adjust: operations that previously dispatched to the GPU at 500,000 elements may now require 2,000,000 elements to justify GPU dispatch on the integrated GPU. The system adapts automatically.
Step 5: Resume GPU dispatch.
deviceAvailable flips to true. The pipeline cache and buffer pool are empty but functional. The first few operations pay a one-time pipeline compilation cost (5 to 20 ms per unique shader). Subsequent operations hit the warm cache. Within 3 to 5 operations, performance returns to pre-loss levels.
Race conditions and their handling
GPU device loss introduces timing-sensitive edge cases. Our engine handles three.
Race 1: Loss during command encoding
const encoder = device.createCommandEncoder();
const pass = encoder.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(count);
pass.end();
device.queue.submit([encoder.finish()]); // May throw or silently fail
If the device dies between createCommandEncoder() and queue.submit(), the encoding calls do not throw (they operate on local state), but submit() fails. Our engine wraps the encode-submit sequence in a try/catch. On failure, the operation enters the Stage 2 re-dispatch path.
Race 2: Loss during buffer readback
await resultBuffer.mapAsync(GPUMapMode.READ); // Rejects with OperationError
If the device dies while mapAsync is pending, the promise rejects. Our engine catches the rejection and re-dispatches to the CPU tier. The input data is still in the SharedArrayBuffer. No work is lost.
Race 3: Loss during re-probe
The adapter can become invalid between requestAdapter() and requestDevice(). This happens if the GPU is removed during the re-probe itself (rare, but possible with hot-swappable eGPUs). The entire re-probe sequence is wrapped in a try/catch. On failure, the engine remains in CPU-only mode and schedules another re-probe for the next invocation.
What the application developer writes
From the application's perspective:
const sorted = await engine.dispatch('radix_sort', inputArray);
const filtered = await engine.dispatch('filter_gt', { data: sorted, threshold: 1000 });
const grouped = await engine.dispatch('group_sum', { data: filtered, groupBy: 'region' });
No GPU-specific error handling. No device availability check. No fallback branching. The engine returns correct results regardless of whether the GPU is available, was lost mid-operation, or never existed.
The application code is identical for a workstation with an RTX 4090, a laptop with Intel UHD, and a VDI terminal with no GPU. The performance differs. The API does not.
Observability for operations teams
Silent recovery is the correct user experience. For SRE and infrastructure teams, visibility into device loss events is critical for fleet health monitoring.
The engine emits structured telemetry at every state transition:
[
{
"event": "device_lost",
"timestamp": "2026-04-14T14:23:41.182Z",
"reason": "unknown",
"adapter": "NVIDIA GeForce RTX 3060",
"pendingOperations": 1,
"sessionUptime": 7241000
},
{
"event": "fallback_dispatched",
"timestamp": "2026-04-14T14:23:41.183Z",
"operation": "radix_sort",
"originalTier": "gpu",
"fallbackTier": "web_workers",
"elementCount": 500000,
"estimatedLatencyIncrease": "9ms"
},
{
"event": "reprobe_completed",
"timestamp": "2026-04-14T14:23:48.412Z",
"adapterChanged": true,
"previousAdapter": "NVIDIA GeForce RTX 3060",
"newAdapter": "Intel UHD Graphics 770",
"newCalibrationRatio": 4.2,
"recalibrationTime": 187
}
]
These events answer the questions your operations team will ask:
- "How often are users hitting device loss?" Count
device_lostevents per day, segmented by adapter vendor. - "Which GPU drivers are most unstable?" Correlate
device_lostfrequency with adapter info strings. - "What is the user impact?" The
estimatedLatencyIncreaseonfallback_dispatchedquantifies the degradation per event. - "Are we recovering correctly?" Every
device_lostshould be followed by either areprobe_completedor a sustained CPU-only mode. Missing recovery events indicate a bug.
Server-side comparison
| Concern | Server-side (Kubernetes + CUDA) | Browser-side (our engine) |
|---|---|---|
| Failure detection | Health checks (seconds) | GPUDevice.lost promise (microtask) |
| Recovery mechanism | Pod restart + rescheduling | In-process re-dispatch + re-probe |
| Data persistence | Checkpoint to S3/disk | Input in SharedArrayBuffer (never left CPU) |
| Recovery time | 10 to 60 seconds | Under 200 ms |
| Hardware change detection | Node labels, device plugins | Adapter info comparison |
| Application code changes | Retry logic, circuit breakers | None |
| Blast radius | Pod-level (other pods unaffected) | Tab-level (other tabs unaffected) |
The browser model is simpler in one critical respect: input data never leaves JavaScript heap memory. The GPU receives a copy. When the GPU dies, the copy is gone, but the original is intact. There is no checkpointing problem.
Why this matters for your business
A GPU-accelerated dashboard that crashes when the driver resets is worse than a dashboard that never used the GPU. The GPU acceleration buys you 3 ms query times. The crash costs you a support ticket, a lost workflow, and an employee who stops trusting the tool.
Our engine delivers the 3 ms query times without the crash risk. The GPU is an optimization layer. Losing it degrades performance (3 ms becomes 12 ms). It does not degrade correctness, availability, or user experience beyond a single slower interaction.
This is the reliability standard behind our enterprise AI automation infrastructure. We do not build systems that work when everything is perfect. We build systems that work when the GPU crashes, the driver resets, the laptop undocks, and the user never knows. Because in an enterprise fleet of 500 machines running 8 hours a day, 5 days a week, "the GPU sometimes crashes" is not a theoretical risk. It is a weekly event. Our engine handles it before your users notice it.