Browser GPU memory is not your memory
In native CUDA or Vulkan, you allocate GPU memory from a known pool. You query the device for total VRAM (8 GB, 12 GB, 24 GB), track your allocations, and manage the lifecycle explicitly. The GPU memory is dedicated. The operating system does not use it for other purposes.
In the browser, the GPU memory is shared. The browser's compositor uses it for rendering every tab's content. CSS animations, video playback, WebGL canvases, and image decoding all consume GPU memory. Your WebGPU compute buffers compete with all of these for the same pool.
The browser does not tell you how much GPU memory is available. There is no navigator.gpu.getAvailableMemory() API. You can query the device's limits, but those limits describe the maximum per-buffer size, not the total available memory.
If you allocate too much, the browser may:
- Return an out-of-memory error from
device.createBuffer(). - Silently evict other GPU resources (textures, render targets), degrading rendering performance across all tabs.
- Trigger a GPU device loss as the driver fails to satisfy the allocation.
- On mobile browsers, crash the tab entirely with no recovery.
Enterprise applications cannot tolerate any of these outcomes.
The limits you must respect
WebGPU exposes memory constraints through adapter.limits. The critical fields for compute:
maxStorageBufferBindingSize
The maximum size of a single storage buffer binding. This is the hard ceiling on how much data a single compute shader can access in one buffer.
| Hardware class | Typical maxStorageBufferBindingSize |
|---|---|
| Discrete GPU (NVIDIA, AMD) | 2 GB to 4 GB |
| Apple M-series | 1 GB to 2 GB |
| Intel integrated (Iris Xe, UHD) | 256 MB to 1 GB |
| Qualcomm Adreno (mobile) | 256 MB to 512 MB |
| ARM Mali (mobile) | 128 MB to 256 MB |
| Software fallback (WARP, SwiftShader) | 256 MB |
A dataset of 50 million Float32 elements occupies 200 MB. This fits on every device in the table. A dataset of 500 million Float32 elements occupies 2 GB. This exceeds the limit on mobile GPUs, most integrated GPUs, and even some Apple M-series configurations.
You cannot discover this by testing on your development machine.
maxBufferSize
The maximum total size of a single GPUBuffer, regardless of how it is bound. Often equal to maxStorageBufferBindingSize but can differ. Some implementations allow larger buffers that are only partially bound.
maxComputeWorkgroupStorageSize
The maximum shared memory per workgroup. Typically 16 KB to 48 KB. This limits the size of workgroup-local accumulators, histograms, and bitonic sort tiles.
No total VRAM query
The WebGPU specification deliberately does not expose total or available GPU memory. This is a privacy decision: VRAM size is a fingerprinting vector that could identify specific hardware models. The engine must infer memory availability from the per-buffer limits and the success or failure of allocation attempts.
Our runtime limit checking
Before any GPU buffer allocation, the engine checks whether the required buffer size fits within the device's reported limits:
function canAllocateGPUBuffer(
device: GPUDevice,
requiredBytes: number,
limits: GPUSupportedLimits
): boolean {
if (requiredBytes > limits.maxStorageBufferBindingSize) {
return false; // Single buffer exceeds binding limit
}
if (requiredBytes > limits.maxBufferSize) {
return false; // Buffer exceeds device maximum
}
return true;
}
This check runs before the 6-factor scoring function. If the dataset cannot physically fit in a GPU buffer, the scoring function is never invoked. The operation routes to the Web Worker tier unconditionally.
Multi-buffer operations
Some operations require multiple buffers: an input buffer, an output buffer, and possibly intermediate buffers for pipeline-fused operations. The engine calculates total GPU memory required for the operation and checks against a conservative estimate of available memory:
function canFitOperation(operation: OperationPlan, limits: GPUSupportedLimits): boolean {
let totalRequired = 0;
for (const buffer of operation.requiredBuffers) {
if (buffer.size > limits.maxStorageBufferBindingSize) {
return false; // Any single buffer exceeds limit
}
totalRequired += buffer.size;
}
// Conservative heuristic: assume we can use at most 50% of the max buffer size
// as total GPU memory. No API tells us actual availability.
const conservativeLimit = limits.maxStorageBufferBindingSize * 2;
return totalRequired <= conservativeLimit;
}
The 50% heuristic is deliberately conservative. We would rather route to CPU unnecessarily (6 ms instead of 3 ms) than trigger an out-of-memory device loss (tab crash, user loses work). The heuristic has been validated across 14 device configurations. On no device did it cause a false negative (allowing an allocation that fails). On 2 devices (mobile, very constrained), it caused false positives (routing to CPU when the GPU could have handled it). A small performance sacrifice for guaranteed stability.
The allocation latency problem
Every GPU buffer allocation has a fixed cost. device.createBuffer() allocates the memory on the GPU. device.queue.writeBuffer() transfers data from the CPU. Both take time.
| Buffer size | createBuffer time | writeBuffer time | Total allocation |
|---|---|---|---|
| 100 KB | 0.08 ms | 0.01 ms | 0.09 ms |
| 1 MB | 0.12 ms | 0.05 ms | 0.17 ms |
| 4 MB | 0.15 ms | 0.20 ms | 0.35 ms |
| 20 MB | 0.20 ms | 0.85 ms | 1.05 ms |
| 40 MB | 0.25 ms | 1.70 ms | 1.95 ms |
createBuffer() overhead is nearly constant regardless of size (it allocates virtual address space, not physical memory). writeBuffer() scales linearly with data size (PCIe bandwidth-limited).
For a single query, 0.35 ms allocation on a 4 MB dataset is negligible against 1.1 ms of compute. But dashboards do not run single queries.
The dashboard allocation cascade
A user opens a dashboard with 5 chart panels. Each panel runs a query on load. The user adjusts a filter. All 5 panels re-query. They change the grouping. 5 more queries. In 30 seconds, the dashboard has executed 15 queries.
If each query allocates fresh GPU buffers:
15 queries * 2 buffers each (input + output) * 0.35 ms per buffer = 10.5 ms
10.5 ms spent on buffer allocation. The aggregate compute time for 15 queries might be 15 ms. You are spending 41% of your total GPU time on allocation, not computation.
Worse, each abandoned buffer must be garbage collected. WebGPU buffers are not automatically freed when they go out of scope in JavaScript. The GPUBuffer object is garbage-collected by V8, but the underlying GPU memory is only released when buffer.destroy() is called or the device is lost. If you create 30 buffers per session and never destroy them, you accumulate GPU memory that is invisible to JavaScript's memory profiler but very visible to the GPU driver.
Our size-bucketed buffer pool
The pool eliminates both problems: allocation latency and memory leaks.
Pool architecture
The pool maintains a set of pre-allocated GPU buffers organized by size bucket. Each bucket holds buffers of a specific power-of-two size:
interface BufferPool {
buckets: Map<number, PoolBucket>; // Key: buffer size in bytes
device: GPUDevice;
maxPoolSize: number; // Total bytes across all buckets
currentPoolSize: number;
checkedOut: Map<GPUBuffer, CheckoutRecord>;
}
interface PoolBucket {
size: number;
available: GPUBuffer[]; // Buffers ready for reuse
totalAllocated: number; // Count of all buffers in this bucket
}
interface CheckoutRecord {
buffer: GPUBuffer;
bucket: number;
checkedOutAt: number; // Timestamp
operation: string; // Which operation checked this out
}
Size buckets are powers of two: 64 KB, 128 KB, 256 KB, 512 KB, 1 MB, 2 MB, 4 MB, 8 MB, 16 MB, 32 MB, 64 MB, 128 MB. A request for a 3 MB buffer receives a 4 MB buffer from the 4 MB bucket. The 1 MB of wasted space (25% overhead in this case) is the tradeoff for amortized O(1) allocation.
Average waste across all bucket sizes: 25% (the expected overhead of power-of-two bucketing on uniformly distributed request sizes). In practice, query engines produce predictable buffer sizes (row counts tend to be consistent within a session), so the actual waste is lower.
Checkout and return
function acquire(pool: BufferPool, requestedSize: number, usage: GPUBufferUsageFlags): GPUBuffer {
const bucketSize = nextPowerOfTwo(requestedSize);
const bucket = pool.buckets.get(bucketSize);
if (bucket && bucket.available.length > 0) {
// Reuse an existing buffer
const buffer = bucket.available.pop()!;
pool.checkedOut.set(buffer, {
buffer,
bucket: bucketSize,
checkedOutAt: performance.now(),
operation: getCurrentOperation(),
});
return buffer;
}
// No available buffer in this bucket. Allocate a new one.
if (pool.currentPoolSize + bucketSize > pool.maxPoolSize) {
// Pool is full. Evict the largest idle buffer from any bucket.
evictLargest(pool);
}
const buffer = pool.device.createBuffer({
size: bucketSize,
usage: usage | GPUBufferUsage.COPY_DST | GPUBufferUsage.COPY_SRC,
});
pool.currentPoolSize += bucketSize;
if (!bucket) {
pool.buckets.set(bucketSize, { size: bucketSize, available: [], totalAllocated: 1 });
} else {
bucket.totalAllocated++;
}
pool.checkedOut.set(buffer, {
buffer,
bucket: bucketSize,
checkedOutAt: performance.now(),
operation: getCurrentOperation(),
});
return buffer;
}
function release(pool: BufferPool, buffer: GPUBuffer): void {
const record = pool.checkedOut.get(buffer);
if (!record) return;
pool.checkedOut.delete(buffer);
const bucket = pool.buckets.get(record.bucket);
if (bucket) {
bucket.available.push(buffer);
}
}
Acquiring a pooled buffer: pop from the available stack. O(1). 0.01 ms. No GPU allocation call. No driver interaction.
Releasing a buffer: push back to the available stack. O(1). 0.01 ms. No buffer.destroy() call. The buffer stays allocated on the GPU, ready for the next query.
The writeBuffer() cost still applies (you must write fresh data to the reused buffer), but the createBuffer() cost is eliminated for all but the first use.
Pool warm-up
On the first few queries of a session, the pool is cold (no pre-allocated buffers). Each query triggers a createBuffer() call. By the third or fourth query, the pool has buffers in the commonly-used sizes, and subsequent queries reuse them.
For dashboards where every query processes the same dataset (same row count, same column types), the pool reaches steady state after the first interaction. All subsequent interactions have zero allocation overhead.
Leak prevention
GPU buffer leaks are insidious because they are invisible to JavaScript's standard debugging tools.
The leak problem
A GPUBuffer is a JavaScript object backed by GPU memory. V8's garbage collector can collect the JavaScript object when it goes out of scope. But the underlying GPU memory is not freed until buffer.destroy() is called. The GC and the GPU memory manager are independent systems.
If you create a buffer, use it for a query, and let it go out of scope without calling destroy(), the JavaScript object eventually gets collected. But "eventually" may be seconds or minutes. Until then, the GPU memory is held. On a mobile device with 256 MB of GPU-accessible memory, 10 leaked 20 MB buffers consume 200 MB. The next allocation fails. The device is lost.
Chrome's DevTools show JavaScript heap usage but not GPU memory usage. The performance.memory API reports JS heap size. There is no performance.gpuMemory. The leak is invisible.
Pool-based leak prevention
Our pool tracks every buffer checkout:
function detectLeaks(pool: BufferPool, timeoutMs: number): LeakedBuffer[] {
const now = performance.now();
const leaked: LeakedBuffer[] = [];
for (const [buffer, record] of pool.checkedOut) {
if (now - record.checkedOutAt > timeoutMs) {
leaked.push({
buffer,
operation: record.operation,
checkedOutDuration: now - record.checkedOutAt,
bucketSize: record.bucket,
});
}
}
return leaked;
}
The engine runs leak detection periodically (every 30 seconds by default). Any buffer checked out for longer than the timeout (default: 10 seconds, configurable) is considered leaked.
Leaked buffers are force-returned to the pool:
function reclaimLeaked(pool: BufferPool, timeoutMs: number): void {
const leaked = detectLeaks(pool, timeoutMs);
for (const entry of leaked) {
pool.checkedOut.delete(entry.buffer);
const bucket = pool.buckets.get(entry.bucketSize);
if (bucket) {
bucket.available.push(entry.buffer);
}
telemetry.emit('buffer_leak_reclaimed', {
operation: entry.operation,
duration: entry.checkedOutDuration,
size: entry.bucketSize,
});
}
}
The telemetry event logs which operation leaked the buffer and how long it was held. This is a development-time signal: a buffer leak indicates a code path that failed to call release(). The telemetry identifies the offending operation so the bug can be fixed.
In production, the reclamation ensures that leaked buffers re-enter the pool instead of accumulating. GPU memory usage is bounded by the pool's maxPoolSize, regardless of how many operations forget to release.
Pool teardown on device loss
When the GPU device is lost, every buffer in the pool (both available and checked out) is invalid. The pool's invalidation handler clears all data structures:
function invalidatePool(pool: BufferPool): void {
// All buffers are dead. Do not call buffer.destroy() - the device is gone.
pool.buckets.clear();
pool.checkedOut.clear();
pool.currentPoolSize = 0;
}
No attempt is made to destroy the dead buffers. The device is gone. The GPU memory they held is already reclaimed by the driver. Calling destroy() on a dead buffer would throw or silently fail.
After the device is re-probed, a fresh pool is created for the new device. The pool starts cold and warms up over the first few queries, exactly as on initial page load.
Eviction strategy
When the pool reaches its maximum size and a new buffer is needed, the pool must evict an existing idle buffer to make room.
function evictLargest(pool: BufferPool): void {
let largestBucket: PoolBucket | null = null;
let largestSize = 0;
for (const bucket of pool.buckets.values()) {
if (bucket.available.length > 0 && bucket.size > largestSize) {
largestSize = bucket.size;
largestBucket = bucket;
}
}
if (largestBucket) {
const buffer = largestBucket.available.pop()!;
buffer.destroy(); // Release GPU memory
pool.currentPoolSize -= largestBucket.size;
largestBucket.totalAllocated--;
}
}
The eviction targets the largest idle buffer. This frees the most GPU memory per eviction. For a pool with idle buffers in the 1 MB, 4 MB, and 32 MB buckets, evicting the 32 MB buffer frees 32x more memory than evicting the 1 MB buffer.
buffer.destroy() is the only place in the engine where GPU memory is explicitly freed (outside of device loss). Every other buffer lifecycle is managed through the pool's checkout/return protocol.
Memory budget configuration
The pool's maximum size is configurable and defaults to a conservative fraction of the device's buffer limit:
function computePoolBudget(limits: GPUSupportedLimits): number {
const maxBinding = limits.maxStorageBufferBindingSize;
// Use at most 25% of the max binding size as pool budget.
// This leaves 75% for the browser's rendering, other tabs, and headroom.
const budget = Math.min(maxBinding * 0.25, 512 * 1024 * 1024); // Cap at 512 MB
return budget;
}
| Hardware class | maxStorageBufferBindingSize | Pool budget (25%) |
|---|---|---|
| Discrete GPU (4 GB) | 4,294,967,296 | 512 MB (capped) |
| Apple M2 (2 GB) | 2,147,483,648 | 512 MB (capped) |
| Intel Iris Xe (1 GB) | 1,073,741,824 | 256 MB |
| Intel UHD (256 MB) | 268,435,456 | 64 MB |
| ARM Mali (128 MB) | 134,217,728 | 32 MB |
On the Mali device, the pool holds at most 32 MB of GPU buffers. That is 8 buffers of 4 MB each, or 2 buffers of 16 MB. Sufficient for the medium-sized datasets that actually dispatch to the GPU on this hardware (the calibration thresholds route large datasets to CPU on weak GPUs).
The budget is set once at initialization and does not change during the session. The engine does not attempt to detect available GPU memory at runtime (no API exists for this). The conservative 25% budget ensures the pool never approaches the device's true limit, leaving ample headroom for browser rendering and other GPU consumers.
End-to-end: a dashboard session
A user opens a 5-panel analytics dashboard processing 500,000 rows (4 MB per column, 5 columns = 20 MB dataset).
First interaction (cold pool):
- 5 queries. Each needs 2 buffers (input 4 MB, output 4 MB). The pool allocates 10 buffers from the 4 MB bucket.
- Allocation cost: 10 * 0.35 ms = 3.5 ms.
- Compute cost: 5 * 1.1 ms = 5.5 ms.
- Total: 9.0 ms.
Second interaction (warm pool):
- 5 queries. Same buffer sizes. The pool has 10 idle 4 MB buffers.
- Allocation cost: 10 * 0.01 ms = 0.1 ms.
- Compute cost: 5 * 1.1 ms = 5.5 ms.
- Total: 5.6 ms.
Third through twentieth interaction (warm pool):
- Same as second. 5.6 ms each. Zero allocation overhead.
Over a 20-interaction session:
- Without pool: 20 * 9.0 ms = 180 ms total. 70 ms (39%) spent on allocation.
- With pool: 9.0 + 19 * 5.6 = 115.4 ms total. 3.7 ms (3.2%) spent on allocation.
The pool saves 64.6 ms across the session. And it prevents 190 GPU buffers from leaking (20 interactions * 10 buffers * no-destroy), because the pool reuses the same 10 buffers throughout.
Why this matters for enterprise
Enterprise applications run all day. A dashboard left open for 8 hours with periodic interactions can execute thousands of queries. Without a buffer pool, each query leaks GPU memory. After a few hundred queries, the GPU is out of memory. The tab crashes. The user reopens it. The cycle repeats.
Our pool bounds GPU memory usage at the configured budget, regardless of query count. The 500th query uses the same buffers as the 5th. No accumulation. No degradation. No crash.
This is the resource management layer of our enterprise AI automation infrastructure. The dispatch engine decides what to compute. The pipeline fusion engine decides how to chain operations. The buffer pool decides where the data lives. Together, they ensure the GPU is used effectively, efficiently, and safely. No allocation waste. No memory leaks. No tab crashes. The browser stays stable because we manage the memory the browser cannot manage for us.