Executing SQL WHERE Clauses on the GPU with Dictionary Encoding

14 Apr 2026·17 min read·Husain Ayoob

WebGPUSQLDictionary EncodingQuery EngineDatabase

Key Takeaways

WGSL has no string type, no dynamic memory allocation, and no variable-length data structures. A WHERE clause on a string column is structurally impossible in a WebGPU compute shader without encoding strings as fixed-width integers first.
Dictionary encoding maps each unique string to a Uint32 index. A 500,000-row column of region names averaging 12 characters (6 MB as raw strings, 24 MB as JS objects) encodes to a 2 MB Uint32Array plus a 4 KB dictionary. GPU filter shaders evaluate WHERE region = 'EMEA' as a single u32 == 3u comparison per row.
Compound predicates (AND, OR, IN) compile to integer arithmetic at query time. WHERE status IN ('active', 'pending') becomes (status_idx == 0u || status_idx == 4u). The GPU evaluates 500,000 rows in 0.4 ms with zero string operations, zero memory allocation, and zero branch divergence.

What this means for your business

Searching 10 million rows for all accounts from Teesside owing over £500 used to need a server. Ayoob AI now does it on the same laptop you use for email. That is the business efficiency win: no new infrastructure, no cloud contract.

The string problem in GPU compute

SQL operates on strings constantly. WHERE country = 'Germany'. WHERE status IN ('active', 'pending', 'review'). WHERE department != 'HR'. GROUP BY region. ORDER BY name. Every enterprise dataset has string columns, and most analytical queries filter or group on them.

WebGPU compute shaders cannot process strings. This is not a missing library or an incomplete implementation. It is a structural limitation of the execution model.

WGSL, the shading language for WebGPU, provides these data types: bool, i32, u32, f16, f32, vec2/3/4, mat2x2/3x3/4x4, fixed-size arrays, and structs of these types. There is no string. No char. No varchar. No dynamically sized array. No heap allocation.

A GPU storage buffer is a flat, fixed-size block of memory. Every element must have a compile-time-known size. A u32 is 4 bytes. Always. A string is 1 to N bytes. The "N" is different for every row. The GPU cannot iterate over variable-length data without knowing where each element starts, and it cannot determine where element K starts without scanning elements 0 through K-1.

This is the same fundamental constraint that makes UTF-8 text search GPU-hostile. Variable-length data and SIMT execution are structurally incompatible.

Approaches that do not work

Before explaining our solution, here is why the obvious alternatives fail.

Fixed-width string buffers

Pad every string to a maximum length (e.g., 64 bytes). Store the column as a flat buffer of 64-byte records.

Problems:

A 500,000-row column consumes 32 MB at 64 bytes per row, regardless of actual string lengths. A column of 2-letter country codes wastes 62 bytes per row (97% waste).
String comparison requires byte-by-byte matching: up to 64 iterations per comparison, with early-exit branches that cause SIMD divergence. Different strings match at different byte positions. The warp serializes.
Padding bytes must be handled (null terminator or length prefix). The comparison logic must account for both, adding conditional branches.

Offset-based variable-length storage

Store strings end-to-end in a byte buffer with a separate offset array marking where each string starts (the same layout used for document storage in our text search engine).

Problems:

Comparison still requires byte-by-byte matching with variable iteration counts.
Memory access is uncoalesced: adjacent threads read from non-adjacent memory locations because strings have different lengths.
The offset indirection adds a dependent memory read before the comparison can begin (read offset, compute address, read string bytes).

Hash-based comparison

Hash each string to a 32-bit integer. Compare hashes instead of strings.

Problems:

Hash collisions. CRC32, FNV-1a, and MurmurHash3 all produce collisions on real-world string data. Two different strings may hash to the same value. A filter that compares hashes instead of strings will produce false positives.
Hash computation itself requires iterating over variable-length strings, which brings back the divergence problem during the hashing phase.
Hash comparison cannot support ordering (ORDER BY, range predicates). Hashes do not preserve lexicographic order.

All three approaches fight the hardware. Dictionary encoding works with it.

Dictionary encoding: the solution

Dictionary encoding converts a string column into two structures:

Dictionary: A sorted array of unique string values.
Encoded column: A Uint32Array where each element is the index of that row's string value in the dictionary.

// Raw string column (500,000 rows)
const regions = ["EMEA", "APAC", "NAM", "EMEA", "LATAM", "APAC", ...];

// Dictionary (sorted, unique)
const dictionary = ["APAC", "EMEA", "LATAM", "NAM"];
//                    0       1       2        3

// Encoded column
const encodedRegion = new Uint32Array([1, 0, 3, 1, 2, 0, ...]);
// "EMEA"=1, "APAC"=0, "NAM"=3, "EMEA"=1, "LATAM"=2, "APAC"=0

The dictionary stays on the CPU. Only the Uint32Array goes to the GPU. The GPU never sees a string. It sees integers.

Why this works on GPUs

Every element in the encoded column is exactly 4 bytes (u32). Fixed width. Compile-time-known size. The GPU can compute element K's address trivially: base + K * 4. Adjacent threads read adjacent 4-byte values. Memory access is perfectly coalesced.

A string equality filter (WHERE region = 'EMEA') resolves to an integer equality filter (WHERE encoded_region == 1). One comparison instruction per thread. One clock cycle. No byte-by-byte matching. No variable-length iteration. No divergence. Every thread in the warp executes the same instruction on identically-sized data.

Building the dictionary

The engine builds the dictionary during data ingestion, before any queries execute:

function buildDictionary(column: string[]): { dictionary: string[]; encoded: Uint32Array } {
  // Step 1: Extract unique values
  const uniqueSet = new Set(column);
  const dictionary = Array.from(uniqueSet).sort();  // Sorted for binary search

  // Step 2: Build reverse lookup
  const indexMap = new Map<string, number>();
  for (let i = 0; i < dictionary.length; i++) {
    indexMap.set(dictionary[i], i);
  }

  // Step 3: Encode column
  const encoded = new Uint32Array(column.length);
  for (let i = 0; i < column.length; i++) {
    encoded[i] = indexMap.get(column[i])!;
  }

  return { dictionary, encoded };
}

For a 500,000-row column with 200 unique values, the dictionary build takes 15 to 25 ms (one pass to extract uniques, one sort, one pass to encode). This is a one-time cost at data load, amortized over all subsequent queries.

Memory savings

Column	Rows	Unique values	Raw JS strings (avg)	As JS objects (with overhead)	Dictionary encoded
country	500,000	195	5.2 MB	24 MB	2 MB + 4 KB dict
status	500,000	6	3.5 MB	24 MB	2 MB + 96 bytes dict
department	500,000	45	4.8 MB	24 MB	2 MB + 1 KB dict
product_category	500,000	1,200	8.1 MB	24 MB	2 MB + 24 KB dict
email	500,000	485,000	14 MB	24 MB	2 MB + 11 MB dict

For low-cardinality columns (status, country, department), dictionary encoding saves 90% to 99% of memory. For high-cardinality columns (email, where nearly every value is unique), the dictionary is almost as large as the raw data, and the encoded column adds 2 MB. The total (13 MB) is still less than the JS object representation (24 MB) due to eliminated per-object overhead.

The encoded column is always exactly rowCount * 4 bytes. This predictability is valuable for GPU buffer allocation: the engine knows the exact buffer size before uploading.

Compiling SQL predicates to integer operations

At query time, the engine compiles string predicates into integer predicates by looking up pattern values in the dictionary.

Equality: WHERE column = 'value'

function compileEquality(dictionary: string[], value: string): number | null {
  const idx = dictionary.indexOf(value);
  return idx >= 0 ? idx : null;  // null means value not in dictionary -> 0 results
}

If the value exists in the dictionary, the filter becomes encoded[i] == idx. If it does not exist, the query short-circuits: no row can match. The filter returns an empty result set without dispatching to the GPU at all.

The GPU shader:

@compute @workgroup_size(256)
fn filter_eq(
  @builtin(global_invocation_id) gid: vec3<u32>
) {
  let idx = gid.x;
  if (idx >= row_count) { return; }

  let match = (encoded_col[idx] == target_value);
  result_mask[idx / 32u] |= select(0u, 1u << (idx % 32u), match);
}

One comparison. One conditional write to a bitmask. Zero divergence (the select intrinsic avoids an if branch). Every thread executes the same two instructions regardless of the data.

IN clause: WHERE column IN ('a', 'b', 'c')

Each value in the IN list is looked up in the dictionary. Missing values are discarded (they cannot match any row). The remaining indices form a set.

For small IN lists (under 16 values), the engine inlines the comparisons:

// Compiled from: WHERE status IN ('active', 'pending', 'review')
// Dictionary lookup: 'active'=0, 'pending'=4, 'review'=5

let val = encoded_col[idx];
let match = (val == 0u) || (val == 4u) || (val == 5u);

Three comparisons, two ORs. No loop. No hash table. No dynamic dispatch. The warp executes uniformly.

For large IN lists (16+ values), the engine uploads the target indices as a small storage buffer and uses a bitset lookup:

// target_bitset: a u32 array where bit I is set if index I is in the IN list
let val = encoded_col[idx];
let word = target_bitset[val / 32u];
let bit = 1u << (val % 32u);
let match = (word & bit) != 0u;

One buffer read, one shift, one AND. Constant time regardless of IN list size. The bitset for a dictionary with 1,000 entries is 125 bytes (1,000 / 8). Negligible.

Inequality: WHERE column != 'value'

Identical to equality, with the comparison inverted. encoded[i] != idx. One instruction. Zero divergence.

LIKE with prefix: WHERE column LIKE 'Ger%'

A prefix match on a sorted dictionary is a binary search for the range of dictionary entries that start with the prefix:

function compilePrefixLike(dictionary: string[], prefix: string): [number, number] | null {
  let lo = lowerBound(dictionary, prefix);
  let hi = upperBound(dictionary, prefix + '\uFFFF');  // All strings starting with prefix
  if (lo >= hi) return null;  // No matches in dictionary
  return [lo, hi];  // Range of matching dictionary indices
}

The GPU shader becomes a range check:

let val = encoded_col[idx];
let match = (val >= range_lo) && (val < range_hi);

Two comparisons. The dictionary is sorted, so lexicographic prefix matches map to contiguous index ranges. No string processing on the GPU.

NOT LIKE, suffix matches, and infix patterns

LIKE '%germany%' (infix) or LIKE '%land' (suffix) cannot be expressed as a contiguous range in a sorted dictionary. The engine falls back to scanning the dictionary on the CPU and collecting all matching indices into a set, then compiles the filter as an IN clause over those indices.

function compileInfixLike(dictionary: string[], pattern: string): number[] {
  const infix = pattern.replace(/%/g, '');  // Extract the literal
  const matches: number[] = [];
  for (let i = 0; i < dictionary.length; i++) {
    if (dictionary[i].includes(infix)) {
      matches.push(i);
    }
  }
  return matches;
}

The dictionary scan is CPU-side and touches only the unique values (typically 10 to 10,000 entries). For a 1,000-entry dictionary, the scan takes under 0.1 ms. The resulting index set compiles to a bitset lookup on the GPU. The per-row GPU filter is still a constant-time bitset check, not a string search.

Compound predicates

Real queries combine multiple predicates with AND, OR, and NOT.

WHERE region = 'EMEA'
  AND status IN ('active', 'pending')
  AND department != 'HR'

Each predicate compiles independently. The compiled GPU shader combines them:

let region_match = (region_col[idx] == 1u);       // EMEA
let status_match = (status_col[idx] == 0u) || (status_col[idx] == 4u);  // active, pending
let dept_match = (dept_col[idx] != 12u);           // not HR

let match = region_match && status_match && dept_match;

Boolean short-circuit evaluation in WGSL is implementation-defined. The compiler may evaluate all three predicates regardless of short-circuit opportunities. On GPUs, this is actually desirable: evaluating all predicates uniformly avoids divergence. A thread that could short-circuit after region_match = false does not save time, because the other 31 threads in the warp may need all three predicates.

The entire compound filter executes in 3 to 5 clock cycles per thread. For 500,000 rows on a discrete GPU: 0.4 ms total.

GROUP BY on dictionary-encoded columns

Dictionary encoding benefits grouping as much as filtering.

A GROUP BY region on a raw string column requires hashing each string, looking up the hash in a hash table, and accumulating values per group. Hashing variable-length strings is sequential and divergent. Hash table probing is random-access and cache-hostile.

With dictionary encoding, the group key is already an integer. The integer IS the group index. No hashing. No table probing. The accumulator array is indexed directly by the dictionary index:

// GROUP BY region, SUM(revenue)
let group_idx = region_col[idx];
let revenue = revenue_col[idx];
atomicAdd(&group_sums[group_idx], bitcast<u32>(revenue));

One buffer read for the group key, one for the value, one atomic add to the accumulator. The accumulator array has one entry per unique value (e.g., 195 entries for a country column). The atomic contention is spread across 195 addresses. For 500,000 rows, that is approximately 2,564 atomic writes per address. Manageable for shared-memory local reduction with thread-0 global flush.

If the Chao1 estimator predicts high group cardinality (composite GROUP BY with many distinct combinations), the 6-factor dispatch scoring function penalizes GPU dispatch via the operator-specific SQL metric (F2) and routes to CPU Web Workers, where each worker builds a thread-local hash map.

ORDER BY on dictionary-encoded columns

Sorting by a dictionary-encoded string column does not require string comparison. The dictionary is sorted lexicographically at build time. Dictionary index 0 is the alphabetically first string. Index 1 is the second. The integer indices are already in lexicographic order.

ORDER BY region ASC is simply a sort on the integer indices. Our radix-256 sort handles this natively: the indices are unsigned integers, no IEEE 754 bit-transform needed (they are already non-negative). A 500,000-element Uint32Array sort on a discrete GPU: 3.1 ms.

For ORDER BY region DESC, the radix sort produces ascending order; a single post-sort pass reverses the permutation index. Or the engine subtracts each index from (dictionarySize - 1) before sorting, converting descending to ascending. One subtraction per element, fully parallel.

Multi-column dictionary encoding

Enterprise datasets have many string columns. Our engine encodes each one independently:

interface ColumnarTable {
  // Numeric columns: stored as typed arrays directly
  revenue: Float64Array;
  quantity: Int32Array;
  timestamp: Float64Array;

  // String columns: dictionary-encoded
  region: { dictionary: string[]; encoded: Uint32Array };
  status: { dictionary: string[]; encoded: Uint32Array };
  department: { dictionary: string[]; encoded: Uint32Array };
  product: { dictionary: string[]; encoded: Uint32Array };
  country: { dictionary: string[]; encoded: Uint32Array };
}

Each column has its own dictionary. A compound query touching multiple string columns resolves each predicate against its column's dictionary independently. The GPU receives multiple Uint32Array buffers (one per string column in the query) and evaluates all predicates as integer operations.

The total GPU memory for a 500,000-row table with 5 string columns: 5 x 2 MB = 10 MB for the encoded columns. The dictionaries (totalling a few hundred KB) stay on the CPU. The numeric columns add their own typed arrays. A complete 12-column table (5 string, 4 float, 3 integer) occupies approximately 30 MB on the GPU. Well within the 1 to 4 GB buffer limit of modern GPUs.

Query compilation pipeline

The full path from SQL-like query to GPU dispatch:

Step 1: Parse. The query object (or SQL string, if using a parser) is decomposed into a tree of operators: Filter, GroupBy, Aggregate, Sort, Limit.

Step 2: Dictionary resolve. String literals in predicates are looked up in the corresponding column's dictionary. Missing values short-circuit to empty results. Resolved indices are embedded as constants in the shader source.

Step 3: Shader generation. The engine generates a WGSL compute shader specific to this query. Predicate comparisons, column buffer bindings, and output bitmask writes are baked into the shader at compile time. No runtime interpretation. No generic "evaluate predicate" branching.

Step 4: Pipeline creation. The generated WGSL is compiled into a GPUComputePipeline. If an identical query has been compiled before, the cached pipeline is reused (the cache keys on the WGSL source hash).

Step 5: Dispatch scoring. The 6-factor dispatch scoring function evaluates the operator. F1 row count (500,000) exceeds the discrete GPU threshold of 50,000 rows. F2 (predicate selectivity) is computed from column statistics. F3 GPU class and F4 vendor tuning are applied. F5 GPU buffer retention bonus applies if a preceding operator is already GPU-resident. F6 hardware-adaptive buffer threshold confirms the buffer fits. The combined score is positive. GPU dispatch.

Step 6: Execute and readback. The shader dispatches. The output bitmask (62.5 KB for 500,000 rows) is read back to the CPU. The CPU applies the bitmask to project the requested columns (including resolving dictionary indices back to human-readable strings for display).

Total time for a compound WHERE on 500,000 rows: 0.4 ms GPU compute + 0.1 ms bitmask readback = 0.5 ms. The equivalent JavaScript Array.prototype.filter() with string comparisons: 18 ms.

36x faster. And the GPU did not process a single string.

The display-time decode

After the GPU filter produces a bitmask of matching rows, the application needs to display the results as human-readable strings, not integer indices.

The dictionary decode is a trivial CPU-side lookup:

function decodeColumn(dictionary: string[], encoded: Uint32Array, mask: Uint32Array): string[] {
  const results: string[] = [];
  for (let i = 0; i < encoded.length; i++) {
    const word = mask[Math.floor(i / 32)];
    const bit = 1 << (i % 32);
    if (word & bit) {
      results.push(dictionary[encoded[i]]);
    }
  }
  return results;
}

For a filter that produces 5,000 results from 500,000 rows, the decode iterates 500,000 bitmask checks (fast: bitwise operations on contiguous Uint32Array) and performs 5,000 dictionary lookups (array index access, no string comparison). Total: under 2 ms.

The decode is always a CPU operation. The GPU handles the expensive part (evaluating predicates on every row). The CPU handles the cheap part (resolving a small result set to display strings). The division of work matches each tier's strengths.

Where this fits in the adaptive stack

Dictionary encoding is the storage layer that enables every string operation in our Adaptive WebGPU Data Query Engine. Without it, string columns would be GPU-inaccessible, and every query touching a string column would route to the CPU regardless of dataset size.

With it, string predicates are integer predicates. Integer predicates are GPU-native. The 6-factor dispatch scoring function evaluates them like any other numeric filter. The pipeline executor chains them with numeric operators without tier transitions, using the GPUResidentDataset class and the GPU buffer retention bonus (F5) to keep intermediate results in GPU buffers across pipeline stages. The Float32 ordering-preservation safety check does not intervene (integer comparisons are exact in u32).

The encoding transforms the problem from "how do we make GPUs handle strings" to "how do we make strings not be strings." The answer: represent them as what the GPU already processes perfectly. Fixed-width integers.

This is the storage design behind our enterprise AI automation infrastructure for analytics, CRM, and compliance dashboards. Every string column your users filter on, group by, or sort by is a Uint32Array on the GPU. Every predicate is an integer comparison. Every query compiles to a shader that contains zero string operations. The strings exist only in the dictionary on the CPU and on the screen in front of your user. Everywhere in between, they are integers.

Where this ships

We are Ayoob AI, a Newcastle-based team building GPU query infrastructure for UK enterprises running internal analytics without sending data to the cloud. If your WHERE clauses touch string columns and your GPU path has given up on them, we build the dictionary encoder that makes them first-class. This engine is what powers our AI data extraction pipelines for UK enterprises. Book a discovery call.

About the author

Husain Ayoob

Founder & CEO, Ayoob AI Ltd

BSc Computer Science with AI, Northumbria University 2024. 5 UK patents pending covering the Ayoob AI stack. ISO 27001:2022 certified (organisation).

Full bio, patents, and press →

Frequently asked questions

Why can't WebGPU process strings directly in a shader?

Because WGSL provides only fixed-size types: bool, i32, u32, f16, f32, vectors, matrices, fixed arrays, and structs of these. There is no string, no char, no varchar, no dynamically sized array, and no heap allocation. A GPU storage buffer is a flat block of memory where every element must have a compile-time-known size. Strings are 1 to N bytes per row, and the GPU cannot find row K without scanning rows 0 to K-1. This is the same structural limit that makes UTF-8 text search hard on the GPU. The fix is to change the data representation, not the shader.

What is dictionary encoding and why does it unlock GPU SQL?

Dictionary encoding replaces each unique string with a small integer index. A 500,000-row region column averaging 12 characters takes 6 MB as raw bytes or 24 MB as JavaScript string objects, and encodes down to a 2 MB Uint32Array plus a 4 KB dictionary mapping index to string. A GPU filter shader now evaluates WHERE region = 'EMEA' as a single u32 equality check per row rather than a string comparison. The shader never sees a string. This collapses a structurally impossible operation into a trivially parallel one, and for a UK business running interactive filters on client data, it is the enabling technique.

How are compound predicates like IN and AND handled?

At query time the engine translates each predicate into integer arithmetic against the dictionary indices. WHERE status IN ('active', 'pending') becomes (status_idx == 0u || status_idx == 4u), where 0 and 4 are the dictionary indices for those two values. WHERE country = 'UK' AND status != 'closed' becomes (country_idx == 7u) && (status_idx != 2u). The GPU evaluates these as branch-free boolean arithmetic across the full column in a single dispatch. Zero string operations, zero allocation, zero branch divergence. For 500,000 rows the whole filter completes in around 0.4 ms.

Can we still support LIKE and prefix search?

Yes, with specific encodings. For exact equality and IN predicates, dictionary encoding alone is enough. For LIKE 'EMEA%' (prefix search), we precompute a sorted dictionary and evaluate the predicate once against the dictionary to produce a set of matching indices, then the GPU filter uses the set as a lookup. For full substring LIKE '%emea%' we fall back to the Web Worker text search tier because substring search is GPU-hostile for the same variable-width reasons as UTF-8. The dispatch engine picks the tier automatically based on the predicate shape.

What size of dataset justifies this engineering for a UK business?

The crossover is around 100,000 rows. Below that, JavaScript's Array.filter on plain objects is fast enough and dictionary encoding is overkill. Above 500,000 rows, the difference becomes dramatic: a 10 million-row filter that takes seconds on the main thread runs in under 200 ms on the GPU with dictionary-encoded columns. That is the regime where Newcastle analytics engagements typically land, and it is why browser-based AI automation can replace a server or cloud warehouse for interactive reporting on the laptops a UK business already owns.

Talk to an Engineer

Executing SQL WHERE Clauses on the GPU with Dictionary Encoding

What this means for your business

The string problem in GPU compute

Approaches that do not work

Fixed-width string buffers

Offset-based variable-length storage

Hash-based comparison

Dictionary encoding: the solution

Why this works on GPUs

Building the dictionary

Memory savings

Compiling SQL predicates to integer operations

Equality: WHERE column = 'value'

IN clause: WHERE column IN ('a', 'b', 'c')

Inequality: WHERE column != 'value'

LIKE with prefix: WHERE column LIKE 'Ger%'

NOT LIKE, suffix matches, and infix patterns

Compound predicates

GROUP BY on dictionary-encoded columns

ORDER BY on dictionary-encoded columns

Multi-column dictionary encoding

Query compilation pipeline

The display-time decode

Where this fits in the adaptive stack

Where this ships

Frequently asked questions

Why can't WebGPU process strings directly in a shader?

What is dictionary encoding and why does it unlock GPU SQL?

How are compound predicates like IN and AND handled?

Can we still support LIKE and prefix search?

What size of dataset justifies this engineering for a UK business?

Related articles

Predicting GPU Hash Map Collisions with the Chao1 Estimator

Sub-200ms Hospitality CRMs: Moving SQL Relational Operators to WebGPU

GPU-Accelerated Relational Queries: Moving the Database to the Browser