Ayoob AI

Executing SQL WHERE Clauses on the GPU with Dictionary Encoding

·16 min read·Ayoob AI
WebGPUSQLDictionary EncodingQuery EngineDatabase

The string problem in GPU compute

SQL operates on strings constantly. WHERE country = 'Germany'. WHERE status IN ('active', 'pending', 'review'). WHERE department != 'HR'. GROUP BY region. ORDER BY name. Every enterprise dataset has string columns, and most analytical queries filter or group on them.

WebGPU compute shaders cannot process strings. This is not a missing library or an incomplete implementation. It is a structural limitation of the execution model.

WGSL, the shading language for WebGPU, provides these data types: bool, i32, u32, f16, f32, vec2/3/4, mat2x2/3x3/4x4, fixed-size arrays, and structs of these types. There is no string. No char. No varchar. No dynamically sized array. No heap allocation.

A GPU storage buffer is a flat, fixed-size block of memory. Every element must have a compile-time-known size. A u32 is 4 bytes. Always. A string is 1 to N bytes. The "N" is different for every row. The GPU cannot iterate over variable-length data without knowing where each element starts, and it cannot determine where element K starts without scanning elements 0 through K-1.

This is the same fundamental constraint that makes UTF-8 text search GPU-hostile. Variable-length data and SIMT execution are structurally incompatible.

Approaches that do not work

Before explaining our solution, here is why the obvious alternatives fail.

Fixed-width string buffers

Pad every string to a maximum length (e.g., 64 bytes). Store the column as a flat buffer of 64-byte records.

Problems:

  • A 500,000-row column consumes 32 MB at 64 bytes per row, regardless of actual string lengths. A column of 2-letter country codes wastes 62 bytes per row (97% waste).
  • String comparison requires byte-by-byte matching: up to 64 iterations per comparison, with early-exit branches that cause SIMD divergence. Different strings match at different byte positions. The warp serializes.
  • Padding bytes must be handled (null terminator or length prefix). The comparison logic must account for both, adding conditional branches.

Offset-based variable-length storage

Store strings end-to-end in a byte buffer with a separate offset array marking where each string starts (the same layout used for document storage in our text search engine).

Problems:

  • Comparison still requires byte-by-byte matching with variable iteration counts.
  • Memory access is uncoalesced: adjacent threads read from non-adjacent memory locations because strings have different lengths.
  • The offset indirection adds a dependent memory read before the comparison can begin (read offset, compute address, read string bytes).

Hash-based comparison

Hash each string to a 32-bit integer. Compare hashes instead of strings.

Problems:

  • Hash collisions. CRC32, FNV-1a, and MurmurHash3 all produce collisions on real-world string data. Two different strings may hash to the same value. A filter that compares hashes instead of strings will produce false positives.
  • Hash computation itself requires iterating over variable-length strings, which brings back the divergence problem during the hashing phase.
  • Hash comparison cannot support ordering (ORDER BY, range predicates). Hashes do not preserve lexicographic order.

All three approaches fight the hardware. Dictionary encoding works with it.

Dictionary encoding: the solution

Dictionary encoding converts a string column into two structures:

  1. Dictionary: A sorted array of unique string values.
  2. Encoded column: A Uint32Array where each element is the index of that row's string value in the dictionary.
// Raw string column (500,000 rows)
const regions = ["EMEA", "APAC", "NAM", "EMEA", "LATAM", "APAC", ...];

// Dictionary (sorted, unique)
const dictionary = ["APAC", "EMEA", "LATAM", "NAM"];
//                    0       1       2        3

// Encoded column
const encodedRegion = new Uint32Array([1, 0, 3, 1, 2, 0, ...]);
// "EMEA"=1, "APAC"=0, "NAM"=3, "EMEA"=1, "LATAM"=2, "APAC"=0

The dictionary stays on the CPU. Only the Uint32Array goes to the GPU. The GPU never sees a string. It sees integers.

Why this works on GPUs

Every element in the encoded column is exactly 4 bytes (u32). Fixed width. Compile-time-known size. The GPU can compute element K's address trivially: base + K * 4. Adjacent threads read adjacent 4-byte values. Memory access is perfectly coalesced.

A string equality filter (WHERE region = 'EMEA') resolves to an integer equality filter (WHERE encoded_region == 1). One comparison instruction per thread. One clock cycle. No byte-by-byte matching. No variable-length iteration. No divergence. Every thread in the warp executes the same instruction on identically-sized data.

Building the dictionary

The engine builds the dictionary during data ingestion, before any queries execute:

function buildDictionary(column: string[]): { dictionary: string[]; encoded: Uint32Array } {
  // Step 1: Extract unique values
  const uniqueSet = new Set(column);
  const dictionary = Array.from(uniqueSet).sort();  // Sorted for binary search

  // Step 2: Build reverse lookup
  const indexMap = new Map<string, number>();
  for (let i = 0; i < dictionary.length; i++) {
    indexMap.set(dictionary[i], i);
  }

  // Step 3: Encode column
  const encoded = new Uint32Array(column.length);
  for (let i = 0; i < column.length; i++) {
    encoded[i] = indexMap.get(column[i])!;
  }

  return { dictionary, encoded };
}

For a 500,000-row column with 200 unique values, the dictionary build takes 15 to 25 ms (one pass to extract uniques, one sort, one pass to encode). This is a one-time cost at data load, amortized over all subsequent queries.

Memory savings

ColumnRowsUnique valuesRaw JS strings (avg)As JS objects (with overhead)Dictionary encoded
country500,0001955.2 MB24 MB2 MB + 4 KB dict
status500,00063.5 MB24 MB2 MB + 96 bytes dict
department500,000454.8 MB24 MB2 MB + 1 KB dict
product_category500,0001,2008.1 MB24 MB2 MB + 24 KB dict
email500,000485,00014 MB24 MB2 MB + 11 MB dict

For low-cardinality columns (status, country, department), dictionary encoding saves 90% to 99% of memory. For high-cardinality columns (email, where nearly every value is unique), the dictionary is almost as large as the raw data, and the encoded column adds 2 MB. The total (13 MB) is still less than the JS object representation (24 MB) due to eliminated per-object overhead.

The encoded column is always exactly rowCount * 4 bytes. This predictability is valuable for GPU buffer allocation: the engine knows the exact buffer size before uploading.

Compiling SQL predicates to integer operations

At query time, the engine compiles string predicates into integer predicates by looking up pattern values in the dictionary.

Equality: WHERE column = 'value'

function compileEquality(dictionary: string[], value: string): number | null {
  const idx = dictionary.indexOf(value);
  return idx >= 0 ? idx : null;  // null means value not in dictionary -> 0 results
}

If the value exists in the dictionary, the filter becomes encoded[i] == idx. If it does not exist, the query short-circuits: no row can match. The filter returns an empty result set without dispatching to the GPU at all.

The GPU shader:

@compute @workgroup_size(256)
fn filter_eq(
  @builtin(global_invocation_id) gid: vec3<u32>
) {
  let idx = gid.x;
  if (idx >= row_count) { return; }

  let match = (encoded_col[idx] == target_value);
  result_mask[idx / 32u] |= select(0u, 1u << (idx % 32u), match);
}

One comparison. One conditional write to a bitmask. Zero divergence (the select intrinsic avoids an if branch). Every thread executes the same two instructions regardless of the data.

IN clause: WHERE column IN ('a', 'b', 'c')

Each value in the IN list is looked up in the dictionary. Missing values are discarded (they cannot match any row). The remaining indices form a set.

For small IN lists (under 16 values), the engine inlines the comparisons:

// Compiled from: WHERE status IN ('active', 'pending', 'review')
// Dictionary lookup: 'active'=0, 'pending'=4, 'review'=5

let val = encoded_col[idx];
let match = (val == 0u) || (val == 4u) || (val == 5u);

Three comparisons, two ORs. No loop. No hash table. No dynamic dispatch. The warp executes uniformly.

For large IN lists (16+ values), the engine uploads the target indices as a small storage buffer and uses a bitset lookup:

// target_bitset: a u32 array where bit I is set if index I is in the IN list
let val = encoded_col[idx];
let word = target_bitset[val / 32u];
let bit = 1u << (val % 32u);
let match = (word & bit) != 0u;

One buffer read, one shift, one AND. Constant time regardless of IN list size. The bitset for a dictionary with 1,000 entries is 125 bytes (1,000 / 8). Negligible.

Inequality: WHERE column != 'value'

Identical to equality, with the comparison inverted. encoded[i] != idx. One instruction. Zero divergence.

LIKE with prefix: WHERE column LIKE 'Ger%'

A prefix match on a sorted dictionary is a binary search for the range of dictionary entries that start with the prefix:

function compilePrefixLike(dictionary: string[], prefix: string): [number, number] | null {
  let lo = lowerBound(dictionary, prefix);
  let hi = upperBound(dictionary, prefix + '\uFFFF');  // All strings starting with prefix
  if (lo >= hi) return null;  // No matches in dictionary
  return [lo, hi];  // Range of matching dictionary indices
}

The GPU shader becomes a range check:

let val = encoded_col[idx];
let match = (val >= range_lo) && (val < range_hi);

Two comparisons. The dictionary is sorted, so lexicographic prefix matches map to contiguous index ranges. No string processing on the GPU.

NOT LIKE, suffix matches, and infix patterns

LIKE '%germany%' (infix) or LIKE '%land' (suffix) cannot be expressed as a contiguous range in a sorted dictionary. The engine falls back to scanning the dictionary on the CPU and collecting all matching indices into a set, then compiles the filter as an IN clause over those indices.

function compileInfixLike(dictionary: string[], pattern: string): number[] {
  const infix = pattern.replace(/%/g, '');  // Extract the literal
  const matches: number[] = [];
  for (let i = 0; i < dictionary.length; i++) {
    if (dictionary[i].includes(infix)) {
      matches.push(i);
    }
  }
  return matches;
}

The dictionary scan is CPU-side and touches only the unique values (typically 10 to 10,000 entries). For a 1,000-entry dictionary, the scan takes under 0.1 ms. The resulting index set compiles to a bitset lookup on the GPU. The per-row GPU filter is still a constant-time bitset check, not a string search.

Compound predicates

Real queries combine multiple predicates with AND, OR, and NOT.

WHERE region = 'EMEA'
  AND status IN ('active', 'pending')
  AND department != 'HR'

Each predicate compiles independently. The compiled GPU shader combines them:

let region_match = (region_col[idx] == 1u);       // EMEA
let status_match = (status_col[idx] == 0u) || (status_col[idx] == 4u);  // active, pending
let dept_match = (dept_col[idx] != 12u);           // not HR

let match = region_match && status_match && dept_match;

Boolean short-circuit evaluation in WGSL is implementation-defined. The compiler may evaluate all three predicates regardless of short-circuit opportunities. On GPUs, this is actually desirable: evaluating all predicates uniformly avoids divergence. A thread that could short-circuit after region_match = false does not save time, because the other 31 threads in the warp may need all three predicates.

The entire compound filter executes in 3 to 5 clock cycles per thread. For 500,000 rows on a discrete GPU: 0.4 ms total.

GROUP BY on dictionary-encoded columns

Dictionary encoding benefits grouping as much as filtering.

A GROUP BY region on a raw string column requires hashing each string, looking up the hash in a hash table, and accumulating values per group. Hashing variable-length strings is sequential and divergent. Hash table probing is random-access and cache-hostile.

With dictionary encoding, the group key is already an integer. The integer IS the group index. No hashing. No table probing. The accumulator array is indexed directly by the dictionary index:

// GROUP BY region, SUM(revenue)
let group_idx = region_col[idx];
let revenue = revenue_col[idx];
atomicAdd(&group_sums[group_idx], bitcast<u32>(revenue));

One buffer read for the group key, one for the value, one atomic add to the accumulator. The accumulator array has one entry per unique value (e.g., 195 entries for a country column). The atomic contention is spread across 195 addresses. For 500,000 rows, that is approximately 2,564 atomic writes per address. Manageable for shared-memory local reduction with thread-0 global flush.

If the Chao1 estimator predicts high group cardinality (composite GROUP BY with many distinct combinations), the 6-factor dispatch scoring function penalizes GPU dispatch via the operator-specific SQL metric (F2) and routes to CPU Web Workers, where each worker builds a thread-local hash map.

ORDER BY on dictionary-encoded columns

Sorting by a dictionary-encoded string column does not require string comparison. The dictionary is sorted lexicographically at build time. Dictionary index 0 is the alphabetically first string. Index 1 is the second. The integer indices are already in lexicographic order.

ORDER BY region ASC is simply a sort on the integer indices. Our radix-256 sort handles this natively: the indices are unsigned integers, no IEEE 754 bit-transform needed (they are already non-negative). A 500,000-element Uint32Array sort on a discrete GPU: 3.1 ms.

For ORDER BY region DESC, the radix sort produces ascending order; a single post-sort pass reverses the permutation index. Or the engine subtracts each index from (dictionarySize - 1) before sorting, converting descending to ascending. One subtraction per element, fully parallel.

Multi-column dictionary encoding

Enterprise datasets have many string columns. Our engine encodes each one independently:

interface ColumnarTable {
  // Numeric columns: stored as typed arrays directly
  revenue: Float64Array;
  quantity: Int32Array;
  timestamp: Float64Array;

  // String columns: dictionary-encoded
  region: { dictionary: string[]; encoded: Uint32Array };
  status: { dictionary: string[]; encoded: Uint32Array };
  department: { dictionary: string[]; encoded: Uint32Array };
  product: { dictionary: string[]; encoded: Uint32Array };
  country: { dictionary: string[]; encoded: Uint32Array };
}

Each column has its own dictionary. A compound query touching multiple string columns resolves each predicate against its column's dictionary independently. The GPU receives multiple Uint32Array buffers (one per string column in the query) and evaluates all predicates as integer operations.

The total GPU memory for a 500,000-row table with 5 string columns: 5 x 2 MB = 10 MB for the encoded columns. The dictionaries (totalling a few hundred KB) stay on the CPU. The numeric columns add their own typed arrays. A complete 12-column table (5 string, 4 float, 3 integer) occupies approximately 30 MB on the GPU. Well within the 1 to 4 GB buffer limit of modern GPUs.

Query compilation pipeline

The full path from SQL-like query to GPU dispatch:

Step 1: Parse. The query object (or SQL string, if using a parser) is decomposed into a tree of operators: Filter, GroupBy, Aggregate, Sort, Limit.

Step 2: Dictionary resolve. String literals in predicates are looked up in the corresponding column's dictionary. Missing values short-circuit to empty results. Resolved indices are embedded as constants in the shader source.

Step 3: Shader generation. The engine generates a WGSL compute shader specific to this query. Predicate comparisons, column buffer bindings, and output bitmask writes are baked into the shader at compile time. No runtime interpretation. No generic "evaluate predicate" branching.

Step 4: Pipeline creation. The generated WGSL is compiled into a GPUComputePipeline. If an identical query has been compiled before, the cached pipeline is reused (the cache keys on the WGSL source hash).

Step 5: Dispatch scoring. The 6-factor dispatch scoring function evaluates the operator. F1 row count (500,000) exceeds the discrete GPU threshold of 50,000 rows. F2 (predicate selectivity) is computed from column statistics. F3 GPU class and F4 vendor tuning are applied. F5 GPU buffer retention bonus applies if a preceding operator is already GPU-resident. F6 hardware-adaptive buffer threshold confirms the buffer fits. The combined score is positive. GPU dispatch.

Step 6: Execute and readback. The shader dispatches. The output bitmask (62.5 KB for 500,000 rows) is read back to the CPU. The CPU applies the bitmask to project the requested columns (including resolving dictionary indices back to human-readable strings for display).

Total time for a compound WHERE on 500,000 rows: 0.4 ms GPU compute + 0.1 ms bitmask readback = 0.5 ms. The equivalent JavaScript Array.prototype.filter() with string comparisons: 18 ms.

36x faster. And the GPU did not process a single string.

The display-time decode

After the GPU filter produces a bitmask of matching rows, the application needs to display the results as human-readable strings, not integer indices.

The dictionary decode is a trivial CPU-side lookup:

function decodeColumn(dictionary: string[], encoded: Uint32Array, mask: Uint32Array): string[] {
  const results: string[] = [];
  for (let i = 0; i < encoded.length; i++) {
    const word = mask[Math.floor(i / 32)];
    const bit = 1 << (i % 32);
    if (word & bit) {
      results.push(dictionary[encoded[i]]);
    }
  }
  return results;
}

For a filter that produces 5,000 results from 500,000 rows, the decode iterates 500,000 bitmask checks (fast: bitwise operations on contiguous Uint32Array) and performs 5,000 dictionary lookups (array index access, no string comparison). Total: under 2 ms.

The decode is always a CPU operation. The GPU handles the expensive part (evaluating predicates on every row). The CPU handles the cheap part (resolving a small result set to display strings). The division of work matches each tier's strengths.

Where this fits in the adaptive stack

Dictionary encoding is the storage layer that enables every string operation in our Adaptive WebGPU Data Query Engine. Without it, string columns would be GPU-inaccessible, and every query touching a string column would route to the CPU regardless of dataset size.

With it, string predicates are integer predicates. Integer predicates are GPU-native. The 6-factor dispatch scoring function evaluates them like any other numeric filter. The pipeline executor chains them with numeric operators without tier transitions, using the GPUResidentDataset class and the GPU buffer retention bonus (F5) to keep intermediate results in GPU buffers across pipeline stages. The Float32 ordering-preservation safety check does not intervene (integer comparisons are exact in u32).

The encoding transforms the problem from "how do we make GPUs handle strings" to "how do we make strings not be strings." The answer: represent them as what the GPU already processes perfectly. Fixed-width integers.

This is the storage design behind our enterprise AI automation infrastructure for analytics, CRM, and compliance dashboards. Every string column your users filter on, group by, or sort by is a Uint32Array on the GPU. Every predicate is an integer comparison. Every query compiles to a shader that contains zero string operations. The strings exist only in the dictionary on the CPU and on the screen in front of your user. Everywhere in between, they are integers.

Want to discuss how this applies to your business?

Book a Discovery Call