Ayoob AI

The Variable-Width Problem: Why UTF-8 Breaks WebGPU Text Search

·15 min read·Ayoob AI
WebGPUUTF-8Text SearchUnicodeInternationalization

The assumption that breaks GPU text processing

GPU compute shaders achieve throughput by assigning one thread to one data element. For numerical arrays, this is natural: element N is at byte offset N * sizeof(element). Thread N reads from a known address. Every thread reads the same number of bytes. Memory accesses are coalesced. The warp executes in lockstep.

Text data violates this assumption.

In a Float32Array, element 1,000 is at byte offset 4,000. Always. In a UTF-8 encoded string, character 1,000 could be at byte offset 1,000 (if every preceding character is ASCII) or byte offset 3,997 (if every preceding character is a 3-byte CJK character) or anywhere in between. You cannot compute the byte offset of character N without scanning every byte from 0 to N-1.

This is not an implementation detail. It is a mathematical property of variable-width encoding. No amount of GPU optimization can bypass it.

How UTF-8 encoding works

UTF-8 encodes Unicode code points in 1 to 4 bytes:

Code point rangeByte countByte patternExample
U+0000 to U+007F1 byte0xxxxxxx'A' = 0x41
U+0080 to U+07FF2 bytes110xxxxx 10xxxxxx'é' = 0xC3 0xA9
U+0800 to U+FFFF3 bytes1110xxxx 10xxxxxx 10xxxxxx'中' = 0xE4 0xB8 0xAD
U+10000 to U+10FFFF4 bytes11110xxx 10xxxxxx 10xxxxxx 10xxxxxx'😀' = 0xF0 0x9F 0x98 0x80

The leading byte determines the sequence length: 0xxxxxxx = 1 byte, 110xxxxx = 2 bytes, 1110xxxx = 3 bytes, 11110xxx = 4 bytes. Bytes starting with 10xxxxxx are continuation bytes that belong to a preceding leading byte.

For pure ASCII text (English, digits, common punctuation), every character is 1 byte. UTF-8 is compatible with ASCII. The one-thread-per-byte model works because one byte equals one character.

For multilingual text, the one-thread-per-byte model breaks.

The three GPU problems with variable-width encoding

Problem 1: Thread-to-character alignment

On a GPU, you dispatch N threads to process N elements. For text search, you want each thread to evaluate one character position. But you cannot assign thread N to character N without first knowing where character N starts.

Consider a string mixing ASCII and multi-byte characters:

Bytes:  [H] [é    ] [l] [l] [o] [ ] [世     ] [界     ]
Offset:  0   1  2    3   4   5   6   7  8  9  10 11 12
Chars:   0   1       2   3   4   5   6         7

Thread 2 should process character 2 ('l' at byte 3). But thread 2 does not know that byte 2 is a continuation byte of 'é', not a character start. It would need to inspect byte 2 (0xA9, which starts with 10xxxxxx, marking it as a continuation byte), scan backward to find the leading byte, determine it is part of a 2-byte sequence, and then scan forward to find the next character start at byte 3.

This backward scan is sequential. It depends on the data. Different threads scan different distances. The warp diverges.

Worse: if thread 2 mistakenly treats byte 2 as a character start and begins matching the search pattern against it, the match is comparing against a continuation byte, not a valid character. The result is not "wrong match." It is undefined behaviour at the Unicode level. The byte 0xA9 is not a valid character start in UTF-8. Matching against it produces garbage.

Problem 2: Warp divergence from variable byte counts

Even if you solve the alignment problem (by pre-computing a character offset table on the CPU), the threads within a warp will process characters of different byte widths.

Thread 0 processes 'H' (1 byte). Thread 1 processes 'é' (2 bytes). Thread 6 processes '世' (3 bytes). Each thread reads a different number of bytes, follows a different decoding path (1-byte vs. 2-byte vs. 3-byte), and takes a different number of cycles.

In a 32-thread warp processing a multilingual string, you may have:

  • 15 threads processing 1-byte ASCII characters
  • 8 threads processing 2-byte Latin Extended characters
  • 7 threads processing 3-byte CJK characters
  • 2 threads processing 4-byte emoji

The warp must serialize across 4 execution paths. The SIMD branch divergence penalty is immediate: the warp takes as long as the longest path (4-byte decoding) for every thread, regardless of how many threads actually need 4 bytes.

Problem 3: Uncoalesced memory access

Coalesced memory access on a GPU requires adjacent threads to read adjacent memory addresses. When thread 0 reads 1 byte and thread 1 reads 2 bytes, thread 2's starting address depends on threads 0 and 1's byte widths. The memory access pattern is irregular. The GPU's memory controller cannot batch the reads into single bus transactions. Effective bandwidth drops to 10% to 25% of peak.

Why our ASCII-path two-phase search works

Our two-phase GPU text search operates on raw bytes, not characters. The Phase 1 histogram counts byte values (0 to 127 for the ASCII range). Phase 2 performs byte-by-byte sliding window matching.

For pure ASCII corpora (English-language log files, source code, structured data), this is correct and fast. Every character is 1 byte. Every byte is a valid character. The one-thread-per-byte model is one-thread-per-character. No alignment problem. No divergence from variable-width decoding.

This is why the two-phase pipeline achieves 12 ms on 1 million log entries. Log data is overwhelmingly ASCII. The UTF-8 variable-width problem does not arise.

But the moment the corpus contains multi-byte UTF-8 sequences, the byte-level matching breaks in two ways:

False positives. The byte pattern of a multi-byte character can coincidentally match a segment of the search pattern interpreted as ASCII bytes. A 3-byte CJK character like '中' (0xE4 0xB8 0xAD) contains the byte 0xAD, which is also the second byte of 'í' (0xC3 0xAD). A byte-level search for the pattern 'í' would match the trailing byte of '中'. The "match" is against a continuation byte, not the character the user is searching for.

Split matches across character boundaries. A search pattern could span a multi-byte character boundary in the corpus. The byte-level match succeeds, but the "match" straddles two characters and does not correspond to any valid Unicode string.

Both failures are silent. The search returns results. The results are wrong. The user sees "matches" that do not exist in the actual text.

Case-insensitive search compounds everything

Case-insensitive string matching on ASCII is a solved problem. Each byte is either in the range 0x41 to 0x5A (uppercase A to Z) or 0x61 to 0x7A (lowercase a to z). The case fold is a single bitwise OR with 0x20. One instruction. Zero branching. GPU-friendly.

Unicode case folding is a different problem entirely.

One-to-many mappings

The German 'ß' (U+00DF) folds to "ss" in uppercase. One character becomes two. The folded string is longer than the original. A search pattern containing "STRASSE" must match "Straße". The pattern is 7 characters. The match target is 6 characters. The lengths differ.

On a GPU, this means the matching window size is data-dependent. Thread N processing position N in the corpus does not know how many bytes to compare because the number of bytes depends on whether any characters in the window fold to multi-character sequences. The thread must scan each character, check a lookup table for fold mappings, accumulate the folded length, and compare. This is sequential, data-dependent, and different for every thread.

Locale-dependent folding

The Turkish language has a dotted and dotless 'I'. In Turkish locale: 'I' folds to 'ı' (U+0131), and 'İ' (U+0130) folds to 'i'. In every other locale: 'I' folds to 'i'. The correct case fold depends on a locale parameter that may differ per document or per user.

On a GPU, this means the fold function has a conditional branch on locale. If the corpus contains mixed-locale content (a common scenario in multilingual CRM systems, international log aggregation, or chat platforms), different threads may follow different locale-dependent fold paths. The warp diverges.

Context-dependent folding

The Greek capital sigma 'Σ' folds to 'σ' (U+03C3) in word-medial position and to 'ς' (U+03C2) in word-final position. The correct fold depends on the character's position within a word, which requires examining the surrounding characters. Each thread must inspect its neighbours. The inspection result is data-dependent. The warp diverges.

The Unicode CaseFolding.txt table

The official Unicode case folding table (CaseFolding.txt) defines 1,400+ folding rules. A general-purpose case-insensitive search must consult this table for every character in the corpus and every character in the pattern. The table lookup is a data-dependent branch (the folding rule depends on the character value). With 1,400 possible outcomes, the number of unique execution paths in a 32-thread warp is bounded only by the number of unique characters in that warp's 32 positions.

This is not bounded divergence. This is categorical divergence. The GPU cannot execute this efficiently regardless of the dataset size.

Our detection and routing mechanism

Our engine detects the UTF-8 + case-insensitive combination at operation registration time, before any GPU resources are allocated.

Step 1: Encoding detection

When a corpus is loaded, the engine classifies its encoding into one of three categories: ASCII, Latin-1, or Unicode. The classification scans a sample of documents for byte values that indicate the encoding tier:

function detectEncoding(corpus: Uint8Array, sampleSize: number): 'ascii' | 'latin1' | 'unicode' {
  const step = Math.max(1, Math.floor(corpus.length / sampleSize));
  let hasHighByte = false;
  let hasMultiByte = false;
  for (let i = 0; i < corpus.length; i += step) {
    const b = corpus[i];
    if (b > 0x7F) {
      hasHighByte = true;
      if ((b & 0xC0) === 0xC0) {
        hasMultiByte = true;
        break;
      }
    }
  }
  if (hasMultiByte) return 'unicode';
  if (hasHighByte) return 'latin1';
  return 'ascii';
}

ASCII corpora contain only bytes in the 0x00 to 0x7F range. Latin-1 corpora contain high bytes (0x80 to 0xFF) but only as single-byte characters. Unicode corpora contain multi-byte UTF-8 sequences (leading bytes 0xC0 and above with continuation bytes). The sample size is configurable (default: 10,000 bytes). For a 200 MB corpus, the scan takes under 0.01 ms.

The result is cached per corpus. Subsequent searches against the same corpus reuse the encoding classification.

Step 2: Search flag analysis

The engine inspects the search operation's flags in combination with the corpus encoding classification:

  • Case-sensitive search on ASCII or Latin-1 corpus: GPU-eligible. Byte-level matching is correct and fast.
  • Case-insensitive search on ASCII corpus: GPU-eligible. ASCII case folding is a single bitwise OR, zero divergence.
  • Case-sensitive search on Unicode corpus: GPU-eligible. The Phase 1 histogram pre-filter still works (it operates on byte frequencies, not character semantics). Phase 2 byte matching operates at the byte level.
  • Case-insensitive search on Unicode corpus: GPU-inhibited. Categorical penalty. This is inhibition factor #5 (Unicode case-insensitive) in the GPU inhibition scoring engine.

Step 3: Categorical inhibition

When the corpus is classified as Unicode and the search is case-insensitive, the engine assigns a penalty of negative infinity to the GPU dispatch score. This is one of six categorical inhibition factors in the GPU inhibition scoring engine, each of which acts as a hard cutoff. The mechanism is identical to branch divergence inhibition and high match density inhibition.

function computeSearchDispatchScore(corpus: CorpusMeta, search: SearchOp): number {
  // Categorical inhibition factor #5: Unicode case-insensitive
  if (corpus.encoding === 'unicode' && search.caseInsensitive) {
    return -Infinity;  // Hard cutoff, cannot be overridden by continuous factors
  }

  // Standard continuous scoring for eligible combinations
  return computeContinuousScore(corpus, search);
}

The negative infinity penalty short-circuits the entire dispatch pipeline. No GPU buffers are allocated. No shaders are compiled. The search dispatches directly to the Web Worker tier.

How the CPU tier handles UTF-8 case-insensitive search

The Web Worker tier uses JavaScript's native string handling, which is built on ICU (International Components for Unicode). String.prototype.includes(), String.prototype.toLowerCase(), and RegExp with the i flag all handle UTF-8 correctly, including:

  • Multi-byte character boundaries
  • One-to-many case folding ('ß' to 'ss')
  • Locale-dependent folding (Turkish dotless I)
  • Context-dependent folding (Greek final sigma)
  • Surrogate pair handling for characters above U+FFFF
// Worker: case-insensitive UTF-8 search
function searchCaseInsensitive(docs: string[], pattern: string): number[] {
  const lowerPattern = pattern.toLowerCase();
  const matches: number[] = [];
  for (let i = 0; i < docs.length; i++) {
    if (docs[i].toLowerCase().includes(lowerPattern)) {
      matches.push(i);
    }
  }
  return matches;
}

This is not optimal. toLowerCase() allocates a new string for every document. For 150,000 documents averaging 80 characters, that is 150,000 string allocations and 12 MB of temporary strings. But it is correct. And on the CPU, correctness does not require fighting the hardware.

With 8 Web Workers processing 150,000 documents: approximately 18 ms for case-insensitive UTF-8 search. Slower than the GPU's 1.8 ms for ASCII case-sensitive search, but correct for every character in every language in the Unicode standard.

Pre-computed lowercase corpus

For corpora that are searched repeatedly (live log streams, CRM databases), the engine can pre-compute a lowercase copy of the corpus on first load. The toLowerCase() cost is paid once. Subsequent searches compare the pattern (lowered once) against the pre-lowered corpus, eliminating per-search allocation.

Pre-computation cost for 150,000 documents: approximately 25 ms on the main thread (or 4 ms across 8 workers). Amortized over hundreds of searches per session, the per-search cost drops to effectively zero.

The hybrid path: ASCII-fast, UTF-8-correct

For corpora that are predominantly ASCII with occasional multi-byte characters (common in English-language systems that handle international names, addresses, or product descriptions), the engine offers a hybrid path:

  1. Phase 1 runs on the GPU. The character frequency histogram operates on bytes. It does not need to understand character boundaries. A document that lacks the bytes present in the search pattern is correctly rejected regardless of encoding.

  2. Phase 2 candidate evaluation splits. Candidates that contain only ASCII bytes (detected by a fast byte scan: all bytes < 0x80) are matched on the GPU using the standard byte-level pipeline. Candidates that contain multi-byte sequences are matched on the CPU using the Web Worker tier.

For a corpus where 85% of documents are pure ASCII:

  • Phase 1 eliminates 95% of all documents (GPU, 0.8 ms)
  • Phase 2 GPU path: 85% of 5% candidates = 4.25% of total (GPU, 0.4 ms)
  • Phase 2 CPU path: 15% of 5% candidates = 0.75% of total (Workers, 0.3 ms)
  • Total: 1.5 ms

This hybrid approach gets within 85% of pure GPU speed while maintaining full UTF-8 correctness. It works because Phase 1's byte-level histogram is encoding-agnostic, and the ASCII/multi-byte split in Phase 2 routes each candidate to the correct execution tier.

Where this matters in enterprise systems

Multilingual CRM search

A global hospitality chain stores guest names and preferences in the guest's native script. A search for "müller" (German umlaut) must match "MÜLLER" case-insensitively. A search for "田中" (Japanese) must not produce false matches against byte subsequences of other CJK characters.

Our engine detects the multi-byte corpus, applies the categorical inhibition for case-insensitive searches, and routes to CPU. The CRM query pipeline handles the numeric filtering and aggregation on the GPU (dictionary-encoded integer operations, unaffected by encoding). Only the text search routes to CPU. The dashboard remains fast.

International log aggregation

A multinational enterprise aggregates logs from systems deployed across Japan, Germany, Brazil, and India. Log messages contain variable-width characters in Japanese, German umlauts, Portuguese accents, and Hindi Devanagari. A SIEM search for error patterns must handle all of these correctly.

The streaming search pipeline detects the multi-byte content in the corpus sample. Pattern searches that are case-sensitive and use ASCII-only patterns (IP addresses, error codes, hex strings) still run on the GPU via the two-phase pipeline. Pattern searches that are case-insensitive or use multi-byte patterns route to CPU workers.

Chat moderation

A gaming platform's chat moderation system processes messages from a global player base. Players communicate in dozens of languages. Toxic content detection patterns include transliterated slurs, Unicode homoglyph substitutions (using Cyrillic 'а' in place of Latin 'a'), and mixed-script evasion techniques.

Detecting these patterns requires Unicode-aware normalization (NFKC) and case folding, both of which are categorically GPU-hostile. The engine routes these searches to CPU. Simpler ASCII-based pattern searches (coded commands, IP addresses, base64 fragments) continue to run on the GPU.

The engineering principle

The GPU is not a universal accelerator. It is a conditional accelerator. The conditions include data-parallel structure, uniform control flow, fixed-width elements, and coalesced memory access. UTF-8 text with case-insensitive search violates all four.

The correct response is not to force the GPU to handle it (producing incorrect results or catastrophic performance). The correct response is to detect the incompatibility, route to the hardware that handles it correctly, and use the GPU for the parts of the pipeline where it actually helps.

This is the architectural discipline behind our enterprise AI automation infrastructure. We do not hide encoding complexity behind a GPU marketing claim. We profile the workload, classify the encoding, evaluate the search flags, and route to the tier that produces correct results at the best achievable speed. For ASCII corpora, that is the GPU at 1.8 ms. For multilingual case-insensitive search, that is CPU workers at 18 ms. Both are correct. Both are fast for what they are. The engine chooses automatically.

Want to discuss how this applies to your business?

Book a Discovery Call