The problem with searching text on GPUs
GPUs excel at arithmetic on fixed-width numeric data. Text is the opposite: variable-length, byte-oriented, and riddled with data-dependent branching. A naive attempt to run string.includes() as a compute shader produces the exact SIMD branch divergence pattern that makes GPU execution slower than CPU.
Every thread in a 32-wide warp processes a different document. Each document has a different length, different content, different character distribution. The matching loop branches at every byte comparison. Warp utilization collapses to single digits.
This is why most browser-based search engines do not even attempt GPU execution. They default to CPU with Web Workers, accept the 200 to 800 ms latency on large corpora, and move on.
We did not accept it. We found a way to make text search GPU-compatible by restructuring the problem into two phases: one that is perfectly parallel and one that is narrowly targeted.
Where browser-based text search matters
Before the algorithm, the context. Three industries push browser-based log search to its limits.
Cybersecurity. SIEM dashboards ingest thousands of log entries per second. Analysts search for IOCs (indicators of compromise): IP addresses, domain patterns, specific error signatures, base64-encoded payloads. The search must be interactive. Waiting 800 ms per query while triaging an active incident is not acceptable.
Gaming anti-cheat and moderation. Chat log analysis for toxicity detection, coordinated manipulation, and real-time abuse patterns. A single game session can produce 50,000 to 200,000 chat messages. Moderators need sub-second search across the full corpus to identify patterns and trace conversations.
Observability platforms. Application logs, infrastructure metrics, trace spans. Engineers debugging a production issue search for stack traces, request IDs, and error patterns across millions of log lines. Every second of search latency extends the mean time to resolution.
In all three cases, the corpus fits in browser memory (50 MB to 500 MB), the queries are interactive and frequent, and server round-trips add unacceptable latency. The constraint is not storage. It is search throughput.
Why brute-force GPU matching fails
The straightforward approach: upload all documents to a GPU storage buffer, dispatch one thread per document, have each thread scan its document for the query pattern.
@compute @workgroup_size(256)
fn search(@builtin(global_invocation_id) id: vec3<u32>) {
let doc_idx = id.x;
let doc_start = offsets[doc_idx];
let doc_len = offsets[doc_idx + 1] - doc_start;
for (var i: u32 = 0; i <= doc_len - pattern_len; i++) {
var match = true;
for (var j: u32 = 0; j < pattern_len; j++) {
if (corpus[doc_start + i + j] != pattern[j]) {
match = false;
break;
}
}
if (match) {
results[doc_idx] = 1u;
return;
}
}
}
This shader has three problems.
Problem 1: Divergent inner loop. The break on mismatch causes threads in the same warp to exit the inner loop at different iterations. Some threads match on the first character, some on the tenth, some scan the entire document. The warp serializes.
Problem 2: Variable-length documents. The outer loop bound (doc_len - pattern_len) differs per thread. Short documents finish early and sit idle while long documents continue. Warp utilization is determined by the longest document in each group of 32.
Problem 3: Global memory bandwidth. Every byte comparison reads from global (VRAM) memory. A 200 MB corpus with a 10-character query pattern requires up to 2 billion byte reads in the worst case. Global memory bandwidth on a mid-range discrete GPU is 250 to 400 GB/s, but random access patterns (different threads reading different document regions) prevent coalesced reads. Effective bandwidth drops to 50 to 100 GB/s. The search becomes memory-bound with low compute utilization.
On our benchmark (1 million log entries, average 200 bytes each, 200 MB total corpus, 8-character query pattern), brute-force GPU search takes 145 ms. Single-threaded CPU String.prototype.includes() takes 820 ms. Eight-thread Web Workers take 110 ms. The GPU barely beats the CPU workers, and the divergence penalty erases most of the theoretical throughput advantage.
Our two-phase approach
The insight: most documents in a corpus do not contain the query pattern. For a typical search against log data, 95% to 99% of entries are irrelevant. If you can eliminate those entries cheaply before performing byte-level matching, the expensive phase operates on a tiny subset.
Our Adaptive WebGPU Pattern Matching Engine splits search into two dispatches.
Phase 1: Character frequency histogram pre-filter
Phase 1 does not search for the pattern. It asks a simpler question: "Does this document contain the right characters in the right quantities to possibly contain the pattern?"
If the query pattern is "ERROR_503", any matching document must contain at least one E, one R (actually two), one O, one _, one 5, one 0, and one 3. A document that contains zero R characters cannot match. A document that contains only one R cannot match (the pattern requires two). Phase 1 rejects these documents without reading a single byte in sequence.
How the histogram is built
Each workgroup processes one document. The workgroup has 256 threads that stride across the document's byte positions in parallel.
The workgroup allocates a 128-bin histogram in shared memory. Each bin corresponds to an ASCII code point (0 to 127). Shared memory on modern GPUs provides 16 KB to 48 KB per workgroup at register-like access speeds (20 to 50 cycles latency versus 200 to 400 cycles for global memory).
A 128-bin histogram of u32 counters occupies 512 bytes. This leaves ample room in shared memory for both the histogram and the query pattern (up to 64 bytes).
The histogram construction is parallel and uniform:
// Simplified: one workgroup per document, 256 threads stride across byte positions
@compute @workgroup_size(256)
fn build_histogram(
@builtin(local_invocation_id) lid: vec3<u32>,
@builtin(workgroup_id) wid: vec3<u32>
) {
let doc_idx = wid.x;
let doc_start = offsets[doc_idx];
let doc_len = offsets[doc_idx + 1u] - doc_start;
// Initialize shared memory histogram bins to 0
// Each thread clears a subset of bins
for (var b = lid.x; b < 128u; b += 256u) {
shared_histogram[b] = 0u;
}
workgroupBarrier();
// 256 threads stride across the document's bytes
for (var offset = lid.x; offset < doc_len; offset += 256u) {
let byte_val = corpus[doc_start + offset];
let bin = byte_val & 127u;
atomicAdd(&shared_histogram[bin], 1u);
}
workgroupBarrier();
// Thread 0 evaluates the histogram against the query
if (lid.x == 0u) {
var pass = true;
for (var c: u32 = 0u; c < query_char_count; c++) {
let required_char = query_chars[c];
let required_count = query_counts[c];
let actual_count = shared_histogram[required_char];
if (actual_count < required_count) {
pass = false;
break;
}
}
if (pass) {
// Mark as candidate in global bitmask
atomicOr(&candidate_bitmask[doc_idx / 32u], 1u << (doc_idx % 32u));
}
}
}
Why Thread 0 evaluates
The histogram evaluation is a small, sequential operation: iterate over the query's unique characters (typically 5 to 15) and compare counts. Dispatching this across multiple threads would require a reduction step with synchronization overhead that exceeds the compute saved. Thread 0 handles it in tens of nanoseconds. The other 255 threads in the workgroup are idle for this step, but the step is so short (under 100 cycles) that the cost is negligible.
Pre-filter effectiveness
The elimination rate depends on the query's character specificity and the corpus's character distribution. On our log corpus benchmarks:
| Query type | Example | Documents eliminated |
|---|---|---|
| Error code with digits | ERROR_503 | 97.2% |
| IP address pattern | 192.168.1 | 98.4% |
| Mixed-case identifier | NullPointerException | 96.8% |
| Short common word | error | 71.3% |
| Single character | x | 12.1% |
Longer, more specific queries with uncommon character combinations achieve higher elimination rates. Short, common queries benefit less. The engine's adaptive dispatch scoring accounts for estimated elimination rate when deciding whether to use the two-phase GPU path or fall back to CPU workers.
Phase 2: Targeted byte-level matching
Phase 2 dispatches a second compute shader that processes only the candidate documents identified by the Phase 1 bitmask.
The dispatch assigns one thread per byte position across the candidate documents. Each thread first checks the candidate bitmask for its document to confirm the document survived Phase 1. If the bitmask bit is not set, the thread exits immediately. If it is set, the thread performs byte-by-byte pattern matching starting at its assigned byte position. The pattern (up to 64 bytes) is loaded into workgroup shared memory so that every thread in the workgroup reads pattern bytes from fast shared memory rather than global memory. Each thread uses a binary search on the document offset array to determine which document its byte position belongs to.
Phase 2 still has the divergence characteristics of byte-level matching. But the critical difference: the candidate bitmask eliminates 70% to 97% of documents before any byte comparison begins.
On a corpus where Phase 1 eliminates 97% of documents, Phase 2 performs byte-level matching on only 3% of the original data. The total work is reduced by an order of magnitude or more.
The byte-level matching is straightforward: compare each byte of the pattern against the corpus starting at the thread's position. If all bytes match, the thread writes a result. This one-thread-per-byte-position model avoids the variable-length outer loop of the brute-force approach. Every thread performs the same inner loop of up to 64 comparisons. Control flow is more uniform across threads, which reduces SIMD divergence compared to the one-thread-per-document alternative.
Memory architecture
The corpus is stored in a flat Uint8Array buffer on the GPU. Document boundaries are defined by an offset array (Uint32Array) where offsets[i] is the byte position where document i begins.
Total GPU memory usage for a 200 MB corpus:
| Buffer | Size | Purpose |
|---|---|---|
| Corpus data | 200 MB | Raw bytes |
| Offset array | 4 MB | 1M x 4-byte offsets |
| Candidate bitmask | 125 KB | 1M bits (1M / 8 bytes) |
| Candidate index (compacted) | 120 KB (variable) | ~30K x 4-byte indices |
| Result buffer | 125 KB | 1M bits |
| Query parameters | < 1 KB | Character set + counts |
Total: approximately 205 MB. Well within the 1 to 4 GB buffer size limit of modern GPUs (maxBufferSize from adapter.limits).
The corpus buffer is uploaded once on dataset load. Subsequent searches reuse the same buffer and only update the query parameters (under 1 KB per search). Buffer upload is a one-time cost.
End-to-end performance
Benchmarked on 1 million log entries, average 200 bytes per entry, 200 MB total corpus:
| Method | Query: ERROR_503 | Query: 192.168.1 | Query: error |
|---|---|---|---|
CPU single-thread (includes()) | 820 ms | 790 ms | 680 ms |
| CPU 8-thread Web Workers | 110 ms | 105 ms | 92 ms |
| GPU brute-force (no pre-filter) | 145 ms | 140 ms | 128 ms |
| GPU two-phase (our engine) | 12 ms | 8 ms | 41 ms |
For the specific query ERROR_503, Phase 1 eliminates 97.2% of documents in 3.1 ms. Phase 2 matches the remaining 28,000 candidates in 8.9 ms. Total: 12 ms. That is 68x faster than single-threaded CPU and 9x faster than the Web Worker pool.
For the less selective query error, Phase 1 eliminates only 71.3% of documents. Phase 2 processes 287,000 candidates. Total: 41 ms. Still 20x faster than single-threaded CPU and 2.7x faster than workers.
The two-phase approach always outperforms brute-force GPU search because Phase 1's shared-memory histogram is an order of magnitude cheaper than Phase 2's global-memory byte comparison. Even when elimination is modest, the pre-filter pays for itself.
When the GPU path is not used
Not every search benefits from GPU dispatch. The GPU inhibition scoring engine evaluates each search operation against six categorical inhibition factors and eight continuous scoring factors before dispatch.
The six categorical inhibition factors assign a negative infinity penalty, unconditionally routing the search to CPU. These include: wildcard branch divergence, regex branch divergence, fuzzy branch divergence, patterns exceeding shared memory (longer than 64 characters), Unicode case-insensitive search, and high match density (exceeding 100 matches per 1,000 input elements). Each is a hard cutoff that cannot be overridden by the continuous scoring factors.
When no categorical factor is triggered, the eight continuous scoring factors determine whether GPU dispatch is worthwhile. Small corpora below the crossover threshold (default: 500K bytes on discrete GPUs, 2M bytes on integrated GPUs) score below the dispatch threshold and route to CPU. The self-calibrating crossover thresholds adjust based on measured performance but never modify the categorical inhibition factors.
The dispatch decision is automatic. You submit a query. The engine returns results from whichever backend was fastest for that query, on that hardware, against that corpus.
Integrating with columnar query pipelines
Text search rarely operates in isolation. A typical workflow: filter log entries by time range, then search the filtered subset for a pattern, then aggregate by source.
Our pattern matching engine integrates with the adaptive query engine. The time-range filter runs as a parallel GPU filter operator, producing a candidate bitmask. That bitmask is intersected with the text search bitmask using a single bitwise AND dispatch. The intersection result feeds into the aggregation operator.
The entire pipeline (time filter, text search, aggregation) executes as a sequence of GPU dispatches with no intermediate CPU round-trips. Intermediate buffers stay in GPU memory. The CPU sees only the final aggregated result: 50 rows of grouped statistics, read back via a single mapAsync() call.
For a SIEM dashboard where an analyst filters to the last 15 minutes (50,000 entries from 1 million), then searches for an IP pattern, then groups by source host: total execution is 4 to 8 ms. The same operation with a server round-trip: 200 to 400 ms.
The broader architecture
This two-phase text search is one operator in a larger system. The character frequency histogram is a statistical pre-filter: a cheap, parallel, GPU-friendly test that narrows the candidate set before an expensive, divergent, GPU-hostile operation. The same structural pattern applies beyond text search. Any workload with a cheap approximate test and an expensive exact test can benefit from this two-phase decomposition.
The engineering principle: do not fight the hardware. GPUs are bad at divergent byte-level comparisons. They are excellent at parallel histogram construction in shared memory. Structure your algorithm so the GPU does what it is good at, and minimize the time it spends doing what it is bad at.
This is how we approach enterprise AI automation infrastructure at every level. Understand the hardware constraint. Design the algorithm around it. Measure the result. If the measurement contradicts the design, change the design. We do not ship code that assumes performance. We ship code that proves it.