The Two-Phase GPU Text Search Algorithm for Massive Log Files

Q: Why does text search need two phases on a GPU?

Because byte-level pattern matching is the exact kind of data-dependent branching that causes SIMD divergence on GPUs. A naive port of string.includes() to a compute shader makes every thread in a warp follow a different matching path depending on its document's content, which collapses warp utilisation to single digits. The two-phase architecture splits the work: phase 1 is uniform and divergence-free (histogram building and pattern evaluation), phase 2 runs byte-level matching only on the 3 percent of documents that survive phase 1. The divergence-prone work runs on a tiny fraction of the data, so the GPU stays busy.

Q: What does the 128-bin histogram actually do?

It counts how many times each character class appears in a document. One workgroup per document, 256 threads each processing chunks of the document in parallel, writing into 128 shared-memory bins. Thread 0 then checks whether the document contains the minimum character counts required by the pattern. A pattern like ERROR-404 requires specific characters in specific counts. If the document lacks any of them, it cannot possibly match, and it is eliminated without byte-level comparison. On real log corpora, this pre-filter eliminates 95 to 97 percent of documents, which is why the whole pipeline stays fast even on large corpora.

Q: What real workloads does this serve?

Three industries push browser-based log search hardest. Cybersecurity SIEM dashboards ingesting thousands of log entries per second where analysts need interactive search for IOCs. Gaming anti-cheat and chat moderation where a single game session produces 50,000 to 200,000 messages needing real-time pattern sweeps. Observability platforms aggregating logs, metrics, and spans across every service in the infrastructure. All three share the same shape: massive volumes of text, complex patterns, real-time latency requirements, and a standard regex engine that hits the wall at scale. Full code GPU pattern matching solves this class of problem structurally.

Q: Does it work on all devices?

The engine runs on any device with WebGPU and falls back to Web Workers or main-thread execution on devices without. Discrete GPUs deliver the 12ms per million entries headline. Integrated GPUs deliver around 40 to 80ms. Web Worker fallback on an 8-core laptop is 200 to 400ms. Single-thread fallback is 800ms-plus. The dispatch engine selects the fastest tier automatically. For UK enterprise applications running on unknown user hardware, this delivers acceptable latency on the weakest supported devices and excellent latency on everything else.

Q: Can this architecture handle streaming data?

Yes, through searched-frontier tracking. A streaming search corpus grows continuously as new log entries arrive. Instead of re-searching the whole corpus on every cycle, the engine tracks a byte offset marking the boundary between already-searched and new data. Each detection cycle searches only the new data beyond the frontier. This reduces per-cycle work by 95 to 99 percent on mature streams. For SIEM and observability platforms ingesting 5,000 to 10,000 entries per second, this keeps detection latency under 100ms regardless of how long the stream has been running.

6 Apr 2026·13 min read·Husain Ayoob

WebGPUText SearchPattern MatchingObservabilityPerformance

Key Takeaways

Phase 1 assigns one workgroup (256 threads) per document, building a 128-bin character frequency histogram in workgroup shared memory. Thread 0 evaluates the histogram against the query pattern's character requirements, eliminating up to 97% of documents before any byte-level comparison.
Phase 2 assigns one thread per byte position across candidate documents, performing byte-by-byte pattern matching with the pattern loaded into workgroup shared memory (up to 64 bytes). A candidate bitmask check and binary search for document mapping ensure only surviving candidates are processed.
The entire pipeline runs client-side with zero server round-trips. A 200 MB log corpus (1 million entries) completes a full-text search in 12 ms on a discrete GPU, compared to 800 ms with single-threaded JavaScript string matching.

The problem with searching text on GPUs

GPUs excel at arithmetic on fixed-width numeric data. Text is the opposite: variable-length, byte-oriented, and riddled with data-dependent branching. A naive attempt to run string.includes() as a compute shader produces the exact SIMD branch divergence pattern that makes GPU execution slower than CPU.

Every thread in a 32-wide warp processes a different document. Each document has a different length, different content, different character distribution. The matching loop branches at every byte comparison. Warp utilization collapses to single digits.

This is why most browser-based search engines do not even attempt GPU execution. They default to CPU with Web Workers, accept the 200 to 800 ms latency on large corpora, and move on.

We did not accept it. We found a way to make text search GPU-compatible by restructuring the problem into two phases: one that is perfectly parallel and one that is narrowly targeted.

Where browser-based text search matters

Before the algorithm, the context. Three industries push browser-based log search to its limits.

Cybersecurity. SIEM dashboards ingest thousands of log entries per second. Analysts search for IOCs (indicators of compromise): IP addresses, domain patterns, specific error signatures, base64-encoded payloads. The search must be interactive. Waiting 800 ms per query while triaging an active incident is not acceptable.

Gaming anti-cheat and moderation. Chat log analysis for toxicity detection, coordinated manipulation, and real-time abuse patterns. A single game session can produce 50,000 to 200,000 chat messages. Moderators need sub-second search across the full corpus to identify patterns and trace conversations.

Observability platforms. Application logs, infrastructure metrics, trace spans. Engineers debugging a production issue search for stack traces, request IDs, and error patterns across millions of log lines. Every second of search latency extends the mean time to resolution.

In all three cases, the corpus fits in browser memory (50 MB to 500 MB), the queries are interactive and frequent, and server round-trips add unacceptable latency. The constraint is not storage. It is search throughput.

Why brute-force GPU matching fails

The straightforward approach: upload all documents to a GPU storage buffer, dispatch one thread per document, have each thread scan its document for the query pattern.

@compute @workgroup_size(256)
fn search(@builtin(global_invocation_id) id: vec3<u32>) {
  let doc_idx = id.x;
  let doc_start = offsets[doc_idx];
  let doc_len = offsets[doc_idx + 1] - doc_start;

  for (var i: u32 = 0; i <= doc_len - pattern_len; i++) {
    var match = true;
    for (var j: u32 = 0; j < pattern_len; j++) {
      if (corpus[doc_start + i + j] != pattern[j]) {
        match = false;
        break;
      }
    }
    if (match) {
      results[doc_idx] = 1u;
      return;
    }
  }
}

This shader has three problems.

Problem 1: Divergent inner loop. The break on mismatch causes threads in the same warp to exit the inner loop at different iterations. Some threads match on the first character, some on the tenth, some scan the entire document. The warp serializes.

Problem 2: Variable-length documents. The outer loop bound (doc_len - pattern_len) differs per thread. Short documents finish early and sit idle while long documents continue. Warp utilization is determined by the longest document in each group of 32.

Problem 3: Global memory bandwidth. Every byte comparison reads from global (VRAM) memory. A 200 MB corpus with a 10-character query pattern requires up to 2 billion byte reads in the worst case. Global memory bandwidth on a mid-range discrete GPU is 250 to 400 GB/s, but random access patterns (different threads reading different document regions) prevent coalesced reads. Effective bandwidth drops to 50 to 100 GB/s. The search becomes memory-bound with low compute utilization.

On our benchmark (1 million log entries, average 200 bytes each, 200 MB total corpus, 8-character query pattern), brute-force GPU search takes 145 ms. Single-threaded CPU String.prototype.includes() takes 820 ms. Eight-thread Web Workers take 110 ms. The GPU barely beats the CPU workers, and the divergence penalty erases most of the theoretical throughput advantage.

Our two-phase approach

The insight: most documents in a corpus do not contain the query pattern. For a typical search against log data, 95% to 99% of entries are irrelevant. If you can eliminate those entries cheaply before performing byte-level matching, the expensive phase operates on a tiny subset.

Our Adaptive WebGPU Pattern Matching Engine splits search into two dispatches.

Phase 1: Character frequency histogram pre-filter

Phase 1 does not search for the pattern. It asks a simpler question: "Does this document contain the right characters in the right quantities to possibly contain the pattern?"

If the query pattern is "ERROR_503", any matching document must contain at least one E, one R (actually two), one O, one _, one 5, one 0, and one 3. A document that contains zero R characters cannot match. A document that contains only one R cannot match (the pattern requires two). Phase 1 rejects these documents without reading a single byte in sequence.

How the histogram is built

Each workgroup processes one document. The workgroup has 256 threads that stride across the document's byte positions in parallel.

The workgroup allocates a 128-bin histogram in shared memory. Each bin corresponds to an ASCII code point (0 to 127). Shared memory on modern GPUs provides 16 KB to 48 KB per workgroup at register-like access speeds (20 to 50 cycles latency versus 200 to 400 cycles for global memory).

A 128-bin histogram of u32 counters occupies 512 bytes. This leaves ample room in shared memory for both the histogram and the query pattern (up to 64 bytes).

The histogram construction is parallel and uniform:

// Simplified: one workgroup per document, 256 threads stride across byte positions
@compute @workgroup_size(256)
fn build_histogram(
  @builtin(local_invocation_id) lid: vec3<u32>,
  @builtin(workgroup_id) wid: vec3<u32>
) {
  let doc_idx = wid.x;
  let doc_start = offsets[doc_idx];
  let doc_len = offsets[doc_idx + 1u] - doc_start;

  // Initialize shared memory histogram bins to 0
  // Each thread clears a subset of bins
  for (var b = lid.x; b < 128u; b += 256u) {
    shared_histogram[b] = 0u;
  }
  workgroupBarrier();

  // 256 threads stride across the document's bytes
  for (var offset = lid.x; offset < doc_len; offset += 256u) {
    let byte_val = corpus[doc_start + offset];
    let bin = byte_val & 127u;
    atomicAdd(&shared_histogram[bin], 1u);
  }
  workgroupBarrier();

  // Thread 0 evaluates the histogram against the query
  if (lid.x == 0u) {
    var pass = true;
    for (var c: u32 = 0u; c < query_char_count; c++) {
      let required_char = query_chars[c];
      let required_count = query_counts[c];
      let actual_count = shared_histogram[required_char];
      if (actual_count < required_count) {
        pass = false;
        break;
      }
    }
    if (pass) {
      // Mark as candidate in global bitmask
      atomicOr(&candidate_bitmask[doc_idx / 32u], 1u << (doc_idx % 32u));
    }
  }
}

Why Thread 0 evaluates

The histogram evaluation is a small, sequential operation: iterate over the query's unique characters (typically 5 to 15) and compare counts. Dispatching this across multiple threads would require a reduction step with synchronization overhead that exceeds the compute saved. Thread 0 handles it in tens of nanoseconds. The other 255 threads in the workgroup are idle for this step, but the step is so short (under 100 cycles) that the cost is negligible.

Pre-filter effectiveness

The elimination rate depends on the query's character specificity and the corpus's character distribution. On our log corpus benchmarks:

Query type	Example	Documents eliminated
Error code with digits	`ERROR_503`	97.2%
IP address pattern	`192.168.1`	98.4%
Mixed-case identifier	`NullPointerException`	96.8%
Short common word	`error`	71.3%
Single character	`x`	12.1%

Longer, more specific queries with uncommon character combinations achieve higher elimination rates. Short, common queries benefit less. The engine's adaptive dispatch scoring accounts for estimated elimination rate when deciding whether to use the two-phase GPU path or fall back to CPU workers.

Phase 2: Targeted byte-level matching

Phase 2 dispatches a second compute shader that processes only the candidate documents identified by the Phase 1 bitmask.

The dispatch assigns one thread per byte position across the candidate documents. Each thread first checks the candidate bitmask for its document to confirm the document survived Phase 1. If the bitmask bit is not set, the thread exits immediately. If it is set, the thread performs byte-by-byte pattern matching starting at its assigned byte position. The pattern (up to 64 bytes) is loaded into workgroup shared memory so that every thread in the workgroup reads pattern bytes from fast shared memory rather than global memory. Each thread uses a binary search on the document offset array to determine which document its byte position belongs to.

Phase 2 still has the divergence characteristics of byte-level matching. But the critical difference: the candidate bitmask eliminates 70% to 97% of documents before any byte comparison begins.

On a corpus where Phase 1 eliminates 97% of documents, Phase 2 performs byte-level matching on only 3% of the original data. The total work is reduced by an order of magnitude or more.

The byte-level matching is straightforward: compare each byte of the pattern against the corpus starting at the thread's position. If all bytes match, the thread writes a result. This one-thread-per-byte-position model avoids the variable-length outer loop of the brute-force approach. Every thread performs the same inner loop of up to 64 comparisons. Control flow is more uniform across threads, which reduces SIMD divergence compared to the one-thread-per-document alternative.

Memory architecture

The corpus is stored in a flat Uint8Array buffer on the GPU. Document boundaries are defined by an offset array (Uint32Array) where offsets[i] is the byte position where document i begins.

Total GPU memory usage for a 200 MB corpus:

Buffer	Size	Purpose
Corpus data	200 MB	Raw bytes
Offset array	4 MB	1M x 4-byte offsets
Candidate bitmask	125 KB	1M bits (1M / 8 bytes)
Candidate index (compacted)	120 KB (variable)	~30K x 4-byte indices
Result buffer	125 KB	1M bits
Query parameters	< 1 KB	Character set + counts

Total: approximately 205 MB. Well within the 1 to 4 GB buffer size limit of modern GPUs (maxBufferSize from adapter.limits).

The corpus buffer is uploaded once on dataset load. Subsequent searches reuse the same buffer and only update the query parameters (under 1 KB per search). Buffer upload is a one-time cost.

End-to-end performance

Benchmarked on 1 million log entries, average 200 bytes per entry, 200 MB total corpus:

Method	Query: `ERROR_503`	Query: `192.168.1`	Query: `error`
CPU single-thread (`includes()`)	820 ms	790 ms	680 ms
CPU 8-thread Web Workers	110 ms	105 ms	92 ms
GPU brute-force (no pre-filter)	145 ms	140 ms	128 ms
GPU two-phase (our engine)	12 ms	8 ms	41 ms

For the specific query ERROR_503, Phase 1 eliminates 97.2% of documents in 3.1 ms. Phase 2 matches the remaining 28,000 candidates in 8.9 ms. Total: 12 ms. That is 68x faster than single-threaded CPU and 9x faster than the Web Worker pool.

For the less selective query error, Phase 1 eliminates only 71.3% of documents. Phase 2 processes 287,000 candidates. Total: 41 ms. Still 20x faster than single-threaded CPU and 2.7x faster than workers.

The two-phase approach always outperforms brute-force GPU search because Phase 1's shared-memory histogram is an order of magnitude cheaper than Phase 2's global-memory byte comparison. Even when elimination is modest, the pre-filter pays for itself.

When the GPU path is not used

Not every search benefits from GPU dispatch. The GPU inhibition scoring engine evaluates each search operation against six categorical inhibition factors and eight continuous scoring factors before dispatch.

The six categorical inhibition factors assign a negative infinity penalty, unconditionally routing the search to CPU. These include: wildcard branch divergence, regex branch divergence, fuzzy branch divergence, patterns exceeding shared memory (longer than 64 characters), Unicode case-insensitive search, and high match density (exceeding 100 matches per 1,000 input elements). Each is a hard cutoff that cannot be overridden by the continuous scoring factors.

When no categorical factor is triggered, the eight continuous scoring factors determine whether GPU dispatch is worthwhile. Small corpora below the crossover threshold (default: 500K bytes on discrete GPUs, 2M bytes on integrated GPUs) score below the dispatch threshold and route to CPU. The self-calibrating crossover thresholds adjust based on measured performance but never modify the categorical inhibition factors.

The dispatch decision is automatic. You submit a query. The engine returns results from whichever backend was fastest for that query, on that hardware, against that corpus.

Integrating with columnar query pipelines

Text search rarely operates in isolation. A typical workflow: filter log entries by time range, then search the filtered subset for a pattern, then aggregate by source.

Our pattern matching engine integrates with the adaptive query engine. The time-range filter runs as a parallel GPU filter operator, producing a candidate bitmask. That bitmask is intersected with the text search bitmask using a single bitwise AND dispatch. The intersection result feeds into the aggregation operator.

The entire pipeline (time filter, text search, aggregation) executes as a sequence of GPU dispatches with no intermediate CPU round-trips. Intermediate buffers stay in GPU memory. The CPU sees only the final aggregated result: 50 rows of grouped statistics, read back via a single mapAsync() call.

For a SIEM dashboard where an analyst filters to the last 15 minutes (50,000 entries from 1 million), then searches for an IP pattern, then groups by source host: total execution is 4 to 8 ms. The same operation with a server round-trip: 200 to 400 ms.

The broader architecture

This two-phase text search is one operator in a larger system. The character frequency histogram is a statistical pre-filter: a cheap, parallel, GPU-friendly test that narrows the candidate set before an expensive, divergent, GPU-hostile operation. The same structural pattern applies beyond text search. Any workload with a cheap approximate test and an expensive exact test can benefit from this two-phase decomposition.

The engineering principle: do not fight the hardware. GPUs are bad at divergent byte-level comparisons. They are excellent at parallel histogram construction in shared memory. Structure your algorithm so the GPU does what it is good at, and minimize the time it spends doing what it is bad at.

This is how we approach enterprise AI automation infrastructure at every level. Understand the hardware constraint. Design the algorithm around it. Measure the result. If the measurement contradicts the design, change the design. We do not ship code that assumes performance. We ship code that proves it.

Where this ships

We are Ayoob AI, a Newcastle-based team building GPU text-search infrastructure for UK document-heavy operations like law, insurance, and accountancy. If your pattern-matching pipeline is bottlenecked on byte-level comparison, we engineer the two-phase decomposition that moves the work off the critical path. This algorithm is what powers our AI document processing pipelines for UK law and insurance firms. Book a discovery call.

About the author

Husain Ayoob

Founder & CEO, Ayoob AI Ltd

BSc Computer Science with AI, Northumbria University 2024. 5 UK patents pending covering the Ayoob AI stack. ISO 27001:2022 certified (organisation).

Full bio, patents, and press →

Frequently asked questions

Why does text search need two phases on a GPU?

Because byte-level pattern matching is the exact kind of data-dependent branching that causes SIMD divergence on GPUs. A naive port of string.includes() to a compute shader makes every thread in a warp follow a different matching path depending on its document's content, which collapses warp utilisation to single digits. The two-phase architecture splits the work: phase 1 is uniform and divergence-free (histogram building and pattern evaluation), phase 2 runs byte-level matching only on the 3 percent of documents that survive phase 1. The divergence-prone work runs on a tiny fraction of the data, so the GPU stays busy.

What does the 128-bin histogram actually do?

It counts how many times each character class appears in a document. One workgroup per document, 256 threads each processing chunks of the document in parallel, writing into 128 shared-memory bins. Thread 0 then checks whether the document contains the minimum character counts required by the pattern. A pattern like ERROR-404 requires specific characters in specific counts. If the document lacks any of them, it cannot possibly match, and it is eliminated without byte-level comparison. On real log corpora, this pre-filter eliminates 95 to 97 percent of documents, which is why the whole pipeline stays fast even on large corpora.

What real workloads does this serve?

Three industries push browser-based log search hardest. Cybersecurity SIEM dashboards ingesting thousands of log entries per second where analysts need interactive search for IOCs. Gaming anti-cheat and chat moderation where a single game session produces 50,000 to 200,000 messages needing real-time pattern sweeps. Observability platforms aggregating logs, metrics, and spans across every service in the infrastructure. All three share the same shape: massive volumes of text, complex patterns, real-time latency requirements, and a standard regex engine that hits the wall at scale. Full code GPU pattern matching solves this class of problem structurally.

Does it work on all devices?

The engine runs on any device with WebGPU and falls back to Web Workers or main-thread execution on devices without. Discrete GPUs deliver the 12ms per million entries headline. Integrated GPUs deliver around 40 to 80ms. Web Worker fallback on an 8-core laptop is 200 to 400ms. Single-thread fallback is 800ms-plus. The dispatch engine selects the fastest tier automatically. For UK enterprise applications running on unknown user hardware, this delivers acceptable latency on the weakest supported devices and excellent latency on everything else.

Can this architecture handle streaming data?

Yes, through searched-frontier tracking. A streaming search corpus grows continuously as new log entries arrive. Instead of re-searching the whole corpus on every cycle, the engine tracks a byte offset marking the boundary between already-searched and new data. Each detection cycle searches only the new data beyond the frontier. This reduces per-cycle work by 95 to 99 percent on mature streams. For SIEM and observability platforms ingesting 5,000 to 10,000 entries per second, this keeps detection latency under 100ms regardless of how long the stream has been running.

Talk to an Engineer