Real-Time Threat Detection with GPU-Accelerated Streaming Corpora

10 Apr 2026·14 min read·Husain Ayoob

WebGPUStreamingSecurityObservabilityText Search

Key Takeaways

Searched-frontier tracking maintains a byte offset marking the boundary between already-searched and new data. When new documents are appended, the corpus byte buffer extends in place without re-encoding existing content. The GPU searches only data beyond the frontier, reducing per-cycle work by 95% to 99% on mature streams.
If estimated match density exceeds 100 results per 1,000 input elements, the GPU inhibition scoring engine assigns a categorical penalty of negative infinity, routing the search to the Web Worker tier unconditionally. This is one of six categorical inhibition factors that act as hard cutoffs.
A 30-second detection cycle on a stream ingesting 5,000 log entries per second searches only the 150,000 new entries (12 MB) instead of the full 3-million-entry corpus (240 MB). GPU search time: 1.8 ms per pattern. Total 20-pattern sweep: 36 ms. Threat detection latency stays under 100 ms regardless of corpus age.

The problem with searching live data

Static corpora are searched once. You load the dataset, run your queries, and read the results. The data does not change between searches.

Live data streams do not stop. A SIEM platform ingests 2,000 to 10,000 log entries per second from firewalls, application servers, DNS resolvers, and endpoint agents. A logistics tracking system receives GPS pings, status updates, and exception alerts from thousands of vehicles every second. An observability platform collects spans, metrics, and logs from every service in your infrastructure continuously.

Every second, the corpus grows. Every detection cycle, you need to search it.

The naive approach: search the entire corpus on every cycle. For a corpus that has accumulated 3 million entries (240 MB) over the last 10 minutes, a full search takes 12 ms per pattern on a discrete GPU using our two-phase pipeline. With 20 detection patterns, that is 240 ms per cycle. Acceptable for 30-second detection intervals.

But the corpus keeps growing. After an hour: 18 million entries, 1.4 GB. Full search: 72 ms per pattern, 1,440 ms for 20 patterns. After a shift: 50+ million entries. The corpus exceeds GPU buffer limits. The search time exceeds your detection window. The system falls over.

Re-searching data you have already searched is the problem. You searched the first 3 million entries 30 seconds ago. You found your results. Those entries have not changed. Searching them again produces the same results and wastes 95% of the GPU's work.

Searched-frontier tracking

Our engine maintains a searched frontier: a byte offset into the corpus buffer that marks the boundary between data that has been searched and data that has not.

Corpus byte buffer:
[==========searched===========|----unsearched----]
                               ^
                          frontier offset

On each detection cycle, the GPU searches only the bytes beyond the frontier. After the search completes, the frontier advances to the end of the buffer. New data appended after the frontier remains unsearched until the next cycle.

How the frontier works

The corpus is stored as a flat Uint8Array in a SharedArrayBuffer, with a parallel offset array (Uint32Array) that records the byte position where each document begins. Both structures are append-only during a session.

interface StreamingCorpus {
  bytes: Uint8Array;          // Raw document bytes, contiguous
  offsets: Uint32Array;       // Document start positions
  byteLength: number;         // Current end of written data
  documentCount: number;      // Current document count
  searchedByteOffset: number; // Frontier: bytes searched up to here
  searchedDocIndex: number;   // Frontier: documents searched up to here
}

When new log entries arrive, they are appended to the byte buffer and the offset array extends:

function appendDocuments(corpus: StreamingCorpus, newDocs: Uint8Array[], newOffsets: number[]) {
  for (let i = 0; i < newDocs.length; i++) {
    const doc = newDocs[i];
    corpus.bytes.set(doc, corpus.byteLength);
    corpus.offsets[corpus.documentCount] = corpus.byteLength;
    corpus.byteLength += doc.length;
    corpus.documentCount++;
  }
  // Frontier does NOT advance. New data is unsearched.
}

The key property: existing data is never re-encoded, re-indexed, or moved. The append operation writes new bytes to the end of the buffer and extends the offset array. Documents that were already in the buffer retain their exact byte positions. Their histograms from previous Phase 1 passes remain valid.

Searching only new data

On each detection cycle, the engine uploads only the unsearched portion of the corpus to the GPU:

function searchNewData(corpus: StreamingCorpus, pattern: SearchPattern) {
  const newByteStart = corpus.searchedByteOffset;
  const newByteEnd = corpus.byteLength;
  const newDocStart = corpus.searchedDocIndex;
  const newDocEnd = corpus.documentCount;

  if (newDocEnd === newDocStart) return []; // No new data

  // Upload only the new bytes to GPU
  const newBytes = corpus.bytes.subarray(newByteStart, newByteEnd);
  const newOffsets = corpus.offsets.subarray(newDocStart, newDocEnd + 1);

  // Run two-phase search on the new segment only
  const results = twoPhaseSearch(newBytes, newOffsets, pattern);

  // Advance frontier
  corpus.searchedByteOffset = newByteEnd;
  corpus.searchedDocIndex = newDocEnd;

  return results;
}

The subarray() call creates a view into the existing SharedArrayBuffer. No copy. The GPU receives only the new bytes via device.queue.writeBuffer(). The transfer size is proportional to the new data since the last cycle, not the total corpus size.

Quantifying the savings

For a stream ingesting 5,000 log entries per second with an average entry size of 80 bytes:

Detection interval	New entries per cycle	New bytes	Full corpus (10 min)	Search reduction
1 second	5,000	400 KB	240 MB	99.8%
5 seconds	25,000	2 MB	240 MB	99.2%
30 seconds	150,000	12 MB	240 MB	95.0%
60 seconds	300,000	24 MB	240 MB	90.0%

At a 30-second detection interval, the GPU searches 12 MB instead of 240 MB. The search time drops from 12 ms (full corpus) to 1.8 ms (new data only) per pattern. For 20 patterns, total sweep time is 36 ms instead of 240 ms.

Critically, the sweep time is constant. It does not grow with corpus age. Whether the full corpus is 240 MB or 2.4 GB, the per-cycle search processes only the data appended since the last frontier advance. The engine can run indefinitely without degradation.

Buffer management for unbounded streams

A stream that runs for hours or days will exceed any fixed buffer size. The engine handles this with a ring buffer strategy.

Pre-allocated ring buffer

At initialization, the engine allocates a fixed-size SharedArrayBuffer for the corpus. The size is configurable (default: 512 MB, sufficient for approximately 6 million 80-byte log entries). The byte buffer and offset array are both backed by this allocation.

When the write position approaches the buffer's end, the engine wraps:

function appendWithWrap(corpus: StreamingCorpus, doc: Uint8Array) {
  if (corpus.byteLength + doc.length > corpus.maxByteLength) {
    // Evict oldest documents to make room
    const evictTarget = corpus.byteLength + doc.length - corpus.maxByteLength + EVICTION_MARGIN;
    evictOldest(corpus, evictTarget);
  }
  // Append as normal
  corpus.bytes.set(doc, corpus.byteLength);
  corpus.offsets[corpus.documentCount] = corpus.byteLength;
  corpus.byteLength += doc.length;
  corpus.documentCount++;
}

Eviction removes the oldest documents from the logical start of the buffer. The evicted region is not zeroed (unnecessary cost). The offset array's base index advances. Searches never reference evicted documents because they were already past the frontier (searched in a previous cycle).

Compaction

After many append-and-evict cycles, the live data may be fragmented (a contiguous block in the middle of the buffer with dead space at the start). Periodically (or when fragmentation exceeds a threshold), the engine compacts: it copies the live data to the start of the buffer, rewrites the offset array, and resets the frontier offset.

Compaction is a CPU operation on the SharedArrayBuffer. It runs between detection cycles (never during a GPU search). For a 200 MB live region, compaction takes 5 to 15 ms (a memcpy of the live bytes plus offset recalculation). Infrequent enough (once every few minutes at most) to have negligible impact on detection latency.

GPU buffer synchronization

The GPU does not read from the SharedArrayBuffer directly. WebGPU requires data in GPU-accessible buffers created via device.createBuffer(). The engine must synchronize the new data from the CPU-side SharedArrayBuffer to the GPU-side storage buffer on each detection cycle.

Incremental GPU buffer upload

The engine maintains a GPU-side corpus buffer that mirrors the CPU-side SharedArrayBuffer. On each cycle, only the new bytes (from the previous frontier to the current write position) are uploaded:

device.queue.writeBuffer(
  gpuCorpusBuffer,
  corpus.searchedByteOffset,  // Offset into GPU buffer
  corpus.bytes,
  corpus.searchedByteOffset,  // Offset into CPU buffer
  newByteCount                // Only the new bytes
);

The writeBuffer call with an offset writes to a specific position in the existing GPU buffer without touching the rest. The previously uploaded data remains in GPU memory undisturbed. Transfer cost is proportional to the new data only.

For 12 MB of new data on PCIe 4.0 x16: upload time is approximately 0.5 ms. This is the only transfer per cycle. The pipeline fusion principle applies: both Phase 1 and Phase 2 of the two-phase search read from the same GPU buffer with no intermediate readback.

GPU buffer resizing

When the CPU-side buffer grows beyond the current GPU buffer allocation, the engine creates a new, larger GPU buffer and copies the existing contents. This is a GPU-to-GPU copy (fast, no PCIe traversal on discrete GPUs) followed by deallocation of the old buffer.

The engine over-allocates by 2x to amortize resizing cost. A GPU buffer that starts at 32 MB grows to 64 MB, 128 MB, 256 MB, and so on. For a 512 MB CPU corpus, the GPU buffer resizes at most 4 times during the entire session. Total resize overhead: under 10 ms cumulative.

Match density monitoring and contention fallback

Not every detection cycle produces sparse results. A threat event (active attack, infrastructure failure, configuration error flooding logs) can cause match density to spike. When a pattern like ERROR or DENIED suddenly matches 15% of incoming log entries instead of the usual 0.5%, the GPU's atomic contention characteristics change.

The contention problem with dense matches

The two-phase search pipeline uses atomicOr to set bits in the candidate bitmask during Phase 1, and atomicAdd to compact candidate indices during Phase 2. At low match density (under 5%), contention on these atomics is negligible. At high density (above 10%), hundreds of threads per cycle contend on bitmask words and the compaction counter. Throughput collapses non-linearly.

Density detection

The GPU inhibition scoring engine evaluates estimated match density before GPU dispatch. If the estimated density exceeds 100 matches per 1,000 input elements (10%), this triggers one of the six categorical inhibition factors. The engine assigns a penalty of negative infinity to the GPU dispatch score, and the entire search routes to the Web Worker tier.

On the Web Worker tier, each worker processes a chunk of documents from the SharedArrayBuffer using native String.prototype.includes(). Eight workers with thread-local result arrays produce zero atomic contention. For 150,000 documents at 10% match density, the CPU search completes in approximately 2.1 ms. This avoids the non-linear performance collapse that atomic contention would cause on the GPU at high density.

The penalty mechanism

High match density (exceeding 100 matches per 1,000 input elements) is one of the six categorical inhibition factors in the GPU inhibition scoring engine. When triggered, the engine assigns a penalty of negative infinity to the GPU dispatch score, routing the search to the Web Worker tier. This is the same categorical mechanism used for branch divergence inhibition and other hard cutoffs. The categorical factors are hard boundaries that cannot be overridden by the continuous scoring factors or the self-calibrating crossover thresholds.

Threat detection pipeline architecture

Putting it together, here is the full pipeline for a SIEM platform running 30-second detection cycles:

Cycle start

Ingest. New log entries from the last 30 seconds are appended to the SharedArrayBuffer corpus. 150,000 new entries, 12 MB.
GPU buffer sync. The 12 MB of new bytes are uploaded to the GPU corpus buffer at the frontier offset. Transfer: 0.5 ms.

Pattern sweep

For each of 20 detection patterns:
- The GPU inhibition scoring engine evaluates the pattern against the six categorical inhibition factors and eight continuous scoring factors.
- If no categorical inhibition is triggered and the continuous score exceeds the crossover threshold: the two-phase GPU pipeline executes (Phase 1 histogram pre-filter, Phase 2 byte matching) against the 150,000 new documents.
- If categorical inhibition is triggered (e.g., estimated match density exceeds 100 per 1,000): the search routes to Web Workers.
- Results (matching document indices) are appended to the cycle's result set.

Average per-pattern timing:

Dispatch path	Time
GPU two-phase (low density)	1.8 ms
Web Workers (high density fallback)	3.0 ms

For 20 patterns (18 low density on GPU, 2 high density on Workers): (18 x 1.8) + (2 x 3.0) = 38.4 ms.

Correlation

Cross-pattern correlation. Matching document indices from all 20 patterns are intersected and time-windowed. Documents that match multiple threat indicators within a time window are flagged as high-priority alerts. This is a CPU operation on the small result set (typically hundreds of matches from 150,000 documents). Time: under 1 ms.

Frontier advance

Advance frontier. The searched-byte and searched-document offsets advance to the current write position. The next cycle will search only data appended after this point.

Total cycle time

Step	Time
GPU buffer sync	0.5 ms
20-pattern sweep	38.4 ms
Cross-pattern correlation	0.8 ms
Frontier advance	0.01 ms
Total	39.7 ms

Under 40 ms per 30-second cycle. The detection system uses 0.13% of the available wall-clock time. It could run at 1-second intervals (searching 5,000 new entries per cycle) and still complete in under 5 ms, leaving 99.5% of CPU and GPU time for the application.

Scaling characteristics

The pipeline's per-cycle cost depends on the ingestion rate and detection interval, not the corpus age.

Ingestion rate	Detection interval	New entries per cycle	Per-cycle search time (20 patterns)
1,000/sec	30 sec	30,000	8 ms
5,000/sec	30 sec	150,000	39 ms
10,000/sec	30 sec	300,000	74 ms
5,000/sec	5 sec	25,000	7 ms
5,000/sec	1 sec	5,000	2 ms

At 10,000 entries per second (the upper end for a single-site SIEM deployment), the 30-second cycle completes in 74 ms. Well within budget. At 1-second intervals for near-real-time detection, the same stream requires only 2 ms per cycle.

The corpus can accumulate to any size. The buffer ring evicts old data when the allocation limit is reached. Searches never touch evicted data (it was already searched). The only memory constraint is the SharedArrayBuffer allocation size, which is configurable up to the browser's per-origin memory limit (typically 2 to 4 GB).

Application: logistics exception tracking

The same architecture applies to logistics operations. A fleet management platform tracks 5,000 vehicles, each reporting GPS position, speed, fuel level, cargo temperature, and exception codes every 10 seconds. That is 500 events per second, each 120 bytes. Over a 12-hour shift: 21.6 million events, 2.5 GB.

The operations team needs to search for exception patterns in real-time: temperature deviations, route deviations, extended idle periods, repeated error codes, geo-fence breaches. Each pattern is a text search or structured filter across the event stream.

With searched-frontier tracking, each 30-second cycle searches only the 15,000 new events (1.8 MB). The 20-pattern sweep completes in under 10 ms. The operations dashboard updates in real-time. No server round-trip. No batch job. No 15-minute refresh cycle.

The corpus ring buffer evicts events older than the retention window (configurable: 2 hours, 4 hours, full shift). Historical queries beyond the retention window fall back to the server-side archive. Live queries against the current window run entirely in the browser at GPU speed.

The engineering principle

Streaming data is an append-only problem. The searched-frontier mechanism reduces it to a bounded problem: regardless of how long the stream has been running, each search cycle processes a fixed window of new data.

The two-phase GPU pipeline makes each search fast. The frontier makes each search small. The contention monitor prevents dense result spikes from crashing the GPU. The device loss handler ensures that if the GPU fails, detection continues on the CPU without interruption.

This is what production-grade enterprise AI automation looks like for streaming workloads. Not a system that works for 10 minutes and then slows down as the corpus grows. A system that runs at constant speed for hours, days, or weeks, because it never re-searches data it has already seen.

Where this ships

We are Ayoob AI, a Newcastle-based team building streaming detection infrastructure for UK platforms that need real-time pattern detection at scale. If your log, chat, or transaction stream is outgrowing the grep-plus-regex architecture, we build the GPU pipeline that keeps up. The same streaming search primitive powers our AI compliance automation work for UK regulated firms handling sanctions screening. Book a discovery call.

About the author

Husain Ayoob

Founder & CEO, Ayoob AI Ltd

BSc Computer Science with AI, Northumbria University 2024. 5 UK patents pending covering the Ayoob AI stack. ISO 27001:2022 certified (organisation).

Full bio, patents, and press →

Frequently asked questions

What is a searched frontier?

A byte offset into the corpus buffer that marks the boundary between data the engine has already searched and data it has not. Every time new log entries arrive, the buffer extends in place without re-encoding existing content. On the next detection cycle, the GPU searches only data beyond the frontier offset. After the sweep completes, the frontier advances to the current buffer end. This is the core optimisation that makes streaming text search tractable: the engine never re-searches data that has not changed. On mature streams, 95 to 99 percent of the corpus is beyond the frontier and does not need to be touched again.

Why does this matter for SIEM and observability?

Because static-corpus search does not scale to streaming ingestion. A SIEM ingesting 5,000 log entries per second accumulates 18 million entries in an hour and 50 million entries in a typical shift. Searching the whole corpus on every detection cycle takes longer than the cycle itself, so the detection window never catches up. With searched-frontier tracking, the engine searches only the new entries each cycle, which is a tiny fraction of the total. Detection latency stays under 100ms regardless of how long the stream has been running, which is what makes real-time threat detection actually real-time.

What happens if a detection pattern matches too much?

The engine detects the high match density and reroutes to CPU. If estimated match density exceeds 100 results per 1,000 input elements, the GPU inhibition scoring engine assigns a categorical penalty of negative infinity, which unconditionally routes the search to the Web Worker tier. This avoids the atomic contention problem: 3,000-plus GPU threads competing on the same atomic counter destroys throughput non-linearly. Routing high-density searches to CPU keeps the pipeline predictable. It is one of six categorical inhibition factors that act as hard cutoffs rather than soft optimisations.

Can this run in a browser or does it need a server?

Browser-side. The whole pipeline runs client-side on WebGPU, which is deliberate for observability platforms. Round-tripping log data to a central server for pattern matching adds latency and compliance complexity. Running at the edge (user's browser or an on-premise worker) keeps data under the customer's control. For UK enterprise SIEM platforms concerned about data residency under UK GDPR or sector-specific rules, client-side GPU search simplifies the compliance story and delivers lower latency at the same time.

Does this work for non-security streaming?

Yes. Any streaming text search with real-time requirements fits the pattern. Logistics tracking systems receiving GPS pings and status updates from thousands of vehicles. Observability platforms collecting spans, metrics, and logs continuously. Live chat moderation for gaming platforms. Financial transaction monitoring for fraud patterns. The shape is the same: corpus that grows continuously, patterns that need to match new data quickly, and a requirement that detection latency stays bounded regardless of corpus age. Searched-frontier tracking plus GPU dispatch solves this class of problem generally.

Talk to an Engineer