Ayoob AI

Real-Time Threat Detection with GPU-Accelerated Streaming Corpora

·14 min read·Ayoob AI
WebGPUStreamingSecurityObservabilityText Search

The problem with searching live data

Static corpora are searched once. You load the dataset, run your queries, and read the results. The data does not change between searches.

Live data streams do not stop. A SIEM platform ingests 2,000 to 10,000 log entries per second from firewalls, application servers, DNS resolvers, and endpoint agents. A logistics tracking system receives GPS pings, status updates, and exception alerts from thousands of vehicles every second. An observability platform collects spans, metrics, and logs from every service in your infrastructure continuously.

Every second, the corpus grows. Every detection cycle, you need to search it.

The naive approach: search the entire corpus on every cycle. For a corpus that has accumulated 3 million entries (240 MB) over the last 10 minutes, a full search takes 12 ms per pattern on a discrete GPU using our two-phase pipeline. With 20 detection patterns, that is 240 ms per cycle. Acceptable for 30-second detection intervals.

But the corpus keeps growing. After an hour: 18 million entries, 1.4 GB. Full search: 72 ms per pattern, 1,440 ms for 20 patterns. After a shift: 50+ million entries. The corpus exceeds GPU buffer limits. The search time exceeds your detection window. The system falls over.

Re-searching data you have already searched is the problem. You searched the first 3 million entries 30 seconds ago. You found your results. Those entries have not changed. Searching them again produces the same results and wastes 95% of the GPU's work.

Searched-frontier tracking

Our engine maintains a searched frontier: a byte offset into the corpus buffer that marks the boundary between data that has been searched and data that has not.

Corpus byte buffer:
[==========searched===========|----unsearched----]
                               ^
                          frontier offset

On each detection cycle, the GPU searches only the bytes beyond the frontier. After the search completes, the frontier advances to the end of the buffer. New data appended after the frontier remains unsearched until the next cycle.

How the frontier works

The corpus is stored as a flat Uint8Array in a SharedArrayBuffer, with a parallel offset array (Uint32Array) that records the byte position where each document begins. Both structures are append-only during a session.

interface StreamingCorpus {
  bytes: Uint8Array;          // Raw document bytes, contiguous
  offsets: Uint32Array;       // Document start positions
  byteLength: number;         // Current end of written data
  documentCount: number;      // Current document count
  searchedByteOffset: number; // Frontier: bytes searched up to here
  searchedDocIndex: number;   // Frontier: documents searched up to here
}

When new log entries arrive, they are appended to the byte buffer and the offset array extends:

function appendDocuments(corpus: StreamingCorpus, newDocs: Uint8Array[], newOffsets: number[]) {
  for (let i = 0; i < newDocs.length; i++) {
    const doc = newDocs[i];
    corpus.bytes.set(doc, corpus.byteLength);
    corpus.offsets[corpus.documentCount] = corpus.byteLength;
    corpus.byteLength += doc.length;
    corpus.documentCount++;
  }
  // Frontier does NOT advance. New data is unsearched.
}

The key property: existing data is never re-encoded, re-indexed, or moved. The append operation writes new bytes to the end of the buffer and extends the offset array. Documents that were already in the buffer retain their exact byte positions. Their histograms from previous Phase 1 passes remain valid.

Searching only new data

On each detection cycle, the engine uploads only the unsearched portion of the corpus to the GPU:

function searchNewData(corpus: StreamingCorpus, pattern: SearchPattern) {
  const newByteStart = corpus.searchedByteOffset;
  const newByteEnd = corpus.byteLength;
  const newDocStart = corpus.searchedDocIndex;
  const newDocEnd = corpus.documentCount;

  if (newDocEnd === newDocStart) return []; // No new data

  // Upload only the new bytes to GPU
  const newBytes = corpus.bytes.subarray(newByteStart, newByteEnd);
  const newOffsets = corpus.offsets.subarray(newDocStart, newDocEnd + 1);

  // Run two-phase search on the new segment only
  const results = twoPhaseSearch(newBytes, newOffsets, pattern);

  // Advance frontier
  corpus.searchedByteOffset = newByteEnd;
  corpus.searchedDocIndex = newDocEnd;

  return results;
}

The subarray() call creates a view into the existing SharedArrayBuffer. No copy. The GPU receives only the new bytes via device.queue.writeBuffer(). The transfer size is proportional to the new data since the last cycle, not the total corpus size.

Quantifying the savings

For a stream ingesting 5,000 log entries per second with an average entry size of 80 bytes:

Detection intervalNew entries per cycleNew bytesFull corpus (10 min)Search reduction
1 second5,000400 KB240 MB99.8%
5 seconds25,0002 MB240 MB99.2%
30 seconds150,00012 MB240 MB95.0%
60 seconds300,00024 MB240 MB90.0%

At a 30-second detection interval, the GPU searches 12 MB instead of 240 MB. The search time drops from 12 ms (full corpus) to 1.8 ms (new data only) per pattern. For 20 patterns, total sweep time is 36 ms instead of 240 ms.

Critically, the sweep time is constant. It does not grow with corpus age. Whether the full corpus is 240 MB or 2.4 GB, the per-cycle search processes only the data appended since the last frontier advance. The engine can run indefinitely without degradation.

Buffer management for unbounded streams

A stream that runs for hours or days will exceed any fixed buffer size. The engine handles this with a ring buffer strategy.

Pre-allocated ring buffer

At initialization, the engine allocates a fixed-size SharedArrayBuffer for the corpus. The size is configurable (default: 512 MB, sufficient for approximately 6 million 80-byte log entries). The byte buffer and offset array are both backed by this allocation.

When the write position approaches the buffer's end, the engine wraps:

function appendWithWrap(corpus: StreamingCorpus, doc: Uint8Array) {
  if (corpus.byteLength + doc.length > corpus.maxByteLength) {
    // Evict oldest documents to make room
    const evictTarget = corpus.byteLength + doc.length - corpus.maxByteLength + EVICTION_MARGIN;
    evictOldest(corpus, evictTarget);
  }
  // Append as normal
  corpus.bytes.set(doc, corpus.byteLength);
  corpus.offsets[corpus.documentCount] = corpus.byteLength;
  corpus.byteLength += doc.length;
  corpus.documentCount++;
}

Eviction removes the oldest documents from the logical start of the buffer. The evicted region is not zeroed (unnecessary cost). The offset array's base index advances. Searches never reference evicted documents because they were already past the frontier (searched in a previous cycle).

Compaction

After many append-and-evict cycles, the live data may be fragmented (a contiguous block in the middle of the buffer with dead space at the start). Periodically (or when fragmentation exceeds a threshold), the engine compacts: it copies the live data to the start of the buffer, rewrites the offset array, and resets the frontier offset.

Compaction is a CPU operation on the SharedArrayBuffer. It runs between detection cycles (never during a GPU search). For a 200 MB live region, compaction takes 5 to 15 ms (a memcpy of the live bytes plus offset recalculation). Infrequent enough (once every few minutes at most) to have negligible impact on detection latency.

GPU buffer synchronization

The GPU does not read from the SharedArrayBuffer directly. WebGPU requires data in GPU-accessible buffers created via device.createBuffer(). The engine must synchronize the new data from the CPU-side SharedArrayBuffer to the GPU-side storage buffer on each detection cycle.

Incremental GPU buffer upload

The engine maintains a GPU-side corpus buffer that mirrors the CPU-side SharedArrayBuffer. On each cycle, only the new bytes (from the previous frontier to the current write position) are uploaded:

device.queue.writeBuffer(
  gpuCorpusBuffer,
  corpus.searchedByteOffset,  // Offset into GPU buffer
  corpus.bytes,
  corpus.searchedByteOffset,  // Offset into CPU buffer
  newByteCount                // Only the new bytes
);

The writeBuffer call with an offset writes to a specific position in the existing GPU buffer without touching the rest. The previously uploaded data remains in GPU memory undisturbed. Transfer cost is proportional to the new data only.

For 12 MB of new data on PCIe 4.0 x16: upload time is approximately 0.5 ms. This is the only transfer per cycle. The pipeline fusion principle applies: both Phase 1 and Phase 2 of the two-phase search read from the same GPU buffer with no intermediate readback.

GPU buffer resizing

When the CPU-side buffer grows beyond the current GPU buffer allocation, the engine creates a new, larger GPU buffer and copies the existing contents. This is a GPU-to-GPU copy (fast, no PCIe traversal on discrete GPUs) followed by deallocation of the old buffer.

The engine over-allocates by 2x to amortize resizing cost. A GPU buffer that starts at 32 MB grows to 64 MB, 128 MB, 256 MB, and so on. For a 512 MB CPU corpus, the GPU buffer resizes at most 4 times during the entire session. Total resize overhead: under 10 ms cumulative.

Match density monitoring and contention fallback

Not every detection cycle produces sparse results. A threat event (active attack, infrastructure failure, configuration error flooding logs) can cause match density to spike. When a pattern like ERROR or DENIED suddenly matches 15% of incoming log entries instead of the usual 0.5%, the GPU's atomic contention characteristics change.

The contention problem with dense matches

The two-phase search pipeline uses atomicOr to set bits in the candidate bitmask during Phase 1, and atomicAdd to compact candidate indices during Phase 2. At low match density (under 5%), contention on these atomics is negligible. At high density (above 10%), hundreds of threads per cycle contend on bitmask words and the compaction counter. Throughput collapses non-linearly.

Density detection

The GPU inhibition scoring engine evaluates estimated match density before GPU dispatch. If the estimated density exceeds 100 matches per 1,000 input elements (10%), this triggers one of the six categorical inhibition factors. The engine assigns a penalty of negative infinity to the GPU dispatch score, and the entire search routes to the Web Worker tier.

On the Web Worker tier, each worker processes a chunk of documents from the SharedArrayBuffer using native String.prototype.includes(). Eight workers with thread-local result arrays produce zero atomic contention. For 150,000 documents at 10% match density, the CPU search completes in approximately 2.1 ms. This avoids the non-linear performance collapse that atomic contention would cause on the GPU at high density.

The penalty mechanism

High match density (exceeding 100 matches per 1,000 input elements) is one of the six categorical inhibition factors in the GPU inhibition scoring engine. When triggered, the engine assigns a penalty of negative infinity to the GPU dispatch score, routing the search to the Web Worker tier. This is the same categorical mechanism used for branch divergence inhibition and other hard cutoffs. The categorical factors are hard boundaries that cannot be overridden by the continuous scoring factors or the self-calibrating crossover thresholds.

Threat detection pipeline architecture

Putting it together, here is the full pipeline for a SIEM platform running 30-second detection cycles:

Cycle start

  1. Ingest. New log entries from the last 30 seconds are appended to the SharedArrayBuffer corpus. 150,000 new entries, 12 MB.
  2. GPU buffer sync. The 12 MB of new bytes are uploaded to the GPU corpus buffer at the frontier offset. Transfer: 0.5 ms.

Pattern sweep

  1. For each of 20 detection patterns:
    • The GPU inhibition scoring engine evaluates the pattern against the six categorical inhibition factors and eight continuous scoring factors.
    • If no categorical inhibition is triggered and the continuous score exceeds the crossover threshold: the two-phase GPU pipeline executes (Phase 1 histogram pre-filter, Phase 2 byte matching) against the 150,000 new documents.
    • If categorical inhibition is triggered (e.g., estimated match density exceeds 100 per 1,000): the search routes to Web Workers.
    • Results (matching document indices) are appended to the cycle's result set.

Average per-pattern timing:

Dispatch pathTime
GPU two-phase (low density)1.8 ms
Web Workers (high density fallback)3.0 ms

For 20 patterns (18 low density on GPU, 2 high density on Workers): (18 x 1.8) + (2 x 3.0) = 38.4 ms.

Correlation

  1. Cross-pattern correlation. Matching document indices from all 20 patterns are intersected and time-windowed. Documents that match multiple threat indicators within a time window are flagged as high-priority alerts. This is a CPU operation on the small result set (typically hundreds of matches from 150,000 documents). Time: under 1 ms.

Frontier advance

  1. Advance frontier. The searched-byte and searched-document offsets advance to the current write position. The next cycle will search only data appended after this point.

Total cycle time

StepTime
GPU buffer sync0.5 ms
20-pattern sweep38.4 ms
Cross-pattern correlation0.8 ms
Frontier advance0.01 ms
Total39.7 ms

Under 40 ms per 30-second cycle. The detection system uses 0.13% of the available wall-clock time. It could run at 1-second intervals (searching 5,000 new entries per cycle) and still complete in under 5 ms, leaving 99.5% of CPU and GPU time for the application.

Scaling characteristics

The pipeline's per-cycle cost depends on the ingestion rate and detection interval, not the corpus age.

Ingestion rateDetection intervalNew entries per cyclePer-cycle search time (20 patterns)
1,000/sec30 sec30,0008 ms
5,000/sec30 sec150,00039 ms
10,000/sec30 sec300,00074 ms
5,000/sec5 sec25,0007 ms
5,000/sec1 sec5,0002 ms

At 10,000 entries per second (the upper end for a single-site SIEM deployment), the 30-second cycle completes in 74 ms. Well within budget. At 1-second intervals for near-real-time detection, the same stream requires only 2 ms per cycle.

The corpus can accumulate to any size. The buffer ring evicts old data when the allocation limit is reached. Searches never touch evicted data (it was already searched). The only memory constraint is the SharedArrayBuffer allocation size, which is configurable up to the browser's per-origin memory limit (typically 2 to 4 GB).

Application: logistics exception tracking

The same architecture applies to logistics operations. A fleet management platform tracks 5,000 vehicles, each reporting GPS position, speed, fuel level, cargo temperature, and exception codes every 10 seconds. That is 500 events per second, each 120 bytes. Over a 12-hour shift: 21.6 million events, 2.5 GB.

The operations team needs to search for exception patterns in real-time: temperature deviations, route deviations, extended idle periods, repeated error codes, geo-fence breaches. Each pattern is a text search or structured filter across the event stream.

With searched-frontier tracking, each 30-second cycle searches only the 15,000 new events (1.8 MB). The 20-pattern sweep completes in under 10 ms. The operations dashboard updates in real-time. No server round-trip. No batch job. No 15-minute refresh cycle.

The corpus ring buffer evicts events older than the retention window (configurable: 2 hours, 4 hours, full shift). Historical queries beyond the retention window fall back to the server-side archive. Live queries against the current window run entirely in the browser at GPU speed.

The engineering principle

Streaming data is an append-only problem. The searched-frontier mechanism reduces it to a bounded problem: regardless of how long the stream has been running, each search cycle processes a fixed window of new data.

The two-phase GPU pipeline makes each search fast. The frontier makes each search small. The contention monitor prevents dense result spikes from crashing the GPU. The device loss handler ensures that if the GPU fails, detection continues on the CPU without interruption.

This is what production-grade enterprise AI automation looks like for streaming workloads. Not a system that works for 10 minutes and then slows down as the corpus grows. A system that runs at constant speed for hours, days, or weeks, because it never re-searches data it has already seen.

Want to discuss how this applies to your business?

Book a Discovery Call