Sub-200ms Hospitality CRMs: Moving SQL Relational Operators to WebGPU

9 Apr 2026·15 min read·Husain Ayoob

WebGPUCRMHospitalityQuery EnginePerformance

Key Takeaways

Dictionary encoding converts variable-length string columns to Uint32Array integer indices, enabling WebGPU compute shaders to evaluate string filters as single u32 comparisons. A WHERE status = 'VIP' filter becomes WHERE status_idx == 3, eliminating byte-by-byte string matching entirely.
Our 6-factor dispatch score evaluates each query operator independently: F1 row count vs threshold (weight 4.0), F2 operator-specific SQL metric (predicate selectivity for filters, Chao1 group cardinality for group-by, join key overlap ratio for joins), F3 GPU class adjustment, F4 vendor tuning, F5 GPU buffer retention bonus, and F6 hardware-adaptive buffer threshold. A single CRM query can execute its filter on the GPU and its aggregation on the CPU.
A hospitality face-scan CRM resolving VIP status across 500,000 guest profiles completes the full lookup pipeline (filter, join, aggregate) in under 8 ms client-side versus 150 to 400 ms with a server round-trip. Recognition latency stays below 200 ms end-to-end including the vision model.

The latency budget for face-scan recognition

A guest walks into a hotel lobby. A camera captures their face. A vision model extracts an embedding vector. The CRM must resolve that vector against the guest database, retrieve the profile, determine VIP status, pull stay history, check loyalty tier, and surface personalized preferences to the front desk screen.

The guest is walking. They reach the desk in 3 to 5 seconds. The recognition result must appear before they arrive. The total latency budget from camera frame to on-screen profile: under 2 seconds.

The vision model (embedding extraction, nearest-neighbour lookup) consumes 800 ms to 1.2 seconds depending on hardware. That leaves 800 ms to 1.2 seconds for everything else: CRM query, profile assembly, UI render.

A standard SaaS CRM handles this with a server round-trip. The client sends the matched guest ID to the API. The server queries PostgreSQL or a similar RDBMS. The result traverses the network back to the client.

Best case: 150 ms. Enterprise environment with VPN, multi-region database, connection pooling overhead, and JSON serialization: 250 to 400 ms. Add a second query for stay history and a third for loyalty details, and you are at 450 ms to 1.2 seconds of network-bound latency. The budget is consumed. There is no room for UI animation, graceful loading states, or fallback retries.

The query itself is fast. PostgreSQL resolves a primary key lookup in under 1 ms. The server-side application logic takes 2 to 5 ms. The remaining 145 to 395 ms is network and serialization overhead. You are not waiting for compute. You are waiting for packets.

Moving the database to the browser

The guest database for a single hotel property is not large. A 500-room hotel with 10 years of guest history holds 200,000 to 500,000 unique guest profiles. Each profile has 15 to 25 fields: name, email, phone, loyalty tier, VIP status, preferences (room type, pillow type, dietary requirements, minibar preferences), lifetime spend, visit count, last stay date, notes.

In columnar format with dictionary encoding, this dataset occupies 20 to 40 MB. That loads in 1 to 3 seconds on a standard enterprise connection. After the initial load, every query runs locally. No network. No server. No serialization.

The question is whether the browser can query 500,000 profiles fast enough to fit within the latency budget.

With our Adaptive WebGPU Data Query Engine: yes. By a wide margin.

Why columnar storage matters

Standard CRM data models are row-oriented. Each guest record is a JavaScript object with named properties. Querying means iterating over an array of objects, accessing a property on each one.

// Row-oriented: 500,000 objects
const vips = guests.filter(g => g.vipStatus === 'VIP' && g.lastStay > cutoffDate);

This is slow for three reasons. First, each property access traverses V8's hidden class chain and pointer dereferences. Second, the objects are scattered across heap memory with poor cache locality. Third, the filter callback is invoked 500,000 times through the engine's function call machinery.

Columnar storage inverts the layout. Each field becomes a contiguous typed array:

// Columnar: one array per field
const vipStatusCol = new Uint32Array(500_000);   // dictionary-encoded
const lastStayCol  = new Float64Array(500_000);  // epoch timestamps
const nameCol      = new Uint32Array(500_000);   // dictionary-encoded
const lifetimeSpendCol = new Float64Array(500_000);

A filter scan on vipStatusCol reads a contiguous block of memory. The CPU's prefetcher detects the sequential access pattern and loads cache lines ahead of the read pointer. For a 500,000-element Uint32Array (2 MB), the entire column fits in L3 cache after a single scan. Subsequent queries on the same column hit cache.

On the GPU, columnar storage enables coalesced reads. Adjacent threads in a workgroup read adjacent memory addresses, which the memory controller batches into a single bus transaction. Row-oriented data forces each thread to chase pointers to different heap locations. The reads are uncoalesced, wasting 75% to 90% of memory bandwidth.

Dictionary encoding for GPU string processing

Half the fields in a guest profile are strings: name, email, loyalty tier, VIP status, room preferences, dietary requirements. WebGPU compute shaders cannot process variable-length strings. WGSL has no string type. No strcmp. No Unicode handling.

Dictionary encoding solves this. During data ingestion, the engine builds a sorted dictionary of unique values for each string column and replaces every string with its integer index:

// VIP status column: 4 unique values
const vipDict = ["Bronze", "Gold", "Standard", "VIP"];  // sorted

// 500,000 rows encoded as integer indices
const vipStatusCol = new Uint32Array([
  3, 2, 0, 3, 1, 2, 3, 0, ...  // 3=VIP, 2=Standard, 0=Bronze, 1=Gold
]);

A filter WHERE vipStatus = 'VIP' becomes WHERE vipStatusCol[i] == 3. A single u32 comparison per row. The GPU evaluates this with one instruction per thread. No string allocation. No byte-by-byte matching. No variable-length handling.

Dictionary encoding statistics

For typical hospitality CRM data:

Column	Unique values	Dictionary size	Encoded column (500K rows)
VIP status	4	64 bytes	2 MB (Uint32Array)
Loyalty tier	6	96 bytes	2 MB
Room preference	12	240 bytes	2 MB
Dietary requirements	18	360 bytes	2 MB
Country	195	~4 KB	2 MB
City	~8,000	~160 KB	2 MB

Every string column, regardless of the original string lengths, encodes to a 2 MB Uint32Array for 500,000 rows. The total dictionary overhead for all categorical columns is under 200 KB. The GPU receives only the integer arrays. The dictionaries stay on the CPU for result display (mapping indices back to human-readable strings after the query completes).

Compound string predicates

Complex filters combine multiple dictionary-encoded columns:

WHERE vipStatus = 'VIP' AND loyaltyTier IN ('Gold', 'Platinum') AND country = 'UAE'

Each predicate resolves to integer comparisons at query compilation time. The compiler looks up 'VIP' in the vipStatus dictionary (index 3), 'Gold' and 'Platinum' in the loyalty dictionary (indices 1 and 4), and 'UAE' in the country dictionary (index 178). The GPU shader evaluates:

let match = (vip_col[idx] == 3u)
         && (loyalty_col[idx] == 1u || loyalty_col[idx] == 4u)
         && (country_col[idx] == 178u);

Three integer comparisons and two logical ORs. No string operations anywhere in the hot path.

The 6-factor scoring function

Not every operator in a CRM query belongs on the GPU. Our engine evaluates each operator independently using a 6-factor scoring function and routes it to the optimal tier.

Factor 1 (F1): Row count vs threshold

The number of rows entering the operator, evaluated against hardware-specific thresholds. F1 carries a weight of 4.0. For a discrete GPU, the threshold is 50,000 rows; for an integrated GPU, 100,000 rows. For the first operator in the pipeline, this is the full guest count (500,000). For downstream operators after a selective filter, it may be 5,000 or fewer.

A filter on 500,000 rows scores high for GPU dispatch. A sort on 50 filtered results scores low. The GPU's overhead (buffer allocation, shader dispatch) is not justified for tiny result sets.

Factor 2 (F2): Operator-specific SQL metric

F2 captures the workload characteristic specific to the SQL operator type. For filter operators, this is predicate selectivity: the fraction of rows that pass the filter. Estimated from per-column statistics maintained at ingestion: min, max, null count, and a 64-bucket histogram. For vipStatus = 'VIP' on a hotel with 8% VIP guests, estimated selectivity is 8%.

For GROUP BY operators, F2 is the Chao1 group cardinality estimate. Low group count (GROUP BY vipStatus: 4 groups) is GPU-friendly. High group count (GROUP BY guestId: 500,000 groups) is GPU-hostile. For dictionary-encoded columns, exact group cardinality is the dictionary size. For composite keys (GROUP BY country, loyaltyTier), the engine uses the Chao1 species richness estimator on a sampled cross-product.

For join operators, F2 is the join key overlap ratio: the fraction of keys in the smaller relation that have matches in the larger relation.

Factor 3 (F3): GPU class adjustment

Adjusts the score based on the detected GPU class. A front desk terminal with a discrete GPU receives a favourable adjustment. A tablet with an integrated GPU receives a penalty reflecting its lower memory bandwidth. The hardware capability detector determines the GPU class at initialisation.

Factor 4 (F4): Vendor tuning

Hardware vendor-specific tuning coefficients that account for differences in atomic throughput, shared memory size, and dispatch overhead across GPU vendors and generations.

Factor 5 (F5): GPU buffer retention bonus

When a preceding operator has already produced its output in a GPU buffer (via the GPUResidentDataset class), the next operator receives a bonus for keeping execution on the GPU. The pipeline executor performs multi-pass re-scoring (up to 3 iterations) to propagate buffer retention bonuses through the operator pipeline. The GPU buffer retention bonus feeds back into the dispatch scoring model to create cascading GPU segment formation across multi-operator plans.

Factor 6 (F6): Hardware-adaptive buffer threshold

The hardware-specific buffer size threshold derived from the hardware capability detector's runtime probing. This accounts for the device's actual GPU buffer limits and memory bandwidth, normalising the scoring function across hardware classes. The same query produces different routing decisions on a front desk workstation versus a concierge tablet.

Score computation and tier routing

The six factors combine into a dispatch score that routes each operator to one of three execution tiers. If the score is positive, the operator dispatches to the WebGPU compute pipeline. If the score is non-positive and the row count falls within a defined medium range (between 10,000 and 500,000 rows in our preferred configuration), the operator dispatches to the Web Worker thread pool. Otherwise, the operator executes on the CPU main thread. If branch divergence or Float32 ordering-preservation safety checks trigger categorical penalties, the score is overridden to negative infinity regardless of the other factors (using the same categorical inhibition principle covered by our GPU Inhibition patent). The pipeline executor provides transparent CPU fallback on GPU failure.

A real CRM query pipeline

A guest is identified by the face-scan system. The CRM receives the matched guest ID and must assemble a full profile. The query:

SELECT g.name, g.vipStatus, g.loyaltyTier, g.lifetimeSpend,
       g.roomPreference, g.dietaryRequirements,
       COUNT(s.stayId) as totalStays,
       MAX(s.checkoutDate) as lastStay,
       SUM(s.totalSpend) as recentSpend
FROM guests g
LEFT JOIN stays s ON g.guestId = s.guestId AND s.checkoutDate > '2024-01-01'
WHERE g.guestId = 12847
GROUP BY g.guestId

In a traditional CRM, this is a server round-trip. In our engine, the data is already in the browser. The query compiles to four operators:

Operator	Input rows	Score	Routed to	Time
Filter (guestId = 12847)	500,000	1.6	GPU	0.4 ms
Join (stays on guestId, date filter)	1 guest x ~45 stays	0.02	CPU main thread	0.01 ms
Aggregate (COUNT, MAX, SUM)	45 rows	0.001	CPU main thread	< 0.01 ms
Projection (select columns)	1 row	0.001	CPU main thread	< 0.01 ms

Total query time: 0.4 ms. The filter runs on the GPU because it scans 500,000 rows (high cardinality). Every downstream operator runs on the main thread because the result set is tiny.

But single-guest lookup is the simple case. The powerful scenario is aggregate analytics.

Aggregate dashboard queries

The hotel operations manager opens a dashboard. They want to see VIP distribution by country for guests who stayed in the last 12 months. The query:

SELECT country, vipStatus, COUNT(*) as guestCount, AVG(lifetimeSpend) as avgSpend
FROM guests
WHERE lastStay > '2024-12-29'
GROUP BY country, vipStatus
ORDER BY guestCount DESC

Operator	Input rows	Selectivity / Groups	Score	Routed to	Time
Filter (lastStay > cutoff)	500,000	~40% selectivity	1.8	GPU	1.1 ms
GroupBy (country x vipStatus)	~200,000	Chao1: ~780 groups	1.4	GPU	1.9 ms
Sort (guestCount DESC)	780	n/a	0.01	CPU main thread	0.1 ms

Total: 3.1 ms. The filter and group-by both run on the GPU. The GPUResidentDataset keeps the intermediate buffer in GPU memory between them (no CPU round-trip), with the GPU buffer retention bonus (F5) ensuring the group-by operator's score reflects the data already being GPU-resident. The 780-row grouped result is read back to the CPU for a trivial sort.

The operations manager adjusts the date range. The query re-executes. 3.1 ms later, the chart updates. No loading spinner. No skeleton screen. No "Refreshing data..." toast.

On a server-round-trip architecture, the same interaction takes 150 to 400 ms. The manager notices. They wait. They click less. They explore less. The dashboard that was built to surface insights becomes a tool that punishes curiosity with latency.

The face-scan latency budget revisited

With the query engine running locally, here is the full pipeline from camera frame to on-screen profile:

Stage	Time
Frame capture and preprocessing	30 ms
Face embedding extraction (vision model)	400 ms
Nearest-neighbour lookup (embedding index)	50 ms
CRM profile query (our engine, local)	0.4 ms
UI render (React, single component update)	8 ms
Total	~489 ms

Under 500 ms. Under 200 ms for everything after the vision model. The guest is still 4 steps from the desk.

Replace the local CRM query with a server round-trip:

Stage	Time
Frame capture and preprocessing	30 ms
Face embedding extraction	400 ms
Nearest-neighbour lookup	50 ms
CRM API call (network + query + response)	250 ms
UI render	8 ms
Total	~738 ms

Still under 1 second in the best case. But add VPN overhead (common in hotel chains), database connection pool exhaustion during peak check-in hours, and a second query for loyalty details, and you are at 1.2 to 1.8 seconds. The budget is tight. There is no margin for retry on network failure.

With the local engine, the CRM query is 0.4 ms. There is room for three redundant queries, a full stay history lookup, and a loyalty calculation before you reach 10 ms. The network was the bottleneck. We removed the network.

Multi-property data architecture

Hotel chains operate across dozens or hundreds of properties. A single-property dataset (500,000 profiles) fits in the browser comfortably. A chain-wide dataset (5 million to 50 million profiles) does not.

Our architecture handles this with a tiered data strategy:

Local tier (browser). The current property's guest database. Full columnar dataset, dictionary-encoded, cached in browser memory. All queries run locally. This covers 95% of front desk interactions (guests who have stayed at this property before).

On-demand tier (server). For guests not found in the local dataset (first-time visitors to this property who have stayed at other properties in the chain), the engine falls back to a server query. The result is cached locally for the duration of the session.

Sync tier (background). Overnight, the local dataset is refreshed with updated chain-wide data for guests likely to visit (based on reservations, loyalty programme activity, and seasonal patterns). This pre-populates the local cache with profiles that will be needed during the next day's operations.

The engine abstracts the tier boundary. The application submits a query to the engine. If the data is local, the query runs in 0.4 ms. If the data requires a server fetch, the engine handles the round-trip transparently. The application code does not branch on data location.

Why this is faster than any SaaS CRM

SaaS CRMs are architecturally constrained by their deployment model. The data lives in a multi-tenant database in a data centre. Every query crosses the network. Every interaction pays the latency tax.

Even "fast" SaaS CRMs with edge caching and CDN-proxied APIs cannot eliminate the fundamental round-trip. A cached API response is still a network request. HTTP/2 multiplexing reduces connection overhead but not propagation delay. GraphQL reduces payload size but not latency.

Our engine eliminates the round-trip for the 95% of queries that can be served from local data. The remaining 5% fall back to a server call. The average query latency across all interactions: under 5 ms. The p99 (server fallback for unknown guests): 250 ms.

No SaaS CRM built on a server-query architecture can match sub-5 ms average query latency. The physics of network propagation prevents it. Moving the data to the client and querying it on the GPU is not an optimization of the existing model. It is a different model.

This is the architecture behind our enterprise AI automation infrastructure applied to hospitality. Probe the hardware. Load the data locally. Query it at hardware speed. Reserve the network for what the network is actually needed for: synchronization and data that does not fit locally. The result is a CRM that responds before the guest reaches the desk.

Where this ships

We are Ayoob AI, a Newcastle-based team building GPU-accelerated CRM and guest-recognition systems for Newcastle and UK hospitality operations. If your front-of-house tooling is stalling on server round-trips, we move the query to the device and make it invisible. This engine is the foundation of our AI for Newcastle hospitality work and the broader AI workflow automation we deploy across the North East. Book a discovery call.

About the author

Husain Ayoob

Founder & CEO, Ayoob AI Ltd

BSc Computer Science with AI, Northumbria University 2024. 5 UK patents pending covering the Ayoob AI stack. ISO 27001:2022 certified (organisation).

Full bio, patents, and press →

Frequently asked questions

Why does hospitality need sub-200ms recognition?

Because a guest walking into a hotel lobby reaches the front desk in 3 to 5 seconds. Personalisation has to happen before they arrive, not after. The vision model itself consumes 800ms to 1.2 seconds. Everything else (CRM query, profile assembly, UI render) has to fit in the remaining budget. A standard server-round-trip CRM spends 150 to 400ms just on network latency before any compute happens. On-device WebGPU collapses that to under 8ms, which is the difference between a personalised greeting and an apology for delay. For Newcastle and UK hospitality operators, this is why on-device architecture beats cloud CRMs for guest-facing workflows.

How does dictionary encoding help string filtering?

By converting variable-length strings to fixed-width unsigned integer indices. A guest column with values like VIP, Regular, or Occasional becomes a Uint32Array where each value is a numeric index into a dictionary. A WHERE status = VIP filter becomes WHERE status_idx == 3, which is a single u32 comparison per row on the GPU. This eliminates byte-by-byte string matching entirely. For categorical columns common in CRM data (loyalty tier, preference flags, stay type, booking channel), dictionary encoding reduces filter cost by an order of magnitude and makes GPU execution viable on string workloads.

What is the 6-factor dispatch score?

A multi-factor formula that evaluates each query operator independently to decide whether it runs on CPU, Workers, or GPU. F1 is row count versus threshold (weight 4.0, most influential). F2 is an operator-specific SQL metric: predicate selectivity for filters, Chao1 group cardinality for group-by, join key overlap ratio for joins. F3 adjusts for GPU class (discrete versus integrated). F4 applies vendor-specific tuning. F5 adds a bonus for keeping intermediate results in GPU buffers across pipeline stages. F6 is a hardware-adaptive buffer threshold. The score means a single query can execute its filter on the GPU and its aggregation on the CPU depending on the characteristics of each operator.

Can this replace our existing CRM?

No, it augments. The existing CRM (Salesforce, HubSpot, a bespoke hospitality platform) continues to own persistence, user management, and business logic. The on-device query engine owns the read-heavy guest lookup path, which is where latency matters for front-of-house experience. Data syncs to the client at login or session start. Updates flow back through the CRM's standard write APIs. For Newcastle hospitality operators already committed to a CRM platform, this is an augmentation rather than a replacement. For operators starting from scratch, the on-device architecture can be the primary system with cloud persistence as the backing store.

Does this work across multiple sites?

Yes. For multi-site hospitality groups, the architecture deploys per-site with shared guest data synchronised across locations. Each site runs the query engine locally, so recognition latency stays sub-200ms regardless of which site the guest visits. Guest data syncs through the backing CRM or a dedicated replication layer. For a group running 10 to 30 venues across Newcastle and the North East, this delivers consistent personalisation at walk-in across every site without building centralised latency-heavy infrastructure. The compliance story is also simpler because data processing stays within each site's perimeter.

Talk to an Engineer