AI Inference
The process of running a trained AI model on input data to produce an output, distinguished from training (which produces the model) and fine-tuning (which adapts it).
How it works
Inference is what production AI actually does at runtime. Every prompt sent to a language model, every embedding computed, every classification produced, is an inference call. Inference cost and latency dominate the economics of production AI: a system handling 100,000 inferences per day at 0.5 seconds each is a different operational beast from one handling 10 inferences per day at 30 seconds each. Optimisation levers include model quantisation (running at 4-bit or 8-bit precision rather than 16-bit), batching, key-value caching, and choice of inference framework (vLLM, TensorRT-LLM, llama.cpp, others). For on-premise UK deployments, inference engineering is where Ayoob AI's GPU compute infrastructure expertise and patent portfolio matter most.
Related terms
Model Serving
The infrastructure layer that exposes a trained AI model as a network endpoint, handling request batching, load balancing, autoscaling, observability, and version management.
Large Language Model (LLM)
A neural network trained on large text corpora to predict the next token given context, used for text generation, summarisation, classification, and reasoning tasks across enterprise software.
Private AI
AI deployed on infrastructure the client controls (on-premise, in the client's cloud tenancy, or air-gapped), with no third-party LLM provider in the data path and no inference-time data export.
Want to see this technology in action?
Book a Discovery Call