Question 1

What is Model Serving?

Accepted Answer

The infrastructure layer that exposes a trained AI model as a network endpoint, handling request batching, load balancing, autoscaling, observability, and version management.

Question 2

How does Model Serving work?

Accepted Answer

Model serving is to AI what application servers are to web apps: the production substrate. Common serving frameworks include vLLM (high-throughput LLM inference with paged attention), TensorRT-LLM (NVIDIA-optimised inference), Triton Inference Server, BentoML, and SageMaker / Vertex / Azure ML for managed deployments. Production model serving handles concurrent requests, batches them for GPU efficiency, surfaces metrics for observability, supports rolling model upgrades, and integrates with the firm's existing authentication and authorisation. For UK enterprise deployment, model serving infrastructure is part of the system Ayoob AI ships and operates inside the client's tenancy alongside the application code.

Model Serving

How it works

Related terms