How to Deploy vLLM in Production
- Pin a specific vLLM Docker image tag and configure GPU passthrough with the NVIDIA Container Toolkit.
- Set
--max-model-lenand--gpu-memory-utilizationto balance KV-cache capacity against OOM headroom. - Enable PagedAttention optimizations including prefix caching and chunked prefill for production workloads.
- Deploy Kubernetes manifests with startup/readiness/liveness probes, topology spread, and GPU resource limits.
- Configure the OpenAI-compatible API with secure environment-based authentication and guided decoding.
- Autoscale replicas using KEDA triggered by per-replica queue depth from Prometheus metrics.
- Monitor TTFT p99, KV-cache utilization, and request queue depth via Prometheus and Grafana dashboards.
- Load-test with
benchmark_serving.pyunder realistic traffic patterns before routing production requests.
Deploying large language models in production demands a careful balance of latency, throughput, and cost. vLLM has emerged as a widely adopted open-source inference engine for production LLM serving, but the gap between running a quick demo and operating a reliable, observable, autoscaling service remains wide. This guide bridges that gap with concrete configurations, deployment manifests, and monitoring setups reflecting vLLM production defaults as of mid-2026.
Compared to alternatives like Hugging Face TGI, NVIDIA TensorRT-LLM, and SGLang, vLLM occupies a distinct position: it offers broad model compatibility, an OpenAI-compatible API out of the box, and a permissive Apache 2.0 license. TensorRT-LLM delivers higher peak throughput on NVIDIA hardware but requires model-specific compilation and tighter vendor lock-in. SGLang shares architectural similarities with vLLM (including PagedAttention-derived memory management) but takes a different approach centered on structured generation and co-design. TGI remains viable for simpler deployments, though its continuous batcher lacks chunked prefill and disaggregated decode, limiting its multi-modal breadth compared to vLLM.
This guide assumes familiarity with Docker, Kubernetes, GPU infrastructure (NVIDIA Container Toolkit, device plugins), and core LLM concepts such as tokenization, KV caches, and quantization.
Prerequisites: All examples in this guide require NVIDIA driver ≥ 525, CUDA ≥ 12.1, and NVIDIA Container Toolkit ≥ 1.14. Docker Engine ≥ 23.0 (with Compose V2) is required for Docker examples. Kubernetes examples require Kubernetes ≥ 1.27 with the NVIDIA GPU Operator, KEDA (Kubernetes Event-driven Autoscaling) v2.x, and cert-manager with a configured
letsencrypt-prodClusterIssuer.
Table of Contents
- vLLM Architecture Essentials for Production Engineers
- Docker Deployment for vLLM
- Kubernetes Deployment for vLLM at Scale
- OpenAI-Compatible API Configuration
- Performance Optimization for Production Workloads
- Monitoring and Observability
- Security and Reliability in Production
- Production Readiness Checklist
vLLM Architecture Essentials for Production Engineers
PagedAttention and Memory Management
PagedAttention is the foundational innovation that separates vLLM from naive inference servers. Traditional KV-cache allocation reserves contiguous GPU memory blocks per sequence, leading to severe internal fragmentation when request lengths vary. PagedAttention borrows virtual memory concepts from operating systems. It splits the KV cache into fixed-size blocks (pages) and maps logical cache positions to non-contiguous physical blocks, reducing fragmentation and letting vLLM pack more concurrent sequences into the same GPU memory.
The practical impact is direct: by reclaiming wasted memory, PagedAttention increases the achievable batch size for a given GPU, which in turn improves throughput. For production workloads with heterogeneous prompt lengths on an 80 GB H100 running a 7B FP16 model, this can be the difference between serving 30 concurrent requests and serving 100+ (exact numbers depend on sequence length distribution and model size).
PagedAttention borrows virtual memory concepts from operating systems. It splits the KV cache into fixed-size blocks (pages) and maps logical cache positions to non-contiguous physical blocks, reducing fragmentation and letting vLLM pack more concurrent sequences into the same GPU memory.
Automatic prefix caching (APC) extends this further. When multiple requests share a common system prompt or few-shot prefix, vLLM caches and reuses the KV blocks for that shared prefix rather than recomputing attention for each request. For workloads where every request includes an identical system prompt (common in chatbot and agent deployments), APC reduces time-to-first-token by skipping redundant prefill computation. The magnitude depends on how large the shared prefix is relative to the total prompt; measure with your own workload to quantify the gain.
V1 Engine Architecture (2025-2026)
The V1 engine, which became the default in vLLM's 2025 releases (v0.6.0; verify against release notes for your version), introduced several structural changes relevant to production operators.
The primary problem the V1 engine addresses is scheduling overhead at high concurrency. In the prior architecture, the scheduler copied intermediate tensors between GPU and CPU during scheduling decisions. V1 pins host memory and uses direct DMA transfers (zero-copy), eliminating those redundant copies during token sampling and output processing.
Disaggregated prefill and decode separates the two phases of autoregressive inference into distinct scheduling domains. Prefill (processing the full input prompt) is compute-bound and benefits from large batch sizes. Decode (generating tokens one at a time) is memory-bandwidth-bound and latency-sensitive. The V1 engine schedules these phases independently, preventing long prefill jobs from blocking decode steps of in-flight requests. This blocking is the primary cause of latency spikes in production serving.
Chunked prefill complements disaggregation by breaking long prompts into smaller chunks that interleave with decode batches. This ensures that a single long-context request (say, 32K tokens) does not monopolize the GPU for hundreds of milliseconds while other requests wait.
Supported Models and Quantization
vLLM supports several architecture families: Llama 3 and 3.1, Mistral and Mixtral, Qwen 2 and 2.5, DeepSeek-V2 and V3, and multi-modal models including LLaVA, Qwen-VL, and InternVL. Model compatibility is tied to the Hugging Face Transformers architecture identifier; any model that matches a supported config.json architecture string will load without custom code, provided the architecture is in vLLM's supported list.
For production quantization, the options span several trade-off curves. AWQ (Activation-aware Weight Quantization) provides 4-bit weight quantization with good accuracy retention; Hugging Face hosts many pre-quantized AWQ checkpoints. GPTQ offers similar bit-widths but uses a different calibration approach; AWQ tends to deliver 5-15% higher throughput in vLLM due to more optimized kernel implementations (verify for your specific model and vLLM version). FP8 quantization on Hopper (H100) GPUs offers near-lossless quality at 8-bit precision with approximately 2x throughput improvement over FP16 (results vary by model size and batch size). Ada Lovelace GPUs (L4, RTX 4090) have hardware FP8 support, but vLLM's FP8 kernel optimizations are primarily validated on Hopper; verify kernel support for your vLLM version before relying on FP8 with Ada Lovelace. GGUF support exists but is primarily useful for CPU-offload scenarios and is not recommended for GPU-first production deployments.
Docker Deployment for vLLM
Single-GPU Docker Setup
The official vLLM project publishes Docker images to Docker Hub and GitHub Container Registry. The recommended approach is to pin to a specific release tag rather than using latest, since vLLM's API surface and default engine behavior can change between minor versions.
Important: Verify the latest release tag at https://github.com/vllm-project/vllm/releases and on Docker Hub before substituting. All examples in this guide use
vllm/vllm-openai:<your-release-tag>as a placeholder. Replace<your-release-tag>with a verified tag (e.g.,v0.8.3if it exists at the time of your deployment).
The critical Docker flags for GPU inference are --gpus for GPU passthrough, --shm-size for shared memory (required for inter-process communication, primarily NCCL for tensor parallelism), and --ipc=host as an alternative that grants full host IPC namespace access.
Security note: For production, store tokens in a .env file with chmod 600 and pass via --env-file .env, or use Docker secrets. Never pass credentials directly in the docker run command, as they will be visible in docker inspect, process listings, and shell history. Prefer --env-file .env exclusively and have vLLM read the API key from its environment rather than passing --api-key on the command line.
Note on --quantization awq: This flag requires a pre-quantized AWQ checkpoint. The base meta-llama/Llama-3.1-8B-Instruct model is not AWQ-quantized. Use an AWQ-quantized variant such as hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4, or remove --quantization awq if using the base model.
docker run -d \
--name vllm-server \
--gpus '"device=0"' \
--shm-size=4g \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env-file .env \
vllm/vllm-openai:<your-release-tag> \
--model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
--served-model-name llama-3.1-8b \
--max-model-len 8192 \
--quantization awq \
--dtype auto \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--port 8000
Your .env file should contain:
HUGGING_FACE_HUB_TOKEN=<your-hf-token>
VLLM_API_KEY=<your-api-key>
Note: The VLLM_API_KEY environment variable is read by vLLM automatically to enable API key authentication. Verify that your vLLM version supports this by running vllm serve --help and checking for environment variable documentation. This approach avoids exposing the key in docker inspect or process listings.
The --max-model-len flag caps the maximum sequence length and directly controls how much KV-cache memory is reserved. Setting this lower than the model's maximum context window frees GPU memory for larger batch sizes. The --gpu-memory-utilization 0.90 allocates 90% of GPU memory to vLLM, leaving headroom for CUDA context and fragmentation. The volume mount for the Hugging Face cache avoids re-downloading multi-gigabyte model weights on container restarts.
Multi-GPU Docker Setup with Tensor Parallelism
For models that exceed single-GPU memory or require higher throughput, tensor parallelism shards the model across multiple GPUs within a single node. The --tensor-parallel-size argument must match the number of GPUs allocated.
docker run -d \
--name vllm-server-tp4 \
--gpus '"device=0,1,2,3"' \
--shm-size=16g \
--ipc=host \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env-file .env \
-e NCCL_DEBUG=WARN \
vllm/vllm-openai:<your-release-tag> \
--model meta-llama/Llama-3.1-70B-Instruct \
--served-model-name llama-3.1-70b \
--tensor-parallel-size 4 \
--max-model-len 16384 \
--dtype auto \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--port 8000
The --ipc=host flag is particularly important for multi-GPU configurations, as NCCL (NVIDIA's collective communication library) uses shared memory for intra-node GPU-to-GPU communication. Insufficient shared memory causes cryptic NCCL errors at runtime.
Note on NCCL P2P transfers: NCCL_P2P_DISABLE defaults to 0 (P2P enabled). This is the default behavior and does not need to be explicitly set. If NVLink P2P transfers are unexpectedly disabled (e.g., in certain hypervisor environments), set NCCL_P2P_DISABLE=1 to explicitly fall back to PCIe, or investigate driver-level P2P enablement. On multi-tenant hosts, be aware that --gpu-memory-utilization 0.90 combined with --ipc=host may risk OOM if other processes share the GPU.
Production Docker Compose Configuration
A Docker Compose stack for production wraps vLLM with a reverse proxy for rate limiting, health checks for orchestrator integration, and persistent model caching.
Note: The nginx service requires an nginx.conf configuration file and TLS certificates in the ./certs directory. See the companion repository or vLLM documentation for sample configurations.
services:
vllm:
image: vllm/vllm-openai:<your-release-tag>
container_name: vllm-server
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
limits:
memory: 64g
shm_size: "16g"
ipc: host
volumes:
# Persistent model cache prevents re-downloads on restart
- model-cache:/root/.cache/huggingface
env_file:
- .env
command: >
--model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
--served-model-name llama-3.1-8b
--max-model-len 8192
--quantization awq
--dtype auto
--gpu-memory-utilization 0.90
--enable-prefix-caching
--enable-metrics
--port 8000
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 120s # Models take time to load
ports:
- "8000:8000"
nginx:
image: nginx:1.27-alpine
container_name: vllm-proxy
depends_on:
vllm:
condition: service_healthy
ports:
- "443:443"
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./certs:/etc/nginx/certs:ro
restart: unless-stopped
volumes:
model-cache:
driver: local
Note: The VLLM_API_KEY environment variable is passed to the container via the .env file. vLLM reads this variable from the environment to enable API key authentication, so --api-key does not need to appear in the command line. Verify this behavior with your vLLM version.
The start_period of 120 seconds on the health check is deliberate: loading a quantized 8B model from disk typically takes 30-90 seconds depending on storage speed, and marking the container unhealthy during startup triggers unnecessary restarts. The Docker Compose GPU reservation syntax (deploy.resources.reservations.devices) requires the NVIDIA Container Toolkit and Docker Compose V2.
Kubernetes Deployment for vLLM at Scale
Kubernetes Prerequisites and GPU Operators
Deploying GPU workloads on Kubernetes requires the NVIDIA GPU Operator, which installs the device plugin, driver containers, and container runtime components. KEDA (Kubernetes Event-driven Autoscaling) v2.x must be installed for autoscaling. cert-manager with a configured letsencrypt-prod ClusterIssuer is required for TLS. Nodes must be labeled (e.g., nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3) and the nvidia.com/gpu resource must appear in node capacity. A dedicated namespace with appropriate RBAC prevents accidental resource contention with non-GPU workloads. A ReadWriteMany-capable StorageClass (e.g., NFS, CephFS, or a cloud RWX StorageClass) is required for multi-replica deployments sharing a model cache.
Creating Required Secrets
Before applying the Deployment manifests, create the required Kubernetes secrets:
kubectl create namespace llm-serving # if not already created
kubectl create secret generic hf-secret \
--from-literal=token=$HF_TOKEN \
-n llm-serving
kubectl create secret generic vllm-secret \
--from-literal=api-key=$VLLM_API_KEY \
-n llm-serving
# Verify both secrets exist
kubectl get secret hf-secret vllm-secret -n llm-serving
Deployment Manifests
Note on API key handling: The --api-key flag is intentionally omitted from the container args below. Passing secrets via args exposes them in kubectl describe pod output, /proc/<pid>/cmdline, and cluster audit logs. Instead, the VLLM_API_KEY environment variable is injected from a Kubernetes Secret and read by vLLM automatically. Verify that your vLLM version supports reading VLLM_API_KEY from the environment by checking vllm serve --help.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
namespace: llm-serving
labels:
app: vllm
spec:
replicas: 2
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
runtimeClassName: nvidia # Requires NVIDIA RuntimeClass
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: vllm
containers:
- name: vllm
image: vllm/vllm-openai:<your-release-tag>
args:
- "--model"
- "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"
- "--served-model-name"
- "llama-3.1-8b"
- "--max-model-len"
- "8192"
- "--quantization"
- "awq"
- "--dtype"
- "auto"
- "--gpu-memory-utilization"
- "0.90"
- "--enable-prefix-caching"
- "--enable-metrics"
# --api-key intentionally omitted from args.
# vLLM reads VLLM_API_KEY from environment automatically.
# Passing via args exposes the value in kubectl describe output.
- "--port"
- "8000"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
- name: VLLM_API_KEY
valueFrom:
secretKeyRef:
name: vllm-secret
key: api-key
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: "1"
memory: "32Gi"
requests:
nvidia.com/gpu: "1"
memory: "16Gi"
startupProbe:
httpGet:
path: /health
port: 8000
# Allows up to 400s (40 * 10s) for model load before
# liveness/readiness probes begin. Adjust for model size.
failureThreshold: 40
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
# initialDelaySeconds not needed when startupProbe is present;
# readiness only starts after startupProbe succeeds.
periodSeconds: 15
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 30
failureThreshold: 5
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: vllm-model-cache
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-model-cache
namespace: llm-serving
spec:
accessModes:
- ReadWriteMany
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
Note on PVC access mode: ReadWriteMany is required because replicas: 2 with topologySpreadConstraints spreads pods across different nodes, and ReadWriteOnce only allows mounting by a single node at a time. This requires a StorageClass that supports RWX (e.g., NFS, CephFS, or a cloud-provider RWX StorageClass). Alternatively, use a StatefulSet with per-pod PVCs or an init-container model-downloader pattern.
Note on storageClassName: The value fast-ssd is an example; replace with a StorageClass that exists in your cluster and supports ReadWriteMany.
The topologySpreadConstraints ensure replicas land on different nodes, preventing a single node failure from taking down all inference capacity. The PVC uses a fast SSD storage class because model loading time is directly bottlenecked by storage throughput.
Service and Ingress Configuration
Prerequisite: The Ingress configuration below requires cert-manager installed with a configured letsencrypt-prod ClusterIssuer. See cert-manager.io/docs for setup instructions.
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: llm-serving
spec:
type: ClusterIP
selector:
app: vllm
ports:
- port: 80
targetPort: 8000
protocol: TCP
name: http
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: vllm-ingress
namespace: llm-serving
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/proxy-buffering: "off"
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: nginx
tls:
- hosts:
- llm-api.example.com
secretName: vllm-tls
rules:
- host: llm-api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: vllm-service
port:
number: 80
The proxy-read-timeout of 300 seconds accommodates long-running generation requests, particularly for high max_tokens values. Disabling proxy-buffering is essential for streaming responses; with buffering enabled, Server-Sent Events are held until the entire response completes, defeating the purpose of streaming. Session affinity is not needed for vLLM since requests are stateless, but if client-side streaming reconnection logic depends on hitting the same pod, add nginx.ingress.kubernetes.io/affinity: "cookie".
Horizontal Scaling with KEDA
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaler
namespace: llm-serving
spec:
scaleTargetRef:
name: vllm-inference
minReplicaCount: 1
maxReplicaCount: 8
cooldownPeriod: 300 # 5 minutes — GPU pods are expensive to churn
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: vllm_requests_waiting_per_replica
query: |
sum(vllm:num_requests_waiting{namespace="llm-serving"})
/ scalar(
count(up{job="vllm", namespace="llm-serving"} == 1)
or vector(1)
)
threshold: "5"
activationThreshold: "2"
Pending requests per replica is a better scaling signal than raw GPU utilization, which can be high during normal operation without indicating capacity exhaustion. The 300-second cooldown prevents rapid scale-down that would waste the time spent loading model weights into a new pod's GPU memory. GPU pods typically take 1-5+ minutes to become ready after scheduling, depending on model size and storage speed. The scalar(... or vector(1)) construct in the query prevents division by zero when all pods are down, ensuring KEDA can still trigger scale-up from zero.
Multi-Node Distributed Inference
For models that exceed single-node GPU memory (e.g., Llama 3.1 405B requires 8+ GPUs), pipeline parallelism across nodes is necessary. vLLM uses Ray for distributed orchestration and NCCL for inter-GPU communication. The network requirements are stringent: NCCL performs poorly over standard TCP/IP, and production multi-node deployments should use InfiniBand or RoCE (RDMA over Converged Ethernet). As a rule of thumb, plan for at least 100 Gbps inter-node bandwidth based on NCCL's requirements for models above ~70B parameters; actual requirements scale with model size and communication volume. In most cases, scaling replicas of a smaller model (or quantized variant) across independent nodes provides better aggregate throughput than running a single model across multiple nodes due to the communication overhead.
OpenAI-Compatible API Configuration
Endpoint Configuration and Model Aliases
vLLM exposes /v1/completions, /v1/chat/completions, and /v1/embeddings endpoints that match the OpenAI API specification. The --served-model-name flag allows operators to present an alias (e.g., llama-production) so that existing application code targeting OpenAI can be redirected without modification. Note that vLLM's OpenAI-compatible API covers /v1/chat/completions, /v1/completions, and /v1/embeddings; other OpenAI-specific endpoints (e.g., Assistants API, fine-tuning) are not supported. The --api-key flag (or VLLM_API_KEY environment variable) enables bearer token authentication on all endpoints.
Request Parameters for Production
Production deployments should set sensible defaults and guardrails for max_tokens, temperature, and top_p. Guided decoding enables structured JSON output by constraining the token generation to match a JSON schema, which is critical for tool-calling and agentic workflows.
from openai import OpenAI
import os
import json
import openai
client = OpenAI(
base_url="https://llm-api.example.com/v1",
api_key=os.environ["VLLM_API_KEY"],
)
# Streaming chat completion with structured JSON output
try:
response = client.chat.completions.create(
model="llama-3.1-8b",
messages=[
{"role": "system", "content": "Extract entities as JSON."},
{"role": "user", "content": "Apple announced the M4 chip in Cupertino."},
],
max_tokens=512,
temperature=0.1,
top_p=0.95,
stream=True,
response_format={
"type": "json_schema",
"json_schema": {
"name": "entities",
"schema": {
"type": "object",
"properties": {
"entities": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"type": {"type": "string"},
},
"required": ["name", "type"],
},
}
},
"required": ["entities"],
},
},
},
)
response_parts = []
for chunk in response:
if chunk.choices and chunk.choices[0].delta.content is not None:
response_parts.append(chunk.choices[0].delta.content)
full_response = "".join(response_parts)
parsed = json.loads(full_response)
print(parsed)
except openai.AuthenticationError as e:
# Non-retryable: bad API key
raise RuntimeError(f"Authentication failed — check VLLM_API_KEY: {e}") from e
except openai.RateLimitError as e:
# Retryable: implement exponential backoff in production
print(f"Rate limited: {e}")
raise
except openai.APIConnectionError as e:
print(f"Connection error (retryable): {e}")
raise
except json.JSONDecodeError as e:
print(f"Failed to parse model response as JSON. Raw: {full_response!r}. Error: {e}")
raise
except openai.APIError as e:
print(f"API error: {e}")
raise
Performance Optimization for Production Workloads
GPU Memory Utilization Tuning
The --gpu-memory-utilization parameter (range 0.0-1.0) controls what fraction of GPU memory vLLM pre-allocates for the KV cache and model weights. Values between 0.85 and 0.95 are typical. Higher values increase concurrent request capacity but leave less headroom for CUDA memory allocation spikes, which can cause OOM errors under bursty traffic. Setting --max-model-len lower than the model's maximum reduces per-sequence KV-cache reservation and allows more concurrent sequences. The --enforce-eager flag disables CUDA graph capture, trading throughput for lower GPU memory usage (typically 5-15% on models above ~30B parameters; the savings may be larger on models below ~13B parameters or under high-concurrency workloads). This is useful when running close to memory limits.
Batching and Scheduling Configuration
vLLM's continuous batching dynamically adds new requests to in-flight batches without waiting for the current batch to complete. The --max-num-seqs parameter caps the maximum number of concurrent sequences (default: 256). Reducing this value limits memory pressure but caps throughput. The --max-num-batched-tokens parameter controls the total token budget per batch iteration; lower values reduce per-iteration latency but decrease throughput. For latency-sensitive deployments, enabling chunked prefill with smaller chunk sizes ensures that decode iterations are not blocked by large prefill operations.
Quantization in Practice
If near-lossless quality matters most, FP8 quantization on Hopper GPUs provides the best quality-performance tradeoff, delivering approximately 2x throughput improvement over FP16 on H100 (results vary by model size and batch size). Ada Lovelace GPUs have hardware FP8 support, but verify kernel support for your vLLM version. If memory is the binding constraint, AWQ at 4-bit fits larger models into fewer GPUs, and it tends to outperform GPTQ in vLLM due to more optimized CUDA kernel implementations. For decode-bound workloads where latency dominates, speculative decoding can help: a smaller draft model proposes tokens that the main model verifies in parallel, yielding 1.3-2x speedups when the draft model achieves an acceptance rate ≥ 0.7. Poorly matched draft models may yield no speedup or regression. Configure speculative decoding with --speculative-model and --num-speculative-tokens.
For decode-bound workloads where latency dominates, speculative decoding can help: a smaller draft model proposes tokens that the main model verifies in parallel, yielding 1.3-2x speedups when the draft model achieves an acceptance rate ≥ 0.7.
Note: The benchmark tool below requires cloning the vLLM repository. It is a standalone script, not an installable Python module.
# Clone the vLLM repository and run from its root directory
git clone https://github.com/vllm-project/vllm.git && cd vllm
# Start a vLLM server separately (e.g., via Docker), then benchmark against it:
python benchmarks/benchmark_serving.py \
--backend vllm \
--endpoint /v1/completions \
--model llama-3.1-8b \
--dataset-name sharegpt \
--num-prompts 500 \
--request-rate 10 \
--base-url http://localhost:8000
# Sample output interpretation (numbers are illustrative only;
# actual results depend on hardware, model, and traffic pattern):
# Throughput: 842.3 tokens/s (total output throughput across all requests)
# Mean TTFT: 48.2ms | P99 TTFT: 142.7ms
# Mean E2E latency: 1.24s | P99 E2E: 3.87s
# Compare across --quantization awq vs. none, different --max-num-seqs values
Monitoring and Observability
Prometheus Metrics Endpoint
vLLM exposes a /metrics endpoint with Prometheus-formatted metrics. In vLLM ≥ 0.4.x, /metrics is exposed by default. The --enable-metrics flag is a no-op in versions where metrics are on by default; include it explicitly if your vLLM version requires it. The critical production metrics are: vllm:num_requests_running (active request count), vllm:num_requests_waiting (queue depth, the primary indicator of capacity exhaustion), vllm:gpu_cache_usage_perc (KV-cache saturation), vllm:time_to_first_token_seconds (histogram of TTFT), and vllm:e2e_request_latency_seconds (histogram of end-to-end latency).
Grafana Dashboard Setup
# Prometheus scrape configuration for vLLM
scrape_configs:
- job_name: vllm
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- llm-serving
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
scrape_interval: 15s
Note on gauge thresholds: The thresholds below assume vllm:gpu_cache_usage_perc reports values in the 0-100 range (percentage). If your vLLM version reports this metric in the 0.0-1.0 range (fraction), change the threshold values to 0.90 and 0.95 respectively. Verify with curl http://<pod>:8000/metrics | grep gpu_cache_usage_perc.
{
"dashboard": {
"title": "vLLM Production Monitoring",
"panels": [
{
"title": "Request Throughput (req/s)",
"type": "timeseries",
"targets": [{"expr": "rate(vllm:e2e_request_latency_seconds_count[1m])"}]
},
{
"title": "TTFT Latency (p50/p95/p99)",
"type": "timeseries",
"targets": [
{"expr": "histogram_quantile(0.50, rate(vllm:time_to_first_token_seconds_bucket[5m]))", "legendFormat": "p50"},
{"expr": "histogram_quantile(0.95, rate(vllm:time_to_first_token_seconds_bucket[5m]))", "legendFormat": "p95"},
{"expr": "histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m]))", "legendFormat": "p99"}
]
},
{
"title": "GPU KV Cache Utilization (%)",
"type": "gauge",
"targets": [{"expr": "vllm:gpu_cache_usage_perc"}],
"thresholds": [{"value": 90, "color": "orange"}, {"value": 95, "color": "red"}]
},
{
"title": "Request Queue Depth",
"type": "timeseries",
"targets": [{"expr": "vllm:num_requests_waiting"}]
},
{
"title": "E2E Latency (p50/p95/p99)",
"type": "timeseries",
"targets": [
{"expr": "histogram_quantile(0.50, rate(vllm:e2e_request_latency_seconds_bucket[5m]))", "legendFormat": "p50"},
{"expr": "histogram_quantile(0.95, rate(vllm:e2e_request_latency_seconds_bucket[5m]))", "legendFormat": "p95"},
{"expr": "histogram_quantile(0.99, rate(vllm:e2e_request_latency_seconds_bucket[5m]))", "legendFormat": "p99"}
]
}
]
}
}
Alert rules should fire when vllm:gpu_cache_usage_perc exceeds 95 for more than 2 minutes (indicating capacity exhaustion and imminent request rejections), when TTFT p99 exceeds the SLA threshold, or when the error rate spikes above baseline.
Logging and Tracing
vLLM outputs structured logs to stdout by default, which integrates cleanly with Kubernetes log collection via Fluentd, Fluent Bit, or Grafana Loki. For request-level tracing, passing a custom X-Request-ID header through the reverse proxy and correlating it in application logs enables end-to-end tracing across the client, proxy, and inference server. The vLLM server logs include request IDs and timing breakdowns that can be indexed for debugging latency anomalies.
Security and Reliability in Production
API Authentication and Rate Limiting
The --api-key flag (or VLLM_API_KEY environment variable) provides bearer token authentication. For multi-tenant environments, a reverse proxy (Nginx, Envoy, or a cloud API gateway) should handle per-client rate limiting, since vLLM does not natively support per-user quotas. In Kubernetes, NetworkPolicy resources should restrict ingress to the vLLM pods to only the reverse proxy or service mesh sidecar, preventing direct access from other workloads.
High Availability Patterns
Running multiple vLLM replicas behind a Kubernetes Service provides basic high availability. The /health endpoint returns HTTP 200 when the model is loaded and ready to serve. Startup behavior and error-state response codes vary by vLLM version; verify against your deployed version's documentation. This enables the load balancer to route around unhealthy pods. Kubernetes handles graceful shutdown through terminationGracePeriodSeconds; setting this to 60-120 seconds allows in-flight requests to complete before pod termination. The PVC-backed model cache reduces cold start time from minutes (network download) to seconds (local disk load).
Production Readiness Checklist
This checklist summarizes the guidance from preceding sections. Use it as a pre-launch gate; each item links back to a detailed section above.
- Pin the vLLM image tag to a specific, verified release version; never use
latestin production. Verify the tag exists at Docker Hub or GitHub releases. - Set
--max-model-lento the actual maximum sequence length the application needs, not the model's theoretical maximum. - Configure
--gpu-memory-utilizationbetween 0.85 and 0.95 and load-test to find the sweet spot for the specific hardware and model combination. - Enable Prometheus metrics and deploy dashboards tracking TTFT p99, queue depth, and KV-cache utilization.
- Configure alerts for KV-cache saturation above 95% and TTFT p99 exceeding SLA thresholds.
- Implement rate limiting at the reverse proxy layer with appropriate per-client quotas.
- Load test with realistic traffic patterns using
benchmarks/benchmark_serving.py(from the cloned vLLM repo) before directing production traffic. - Store credentials securely using
.envfiles with restricted permissions, Docker secrets, or Kubernetes Secrets; never pass tokens as plain CLI arguments or in containerargs.
For the complete set of deployment manifests, Docker Compose files, and Grafana dashboards referenced in this guide, the vLLM project maintains official documentation at docs.vllm.ai and the source repository at github.com/vllm-project/vllm.

