BTC 71,187.00 +0.62%
ETH 2,161.90 +0.08%
S&P 500 6,591.90 +0.54%
Dow Jones 46,429.49 +0.66%
Nasdaq 21,929.83 +0.77%
VIX 25.33 -6.01%
EUR/USD 1.09 +0.15%
USD/JPY 149.50 -0.05%
Gold 4,532.70 -0.43%
Oil (WTI) 91.50 +1.31%
BTC 71,187.00 +0.62%
ETH 2,161.90 +0.08%
S&P 500 6,591.90 +0.54%
Dow Jones 46,429.49 +0.66%
Nasdaq 21,929.83 +0.77%
VIX 25.33 -6.01%
EUR/USD 1.09 +0.15%
USD/JPY 149.50 -0.05%
Gold 4,532.70 -0.43%
Oil (WTI) 91.50 +1.31%

vLLM Production Deployment: Complete 2026 Guide

| 2 Min Read
Master vLLM production deployment with Docker, Kubernetes, and monitoring. Learn PagedAttention optimization, multi-GPU setup, and OpenAI-compatible API configuration. Continue reading vLLM Production...
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

How to Deploy vLLM in Production

  1. Pin a specific vLLM Docker image tag and configure GPU passthrough with the NVIDIA Container Toolkit.
  2. Set --max-model-len and --gpu-memory-utilization to balance KV-cache capacity against OOM headroom.
  3. Enable PagedAttention optimizations including prefix caching and chunked prefill for production workloads.
  4. Deploy Kubernetes manifests with startup/readiness/liveness probes, topology spread, and GPU resource limits.
  5. Configure the OpenAI-compatible API with secure environment-based authentication and guided decoding.
  6. Autoscale replicas using KEDA triggered by per-replica queue depth from Prometheus metrics.
  7. Monitor TTFT p99, KV-cache utilization, and request queue depth via Prometheus and Grafana dashboards.
  8. Load-test with benchmark_serving.py under realistic traffic patterns before routing production requests.

Deploying large language models in production demands a careful balance of latency, throughput, and cost. vLLM has emerged as a widely adopted open-source inference engine for production LLM serving, but the gap between running a quick demo and operating a reliable, observable, autoscaling service remains wide. This guide bridges that gap with concrete configurations, deployment manifests, and monitoring setups reflecting vLLM production defaults as of mid-2026.

Compared to alternatives like Hugging Face TGI, NVIDIA TensorRT-LLM, and SGLang, vLLM occupies a distinct position: it offers broad model compatibility, an OpenAI-compatible API out of the box, and a permissive Apache 2.0 license. TensorRT-LLM delivers higher peak throughput on NVIDIA hardware but requires model-specific compilation and tighter vendor lock-in. SGLang shares architectural similarities with vLLM (including PagedAttention-derived memory management) but takes a different approach centered on structured generation and co-design. TGI remains viable for simpler deployments, though its continuous batcher lacks chunked prefill and disaggregated decode, limiting its multi-modal breadth compared to vLLM.

This guide assumes familiarity with Docker, Kubernetes, GPU infrastructure (NVIDIA Container Toolkit, device plugins), and core LLM concepts such as tokenization, KV caches, and quantization.

Prerequisites: All examples in this guide require NVIDIA driver ≥ 525, CUDA ≥ 12.1, and NVIDIA Container Toolkit ≥ 1.14. Docker Engine ≥ 23.0 (with Compose V2) is required for Docker examples. Kubernetes examples require Kubernetes ≥ 1.27 with the NVIDIA GPU Operator, KEDA (Kubernetes Event-driven Autoscaling) v2.x, and cert-manager with a configured letsencrypt-prod ClusterIssuer.

Table of Contents

vLLM Architecture Essentials for Production Engineers

PagedAttention and Memory Management

PagedAttention is the foundational innovation that separates vLLM from naive inference servers. Traditional KV-cache allocation reserves contiguous GPU memory blocks per sequence, leading to severe internal fragmentation when request lengths vary. PagedAttention borrows virtual memory concepts from operating systems. It splits the KV cache into fixed-size blocks (pages) and maps logical cache positions to non-contiguous physical blocks, reducing fragmentation and letting vLLM pack more concurrent sequences into the same GPU memory.

The practical impact is direct: by reclaiming wasted memory, PagedAttention increases the achievable batch size for a given GPU, which in turn improves throughput. For production workloads with heterogeneous prompt lengths on an 80 GB H100 running a 7B FP16 model, this can be the difference between serving 30 concurrent requests and serving 100+ (exact numbers depend on sequence length distribution and model size).

PagedAttention borrows virtual memory concepts from operating systems. It splits the KV cache into fixed-size blocks (pages) and maps logical cache positions to non-contiguous physical blocks, reducing fragmentation and letting vLLM pack more concurrent sequences into the same GPU memory.

Automatic prefix caching (APC) extends this further. When multiple requests share a common system prompt or few-shot prefix, vLLM caches and reuses the KV blocks for that shared prefix rather than recomputing attention for each request. For workloads where every request includes an identical system prompt (common in chatbot and agent deployments), APC reduces time-to-first-token by skipping redundant prefill computation. The magnitude depends on how large the shared prefix is relative to the total prompt; measure with your own workload to quantify the gain.

V1 Engine Architecture (2025-2026)

The V1 engine, which became the default in vLLM's 2025 releases (v0.6.0; verify against release notes for your version), introduced several structural changes relevant to production operators.

The primary problem the V1 engine addresses is scheduling overhead at high concurrency. In the prior architecture, the scheduler copied intermediate tensors between GPU and CPU during scheduling decisions. V1 pins host memory and uses direct DMA transfers (zero-copy), eliminating those redundant copies during token sampling and output processing.

Disaggregated prefill and decode separates the two phases of autoregressive inference into distinct scheduling domains. Prefill (processing the full input prompt) is compute-bound and benefits from large batch sizes. Decode (generating tokens one at a time) is memory-bandwidth-bound and latency-sensitive. The V1 engine schedules these phases independently, preventing long prefill jobs from blocking decode steps of in-flight requests. This blocking is the primary cause of latency spikes in production serving.

Chunked prefill complements disaggregation by breaking long prompts into smaller chunks that interleave with decode batches. This ensures that a single long-context request (say, 32K tokens) does not monopolize the GPU for hundreds of milliseconds while other requests wait.

Supported Models and Quantization

vLLM supports several architecture families: Llama 3 and 3.1, Mistral and Mixtral, Qwen 2 and 2.5, DeepSeek-V2 and V3, and multi-modal models including LLaVA, Qwen-VL, and InternVL. Model compatibility is tied to the Hugging Face Transformers architecture identifier; any model that matches a supported config.json architecture string will load without custom code, provided the architecture is in vLLM's supported list.

For production quantization, the options span several trade-off curves. AWQ (Activation-aware Weight Quantization) provides 4-bit weight quantization with good accuracy retention; Hugging Face hosts many pre-quantized AWQ checkpoints. GPTQ offers similar bit-widths but uses a different calibration approach; AWQ tends to deliver 5-15% higher throughput in vLLM due to more optimized kernel implementations (verify for your specific model and vLLM version). FP8 quantization on Hopper (H100) GPUs offers near-lossless quality at 8-bit precision with approximately 2x throughput improvement over FP16 (results vary by model size and batch size). Ada Lovelace GPUs (L4, RTX 4090) have hardware FP8 support, but vLLM's FP8 kernel optimizations are primarily validated on Hopper; verify kernel support for your vLLM version before relying on FP8 with Ada Lovelace. GGUF support exists but is primarily useful for CPU-offload scenarios and is not recommended for GPU-first production deployments.

Docker Deployment for vLLM

Single-GPU Docker Setup

The official vLLM project publishes Docker images to Docker Hub and GitHub Container Registry. The recommended approach is to pin to a specific release tag rather than using latest, since vLLM's API surface and default engine behavior can change between minor versions.

Important: Verify the latest release tag at https://github.com/vllm-project/vllm/releases and on Docker Hub before substituting. All examples in this guide use vllm/vllm-openai:<your-release-tag> as a placeholder. Replace <your-release-tag> with a verified tag (e.g., v0.8.3 if it exists at the time of your deployment).

The critical Docker flags for GPU inference are --gpus for GPU passthrough, --shm-size for shared memory (required for inter-process communication, primarily NCCL for tensor parallelism), and --ipc=host as an alternative that grants full host IPC namespace access.

Security note: For production, store tokens in a .env file with chmod 600 and pass via --env-file .env, or use Docker secrets. Never pass credentials directly in the docker run command, as they will be visible in docker inspect, process listings, and shell history. Prefer --env-file .env exclusively and have vLLM read the API key from its environment rather than passing --api-key on the command line.

Note on --quantization awq: This flag requires a pre-quantized AWQ checkpoint. The base meta-llama/Llama-3.1-8B-Instruct model is not AWQ-quantized. Use an AWQ-quantized variant such as hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4, or remove --quantization awq if using the base model.

docker run -d \
  --name vllm-server \
  --gpus '"device=0"' \
  --shm-size=4g \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env-file .env \
  vllm/vllm-openai:<your-release-tag> \
  --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
  --served-model-name llama-3.1-8b \
  --max-model-len 8192 \
  --quantization awq \
  --dtype auto \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --port 8000

Your .env file should contain:

HUGGING_FACE_HUB_TOKEN=<your-hf-token>
VLLM_API_KEY=<your-api-key>

Note: The VLLM_API_KEY environment variable is read by vLLM automatically to enable API key authentication. Verify that your vLLM version supports this by running vllm serve --help and checking for environment variable documentation. This approach avoids exposing the key in docker inspect or process listings.

The --max-model-len flag caps the maximum sequence length and directly controls how much KV-cache memory is reserved. Setting this lower than the model's maximum context window frees GPU memory for larger batch sizes. The --gpu-memory-utilization 0.90 allocates 90% of GPU memory to vLLM, leaving headroom for CUDA context and fragmentation. The volume mount for the Hugging Face cache avoids re-downloading multi-gigabyte model weights on container restarts.

Multi-GPU Docker Setup with Tensor Parallelism

For models that exceed single-GPU memory or require higher throughput, tensor parallelism shards the model across multiple GPUs within a single node. The --tensor-parallel-size argument must match the number of GPUs allocated.

docker run -d \
  --name vllm-server-tp4 \
  --gpus '"device=0,1,2,3"' \
  --shm-size=16g \
  --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env-file .env \
  -e NCCL_DEBUG=WARN \
  vllm/vllm-openai:<your-release-tag> \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --served-model-name llama-3.1-70b \
  --tensor-parallel-size 4 \
  --max-model-len 16384 \
  --dtype auto \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --port 8000

The --ipc=host flag is particularly important for multi-GPU configurations, as NCCL (NVIDIA's collective communication library) uses shared memory for intra-node GPU-to-GPU communication. Insufficient shared memory causes cryptic NCCL errors at runtime.

Note on NCCL P2P transfers: NCCL_P2P_DISABLE defaults to 0 (P2P enabled). This is the default behavior and does not need to be explicitly set. If NVLink P2P transfers are unexpectedly disabled (e.g., in certain hypervisor environments), set NCCL_P2P_DISABLE=1 to explicitly fall back to PCIe, or investigate driver-level P2P enablement. On multi-tenant hosts, be aware that --gpu-memory-utilization 0.90 combined with --ipc=host may risk OOM if other processes share the GPU.

Production Docker Compose Configuration

A Docker Compose stack for production wraps vLLM with a reverse proxy for rate limiting, health checks for orchestrator integration, and persistent model caching.

Note: The nginx service requires an nginx.conf configuration file and TLS certificates in the ./certs directory. See the companion repository or vLLM documentation for sample configurations.

services:
  vllm:
    image: vllm/vllm-openai:<your-release-tag>
    container_name: vllm-server
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
        limits:
          memory: 64g
    shm_size: "16g"
    ipc: host
    volumes:
      # Persistent model cache prevents re-downloads on restart
      - model-cache:/root/.cache/huggingface
    env_file:
      - .env
    command: >
      --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
      --served-model-name llama-3.1-8b
      --max-model-len 8192
      --quantization awq
      --dtype auto
      --gpu-memory-utilization 0.90
      --enable-prefix-caching
      --enable-metrics
      --port 8000
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s  # Models take time to load
    ports:
      - "8000:8000"

  nginx:
    image: nginx:1.27-alpine
    container_name: vllm-proxy
    depends_on:
      vllm:
        condition: service_healthy
    ports:
      - "443:443"
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro
    restart: unless-stopped

volumes:
  model-cache:
    driver: local

Note: The VLLM_API_KEY environment variable is passed to the container via the .env file. vLLM reads this variable from the environment to enable API key authentication, so --api-key does not need to appear in the command line. Verify this behavior with your vLLM version.

The start_period of 120 seconds on the health check is deliberate: loading a quantized 8B model from disk typically takes 30-90 seconds depending on storage speed, and marking the container unhealthy during startup triggers unnecessary restarts. The Docker Compose GPU reservation syntax (deploy.resources.reservations.devices) requires the NVIDIA Container Toolkit and Docker Compose V2.

Kubernetes Deployment for vLLM at Scale

Kubernetes Prerequisites and GPU Operators

Deploying GPU workloads on Kubernetes requires the NVIDIA GPU Operator, which installs the device plugin, driver containers, and container runtime components. KEDA (Kubernetes Event-driven Autoscaling) v2.x must be installed for autoscaling. cert-manager with a configured letsencrypt-prod ClusterIssuer is required for TLS. Nodes must be labeled (e.g., nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3) and the nvidia.com/gpu resource must appear in node capacity. A dedicated namespace with appropriate RBAC prevents accidental resource contention with non-GPU workloads. A ReadWriteMany-capable StorageClass (e.g., NFS, CephFS, or a cloud RWX StorageClass) is required for multi-replica deployments sharing a model cache.

Creating Required Secrets

Before applying the Deployment manifests, create the required Kubernetes secrets:

kubectl create namespace llm-serving  # if not already created

kubectl create secret generic hf-secret \
  --from-literal=token=$HF_TOKEN \
  -n llm-serving

kubectl create secret generic vllm-secret \
  --from-literal=api-key=$VLLM_API_KEY \
  -n llm-serving

# Verify both secrets exist
kubectl get secret hf-secret vllm-secret -n llm-serving

Deployment Manifests

Note on API key handling: The --api-key flag is intentionally omitted from the container args below. Passing secrets via args exposes them in kubectl describe pod output, /proc/<pid>/cmdline, and cluster audit logs. Instead, the VLLM_API_KEY environment variable is injected from a Kubernetes Secret and read by vLLM automatically. Verify that your vLLM version supports reading VLLM_API_KEY from the environment by checking vllm serve --help.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
  namespace: llm-serving
  labels:
    app: vllm
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      runtimeClassName: nvidia  # Requires NVIDIA RuntimeClass
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: vllm
      containers:
        - name: vllm
          image: vllm/vllm-openai:<your-release-tag>
          args:
            - "--model"
            - "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"
            - "--served-model-name"
            - "llama-3.1-8b"
            - "--max-model-len"
            - "8192"
            - "--quantization"
            - "awq"
            - "--dtype"
            - "auto"
            - "--gpu-memory-utilization"
            - "0.90"
            - "--enable-prefix-caching"
            - "--enable-metrics"
            # --api-key intentionally omitted from args.
            # vLLM reads VLLM_API_KEY from environment automatically.
            # Passing via args exposes the value in kubectl describe output.
            - "--port"
            - "8000"
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: token
            - name: VLLM_API_KEY
              valueFrom:
                secretKeyRef:
                  name: vllm-secret
                  key: api-key
          ports:
            - containerPort: 8000
              name: http
          resources:
            limits:
              nvidia.com/gpu: "1"
              memory: "32Gi"
            requests:
              nvidia.com/gpu: "1"
              memory: "16Gi"
          startupProbe:
            httpGet:
              path: /health
              port: 8000
            # Allows up to 400s (40 * 10s) for model load before
            # liveness/readiness probes begin. Adjust for model size.
            failureThreshold: 40
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            # initialDelaySeconds not needed when startupProbe is present;
            # readiness only starts after startupProbe succeeds.
            periodSeconds: 15
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            periodSeconds: 30
            failureThreshold: 5
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: vllm-model-cache
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-model-cache
  namespace: llm-serving
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 100Gi

Note on PVC access mode: ReadWriteMany is required because replicas: 2 with topologySpreadConstraints spreads pods across different nodes, and ReadWriteOnce only allows mounting by a single node at a time. This requires a StorageClass that supports RWX (e.g., NFS, CephFS, or a cloud-provider RWX StorageClass). Alternatively, use a StatefulSet with per-pod PVCs or an init-container model-downloader pattern.

Note on storageClassName: The value fast-ssd is an example; replace with a StorageClass that exists in your cluster and supports ReadWriteMany.

The topologySpreadConstraints ensure replicas land on different nodes, preventing a single node failure from taking down all inference capacity. The PVC uses a fast SSD storage class because model loading time is directly bottlenecked by storage throughput.

Service and Ingress Configuration

Prerequisite: The Ingress configuration below requires cert-manager installed with a configured letsencrypt-prod ClusterIssuer. See cert-manager.io/docs for setup instructions.

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: llm-serving
spec:
  type: ClusterIP
  selector:
    app: vllm
  ports:
    - port: 80
      targetPort: 8000
      protocol: TCP
      name: http
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  namespace: llm-serving
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - llm-api.example.com
      secretName: vllm-tls
  rules:
    - host: llm-api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: vllm-service
                port:
                  number: 80

The proxy-read-timeout of 300 seconds accommodates long-running generation requests, particularly for high max_tokens values. Disabling proxy-buffering is essential for streaming responses; with buffering enabled, Server-Sent Events are held until the entire response completes, defeating the purpose of streaming. Session affinity is not needed for vLLM since requests are stateless, but if client-side streaming reconnection logic depends on hitting the same pod, add nginx.ingress.kubernetes.io/affinity: "cookie".

Horizontal Scaling with KEDA

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
  namespace: llm-serving
spec:
  scaleTargetRef:
    name: vllm-inference
  minReplicaCount: 1
  maxReplicaCount: 8
  cooldownPeriod: 300  # 5 minutes — GPU pods are expensive to churn
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: vllm_requests_waiting_per_replica
        query: |
          sum(vllm:num_requests_waiting{namespace="llm-serving"})
          / scalar(
              count(up{job="vllm", namespace="llm-serving"} == 1)
              or vector(1)
            )
        threshold: "5"
        activationThreshold: "2"

Pending requests per replica is a better scaling signal than raw GPU utilization, which can be high during normal operation without indicating capacity exhaustion. The 300-second cooldown prevents rapid scale-down that would waste the time spent loading model weights into a new pod's GPU memory. GPU pods typically take 1-5+ minutes to become ready after scheduling, depending on model size and storage speed. The scalar(... or vector(1)) construct in the query prevents division by zero when all pods are down, ensuring KEDA can still trigger scale-up from zero.

Multi-Node Distributed Inference

For models that exceed single-node GPU memory (e.g., Llama 3.1 405B requires 8+ GPUs), pipeline parallelism across nodes is necessary. vLLM uses Ray for distributed orchestration and NCCL for inter-GPU communication. The network requirements are stringent: NCCL performs poorly over standard TCP/IP, and production multi-node deployments should use InfiniBand or RoCE (RDMA over Converged Ethernet). As a rule of thumb, plan for at least 100 Gbps inter-node bandwidth based on NCCL's requirements for models above ~70B parameters; actual requirements scale with model size and communication volume. In most cases, scaling replicas of a smaller model (or quantized variant) across independent nodes provides better aggregate throughput than running a single model across multiple nodes due to the communication overhead.

OpenAI-Compatible API Configuration

Endpoint Configuration and Model Aliases

vLLM exposes /v1/completions, /v1/chat/completions, and /v1/embeddings endpoints that match the OpenAI API specification. The --served-model-name flag allows operators to present an alias (e.g., llama-production) so that existing application code targeting OpenAI can be redirected without modification. Note that vLLM's OpenAI-compatible API covers /v1/chat/completions, /v1/completions, and /v1/embeddings; other OpenAI-specific endpoints (e.g., Assistants API, fine-tuning) are not supported. The --api-key flag (or VLLM_API_KEY environment variable) enables bearer token authentication on all endpoints.

Request Parameters for Production

Production deployments should set sensible defaults and guardrails for max_tokens, temperature, and top_p. Guided decoding enables structured JSON output by constraining the token generation to match a JSON schema, which is critical for tool-calling and agentic workflows.

from openai import OpenAI
import os
import json
import openai

client = OpenAI(
    base_url="https://llm-api.example.com/v1",
    api_key=os.environ["VLLM_API_KEY"],
)

# Streaming chat completion with structured JSON output
try:
    response = client.chat.completions.create(
        model="llama-3.1-8b",
        messages=[
            {"role": "system", "content": "Extract entities as JSON."},
            {"role": "user", "content": "Apple announced the M4 chip in Cupertino."},
        ],
        max_tokens=512,
        temperature=0.1,
        top_p=0.95,
        stream=True,
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "entities",
                "schema": {
                    "type": "object",
                    "properties": {
                        "entities": {
                            "type": "array",
                            "items": {
                                "type": "object",
                                "properties": {
                                    "name": {"type": "string"},
                                    "type": {"type": "string"},
                                },
                                "required": ["name", "type"],
                            },
                        }
                    },
                    "required": ["entities"],
                },
            },
        },
    )

    response_parts = []
    for chunk in response:
        if chunk.choices and chunk.choices[0].delta.content is not None:
            response_parts.append(chunk.choices[0].delta.content)

    full_response = "".join(response_parts)
    parsed = json.loads(full_response)
    print(parsed)

except openai.AuthenticationError as e:
    # Non-retryable: bad API key
    raise RuntimeError(f"Authentication failed — check VLLM_API_KEY: {e}") from e
except openai.RateLimitError as e:
    # Retryable: implement exponential backoff in production
    print(f"Rate limited: {e}")
    raise
except openai.APIConnectionError as e:
    print(f"Connection error (retryable): {e}")
    raise
except json.JSONDecodeError as e:
    print(f"Failed to parse model response as JSON. Raw: {full_response!r}. Error: {e}")
    raise
except openai.APIError as e:
    print(f"API error: {e}")
    raise

Performance Optimization for Production Workloads

GPU Memory Utilization Tuning

The --gpu-memory-utilization parameter (range 0.0-1.0) controls what fraction of GPU memory vLLM pre-allocates for the KV cache and model weights. Values between 0.85 and 0.95 are typical. Higher values increase concurrent request capacity but leave less headroom for CUDA memory allocation spikes, which can cause OOM errors under bursty traffic. Setting --max-model-len lower than the model's maximum reduces per-sequence KV-cache reservation and allows more concurrent sequences. The --enforce-eager flag disables CUDA graph capture, trading throughput for lower GPU memory usage (typically 5-15% on models above ~30B parameters; the savings may be larger on models below ~13B parameters or under high-concurrency workloads). This is useful when running close to memory limits.

Batching and Scheduling Configuration

vLLM's continuous batching dynamically adds new requests to in-flight batches without waiting for the current batch to complete. The --max-num-seqs parameter caps the maximum number of concurrent sequences (default: 256). Reducing this value limits memory pressure but caps throughput. The --max-num-batched-tokens parameter controls the total token budget per batch iteration; lower values reduce per-iteration latency but decrease throughput. For latency-sensitive deployments, enabling chunked prefill with smaller chunk sizes ensures that decode iterations are not blocked by large prefill operations.

Quantization in Practice

If near-lossless quality matters most, FP8 quantization on Hopper GPUs provides the best quality-performance tradeoff, delivering approximately 2x throughput improvement over FP16 on H100 (results vary by model size and batch size). Ada Lovelace GPUs have hardware FP8 support, but verify kernel support for your vLLM version. If memory is the binding constraint, AWQ at 4-bit fits larger models into fewer GPUs, and it tends to outperform GPTQ in vLLM due to more optimized CUDA kernel implementations. For decode-bound workloads where latency dominates, speculative decoding can help: a smaller draft model proposes tokens that the main model verifies in parallel, yielding 1.3-2x speedups when the draft model achieves an acceptance rate ≥ 0.7. Poorly matched draft models may yield no speedup or regression. Configure speculative decoding with --speculative-model and --num-speculative-tokens.

For decode-bound workloads where latency dominates, speculative decoding can help: a smaller draft model proposes tokens that the main model verifies in parallel, yielding 1.3-2x speedups when the draft model achieves an acceptance rate ≥ 0.7.

Note: The benchmark tool below requires cloning the vLLM repository. It is a standalone script, not an installable Python module.

# Clone the vLLM repository and run from its root directory
git clone https://github.com/vllm-project/vllm.git && cd vllm

# Start a vLLM server separately (e.g., via Docker), then benchmark against it:
python benchmarks/benchmark_serving.py \
  --backend vllm \
  --endpoint /v1/completions \
  --model llama-3.1-8b \
  --dataset-name sharegpt \
  --num-prompts 500 \
  --request-rate 10 \
  --base-url http://localhost:8000

# Sample output interpretation (numbers are illustrative only;
# actual results depend on hardware, model, and traffic pattern):
# Throughput: 842.3 tokens/s (total output throughput across all requests)
# Mean TTFT: 48.2ms | P99 TTFT: 142.7ms
# Mean E2E latency: 1.24s | P99 E2E: 3.87s
# Compare across --quantization awq vs. none, different --max-num-seqs values

Monitoring and Observability

Prometheus Metrics Endpoint

vLLM exposes a /metrics endpoint with Prometheus-formatted metrics. In vLLM ≥ 0.4.x, /metrics is exposed by default. The --enable-metrics flag is a no-op in versions where metrics are on by default; include it explicitly if your vLLM version requires it. The critical production metrics are: vllm:num_requests_running (active request count), vllm:num_requests_waiting (queue depth, the primary indicator of capacity exhaustion), vllm:gpu_cache_usage_perc (KV-cache saturation), vllm:time_to_first_token_seconds (histogram of TTFT), and vllm:e2e_request_latency_seconds (histogram of end-to-end latency).

Grafana Dashboard Setup

# Prometheus scrape configuration for vLLM
scrape_configs:
  - job_name: vllm
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - llm-serving
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
    scrape_interval: 15s

Note on gauge thresholds: The thresholds below assume vllm:gpu_cache_usage_perc reports values in the 0-100 range (percentage). If your vLLM version reports this metric in the 0.0-1.0 range (fraction), change the threshold values to 0.90 and 0.95 respectively. Verify with curl http://<pod>:8000/metrics | grep gpu_cache_usage_perc.

{
  "dashboard": {
    "title": "vLLM Production Monitoring",
    "panels": [
      {
        "title": "Request Throughput (req/s)",
        "type": "timeseries",
        "targets": [{"expr": "rate(vllm:e2e_request_latency_seconds_count[1m])"}]
      },
      {
        "title": "TTFT Latency (p50/p95/p99)",
        "type": "timeseries",
        "targets": [
          {"expr": "histogram_quantile(0.50, rate(vllm:time_to_first_token_seconds_bucket[5m]))", "legendFormat": "p50"},
          {"expr": "histogram_quantile(0.95, rate(vllm:time_to_first_token_seconds_bucket[5m]))", "legendFormat": "p95"},
          {"expr": "histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m]))", "legendFormat": "p99"}
        ]
      },
      {
        "title": "GPU KV Cache Utilization (%)",
        "type": "gauge",
        "targets": [{"expr": "vllm:gpu_cache_usage_perc"}],
        "thresholds": [{"value": 90, "color": "orange"}, {"value": 95, "color": "red"}]
      },
      {
        "title": "Request Queue Depth",
        "type": "timeseries",
        "targets": [{"expr": "vllm:num_requests_waiting"}]
      },
      {
        "title": "E2E Latency (p50/p95/p99)",
        "type": "timeseries",
        "targets": [
          {"expr": "histogram_quantile(0.50, rate(vllm:e2e_request_latency_seconds_bucket[5m]))", "legendFormat": "p50"},
          {"expr": "histogram_quantile(0.95, rate(vllm:e2e_request_latency_seconds_bucket[5m]))", "legendFormat": "p95"},
          {"expr": "histogram_quantile(0.99, rate(vllm:e2e_request_latency_seconds_bucket[5m]))", "legendFormat": "p99"}
        ]
      }
    ]
  }
}

Alert rules should fire when vllm:gpu_cache_usage_perc exceeds 95 for more than 2 minutes (indicating capacity exhaustion and imminent request rejections), when TTFT p99 exceeds the SLA threshold, or when the error rate spikes above baseline.

Logging and Tracing

vLLM outputs structured logs to stdout by default, which integrates cleanly with Kubernetes log collection via Fluentd, Fluent Bit, or Grafana Loki. For request-level tracing, passing a custom X-Request-ID header through the reverse proxy and correlating it in application logs enables end-to-end tracing across the client, proxy, and inference server. The vLLM server logs include request IDs and timing breakdowns that can be indexed for debugging latency anomalies.

Security and Reliability in Production

API Authentication and Rate Limiting

The --api-key flag (or VLLM_API_KEY environment variable) provides bearer token authentication. For multi-tenant environments, a reverse proxy (Nginx, Envoy, or a cloud API gateway) should handle per-client rate limiting, since vLLM does not natively support per-user quotas. In Kubernetes, NetworkPolicy resources should restrict ingress to the vLLM pods to only the reverse proxy or service mesh sidecar, preventing direct access from other workloads.

High Availability Patterns

Running multiple vLLM replicas behind a Kubernetes Service provides basic high availability. The /health endpoint returns HTTP 200 when the model is loaded and ready to serve. Startup behavior and error-state response codes vary by vLLM version; verify against your deployed version's documentation. This enables the load balancer to route around unhealthy pods. Kubernetes handles graceful shutdown through terminationGracePeriodSeconds; setting this to 60-120 seconds allows in-flight requests to complete before pod termination. The PVC-backed model cache reduces cold start time from minutes (network download) to seconds (local disk load).

Production Readiness Checklist

This checklist summarizes the guidance from preceding sections. Use it as a pre-launch gate; each item links back to a detailed section above.

  1. Pin the vLLM image tag to a specific, verified release version; never use latest in production. Verify the tag exists at Docker Hub or GitHub releases.
  2. Set --max-model-len to the actual maximum sequence length the application needs, not the model's theoretical maximum.
  3. Configure --gpu-memory-utilization between 0.85 and 0.95 and load-test to find the sweet spot for the specific hardware and model combination.
  4. Enable Prometheus metrics and deploy dashboards tracking TTFT p99, queue depth, and KV-cache utilization.
  5. Configure alerts for KV-cache saturation above 95% and TTFT p99 exceeding SLA thresholds.
  6. Implement rate limiting at the reverse proxy layer with appropriate per-client quotas.
  7. Load test with realistic traffic patterns using benchmarks/benchmark_serving.py (from the cloned vLLM repo) before directing production traffic.
  8. Store credentials securely using .env files with restricted permissions, Docker secrets, or Kubernetes Secrets; never pass tokens as plain CLI arguments or in container args.

For the complete set of deployment manifests, Docker Compose files, and Grafana dashboards referenced in this guide, the vLLM project maintains official documentation at docs.vllm.ai and the source repository at github.com/vllm-project/vllm.

SitePoint TeamSitePoint Team

Sharing our passion for building incredible internet things.

Comments

Please sign in to comment.
Capitolioxa Market Intelligence