How to Run Multiple LLM Models on One GPU
- Audit your total GPU VRAM and system RAM available for inference.
- Calculate each model's VRAM footprint using the formula: parameters × bytes_per_parameter + KV cache + overhead.
- Choose a quantization level (e.g., Q4_K_M) that fits your quality and memory constraints.
- Select a strategy: concurrent loading for always-on pairs, hot-swapping for sequential access, or Docker isolation for production.
- Configure per-model GPU layer counts and context lengths to partition VRAM deliberately.
- Monitor real-time VRAM usage with nvidia-smi or nvtop to catch silent CPU fallback and fragmentation.
- Tune keep-alive TTLs and KV cache slots to match actual request patterns and concurrency needs.
Running multiple local models simultaneously appeals to any developer building pipelines that combine specialized LLMs for coding, retrieval-augmented generation, chat, and embeddings. This tutorial walks through a systematic approach to memory planning, allocation, and orchestration across three strategies.
Table of Contents
- Why Running Multiple Local Models Is Hard
- Understanding LLM Memory Anatomy
- Strategy 1: Concurrent Loading with VRAM Budgeting
- Strategy 2: Hot-Swapping Models on Demand
- Strategy 3: Containerized Model Isolation with Docker
- Advanced Techniques: Squeezing More from Limited VRAM
- Monitoring and Debugging Memory Issues
- Putting It All Together: Decision Framework
- Key Takeaways
Prerequisites — Tested Versions
The examples in this article were tested with: Ollama ≥ 0.3.14, llama.cpp server build b3400+, Python 3.10+, CUDA 12.2+ (driver 535+), Docker 26+ with Compose v2.20+, NVIDIA Container Toolkit 1.14+, on Ubuntu 22.04 LTS. GPU-accelerated examples assume an NVIDIA GPU with at least 8 GB VRAM. Load time estimates assume NVMe Gen4 SSD storage unless otherwise noted. CPU offload examples assume 32 GB or more of system RAM.
Why Running Multiple Local Models Is Hard
Running multiple local models simultaneously appeals to any developer building pipelines that combine specialized LLMs for coding, retrieval-augmented generation, chat, and embeddings. A coding assistant handles code completion, a small embedding model powers semantic search, and a larger chat model handles user-facing interactions. The obstacle is simple: GPU VRAM is finite and shared. Loading a second model without accounting for the first leads to out-of-memory crashes, silent CPU fallback, or throughput that collapses from ~40 t/s fully on GPU to ~3 t/s when the framework quietly offloads to system RAM. Memory management for local LLMs requires deliberate planning, not guesswork.
Memory management for local LLMs requires deliberate planning, not guesswork.
This tutorial walks through a systematic approach to memory planning, allocation, and orchestration across three strategies: concurrent loading with VRAM budgeting, hot-swapping models on demand, and containerized model isolation with Docker. It draws on the latest capabilities in Ollama, llama.cpp, and the NVIDIA Container Toolkit. Prerequisites include familiarity with LLM inference concepts, basic GPU architecture, Python, Docker, and command-line tooling.
Understanding LLM Memory Anatomy
What Actually Consumes VRAM
Model weights dominate VRAM consumption. A 7-billion-parameter model in 16-bit floating point occupies roughly 14 GB just for weight storage. At 4096 context length with a batch size of 1, the KV cache for a 7B model typically consumes 0.5 to 1.0 GB, but this grows linearly with concurrent requests and longer contexts under fixed-slot allocation (as in llama.cpp --parallel). Paged attention systems such as vLLM allocate KV cache dynamically and may not grow linearly.
Activation memory covers intermediate tensors during the forward pass. Runtime overhead includes CUDA context allocation (typically 300 to 800 MB depending on driver version and framework; verify against your specific CUDA and driver versions), cuBLAS workspace buffers, and memory fragmentation from repeated allocations. Two processes sharing a GPU each pay this overhead independently.
Calculating VRAM Requirements per Model
The core formula is:
VRAM ≈ (parameters × bytes_per_parameter) + KV_cache + overhead
Where bytes_per_parameter depends on quantization: FP16 uses 2 bytes, Q8_0 uses roughly 1.1 bytes, Q6_K uses about 0.83 bytes, Q5_K_M uses about 0.73 bytes, Q4_K_M uses about 0.64 bytes, and Q3_K_S uses about 0.53 bytes. Estimate KV cache size as 2 × num_layers × num_kv_heads × head_dim × context_length × bytes_per_element (where the final × 2 accounts for FP16 storage, i.e., 2 bytes per element), with FP16 KV cache being standard in most backends. Some backends support Q8 or Q4 KV cache; check your framework's documentation.
Practical estimates for Q4_K_M quantization: a 7B model lands at approximately 4.4 GB for weights alone (roughly 5.0 to 5.5 GB loaded with KV cache and overhead at 4096 context), a 13B model at approximately 8.2 GB (9.0 to 9.5 GB loaded), and a 70B model at approximately 41 to 42 GB (exceeding 24 GB single-GPU VRAM without partial offload).
The following Python script computes these estimates programmatically:
QUANT_BYTES = {
"FP16": 2.0,
"Q8_0": 1.1,
"Q6_K": 0.83,
"Q5_K_M": 0.73,
"Q4_K_M": 0.64,
"Q3_K_S": 0.53,
"IQ2_XXS": 0.29,
}
def estimate_vram(
params_billion: float,
quant: str,
context_length: int = 4096,
batch_size: int = 1,
num_layers: int = None,
num_kv_heads: int = None,
head_dim: int = 128,
overhead_gb: float = 0.6,
):
"""Estimate VRAM usage for a quantized LLM.
num_layers and num_kv_heads vary by architecture. Pass explicit values
from the model card for accuracy.
"""
if quant not in QUANT_BYTES:
raise ValueError(
f"Unknown quantization '{quant}'. "
f"Valid options: {list(QUANT_BYTES.keys())}"
)
if num_layers is None:
raise ValueError(
"num_layers is required. Pass the value from your model card. "
"Heuristic inference has been removed due to inaccuracy across architectures."
)
if num_kv_heads is None:
raise ValueError(
"num_kv_heads is required. Pass the value from your model card."
)
params = params_billion * 1e9
bpp = QUANT_BYTES[quant]
weight_gb = (params * bpp) / (1024**3)
kv_factor = 2 # K and V tensors
fp16_bytes = 2 # bytes per element for FP16 KV cache
kv_bytes = (
kv_factor * num_layers * num_kv_heads * head_dim
* context_length * batch_size * fp16_bytes
)
kv_gb = kv_bytes / (1024**3)
total = weight_gb + kv_gb + overhead_gb
print(f"{'Component':<25} {'GB':>8}")
print("-" * 35)
print(f"{'Model weights':<25} {weight_gb:>8.2f}")
print(f"{'KV cache':<25} {kv_gb:>8.2f}")
print(f"{'Overhead (CUDA/runtime)':<25} {overhead_gb:>8.2f}")
print("-" * 35)
print(f"{'TOTAL ESTIMATED VRAM':<25} {total:>8.2f}")
return total
# Example: 7B Q4_K_M at 4096 context (Llama 3.1 8B architecture)
print("=== 7B Q4_K_M ===")
estimate_vram(7, "Q4_K_M", context_length=4096,
num_layers=32, num_kv_heads=8, head_dim=128)
print("
=== 13B Q4_K_M ===")
estimate_vram(13, "Q4_K_M", context_length=4096,
num_layers=40, num_kv_heads=40, head_dim=128)
VRAM vs. System RAM: When CPU Offloading Helps (and Hurts)
Three execution modes exist: full GPU inference (all layers on VRAM), partial offload (some layers on GPU, rest on system RAM), and full CPU inference. Partial offload runs layers that exceed VRAM on system RAM while keeping the rest on GPU, but every token generation requires transferring activations across the PCIe bus for offloaded layers. On PCIe 4.0 x16, bandwidth peaks at roughly 32 GB/s unidirectional (bidirectional peak is ~64 GB/s, but LLM layer offload transfer is effectively unidirectional per token), often 20 to 25 GB/s effective, introducing measurable latency per offloaded layer. A 70B model with half its layers offloaded to CPU may generate tokens at 3 to 5 tokens per second on a 24 GB GPU, compared to 15+ tokens per second fully on a 48 GB GPU.
Full CPU inference on DDR5-4800 dual-channel systems can manage 1 to 3 tokens per second for a 7B Q4_K_M model. Apple Silicon unified memory avoids the PCIe bottleneck entirely, making partial offload patterns less penalizing on M-series chips. A 34B Q4_K_M model at 4096 context on an M4 Max with 128 GB unified memory achieves 10+ tokens per second thanks to the ~400 GB/s memory bandwidth.
VRAM Consumption Breakdown by Model Size and Quantization (24 GB GPU)
| Model | Q8_0 | Q6_K | Q5_K_M | Q4_K_M | Q3_K_S |
|---|---|---|---|---|---|
| 7B | ~7.7 GB / ~35 t/s | ~5.8 GB / ~42 t/s | ~5.1 GB / ~45 t/s | ~4.4 GB / ~48 t/s | ~3.7 GB / ~44 t/s |
| 13B | ~14.3 GB / ~20 t/s | ~10.8 GB / ~25 t/s | ~9.5 GB / ~27 t/s | ~8.2 GB / ~29 t/s | ~6.9 GB / ~26 t/s |
| 34B | ~37.4 GB / CPU* | ~28.2 GB / CPU* | ~24.8 GB / partial | ~21.8 GB / ~12 t/s | ~18.0 GB / ~11 t/s |
| 70B | ~77.0 GB / CPU* | ~58.1 GB / CPU* | ~51.1 GB / CPU* | ~44.8 GB / CPU* | ~37.1 GB / CPU* |
*Requires partial or full CPU offload on 24 GB. Tokens/sec estimates assume single-request inference on an RTX 4090 or comparable 24 GB GPU. Weight-only estimates; add 0.5 to 2.0 GB for KV cache and overhead depending on context length.
Strategy 1: Concurrent Loading with VRAM Budgeting
When to Use Concurrent Loading
Two or more models must serve requests with sub-second switching. A typical scenario: a 7B coding assistant handles inline completions while a small embedding model (under 1 GB) processes document chunks for RAG indexing. Both need to be resident in VRAM. The constraint is hard: every model must physically fit in VRAM at the same time, including KV cache for their expected concurrent request load.
Configuring Ollama for Multiple Resident Models
Ollama exposes two environment variables for concurrent serving. OLLAMA_MAX_LOADED_MODELS controls how many models remain in VRAM simultaneously (default is 1; verify against your Ollama version's release notes, as behavior changed between 0.1.x and 0.3.x). OLLAMA_NUM_PARALLEL sets the number of concurrent requests per model (default is 1 in most Ollama versions; check ollama --help or release notes for your version), which directly affects KV cache allocation. Increasing OLLAMA_NUM_PARALLEL multiplies KV cache memory by the parallelism factor.
Set the num_gpu parameter in a Modelfile to control per-model VRAM allocation. This parameter specifies how many layers to offload to GPU, letting you manually partition VRAM between models by giving each a calculated number of GPU layers.
# Set environment variables before launching Ollama
export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_NUM_PARALLEL=2
export OLLAMA_KEEP_ALIVE="10m" # Overrides default of 5m; tune to match your request inter-arrival time.
# Start the Ollama server
ollama serve &
# Pull models
ollama pull llama3.1:8b-instruct-q4_K_M
ollama pull nomic-embed-text
# Trigger loading both models (first request loads each)
curl http://localhost:11434/api/generate -d '{"model": "llama3.1:8b-instruct-q4_K_M", "prompt": "warmup", "stream": false}'
curl http://localhost:11434/api/embeddings -d '{"model": "nomic-embed-text", "input": "warmup"}'
# Verify both loaded with: ollama ps
# nomic-embed-text: ~0.5-0.8 GB VRAM (weights ~274 MB + CUDA context; verify with nvidia-smi)
# llama3.1:8b Q4_K_M: ~5.5 GB VRAM (at default context 2048 and parallel 1)
# Total: ~6.0-6.3 GB — fits comfortably on 24 GB with room for larger context
Configuring llama.cpp Multi-Model Serving
Running multiple llama-server instances on different ports provides finer control. Each instance gets an explicit --n-gpu-layers value calculated from the VRAM budget.
# Replace ./llama-server with the full path to your llama.cpp server binary.
# Built from source: ./build/bin/llama-server
# Installed via package: llama-server
# Verify: which llama-server || ls ./build/bin/llama-server
set -e
# Instance 1: 13B Q4_K_M model — allocate ~9.5 GB VRAM (all 40 layers, weights ~8.2 GB + KV + overhead at 4096 context)
./llama-server \
--model models/llama-13b-q4_k_m.gguf \
--port 8080 \
--n-gpu-layers 40 \
--ctx-size 4096 \
--parallel 1 &
PID1=$!
# Instance 2: 7B Q4_K_M model — allocate partial VRAM (24 of 32 layers on GPU)
# Remaining 8 layers offloaded to CPU to stay within 24 GB total
./llama-server \
--model models/llama-7b-q4_k_m.gguf \
--port 8081 \
--n-gpu-layers 24 \
--ctx-size 2048 \
--parallel 1 &
PID2=$!
sleep 2
kill -0 $PID1 || echo "Server 1 (13B) failed to start"
kill -0 $PID2 || echo "Server 2 (7B) failed to start"
# Combined estimate: ~9.5 GB + ~6.5 GB (partial 7B) + ~1.2 GB overhead = ~17.2 GB on a 24 GB GPU
# Monitor with: watch -n 1 nvidia-smi
Pitfalls: Fragmentation and OOM Under Load
KV cache growth under concurrent requests is the most common source of unexpected OOM. A model loaded at --parallel 4 allocates four KV cache slots at initialization. If context length is 8192 and each slot consumes 512 MB, that is 2 GB of KV cache alone for one model. Monitoring tools like nvidia-smi dmon (sampled at 1-second intervals) and nvtop (real-time per-process VRAM tracking) reveal actual consumption under load. Repeated load/unload cycles can cause CUDA memory fragmentation; no reliable threshold exists for when this becomes a problem, so monitor free contiguous VRAM and restart the server proactively when allocations start failing despite sufficient total free memory.
Strategy 2: Hot-Swapping Models on Demand
When to Use Hot-Swapping
You need many models but never at the same time. A developer running five specialized models (code, chat, summarization, translation, embeddings) with 2 to 10 seconds of acceptable latency between switches can give each model full VRAM access during its active window. This maximizes per-model throughput at the cost of swap latency.
Ollama's Built-In Model Scheduling
Ollama handles model eviction automatically through its keep-alive mechanism. The OLLAMA_KEEP_ALIVE environment variable (default 5m) controls how long an idle model remains in VRAM. Setting it to 0 unloads immediately after each request. Setting it to 30s keeps the model warm for half a minute. When a new model is requested and VRAM is insufficient, Ollama evicts the least recently used resident model. Tuning this TTL to match request patterns reduces unnecessary reloads.
Building a Python Hot-Swap Orchestrator
For workflows requiring explicit control over model lifecycle, a lightweight orchestrator using the Ollama HTTP API provides LRU-based swap management:
import requests
import time
import logging
import os
import threading
from collections import OrderedDict
log = logging.getLogger(__name__)
OLLAMA_BASE = os.environ.get("OLLAMA_BASE", "http://localhost:11434")
CONNECT_TIMEOUT = 5 # seconds to establish connection
READ_TIMEOUT = 120 # seconds to wait for response body
class HotSwapManager:
def __init__(self, vram_budget_gb: float = 22.0):
self.vram_budget = vram_budget_gb
self.loaded: OrderedDict[str, float] = OrderedDict() # model -> vram_gb
self._lock = threading.Lock()
self.model_vram = {
"llama3.1:8b-instruct-q4_K_M": 5.5,
"codellama:7b-instruct-q4_K_M": 5.0,
"nomic-embed-text": 0.7,
"mistral:7b-instruct-q4_K_M": 5.2,
"llama3.1:70b-instruct-q4_K_M": 42.0,
}
self._sync_state()
def _sync_state(self):
"""Reconcile in-memory state against Ollama's actual loaded models."""
try:
resp = requests.get(
f"{OLLAMA_BASE}/api/ps",
timeout=(CONNECT_TIMEOUT, READ_TIMEOUT),
)
resp.raise_for_status()
for entry in resp.json().get("models", []):
name = entry.get("name", "")
# Use reported VRAM if available, else fall back to our dict
vram = entry.get("size_vram", 0) / (1024 ** 3)
self.loaded[name] = vram or self.model_vram.get(name, 6.0)
log.info("Synced state: %s", list(self.loaded.keys()))
except requests.RequestException as e:
log.warning("Could not sync Ollama state: %s. Proceeding with empty state.", e)
def _used_vram(self) -> float:
return sum(self.loaded.values())
def _unload_model(self, model: str):
if model not in self.loaded:
log.debug("Skipping unload of '%s': not tracked as loaded.", model)
return
log.info("Unloading %s...", model)
try:
resp = requests.post(
f"{OLLAMA_BASE}/api/generate",
json={"model": model, "keep_alive": 0},
timeout=(CONNECT_TIMEOUT, READ_TIMEOUT),
)
resp.raise_for_status()
self.loaded.pop(model, None)
except requests.RequestException as e:
log.error("Failed to unload '%s': %s. State may be inconsistent.", model, e)
raise
def _ensure_capacity(self, required_gb: float):
# Called under self._lock
while self._used_vram() + required_gb > self.vram_budget and self.loaded:
evict_model = next(iter(self.loaded)) # LRU eviction
self._unload_model(evict_model)
def generate(self, model: str, prompt: str) -> str:
if model not in self.model_vram:
raise ValueError(
f"Unknown model '{model}'. Add it to model_vram dict with its VRAM estimate."
)
needed = self.model_vram[model]
if needed > self.vram_budget:
raise ValueError(
f"'{model}' requires {needed} GB, exceeds budget {self.vram_budget} GB"
)
with self._lock:
if model in self.loaded:
self.loaded.move_to_end(model) # Mark as recently used
else:
self._ensure_capacity(needed)
log.info("Loading %s (~%.1f GB)...", model, needed)
self.loaded[model] = needed
start = time.time()
resp = requests.post(
f"{OLLAMA_BASE}/api/generate",
json={"model": model, "prompt": prompt, "stream": False},
timeout=(CONNECT_TIMEOUT, READ_TIMEOUT),
)
resp.raise_for_status()
elapsed = time.time() - start
log.info("Response from %s in %.1fs", model, elapsed)
return resp.json().get("response", "")
# Usage
manager = HotSwapManager(vram_budget_gb=22.0)
print(manager.generate("llama3.1:8b-instruct-q4_K_M", "Explain quicksort."))
print(manager.generate("codellama:7b-instruct-q4_K_M", "Write a Python fibonacci function."))
print(manager.generate("mistral:7b-instruct-q4_K_M", "Summarize the TCP handshake."))
Measuring Swap Latency
Cold load times (model not in system page cache) for common GGUF models from local storage: a 7B Q4_K_M loads in roughly 1 to 3 seconds from NVMe Gen4 SSD or 8 to 12 seconds from SATA SSD, a 13B Q4_K_M in 4 to 7 seconds (NVMe), and a 70B Q4_K_M requiring partial offload in 15 to 30 seconds (NVMe; significantly longer on SATA). The OS page cache retains the model file after recent use, cutting reload times roughly in half when the model file fits in available system RAM. Streaming the first token while the full context is still processing reduces perceived latency. Async preloading, triggered by predicting the next model from request patterns, can overlap model loading with the user's current interaction.
Strategy Comparison: Concurrent vs. Hot-Swap vs. Containerized
| Factor | Concurrent | Hot-Swap | Containerized |
|---|---|---|---|
| Max models (24 GB) | 2-4 small | Unlimited (sequential) | 2-3 with GPU sharing |
| Switch latency | <100 ms* | 2-10 seconds | 2-10 seconds + container overhead |
| VRAM efficiency | Low (partitioned) | High (full VRAM per model) | Medium (overhead per container) |
| Complexity | Low | Medium | High |
| Best use case | Always-on pairs | Many models, tolerant latency | Production, multi-tenant |
*Assumes models already loaded in VRAM; latency is routing overhead only, measured at the API layer.
The table captures the broad trade-offs, but one thing it cannot show: concurrent loading penalizes large models disproportionately because VRAM partitioning forces aggressive quantization or partial offload on every resident model, not just the largest one.
Strategy 3: Containerized Model Isolation with Docker
When to Use Container Isolation
Production deployments, multi-tenant serving, and any scenario where you need hard resource limits and crash isolation. If one model's inference process segfaults or leaks memory, other containers remain unaffected. Each container can run different framework versions, different model formats, and scale independently.
Docker + NVIDIA Container Toolkit Setup
The NVIDIA Container Toolkit must be installed for GPU passthrough. Verify with nvidia-smi inside a container: docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi. The --gpus flag controls GPU visibility. Use --gpus all to expose all GPUs to the container, or --gpus '"device=0"' to restrict to a specific GPU. Resource limits in Docker Compose pin memory limits.
Docker Compose for Multi-Model Deployments
Note: Create an nginx.conf file before running docker compose up; the proxy service bind-mounts it and will fail if it does not exist. See the llama.cpp server wiki for a reference upstream proxy configuration.
# Docker Compose Specification (Compose v2.20+, Docker 26+)
services:
chat-model:
image: ollama/ollama:0.3.14
container_name: ollama-chat
ports:
- "11434:11434"
volumes:
- ollama-chat-data:/root/.ollama
mem_limit: 16g # Enforces limit in standalone Docker Compose (non-Swarm).
environment:
- OLLAMA_MAX_LOADED_MODELS=1
- OLLAMA_KEEP_ALIVE=10m
- NVIDIA_VISIBLE_DEVICES=0
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
limits:
memory: 16g # Only honored in Docker Swarm mode; mem_limit above enforces in standalone.
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
entrypoint: >
sh -c "ollama serve &
SERVER_PID=$! &&
until curl -sf http://localhost:11434/api/tags >/dev/null; do sleep 1; done &&
ollama pull llama3.1:8b-instruct-q4_K_M &&
wait $SERVER_PID"
# For production, use a dedicated entrypoint script with exec and signal trapping (trap).
embedding-model:
image: ghcr.io/ggerganov/llama.cpp:server-b3400
container_name: llama-cpp-embed
ports:
- "8081:8080"
volumes:
- ./models:/models:ro
mem_limit: 4g
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
limits:
memory: 4g
command: >
--model /models/nomic-embed-text-v1.5-Q4_K_M.gguf
--port 8080
--n-gpu-layers 99
--ctx-size 2048
--embedding
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
proxy:
image: nginx:alpine
container_name: model-proxy
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
chat-model:
condition: service_healthy
embedding-model:
condition: service_healthy
volumes:
ollama-chat-data:
GPU Sharing Strategies in Containers
Multi-Instance GPU (MIG) on NVIDIA A100, H100, and A30 hardware partitions a single GPU into isolated instances with dedicated memory and compute. A single A100 80 GB can split into up to seven 10 GB instances (A100 40 GB supports 7 × 5 GB instances), each functioning as an independent GPU. This provides the strongest isolation but requires supported hardware and driver configuration. Instance profiles vary by GPU SKU; consult nvidia-smi mig -lgip for available profiles on your hardware.
Time-slicing via the NVIDIA device plugin for Kubernetes (or manual configuration in Docker) lets multiple containers share a GPU by interleaving kernel execution. There is no memory isolation: a container can still allocate VRAM beyond its "fair share" and cause OOM for neighbors. This works for bursty workloads where containers rarely need the GPU simultaneously.
Multi-Process Service (MPS) enables concurrent kernel execution from multiple processes on a single GPU, improving utilization when individual processes underuse compute resources. MPS works well for serving many small models that each use a fraction of the GPU's compute capacity. It works poorly when models are large enough to saturate the GPU individually.
Advanced Techniques: Squeezing More from Limited VRAM
Aggressive Quantization for Multi-Model Fits
Dropping below Q4_K_M opens space for additional models. Q3_K_S cuts a 7B model to approximately 3.7 GB (versus 4.4 GB for Q4_K_M), saving roughly 500 to 700 MB of loaded VRAM. IQ2_XXS pushes a 7B model below 2.1 GB for weights only (loaded with KV cache and overhead, expect ~2.3 to 2.5 GB), but expect a measurable quality drop on reasoning-heavy tasks. Benchmark this on your target task: published comparisons show 5 to 15 percentage point drops on benchmarks like MMLU and HumanEval relative to Q4_K_M. For embedding models and classification tasks, aggressive quantization often has negligible impact on retrieval recall or classification accuracy, making sub-Q4 quantization practical for multi-model deployments where not every model needs maximum quality.
For embedding models and classification tasks, aggressive quantization often has negligible impact on retrieval recall or classification accuracy, making sub-Q4 quantization practical for multi-model deployments where not every model needs maximum quality.
Context Length Reduction and Dynamic Allocation
Reducing --ctx-size directly reduces KV cache memory. An embedding model processing 512-token chunks needs only --ctx-size 512, saving over 75% of KV cache memory compared to 4096. A chat model might need 4096 or 8192 for conversational context, while a code completion model working on single functions can operate at 2048. Tune each model's context length to its task; this often frees enough VRAM for an additional small model.
Unified Memory and Partial CPU Offload Patterns
Strategic layer splitting keeps the most memory-bandwidth-sensitive layers on the GPU. In transformer architectures, each layer contains both attention and FFN sublayers. Layers loaded to GPU (layers 0 through N) contain both attention and FFN components; llama.cpp provides no mechanism to selectively GPU-offload attention sublayers independently of FFN sublayers within the same layer. FFN-heavy computation in CPU-offloaded layers can tolerate the bandwidth penalty with less proportional throughput loss than attention-heavy operations.
# CodeLlama 34B Q4_K_M model: 48 layers total
# (Layer counts vary by model architecture — verify with your model's metadata
# or by running: ./llama-server --model <path> --n-gpu-layers 0 and inspecting output)
# Keep 20 layers on GPU (~12 GB VRAM), offload 28 to CPU
./llama-server \
--model models/codellama-34b-q4_k_m.gguf \
--port 8080 \
--n-gpu-layers 20 \
--ctx-size 4096 \
--parallel 1 \
--threads 8 # CPU threads for offloaded layers
# Expected throughput impact:
# Full GPU (all 48 layers): ~12 t/s on 48 GB GPU
# 20 GPU + 28 CPU layers: ~4-6 t/s on 24 GB GPU
# Full CPU: ~1-2 t/s
# The 20-layer split reclaims ~12 GB for other models while
# maintaining 3-5x the throughput of full CPU inference
On Apple Silicon, unified memory architecture eliminates the PCIe transfer penalty. A M4 Max with 128 GB unified memory (verify your specific SKU configuration) can run a 34B Q4_K_M model at 4096 context with all layers "offloaded" to CPU at near-GPU bandwidth, achieving 10+ tokens per second thanks to the ~400 GB/s memory bandwidth.
Monitoring and Debugging Memory Issues
Real-Time VRAM Monitoring
nvidia-smi dmon -d 1 provides per-second sampling of GPU utilization, memory usage, temperature, and power draw. nvtop gives a top-like interactive view with per-process VRAM breakdown, showing exactly which model instance consumes what. gpustat --watch offers a minimal, scriptable alternative. For automated alerts, a simple loop checking nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits against a threshold (e.g., alert at 90% of total VRAM) can trigger notifications before OOM occurs.
Diagnosing Common Failures
Throughput drops to single-digit t/s with no error message. This is silent CPU fallback: a framework detects insufficient VRAM and quietly offloads layers to system RAM without notification. Monitor token generation speed per request to catch it.
OOM errors during KV cache expansion, rather than at model load, catch many practitioners off guard. A model loads successfully, but the first long-context request triggers an allocation failure. This happens when --parallel or context length was set without accounting for KV cache growth at full utilization.
CUDA memory fragmentation accumulates after repeated load/unload cycles. Total free VRAM may show sufficient space, but allocations fail because no contiguous block is large enough. Restarting the inference server process is the most reliable fix; some frameworks support explicit cache clearing between model loads.
Putting It All Together: Decision Framework
The choice of strategy follows from three variables: total VRAM, number of models, and latency tolerance.
| GPU VRAM | 2 Models | 3-5 Models | 5+ Models |
|---|---|---|---|
| 8 GB | Hot-swap only | Hot-swap only | Hot-swap only |
| 16 GB | Concurrent (if small) | Hot-swap | Hot-swap |
| 24 GB | Concurrent | Concurrent small + hot-swap large | Hot-swap + Docker isolation |
| 48 GB+ | Concurrent | Concurrent | Concurrent + Docker for production |
Start with concurrent loading for your always-on models and add hot-swap for the rest. Docker isolation adds value when stability guarantees, reproducibility, or multi-tenant access control matter more than minimizing overhead.
Key Takeaways
Calculate before loading: the VRAM formula and the estimation script above prevent the most common deployment failures. Always pass explicit num_layers and num_kv_heads values from the model card rather than relying on heuristics. Match your strategy to actual usage patterns. Ollama's built-in keep-alive scheduling handles hot-swapping with zero custom code for many workflows, so start there before building custom orchestration. Graduate to Docker isolation with GPU resource pinning when moving toward production serving. The Ollama documentation, llama.cpp server wiki, and NVIDIA's MIG and MPS guides provide the reference material for deeper configuration.
Calculate before loading: the VRAM formula and the estimation script above prevent the most common deployment failures.

