Running Multiple Local Models: Memory Management Strategies

SitePoint Team

Published in

AI·Computing·Programming·

March 11, 2026

Share this article

Running Multiple Local Models: Memory Management Strategies

SitePoint Premium

Stay Relevant and Grow Your Career in Tech

Premium Results
Publish articles on SitePoint
Daily curated jobs
Learning Paths
Discounts to dev tools

Start Free Trial

7 Day Free Trial. Cancel Anytime.

How to Run Multiple LLM Models on One GPU

Audit your total GPU VRAM and system RAM available for inference.
Calculate each model's VRAM footprint using the formula: parameters × bytes_per_parameter + KV cache + overhead.
Choose a quantization level (e.g., Q4_K_M) that fits your quality and memory constraints.
Select a strategy: concurrent loading for always-on pairs, hot-swapping for sequential access, or Docker isolation for production.
Configure per-model GPU layer counts and context lengths to partition VRAM deliberately.
Monitor real-time VRAM usage with nvidia-smi or nvtop to catch silent CPU fallback and fragmentation.
Tune keep-alive TTLs and KV cache slots to match actual request patterns and concurrency needs.

Running multiple local models simultaneously appeals to any developer building pipelines that combine specialized LLMs for coding, retrieval-augmented generation, chat, and embeddings. This tutorial walks through a systematic approach to memory planning, allocation, and orchestration across three strategies.

Why Running Multiple Local Models Is Hard
Understanding LLM Memory Anatomy
Strategy 1: Concurrent Loading with VRAM Budgeting
Strategy 2: Hot-Swapping Models on Demand
Strategy 3: Containerized Model Isolation with Docker
Advanced Techniques: Squeezing More from Limited VRAM
Monitoring and Debugging Memory Issues
Putting It All Together: Decision Framework
Key Takeaways

Prerequisites — Tested Versions

The examples in this article were tested with: Ollama ≥ 0.3.14, llama.cpp server build b3400+, Python 3.10+, CUDA 12.2+ (driver 535+), Docker 26+ with Compose v2.20+, NVIDIA Container Toolkit 1.14+, on Ubuntu 22.04 LTS. GPU-accelerated examples assume an NVIDIA GPU with at least 8 GB VRAM. Load time estimates assume NVMe Gen4 SSD storage unless otherwise noted. CPU offload examples assume 32 GB or more of system RAM.

Why Running Multiple Local Models Is Hard

Running multiple local models simultaneously appeals to any developer building pipelines that combine specialized LLMs for coding, retrieval-augmented generation, chat, and embeddings. A coding assistant handles code completion, a small embedding model powers semantic search, and a larger chat model handles user-facing interactions. The obstacle is simple: GPU VRAM is finite and shared. Loading a second model without accounting for the first leads to out-of-memory crashes, silent CPU fallback, or throughput that collapses from ~40 t/s fully on GPU to ~3 t/s when the framework quietly offloads to system RAM. Memory management for local LLMs requires deliberate planning, not guesswork.

Memory management for local LLMs requires deliberate planning, not guesswork.

This tutorial walks through a systematic approach to memory planning, allocation, and orchestration across three strategies: concurrent loading with VRAM budgeting, hot-swapping models on demand, and containerized model isolation with Docker. It draws on the latest capabilities in Ollama, llama.cpp, and the NVIDIA Container Toolkit. Prerequisites include familiarity with LLM inference concepts, basic GPU architecture, Python, Docker, and command-line tooling.

Understanding LLM Memory Anatomy

What Actually Consumes VRAM

Model weights dominate VRAM consumption. A 7-billion-parameter model in 16-bit floating point occupies roughly 14 GB just for weight storage. At 4096 context length with a batch size of 1, the KV cache for a 7B model typically consumes 0.5 to 1.0 GB, but this grows linearly with concurrent requests and longer contexts under fixed-slot allocation (as in llama.cpp --parallel). Paged attention systems such as vLLM allocate KV cache dynamically and may not grow linearly.

Activation memory covers intermediate tensors during the forward pass. Runtime overhead includes CUDA context allocation (typically 300 to 800 MB depending on driver version and framework; verify against your specific CUDA and driver versions), cuBLAS workspace buffers, and memory fragmentation from repeated allocations. Two processes sharing a GPU each pay this overhead independently.

Calculating VRAM Requirements per Model

The core formula is:

VRAM ≈ (parameters × bytes_per_parameter) + KV_cache + overhead

Where bytes_per_parameter depends on quantization: FP16 uses 2 bytes, Q8_0 uses roughly 1.1 bytes, Q6_K uses about 0.83 bytes, Q5_K_M uses about 0.73 bytes, Q4_K_M uses about 0.64 bytes, and Q3_K_S uses about 0.53 bytes. Estimate KV cache size as 2 × num_layers × num_kv_heads × head_dim × context_length × bytes_per_element (where the final × 2 accounts for FP16 storage, i.e., 2 bytes per element), with FP16 KV cache being standard in most backends. Some backends support Q8 or Q4 KV cache; check your framework's documentation.

Practical estimates for Q4_K_M quantization: a 7B model lands at approximately 4.4 GB for weights alone (roughly 5.0 to 5.5 GB loaded with KV cache and overhead at 4096 context), a 13B model at approximately 8.2 GB (9.0 to 9.5 GB loaded), and a 70B model at approximately 41 to 42 GB (exceeding 24 GB single-GPU VRAM without partial offload).

The following Python script computes these estimates programmatically:

QUANT_BYTES = {
    "FP16": 2.0,
    "Q8_0": 1.1,
    "Q6_K": 0.83,
    "Q5_K_M": 0.73,
    "Q4_K_M": 0.64,
    "Q3_K_S": 0.53,
    "IQ2_XXS": 0.29,
}


def estimate_vram(
    params_billion: float,
    quant: str,
    context_length: int = 4096,
    batch_size: int = 1,
    num_layers: int = None,
    num_kv_heads: int = None,
    head_dim: int = 128,
    overhead_gb: float = 0.6,
):
    """Estimate VRAM usage for a quantized LLM.

    num_layers and num_kv_heads vary by architecture. Pass explicit values
    from the model card for accuracy.
    """
    if quant not in QUANT_BYTES:
        raise ValueError(
            f"Unknown quantization '{quant}'. "
            f"Valid options: {list(QUANT_BYTES.keys())}"
        )
    if num_layers is None:
        raise ValueError(
            "num_layers is required. Pass the value from your model card. "
            "Heuristic inference has been removed due to inaccuracy across architectures."
        )
    if num_kv_heads is None:
        raise ValueError(
            "num_kv_heads is required. Pass the value from your model card."
        )

    params = params_billion * 1e9
    bpp = QUANT_BYTES[quant]
    weight_gb = (params * bpp) / (1024**3)

    kv_factor = 2          # K and V tensors
    fp16_bytes = 2         # bytes per element for FP16 KV cache
    kv_bytes = (
        kv_factor * num_layers * num_kv_heads * head_dim
        * context_length * batch_size * fp16_bytes
    )
    kv_gb = kv_bytes / (1024**3)

    total = weight_gb + kv_gb + overhead_gb

    print(f"{'Component':<25} {'GB':>8}")
    print("-" * 35)
    print(f"{'Model weights':<25} {weight_gb:>8.2f}")
    print(f"{'KV cache':<25} {kv_gb:>8.2f}")
    print(f"{'Overhead (CUDA/runtime)':<25} {overhead_gb:>8.2f}")
    print("-" * 35)
    print(f"{'TOTAL ESTIMATED VRAM':<25} {total:>8.2f}")
    return total


# Example: 7B Q4_K_M at 4096 context (Llama 3.1 8B architecture)
print("=== 7B Q4_K_M ===")
estimate_vram(7, "Q4_K_M", context_length=4096,
              num_layers=32, num_kv_heads=8, head_dim=128)

print("
=== 13B Q4_K_M ===")
estimate_vram(13, "Q4_K_M", context_length=4096,
              num_layers=40, num_kv_heads=40, head_dim=128)

VRAM vs. System RAM: When CPU Offloading Helps (and Hurts)

Three execution modes exist: full GPU inference (all layers on VRAM), partial offload (some layers on GPU, rest on system RAM), and full CPU inference. Partial offload runs layers that exceed VRAM on system RAM while keeping the rest on GPU, but every token generation requires transferring activations across the PCIe bus for offloaded layers. On PCIe 4.0 x16, bandwidth peaks at roughly 32 GB/s unidirectional (bidirectional peak is ~64 GB/s, but LLM layer offload transfer is effectively unidirectional per token), often 20 to 25 GB/s effective, introducing measurable latency per offloaded layer. A 70B model with half its layers offloaded to CPU may generate tokens at 3 to 5 tokens per second on a 24 GB GPU, compared to 15+ tokens per second fully on a 48 GB GPU.

Full CPU inference on DDR5-4800 dual-channel systems can manage 1 to 3 tokens per second for a 7B Q4_K_M model. Apple Silicon unified memory avoids the PCIe bottleneck entirely, making partial offload patterns less penalizing on M-series chips. A 34B Q4_K_M model at 4096 context on an M4 Max with 128 GB unified memory achieves 10+ tokens per second thanks to the ~400 GB/s memory bandwidth.

VRAM Consumption Breakdown by Model Size and Quantization (24 GB GPU)

Model	Q8_0	Q6_K	Q5_K_M	Q4_K_M	Q3_K_S
7B	~7.7 GB / ~35 t/s	~5.8 GB / ~42 t/s	~5.1 GB / ~45 t/s	~4.4 GB / ~48 t/s	~3.7 GB / ~44 t/s
13B	~14.3 GB / ~20 t/s	~10.8 GB / ~25 t/s	~9.5 GB / ~27 t/s	~8.2 GB / ~29 t/s	~6.9 GB / ~26 t/s
34B	~37.4 GB / CPU*	~28.2 GB / CPU*	~24.8 GB / partial	~21.8 GB / ~12 t/s	~18.0 GB / ~11 t/s
70B	~77.0 GB / CPU*	~58.1 GB / CPU*	~51.1 GB / CPU*	~44.8 GB / CPU*	~37.1 GB / CPU*

*Requires partial or full CPU offload on 24 GB. Tokens/sec estimates assume single-request inference on an RTX 4090 or comparable 24 GB GPU. Weight-only estimates; add 0.5 to 2.0 GB for KV cache and overhead depending on context length.

Strategy 1: Concurrent Loading with VRAM Budgeting

When to Use Concurrent Loading

Two or more models must serve requests with sub-second switching. A typical scenario: a 7B coding assistant handles inline completions while a small embedding model (under 1 GB) processes document chunks for RAG indexing. Both need to be resident in VRAM. The constraint is hard: every model must physically fit in VRAM at the same time, including KV cache for their expected concurrent request load.

Configuring Ollama for Multiple Resident Models

Ollama exposes two environment variables for concurrent serving. OLLAMA_MAX_LOADED_MODELS controls how many models remain in VRAM simultaneously (default is 1; verify against your Ollama version's release notes, as behavior changed between 0.1.x and 0.3.x). OLLAMA_NUM_PARALLEL sets the number of concurrent requests per model (default is 1 in most Ollama versions; check ollama --help or release notes for your version), which directly affects KV cache allocation. Increasing OLLAMA_NUM_PARALLEL multiplies KV cache memory by the parallelism factor.

Set the num_gpu parameter in a Modelfile to control per-model VRAM allocation. This parameter specifies how many layers to offload to GPU, letting you manually partition VRAM between models by giving each a calculated number of GPU layers.

# Set environment variables before launching Ollama
export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_NUM_PARALLEL=2
export OLLAMA_KEEP_ALIVE="10m"  # Overrides default of 5m; tune to match your request inter-arrival time.

# Start the Ollama server
ollama serve &

# Pull models
ollama pull llama3.1:8b-instruct-q4_K_M
ollama pull nomic-embed-text

# Trigger loading both models (first request loads each)
curl http://localhost:11434/api/generate -d '{"model": "llama3.1:8b-instruct-q4_K_M", "prompt": "warmup", "stream": false}'
curl http://localhost:11434/api/embeddings -d '{"model": "nomic-embed-text", "input": "warmup"}'

# Verify both loaded with: ollama ps
# nomic-embed-text: ~0.5-0.8 GB VRAM (weights ~274 MB + CUDA context; verify with nvidia-smi)
# llama3.1:8b Q4_K_M: ~5.5 GB VRAM (at default context 2048 and parallel 1)
# Total: ~6.0-6.3 GB — fits comfortably on 24 GB with room for larger context

Configuring llama.cpp Multi-Model Serving

Running multiple llama-server instances on different ports provides finer control. Each instance gets an explicit --n-gpu-layers value calculated from the VRAM budget.

# Replace ./llama-server with the full path to your llama.cpp server binary.
# Built from source: ./build/bin/llama-server
# Installed via package: llama-server
# Verify: which llama-server || ls ./build/bin/llama-server

set -e

# Instance 1: 13B Q4_K_M model — allocate ~9.5 GB VRAM (all 40 layers, weights ~8.2 GB + KV + overhead at 4096 context)
./llama-server \
  --model models/llama-13b-q4_k_m.gguf \
  --port 8080 \
  --n-gpu-layers 40 \
  --ctx-size 4096 \
  --parallel 1 &
PID1=$!

# Instance 2: 7B Q4_K_M model — allocate partial VRAM (24 of 32 layers on GPU)
# Remaining 8 layers offloaded to CPU to stay within 24 GB total
./llama-server \
  --model models/llama-7b-q4_k_m.gguf \
  --port 8081 \
  --n-gpu-layers 24 \
  --ctx-size 2048 \
  --parallel 1 &
PID2=$!

sleep 2
kill -0 $PID1 || echo "Server 1 (13B) failed to start"
kill -0 $PID2 || echo "Server 2 (7B) failed to start"

# Combined estimate: ~9.5 GB + ~6.5 GB (partial 7B) + ~1.2 GB overhead = ~17.2 GB on a 24 GB GPU
# Monitor with: watch -n 1 nvidia-smi

Pitfalls: Fragmentation and OOM Under Load

KV cache growth under concurrent requests is the most common source of unexpected OOM. A model loaded at --parallel 4 allocates four KV cache slots at initialization. If context length is 8192 and each slot consumes 512 MB, that is 2 GB of KV cache alone for one model. Monitoring tools like nvidia-smi dmon (sampled at 1-second intervals) and nvtop (real-time per-process VRAM tracking) reveal actual consumption under load. Repeated load/unload cycles can cause CUDA memory fragmentation; no reliable threshold exists for when this becomes a problem, so monitor free contiguous VRAM and restart the server proactively when allocations start failing despite sufficient total free memory.

Strategy 2: Hot-Swapping Models on Demand

When to Use Hot-Swapping

You need many models but never at the same time. A developer running five specialized models (code, chat, summarization, translation, embeddings) with 2 to 10 seconds of acceptable latency between switches can give each model full VRAM access during its active window. This maximizes per-model throughput at the cost of swap latency.

Ollama's Built-In Model Scheduling

Ollama handles model eviction automatically through its keep-alive mechanism. The OLLAMA_KEEP_ALIVE environment variable (default 5m) controls how long an idle model remains in VRAM. Setting it to 0 unloads immediately after each request. Setting it to 30s keeps the model warm for half a minute. When a new model is requested and VRAM is insufficient, Ollama evicts the least recently used resident model. Tuning this TTL to match request patterns reduces unnecessary reloads.

Building a Python Hot-Swap Orchestrator

For workflows requiring explicit control over model lifecycle, a lightweight orchestrator using the Ollama HTTP API provides LRU-based swap management:

import requests
import time
import logging
import os
import threading
from collections import OrderedDict

log = logging.getLogger(__name__)

OLLAMA_BASE = os.environ.get("OLLAMA_BASE", "http://localhost:11434")

CONNECT_TIMEOUT = 5    # seconds to establish connection
READ_TIMEOUT = 120     # seconds to wait for response body


class HotSwapManager:
    def __init__(self, vram_budget_gb: float = 22.0):
        self.vram_budget = vram_budget_gb
        self.loaded: OrderedDict[str, float] = OrderedDict()  # model -> vram_gb
        self._lock = threading.Lock()
        self.model_vram = {
            "llama3.1:8b-instruct-q4_K_M": 5.5,
            "codellama:7b-instruct-q4_K_M": 5.0,
            "nomic-embed-text": 0.7,
            "mistral:7b-instruct-q4_K_M": 5.2,
            "llama3.1:70b-instruct-q4_K_M": 42.0,
        }
        self._sync_state()

    def _sync_state(self):
        """Reconcile in-memory state against Ollama's actual loaded models."""
        try:
            resp = requests.get(
                f"{OLLAMA_BASE}/api/ps",
                timeout=(CONNECT_TIMEOUT, READ_TIMEOUT),
            )
            resp.raise_for_status()
            for entry in resp.json().get("models", []):
                name = entry.get("name", "")
                # Use reported VRAM if available, else fall back to our dict
                vram = entry.get("size_vram", 0) / (1024 ** 3)
                self.loaded[name] = vram or self.model_vram.get(name, 6.0)
            log.info("Synced state: %s", list(self.loaded.keys()))
        except requests.RequestException as e:
            log.warning("Could not sync Ollama state: %s. Proceeding with empty state.", e)

    def _used_vram(self) -> float:
        return sum(self.loaded.values())

    def _unload_model(self, model: str):
        if model not in self.loaded:
            log.debug("Skipping unload of '%s': not tracked as loaded.", model)
            return
        log.info("Unloading %s...", model)
        try:
            resp = requests.post(
                f"{OLLAMA_BASE}/api/generate",
                json={"model": model, "keep_alive": 0},
                timeout=(CONNECT_TIMEOUT, READ_TIMEOUT),
            )
            resp.raise_for_status()
            self.loaded.pop(model, None)
        except requests.RequestException as e:
            log.error("Failed to unload '%s': %s. State may be inconsistent.", model, e)
            raise

    def _ensure_capacity(self, required_gb: float):
        # Called under self._lock
        while self._used_vram() + required_gb > self.vram_budget and self.loaded:
            evict_model = next(iter(self.loaded))  # LRU eviction
            self._unload_model(evict_model)

    def generate(self, model: str, prompt: str) -> str:
        if model not in self.model_vram:
            raise ValueError(
                f"Unknown model '{model}'. Add it to model_vram dict with its VRAM estimate."
            )
        needed = self.model_vram[model]
        if needed > self.vram_budget:
            raise ValueError(
                f"'{model}' requires {needed} GB, exceeds budget {self.vram_budget} GB"
            )

        with self._lock:
            if model in self.loaded:
                self.loaded.move_to_end(model)  # Mark as recently used
            else:
                self._ensure_capacity(needed)
                log.info("Loading %s (~%.1f GB)...", model, needed)
                self.loaded[model] = needed

        start = time.time()
        resp = requests.post(
            f"{OLLAMA_BASE}/api/generate",
            json={"model": model, "prompt": prompt, "stream": False},
            timeout=(CONNECT_TIMEOUT, READ_TIMEOUT),
        )
        resp.raise_for_status()
        elapsed = time.time() - start
        log.info("Response from %s in %.1fs", model, elapsed)
        return resp.json().get("response", "")


# Usage
manager = HotSwapManager(vram_budget_gb=22.0)
print(manager.generate("llama3.1:8b-instruct-q4_K_M", "Explain quicksort."))
print(manager.generate("codellama:7b-instruct-q4_K_M", "Write a Python fibonacci function."))
print(manager.generate("mistral:7b-instruct-q4_K_M", "Summarize the TCP handshake."))

Measuring Swap Latency

Cold load times (model not in system page cache) for common GGUF models from local storage: a 7B Q4_K_M loads in roughly 1 to 3 seconds from NVMe Gen4 SSD or 8 to 12 seconds from SATA SSD, a 13B Q4_K_M in 4 to 7 seconds (NVMe), and a 70B Q4_K_M requiring partial offload in 15 to 30 seconds (NVMe; significantly longer on SATA). The OS page cache retains the model file after recent use, cutting reload times roughly in half when the model file fits in available system RAM. Streaming the first token while the full context is still processing reduces perceived latency. Async preloading, triggered by predicting the next model from request patterns, can overlap model loading with the user's current interaction.

Strategy Comparison: Concurrent vs. Hot-Swap vs. Containerized

Factor	Concurrent	Hot-Swap	Containerized
Max models (24 GB)	2-4 small	Unlimited (sequential)	2-3 with GPU sharing
Switch latency	<100 ms*	2-10 seconds	2-10 seconds + container overhead
VRAM efficiency	Low (partitioned)	High (full VRAM per model)	Medium (overhead per container)
Complexity	Low	Medium	High
Best use case	Always-on pairs	Many models, tolerant latency	Production, multi-tenant

*Assumes models already loaded in VRAM; latency is routing overhead only, measured at the API layer.

The table captures the broad trade-offs, but one thing it cannot show: concurrent loading penalizes large models disproportionately because VRAM partitioning forces aggressive quantization or partial offload on every resident model, not just the largest one.

Strategy 3: Containerized Model Isolation with Docker

When to Use Container Isolation

Production deployments, multi-tenant serving, and any scenario where you need hard resource limits and crash isolation. If one model's inference process segfaults or leaks memory, other containers remain unaffected. Each container can run different framework versions, different model formats, and scale independently.

Docker + NVIDIA Container Toolkit Setup

The NVIDIA Container Toolkit must be installed for GPU passthrough. Verify with nvidia-smi inside a container: docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi. The --gpus flag controls GPU visibility. Use --gpus all to expose all GPUs to the container, or --gpus '"device=0"' to restrict to a specific GPU. Resource limits in Docker Compose pin memory limits.

Docker Compose for Multi-Model Deployments

Note: Create an nginx.conf file before running docker compose up; the proxy service bind-mounts it and will fail if it does not exist. See the llama.cpp server wiki for a reference upstream proxy configuration.

# Docker Compose Specification (Compose v2.20+, Docker 26+)

services:
  chat-model:
    image: ollama/ollama:0.3.14
    container_name: ollama-chat
    ports:
      - "11434:11434"
    volumes:
      - ollama-chat-data:/root/.ollama
    mem_limit: 16g  # Enforces limit in standalone Docker Compose (non-Swarm).
    environment:
      - OLLAMA_MAX_LOADED_MODELS=1
      - OLLAMA_KEEP_ALIVE=10m
      - NVIDIA_VISIBLE_DEVICES=0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          memory: 16g  # Only honored in Docker Swarm mode; mem_limit above enforces in standalone.
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3
    entrypoint: >
      sh -c "ollama serve &
      SERVER_PID=$! &&
      until curl -sf http://localhost:11434/api/tags >/dev/null; do sleep 1; done &&
      ollama pull llama3.1:8b-instruct-q4_K_M &&
      wait $SERVER_PID"
    # For production, use a dedicated entrypoint script with exec and signal trapping (trap).

  embedding-model:
    image: ghcr.io/ggerganov/llama.cpp:server-b3400
    container_name: llama-cpp-embed
    ports:
      - "8081:8080"
    volumes:
      - ./models:/models:ro
    mem_limit: 4g
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          memory: 4g
    command: >
      --model /models/nomic-embed-text-v1.5-Q4_K_M.gguf
      --port 8080
      --n-gpu-layers 99
      --ctx-size 2048
      --embedding
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  proxy:
    image: nginx:alpine
    container_name: model-proxy
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      chat-model:
        condition: service_healthy
      embedding-model:
        condition: service_healthy

volumes:
  ollama-chat-data:

GPU Sharing Strategies in Containers

Multi-Instance GPU (MIG) on NVIDIA A100, H100, and A30 hardware partitions a single GPU into isolated instances with dedicated memory and compute. A single A100 80 GB can split into up to seven 10 GB instances (A100 40 GB supports 7 × 5 GB instances), each functioning as an independent GPU. This provides the strongest isolation but requires supported hardware and driver configuration. Instance profiles vary by GPU SKU; consult nvidia-smi mig -lgip for available profiles on your hardware.

Time-slicing via the NVIDIA device plugin for Kubernetes (or manual configuration in Docker) lets multiple containers share a GPU by interleaving kernel execution. There is no memory isolation: a container can still allocate VRAM beyond its "fair share" and cause OOM for neighbors. This works for bursty workloads where containers rarely need the GPU simultaneously.

Multi-Process Service (MPS) enables concurrent kernel execution from multiple processes on a single GPU, improving utilization when individual processes underuse compute resources. MPS works well for serving many small models that each use a fraction of the GPU's compute capacity. It works poorly when models are large enough to saturate the GPU individually.

Advanced Techniques: Squeezing More from Limited VRAM

Aggressive Quantization for Multi-Model Fits

Dropping below Q4_K_M opens space for additional models. Q3_K_S cuts a 7B model to approximately 3.7 GB (versus 4.4 GB for Q4_K_M), saving roughly 500 to 700 MB of loaded VRAM. IQ2_XXS pushes a 7B model below 2.1 GB for weights only (loaded with KV cache and overhead, expect ~2.3 to 2.5 GB), but expect a measurable quality drop on reasoning-heavy tasks. Benchmark this on your target task: published comparisons show 5 to 15 percentage point drops on benchmarks like MMLU and HumanEval relative to Q4_K_M. For embedding models and classification tasks, aggressive quantization often has negligible impact on retrieval recall or classification accuracy, making sub-Q4 quantization practical for multi-model deployments where not every model needs maximum quality.

For embedding models and classification tasks, aggressive quantization often has negligible impact on retrieval recall or classification accuracy, making sub-Q4 quantization practical for multi-model deployments where not every model needs maximum quality.

Context Length Reduction and Dynamic Allocation

Reducing --ctx-size directly reduces KV cache memory. An embedding model processing 512-token chunks needs only --ctx-size 512, saving over 75% of KV cache memory compared to 4096. A chat model might need 4096 or 8192 for conversational context, while a code completion model working on single functions can operate at 2048. Tune each model's context length to its task; this often frees enough VRAM for an additional small model.

Unified Memory and Partial CPU Offload Patterns

Strategic layer splitting keeps the most memory-bandwidth-sensitive layers on the GPU. In transformer architectures, each layer contains both attention and FFN sublayers. Layers loaded to GPU (layers 0 through N) contain both attention and FFN components; llama.cpp provides no mechanism to selectively GPU-offload attention sublayers independently of FFN sublayers within the same layer. FFN-heavy computation in CPU-offloaded layers can tolerate the bandwidth penalty with less proportional throughput loss than attention-heavy operations.

# CodeLlama 34B Q4_K_M model: 48 layers total
# (Layer counts vary by model architecture — verify with your model's metadata
#  or by running: ./llama-server --model <path> --n-gpu-layers 0 and inspecting output)
# Keep 20 layers on GPU (~12 GB VRAM), offload 28 to CPU
./llama-server \
  --model models/codellama-34b-q4_k_m.gguf \
  --port 8080 \
  --n-gpu-layers 20 \
  --ctx-size 4096 \
  --parallel 1 \
  --threads 8  # CPU threads for offloaded layers

# Expected throughput impact:
# Full GPU (all 48 layers): ~12 t/s on 48 GB GPU
# 20 GPU + 28 CPU layers: ~4-6 t/s on 24 GB GPU
# Full CPU: ~1-2 t/s
# The 20-layer split reclaims ~12 GB for other models while
# maintaining 3-5x the throughput of full CPU inference

On Apple Silicon, unified memory architecture eliminates the PCIe transfer penalty. A M4 Max with 128 GB unified memory (verify your specific SKU configuration) can run a 34B Q4_K_M model at 4096 context with all layers "offloaded" to CPU at near-GPU bandwidth, achieving 10+ tokens per second thanks to the ~400 GB/s memory bandwidth.

Monitoring and Debugging Memory Issues

Real-Time VRAM Monitoring

nvidia-smi dmon -d 1 provides per-second sampling of GPU utilization, memory usage, temperature, and power draw. nvtop gives a top-like interactive view with per-process VRAM breakdown, showing exactly which model instance consumes what. gpustat --watch offers a minimal, scriptable alternative. For automated alerts, a simple loop checking nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits against a threshold (e.g., alert at 90% of total VRAM) can trigger notifications before OOM occurs.

Diagnosing Common Failures

Throughput drops to single-digit t/s with no error message. This is silent CPU fallback: a framework detects insufficient VRAM and quietly offloads layers to system RAM without notification. Monitor token generation speed per request to catch it.

OOM errors during KV cache expansion, rather than at model load, catch many practitioners off guard. A model loads successfully, but the first long-context request triggers an allocation failure. This happens when --parallel or context length was set without accounting for KV cache growth at full utilization.

CUDA memory fragmentation accumulates after repeated load/unload cycles. Total free VRAM may show sufficient space, but allocations fail because no contiguous block is large enough. Restarting the inference server process is the most reliable fix; some frameworks support explicit cache clearing between model loads.

Putting It All Together: Decision Framework

The choice of strategy follows from three variables: total VRAM, number of models, and latency tolerance.

GPU VRAM	2 Models	3-5 Models	5+ Models
8 GB	Hot-swap only	Hot-swap only	Hot-swap only
16 GB	Concurrent (if small)	Hot-swap	Hot-swap
24 GB	Concurrent	Concurrent small + hot-swap large	Hot-swap + Docker isolation
48 GB+	Concurrent	Concurrent	Concurrent + Docker for production

Start with concurrent loading for your always-on models and add hot-swap for the rest. Docker isolation adds value when stability guarantees, reproducibility, or multi-tenant access control matter more than minimizing overhead.

Key Takeaways

Calculate before loading: the VRAM formula and the estimation script above prevent the most common deployment failures. Always pass explicit num_layers and num_kv_heads values from the model card rather than relying on heuristics. Match your strategy to actual usage patterns. Ollama's built-in keep-alive scheduling handles hot-swapping with zero custom code for many workflows, so start there before building custom orchestration. Graduate to Docker isolation with GPU resource pinning when moving toward production serving. The Ollama documentation, llama.cpp server wiki, and NVIDIA's MIG and MPS guides provide the reference material for deeper configuration.

Calculate before loading: the VRAM formula and the estimation script above prevent the most common deployment failures.

SitePoint Team

Sharing our passion for building incredible internet things.

Running Multiple Local Models: Memory Management Strategies

Running Multiple Local Models: Memory Management Strategies

How to Run Multiple LLM Models on One GPU

Table of Contents

Why Running Multiple Local Models Is Hard

Understanding LLM Memory Anatomy

What Actually Consumes VRAM

Calculating VRAM Requirements per Model

VRAM vs. System RAM: When CPU Offloading Helps (and Hurts)

Strategy 1: Concurrent Loading with VRAM Budgeting

When to Use Concurrent Loading

Configuring Ollama for Multiple Resident Models

Configuring llama.cpp Multi-Model Serving

Pitfalls: Fragmentation and OOM Under Load

Strategy 2: Hot-Swapping Models on Demand

When to Use Hot-Swapping

Ollama's Built-In Model Scheduling

Building a Python Hot-Swap Orchestrator

Measuring Swap Latency

Strategy 3: Containerized Model Isolation with Docker

When to Use Container Isolation

Docker + NVIDIA Container Toolkit Setup

Docker Compose for Multi-Model Deployments

GPU Sharing Strategies in Containers

Advanced Techniques: Squeezing More from Limited VRAM

Aggressive Quantization for Multi-Model Fits

Context Length Reduction and Dynamic Allocation

Unified Memory and Partial CPU Offload Patterns

Monitoring and Debugging Memory Issues

Real-Time VRAM Monitoring

Diagnosing Common Failures

Putting It All Together: Decision Framework

Key Takeaways

Comments

More from Capitolioxa

Samsung already nuked the only cool thing about the Galaxy S26’s AI

Samsung allegedly tests insane Galaxy phone batteries, and one's really up there

I kept deleting chats by accident, and Google Messages just fixed it

Morning Briefing