Ollama vs vLLM Comparison
| Dimension | Ollama | vLLM |
|---|---|---|
| Single-user throughput (Llama 3.1 8B) | ~62 tok/s (Q4_K_M) | ~71 tok/s (FP16) |
| 50-user aggregate throughput | ~155 tok/s (queue-based) | ~920 tok/s (continuous batching) |
| p99 latency at 50 users | ~24.7s | ~2.8s |
| Best fit | Local dev, prototyping, edge | Production serving, multi-user SLAs |
The conversation around ollama vs vllm has shifted substantially since 2024. Both tools have undergone major architectural changes, and benchmarks from even 12 months ago may no longer reflect the reality of running a local llm benchmark on current hardware with current software. This benchmark uses controlled hardware, identical models, and standardized prompts to produce reproducible, comparable results.
Table of Contents
- Why This Benchmark Matters Now
- Ollama and vLLM in 2026: Quick Overview
- Benchmark Setup and Methodology
- Throughput Benchmark Results: Requests per Second
- Latency Benchmark Results: Time-to-First-Response and P95
- Memory Usage and Resource Efficiency
- Developer Experience and Ecosystem Comparison
- When to Use Ollama vs vLLM: Decision Framework
- The Right Tool for the Right Job
- Common Pitfalls
Why This Benchmark Matters Now
The conversation around ollama vs vllm has shifted substantially since 2024. Both tools have undergone major architectural changes: Ollama has optimized GPU utilization through llama.cpp kernel improvements and quantized inference paths, while vLLM has shipped continued PagedAttention optimizations, improved speculative decoding, and streamlined its installation story. Benchmarks from even 12 months ago may no longer reflect the reality of running a local llm benchmark on current hardware with current software.
Ollama, once pigeonholed as a hobbyist tool, now shows up in CI pipelines running code-review prompts, Jetson Orin edge deployments, and internal dev toolchains. vLLM, previously requiring deep familiarity with Python ML tooling, has simplified setup without sacrificing its production-grade throughput capabilities. The core question persists: does the ease-of-use versus raw throughput tradeoff still hold, or have these tools converged enough to blur the line?
This benchmark uses controlled hardware (single NVIDIA RTX 4090), identical models (Llama 3.1 8B and DeepSeek-R1-Distill-Llama-8B), and standardized prompts to produce reproducible, comparable results. Every configuration, script, and Docker setup is included so that readers can validate findings on their own infrastructure.
Ollama and vLLM in 2026: Quick Overview
What Is Ollama?
Ollama is a developer-friendly tool for running large language models locally. Its Go-based server wraps an inference backend built on llama.cpp, and recent versions have tightened GPU utilization through operator fusion and improved CUDA graph support in that backend. The model ecosystem revolves around the Modelfile format and Ollama Hub, which together provide a pull-and-run experience similar to Docker's image management. A single ollama pull command downloads a quantized model ready for immediate serving.
What Is vLLM?
vLLM is a Python-based, GPU-first LLM serving engine designed for high-throughput production workloads. Its defining feature remains PagedAttention, which manages GPU memory through a paging mechanism inspired by operating system virtual memory, and which has received continued optimization across recent vLLM releases (see the vLLM changelog for version-specific improvements). Recent versions have improved speculative decoding, added multi-LoRA serving with hot-swap support, and maintained an OpenAI-compatible API server. Tensor parallelism support enables scaling across multiple GPUs for larger models.
Key Architectural Differences
The philosophical divide is structural. Ollama prioritizes accessibility: a single binary, integrated model management, and minimal configuration. vLLM prioritizes throughput: continuous batching, dynamic memory allocation via PagedAttention, and fine-grained control over serving parameters. Ollama allocates GPU memory statically per model load; vLLM pages memory dynamically based on active sequences. This distinction has minor impact at 1-3 concurrent requests but becomes the dominant performance factor under load. In an llm serving comparison, these architectural choices define where each tool excels.
| Dimension | Ollama | vLLM |
|---|---|---|
| Language | Go + llama.cpp (C/C++) backend | Python + CUDA kernels |
| Memory management | Static allocation | PagedAttention (dynamic) |
| Batching | Request queuing | Continuous batching |
| Model management | Built-in (Ollama Hub) | External (HuggingFace, local) |
| Primary target | Developer/local use | Production serving |
Benchmark Setup and Methodology
Hardware and Environment
All tests ran on a single-GPU system to establish a controlled baseline:
- GPU: NVIDIA RTX 4090 (24GB VRAM)
- CPU: AMD Ryzen 9 7950X
- RAM: 64GB DDR5
- OS: Ubuntu 24.04 LTS
- CUDA: 12.6
- Python: 3.12
- Isolation: Docker containers for both tools (see Software Versions below), ensuring no shared process interference
Software Versions: Pin your Docker images to the exact versions used during testing. Run
docker inspect ollama/ollama:<tag> --format '{{index .RepoDigests 0}}'anddocker inspect vllm/vllm-openai:<tag> --format '{{index .RepoDigests 0}}'to obtain image digests for reproducibility. Recordollama --versionandpip show vllmoutputs before benchmarking.
Prerequisites
Before running the benchmark, ensure the following are in place:
- nvidia-container-toolkit installed and Docker daemon configured:
sudo apt install nvidia-container-toolkit && sudo systemctl restart docker - HuggingFace account with Llama 3.1 access approved and a token generated
- Export your HuggingFace token:
export HF_TOKEN=hf_...before runningdocker compose up - Disk space: FP16 8B model ≈15GB, AWQ variant ≈5GB, GGUF Q4_K_M ≈5GB
- Docker Compose v2+ (use
docker composenot the deprecateddocker-composev1 CLI) - Python 3.12 with
aiohttpinstalled (pip install aiohttp), plusopenaipackage for the API compatibility example
Models Tested
We selected two models for their popularity and appropriate sizing for single-GPU inference:
- Llama 3.1 8B: Q4_K_M quantization for Ollama; FP16 and AWQ quantization for vLLM
- DeepSeek-R1-Distill-Llama-8B: Same quantization strategy per tool
The quantization difference reflects how each tool is typically deployed: Ollama users run GGUF quantized models, while vLLM users serve in FP16 or use AWQ/GPTQ quantization. Comparing each tool in its native quantization format matches real-world usage, but readers should note that throughput differences reflect both architectural and quantization effects, not architecture alone. A controlled comparison using vLLM with AWQ at an equivalent bit-width would better isolate architectural impact.
Benchmarking Methodology
A custom Python script using aiohttp generated concurrent requests against each tool's HTTP API. The script captured tokens per second (throughput), time-to-first-response (TTFR) measured as full round-trip latency in non-streaming mode (not equivalent to streaming TTFT), end-to-end latency at p50, p95, and p99 percentiles, and peak VRAM usage via nvidia-smi dmon -s m -d 100 polling at 100ms intervals.
Test scenarios covered three concurrency levels: single-user sequential, 10 concurrent users, and 50 concurrent users. Every test used a fixed input prompt (verify exact token count per model using each model's tokenizer) and 256-token maximum output. Three warmup runs preceded 10 measured runs, and we report median values.
Note on TTFR vs TTFT: The benchmark script uses
stream: False, meaning all "TTFR" values represent full non-streaming round-trip latency. True time-to-first-token (TTFT) measurement requiresstream: Trueand per-chunk timing, and would produce substantially different (typically lower) values. Do not compare these results directly with streaming TTFT benchmarks.
# docker-compose.yml — Benchmark Environment
# IMPORTANT: Replace image tags below with real pinned versions for your test.
# Obtain digest after pull:
# docker inspect ollama/ollama:0.3.12 --format '{{index .RepoDigests 0}}'
# Do NOT commit HF_TOKEN to version control; use a .env file (add to .gitignore).
services:
ollama:
image: ollama/ollama:0.3.12 # replace with your tested version + verify digest
runtime: nvidia
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
vllm:
image: vllm/vllm-openai:0.4.2 # replace with your tested version + verify digest
runtime: nvidia
ports:
- "8000:8000"
volumes:
- huggingface_cache:/root/.cache/huggingface
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} # set in .env file, never hardcoded
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--dtype float16
--max-model-len 2048
--gpu-memory-utilization 0.90
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
volumes:
ollama_data:
huggingface_cache:
Note:
runtime: nvidiarequiresnvidia-container-toolkitto be installed on the host (see Prerequisites). The--gpu-memory-utilization 0.90flag leaves only ~2.4GB headroom on a 24GB GPU; if other processes consume GPU memory, vLLM may OOM. Ollama does not have an equivalent tuning parameter — its VRAM allocation is determined by the model size at load time.
# benchmark.py — Concurrent benchmark script
import asyncio
import aiohttp
import time
import statistics
import sys
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
OLLAMA_URL = "http://localhost:11434/api/generate"
VLLM_URL = "http://localhost:8000/v1/completions"
OLLAMA_MODEL = "llama3.1:8b-q4_K_M"
VLLM_MODEL = "meta-llama/Llama-3.1-8B-Instruct"
PROMPT = (
"Explain the principles of distributed consensus algorithms in detail, "
"covering Paxos, Raft, and Byzantine fault tolerance. " * 12
)
# Verify token count per model:
# from transformers import AutoTokenizer
# t = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
# print(len(t.encode(PROMPT)))
REQUEST_TIMEOUT_SEC = 120
SESSION_TIMEOUT = aiohttp.ClientTimeout(total=REQUEST_TIMEOUT_SEC)
async def query_ollama(session, prompt):
# Verify exact model tag with `ollama list` before running
payload = {
"model": OLLAMA_MODEL,
"prompt": prompt,
"stream": False,
"options": {"num_predict": 256}
}
start = time.perf_counter()
try:
async with session.post(OLLAMA_URL, json=payload) as resp:
resp.raise_for_status()
result = await resp.json()
except (aiohttp.ClientError, asyncio.TimeoutError) as e:
return {"tokens": 0, "elapsed": 0, "tok_per_sec": 0, "error": str(e)}
elapsed = time.perf_counter() - start
tokens = result.get("eval_count", 0)
return {
"tokens": tokens,
"elapsed": elapsed,
"tok_per_sec": tokens / elapsed if elapsed > 0 else 0
}
async def query_vllm(session, prompt):
payload = {
"model": VLLM_MODEL,
"prompt": prompt,
"max_tokens": 256,
"stream": False
}
start = time.perf_counter()
try:
async with session.post(VLLM_URL, json=payload) as resp:
resp.raise_for_status()
result = await resp.json()
except (aiohttp.ClientError, asyncio.TimeoutError) as e:
return {"tokens": 0, "elapsed": 0, "tok_per_sec": 0, "error": str(e)}
elapsed = time.perf_counter() - start
# Guarded access — avoids KeyError on malformed/partial responses
tokens = result.get("usage", {}).get("completion_tokens", 0)
return {
"tokens": tokens,
"elapsed": elapsed,
"tok_per_sec": tokens / elapsed if elapsed > 0 else 0
}
async def run_benchmark(query_fn, concurrency=1, runs=10):
results = []
errors = 0
async with aiohttp.ClientSession(timeout=SESSION_TIMEOUT) as session:
# Warmup — inspect for errors
for i in range(3):
warmup = await asyncio.gather(
*[query_fn(session, PROMPT) for _ in range(concurrency)]
)
warmup_errors = [r for r in warmup if "error" in r]
if warmup_errors:
log.warning(
"Warmup run %d: %d/%d requests failed — check server. "
"First error: %s",
i + 1, len(warmup_errors), concurrency,
warmup_errors[0]["error"],
)
for _ in range(runs):
batch = await asyncio.gather(
*[query_fn(session, PROMPT) for _ in range(concurrency)]
)
for r in batch:
if "error" in r:
errors += 1
else:
results.append(r)
total = runs * concurrency
if errors > 0:
log.warning("%d requests failed out of %d total", errors, total)
if not results:
log.error("All requests failed. Check server status.")
return
throughputs = [r["tok_per_sec"] for r in results]
latencies = sorted(r["elapsed"] for r in results)
n = len(latencies)
if n < 100:
log.warning(
"Sample size (%d) too small for reliable p95/p99 estimates. "
"p99 will equal the maximum observed value for n <= 100.",
n,
)
def percentile(data, p):
"""Return the p-th percentile of sorted data (0 < p <= 100)."""
idx = min(int(len(data) * p / 100), len(data) - 1)
return data[idx]
print(f"Median tok/s: {statistics.median(throughputs):.1f}")
print(f"p50 latency: {statistics.median(latencies):.3f}s")
print(f"p95 latency: {percentile(latencies, 95):.3f}s")
print(f"p99 latency: {percentile(latencies, 99):.3f}s")
print(f"Total successful requests: {n}")
if __name__ == "__main__":
if len(sys.argv) < 2:
sys.exit("Usage: benchmark.py <ollama|vllm> [concurrency] [runs]")
tool = sys.argv[1].lower()
if tool not in ("ollama", "vllm"):
sys.exit(f"ERROR: unknown tool '{tool}'. Must be 'ollama' or 'vllm'.")
try:
concurrency = int(sys.argv[2]) if len(sys.argv) > 2 else 1
runs = int(sys.argv[3]) if len(sys.argv) > 3 else 10
except ValueError as exc:
sys.exit(f"ERROR: concurrency and runs must be integers. ({exc})")
if concurrency < 1 or runs < 1:
sys.exit("ERROR: concurrency and runs must be >= 1.")
fn = query_ollama if tool == "ollama" else query_vllm
asyncio.run(run_benchmark(fn, concurrency=concurrency, runs=runs))
Throughput Benchmark Results: Requests per Second
Single-User Sequential Throughput
In single-stream scenarios, the results challenge the assumption that vLLM categorically outperforms Ollama. Ollama (Q4_K_M) delivered approximately 62 tok/s for Llama 3.1 8B, while vLLM (FP16) reached 71 tok/s and its AWQ variant landed at 68 tok/s. The 13% gap between Ollama and vLLM FP16 owes as much to quantization differences as to architecture. Ollama's competitive showing here stems from aggressive llama.cpp kernel optimizations for quantized inference on consumer GPUs.
DeepSeek-R1-Distill-Llama-8B followed a similar pattern: Ollama at 58 tok/s versus vLLM FP16 at 67 tok/s. The single-user case is where Ollama's lighter server overhead and optimized quantized kernels offset vLLM's architectural advantages.
Concurrent Load Throughput (10 and 50 Users)
The picture changes sharply under concurrent load. vLLM's continuous batching engine aggregated 10 concurrent requests into unified GPU operations, sustaining approximately 485 total tok/s for Llama 3.1 8B (FP16). Ollama processes requests through a FIFO queue; in the version tested, concurrent requests did not benefit from continuous batching, resulting in near-sequential GPU utilization and approximately 148 total tok/s at the same concurrency level. That is a 3.3x throughput difference from a single architectural choice.
At 50 concurrent users, vLLM maintained approximately 920 total tok/s while Ollama plateaued at roughly 155 tok/s, with per-request throughput degrading as queue depth grew. DeepSeek model results tracked the same pattern, with vLLM at approximately 840 tok/s versus Ollama at roughly 142 tok/s under 50 concurrent users.
This is where the vllm vs ollama comparison becomes unambiguous for production workloads: continuous batching is not an incremental improvement but a fundamentally different scaling curve.
| Configuration | Ollama (tok/s total) | vLLM FP16 (tok/s total) | vLLM AWQ (tok/s total) |
|---|---|---|---|
| Llama 3.1 8B, 1 user | 62 | 71 | 68 |
| Llama 3.1 8B, 10 users | 148 | 485 | 452 |
| Llama 3.1 8B, 50 users | 155 | 920 | 875 |
| DeepSeek-R1 8B, 1 user | 58 | 67 | 63 |
| DeepSeek-R1 8B, 10 users | 135 | 445 | 418 |
| DeepSeek-R1 8B, 50 users | 142 | 840 | 795 |
1-user rows show per-request tok/s; multi-user rows show aggregate system tok/s (sum across all concurrent requests).
Latency Benchmark Results: Time-to-First-Response and P95
Time-to-First-Response (TTFR)
At single-user concurrency, Ollama recorded a TTFR of approximately 45ms for Llama 3.1 8B, compared to vLLM's 82ms. Ollama's lighter Go-based server introduces less startup overhead per request. For DeepSeek-R1, the numbers were 51ms (Ollama) versus 89ms (vLLM).
Under concurrent load, however, the positions reversed. At 50 concurrent users, Ollama's TTFR climbed to approximately 3,200ms as requests sat in the queue awaiting sequential processing. vLLM's TTFR at the same concurrency level remained at approximately 145ms, because continuous batching absorbs new requests into the current batch without requiring prior requests to complete.
End-to-End Latency Distribution (p50, p95, p99)
Tail latency tells the production readiness story. At single-user, both tools showed tight distributions: Ollama's p99 was 4.3s versus vLLM's 3.8s for Llama 3.1 8B (256-token generation).
At 50 concurrent users, the distributions diverged sharply. vLLM held a p95 of 2.1s and p99 of 2.8s. Ollama's p95 spiked to 18.4s with a p99 of 24.7s, reflecting the compounding queue delays of sequential processing under load.
| Metric | Ollama (1 user) | vLLM (1 user) | Ollama (50 users) | vLLM (50 users) |
|---|---|---|---|---|
| TTFR | 45ms | 82ms | 3,200ms | 145ms |
| p50 latency | 3.9s | 3.5s | 12.1s | 1.8s |
| p95 latency | 4.2s | 3.7s | 18.4s | 2.1s |
| p99 latency | 4.3s | 3.8s | 24.7s | 2.8s |
Values for Llama 3.1 8B, 256-token generation. TTFR values are non-streaming round-trip times; true streaming TTFT will differ and would typically be lower.
The predictability of vLLM's latency under load is its strongest production argument. Services with SLA requirements around response time cannot tolerate the p99 variance Ollama exhibits at scale.
Continuous batching is not an incremental improvement but a fundamentally different scaling curve.
Memory Usage and Resource Efficiency
Peak VRAM Consumption
vLLM's PagedAttention allocates GPU memory dynamically based on active sequence count, while Ollama allocates a fixed block at model load. At idle (model loaded, no active requests), Ollama consumed approximately 5.2GB VRAM for Llama 3.1 8B Q4_K_M, while vLLM used approximately 16.1GB for the FP16 variant (reflecting the larger unquantized model plus PagedAttention page tables).
Under 50 concurrent users, Ollama held VRAM nearly static at 5.4GB since it processes one request at a time. vLLM's usage grew to approximately 21.8GB, dynamically allocating pages for the KV cache of all active sequences. The AWQ variant of vLLM was more conservative, peaking at 12.4GB under the same load.
Model weight size drives most of the VRAM gap (Q4_K_M ≈ 4.6GB vs FP16 ≈ 14.9GB for 8B parameters), with PagedAttention KV cache allocation accounting for the rest. For an architectural comparison of memory management, the AWQ vLLM numbers (which use a comparable quantization bit-width) are more informative.
CPU and System RAM Usage
Ollama's system RAM footprint was notably lower: approximately 1.8GB versus vLLM's 4.6GB, driven by vLLM's Python runtime, tokenizer processes, and scheduling overhead. CPU utilization showed a similar pattern, with Ollama averaging 8% CPU under load versus vLLM's 15%.
For resource-constrained environments (developer laptops, edge devices, or shared servers where VRAM and RAM budgets are tight), Ollama's lower resource floor is a practical advantage.
| Resource | Ollama (idle) | Ollama (50 users) | vLLM FP16 (idle) | vLLM FP16 (50 users) |
|---|---|---|---|---|
| VRAM | 5.2 GB | 5.4 GB | 16.1 GB | 21.8 GB |
| System RAM | 1.8 GB | 2.1 GB | 4.6 GB | 5.2 GB |
| CPU utilization | 2% | 8% | 5% | 15% |
Developer Experience and Ecosystem Comparison
Setup and Configuration
Binary Install (Local DX)
Ollama's setup remains its main advantage: install the binary, run ollama pull llama3.1:8b, run ollama serve, and inference is available in under two minutes. No Python environment, no dependency management, no model conversion.
Docker (Benchmark Reproduction)
To reproduce the benchmark, use the Docker Compose file provided above. For Ollama, the model is pulled from within the running container (e.g., docker exec ollama ollama pull llama3.1:8b-q4_K_M). For vLLM, the model is downloaded from HuggingFace automatically on first launch — this requires the HF_TOKEN environment variable and may take significant time depending on network speed (the FP16 8B model is approximately 15GB). Allow adequate time for model downloads before running benchmarks.
vLLM additionally requires command-line flags to configure dtype, memory utilization, and model parameters. Expect 30-60 minutes of trial-and-error on your first configuration of --gpu-memory-utilization, --max-model-len, and quantization flags, particularly if you are unfamiliar with how KV cache size interacts with --max-model-len at a given memory budget.
API Compatibility and Integration
Both tools now offer OpenAI-compatible API endpoints, making them interchangeable at the HTTP layer for basic chat and completion workflows. vLLM's API surface is richer, exposing logprobs, guided decoding (structured JSON output), and tool use support. Ollama's strength is its native integration ecosystem: LangChain, Open WebUI, Continue (VS Code), and numerous desktop applications connect to Ollama out of the box.
# Identical OpenAI-compatible request to both endpoints
import openai
# --- Ollama ---
ollama_client = openai.OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Ollama ignores API key but the field is required
)
ollama_response = ollama_client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Explain PagedAttention in three sentences."}],
max_tokens=256
)
print(ollama_response.choices[0].message.content)
# Note: Ollama's response omits `usage.prompt_tokens_details` and `system_fingerprint`
# (behavior may vary by Ollama version)
# --- vLLM ---
# WARNING: This server has no authentication. Do not expose port 8000 publicly.
# Use a reverse proxy with auth for any non-localhost deployment.
vllm_client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
vllm_response = vllm_client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain PagedAttention in three sentences."}],
max_tokens=256
)
print(vllm_response.choices[0].message.content)
# Note: vLLM includes full `usage` breakdown and supports `logprobs=True`
Response structures differ only slightly for most applications, but engineers relying on detailed token-level usage data or logprob outputs will find vLLM's response objects more complete.
When to Use Ollama vs vLLM: Decision Framework
Choose Ollama When...
- The use case involves local development, prototyping, or experimentation
- Concurrency will remain at single-user or low single-digit concurrent requests
- Minimal setup time and integrated model management are priorities
- The deployment target is a CPU-only machine or a consumer-grade GPU (not benchmarked in this study)
Choose vLLM When...
- The deployment serves concurrent users and throughput per GPU dollar matters
- Predictable tail latency (p95/p99) is required for SLA compliance
- Multi-model serving or LoRA adapter hot-switching is needed
- Tensor parallelism across multiple GPUs is part of the scaling plan
- Structured output (guided decoding), logprobs, or advanced API features are required
- Integration with developer tools like Open WebUI, Continue, or LangChain is needed immediately (note: Ollama's ecosystem support is broader here, but vLLM's OpenAI-compatible API works with most tools)
Can Both Be Used Together?
A common and practical pattern is to use Ollama during local development and vLLM for staging and production. The OpenAI-compatible API layer makes this transition straightforward at the code level. The primary consideration when transitioning is model format: Ollama uses GGUF quantized models while vLLM typically serves HuggingFace-format models in FP16, AWQ, or GPTQ. Output behavior may differ subtly between quantization formats of the same base model, so validation testing during the transition is advisable.
The predictability of vLLM's latency under load is its strongest production argument. Services with SLA requirements around response time cannot tolerate the p99 variance Ollama exhibits at scale.
| Criteria | Ollama | vLLM |
|---|---|---|
| Single-user throughput | Strong (~62 tok/s, Llama 8B Q4) | Slightly better (~71 tok/s FP16) |
| Multi-user throughput | Limited (~155 tok/s at 50 users) | High (~920 tok/s at 50 users) |
| Tail latency at scale | Poor (p99 ~24.7s at 50 users) | Strong (p99 ~2.8s at 50 users) |
| Setup time | Minutes | Varies (model download ≈15GB + configuration) |
| VRAM efficiency at scale | Static (lower baseline) | Dynamic (higher ceiling) |
| API richness | Basic OpenAI compat | Full OpenAI compat + extras |
| Model ecosystem | Ollama Hub (GGUF) | HuggingFace (FP16/AWQ/GPTQ) |
| Best fit | Dev/local/edge | Production/serving |
The Right Tool for the Right Job
The benchmark results confirm a clear pattern: vLLM wins on throughput and latency predictability above roughly 5 concurrent users, while Ollama wins on simplicity, resource efficiency, and single-user performance. At 50 concurrent users, vLLM delivered roughly 6x the total throughput with p99 latency under 3 seconds, compared to Ollama's 24.7-second p99. In single-stream use, Ollama came within 13% of vLLM FP16 throughput, a gap that partially reflects quantization differences rather than architecture alone.
The gap between these tools has narrowed, particularly in single-user scenarios where Ollama's backend improvements have closed much of the distance. But the architectural difference in batching strategy means the tools are not converging toward a single use case. They remain complementary.
Engineers are encouraged to run this local llm benchmark on their own hardware using the Docker Compose and benchmark script provided above.
Common Pitfalls
- Model tag mismatch: Ollama Hub tags are case-sensitive and may change. Always verify the exact tag with
ollama listafter pulling. - HuggingFace authentication:
meta-llama/Llama-3.1-8B-Instructis a gated model. Without a validHF_TOKEN, the vLLM container will fail silently or with aGatedRepoError. - Small sample p95/p99: With 10 runs at concurrency 1 (10 total samples), p99 equals the maximum value. Increase runs or concurrency for statistically meaningful percentile estimates.
- TTFR vs TTFT: This benchmark measures non-streaming round-trip time. Do not compare these values to streaming TTFT numbers from other benchmarks.
- GPU memory contention:
--gpu-memory-utilization 0.90on 24GB leaves ~2.4GB. Other GPU processes may cause vLLM to OOM. - Thermal throttling: Three warmup runs may not fully stabilize GPU thermals on sustained workloads; consider additional warmup for long benchmarks.

