Engineering teams pursuing local AI capabilities face an expensive default assumption: every developer needs their own GPU. The math rarely supports it. A shared GPU server running vLLM with continuous batching and an OpenAI-compatible API can serve an entire team from a single workstation, keeping latency low, data private, and hardware budgets sane.
This architecture replaces per-developer GPU allocation with a Docker-based deployment that handles queue management, fair-use scheduling, and concurrent inference requests across the LAN.
Table of Contents
- Why One GPU Can Serve Your Whole Team
- Architecture Overview
- Setting Up the GPU Server
- Managing Multi-User Access and Fair Queuing
- Connecting Developer Workstations
- Performance Tuning and Scaling
- Security and Data Privacy Considerations
- Deployment Sequence
Why One GPU Can Serve Your Whole Team
The Economics of Per-Developer GPU Allocation
The naive approach to local AI for a five-person team means purchasing five NVIDIA RTX 4090 cards at roughly $2,500 each (estimated, mid-2024; verify current market rates), totaling $12,500 in GPU hardware alone, not counting the workstations to house them. A single NVIDIA A6000 with 48GB of VRAM costs approximately $4,500, and even a dual RTX 4090 build in a shared workstation lands around $7,000 all-in. The savings compound further when factoring in power consumption, cooling, and maintenance across five individual machines versus one centralized server.
API-based alternatives carry their own costs. A team of five developers hitting a cloud LLM API for code completions, chat, and documentation generation can expect roughly $500/month on GPT-3.5 Turbo at moderate volume (around 1M tokens/day) or up to $2,000/month on GPT-4 Turbo at similar usage. Over twelve months, that is $6,000 to $24,000, with no hardware asset remaining and all proprietary code transiting external servers.
Why Most Developer GPU Time Goes Unused
Developer inference workloads are fundamentally bursty. A developer writes code for several minutes, triggers a completion or chat request, waits a few seconds for the response, then returns to writing. In the author's testing on a single-developer RTX 4090 setup running Mistral 7B with Continue.dev, nvidia-smi dmon -s u -d 5 showed GPU compute utilization averaging 5 to 15 percent over a two-hour coding session. Run the same command during your team's typical workflow to measure your own baseline.
Developer inference workloads are fundamentally bursty. A developer writes code for several minutes, triggers a completion or chat request, waits a few seconds for the response, then returns to writing.
vLLM's continuous batching mechanism exploits this pattern directly. Unlike naive batching, which waits for a fixed batch to fill before processing, continuous batching dynamically adds incoming requests to an in-progress batch. The GPU stays saturated only when multiple developers happen to request inference simultaneously, and even then, each request begins processing immediately rather than waiting for others to complete.
Architecture Overview
System Design and Component Map
Developer laptops connect over LAN or VPN to a Docker host running on the GPU workstation. Inside the Docker host, a vLLM server loads model weights from NVMe storage and exposes an OpenAI-compatible API. A reverse proxy layer (Caddy or Nginx) sits in front, handling TLS termination and basic authentication. A lightweight FastAPI middleware provides per-user rate limiting, priority logging, and request logging for usage auditing. An optional monitoring sidecar exposes GPU metrics for dashboard consumption.
Each component addresses a specific failure mode. The reverse proxy enforces transport security and blocks unauthenticated access. The FastAPI middleware layer adds fair-use controls that vLLM itself does not provide natively. vLLM handles the core inference workload with PagedAttention memory management and continuous batching. NVMe storage ensures model weights load quickly at server startup.
Why vLLM over Ollama, llama.cpp server, or Hugging Face TGI? It comes down to production-grade concurrency handling. Ollama targets single-user local use and lacks vLLM's production concurrency features. llama.cpp's continuous batching (--cont-batching) is less mature and achieves lower concurrent throughput than vLLM's PagedAttention scheduler. TGI provides similar capabilities but vLLM's PagedAttention implementation delivers more efficient memory utilization for concurrent requests, and its OpenAI-compatible API endpoint means zero client-side tooling changes for developers already using OpenAI SDKs.
Hardware Requirements and Recommendations
The minimum viable specification is an NVIDIA RTX 3090 with 24GB VRAM, which can run 7B to 13B parameter models at up to 8 concurrent sequences with 4096 context length at 0.90 GPU memory utilization. For teams wanting to run 34B or quantized 70B models, an RTX 4090 (24GB) or A6000 (48GB) is the recommended baseline. The A6000's additional VRAM headroom allows larger context windows and bigger models without quantization trade-offs.
Storage matters more than most guides suggest. Model weights for a 70B parameter model in 4-bit quantization still consume roughly 35 to 40GB on disk. NVMe storage ensures server cold starts complete in seconds rather than minutes. System RAM should be at least 64GB to handle model loading and the operating system overhead without contention. Network requirements start at Gigabit Ethernet for LAN-connected teams. Inference responses are text, so bandwidth is rarely the bottleneck; latency is. For remote or VPN-connected developers, a WireGuard tunnel over a reasonably fast internet connection works, though first-token latency will increase by the round-trip time.
Setting Up the GPU Server
Base System and NVIDIA Driver Configuration
Use Ubuntu Server 22.04 LTS or 24.04 LTS for NVIDIA GPU workloads; both have the broadest driver and container toolkit support. Install the NVIDIA driver from the official repository rather than the Ubuntu default packages to ensure version compatibility with CUDA and the container toolkit.
# Add NVIDIA package repository and install drivers
sudo apt-get update
sudo apt-get install -y linux-headers-$(uname -r)
sudo apt-get install -y software-properties-common
sudo add-apt-repository -y ppa:graphics-drivers/ppa
sudo apt-get update
# Check which driver version is recommended for your GPU
ubuntu-drivers devices
# Install the recommended driver version (substitute the version shown above)
sudo apt-get install -y nvidia-driver-<recommended-version>
# Reboot to load the new driver
sudo reboot
# After reboot, verify the driver is loaded
nvidia-smi
The nvidia-smi output should show the GPU model, driver version, and CUDA version. Note that the CUDA version shown by nvidia-smi reflects the maximum CUDA version the driver supports, not an installed CUDA toolkit. If the GPU does not appear, check that Secure Boot is disabled in BIOS, as it frequently blocks unsigned kernel modules. Confirm CUDA compatibility with the vLLM image tag you plan to use by checking the vLLM release notes for the minimum required CUDA version.
Docker and NVIDIA Container Toolkit
Docker Engine and the NVIDIA Container Toolkit are required to pass GPU access into containers. The toolkit installs a runtime hook that makes NVIDIA devices available inside Docker containers without manual device mapping.
# Install Docker Engine
# Option 1: convenience script (verify checksum before executing, or use Option 2)
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
# Option 2: follow manual APT installation steps at
# https://docs.docker.com/engine/install/ubuntu/
sudo usermod -aG docker $USER
# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Verify GPU is accessible inside containers
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
The final command should display the same nvidia-smi output seen on the host. If it fails, confirm that the NVIDIA Container Toolkit runtime configuration was applied by checking /etc/docker/daemon.json for the nvidia runtime entry. Note that these instructions use docker compose (v2 plugin, included with Docker Engine ≥24.0), not the legacy docker-compose v1 binary.
Deploying vLLM with Docker Compose
The Docker Compose file below deploys vLLM with GPU passthrough, a persistent volume for cached model weights (avoiding re-downloads on restart), and tuned serving parameters. Model selection depends on team needs: DeepSeek-Coder-V2-Lite-Instruct works well for code-focused teams, Mistral-7B-Instruct-v0.3 provides a strong general-purpose option, and CodeLlama-34b-Instruct fits teams with the VRAM budget for a larger coding model.
services:
vllm:
image: vllm/vllm-openai:v0.4.3 # pin to a specific release; check github.com/vllm-project/vllm/releases
container_name: vllm-server
restart: unless-stopped
ports:
- "8000:8000"
volumes:
- model-cache:/root/.cache/huggingface
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command: >
--model mistralai/Mistral-7B-Instruct-v0.3
--max-model-len 8192
--gpu-memory-utilization 0.90
--max-num-seqs 16
--dtype auto
--host 0.0.0.0
--port 8000
volumes:
model-cache:
driver: local
The key vLLM flags warrant explanation:
# --max-model-len 8192 : Maximum context length per request. Lower values
# free VRAM for more concurrent sequences.
# --gpu-memory-utilization 0.90 : Fraction of GPU memory vLLM may use. Leave
# headroom (0.85-0.92) to avoid OOM errors.
# On 24GB GPUs with quantized models and
# --max-num-seqs 16, consider lowering to 0.85
# to provide additional KV cache headroom.
# --max-num-seqs 16 : Maximum number of sequences processed concurrently.
# This is a primary concurrency control lever;
# --max-num-batched-tokens also limits throughput.
# --dtype auto : Selects bfloat16 on Ampere (sm_80+) and newer GPUs;
# falls back to float16 on older architectures.
If you need to deploy a model that requires --trust-remote-code (e.g., DeepSeek), you can add that flag to the command. Security warning: --trust-remote-code executes Python code from the model repository. Only enable this flag for models from sources your organization explicitly trusts. Mistral and most standard architectures do not require it.
Set HF_TOKEN in a .env file alongside the Compose file for gated model access (the token needs at least read scope for gated model repositories). Ensure .env is listed in .gitignore before committing this repository to version control. Launch with docker compose up -d and verify with curl http://localhost:8000/v1/models.
Adding a Reverse Proxy and Basic Auth
Even on a private LAN, a reverse proxy adds TLS encryption and access control. Caddy handles automatic HTTPS certificate provisioning, including for internal hostnames if configured with a local CA. The following Caddyfile proxies requests to vLLM with basic authentication:
gpu-server.internal:443 {
tls internal
# basicauth /v1/* {
# # Caddy requires at least one valid user entry to start.
# # Generate a password hash by running:
# # caddy hash-password --plaintext 'your-password'
# # Then uncomment this block and add the user line:
# # dev_team <paste-hash-output-here>
# }
reverse_proxy localhost:8080 {
header_up X-Forwarded-For {remote_host}
header_up X-Real-IP {remote_host}
}
log {
output file /var/log/caddy/access.log
format json
}
}
Note: The basicauth block above is commented out because Caddy requires at least one valid user entry to start. Before deploying to production, generate a real hash with caddy hash-password --plaintext 'your-password', uncomment the block, and add your user entry. Validate the configuration with caddy validate --config Caddyfile before starting Caddy.
Deploy Caddy as a second service in the Docker Compose file or install it directly on the host. The tls internal directive causes Caddy's local CA to issue a certificate for the hostname. Distribute the CA root certificate (found in Caddy's data directory under pki/authorities/local/root.crt) to developer machines for trusted HTTPS, not the server certificate itself. Alternatively, use tls with an ACME provider for publicly resolvable internal domains.
Managing Multi-User Access and Fair Queuing
How vLLM Handles Concurrent Requests
vLLM's continuous batching processes requests as they arrive without forcing them to wait for a batch boundary. When a new request hits the server while existing sequences are generating, vLLM inserts it into the current processing iteration. vLLM frees memory slots (via PagedAttention) when a sequence finishes generating, making room for new sequences without restarting the batch.
The --max-num-seqs flag directly controls how many sequences can be in flight simultaneously. The --max-num-batched-tokens flag further limits the total number of tokens processed per iteration, which also affects throughput. Setting --max-num-seqs too high on a memory-constrained GPU causes KV cache pressure, increasing time-to-first-token for all users. Setting it too low leaves GPU compute idle during concurrent bursts. For a team of five on an RTX 4090, --max-num-seqs 8 to 16 provides a reasonable balance.
Under contention, when more requests arrive than --max-num-seqs allows, vLLM queues them internally. Each queued request waits for an in-flight sequence to finish before vLLM begins processing it. This is where the FastAPI middleware layer adds value: it can enforce per-user fairness before requests even reach vLLM.
Building a Lightweight Request Queue and Rate Limiter
The following FastAPI application acts as a proxy between developers and the vLLM server, adding per-user rate limiting via a token bucket algorithm, priority logging via a custom header (enforcement requires implementing a queue worker; see note in code), and request logging for usage auditing.
Prerequisites: Create a requirements.txt with the following dependencies before deploying:
fastapi>=0.110.0
httpx>=0.27.0
uvicorn>=0.29.0
cachetools>=5.3.0
import os
import time
import asyncio
import hashlib
import logging
from collections import Counter, deque
from contextlib import asynccontextmanager
from cachetools import LRUCache
from fastapi import FastAPI, Request, HTTPException, Header
from fastapi.responses import StreamingResponse
import httpx
# ── Configuration ──────────────────────────────────────────────────────────────
VLLM_BACKEND = os.environ.get("VLLM_BACKEND", "http://localhost:8000")
RATE_LIMIT = int(os.environ.get("RATE_LIMIT", "20")) # requests per minute per user
BUCKET_REFILL_SECONDS = 60 # seconds to fully refill
BACKEND_TIMEOUT = float(os.environ.get("BACKEND_TIMEOUT", "120"))
_BUCKET_CACHE_MAX = 2000 # max concurrent tracked users
# ── Logging ────────────────────────────────────────────────────────────────────
logger = logging.getLogger("queue-proxy")
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
# ── Rate limiter state ─────────────────────────────────────────────────────────
user_buckets: LRUCache = LRUCache(maxsize=_BUCKET_CACHE_MAX)
user_locks: dict[str, asyncio.Lock] = {}
_locks_lock = asyncio.Lock()
# Capped request log to prevent unbounded memory growth
request_log: deque[dict] = deque(maxlen=10000)
# NOTE: Priority is logged but not enforced. To enforce priority scheduling,
# implement a worker that dequeues from an asyncio.PriorityQueue before
# forwarding to the backend. As written, the proxy forwards requests immediately
# in arrival order.
# ── HTTP client lifecycle ──────────────────────────────────────────────────────
_http_client: httpx.AsyncClient | None = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global _http_client
_http_client = httpx.AsyncClient(
timeout=BACKEND_TIMEOUT,
limits=httpx.Limits(max_connections=64, max_keepalive_connections=32),
)
yield
await _http_client.aclose()
app = FastAPI(lifespan=lifespan)
def _get_client() -> httpx.AsyncClient:
if _http_client is None:
raise RuntimeError("HTTP client not initialised")
return _http_client
# ── Rate limiting ──────────────────────────────────────────────────────────────
async def _get_user_lock(user: str) -> asyncio.Lock:
async with _locks_lock:
if user not in user_locks:
user_locks[user] = asyncio.Lock()
return user_locks[user]
async def check_rate_limit(user: str) -> bool:
lock = await _get_user_lock(user)
async with lock:
now = time.time()
if user not in user_buckets:
user_buckets[user] = {"tokens": float(RATE_LIMIT), "last": now}
bucket = user_buckets[user]
elapsed = now - bucket["last"]
bucket["tokens"] = min(
float(RATE_LIMIT),
bucket["tokens"] + elapsed * (RATE_LIMIT / BUCKET_REFILL_SECONDS),
)
bucket["last"] = now
if bucket["tokens"] >= 1.0:
bucket["tokens"] -= 1.0
return True
return False
# ── Headers to forward ─────────────────────────────────────────────────────────
FORWARDED_HEADERS = {"content-type", "accept", "authorization"}
# ── Proxy route ────────────────────────────────────────────────────────────────
@app.api_route("/v1/{path:path}", methods=["GET", "POST", "DELETE", "PUT"])
async def proxy(
request: Request,
path: str,
x_api_key: str = Header(...),
x_priority: str = Header(default="normal"),
):
user = x_api_key # TODO(security): resolve key to opaque user identity
if not await check_rate_limit(user):
raise HTTPException(status_code=429, detail="Rate limit exceeded")
body = await request.body() if request.method in ("POST", "PUT") else None
forward_headers = {
k: v
for k, v in request.headers.items()
if k.lower() in FORWARDED_HEADERS
}
user_token = hashlib.sha256(user.encode()).hexdigest()[:16]
logger.info("user=%s priority=%s path=/v1/%s", user_token, x_priority, path)
request_log.append({
"user": user_token,
"path": path,
"time": time.time(),
"priority": x_priority,
})
client = _get_client()
async with client.stream(
request.method,
f"{VLLM_BACKEND}/v1/{path}",
content=body,
headers=forward_headers,
) as r:
return StreamingResponse(
r.aiter_bytes(),
status_code=r.status_code,
media_type=r.headers.get("content-type", "application/json"),
)
@app.get("/stats")
async def stats():
log_list = list(request_log)
return {
"total_requests": len(log_list),
"per_user": dict(Counter(r["user"] for r in log_list)),
}
Deploy this as a second Docker container or a systemd service on the host, listening on port 8080. Point the reverse proxy at port 8080 instead of 8000 directly. The X-Priority header allows IDE integrations to flag code completions as high priority while batch documentation jobs run at normal priority. Note that the current implementation logs priority but does not enforce ordering.
Monitoring GPU Utilization and Server Health
Tracking GPU memory consumption, compute utilization, and request throughput prevents silent degradation. The following sidecar script exposes GPU statistics as a JSON endpoint suitable for polling by a custom dashboard. For native Prometheus scraping, deploy NVIDIA DCGM Exporter or configure prometheus-json-exporter with a field-mapping file.
import subprocess
from fastapi import FastAPI
NVIDIA_SMI = "/usr/bin/nvidia-smi"
app = FastAPI()
def _safe_int(value: str) -> int | None:
try:
return int(value)
except ValueError:
return None
@app.get("/gpu-stats")
def gpu_stats():
result = subprocess.run(
[NVIDIA_SMI,
"--query-gpu=utilization.gpu,utilization.memory,"
"memory.used,memory.total,temperature.gpu",
"--format=csv,noheader,nounits"],
capture_output=True,
text=True,
timeout=10,
)
if result.returncode != 0:
return {"error": "nvidia-smi failed", "detail": result.stderr.strip()}
gpus = []
for i, line in enumerate(result.stdout.strip().split("
")):
vals = [v.strip() for v in line.split(",")]
if len(vals) < 5:
continue
gpus.append({
"gpu_id": i,
"gpu_util_pct": _safe_int(vals[0]),
"mem_util_pct": _safe_int(vals[1]),
"mem_used_mb": _safe_int(vals[2]),
"mem_total_mb": _safe_int(vals[3]),
"temp_c": _safe_int(vals[4]),
})
return {"gpus": gpus}
Key metrics to watch: GPU memory used versus total (approaching 100% signals OOM risk), GPU compute utilization (sustained >90% means the team is outgrowing the hardware), and request queue depth from the FastAPI proxy's /stats endpoint. For full observability, the NVIDIA DCGM Exporter provides Prometheus-native GPU metrics and integrates directly with Grafana dashboards.
Connecting Developer Workstations
Using the OpenAI-Compatible API
vLLM's OpenAI-compatible endpoint means any tool or library that speaks the OpenAI chat completions protocol works without modification. Developers simply point the base URL at the shared server instead of api.openai.com.
from openai import OpenAI
client = OpenAI(
base_url="https://gpu-server.internal/v1",
api_key="your-team-api-key"
)
response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.3",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to merge two sorted lists."}
],
max_tokens=512,
temperature=0.2
)
print(response.choices[0].message.content)
The model parameter must match the model name passed to vLLM's --model flag exactly. The API key value becomes the user identifier in the rate-limiting proxy. Note that the reverse proxy uses HTTP Basic Auth (-u username:password with curl) while the FastAPI proxy uses the X-Api-Key header. These are two separate authentication layers.
IDE Integration: VS Code and JetBrains
Continue.dev, the open-source AI coding assistant for VS Code and JetBrains, supports custom OpenAI-compatible endpoints natively. Add the following to the Continue configuration file (.continue/config.json in the project or user directory). Verify key names against your installed Continue version's schema at continue.dev/docs.
{
"models": [
{
"title": "Team GPU - Mistral 7B",
"provider": "openai",
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"apiBase": "https://gpu-server.internal/v1",
"apiKey": "your-team-api-key"
}
],
"tabAutocompleteModel": {
"title": "Team GPU - Autocomplete",
"provider": "openai",
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"apiBase": "https://gpu-server.internal/v1",
"apiKey": "your-team-api-key"
}
}
For JetBrains IDEs with AI Assistant or third-party plugins that accept custom OpenAI providers, the same apiBase and apiKey pattern applies. Any tool that reads the OPENAI_BASE_URL environment variable can be configured system-wide by setting export OPENAI_BASE_URL=https://gpu-server.internal/v1 in the developer's shell profile (openai Python SDK ≥1.0.0; for older versions use OPENAI_API_BASE).
CLI and Script Access for Batch Jobs
For terminal usage and scripted batch processing, curl and shell loops provide direct access without additional dependencies. Note: the reverse proxy requires HTTP Basic Auth (-u username:password), and the FastAPI proxy behind it uses the X-Api-Key header. Include both when requests traverse both layers.
# Single request from the terminal
curl -s https://gpu-server.internal/v1/chat/completions \
-u dev_team:your-password \
-H "Content-Type: application/json" \
-H "x-api-key: your-team-api-key" \
-H "x-priority: normal" \
-d '{"model":"mistralai/Mistral-7B-Instruct-v0.3","messages":[{"role":"user","content":"Explain Python dataclasses in 3 sentences."}],"max_tokens":256}' | jq '.choices[0].message.content'
# Batch docstring generation across Python files
# Requires: jq installed on the client machine
find ./src -name "*.py" | while IFS= read -r f; do
content=$(cat "$f")
payload=$(jq -n \
--arg model "mistralai/Mistral-7B-Instruct-v0.3" \
--arg content "Generate docstrings for: ${content}" \
'{model: $model,
messages: [{role: "user", content: $content}],
max_tokens: 512}')
curl -s --max-time 180 \
https://gpu-server.internal/v1/chat/completions \
-u dev_team:your-password \
-H "Content-Type: application/json" \
-H "x-api-key: your-team-api-key" \
-H "x-priority: normal" \
-d "$payload" \
> "${f}.docs"
done
Batch jobs should use the x-priority: normal header to avoid starving interactive code completions from IDE users. Note that jq is used to safely construct JSON payloads containing file content, which handles escaping of special characters in source files.
Performance Tuning and Scaling
Optimizing for Team Size
For teams of two to five developers, a single RTX 4090 running a 7B to 13B parameter model with --max-num-seqs 8 handles concurrent requests without queue buildup. Code completions return within one to two seconds under typical load.
Teams of five to fifteen developers should step up to an A6000 48GB or a dual-GPU configuration, running a 34B parameter model with AWQ or GPTQ 4-bit quantization. Set --max-num-seqs between 16 and 32. Quantization reduces model quality by a small margin. Measure the difference on your own evaluation set (e.g., HumanEval or MBPP pass rates for coding tasks) before committing to a quantization method, since the impact varies by model and task. The trade-off: quantization roughly doubles the model size that fits in a given VRAM budget.
When queue depth from your
/statsendpoint consistently shows requests waiting, or when p95 first-token latency exceeds your team's SLA, you have outgrown the current hardware.
When queue depth from your /stats endpoint consistently shows requests waiting, or when p95 first-token latency exceeds your team's SLA, you have outgrown the current hardware. Adding a second GPU to the same machine (using vLLM's tensor parallelism via --tensor-parallel-size 2) scales throughput more cost-effectively than a second server, up to the point where PCIe bandwidth between GPUs becomes a bottleneck. NVLink-connected GPUs avoid this, but outside of data center hardware, consumer GPUs communicate over PCIe only. A second server becomes the better choice when the team outgrows what two GPUs can handle or when geographic distribution demands lower latency to different offices.
Latency Expectations and SLA Setting
Latency varies significantly with concurrency, quantization, and context length. The following numbers were observed on an RTX 4090 running Mistral-7B-Instruct-v0.3 with AWQ 4-bit quantization, 4096 context length, and --max-num-seqs 16:
- 1 concurrent user: ~60 to 80 tokens/sec
- 5 concurrent users: ~12 to 18 tokens/sec per user
- 10 concurrent users: ~6 to 10 tokens/sec per user, with first-token latency climbing to 1 to 3 seconds
Longer context lengths and full-precision (float16/bfloat16) models will produce lower throughput. Shorter contexts and aggressive quantization will produce higher throughput.
Reasonable team SLA targets for this architecture: code completions under two seconds to first token (achievable at up to ~8 concurrent users on the configuration above), chat responses under five seconds to first token. If monitoring shows consistent SLA violations, increase --max-num-seqs (if VRAM allows), switch to a smaller or more heavily quantized model, or add GPU capacity.
Security and Data Privacy Considerations
The primary security advantage of this architecture is data locality. No source code, prompts, or completions leave the local network, provided all developer connections traverse the LAN or an encrypted VPN tunnel.
No source code, prompts, or completions leave the local network, provided all developer connections traverse the LAN or an encrypted VPN tunnel.
Rotate API keys on a fixed schedule (e.g., every 90 days) and revoke them immediately when a team member departs. The FastAPI middleware's request logging provides an audit trail of who queried what and when, supporting compliance requirements without external data processors. Check each model's license before deployment: Mistral models use Apache 2.0, and DeepSeek models have their own license terms. Verify license compatibility with the organization's use case for any model you choose.
Deployment Sequence
A single GPU, Docker, vLLM with continuous batching, and a thin FastAPI proxy layer serve an entire development team's local AI needs. Start with the hardware and model that fits the current team size, deploy the monitoring stack to track real usage patterns, and scale GPU capacity or model selection based on observed metrics rather than assumptions.

