How to Set Up AI Code Review With a Local LLM
- Install Ollama and pull the Qwen2.5-Coder 7B model to your local machine.
- Create a Flask application with a
/webhookendpoint that receives GitHub PR events. - Verify incoming webhook signatures using HMAC-SHA256 to prevent unauthorized requests.
- Fetch the pull request diff via the GitHub API and parse it into per-file hunks.
- Construct a review prompt embedding the diff and instructing the model to return structured JSON findings.
- Send the prompt to Ollama's local REST API and parse the JSON response for bugs, security issues, and style violations.
- Post the review comments back to the pull request using GitHub's Reviews API as a single batched review.
- Deploy with a WSGI server and background job queue to handle inference times beyond GitHub's 10-second webhook timeout.
Cloud-based AI code review tools have reshaped how teams catch bugs before merge, but they come with strings attached. This article walks through building a fully automated, self-hosted code review pipeline using a local LLM — eliminating per-seat costs and keeping your code off third-party servers.
Table of Contents
- Why Self-Host Your AI Code Review?
- Architecture Overview
- Setting Up Ollama and Qwen2.5-Coder
- Building the Webhook Server
- Integrating the Local LLM for Code Review
- Posting Review Comments to GitHub
- Testing and Running Locally
- Adapting for GitLab
- Tips for Production Use
- What Comes Next
Why Self-Host Your AI Code Review?
Cloud-based AI code review tools like GitHub Copilot, CodeRabbit, and Codium have reshaped how teams catch bugs before merge. But they come with strings attached: per-seat pricing that scales with headcount, transmission of proprietary source code to third-party servers (depending on subscription tier and configuration), and vendor lock-in that makes switching painful. For organizations handling sensitive codebases, regulated data, or simply wanting full control, a self-hosted code review AI offers a practical alternative.
Running a local LLM for code review means complete data sovereignty, provided Ollama is running on local infrastructure and not configured to point at a remote endpoint. No diffs leave the network. You pay nothing beyond buying and running the hardware. Review prompts can be customized per project, tuned for specific coding standards, and iterated on without waiting for a vendor's product roadmap. The system works offline, behind air-gapped networks, and in environments where compliance teams would never approve sending code to an external API.
Running a local LLM for code review means complete data sovereignty, provided Ollama is running on local infrastructure and not configured to point at a remote endpoint. No diffs leave the network.
This article walks through building a Flask webhook server that listens for GitHub pull requests, analyzes diffs with Qwen2.5-Coder running locally through Ollama, and posts structured review comments back to the PR. The result is a fully automated, self-hosted code review pipeline that eliminates per-seat and per-API-call costs.
Prerequisites
This tutorial assumes Python 3.10 or later, Ollama installed on the development machine, and a GitHub account with a test repository where you have admin access (to configure webhooks). A minimum of 16 GB RAM is recommended for running the 7B parameter model comfortably, though 8 GB works with quantized variants. An ngrok account with an authenticated CLI is required for local development (free tier works, but you must run ngrok config add-authtoken <your_token> before use). Readers should be familiar with Flask, REST APIs, and standard Git workflows including pull request mechanics.
Architecture Overview
The pipeline follows a linear flow: a GitHub pull request event triggers a webhook delivery to a Flask server. The server fetches the PR diff via the GitHub API, constructs a review prompt, sends it to Qwen2.5-Coder running locally in Ollama, parses the structured response, and posts review comments back to the PR through GitHub's Reviews API.
Five components make up the system. A webhook receiver validates and routes incoming GitHub events. A diff parser extracts file-level changes into a structured format. A prompt builder frames the diff for code review. An LLM inference layer communicates with Ollama's local API. Finally, a comment poster maps model output to GitHub's review format.
Flask handles HTTP with minimal boilerplate. Ollama provides zero-configuration local model serving with a simple REST API. In testing, Qwen2.5-Coder 7B returned valid JSON review output more reliably than CodeLlama 7B, and it catches injection, null-reference, and style issues at 7B parameters, making it practical to run on developer workstations or modest server hardware.
How the Review Pipeline Works
When a developer opens or updates a pull request, GitHub sends a POST request to the configured webhook URL. The Flask server verifies the HMAC-SHA256 signature, extracts the repository and PR metadata, and fetches the unified diff. The diff is parsed into per-file hunks, filtered to exclude non-reviewable files, and formatted into a prompt that instructs Qwen2.5-Coder to identify bugs, security vulnerabilities, performance issues, and style violations. The model returns JSON-structured feedback, which the server maps to GitHub's review comment format and posts as a single batched review.
Important: This tutorial keeps the implementation synchronous for clarity. GitHub's webhook timeout is 10 seconds, and LLM inference routinely exceeds that. In production, you must offload LLM inference to a background job queue (such as RQ or Celery) to prevent GitHub from marking deliveries as failed and potentially deactivating your webhook endpoint. This is not optional for real-world use.
Setting Up Ollama and Qwen2.5-Coder
Installing Ollama and Pulling the Model
Ollama installs with a single command on macOS and Linux. After installation, pulling the 7B parameter Qwen2.5-Coder model and verifying it responds to code review prompts is straightforward. The 7B model is approximately 4.7 GB; download time depends on connection speed and may take several minutes on a typical broadband connection.
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull the Qwen2.5-Coder 7B model
ollama pull qwen2.5-coder:7b
# Verify the model works with a quick code review prompt
ollama run qwen2.5-coder:7b "Review this Python code for bugs: def divide(a, b): return a/b"
The model should respond with observations about missing zero-division handling, confirming it is loaded and capable of code review reasoning.
Choosing the Right Model Size
Qwen2.5-Coder ships in several parameter sizes. The 1.5B variant returns results in single-digit seconds on most hardware but produces shallower analysis — in the author's experience, it often misses subtle bugs. The 7B model hits a practical sweet spot: in testing, it catches security issues, logic errors, and style problems, returning results in under two minutes for a typical 200-line diff on an M1 MacBook with 16 GB RAM. The 32B variant delivers the highest quality reviews but demands significantly more RAM and GPU memory, making it impractical for most developer workstations.
For machines with limited memory, quantized versions (Q4_K_M or Q8_0) reduce the footprint while still catching the same bug categories in the author's testing. Other viable models include CodeLlama and DeepSeek-Coder, but as of this writing, Qwen2.5-Coder 7B returned valid structured JSON and identified injection, null-reference, and logic bugs more consistently than the alternatives tested.
Building the Webhook Server
Project Setup and Dependencies
The project requires a minimal set of dependencies. Create a project directory, set up a virtual environment, and install the requirements:
ai-code-review/
├── app.py
├── requirements.txt
└── .env
# requirements.txt
# Pin to versions without known CVEs. Regenerate with:
# pip-compile --generate-hashes requirements.in
Flask==3.0.3
Werkzeug==3.0.3
requests==2.32.3
python-dotenv==1.0.1
# Production:
# gunicorn>=22.0.0
Create a .env file with the required environment variables. Do not commit this file to version control.
# .env
GITHUB_WEBHOOK_SECRET=your_webhook_secret_here
GITHUB_TOKEN=ghp_your_token_here
OLLAMA_URL=http://localhost:11434
The hashlib and hmac modules ship with Python's standard library and handle webhook signature verification. No additional cryptographic dependencies are needed.
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Implementing the Webhook Endpoint
The core Flask application receives GitHub webhook deliveries, verifies their authenticity, and routes pull request events to the review pipeline. Signature verification using HMAC-SHA256 is not optional: without it, anyone who discovers the webhook URL can trigger arbitrary processing.
import hmac
import hashlib
import json
import logging
import os
import re
import requests
from flask import Flask, request, jsonify, abort
from dotenv import load_dotenv
load_dotenv()
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
)
app = Flask(__name__)
GITHUB_WEBHOOK_SECRET = os.getenv("GITHUB_WEBHOOK_SECRET")
GITHUB_TOKEN = os.getenv("GITHUB_TOKEN")
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://localhost:11434")
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "qwen2.5-coder:7b")
# Fail fast if required environment variables are missing
if not GITHUB_TOKEN:
raise ValueError("GITHUB_TOKEN environment variable is required")
if not GITHUB_WEBHOOK_SECRET:
raise ValueError("GITHUB_WEBHOOK_SECRET environment variable is required")
def verify_signature(payload_body: bytes, signature_header: str | None) -> bool:
"""Verify GitHub webhook HMAC-SHA256 signature."""
if not signature_header:
return False
# GITHUB_WEBHOOK_SECRET is guaranteed non-None by startup check above.
hash_object = hmac.new(
GITHUB_WEBHOOK_SECRET.encode("utf-8"),
msg=payload_body,
digestmod=hashlib.sha256,
)
expected_signature = "sha256=" + hash_object.hexdigest()
return hmac.compare_digest(expected_signature, signature_header)
@app.route("/webhook", methods=["POST"])
def webhook():
signature = request.headers.get("X-Hub-Signature-256")
if not verify_signature(request.data, signature):
abort(403, "Invalid signature")
event = request.headers.get("X-GitHub-Event")
if event != "pull_request":
return jsonify({"status": "ignored", "reason": "not a PR event"}), 200
payload = request.json
action = payload.get("action")
if action not in ("opened", "synchronize"):
return jsonify({"status": "ignored", "reason": f"action: {action}"}), 200
# WARNING: Synchronous call. GitHub times out at 10s. Use a background worker in production.
handle_pull_request(payload)
return jsonify({"status": "processing"}), 200
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000, debug=False) # Never use debug=True with a public tunnel
The endpoint filters for opened and synchronize actions only, meaning reviews trigger on new PRs and when new commits are pushed to existing PRs. All other event types and actions are acknowledged but ignored.
Fetching the Pull Request Diff
GitHub's API returns unified diffs when the appropriate Accept header is set. The diff parser extracts per-file changes into a structured format suitable for prompt construction.
def fetch_pr_diff(owner, repo, pr_number):
"""Fetch the unified diff for a pull request from GitHub."""
url = f"https://api.github.com/repos/{owner}/{repo}/pulls/{pr_number}"
headers = {
"Authorization": f"token {GITHUB_TOKEN}",
"Accept": "application/vnd.github.v3.diff",
}
response = requests.get(url, headers=headers, timeout=30, allow_redirects=False)
response.raise_for_status()
return response.text
def parse_diff(raw_diff):
"""Parse unified diff into a list of file-level change dicts."""
files = []
current_file = None
skip_extensions = (".lock", "-lock.json", ".png", ".jpg", ".svg", ".min.js", ".map")
for line in raw_diff.split("
"):
if line.startswith("diff --git"):
current_file = None # Reset on every new diff header.
parts = line.split(" b/", 1)
if len(parts) < 2:
continue # Malformed diff line; skip safely.
filepath = parts[1].strip()
if not filepath or filepath.endswith(skip_extensions):
continue
current_file = {"path": filepath, "hunks": ""}
files.append(current_file)
elif current_file is not None:
current_file["hunks"] += line + "
"
return files
The parser skips lock files (including package-lock.json via the -lock.json suffix), images, source maps, and minified JavaScript — files that produce noise rather than actionable review feedback.
The GITHUB_TOKEN should be a personal access token or fine-grained token. For classic PATs, use public_repo for public repos or repo for private repos. For fine-grained PATs (recommended), grant contents: read and pull_requests: write on the target repository.
Integrating the Local LLM for Code Review
Crafting the Review Prompt
The prompt determines review quality. The system prompt establishes the model's role and output format. The user prompt embeds the actual diff context. Requesting JSON-structured output with specific fields (file, line, severity, comment) makes the response programmatically parseable.
SYSTEM_PROMPT = """You are an expert code reviewer. Analyze the provided code diff and identify:
- Bugs and logic errors
- Security vulnerabilities (injection, auth issues, data exposure)
- Performance problems
- Readability and style concerns
Return your findings as a JSON array. Each item must have:
- "file": the filename
- "line": the approximate line number in the diff
- "severity": one of "critical", "warning", or "suggestion"
- "comment": a concise explanation of the issue and how to fix it
If you find no issues, return an empty array: []
Only return valid JSON. No markdown, no explanation outside the JSON."""
def build_review_prompt(files):
"""Build the user prompt from parsed diff files."""
diff_text = ""
for f in files:
path = f.get("path") or "(unknown)"
diff_text += f"### File: {path}
{f['hunks']}
"
# Truncate to stay within context window limits.
# Qwen2.5-Coder 7B supports up to 32,768 tokens. At roughly 3-4 characters
# per token, 12,000 characters is a conservative limit that leaves room for
# the system prompt and generated response.
max_chars = 12000
if len(diff_text) > max_chars:
# Truncate at a newline boundary to avoid splitting mid-line.
truncated = diff_text[:max_chars]
last_newline = truncated.rfind("
")
if last_newline > 0:
truncated = truncated[:last_newline]
diff_text = truncated + "
... [truncated]"
return f"Review the following code changes:
{diff_text}"
For larger diffs, a chunking strategy that processes files individually and aggregates results would produce better coverage.
Calling Ollama's API
Ollama exposes a REST API at POST /api/generate for text generation. (Ollama also provides /api/chat with an OpenAI-compatible messages format using role-based system/user messages, which may produce better instruction-following for some models.) Setting temperature low (0.1 to 0.3) reduces creative variation and produces more deterministic, focused review output. Since LLMs do not always return perfectly valid JSON, a regex fallback attempts to extract the JSON array from the response if standard parsing fails.
def get_llm_review(files):
"""Send diff to Ollama's Qwen2.5-Coder and parse structured review."""
prompt = build_review_prompt(files)
payload = {
"model": OLLAMA_MODEL,
"prompt": prompt,
"system": SYSTEM_PROMPT,
"stream": False,
"options": {"temperature": 0.2, "num_predict": 2048},
}
try:
response = requests.post(
f"{OLLAMA_URL}/api/generate",
json=payload,
timeout=120,
)
response.raise_for_status()
result = response.json().get("response", "")
# Attempt JSON parsing
try:
comments = json.loads(result)
except json.JSONDecodeError:
# Fallback: non-greedy match for first complete JSON array.
match = re.search(r"\[.*?\]", result, re.DOTALL)
if match:
try:
comments = json.loads(match.group())
except json.JSONDecodeError:
logging.warning("LLM response contained no parseable JSON array")
comments = []
else:
comments = []
return [c for c in comments if isinstance(c, dict) and "comment" in c]
except (requests.RequestException, json.JSONDecodeError) as e:
logging.exception("LLM review failed: %s", e)
return []
The 120-second timeout accommodates the 7B model running on CPU-only hardware, where inference is slow for longer diffs. On a GPU-accelerated system, you can cut this to 30-60 seconds. The num_predict setting of 2,048 tokens limits response length; for PRs with many findings across multiple files, consider increasing this to 4,096 to avoid truncated JSON responses that would fail parsing and silently return zero comments. The final list comprehension filters out any malformed entries the model may have produced.
Posting Review Comments to GitHub
Using the GitHub Pull Request Review API
GitHub's Reviews API allows posting multiple comments as a single review, which produces a cleaner experience than individual comment posts. The review submits as COMMENT (informational) or REQUEST_CHANGES (blocking) depending on whether any critical severity issues were found.
Important: GitHub's position field in review comments must be a 1-based offset within the diff hunk, not an absolute file line number. Passing a file line number will cause GitHub to return HTTP 422 errors. This implementation uses the line and side fields available with the application/vnd.github+json Accept header as a more reliable alternative. Note that the LLM returns approximate line numbers from the diff, so exact positioning requires a diff-hunk offset mapping step that this implementation does not include. Some comments may still fail to attach to specific lines; the review body will always post successfully.
Signature verification using HMAC-SHA256 is not optional: without it, anyone who discovers the webhook URL can trigger arbitrary processing.
def post_review(owner, repo, pr_number, comments, head_sha):
"""Post a batched code review to the GitHub PR."""
# Determine review event based on severity
has_critical = any(c.get("severity") == "critical" for c in comments)
event = "REQUEST_CHANGES" if has_critical else "COMMENT"
# Map comments to GitHub's review comment format.
# Uses the "line" + "side" fields instead of "position" to avoid
# needing a diff-hunk offset mapping. Some comments may receive
# 422 errors from GitHub if the line number doesn't exist in the diff.
review_comments = []
for c in comments:
try:
line_number = max(1, int(c.get("line", 1)))
except (ValueError, TypeError):
line_number = 1
review_comment = {
"path": c.get("file", ""),
"line": line_number,
"side": "RIGHT",
"body": f"**[{c.get('severity', 'suggestion').upper()}]** {c['comment']}",
}
review_comments.append(review_comment)
body_text = (
"No issues found."
if not review_comments
else f"Automated AI Code Review — {len(review_comments)} finding"
f"{'s' if len(review_comments) != 1 else ''}"
)
review_body = {
"commit_id": head_sha,
"body": body_text,
"event": "COMMENT" if not review_comments else event,
"comments": review_comments,
}
url = f"https://api.github.com/repos/{owner}/{repo}/pulls/{pr_number}/reviews"
headers = {
"Authorization": f"token {GITHUB_TOKEN}",
"Accept": "application/vnd.github+json",
}
response = requests.post(
url, json=review_body, headers=headers,
timeout=30, allow_redirects=False,
)
response.raise_for_status()
logging.info("Review posted: HTTP %d for PR #%d", response.status_code, pr_number)
Putting It All Together
The orchestration function chains every component: extracting metadata from the webhook payload, fetching the diff, running LLM analysis, and posting results.
def handle_pull_request(payload):
"""Orchestrate the full review pipeline from webhook to posted review."""
try:
pr = payload["pull_request"]
repo = payload["repository"]
owner = repo["owner"]["login"]
repo_name = repo["name"]
pr_number = pr["number"]
head_sha = pr["head"]["sha"]
except (KeyError, TypeError) as exc:
logging.error("Malformed webhook payload: missing key %s", exc)
return
logging.info("Reviewing PR #%d in %s/%s", pr_number, owner, repo_name)
try:
raw_diff = fetch_pr_diff(owner, repo_name, pr_number)
except requests.RequestException as exc:
logging.error("Failed to fetch diff for PR #%d: %s", pr_number, exc)
return
files = parse_diff(raw_diff)
if not files:
logging.info("No reviewable files in PR #%d diff", pr_number)
return
comments = get_llm_review(files)
post_review(owner, repo_name, pr_number, comments, head_sha)
logging.info("Posted %d review comment(s) for PR #%d", len(comments), pr_number)
This function is called directly from the webhook endpoint. In production, dispatch it to a background worker to avoid blocking the HTTP response — GitHub marks webhook deliveries as failed after 10 seconds and may deactivate the endpoint after repeated timeouts.
Testing and Running Locally
Exposing Your Local Server with ngrok
GitHub needs a publicly reachable URL to deliver webhooks. Running the Flask server locally and tunneling through ngrok bridges that gap during development. The free ngrok tier requires account signup and authentication before use.
# Authenticate ngrok (required once after installation)
ngrok config add-authtoken <your_token>
# Terminal 1: Start the Flask server
python app.py
# Terminal 2: Start ngrok tunnel
ngrok http 5000
Configure the webhook in the GitHub repository settings with these values:
- Payload URL: The ngrok HTTPS URL followed by
/webhook(e.g.,https://abc123.ngrok-free.app/webhook) - Content type:
application/json - Secret: The same value set in the
.envfile asGITHUB_WEBHOOK_SECRET - Events: Select "Pull requests" only
Note that ngrok free-tier URLs are ephemeral — you will need to update the webhook URL in GitHub each time you restart ngrok.
Triggering a Test Review
Create a branch with intentional issues: an SQL query built with string concatenation, an unused variable, a function missing error handling. Open a pull request against the main branch and watch the Flask server logs. Within a minute or two (depending on hardware), the review should appear on the PR as a set of inline comments.
Common debugging targets include signature mismatch errors (usually a whitespace issue in the secret), model timeouts on CPU-only systems (increase the timeout or switch to a smaller quantized model), and diff parsing failures on binary files that slipped through the filter. Check GitHub's webhook delivery logs (Settings → Webhooks → Recent Deliveries) to verify delivery status; expect "failed" status on deliveries that take longer than 10 seconds due to the synchronous implementation.
Adapting for GitLab
Key Differences for GitLab Webhooks
GitLab's webhook system uses a simpler authentication model. Instead of HMAC-SHA256 verification, GitLab sends a X-Gitlab-Token header containing a secret token.
To adapt this pipeline for GitLab:
- Replace HMAC signature verification with a constant-time comparison: use
hmac.compare_digest(request.headers.get('X-Gitlab-Token', ''), GITLAB_TOKEN)to prevent timing attacks. Do not use plain==comparison for secret tokens. - Parse the Merge Request event payload, which uses different field names (
object_attributes.iidfor the MR number,project.path_with_namespacefor the repo identifier). - Post review comments through GitLab's Discussions API (
POST /api/v4/projects/:id/merge_requests/:iid/discussions), as GitLab does not have an equivalent batched review concept. - Fetch diffs from GitLab's Merge Request Changes API (
GET /api/v4/projects/:id/merge_requests/:iid/changes) instead of the GitHub diff endpoint.
Tips for Production Use
Running the Flask development server is fine for testing, but production deployments should use a proper WSGI server behind a systemd service or Docker container. Gunicorn with two to four workers handles concurrent webhook deliveries without dropping requests. Add gunicorn>=22.0.0 to your requirements.txt for production use.
Adding a job queue with Redis and RQ (or Celery) decouples webhook receipt from LLM inference. This is essential for large PRs where inference takes minutes, well past GitHub's 10-second webhook timeout expectation.
Keep Ollama running as a persistent service rather than starting it per-request. Model loading is the most expensive operation; once loaded, subsequent inferences start immediately. On Linux, a systemd unit for Ollama ensures it survives reboots.
Adding project-specific coding standards to the system prompt, such as "this project uses SQLAlchemy ORM exclusively; flag any raw SQL queries," dramatically improves review relevance.
Rate limiting prevents abuse on busy repositories. Consider using Flask-Limiter or nginx-level rate limiting on the /webhook endpoint. Ignoring PRs authored by bots (Dependabot, Renovate) avoids wasting inference cycles on automated dependency bumps. Adding project-specific coding standards to the system prompt, such as "this project uses SQLAlchemy ORM exclusively; flag any raw SQL queries," dramatically improves review relevance.
Monitoring review quality over time by logging false-positive rates and developer feedback on comments helps refine prompts iteratively.
What Comes Next
This pipeline delivers a fully self-hosted, privacy-preserving automated code review system running entirely on local hardware. Every component, from webhook verification to LLM inference to GitHub comment posting, runs without calling any external AI service.
Experiment with model sizes to find the right speed and quality balance for your team's hardware. Customize prompts per repository. Layer in async processing for production scale. The complete source code from this article consolidates into a single app.py file you can share as a GitHub Gist or repository for team adoption.

