Local AI Coding Assistant: Complete VS Code + Ollama + Continue Setup

SitePoint Team

Published in

AI·Programming·Computing·

March 11, 2026

Share this article

Local AI Coding Assistant: Complete VS Code + Ollama + Continue Setup

SitePoint Premium

Stay Relevant and Grow Your Career in Tech

Premium Results
Publish articles on SitePoint
Daily curated jobs
Learning Paths
Discounts to dev tools

Start Free Trial

7 Day Free Trial. Cancel Anytime.

How to Build a Local AI Coding Assistant with VS Code, Ollama, and Continue

Install Ollama on your platform (macOS, Windows, or Linux) and start the background service.
Pull the Qwen2.5-Coder 7B model for chat and the 1.5B model for fast autocomplete.
Verify Ollama is running by checking the local API at http://localhost:11434.
Install the Continue extension from the VS Code Extensions marketplace.
Configure Continue's config.json with both models pointing to Ollama's local endpoint.
Test the chat panel with a code-generation prompt and confirm inline tab autocomplete works.
Tune debounce delay, context length, and model size to match your hardware for optimal performance.

Every keystroke sent to a cloud AI coding assistant travels through external servers. This article walks through building a fully local Copilot alternative using VS Code, Ollama, the Continue extension, and Qwen2.5-Coder—all running on consumer hardware with zero subscription fees and complete privacy.

Why Go Local with Your AI Coding Assistant?
What You'll Need: Prerequisites and System Requirements
Step 1: Install and Configure Ollama
Step 2: Install the Continue Extension in VS Code
Step 3: Connect Continue to Ollama
Step 4: Using Your Local AI Assistant Day-to-Day
Performance Benchmarks and Cost Comparison
Troubleshooting Common Issues
Your Private AI Coding Setup Is Ready

Why Go Local with Your AI Coding Assistant?

Every keystroke sent to a cloud AI coding assistant travels through external servers. External infrastructure outside the developer's control processes it. GitHub Copilot and similar services charge $10 to $19 per month, accumulating to $120–$228 per year, while quietly ingesting proprietary codebases into pipelines that someone else's terms of service govern. Vendor lock-in compounds the issue: when the provider changes pricing, alters model behavior, or suffers an outage, developers absorb the impact with no recourse.

A local AI coding assistant eliminates all three problems. Code never leaves the machine. There is no subscription. The setup works offline, on an airplane, in an air-gapped environment, or anywhere else a laptop goes.

A local AI coding assistant eliminates all three problems. Code never leaves the machine. There is no subscription. The setup works offline, on an airplane, in an air-gapped environment, or anywhere else a laptop goes.

This article walks through building a fully local Copilot alternative using VS Code, Ollama, the Continue extension, and Qwen2.5-Coder. The stack provides both chat-based code generation and real-time tab autocomplete, all running on consumer hardware. It targets beginners and takes roughly 30 minutes from a fresh start to a working setup.

Included along the way are ready-to-use configuration files, performance benchmarks sourced from community testing, and a cost comparison table to support an informed decision about whether this approach fits a given workflow.

This guide was tested with Continue v0.9.x and Ollama v0.3.x. If you are using newer versions, refer to the respective documentation for any configuration changes.

What You'll Need: Prerequisites and System Requirements

Hardware Requirements

The minimum viable setup requires 8GB of RAM and any modern CPU from the last five years, whether Intel, AMD, or Apple Silicon. This will run smaller models (1.5B parameters) adequately for basic autocomplete.

For a recommended experience with the 7B parameter model, 16GB or more of RAM and a discrete GPU with at least 6GB of VRAM will produce noticeably faster inference. NVIDIA GPUs from the RTX 3060 onward handle this well.

Apple Silicon deserves a specific mention. The M-series chips use unified memory architecture, meaning system RAM is shared directly with the GPU cores. An M1 with 16GB of unified memory performs competitively with dedicated GPU setups on the PC side (see the benchmark table below for specific comparisons), making MacBooks particularly well-suited for local LLM inference.

Software Prerequisites

The only software prerequisites are VS Code (latest stable release) and basic comfort with a terminal. No prior experience with AI, machine learning, or model deployment is required. Everything installs through standard package managers and extension marketplaces.

Step 1: Install and Configure Ollama

Installing Ollama on Your Platform

Ollama is a lightweight runtime that manages downloading, configuring, and serving large language models locally. It exposes a local API that other tools, including Continue, can connect to.

macOS (using Homebrew or direct download):

brew install ollama

Alternatively, download the macOS installer from ollama.com/download.

Windows:

Download the installer from ollama.com/download/windows. Run the .exe and follow the prompts. Ollama runs as a background service after installation.

Linux:

curl --max-time 60 -fsSL https://ollama.com/install.sh -o install.sh

Security note: Do not pipe remote scripts directly to your shell. Instead, download first, verify the checksum, then execute:

# Download the install script
curl --max-time 60 -fsSL https://ollama.com/install.sh -o install.sh

# Verify checksum against value published at https://ollama.com/download/linux
# Replace <EXPECTED_SHA256> with the value from the official release page
echo "<EXPECTED_SHA256>  install.sh" | sha256sum --check

# Only execute after checksum passes
sh install.sh

Linux GPU Setup: For NVIDIA GPUs, install CUDA 12.x drivers before proceeding (nvidia-smi should return a driver version). For AMD GPUs, ROCm 5.x or later is required. Without these drivers, Ollama will silently fall back to CPU-only inference, and you will not see the GPU-accelerated performance listed in the benchmarks below. Verify GPU detection after starting Ollama by checking log output for references to GPU layers.

After installation on any platform, verify it worked:

ollama --version

This should print the installed version number (v0.3.x or later recommended). Then start the Ollama service (on macOS and Linux it may auto-start; on Windows it runs as a system service):

ollama serve

macOS GUI installer note: If you installed Ollama via the .app download rather than Homebrew, do not run ollama serve manually — the app manages the server process automatically. Quit and reopen the Ollama menu bar app instead if you need to restart the service. Running ollama serve while the app is active causes a port conflict.

Leave this terminal window running (for CLI installs), or confirm the service is active in the background.

Pulling the Qwen2.5-Coder Model

Qwen2.5-Coder, developed by Alibaba's Qwen team, scores well on HumanEval relative to its size, placing at the top of 7B-class open-weight coding models at the time of writing. It handles code generation, explanation, refactoring, and completion across dozens of programming languages.

Qwen2.5-Coder ships in several sizes. Ollama downloads Q4_K_M quantization by default, which balances quality against memory use. Choosing the right size depends on available hardware:

1.5B: Lightweight and fast. The Q4 variant uses roughly 1GB of RAM/VRAM. Suitable for autocomplete on machines with 8GB RAM. Lower quality for complex reasoning.
7B (start here): The recommended balance point. Strong code generation quality. CPU-only: the Q4 variant occupies approximately 4.5GB of system RAM; 16GB total ensures OS headroom. GPU: 6GB VRAM is the minimum; 8GB recommended for stable operation.
14B: Best quality output, but requires 24GB+ RAM or a GPU with 12GB+ VRAM (Q4 quantization assumed).

Pull the recommended 7B model:

ollama pull qwen2.5-coder:7b

This download is approximately 4.7GB (Q4_K_M quantization). Once complete, run a quick smoke test to confirm the model works:

ollama run qwen2.5-coder:7b "Write a Python function to reverse a string"

The model should respond with a working Python function within a few seconds.

Verifying Ollama Is Running

Continue communicates with Ollama through its local REST API. Confirm the API is accessible by opening a browser or running:

curl --max-time 5 http://localhost:11434

A response of Ollama is running confirms everything is operational.

Quick troubleshooting: If the port is not responding, check whether another process is using port 11434. On Linux, ss -tlnp | grep 11434 is widely available; on macOS, use lsof -i :11434; on Windows, use netstat -ano | findstr 11434. If the service is not starting, try restarting it with ollama serve in a new terminal window. macOS GUI app users: do not run ollama serve manually — quit and reopen the Ollama app from the menu bar instead.

Step 2: Install the Continue Extension in VS Code

Finding and Installing Continue

Open VS Code and navigate to the Extensions sidebar (Ctrl+Shift+X or Cmd+Shift+X). Search for "Continue" and install the extension published by Continue.dev. It is free and open-source under the Apache 2.0 license.

This guide was tested with Continue v0.9.x. Verify your installed version in the VS Code Extensions sidebar. If the configuration format has changed in a newer release, refer to continue.dev/docs for the current schema.

Continue functions as a model-agnostic AI code assistant. Unlike extensions tied to a single provider, Continue connects to any LLM backend: local models through Ollama, cloud APIs, or self-hosted inference servers. This flexibility is precisely why it suits a local-first setup. Its active development cadence, with frequent releases and community contributions, also means that support for new models and features arrives quickly.

Initial Extension Tour

After installation, a Continue icon appears in the VS Code sidebar. Click it to open the Continue panel, which contains two primary interfaces:

Chat interface: A conversational panel for asking questions, generating code, and requesting explanations. This appears in the sidebar.
Tab autocomplete: Inline completions that appear as ghost text while typing, similar to GitHub Copilot's inline suggestions.

At this point, Continue is installed but not yet configured to use Ollama. The chat panel may show a default configuration prompt or an error. Configuration is the next step.

Step 3: Connect Continue to Ollama

Pulling Both Models

Before writing the configuration, pull both models that the config will reference. This prevents "model not found" errors when Continue first loads.

The 7B model should already be downloaded from Step 1. Now pull the 1.5B model used for autocomplete:

ollama pull qwen2.5-coder:1.5b

Verify both models are available:

ollama list

You should see rows for both qwen2.5-coder:7b and qwen2.5-coder:1.5b.

Editing the Continue Configuration File

Continue stores its settings in a config.json file. The file location varies by operating system:

macOS: ~/.continue/config.json
Linux: ~/.continue/config.json
Windows: %USERPROFILE%\.continue\config.json

Open this file in VS Code or any text editor. Replace its contents with the following complete configuration:

{
  "models": [
    {
      "title": "Qwen2.5-Coder 7B",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b",
      "apiBase": "http://localhost:11434",
      "contextLength": 8192,
      "maxPromptTokens": 2048
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5-Coder 1.5B",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b",
    "apiBase": "http://localhost:11434",
    "contextLength": 4096,
    "maxPromptTokens": 1024
  },
  "tabAutocompleteOptions": {
    "debounceDelay": 300,
    "multilineCompletions": "always"
  },
  "allowAnonymousTelemetry": false
}

Configuration notes:

apiBase: "http://localhost:11434" explicitly sets the Ollama API endpoint. This is required for non-default Ollama installations and recommended for all setups to avoid silent connection failures.

contextLength: 8192 targets broad hardware compatibility with a conservative default. Qwen2.5-Coder 7B supports up to 128K context natively. If you work with large files and have sufficient RAM, values up to 32768 are reasonable — increase gradually and monitor performance.

Each model object includes maxPromptTokens to control the token budget sent per request. The chat model uses 2048 and the autocomplete model uses 1024, balancing context quality against latency. allowAnonymousTelemetry: false disables usage telemetry that Continue otherwise sends to Continue.dev. Recommended for privacy-sensitive environments.

This configuration uses a dual-model strategy, which is key to a responsive experience. The chat model (qwen2.5-coder:7b) handles conversational interactions where a few seconds of latency is acceptable. The autocomplete model (qwen2.5-coder:1.5b) handles real-time inline suggestions where speed is critical.

Tip: Validate the JSON syntax before saving to avoid silent configuration failures:

python3 -m json.tool ~/.continue/config.json

If this prints reformatted JSON with no error, the syntax is valid.

Configuring Tab Autocomplete

The tabAutocompleteOptions section controls how inline completions behave:

debounceDelay: Set to 300 milliseconds, meaning Continue waits 300ms after the last keystroke before requesting a completion. Lower values (150ms) feel more responsive but generate more inference requests, increasing CPU/GPU load. Higher values (500ms) reduce load but feel sluggish.
multilineCompletions: Set to "always" to allow multi-line suggestions for function bodies, loops, and blocks. Set to "never" if single-line completions are preferred.

The maxPromptTokens field on the tabAutocompleteModel object limits the context sent to the model for each autocomplete request. 1024 tokens is a practical default; larger values provide more context at the cost of latency.

To disable tab autocomplete while keeping chat functional, remove the tabAutocompleteModel key from the configuration or set "disable": true inside tabAutocompleteOptions.

Testing the Connection

Open the Continue chat panel in the VS Code sidebar. Type a test prompt:

Write a JavaScript function that debounces another function with a configurable delay.

The response should stream from the Qwen2.5-Coder 7B model within a few seconds.

Next, open any code file (or create a new .js or .py file) and start typing a function signature. Ghost text suggestions should appear after the debounce delay. Press Tab to accept a suggestion.

Common errors and fixes:

"Model not found" means the model name in config.json does not match what you pulled. Run ollama list to check installed models, including the tag such as :7b or :1.5b.

"Connection refused" means Ollama is not running. Run ollama serve (CLI installs) or open the Ollama app (macOS GUI installs). Also verify that apiBase in your config.json matches the address Ollama is listening on (default: http://localhost:11434).

If the extension is not responding at all, a JSON syntax error is likely preventing Continue from loading the configuration. Validate the file with python3 -m json.tool ~/.continue/config.json.

Step 4: Using Your Local AI Assistant Day-to-Day

Chat-Based Code Generation

Open the Continue chat panel and start a multi-turn conversation about code. Type questions, paste code snippets, and request modifications iteratively. A typical workflow might proceed as: ask the model to explain a function, then request a refactored version, then ask it to generate unit tests for the refactored code.

Typing @filename references a specific file in the workspace, allowing prompts like "Refactor @utils.ts to use async/await." Typing @codebase enables indexed search across your workspace (requires indexing to complete first). Type @ in the chat input to see all available context providers in your installed Continue version, as availability varies by release.

For example, given a function in the editor:

async function fetchUserData(userId) {
  const res = await fetch(`/api/users/${encodeURIComponent(userId)}`);
  if (!res.ok) {
    throw new Error(`HTTP error ${res.status}: ${res.statusText}`);
  }
  return res.json();
}

Typing in the Continue chat panel: "Add proper error handling and TypeScript types to this function" will produce a refactored version with try/catch blocks, typed interfaces, and explicit error propagation. The quality is comparable to what cloud models produce for single-file, focused tasks like this.

Inline Code Completion (Tab Autocomplete)

As you type, the 1.5B autocomplete model generates ghost text suggestions in real time. These appear as dimmed text ahead of the cursor. Press Tab to accept the full suggestion, or continue typing to refine what the model offers.

Tips for getting better completions: use descriptive variable and function names, write a brief comment above a function describing its purpose before writing the signature, and keep files focused. The model uses surrounding context to predict the next tokens, so meaningful code structure directly improves suggestion quality.

Editing Code with Inline Commands

Press Cmd+I (macOS) or Ctrl+I (Windows/Linux) to open Continue's inline editing interface. This allows highlighting a block of code and providing a natural language instruction to modify it. For example, selecting a fetch call and typing "Add error handling with retry logic" will produce an edited version of the selected code inline, with a diff view to accept or reject the changes.

This interaction model is faster than the chat panel for targeted edits because it operates directly in the editor context without switching focus.

Performance Benchmarks and Cost Comparison

Response Speed Benchmarks

Inference speed varies significantly across hardware. The following table summarizes tokens-per-second (tok/s) performance for the Qwen2.5-Coder 7B model based on community benchmarks (Q4_K_M quantization assumed, reported late 2024; results vary by OS, Ollama version, and background load). Run ollama run qwen2.5-coder:7b and observe the reported eval rate in the terminal output to measure your own hardware:

Hardware	RAM/VRAM	Chat (tok/s)	Autocomplete Latency
Apple M1	8GB unified	~15 tok/s	400–600ms
Apple M2	16GB unified	~25 tok/s	200–350ms
Apple M3 Pro	36GB unified	~40 tok/s	100–200ms
Intel/AMD (no GPU)	16GB RAM	~5–8 tok/s	800–1200ms
NVIDIA RTX 3060	12GB VRAM	~30 tok/s	150–300ms
NVIDIA RTX 4070	12GB VRAM	~45 tok/s	80–150ms

Note on the M1 8GB row: Running the 7B model on 8GB unified memory forces significant memory pressure. Performance degrades further under multitasking. The 1.5B model is the practical recommendation for 8GB systems; the 7B figure is included to show that it will run, but not comfortably.

On Apple M2 and above, or with an RTX 3060 or better, autocomplete latency drops below 350ms, which falls under the perceptible-lag threshold for most typists on single-line suggestions. CPU-only inference on Intel/AMD is noticeably slower, with lag that disrupts typing flow. Dropping to the 1.5B model for autocomplete (as configured above) substantially mitigates this on lower-end hardware.

Quality Comparison vs Cloud Alternatives

For single-function completions, boilerplate generation, simple refactors, and code explanation, Qwen2.5-Coder 7B produces output close to GitHub Copilot's. In the author's testing, it handled Python and TypeScript single-file tasks (writing utility functions, adding type annotations, generating docstrings) with few meaningful differences from Copilot. JavaScript, Go, and Rust completions also worked well for straightforward function-level tasks. Quality dropped noticeably on multi-step reasoning: generating a full Express middleware chain with error handling, for example, required more manual correction than Copilot would.

Cloud models still outperform local ones at complex multi-file reasoning, very large context windows (100K+ tokens), and tasks requiring up-to-date knowledge of recently released APIs or frameworks. A 7B parameter model running locally will not match GPT-4-class models on architectural planning or cross-repository analysis.

Cloud models still outperform local ones at complex multi-file reasoning, very large context windows (100K+ tokens), and tasks requiring up-to-date knowledge of recently released APIs or frameworks. A 7B parameter model running locally will not match GPT-4-class models on architectural planning or cross-repository analysis.

Cost Comparison Table

Service	Monthly Cost	1-Year Cost	3-Year Cost
GitHub Copilot Individual	$10/mo	$120	$360
GitHub Copilot Business	$19/mo	$228	$684
Cursor Pro	$20/mo	$240	$720
Codeium (Free Tier)*	$0/mo	$0	$0
Local Ollama + Continue	$0/mo	$0	$0

*Codeium Free Tier is cloud-based and transmits code to Codeium's servers. Listed for cost reference only; it does not provide the privacy guarantees of the local setup described in this article.

Electricity costs for local inference are modest. Running a 7B model during active coding hours (approximately 6 to 8 hours per day) adds about $2 per month on a laptop assuming 50–100W average system draw at $0.12–0.15/kWh. GPU-accelerated desktops drawing 150–200W under load add $5–$9 per month. Over three years, the local setup saves $357 to $717 compared to paid alternatives (before electricity). For teams, the savings multiply per seat.

The honest recommendation: the local stack works as a primary tool for most development tasks. For occasional complex tasks that exceed local model capabilities, a hybrid approach using a pay-as-you-go cloud API alongside the local setup gives you the best trade-off between cost, privacy, and output quality.

Troubleshooting Common Issues

If Ollama is not responding, run ollama serve in a terminal (CLI installs only — macOS GUI app users should quit and reopen the app). Check that port 11434 is not blocked by a firewall or occupied by another process.
Slow completions: Switch autocomplete to the 1.5B model. Close memory-heavy applications. Check available RAM with Activity Monitor or Task Manager.
Continue won't connect? Validate config.json syntax with python3 -m json.tool ~/.continue/config.json. Confirm the model name matches ollama list output exactly, including the tag (e.g., :7b). Verify that apiBase is set correctly in your config.
High memory usage: Ollama downloads Q4_K_M quantization by default for most models, which provides good memory efficiency. To verify your model's quantization, run ollama show qwen2.5-coder:7b and look for quantization details (e.g., Q4_K_M) in the output. Q4 quantization typically reduces memory footprint by roughly 50% compared to FP16. To explore other quantization variants, check available tags at ollama.com/library/qwen2.5-coder.
If the model hallucinates or produces low-quality output, provide more context using @ file references. Write specific, constrained prompts rather than open-ended requests.

Your Private AI Coding Setup Is Ready

You now have a fully private, subscription-free, offline-capable AI coding assistant. Code stays on the machine, there is no recurring cost, and the setup works without an internet connection.

For developers looking to expand this foundation, natural next steps include experimenting with alternative coding models such as CodeLlama or DeepSeek-Coder-V2 through Ollama, exploring Continue's custom slash commands for repetitive workflows, and adding context providers that index project documentation for more informed model responses.

Local models keep getting better at a concrete clip: 7B models in late 2024 match what required 30B+ parameters in early 2023. The infrastructure you built in this guide, Ollama plus Continue, will run whatever comes next.

Local models keep getting better at a concrete clip: 7B models in late 2024 match what required 30B+ parameters in early 2023. The infrastructure you built in this guide, Ollama plus Continue, will run whatever comes next.

SitePoint Team

Sharing our passion for building incredible internet things.

Local AI Coding Assistant: Complete VS Code + Ollama + Continue Setup

Local AI Coding Assistant: Complete VS Code + Ollama + Continue Setup

How to Build a Local AI Coding Assistant with VS Code, Ollama, and Continue

Table of Contents

Why Go Local with Your AI Coding Assistant?

What You'll Need: Prerequisites and System Requirements

Hardware Requirements

Software Prerequisites

Step 1: Install and Configure Ollama

Installing Ollama on Your Platform

Pulling the Qwen2.5-Coder Model

Verifying Ollama Is Running

Step 2: Install the Continue Extension in VS Code

Finding and Installing Continue

Initial Extension Tour

Step 3: Connect Continue to Ollama

Pulling Both Models

Editing the Continue Configuration File

Configuring Tab Autocomplete

Testing the Connection

Step 4: Using Your Local AI Assistant Day-to-Day

Chat-Based Code Generation

Inline Code Completion (Tab Autocomplete)

Editing Code with Inline Commands

Performance Benchmarks and Cost Comparison

Response Speed Benchmarks

Quality Comparison vs Cloud Alternatives

Cost Comparison Table

Troubleshooting Common Issues

Your Private AI Coding Setup Is Ready

Comments

More from Capitolioxa

Samsung already nuked the only cool thing about the Galaxy S26’s AI

Samsung allegedly tests insane Galaxy phone batteries, and one's really up there

I kept deleting chats by accident, and Google Messages just fixed it

Morning Briefing