Local RAG Without the Cloud: Private Document AI Setup

SitePoint Team

Published in

AI·Programming·

March 11, 2026

Share this article

Local RAG Without the Cloud: Private Document AI Setup

SitePoint Premium

Stay Relevant and Grow Your Career in Tech

Premium Results
Publish articles on SitePoint
Daily curated jobs
Learning Paths
Discounts to dev tools

Start Free Trial

7 Day Free Trial. Cancel Anytime.

How to Set Up Local RAG for Private Document AI

Install Ollama and pull a local LLM such as Mistral or Phi-3 to handle inference without cloud APIs.
Create a Python virtual environment and install LangChain, ChromaDB, Sentence Transformers, and document loaders.
Load your documents (PDF, Markdown, DOCX) into LangChain using directory or file-specific loaders.
Chunk documents into overlapping segments using RecursiveCharacterTextSplitter for effective retrieval.
Generate vector embeddings for each chunk locally with the all-MiniLM-L6-v2 Sentence Transformers model.
Store embeddings in a persistent ChromaDB collection so the index survives restarts without re-embedding.
Build a RetrievalQA chain that retrieves relevant chunks and passes them as grounded context to the local LLM.
Query your documents from the command line and verify answers cite specific source passages.

Why Go Local with RAG?
Architecture Overview and Component Selection
Prerequisites and Environment Setup
Ingesting and Chunking Your Documents
Generating Embeddings and Storing Vectors Locally
Ingestion Script
Querying Your Documents: The RAG Chain
Testing, Tuning, and Troubleshooting
Security and Privacy Considerations
Where to Go Next

Why Go Local with RAG?

Retrieval-Augmented Generation, or RAG, is a technique that enhances large language model responses by grounding them in external documents retrieved at query time, rather than relying solely on the model's training data. A local RAG setup enables private document AI that keeps every byte of data on the machine where it belongs. For teams dealing with proprietary codebases, legal contracts, medical records, or internal policy documents, sending that material to a third-party API introduces real risks: data leakage through cloud provider logging, regulatory non-compliance with frameworks like GDPR or HIPAA, and vendor lock-in that ties critical infrastructure to a single provider's pricing and availability decisions.

This article walks through building a fully local document question-answering system. The stack consists of ChromaDB for vector storage, Ollama for local LLM inference, LangChain for orchestration, and Sentence Transformers for embeddings. No API keys are required. No data leaves the machine. The entire pipeline runs on consumer hardware. This guide assumes intermediate Python knowledge. A GPU speeds up inference but is not required; CPU-only execution works, albeit more slowly.

Architecture Overview and Component Selection

How RAG Works: A Quick Refresher

The RAG pipeline follows a fixed sequence: ingest documents, chunk them into manageable segments, generate vector embeddings for each chunk, store those vectors in a database. At query time, embed the user's question, retrieve the most semantically similar chunks, and pass those chunks as context so the language model generates an answer. That is the entire loop.

The retrieval step is what distinguishes RAG from pure generation. Instead of asking a model to answer from memory (where it will often hallucinate), RAG constrains the model to work with specific, retrieved passages. This grounds responses in actual document content and significantly reduces fabricated answers.

Instead of asking a model to answer from memory (where it will often hallucinate), RAG constrains the model to work with specific, retrieved passages. This grounds responses in actual document content and significantly reduces fabricated answers.

Why These Tools? Component-by-Component Rationale

Ollama provides local LLM serving through a single binary. It supports models like Llama 3, Mistral, and Phi-3, handles model downloading and quantization, and exposes a local API compatible with LangChain. No Docker containers needed; Ollama installs as a native binary or system service.

Most local embedding solutions require managing model files, dependencies, and GPU allocation manually. Sentence Transformers eliminates that overhead. The all-MiniLM-L6-v2 model produces 384-dimensional vectors, runs fast on CPU, and scores in the top quartile on the MTEB retrieval benchmark, making it a strong default for RAG pipelines.

Running a dedicated vector database means managing another service, port, and failure mode. ChromaDB avoids this by running in embedded mode with no separate server process. It persists to disk using an SQLite-backed storage engine (in ChromaDB ≥0.4) and supports similarity search out of the box. For single-user or small-team scenarios, it eliminates that operational overhead entirely.

LangChain ties these components together as an orchestration layer, providing standardized interfaces for document loading, text splitting, embedding generation, vector store interaction, and chain assembly.

Dimension	Local Stack (Ollama + ChromaDB)	Cloud Stack (OpenAI + Pinecone)
Cost	Hardware only; no per-token fees	Pay-per-token API + hosted DB fees
Privacy	Data never leaves the machine	Data transits to and is processed by third parties
Latency	5-30s per query on CPU, 1-5s with GPU; no network round-trip	Network-dependent; 200ms-2s per call
Setup Complexity	Install Ollama + Python packages	API key management, account setup, network config
Scalability	Limited by local compute	Scales with spend
Offline Capable	Fully offline after model download	Requires persistent internet

Prerequisites and Environment Setup

Hardware and Software Requirements

Minimum specifications: 8 GB RAM and a modern x86 or ARM CPU. Recommended: 16 GB RAM with a GPU offering 6 GB or more of VRAM (NVIDIA GPUs with CUDA support provide the best compatibility). Python 3.10 or later is required.

System-level dependencies for unstructured: The unstructured document processing library requires system packages that pip cannot install. Install them before proceeding:

On Ubuntu/Debian: sudo apt-get install libmagic1 poppler-utils tesseract-ocr
On macOS: brew install libmagic poppler tesseract

Note: This guide covers macOS and Linux. Windows users should download the Ollama installer from ollama.com instead of using the curl command below.

Installing Ollama and Pulling a Model

# Install Ollama (macOS/Linux)
# Security note: review the script before executing. For additional assurance:
#   curl -fsSL https://ollama.com/install.sh -o install.sh
#   sha256sum install.sh   # verify checksum against ollama.com published value
#   sh install.sh
curl -fsSL https://ollama.com/install.sh | sh

# Pull the Mistral model (4.1 GB download)
ollama pull mistral

# Verify the model works with a non-interactive API call
curl -s http://localhost:11434/api/generate \
  -d '{"model":"mistral","prompt":"What is RAG?","stream":false}' \
  | python3 -m json.tool

If the API call returns a JSON response containing a non-empty "response" field, the local LLM is functional. On systems with limited RAM, phi3 (a smaller model) can substitute for mistral—pull it with ollama pull phi3.

Creating the Python Project and Installing Dependencies

python3 -m venv .venv && source .venv/bin/activate

pip install langchain==0.2.16 langchain-community==0.2.16 langchain-huggingface==0.0.3 langchain-ollama==0.1.3 langchain-text-splitters==0.2.4 chromadb==0.5.0 sentence-transformers==3.0.0 pypdf==4.0.0 unstructured==0.14.10

This command installs the orchestration framework, ChromaDB vector store, the Sentence Transformers embedding library, LangChain integrations for Hugging Face and Ollama, the text splitters package, and document loaders for PDF and other formats. Version pins are included because LangChain's API surface changes frequently across minor releases; unpinned installs are a common source of reproducibility failures.

Ingesting and Chunking Your Documents

Loading Documents with LangChain Loaders

LangChain provides loaders for PDF, Markdown, plain text, and DOCX formats. The DirectoryLoader can process an entire folder of files. When loader_cls=PyPDFLoader is specified, all matched files are processed by PyPDFLoader. Omit loader_cls to enable automatic format detection, or instantiate separate loaders per format for mixed-format directories.

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader

# Load all PDFs from a local directory
loader = DirectoryLoader(
    "./documents",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader
)

documents = loader.load()
print(f"Loaded {len(documents)} pages from PDF files")

Create a ./documents directory (mkdir -p ./documents) and place PDF files inside. The loader traverses subdirectories via the glob pattern and returns a list of Document objects, each containing page text and metadata (source file path, page number).

Chunking Strategies That Affect Answer Quality

Raw document pages are too large for effective retrieval in most cases. Chunking splits them into smaller segments that can be independently embedded and retrieved. The RecursiveCharacterTextSplitter is LangChain's most versatile splitter: it tries to split on paragraph boundaries first, then sentences, then characters, preserving semantic coherence.

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["

", "
", " ", ""]
)

chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} pages")

if chunks:
    print(f"
Sample chunk:
{chunks[0].page_content[:300]}...")
else:
    print("
No chunks produced — check that documents contain extractable text.")

A chunk_size of 1000 characters with a chunk_overlap of 200 provides a reasonable starting point. LangChain measures chunk_size in characters, not tokens; a 1000-character chunk is roughly 200-300 tokens depending on content. The overlap ensures that information spanning a chunk boundary is captured in at least one chunk. Smaller chunks (500 characters) improve retrieval precision but lose context. Larger chunks (1500+) preserve more context but dilute relevance scoring. These parameters directly affect answer quality and warrant tuning for specific document types.

Generating Embeddings and Storing Vectors Locally

Creating Embeddings with Sentence Transformers

The all-MiniLM-L6-v2 model from Sentence Transformers fits the local RAG use case well: it is ~80 MB on disk, runs on CPU without noticeable lag for single queries, and ranks in the top quartile of MTEB retrieval benchmarks. The model downloads automatically on first use; subsequent runs use the cached version. It produces 384-dimensional vectors optimized for semantic similarity tasks, which aligns directly with the retrieval needs of a RAG pipeline. LangChain wraps this model through the HuggingFaceEmbeddings class, providing a consistent interface regardless of the underlying embedding provider.

Persisting Vectors in ChromaDB

import chromadb
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

PERSIST_DIR = "./chroma_db"
COLLECTION_NAME = "local_documents"
BATCH_SIZE = 100


def create_vector_store(chunks, persist_directory=PERSIST_DIR):
    """Create embeddings and store them in a persistent ChromaDB collection."""

    embeddings = HuggingFaceEmbeddings(
        model_name="all-MiniLM-L6-v2",
        model_kwargs={"device": "cpu"},
        encode_kwargs={"normalize_embeddings": True}
    )

    # Delete existing collection to prevent duplicate vectors on re-ingest
    _client = chromadb.PersistentClient(path=persist_directory)
    existing = [c.name for c in _client.list_collections()]
    if COLLECTION_NAME in existing:
        print(
            f"[warn] Collection '{COLLECTION_NAME}' already exists — "
            "deleting before re-ingest to prevent duplicates."
        )
        _client.delete_collection(COLLECTION_NAME)

    # Ingest in batches to avoid OOM on large corpora
    vector_store = None
    for i in range(0, len(chunks), BATCH_SIZE):
        batch = chunks[i : i + BATCH_SIZE]
        if vector_store is None:
            vector_store = Chroma.from_documents(
                documents=batch,
                embedding=embeddings,
                persist_directory=persist_directory,
                collection_name=COLLECTION_NAME,
            )
        else:
            vector_store.add_documents(batch)
        print(f"  Ingested batch {i // BATCH_SIZE + 1} "
              f"({min(i + BATCH_SIZE, len(chunks))}/{len(chunks)} chunks)")

    print(f"Stored {len(chunks)} vectors in {persist_directory}")
    return vector_store

The persist_directory parameter tells ChromaDB to write its index and metadata to disk. With ChromaDB ≥0.4, persistence is automatic—no explicit persist() call is needed. This means the vector store survives process restarts, and re-embedding the entire corpus is only necessary when documents change. The duplicate-prevention step deletes any existing collection before re-ingesting, ensuring that running the ingest script multiple times does not silently duplicate all vectors. Batched ingestion prevents out-of-memory errors when processing large corpora. For incremental updates, detecting new or modified files (by tracking file hashes or modification timestamps) and embedding only those chunks avoids redundant work, though a full implementation of change detection is beyond the scope of this tutorial.

Setting normalize_embeddings to True ensures that dot product search yields results equivalent to cosine similarity, matching the metric used in most retrieval benchmarks.

Ingestion Script

Before querying, you must ingest your documents into the vector store. Save the following as ingest.py:

import sys

import chromadb
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

PERSIST_DIR = "./chroma_db"
COLLECTION_NAME = "local_documents"
BATCH_SIZE = 100


def main() -> int:
    # Load all PDFs from a local directory
    try:
        loader = DirectoryLoader(
            "./documents",
            glob="**/*.pdf",
            loader_cls=PyPDFLoader,
        )
        documents = loader.load()
    except Exception as exc:
        print(f"[error] Failed to load documents: {exc}", file=sys.stderr)
        return 1

    if not documents:
        print(
            "[error] No documents loaded. Check ./documents contains PDF files.",
            file=sys.stderr,
        )
        return 1

    print(f"Loaded {len(documents)} pages from PDF files")

    # Chunk the documents
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len,
        separators=["

", "
", " ", ""],
    )

    chunks = text_splitter.split_documents(documents)
    if not chunks:
        print("[error] Text splitting produced zero chunks.", file=sys.stderr)
        return 1

    print(f"Created {len(chunks)} chunks from {len(documents)} pages")

    # Create embeddings and store in ChromaDB
    embeddings = HuggingFaceEmbeddings(
        model_name="all-MiniLM-L6-v2",
        model_kwargs={"device": "cpu"},
        encode_kwargs={"normalize_embeddings": True},
    )

    # Delete existing collection to prevent duplicate vectors on re-ingest
    try:
        _client = chromadb.PersistentClient(path=PERSIST_DIR)
        existing = [c.name for c in _client.list_collections()]
        if COLLECTION_NAME in existing:
            print(
                f"[warn] Collection '{COLLECTION_NAME}' already exists — "
                "deleting before re-ingest to prevent duplicates."
            )
            _client.delete_collection(COLLECTION_NAME)
    except Exception as exc:
        print(f"[error] Failed to manage ChromaDB collection: {exc}", file=sys.stderr)
        return 1

    # Ingest in batches to avoid OOM on large corpora
    try:
        vector_store = None
        for i in range(0, len(chunks), BATCH_SIZE):
            batch = chunks[i : i + BATCH_SIZE]
            if vector_store is None:
                vector_store = Chroma.from_documents(
                    documents=batch,
                    embedding=embeddings,
                    persist_directory=PERSIST_DIR,
                    collection_name=COLLECTION_NAME,
                )
            else:
                vector_store.add_documents(batch)
            print(
                f"  Ingested batch {i // BATCH_SIZE + 1} "
                f"({min(i + BATCH_SIZE, len(chunks))}/{len(chunks)} chunks)"
            )
    except Exception as exc:
        print(f"[error] Failed to write vector store: {exc}", file=sys.stderr)
        return 1

    print(f"Stored {len(chunks)} vectors in {PERSIST_DIR}")
    return 0


if __name__ == "__main__":
    sys.exit(main())

Run the ingestion step first:

python ingest.py

Confirm that ./chroma_db exists and is non-empty before proceeding to queries.

Querying Your Documents: The RAG Chain

Connecting Ollama as the Local LLM

Important: Ollama must be running before executing the query script. Start it with ollama serve in a separate terminal if it is not already running as a system service.

from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="mistral",
    temperature=0.1,
    num_ctx=4096
)

The temperature of 0.1 keeps responses factual and deterministic, appropriate for document Q&A where creativity is undesirable. The num_ctx parameter sets the context window size in tokens. A value of 4096 provides enough room for retrieved chunks plus the question and system prompt. Increasing it to 8192 allows more context at the cost of higher memory usage and slower inference. Do not set num_ctx above the model's maximum context length (8192 for Mistral 7B).

Building the Retrieval QA Chain

This is the complete query script that loads a persisted ChromaDB store, configures retrieval, builds a grounded prompt, and answers questions from the command line. Save this as query_docs.py and run it after running ingest.py.

Note: RetrievalQA is a legacy chain in LangChain. It works with the pinned versions specified in this guide, but LangChain is migrating to LCEL (LangChain Expression Language) based chains. Consult the LangChain documentation for LCEL equivalents if you are starting a new production project.

import sys
import re

from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_ollama import ChatOllama
from langchain.chains import RetrievalQA
from langchain.prompts import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

PERSIST_DIR = "./chroma_db"
COLLECTION_NAME = "local_documents"
MAX_QUESTION_LEN = 500
_INJECTION_PATTERN = re.compile(
    r"(ignore\s+(previous|all|above)\s+instructions|system\s*prompt|you\s+are\s+now)",
    re.IGNORECASE,
)


def sanitize_question(raw: str) -> str:
    """Validate and sanitize user input to mitigate prompt injection."""
    q = raw.strip()
    if not q:
        return "What are the key topics in these documents?"
    if len(q) > MAX_QUESTION_LEN:
        raise ValueError(
            f"Question exceeds maximum length of {MAX_QUESTION_LEN} characters."
        )
    if _INJECTION_PATTERN.search(q):
        raise ValueError("Question contains disallowed content.")
    return q


# Load persisted vector store
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True},
)

vector_store = Chroma(
    persist_directory=PERSIST_DIR,
    embedding_function=embeddings,
    collection_name=COLLECTION_NAME,
)

# Configure retriever — fetch top 4 most similar chunks
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4},
)

# Chat prompt template grounding the model in retrieved context.
# ChatOllama is a chat model, so a ChatPromptTemplate with system/human
# messages ensures the prompt is formatted correctly for the chat API.
system_template = (
    "Use the following context to answer the question. "
    "If the answer is not contained in the context, say "
    "\"I don't have enough information to answer this based on the provided documents.\"

"
    "Context:
{context}"
)

chat_prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessagePromptTemplate.from_template(system_template),
        HumanMessagePromptTemplate.from_template("Question: {question}
Answer:"),
    ]
)

# Initialize local LLM
llm = ChatOllama(model="mistral", temperature=0.1, num_ctx=4096)

# Assemble the RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": chat_prompt},
    return_source_documents=True,
)

# Accept question from command line with input validation
raw_input = " ".join(sys.argv[1:]) if len(sys.argv) > 1 else ""
try:
    question = sanitize_question(raw_input)
except ValueError as e:
    print(f"[error] {e}", file=sys.stderr)
    sys.exit(1)

try:
    result = qa_chain.invoke({"query": question})
except Exception as exc:
    print(f"[error] Chain invocation failed: {exc}", file=sys.stderr)
    sys.exit(1)

answer = result.get("result", "[No answer returned by chain]")
source_docs = result.get("source_documents", [])

print(f"
Question: {question}")
print(f"
Answer: {answer}")
print("
Sources:")
for doc in source_docs:
    source = doc.metadata.get("source", "Unknown")
    page = doc.metadata.get("page", "?")
    snippet = doc.page_content[:80].replace("
", " ")
    print(f"  - {source} (page {page}): {snippet}...")

Run it with: python query_docs.py "What does the policy say about remote work?"

The chain_type="stuff" approach concatenates all retrieved chunks into a single context block. This works well when the total retrieved text fits within the context window. At k=4 and chunk_size=1000 characters, retrieved context can approach or exceed 4096 tokens; use num_ctx=8192 or reduce k if the model truncates responses. For larger retrieval sets, map_reduce or refine chain types process chunks sequentially, though they require multiple LLM calls and increase latency.

Setting return_source_documents=True provides transparency by showing which document chunks informed the answer.

Understanding the Prompt Template

The prompt template contains two variables: {context} and {question}. LangChain populates the {context} variable with the concatenated text of the top-k retrieved chunks. LangChain inserts the user's query into {question}. The instruction to say "I don't have enough information" when the context lacks the answer is critical. Without it, the model falls back on its training data and generates plausible but unsourced answers, defeating the purpose of grounded retrieval.

The instruction to say "I don't have enough information" when the context lacks the answer is critical. Without it, the model falls back on its training data and generates plausible but unsourced answers, defeating the purpose of grounded retrieval.

Testing, Tuning, and Troubleshooting

Running Your First Queries

# Test queries to validate the pipeline (run each separately)
# python query_docs.py "Summarize the main findings of the report"
# python query_docs.py "What are the specific deadlines mentioned?"
# python query_docs.py "Who is responsible for budget approval?"

# Expected output pattern:
# Question: What are the specific deadlines mentioned?
# Answer: According to the documents, the following deadlines are specified:
#   1. [Specific deadline pulled from your documents]
#   2. [Another deadline from context]
# Sources:
#   - ./documents/report.pdf (page 3): The project timeline requires...

Answers should directly reference content from the ingested documents. If the model generates generic responses without document-specific details, the retrieval step is likely returning irrelevant chunks.

Tuning Retrieval Quality

A k of 4 is a reasonable default for the number of retrieved chunks. Increasing to 6 or 8 provides more context but risks including less relevant chunks that dilute answer quality. Reducing to 2 sharpens focus but may miss relevant information spread across multiple chunks.

For documents with dense, granular information like technical specifications, try smaller chunks (500 characters) with higher overlap (150-200). Narrative documents where context spans multiple paragraphs work better with larger chunks (1500 characters). After changing chunking parameters, you must re-embed and re-index the entire corpus.

The BAAI/bge-small-en-v1.5 model is an alternative that outperforms all-MiniLM-L6-v2 on several MTEB retrieval tasks by 1-3 percentage points. Note that bge-small-en-v1.5 produces 512-dimensional vectors; it is not a drop-in replacement for all-MiniLM-L6-v2 (384-d)—switching requires deleting and rebuilding the ChromaDB collection.

Ollama makes it easy to test different LLMs. Run ollama pull llama3, then change the model parameter to llama3 in the ChatOllama constructor. phi3 (pull with ollama pull phi3; 3.8B parameters) is a lighter alternative for constrained hardware. mistral provides a good balance of quality and speed for an on-prem RAG implementation.

Common Issues and Fixes

Ollama connection refused? The Ollama service is not running. Start it with ollama serve in a separate terminal, or verify it is running with ollama list. On Linux with systemd, check systemctl status ollama.

Out-of-memory errors: Reduce num_ctx from 4096 to 2048, or switch to a smaller quantized model (e.g., mistral uses Q4 quantization by default; phi3 requires less memory). Close other memory-intensive applications.
Poor or generic answers: The most common causes are chunk sizes that are too large (diluting relevance), a k value that is too high (introducing noise), or a prompt template that does not explicitly instruct the model to use only the provided context. Verify retrieved chunks by inspecting the source documents in the output to confirm they contain relevant text.
Slow inference on CPU: Expected behavior without a GPU. Mistral 7B on CPU generates around 5 tokens per second on an 8-core x86 CPU with 16 GB RAM. phi3 (3.8B parameters) roughly doubles that throughput.

Security and Privacy Considerations

The core guarantee of this local RAG setup is that zero data leaves the machine during operation. To verify this, run a network audit using sudo ss -tlnp on Linux (root required to display process names; omit sudo for port listing only) or netstat -an on macOS to confirm that only localhost connections are active during queries. Ollama binds to 127.0.0.1:11434 by default and does not open external ports.

The core guarantee of this local RAG setup is that zero data leaves the machine during operation.

Restrict file-system permissions for the ChromaDB persist directory (./chroma_db) to the user running the pipeline. On Linux, chmod -R 700 ./chroma_db prevents other users from reading the vector store, which contains embedded representations of potentially sensitive documents. Note that chmod 700 on the directory alone does not restrict permissions on files created inside it; using -R applies the permission recursively to existing contents, and setting umask 077 before running the ingest script ensures newly created files are also restricted.

Model provenance matters. Downloaded LLMs carry their own license terms. Mistral uses the Apache 2.0 license, permitting commercial use. Llama 3 uses the Llama 3 Community License Agreement, which includes specific usage restrictions for applications exceeding 700 million monthly active users—review the current terms at Meta's official Llama page before deploying.

For maximum security, the entire stack can run in an air-gapped environment. Download Ollama, the model weights, and Python packages on a connected machine, then transfer them via physical media and install offline. Verify the integrity of transferred files using checksums (e.g., sha256sum) before installation. After initial setup, no network connectivity is required.

Where to Go Next

The single highest-value next step is adding a browser-based interface with Gradio or Streamlit, which makes this system accessible to non-technical team members without requiring command-line interaction.

Beyond that, ChromaDB supports hybrid search combining keyword matching with semantic similarity, which improves retrieval for queries containing specific terms or identifiers. Wrapping the pipeline in a FastAPI application enables multi-user access while keeping all processing local. For higher retrieval accuracy, cross-encoder re-ranking models (such as cross-encoder/ms-marco-MiniLM-L-6-v2) can score and reorder retrieved chunks before they reach the LLM, though they add latency per query. The LangChain documentation and Ollama model library provide further guidance on extending these patterns.

SitePoint Team

Sharing our passion for building incredible internet things.

Local RAG Without the Cloud: Private Document AI Setup

Local RAG Without the Cloud: Private Document AI Setup

How to Set Up Local RAG for Private Document AI

Table of Contents

Why Go Local with RAG?

Architecture Overview and Component Selection

How RAG Works: A Quick Refresher

Why These Tools? Component-by-Component Rationale

Prerequisites and Environment Setup

Hardware and Software Requirements

Installing Ollama and Pulling a Model

Creating the Python Project and Installing Dependencies

Ingesting and Chunking Your Documents

Loading Documents with LangChain Loaders

Chunking Strategies That Affect Answer Quality

Generating Embeddings and Storing Vectors Locally

Creating Embeddings with Sentence Transformers

Persisting Vectors in ChromaDB

Ingestion Script

Querying Your Documents: The RAG Chain

Connecting Ollama as the Local LLM

Building the Retrieval QA Chain

Understanding the Prompt Template

Testing, Tuning, and Troubleshooting

Running Your First Queries

Tuning Retrieval Quality

Common Issues and Fixes

Security and Privacy Considerations

Where to Go Next

Comments

More from Capitolioxa

Samsung already nuked the only cool thing about the Galaxy S26’s AI

Samsung allegedly tests insane Galaxy phone batteries, and one's really up there

I kept deleting chats by accident, and Google Messages just fixed it

Morning Briefing