Building Production-Ready RAG Systems with Free LLMs: From Zero to Analysis-Ready in 6 Steps

digitado ⋅ 17 de February de 2026

Introduction

When I started exploring Retrieval-Augmented Generation (RAG) systems for incident analysis, I realized that jumping straight into paid APIs like Claude or OpenAI wasn’t practical for learning and experimentation. Instead, I wanted to build something completely local, free to run, and powerful enough to handle real production scenarios.

This article documents my journey building a fully functional RAG system that analyzes production incidents by learning from past issues — without spending a dime on API calls. Everything runs on a laptop using open-source tools.

What You’ll Build

By the end of this guide, you’ll understand how to build a working RAG system that:

✅ Learns from past incident data (your knowledge base)
✅ Performs semantic search on incident history (finds similar past issues)
✅ Analyzes new incidents using an open-source LLM (Llama 2)
✅ Suggests root causes and resolutions based on historical patterns
✅ Runs completely locally (no API keys, no cloud services)
✅ Produces analysis in 8–15 seconds per incident

Real-World Use Case

Imagine you have a production incident:

New Issue:
- Memory usage: 89% (baseline: 45%)
- GC pause time: 2.3 seconds (SLA: 200ms)
- Cache lookups: 4x slower than normal
- Error: OutOfMemoryError starting to appear

Your RAG system will:

Search through historical incidents
Find that on January 15th, you had a similar issue (85% match)
Retrieve the resolution that worked then (LRU cache eviction policy)
Analyze the current incident with that context
Provide a confidence-based recommendation

That’s the power of RAG with local LLMs. And you build it yourself, completely free.

Why Build Locally First?

Learning

You understand the entire system without the abstraction of managed services. No black boxes — just pure RAG architecture from first principles.

Cost

Zero runtime costs. The only cost is electricity to run your computer. After initial setup (~15GB disk), there are no monthly bills or per-query fees. Compare this to Claude API ($0.004/query) or GPT-4 ($0.03/query).

Privacy

Your incident data never leaves your machine. No sending sensitive production issues to external APIs. Your infrastructure remains under your control.

Flexibility

You control every component. Want to change the LLM? Swap Ollama models in seconds. Want to modify how embeddings work? Edit the code. Want to add custom context building? You own the entire pipeline.

Foundation for Production

Once you validate your approach locally, scaling to paid APIs (Claude, GPT-4) requires changing just a few lines of code. You’ve built the right abstraction layer from day one.

Architecture: The Incident Analysis RAG System

The system has three core components working together:

1. Knowledge Base (Weaviate Vector Database)

Stores your historical incident data
Enables semantic search (finding similar incidents)
Runs locally in Docker

2. Embedding Model (sentence-transformers)

Converts incident descriptions to vectors (mathematical representations)
Enables semantic similarity matching
Runs locally on your CPU/GPU

3. Analysis Engine (Ollama + Llama 2)

Runs open-source Llama 2 model locally
Analyzes new incidents with historical context
Generates root cause analysis and recommendations

When you submit a new incident, the system:

Converts symptoms to vectors
Searches vector database for similar incidents (semantic search)
Retrieves the historical data: root causes, resolutions, time to fix
Sends current incident + historical context to Llama 2
LLM generates analysis with confidence scores

The entire process happens on your machine. No API calls. No rate limits. No bills.

System Requirements

Before starting, verify you have:

✓ macOS, Linux, or Windows (with WSL2)
✓ Python 3.11 or higher
✓ 8GB RAM (16GB recommended for faster analysis)
✓ 15GB free disk space (for models)
✓ Docker Desktop installed
✓ Modern CPU (Intel i5+, Apple Silicon, or AMD Ryzen)

Time to complete: ~1 hour for initial setup (mostly downloading models)

Step 1: Install Ollama (Your Local LLM Runtime)

Ollama lets you run open-source LLMs locally without any cloud service.

Installation is straightforward:

macOS: brew install ollama
Linux: curl https://ollama.ai/install.sh | sh
Windows: Download from https://ollama.ai/download/windows

After installation, verify it works and download the Llama 2 model (~4GB, one-time download).

Step 2: Setup Project & Dependencies

Create your project directory and Python environment. Install dependencies (requests, weaviate-client, sentence-transformers, scikit-learn, etc.).

The full requirements.txt and setup instructions are in the GitLab repository.

Step 3: Setup Weaviate Vector Database

Weaviate stores your historical incident data and enables fast semantic search.

Create a simple docker-compose.yml file to run Weaviate. The configuration is minimal—just specify the port and Weaviate image.

Start Weaviate with Docker:

docker-compose up -d
sleep 45
curl http://localhost:8080/v1/.well-known/ready

That’s it. Weaviate is now running and ready to store incidents.

Step 4: Build the RAG Pipeline

This is the core of the system. The complete implementation is available in the GitLab repository.

Key components of the implementation:

IncidentAnalyzerRAG class handles:

Connecting to Weaviate (vector database)
Loading sentence-transformers (embedding model)
Connecting to Ollama (LLM)
Managing historical incident data

Main methods:

setup_schema() – Creates the incident storage structure
load_sample_incidents() – Loads built-in historical incidents into the database
add_incident_to_history() – Adds new incidents to the knowledge base
search_similar_incidents() – Finds similar past incidents using semantic search
analyze_incident() – Uses Ollama to analyze with historical context
full_analysis() – Orchestrates the entire pipeline

How it works:

The RAG pipeline comes with sample historical incidents built in (memory leaks, connection pool exhaustion, CPU spikes). The system is self-contained in a single rag_pipeline.py file with a __main__ block that demonstrates the complete workflow.

When you run it:

The system initializes (connects to Weaviate and Ollama)
Creates the incident storage schema
Loads sample incidents (INC-001, INC-002, INC-003)
Analyzes a test incident for demonstration
Displays analysis with confidence scores and recommendations

For the complete, production-ready implementation, see the rag_pipeline.py file.

Step 5: Load Historical Incidents

The RAG system comes with sample historical incidents built in:

INC-001: Memory leak in cache service
INC-002: Database connection pool exhaustion
INC-003: CPU spike from unoptimized loop

These are loaded automatically when you run the system via the load_sample_incidents() method in the RAG pipeline.

To use your own incident history, you can:

Call add_incident_to_history() for each incident programmatically
Batch load incidents from your incident management system
Import from CSV or JSON files

Each incident should include:

Incident ID: Unique identifier (INC-001, INC-002, etc.)
Title: Short description
Root Cause: What actually caused it
Resolution: How it was fixed
Time to Fix: How long it took (in minutes)
Symptoms: Observable symptoms (OutOfMemoryError, high latency, etc.)

All of this is handled in the main RAG pipeline code. Check the rag_pipeline.py for the complete implementation.

Step 6: Run Your First Analysis

Start the two services in different terminals:

Terminal 1: Start Ollama

ollama serve

Terminal 2: Start Weaviate and run RAG

docker-compose up -d && sleep 45
python src/rag_pipeline.py

The script will:

Initialize the RAG system
Set up the incident schema
Load sample incidents
Analyze a test incident
Display the results

Expected Output

======================================================================
LOCAL RAG SYSTEM - NO API KEYS NEEDED
======================================================================
🚀 Starting Local RAG (No API Keys!)
✓ Connected to Weaviate
✓ Embedding model loaded
✓ Ollama connected

🎉 Local RAG System Ready (NO API KEYS!)

Cleaned up old data
✓ Schema created
✓ Loaded 3 incidents

======================================================================
TESTING LOCAL RAG WITH LOCAL LLM
======================================================================

📌 Query: Memory error OutOfMemory heap space GC pause

📚 Similar Incidents Found: 2
  - Memory leak in cache service
  - CPU spike from unoptimized loop

🤖 Analyzing incident with Ollama (Llama 2)...

💡 Local LLM Analysis:
======================================================================
Based on the current incident and similar past incidents, I recommend:

1. ROOT CAUSE (95% confidence):
   Cache memory leak due to missing eviction policy. Similar to INC-001.

2. IMMEDIATE ACTIONS:
   - Restart cache service (5 minutes)
   - Monitor memory and GC behavior

3. PERMANENT FIX:
   Reimplement LRU cache eviction policy (45 minutes total)

4. TIME ESTIMATE: 45 minutes based on historical data
======================================================================

How It Works: The RAG Difference

Without RAG, you’d ask Llama 2:

“We have OutOfMemoryError, GC pause 2.3s, slow cache. What’s the issue?”

Llama 2 would give a generic answer based on its training data.

With RAG, you ask:

“We have OutOfMemoryError, GC pause 2.3s, slow cache. Here’s what happened on January 15th with 85% similarity: Memory leak from missing cache eviction. Here’s how we fixed it…”

Llama 2 now becomes an expert analyst with your specific context. It can confidently say:

Root cause: 95% confidence (because it matches historical pattern)
Fix: Reimplemented LRU cache eviction
Time estimate: 45 minutes (based on historical data)

This is the magic of RAG: combining your knowledge base with LLM reasoning.

Performance Characteristics

Here’s what to expect:

Metric Value Analysis latency 8–17 seconds Vector search time 100–500ms LLM inference 5–12 seconds Memory usage 4–6GB Storage needed 15GB Cost per analysis $0.00 Monthly cost $0.00

Why is it fast?

Llama 2 is small (7B parameters) but efficient
Semantic search in vector database is optimized
Everything runs locally (no network latency)

Why might it be slower?

First query initializes models (~3–5 seconds extra)
Subsequent queries reuse cached models

Scaling to Production: When to Upgrade

Once you’ve validated your incident analysis approach locally, you can upgrade to paid models for potentially better accuracy.

Claude API

Accuracy: Excellent (trained on broader data)
Cost: ~$0.004 per analysis (using Claude 3.5 Sonnet)
Integration: Simple code change (3–4 lines)

GPT-4

Accuracy: State-of-the-art
Cost: ~$0.03 per analysis
Integration: Simple code change (3–4 lines)

The beautiful part: Your RAG pipeline stays exactly the same. You’re only swapping the LLM component. The embeddings, vector search, and context building all remain unchanged.

Start with free Llama 2. When you need better accuracy, upgrade to Claude. Your architecture supports it from day one.

Key Takeaways

RAG amplifies LLM capabilities: Free Llama 2 becomes an expert analyst when combined with your incident history
Local-first saves money: Zero API costs while you’re learning and validating
Easy to upgrade: When you need better quality, swap Ollama for Claude with minimal code changes
Data privacy: Sensitive incidents never leave your infrastructure
Fast iteration: Modify prompts, adjust incident data, and experiment instantly

Conclusion

You now understand how to build a production-grade incident analysis system that:

Learns from your historical incident data
Performs semantic search to find similar past issues
Analyzes new incidents using open-source Llama 2
Requires zero API keys and costs nothing to run
Is ready to scale to paid models when needed

The best part? You understand every component. This is genuine RAG architecture — not an abstraction layer hiding the complexity.

When you’re ready to upgrade to Claude or GPT-4, you’re not learning a new system or switching platforms. You’re simply swapping one component while everything else stays the same.

Start with free. Learn thoroughly. Validate your approach. Scale when proven. That’s the future of building with AI.

Resources

GitLab Repository: rag-incident-analyzer
Quick Start: README with setup instructions
Core Implementation: rag_pipeline.py

External Resources:

Building Production-Ready RAG Systems with Free LLMs: From Zero to Analysis-Ready in 6 Steps was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like 0

Liked Liked