Building Production-Ready RAG Systems with Free LLMs: From Zero to Analysis-Ready in 6 Steps
Introduction
When I started exploring Retrieval-Augmented Generation (RAG) systems for incident analysis, I realized that jumping straight into paid APIs like Claude or OpenAI wasn’t practical for learning and experimentation. Instead, I wanted to build something completely local, free to run, and powerful enough to handle real production scenarios.
This article documents my journey building a fully functional RAG system that analyzes production incidents by learning from past issues — without spending a dime on API calls. Everything runs on a laptop using open-source tools.
What You’ll Build
By the end of this guide, you’ll understand how to build a working RAG system that:
- ✅ Learns from past incident data (your knowledge base)
- ✅ Performs semantic search on incident history (finds similar past issues)
- ✅ Analyzes new incidents using an open-source LLM (Llama 2)
- ✅ Suggests root causes and resolutions based on historical patterns
- ✅ Runs completely locally (no API keys, no cloud services)
- ✅ Produces analysis in 8–15 seconds per incident
Real-World Use Case
Imagine you have a production incident:
New Issue:
- Memory usage: 89% (baseline: 45%)
- GC pause time: 2.3 seconds (SLA: 200ms)
- Cache lookups: 4x slower than normal
- Error: OutOfMemoryError starting to appear
Your RAG system will:
- Search through historical incidents
- Find that on January 15th, you had a similar issue (85% match)
- Retrieve the resolution that worked then (LRU cache eviction policy)
- Analyze the current incident with that context
- Provide a confidence-based recommendation
That’s the power of RAG with local LLMs. And you build it yourself, completely free.
Why Build Locally First?
Learning
You understand the entire system without the abstraction of managed services. No black boxes — just pure RAG architecture from first principles.
Cost
Zero runtime costs. The only cost is electricity to run your computer. After initial setup (~15GB disk), there are no monthly bills or per-query fees. Compare this to Claude API ($0.004/query) or GPT-4 ($0.03/query).
Privacy
Your incident data never leaves your machine. No sending sensitive production issues to external APIs. Your infrastructure remains under your control.
Flexibility
You control every component. Want to change the LLM? Swap Ollama models in seconds. Want to modify how embeddings work? Edit the code. Want to add custom context building? You own the entire pipeline.
Foundation for Production
Once you validate your approach locally, scaling to paid APIs (Claude, GPT-4) requires changing just a few lines of code. You’ve built the right abstraction layer from day one.
Architecture: The Incident Analysis RAG System
The system has three core components working together:
1. Knowledge Base (Weaviate Vector Database)
- Stores your historical incident data
- Enables semantic search (finding similar incidents)
- Runs locally in Docker
2. Embedding Model (sentence-transformers)
- Converts incident descriptions to vectors (mathematical representations)
- Enables semantic similarity matching
- Runs locally on your CPU/GPU
3. Analysis Engine (Ollama + Llama 2)
- Runs open-source Llama 2 model locally
- Analyzes new incidents with historical context
- Generates root cause analysis and recommendations
When you submit a new incident, the system:
- Converts symptoms to vectors
- Searches vector database for similar incidents (semantic search)
- Retrieves the historical data: root causes, resolutions, time to fix
- Sends current incident + historical context to Llama 2
- LLM generates analysis with confidence scores
The entire process happens on your machine. No API calls. No rate limits. No bills.
System Requirements
Before starting, verify you have:
✓ macOS, Linux, or Windows (with WSL2)
✓ Python 3.11 or higher
✓ 8GB RAM (16GB recommended for faster analysis)
✓ 15GB free disk space (for models)
✓ Docker Desktop installed
✓ Modern CPU (Intel i5+, Apple Silicon, or AMD Ryzen)
Time to complete: ~1 hour for initial setup (mostly downloading models)
Step 1: Install Ollama (Your Local LLM Runtime)
Ollama lets you run open-source LLMs locally without any cloud service.
Installation is straightforward:
- macOS: brew install ollama
- Linux: curl https://ollama.ai/install.sh | sh
- Windows: Download from https://ollama.ai/download/windows
After installation, verify it works and download the Llama 2 model (~4GB, one-time download).
Step 2: Setup Project & Dependencies
Create your project directory and Python environment. Install dependencies (requests, weaviate-client, sentence-transformers, scikit-learn, etc.).
The full requirements.txt and setup instructions are in the GitLab repository.
Step 3: Setup Weaviate Vector Database
Weaviate stores your historical incident data and enables fast semantic search.
Create a simple docker-compose.yml file to run Weaviate. The configuration is minimal—just specify the port and Weaviate image.
Start Weaviate with Docker:
docker-compose up -d
sleep 45
curl http://localhost:8080/v1/.well-known/ready
That’s it. Weaviate is now running and ready to store incidents.
Step 4: Build the RAG Pipeline
This is the core of the system. The complete implementation is available in the GitLab repository.
Key components of the implementation:
IncidentAnalyzerRAG class handles:
- Connecting to Weaviate (vector database)
- Loading sentence-transformers (embedding model)
- Connecting to Ollama (LLM)
- Managing historical incident data
Main methods:
- setup_schema() – Creates the incident storage structure
- load_sample_incidents() – Loads built-in historical incidents into the database
- add_incident_to_history() – Adds new incidents to the knowledge base
- search_similar_incidents() – Finds similar past incidents using semantic search
- analyze_incident() – Uses Ollama to analyze with historical context
- full_analysis() – Orchestrates the entire pipeline
How it works:
The RAG pipeline comes with sample historical incidents built in (memory leaks, connection pool exhaustion, CPU spikes). The system is self-contained in a single rag_pipeline.py file with a __main__ block that demonstrates the complete workflow.
When you run it:
- The system initializes (connects to Weaviate and Ollama)
- Creates the incident storage schema
- Loads sample incidents (INC-001, INC-002, INC-003)
- Analyzes a test incident for demonstration
- Displays analysis with confidence scores and recommendations
For the complete, production-ready implementation, see the rag_pipeline.py file.
Step 5: Load Historical Incidents
The RAG system comes with sample historical incidents built in:
- INC-001: Memory leak in cache service
- INC-002: Database connection pool exhaustion
- INC-003: CPU spike from unoptimized loop
These are loaded automatically when you run the system via the load_sample_incidents() method in the RAG pipeline.
To use your own incident history, you can:
- Call add_incident_to_history() for each incident programmatically
- Batch load incidents from your incident management system
- Import from CSV or JSON files
Each incident should include:
- Incident ID: Unique identifier (INC-001, INC-002, etc.)
- Title: Short description
- Root Cause: What actually caused it
- Resolution: How it was fixed
- Time to Fix: How long it took (in minutes)
- Symptoms: Observable symptoms (OutOfMemoryError, high latency, etc.)
All of this is handled in the main RAG pipeline code. Check the rag_pipeline.py for the complete implementation.
Step 6: Run Your First Analysis
Start the two services in different terminals:
Terminal 1: Start Ollama
ollama serve
Terminal 2: Start Weaviate and run RAG
docker-compose up -d && sleep 45
python src/rag_pipeline.py
The script will:
- Initialize the RAG system
- Set up the incident schema
- Load sample incidents
- Analyze a test incident
- Display the results
Expected Output
======================================================================
LOCAL RAG SYSTEM - NO API KEYS NEEDED
======================================================================
🚀 Starting Local RAG (No API Keys!)
✓ Connected to Weaviate
✓ Embedding model loaded
✓ Ollama connected
🎉 Local RAG System Ready (NO API KEYS!)
Cleaned up old data
✓ Schema created
✓ Loaded 3 incidents
======================================================================
TESTING LOCAL RAG WITH LOCAL LLM
======================================================================
📌 Query: Memory error OutOfMemory heap space GC pause
📚 Similar Incidents Found: 2
- Memory leak in cache service
- CPU spike from unoptimized loop
🤖 Analyzing incident with Ollama (Llama 2)...
💡 Local LLM Analysis:
======================================================================
Based on the current incident and similar past incidents, I recommend:
1. ROOT CAUSE (95% confidence):
Cache memory leak due to missing eviction policy. Similar to INC-001.
2. IMMEDIATE ACTIONS:
- Restart cache service (5 minutes)
- Monitor memory and GC behavior
3. PERMANENT FIX:
Reimplement LRU cache eviction policy (45 minutes total)
4. TIME ESTIMATE: 45 minutes based on historical data
======================================================================
How It Works: The RAG Difference
Without RAG, you’d ask Llama 2:
“We have OutOfMemoryError, GC pause 2.3s, slow cache. What’s the issue?”
Llama 2 would give a generic answer based on its training data.
With RAG, you ask:
“We have OutOfMemoryError, GC pause 2.3s, slow cache. Here’s what happened on January 15th with 85% similarity: Memory leak from missing cache eviction. Here’s how we fixed it…”
Llama 2 now becomes an expert analyst with your specific context. It can confidently say:
- Root cause: 95% confidence (because it matches historical pattern)
- Fix: Reimplemented LRU cache eviction
- Time estimate: 45 minutes (based on historical data)
This is the magic of RAG: combining your knowledge base with LLM reasoning.
Performance Characteristics
Here’s what to expect:
Metric Value Analysis latency 8–17 seconds Vector search time 100–500ms LLM inference 5–12 seconds Memory usage 4–6GB Storage needed 15GB Cost per analysis $0.00 Monthly cost $0.00
Why is it fast?
- Llama 2 is small (7B parameters) but efficient
- Semantic search in vector database is optimized
- Everything runs locally (no network latency)
Why might it be slower?
- First query initializes models (~3–5 seconds extra)
- Subsequent queries reuse cached models
Scaling to Production: When to Upgrade
Once you’ve validated your incident analysis approach locally, you can upgrade to paid models for potentially better accuracy.
Claude API
- Accuracy: Excellent (trained on broader data)
- Cost: ~$0.004 per analysis (using Claude 3.5 Sonnet)
- Integration: Simple code change (3–4 lines)
GPT-4
- Accuracy: State-of-the-art
- Cost: ~$0.03 per analysis
- Integration: Simple code change (3–4 lines)
The beautiful part: Your RAG pipeline stays exactly the same. You’re only swapping the LLM component. The embeddings, vector search, and context building all remain unchanged.
Start with free Llama 2. When you need better accuracy, upgrade to Claude. Your architecture supports it from day one.
Key Takeaways
- RAG amplifies LLM capabilities: Free Llama 2 becomes an expert analyst when combined with your incident history
- Local-first saves money: Zero API costs while you’re learning and validating
- Easy to upgrade: When you need better quality, swap Ollama for Claude with minimal code changes
- Data privacy: Sensitive incidents never leave your infrastructure
- Fast iteration: Modify prompts, adjust incident data, and experiment instantly
Conclusion
You now understand how to build a production-grade incident analysis system that:
- Learns from your historical incident data
- Performs semantic search to find similar past issues
- Analyzes new incidents using open-source Llama 2
- Requires zero API keys and costs nothing to run
- Is ready to scale to paid models when needed
The best part? You understand every component. This is genuine RAG architecture — not an abstraction layer hiding the complexity.
When you’re ready to upgrade to Claude or GPT-4, you’re not learning a new system or switching platforms. You’re simply swapping one component while everything else stays the same.
Start with free. Learn thoroughly. Validate your approach. Scale when proven. That’s the future of building with AI.
Resources
- GitLab Repository: rag-incident-analyzer
- Quick Start: README with setup instructions
- Core Implementation: rag_pipeline.py
External Resources:
- Ollama Documentation
- Weaviate Vector Database
- Sentence Transformers
- RAG Architecture (LangChain)
- Llama 2 Model
Building Production-Ready RAG Systems with Free LLMs: From Zero to Analysis-Ready in 6 Steps was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.