Building a Zero-Hallucination RAG Agent: Custom LangChain vs Pre-Built Tools
The Problem
I needed a RAG agent to answer questions about my 27 portfolio projects. I started with Flowise, a popular no-code RAG platform, but it consistently hallucinated fake projects:
- Invented “Project Alpha”, “Project Beta” (Greek alphabet naming)
- Made up technologies I never used (Azure, cryptocurrency bots, VoIP applications)
- Created plausible-sounding but completely false project descriptions
Even with temperature 0.1, explicit system prompts, and correct chunk retrieval, Flowise’s Conversational Retrieval QA Chain prioritized conversational fluency over factual accuracy. The LLM would “fill gaps” with creative writing rather than admitting “I don’t know.”
For a portfolio where accuracy equals credibility, this was unacceptable.
Solution Architecture
I built a custom LangChain implementation with hallucination prevention at the architectural level, not just the prompt level.
System Design
User Query
↓
Query Router (detect metadata vs semantic queries)
↓
├─ Metadata Query Path (NO LLM)
│ └─ Direct SQLite query → Return 27 titles
│
└─ Semantic Query Path (LLM with strict grounding)
↓
Chroma Vector Store (153 pre-embedded chunks)
↓
Retrieve top-k chunks (MMR for diversity)
↓
Filter by relevance threshold (>0.5)
↓
Format context with full metadata
↓
GPT-4o-mini (temp=0, strict grounding prompt)
↓
Validate response (citations present?)
↓
Return answer with sources
Key Technical Decisions
1. Metadata-First Architecture
Instead of asking the LLM “what are all the project titles?”, I store titles as metadata during ingestion and query them directly:
def list_all_projects(self) -> List[str]:
"""Pure metadata query - impossible to hallucinate"""
all_docs = self.vectorstore.similarity_search("", k=500)
titles = set(doc.metadata.get('title') for doc in all_docs)
return sorted(list(titles))
This architectural separation means 0% hallucination rate for factual lookups.
2. YAML Frontmatter Extraction
Each portfolio markdown file has structured metadata:
layout: post
title: "Building Production-Ready Fraud Detection"
date: 2025-09-28
The ingestion pipeline extracts this before chunking, attaching it to every chunk from that document:
def parse_frontmatter(self, content: str):
match = re.match(r'^---\s*\n(.*?)\n---\s*\n(.*)$', content, re.DOTALL)
if match:
metadata = yaml.safe_load(match.group(1))
# Convert datetime objects to strings for Chroma
for key, value in metadata.items():
if hasattr(value, 'isoformat'):
metadata[key] = value.isoformat()
return metadata, match.group(2)
3. Strict Grounding System Prompt
The system prompt enforces factual constraints:
system_prompt = """You are a factual assistant for Paulo Cavallo's portfolio.
STRICT RULES:
1. Answer ONLY using the provided context chunks
2. If context doesn't contain the answer, say: "I don't have sufficient information"
3. ALWAYS cite source filenames
4. DO NOT add information from your training data
5. DO NOT make up project names or details
6. DO NOT make subjective judgments without objective metrics
It's better to say "I don't know" than to hallucinate."""
Temperature is set to 0 for deterministic output.
4. Response Validation
Before returning an answer, the system checks:
has_citation = any(marker in answer for marker in ['Source:', '.md'])
has_disclaimer = any(phrase in answer.lower() for phrase in
["i don't have", "insufficient", "don't know"])
if not has_citation and not has_disclaimer:
log_warning("Response lacks citations")
Technical Stack
LangChain 0.1.20: Full control over RAG pipeline
Chroma 0.4.15: Local vector database, persistent
OpenAI text-embedding-3-small: $0.0006 for all 153 chunks
OpenAI GPT-4o-mini: $0.00015 per 1K tokens
Gradio 4.44: Web UI
Python 3.10: Core implementation
Chunking Strategy
Chunk size: 3000 characters (preserves narrative context) Overlap: 500 characters (prevents information loss at boundaries) Splitter: RecursiveCharacterTextSplitter with semantic separators
splitter = RecursiveCharacterTextSplitter(
chunk_size=3000,
chunk_overlap=500,
separators=["\n\n", "\n", ". ", " ", ""] # Priority order
)
For 27 documents (~150K characters total), this creates 153 chunks (avg 5.7 per document).
Retrieval Configuration
- Strategy: Maximal Marginal Relevance (MMR)
- Top-k: 10 chunks retrieved per query
- Lambda: 0.5 (balances relevance vs diversity)
- Threshold: 0.5 minimum relevance score
MMR ensures results span multiple projects rather than returning 10 chunks from a single highly-relevant project.
Deployment: Auto-Rebuild Pattern
Hugging Face Spaces deployment presented a challenge: Chroma version mismatches between local (0.4.22) and HF environment caused sqlite3.OperationalError: no such column errors.
Solution: Auto-rebuild on startup if database is incompatible:
if not os.path.exists(chroma_path):
needs_rebuild = True
else:
try:
vectorstore = load_vectorstore_simple()
except Exception as e:
print(f"Vectorstore incompatible: {e}")
shutil.rmtree(chroma_path)
needs_rebuild = True
if needs_rebuild:
_, chunks = ingest_main() # Re-fetch from GitHub
vectorstore = create_vectorstore_simple(chunks)
First startup takes ~60 seconds and costs $0.0006. Subsequent startups load the cached database instantly.
Test Results: 0% Hallucination Rate
I tested with queries designed to expose hallucination:
Query | Flowise Result | Custom Agent Result | Status |
---|---|---|---|
“What projects use Azure?” | Invented 3 fake Azure projects | “I don’t have sufficient information” | ✅ PASS |
“List all projects” | 28 titles (1 hallucinated) | 27 accurate titles | ✅ PASS |
“What’s most complex?” | Subjective claim without evidence | “I don’t have sufficient information” | ✅ PASS |
“Tell me about fraud detection” | Generic ML description | 2 specific projects with citations | ✅ PASS |
“What projects use AWS?” | Mixed real + fake projects | 4 real AWS projects with details | ✅ PASS |
Hallucination rate: 0%
Cost Analysis
One-time setup:
Embedding 153 chunks: $0.0006 Initial testing: $0.01 Total: ~$0.01
Monthly usage (100 queries):
Retrieval: Free (local Chroma) LLM generation: 100 × ~1K tokens × $0.00015/1K = $0.015 Total: ~$0.02/month
Scaling:
1,000 queries/month: ~$0.15 10,000 queries/month: ~$1.50
Key Lessons
- Architectural grounding > Prompt engineering
No amount of prompt engineering could fix Flowise’s conversational chains. The solution required architectural changes: separating metadata queries from semantic queries, validating responses, and defaulting to “I don’t know.”
- Pre-built tools optimize for wrong metrics
Flowise optimizes for conversational engagement (“always give an answer”). For portfolio credibility, accuracy matters more than helpfulness. Custom implementations let you choose your optimization target.
- Metadata is architectural truth
By storing project titles as structured metadata and querying them directly, you eliminate an entire class of hallucination. The LLM never gets a chance to invent project names.
- “I don’t know” is a feature
Honest uncertainty builds more trust than plausible-sounding fabrications. The system refuses to answer subjective questions (“most complex project?”) without objective metrics in the context.
- Version compatibility matters in production
The Chroma database schema changed between versions, causing deployment failures. The auto-rebuild pattern solves this: detect incompatibility, rebuild once, then cache for future startups. Future Enhancements
- Multi-query retrieval: Generate 3 query variations to improve recall
- Conversation memory: Maintain context across turns while preserving grounding
- Project comparison: Dedicated function to compare two projects side-by-side
- Advanced analytics: Track which projects get queried most, query success rates
- Relevance feedback: Let users flag incorrect answers to improve retrieval
Conclusion
Custom RAG implementations require more upfront effort than no-code tools, but they’re essential when accuracy is critical. By building hallucination prevention into the architecture—not just the prompts—you create systems that prioritize trustworthiness over conversational polish.
The cost difference is negligible (~$0.02/month). The trust difference is everything.