Building a Zero-Hallucination RAG Agent: Custom LangChain vs Pre-Built Tools

Live App

The Problem

I needed a RAG agent to answer questions about my 27 portfolio projects. I started with Flowise, a popular no-code RAG platform, but it consistently hallucinated fake projects:

Invented “Project Alpha”, “Project Beta” (Greek alphabet naming)
Made up technologies I never used (Azure, cryptocurrency bots, VoIP applications)
Created plausible-sounding but completely false project descriptions

Even with temperature 0.1, explicit system prompts, and correct chunk retrieval, Flowise’s Conversational Retrieval QA Chain prioritized conversational fluency over factual accuracy. The LLM would “fill gaps” with creative writing rather than admitting “I don’t know.”

For a portfolio where accuracy equals credibility, this was unacceptable.

Solution Architecture

I built a custom LangChain implementation with hallucination prevention at the architectural level, not just the prompt level.

System Design

User Query
↓
Query Router (detect metadata vs semantic queries)
↓
├─ Metadata Query Path (NO LLM)
│  └─ Direct SQLite query → Return 27 titles
│
└─ Semantic Query Path (LLM with strict grounding)
↓
Chroma Vector Store (153 pre-embedded chunks)
↓
Retrieve top-k chunks (MMR for diversity)
↓
Filter by relevance threshold (>0.5)
↓
Format context with full metadata
↓
GPT-4o-mini (temp=0, strict grounding prompt)
↓
Validate response (citations present?)
↓
Return answer with sources

Key Technical Decisions

1. Metadata-First Architecture

Instead of asking the LLM “what are all the project titles?”, I store titles as metadata during ingestion and query them directly:

def list_all_projects(self) -> List[str]:
    """Pure metadata query - impossible to hallucinate"""
    all_docs = self.vectorstore.similarity_search("", k=500)
    titles = set(doc.metadata.get('title') for doc in all_docs)
    return sorted(list(titles))

This architectural separation means 0% hallucination rate for factual lookups.

2. YAML Frontmatter Extraction

Each portfolio markdown file has structured metadata:

layout: post
title: "Building Production-Ready Fraud Detection"
date: 2025-09-28

The ingestion pipeline extracts this before chunking, attaching it to every chunk from that document:

def parse_frontmatter(self, content: str):
    match = re.match(r'^---\s*\n(.*?)\n---\s*\n(.*)$', content, re.DOTALL)
    if match:
        metadata = yaml.safe_load(match.group(1))
        # Convert datetime objects to strings for Chroma
        for key, value in metadata.items():
            if hasattr(value, 'isoformat'):
                metadata[key] = value.isoformat()
        return metadata, match.group(2)

3. Strict Grounding System Prompt

The system prompt enforces factual constraints:

system_prompt = """You are a factual assistant for Paulo Cavallo's portfolio.

STRICT RULES:
1. Answer ONLY using the provided context chunks
2. If context doesn't contain the answer, say: "I don't have sufficient information"
3. ALWAYS cite source filenames
4. DO NOT add information from your training data
5. DO NOT make up project names or details
6. DO NOT make subjective judgments without objective metrics

It's better to say "I don't know" than to hallucinate."""

Temperature is set to 0 for deterministic output.

4. Response Validation

Before returning an answer, the system checks:

has_citation = any(marker in answer for marker in ['Source:', '.md'])
has_disclaimer = any(phrase in answer.lower() for phrase in 
                    ["i don't have", "insufficient", "don't know"])

if not has_citation and not has_disclaimer:
    log_warning("Response lacks citations")

Technical Stack

LangChain 0.1.20: Full control over RAG pipeline
Chroma 0.4.15: Local vector database, persistent
OpenAI text-embedding-3-small: $0.0006 for all 153 chunks
OpenAI GPT-4o-mini: $0.00015 per 1K tokens
Gradio 4.44: Web UI
Python 3.10: Core implementation

Chunking Strategy

Chunk size: 3000 characters (preserves narrative context) Overlap: 500 characters (prevents information loss at boundaries) Splitter: RecursiveCharacterTextSplitter with semantic separators

splitter = RecursiveCharacterTextSplitter(
    chunk_size=3000,
    chunk_overlap=500,
    separators=["\n\n", "\n", ". ", " ", ""]  # Priority order
)

chunks

For 27 documents (~150K characters total), this creates 153 chunks (avg 5.7 per document).

Retrieval Configuration

Strategy: Maximal Marginal Relevance (MMR)
Top-k: 10 chunks retrieved per query
Lambda: 0.5 (balances relevance vs diversity)
Threshold: 0.5 minimum relevance score

MMR ensures results span multiple projects rather than returning 10 chunks from a single highly-relevant project.

Deployment: Auto-Rebuild Pattern

Hugging Face Spaces deployment presented a challenge: Chroma version mismatches between local (0.4.22) and HF environment caused sqlite3.OperationalError: no such column errors.

Solution: Auto-rebuild on startup if database is incompatible:

if not os.path.exists(chroma_path):
    needs_rebuild = True
else:
    try:
        vectorstore = load_vectorstore_simple()
    except Exception as e:
        print(f"Vectorstore incompatible: {e}")
        shutil.rmtree(chroma_path)
        needs_rebuild = True

if needs_rebuild:
    _, chunks = ingest_main()  # Re-fetch from GitHub
    vectorstore = create_vectorstore_simple(chunks)

First startup takes ~60 seconds and costs $0.0006. Subsequent startups load the cached database instantly.

Test Results: 0% Hallucination Rate

I tested with queries designed to expose hallucination:

Query	Flowise Result	Custom Agent Result	Status
“What projects use Azure?”	Invented 3 fake Azure projects	“I don’t have sufficient information”	✅ PASS
“List all projects”	28 titles (1 hallucinated)	27 accurate titles	✅ PASS
“What’s most complex?”	Subjective claim without evidence	“I don’t have sufficient information”	✅ PASS
“Tell me about fraud detection”	Generic ML description	2 specific projects with citations	✅ PASS
“What projects use AWS?”	Mixed real + fake projects	4 real AWS projects with details	✅ PASS

Hallucination rate: 0%

Cost Analysis

One-time setup:

Embedding 153 chunks: $0.0006 Initial testing: $0.01 Total: ~$0.01

Monthly usage (100 queries):

Retrieval: Free (local Chroma) LLM generation: 100 × ~1K tokens × $0.00015/1K = $0.015 Total: ~$0.02/month

Scaling:

1,000 queries/month: ~$0.15 10,000 queries/month: ~$1.50

Key Lessons

Architectural grounding > Prompt engineering

No amount of prompt engineering could fix Flowise’s conversational chains. The solution required architectural changes: separating metadata queries from semantic queries, validating responses, and defaulting to “I don’t know.”

Pre-built tools optimize for wrong metrics

Flowise optimizes for conversational engagement (“always give an answer”). For portfolio credibility, accuracy matters more than helpfulness. Custom implementations let you choose your optimization target.

Metadata is architectural truth

By storing project titles as structured metadata and querying them directly, you eliminate an entire class of hallucination. The LLM never gets a chance to invent project names.

“I don’t know” is a feature

Honest uncertainty builds more trust than plausible-sounding fabrications. The system refuses to answer subjective questions (“most complex project?”) without objective metrics in the context.

Version compatibility matters in production

The Chroma database schema changed between versions, causing deployment failures. The auto-rebuild pattern solves this: detect incompatibility, rebuild once, then cache for future startups. Future Enhancements

Multi-query retrieval: Generate 3 query variations to improve recall
Conversation memory: Maintain context across turns while preserving grounding
Project comparison: Dedicated function to compare two projects side-by-side
Advanced analytics: Track which projects get queried most, query success rates
Relevance feedback: Let users flag incorrect answers to improve retrieval

Conclusion

Custom RAG implementations require more upfront effort than no-code tools, but they’re essential when accuracy is critical. By building hallucination prevention into the architecture—not just the prompts—you create systems that prioritize trustworthiness over conversational polish.

The cost difference is negligible (~$0.02/month). The trust difference is everything.

Written on October 1, 2025