EvalOps: Production-Grade LLM Evaluation and Observability Platform
EvalOps is a systematic evaluation framework for LLM applications that addresses the fundamental problem of non-deterministic outputs. Traditional software testing fails when “correct” answers can be phrased a thousand different ways. EvalOps provides semantic similarity matching using BERT embeddings, statistical drift detection, and A/B comparison with effect size calculation. The platform includes a full observability stack with LangSmith integration, structured logging, and a Streamlit dashboard for exploring results. Deployed on AWS via Docker, the system demonstrates production-ready MLOps practices with 285 tests passing and a live demo processing 24 evaluation runs across 470 test cases.
Live Demo
Dashboard: (EC2 instance was stopped to save costs)

Explore evaluation runs, drift detection, and A/B comparison across Q&A, classification, and summarization tasks.
Docker Hub: pmcavallo/evalops
docker pull pmcavallo/evalops:latest
docker run -p 8501:8501 pmcavallo/evalops:latest
The Problem
Traditional software testing doesn’t work for LLMs:
Non-Deterministic Outputs:
- Ask “What is the capital of France?” ten times, get ten slightly different phrasings
- “Paris”, “The capital is Paris”, “Paris is the capital of France” are all correct
- Simple string matching fails catastrophically
Scale Problem:
- Manual review of LLM outputs doesn’t scale beyond a few dozen cases
- Production systems generate thousands of outputs daily
- Quality degradation happens gradually and invisibly
The Drift Problem:
- Model updates, prompt changes, and API version bumps cause subtle quality shifts
- Without systematic measurement, you discover degradation when users complain
- By then, you’ve lost trust and potentially revenue
A/B Testing Complexity:
- “Is prompt A better than prompt B?” seems simple
- But statistical significance, effect size, and sample size all matter
- Most teams eyeball results or use inadequate testing
The Core Issue: LLM evaluation requires understanding meaning, not matching strings. It requires statistical rigor, not gut feelings. And it requires continuous monitoring, not one-time testing.
The Solution
EvalOps provides systematic evaluation with semantic understanding, statistical rigor, and production observability.
Why Not Just Use String Matching?
| Approach | Problem |
|---|---|
| Exact Match | “Paris” ≠ “The capital is Paris” (both correct) |
| Contains | “Paris is lovely” matches “Paris” (false positive) |
| Fuzzy Match | “Paris” vs “Pairs” scores high (typo, not semantic) |
| Semantic Similarity | ✅ “Paris” ≈ “The capital is Paris” (meaning matches) |
Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ EVALOPS ARCHITECTURE │
│ │
│ ┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Dataset │─────>│ EvalRunner │─────>│ Results │ │
│ │ (JSON/List) │ │ │ │ (SQLite/Postgres)│ │
│ └─────────────────┘ └──────────────────┘ └──────────────────┘ │
│ │ │ │ │
│ │ v │ │
│ │ ┌──────────────────┐ │ │
│ │ │ Metrics │ │ │
│ │ │ - SemanticSim │ │ │
│ │ │ - Accuracy │ │ │
│ │ │ - Latency │ │ │
│ │ │ - LLMJudge │ │ │
│ │ └──────────────────┘ │ │
│ │ │ │ │
│ v v v │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ OBSERVABILITY LAYER │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ LangSmith │ │ Structured │ │ Metrics │ │ │
│ │ │ Tracing │ │ Logging │ │ Collector │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ COMPARISON ENGINE │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ A/B Testing │ │ Drift │ │ Regression │ │ │
│ │ │ (stats) │ │ Detection │ │ Testing │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ v │
│ ┌──────────────────┐ │
│ │ Streamlit │ │
│ │ Dashboard │ │
│ │ (AWS EC2) │ │
│ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Key Features
Semantic Similarity with BERT Embeddings
The core insight: meaning matters, not strings. EvalOps uses sentence-transformers to compute semantic similarity:
from evalops.core.metrics import SemanticSimilarity
metric = SemanticSimilarity(threshold=0.7)
# These are semantically equivalent
result = metric.evaluate(
actual="Paris is the capital of France",
expected="Paris"
)
# result.score = 0.82, result.passed = True
# These are semantically different
result = metric.evaluate(
actual="London is a beautiful city",
expected="Paris"
)
# result.score = 0.31, result.passed = False
How It Works:
- Convert both strings to 384-dimensional vectors using
all-MiniLM-L6-v2 - Compute cosine similarity between vectors
- Compare against configurable threshold
Why BERT Over Alternatives:
| Method | Pros | Cons |
|---|---|---|
| TF-IDF | Fast, simple | No semantic understanding |
| Word2Vec | Captures some meaning | Word-level, not sentence-level |
| BERT Embeddings | True semantic understanding | Slightly slower (still <100ms) |
| LLM-as-Judge | Most nuanced | Expensive, slow, non-deterministic |
Drift Detection
Catch quality degradation before users do:
from evalops.comparison import DriftDetector
detector = DriftDetector(
baseline_run_id="prod_v1",
alert_threshold=0.05 # Alert if pass rate drops 5%
)
result = detector.check(current_run)
if result.has_drift:
print(f"⚠️ Drift detected: {result.baseline_pass_rate:.1%} → {result.current_pass_rate:.1%}")
print(f" Degraded cases: {result.degraded_case_ids}")
Drift Detection Algorithm:
For each case in current_run:
1. Find matching case in baseline (by input hash)
2. Compare pass/fail status
3. Track: improved, degraded, unchanged
Compute:
- Pass rate delta
- Statistical significance (chi-squared test)
- Affected case breakdown
Alert if:
- Pass rate dropped > threshold AND
- Change is statistically significant (p < 0.05)
A/B Comparison with Statistical Rigor
Not just “A is better than B” but “A is better than B with 95% confidence and medium effect size”:
from evalops.comparison import ABComparison
comparison = ABComparison()
result = comparison.compare(
baseline_run=run_a,
candidate_run=run_b
)
print(f"Baseline pass rate: {result.baseline_pass_rate:.1%}")
print(f"Candidate pass rate: {result.candidate_pass_rate:.1%}")
print(f"Improvement: {result.improvement:.1%}")
print(f"P-value: {result.p_value:.4f}")
print(f"Effect size (Cohen's h): {result.effect_size:.2f}")
print(f"Significant: {result.is_significant}")
Statistical Tests Used:
| Metric | Test | Why |
|---|---|---|
| Pass rate difference | Chi-squared | Binary outcome (pass/fail) |
| Effect size | Cohen’s h | Standardized measure for proportions |
| Confidence interval | Wilson score | Better than normal approximation for proportions |
Observability Stack
Production-ready logging and tracing:
from evalops.observability import LangSmithTracer, EvalLogger
# Distributed tracing
tracer = LangSmithTracer(project_name="my-evals")
# Structured logging
logger = EvalLogger(service_name="eval-service")
runner = EvalRunner(tracer=tracer, logger=logger)
result = await runner.evaluate(dataset, target_fn, metrics)
# Every evaluation is:
# - Traced in LangSmith with full context
# - Logged in structured JSON format
# - Stored with metadata for later analysis
Metrics
| Metric | Description | Use Case |
|---|---|---|
Accuracy |
Exact or fuzzy string match | Simple factual Q&A |
SemanticSimilarity |
BERT embedding cosine similarity | Open-ended responses |
Latency |
Response time threshold | Performance SLAs |
ContainsKeywords |
Required keywords present | Compliance checking |
LLMJudge |
LLM-as-judge evaluation | Complex quality assessment |
Custom Metrics
from evalops.core.metrics import Metric, MetricResult
class ToxicityCheck(Metric):
"""Check if response contains toxic content."""
def __init__(self, threshold: float = 0.5):
self.threshold = threshold
self.classifier = load_toxicity_model()
def evaluate(self, actual: str, expected: str, **kwargs) -> MetricResult:
score = self.classifier.predict(actual)
return MetricResult(
name="toxicity",
score=1 - score, # Invert so higher is better
passed=score < self.threshold,
details={"toxicity_score": score}
)
Dashboard
The Streamlit dashboard provides visual exploration of evaluation results:
Overview Page:
- Total runs, cases, average pass rate
- Pass rate trends over time
- Recent runs with quick status
Run Explorer:
- Filter by date, tags, pass rate threshold
- Sort by various metrics
- Drill down into individual runs
Run Detail:
- Case-by-case breakdown
- Pass/fail distribution
- Latency statistics
- Full input/output/expected for each case
A/B Comparison:
- Side-by-side run comparison
- Statistical significance indicators
- Effect size visualization
- Case-level diff view
Drift Monitor:
- Baseline vs current comparison
- Trend visualization
- Alert configuration
- Degraded case identification
Deployment
Docker
Build locally:
docker build -t evalops .
docker run -p 8501:8501 evalops
Or use the pre-built image:
docker pull pmcavallo/evalops:latest
docker run -d -p 8501:8501 --name evalops pmcavallo/evalops:latest
Dockerfile Highlights:
# CPU-optimized PyTorch for smaller image
RUN pip install --no-cache-dir \
torch --index-url https://download.pytorch.org/whl/cpu
# Health check for container orchestration
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health || exit 1
AWS Architecture
| Service | Purpose | Cost |
|---|---|---|
| EC2 (t3.micro) | Hosts Streamlit dashboard via Docker | Free tier |
| DynamoDB | Stores evaluation runs, cases, baselines | Free tier |
| IAM | Least-privilege access for deployment | Free |
| Security Groups | Ports 22 (SSH), 8501 (Streamlit) | Free |
Deployment Steps:
# 1. SSH into EC2
ssh -i evalops-key.pem ec2-user@<public-ip>
# 2. Install Docker
sudo dnf install docker -y
sudo systemctl start docker
# 3. Pull and run
docker pull pmcavallo/evalops:latest
docker run -d -p 8501:8501 --name evalops pmcavallo/evalops:latest
The Docker Story:
Initial deployment attempted manual Python installation on EC2:
- Python version mismatch (3.7 vs 3.11 required)
- Disk space issues (PyTorch is 900MB)
- Dependency chain failures (sentence-transformers → torch → …)
Docker solved all of this with one command. Build once locally, run anywhere identically.
Results
Test Coverage
| Category | Tests | Status |
|---|---|---|
| Core (Dataset, Runner) | 45 | ✅ Passing |
| Metrics | 38 | ✅ Passing |
| Comparison | 52 | ✅ Passing |
| Storage | 41 | ✅ Passing |
| API | 35 | ✅ Passing |
| CLI | 28 | ✅ Passing |
| Observability | 31 | ✅ Passing |
| Dashboard | 15 | ✅ Passing |
| Total | 285 | ✅ All Passing |
Demo Data
| Dataset | Cases | Description |
|---|---|---|
| Q&A | 20 | Factual question-answering |
| Classification | 15 | Sentiment/category classification |
| Summarization | 15 | Document summarization |
| Metric | Value |
|---|---|
| Total Runs | 24 |
| Total Cases | 470 |
| Average Pass Rate | 79.5% |
| Simulated Drift Events | 3 |
Tech Stack
┌─────────────────────────────────────────────────────────────┐
│ PRESENTATION LAYER │
│ Streamlit (Dashboard) │ FastAPI (REST API) │ Typer (CLI) │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ CORE LAYER │
│ sentence-transformers │ SQLAlchemy 2.0 │ Pydantic │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ OBSERVABILITY LAYER │
│ LangSmith │ structlog │ MetricsCollector │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ INFRASTRUCTURE LAYER │
│ Docker │ AWS EC2 │ DynamoDB │ SQLite/PostgreSQL │
└─────────────────────────────────────────────────────────────┘
| Component | Technology | Purpose |
|---|---|---|
| Embeddings | sentence-transformers | BERT-based semantic similarity |
| API | FastAPI | REST endpoints |
| CLI | Typer + Rich | Command-line interface |
| Dashboard | Streamlit + Plotly | Visualization |
| Database | SQLAlchemy 2.0 | ORM with SQLite/PostgreSQL |
| Logging | structlog | Structured JSON logging |
| Tracing | LangSmith | Distributed tracing |
| Container | Docker | Reproducible deployment |
| Cloud | AWS (EC2, DynamoDB) | Production hosting |
Project Structure
evalops/
├── src/evalops/
│ ├── core/ # Dataset, Runner, Metrics
│ │ ├── dataset.py # EvalCase, EvalDataset
│ │ ├── runner.py # EvalRunner, EvalResult
│ │ ├── metrics.py # Accuracy, SemanticSimilarity, etc.
│ │ └── judge.py # LLMJudge, RubricJudge
│ ├── comparison/ # Statistical comparison
│ │ ├── ab_testing.py # ABComparison
│ │ ├── drift.py # DriftDetector
│ │ └── regression.py # RegressionTester
│ ├── storage/ # Persistence
│ │ ├── models.py # SQLAlchemy models
│ │ └── repository.py # EvalRepository
│ ├── observability/ # Logging & tracing
│ │ ├── langsmith.py # LangSmithTracer
│ │ └── logging.py # EvalLogger
│ ├── api/ # REST API
│ │ └── app.py # FastAPI application
│ ├── cli/ # Command-line
│ │ └── main.py # Typer commands
│ └── dashboard/ # Visualization
│ └── app.py # Streamlit application
├── tests/ # 285 unit tests
├── demo/ # Demo datasets and mock targets
├── Dockerfile # Container definition
├── docker-compose.yml # Multi-container setup
└── pyproject.toml # Package configuration
Usage
Basic Evaluation
from evalops import EvalDataset, EvalRunner, SemanticSimilarity
# Load test cases
dataset = EvalDataset.from_json("test_cases.json")
# Your LLM function
async def my_llm(input_text: str) -> str:
response = await client.chat(input_text)
return response.content
# Run evaluation
runner = EvalRunner()
result = await runner.evaluate(
dataset=dataset,
target=my_llm,
metrics=[SemanticSimilarity(threshold=0.7)]
)
print(f"Pass rate: {result.pass_rate:.1%}")
print(f"Failed cases: {[c.case_id for c in result.failed_cases]}")
CLI Usage
# Run evaluation
evalops run --dataset qa_cases.json --target my_module:llm_function
# Compare two runs
evalops compare --baseline RUN_A --candidate RUN_B
# Check for drift
evalops drift --baseline prod_v1 --dataset qa_cases
# Launch dashboard
evalops-dashboard
API Usage
# Start server
uvicorn evalops.api.app:app --reload
# Run evaluation via API
curl -X POST http://localhost:8000/evaluate \
-H "Content-Type: application/json" \
-d '{"dataset_path": "qa_cases.json", "metrics": ["semantic_similarity"]}'
# Get run results
curl http://localhost:8000/runs/RUN_ID
Lessons Learned
Semantic Similarity is Not Perfect
BERT embeddings capture meaning well but have edge cases:
- Negations can score high (“Paris is great” vs “Paris is not great”)
- Domain-specific terminology may not embed well
- Very short responses lose context
Solution: Combine semantic similarity with other metrics. Use LLMJudge for nuanced cases.
Statistical Rigor Matters
Early versions just compared pass rates. Problems:
- Small sample sizes gave misleading results
- No confidence intervals
- Effect size ignored (10% improvement on 1000 cases vs 10 cases)
Solution: Proper statistical tests with p-values, effect sizes, and confidence intervals.
Docker Saves Hours
Manual EC2 deployment attempt:
- 45 minutes debugging Python versions
- 30 minutes on disk space issues
- 20 minutes on dependency chains
Docker deployment:
- 5 minutes to pull and run
Lesson: Containerize early, not as an afterthought.
Test Everything
285 tests sounds like a lot. It’s not. Each test caught real bugs:
- Edge cases in metric calculations
- Database transaction issues
- API response format inconsistencies
Lesson: Comprehensive testing isn’t overhead, it’s insurance.
Future Improvements
- Real-Time Monitoring: WebSocket-based live evaluation streaming
- Multi-Model Comparison: Compare GPT-4 vs Claude vs Llama on same dataset
- Cost Tracking: Token usage and API cost per evaluation
- Scheduled Runs: Cron-based automated evaluation pipelines
- Slack/PagerDuty Integration: Alert on drift detection
- PostgreSQL Migration: Production database for team collaboration
License
MIT License - see LICENSE for details.
Links
- Live Demo: http://44.213.248.8:8501
- GitHub: github.com/pmcavallo/evalops
- Docker Hub: hub.docker.com/r/pmcavallo/evalops
“Traditional testing asks ‘did I get the exact right string?’ LLM evaluation asks ‘did I get a response that means the right thing?’ EvalOps bridges that gap with semantic understanding, statistical rigor, and production observability. Because in production, you don’t just need to know if your LLM works - you need to know the moment it stops working.”
Written on December 31, 2024
