EvalOps: Production-Grade LLM Evaluation and Observability Platform

EvalOps is a systematic evaluation framework for LLM applications that addresses the fundamental problem of non-deterministic outputs. Traditional software testing fails when “correct” answers can be phrased a thousand different ways. EvalOps provides semantic similarity matching using BERT embeddings, statistical drift detection, and A/B comparison with effect size calculation. The platform includes a full observability stack with LangSmith integration, structured logging, and a Streamlit dashboard for exploring results. Deployed on AWS via Docker, the system demonstrates production-ready MLOps practices with 285 tests passing and a live demo processing 24 evaluation runs across 470 test cases.

Live Demo

Dashboard: (EC2 instance was stopped to save costs)

evalops

Explore evaluation runs, drift detection, and A/B comparison across Q&A, classification, and summarization tasks.

Docker Hub: pmcavallo/evalops

docker pull pmcavallo/evalops:latest
docker run -p 8501:8501 pmcavallo/evalops:latest

The Problem

Traditional software testing doesn’t work for LLMs:

Non-Deterministic Outputs:

Ask “What is the capital of France?” ten times, get ten slightly different phrasings
“Paris”, “The capital is Paris”, “Paris is the capital of France” are all correct
Simple string matching fails catastrophically

Scale Problem:

Manual review of LLM outputs doesn’t scale beyond a few dozen cases
Production systems generate thousands of outputs daily
Quality degradation happens gradually and invisibly

The Drift Problem:

Model updates, prompt changes, and API version bumps cause subtle quality shifts
Without systematic measurement, you discover degradation when users complain
By then, you’ve lost trust and potentially revenue

A/B Testing Complexity:

“Is prompt A better than prompt B?” seems simple
But statistical significance, effect size, and sample size all matter
Most teams eyeball results or use inadequate testing

The Core Issue: LLM evaluation requires understanding meaning, not matching strings. It requires statistical rigor, not gut feelings. And it requires continuous monitoring, not one-time testing.

The Solution

EvalOps provides systematic evaluation with semantic understanding, statistical rigor, and production observability.

Why Not Just Use String Matching?

Approach	Problem
Exact Match	“Paris” ≠ “The capital is Paris” (both correct)
Contains	“Paris is lovely” matches “Paris” (false positive)
Fuzzy Match	“Paris” vs “Pairs” scores high (typo, not semantic)
Semantic Similarity	✅ “Paris” ≈ “The capital is Paris” (meaning matches)

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           EVALOPS ARCHITECTURE                              │
│                                                                             │
│  ┌─────────────────┐      ┌──────────────────┐      ┌──────────────────┐    │
│  │ Dataset         │─────>│ EvalRunner       │─────>│ Results          │    │
│  │ (JSON/List)     │      │                  │      │ (SQLite/Postgres)│    │
│  └─────────────────┘      └──────────────────┘      └──────────────────┘    │
│         │                         │                          │              │
│         │                         v                          │              │
│         │                 ┌──────────────────┐               │              │
│         │                 │ Metrics          │               │              │
│         │                 │ - SemanticSim    │               │              │
│         │                 │ - Accuracy       │               │              │
│         │                 │ - Latency        │               │              │
│         │                 │ - LLMJudge       │               │              │
│         │                 └──────────────────┘               │              │
│         │                         │                          │              │
│         v                         v                          v              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                        OBSERVABILITY LAYER                          │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                  │    │
│  │  │ LangSmith   │  │ Structured  │  │ Metrics     │                  │    │
│  │  │ Tracing     │  │ Logging     │  │ Collector   │                  │    │
│  │  └─────────────┘  └─────────────┘  └─────────────┘                  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                        │
│                                    v                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                        COMPARISON ENGINE                            │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                  │    │
│  │  │ A/B Testing │  │ Drift       │  │ Regression  │                  │    │
│  │  │ (stats)     │  │ Detection   │  │ Testing     │                  │    │
│  │  └─────────────┘  └─────────────┘  └─────────────┘                  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                        │
│                                    v                                        │
│                          ┌──────────────────┐                               │
│                          │ Streamlit        │                               │
│                          │ Dashboard        │                               │
│                          │ (AWS EC2)        │                               │
│                          └──────────────────┘                               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Key Features

Semantic Similarity with BERT Embeddings

The core insight: meaning matters, not strings. EvalOps uses sentence-transformers to compute semantic similarity:

from evalops.core.metrics import SemanticSimilarity

metric = SemanticSimilarity(threshold=0.7)

# These are semantically equivalent
result = metric.evaluate(
    actual="Paris is the capital of France",
    expected="Paris"
)
# result.score = 0.82, result.passed = True

# These are semantically different
result = metric.evaluate(
    actual="London is a beautiful city",
    expected="Paris"
)
# result.score = 0.31, result.passed = False

How It Works:

Convert both strings to 384-dimensional vectors using all-MiniLM-L6-v2
Compute cosine similarity between vectors
Compare against configurable threshold

Why BERT Over Alternatives:

Method	Pros	Cons
TF-IDF	Fast, simple	No semantic understanding
Word2Vec	Captures some meaning	Word-level, not sentence-level
BERT Embeddings	True semantic understanding	Slightly slower (still <100ms)
LLM-as-Judge	Most nuanced	Expensive, slow, non-deterministic

Drift Detection

Catch quality degradation before users do:

from evalops.comparison import DriftDetector

detector = DriftDetector(
    baseline_run_id="prod_v1",
    alert_threshold=0.05  # Alert if pass rate drops 5%
)

result = detector.check(current_run)

if result.has_drift:
    print(f"⚠️ Drift detected: {result.baseline_pass_rate:.1%} → {result.current_pass_rate:.1%}")
    print(f"   Degraded cases: {result.degraded_case_ids}")

Drift Detection Algorithm:

For each case in current_run:
    1. Find matching case in baseline (by input hash)
    2. Compare pass/fail status
    3. Track: improved, degraded, unchanged
    
Compute:
    - Pass rate delta
    - Statistical significance (chi-squared test)
    - Affected case breakdown

Alert if:
    - Pass rate dropped > threshold AND
    - Change is statistically significant (p < 0.05)

A/B Comparison with Statistical Rigor

Not just “A is better than B” but “A is better than B with 95% confidence and medium effect size”:

from evalops.comparison import ABComparison

comparison = ABComparison()
result = comparison.compare(
    baseline_run=run_a,
    candidate_run=run_b
)

print(f"Baseline pass rate: {result.baseline_pass_rate:.1%}")
print(f"Candidate pass rate: {result.candidate_pass_rate:.1%}")
print(f"Improvement: {result.improvement:.1%}")
print(f"P-value: {result.p_value:.4f}")
print(f"Effect size (Cohen's h): {result.effect_size:.2f}")
print(f"Significant: {result.is_significant}")

Statistical Tests Used:

Metric	Test	Why
Pass rate difference	Chi-squared	Binary outcome (pass/fail)
Effect size	Cohen’s h	Standardized measure for proportions
Confidence interval	Wilson score	Better than normal approximation for proportions

Observability Stack

Production-ready logging and tracing:

from evalops.observability import LangSmithTracer, EvalLogger

# Distributed tracing
tracer = LangSmithTracer(project_name="my-evals")

# Structured logging
logger = EvalLogger(service_name="eval-service")

runner = EvalRunner(tracer=tracer, logger=logger)
result = await runner.evaluate(dataset, target_fn, metrics)

# Every evaluation is:
# - Traced in LangSmith with full context
# - Logged in structured JSON format
# - Stored with metadata for later analysis

Metrics

Metric	Description	Use Case
`Accuracy`	Exact or fuzzy string match	Simple factual Q&A
`SemanticSimilarity`	BERT embedding cosine similarity	Open-ended responses
`Latency`	Response time threshold	Performance SLAs
`ContainsKeywords`	Required keywords present	Compliance checking
`LLMJudge`	LLM-as-judge evaluation	Complex quality assessment

Custom Metrics

from evalops.core.metrics import Metric, MetricResult

class ToxicityCheck(Metric):
    """Check if response contains toxic content."""
    
    def __init__(self, threshold: float = 0.5):
        self.threshold = threshold
        self.classifier = load_toxicity_model()
    
    def evaluate(self, actual: str, expected: str, **kwargs) -> MetricResult:
        score = self.classifier.predict(actual)
        return MetricResult(
            name="toxicity",
            score=1 - score,  # Invert so higher is better
            passed=score < self.threshold,
            details={"toxicity_score": score}
        )

Dashboard

The Streamlit dashboard provides visual exploration of evaluation results:

Overview Page:

Total runs, cases, average pass rate
Pass rate trends over time
Recent runs with quick status

Run Explorer:

Filter by date, tags, pass rate threshold
Sort by various metrics
Drill down into individual runs

Run Detail:

Case-by-case breakdown
Pass/fail distribution
Latency statistics
Full input/output/expected for each case

A/B Comparison:

Side-by-side run comparison
Statistical significance indicators
Effect size visualization
Case-level diff view

Drift Monitor:

Baseline vs current comparison
Trend visualization
Alert configuration
Degraded case identification

Deployment

Docker

Build locally:

docker build -t evalops .
docker run -p 8501:8501 evalops

Or use the pre-built image:

docker pull pmcavallo/evalops:latest
docker run -d -p 8501:8501 --name evalops pmcavallo/evalops:latest

Dockerfile Highlights:

# CPU-optimized PyTorch for smaller image
RUN pip install --no-cache-dir \
    torch --index-url https://download.pytorch.org/whl/cpu

# Health check for container orchestration
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health || exit 1

AWS Architecture

Service	Purpose	Cost
EC2 (t3.micro)	Hosts Streamlit dashboard via Docker	Free tier
DynamoDB	Stores evaluation runs, cases, baselines	Free tier
IAM	Least-privilege access for deployment	Free
Security Groups	Ports 22 (SSH), 8501 (Streamlit)	Free

Deployment Steps:

# 1. SSH into EC2
ssh -i evalops-key.pem ec2-user@<public-ip>

# 2. Install Docker
sudo dnf install docker -y
sudo systemctl start docker

# 3. Pull and run
docker pull pmcavallo/evalops:latest
docker run -d -p 8501:8501 --name evalops pmcavallo/evalops:latest

The Docker Story:

Initial deployment attempted manual Python installation on EC2:

Python version mismatch (3.7 vs 3.11 required)
Disk space issues (PyTorch is 900MB)
Dependency chain failures (sentence-transformers → torch → …)

Docker solved all of this with one command. Build once locally, run anywhere identically.

Results

Test Coverage

Category	Tests	Status
Core (Dataset, Runner)	45	✅ Passing
Metrics	38	✅ Passing
Comparison	52	✅ Passing
Storage	41	✅ Passing
API	35	✅ Passing
CLI	28	✅ Passing
Observability	31	✅ Passing
Dashboard	15	✅ Passing
Total	285	✅ All Passing

Demo Data

Dataset	Cases	Description
Q&A	20	Factual question-answering
Classification	15	Sentiment/category classification
Summarization	15	Document summarization

Metric	Value
Total Runs	24
Total Cases	470
Average Pass Rate	79.5%
Simulated Drift Events	3

Tech Stack

┌─────────────────────────────────────────────────────────────┐
│                      PRESENTATION LAYER                      │
│  Streamlit (Dashboard) │ FastAPI (REST API) │ Typer (CLI)   │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                       CORE LAYER                             │
│  sentence-transformers │ SQLAlchemy 2.0 │ Pydantic          │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY LAYER                       │
│  LangSmith │ structlog │ MetricsCollector                   │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                    INFRASTRUCTURE LAYER                      │
│  Docker │ AWS EC2 │ DynamoDB │ SQLite/PostgreSQL            │
└─────────────────────────────────────────────────────────────┘

Component	Technology	Purpose
Embeddings	sentence-transformers	BERT-based semantic similarity
API	FastAPI	REST endpoints
CLI	Typer + Rich	Command-line interface
Dashboard	Streamlit + Plotly	Visualization
Database	SQLAlchemy 2.0	ORM with SQLite/PostgreSQL
Logging	structlog	Structured JSON logging
Tracing	LangSmith	Distributed tracing
Container	Docker	Reproducible deployment
Cloud	AWS (EC2, DynamoDB)	Production hosting

Project Structure

evalops/
├── src/evalops/
│   ├── core/               # Dataset, Runner, Metrics
│   │   ├── dataset.py      # EvalCase, EvalDataset
│   │   ├── runner.py       # EvalRunner, EvalResult
│   │   ├── metrics.py      # Accuracy, SemanticSimilarity, etc.
│   │   └── judge.py        # LLMJudge, RubricJudge
│   ├── comparison/         # Statistical comparison
│   │   ├── ab_testing.py   # ABComparison
│   │   ├── drift.py        # DriftDetector
│   │   └── regression.py   # RegressionTester
│   ├── storage/            # Persistence
│   │   ├── models.py       # SQLAlchemy models
│   │   └── repository.py   # EvalRepository
│   ├── observability/      # Logging & tracing
│   │   ├── langsmith.py    # LangSmithTracer
│   │   └── logging.py      # EvalLogger
│   ├── api/                # REST API
│   │   └── app.py          # FastAPI application
│   ├── cli/                # Command-line
│   │   └── main.py         # Typer commands
│   └── dashboard/          # Visualization
│       └── app.py          # Streamlit application
├── tests/                  # 285 unit tests
├── demo/                   # Demo datasets and mock targets
├── Dockerfile              # Container definition
├── docker-compose.yml      # Multi-container setup
└── pyproject.toml          # Package configuration

Usage

Basic Evaluation

from evalops import EvalDataset, EvalRunner, SemanticSimilarity

# Load test cases
dataset = EvalDataset.from_json("test_cases.json")

# Your LLM function
async def my_llm(input_text: str) -> str:
    response = await client.chat(input_text)
    return response.content

# Run evaluation
runner = EvalRunner()
result = await runner.evaluate(
    dataset=dataset,
    target=my_llm,
    metrics=[SemanticSimilarity(threshold=0.7)]
)

print(f"Pass rate: {result.pass_rate:.1%}")
print(f"Failed cases: {[c.case_id for c in result.failed_cases]}")

CLI Usage

# Run evaluation
evalops run --dataset qa_cases.json --target my_module:llm_function

# Compare two runs
evalops compare --baseline RUN_A --candidate RUN_B

# Check for drift
evalops drift --baseline prod_v1 --dataset qa_cases

# Launch dashboard
evalops-dashboard

API Usage

# Start server
uvicorn evalops.api.app:app --reload

# Run evaluation via API
curl -X POST http://localhost:8000/evaluate \
  -H "Content-Type: application/json" \
  -d '{"dataset_path": "qa_cases.json", "metrics": ["semantic_similarity"]}'

# Get run results
curl http://localhost:8000/runs/RUN_ID

Lessons Learned

Semantic Similarity is Not Perfect

BERT embeddings capture meaning well but have edge cases:

Negations can score high (“Paris is great” vs “Paris is not great”)
Domain-specific terminology may not embed well
Very short responses lose context

Solution: Combine semantic similarity with other metrics. Use LLMJudge for nuanced cases.

Statistical Rigor Matters

Early versions just compared pass rates. Problems:

Small sample sizes gave misleading results
No confidence intervals
Effect size ignored (10% improvement on 1000 cases vs 10 cases)

Solution: Proper statistical tests with p-values, effect sizes, and confidence intervals.

Docker Saves Hours

Manual EC2 deployment attempt:

45 minutes debugging Python versions
30 minutes on disk space issues
20 minutes on dependency chains

Docker deployment:

5 minutes to pull and run

Lesson: Containerize early, not as an afterthought.

Test Everything

285 tests sounds like a lot. It’s not. Each test caught real bugs:

Edge cases in metric calculations
Database transaction issues
API response format inconsistencies

Lesson: Comprehensive testing isn’t overhead, it’s insurance.

Future Improvements

Real-Time Monitoring: WebSocket-based live evaluation streaming
Multi-Model Comparison: Compare GPT-4 vs Claude vs Llama on same dataset
Cost Tracking: Token usage and API cost per evaluation
Scheduled Runs: Cron-based automated evaluation pipelines
Slack/PagerDuty Integration: Alert on drift detection
PostgreSQL Migration: Production database for team collaboration

License

MIT License - see LICENSE for details.

Links

Live Demo: http://44.213.248.8:8501
GitHub: github.com/pmcavallo/evalops
Docker Hub: hub.docker.com/r/pmcavallo/evalops

“Traditional testing asks ‘did I get the exact right string?’ LLM evaluation asks ‘did I get a response that means the right thing?’ EvalOps bridges that gap with semantic understanding, statistical rigor, and production observability. Because in production, you don’t just need to know if your LLM works - you need to know the moment it stops working.”

Written on December 31, 2024

Written on December 23, 2025