EvalOps: Production-Grade LLM Evaluation and Observability Platform

EvalOps is a systematic evaluation framework for LLM applications that addresses the fundamental problem of non-deterministic outputs. Traditional software testing fails when “correct” answers can be phrased a thousand different ways. EvalOps provides semantic similarity matching using BERT embeddings, statistical drift detection, and A/B comparison with effect size calculation. The platform includes a full observability stack with LangSmith integration, structured logging, and a Streamlit dashboard for exploring results. Deployed on AWS via Docker, the system demonstrates production-ready MLOps practices with 285 tests passing and a live demo processing 24 evaluation runs across 470 test cases.


Live Demo

Dashboard: (EC2 instance was stopped to save costs)

evalops

Explore evaluation runs, drift detection, and A/B comparison across Q&A, classification, and summarization tasks.

Docker Hub: pmcavallo/evalops

docker pull pmcavallo/evalops:latest
docker run -p 8501:8501 pmcavallo/evalops:latest

The Problem

Traditional software testing doesn’t work for LLMs:

Non-Deterministic Outputs:

  • Ask “What is the capital of France?” ten times, get ten slightly different phrasings
  • “Paris”, “The capital is Paris”, “Paris is the capital of France” are all correct
  • Simple string matching fails catastrophically

Scale Problem:

  • Manual review of LLM outputs doesn’t scale beyond a few dozen cases
  • Production systems generate thousands of outputs daily
  • Quality degradation happens gradually and invisibly

The Drift Problem:

  • Model updates, prompt changes, and API version bumps cause subtle quality shifts
  • Without systematic measurement, you discover degradation when users complain
  • By then, you’ve lost trust and potentially revenue

A/B Testing Complexity:

  • “Is prompt A better than prompt B?” seems simple
  • But statistical significance, effect size, and sample size all matter
  • Most teams eyeball results or use inadequate testing

The Core Issue: LLM evaluation requires understanding meaning, not matching strings. It requires statistical rigor, not gut feelings. And it requires continuous monitoring, not one-time testing.


The Solution

EvalOps provides systematic evaluation with semantic understanding, statistical rigor, and production observability.

Why Not Just Use String Matching?

Approach Problem
Exact Match “Paris” ≠ “The capital is Paris” (both correct)
Contains “Paris is lovely” matches “Paris” (false positive)
Fuzzy Match “Paris” vs “Pairs” scores high (typo, not semantic)
Semantic Similarity ✅ “Paris” ≈ “The capital is Paris” (meaning matches)

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           EVALOPS ARCHITECTURE                              │
│                                                                             │
│  ┌─────────────────┐      ┌──────────────────┐      ┌──────────────────┐    │
│  │ Dataset         │─────>│ EvalRunner       │─────>│ Results          │    │
│  │ (JSON/List)     │      │                  │      │ (SQLite/Postgres)│    │
│  └─────────────────┘      └──────────────────┘      └──────────────────┘    │
│         │                         │                          │              │
│         │                         v                          │              │
│         │                 ┌──────────────────┐               │              │
│         │                 │ Metrics          │               │              │
│         │                 │ - SemanticSim    │               │              │
│         │                 │ - Accuracy       │               │              │
│         │                 │ - Latency        │               │              │
│         │                 │ - LLMJudge       │               │              │
│         │                 └──────────────────┘               │              │
│         │                         │                          │              │
│         v                         v                          v              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                        OBSERVABILITY LAYER                          │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                  │    │
│  │  │ LangSmith   │  │ Structured  │  │ Metrics     │                  │    │
│  │  │ Tracing     │  │ Logging     │  │ Collector   │                  │    │
│  │  └─────────────┘  └─────────────┘  └─────────────┘                  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                        │
│                                    v                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                        COMPARISON ENGINE                            │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                  │    │
│  │  │ A/B Testing │  │ Drift       │  │ Regression  │                  │    │
│  │  │ (stats)     │  │ Detection   │  │ Testing     │                  │    │
│  │  └─────────────┘  └─────────────┘  └─────────────┘                  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                        │
│                                    v                                        │
│                          ┌──────────────────┐                               │
│                          │ Streamlit        │                               │
│                          │ Dashboard        │                               │
│                          │ (AWS EC2)        │                               │
│                          └──────────────────┘                               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Key Features

Semantic Similarity with BERT Embeddings

The core insight: meaning matters, not strings. EvalOps uses sentence-transformers to compute semantic similarity:

from evalops.core.metrics import SemanticSimilarity

metric = SemanticSimilarity(threshold=0.7)

# These are semantically equivalent
result = metric.evaluate(
    actual="Paris is the capital of France",
    expected="Paris"
)
# result.score = 0.82, result.passed = True

# These are semantically different
result = metric.evaluate(
    actual="London is a beautiful city",
    expected="Paris"
)
# result.score = 0.31, result.passed = False

How It Works:

  1. Convert both strings to 384-dimensional vectors using all-MiniLM-L6-v2
  2. Compute cosine similarity between vectors
  3. Compare against configurable threshold

Why BERT Over Alternatives:

Method Pros Cons
TF-IDF Fast, simple No semantic understanding
Word2Vec Captures some meaning Word-level, not sentence-level
BERT Embeddings True semantic understanding Slightly slower (still <100ms)
LLM-as-Judge Most nuanced Expensive, slow, non-deterministic

Drift Detection

Catch quality degradation before users do:

from evalops.comparison import DriftDetector

detector = DriftDetector(
    baseline_run_id="prod_v1",
    alert_threshold=0.05  # Alert if pass rate drops 5%
)

result = detector.check(current_run)

if result.has_drift:
    print(f"⚠️ Drift detected: {result.baseline_pass_rate:.1%}{result.current_pass_rate:.1%}")
    print(f"   Degraded cases: {result.degraded_case_ids}")

Drift Detection Algorithm:

For each case in current_run:
    1. Find matching case in baseline (by input hash)
    2. Compare pass/fail status
    3. Track: improved, degraded, unchanged
    
Compute:
    - Pass rate delta
    - Statistical significance (chi-squared test)
    - Affected case breakdown

Alert if:
    - Pass rate dropped > threshold AND
    - Change is statistically significant (p < 0.05)

A/B Comparison with Statistical Rigor

Not just “A is better than B” but “A is better than B with 95% confidence and medium effect size”:

from evalops.comparison import ABComparison

comparison = ABComparison()
result = comparison.compare(
    baseline_run=run_a,
    candidate_run=run_b
)

print(f"Baseline pass rate: {result.baseline_pass_rate:.1%}")
print(f"Candidate pass rate: {result.candidate_pass_rate:.1%}")
print(f"Improvement: {result.improvement:.1%}")
print(f"P-value: {result.p_value:.4f}")
print(f"Effect size (Cohen's h): {result.effect_size:.2f}")
print(f"Significant: {result.is_significant}")

Statistical Tests Used:

Metric Test Why
Pass rate difference Chi-squared Binary outcome (pass/fail)
Effect size Cohen’s h Standardized measure for proportions
Confidence interval Wilson score Better than normal approximation for proportions

Observability Stack

Production-ready logging and tracing:

from evalops.observability import LangSmithTracer, EvalLogger

# Distributed tracing
tracer = LangSmithTracer(project_name="my-evals")

# Structured logging
logger = EvalLogger(service_name="eval-service")

runner = EvalRunner(tracer=tracer, logger=logger)
result = await runner.evaluate(dataset, target_fn, metrics)

# Every evaluation is:
# - Traced in LangSmith with full context
# - Logged in structured JSON format
# - Stored with metadata for later analysis

Metrics

Metric Description Use Case
Accuracy Exact or fuzzy string match Simple factual Q&A
SemanticSimilarity BERT embedding cosine similarity Open-ended responses
Latency Response time threshold Performance SLAs
ContainsKeywords Required keywords present Compliance checking
LLMJudge LLM-as-judge evaluation Complex quality assessment

Custom Metrics

from evalops.core.metrics import Metric, MetricResult

class ToxicityCheck(Metric):
    """Check if response contains toxic content."""
    
    def __init__(self, threshold: float = 0.5):
        self.threshold = threshold
        self.classifier = load_toxicity_model()
    
    def evaluate(self, actual: str, expected: str, **kwargs) -> MetricResult:
        score = self.classifier.predict(actual)
        return MetricResult(
            name="toxicity",
            score=1 - score,  # Invert so higher is better
            passed=score < self.threshold,
            details={"toxicity_score": score}
        )

Dashboard

The Streamlit dashboard provides visual exploration of evaluation results:

Overview Page:

  • Total runs, cases, average pass rate
  • Pass rate trends over time
  • Recent runs with quick status

Run Explorer:

  • Filter by date, tags, pass rate threshold
  • Sort by various metrics
  • Drill down into individual runs

Run Detail:

  • Case-by-case breakdown
  • Pass/fail distribution
  • Latency statistics
  • Full input/output/expected for each case

A/B Comparison:

  • Side-by-side run comparison
  • Statistical significance indicators
  • Effect size visualization
  • Case-level diff view

Drift Monitor:

  • Baseline vs current comparison
  • Trend visualization
  • Alert configuration
  • Degraded case identification

Deployment

Docker

Build locally:

docker build -t evalops .
docker run -p 8501:8501 evalops

Or use the pre-built image:

docker pull pmcavallo/evalops:latest
docker run -d -p 8501:8501 --name evalops pmcavallo/evalops:latest

Dockerfile Highlights:

# CPU-optimized PyTorch for smaller image
RUN pip install --no-cache-dir \
    torch --index-url https://download.pytorch.org/whl/cpu

# Health check for container orchestration
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health || exit 1

AWS Architecture

Service Purpose Cost
EC2 (t3.micro) Hosts Streamlit dashboard via Docker Free tier
DynamoDB Stores evaluation runs, cases, baselines Free tier
IAM Least-privilege access for deployment Free
Security Groups Ports 22 (SSH), 8501 (Streamlit) Free

Deployment Steps:

# 1. SSH into EC2
ssh -i evalops-key.pem ec2-user@<public-ip>

# 2. Install Docker
sudo dnf install docker -y
sudo systemctl start docker

# 3. Pull and run
docker pull pmcavallo/evalops:latest
docker run -d -p 8501:8501 --name evalops pmcavallo/evalops:latest

The Docker Story:

Initial deployment attempted manual Python installation on EC2:

  • Python version mismatch (3.7 vs 3.11 required)
  • Disk space issues (PyTorch is 900MB)
  • Dependency chain failures (sentence-transformers → torch → …)

Docker solved all of this with one command. Build once locally, run anywhere identically.


Results

Test Coverage

Category Tests Status
Core (Dataset, Runner) 45 ✅ Passing
Metrics 38 ✅ Passing
Comparison 52 ✅ Passing
Storage 41 ✅ Passing
API 35 ✅ Passing
CLI 28 ✅ Passing
Observability 31 ✅ Passing
Dashboard 15 ✅ Passing
Total 285 All Passing

Demo Data

Dataset Cases Description
Q&A 20 Factual question-answering
Classification 15 Sentiment/category classification
Summarization 15 Document summarization
Metric Value
Total Runs 24
Total Cases 470
Average Pass Rate 79.5%
Simulated Drift Events 3

Tech Stack

┌─────────────────────────────────────────────────────────────┐
│                      PRESENTATION LAYER                      │
│  Streamlit (Dashboard) │ FastAPI (REST API) │ Typer (CLI)   │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                       CORE LAYER                             │
│  sentence-transformers │ SQLAlchemy 2.0 │ Pydantic          │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY LAYER                       │
│  LangSmith │ structlog │ MetricsCollector                   │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                    INFRASTRUCTURE LAYER                      │
│  Docker │ AWS EC2 │ DynamoDB │ SQLite/PostgreSQL            │
└─────────────────────────────────────────────────────────────┘
Component Technology Purpose
Embeddings sentence-transformers BERT-based semantic similarity
API FastAPI REST endpoints
CLI Typer + Rich Command-line interface
Dashboard Streamlit + Plotly Visualization
Database SQLAlchemy 2.0 ORM with SQLite/PostgreSQL
Logging structlog Structured JSON logging
Tracing LangSmith Distributed tracing
Container Docker Reproducible deployment
Cloud AWS (EC2, DynamoDB) Production hosting

Project Structure

evalops/
├── src/evalops/
│   ├── core/               # Dataset, Runner, Metrics
│   │   ├── dataset.py      # EvalCase, EvalDataset
│   │   ├── runner.py       # EvalRunner, EvalResult
│   │   ├── metrics.py      # Accuracy, SemanticSimilarity, etc.
│   │   └── judge.py        # LLMJudge, RubricJudge
│   ├── comparison/         # Statistical comparison
│   │   ├── ab_testing.py   # ABComparison
│   │   ├── drift.py        # DriftDetector
│   │   └── regression.py   # RegressionTester
│   ├── storage/            # Persistence
│   │   ├── models.py       # SQLAlchemy models
│   │   └── repository.py   # EvalRepository
│   ├── observability/      # Logging & tracing
│   │   ├── langsmith.py    # LangSmithTracer
│   │   └── logging.py      # EvalLogger
│   ├── api/                # REST API
│   │   └── app.py          # FastAPI application
│   ├── cli/                # Command-line
│   │   └── main.py         # Typer commands
│   └── dashboard/          # Visualization
│       └── app.py          # Streamlit application
├── tests/                  # 285 unit tests
├── demo/                   # Demo datasets and mock targets
├── Dockerfile              # Container definition
├── docker-compose.yml      # Multi-container setup
└── pyproject.toml          # Package configuration

Usage

Basic Evaluation

from evalops import EvalDataset, EvalRunner, SemanticSimilarity

# Load test cases
dataset = EvalDataset.from_json("test_cases.json")

# Your LLM function
async def my_llm(input_text: str) -> str:
    response = await client.chat(input_text)
    return response.content

# Run evaluation
runner = EvalRunner()
result = await runner.evaluate(
    dataset=dataset,
    target=my_llm,
    metrics=[SemanticSimilarity(threshold=0.7)]
)

print(f"Pass rate: {result.pass_rate:.1%}")
print(f"Failed cases: {[c.case_id for c in result.failed_cases]}")

CLI Usage

# Run evaluation
evalops run --dataset qa_cases.json --target my_module:llm_function

# Compare two runs
evalops compare --baseline RUN_A --candidate RUN_B

# Check for drift
evalops drift --baseline prod_v1 --dataset qa_cases

# Launch dashboard
evalops-dashboard

API Usage

# Start server
uvicorn evalops.api.app:app --reload

# Run evaluation via API
curl -X POST http://localhost:8000/evaluate \
  -H "Content-Type: application/json" \
  -d '{"dataset_path": "qa_cases.json", "metrics": ["semantic_similarity"]}'

# Get run results
curl http://localhost:8000/runs/RUN_ID

Lessons Learned

Semantic Similarity is Not Perfect

BERT embeddings capture meaning well but have edge cases:

  • Negations can score high (“Paris is great” vs “Paris is not great”)
  • Domain-specific terminology may not embed well
  • Very short responses lose context

Solution: Combine semantic similarity with other metrics. Use LLMJudge for nuanced cases.

Statistical Rigor Matters

Early versions just compared pass rates. Problems:

  • Small sample sizes gave misleading results
  • No confidence intervals
  • Effect size ignored (10% improvement on 1000 cases vs 10 cases)

Solution: Proper statistical tests with p-values, effect sizes, and confidence intervals.

Docker Saves Hours

Manual EC2 deployment attempt:

  • 45 minutes debugging Python versions
  • 30 minutes on disk space issues
  • 20 minutes on dependency chains

Docker deployment:

  • 5 minutes to pull and run

Lesson: Containerize early, not as an afterthought.

Test Everything

285 tests sounds like a lot. It’s not. Each test caught real bugs:

  • Edge cases in metric calculations
  • Database transaction issues
  • API response format inconsistencies

Lesson: Comprehensive testing isn’t overhead, it’s insurance.


Future Improvements

  1. Real-Time Monitoring: WebSocket-based live evaluation streaming
  2. Multi-Model Comparison: Compare GPT-4 vs Claude vs Llama on same dataset
  3. Cost Tracking: Token usage and API cost per evaluation
  4. Scheduled Runs: Cron-based automated evaluation pipelines
  5. Slack/PagerDuty Integration: Alert on drift detection
  6. PostgreSQL Migration: Production database for team collaboration

License

MIT License - see LICENSE for details.



“Traditional testing asks ‘did I get the exact right string?’ LLM evaluation asks ‘did I get a response that means the right thing?’ EvalOps bridges that gap with semantic understanding, statistical rigor, and production observability. Because in production, you don’t just need to know if your LLM works - you need to know the moment it stops working.”

Written on December 31, 2024

Written on December 23, 2025