MROps: AI-Assisted Model Risk Validation

Model Risk Oversight teams at broker-dealers, banks, and investment advisers are managing inventories that have fundamentally changed. Five years ago, a model inventory was logistic regressions, time series forecasts, and vendor scorecards. The validation methodology was mature. The cadence was annual. The workload was predictable.

That inventory now includes LLM-powered advisory systems, agentic compliance monitors, RAG pipelines over regulatory corpora, and multi-agent orchestration frameworks that make autonomous decisions in client-facing workflows. FINRA’s 2026 Annual Regulatory Oversight Report drew a sharp line between these two worlds: AI systems that generate content and AI systems that act autonomously. The supervisory obligations under Rule 3110 apply to both, but the validation methodology for the second category barely exists.

Meanwhile, the intake process hasn’t changed. A model submission arrives. An analyst spends hours reading documentation, classifying the system, checking for SR 11-7 completeness, writing an intake memo, and routing it for validation. Multiply that by hundreds of models and the math breaks: your validators spend more time on process than on judgment.

MROps solves both problems simultaneously. It automates the intake workflow (classification, gap analysis, documentation) so validators focus on effective challenge, not paperwork. And it extends the classification framework to handle agentic AI systems with a novel authorization boundary assessment that no standard framework currently provides.


What It Does

MROps is a LangGraph multi-agent pipeline that processes model submissions and produces professional intake documentation. A submission enters as structured JSON describing a model or AI system. Three agents process it sequentially:

Agent 1: Classifier

The classifier determines three things about every submission:

System type. Is this a traditional statistical model (logistic regression, GBDT, time series), a deterministic tool (rule engine, threshold-based system), or an agentic AI system (LLM-based, tool-calling, autonomous reasoning)? This distinction matters because each type requires different validation approaches, different monitoring cadences, and different regulatory frameworks.

Risk tier. Tier 1 (critical, requires immediate attention), Tier 2 (material, standard validation cycle), or Tier 3 (low-risk, streamlined review). Tiering follows SR 11-7 principles: complexity, uncertainty, breadth of use, and potential impact. For agentic systems, authorization boundary gaps automatically escalate the tier.

Regulatory triggers. Which frameworks apply? SR 11-7 is universal for regulated institutions. FINRA Rule 3110 applies to supervisory requirements for client-facing systems. Reg BI applies when the system touches investment recommendations. FINRA 4370 applies when the system is critical to business continuity.

For agentic systems, the classifier also performs an authorization boundary assessment across 8 dimensions. This is the novel contribution. No standard framework currently defines what “authorization” means for an AI agent. MROps proposes a two-layer model:

Layer 1: Tool Permissions (what can the agent do?)

  1. Identity verification — does the agent authenticate before calling tools?
  2. Scope constraints — are tool permissions bounded and documented?
  3. Action logging — are all tool calls logged to an immutable audit trail?
  4. Human approval gates — do high-impact actions require human confirmation?
  5. Data policy enforcement — are data access rules enforced at tool level, not just in prompts?

Layer 2: Context Trust (what can influence what the agent thinks?)

  1. Context write permissions — who or what can inject content into the agent’s context window?
  2. Context authority levels — does the agent distinguish trusted input (firm data, validated sources) from untrusted input (user queries, external documents)?
  3. Trust degradation rules — when untrusted content enters the context, does the system reduce agent autonomy or escalate to human review?

The first layer draws on emerging agent security frameworks (tool-level permissions, scope constraints, audit trails). The second layer addresses a problem that most governance frameworks haven’t reached yet: the agent’s reasoning can be influenced by content it retrieves, and not all content is equally trustworthy. A RAG pipeline over regulatory documents is not the same trust level as a client’s free-text query. If the agent can’t distinguish them, its authorization boundary is incomplete.

Auto-escalation rules: If identity verification or scope constraints are missing, the system is automatically Tier 1 regardless of its use case. If context authority levels are absent on a client-facing or compliance-critical system, Tier 1. These rules encode the principle that an unbounded agent cannot be validated until its boundaries are defined.

Agent 2: Gap Analyst

The gap analyst checks the submission against 14-16 SR 11-7 requirements (the count varies because agentic systems get two additional authorization categories). For each requirement, it assesses:

  • Status: Present, missing, or insufficient
  • Severity: Critical, high, medium, or low (for gaps only)
  • Finding: What was found or what’s missing, with specific references to the submission
  • Recommendation: Concrete remediation action

The requirements map directly to SR 11-7’s three pillars:

Development documentation: Purpose and use, methodology, data description, assumptions, testing and performance, and limitations. These are SR 11-7 Section IV requirements. The gap analyst checks not just whether these sections exist, but whether they contain sufficient detail for an independent validator to assess the model.

Validation evidence: Conceptual soundness review, ongoing monitoring plan, and outcomes analysis (including backtesting where applicable). These are SR 11-7 Section V requirements. For agentic systems, “outcomes analysis” means something different than for traditional models: you can’t backtest a probabilistic system the same way you backtest a PD model. The gap analyst checks for trace-based validation approaches.

Governance artifacts: Model ownership, approval history, inventory registration, risk assessment, and change management. These are SR 11-7 Section VI requirements. For agentic systems, the gap analyst also checks whether authorization boundaries (both tool permissions and context trust) are documented.

The output is a structured gap report with severity ratings, a readiness assessment (ready, conditional, not ready), and prioritized next steps.

Agent 3: Memo Writer

The memo writer combines the classification and gap analysis into a formatted intake memo that reads like a professional MRM document. The memo includes:

  1. Executive summary with recommendation (approve, conditional, return to submitter)
  2. Submission overview (model name, submitter, description, use case, technology, deployment)
  3. Classification assessment with rationale
  4. Authorization boundary assessment (for agentic systems)
  5. SR 11-7 gap analysis with summary table and narrative
  6. Recommendation with required actions
  7. Appendix with detailed gap matrix

The memo is designed to be the artifact a validator opens on day one. Everything they need to understand what the system is, how risky it is, what’s documented, and what’s missing is in a single document.


Why This Matters

The Scale Problem

A model risk team managing 200 models spends roughly 8 hours per model on intake overhead: reading documentation, classifying the system, checking completeness, writing the memo, routing for validation. That’s 1,600 analyst hours per year on process. MROps automates approximately 70% of that overhead, returning roughly 1,120 hours to the team for the work that requires human judgment: findings, effective challenge, and communication with model owners.

But the real value isn’t cost savings. It’s capacity. Every business unit deploying a new AI system adds to the validation queue. Hiring qualified MRM professionals with AI expertise is slow. MROps scales without headcount. When the inventory grows from 200 to 300, the intake overhead doesn’t grow with it.

The Methodology Problem

SR 11-7 was written in 2011 for logistic regressions and discounted cash flow models. Its principles (effective challenge, independence, documentation rigor) are durable. Its assumptions are not.

SR 11-7 assumes models produce quantitative estimates. Agentic systems produce actions. SR 11-7 assumes deterministic, replicable outputs. LLMs are probabilistic. SR 11-7 assumes point-in-time validation on an annual cadence. Agentic systems change continuously as tools, permissions, prompts, and knowledge bases update. SR 11-7 assumes a human reviews output before decisions are made. Agentic systems act autonomously.

MROps doesn’t replace SR 11-7. It extends the validation methodology to cover a class of systems that SR 11-7 was never designed for, while preserving the governance principles that make SR 11-7 effective.

The Examination Problem

FINRA’s 2026 Annual Regulatory Oversight Report (December 9, 2025) dedicated a standalone section to agentic AI for the first time. The report names seven specific risks for AI agents and states that firms should consider whether “the autonomous nature of AI agents presents the firm with novel regulatory, supervisory or operational considerations.” When a FINRA examiner asks to see supervisory controls for agentic AI systems, the intake memo that MROps generates is examination-ready evidence: classification, risk tier, regulatory triggers, authorization boundary assessment, and documented gap analysis with remediation tracking.

The U.S. Treasury published its Financial Services AI Risk Management Framework on February 19, 2026, with 230 AI-specific control objectives that adapt the NIST AI RMF for financial services. These controls map to the same categories MROps assesses: governance, risk identification, monitoring, and accountability. MROps positions the validation team to demonstrate compliance with emerging federal guidance, not just existing supervisory expectations.


Demo: Four Synthetic Submissions

MROps ships with four synthetic model submissions that demonstrate the full spectrum of systems a model risk team encounters. All data is synthetic. No real financial data, institutions, or individuals are represented.

1. Retail PD Scorecard v3.2

A logistic regression probability of default model for retail unsecured lending. Everything is documented. This is what “ready” looks like.

  • Classification: Traditional Model, Tier 2, Credit Risk
  • Regulatory triggers: SR 11-7
  • Gap analysis: 14/14 requirements present, zero gaps
  • Readiness: Ready
  • Recommendation: Approve for validation

This submission demonstrates baseline behavior. A well-documented traditional model sails through intake in minutes instead of hours. The validator opens the memo and knows immediately: proceed to validation, pay attention to the Grade E underprediction trend, everything else is clean.

2. WealthGuide AI Advisor

An LLM-powered portfolio recommendation agent serving 8,000+ retail clients daily. Missing authorization boundaries. Incident history of inappropriate concentration recommendations.

  • Classification: Agentic System, Tier 1 (escalated from Tier 2), Advisory
  • Regulatory triggers: SR 11-7, FINRA 3110, Reg BI
  • Authorization: 1/5 tool permissions present, 0/3 context trust present
  • Gap analysis: 5 present, 4 missing, 7 insufficient (5 critical gaps)
  • Readiness: Not Ready
  • Recommendation: Return to submitter with 8 required actions

This is the demo’s centerpiece. The classifier correctly identifies it as an agentic system, triggers Reg BI (investment recommendations to retail clients) and FINRA 3110 (supervisory requirements), and auto-escalates to Tier 1 because identity verification is insufficient and scope constraints are missing on a client-facing system. The gap analysis catches that the system has no outcomes analysis, no independent risk assessment, and no context trust documentation. The memo reads like a document an examiner would expect to see.

3. TxnWatch Rule Engine v2.0

A deterministic rule-based transaction surveillance system for AML compliance. 47 rules, no ML components. Well-documented but with a tier mismatch.

  • Classification: Tool, Tier 2 (escalated from submitted Tier 3), Compliance
  • Regulatory triggers: SR 11-7
  • Gap analysis: 12 present, 0 missing, 2 insufficient
  • Readiness: Conditional
  • Recommendation: Conditional approval pending governance updates

This submission demonstrates the classifier’s ability to escalate risk tiers based on regulatory sensitivity. The system was submitted as Tier 3, but the classifier recognizes that a BSA/AML compliance system with OCC examination history and an 85:1 false positive ratio creating material analyst burden should be Tier 2. The gaps are governance-only: update the risk assessment and inventory registration to reflect the correct tier.

4. ComplianceBot Intelligent Monitor

An agentic compliance monitoring system with strong tool permissions but missing context trust. Tool permissions are fully documented. Context authority levels and trust degradation rules are absent.

  • Classification: Agentic System, Tier 1 (auto-escalation), Compliance
  • Regulatory triggers: SR 11-7, FINRA 3110, FINRA 4370
  • Authorization: 5/5 tool permissions present, 1/3 context trust present
  • Gap analysis: 11 present, 2 missing, 3 insufficient (2 critical)
  • Readiness: Not Ready
  • Recommendation: Return to submitter

This submission demonstrates the two-layer authorization model. The development team did strong work on tool permissions: OAuth2 service accounts, read-only access, immutable audit trails, multi-level approval gates, data classification controls. But they didn’t address context trust. The system retrieves regulatory documents via RAG and processes them through an LLM, but it doesn’t distinguish between trusted regulatory content and potentially adversarial content. It has no mechanism to reduce autonomy when untrusted content enters the context. The auto-escalation rule fires because context authority levels are missing on a compliance-critical system.

This is the finding that most governance frameworks miss. The tools are locked down. The context is wide open.


Architecture

                    ┌─────────────────────────┐
                    │    Model Submission      │
                    │    (Structured JSON)      │
                    └────────────┬────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │   Agent 1: Classifier    │
                    │                          │
                    │  • System type           │
                    │  • Risk tier             │
                    │  • Model category        │
                    │  • Regulatory triggers   │
                    │  • Auth boundary (8 dim) │
                    └────────────┬────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │  Agent 2: Gap Analyst    │
                    │                          │
                    │  • 14-16 SR 11-7 checks  │
                    │  • Severity ratings      │
                    │  • Readiness assessment  │
                    └────────────┬────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │  Agent 3: Memo Writer    │
                    │                          │
                    │  • Professional MRM memo │
                    │  • Recommendation        │
                    │  • Required actions      │
                    │  • Detailed gap matrix   │
                    └────────────┬────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │    Intake Memo (.md)     │
                    │    + Model Inventory     │
                    └─────────────────────────┘

Pipeline: LangGraph with typed state, conditional edges, and error handling at each node. If classification fails, the pipeline stops. If gap analysis fails, the memo still generates with available data.

LLM: Claude Sonnet 4.6 via langchain-anthropic. Temperature 0.0 for deterministic classification. Structured JSON output schemas enforced via system prompts.

Data models: Pydantic v2 with strict validation. Every agent input and output is typed. The submission schema includes authorization boundary documentation, development documentation, validation evidence, and governance artifacts.

Prompts: Dedicated prompt files with SR 11-7 requirements embedded. The gap analyst prompt contains the full set of requirements from SR 11-7 Sections IV, V, and VI, adapted for traditional models, tools, and agentic systems.


Tech Stack

Component Technology Purpose
Orchestration LangGraph Multi-agent pipeline with typed state
LLM Claude Sonnet 4.6 Classification, gap analysis, memo generation
Data validation Pydantic v2 Strict input/output schemas
API FastAPI Backend endpoints (Phase 2)
Demo UI Streamlit Interactive submission viewer
Database SQLite Model inventory (PostgreSQL in production)
Testing pytest 27 tests covering all agents and pipeline

Roadmap

Phase 1 (complete): Classifier, gap analyst, memo writer, 4 synthetic submissions, Streamlit demo, 27 tests passing.

Phase 2: FastAPI backend, SQLite model inventory persistence, submission history tracking, batch processing for inventory-wide analysis.

Phase 3: Continuous monitoring integration (drift detection via EvalOps patterns), automated revalidation triggers, LangSmith/LangFuse observability for agent traces.

Phase 4: Validation plan generator (pre-populated by system type and risk tier), multi-model dependency mapping for aggregate risk assessment.


Synthetic Data Disclaimer

All model submissions, company names, individuals, and data in this project are entirely synthetic. No real financial institutions, client data, or regulatory examination results are represented. Sample submissions are designed to demonstrate classification and gap analysis behavior across a range of system types and documentation quality levels.


  • AutoDoc AI — 4-agent documentation system with 47/47 source fidelity. Demonstrates multi-agent RAG architecture patterns used in MROps memo generation.
  • EvalOps — LLM evaluation framework with 285 tests and drift detection. Provides monitoring patterns for Phase 3 continuous validation.