ProductionProof: From Demo to Production in 2 Minutes

ProductionProof generates production readiness documentation for AI/ML projects using Claude Sonnet 4.5 with carefully engineered prompts. Input a project description, get three professional documents: Architecture Decision Records (ADRs), Risk Assessment, and Test Coverage Matrix. The tool addresses the demo-to-production gap that Gartner identifies: 91% of AI pilots show low impact, but 25% of organizations are stuck between pilot and production because they lack the artifacts that prove readiness. ProductionProof generates those artifacts in minutes, not weeks.

Read More

EvalOps: Production-Grade LLM Evaluation and Observability Platform

EvalOps is a systematic evaluation framework for LLM applications that addresses the fundamental problem of non-deterministic outputs. Traditional software testing fails when “correct” answers can be phrased a thousand different ways. EvalOps provides semantic similarity matching using BERT embeddings, statistical drift detection, and A/B comparison with effect size calculation. The platform includes a full observability stack with LangSmith integration, structured logging, and a Streamlit dashboard for exploring results. Deployed on AWS via Docker, the system demonstrates production-ready MLOps practices with 285 tests passing and a live demo processing 24 evaluation runs across 470 test cases.

Read More

CreditNLP: Fine-Tuned LLM for Startup Default Risk Prediction

CreditNLP is a fine-tuned language model that identifies default risk signals in startup loan applications where traditional quantitative data is sparse. Using LoRA (Low-Rank Adaptation) on Mistral-7B, the model learns to detect implicit risk patterns in application narratives - the same linguistic signals that experienced credit underwriters recognize intuitively but cannot codify into rules. The fine-tuned model achieves 93.9% accuracy on parseable outputs compared to 60% for few-shot prompting, demonstrating that domain expertise can be encoded directly into model weights through targeted training on labeled examples. Important Disclaimer: The project uses simulated data. There is no affiliation and no personal information in the simulated data used in this project.

Read More

MCP Banking Workflows: AI-Powered Model Risk Management Automation

MCP Banking Workflows is a production-ready Model Context Protocol (MCP) server that automates model documentation validation, dependency analysis, and regulatory compliance checking for banking credit risk models. The system addresses a critical bottleneck in Model Risk Management: documentation drift - where model code evolves through versions while PowerPoint presentations, Excel data dictionaries, and Word white papers fall out of sync. Using 10 specialized tools accessible via Claude or any MCP-compatible LLM, the server enables AI assistants to validate cross-file consistency, analyze change impact across a 16-model dependency graph, check SR 11-7 compliance, and generate analyst onboarding briefs - transforming hours of manual cross-referencing into seconds of automated validation. Important Disclaimer: the case stuies and any data for this product is simulated. There is no affiliation and personal information on any data for this project.

Read More

ChurnGuard: End-to-End MLOps Pipeline for Customer Churn Prediction

ChurnGuard is a production-ready machine learning system that predicts customer churn for telecom companies. Beyond the model itself, this project demonstrates the complete MLOps lifecycle: experiment tracking, containerization, multi-service orchestration, and cloud deployment. The architecture mirrors how ML systems are built at Telecom companies, where the ability to deploy and maintain models in production is as critical as model accuracy.
Read More

AutoDoc AI v2: Memory-Enhanced Multi-Agentic Documentation System

AutoDoc AI v2 transforms the original multi-agent system into a truly multi-agentic architecture where agents learn from past generations, adapt to different insurance portfolios, and make autonomous routing decisions. The upgrade introduces three critical capabilities: a 3-tier memory system that enables cross-session learning and user personalization, LangGraph state machine orchestration with explicit conditional routing and revision cycles, and dynamic portfolio detection that automatically configures agents for Personal Auto, Homeowners, Workers’ Compensation, or Commercial Auto documentation. These additions solve the fundamental limitation of v1: every document followed the same fixed pipeline regardless of portfolio complexity or historical patterns. Now, a Workers’ Comp model automatically triggers strict compliance checking with 4 revision cycles, while the system learns that “tail development documentation” fails 73% of Workers’ Comp reviews and proactively flags it before generation begins.

Read More

AutoDoc AI: Multi-Agent RAG System for Regulatory Documentation Automation

AutoDoc AI is a production-ready multi-agent orchestration system that transforms PowerPoint presentations into comprehensive, audit-ready model documentation for models. The system addresses a critical bottleneck in model risk management: senior analysts spending 40-60 hours per model on documentation that must comply with Model Audit Rule, multiple Actuarial Standards of Practice (ASOPs), and audit requirements. Using specialized AI agents with a custom orchestration (and a LangGraph version), AutoDoc AI retrieves context from past documentations through RAG (retrieval-augmented generation), validates regulatory compliance in real-time, and generates 30-50 page White Papers that meet stringent audit standards. This architecture solves the fundamental challenge of AI in regulated industries: combining the speed and consistency of automation with the accuracy and accountability required for regulatory oversight, making it applicable beyond insurance to any domain where documentation quality directly impacts regulatory compliance, audit outcomes, and business risk.

Read More

IncidentIQ: AI-Powered Edge Case Resolution (LightGBM + LangGraph Multi-Agent)

IncidentIQ is a production-ready hybrid incident response system that combines gradient boosting with AI agents to solve the edge case problem in DevOps and IT operations. Traditional ML models excel at classifying standard incidents but fail catastrophically on edge cases like misleading symptoms that point to the wrong root cause, false positives during expected high-traffic events, or novel patterns from feature deployments. IncidentIQ uses a fast binary classifier (incident vs. normal) to handle 80% of cases in milliseconds, then routes ambiguous situations to a multi-agent AI system that investigates root causes, applies business context, and proposes specific remediation actions with full reasoning chains. The system demonstrates value through five edge cases: preventing $47K in unnecessary Black Friday scaling when the model falsely predicted an incident, catching a gradual memory leak 2 hours before failure that the model missed, discovering network degradation was the real cause when the model incorrectly blamed the database, identifying specific feature flag interactions affecting only 2% of users when the model had low confidence, and detecting early-stage cascade failures across services when individual metrics appeared normal. Built with production-grade governance including hard rules, human review triggers, and comprehensive audit trails, the system prevents unnecessary remediations, eliminates false positive alerts, and converts ambiguous incidents into actionable insights. This architecture demonstrates that modern ML operations require intelligent orchestration of models, agents, and human oversight, not just better algorithms, and the same hybrid pattern applies to any domain where rigid automation meets complex edge cases like credit decisioning, fraud detection, claims processing, or trading anomaly detection.

Read More

CreditIQ: AI-Native Credit Decisioning Platform

CreditIQ is a production-ready hybrid credit decisioning system that combines gradient boosting with AI agents to solve the edge case problem in lending. While traditional ML models excel at standard approvals and denials, they struggle with borderline cases, thin-file applicants with strong alternative data, contradictory signals like good credit but high DTI, or near-miss denials that deserve conditional approval. CreditIQ routes 80% of applications through a fast LightGBM model (<10ms) and sends the remaining 20% of edge cases to an AI reasoning agent that evaluates nuanced factors, proposes modified terms, and generates FCRA-compliant explanations. Built with full governance guardrails (hard rules agents cannot override, human review for high-stakes cases, comprehensive audit trails), the system delivers a 147x ROI by preventing defaults, converting denials to conditional approvals, and providing regulatory-grade explainability, all while demonstrating that modern ML operations require strategic orchestration of models, agents, and human oversight, not just better algorithms. To be very clear, this is a personal project done with synthetic data I generated specifically for this experiment.
Read More

Building a Zero-Hallucination RAG Agent: Custom LangChain vs Pre-Built Tools

This project tackles a common pain point in retrieval-augmented generation: hallucination. Off-the-shelf tools like Flowise invented fake projects and technologies when applied to my portfolio. To fix this, I built a custom LangChain pipeline with hallucination prevention at the architecture level. It separates metadata vs. semantic queries, enforces strict grounding rules, validates responses for citations, and defaults to “I don’t know” when context is missing. The result: a 0% hallucination rate, lightweight costs (~$0.02/month at 100 queries), and full control over retrieval/validation.
Read More

Prompt Engineering Lab: From Zero-Shot to Production-Ready Systems

This project set out to move beyond “clever phrasing” and show how prompt engineering evolves into system design. Starting with zero-shot classification, I gradually layered in engineered prompts, schema enforcement, validation, tool integration, and retrieval augmentation (RAG). Along the way, I benchmarked where the baseline fell short, built confusion matrices to visualize misclassifications, and demonstrated how grounding prompts in external context unlocks accuracy and reliability. The result is not just a demonstration of model capability, but a framework for building production-ready AI systems that save cost, reduce error, and scale safely.

Read More

Building Production-Ready Fraud Detection: A Complete ML Pipeline Journey

This project demonstrates a comprehensive end-to-end machine learning pipeline for fraud detection, built entirely using Claude Code and showcasing advanced prompt engineering techniques. What started as a simple fraud detection system evolved into a sophisticated demonstration of how to navigate real ML challenges, overcome baseline model limitations, and deploy production-ready solutions using modern DevOps practices.

Read More

AI-in-the-Cloud Knowledge Shootout (Perplexity and NotebookLM)

Perplexity vs NotebookLM, as a continuation of the Cross-Cloud Shootout series. This project set out to test whether AI copilots could act as cloud knowledge orchestrators, producing reliable guidance on architecture, cost, and governance. Instead of benchmarking AWS and GCP directly, the experiment compared how each tool answered the same six cloud prompts. NotebookLM was tied to a curated corpus of AWS/GCP docs and my Cross-Cloud behaviour. Perplexity searched the open web in real time. The shootout revealed two complementary roles: NotebookLM excels at structured, policy-level synthesis, while Perplexity delivers concise, actionable answers.

Read More

RiskBench AI Coding Shootout (Claude Code, Cursor, Github Copilot)

This project set out to pit three leading AI coding assistants (GitHub Copilot, Claude Code, and Cursor) against each other in a controlled “shootout,” with each tool tasked to build out the same end-to-end machine learning pipeline. Across four sprints, the tools generated synthetic datasets, trained and tuned XGBoost models, explored data quality and feature engineering, and ultimately deployed a serving API with SHAP-based interpretability. By holding the repo, prompts, and acceptance tests constant, the project revealed not just raw coding differences, but how each tool shapes data quality, model credibility, and the path to a production-ready ML system.

Read More

Cross-Cloud AutoML Shootout: Lessons from AWS, GCP, and BigQuery

When I kicked off the Cross-Cloud AutoML Shootout, the idea was simple: put AWS and GCP side by side, train on the same dataset, and see which delivered the better model with less friction. What started as a straightforward benchmark quickly turned into something bigger, a case study in how different cloud philosophies shape the experience of doing machine learning. Just like in banking, where model development often collides with regulatory guardrails, this project revealed how quotas, hidden constraints, and pricing structures can be as important as the algorithms themselves.
Read More

SignalGraph 5G - Anomaly Detection & Forecasts (PySpark + Postgres/Teradata + Prophet)

SignalGraph 5G is a demo system that ingests synthetic 4G/5G KPI data, processes it through a Spark-based lakehouse pipeline, and exposes an analyst-friendly UI in Streamlit. The project was designed for anomaly detection, large-scale data engineering, data warehouse/lakehouse integration, and applied ML/forecasting in the network domain. It is deployed as a live Streamlit web app on Render, connected to a Neon Postgres warehouse.

Read More

NetworkIQ - Incident Risk Monitor (Render, Google Cloud, AWS)

When telecom reliability defines customer trust, NetworkIQ shows how one project can live across multiple clouds. NetworkIQ predicts congestion and visualizes incidents on Render, GCP Cloud Run, and AWS, completing the One Project, Three Clouds vision. Built with PySpark preprocessing, XGBoost prediction, and Streamlit dashboards, NetworkIQ demonstrates that portability, scalability, and explainability can be baked into a single AI-first system, no matter the platform.

Read More

AI-Augmented BNPL Risk Dashboard with Intelligent Override System (Scikit-learn/XGBoost, Streamlit, Render)

In the fast-growing Buy Now Pay Later market, consumers face hidden risks from fragmented credit visibility and rapid lending decisions that can spiral into unmanageable debt. This project tackles the problem by providing a Streamlit-based dashboard with real-time monitoring, anomaly detection, policy simulations, and an intelligent override system that allows immediate intervention when risk thresholds are breached. The result is a tool that balances speed with safety, giving risk teams clear insights, actionable controls, and the confidence to manage BNPL risk responsibly in a space where regulation has not yet caught up.

Read More

Credit Risk Model Deployment & Monitoring (AWS + PySpark + CatBoost)

In a world where credit decisions must stay reliable, explainable, and scalable, this project addresses the challenge of deploying a credit risk model in a cloud-native, data-intensive environment. It builds a synthetic telecom-inspired credit dataset, then uses PySpark for scalable preprocessing, CatBoost for powerful categorical modeling, Amazon S3 for seamless cloud storage, and SHAP for insight into feature impact. The result is a scalable, explainable pipeline that supports segment-level business insights and aligns with real-world credit risk workflows, giving teams confidence in automation and clarity in decision-making for postpaid lending.

Read More

Telecom Churn Modeling & Retention Strategy

Customer churn erodes revenue and undermines growth in competitive telecom markets, and preventing it requires early and reliable signals. This project delivers a complete churn modeling pipeline that combines Python, Pandas, scikit-learn, and XGBoost to predict at-risk customers, SHAP for clear interpretability, and CLTV simulations to quantify revenue exposure. It also incorporates model monitoring through Population Stability Index and customer segmentation to guide retention strategies. The outcome is a system that not only predicts churn but also explains it, monitors its stability, and translates insights into actionable business decisions.

Read More

Customer Segmentation Using Statistical Clustering

Telecom companies must understand who their customers are to tailor marketing and retention strategies effectively. This project simulates a diverse base of 5,000 postpaid customers, then applies preprocessing and K-Means clustering to reveal four distinct personas. Visualization through PCA aids interpretation, while segment profiling by usage patterns, tenure, churn risk, and payment behavior, drives targeted strategic actions. The result is an operationally intuitive segmentation model that supports personalization, retention, and plan design using a realistic, scalable methodology.

Read More

Telecom Engagement Monitoring using Fractional Logistic Regression

This project implements a fractional logistic regression monitoring pipeline for tracking customer engagement in a telecom environment. It simulates realistic development and monitoring datasets to evaluate how well the model generalizes over time using key metrics such as RMSE, MAE, PSI, and calibration curves.

Read More

Customer Churn Prediction App (Deployed on Render)

A customer churn risk can quietly erode business value, so this project builds a real-time prediction engine designed to surface risk before it materializes. It constructs and preprocesses realistic telecom-style churn data using ColumnTransformer, trains a RandomForestClassifier, and packages both the model and preprocessing steps using joblib. The user interacts via an intuitive Streamlit interface that signals churn likelihood in real time. Hosted serverlessly on Render, the app bridges data science with operational readiness.

Read More

Credit Bureau Sandbox; Governance Gate & Dashboard Hook (AWS + Tableau)

This project demonstrates hands-on work with bureau-style sandbox data, a credit reporting dashboard hook, and an AWS-based governance gate for model productionization. The repo demonstrates them in a lightweight, auditable way, without committing secrets or spinning heavy compute. It gives concrete artifacts, one-liners to reproduce behavior, and clear pointers to files.

Read More

Fraud Detection with XGBoost and scikit-learn

This project demonstrates a full machine learning workflow for detecting fraudulent transactions using simulated data, with XGBoost, SMOTE for class imbalance, RandomizedSearchCV for hyperparameter tuning, and threshold optimization to improve performance.

Read More

Lending Club Credit Risk: AWS ML Showcase (Governance + Cost Control, under $25)

This project demonstrates a budget‑conscious, console‑first ML pipeline on AWS: data profiling with AWS Glue DataBrew, feature curation and storage in Amazon S3, training and packaging XGBoost in Amazon SageMaker with Managed Spot Training, registering the model in the SageMaker Model Registry, and offline scoring/metrics suitable for a batch decisioning use case. Guardrails include Budgets, Anomaly Detection, and deletion/stop procedures to keep spend near $0/hr when idle.

Read More

Airline Flight Delay Prediction with Python

This project aims to predict whether a flight will be significantly delayed (15+ minutes) using flight metadata, weather, and carrier information. Understanding delay drivers is essential for airlines and airports to improve operations and passenger experience.

Read More

Analyzing A/B Test Impact on Marketplace Conversions with Uplift Modeling

This project simulates and analyzes an A/B pricing test in a marketplace context. Using Python, I simulate customer behavior, estimate the causal impact of a price change on conversion rates, and apply uplift modeling to identify heterogeneous treatment effects across cities. The project demonstrates key skills in experimental design, causal inference, uplift modeling, and data visualization.

Read More

Telecom Customer Churn Prediction with Python

This project focuses on predicting customer churn in the telecommunications industry. Customer churn occurs when a user stops using a company’s services. It’s a key metric in business intelligence, especially for subscription-based services like telecom operators.

Read More

Webscraping with R

This example scrapes web data and cleans it using R’s rvest and Tidyverse. Here we will scrape the Wikipedia data table list of countries by external debt.

Read More

Twitter Data

This example scrapes Twitter data, visualizes it, and looks at some descriptive information:

Read More