RiskBench AI Coding Shootout (Claude Code, Cursor, Github Copilot)
This project set out to pit three leading AI coding assistants (GitHub Copilot, Claude Code, and Cursor) against each other in a controlled “shootout,” with each tool tasked to build out the same end-to-end machine learning pipeline. Across four sprints, the tools generated synthetic datasets, trained and tuned XGBoost models, explored data quality and feature engineering, and ultimately deployed a serving API with SHAP-based interpretability. By holding the repo, prompts, and acceptance tests constant, the project revealed not just raw coding differences, but how each tool shapes data quality, model credibility, and the path to a production-ready ML system.
Suggestions to Stay Comparable
- Use same prompts for each tool → apples-to-apples.
- Use full file prompts when possible → conserve requests.
- Always run the same tests/static checks → uniform evaluation.
- Keep logs/screenshots/prompt history → traceability.
RiskBench Shootout Tools
This shootout now includes three tools used on the same repo with the same prompts and tests:
- GitHub Copilot (inline IDE)
- Cursor (inline IDE)
- Claude Code (terminal/agent)
Evaluation remains identical across tools (tests, metrics, artifacts). Claude Code may edit files and run commands, but success is judged by the same local pytest/static checks.
Splitting the Project in Chunks/Sprints
I decided to split the project in 4 sprints to make sure I was keeping rich notes between stages, and confirm each tool was doing eveything as planned before the final product.
Sprint 1: Set up the repo structure, dependencies, and a working data generator with acceptance tests. Sprint 2: Focused on training a baseline XGBoost model across the three tools. Sprint 3: Expose the trained model via API and add lightweight interpretability. Sprint 4: Build a production-style serving layer with FastAPI/Uvicorn that delivers predictions and SHAP-based explanations through stable endpoints
1. Sprint 1
The purpose of Sprint 1 was to test how three AI-assisted coding tools, GitHub Copilot, Claude Code, and Cursor, perform when tasked with generating the initial RiskBench package structure, producing synthetic training data, and running the provided tests.
This sprint serves as a baseline to evaluate:
- Setup smoothness: How easily each tool got running.
- Data generation: Ability to produce a valid
train.parquet
. - Test execution: Reliability and completeness of unit tests.
- Failure modes: Nature of the issues when something went wrong.
2. What Worked
- All three tools successfully generated the
riskbench
package scaffolding (pyproject, CLI, modules, tests). - Data generation worked for Copilot and Claude: they produced a
train.parquet
file with the expected structure and size.
3. What Didn’t Work
- Claude: Several tests failed despite data generation succeeding. Failures were mainly due to assertion mismatches (
positive_rate
range) and unsupported aggregation (mean
) on categorical dtypes. - Cursor: Data generation crashed due to Pandas attempting to cast
MonthEnd
objects intoint64
. As a result, notrain.parquet
was created inC:\riskbench\data
. To proceed fairly, we copied the Copilot-generated dataset into Cursor’s data folder. This preserves Cursor’s failure as evidence while allowing Sprint 2 comparability. - Copilot: Needed more manual setup than the others (virtual environment, ensuring pyproject install worked), but once configured, all tests passed.
4. Tool Comparisons
- GitHub Copilot
- Strengths: Clean execution of tests, reliable pipeline once setup was fixed.
- Weaknesses: Setup friction (manual environment work required).
- Fewer interactions needed compared to the other 2 tools.
Obs: This highlighted one main issue with Copilot that will be discussed later, but the prompts have a very specific guardrail to only read and write files to each tool’s specific directory, and Copilot was the only one that infringed this guardrail.
- Claude Code
- Strengths: Very smooth initial structure generation, good CLI handling.
- Weaknesses: Logical/semantic issues in test validation (e.g., rate range, aggregation errors).
- More interactions/approvals needed than Copilot, on par with Cursor.
- Cursor
- Strengths: Fast initial code generation, minimal friction to produce structure.
- Weaknesses: Data generation failed (
MonthEnd
casting bug), leaving dataset empty. Required manual dataset copy from Copilot to continue. - Required as much interaction/confirmations as Claude.
5. Sprint 1 Takeaways
- Having three tools side by side reveals not only coding ability but also error patterns:
- Copilot leans on clean, conventional outputs but needs more manual direction.
- Claude excels at scaffolding but struggles with edge-case correctness.
- Cursor generates aggressively but has brittleness in handling types.
- For the shootout, failures are not setbacks but valuable evidence of how each tool handles real-world complexity.
- Cursor’s dataset failure is especially important: it highlights that baseline functionality (writing to disk) cannot be taken for granted, even when scaffolding looks correct.
Sprint 2 Summary: Baseline XGBoost Training
Overview
Sprint 2 focused on training a baseline XGBoost model across the three tools.
Initial results revealed severe data leakage (perfect scores), which was then corrected by excluding leakage_col
and timestamp
.
The fixed runs highlighted divergences in performance across tools.
Results by Tool (Fixed Implementations)
Claude (Control)
- Train AUC: 0.9510
- Test AUC: 0.8220
- Train Accuracy: 0.9123
- Test Accuracy: 0.8427
- KS Statistic: 0.4885
Claude delivered a credible baseline model, with strong class separation and expected generalization.
This became the control reference for comparison.
Copilot
- Train AUC: 0.9375
- Test AUC: 0.4929
- Train Accuracy: 0.9108
- Test Accuracy: 0.9087
- KS Statistic: 0.0288
Copilot’s implementation produced random-like test performance:
- High accuracy is misleading due to class imbalance (majority class prediction).
- Very low KS and AUC indicate poor discriminatory power.
- Indicates either feature weakness or collapse of predictive signal after leakage removal.
Cursor
- Train AUC: 0.8051
- Test AUC: 0.5002
- Train Accuracy: 0.9087
- Test Accuracy: 0.9087
- KS Statistic: 0.0207
Cursor’s implementation mirrored Copilot’s:
- Essentially random performance (Test AUC ≈ 0.50).
- KS near zero confirms no class separation.
- Accuracy inflated by majority-class prediction.
- Cursor flagged feature engineering needs and dataset limitations.
Divergence Analysis
- Claude retained signal → Test AUC ≈ 0.82, credible baseline.
- Copilot and Cursor collapsed to random performance → Test AUC ≈ 0.49–0.50, KS ≈ 0.02–0.03.
- Both Copilot and Cursor flagged issues:
- Class imbalance (≈9% positive rate).
- Weak synthetic features after leakage exclusion.
- Need for stronger feature engineering to recover predictive signal.
Decision
✅ Claude’s model is adopted as the baseline control going forward.
- Provides stable, credible results.
- Establishes a common benchmark for Sprint 3 and beyond.
⚠️ Copilot and Cursor’s degraded models are preserved as evidence.
- Document important divergences.
- Highlight orchestration challenges and tool differences.
- Serve as a reminder of why orchestration (not blind execution) matters.
Key Learnings
- Leakage prevention works, but exposes dataset weaknesses.
- Class imbalance skews accuracy; AUC and KS are more reliable.
- Claude’s implementation showed the dataset still carries predictive signal.
- Copilot and Cursor revealed the true difficulty of the task without engineered features.
- Future work must emphasize feature engineering, regularization, and imbalance handling.
Sprint 3 Re-run on Unified Dataset
Context
Following Sprint 3.5, we unified all tools onto Claude’s dataset (train.parquet
) to ensure fair comparability.
Goal: re-run Sprint 3 (baseline + tuned modeling) for Copilot and Cursor using Claude’s data.
Claude (Control)
- Already trained and tuned on its own dataset (Claude’s data).
- Baseline AUC ≈ 0.822, Tuned AUC ≈ 0.840, KS ≈ 0.52【50†source】.
- These results serve as the control for unified comparisons.
Copilot (Re-run with Claude Data)
- Dataset: Claude’s unified
train.parquet
- Baseline metrics:
- ROC AUC: 0.8188
- KS: 0.4824
- Accuracy: 0.8392
- Confusion matrix: [[11450, 640], [1772, 1138]]
- Tuned metrics:
- ROC AUC: 0.8387
- KS: 0.5163
- Accuracy: 0.8487
- Confusion matrix: [[11645, 445], [1825, 1085]]
- Best hyperparameters:
- colsample_bytree=0.8, learning_rate=0.1, max_depth=3, min_child_weight=3,
n_estimators=200, reg_alpha=0.1, reg_lambda=1, subsample=0.8
- colsample_bytree=0.8, learning_rate=0.1, max_depth=3, min_child_weight=3,
Interpretation: Copilot’s tuned results (AUC 0.839, KS 0.516) are nearly identical to Claude’s (AUC 0.840, KS 0.520).
This confirms the original divergence was due to dataset quality, not modeling code.
Cursor (Re-run with Claude Data)
- Attempted re-run blocked by Cursor usage limits (free plan quota reached).
- Current status: postponed until quota reset or Pro upgrade.
- Evidence retained:
- Original Sprint 3 results on Copilot’s dataset (AUC ≈ 0.50, KS ≈ 0).
- Sprint 3.5 diagnostics confirmed Cursor and Copilot shared the same weaker dataset.
- Decision: Copilot’s re-run serves as proxy evidence for dataset effect. Cursor re-run marked pending.
Decision
- Continue Sprint 4 (serving + interpretability) with Claude and Copilot (Claude’s dataset).
- Document Cursor’s blocked status transparently in RiskBench log.
- Preserve original datasets (
train_original.parquet
) for reproducibility.
Sprint 3.5 – Data Understanding & Feature Engineering
Objective
Diagnose why Claude produced credible models (AUC ~0.82) while Copilot and Cursor flatlined near random performance.
Focus: dataset quality, feature signal, class imbalance, and feature engineering opportunities.
Results by Tool
Claude
- Positive rate: 19.4% (scale_pos_weight ≈ 4.15)
- Top univariate feature AUC:
credit_grade
= 0.6934 - Other strong predictors:
credit_score
(0.6546),debt_to_income
(0.6141) - Signal: Clear hierarchy of predictive features, meaningful KS values
- Correlations: Moderate, interpretable
- Feature Engineering Suggestions: 12 (interactions, ratios, binning, temporal features)
- Narrative: Dataset has strong predictive signal; imbalance manageable.
Cursor
- Positive rate: 9.13% (scale_pos_weight ≈ 9.96)
- Top univariate feature AUC:
cat_1
= 0.5095 (barely above random) - Signal: Near-random across all features; maximum KS = 0.0048
- Correlations: Extremely weak (<0.01)
- Missingness: num_3 (33%), num_5 (37%)
- Feature Engineering Suggestions: 12 (target encoding, temporal aggregations, ratios, missing flags)
- Narrative: Dataset weak; almost no inherent predictive signal; heavy reliance on FE.
Copilot
- Positive rate: 9.13% (scale_pos_weight ≈ 9.96)
- Top univariate feature AUC:
cat_1
= 0.5095 - Signal: Near-random; features ~0.50 AUROC, very low KS
-
Correlations: Extremely weak ( r < 0.01) - Feature Engineering Suggestions: 10 (winsorization, binning, encodings, ratios, clustering)
- Narrative: Same dataset as Cursor; weak raw signal; emphasizes engineered features to compensate.
Strategic Takeaways
- Claude’s dataset (Sprint 1) is fundamentally richer: more positives, stronger predictors.
- Copilot & Cursor datasets (shared) are weaker: ~9% positives, near-random features.
- Divergence is due to dataset generation quality, not model training differences.
- Feature engineering is critical for Copilot/Cursor, while Claude’s models already benefit from strong raw signal.
Execution Friction & Developer Effort
- Copilot: Last to start but first to finish. Requires minimal clicks/approvals → fastest workflow.
- Claude: Produces the richest insights but requires frequent approvals, slower overall.
- Cursor: Similar interaction overhead as Claude but without comparable payoff.
Next Steps
- Define a common feature engineering spec (using best ideas from Claude, Cursor, Copilot).
- Re-run models with engineered features to normalize dataset advantage.
- Proceed to Sprint 4: serving (FastAPI endpoints) + interpretability (SHAP).
Project Note: Dataset vs EDA/Feature Engineering
Context
Sprint 3.5 revealed a fundamental reason for divergence between tools:
- Claude’s models performed well (AUC ~0.82 baseline, ~0.84 tuned).
- Copilot and Cursor collapsed to random-like predictions (AUC ~0.50).
Key Findings
- Dataset Generation (Sprint 1)
- Claude generated a synthetic dataset with:
- ~19% positive rate (vs. ~9% for Copilot/Cursor)
- Stronger feature–target relationships
- Realistic correlations between predictors
- Copilot’s dataset was weaker, and Cursor inherited Copilot’s directly.
- Claude generated a synthetic dataset with:
- EDA and Feature Engineering
- Claude often performs implicit EDA/feature preparation steps (e.g., checking distributions, encoding, correlation checks) even before being explicitly prompted.
- Copilot and Cursor tend to follow the literal prompt only, without enriching the workflow.
- Evidence from Sprint 3.5
- Claude surfaced a clear signal hierarchy (credit_grade, credit_score, debt_to_income).
- Copilot/Cursor showed features barely above random (best AUROC ~0.51), very weak correlations, and severe imbalance (~10:1).
- Their feature engineering suggestions were generic “manufacture signal” strategies, while Claude proposed targeted, high-value transformations.
Interpretation
Claude’s advantage comes from both factors combined:
- It generated a better synthetic dataset in Sprint 1.
- It also tends to perform deeper implicit analysis (EDA/FE) without being told.
By contrast, Copilot and Cursor stuck to literal execution and were constrained by weaker data.
Decision Implication
To ensure fairness and isolate tool performance in Sprints 4+, I will unify all tools onto Claude’s dataset (train.parquet
).
- This ensures the comparison measures coding/orchestration ability, not dataset luck.
- Sprint 1 divergence will remain documented as evidence of tool differences in data generation.
Sprint 4: Serving & Explainability Report
Environment: Windows / PowerShell (py
launcher), FastAPI + Uvicorn
What went wrong (root causes)
Copilot
- Wrote files outside its directory, breaking provenance and making it unclear which artifacts were in use.
- Placeholder artifacts (e.g., dummy pickles) caused
pickle
/joblib
load failures. - Result: repeated crashes when trying to start an API with Copilot’s outputs.
ChatGPT (assistant guidance missteps)
- Proposed patches that weren’t always single‑line PowerShell edits (against my rule).
- Introduced a
/meta
file that referenced undefined globals or nonexistent model attributes, producing 500s. - A regex edit left a dangling block, causing
SyntaxError
at import.
Serialization under Uvicorn
- Some preprocessors had been serialized under
uvicorn.__main__
, so the class path didn’t resolve at runtime.
How Claude fixed it
- Restored a clean
serving.py
:- Single, robust
/meta
using explicitMODEL_NAME="xgboost"
,MODEL_VERSION="S3_tuned"
, andFEAT_NAMES
(14 features). - Safe extraction of SHAP base value (optional).
- Removed duplicate/partial endpoints and stray code.
- Single, robust
- Ensured preprocessor loading uses
joblib.load(path)
and is resilient to module aliasing. - Validated the app via Swagger and curl/irm POSTs; all endpoints returned 200.
Net effect: API consistently boots, serves predictions and SHAP explanations, and exposes stable metadata and schema endpoints for screenshots.
Interpretation of test results
This screenshot shows the Swagger UI for the /predict endpoint of the RiskBench serving API.
Endpoint:
- POST /predict is the API route that takes in a feature set (a single applicant’s data) and returns a risk score.
- It can optionally return SHAP explanations, which are feature-level contributions explaining why the model gave that score.
Parameters:
- The parameter explain is a boolean query flag.
- If set to true, the response will include reason codes: the top features, their values, and SHAP contributions (positive pushes score up, negative pushes score down).
- If set to false (or omitted), you just get the raw risk probability.
Request Body:
- You feed the endpoint a JSON object with all required features.
- Score = 0.254 → with the default threshold 0.5, this is a negative class decision.
- SHAP contributions (log‑odds):
- Positive → pushes risk/probability up; negative → pushes it down.
credit_grade = "A"
contributed the largest increase (+2.37).credit_utilization = 0
,credit_score = 700
, andrecent_inquiries = 0
decreased the score (risk‑reducing).
- Business mapping: if the positive class is “high risk,” negatives are protective; if it’s “approval,” flip the interpretation. Either way, the reasons clearly justify a ~0.25 score.
Final Takeaway
Claude Code consistently outperformed by generating stronger data, richer feature engineering, and delivering a fully functional serving API, while Copilot proved fastest with minimal interaction but weaker at deeper analysis, and Cursor was limited by quota and dataset weaknesses. This shows that tool differences aren’t just about speed, they shape data quality, model credibility, and whether you end up with a real, production-ready ML system.
Overall, I came away impressed with what these coding agents can already do: in just a few sprints, they scaffolded a repo, generated data, trained and tuned models, and stood up a serving layer with interpretability.The experience highlighted both their speed and their blind spots, showing that while automation can push projects forward quickly, the real value comes from guiding, refining, and connecting those outputs into something robust and strategically useful.