NetworkIQ - Incident Risk Monitor (Render, Google Cloud, AWS)
When telecom reliability defines customer trust, NetworkIQ shows how one project can live across multiple clouds. NetworkIQ predicts congestion and visualizes incidents on Render, GCP Cloud Run, and AWS, completing the One Project, Three Clouds vision. Built with PySpark preprocessing, XGBoost prediction, and Streamlit dashboards, NetworkIQ demonstrates that portability, scalability, and explainability can be baked into a single AI-first system, no matter the platform.
Live App: GCP Deployment โ Render Deployment โ AWS Deployment
๐น Project Overview
NetworkIQ is a telecom-aligned MVP that transforms network telemetry into faster incident detection (MTTDโ), better customer experience (NPS proxyโ), and leaner cost per GB.
It demonstrates how AI-first system design can turn raw performance data into actionable insights for network operators. The dataset used in NetworkIQ is synthetic and was generated to mimic realistic telecom KPIs such as signal strength (RSRP, RSRQ, SINR), throughput, latency, jitter, and drop rates. Values were modeled on publicly available ranges from industry specifications and research datasets, with synthetic generation used to create plausible correlations (e.g., poor SINR leading to lower throughput and higher drop rates).
๐น Why This Matters
- Detect Incidents Earlier (MTTDโ): Spot congestion and outages from KPI anomalies.
- Improve Customer Experience (NPS proxyโ): Reduce dropped/slow sessions and clearly communicate impact.
- Optimize Cost (Cost/GBโ): Enable smarter capacity planning and parameter tuning.
๐น Tech Stack & Architecture
Architecture:
CSV โ PySpark (ETL) โ Parquet โ XGBoost (prediction) โ Streamlit Dashboard โ Multi-cloud Deployment (Render + GCP + AWS roadmap).
- Data Pipeline: PySpark CSV โ Parquet ingestion
- Modeling: Logistic Regression, Random Forest, XGBoost (selected as best-performing)
- Dashboard: Streamlit app deployed multi-cloud
- Visualization: Interactive map overlays predictions with intuitive cell-site visuals
- CI/CD: GitHub Actions workflow for GCP Cloud Run deployment
- Secrets: Managed securely via Google Secret Manager
๐น EDA & Key KPIs
NetworkIQ ingests standard radio access network KPIs tracked by telcos:
- Throughput (Mbps): Data delivery performance.
- Latency (ms): Network responsiveness.
- Packet Loss (%): Connection stability.
- Dropped Session Rate: Customer experience proxy.
EDA Example (PySpark Snippet):
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("NetworkIQ").getOrCreate()
df = spark.read.csv("data/raw/sample_cells.csv", header=True, inferSchema=True)
df = df.withColumnRenamed("cell_id", "cell").withColumn("throughput_mbps", df["bytes"]/df["duration"])
df.show(5)
๐น Model Development
Model Performance
Model | AUC | KS | Notes |
---|---|---|---|
Logistic Regression | 0.74 | 0.28 | Interpretable but weaker baseline |
Random Forest | 0.81 | 0.36 | Higher complexity, moderate gains |
XGBoost | 0.86 | 0.42 | Best performance, robust & scalable |
๐ XGBoost outperformed alternatives, identifying up to 20% more high-risk accounts at the same approval rate.
Training Example (Python):
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score
xgb = XGBClassifier(n_estimators=200, max_depth=6, learning_rate=0.1, random_state=42)
xgb.fit(X_train, y_train)
preds = xgb.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, preds)
print("AUC:", auc)
๐น AI Interpretation (Executive Briefings)
NetworkIQ includes an AI Interpretation module powered by LLMs (Gemini API). With a single click, it generates:
- Executive Summaries โ highlights network-wide issues and trends.
- Actionable Recommendations โ suggests where intervention should be prioritized.
- Per-Cell Explanations โ natural language explanations for why each site is at risk.
This ensures both technical and non-technical stakeholders can understand the model outputs without digging into raw metrics.
In production, this feature would run automatically as filters change, but in our demo we keep it manual to manage token usage.
How it works:
I integrated the Gemini API into the Streamlit app through a lightweight wrapper. When a user clicks Generate AI Briefing, the app sends a structured prompt containing the filtered network metrics (latency, throughput, drop rate, predicted risk) for the selected cells. Gemini then returns a natural-language summary, which I format into an executive briefing, action recommendations, or per-cell explanations depending on the userโs choice.
To keep the demo efficient and cost-aware, the feature is set to manual execution rather than auto-refresh. In a production system, this workflow would run automatically as filters change, ensuring stakeholders always have up-to-date AI-generated insights.
๐น Interactive Risk Map
The Predicted Risk Map overlays model outputs onto cell-site locations:
- Circle size = risk magnitude (larger means higher predicted probability).
- Color = risk level (amber โ red as risk increases).
- Hover tooltips display cell ID and predicted probability.
This visualization makes it easy to spot geographic clusters of risk and prioritize field resources.
๐น Deployment & Validation
โ
Multi-Cloud Deployment: Live dashboards on GCP and Render
โ
Predictive Engine: XGBoost congestion prediction integrated into dashboard
โ
AI Integration: Translates predictions into natural-language insights for non-technical users
โ
Industry Validation: Demoed to a telecom professional, who highlighted predictive accuracy, intuitive mapping, and accessibility
CI/CD Workflow (Excerpt):
name: Deploy to Cloud Run
on:
workflow_dispatch:
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Authenticate
uses: google-github-actions/auth@v1
with:
credentials_json: $
- name: Deploy
run: gcloud run deploy networkiq --source . --region us-central1 --allow-unauthenticated
๐น Responsible AI & Monitoring
- Explainability: SHAP values for feature contribution.
- Stability: PSI to monitor population drift.
- Transparency: Model card skeleton included in
/docs
. - Ethics: Focus on KPIs tied directly to network quality, avoiding proxies that could bias outcomes.
๐น Roadmap
- AWS App Runner deployment to complete the โOne Project, Three Cloudsโ strategy.
- Feedback Loops for adaptive retraining.
- Advanced Forecasting (ARIMA/Prophet baselines).
- Expanded KPI Coverage for richer incident monitoring.
๐น Lessons Learned
- Spark Version Mismatch: Fixed by aligning
PYSPARK_PYTHON
andPYSPARK_DRIVER_PYTHON
to Python 3.10 environment. - API Latency: Optimized Streamlit queries to reduce dashboard lag.
- Secrets Management: Ensured API keys were never exposed in code or logs.
How I Deployed NetworkIQ on AWS (Free Tier)
Goal: make the app easy to reach on the public internet while keeping cost โ $0 on a new AWS account.
What I deployed
- The Streamlit app is packaged in a Docker image (see
Dockerfile
). - I launched one free-tier EC2 instance (Amazon Linux 2023, t2.micro).
- I ran the container so the site is reachable on port 80 (HTTP).
Simple architecture
- EC2 (t3.micro) โ tiny virtual machine in AWS.
- Docker โ runs the app the same way everywhere.
- Streamlit โ serves the UI.
- Security group โ allows web traffic in.
Live Demo (AWS)
AWS (EC2) | Live Demo |
The EC2 instance serves the Dockerized Streamlit app on HTTP 80 (host 80 โ container 8080).
Note: The demo may be offline outside demo hours and will be retired at the end of my AWS Free Tier window.
๐น License
MIT ยฉ 2025 Paulo Cavallo.