SignalGraph 5G - Anomaly Detection & Forecasts (PySpark + Postgres/Teradata + Prophet)

SignalGraph 5G is a demo system that ingests synthetic 4G/5G KPI data, processes it through a Spark-based lakehouse pipeline, and exposes an analyst-friendly UI in Streamlit. The project was designed for anomaly detection, large-scale data engineering, data warehouse/lakehouse integration, and applied ML/forecasting in the network domain. It is deployed as a live Streamlit web app on Render, connected to a Neon Postgres warehouse.

Live App: Render Deployment

This demo uses a minimal Silver slice committed to the repo and a Neon Postgres view to mirror the warehouse mart. Expect a short cold start on the Free plan.


Architecture at a Glance

  • Data pipeline: Spark jobs implementing Bronze → Silver → Gold partitions (Parquet, hive-style).
  • Anomaly detection: Initial rule-based seed (latency_ms > 60 OR prb_util_pct > 85), extending to supervised models (XGBoost, Prophet).
  • Storage & Lakehouse: Hive partitions for scale-out processing; DuckDB/Postgres mirrors for BI/ops integration.
  • UI: Streamlit analyst view with partition filters, anomaly tables, and alerts.
  • Deployment: Render Web Service (Free).
  • Forecasting: Prophet-based forecasts on latency and PRB utilization.
  • Planned extensions: Graph analytics with Neo4j (cell neighbors, centrality), warehouse DDL for Teradata/Postgres, SHAP/feature attribution.

Key Factors in the Data

The dataset used in SignalGraph is synthetic and was generated to reflect realistic 4G/5G KPIs. The synthetic generation used in the project was meant to create plausible correlations (e.g., high PRB utilization leading to higher latency and packet loss).

  • latency_ms – average user-plane latency; higher values mean slower responsiveness.
  • prb_util_pct – PRB utilization percentage; proxy for network load and congestion.
  • thrpt_mbps – downlink throughput; what the end user experiences as speed.
  • sinr_db – signal-to-interference-plus-noise ratio; higher values mean clearer signal.
  • rsrp_dbm – reference signal received power; less negative is stronger signal.
  • pkt_loss_pct / drop_rate_pct – packet or call drop percentage; reliability indicator.

App Screenshots

SignalGraph Analyst UI — Anomaly Monitoring
Shows filtered Silver data with anomaly flags and region/date controls.

signalgraph

SignalGraph Alerts & Model Risk Scores
Highlights top cells by anomaly rate and risk scores, with placeholder logic for alert explanations.

signalgraph

Forecast Graph — PRB Utilization
Prophet-based logistic growth model applied to hourly PRB utilization (bounded 0–100). Forecast includes mean (orange) and 80% interval (blue band).

What is PRB Utilization?

In 4G/5G networks, Physical Resource Blocks (PRBs) are the smallest unit of spectrum allocation that the base station scheduler can assign to users. Think of PRBs as the “lanes” on a highway:

  • Low PRB utilization = free lanes → smooth traffic, high throughput, low latency.
  • High PRB utilization (≥80–85%) = congestion → packets queue up, throughput drops, latency and jitter spike, and call drops may increase.

PRB utilization is therefore a direct capacity and congestion indicator, bridging RF (radio access) conditions with IP-level user experience.

signalgraph

Forecast Graph: PRB Utilization

The forecast graph shows PRB utilization over time at the cell level.

  • Blue line (Observed): historical hourly PRB usage, smoothed for noise.
  • Orange line (Forecast): Prophet’s logistic growth model prediction, bounded between 0–100% to reflect real physical limits.
  • Shaded band: 80% confidence interval, widening further out in time to capture growing uncertainty.

Interpretation:

  • Sustained upward trends approaching 80–90% signal impending congestion, guiding proactive actions (e.g., cell splitting, carrier aggregation, or load balancing).
  • Downward stabilization near 50–60% suggests healthy utilization, with enough headroom for bursts.
  • The widening confidence band reflects realistic modeling: while we cannot predict exact usage, we can bound the risk of overload.

Feature Update — Explanations, Interpretations & Polishing

Features I added to the app:

Alerts & Thresholds

  • Interactive Threshold Slider: Users can override the model’s default operating threshold (0.09) to surface more or fewer alerts.
    • Lowering the threshold increases recall (more cells flagged) but also raises false positives.
    • Default comes from metrics.json and is tuned for F1 balance.

signalgraph

Cell Drill-In — Neighbors & Centrality

  • Metrics:
    • Neighbor degree counts direct peers (higher → bigger local blast-radius).
    • PageRank weights peers by their influence (higher → wider impact).
    • Betweenness shows if a cell is a bridge on shortest paths (higher → issues can fragment clusters).

signalgraph

  • Interpretation Block: Centrality explains network influence. These metrics help triage whether an issue is isolated or may propagate via topology.

Model Metrics

  • AUC-ROC (0.745): Model separates risky vs. normal cells fairly well (75% chance a risky cell scores higher than a normal one).
  • AUC-PR (0.241): Performance above random baseline in imbalanced data.
  • Operating Threshold (0.09): Balances recall and precision; configurable in-app.
  • F1@thr (0.437): Trade-off between catching risks vs. tolerating false alarms.

PRB Utilization Forecast

  • How to Read:
    • Blue line = median forecast.
    • Shaded band = uncertainty.
    • Threshold (85%) = capacity pressure.

signalgraph

  • Per-Cell Interpretation: Each forecast panel automatically reports the chance of breaching the 85% PRB threshold within the next 12 hours.

Network SLO Summary

  • Capacity SLO (99.6%): Network is comfortably under PRB pressure.
  • Latency SLO (93.7%): Most traffic meets the 60ms p95 target, though some cells drift.
  • Reliability SLO (28.2%): Packet drops are frequent — weakest dimension.

  • Note: In full deployment, SLOs recompute dynamically per region and date window, so numbers change with the user’s filters.

Warehouse View (Postgres)

  • Purpose: Exposes the raw last_hour_risk table mirrored from Postgres.
  • Why: Lets ops verify the same records that models and dashboards consume — ensuring transparency and auditability.
  • User Action: Can filter, sort, or export rows for validation.

✅ These additions make the app not just a scoreboard, but an interpretable triage tool — linking model predictions, forecasts, graph centrality, and warehouse verification into one workflow.


Key Features

1. Spark Bronze → Silver ETL

I start with synthetic KPI data (cell_kpi_minute.parquet) and run a Spark job to enforce schema, add partitions, and flag anomalies:

df2 = (df
    .withColumn("date", F.to_date("ts"))
    .withColumn("hour", F.hour("ts"))
    .withColumn("region", F.coalesce(F.col("region"), F.lit("unknown"))))

df2 = (df2
    .filter(F.col("rsrp_dbm").isNotNull() & F.col("sinr_db").isNotNull())
    .withColumn("latency_poor", (F.col("latency_ms") > 60))
    .withColumn("prb_high", (F.col("prb_util_pct") > 85))
    .withColumn("anomaly_flag", (F.col("latency_poor") | F.col("prb_high")).cast("int")))

This produces hive-partitioned Silver tables by date and region.


2. Silver → Gold Aggregates

The Gold layer summarizes KPIs hourly per cell and generates a next-hour anomaly label for supervised training:

agg = df.groupBy("cell_id", "region", "ts_hour").agg(
    F.count("*").alias("n_samples"),
    F.avg("latency_ms").alias("latency_ms_avg"),
    F.expr("percentile_approx(latency_ms, 0.95, 100)").alias("latency_ms_p95"),
    F.avg("prb_util_pct").alias("prb_util_pct_avg"),
    ...
).withColumn("y_next_anomaly", F.lead("anomaly_any").over(w).cast("int"))

3. Analyst UI (Streamlit)

Streamlit powers an interactive analyst view. It filters regions/dates, shows anomaly tables, and surfaces top cells by anomaly rate:

st.subheader("Top cells by anomaly rate")
top = (
    df.groupby("cell_id")[["anomaly_flag", "latency_ms", "prb_util_pct"]]
      .mean(numeric_only=True)
      .sort_values("anomaly_flag", ascending=False)
      .reset_index()
      .head(10)
)
st.dataframe(top)

Alerts are also generated when model risk scores exceed thresholds:

alerts = scores[scores["risk_score"] >= thr].copy()
alerts["reason"] = alerts.apply(reason_from_row, axis=1)

4. Forecasting (Prophet & Visualization)

SignalGraph includes time-series forecasting for key KPIs like latency and PRB utilization. Forecasts include prediction intervals and network engineering guardrails:

ax.plot(obs["ds"], obs["y"], label="Observed p95 latency (ms)")
ax.plot(fc["ds"], fc["yhat"], label="Forecast (yhat)")
ax.fill_between(fc["ds"], fc["yhat_lower"], fc["yhat_upper"], alpha=0.2)

Why This Matters

  • App Utility: anomaly detection, forecasting, large data sets, Spark/Hadoop-style pipelines, Teradata/warehouse mirroring, SLO tracking, SHAP explainability, and 4G/5G KPI expertise.
  • Production discipline: schema contracts, timestamp precision guardrails, partitioning strategies, model triage thresholds, and monitoring artifacts.
  • Scalable & extensible: Designed to drop into Dataproc/EMR clusters and extend into graph/network analysis with Neo4j centrality and influence metrics.

Next Steps

  • Deploy the Streamlit UI as a live web app on Render so users can interact with SignalGraph directly.
  • Mirror DuckDB marts into Postgres/Teradata with clean DDL.
  • Prototype a lightweight agent layer:
    • Monitoring Agent: track ETL freshness and anomalies in real time.
    • Forecasting Agent: run Prophet in parallel and compare with observed KPIs.
    • Orchestrator Agent: combine monitoring + forecasting into a single dashboard summary.

Tech Stack

  • Languages & Libraries: Python 3.10, PySpark 3.5.1, pandas, scikit-learn, XGBoost, Prophet, matplotlib, DuckDB, SHAP, Altair.
  • Frameworks: Streamlit UI, Spark ETL, PyArrow.
  • Data Stores: Hive-partitioned Parquet, DuckDB, Postgres/Teradata schema (warehouse view).
  • Graph & Network Analysis: Neo4j integration, centrality metrics (degree, PageRank, betweenness), neighbor drill-in.
  • Explainability & Monitoring: SHAP local/global feature attribution, threshold tuning with triage slider, SLO summaries (capacity, latency, reliability).
  • Domain: 4G/5G KPIs (RSRP, RSRQ, SINR, PRB utilization, latency, jitter, packet loss).
Written on August 29, 2025