From Dashboards to Dialogue: How CodeACT & MCP Power Autonomous Root Cause Analysis
- Senthilkumar Bala
- Apr 24
- 4 min read
Updated: Apr 29
In today’s highly distributed systems, the challenge in observability isn’t the absence of data—it’s the overwhelming flood of metrics, logs, traces, alerts, and dashboards scattered across silos. Data Analysts, Engineers and SREs are forced to manually correlate signals across tools like Prometheus, Grafana, DataDog, and CloudWatch—often under intense time pressure. The result? Slow root cause analysis, alert fatigue, and missed patterns.
The next evolution in observability isn’t more dashboards—it’s conversational intelligence. Imagine asking, “Why did latency spike in us-west during deployment?” and getting a precise, data-backed answer within seconds.
That’s the vision behind a GenAI agent powered by CodeACT and the Model Context Protocol (MCP)—a self-reasoning system that turns natural language queries into deep, autonomous root cause analysis (RCA). In this blog, we explore how these two components shift observability from visual hunting to cognitive querying.
The Problem with Dashboards
Dashboards are useful for monitoring—but they’re fundamentally reactive. They expect the user to:
Know what metrics to look at,
Understand which systems are involved,
And manually correlate across noisy signals from different tools.
In Network Operations Centers (NOCs) for ISPs or SRE teams managing complex distributed systems, this becomes overwhelming. One incident—like a latency spike or packet drop—can trigger alerts from:
Prometheus (metrics),
Grafana dashboards,
Logs from CloudWatch or DataDog,
Status pages and ticketing tools.
Each dashboard offers a piece of the puzzle, but analysts must mentally stitch these fragments into a cohesive diagnosis. The cognitive overload not only slows down time-to-resolution, it often leads to false leads and finger-pointing.
Enter CodeACT and MCP: Enabling Conversational RCA
What is CodeACT?
CodeACT stands for Code Adaptive Compute-efficient Tuning framework. It is designed for code-first large language models (LLMs) to autonomously generate, debug, and execute code tailored to analytics tasks. Think of it as a specialized AI agent that not only understands what you're asking—but can act on it using code.
CodeACT’s Core Capabilities:
Code Generation: Produces SQL, Python, and bash scripts based on the analytical query and context.
Self-Debugging: If the code fails, it automatically identifies the issue and rewrites the logic.
Tool Invocation: Executes diagnostics across environments (logs, metrics, APIs).
Execution Loop: Runs code, evaluates results, and iterates if needed—like a data scientist in a loop.
Augmented with a Deep Reasoning Engine:
Anomaly Detection: Flags outliers and trend breaks.
Impact Assessment: Links anomalies to KPIs.
Pattern Recognition: Leverages historical RCA patterns.
Hypothesis Formulation: Suggests root causes and validates them.
What is MCP (Model Context Protocol)?
Traditional AI fails when it lacks context. MCP solves this by acting as the memory and sensory layer of the GenAI system.
MCP is the connective tissue that bridges the GenAI agent with live observability data, enabling it to perform real-time reasoning and action.
MCP Has Two Key Components:
🔹 MCP Client Module
Resides in the Task Orchestration Layer
Responsible for translating AI-generated tasks (e.g., SQL queries, API calls, telemetry pulls) into actionable requests
Acts as the execution gateway—sending instructions to and receiving results from the MCP Server
🔹 MCP Server Endpoint
Interfaces with a wide array of observability and telemetry systems
Connects to real-time sources like CSV files, Postgres, Google Analytics, DataDog, CloudWatch, Prometheus, Grafana, StatusNow, and more
These systems act as the “eyes and ears” of the GenAI agent—feeding it with:
Metadata, Key Performance Indicators (KPIs)
Logs, metrics, and traces
Contextual inputs and outputs from external systems
Together, the MCP client and server allow the agent to operate in a live system-aware loop, making root cause analysis not just fast, but contextually accurate.
A Day in the Life of Dialogue-Based RCA: Network Latency Edition
Let’s walk through a real-world Network Operations scenario, where the GenAI agent powered by CodeACT and MCP helps identify the root cause of a latency spike.
You Ask:
Why did latency increase in the us-west region between 2 PM and 4 PM?
What Happens Behind the Scenes:
Query Parsing (Agent + LLM)
Timeframe? ✓
Metric of interest: latency? ✓
Affected segment: us-west region? ✓
Dependency graph for network infrastructure? ✓
Code Generation (LLM)
Generates SQL queries / Pandas code to
extract latency time series
Cross-references packet drop and bandwidth saturation metrics
Correlates with upgrades / downtime for the same time period
Detect any configuration change pushed at 1:50 PM
Generated Code is then sent to MCP Server for execution
Code Execution and Data Fetching (MCP Server)
MCP Server executes the code to pull results from multiple sources
Prometheus: latency and packet loss metrics
CloudWatch: EC2 instance health, network throughput
DataDog: anomaly alerts and logs
StatusNow: recent change events or incidents
Reasoning (LLM)
Anomaly detected in latency curves shortly after a router firmware patch
Increase in TCP retransmissions in the us-west edge cluster
Hypothesis: Firmware upgrade caused brief routing instability, elevating latency
Final Output
A router firmware upgrade in the us-west region at 1:50 PM triggered route instability and increased TCP retransmissions, leading to a sustained 120ms latency increase between 2 PM and 4 PM
Why This Changes the Game
Traditional Analysis | GenAI-Powered RCA |
Click-heavy dashboard exploration | Ask a question in natural language |
Analyst-dependent expertise | AI-agent-driven expertise |
Static reports | Dynamic, real-time insights |
Siloed tool usage | Unified orchestration across telemetry |
Final Thoughts: From Insight to Action, Faster
CodeACT and MCP together enable the future of enterprise analysis—where insights aren’t discovered, they’re delivered.
Analysts can focus on strategy, not syntax. Engineers can focus on fixes, not forensics.
The age of sifting through dashboards is over. The age of conversational RCA has begun.
Komentāri