Skip to content

Debugging Kubernetes with AI: How MCP Gives LLMs Real Cluster Context

You’re debugging at 2 AM. A deployment won’t roll out. Pods are stuck in CrashLoopBackOff. You open Copilot Chat and paste the error — the AI gives you a reasonable list of possibilities. OOM? Bad probe? Missing secret? Image pull error?

All plausible. None specific. Because the AI can’t see your cluster.

This is the fundamental gap: LLMs have absorbed thousands of Kubernetes docs, Stack Overflow answers, and runbooks. They genuinely understand the theory. But theory without data produces checklists, not answers.

Ask any LLM “why is my pod CrashLoopBackOff?” and you’ll get something like:

  1. Check container logs (kubectl logs)
  2. Check events (kubectl describe pod)
  3. Check resource limits (OOMKilled?)
  4. Check probes (liveness failing?)
  5. Check image pull (ImagePullBackOff?)
  6. Check volume mounts (missing ConfigMap/Secret?)
  7. Check init containers (stuck init?)

This is correct and useless in equal measure. You already know what to check. The bottleneck is running the right 5 commands out of hundreds, cross-referencing the output, and synthesizing a root cause. That’s the part that takes 30 minutes at 2 AM.

MCP (Model Context Protocol) is an open standard that lets AI assistants call external tools. Instead of reasoning in a vacuum, the AI sends structured requests to a server that returns real data.

For Kubernetes, this means the AI can:

  1. Request the pod’s event timeline → see the actual OOMKilled event
  2. Request the last 50 log lines → see the actual stack trace
  3. Request resource definitions → see the actual memory limit set to 128Mi
  4. Request live metrics → see memory peaking at 142Mi

Same AI, same knowledge — but now with your cluster’s data. The output shifts from “here are 7 things to check” to “your pod is OOMKilled because it uses 142Mi but the limit is 128Mi.”

KubeGlass includes an MCP server with 95 tools. It uses your existing kubeconfig — the same clusters accessible via kubectl are available to the AI.

Add to .vscode/settings.json:

{
"mcp": {
"servers": {
"kubeglass": {
"command": "kubeglass",
"args": ["mcp"]
}
}
}
}

That’s it. No API keys, no cloud service, no sidecar agents. The MCP server runs locally alongside the dashboard.

Here’s what actually happens when you ask a question with MCP connected.

You type:

The checkout-service in production has pods stuck in CrashLoopBackOff. Why?

What the AI does (you see tool calls in the chat):

  1. Calls get_pods with namespace=production, field selector matching checkout-service
  2. Sees status: CrashLoopBackOff, last termination reason: OOMKilled
  3. Calls get_pod_resources → memory limit 128Mi, request 64Mi
  4. Calls get_pod_logs with --previousfatal error: runtime: out of memory
  5. Synthesizes: limit too low for actual usage

AI responds:

The checkout-service pods are being OOMKilled. The container has a 128Mi memory limit, but the application is exceeding it — the previous container’s last log line shows fatal error: runtime: out of memory. Increase the memory limit to at least 256Mi. If memory usage has grown recently, also check for a leak by examining the allocation profile.

That’s a complete diagnosis with a specific recommendation, not a generic checklist. The AI did in 3 seconds what would have taken you several minutes of kubectl commands.

The OOMKilled example is simple — you might have caught it yourself in 2 minutes. The real wins come from problems that span multiple resource types.

“Why can’t the frontend pod reach the API service?”

The AI checks: Service selectors → Pod labels → NetworkPolicies → endpoint health → DNS resolution. A human would need 5+ commands. The AI chains them automatically and finds the NetworkPolicy blocking egress on port 8080.

“The CI pipeline’s deploy step is failing with forbidden errors.”

The AI checks: ServiceAccount → RoleBindings → ClusterRoleBindings → specific verb permissions. Finds that the service account has get and list but not patch on Deployments in the target namespace.

“Is staging actually matching what’s in our Helm values?”

The AI compares live resources against the last Helm release’s manifest, finds 3 ConfigMaps that were manually edited and no longer match the chart values.

“What’s running in the platform namespace and how healthy is it?”

New team members get a complete picture without knowing which kubectl commands to run: deployments, their replica status, recent events, resource utilization, and any warnings.

The 95 tools span the full operational surface:

CategoryWhat it covers
WorkloadsPods, Deployments, StatefulSets, DaemonSets, Jobs, CronJobs
NetworkingServices, Ingresses, NetworkPolicies, Endpoints
StoragePVCs, PVs, StorageClasses
ConfigurationConfigMaps, Secrets (metadata only), resource quotas
RBACRoles, bindings, who-can queries, access matrix
HelmReleases, history, values, manifests
GitOpsArgo CD apps, Flux sources, sync status
ClusterNodes, namespaces, events, API resources, CRDs
CostNamespace cost breakdown, resource efficiency
HealthSecurity policies, probes, resource constraints

All MCP tools are read-only. The AI can inspect, diagnose, and recommend — but it cannot kubectl apply, delete pods, or modify your cluster. Mutations require human confirmation through the KubeGlass UI or kubectl.

This is intentional. AI-assisted debugging should reduce investigation time. AI-driven mutations on production clusters are a different risk category entirely.

Terminal window
# Install
brew install kubeglass/tap/kubeglass
# Start the dashboard (also serves MCP)
kubeglass

Configure your MCP client as shown above, then ask it something about your cluster. The first question most people ask: “Give me a health summary of this cluster.”

From there, the conversation naturally leads to whatever needs attention.