Debugging Kubernetes with AI: How MCP Gives LLMs Real Cluster Context

You’re debugging at 2 AM. A deployment won’t roll out. Pods are stuck in CrashLoopBackOff. You open Copilot Chat and paste the error — the AI gives you a reasonable list of possibilities. OOM? Bad probe? Missing secret? Image pull error?

All plausible. None specific. Because the AI can’t see your cluster.

This is the fundamental gap: LLMs have absorbed thousands of Kubernetes docs, Stack Overflow answers, and runbooks. They genuinely understand the theory. But theory without data produces checklists, not answers.

The problem with checklists

Ask any LLM “why is my pod CrashLoopBackOff?” and you’ll get something like:

Check container logs (kubectl logs)
Check events (kubectl describe pod)
Check resource limits (OOMKilled?)
Check probes (liveness failing?)
Check image pull (ImagePullBackOff?)
Check volume mounts (missing ConfigMap/Secret?)
Check init containers (stuck init?)

This is correct and useless in equal measure. You already know what to check. The bottleneck is running the right 5 commands out of hundreds, cross-referencing the output, and synthesizing a root cause. That’s the part that takes 30 minutes at 2 AM.

What if the AI could see?

MCP (Model Context Protocol) is an open standard that lets AI assistants call external tools. Instead of reasoning in a vacuum, the AI sends structured requests to a server that returns real data.

For Kubernetes, this means the AI can:

Request the pod’s event timeline → see the actual OOMKilled event
Request the last 50 log lines → see the actual stack trace
Request resource definitions → see the actual memory limit set to 128Mi
Request live metrics → see memory peaking at 142Mi

Same AI, same knowledge — but now with your cluster’s data. The output shifts from “here are 7 things to check” to “your pod is OOMKilled because it uses 142Mi but the limit is 128Mi.”

Setting it up

KubeGlass includes an MCP server with 95 tools. It uses your existing kubeconfig — the same clusters accessible via kubectl are available to the AI.

VS Code Copilot
Claude Desktop

Add to .vscode/settings.json:

{
  "mcp": {
    "servers": {
      "kubeglass": {
        "command": "kubeglass",
        "args": ["mcp"]
      }
    }
  }
}

Add to your Claude Desktop MCP configuration:

{
  "mcpServers": {
    "kubeglass": {
      "command": "kubeglass",
      "args": ["mcp"]
    }
  }
}

That’s it. No API keys, no cloud service, no sidecar agents. The MCP server runs locally alongside the dashboard.

A real debugging session

Here’s what actually happens when you ask a question with MCP connected.

You type:

The checkout-service in production has pods stuck in CrashLoopBackOff. Why?

What the AI does (you see tool calls in the chat):

Calls get_pods with namespace=production, field selector matching checkout-service
Sees status: CrashLoopBackOff, last termination reason: OOMKilled
Calls get_pod_resources → memory limit 128Mi, request 64Mi
Calls get_pod_logs with --previous → fatal error: runtime: out of memory
Synthesizes: limit too low for actual usage

AI responds:

The checkout-service pods are being OOMKilled. The container has a 128Mi memory limit, but the application is exceeding it — the previous container’s last log line shows fatal error: runtime: out of memory. Increase the memory limit to at least 256Mi. If memory usage has grown recently, also check for a leak by examining the allocation profile.

That’s a complete diagnosis with a specific recommendation, not a generic checklist. The AI did in 3 seconds what would have taken you several minutes of kubectl commands.

Where this actually matters

The OOMKilled example is simple — you might have caught it yourself in 2 minutes. The real wins come from problems that span multiple resource types.

Cross-resource diagnosis

“Why can’t the frontend pod reach the API service?”

The AI checks: Service selectors → Pod labels → NetworkPolicies → endpoint health → DNS resolution. A human would need 5+ commands. The AI chains them automatically and finds the NetworkPolicy blocking egress on port 8080.

Permission debugging

“The CI pipeline’s deploy step is failing with forbidden errors.”

The AI checks: ServiceAccount → RoleBindings → ClusterRoleBindings → specific verb permissions. Finds that the service account has get and list but not patch on Deployments in the target namespace.

Drift investigation

“Is staging actually matching what’s in our Helm values?”

The AI compares live resources against the last Helm release’s manifest, finds 3 ConfigMaps that were manually edited and no longer match the chart values.

Onboarding

“What’s running in the platform namespace and how healthy is it?”

New team members get a complete picture without knowing which kubectl commands to run: deployments, their replica status, recent events, resource utilization, and any warnings.

What’s covered

The 95 tools span the full operational surface:

Category	What it covers
Workloads	Pods, Deployments, StatefulSets, DaemonSets, Jobs, CronJobs
Networking	Services, Ingresses, NetworkPolicies, Endpoints
Storage	PVCs, PVs, StorageClasses
Configuration	ConfigMaps, Secrets (metadata only), resource quotas
RBAC	Roles, bindings, who-can queries, access matrix
Helm	Releases, history, values, manifests
GitOps	Argo CD apps, Flux sources, sync status
Cluster	Nodes, namespaces, events, API resources, CRDs
Cost	Namespace cost breakdown, resource efficiency
Health	Security policies, probes, resource constraints

What it doesn’t do

All MCP tools are read-only. The AI can inspect, diagnose, and recommend — but it cannot kubectl apply, delete pods, or modify your cluster. Mutations require human confirmation through the KubeGlass UI or kubectl.

This is intentional. AI-assisted debugging should reduce investigation time. AI-driven mutations on production clusters are a different risk category entirely.

Try it

# Install
brew install kubeglass/tap/kubeglass

# Start the dashboard (also serves MCP)
kubeglass

Configure your MCP client as shown above, then ask it something about your cluster. The first question most people ask: “Give me a health summary of this cluster.”

From there, the conversation naturally leads to whatever needs attention.