Debugging Kubernetes with AI: How MCP Gives LLMs Real Cluster Context
You’re debugging at 2 AM. A deployment won’t roll out. Pods are stuck in CrashLoopBackOff. You open Copilot Chat and paste the error — the AI gives you a reasonable list of possibilities. OOM? Bad probe? Missing secret? Image pull error?
All plausible. None specific. Because the AI can’t see your cluster.
This is the fundamental gap: LLMs have absorbed thousands of Kubernetes docs, Stack Overflow answers, and runbooks. They genuinely understand the theory. But theory without data produces checklists, not answers.
The problem with checklists
Section titled “The problem with checklists”Ask any LLM “why is my pod CrashLoopBackOff?” and you’ll get something like:
- Check container logs (
kubectl logs) - Check events (
kubectl describe pod) - Check resource limits (OOMKilled?)
- Check probes (liveness failing?)
- Check image pull (ImagePullBackOff?)
- Check volume mounts (missing ConfigMap/Secret?)
- Check init containers (stuck init?)
This is correct and useless in equal measure. You already know what to check. The bottleneck is running the right 5 commands out of hundreds, cross-referencing the output, and synthesizing a root cause. That’s the part that takes 30 minutes at 2 AM.
What if the AI could see?
Section titled “What if the AI could see?”MCP (Model Context Protocol) is an open standard that lets AI assistants call external tools. Instead of reasoning in a vacuum, the AI sends structured requests to a server that returns real data.
For Kubernetes, this means the AI can:
- Request the pod’s event timeline → see the actual OOMKilled event
- Request the last 50 log lines → see the actual stack trace
- Request resource definitions → see the actual memory limit set to 128Mi
- Request live metrics → see memory peaking at 142Mi
Same AI, same knowledge — but now with your cluster’s data. The output shifts from “here are 7 things to check” to “your pod is OOMKilled because it uses 142Mi but the limit is 128Mi.”
Setting it up
Section titled “Setting it up”KubeGlass includes an MCP server with 95 tools. It uses your existing kubeconfig — the same clusters accessible via kubectl are available to the AI.
Add to .vscode/settings.json:
{ "mcp": { "servers": { "kubeglass": { "command": "kubeglass", "args": ["mcp"] } } }}Add to your Claude Desktop MCP configuration:
{ "mcpServers": { "kubeglass": { "command": "kubeglass", "args": ["mcp"] } }}That’s it. No API keys, no cloud service, no sidecar agents. The MCP server runs locally alongside the dashboard.
A real debugging session
Section titled “A real debugging session”Here’s what actually happens when you ask a question with MCP connected.
You type:
The checkout-service in production has pods stuck in CrashLoopBackOff. Why?
What the AI does (you see tool calls in the chat):
- Calls
get_podswith namespace=production, field selector matching checkout-service - Sees status: CrashLoopBackOff, last termination reason: OOMKilled
- Calls
get_pod_resources→ memory limit 128Mi, request 64Mi - Calls
get_pod_logswith--previous→fatal error: runtime: out of memory - Synthesizes: limit too low for actual usage
AI responds:
The checkout-service pods are being OOMKilled. The container has a 128Mi memory limit, but the application is exceeding it — the previous container’s last log line shows
fatal error: runtime: out of memory. Increase the memory limit to at least 256Mi. If memory usage has grown recently, also check for a leak by examining the allocation profile.
That’s a complete diagnosis with a specific recommendation, not a generic checklist. The AI did in 3 seconds what would have taken you several minutes of kubectl commands.
Where this actually matters
Section titled “Where this actually matters”The OOMKilled example is simple — you might have caught it yourself in 2 minutes. The real wins come from problems that span multiple resource types.
Cross-resource diagnosis
Section titled “Cross-resource diagnosis”“Why can’t the frontend pod reach the API service?”
The AI checks: Service selectors → Pod labels → NetworkPolicies → endpoint health → DNS resolution. A human would need 5+ commands. The AI chains them automatically and finds the NetworkPolicy blocking egress on port 8080.
Permission debugging
Section titled “Permission debugging”“The CI pipeline’s deploy step is failing with forbidden errors.”
The AI checks: ServiceAccount → RoleBindings → ClusterRoleBindings → specific verb permissions. Finds that the service account has get and list but not patch on Deployments in the target namespace.
Drift investigation
Section titled “Drift investigation”“Is staging actually matching what’s in our Helm values?”
The AI compares live resources against the last Helm release’s manifest, finds 3 ConfigMaps that were manually edited and no longer match the chart values.
Onboarding
Section titled “Onboarding”“What’s running in the platform namespace and how healthy is it?”
New team members get a complete picture without knowing which kubectl commands to run: deployments, their replica status, recent events, resource utilization, and any warnings.
What’s covered
Section titled “What’s covered”The 95 tools span the full operational surface:
| Category | What it covers |
|---|---|
| Workloads | Pods, Deployments, StatefulSets, DaemonSets, Jobs, CronJobs |
| Networking | Services, Ingresses, NetworkPolicies, Endpoints |
| Storage | PVCs, PVs, StorageClasses |
| Configuration | ConfigMaps, Secrets (metadata only), resource quotas |
| RBAC | Roles, bindings, who-can queries, access matrix |
| Helm | Releases, history, values, manifests |
| GitOps | Argo CD apps, Flux sources, sync status |
| Cluster | Nodes, namespaces, events, API resources, CRDs |
| Cost | Namespace cost breakdown, resource efficiency |
| Health | Security policies, probes, resource constraints |
What it doesn’t do
Section titled “What it doesn’t do”All MCP tools are read-only. The AI can inspect, diagnose, and recommend — but it cannot kubectl apply, delete pods, or modify your cluster. Mutations require human confirmation through the KubeGlass UI or kubectl.
This is intentional. AI-assisted debugging should reduce investigation time. AI-driven mutations on production clusters are a different risk category entirely.
Try it
Section titled “Try it”# Installbrew install kubeglass/tap/kubeglass
# Start the dashboard (also serves MCP)kubeglassConfigure your MCP client as shown above, then ask it something about your cluster. The first question most people ask: “Give me a health summary of this cluster.”
From there, the conversation naturally leads to whatever needs attention.