skills/network-rca/SKILL.md
You are a Kubernetes network forensics specialist. Your job is to help users investigate past incidents by working with traffic snapshots — immutable captures of all network activity across a cluster during a specific time window.
Kubeshark is a search engine for network traffic. Just as Google crawls and indexes the web so you can query it instantly, Kubeshark captures and indexes (dissects) cluster traffic so you can query any API call, header, payload, or timing metric across your entire infrastructure. Snapshots are the raw data; dissection is the indexing step; KFL queries are your search bar.
Unlike real-time monitoring, retrospective analysis lets you go back in time: reconstruct what happened, compare against known-good baselines, and pinpoint root causes with full L4/L7 visibility.
Before starting any analysis, verify the environment is ready.
Confirm the Kubeshark MCP is accessible and tools are available. Look for tools
like list_api_calls, list_l4_flows, create_snapshot, etc.
Tool: check_kubeshark_status
If tools like list_api_calls or list_l4_flows are missing from the response,
something is wrong with the MCP connection. Guide the user through setup
(see Setup Reference at the bottom).
Retrospective analysis depends on raw capture — Kubeshark's kernel-level (eBPF) packet recording that stores traffic at the node level. Without it, snapshots have nothing to work with.
Raw capture runs as a FIFO buffer: old data is discarded as new data arrives. The buffer size determines how far back you can go. Larger buffer = wider snapshot window.
tap:
capture:
raw:
enabled: true
storageSize: 10Gi # Per-node FIFO buffer
If raw capture isn't enabled, inform the user that retrospective analysis requires it and share the configuration above.
Snapshots are assembled on the Hub's storage, which is ephemeral by default. For serious forensic work, persistent storage is recommended:
tap:
snapshots:
local:
storageClass: gp2
storageSize: 1000Gi
Every investigation starts with a snapshot. After that, you choose one of two investigation routes depending on your goal:
get_data_boundaries
to see what raw capture data is available.list_snapshots.| PCAP Route | Dissection Route | |
|---|---|---|
| Speed | Immediate — no indexing needed | Takes time to index |
| Filtering | Nodes, time window, BPF filters | Kubernetes & API-level (pods, labels, paths, status codes) |
| Output | Cluster-wide PCAP files | Structured query results |
| Investigation by | Human (Wireshark) | AI agent or human (queryable database) |
| Best for | Compliance, sharing with network teams, Wireshark deep-dives | Root cause analysis, API-level debugging, automated investigation |
Both routes are valid and complementary. Use PCAP when you need raw packets for human analysis or compliance. Use Dissection when you want an AI agent to search and analyze traffic programmatically.
Both routes start here. A snapshot is an immutable freeze of all cluster traffic in a time window.
Tool: get_data_boundaries
Check what raw capture data exists across the cluster. You can only create snapshots within these boundaries — data outside the window has been rotated out of the FIFO buffer.
Example response:
Cluster-wide:
Oldest: 2026-03-14 16:12:34 UTC
Newest: 2026-03-14 18:05:20 UTC
Per node:
┌─────────────────────────────┬──────────┬──────────┐
│ Node │ Oldest │ Newest │
├─────────────────────────────┼──────────┼──────────┤
│ ip-10-0-25-170.ec2.internal │ 16:12:34 │ 18:03:39 │
│ ip-10-0-32-115.ec2.internal │ 16:13:45 │ 18:05:20 │
└─────────────────────────────┴──────────┴──────────┘
If the incident falls outside the available window, the data has been rotated
out. Suggest increasing storageSize for future coverage.
Tool: create_snapshot
Specify nodes (or cluster-wide) and a time window within the data boundaries. Snapshots include raw capture files, Kubernetes pod events, and eBPF cgroup events.
Snapshots take time to build. Check status with get_snapshot — wait until
completed before proceeding with either route.
Tool: list_snapshots
Shows all snapshots on the local Hub, with name, size, status, and node count.
Snapshots on the Hub are ephemeral. Cloud storage (S3, GCS, Azure Blob) provides long-term retention. Snapshots can be downloaded to any cluster with Kubeshark — not necessarily the original one.
Check cloud status: get_cloud_storage_status
Upload to cloud: upload_snapshot_to_cloud
Download from cloud: download_snapshot_from_cloud
The PCAP route does not require dissection. It works directly with the raw snapshot data to produce filtered, cluster-wide PCAP files. Use this route when:
Tool: export_snapshot_pcap
Filter the snapshot down to what matters using:
host 10.0.53.101,
port 8080, net 10.0.0.0/16)These filters are combinable — select specific nodes, narrow the time range, and apply a BPF expression all at once.
When you know the workload names but not their IPs, resolve them from the snapshot's metadata. Snapshots preserve pod-to-IP mappings from capture time, so resolution is accurate even if pods have been rescheduled since.
Tool: resolve_workload
Example workflow — extract PCAP for specific workloads:
resolve_workload for orders-594487879c-7ddxf → 10.0.53.101resolve_workload for payment-service-6b8f9d-x2k4p → 10.0.53.205host 10.0.53.101 or host 10.0.53.205export_snapshot_pcap with that BPF filterThis gives you a cluster-wide PCAP filtered to exactly the workloads involved in the incident — ready for Wireshark or long-term storage.
The Dissection route indexes raw packets into structured L7 API calls, building a queryable database from the snapshot. Use this route when:
KFL requirement: The Dissection route uses KFL filters for all queries
(list_api_calls, get_api_stats, etc.). Before constructing any KFL filter,
load the KFL skill (skills/kfl/). KFL is statically typed — incorrect field
names or syntax will fail silently or error. If the KFL skill is not available,
suggest the user install it:
ln -s /path/to/kubeshark/skills/kfl ~/.claude/skills/kfl
If the KFL skill cannot be loaded, only use the exact filter examples shown
in this skill. Do not improvise or guess at field names, operators, or syntax.
KFL field names differ from what you might expect (e.g., status_code not
response.status, src.pod.namespace not src.namespace). Using incorrect
fields produces wrong results without warning.
Tool: start_snapshot_dissection
Dissection takes time proportional to snapshot size — it parses every packet, reassembles streams, and builds the index. After completion, these tools become available:
list_api_calls — Search API transactions with KFL filtersget_api_call — Drill into a specific call (headers, body, timing, payload)get_api_stats — Aggregated statistics (throughput, error rates, latency)Start broad, then narrow:
get_api_stats — Get the overall picture: error rates, latency percentiles,
throughput. Look for spikes or anomalies.list_api_calls filtered by error codes (4xx, 5xx) or high latency — find
the problematic transactions.get_api_call on specific calls — inspect headers, bodies, timing, and
full payload to understand what went wrong.Example list_api_calls response (filtered to http && status_code >= 500):
┌──────────────────────┬────────┬──────────────────────────┬────────┬───────────┐
│ Timestamp │ Method │ URL │ Status │ Elapsed │
├──────────────────────┼────────┼──────────────────────────┼────────┼───────────┤
│ 2026-03-14 17:23:45 │ POST │ /api/v1/orders/charge │ 503 │ 12,340 ms │
│ 2026-03-14 17:23:46 │ POST │ /api/v1/orders/charge │ 503 │ 11,890 ms │
│ 2026-03-14 17:23:48 │ GET │ /api/v1/inventory/check │ 500 │ 8,210 ms │
│ 2026-03-14 17:24:01 │ POST │ /api/v1/payments/process │ 502 │ 30,000 ms │
└──────────────────────┴────────┴──────────────────────────┴────────┴───────────┘
Src: api-gateway (prod) → Dst: payment-service (prod)
Use the pattern of repeated failures and high latency to identify the failing
service chain, then drill into individual calls with get_api_call.
Layer filters progressively when investigating:
// Step 1: Protocol + namespace
http && dst.pod.namespace == "production"
// Step 2: Add error condition
http && dst.pod.namespace == "production" && status_code >= 500
// Step 3: Narrow to service
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service"
// Step 4: Narrow to endpoint
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge")
Other common RCA filters:
dns && dns_response && status_code != 0 // Failed DNS lookups
src.service.namespace != dst.service.namespace // Cross-namespace traffic
http && elapsed_time > 5000000 // Slow transactions (> 5s)
conn && conn_state == "open" && conn_local_bytes > 1000000 // High-volume connections
The two routes are complementary. A common pattern:
resolve_workload
to get their IPsget_data_boundaries — is the window still in raw capture?create_snapshot covering the incident window (add 15 minutes buffer)start_snapshot_dissection → get_api_stats →
list_api_calls → get_api_call → follow the dependency chainresolve_workload → export_snapshot_pcap with BPF →
hand off to Wireshark or archiveget_api_stats across them to detect latency drift, error rate changes,
or new service-to-service connections.create_snapshot + upload_snapshot_to_cloud
for immutable, long-term evidence. Downloadable to any cluster months later.For CLI installation, MCP configuration, verification, and troubleshooting,
see references/setup.md.