Your First Investigation

Scenario

A pod in your production cluster is stuck in CrashLoopBackOff after a deployment. Let’s use Siclaw to diagnose it.

Start Siclaw

siclaw

If you want Siclaw to inspect a Kubernetes cluster, import a kubeconfig first with /setup.

Describe the Problem

? What would you like to investigate?
> Pod payment-service is CrashLoopBackOff in namespace prod after deploying v2.3

Be specific. Include the pod name, namespace, and what changed (e.g., “after deploying v2.3”). More context = better hypotheses.

Phase 1: Context Gathering

Siclaw automatically runs diagnostic commands to understand the situation:

── Phase 1: Context Gathering ─────────────────────────────
  kubectl get pods -n prod | grep payment-service
  kubectl describe pod payment-service-xxx -n prod
  kubectl logs payment-service-xxx -n prod --previous
  kubectl get events -n prod --sort-by='.lastTimestamp'

All commands are read-only — nothing is modified.

Phase 2: Hypothesis Generation

Based on the evidence, Siclaw generates ranked hypotheses:

── Phase 2: Hypothesis Generation ──────────────────────────
  H1  OOMKilled — memory limit too low for v2.3      confidence: 78%
  H2  Config mount failure — missing configmap key    confidence: 45%
  H3  Liveness probe mismatch — endpoint changed      confidence: 32%

If Investigation Memory has data, Siclaw will also check past incidents for similar patterns and adjust hypothesis confidence accordingly.

Phase 3: Parallel Validation

Up to 3 sub-agents independently validate each hypothesis:

── Phase 3: Parallel Validation (3 sub-agents) ─────────────
  Agent-1 validating H1 · Agent-2 validating H2 · Agent-3 validating H3

Each sub-agent runs targeted commands to confirm or refute its hypothesis. They don’t share information — this prevents confirmation bias.

Phase 4: Conclusion

Siclaw synthesizes all evidence into a structured report:

── Phase 4: Conclusion ──────────────────────────────────────
  Root Cause: OOMKilled — memory limit 256Mi insufficient for v2.3
  Confidence: 92% · Evidence: 4 signals · Duration: 47s

  Causal chain:
    1. v2.3 deployment added new caching layer
    2. Memory usage increased from ~180Mi to ~310Mi
    3. Pod exceeded 256Mi memory limit
    4. Kernel OOMKilled the process → container restart → CrashLoopBackOff

  Recommended next steps:
    - Increase the deployment memory limit to 512Mi
    - Add memory requests to match expected usage

The full trace is saved to .siclaw/traces/deep-search-{timestamp}.md (relative to where Siclaw was launched).

Deep Investigation Mode

For complex issues, you can explicitly trigger a deep investigation:

> /dp "Intermittent 5xx errors on the API gateway, happening every 30 minutes"

This activates the structured 4-phase workflow (triage → hypotheses → parallel validation → conclusion) for thorough investigation.

More Examples

Here are other common scenarios Siclaw handles well:

OOMKilled Pods

> Pods in namespace ml-training keep getting OOMKilled, happening more since yesterday

Siclaw will check memory limits vs actual usage, recent deployment changes, and correlate with node memory pressure.

Node NotReady

> Node worker-07 went NotReady 20 minutes ago, pods are being evicted

Siclaw will inspect node conditions, kubelet logs, kernel messages (dmesg), and network connectivity to the API server.

Intermittent Network Issues

> /dp "Service mesh intermittent 503 errors between order-service and inventory-service"

Using /dp triggers the full structured investigation — useful for complex cross-service issues.

What’s Next?

Deep Investigation — how the investigation workflow works
Skills — create reusable diagnostic playbooks
Memory — how investigation history improves future diagnoses

​Scenario

​Start Siclaw

​Describe the Problem

​Phase 1: Context Gathering

​Phase 2: Hypothesis Generation

​Phase 3: Parallel Validation

​Phase 4: Conclusion

​Deep Investigation Mode

​More Examples

​OOMKilled Pods

​Node NotReady

​Intermittent Network Issues

​What’s Next?

Scenario

Start Siclaw

Describe the Problem

Phase 1: Context Gathering

Phase 2: Hypothesis Generation

Phase 3: Parallel Validation

Phase 4: Conclusion

Deep Investigation Mode

More Examples

OOMKilled Pods

Node NotReady

Intermittent Network Issues

What’s Next?