shannon/.claude/commands/debug.md at main

mirror of https://github.com/KeygraphHQ/shannon.git synced 2026-02-12 17:22:50 +00:00

Files

Arjun Malleswaran 78a0a61208 Feat/temporal (#46 )

* refactor: modularize claude-executor and extract shared utilities

- Extract message handling into src/ai/message-handlers.ts with pure functions
- Extract output formatting into src/ai/output-formatters.ts
- Extract progress management into src/ai/progress-manager.ts
- Add audit-logger.ts with Null Object pattern for optional logging
- Add shared utilities: formatting.ts, file-io.ts, functional.ts
- Consolidate getPromptNameForAgent into src/types/agents.ts

* feat: add Claude Code custom commands for debug and review

* feat: add Temporal integration foundation (phase 1-2)

- Add Temporal SDK dependencies (@temporalio/client, worker, workflow, activity)
- Add shared types for pipeline state, metrics, and progress queries
- Add classifyErrorForTemporal() for retry behavior classification
- Add docker-compose for Temporal server with SQLite persistence

* feat: add Temporal activities for agent execution (phase 3)

- Add activities.ts with heartbeat loop, git checkpoint/rollback, and error classification
- Export runClaudePrompt, validateAgentOutput, ClaudePromptResult for Temporal use
- Track attempt number via Temporal Context for accurate audit logging
- Rollback git workspace before retry to ensure clean state

* feat: add Temporal workflow for 5-phase pipeline orchestration (phase 4)

* feat: add Temporal worker, client, and query tools (phase 5)

- Add worker.ts with workflow bundling and graceful shutdown
- Add client.ts CLI to start pipelines with progress polling
- Add query.ts CLI to inspect running workflow state
- Fix buffer overflow by truncating error messages and stack traces
- Skip git operations gracefully on non-git repositories
- Add kill.sh/start.sh dev scripts and Dockerfile.worker

* feat: fix Docker worker container setup

- Install uv instead of deprecated uvx package
- Add mcp-server and configs directories to container
- Mount target repo dynamically via TARGET_REPO env variable

* fix: add report assembly step to Temporal workflow

- Add assembleReportActivity to concatenate exploitation evidence files before report agent runs
- Call assembleFinalReport in workflow Phase 5 before runReportAgent
- Ensure deliverables directory exists before writing final report
- Simplify pipeline-testing report prompt to just prepend header

* refactor: consolidate Docker setup to root docker-compose.yml

* feat: improve Temporal client UX and env handling

- Change default to fire-and-forget (--wait flag to opt-in)
- Add splash screen and improve console output formatting
- Add .env to gitignore, remove from dockerignore for container access
- Add Taskfile for common development commands

* refactor: simplify session ID handling and improve Taskfile options

- Include hostname in workflow ID for better audit log organization
- Extract sanitizeHostname utility to audit/utils.ts for reuse
- Remove unused generateSessionLogPath and buildLogFilePath functions
- Simplify Taskfile with CONFIG/OUTPUT/CLEAN named parameters

* chore: add .env.example and simplify .gitignore

* docs: update README and CLAUDE.md for Temporal workflow usage

- Replace Docker CLI instructions with Task-based commands
- Add monitoring/stopping sections and workflow examples
- Document Temporal orchestration layer and troubleshooting
- Simplify file structure to key files overview

* refactor: replace Taskfile with bash CLI script

- Add shannon bash script with start/logs/query/stop/help commands
- Remove Taskfile.yml dependency (no longer requires Task installation)
- Update README.md and CLAUDE.md to use ./shannon commands
- Update client.ts output to show ./shannon commands

* docs: fix deliverable filename in README

* refactor: remove direct CLI and .shannon-store.json in favor of Temporal

- Delete src/shannon.ts direct CLI entry point (Temporal is now the only mode)
- Remove .shannon-store.json session lock (Temporal handles workflow deduplication)
- Remove broken scripts/export-metrics.js (imported non-existent function)
- Update package.json to remove main, start script, and bin entry
- Clean up CLAUDE.md and debug.md to remove obsolete references

* chore: remove licensing comments from prompt files to prevent leaking into actual prompts

* fix: resolve parallel workflow race conditions and retry logic bugs

- Fix save_deliverable race condition using closure pattern instead of global variable
- Fix error classification order so OutputValidationError matches before generic validation
- Fix ApplicationFailure re-classification bug by checking instanceof before re-throwing
- Add per-error-type retry limits (3 for output validation, 50 for billing)
- Add fast retry intervals for pipeline testing mode (10s vs 5min)
- Increase worker concurrent activities to 25 for parallel workflows

* refactor: pipeline vuln→exploit workflow for parallel execution

- Replace sync barrier between vuln/exploit phases with independent pipelines
- Each vuln type runs: vuln agent → queue check → conditional exploit
- Add checkExploitationQueue activity to skip exploits when no vulns found
- Use Promise.allSettled for graceful failure handling across pipelines
- Add PipelineSummary type for aggregated cost/duration/turns metrics

* fix: re-throw retryable errors in checkExploitationQueue

* fix: detect and retry on Claude Code spending cap errors

- Add spending cap pattern detection in detectApiError() with retryable error
- Add matching patterns to classifyErrorForTemporal() for proper Temporal retry
- Add defense-in-depth safeguard in runClaudePrompt() for $0 cost / low turn detection
- Add final sanity check in activities before declaring success

* fix: increase heartbeat timeout to prevent false worker-dead detection

Original 30s timeout was from POC spec assuming <5min activities. With
hour-long activities and multiple concurrent workflows sharing one worker,
resource contention causes event loop stalls exceeding 30s, triggering
false heartbeat timeouts. Increased to 10min (prod) and 5min (testing).

* fix: temporal db init

* fix: persist home dir

* feat: add per-workflow unified logging with ./shannon logs ID=<workflow-id>

- Add WorkflowLogger class for human-readable, per-workflow log files
- Create workflow.log in audit-logs/{workflowId}/ with phase, agent, tool, and LLM events
- Update ./shannon logs to require ID param and tail specific workflow log
- Add phase transition logging at workflow boundaries
- Include workflow completion summary with agent breakdown (duration, cost)
- Mount audit-logs volume in docker-compose for host access

---------

Co-authored-by: ezl-keygraph <ezhil@keygraph.io>

2026-01-15 10:36:11 -08:00

5.1 KiB

Raw Permalink Blame History

description

description
Systematically debug errors using context analysis and structured recovery

You are debugging an issue. Follow this structured approach to avoid spinning in circles.

Step 1: Capture Error Context

Read the full error message and stack trace
Identify the layer where the error originated:
- CLI/Args - Input validation, path resolution
- Config Parsing - YAML parsing, JSON Schema validation
- Session Management - Mutex, session.json, lock files
- Audit System - Logging, metrics tracking, atomic writes
- Claude SDK - Agent execution, MCP servers, turn handling
- Git Operations - Checkpoints, rollback, commit
- Tool Execution - nmap, subfinder, whatweb
- Validation - Deliverable checks, queue validation

Step 2: Check Relevant Logs

Session audit logs:

# Find most recent session
ls -lt audit-logs/ | head -5

# Check session metrics and errors
cat audit-logs/<session>/session.json | jq '.errors, .agentMetrics'

# Check agent execution logs
ls -lt audit-logs/<session>/agents/
cat audit-logs/<session>/agents/<latest>.log

Step 3: Trace the Call Path

For Shannon, trace through these layers:

Temporal Client → src/temporal/client.ts - Workflow initiation
Workflow → src/temporal/workflows.ts - Pipeline orchestration
Activities → src/temporal/activities.ts - Agent execution with heartbeats
Config → src/config-parser.ts - YAML loading, schema validation
Session → src/session-manager.ts - Agent definitions, execution order
Audit → src/audit/audit-session.ts - Logging facade, metrics tracking
Executor → src/ai/claude-executor.ts - SDK calls, MCP setup, retry logic
Validation → src/queue-validation.ts - Deliverable checks

Step 4: Identify Root Cause

Common Shannon-specific issues:

Symptom	Likely Cause	Fix
Agent hangs indefinitely	MCP server crashed, Playwright timeout	Check Playwright logs in `/tmp/playwright-*`
"Validation failed: Missing deliverable"	Agent didn't create expected file	Check `deliverables/` dir, review prompt
Git checkpoint fails	Uncommitted changes, git lock	Run `git status`, remove `.git/index.lock`
"Session limit reached"	Claude API billing limit	Not retryable - check API usage
Parallel agents all fail	Shared resource contention	Check mutex usage, stagger startup timing
Cost/timing not tracked	Metrics not reloaded before update	Add `metricsTracker.reload()` before updates
session.json corrupted	Partial write during crash	Delete and restart, or restore from backup
YAML config rejected	Invalid schema or unsafe content	Run through AJV validator manually
Prompt variable not replaced	Missing `{{VARIABLE}}` in context	Check `prompt-manager.ts` interpolation

MCP Server Issues:

# Check if Playwright browsers are installed
npx playwright install chromium

# Check MCP server startup (look for connection errors)
grep -i "mcp\|playwright" audit-logs/<session>/agents/*.log

Git State Issues:

# Check for uncommitted changes
git status

# Check for git locks
ls -la .git/*.lock

# View recent git operations from Shannon
git reflog | head -10

Step 5: Apply Fix with Retry Limit

CRITICAL: Track consecutive failed attempts
After 3 consecutive failures on the same issue, STOP and:
- Summarize what was tried
- Explain what's blocking progress
- Ask the user for guidance or additional context
After a successful fix, reset the failure counter

Step 6: Validate the Fix

For code changes:

# Compile TypeScript
npx tsc --noEmit

# Quick validation run
shannon <URL> <REPO> --pipeline-testing

For audit/session issues:

Verify session.json is valid JSON after fix
Check that atomic writes complete without errors
Confirm mutex release in finally blocks

For agent issues:

Verify deliverable files are created in correct location
Check that validation functions return expected results
Confirm retry logic triggers on appropriate errors

Anti-Patterns to Avoid

Don't delete session.json without checking if session is active
Don't modify git state while an agent is running
Don't retry billing/quota errors (they're not retryable)
Don't ignore PentestError type - it indicates the error category
Don't make random changes hoping something works
Don't fix symptoms without understanding root cause
Don't bypass mutex protection for "quick fixes"

Quick Reference: Error Types

PentestError Type	Meaning	Retryable?
`config`	Configuration file issues	No
`network`	Connection/timeout issues	Yes
`tool`	External tool (nmap, etc.) failed	Yes
`prompt`	Claude SDK/API issues	Sometimes
`filesystem`	File read/write errors	Sometimes
`validation`	Deliverable validation failed	Yes (via retry)
`billing`	API quota/billing limit	No
`unknown`	Unexpected error	Depends

Now analyze the error and begin debugging systematically.

5.1 KiB Raw Permalink Blame History