27 Commits

Author SHA1 Message Date
ezl-keygraph 5596411bd3 fix: render agent deliverables before the success commit so resume preserves them (#377) 2026-06-23 14:25:17 +05:30
ezl-keygraph 6a86b6c4c3 fix(cli): pin npx command hints to beta tag 2026-06-17 18:30:02 +05:30
ezl-keygraph fb14a0170a ci: bump the beta release line to 2.0.0 (#356) 2026-06-17 18:09:27 +05:30
ezl-keygraph cf396fb9c7 feat(worker): enforce bounded bash timeouts via pi extension 2026-06-16 12:48:32 +05:30
ezl-keygraph f97afb482e refactor(worker): unify provider precedence between preflight and executor 2026-06-15 23:06:48 +05:30
ezl-keygraph c2bceba95c docs(worker): update stale sdk comments 2026-06-15 22:50:44 +05:30
ezl-keygraph 7c20384991 docs: remove vertex references from llms context 2026-06-15 22:48:59 +05:30
ezl-keygraph 0bc004a583 build: drop @anthropic-ai/claude-code from worker image 2026-06-15 22:42:12 +05:30
ezl-keygraph d3beea504a refactor(cli): remove CLAUDE_CODE_MAX_OUTPUT_TOKENS config 2026-06-15 22:40:50 +05:30
ezl-keygraph f46243a35a feat(worker): load playwright-cli skill via pi resource loader 2026-06-15 22:37:36 +05:30
ezl-keygraph 09e11b3ad9 fix(worker): restore minLength/minItems on pre-recon and exploit collector schemas 2026-06-15 21:06:29 +05:30
ezl-keygraph e16dcba13f refactor(prompts): drop collector server names from deliverable instructions 2026-06-15 20:21:22 +05:30
ezl-keygraph 5547afa73f refactor(prompts): drop stale MCP terminology for collector tools 2026-06-15 20:18:53 +05:30
ezl-keygraph 667e6ac4b0 refactor(prompts): use pi tool names (task, todo_write, read, bash, glob) 2026-06-15 20:03:26 +05:30
ezl-keygraph d18e928a6a feat(worker): add glob custom tool and route code_path globs to it 2026-06-15 20:03:26 +05:30
ezl-keygraph 58d0defea7 feat(worker): give task sub-agent write+bash, align tool descriptions 2026-06-15 19:54:20 +05:30
ezl-keygraph 9e845159b3 fix(worker): restore minLength/minItems on vuln-collector schemas 2026-06-15 18:42:53 +05:30
ezl-keygraph 0fd2f6bbe4 fix(worker): gate adaptive thinking to Opus models, drop CLAUDE_THINKING_LEVEL 2026-06-15 18:11:46 +05:30
ezl-keygraph 575465a741 feat(worker): pi-event-driven output formatting 2026-06-15 16:16:46 +05:30
ezl-keygraph 263b18e98a refactor(worker): rename claude-executor to pi-executor 2026-06-15 16:05:31 +05:30
ezl-keygraph 56241625a4 fix(worker): count sub-agent cost and surface compaction failures 2026-06-15 15:59:55 +05:30
ezl-keygraph 79fb49c159 feat(prompts): instruct agents to call submit_exploitation_queue and submit_auth_result 2026-06-15 15:49:02 +05:30
ezl-keygraph c275b27a6c fix(worker): route Bedrock and custom-base-URL providers from env 2026-06-15 15:36:14 +05:30
ezl-keygraph a9e966026c feat: remove Google Vertex AI provider support 2026-06-15 12:49:40 +05:30
ezl-keygraph 1908156525 feat(worker): migrate agent runtime from Claude Agent SDK to pi harness 2026-06-15 12:05:32 +05:30
ezl-keygraph 3d1a3c75f8 feat(ai): support Claude Fable 5 (upgrade Claude Agent SDK to 0.3.173) (#354) 2026-06-12 14:50:27 +05:30
ezl-keygraph ac6db3b52e feat(ai): upgrade to Opus 4.8 and Claude Agent SDK 0.3.163 (#353) 2026-06-12 02:03:26 +05:30
68 changed files with 4255 additions and 3564 deletions
+4 -24
View File
@@ -1,10 +1,7 @@
# Shannon Environment Configuration
# Copy this file to .env and fill in your credentials
# Recommended output token configuration for larger tool outputs
CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000
# Adaptive thinking is enabled automatically on Opus 4.6/4.7. Set to false to disable.
# Adaptive thinking is enabled automatically on Opus 4.6/4.7/4.8. Set to false to disable.
# CLAUDE_ADAPTIVE_THINKING=false
# Shannon forwards your machine's /etc/hosts entries into the worker container. Set to false to disable.
@@ -29,10 +26,10 @@ ANTHROPIC_API_KEY=your-api-key-here
# Model Tier Overrides (Anthropic API / OAuth / Custom Base URL / Bedrock)
# =============================================================================
# Override which model is used for each tier. Defaults are used if not set.
# Optional for direct Anthropic and custom base URL modes. Required for Bedrock/Vertex.
# Optional for direct Anthropic and custom base URL modes. Required for Bedrock.
# ANTHROPIC_SMALL_MODEL=... # Small tier (default: claude-haiku-4-5-20251001)
# ANTHROPIC_MEDIUM_MODEL=... # Medium tier (default: claude-sonnet-4-6)
# ANTHROPIC_LARGE_MODEL=... # Large tier (default: claude-opus-4-7)
# ANTHROPIC_LARGE_MODEL=... # Large tier (default: claude-opus-4-8)
# =============================================================================
# OPTION 3: AWS Bedrock
@@ -42,25 +39,8 @@ ANTHROPIC_API_KEY=your-api-key-here
# Example Bedrock model IDs for us-east-1:
# ANTHROPIC_SMALL_MODEL=us.anthropic.claude-haiku-4-5-20251001-v1:0
# ANTHROPIC_MEDIUM_MODEL=us.anthropic.claude-sonnet-4-6
# ANTHROPIC_LARGE_MODEL=us.anthropic.claude-opus-4-7
# ANTHROPIC_LARGE_MODEL=us.anthropic.claude-opus-4-8
# CLAUDE_CODE_USE_BEDROCK=1
# AWS_REGION=us-east-1
# AWS_BEARER_TOKEN_BEDROCK=your-bearer-token
# =============================================================================
# OPTION 4: Google Vertex AI
# =============================================================================
# https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-partner-models
# Requires a GCP service account with roles/aiplatform.user.
# Download the SA key JSON from GCP Console (IAM > Service Accounts > Keys).
# Requires the model tier overrides above to be set with Vertex AI model IDs.
# Example Vertex AI model IDs:
# ANTHROPIC_SMALL_MODEL=claude-haiku-4-5@20251001
# ANTHROPIC_MEDIUM_MODEL=claude-sonnet-4-6
# ANTHROPIC_LARGE_MODEL=claude-opus-4-7
# CLAUDE_CODE_USE_VERTEX=1
# CLOUD_ML_REGION=us-east5
# ANTHROPIC_VERTEX_PROJECT_ID=your-gcp-project-id
# GOOGLE_APPLICATION_CREDENTIALS=./credentials/google-sa-key.json
+7 -5
View File
@@ -30,15 +30,17 @@ jobs:
run: |
set -euo pipefail
BASE="2.0.0"
LATEST=$(npm view "@keygraph/shannon" dist-tags.beta 2>/dev/null || echo "")
if [[ -z "$LATEST" ]]; then
echo "version=1.0.0-beta.1" >> "$GITHUB_OUTPUT"
else
# Extract N from 1.0.0-beta.N and increment
if [[ "$LATEST" == "$BASE-beta."* ]]; then
# Same base version — increment the beta counter (e.g. 2.0.0-beta.2 -> 2.0.0-beta.3)
N=$(echo "$LATEST" | grep -oE 'beta\.([0-9]+)' | grep -oE '[0-9]+')
NEXT=$((N + 1))
echo "version=1.0.0-beta.$NEXT" >> "$GITHUB_OUTPUT"
echo "version=$BASE-beta.$NEXT" >> "$GITHUB_OUTPUT"
else
# No prior beta, or a different base (e.g. last beta was 1.0.0-beta.N) — start over.
echo "version=$BASE-beta.1" >> "$GITHUB_OUTPUT"
fi
- name: Print version
+2 -2
View File
@@ -4,7 +4,7 @@ on:
workflow_dispatch:
inputs:
version:
description: "Beta version to roll back to (example: 1.0.0-beta.2)"
description: "Beta version to roll back to (example: 2.0.0-beta.2)"
required: true
type: string
@@ -31,7 +31,7 @@ jobs:
VERSION="${RAW_VERSION#v}"
if ! [[ "$VERSION" =~ ^[0-9]+\.[0-9]+\.[0-9]+-beta\.[0-9]+$ ]]; then
echo "Version must be in format X.Y.Z-beta.N (e.g. 1.0.0-beta.2)"
echo "Version must be in format X.Y.Z-beta.N (e.g. 2.0.0-beta.2)"
exit 1
fi
+5 -5
View File
@@ -122,7 +122,7 @@ Infra (Temporal) runs via `docker-compose.yml`. Workers are ephemeral `docker ru
- `apps/worker/src/paths.ts` — Centralized path constants (`PROMPTS_DIR`, `CONFIGS_DIR`, `WORKSPACES_DIR`)
- `apps/worker/src/session-manager.ts` — Agent definitions (`AGENTS` record). Agent types in `apps/worker/src/types/agents.ts`
- `apps/worker/src/config-parser.ts` — YAML config parsing with JSON Schema validation
- `apps/worker/src/ai/claude-executor.ts`Claude Agent SDK integration with retry logic
- `apps/worker/src/ai/pi-executor.ts`pi harness integration (retry disabled; Temporal owns retry)
- `apps/worker/src/services/` — Business logic layer (Temporal-agnostic). Activities delegate here. Key: `agent-execution.ts`, `error-handling.ts`, `container.ts`
- `apps/worker/src/types/` — Consolidated types: `Result<T,E>`, `ErrorCode`, `AgentName`, `ActivityLogger`, etc.
- `apps/worker/src/utils/` — Shared utilities (file I/O, formatting, concurrency)
@@ -145,9 +145,9 @@ Durable workflow orchestration with crash recovery, queryable progress, intellig
5. **Reporting** (`report`) — Executive-level security report
### Supporting Systems
- **Configuration** — YAML configs in `apps/worker/configs/` with JSON Schema validation (`config-schema.json`). Supports auth settings (MFA/TOTP), URL/code rule scoping (`rules.avoid`/`rules.focus`), run-scope steering (`vuln_classes`, `exploit`), free-form `rules_of_engagement`, and post-hoc `report` filters (`min_severity`, `min_confidence`, `guidance`). `code_path` avoid rules are written into `~/.claude/settings.json` `permissions.deny` (`Read`/`Edit`) once per workflow by `apps/worker/src/temporal/activities.ts:syncCodePathDenyRules` so the SDK enforces them at the tool layer even in `bypassPermissions` mode. `vuln_classes`/`exploit` scope is locked into `session.json` on first run; resumes with a different scope fail fast (`persistOrValidateRunScope`). Credential resolution — local mode: env vars → `./.env`; npx mode: env vars → `~/.shannon/config.toml` (via `shn setup`)
- **Configuration** — YAML configs in `apps/worker/configs/` with JSON Schema validation (`config-schema.json`). Supports auth settings (MFA/TOTP), URL/code rule scoping (`rules.avoid`/`rules.focus`), run-scope steering (`vuln_classes`, `exploit`), free-form `rules_of_engagement`, and post-hoc `report` filters (`min_severity`, `min_confidence`, `guidance`). `code_path` avoid rules are enforced via the `@gotgenes/pi-permission-system` extension: `apps/worker/src/temporal/activities.ts:syncCodePathDenyRules` writes a global `path` deny config once per workflow (`apps/worker/src/ai/settings-writer.ts:writeCodePathPermissionConfig`), and the executor loads the extension when that config is present (`apps/worker/src/ai/pi-executor.ts`), so denies fire across every tool and child `task` session. `vuln_classes`/`exploit` scope is locked into `session.json` on first run; resumes with a different scope fail fast (`persistOrValidateRunScope`). Credential resolution — local mode: env vars → `./.env`; npx mode: env vars → `~/.shannon/config.toml` (via `shn setup`)
- **Prompts** — Per-phase templates in `apps/worker/prompts/` with variable substitution (`{{TARGET_URL}}`, `{{CONFIG_CONTEXT}}`). Shared partials in `apps/worker/prompts/shared/` via `apps/worker/src/services/prompt-manager.ts`, including `_code-path-rules.txt` (focus/avoid `[FILE]`/`[GLOB]` routing) and `_rules-of-engagement.txt` (free-text engagement rules). When `exploit: false`, `apps/worker/src/services/findings-renderer.ts` deterministically converts each `*_exploitation_queue.json` into a `*_findings.md` for report assembly — no LLM in the loop
- **SDK Integration** — Uses `@anthropic-ai/claude-agent-sdk` with `maxTurns: 10_000` and `bypassPermissions` mode. Adaptive thinking is enabled by default on Opus 4.6/4.7 (`supportsAdaptiveThinking` in `apps/worker/src/ai/models.ts`); disable per-scan via `CLAUDE_ADAPTIVE_THINKING=false` (env) or `core.adaptive_thinking = false` (npx TOML). Browser automation via `playwright-cli` with session isolation (`-s=<session>`). TOTP generation via `generate-totp` CLI tool. Login flow template at `apps/worker/prompts/shared/login-instructions.txt` supports form, SSO, API, and basic auth. On authenticated whitebox scans, the `validate-authentication` preflight performs the single real login and saves the browser session to `auth-state.json` in the per-session audit directory (path from `authStateFile()` in `apps/worker/src/audit/utils.ts`, derived from `generateAuditPath()`). The validation activity (`apps/worker/src/services/validate-authentication.ts`) removes any stale file from a prior run before the agent runs and verifies the file parses and contains cookies or storage before the preflight is marked complete; `logWorkflowComplete` deletes it when the workflow ends so authenticated cookies don't sit on disk between scans. Agent prompts opt in to session reuse by `@include(shared/_shared-session.txt)` before their `<login_instructions>` block — the partial restores the session and falls through to the full login flow if verification fails. `vuln-auth`/`exploit-auth` omit the include and own their own login
- **Agent Harness (pi)** — Uses the **pi harness** (`@earendil-works/pi-coding-agent`, requires Node ≥ 22.19) via `apps/worker/src/ai/pi-executor.ts` (`runPiPrompt``createAgentSession`, retry disabled so Temporal owns retry). Models resolve through pi-ai in `apps/worker/src/ai/models.ts` (Anthropic / Bedrock / custom base URL via `ModelRegistry`+`AuthStorage`). pi ships no JSON-schema output or `Task`/`TodoWrite` built-ins, so structured queues are captured via a `submit_exploitation_queue` custom tool (`apps/worker/src/ai/queue-schemas.ts`), and `task` (read-only child sessions) + `todo_write` are provided as custom tools (`apps/worker/src/ai/tools.ts`); the per-phase MCP collectors are pi custom tools (TypeBox `defineTool` in `apps/worker/src/mcp-server/`). Adaptive thinking (pi's `medium` level) is enabled only on Opus 4.6/4.7/4.8 (`supportsAdaptiveThinking`); every other model runs with thinking `off`. Disable per-scan via `CLAUDE_ADAPTIVE_THINKING=false` (`off`) / `core.adaptive_thinking = false` (npx TOML). Browser automation via `playwright-cli` with session isolation (`-s=<session>`). TOTP generation via `generate-totp` CLI tool. Login flow template at `apps/worker/prompts/shared/login-instructions.txt` supports form, SSO, API, and basic auth. On authenticated whitebox scans, the `validate-authentication` preflight performs the single real login and saves the browser session to `auth-state.json` in the per-session audit directory (path from `authStateFile()` in `apps/worker/src/audit/utils.ts`, derived from `generateAuditPath()`). The validation activity (`apps/worker/src/services/validate-authentication.ts`) removes any stale file from a prior run before the agent runs and verifies the file parses and contains cookies or storage before the preflight is marked complete; `logWorkflowComplete` deletes it when the workflow ends so authenticated cookies don't sit on disk between scans. Agent prompts opt in to session reuse by `@include(shared/_shared-session.txt)` before their `<login_instructions>` block — the partial restores the session and falls through to the full login flow if verification fails. `vuln-auth`/`exploit-auth` omit the include and own their own login
- **Audit System** — Crash-safe append-only logging in `workspaces/{hostname}_{sessionId}/`. Tracks session metrics, per-agent logs, prompts, and deliverables. WorkflowLogger (`apps/worker/src/audit/workflow-logger.ts`) provides unified human-readable per-workflow logs, backed by LogStream (`apps/worker/src/audit/log-stream.ts`) shared stream primitive
- **Deliverables** — Saved to `deliverables/` in the target repo via the `save-deliverable` CLI script (`apps/worker/src/scripts/save-deliverable.ts`)
- **Workspaces & Resume** — Named workspaces via `-w <name>` or auto-named from URL+timestamp. Resume detects completed agents via `session.json`. `loadResumeState()` in `apps/worker/src/temporal/activities.ts` validates deliverable existence, restores git checkpoints, and cleans up incomplete deliverables. Workspace listing via `apps/worker/src/temporal/workspaces.ts`
@@ -168,7 +168,7 @@ Durable workflow orchestration with crash recovery, queryable progress, intellig
### Key Design Patterns
- **Configuration-Driven** — YAML configs with JSON Schema validation
- **Progressive Analysis** — Each phase builds on previous results
- **SDK-First** — Claude Agent SDK handles autonomous analysis
- **Harness-First** — the pi harness (`@earendil-works/pi-coding-agent`) handles autonomous analysis
- **Modular Error Handling** — `ErrorCode` enum, `Result<T,E>` for explicit error propagation, automatic retry (3 attempts per agent)
- **Services Boundary** — Activities are thin Temporal wrappers; `apps/worker/src/services/` owns business logic, accepts `ActivityLogger`, returns `Result<T,E>`. No Temporal imports in services
- **DI Container** — Per-workflow in `apps/worker/src/services/container.ts`. `AuditSession` excluded (parallel safety)
@@ -228,7 +228,7 @@ Comments must be **timeless** — no references to this conversation, refactorin
**Entry Points:** `apps/worker/src/temporal/workflows.ts`, `apps/worker/src/temporal/activities.ts`, `apps/worker/src/temporal/worker.ts`
**Core Logic:** `apps/worker/src/session-manager.ts`, `apps/worker/src/ai/claude-executor.ts`, `apps/worker/src/ai/settings-writer.ts` (writes `code_path` deny rules to `~/.claude/settings.json`), `apps/worker/src/config-parser.ts`, `apps/worker/src/services/` (incl. `preflight.ts`, `findings-renderer.ts`, `reporting.ts`), `apps/worker/src/audit/`
**Core Logic:** `apps/worker/src/session-manager.ts`, `apps/worker/src/ai/pi-executor.ts`, `apps/worker/src/ai/settings-writer.ts` (writes `code_path` deny rules to the `@gotgenes/pi-permission-system` global config), `apps/worker/src/config-parser.ts`, `apps/worker/src/services/` (incl. `preflight.ts`, `findings-renderer.ts`, `reporting.ts`), `apps/worker/src/audit/`
**Config:** `docker-compose.yml`, `apps/cli/infra/compose.yml`, `apps/worker/configs/`, `apps/worker/prompts/`, `tsconfig.base.json` (shared compiler options), `turbo.json`, `biome.json`
+1 -1
View File
@@ -91,7 +91,7 @@ COPY --from=builder /app/node_modules /app/node_modules
COPY --from=builder /app/apps/worker /app/apps/worker
COPY --from=builder /app/apps/cli/package.json /app/apps/cli/package.json
RUN npm install -g --ignore-scripts @anthropic-ai/claude-code@2.1.84 @playwright/cli@0.1.1
RUN npm install -g --ignore-scripts @playwright/cli@0.1.1
RUN mkdir -p /tmp/.claude/skills && \
playwright-cli install --skills && \
cp -r .claude/skills/playwright-cli /tmp/.claude/skills/ && \
+2 -2
View File
@@ -78,7 +78,7 @@ Sample Shannon Lite penetration test reports from intentionally vulnerable appli
- **Docker** - required for the worker container.
- **Node.js 18+** - required for the recommended `npx` workflow.
- **AI provider credentials** - Anthropic is recommended; AWS Bedrock, Google Vertex AI, and compatible proxy setups are documented separately.
- **AI provider credentials** - Anthropic is recommended; AWS Bedrock and compatible proxy setups are documented separately.
### Run Shannon Lite
@@ -194,7 +194,7 @@ Use these guides for operational detail:
| --- | --- |
| [Source build and CLI commands](docs/development.md) | Cloning, building, common commands, output paths, and local development. |
| [Configuration](docs/configuration.md) | Authenticated testing, login flows, rules of engagement, report filters, and rate-limit settings. |
| [AI providers](docs/ai-providers.md) | Anthropic, AWS Bedrock, Google Vertex AI, and custom Anthropic-compatible endpoints. |
| [AI providers](docs/ai-providers.md) | Anthropic, AWS Bedrock, and custom Anthropic-compatible endpoints. |
| [Platforms and networking](docs/platforms.md) | Windows/WSL2, Linux, macOS, Docker networking, local apps, and custom hostnames. |
| [Workspaces and resuming](docs/workspaces.md) | Naming workspaces, resuming interrupted scans, and workspace storage. |
| [Safety and limitations](docs/safety.md) | Authorized-use requirements, non-production guidance, mutative effects, cost, and model caveats. |
+10 -83
View File
@@ -5,7 +5,6 @@
* then persists everything to ~/.shannon/config.toml with 0o600 permissions.
*/
import fs from 'node:fs';
import os from 'node:os';
import path from 'node:path';
import * as p from '@clack/prompts';
@@ -13,7 +12,7 @@ import { type ShannonConfig, saveConfig } from '../config/writer.js';
const SHANNON_HOME = path.join(os.homedir(), '.shannon');
type Provider = 'anthropic' | 'custom_base_url' | 'bedrock' | 'vertex';
type Provider = 'anthropic' | 'custom_base_url' | 'bedrock';
export async function setup(): Promise<void> {
p.intro('Shannon Setup');
@@ -25,7 +24,6 @@ export async function setup(): Promise<void> {
{ value: 'anthropic' as const, label: 'Claude Direct', hint: 'recommended' },
{ value: 'custom_base_url' as const, label: 'Custom Base URL', hint: 'proxies, gateways' },
{ value: 'bedrock' as const, label: 'Claude via AWS Bedrock' },
{ value: 'vertex' as const, label: 'Claude via Google Vertex AI' },
],
});
if (p.isCancel(provider)) return cancelAndExit();
@@ -40,7 +38,7 @@ export async function setup(): Promise<void> {
const configPath = path.join(SHANNON_HOME, 'config.toml');
p.log.success(`Configuration saved to ${configPath}`);
p.outro('Run `npx @keygraph/shannon start` to begin a scan.');
p.outro('Run `npx @keygraph/shannon@beta start` to begin a scan.');
}
async function setupProvider(provider: Provider): Promise<ShannonConfig> {
@@ -51,8 +49,6 @@ async function setupProvider(provider: Provider): Promise<ShannonConfig> {
return setupCustomBaseUrl();
case 'bedrock':
return setupBedrock();
case 'vertex':
return setupVertex();
}
}
@@ -83,7 +79,7 @@ async function setupAnthropic(): Promise<ShannonConfig> {
'Do you want to change the default models?\n' +
' Small - claude-haiku-4-5-20251001\n' +
' Medium - claude-sonnet-4-6\n' +
' Large - claude-opus-4-7',
' Large - claude-opus-4-8',
initialValue: false,
});
if (p.isCancel(customizeModels)) return cancelAndExit();
@@ -105,7 +101,7 @@ async function setupAnthropic(): Promise<ShannonConfig> {
const large = await p.text({
message: 'Large model ID',
initialValue: 'claude-opus-4-7',
initialValue: 'claude-opus-4-8',
validate: required('Large model ID is required'),
});
if (p.isCancel(large)) return cancelAndExit();
@@ -143,7 +139,7 @@ async function setupCustomBaseUrl(): Promise<ShannonConfig> {
'Do you want to change the default models?\n' +
' Small - claude-haiku-4-5-20251001\n' +
' Medium - claude-sonnet-4-6\n' +
' Large - claude-opus-4-7',
' Large - claude-opus-4-8',
initialValue: false,
});
if (p.isCancel(customizeModels)) return cancelAndExit();
@@ -165,7 +161,7 @@ async function setupCustomBaseUrl(): Promise<ShannonConfig> {
const large = await p.text({
message: 'Large model ID',
initialValue: 'claude-opus-4-7',
initialValue: 'claude-opus-4-8',
validate: required('Large model ID is required'),
});
if (p.isCancel(large)) return cancelAndExit();
@@ -202,7 +198,7 @@ async function setupBedrock(): Promise<ShannonConfig> {
const large = await p.text({
message: 'Large model ID',
placeholder: 'us.anthropic.claude-opus-4-7',
placeholder: 'us.anthropic.claude-opus-4-8',
validate: required('Large model ID is required'),
});
if (p.isCancel(large)) return cancelAndExit();
@@ -213,84 +209,15 @@ async function setupBedrock(): Promise<ShannonConfig> {
};
}
async function setupVertex(): Promise<ShannonConfig> {
// 1. Collect region and project ID
const region = await p.text({
message: 'Google Cloud region',
placeholder: 'us-east5',
validate: required('Region is required'),
});
if (p.isCancel(region)) return cancelAndExit();
const projectId = await p.text({
message: 'GCP Project ID',
validate: required('Project ID is required'),
});
if (p.isCancel(projectId)) return cancelAndExit();
// 2. File picker for service account key
p.log.info('Select the path to your GCP Service Account JSON key file.');
const keySourcePath = await p.path({
message: 'Service Account JSON key file',
validate: (value) => {
if (!value) return 'Path is required';
if (!fs.existsSync(value)) return 'File not found';
if (!value.endsWith('.json')) return 'Must be a .json file';
return undefined;
},
});
if (p.isCancel(keySourcePath)) return cancelAndExit();
// 3. Copy key to ~/.shannon/ and lock permissions
const destPath = path.join(SHANNON_HOME, 'google-sa-key.json');
fs.mkdirSync(SHANNON_HOME, { recursive: true });
fs.copyFileSync(keySourcePath, destPath);
fs.chmodSync(destPath, 0o600);
p.log.success(`Key copied to ${destPath} (permissions: 0600)`);
// 4. Model tiers
const models = await p.group({
small: () =>
p.text({
message: 'Small model ID',
placeholder: 'claude-haiku-4-5@20251001',
validate: required('Small model ID is required'),
}),
medium: () =>
p.text({
message: 'Medium model ID',
placeholder: 'claude-sonnet-4-6',
validate: required('Medium model ID is required'),
}),
large: () =>
p.text({
message: 'Large model ID',
placeholder: 'claude-opus-4-7',
validate: required('Large model ID is required'),
}),
});
if (p.isCancel(models)) return cancelAndExit();
return {
vertex: {
use: true,
region,
project_id: projectId,
key_path: destPath,
},
models: { small: models.small, medium: models.medium, large: models.large },
};
}
// === Helpers ===
async function maybePromptAdaptiveThinking(config: ShannonConfig): Promise<void> {
const m = config.models;
const hasOpus47 = !m || [m.small, m.medium, m.large].some((v) => v && /opus-4-[67]/.test(v));
if (!hasOpus47) return;
const hasAdaptiveModel = !m || [m.small, m.medium, m.large].some((v) => v && /opus-4-[678]/.test(v));
if (!hasAdaptiveModel) return;
const enable = await p.confirm({
message: 'Enable adaptive thinking on Opus 4.6/4.7? Claude decides when and how deeply to reason.',
message: 'Enable adaptive thinking on Opus 4.6/4.7/4.8? Claude decides when and how deeply to reason.',
initialValue: true,
});
if (p.isCancel(enable)) return cancelAndExit();
+18 -10
View File
@@ -10,7 +10,7 @@ import fs from 'node:fs';
import path from 'node:path';
import { ensureImage, ensureInfra, randomSuffix, spawnWorker } from '../docker.js';
import { buildEnvFlags, loadEnv, validateCredentials } from '../env.js';
import { getCredentialsPath, getWorkspacesDir, initHome } from '../home.js';
import { getWorkspacesDir, initHome } from '../home.js';
import { isLocal } from '../mode.js';
import { resolveConfig, resolveRepo } from '../paths.js';
import { displaySplash } from '../splash.js';
@@ -78,13 +78,6 @@ export async function start(args: StartArgs): Promise<void> {
}
fs.mkdirSync(path.join(repo.hostPath, '.playwright'), { recursive: true });
const credentialsPath = getCredentialsPath();
const hasCredentials = fs.existsSync(credentialsPath);
if (hasCredentials) {
process.env.GOOGLE_APPLICATION_CREDENTIALS = '/app/credentials/google-sa-key.json';
}
// 10. Resolve output directory
const outputDir = args.output ? path.resolve(args.output) : undefined;
if (outputDir) {
@@ -107,7 +100,6 @@ export async function start(args: StartArgs): Promise<void> {
containerName,
envFlags: buildEnvFlags(),
...(config && { config }),
...(hasCredentials && { credentials: credentialsPath }),
...(promptsDir && { promptsDir }),
...(outputDir && { outputDir }),
workspace,
@@ -223,7 +215,7 @@ function printInfo(
repoPath: string,
workspacesDir: string,
): void {
const logsCmd = isLocal() ? `./shannon logs ${workspace}` : `npx @keygraph/shannon logs ${workspace}`;
const logsCmd = isLocal() ? `./shannon logs ${workspace}` : `npx @keygraph/shannon@beta logs ${workspace}`;
const reportsPath = path.join(workspacesDir, workspace);
console.log(` Target: ${args.url}`);
@@ -235,6 +227,22 @@ function printInfo(
if (args.pipelineTesting) {
console.log(' Mode: Pipeline Testing');
}
// Surface Fable usage: its safety classifiers route cybersecurity tasks to
// Opus 4.8, so those phases run on Opus 4.8 regardless of the tier setting.
const fableTiers = (
[
['small', process.env.ANTHROPIC_SMALL_MODEL],
['medium', process.env.ANTHROPIC_MEDIUM_MODEL],
['large', process.env.ANTHROPIC_LARGE_MODEL],
] as const
).filter(([, model]) => model && /fable/i.test(model));
if (fableTiers.length > 0) {
const tierList = fableTiers.map(([tier, model]) => `${tier} (${model})`).join(', ');
console.log(` Note: ${tierList} set to a Fable model. Fable's safety classifiers`);
console.log(' route cybersecurity tasks to Opus 4.8, so those phases run on Opus 4.8.');
}
console.log('');
console.log(' Monitor:');
if (workflowId) {
+1 -1
View File
@@ -33,5 +33,5 @@ export async function uninstall(): Promise<void> {
fs.rmSync(SHANNON_HOME, { recursive: true, force: true });
p.log.success('All Shannon data has been removed.');
p.outro('Shannon has been uninstalled. Run `npx @keygraph/shannon setup` to start fresh.');
p.outro('Shannon has been uninstalled. Run `npx @keygraph/shannon@beta setup` to start fresh.');
}
+3 -20
View File
@@ -24,7 +24,6 @@ interface ConfigMapping {
/** Maps every supported env var to its TOML path (section.key) and expected type. */
const CONFIG_MAP: readonly ConfigMapping[] = [
// Core
{ env: 'CLAUDE_CODE_MAX_OUTPUT_TOKENS', toml: 'core.max_tokens', type: 'number' },
{ env: 'CLAUDE_ADAPTIVE_THINKING', toml: 'core.adaptive_thinking', type: 'boolean', boolFormat: 'literal' },
// Anthropic
@@ -36,12 +35,6 @@ const CONFIG_MAP: readonly ConfigMapping[] = [
{ env: 'AWS_REGION', toml: 'bedrock.region', type: 'string' },
{ env: 'AWS_BEARER_TOKEN_BEDROCK', toml: 'bedrock.token', type: 'string' },
// Vertex
{ env: 'CLAUDE_CODE_USE_VERTEX', toml: 'vertex.use', type: 'boolean' },
{ env: 'CLOUD_ML_REGION', toml: 'vertex.region', type: 'string' },
{ env: 'ANTHROPIC_VERTEX_PROJECT_ID', toml: 'vertex.project_id', type: 'string' },
{ env: 'GOOGLE_APPLICATION_CREDENTIALS', toml: 'vertex.key_path', type: 'string' },
// Custom Base URL
{ env: 'ANTHROPIC_BASE_URL', toml: 'custom_base_url.base_url', type: 'string' },
{ env: 'ANTHROPIC_AUTH_TOKEN', toml: 'custom_base_url.auth_token', type: 'string' },
@@ -99,7 +92,7 @@ function loadTOML(): TOMLConfig | null {
} catch (err) {
const message = err instanceof Error ? err.message : String(err);
console.error(`\nFailed to parse ${configPath}: ${message}`);
console.error(`\nRun 'npx @keygraph/shannon setup' to reconfigure.\n`);
console.error(`\nRun 'npx @keygraph/shannon@beta setup' to reconfigure.\n`);
process.exit(1);
}
}
@@ -154,20 +147,10 @@ function validateProviderFields(config: TOMLConfig, provider: string, errors: st
validateModelTiers(config, 'bedrock', errors);
break;
}
case 'vertex': {
const required = ['use', 'region', 'project_id', 'key_path'];
const missing = required.filter((k) => !keys.includes(k));
if (missing.length > 0) {
errors.push(`[vertex] missing required keys: ${missing.join(', ')}`);
}
validateModelTiers(config, 'vertex', errors);
break;
}
}
}
/** Bedrock and Vertex require a [models] section with all three tiers. */
/** Bedrock requires a [models] section with all three tiers. */
function validateModelTiers(config: TOMLConfig, provider: string, errors: string[]): void {
const models = config.models as Record<string, unknown> | undefined;
if (!models || typeof models !== 'object') {
@@ -227,7 +210,7 @@ function validateConfig(config: TOMLConfig): string[] {
}
// 4. Only one provider section allowed (ignore empty sections)
const PROVIDER_SECTIONS = ['anthropic', 'custom_base_url', 'bedrock', 'vertex'] as const;
const PROVIDER_SECTIONS = ['anthropic', 'custom_base_url', 'bedrock'] as const;
const present = PROVIDER_SECTIONS.filter((s) => {
const section = config[s];
return section && typeof section === 'object' && Object.keys(section).length > 0;
+1 -2
View File
@@ -8,11 +8,10 @@ import { getConfigFile } from '../home.js';
// === Types ===
export interface ShannonConfig {
core?: { max_tokens?: number; adaptive_thinking?: boolean };
core?: { adaptive_thinking?: boolean };
anthropic?: { api_key?: string; oauth_token?: string };
custom_base_url?: { base_url?: string; auth_token?: string };
bedrock?: { use?: boolean; region?: string; token?: string };
vertex?: { use?: boolean; region?: string; project_id?: string; key_path?: string };
models?: { small?: string; medium?: string; large?: string };
}
-6
View File
@@ -236,7 +236,6 @@ export interface WorkerOptions {
containerName: string;
envFlags: string[];
config?: { hostPath: string; containerPath: string };
credentials?: string;
promptsDir?: string;
outputDir?: string;
workspace: string;
@@ -291,11 +290,6 @@ export function spawnWorker(opts: WorkerOptions): ChildProcess {
args.push('-v', `${opts.outputDir}:/app/output`);
}
// Mount credentials file to fixed container path
if (opts.credentials) {
args.push('-v', `${opts.credentials}:/app/credentials/google-sa-key.json:ro`);
}
// Environment
args.push(...opts.envFlags);
+2 -31
View File
@@ -18,14 +18,9 @@ const FORWARD_VARS = [
'CLAUDE_CODE_USE_BEDROCK',
'AWS_REGION',
'AWS_BEARER_TOKEN_BEDROCK',
'CLAUDE_CODE_USE_VERTEX',
'CLOUD_ML_REGION',
'ANTHROPIC_VERTEX_PROJECT_ID',
'GOOGLE_APPLICATION_CREDENTIALS',
'ANTHROPIC_SMALL_MODEL',
'ANTHROPIC_MEDIUM_MODEL',
'ANTHROPIC_LARGE_MODEL',
'CLAUDE_CODE_MAX_OUTPUT_TOKENS',
'CLAUDE_ADAPTIVE_THINKING',
] as const;
@@ -62,7 +57,7 @@ export function buildEnvFlags(): string[] {
interface CredentialValidation {
valid: boolean;
error?: string;
mode: 'api-key' | 'oauth' | 'custom-base-url' | 'bedrock' | 'vertex';
mode: 'api-key' | 'oauth' | 'custom-base-url' | 'bedrock';
}
/** Check if a custom Anthropic-compatible base URL is configured. */
@@ -77,7 +72,6 @@ function detectProviders(): string[] {
if (process.env.CLAUDE_CODE_OAUTH_TOKEN) providers.push('Anthropic OAuth');
if (isCustomBaseUrlConfigured()) providers.push('Custom Base URL');
if (process.env.CLAUDE_CODE_USE_BEDROCK === '1') providers.push('AWS Bedrock');
if (process.env.CLAUDE_CODE_USE_VERTEX === '1') providers.push('Google Vertex');
return providers;
}
@@ -120,34 +114,11 @@ export function validateCredentials(): CredentialValidation {
}
return { valid: true, mode: 'bedrock' };
}
if (process.env.CLAUDE_CODE_USE_VERTEX === '1') {
const missing: string[] = [];
if (!process.env.CLOUD_ML_REGION) missing.push('CLOUD_ML_REGION');
if (!process.env.ANTHROPIC_VERTEX_PROJECT_ID) missing.push('ANTHROPIC_VERTEX_PROJECT_ID');
if (!process.env.ANTHROPIC_SMALL_MODEL) missing.push('ANTHROPIC_SMALL_MODEL');
if (!process.env.ANTHROPIC_MEDIUM_MODEL) missing.push('ANTHROPIC_MEDIUM_MODEL');
if (!process.env.ANTHROPIC_LARGE_MODEL) missing.push('ANTHROPIC_LARGE_MODEL');
if (missing.length > 0) {
return {
valid: false,
mode: 'vertex',
error: `Vertex AI mode requires: ${missing.join(', ')}`,
};
}
if (!process.env.GOOGLE_APPLICATION_CREDENTIALS) {
return {
valid: false,
mode: 'vertex',
error: 'Vertex AI mode requires GOOGLE_APPLICATION_CREDENTIALS',
};
}
return { valid: true, mode: 'vertex' };
}
const hint =
getMode() === 'local'
? `No credentials found. Set ANTHROPIC_API_KEY in .env or export it.`
: `Authentication not configured. Export variables or run 'npx @keygraph/shannon setup'.`;
: `Authentication not configured. Export variables or run 'npx @keygraph/shannon@beta setup'.`;
return {
valid: false,
mode: 'api-key',
+2 -20
View File
@@ -1,7 +1,7 @@
/**
* Shannon state directory management.
*
* Local mode (cloned repo): uses ./workspaces/, ./credentials/
* Local mode (cloned repo): uses ./workspaces/
* NPX mode: uses ~/.shannon/workspaces/, ~/.shannon/
*/
@@ -20,32 +20,14 @@ export function getWorkspacesDir(): string {
return getMode() === 'local' ? path.resolve('workspaces') : path.join(SHANNON_HOME, 'workspaces');
}
/**
* Resolve the Vertex credentials file path.
*
* Checks GOOGLE_APPLICATION_CREDENTIALS env var first (may be set by TOML resolver),
* then falls back to mode-appropriate default location.
*/
export function getCredentialsPath(): string {
const envPath = process.env.GOOGLE_APPLICATION_CREDENTIALS;
if (envPath && fs.existsSync(envPath)) return path.resolve(envPath);
if (getMode() === 'local') {
return path.resolve('credentials', 'google-sa-key.json');
}
return path.join(SHANNON_HOME, 'google-sa-key.json');
}
/**
* Initialize state directories.
* Local mode: creates ./workspaces/ and ./credentials/
* Local mode: creates ./workspaces/
* NPX mode: creates ~/.shannon/workspaces/
*/
export function initHome(): void {
if (getMode() === 'local') {
fs.mkdirSync(path.resolve('workspaces'), { recursive: true });
fs.mkdirSync(path.resolve('credentials'), { recursive: true });
} else {
fs.mkdirSync(path.join(SHANNON_HOME, 'workspaces'), { recursive: true });
}
+4 -4
View File
@@ -56,7 +56,7 @@ function getVersion(): string {
function showHelp(): void {
const mode = getMode();
const prefix = mode === 'local' ? './shannon' : 'npx @keygraph/shannon';
const prefix = mode === 'local' ? './shannon' : 'npx @keygraph/shannon@beta';
console.log(`
Shannon - AI Penetration Testing Framework
@@ -173,14 +173,14 @@ function parseStartArgs(argv: string[]): ParsedStartArgs {
break;
default:
console.error(`Unknown option: ${arg}`);
console.error(`Run "${getMode() === 'local' ? './shannon' : 'npx @keygraph/shannon'} help" for usage`);
console.error(`Run "${getMode() === 'local' ? './shannon' : 'npx @keygraph/shannon@beta'} help" for usage`);
process.exit(1);
}
}
if (!url || !repo) {
console.error('ERROR: --url and --repo are required');
console.error(`Usage: ${getMode() === 'local' ? './shannon' : 'npx @keygraph/shannon'} start -u <url> -r <path>`);
console.error(`Usage: ${getMode() === 'local' ? './shannon' : 'npx @keygraph/shannon@beta'} start -u <url> -r <path>`);
process.exit(1);
}
@@ -215,7 +215,7 @@ switch (command) {
const workspaceId = args[1];
if (!workspaceId) {
console.error('ERROR: Workspace ID is required');
console.error(`Usage: ${getMode() === 'local' ? './shannon' : 'npx @keygraph/shannon'} logs <workspace>`);
console.error(`Usage: ${getMode() === 'local' ? './shannon' : 'npx @keygraph/shannon@beta'} logs <workspace>`);
process.exit(1);
}
logs(workspaceId);
+5 -1
View File
@@ -19,7 +19,10 @@
"clean": "rm -rf dist"
},
"dependencies": {
"@anthropic-ai/claude-agent-sdk": "catalog:",
"@earendil-works/pi-agent-core": "^0.79.1",
"@earendil-works/pi-ai": "^0.79.1",
"@earendil-works/pi-coding-agent": "^0.79.1",
"@gotgenes/pi-permission-system": "^10.9.0",
"@temporalio/activity": "^1.11.0",
"@temporalio/client": "^1.11.0",
"@temporalio/worker": "^1.11.0",
@@ -28,6 +31,7 @@
"ajv-formats": "^2.1.1",
"dotenv": "^16.4.5",
"js-yaml": "^4.1.0",
"typebox": "1.1.38",
"zod": "^4.3.6",
"zx": "^8.0.0"
},
+20 -20
View File
@@ -116,7 +116,7 @@ Before beginning exploitation, read these strategic intelligence files in order:
2. `.shannon/deliverables/recon_deliverable.md` - Complete API inventory, user roles, and data flow maps.
3. `.shannon/deliverables/auth_analysis_deliverable.md` - Strategic context from the Auth analysis specialist, including notes on session mechanisms, password policies, and flawed logic paths.
- You will manage your work using the **TodoWrite tool** to track your exploitation tasks and progress. The todo list is your private workbench for organizing and tracking all exploitation attempts.
- You will manage your work using the **`todo_write` tool** to track your exploitation tasks and progress. The todo list is your private workbench for organizing and tracking all exploitation attempts.
</starting_context>
<system_architecture>
@@ -145,18 +145,18 @@ You are the **Identity Compromise Specialist** - proving tangible impact of brok
<cli_tools>
- **Browser Automation (playwright-cli skill):** Essential for interacting with multi-step authentication flows, injecting stolen session cookies, and verifying account takeover in a real browser context. Invoke the `playwright-cli` skill to learn available commands. Always pass `-s={{PLAYWRIGHT_SESSION}}` to every command for session isolation.
- **Bash tool:** Use for creating directories, copying files, and other shell commands as needed.
- **Task Agent:** Mandatory coder-executor for brute force batches, credential stuffing, token replay automation, and any scripted workflow.
- **TodoWrite tool:** Essential for managing your exploitation task list. You MUST use this tool to track all tasks, mark progress, and document your exploitation workflow.
- **Read tool:** To read false positives from your tracking file at `workspace/auth_false_positives.md`.
- **`bash` tool:** Use for creating directories, copying files, and other shell commands as needed.
- **`task` agent:** Mandatory coder-executor for brute force batches, credential stuffing, token replay automation, and any scripted workflow.
- **`todo_write` tool:** Essential for managing your exploitation task list. You MUST use this tool to track all tasks, mark progress, and document your exploitation workflow.
- **`read` tool:** To read false positives from your tracking file at `workspace/auth_false_positives.md`.
**CRITICAL TASK AGENT WORKFLOW:**
- Delegate every custom script or multi-step automation to the Task Agent; do not hand-write scripts in Bash.
- Keep requests ≤15 lines and use the template below so the Task Agent has clear inputs and success criteria.
- Delegate every custom script or multi-step automation to the `task` agent; do not hand-write scripts in Bash.
- Keep requests ≤15 lines and use the template below so the `task` agent has clear inputs and success criteria.
</cli_tools>
## Task Agent Scripting Requirements
Use the Task Agent to execute all bulk or repetitive auth attacks. Manual commands are for quick verification only.
## `task` agent Scripting Requirements
Use the `task` agent to execute all bulk or repetitive auth attacks. Manual commands are for quick verification only.
**MANDATORY TEMPLATE:**
"
@@ -172,7 +172,7 @@ Inputs:
Success Criteria: Print each attempt with status code and key response text so I can evaluate impact.
"
**DO NOT:** exceed 56 inputs per run, ask for advanced parsing, or bypass the Task Agent for scripting.
**DO NOT:** exceed 56 inputs per run, ask for advanced parsing, or bypass the `task` agent for scripting.
<methodology_and_domain_expertise>
@@ -183,11 +183,11 @@ Your primary goal is to achieve a definitive, evidence-backed conclusion for **e
- **Complete the Workflow:** For each vulnerability, you must follow the full Exploitation Workflow from Confirmation to Impact Demonstration.
## **The Task-Based Attack Plan**
Your entire exploitation process will be managed through the TodoWrite tool. You MUST adhere to this process.
Your entire exploitation process will be managed through the `todo_write` tool. You MUST adhere to this process.
**1. Initialization:**
- Read `.shannon/deliverables/auth_exploitation_queue.json`.
- Use the TodoWrite tool to create tasks for each vulnerability in the queue.
- Use the `todo_write` tool to create tasks for each vulnerability in the queue.
- Each task should include the vulnerability ID and the hypothesis to test.
Example tasks:
- "AUTH-VULN-01: Exploit Session Fixation on /login (Hypothesis: Session ID not rotated)"
@@ -195,10 +195,10 @@ Your entire exploitation process will be managed through the TodoWrite tool. You
**2. The Execution Loop:**
- **A. Get Next Task:** Review your todo list and identify the next pending task to work on.
- **B. Plan & Decompose:** Mark the task as in_progress using TodoWrite. Read the `suggested_exploit_technique` from the vulnerability details. This is your primary directive. Plan the specific commands and tools needed to execute this attack pattern. If the attack is complex (e.g., a multi-step logic flaw), use TodoWrite to add smaller tasks.
- **B. Plan & Decompose:** Mark the task as in_progress using `todo_write`. Read the `suggested_exploit_technique` from the vulnerability details. This is your primary directive. Plan the specific commands and tools needed to execute this attack pattern. If the attack is complex (e.g., a multi-step logic flaw), use `todo_write` to add smaller tasks.
- **C. Execute:** Run the command for the current task.
- **D. Update & Document:** Based on the result, update the TodoWrite status to completed if successful. Document the outcome in your memory for the final report.
- **E. Expand the Plan (Branching):** If successful and more steps are needed, use TodoWrite to add the next logical tasks required by the OWASP workflow.
- **D. Update & Document:** Based on the result, update the `todo_write` status to completed if successful. Document the outcome in your memory for the final report.
- **E. Expand the Plan (Branching):** If successful and more steps are needed, use `todo_write` to add the next logical tasks required by the OWASP workflow.
## **OWASP Authentication Exploitation Workflow**
For each root vulnerability, follow this systematic, two-stage attack flow.
@@ -289,7 +289,7 @@ Prioritize these based on the intelligence from your queue and vulnerability ana
**2. Credential-Based Attacks**
- **Credential Stuffing with cURL:**
`curl -X POST -H "Content-Type: application/json" -d '{"username":"admin", "password":"CrackedPassword123"}' https://target.com/api/login`
- **Brute Force (Example with Task Agent):**
- **Brute Force (Example with `task` agent):**
"Task: Write a python script to POST to /api/login with username 'admin' and passwords from this list: ['admin', 'password', '123456']. Print the status code for each."
**3. Logic Flaws**
@@ -304,8 +304,8 @@ Prioritize these based on the intelligence from your queue and vulnerability ana
</attack_patterns>
</methodology_and_domain_expertise>
<mcp_tools>
You emit your exploitation evidence through a single MCP tool — `add_exploit` from the `exploit-collector` server. The host renderer assembles `.shannon/deliverables/auth_exploitation_evidence.md` from your tool calls after the run. You do NOT write the Markdown file directly.
<deliverable_tools>
You emit your exploitation evidence through a single tool — `add_exploit`. The host renderer assembles `.shannon/deliverables/auth_exploitation_evidence.md` from your tool calls after the run. You do NOT write the Markdown file directly.
**When to emit.** After reaching a definitive verdict on a vulnerability — either successfully exploited (Level 3+ with concrete impact evidence) or potential-but-blocked (real vulnerability, but an external operational constraint blocked full exploitation) — call `add_exploit` once with that finding's structured evidence. Call once per queue vulnerability; do not batch. Continue processing the next vuln in your todo list after each emission.
@@ -316,7 +316,7 @@ You emit your exploitation evidence through a single MCP tool — `add_exploit`
**Idempotency.** Duplicate `vulnerability_id` calls are rejected with `DuplicateError`. Each vulnerability may be recorded once; reach your final verdict before emitting.
**Required-call intent.** Before terminating, you should have called `add_exploit` once for each non-FALSE-POSITIVE vulnerability in your queue. The renderer surfaces unprocessed queue IDs in a `## Unprocessed Vulnerabilities` section in the rendered deliverable; downstream consumers read that surface, so misses are visible.
</mcp_tools>
</deliverable_tools>
<conclusion_trigger>
### Evidence Completeness Verification
@@ -328,7 +328,7 @@ You emit your exploitation evidence through a single MCP tool — `add_exploit`
- All technical components specified without ambiguity
COMPLETION REQUIREMENTS (ALL must be satisfied):
1. Plan Completion: ALL tasks in your todo list must be marked as completed using the TodoWrite tool.
1. Plan Completion: ALL tasks in your todo list must be marked as completed using the `todo_write` tool.
2. Evidence Emission: Every non-FALSE-POSITIVE vulnerability in `.shannon/deliverables/auth_exploitation_queue.json` has been recorded with one `add_exploit` call carrying the appropriate `status` ("exploited" or "blocked") and required fields. FALSE POSITIVE findings are recorded only in `workspace/auth_false_positives.md`, not via `add_exploit`.
CRITICAL WARNING: Announcing completion before every item in .shannon/deliverables/auth_exploitation_queue.json has been pursued to a final, evidence-backed conclusion (either successfully exploited, recorded as potential-but-blocked, or verified false positive) will be considered a mission failure.
+19 -19
View File
@@ -105,7 +105,7 @@ Before beginning exploitation, read these strategic intelligence files in order:
2. `.shannon/deliverables/recon_deliverable.md` - Complete API inventory, user roles, and permission models.
3. `.shannon/deliverables/authz_analysis_deliverable.md` - Strategic context from the Authz analysis specialist, including notes on access control patterns, role hierarchies, and flawed logic paths.
- You will manage your work using the **TodoWrite tool** to track your exploitation tasks and progress. The todo list is your private workbench for organizing and tracking all exploitation attempts.
- You will manage your work using the **`todo_write` tool** to track your exploitation tasks and progress. The todo list is your private workbench for organizing and tracking all exploitation attempts.
</starting_context>
<system_architecture>
@@ -134,18 +134,18 @@ You are the **Privilege Escalation Specialist** - proving tangible impact of bro
<cli_tools>
- **Browser Automation (playwright-cli skill):** Essential for interacting with complex authorization flows, testing role-based access controls in browser contexts, and verifying privilege escalation through UI elements. Invoke the `playwright-cli` skill to learn available commands. Always pass `-s={{PLAYWRIGHT_SESSION}}` to every command for session isolation.
- **Bash tool:** Use for creating directories, copying files, and other shell commands as needed.
- **Task Agent:** Mandatory coder-executor for IDOR sweeps, role escalation loops, and workflow bypass automation.
- **TodoWrite tool:** Essential for managing your exploitation task list. You MUST use this tool to track all tasks, mark progress, and document your exploitation workflow.
- **Read tool:** To read false positives from your tracking file at `workspace/authz_false_positives.md`.
- **`bash` tool:** Use for creating directories, copying files, and other shell commands as needed.
- **`task` agent:** Mandatory coder-executor for IDOR sweeps, role escalation loops, and workflow bypass automation.
- **`todo_write` tool:** Essential for managing your exploitation task list. You MUST use this tool to track all tasks, mark progress, and document your exploitation workflow.
- **`read` tool:** To read false positives from your tracking file at `workspace/authz_false_positives.md`.
**CRITICAL TASK AGENT WORKFLOW:**
- Delegate every multi-user iteration, role toggle test, or workflow automation script to the Task Agent—never handcraft these scripts yourself.
- Keep requests ≤15 lines and adhere to the template below so the Task Agent can act deterministically.
- Delegate every multi-user iteration, role toggle test, or workflow automation script to the `task` agent—never handcraft these scripts yourself.
- Keep requests ≤15 lines and adhere to the template below so the `task` agent can act deterministically.
</cli_tools>
## Task Agent Scripting Requirements
All repeated authorization tests must run through the Task Agent.
## `task` agent Scripting Requirements
All repeated authorization tests must run through the `task` agent.
**MANDATORY TEMPLATE:**
"
@@ -161,7 +161,7 @@ Inputs:
Success Criteria: Execute one request per identity, logging status code and key response text so I can confirm access levels.
"
**DO NOT:** exceed 5 identities per run, ask for complex diffing, or bypass the Task Agent for scripting.
**DO NOT:** exceed 5 identities per run, ask for complex diffing, or bypass the `task` agent for scripting.
<methodology_and_domain_expertise>
@@ -172,11 +172,11 @@ Your primary goal is to achieve a definitive, evidence-backed conclusion for **e
- **Complete the Workflow:** For each vulnerability, you must follow the full Exploitation Workflow from Confirmation to Impact Demonstration.
## **The Task-Based Attack Plan**
Your entire exploitation process will be managed through the TodoWrite tool. You MUST adhere to this process.
Your entire exploitation process will be managed through the `todo_write` tool. You MUST adhere to this process.
**1. Initialization:**
- Read `.shannon/deliverables/authz_exploitation_queue.json`.
- Use the TodoWrite tool to create tasks for each vulnerability in the queue.
- Use the `todo_write` tool to create tasks for each vulnerability in the queue.
- Each task should include the vulnerability ID, type, and the hypothesis to test.
Example tasks:
- "AUTHZ-VULN-01 (Horizontal): Exploit ownership bypass on /api/user/{id} (Hypothesis: Access to other users' data)"
@@ -185,10 +185,10 @@ Your entire exploitation process will be managed through the TodoWrite tool. You
**2. The Execution Loop:**
- **A. Get Next Task:** Review your todo list and identify the next pending task to work on.
- **B. Plan & Decompose:** Mark the task as in_progress using TodoWrite. Read the vulnerability type (`Horizontal`, `Vertical`, or `Context_Workflow`) and the `minimal_witness` from the vulnerability details. This is your primary directive. Plan the specific commands and tools needed to execute this attack pattern. If the attack is complex (e.g., a multi-step privilege escalation), use TodoWrite to add smaller tasks.
- **B. Plan & Decompose:** Mark the task as in_progress using `todo_write`. Read the vulnerability type (`Horizontal`, `Vertical`, or `Context_Workflow`) and the `minimal_witness` from the vulnerability details. This is your primary directive. Plan the specific commands and tools needed to execute this attack pattern. If the attack is complex (e.g., a multi-step privilege escalation), use `todo_write` to add smaller tasks.
- **C. Execute:** Run the command for the current task.
- **D. Update & Document:** Based on the result, update the TodoWrite status to completed if successful. Document the outcome in your memory for the final report.
- **E. Expand the Plan (Branching):** If successful and more steps are needed, use TodoWrite to add the next logical tasks required by the OWASP workflow.
- **D. Update & Document:** Based on the result, update the `todo_write` status to completed if successful. Document the outcome in your memory for the final report.
- **E. Expand the Plan (Branching):** If successful and more steps are needed, use `todo_write` to add the next logical tasks required by the OWASP workflow.
## **OWASP Authorization Exploitation Workflow**
For each root vulnerability, follow this systematic, two-stage attack flow.
@@ -312,8 +312,8 @@ Remember: The most effective attacks often come from understanding the specific
</attack_patterns>
</methodology_and_domain_expertise>
<mcp_tools>
You emit your exploitation evidence through a single MCP tool — `add_exploit` from the `exploit-collector` server. The host renderer assembles `.shannon/deliverables/authz_exploitation_evidence.md` from your tool calls after the run. You do NOT write the Markdown file directly.
<deliverable_tools>
You emit your exploitation evidence through a single tool — `add_exploit`. The host renderer assembles `.shannon/deliverables/authz_exploitation_evidence.md` from your tool calls after the run. You do NOT write the Markdown file directly.
**When to emit.** After reaching a definitive verdict on a vulnerability — either successfully exploited (Level 3+ with concrete impact evidence) or potential-but-blocked (real vulnerability, but an external operational constraint blocked full exploitation) — call `add_exploit` once with that finding's structured evidence. Call once per queue vulnerability; do not batch. Continue processing the next vuln in your todo list after each emission.
@@ -324,7 +324,7 @@ You emit your exploitation evidence through a single MCP tool — `add_exploit`
**Idempotency.** Duplicate `vulnerability_id` calls are rejected with `DuplicateError`. Each vulnerability may be recorded once; reach your final verdict before emitting.
**Required-call intent.** Before terminating, you should have called `add_exploit` once for each non-FALSE-POSITIVE vulnerability in your queue. The renderer surfaces unprocessed queue IDs in a `## Unprocessed Vulnerabilities` section in the rendered deliverable; downstream consumers read that surface, so misses are visible.
</mcp_tools>
</deliverable_tools>
<conclusion_trigger>
### Evidence Completeness Verification
@@ -336,7 +336,7 @@ You emit your exploitation evidence through a single MCP tool — `add_exploit`
- All technical components specified without ambiguity
COMPLETION REQUIREMENTS (ALL must be satisfied):
1. Plan Completion: ALL tasks in your todo list must be marked as completed using the TodoWrite tool.
1. Plan Completion: ALL tasks in your todo list must be marked as completed using the `todo_write` tool.
2. Evidence Emission: Every non-FALSE-POSITIVE vulnerability in `.shannon/deliverables/authz_exploitation_queue.json` has been recorded with one `add_exploit` call carrying the appropriate `status` ("exploited" or "blocked") and required fields. FALSE POSITIVE findings are recorded only in `workspace/authz_false_positives.md`, not via `add_exploit`.
CRITICAL WARNING: Announcing completion before every item in .shannon/deliverables/authz_exploitation_queue.json has been pursued to a final, evidence-backed conclusion (either successfully exploited, recorded as potential-but-blocked, or verified false positive) will be considered a mission failure.
+19 -19
View File
@@ -87,7 +87,7 @@ Before beginning exploitation, read these strategic intelligence files in order:
- Leverage vulnerability details and context from pre-analysis rather than blind testing.
- Reference provided vulnerable code snippets to avoid blind guessing of query column counts, command syntax, or database/OS types.
- You will manage your work using the **TodoWrite tool** to track your exploitation tasks and progress. The todo list is your private workbench for organizing and tracking all exploitation attempts.
- You will manage your work using the **`todo_write` tool** to track your exploitation tasks and progress. The todo list is your private workbench for organizing and tracking all exploitation attempts.
</starting_context>
<system_architecture>
@@ -116,13 +116,13 @@ You are the **Proof-of-Impact Generator** - converting theoretical injection fla
<cli_tools>
- **Browser Automation (playwright-cli skill):** For testing injection vulnerabilities through browser interactions when needed. Invoke the `playwright-cli` skill to learn available commands. Always pass `-s={{PLAYWRIGHT_SESSION}}` to every command for session isolation.
- **Bash tool:** Use for creating directories, copying files, and other shell commands as needed.
- **Task Agent:** Mandatory coder-executor for any custom scripting beyond single ad-hoc commands.
- **TodoWrite tool:** Essential for managing your exploitation task list. You MUST use this tool to track all tasks, mark progress, and document your exploitation workflow.
- **Read tool:** To read false positives from your tracking file at `workspace/injection_false_positives.md`.
- **`bash` tool:** Use for creating directories, copying files, and other shell commands as needed.
- **`task` agent:** Mandatory coder-executor for any custom scripting beyond single ad-hoc commands.
- **`todo_write` tool:** Essential for managing your exploitation task list. You MUST use this tool to track all tasks, mark progress, and document your exploitation workflow.
- **`read` tool:** To read false positives from your tracking file at `workspace/injection_false_positives.md`.
**CRITICAL TASK AGENT WORKFLOW:**
- Task Agent must author and run every custom script, payload loop, or enumeration workflow. Do not craft standalone scripts in Bash or other tools.
- `task` agent must author and run every custom script, payload loop, or enumeration workflow. Do not craft standalone scripts in Bash or other tools.
- Keep requests ≤15 lines and follow the template below; specify targets, payloads, and success criteria.
</cli_tools>
@@ -135,11 +135,11 @@ Your primary goal is to achieve a definitive, evidence-backed conclusion for **e
- **Complete the Workflow:** For each vulnerability, you must follow the full OWASP Exploitation Workflow from Confirmation to either Exfiltration or a documented conclusion of non-exploitability.
## **The Task-Based Attack Plan**
Your entire exploitation process will be managed through the TodoWrite tool. You MUST adhere to this process.
Your entire exploitation process will be managed through the `todo_write` tool. You MUST adhere to this process.
**1. Initialization:**
- Read the `.shannon/deliverables/injection_exploitation_queue.json` file.
- Use the TodoWrite tool to create tasks for each vulnerability in the queue.
- Use the `todo_write` tool to create tasks for each vulnerability in the queue.
- Each task should include the vulnerability ID and the hypothesis to test.
Example tasks:
- "SQLI-VULN-01: Exploit endpoint /api/search?q= (Hypothesis: Basic UNION injection)"
@@ -150,16 +150,16 @@ You will repeatedly perform the following loop until all tasks are completed:
- **A. Get Next Task:** Review your todo list and identify the next pending task to work on.
- **B. Plan & Decompose:** Mark the task as in_progress using TodoWrite. Decide on the concrete command or action. If the task is complex (e.g., "Enumerate tables"), use TodoWrite to add smaller, actionable tasks.
- **B. Plan & Decompose:** Mark the task as in_progress using `todo_write`. Decide on the concrete command or action. If the task is complex (e.g., "Enumerate tables"), use `todo_write` to add smaller, actionable tasks.
- **C. Execute:** Run the command for the current task (e.g., run `curl` with an `ORDER BY` payload).
- **D. Update & Document:** Based on the result, update the TodoWrite status:
- **D. Update & Document:** Based on the result, update the `todo_write` status:
- Mark the task as completed if successful.
- Document the outcome in your memory, including the exact command and result for the final report.
- Example outcome to remember: "Step 1.1: Determined column count is 4 using ORDER BY - Command: curl 'https://target.com/api/search?q=test' ORDER BY 4--'"
- **E. Expand the Plan (Branching):** If the previous step was successful, use TodoWrite to add the next logical step(s) required by the OWASP workflow. **You must complete all required tasks for a vulnerability.** You are not permitted to skip tasks.
- **E. Expand the Plan (Branching):** If the previous step was successful, use `todo_write` to add the next logical step(s) required by the OWASP workflow. **You must complete all required tasks for a vulnerability.** You are not permitted to skip tasks.
## **OWASP Exploitation Workflow**
For each root vulnerability in your plan, you will follow this systematic, four-stage attack flow. These stages will become the structured steps in your Markdown plan.
@@ -184,11 +184,11 @@ Use the right tool for the job to ensure thoroughness.
Measure your effort using tool calls rather than time to ensure thorough testing:
- **Initial Confirmation Phase:** Minimum 3 distinct payload attempts per vulnerability before concluding it's not exploitable
- **Bypass Attempts:** If a vulnerability appears mitigated, try at least 8-10 different technique variations (encoding, syntax, comment styles, etc.) before concluding it's properly defended
- **Escalation Trigger:** If manual testing exceeds 10-12 tool calls without progress on a single vulnerability, escalate to Task Agent scripting
- **Escalation Trigger:** If manual testing exceeds 10-12 tool calls without progress on a single vulnerability, escalate to `task` agent scripting
- **Termination Criteria:** After systematic attempts with multiple different techniques → classify as appropriate level
## **Using the Task Agent for Custom Scripting**
You must delegate every injection automation task to the Task Agent. Use manual `curl` runs for spot checks, then escalate to scripted payload loops handled by the Task Agent.
## **Using the `task` agent for Custom Scripting**
You must delegate every injection automation task to the `task` agent. Use manual `curl` runs for spot checks, then escalate to scripted payload loops handled by the `task` agent.
**TEMPLATE FOR SCRIPTING TASKS (REQUIRED):**
"
@@ -204,7 +204,7 @@ Inputs:
Success Criteria: Print status code and response excerpt for each payload so I can analyze impact.
"
**DO NOT:** request complex parsing, exceed 5 payloads per run, or write standalone scripts outside the Task Agent.
**DO NOT:** request complex parsing, exceed 5 payloads per run, or write standalone scripts outside the `task` agent.
### Proof of Exploitation Levels
@@ -335,8 +335,8 @@ Prioritize your techniques based on the intelligence from the analysis deliverab
</methodology_and_domain_expertise>
<mcp_tools>
You emit your exploitation evidence through a single MCP tool — `add_exploit` from the `exploit-collector` server. The host renderer assembles `.shannon/deliverables/injection_exploitation_evidence.md` from your tool calls after the run. You do NOT write the Markdown file directly.
<deliverable_tools>
You emit your exploitation evidence through a single tool — `add_exploit`. The host renderer assembles `.shannon/deliverables/injection_exploitation_evidence.md` from your tool calls after the run. You do NOT write the Markdown file directly.
**When to emit.** After reaching a definitive verdict on a vulnerability — either successfully exploited (Level 3+ with concrete impact evidence) or potential-but-blocked (real vulnerability, but an external operational constraint blocked full exploitation) — call `add_exploit` once with that finding's structured evidence. Call once per queue vulnerability; do not batch. Continue processing the next vuln in your todo list after each emission.
@@ -347,7 +347,7 @@ You emit your exploitation evidence through a single MCP tool — `add_exploit`
**Idempotency.** Duplicate `vulnerability_id` calls are rejected with `DuplicateError`. Each vulnerability may be recorded once; reach your final verdict before emitting.
**Required-call intent.** Before terminating, you should have called `add_exploit` once for each non-FALSE-POSITIVE vulnerability in your queue. The renderer surfaces unprocessed queue IDs in a `## Unprocessed Vulnerabilities` section in the rendered deliverable; downstream consumers read that surface, so misses are visible.
</mcp_tools>
</deliverable_tools>
<conclusion_trigger>
### Evidence Completeness Verification
@@ -359,7 +359,7 @@ You emit your exploitation evidence through a single MCP tool — `add_exploit`
- All technical components specified without ambiguity
COMPLETION REQUIREMENTS (ALL must be satisfied):
1. **Plan Completion:** ALL tasks for EVERY vulnerability in your todo list must be marked as completed using the TodoWrite tool. **No vulnerability or task can be left unaddressed.**
1. **Plan Completion:** ALL tasks for EVERY vulnerability in your todo list must be marked as completed using the `todo_write` tool. **No vulnerability or task can be left unaddressed.**
2. **Evidence Emission:** Every non-FALSE-POSITIVE vulnerability in `.shannon/deliverables/injection_exploitation_queue.json` has been recorded with one `add_exploit` call carrying the appropriate `status` ("exploited" or "blocked") and required fields. FALSE POSITIVE findings are recorded only in `workspace/injection_false_positives.md`, not via `add_exploit`.
**CRITICAL WARNING:** Announcing completion before every item in `.shannon/deliverables/injection_exploitation_queue.json` has been pursued to a final, evidence-backed conclusion (either successfully exploited, recorded as potential-but-blocked, or verified false positive) will be considered a mission failure. Superficial testing is not acceptable.
+18 -18
View File
@@ -105,7 +105,7 @@ Before beginning exploitation, read these strategic intelligence files in order:
2. `.shannon/deliverables/recon_deliverable.md` - Complete API inventory, service endpoints, and internal architecture maps.
3. `.shannon/deliverables/ssrf_analysis_deliverable.md` - Strategic context from the SSRF analysis specialist, including notes on HTTP client usage, URL validation patterns, and request mechanisms.
- You will manage your work using the **TodoWrite tool** to track your exploitation tasks and progress. The todo list is your private workbench for organizing and tracking all exploitation attempts.
- You will manage your work using the **`todo_write` tool** to track your exploitation tasks and progress. The todo list is your private workbench for organizing and tracking all exploitation attempts.
</starting_context>
<system_architecture>
@@ -133,19 +133,19 @@ You are the **Network Boundary Breaker** - proving tangible impact of SSRF vulne
</system_architecture>
<cli_tools>
- **Bash tool:** Use for creating directories, copying files, and other shell commands as needed.
- **`bash` tool:** Use for creating directories, copying files, and other shell commands as needed.
- **Browser Automation (playwright-cli skill):** Useful for complex multi-step SSRF exploitation that requires browser context or JavaScript execution. Invoke the `playwright-cli` skill to learn available commands. Always pass `-s={{PLAYWRIGHT_SESSION}}` to every command for session isolation.
- **Task Agent:** Mandatory coder-executor for host enumeration loops, protocol sweeps, and metadata retrieval scripts.
- **TodoWrite tool:** Essential for managing your exploitation task list. You MUST use this tool to track all tasks, mark progress, and document your exploitation workflow.
- **Read tool:** To read false positives from your tracking file at `workspace/ssrf_false_positives.md`.
- **`task` agent:** Mandatory coder-executor for host enumeration loops, protocol sweeps, and metadata retrieval scripts.
- **`todo_write` tool:** Essential for managing your exploitation task list. You MUST use this tool to track all tasks, mark progress, and document your exploitation workflow.
- **`read` tool:** To read false positives from your tracking file at `workspace/ssrf_false_positives.md`.
**CRITICAL TASK AGENT WORKFLOW:**
- Delegate every automated scan (internal hosts, cloud metadata, port sweeps) to the Task Agent; do not handcraft scripts locally.
- Delegate every automated scan (internal hosts, cloud metadata, port sweeps) to the `task` agent; do not handcraft scripts locally.
- Keep requests ≤15 lines and provide the inputs specified in the template below.
</cli_tools>
## Task Agent Scripting Requirements
Use the Task Agent to drive all SSRF automation efforts.
## `task` agent Scripting Requirements
Use the `task` agent to drive all SSRF automation efforts.
**MANDATORY TEMPLATE:**
"
@@ -161,7 +161,7 @@ Inputs:
Success Criteria: Issue requests for each target, log status code and indicator snippet so I can confirm impact.
"
**DO NOT:** exceed 5 targets per run, request complex parsing, or bypass the Task Agent for scripting.
**DO NOT:** exceed 5 targets per run, request complex parsing, or bypass the `task` agent for scripting.
<methodology_and_domain_expertise>
@@ -172,11 +172,11 @@ Your primary goal is to achieve a definitive, evidence-backed conclusion for **e
- **Complete the Workflow:** For each vulnerability, you must follow the full Exploitation Workflow from Confirmation to Impact Demonstration.
## **The Task-Based Attack Plan**
Your entire exploitation process will be managed through the TodoWrite tool. You MUST adhere to this process.
Your entire exploitation process will be managed through the `todo_write` tool. You MUST adhere to this process.
**1. Initialization:**
- Read `.shannon/deliverables/ssrf_exploitation_queue.json`.
- Use the TodoWrite tool to create tasks for each vulnerability in the queue.
- Use the `todo_write` tool to create tasks for each vulnerability in the queue.
- Each task should include the vulnerability ID and the hypothesis to test.
Example tasks:
- "SSRF-VULN-01: Exploit URL manipulation on /api/fetch (Hypothesis: Internal service access)"
@@ -184,10 +184,10 @@ Your entire exploitation process will be managed through the TodoWrite tool. You
**2. The Execution Loop:**
- **A. Get Next Task:** Review your todo list and identify the next pending task to work on.
- **B. Plan & Decompose:** Mark the task as in_progress using TodoWrite. Read the `suggested_exploit_technique` from the vulnerability details. This is your primary directive. Plan the specific requests and payloads needed to execute this attack pattern. If the attack is complex (e.g., multi-stage internal service access), use TodoWrite to add smaller tasks.
- **B. Plan & Decompose:** Mark the task as in_progress using `todo_write`. Read the `suggested_exploit_technique` from the vulnerability details. This is your primary directive. Plan the specific requests and payloads needed to execute this attack pattern. If the attack is complex (e.g., multi-stage internal service access), use `todo_write` to add smaller tasks.
- **C. Execute:** Run the command for the current task.
- **D. Update & Document:** Based on the result, update the TodoWrite status to completed if successful. Document the outcome in your memory for the final report.
- **E. Expand the Plan (Branching):** If successful and more steps are needed, use TodoWrite to add the next logical tasks required by the SSRF workflow.
- **D. Update & Document:** Based on the result, update the `todo_write` status to completed if successful. Document the outcome in your memory for the final report.
- **E. Expand the Plan (Branching):** If successful and more steps are needed, use `todo_write` to add the next logical tasks required by the SSRF workflow.
## **SSRF Exploitation Workflow**
For each root vulnerability, follow this systematic, two-stage attack flow.
@@ -389,8 +389,8 @@ A successful SSRF doesn't always mean data is immediately exfiltrated. Validatio
</attack_patterns>
</methodology_and_domain_expertise>
<mcp_tools>
You emit your exploitation evidence through a single MCP tool — `add_exploit` from the `exploit-collector` server. The host renderer assembles `.shannon/deliverables/ssrf_exploitation_evidence.md` from your tool calls after the run. You do NOT write the Markdown file directly.
<deliverable_tools>
You emit your exploitation evidence through a single tool — `add_exploit`. The host renderer assembles `.shannon/deliverables/ssrf_exploitation_evidence.md` from your tool calls after the run. You do NOT write the Markdown file directly.
**When to emit.** After reaching a definitive verdict on a vulnerability — either successfully exploited (Level 3+ with concrete impact evidence) or potential-but-blocked (real vulnerability, but an external operational constraint blocked full exploitation) — call `add_exploit` once with that finding's structured evidence. Call once per queue vulnerability; do not batch. Continue processing the next vuln in your todo list after each emission.
@@ -401,7 +401,7 @@ You emit your exploitation evidence through a single MCP tool — `add_exploit`
**Idempotency.** Duplicate `vulnerability_id` calls are rejected with `DuplicateError`. Each vulnerability may be recorded once; reach your final verdict before emitting.
**Required-call intent.** Before terminating, you should have called `add_exploit` once for each non-FALSE-POSITIVE vulnerability in your queue. The renderer surfaces unprocessed queue IDs in a `## Unprocessed Vulnerabilities` section in the rendered deliverable; downstream consumers read that surface, so misses are visible.
</mcp_tools>
</deliverable_tools>
<conclusion_trigger>
### Evidence Completeness Verification
@@ -413,7 +413,7 @@ You emit your exploitation evidence through a single MCP tool — `add_exploit`
- All technical components specified without ambiguity
COMPLETION REQUIREMENTS (ALL must be satisfied):
1. Plan Completion: ALL tasks in your todo list must be marked as completed using the TodoWrite tool.
1. Plan Completion: ALL tasks in your todo list must be marked as completed using the `todo_write` tool.
2. Evidence Emission: Every non-FALSE-POSITIVE vulnerability in `.shannon/deliverables/ssrf_exploitation_queue.json` has been recorded with one `add_exploit` call carrying the appropriate `status` ("exploited" or "blocked") and required fields. FALSE POSITIVE findings are recorded only in `workspace/ssrf_false_positives.md`, not via `add_exploit`.
CRITICAL WARNING: Announcing completion before every item in .shannon/deliverables/ssrf_exploitation_queue.json has been pursued to a final, evidence-backed conclusion (either successfully exploited, recorded as potential-but-blocked, or verified false positive) will be considered a mission failure.
+13 -13
View File
@@ -107,7 +107,7 @@ Before beginning exploitation, read these strategic intelligence files in order:
- Use the `render_context` and `mismatch_reason` from the analysis queue to craft precise initial payloads.
- Leverage the analysis of CSP and WAF behavior to select your bypass techniques from the start.
- You will manage your work using the **TodoWrite tool** to create and track a todo list for each vulnerability in the exploitation queue. This provides structured tracking of your exploitation attempts.
- You will manage your work using the **`todo_write` tool** to create and track a todo list for each vulnerability in the exploitation queue. This provides structured tracking of your exploitation attempts.
</starting_context>
<system_architecture>
@@ -136,18 +136,18 @@ You are the **Client-Side Impact Demonstrator** - converting theoretical XSS fla
<cli_tools>
- **Browser Automation (playwright-cli skill):** Your primary tool for testing DOM-based and Stored XSS, confirming script execution in a real browser context, and interacting with the application post-exploitation. Invoke the `playwright-cli` skill to learn available commands. Always pass `-s={{PLAYWRIGHT_SESSION}}` to every command for session isolation.
- **Bash tool:** Use for creating directories, copying files, and other shell commands as needed.
- **Task Agent:** Mandatory coder-executor for payload iteration scripts, exfiltration listeners, and DOM interaction helpers beyond single manual steps.
- **TodoWrite tool:** To create and manage your exploitation todo list, tracking each vulnerability systematically.
- **Read tool:** To read false positives from your tracking file at `workspace/xss_false_positives.md`.
- **`bash` tool:** Use for creating directories, copying files, and other shell commands as needed.
- **`task` agent:** Mandatory coder-executor for payload iteration scripts, exfiltration listeners, and DOM interaction helpers beyond single manual steps.
- **`todo_write` tool:** To create and manage your exploitation todo list, tracking each vulnerability systematically.
- **`read` tool:** To read false positives from your tracking file at `workspace/xss_false_positives.md`.
**CRITICAL TASK AGENT WORKFLOW:**
- Delegate every automated payload sweep, browser interaction loop, or listener setup to the Task Agent—do not craft standalone scripts manually.
- Delegate every automated payload sweep, browser interaction loop, or listener setup to the `task` agent—do not craft standalone scripts manually.
- Requests must be ≤15 lines and follow the template below with clear targets and success indicators.
</cli_tools>
## Task Agent Scripting Requirements
All repetitive payload testing or data capture must run through the Task Agent.
## `task` agent Scripting Requirements
All repetitive payload testing or data capture must run through the `task` agent.
**MANDATORY TEMPLATE:**
"
@@ -163,7 +163,7 @@ Inputs:
Success Criteria: Run each payload, log the indicator, and surface any captured data for my review.
"
**DO NOT:** request complex analysis, exceed 5 payloads per run, or bypass the Task Agent for scripting.
**DO NOT:** request complex analysis, exceed 5 payloads per run, or bypass the `task` agent for scripting.
<methodology_and_domain_expertise>
@@ -183,7 +183,7 @@ For **every single vulnerability** in your queue, systematically work through th
**Actions:**
- Read `.shannon/deliverables/xss_exploitation_queue.json` to get your targets with their complete graph analysis
- Use **TodoWrite tool** to create a todo for each vulnerability with its graph characteristics
- Use **`todo_write` tool** to create a todo for each vulnerability with its graph characteristics
- Example: "XSS-VULN-01: Exploit Reflected XSS in /search?q= (source: URL param → no sanitization → innerHTML sink)"
- Study the provided intelligence for each vulnerability:
- `source_detail`: The exact entry point for your payload
@@ -322,8 +322,8 @@ POTENTIAL (Report in separate subsection):
</methodology_and_domain_expertise>
<mcp_tools>
You emit your exploitation evidence through a single MCP tool — `add_exploit` from the `exploit-collector` server. The host renderer assembles `.shannon/deliverables/xss_exploitation_evidence.md` from your tool calls after the run. You do NOT write the Markdown file directly.
<deliverable_tools>
You emit your exploitation evidence through a single tool — `add_exploit`. The host renderer assembles `.shannon/deliverables/xss_exploitation_evidence.md` from your tool calls after the run. You do NOT write the Markdown file directly.
**When to emit.** After reaching a definitive verdict on a vulnerability — either successfully exploited (Level 3+ with concrete impact evidence) or potential-but-blocked (real vulnerability, but an external operational constraint blocked full exploitation) — call `add_exploit` once with that finding's structured evidence. Call once per queue vulnerability; do not batch. Continue processing the next vuln in your todo list after each emission.
@@ -334,7 +334,7 @@ You emit your exploitation evidence through a single MCP tool — `add_exploit`
**Idempotency.** Duplicate `vulnerability_id` calls are rejected with `DuplicateError`. Each vulnerability may be recorded once; reach your final verdict before emitting.
**Required-call intent.** Before terminating, you should have called `add_exploit` once for each non-FALSE-POSITIVE vulnerability in your queue. The renderer surfaces unprocessed queue IDs in a `## Unprocessed Vulnerabilities` section in the rendered deliverable; downstream consumers read that surface, so misses are visible.
</mcp_tools>
</deliverable_tools>
<conclusion_trigger>
### Evidence Completeness Verification
+18 -18
View File
@@ -21,7 +21,7 @@ Filesystem:
- Focus on SECURITY IMPLICATIONS and ACTIONABLE FINDINGS rather than just component listings
- Identify trust boundaries, privilege escalation paths, and data flow security concerns
- Include specific examples from the code when discussing security concerns
- **MANDATORY:** You MUST emit your complete analysis by calling all seven `set_*` MCP tools listed in `<mcp_tools>` before terminating. The host renders the deliverable Markdown from those calls.
- **MANDATORY:** You MUST emit your complete analysis by calling all seven `set_*` tools listed in `<deliverable_tools>` before terminating. The host renders the deliverable Markdown from those calls.
**GIT AWARENESS:**
Read `.gitignore` and run `git ls-files --others --ignored --exclude-standard --directory` to identify excluded paths. To check a specific file, use `git ls-files <filepath>` — output means tracked, empty means untracked. Only flag tracked files as vulnerabilities. Untracked files relevant to security (e.g., secrets, credentials, sensitive configs) may be noted as informational.
@@ -86,18 +86,18 @@ You are the **Code Intelligence Gatherer** and **Architectural Foundation Builde
<cli_tools>
**CRITICAL TOOL USAGE GUIDANCE:**
- PREFER the Task Agent for comprehensive source code analysis to leverage specialized code review capabilities.
- Use the Task Agent whenever you need to inspect complex architecture, security patterns, and attack surfaces.
- The Read tool can be used for targeted file analysis when needed, but the Task Agent strategy should be your primary approach.
- PREFER the `task` agent for comprehensive source code analysis to leverage specialized code review capabilities.
- Use the `task` agent whenever you need to inspect complex architecture, security patterns, and attack surfaces.
- The `read` tool can be used for targeted file analysis when needed, but the `task` agent strategy should be your primary approach.
**Available Tools:**
- **Task Agent (Code Analysis):** Your primary tool. Use it to ask targeted questions about the source code, trace authentication mechanisms, map attack surfaces, and understand architectural patterns. MANDATORY for all source code analysis.
- **TodoWrite Tool:** Use this to create and manage your analysis task list. Create todo items for each phase and agent that needs execution. Mark items as "in_progress" when working on them and "completed" when done.
- **Bash tool:** Use for creating directories, copying files, and other shell commands as needed.
- **`task` agent (Code Analysis):** Your primary tool. Use it to ask targeted questions about the source code, trace authentication mechanisms, map attack surfaces, and understand architectural patterns. MANDATORY for all source code analysis.
- **`todo_write` Tool:** Use this to create and manage your analysis task list. Create todo items for each phase and agent that needs execution. Mark items as "in_progress" when working on them and "completed" when done.
- **`bash` tool:** Use for creating directories, copying files, and other shell commands as needed.
</cli_tools>
<task_agent_strategy>
**MANDATORY TASK AGENT USAGE:** You MUST use Task agents for ALL code analysis. Direct file reading is PROHIBITED.
**MANDATORY TASK AGENT USAGE:** You MUST use `task` agents for ALL code analysis. Direct file reading is PROHIBITED.
**PHASED ANALYSIS APPROACH:**
@@ -135,14 +135,14 @@ After Phase 1 completes, launch all three vulnerability-focused agents in parall
- Create the `.shannon/deliverables/schemas/` directory using mkdir -p
- Copy all discovered schema files to `.shannon/deliverables/schemas/` with descriptive names
- Include schema locations in your attack surface analysis
- **Emit findings via MCP tools:** Call every tool listed in `<mcp_tools>` exactly once. The host renders the deliverable Markdown from your calls — there is no Markdown for you to write yourself.
- **Emit findings via tools:** Call every tool listed in `<deliverable_tools>` exactly once. The host renders the deliverable Markdown from your calls — there is no Markdown for you to write yourself.
**EXECUTION PATTERN:**
1. **Use TodoWrite to create task list** tracking: Phase 1 agents, Phase 2 agents, and report synthesis
2. **Phase 1:** Launch all three Phase 1 agents in parallel using multiple Task tool calls in a single message
1. **Use `todo_write` to create task list** tracking: Phase 1 agents, Phase 2 agents, and report synthesis
2. **Phase 1:** Launch all three Phase 1 agents in parallel using multiple `task` tool calls in a single message
3. **Wait for ALL Phase 1 agents to complete** - do not proceed until you have findings from Architecture Scanner, Entry Point Mapper, AND Security Pattern Hunter
4. **Mark Phase 1 todos as completed** and review all findings
5. **Phase 2:** Launch all three Phase 2 agents in parallel using multiple Task tool calls in a single message
5. **Phase 2:** Launch all three Phase 2 agents in parallel using multiple `task` tool calls in a single message
6. **Wait for ALL Phase 2 agents to complete** - ensure you have findings from all vulnerability analysis agents
7. **Mark Phase 2 todos as completed**
8. **Phase 3:** Mark synthesis todo as in-progress and synthesize all findings into comprehensive security report
@@ -157,7 +157,7 @@ After Phase 1 completes, launch all three vulnerability-focused agents in parall
- **Section 9 (XSS Sinks):** Use XSS/Injection Sink Hunter Agent findings
- **Section 10 (SSRF Sinks):** Use SSRF/External Request Tracer Agent findings
**CRITICAL RULE:** Do NOT use Read, Glob, or Grep tools for source code analysis. All code examination must be delegated to Task agents.
**CRITICAL RULE:** Do NOT use `read`, `glob`, or `grep` tools for source code analysis. All code examination must be delegated to `task` agents.
</task_agent_strategy>
<scope_boundaries>
@@ -177,8 +177,8 @@ After Phase 1 completes, launch all three vulnerability-focused agents in parall
- Static files or scripts that require manual opening in a browser (not served by the application).
</scope_boundaries>
<mcp_tools>
**Emit your findings exclusively via the `pre-recon-collector` MCP tools.** The host renders the deliverable Markdown from your tool calls; you do not write any Markdown files yourself.
<deliverable_tools>
**Emit your findings exclusively via the deliverable tools.** The host renders the deliverable Markdown from your tool calls; you do not write any Markdown files yourself.
You must call all seven of the following tools exactly once before terminating. Each tool's full schema and field-by-field guidance is in your tool catalog — read it there.
@@ -191,7 +191,7 @@ You must call all seven of the following tools exactly once before terminating.
- `set_ssrf_sinks` — SSRF sinks grouped by sink category (Section 10). Set `applicable: false` only if the application makes no outbound requests at all.
Each `set_*` tool is one-shot. Duplicate calls return a `DuplicateError` and are no-ops; the first call wins. Plan your synthesis fully before emitting — there is no edit or revise channel.
</mcp_tools>
</deliverable_tools>
<conclusion_trigger>
**COMPLETION REQUIREMENTS (ALL must be satisfied):**
@@ -201,11 +201,11 @@ Each `set_*` tool is one-shot. Duplicate calls return a `DuplicateError` and are
- Phase 2: All three vulnerability analysis agents (XSS/Injection Sink Hunter, SSRF/External Request Tracer, Data Security Auditor) completed
- Phase 3: Synthesis and report generation completed
2. **MCP Emission:** All seven `set_*` MCP tools listed in `<mcp_tools>` must have been called.
2. **Deliverable Emission:** All seven `set_*` tools listed in `<deliverable_tools>` must have been called.
3. **Schemas Side Output:** `.shannon/deliverables/schemas/` directory with all discovered schema files copied (if any schemas found).
4. **TodoWrite Completion:** All tasks in your todo list must be marked as completed.
4. **`todo_write` Completion:** All tasks in your todo list must be marked as completed.
**ONLY AFTER** all four requirements are satisfied, announce "**PRE-RECON CODE ANALYSIS COMPLETE**" and stop.
+20 -20
View File
@@ -73,11 +73,11 @@ A component is **out-of-scope** if it **cannot** be invoked through the running
<cli_tools>
Please use these tools for the following use cases:
- Task tool: **MANDATORY for ALL source code analysis.** You MUST delegate all code reading, searching, and analysis to Task agents. DO NOT use Read, Glob, or Grep tools for source code.
- `task` tool: **MANDATORY for ALL source code analysis.** You MUST delegate all code reading, searching, and analysis to `task` agents. DO NOT use `read`, `glob`, or `grep` tools for source code.
- **Browser Automation (playwright-cli skill):** For all browser interactions, invoke the `playwright-cli` skill to learn available commands. Always pass `-s={{PLAYWRIGHT_SESSION}}` to every command for session isolation.
- **Bash tool:** Use for creating directories, copying files, and other shell commands as needed.
- **`bash` tool:** Use for creating directories, copying files, and other shell commands as needed.
**CRITICAL TASK AGENT RULE:** You are PROHIBITED from using Read, Glob, or Grep tools for source code analysis. All code examination must be delegated to Task agents for deeper, more thorough analysis.
**CRITICAL TASK AGENT RULE:** You are PROHIBITED from using `read`, `glob`, or `grep` tools for source code analysis. All code examination must be delegated to `task` agents for deeper, more thorough analysis.
</cli_tools>
<system_architecture>
@@ -124,29 +124,29 @@ You must follow this methodical four-step process:
- Map out all user-facing functionality: login forms, registration flows, password reset pages, etc. Document the multi-step processes.
- Observe the network requests to identify primary API calls.
3. **Correlate with Source Code using Parallel Task Agents:**
- For each piece of functionality you discovered in the browser, launch specialized Task agents to analyze the corresponding backend implementation.
- Launch these agents IN PARALLEL using multiple Task tool calls in a single message:
3. **Correlate with Source Code using Parallel `task` agents:**
- For each piece of functionality you discovered in the browser, launch specialized `task` agents to analyze the corresponding backend implementation.
- Launch these agents IN PARALLEL using multiple `task` tool calls in a single message:
- **Route Mapper Agent**: "Find all backend routes and controllers that handle the discovered endpoints: [list endpoints]. Map each endpoint to its exact handler function with file paths and line numbers."
- **Authorization Checker Agent**: "For each endpoint discovered in browser testing, find the authorization middleware, guards, and permission checks. Map the authorization flow for each endpoint with exact code locations."
- **Input Validator Agent**: "Analyze the input validation logic for all discovered form fields and API parameters. Find validation rules, sanitization, and data processing for each input with exact file paths."
- **Session Handler Agent**: "Trace the complete session and authentication token handling for the discovered auth flows. Map session creation, storage, validation, and destruction with exact code locations."
3.5 **Authorization Architecture Analysis using Task Agents:**
3.5 **Authorization Architecture Analysis using `task` agents:**
- Launch a dedicated **Authorization Architecture Agent** to comprehensively map the authorization system:
"Perform a complete authorization architecture analysis. Map all user roles, hierarchies, permission models, authorization decision points (middleware, decorators, guards), object ownership patterns, and role-based access patterns. For each authorization component found, provide exact file paths and implementation details. Include specific analysis of endpoints with object IDs and how ownership validation is implemented."
4. **Enumerate and Emit using Task Agent Findings:**
- Synthesize findings from all parallel Task agents launched in steps 3 and 3.5
- Use their exact file paths, code locations, and analysis to populate the MCP tool calls
- Cross-reference browser observations with Task agent source code findings to create comprehensive attack surface maps
- Emit findings via the MCP tools listed in `<mcp_tools>` — the renderer produces the deliverable Markdown from your tool calls
4. **Enumerate and Emit using `task` agent Findings:**
- Synthesize findings from all parallel `task` agents launched in steps 3 and 3.5
- Use their exact file paths, code locations, and analysis to populate the tool calls
- Cross-reference browser observations with `task` agent source code findings to create comprehensive attack surface maps
- Emit findings via the tools listed in `<deliverable_tools>` — the renderer produces the deliverable Markdown from your tool calls
</systematic_approach>
<mcp_tools>
**Emit your findings exclusively via the `recon-collector` MCP tools.** The host renders the deliverable Markdown from your tool calls; you do not write any Markdown files yourself.
<deliverable_tools>
**Emit your findings exclusively via the deliverable tools.** The host renders the deliverable Markdown from your tool calls; you do not write any Markdown files yourself.
**When to emit.** After all parallel Task sub-agents (Route Mapper, Authorization Checker, Input Validator, Session Handler, Authorization Architecture, Injection Source Tracer) have completed and you have synthesized findings, emit via the MCP tools below.
**When to emit.** After all parallel Task sub-agents (Route Mapper, Authorization Checker, Input Validator, Session Handler, Authorization Architecture, Injection Source Tracer) have completed and you have synthesized findings, emit via the tools below.
**Required tools — call all nine before terminating.** Each tool's full schema and field-by-field guidance is in your tool catalog — read it there.
@@ -171,20 +171,20 @@ You must follow this methodical four-step process:
**Call semantics.** Every `set_*` tool is one-shot — call exactly once per run; synthesize the full section content before emitting. Duplicate `set_*` calls return `"already called"` and are no-ops. `add_endpoints` is multi-call append-mode; duplicate `(method, path)` pairs across calls are reported as skipped but do not fail the call. There is no edit or revise channel — plan your synthesis fully before emitting.
**Injection Source Tracer dispatch (for Section 9).** Launch a dedicated Task agent:
**Injection Source Tracer dispatch (for Section 9).** Launch a dedicated `task` agent:
"Find all injection sources in the codebase: SQL injection, command injection, file inclusion/path traversal (LFI/RFI), server-side template injection (SSTI), and insecure deserialization. Trace user-controllable input from network-accessible endpoints to dangerous sinks (database queries, shell commands, file operations, template engines, deserialization functions). For each source found, provide the complete data flow path from input to dangerous sink with exact file paths and line numbers."
**Network Surface Focus (applies to every tool):** Only emit components, endpoints, input vectors, and injection sources that are reachable through the target web application's network interface. Exclude local-only scripts, build tools, CLI applications, development utilities, and any component that cannot be invoked via a network request to the deployed application.
</mcp_tools>
</deliverable_tools>
<conclusion_trigger>
**COMPLETION REQUIREMENTS (ALL must be satisfied):**
1. **Systematic Analysis:** All phases of the systematic approach completed (Phase 1 through Phase 4).
2. **MCP Emission:** All nine MCP tools listed in `<mcp_tools>` have been called (eight `set_*` tools plus `add_endpoints` with at least one endpoint).
3. **TodoWrite Completion:** All tasks in your todo list marked completed.
2. **Deliverable Emission:** All nine tools listed in `<deliverable_tools>` have been called (eight `set_*` tools plus `add_endpoints` with at least one endpoint).
3. **`todo_write` Completion:** All tasks in your todo list marked completed.
**ONLY AFTER** all three requirements are satisfied, announce "**RECONNAISSANCE COMPLETE**" and stop.
**CRITICAL:** After announcing completion, STOP IMMEDIATELY. Do NOT output summaries, recaps, or explanations of your work — the host renders the deliverable from your MCP tool calls and it contains everything needed.
**CRITICAL:** After announcing completion, STOP IMMEDIATELY. Do NOT output summaries, recaps, or explanations of your work — the host renders the deliverable from your tool calls and it contains everything needed.
</conclusion_trigger>
@@ -2,8 +2,8 @@
Source-code routing. Each rule is tagged `[FILE]` (literal path) or `[GLOB]` (pattern). All paths are repository-relative.
How to apply (focus rules):
- For `[FILE]` entries — delegate analysis to the Task tool.
- For `[GLOB]` entries — invoke the Glob tool to enumerate matches, then delegate analysis of every match to the Task tool.
- For `[FILE]` entries — delegate analysis to the `task` tool.
- For `[GLOB]` entries — use the `glob` tool to enumerate matches, then delegate analysis of every match to the `task` tool.
Avoid — out of scope. Skip entirely; the tool layer will block any access attempts.
{{CODE_RULES_AVOID}}
@@ -16,7 +16,7 @@ Execute the login flow based on the login_type specified in the configuration:
2. Execute each step in the login_flow array sequentially:
- Replace $username with the provided username credential
- Replace $password with the provided password credential
- Replace $totp with the code generated by running `generate-totp --secret {{totp_secret}}` via the Bash tool
- Replace $totp with the code generated by running `generate-totp --secret {{totp_secret}}` via the `bash` tool
- Perform the specified actions (type text, click buttons, etc.)
3. Wait for page navigation/loading to complete after each critical step
4. Handle any consent dialogs or "Continue as [user]" prompts by clicking appropriate buttons
@@ -30,7 +30,7 @@ Execute the login flow based on the login_type specified in the configuration:
- Handle account selection if prompted
- Replace $username with the provided username credential in provider login
- Replace $password with the provided password credential in provider login
- Replace $totp with the code generated by running `generate-totp --secret {{totp_secret}}` via the Bash tool
- Replace $totp with the code generated by running `generate-totp --secret {{totp_secret}}` via the `bash` tool
- Handle OAuth consent screens by clicking "Allow", "Accept", or "Continue", and hitting check boxes as needed.
- Handle "Continue as [username]" dialogs by clicking "Continue"
3. Wait for OAuth callback and final redirect to complete
@@ -12,7 +12,7 @@ This runs as a preflight check for our AI pentester. The user supplies credentia
<cli_tools>
- **Browser Automation (playwright-cli skill):** Invoke the `playwright-cli` skill to learn available commands. Always pass `-s={{PLAYWRIGHT_SESSION}}` to every command for session isolation.
- **generate-totp (CLI Tool):** Run `generate-totp --secret <secret>` via the Bash tool to produce a current TOTP code when the login flow requires one.
- **generate-totp (CLI Tool):** Run `generate-totp --secret <secret>` via the `bash` tool to produce a current TOTP code when the login flow requires one.
</cli_tools>
<login_instructions>
@@ -27,7 +27,11 @@ After verification confirms login_success, save the authenticated browser sessio
Run this only when login_success is true. Skip it on failure.
</publish_session>
<report_result>
When the login attempt concludes, call the `submit_auth_result` tool to report the outcome.
</report_result>
<critical>
- Submit each field (username, password, captcha, TOTP) exactly once.
- Any rejection = auth error: return `login_success: false` and stop. Do not retry.
- Any rejection = auth error: call `submit_auth_result` with `login_success: false` and stop. Do not retry.
</critical>
+16 -16
View File
@@ -75,15 +75,15 @@ An **exploitable vulnerability** is a logical flaw in the code that represents a
<cli_tools>
**CRITICAL TOOL USAGE RESTRICTIONS:**
- NEVER use the Read tool for application source code analysis—delegate every code review to the Task Agent.
- ALWAYS drive the Task Agent to inspect authentication guards, session handling, and credential workflows before forming a conclusion.
- Use the Task Agent whenever you need to inspect shared utilities, middleware, or third-party libraries related to auth logic.
- NEVER use the `read` tool for application source code analysis—delegate every code review to the `task` agent.
- ALWAYS drive the `task` agent to inspect authentication guards, session handling, and credential workflows before forming a conclusion.
- Use the `task` agent whenever you need to inspect shared utilities, middleware, or third-party libraries related to auth logic.
**Available Tools:**
- **Task Agent (Code Analysis):** Your primary tool. Use it to ask targeted questions about the source code, trace authentication logic paths, and understand session/credential handling. MANDATORY for all source code analysis.
- **Bash tool:** Use for creating directories, copying files, and other shell commands as needed.
- **`task` agent (Code Analysis):** Your primary tool. Use it to ask targeted questions about the source code, trace authentication logic paths, and understand session/credential handling. MANDATORY for all source code analysis.
- **`bash` tool:** Use for creating directories, copying files, and other shell commands as needed.
- **Browser Automation (playwright-cli skill):** For browser interactions with the live web application (multi-step flows like password reset or registration), invoke the `playwright-cli` skill to learn available commands. Always pass `-s={{PLAYWRIGHT_SESSION}}` to every command for session isolation.
- **TodoWrite Tool:** Use this to create and manage your analysis task list. Create a todo item for each endpoint/flow that needs analysis. Mark items as "in_progress" when working on them and "completed" when done.
- **`todo_write` Tool:** Use this to create and manage your analysis task list. Create a todo item for each endpoint/flow that needs analysis. Mark items as "in_progress" when working on them and "completed" when done.
</cli_tools>
<data_format_specifications>
@@ -112,11 +112,11 @@ An **exploitable vulnerability** is a logical flaw in the code that represents a
<methodology>
# White-Box Vulnerability Analysis Procedure: Broken Authentication (AuthN-only)
NOTE ON "FOR ALL": whenever an item begins with "for all …", use the TodoWrite tool to create a separate task for each relevant endpoint/flow that needs to be analyzed.
NOTE ON "FOR ALL": whenever an item begins with "for all …", use the `todo_write` tool to create a separate task for each relevant endpoint/flow that needs to be analyzed.
From `.shannon/deliverables/pre_recon_deliverable.md`, use Section 3 and 6 to help guide your tasks.
## 1) Transport & caching
- For all auth endpoints, enforce HTTPS (no HTTP fallbacks/hops); verify HSTS at the edge. (for all: use TodoWrite tool to add each endpoint as a task)
- For all auth endpoints, enforce HTTPS (no HTTP fallbacks/hops); verify HSTS at the edge. (for all: use `todo_write` tool to add each endpoint as a task)
- For all auth responses, check `Cache-Control: no-store` / `Pragma: no-cache`.
**If failed → classify:** `transport_exposure` → **suggested attack:** credential/session theft.
@@ -194,15 +194,15 @@ For each check you perform from the list above (Transport, Rate Limiting, Sessio
</methodology_and_domain_expertise>
<mcp_tools>
After completing your TodoWrite tasks and synthesizing findings, emit your specialist deliverable via 3 one-shot MCP tools provided by the `vuln-collector` server. Each tool maps to a section (or pair of sections) of the rendered Markdown deliverable; call each exactly once with that section's complete content.
<deliverable_tools>
After completing your `todo_write` tasks and synthesizing findings, emit your specialist deliverable via 3 one-shot tools. Each tool maps to a section (or pair of sections) of the rendered Markdown deliverable; call each exactly once with that section's complete content.
**Tool catalog:**
- `set_findings_summary` — Section 1 (Executive Summary key outcome) and Section 2 (Dominant Vulnerability Patterns)
- `set_strategic_intelligence` — Section 3 (Strategic Intelligence for Exploitation, with auth-specific sub-fields: authentication method, session token details, password policy)
- `set_safe_vectors` — Section 4 (Secure by Design: Validated Components)
The MCP SDK injects each tool's complete description and per-field guidance into your tool catalog — refer to the tool catalog for what each parameter expects.
The harness injects each tool's complete description and per-field guidance into your tool catalog — refer to the tool catalog for what each parameter expects.
**Call semantics:** All 3 tools are one-shot — each may be called exactly once with the section's complete content. Duplicate calls return `"already called"` and are no-ops. There is no incremental/append mode; synthesize each section's full content in working memory before emitting.
@@ -210,19 +210,19 @@ The MCP SDK injects each tool's complete description and per-field guidance into
- `set_findings_summary` and `set_strategic_intelligence` are required — call both before terminating. They produce the load-bearing content the downstream `exploit-auth` agent reads.
- `set_safe_vectors` is recommended. An empty array is acceptable on runs with no validated-secure components, but explicit emission is preferred over skipping.
**Relationship to the exploitation queue:** The exploitation queue (`auth_exploitation_queue.json`) is captured automatically from your final structured output at session end. The 3 MCP tools produce the analysis deliverable Markdown; the structured-output queue is separate and follows the `exploitation_queue_format` schema documented above.
</mcp_tools>
**Relationship to the exploitation queue:** The exploitation queue (`auth_exploitation_queue.json`) is produced by calling the `submit_exploitation_queue` tool when your analysis is complete. The 3 tools produce the analysis deliverable Markdown; the structured-output queue is separate and follows the `exploitation_queue_format` schema documented above.
</deliverable_tools>
<conclusion_trigger>
**COMPLETION REQUIREMENTS (ALL must be satisfied):**
1. **Systematic Analysis:** ALL relevant API endpoints and user-facing features identified in the reconnaissance deliverable must be analyzed for AuthN/AuthZ flaws.
2. **Deliverable Emission:** Call the 3 MCP tools (`set_findings_summary`, `set_strategic_intelligence`, `set_safe_vectors`) exactly once each with their respective section content. `set_findings_summary` and `set_strategic_intelligence` are required; `set_safe_vectors` is recommended (an empty array is acceptable but explicit emission is preferred).
2. **Deliverable Emission:** Call the 3 tools (`set_findings_summary`, `set_strategic_intelligence`, `set_safe_vectors`) exactly once each with their respective section content. `set_findings_summary` and `set_strategic_intelligence` are required; `set_safe_vectors` is recommended (an empty array is acceptable but explicit emission is preferred).
**Note:** The exploitation queue is captured automatically from your final structured output at session end — separate from the MCP tools above. The analysis deliverable Markdown is rendered by the harness after your session ends from the MCP tool calls.
**Note:** The exploitation queue is produced by calling the `submit_exploitation_queue` tool when your analysis is complete — separate from the tools above. The analysis deliverable Markdown is rendered by the harness after your session ends from the tool calls.
**ONLY AFTER** both systematic analysis AND the required MCP tool calls have been completed, announce "**AUTH ANALYSIS COMPLETE**" and stop.
**ONLY AFTER** both systematic analysis AND the required tool calls have been completed, announce "**AUTH ANALYSIS COMPLETE**" and stop.
**CRITICAL:** After announcing completion, STOP IMMEDIATELY. Do NOT output summaries, recaps, or explanations of your work — the deliverable contains everything needed.
</conclusion_trigger>
+19 -19
View File
@@ -80,15 +80,15 @@ An **exploitable vulnerability** is a logical flaw in the code that represents a
<cli_tools>
**CRITICAL TOOL USAGE RESTRICTIONS:**
- NEVER use the Read tool for application source code analysis—delegate every code review to the Task Agent.
- ALWAYS direct the Task Agent to follow authorization guard placement, role checks, and ownership validation before you reach a verdict.
- Use the Task Agent whenever you need to inspect shared middleware, decorators, or policy modules involved in access control.
- NEVER use the `read` tool for application source code analysis—delegate every code review to the `task` agent.
- ALWAYS direct the `task` agent to follow authorization guard placement, role checks, and ownership validation before you reach a verdict.
- Use the `task` agent whenever you need to inspect shared middleware, decorators, or policy modules involved in access control.
**Available Tools:**
- **Task Agent (Code Analysis):** Your primary tool. Use it to ask targeted questions about the source code, trace authorization logic paths, and understand permission models. MANDATORY for all source code analysis.
- **Bash tool:** Use for creating directories, copying files, and other shell commands as needed.
- **`task` agent (Code Analysis):** Your primary tool. Use it to ask targeted questions about the source code, trace authorization logic paths, and understand permission models. MANDATORY for all source code analysis.
- **`bash` tool:** Use for creating directories, copying files, and other shell commands as needed.
- **Browser Automation (playwright-cli skill):** For browser interactions with the live web application (multi-step flows and role-based access controls), invoke the `playwright-cli` skill to learn available commands. Always pass `-s={{PLAYWRIGHT_SESSION}}` to every command for session isolation.
- **TodoWrite Tool:** Use this to create and manage your analysis task list. Create a todo item for each endpoint that needs authorization analysis. Mark items as "in_progress" when working on them and "completed" when done.
- **`todo_write` Tool:** Use this to create and manage your analysis task list. Create a todo item for each endpoint that needs authorization analysis. Mark items as "in_progress" when working on them and "completed" when done.
</cli_tools>
<data_format_specifications>
@@ -126,7 +126,7 @@ An **exploitable vulnerability** is a logical flaw in the code that represents a
### 1) Horizontal Authorization Analysis
- **Create To Dos:**
For each item listed under **`.shannon/deliverables/recon_deliverable.md` → section 8 "Horizontal"*, use the TodoWrite tool to create a task entry.
For each item listed under **`.shannon/deliverables/recon_deliverable.md` → section 8 "Horizontal"*, use the `todo_write` tool to create a task entry.
- **Process:**
- Start at the identified endpoint.
@@ -158,7 +158,7 @@ An **exploitable vulnerability** is a logical flaw in the code that represents a
### 2) Vertical Authorization Analysis
- **Create To Dos:**
For each item listed under **`.shannon/deliverables/recon_deliverable.md` → section 8 "Vertical"**, use the TodoWrite tool to create a task entry.
For each item listed under **`.shannon/deliverables/recon_deliverable.md` → section 8 "Vertical"**, use the `todo_write` tool to create a task entry.
- **Process:**
- Start at the identified endpoint.
@@ -184,7 +184,7 @@ An **exploitable vulnerability** is a logical flaw in the code that represents a
### 3) Context / Workflow Authorization Analysis
- **Create To Dos:**
For each item listed under **`.shannon/deliverables/recon_deliverable.md` → section 8 "Context"**, use the TodoWrite tool to create a task entry.
For each item listed under **`.shannon/deliverables/recon_deliverable.md` → section 8 "Context"**, use the `todo_write` tool to create a task entry.
- **Process:**
- Start at the endpoint that represents a step in a workflow.
@@ -272,8 +272,8 @@ For each analysis you perform from the lists above, you must make a final **verd
</methodology_and_domain_expertise>
<mcp_tools>
After completing your TodoWrite tasks and synthesizing findings, emit your specialist deliverable via 4 one-shot MCP tools provided by the `vuln-collector` server. Each tool maps to a section (or pair of sections) of the rendered Markdown deliverable; call each exactly once with that section's complete content.
<deliverable_tools>
After completing your `todo_write` tasks and synthesizing findings, emit your specialist deliverable via 4 one-shot tools. Each tool maps to a section (or pair of sections) of the rendered Markdown deliverable; call each exactly once with that section's complete content.
**Tool catalog:**
- `set_findings_summary` — Section 1 (Executive Summary key outcome) and Section 2 (Dominant Vulnerability Patterns)
@@ -281,7 +281,7 @@ After completing your TodoWrite tasks and synthesizing findings, emit your speci
- `set_safe_vectors` — Section 4 (vectors confirmed secure)
- `set_blind_spots` — Section 5 (analysis constraints and blind spots)
The MCP SDK injects each tool's complete description and per-field guidance into your tool catalog — refer to the tool catalog for what each parameter expects. For authz specifically, when populating `set_safe_vectors`, the renderer maps `subject` to the "Endpoint" column header and `location` to the "Guard Location" column header.
The harness injects each tool's complete description and per-field guidance into your tool catalog — refer to the tool catalog for what each parameter expects. For authz specifically, when populating `set_safe_vectors`, the renderer maps `subject` to the "Endpoint" column header and `location` to the "Guard Location" column header.
**Call semantics:** All 4 tools are one-shot — each may be called exactly once with the section's complete content. Duplicate calls return `"already called"` and are no-ops. There is no incremental/append mode; synthesize each section's full content in working memory before emitting.
@@ -289,21 +289,21 @@ The MCP SDK injects each tool's complete description and per-field guidance into
- `set_findings_summary` and `set_strategic_intelligence` are required — call both before terminating. They produce the load-bearing content the downstream `exploit-authz` agent reads.
- `set_safe_vectors` and `set_blind_spots` are recommended. Empty arrays are acceptable on runs with no validated-secure endpoints or no constraint gaps, but explicit emission is preferred over skipping.
**Relationship to the exploitation queue:** The exploitation queue (`authz_exploitation_queue.json`) is captured automatically from your final structured output at session end. The 4 MCP tools produce the analysis deliverable Markdown; the structured-output queue is separate and follows the `exploitation_queue_format` schema documented above.
</mcp_tools>
**Relationship to the exploitation queue:** The exploitation queue (`authz_exploitation_queue.json`) is produced by calling the `submit_exploitation_queue` tool when your analysis is complete. The 4 tools produce the analysis deliverable Markdown; the structured-output queue is separate and follows the `exploitation_queue_format` schema documented above.
</deliverable_tools>
<conclusion_trigger>
**COMPLETION REQUIREMENTS (ALL must be satisfied):**
1. **Todo Completion:** ALL tasks in your TodoWrite list must be marked as "completed".
2. **Deliverable Emission:** Call the 4 MCP tools (`set_findings_summary`, `set_strategic_intelligence`, `set_safe_vectors`, `set_blind_spots`) exactly once each with their respective section content. `set_findings_summary` and `set_strategic_intelligence` are required; `set_safe_vectors` and `set_blind_spots` are recommended (empty arrays acceptable but explicit emission is preferred).
1. **Todo Completion:** ALL tasks in your `todo_write` list must be marked as "completed".
2. **Deliverable Emission:** Call the 4 tools (`set_findings_summary`, `set_strategic_intelligence`, `set_safe_vectors`, `set_blind_spots`) exactly once each with their respective section content. `set_findings_summary` and `set_strategic_intelligence` are required; `set_safe_vectors` and `set_blind_spots` are recommended (empty arrays acceptable but explicit emission is preferred).
**Note:** The exploitation queue is captured automatically from your final structured output at session end — separate from the MCP tools above. The analysis deliverable Markdown is rendered by the harness after your session ends from the MCP tool calls.
**Note:** The exploitation queue is produced by calling the `submit_exploitation_queue` tool when your analysis is complete — separate from the tools above. The analysis deliverable Markdown is rendered by the harness after your session ends from the tool calls.
**ONLY AFTER** both todo completion AND the required MCP tool calls have been completed, announce "**AUTHORIZATION ANALYSIS COMPLETE**" and stop.
**ONLY AFTER** both todo completion AND the required tool calls have been completed, announce "**AUTHORIZATION ANALYSIS COMPLETE**" and stop.
**CRITICAL:** After announcing completion, STOP IMMEDIATELY. Do NOT output summaries, recaps, or explanations of your work — the deliverable contains everything needed.
**FAILURE TO COMPLETE TODOS = INCOMPLETE ANALYSIS** - You will be considered to have failed the mission if you emit MCP deliverable calls before completing comprehensive testing of all authorization vectors.
**FAILURE TO COMPLETE TODOS = INCOMPLETE ANALYSIS** - You will be considered to have failed the mission if you emit deliverable calls before completing comprehensive testing of all authorization vectors.
</conclusion_trigger>
+18 -18
View File
@@ -35,7 +35,7 @@ Success criterion: Complete source-to-sink traces detailing path, sanitizers, si
- **Severity Context:** A structural flaw in a backend command, whether an SQL query or a shell command, is one of the most severe classes of vulnerability in a SaaS application. It undermines the foundational trust of the system by creating the *potential* for data exposure (SQLi) or direct server compromise (Command Injection).
- **Your Role is Precise:** Your sole responsibility is to identify and precisely document these structural command flaws. You prove the *potential* for compromise; the Exploitation phase confirms the *realized* compromise. Do not cross this boundary.
- **Code is Ground Truth:** Your analysis must be rooted in the application's code. An observed behavior is only a symptom; the insecure data flow within the code is the root cause you must identify.
- **Thoroughness is Non-Negotiable:** An incomplete analysis is a failed analysis. Finding one flaw is merely the first data point. Your mission is only complete when **every potential data entry point** from the reconnaissance deliverable has been systematically analyzed and documented using the TodoWrite tool. **Do not terminate early.**
- **Thoroughness is Non-Negotiable:** An incomplete analysis is a failed analysis. Finding one flaw is merely the first data point. Your mission is only complete when **every potential data entry point** from the reconnaissance deliverable has been systematically analyzed and documented using the `todo_write` tool. **Do not terminate early.**
</critical>
<starting_context>
@@ -80,15 +80,15 @@ An **exploitable vulnerability** is a confirmed source-to-sink path where the en
<cli_tools>
**CRITICAL TOOL USAGE RESTRICTIONS:**
- NEVER use the Read tool for application source code analysis—delegate every code review to the Task Agent.
- ALWAYS direct the Task Agent to trace tainted data flow, sanitization/encoding steps, and sink construction before you reach a verdict.
- Use the Task Agent instead of Bash or Playwright when you need to inspect handlers, middleware, or shared utilities to follow an injection path.
- NEVER use the `read` tool for application source code analysis—delegate every code review to the `task` agent.
- ALWAYS direct the `task` agent to trace tainted data flow, sanitization/encoding steps, and sink construction before you reach a verdict.
- Use the `task` agent instead of Bash or Playwright when you need to inspect handlers, middleware, or shared utilities to follow an injection path.
**Available Tools:**
- **Task Agent (Code Analysis):** Your primary tool. Use it to ask targeted questions about the source code, map query/command construction paths, and verify sanitization coverage. MANDATORY for all source code analysis.
- **Bash tool:** Use for creating directories, copying files, and other shell commands as needed.
- **`task` agent (Code Analysis):** Your primary tool. Use it to ask targeted questions about the source code, map query/command construction paths, and verify sanitization coverage. MANDATORY for all source code analysis.
- **`bash` tool:** Use for creating directories, copying files, and other shell commands as needed.
- **Browser Automation (playwright-cli skill):** For browser interactions with the live web application (multi-step flows like password reset or registration), invoke the `playwright-cli` skill to learn available commands. Always pass `-s={{PLAYWRIGHT_SESSION}}` to every command for session isolation.
- **TodoWrite Tool:** Use this to create and manage your analysis task list. Create a todo item for each injection source that needs analysis. Mark items as "in_progress" when working on them and "completed" when done.
- **`todo_write` Tool:** Use this to create and manage your analysis task list. Create a todo item for each injection source that needs analysis. Mark items as "in_progress" when working on them and "completed" when done.
</cli_tools>
<data_format_specifications>
@@ -125,7 +125,7 @@ An **exploitable vulnerability** is a confirmed source-to-sink path where the en
- **Goal:** Prove whether untrusted input can influence the **structure** of a backend command (SQL or Shell) or reach sensitive **slots** without the correct defense. No live exploitation in this phase.
- **1) Create a To Do for each Injection Source found in the Pre-Recon Deliverable
- inside of .shannon/deliverables/pre_recon_deliverable.md under the section "7. Injection Sources (Command Injection and SQL Injection)" use the TodoWrite tool to create a task for each discovered Injection Source.
- inside of .shannon/deliverables/pre_recon_deliverable.md under the section "7. Injection Sources (Command Injection and SQL Injection)" use the `todo_write` tool to create a task for each discovered Injection Source.
- Note: All sources are marked as Tainted until they Hit a Santiization that matches the sink context. normalizers (lowercasing, trimming, JSON parse, schema decode) — still **tainted**.
- **2) Trace Data Flow Paths from Source to Sink**
- For each source, your goal is to identify every unique "Data Flow Path" to a database sink. A path is a distinct route the data takes through the code.
@@ -283,8 +283,8 @@ An **exploitable vulnerability** is a confirmed source-to-sink path where the en
</methodology_and_domain_expertise>
<mcp_tools>
After completing your TodoWrite tasks and synthesizing findings, emit your specialist deliverable via 4 one-shot MCP tools provided by the `vuln-collector` server. Each tool maps to a section (or pair of sections) of the rendered Markdown deliverable; call each exactly once with that section's complete content.
<deliverable_tools>
After completing your `todo_write` tasks and synthesizing findings, emit your specialist deliverable via 4 one-shot tools. Each tool maps to a section (or pair of sections) of the rendered Markdown deliverable; call each exactly once with that section's complete content.
**Tool catalog:**
- `set_findings_summary` — Section 1 (Executive Summary key outcome) and Section 2 (Dominant Vulnerability Patterns)
@@ -292,7 +292,7 @@ After completing your TodoWrite tasks and synthesizing findings, emit your speci
- `set_safe_vectors` — Section 4 (vectors confirmed secure)
- `set_blind_spots` — Section 5 (analysis constraints and blind spots)
The MCP SDK injects each tool's complete description and per-field guidance into your tool catalog — refer to the tool catalog for what each parameter expects.
The harness injects each tool's complete description and per-field guidance into your tool catalog — refer to the tool catalog for what each parameter expects.
**Call semantics:** All 4 tools are one-shot — each may be called exactly once with the section's complete content. Duplicate calls return `"already called"` and are no-ops. There is no incremental/append mode; synthesize each section's full content in working memory before emitting.
@@ -300,21 +300,21 @@ The MCP SDK injects each tool's complete description and per-field guidance into
- `set_findings_summary` and `set_strategic_intelligence` are required — call both before terminating. They produce the load-bearing content the downstream `exploit-injection` agent reads.
- `set_safe_vectors` and `set_blind_spots` are recommended. Empty arrays are acceptable on runs with no validated-secure vectors or no constraint gaps, but explicit emission is preferred over skipping.
**Relationship to the exploitation queue:** The exploitation queue (`injection_exploitation_queue.json`) is captured automatically from your final structured output at session end. The 4 MCP tools produce the analysis deliverable Markdown; the structured-output queue is separate and follows the `exploitation_queue_format` schema documented above.
</mcp_tools>
**Relationship to the exploitation queue:** The exploitation queue (`injection_exploitation_queue.json`) is produced by calling the `submit_exploitation_queue` tool when your analysis is complete. The 4 tools produce the analysis deliverable Markdown; the structured-output queue is separate and follows the `exploitation_queue_format` schema documented above.
</deliverable_tools>
<conclusion_trigger>
**COMPLETION REQUIREMENTS (ALL must be satisfied):**
1. **Todo Completion:** ALL tasks in your TodoWrite list must be marked as "completed".
2. **Deliverable Emission:** Call the 4 MCP tools (`set_findings_summary`, `set_strategic_intelligence`, `set_safe_vectors`, `set_blind_spots`) exactly once each with their respective section content. `set_findings_summary` and `set_strategic_intelligence` are required; `set_safe_vectors` and `set_blind_spots` are recommended (empty arrays acceptable but explicit emission is preferred).
1. **Todo Completion:** ALL tasks in your `todo_write` list must be marked as "completed".
2. **Deliverable Emission:** Call the 4 tools (`set_findings_summary`, `set_strategic_intelligence`, `set_safe_vectors`, `set_blind_spots`) exactly once each with their respective section content. `set_findings_summary` and `set_strategic_intelligence` are required; `set_safe_vectors` and `set_blind_spots` are recommended (empty arrays acceptable but explicit emission is preferred).
**Note:** The exploitation queue is captured automatically from your final structured output at session end — separate from the MCP tools above. The analysis deliverable Markdown is rendered by the harness after your session ends from the MCP tool calls.
**Note:** The exploitation queue is produced by calling the `submit_exploitation_queue` tool when your analysis is complete — separate from the tools above. The analysis deliverable Markdown is rendered by the harness after your session ends from the tool calls.
**ONLY AFTER** both todo completion AND the required MCP tool calls have been completed, announce "**INJECTION ANALYSIS COMPLETE**" and stop.
**ONLY AFTER** both todo completion AND the required tool calls have been completed, announce "**INJECTION ANALYSIS COMPLETE**" and stop.
**CRITICAL:** After announcing completion, STOP IMMEDIATELY. Do NOT output summaries, recaps, or explanations of your work — the deliverable contains everything needed.
**FAILURE TO COMPLETE TODOS = INCOMPLETE ANALYSIS** - You will be considered to have failed the mission if you emit MCP deliverable calls before completing comprehensive testing of all input vectors.
**FAILURE TO COMPLETE TODOS = INCOMPLETE ANALYSIS** - You will be considered to have failed the mission if you emit deliverable calls before completing comprehensive testing of all input vectors.
</conclusion_trigger>
+16 -16
View File
@@ -76,15 +76,15 @@ An **exploitable vulnerability** is a data flow where user-controlled input infl
<cli_tools>
**CRITICAL TOOL USAGE RESTRICTIONS:**
- NEVER use the Read tool for application source code analysis—delegate every code review to the Task Agent.
- ALWAYS drive the Task Agent to map user-controlled input to outbound HTTP clients, validation layers, and network controls before declaring a result.
- Use the Task Agent to inspect shared utilities, proxy helpers, and request builders instead of reading files directly.
- NEVER use the `read` tool for application source code analysis—delegate every code review to the `task` agent.
- ALWAYS drive the `task` agent to map user-controlled input to outbound HTTP clients, validation layers, and network controls before declaring a result.
- Use the `task` agent to inspect shared utilities, proxy helpers, and request builders instead of reading files directly.
**Available Tools:**
- **Task Agent (Code Analysis):** Your primary tool. Use it to ask targeted questions about the source code, trace data flows, and understand HTTP client usage. MANDATORY for all source code analysis.
- **Bash tool:** Use for creating directories, copying files, and other shell commands as needed.
- **`task` agent (Code Analysis):** Your primary tool. Use it to ask targeted questions about the source code, trace data flows, and understand HTTP client usage. MANDATORY for all source code analysis.
- **`bash` tool:** Use for creating directories, copying files, and other shell commands as needed.
- **Browser Automation (playwright-cli skill):** For browser interactions with the live web application (multi-step flows involving URL redirection or proxy functionality), invoke the `playwright-cli` skill to learn available commands. Always pass `-s={{PLAYWRIGHT_SESSION}}` to every command for session isolation.
- **TodoWrite Tool:** Use this to create and manage your analysis task list. Create a todo item for each SSRF sink that needs analysis. Mark items as "in_progress" when working on them and "completed" when done.
- **`todo_write` Tool:** Use this to create and manage your analysis task list. Create a todo item for each SSRF sink that needs analysis. Mark items as "in_progress" when working on them and "completed" when done.
</cli_tools>
<data_format_specifications>
@@ -114,7 +114,7 @@ An **exploitable vulnerability** is a data flow where user-controlled input infl
<methodology>
# White-Box Vulnerability Analysis Procedure: Server-Side Request Forgery (SSRF)
NOTE ON "FOR ALL": whenever an item begins with "for all …", use the TodoWrite tool to create a separate task for each relevant endpoint/flow that needs to be analyzed.
NOTE ON "FOR ALL": whenever an item begins with "for all …", use the `todo_write` tool to create a separate task for each relevant endpoint/flow that needs to be analyzed.
From `.shannon/deliverables/pre_recon_deliverable.md`, use Section 10 (SSRF Sinks) to guide your tasks.
## 1) Identify HTTP Client Usage Patterns
@@ -169,7 +169,7 @@ From `.shannon/deliverables/pre_recon_deliverable.md`, use Section 10 (SSRF Sink
Inside `.shannon/deliverables/pre_recon_deliverable.md` under section `##10. SSRF Sinks##`.
Use the TodoWrite tool to create a task for each discovered sink (any server-side request composed even partially from user input).
Use the `todo_write` tool to create a task for each discovered sink (any server-side request composed even partially from user input).
---
@@ -243,15 +243,15 @@ For each check you perform from the list above, you must make a final **verdict*
</methodology_and_domain_expertise>
<mcp_tools>
After completing your TodoWrite tasks and synthesizing findings, emit your specialist deliverable via 3 one-shot MCP tools provided by the `vuln-collector` server. Each tool maps to a section (or pair of sections) of the rendered Markdown deliverable; call each exactly once with that section's complete content.
<deliverable_tools>
After completing your `todo_write` tasks and synthesizing findings, emit your specialist deliverable via 3 one-shot tools. Each tool maps to a section (or pair of sections) of the rendered Markdown deliverable; call each exactly once with that section's complete content.
**Tool catalog:**
- `set_findings_summary` — Section 1 (Executive Summary key outcome) and Section 2 (Dominant Vulnerability Patterns)
- `set_strategic_intelligence` — Section 3 (Strategic Intelligence for Exploitation, with SSRF-specific sub-fields: HTTP client library, request architecture, internal services)
- `set_safe_vectors` — Section 4 (Secure by Design: Validated Components)
The MCP SDK injects each tool's complete description and per-field guidance into your tool catalog — refer to the tool catalog for what each parameter expects.
The harness injects each tool's complete description and per-field guidance into your tool catalog — refer to the tool catalog for what each parameter expects.
**Call semantics:** All 3 tools are one-shot — each may be called exactly once with the section's complete content. Duplicate calls return `"already called"` and are no-ops. There is no incremental/append mode; synthesize each section's full content in working memory before emitting.
@@ -259,19 +259,19 @@ The MCP SDK injects each tool's complete description and per-field guidance into
- `set_findings_summary` and `set_strategic_intelligence` are required — call both before terminating. They produce the load-bearing content the downstream `exploit-ssrf` agent reads.
- `set_safe_vectors` is recommended. An empty array is acceptable on runs with no validated-secure components, but explicit emission is preferred over skipping.
**Relationship to the exploitation queue:** The exploitation queue (`ssrf_exploitation_queue.json`) is captured automatically from your final structured output at session end. The 3 MCP tools produce the analysis deliverable Markdown; the structured-output queue is separate and follows the `exploitation_queue_format` schema documented above.
</mcp_tools>
**Relationship to the exploitation queue:** The exploitation queue (`ssrf_exploitation_queue.json`) is produced by calling the `submit_exploitation_queue` tool when your analysis is complete. The 3 tools produce the analysis deliverable Markdown; the structured-output queue is separate and follows the `exploitation_queue_format` schema documented above.
</deliverable_tools>
<conclusion_trigger>
**COMPLETION REQUIREMENTS (ALL must be satisfied):**
1. **Systematic Analysis:** ALL relevant API endpoints and request-making features identified in the reconnaissance deliverable must be analyzed for SSRF vulnerabilities.
2. **Deliverable Emission:** Call the 3 MCP tools (`set_findings_summary`, `set_strategic_intelligence`, `set_safe_vectors`) exactly once each with their respective section content. `set_findings_summary` and `set_strategic_intelligence` are required; `set_safe_vectors` is recommended (an empty array is acceptable but explicit emission is preferred).
2. **Deliverable Emission:** Call the 3 tools (`set_findings_summary`, `set_strategic_intelligence`, `set_safe_vectors`) exactly once each with their respective section content. `set_findings_summary` and `set_strategic_intelligence` are required; `set_safe_vectors` is recommended (an empty array is acceptable but explicit emission is preferred).
**Note:** The exploitation queue is captured automatically from your final structured output at session end — separate from the MCP tools above. The analysis deliverable Markdown is rendered by the harness after your session ends from the MCP tool calls.
**Note:** The exploitation queue is produced by calling the `submit_exploitation_queue` tool when your analysis is complete — separate from the tools above. The analysis deliverable Markdown is rendered by the harness after your session ends from the tool calls.
**ONLY AFTER** both systematic analysis AND the required MCP tool calls have been completed, announce "**SSRF ANALYSIS COMPLETE**" and stop.
**ONLY AFTER** both systematic analysis AND the required tool calls have been completed, announce "**SSRF ANALYSIS COMPLETE**" and stop.
**CRITICAL:** After announcing completion, STOP IMMEDIATELY. Do NOT output summaries, recaps, or explanations of your work — the deliverable contains everything needed.
</conclusion_trigger>
+17 -17
View File
@@ -77,17 +77,17 @@ An **exploitable vulnerability** is a confirmed source-to-sink path where the en
<cli_tools>
**CRITICAL TOOL USAGE RESTRICTIONS:**
- NEVER use the Read tool for application source code analysis - ALWAYS delegate to Task agents for examining .js, .ts, .py, .php files and application logic. You MAY use Read
- NEVER use the `read` tool for application source code analysis - ALWAYS delegate to `task` agents for examining .js, .ts, .py, .php files and application logic. You MAY use Read
tool directly for these files: `.shannon/deliverables/pre_recon_deliverable.md`, `.shannon/deliverables/recon_deliverable.md`
- Direct the Task Agent to trace render contexts, sanitization coverage, and template/component boundaries before deciding on exploitability.
- **ALWAYS delegate code analysis to Task agents**
- Direct the `task` agent to trace render contexts, sanitization coverage, and template/component boundaries before deciding on exploitability.
- **ALWAYS delegate code analysis to `task` agents**
**Available Tools:**
- **Task Agent (Code Analysis):** MANDATORY for all source code analysis and data flow tracing. Use this instead of Read tool for examining application code, models, controllers, and templates.
- **`task` agent (Code Analysis):** MANDATORY for all source code analysis and data flow tracing. Use this instead of `read` tool for examining application code, models, controllers, and templates.
- **Terminal (curl):** MANDATORY for testing HTTP-based XSS vectors and observing raw HTML responses. Use for reflected XSS testing and JSONP injection testing.
- **Browser Automation (playwright-cli skill):** MANDATORY for testing DOM-based XSS and form submission vectors. Invoke the `playwright-cli` skill to learn available commands. Use for stored XSS testing and client-side payload execution verification. Always pass `-s={{PLAYWRIGHT_SESSION}}` to every command for session isolation.
- **TodoWrite Tool:** Use this to create and manage your analysis task list. Create a todo item for each sink you need to analyze.
- **Bash tool:** Use for creating directories, copying files, and other shell commands as needed.
- **`todo_write` Tool:** Use this to create and manage your analysis task list. Create a todo item for each sink you need to analyze.
- **`bash` tool:** Use for creating directories, copying files, and other shell commands as needed.
</cli_tools>
<data_format_specifications>
@@ -124,11 +124,11 @@ Structure: The vulnerability JSON object MUST follow this exact format:
- **Goal:** Identify vulnerable data flow paths by starting at the XSS sinks received from the recon phase and tracing backward to their sanitizations and sources. This approach is optimized for finding all types of XSS, especially complex Stored XSS patterns.
- **Core Principle:** Data is assumed to be tainted until a context-appropriate output encoder (sanitization) is encountered on its path to the sink.
### **1) Create a todo item for each XSS sink using the TodoWrite tool**
Read .shannon/deliverables/pre_recon_deliverable.md section ##9. XSS Sinks and Render Contexts## and use the **TodoWrite tool** to create a todo item for each discovered sink-context pair that needs analysis.
### **1) Create a todo item for each XSS sink using the `todo_write` tool**
Read .shannon/deliverables/pre_recon_deliverable.md section ##9. XSS Sinks and Render Contexts## and use the **`todo_write` tool** to create a todo item for each discovered sink-context pair that needs analysis.
### **2) Trace Each Sink Backward (Backward Taint Analysis)**
For each pending item in your todo list (managed via TodoWrite tool), trace the origin of the data variable backward from the sink through the application logic. Your goal is to find either a valid sanitizer or an untrusted source. Mark each todo item as completed after you've fully analyzed that sink.
For each pending item in your todo list (managed via `todo_write` tool), trace the origin of the data variable backward from the sink through the application logic. Your goal is to find either a valid sanitizer or an untrusted source. Mark each todo item as completed after you've fully analyzed that sink.
- **Early Termination for Secure Paths (Efficiency Rule):**
- As you trace backward, if you encounter a sanitization/encoding function, immediately perform two checks:
@@ -205,8 +205,8 @@ This rulebook is used for the **Early Termination** check in Step 2.
</methodology_and_domain_expertise>
<mcp_tools>
After completing your TodoWrite tasks and synthesizing findings, emit your specialist deliverable via 4 one-shot MCP tools provided by the `vuln-collector` server. Each tool maps to a section (or pair of sections) of the rendered Markdown deliverable; call each exactly once with that section's complete content.
<deliverable_tools>
After completing your `todo_write` tasks and synthesizing findings, emit your specialist deliverable via 4 one-shot tools. Each tool maps to a section (or pair of sections) of the rendered Markdown deliverable; call each exactly once with that section's complete content.
**Tool catalog:**
- `set_findings_summary` — Section 1 (Executive Summary key outcome) and Section 2 (Dominant Vulnerability Patterns)
@@ -214,7 +214,7 @@ After completing your TodoWrite tasks and synthesizing findings, emit your speci
- `set_safe_vectors` — Section 4 (vectors confirmed secure)
- `set_blind_spots` — Section 5 (analysis constraints and blind spots)
The MCP SDK injects each tool's complete description and per-field guidance into your tool catalog — refer to the tool catalog for what each parameter expects. For XSS specifically, when populating `set_safe_vectors`, include the optional `render_context` field on each entry (HTML_BODY, HTML_ATTRIBUTE, JAVASCRIPT_STRING, URL_PARAM, or CSS_VALUE).
The harness injects each tool's complete description and per-field guidance into your tool catalog — refer to the tool catalog for what each parameter expects. For XSS specifically, when populating `set_safe_vectors`, include the optional `render_context` field on each entry (HTML_BODY, HTML_ATTRIBUTE, JAVASCRIPT_STRING, URL_PARAM, or CSS_VALUE).
**Call semantics:** All 4 tools are one-shot — each may be called exactly once with the section's complete content. Duplicate calls return `"already called"` and are no-ops. There is no incremental/append mode; synthesize each section's full content in working memory before emitting.
@@ -222,19 +222,19 @@ The MCP SDK injects each tool's complete description and per-field guidance into
- `set_findings_summary` and `set_strategic_intelligence` are required — call both before terminating. They produce the load-bearing content the downstream `exploit-xss` agent reads.
- `set_safe_vectors` and `set_blind_spots` are recommended. Empty arrays are acceptable on runs with no validated-secure vectors or no constraint gaps, but explicit emission is preferred over skipping.
**Relationship to the exploitation queue:** The exploitation queue (`xss_exploitation_queue.json`) is captured automatically from your final structured output at session end. The 4 MCP tools produce the analysis deliverable Markdown; the structured-output queue is separate and follows the `exploitation_queue_format` schema documented above.
</mcp_tools>
**Relationship to the exploitation queue:** The exploitation queue (`xss_exploitation_queue.json`) is produced by calling the `submit_exploitation_queue` tool when your analysis is complete. The 4 tools produce the analysis deliverable Markdown; the structured-output queue is separate and follows the `exploitation_queue_format` schema documented above.
</deliverable_tools>
<conclusion_trigger>
COMPLETION REQUIREMENTS (ALL must be satisfied):
1. Systematic Analysis: ALL input vectors identified from the reconnaissance deliverable must be analyzed.
2. Deliverable Emission: Call the 4 MCP tools (`set_findings_summary`, `set_strategic_intelligence`, `set_safe_vectors`, `set_blind_spots`) exactly once each with their respective section content. `set_findings_summary` and `set_strategic_intelligence` are required; `set_safe_vectors` and `set_blind_spots` are recommended (empty arrays acceptable but explicit emission is preferred).
2. Deliverable Emission: Call the 4 tools (`set_findings_summary`, `set_strategic_intelligence`, `set_safe_vectors`, `set_blind_spots`) exactly once each with their respective section content. `set_findings_summary` and `set_strategic_intelligence` are required; `set_safe_vectors` and `set_blind_spots` are recommended (empty arrays acceptable but explicit emission is preferred).
**Note:** The exploitation queue is captured automatically from your final structured output at session end — separate from the MCP tools above. The analysis deliverable Markdown is rendered by the harness after your session ends from the MCP tool calls.
**Note:** The exploitation queue is produced by calling the `submit_exploitation_queue` tool when your analysis is complete — separate from the tools above. The analysis deliverable Markdown is rendered by the harness after your session ends from the tool calls.
ONLY AFTER both systematic analysis AND the required MCP tool calls have been completed, announce "XSS ANALYSIS COMPLETE" and stop.
ONLY AFTER both systematic analysis AND the required tool calls have been completed, announce "XSS ANALYSIS COMPLETE" and stop.
**CRITICAL:** After announcing completion, STOP IMMEDIATELY. Do NOT output summaries, recaps, or explanations of your work — the deliverable contains everything needed.
</conclusion_trigger>
+7
View File
@@ -14,6 +14,7 @@ export interface AuditLogger {
logToolStart(toolName: string, parameters: unknown): Promise<void>;
logToolEnd(result: unknown): Promise<void>;
logError(error: Error, duration: number, turns: number): Promise<void>;
logNote(category: string, message: string): Promise<void>;
}
class RealAuditLogger implements AuditLogger {
@@ -56,6 +57,10 @@ class RealAuditLogger implements AuditLogger {
timestamp: formatTimestamp(),
});
}
async logNote(category: string, message: string): Promise<void> {
await this.auditSession.logWorkflowNote(category, message);
}
}
/** Null Object implementation - all methods are safe no-ops */
@@ -67,6 +72,8 @@ class NullAuditLogger implements AuditLogger {
async logToolEnd(_result: unknown): Promise<void> {}
async logError(_error: Error, _duration: number, _turns: number): Promise<void> {}
async logNote(_category: string, _message: string): Promise<void> {}
}
// Returns no-op when auditSession is null
-404
View File
@@ -1,404 +0,0 @@
// Copyright (C) 2025 Keygraph, Inc.
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU Affero General Public License version 3
// as published by the Free Software Foundation.
// Production Claude agent execution with retry, git checkpoints, and audit logging
import { type JsonSchemaOutputFormat, query } from '@anthropic-ai/claude-agent-sdk';
import { fs, path } from 'zx';
import type { AuditSession } from '../audit/index.js';
import { deliverablesDir } from '../paths.js';
import { isRetryableError, PentestError } from '../services/error-handling.js';
import { AGENT_VALIDATORS } from '../session-manager.js';
import type { ActivityLogger } from '../types/activity-logger.js';
import { isSpendingCapBehavior } from '../utils/billing-detection.js';
import { formatTimestamp } from '../utils/formatting.js';
import { Timer } from '../utils/metrics.js';
import { createAuditLogger } from './audit-logger.js';
import { dispatchMessage } from './message-handlers.js';
import { type ModelTier, resolveModel, supportsAdaptiveThinking } from './models.js';
import { detectExecutionContext, formatCompletionMessage, formatErrorOutput } from './output-formatters.js';
import { createProgressManager } from './progress-manager.js';
declare global {
var SHANNON_DISABLE_LOADER: boolean | undefined;
}
export interface ClaudePromptResult {
result?: string | null | undefined;
success: boolean;
duration: number;
turns?: number | undefined;
cost: number;
model?: string | undefined;
partialCost?: number | undefined;
apiErrorDetected?: boolean | undefined;
error?: string | undefined;
errorType?: string | undefined;
prompt?: string | undefined;
retryable?: boolean | undefined;
structuredOutput?: unknown;
}
function outputLines(lines: string[]): void {
for (const line of lines) {
console.log(line);
}
}
async function writeErrorLog(
err: Error & { code?: string; status?: number },
sourceDir: string,
fullPrompt: string,
duration: number,
): Promise<void> {
try {
const errorLog = {
timestamp: formatTimestamp(),
agent: 'claude-executor',
error: {
name: err.constructor.name,
message: err.message,
code: err.code,
status: err.status,
stack: err.stack,
},
context: {
sourceDir,
prompt: `${fullPrompt.slice(0, 200)}...`,
retryable: isRetryableError(err),
},
duration,
};
const logPath = path.join(deliverablesDir(sourceDir), 'error.log');
await fs.appendFile(logPath, `${JSON.stringify(errorLog)}\n`);
} catch {
// Best-effort error log writing - don't propagate failures
}
}
export async function validateAgentOutput(
result: ClaudePromptResult,
agentName: string | null,
sourceDir: string,
logger: ActivityLogger,
): Promise<boolean> {
logger.info(`Validating ${agentName} agent output`);
try {
// Check if agent completed successfully (text result OR structured output)
if (!result.success || (!result.result && result.structuredOutput === undefined)) {
logger.error('Validation failed: Agent execution was unsuccessful');
return false;
}
// Get validator function for this agent
const validator = agentName ? AGENT_VALIDATORS[agentName as keyof typeof AGENT_VALIDATORS] : undefined;
if (!validator) {
logger.warn(`No validator found for agent "${agentName}" - assuming success`);
logger.info('Validation passed: Unknown agent with successful result');
return true;
}
logger.info(`Using validator for agent: ${agentName}`, { sourceDir });
// Apply validation function
const validationResult = await validator(sourceDir, logger);
if (validationResult) {
logger.info('Validation passed: Required files/structure present');
} else {
logger.error('Validation failed: Missing required deliverable files');
}
return validationResult;
} catch (error) {
const errMsg = error instanceof Error ? error.message : String(error);
logger.error(`Validation failed with error: ${errMsg}`);
return false;
}
}
// Low-level SDK execution. Handles message streaming, progress, and audit logging.
// Exported for Temporal activities to call single-attempt execution.
export async function runClaudePrompt(
prompt: string,
sourceDir: string,
context: string = '',
description: string = 'Claude analysis',
_agentName: string | null = null,
auditSession: AuditSession | null = null,
logger: ActivityLogger,
modelTier: ModelTier = 'medium',
outputFormat?: JsonSchemaOutputFormat,
apiKey?: string,
deliverablesSubdir?: string,
providerConfig?: import('../types/config.js').ProviderConfig,
mcpServers?: Record<string, import('@anthropic-ai/claude-agent-sdk').McpServerConfig>,
): Promise<ClaudePromptResult> {
// 1. Initialize timing and prompt
const timer = new Timer(`agent-${description.toLowerCase().replace(/\s+/g, '-')}`);
const fullPrompt = context ? `${context}\n\n${prompt}` : prompt;
// 2. Set up progress and audit infrastructure
const execContext = detectExecutionContext(description);
const progress = createProgressManager(
{ description, useCleanOutput: execContext.useCleanOutput },
global.SHANNON_DISABLE_LOADER ?? false,
);
const auditLogger = createAuditLogger(auditSession);
logger.info(`Running Claude Code: ${description}...`);
// 3. Build env vars to pass to SDK subprocesses
const sdkEnv: Record<string, string> = {
CLAUDE_CODE_MAX_OUTPUT_TOKENS: process.env.CLAUDE_CODE_MAX_OUTPUT_TOKENS || '64000',
PLAYWRIGHT_MCP_OUTPUT_DIR: deliverablesSubdir
? path.join(sourceDir, path.dirname(deliverablesSubdir), '.playwright-cli')
: path.join(sourceDir, '.shannon', '.playwright-cli'),
// apiKey from ContainerConfig takes precedence over process.env
...(apiKey && { ANTHROPIC_API_KEY: apiKey }),
// Deliverables subdir for save-deliverable CLI tool
...(deliverablesSubdir && { SHANNON_DELIVERABLES_SUBDIR: deliverablesSubdir }),
};
// 3a. Apply structured provider config directly to sdkEnv (no process.env mutation)
if (providerConfig) {
switch (providerConfig.providerType) {
case 'bedrock':
sdkEnv.CLAUDE_CODE_USE_BEDROCK = '1';
if (providerConfig.awsRegion) sdkEnv.AWS_REGION = providerConfig.awsRegion;
if (providerConfig.awsAccessKeyId) sdkEnv.AWS_ACCESS_KEY_ID = providerConfig.awsAccessKeyId;
if (providerConfig.awsSecretAccessKey) sdkEnv.AWS_SECRET_ACCESS_KEY = providerConfig.awsSecretAccessKey;
break;
case 'vertex':
sdkEnv.CLAUDE_CODE_USE_VERTEX = '1';
if (providerConfig.gcpRegion) sdkEnv.CLOUD_ML_REGION = providerConfig.gcpRegion;
if (providerConfig.gcpProjectId) sdkEnv.ANTHROPIC_VERTEX_PROJECT_ID = providerConfig.gcpProjectId;
if (providerConfig.gcpCredentialsPath)
sdkEnv.GOOGLE_APPLICATION_CREDENTIALS = providerConfig.gcpCredentialsPath;
break;
case 'litellm_router':
if (providerConfig.baseUrl) sdkEnv.ANTHROPIC_BASE_URL = providerConfig.baseUrl;
if (providerConfig.authToken) sdkEnv.ANTHROPIC_AUTH_TOKEN = providerConfig.authToken;
break;
default:
// 'anthropic_api' or unset — apiKey already handled above
if (providerConfig.apiKey && !apiKey) sdkEnv.ANTHROPIC_API_KEY = providerConfig.apiKey;
break;
}
}
// 3b. Passthrough env vars not already set by providerConfig or apiKey
const passthroughVars = [
...(!sdkEnv.ANTHROPIC_API_KEY ? ['ANTHROPIC_API_KEY'] : []),
'CLAUDE_CODE_OAUTH_TOKEN',
...(!sdkEnv.ANTHROPIC_BASE_URL ? ['ANTHROPIC_BASE_URL'] : []),
...(!sdkEnv.ANTHROPIC_AUTH_TOKEN ? ['ANTHROPIC_AUTH_TOKEN'] : []),
...(!sdkEnv.CLAUDE_CODE_USE_BEDROCK ? ['CLAUDE_CODE_USE_BEDROCK'] : []),
...(!sdkEnv.AWS_REGION ? ['AWS_REGION'] : []),
'AWS_BEARER_TOKEN_BEDROCK',
...(!sdkEnv.CLAUDE_CODE_USE_VERTEX ? ['CLAUDE_CODE_USE_VERTEX'] : []),
...(!sdkEnv.CLOUD_ML_REGION ? ['CLOUD_ML_REGION'] : []),
...(!sdkEnv.ANTHROPIC_VERTEX_PROJECT_ID ? ['ANTHROPIC_VERTEX_PROJECT_ID'] : []),
...(!sdkEnv.GOOGLE_APPLICATION_CREDENTIALS ? ['GOOGLE_APPLICATION_CREDENTIALS'] : []),
'HOME',
'PATH',
'PLAYWRIGHT_MCP_EXECUTABLE_PATH',
];
for (const name of passthroughVars) {
const val = process.env[name];
if (val) {
sdkEnv[name] = val;
}
}
// 4. Configure SDK options
// Model override from providerConfig takes precedence over env-based resolveModel
const model = providerConfig?.modelOverrides?.[modelTier] ?? resolveModel(modelTier);
const adaptiveThinking = supportsAdaptiveThinking(model) && process.env.CLAUDE_ADAPTIVE_THINKING !== 'false';
const options = {
model,
maxTurns: 10_000,
cwd: sourceDir,
permissionMode: 'bypassPermissions' as const,
allowDangerouslySkipPermissions: true,
settingSources: ['user'] as ('user' | 'project' | 'local')[],
env: sdkEnv,
...(adaptiveThinking && { thinking: { type: 'adaptive' as const } }),
...(outputFormat && { outputFormat }),
...(mcpServers && Object.keys(mcpServers).length > 0 && { mcpServers }),
};
if (!execContext.useCleanOutput) {
logger.info(`SDK Options: maxTurns=${options.maxTurns}, cwd=${sourceDir}, permissions=BYPASS`);
}
let turnCount = 0;
let result: string | null = null;
let apiErrorDetected = false;
let totalCost = 0;
progress.start();
try {
// 6. Process the message stream
const messageLoopResult = await processMessageStream(
fullPrompt,
options,
{ execContext, description, progress, auditLogger, logger },
timer,
);
turnCount = messageLoopResult.turnCount;
result = messageLoopResult.result;
apiErrorDetected = messageLoopResult.apiErrorDetected;
totalCost = messageLoopResult.cost;
const model = messageLoopResult.model;
// === SPENDING CAP SAFEGUARD ===
// 7. Defense-in-depth: Detect spending cap that slipped through detectApiError().
// Uses consolidated billing detection from utils/billing-detection.ts
if (isSpendingCapBehavior(turnCount, totalCost, result || '')) {
throw new PentestError(
`Spending cap likely reached (turns=${turnCount}, cost=$0): ${result?.slice(0, 100)}`,
'billing',
true, // Retryable - Temporal will use 5-30 min backoff
);
}
// 8. Finalize successful result
const duration = timer.stop();
if (apiErrorDetected) {
logger.warn(`API Error detected in ${description} - will validate deliverables before failing`);
}
progress.finish(formatCompletionMessage(execContext, description, turnCount, duration));
return {
result,
success: true,
duration,
turns: turnCount,
cost: totalCost,
model,
partialCost: totalCost,
apiErrorDetected,
...(messageLoopResult.structuredOutput !== undefined && {
structuredOutput: messageLoopResult.structuredOutput,
}),
};
} catch (error) {
// 9. Handle errors — log, write error file, return failure
const duration = timer.stop();
const err = error as Error & { code?: string; status?: number };
await auditLogger.logError(err, duration, turnCount);
progress.stop();
outputLines(formatErrorOutput(err, execContext, description, duration, sourceDir, isRetryableError(err)));
await writeErrorLog(err, sourceDir, fullPrompt, duration);
return {
error: err.message,
errorType: err.constructor.name,
prompt: `${fullPrompt.slice(0, 100)}...`,
success: false,
duration,
cost: totalCost,
retryable: isRetryableError(err),
};
}
}
interface MessageLoopResult {
turnCount: number;
result: string | null;
apiErrorDetected: boolean;
cost: number;
model?: string | undefined;
structuredOutput?: unknown;
}
interface MessageLoopDeps {
execContext: ReturnType<typeof detectExecutionContext>;
description: string;
progress: ReturnType<typeof createProgressManager>;
auditLogger: ReturnType<typeof createAuditLogger>;
logger: ActivityLogger;
}
async function processMessageStream(
fullPrompt: string,
options: NonNullable<Parameters<typeof query>[0]['options']>,
deps: MessageLoopDeps,
timer: Timer,
): Promise<MessageLoopResult> {
const { execContext, description, progress, auditLogger, logger } = deps;
const HEARTBEAT_INTERVAL = 30000;
let turnCount = 0;
let result: string | null = null;
let apiErrorDetected = false;
let cost = 0;
let model: string | undefined;
let structuredOutput: unknown | undefined;
let lastHeartbeat = Date.now();
for await (const message of query({ prompt: fullPrompt, options })) {
// Heartbeat logging when loader is disabled
const now = Date.now();
if (global.SHANNON_DISABLE_LOADER && now - lastHeartbeat > HEARTBEAT_INTERVAL) {
logger.info(`[${Math.floor((now - timer.startTime) / 1000)}s] ${description} running... (Turn ${turnCount})`);
lastHeartbeat = now;
}
// Increment turn count for assistant messages
if (message.type === 'assistant') {
turnCount++;
}
const dispatchResult = await dispatchMessage(message as { type: string; subtype?: string }, turnCount, {
execContext,
description,
progress,
auditLogger,
logger,
});
if (dispatchResult.type === 'throw') {
throw dispatchResult.error;
}
if (dispatchResult.type === 'complete') {
result = dispatchResult.result;
cost = dispatchResult.cost;
if (dispatchResult.structuredOutput !== undefined) {
structuredOutput = dispatchResult.structuredOutput;
}
break;
}
if (dispatchResult.type === 'continue') {
if (dispatchResult.apiErrorDetected) {
apiErrorDetected = true;
}
if (dispatchResult.model) {
model = dispatchResult.model;
}
}
}
return {
turnCount,
result,
apiErrorDetected,
cost,
model,
...(structuredOutput !== undefined && { structuredOutput }),
};
}
@@ -0,0 +1,47 @@
/**
* pi extension: enforce a bounded timeout on every `bash` tool call.
*
* pi's built-in bash tool accepts an optional `timeout` (in seconds) but applies
* NO default and NO upper bound — an unbounded command (e.g. a `playwright-cli`
* browser action that never returns) hangs the agent indefinitely. This extension
* registers a `tool_call` pre-execution handler that blocks any `bash` invocation
* that omits `timeout` or sets it above the maximum, returning a message that tells
* the model how to re-run the command correctly.
*/
import type { ExtensionAPI, ToolCallEvent, ToolCallEventResult } from '@earendil-works/pi-coding-agent';
import { isToolCallEventType } from '@earendil-works/pi-coding-agent';
/** Recommended timeout (seconds) suggested to the model when it omits one. */
const DEFAULT_TIMEOUT_SECONDS = 120;
/** Hard upper bound (seconds) a single bash command may run. */
const MAX_TIMEOUT_SECONDS = 600;
function evaluateBashTimeout(timeout: number | undefined): ToolCallEventResult | undefined {
const hasValidTimeout = typeof timeout === 'number' && Number.isFinite(timeout) && timeout > 0;
if (!hasValidTimeout) {
return {
block: true,
reason: `Set bash 'timeout' (seconds). Default ${DEFAULT_TIMEOUT_SECONDS}s, max ${MAX_TIMEOUT_SECONDS}s.`,
};
}
if (timeout > MAX_TIMEOUT_SECONDS) {
return {
block: true,
reason: `bash 'timeout' ${timeout}s exceeds max ${MAX_TIMEOUT_SECONDS}s. Default ${DEFAULT_TIMEOUT_SECONDS}s, max ${MAX_TIMEOUT_SECONDS}s.`,
};
}
return undefined;
}
export default function bashTimeoutExtension(pi: ExtensionAPI): void {
pi.on('tool_call', (event: ToolCallEvent): ToolCallEventResult | undefined => {
if (!isToolCallEventType('bash', event)) {
return undefined;
}
return evaluateBashTimeout(event.input.timeout);
});
}
-371
View File
@@ -1,371 +0,0 @@
// Copyright (C) 2025 Keygraph, Inc.
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU Affero General Public License version 3
// as published by the Free Software Foundation.
import type { SDKAssistantMessageError } from '@anthropic-ai/claude-agent-sdk';
import { PentestError } from '../services/error-handling.js';
import type { ActivityLogger } from '../types/activity-logger.js';
import { ErrorCode } from '../types/errors.js';
import { matchesBillingTextPattern } from '../utils/billing-detection.js';
import { formatTimestamp } from '../utils/formatting.js';
import type { AuditLogger } from './audit-logger.js';
import {
filterJsonToolCalls,
formatAssistantOutput,
formatResultOutput,
formatToolResultOutput,
formatToolUseOutput,
} from './output-formatters.js';
import type { ProgressManager } from './progress-manager.js';
import type {
ApiErrorDetection,
AssistantMessage,
AssistantResult,
ContentBlock,
ExecutionContext,
ResultData,
ResultMessage,
SystemInitMessage,
ToolResultData,
ToolResultMessage,
ToolUseData,
ToolUseMessage,
} from './types.js';
// Handles both array and string content formats from SDK
function extractMessageContent(message: AssistantMessage): string {
const messageContent = message.message;
if (Array.isArray(messageContent.content)) {
return messageContent.content
.filter((c: ContentBlock) => c.type !== 'thinking' && c.type !== 'redacted_thinking')
.map((c: ContentBlock) => c.text || JSON.stringify(c))
.join('\n');
}
return String(messageContent.content);
}
// Extracts only text content (no tool_use JSON) to avoid false positives in error detection
function extractTextOnlyContent(message: AssistantMessage): string {
const messageContent = message.message;
if (Array.isArray(messageContent.content)) {
return messageContent.content
.filter((c: ContentBlock) => c.type === 'text' || c.text)
.map((c: ContentBlock) => c.text || '')
.join('\n');
}
return String(messageContent.content);
}
function detectApiError(content: string): ApiErrorDetection {
if (!content || typeof content !== 'string') {
return { detected: false };
}
const lowerContent = content.toLowerCase();
// === BILLING/SPENDING CAP ERRORS (Retryable with long backoff) ===
// When Claude Code hits its spending cap, it returns a short message like
// "Spending cap reached resets 8am" instead of throwing an error.
// These should retry with 5-30 min backoff so workflows can recover when cap resets.
if (matchesBillingTextPattern(content)) {
return {
detected: true,
shouldThrow: new PentestError(
`Billing limit reached: ${content.slice(0, 100)}`,
'billing',
true, // RETRYABLE - Temporal will use 5-30 min backoff
{},
ErrorCode.SPENDING_CAP_REACHED,
),
};
}
// === SESSION LIMIT (Non-retryable) ===
// Different from spending cap - usually means something is fundamentally wrong
if (lowerContent.includes('session limit reached')) {
return {
detected: true,
shouldThrow: new PentestError('Session limit reached', 'billing', false),
};
}
// Non-fatal API errors - detected but continue
if (lowerContent.includes('api error') || lowerContent.includes('terminated')) {
return { detected: true };
}
return { detected: false };
}
// Maps SDK structured error types to our error handling.
function handleStructuredError(errorType: SDKAssistantMessageError, content: string): ApiErrorDetection {
switch (errorType) {
case 'billing_error':
return {
detected: true,
shouldThrow: new PentestError(
`Billing error (structured): ${content.slice(0, 100)}`,
'billing',
true, // Retryable with backoff
{},
ErrorCode.INSUFFICIENT_CREDITS,
),
};
case 'rate_limit':
return {
detected: true,
shouldThrow: new PentestError(
`Rate limit hit (structured): ${content.slice(0, 100)}`,
'network',
true, // Retryable with backoff
{},
ErrorCode.API_RATE_LIMITED,
),
};
case 'authentication_failed':
return {
detected: true,
shouldThrow: new PentestError(
`Authentication failed: ${content.slice(0, 100)}`,
'config',
false, // Not retryable - needs API key fix
),
};
case 'server_error':
return {
detected: true,
shouldThrow: new PentestError(
`Server error (structured): ${content.slice(0, 100)}`,
'network',
true, // Retryable
),
};
case 'invalid_request':
return {
detected: true,
shouldThrow: new PentestError(
`Invalid request: ${content.slice(0, 100)}`,
'config',
false, // Not retryable - needs code fix
),
};
case 'max_output_tokens':
return {
detected: true,
shouldThrow: new PentestError(
`Max output tokens reached: ${content.slice(0, 100)}`,
'billing',
true, // Retryable - may succeed with different content
),
};
default:
return { detected: true };
}
}
function handleAssistantMessage(message: AssistantMessage, turnCount: number): AssistantResult {
const content = extractMessageContent(message);
const cleanedContent = filterJsonToolCalls(content);
// Prefer structured error field from SDK, fall back to text-sniffing
// Use text-only content for error detection to avoid false positives
// from tool_use JSON (e.g. security reports containing "usage limit")
let errorDetection: ApiErrorDetection;
if (message.error) {
errorDetection = handleStructuredError(message.error, content);
} else {
const textOnlyContent = extractTextOnlyContent(message);
errorDetection = detectApiError(textOnlyContent);
}
const result: AssistantResult = {
content,
cleanedContent,
apiErrorDetected: errorDetection.detected,
logData: {
turn: turnCount,
content,
timestamp: formatTimestamp(),
},
};
// Only add shouldThrow if it exists (exactOptionalPropertyTypes compliance)
if (errorDetection.shouldThrow) {
result.shouldThrow = errorDetection.shouldThrow;
}
return result;
}
// Final message of a query with cost/duration info
function handleResultMessage(message: ResultMessage): ResultData {
const result: ResultData = {
result: message.result || null,
cost: message.total_cost_usd || 0,
duration_ms: message.duration_ms || 0,
permissionDenials: message.permission_denials?.length || 0,
};
// Only add subtype if it exists (exactOptionalPropertyTypes compliance)
if (message.subtype) {
result.subtype = message.subtype;
}
// Capture stop_reason for diagnostics (helps debug early stops, budget exceeded, etc.)
if (message.stop_reason !== undefined) {
result.stop_reason = message.stop_reason;
if (message.stop_reason && message.stop_reason !== 'end_turn') {
console.log(` Stop reason: ${message.stop_reason}`);
}
}
if (message.structured_output !== undefined) {
result.structuredOutput = message.structured_output;
}
return result;
}
function handleToolUseMessage(message: ToolUseMessage): ToolUseData {
return {
toolName: message.name,
parameters: message.input || {},
timestamp: formatTimestamp(),
};
}
// Truncates long results for display (500 char limit), preserves full content for logging
function handleToolResultMessage(message: ToolResultMessage): ToolResultData {
const content = message.content;
const contentStr = typeof content === 'string' ? content : JSON.stringify(content, null, 2);
const displayContent =
contentStr.length > 500
? `${contentStr.slice(0, 500)}...\n[Result truncated - ${contentStr.length} total chars]`
: contentStr;
return {
content,
displayContent,
timestamp: formatTimestamp(),
};
}
function outputLines(lines: string[]): void {
for (const line of lines) {
console.log(line);
}
}
export type MessageDispatchAction =
| { type: 'continue'; apiErrorDetected?: boolean | undefined; model?: string | undefined }
| { type: 'complete'; result: string | null; cost: number; structuredOutput?: unknown }
| { type: 'throw'; error: Error };
export interface MessageDispatchDeps {
execContext: ExecutionContext;
description: string;
progress: ProgressManager;
auditLogger: AuditLogger;
logger: ActivityLogger;
}
// Dispatches SDK messages to appropriate handlers and formatters
export async function dispatchMessage(
message: { type: string; subtype?: string },
turnCount: number,
deps: MessageDispatchDeps,
): Promise<MessageDispatchAction> {
const { execContext, description, progress, auditLogger, logger } = deps;
switch (message.type) {
case 'assistant': {
const assistantResult = handleAssistantMessage(message as AssistantMessage, turnCount);
if (assistantResult.shouldThrow) {
return { type: 'throw', error: assistantResult.shouldThrow };
}
if (assistantResult.cleanedContent.trim()) {
progress.stop();
outputLines(formatAssistantOutput(assistantResult.cleanedContent, execContext, turnCount, description));
progress.start();
}
await auditLogger.logLlmResponse(turnCount, assistantResult.content);
if (assistantResult.apiErrorDetected) {
logger.warn('API Error detected in assistant response');
return { type: 'continue', apiErrorDetected: true };
}
return { type: 'continue' };
}
case 'system': {
if (message.subtype === 'init') {
const initMsg = message as SystemInitMessage;
if (!execContext.useCleanOutput) {
logger.info(`Model: ${initMsg.model}, Permission: ${initMsg.permissionMode}`);
}
return { type: 'continue', model: initMsg.model };
}
return { type: 'continue' };
}
case 'user':
case 'tool_progress':
case 'tool_use_summary':
case 'auth_status':
return { type: 'continue' };
case 'tool_use': {
const toolData = handleToolUseMessage(message as unknown as ToolUseMessage);
outputLines(formatToolUseOutput(toolData.toolName, toolData.parameters));
await auditLogger.logToolStart(toolData.toolName, toolData.parameters);
return { type: 'continue' };
}
case 'tool_result': {
const toolResultData = handleToolResultMessage(message as unknown as ToolResultMessage);
outputLines(formatToolResultOutput(toolResultData.displayContent));
await auditLogger.logToolEnd(toolResultData.content);
return { type: 'continue' };
}
case 'result': {
const resultData = handleResultMessage(message as ResultMessage);
outputLines(formatResultOutput(resultData, !execContext.useCleanOutput));
if (resultData.subtype === 'error_max_structured_output_retries') {
return {
type: 'throw',
error: new PentestError(
'Structured output validation failed after max retries',
'validation',
true,
{},
ErrorCode.OUTPUT_VALIDATION_FAILED,
),
};
}
return {
type: 'complete' as const,
result: resultData.result,
cost: resultData.cost,
...(resultData.structuredOutput !== undefined && { structuredOutput: resultData.structuredOutput }),
};
}
default:
logger.info(`Unhandled message type: ${message.type}`);
return { type: 'continue' };
}
}
+147 -8
View File
@@ -5,27 +5,94 @@
// as published by the Free Software Foundation.
/**
* Model tier definitions and resolution.
* Model tier definitions and resolution for the pi harness.
*
* Three tiers mapped to capability levels:
* - "small" (Haiku — summarization, structured extraction)
* - "medium" (Sonnet — tool use, general analysis)
* - "large" (Opus — deep reasoning, complex analysis)
*
* Users override via ANTHROPIC_SMALL_MODEL / ANTHROPIC_MEDIUM_MODEL / ANTHROPIC_LARGE_MODEL,
* which works across all providers (direct, Bedrock, Vertex).
* Users override per tier via ANTHROPIC_SMALL_MODEL / ANTHROPIC_MEDIUM_MODEL /
* ANTHROPIC_LARGE_MODEL, which works across all providers (Anthropic, Bedrock,
* custom base URL).
*
* The active provider is chosen from an injected `providerConfig` (the Pro consumer)
* or, in OSS, from the env-var contract the CLI forwards (`CLAUDE_CODE_USE_BEDROCK`,
* `ANTHROPIC_BASE_URL`+`ANTHROPIC_AUTH_TOKEN`, else direct Anthropic). Resolution
* returns a pi `Model` via `ModelRegistry.find`, the `thinkingLevel`, and an
* `AuthStorage` primed with the right credential. Bedrock authenticates from the
* AWS_ env vars via pi-ai.
*/
import type { ThinkingLevel } from '@earendil-works/pi-agent-core';
import type { Api, Model } from '@earendil-works/pi-ai';
import { AuthStorage, type ModelRegistry } from '@earendil-works/pi-coding-agent';
import type { ProviderConfig } from '../types/config.js';
export type ModelTier = 'small' | 'medium' | 'large';
const DEFAULT_MODELS: Readonly<Record<ModelTier, string>> = {
small: 'claude-haiku-4-5-20251001',
medium: 'claude-sonnet-4-6',
large: 'claude-opus-4-7',
large: 'claude-opus-4-8',
};
/** Resolve a model tier to a concrete model ID. */
export function resolveModel(tier: ModelTier = 'medium'): string {
export interface EffectiveProvider {
/** pi-ai provider id: 'anthropic' or 'amazon-bedrock'. */
providerId: string;
/** Custom-base-URL override applied to the resolved anthropic model. */
baseUrl?: string;
/** Runtime credential to prime on AuthStorage for the 'anthropic' provider. */
anthropicToken?: string;
}
/**
* Determine the active provider + auth.
*
* An explicit `providerConfig` (injected by the Pro consumer) wins; otherwise we
* fall back to the OSS env-var contract the CLI forwards: `CLAUDE_CODE_USE_BEDROCK`
* → Bedrock; `ANTHROPIC_BASE_URL`+`ANTHROPIC_AUTH_TOKEN` → custom base URL; else
* direct Anthropic (`ANTHROPIC_API_KEY`, or `CLAUDE_CODE_OAUTH_TOKEN`). Bedrock
* authenticates from the AWS_ env vars via pi-ai, so it needs no anthropic token.
*/
export function resolveEffectiveProvider(apiKey?: string, providerConfig?: ProviderConfig): EffectiveProvider {
const anthropicKey = apiKey ?? providerConfig?.apiKey ?? process.env.ANTHROPIC_API_KEY;
const type = providerConfig?.providerType;
// Bedrock — explicit providerConfig or the env flag.
if (type === 'bedrock' || (!type && process.env.CLAUDE_CODE_USE_BEDROCK === '1')) {
return { providerId: 'amazon-bedrock' };
}
// Custom base URL — explicit providerConfig.
if (type === 'custom_base_url') {
const eff: EffectiveProvider = { providerId: 'anthropic' };
if (providerConfig?.baseUrl) eff.baseUrl = providerConfig.baseUrl;
const token = providerConfig?.authToken ?? anthropicKey;
if (token) eff.anthropicToken = token;
return eff;
}
// Custom base URL — OSS env contract (no providerConfig).
if (!type && process.env.ANTHROPIC_BASE_URL && process.env.ANTHROPIC_AUTH_TOKEN) {
return {
providerId: 'anthropic',
baseUrl: process.env.ANTHROPIC_BASE_URL,
anthropicToken: process.env.ANTHROPIC_AUTH_TOKEN,
};
}
// Direct Anthropic (API key, or — env only — OAuth token).
const eff: EffectiveProvider = { providerId: 'anthropic' };
const token = anthropicKey ?? (type ? undefined : process.env.CLAUDE_CODE_OAUTH_TOKEN);
if (token) eff.anthropicToken = token;
return eff;
}
/** Resolve a model tier to a concrete model ID (env override → providerConfig → default). */
export function resolveModelId(tier: ModelTier = 'medium', providerConfig?: ProviderConfig): string {
const override = providerConfig?.modelOverrides?.[tier];
if (override) return override;
switch (tier) {
case 'small':
return process.env.ANTHROPIC_SMALL_MODEL || DEFAULT_MODELS.small;
@@ -36,7 +103,79 @@ export function resolveModel(tier: ModelTier = 'medium'): string {
}
}
/** Whether a model supports adaptive thinking. Opus 4.6 and 4.7 only. */
/** Whether a model supports adaptive thinking. Opus 4.6, 4.7, and 4.8 only. */
export function supportsAdaptiveThinking(model: string): boolean {
return /opus-4-[67]/.test(model);
return /opus-4-[678]/.test(model);
}
/**
* Resolve the thinking level for a run.
*
* Adaptive thinking is enabled only on capable models (Opus 4.6/4.7/4.8), mapped to
* pi's 'medium' level; every other model runs with thinking 'off'. The
* CLAUDE_ADAPTIVE_THINKING=false kill switch forces 'off' regardless of model.
*/
export function resolveThinkingLevel(modelId: string): ThinkingLevel {
if (process.env.CLAUDE_ADAPTIVE_THINKING === 'false') return 'off';
return supportsAdaptiveThinking(modelId) ? 'medium' : 'off';
}
export interface ModelSelection {
model: Model<Api>;
thinkingLevel: ThinkingLevel;
authStorage: AuthStorage;
modelId: string;
providerId: string;
}
/**
* Resolve the active provider (see resolveEffectiveProvider), prime an AuthStorage
* with its credential, and resolve the tier's model from a fresh ModelRegistry.
* Anthropic / custom-base-URL use a runtime anthropic key; Bedrock authenticates
* from the AWS_ env vars (bearer token primed explicitly as a belt-and-suspenders).
*/
export function resolveModelSelection(
registryFactory: (authStorage: AuthStorage) => ModelRegistry,
modelTier: ModelTier,
apiKey?: string,
providerConfig?: ProviderConfig,
): ModelSelection {
const eff = resolveEffectiveProvider(apiKey, providerConfig);
const modelId = resolveModelId(modelTier, providerConfig);
const authStorage = AuthStorage.inMemory();
if (eff.providerId === 'anthropic' && eff.anthropicToken) {
authStorage.setRuntimeApiKey('anthropic', eff.anthropicToken);
}
// Bedrock auth flows from the AWS_ env vars; prime the bearer token explicitly so
// it resolves via AuthStorage in addition to pi-ai's own env fallback.
if (eff.providerId === 'amazon-bedrock' && process.env.AWS_BEARER_TOKEN_BEDROCK) {
authStorage.setRuntimeApiKey('amazon-bedrock', process.env.AWS_BEARER_TOKEN_BEDROCK);
}
const registry = registryFactory(authStorage);
const found = registry.find(eff.providerId, modelId);
if (!found) {
throw new Error(`Model not found in pi registry: provider="${eff.providerId}" model="${modelId}"`);
}
// Custom base URL: override the resolved model's endpoint.
const model: Model<Api> = eff.baseUrl ? { ...found, baseUrl: eff.baseUrl } : found;
return {
model,
thinkingLevel: resolveThinkingLevel(modelId),
authStorage,
modelId,
providerId: eff.providerId,
};
}
/**
* Whether a model is in the Fable family. Fable's safety classifiers flag
* cybersecurity tasks and route them to Opus 4.8, so a security scan on Fable
* largely runs on Opus 4.8 anyway.
*/
export function isFableModel(model: string): boolean {
return /fable/i.test(model);
}
+65 -164
View File
@@ -4,36 +4,31 @@
// it under the terms of the GNU Affero General Public License version 3
// as published by the Free Software Foundation.
/**
* Human-readable console formatting for the agent executor.
*
* Driven by the pi harness event stream: `turn_end` (assistant text) and
* `tool_execution_start` (structured tool calls). Unlike the previous harness —
* where tool calls were tool_use JSON embedded in assistant text and had to be
* parsed out — pi delivers tool name + args as discrete events, so formatting is
* a direct mapping.
*/
import { AGENTS } from '../session-manager.js';
import { extractAgentType, formatDuration } from '../utils/formatting.js';
import type { ExecutionContext, ResultData } from './types.js';
import type { ExecutionContext } from './types.js';
interface ToolCallInput {
url?: string;
element?: string;
key?: string;
fields?: unknown[];
text?: string;
action?: string;
description?: string;
command?: string;
todos?: Array<{
status: string;
content: string;
}>;
description?: string;
path?: string;
todos?: Array<{ status: string; content: string }>;
[key: string]: unknown;
}
interface ToolCall {
name: string;
input?: ToolCallInput;
}
/**
* Get agent prefix for parallel execution
*/
/** Agent prefix used to attribute output when parallel agents interleave on one stream. */
export function getAgentPrefix(description: string): string {
// Map agent names to their prefixes
const agentPrefixes: Record<string, string> = {
'injection-vuln': '[Injection]',
'xss-vuln': '[XSS]',
@@ -47,7 +42,6 @@ export function getAgentPrefix(description: string): string {
'ssrf-exploit': '[SSRF]',
};
// First try to match by agent name directly
for (const [agentName, prefix] of Object.entries(agentPrefixes)) {
const agent = AGENTS[agentName as keyof typeof AGENTS];
if (agent && description.includes(agent.displayName)) {
@@ -55,7 +49,6 @@ export function getAgentPrefix(description: string): string {
}
}
// Fallback to partial matches for backwards compatibility
if (description.includes('injection')) return '[Injection]';
if (description.includes('xss')) return '[XSS]';
if (description.includes('authz')) return '[Authz]'; // Check authz before auth
@@ -65,9 +58,7 @@ export function getAgentPrefix(description: string): string {
return '[Agent]';
}
/**
* Extract domain from URL for display
*/
/** Extract domain from URL for display. */
function extractDomain(url: string): string {
try {
const urlObj = new URL(url);
@@ -77,11 +68,8 @@ function extractDomain(url: string): string {
}
}
/**
* Format playwright-cli commands into clean progress indicators
*/
/** Format a playwright-cli command (run via the bash tool) into a clean progress indicator. */
function formatBrowserAction(command: string): string | null {
// Extract subcommand after optional session flag (e.g., "playwright-cli -s=session1 navigate https://example.com")
const match = command.match(/playwright-cli\s+(?:-s=\S+\s+)?(\S+)(?:\s+(.*))?/);
if (!match) return null;
@@ -151,26 +139,19 @@ function formatBrowserAction(command: string): string | null {
}
}
/**
* Summarize TodoWrite updates into clean progress indicators
*/
/** Summarize a todo_write update into a clean progress indicator. */
function summarizeTodoUpdate(input: ToolCallInput | undefined): string | null {
if (!input?.todos || !Array.isArray(input.todos)) {
return null;
}
const todos = input.todos;
const completed = todos.filter((t) => t.status === 'completed');
const inProgress = todos.filter((t) => t.status === 'in_progress');
// Show recently completed tasks
const recent = completed.at(-1);
const recent = todos.filter((t) => t.status === 'completed').at(-1);
if (recent) {
return `${recent.content}`;
}
// Show current in-progress task
const current = inProgress.at(0);
const current = todos.filter((t) => t.status === 'in_progress').at(0);
if (current) {
return `🔄 ${current.content}`;
}
@@ -178,69 +159,6 @@ function summarizeTodoUpdate(input: ToolCallInput | undefined): string | null {
return null;
}
/**
* Filter out JSON tool calls from content, with special handling for Task calls
*/
export function filterJsonToolCalls(content: string | null | undefined): string {
if (!content || typeof content !== 'string') {
return content || '';
}
const lines = content.split('\n');
const processedLines: string[] = [];
for (const line of lines) {
const trimmed = line.trim();
// Skip empty lines
if (trimmed === '') {
continue;
}
// Check if this is a JSON tool call
if (trimmed.startsWith('{"type":"tool_use"')) {
try {
const toolCall = JSON.parse(trimmed) as ToolCall;
// Special handling for Task tool calls
if (toolCall.name === 'Task') {
const description = toolCall.input?.description || 'analysis agent';
processedLines.push(`🚀 Launching ${description}`);
continue;
}
// Special handling for TodoWrite tool calls
if (toolCall.name === 'TodoWrite') {
const summary = summarizeTodoUpdate(toolCall.input);
if (summary) {
processedLines.push(summary);
}
continue;
}
// Special handling for browser tool calls (playwright-cli via Bash)
if (toolCall.name === 'Bash') {
const command = toolCall.input?.command || '';
if (command.includes('playwright-cli')) {
const browserAction = formatBrowserAction(command);
if (browserAction) {
processedLines.push(browserAction);
}
}
}
} catch {
// If JSON parsing fails, treat as regular text
processedLines.push(line);
}
} else {
// Keep non-JSON lines (assistant text)
processedLines.push(line);
}
}
return processedLines.join('\n');
}
export function detectExecutionContext(description: string): ExecutionContext {
const isParallelExecution = description.includes('vuln agent') || description.includes('exploit agent');
@@ -252,62 +170,69 @@ export function detectExecutionContext(description: string): ExecutionContext {
description.includes('exploit agent');
const agentType = extractAgentType(description);
const agentKey = description.toLowerCase().replace(/\s+/g, '-');
return { isParallelExecution, useCleanOutput, agentType, agentKey };
}
/** Format assistant turn text (from a pi `turn_end` event). */
export function formatAssistantOutput(
cleanedContent: string,
text: string,
context: ExecutionContext,
turnCount: number,
description: string,
): string[] {
if (!cleanedContent.trim()) {
if (!text.trim()) {
return [];
}
const lines: string[] = [];
if (context.isParallelExecution) {
// Compact output for parallel agents with prefixes
const prefix = getAgentPrefix(description);
lines.push(`${prefix} ${cleanedContent}`);
} else {
// Full turn output for sequential agents
lines.push(`\n Turn ${turnCount} (${description}):`);
lines.push(` ${cleanedContent}`);
// Compact, attributed output for interleaved parallel agents.
return [`${getAgentPrefix(description)} ${text}`];
}
return lines;
// Full turn output for sequential agents.
return [`\n Turn ${turnCount} (${description}):`, ` ${text}`];
}
export function formatResultOutput(data: ResultData, showFullResult: boolean): string[] {
const lines: string[] = [];
/**
* Format a pi `tool_execution_start` event into a clean one-line progress indicator.
*
* Maps the common tool surfaces — `task` (sub-agent delegation), `todo_write`
* (plan updates), `bash` (incl. playwright-cli browser actions), read-only file
* tools, and the structured collector/submit tools — to friendly lines. Returns
* `[]` when there's nothing worth surfacing (e.g. a todo update with no active item).
*/
export function formatToolCall(
toolName: string,
args: Record<string, unknown> | undefined,
context: ExecutionContext,
description: string,
): string[] {
const input = (args ?? {}) as ToolCallInput;
let line: string | null;
lines.push(`\n COMPLETED:`);
lines.push(` Duration: ${(data.duration_ms / 1000).toFixed(1)}s, Cost: $${data.cost.toFixed(4)}`);
if (data.subtype === 'error_max_turns') {
lines.push(` Stopped: Hit maximum turns limit`);
} else if (data.subtype === 'error_during_execution') {
lines.push(` Stopped: Execution error`);
if (toolName === 'task') {
line = `🚀 Launching ${input.description ?? 'sub-agent'}`;
} else if (toolName === 'todo_write') {
line = summarizeTodoUpdate(input);
} else if (toolName === 'bash') {
const command = typeof input.command === 'string' ? input.command : '';
line = command.includes('playwright-cli') ? formatBrowserAction(command) : `💻 ${command.slice(0, 60)}`;
} else if (toolName === 'read' || toolName === 'grep' || toolName === 'find' || toolName === 'ls') {
const path = typeof input.path === 'string' ? ` ${input.path.slice(0, 60)}` : '';
line = `📖 ${toolName}${path}`;
} else if (toolName.startsWith('set_') || toolName.startsWith('add_') || toolName.startsWith('submit_')) {
line = `📊 ${toolName.replace(/_/g, ' ')}`;
} else {
line = `🔧 ${toolName}`;
}
if (data.permissionDenials > 0) {
lines.push(` ${data.permissionDenials} permission denials`);
}
if (!line) return [];
if (showFullResult && data.result && typeof data.result === 'string') {
if (data.result.length > 1000) {
lines.push(` ${data.result.slice(0, 1000)}... [${data.result.length} total chars]`);
} else {
lines.push(` ${data.result}`);
}
if (context.isParallelExecution) {
return [`${getAgentPrefix(description)} ${line}`];
}
return lines;
return [` ${line}`];
}
export function formatErrorOutput(
@@ -321,12 +246,11 @@ export function formatErrorOutput(
const lines: string[] = [];
if (context.isParallelExecution) {
const prefix = getAgentPrefix(description);
lines.push(`${prefix} Failed (${formatDuration(duration)})`);
lines.push(`${getAgentPrefix(description)} Failed (${formatDuration(duration)})`);
} else if (context.useCleanOutput) {
lines.push(`${context.agentType} failed (${formatDuration(duration)})`);
} else {
lines.push(` Claude Code failed: ${description} (${formatDuration(duration)})`);
lines.push(` pi agent failed: ${description} (${formatDuration(duration)})`);
}
lines.push(` Error Type: ${error.constructor.name}`);
@@ -352,35 +276,12 @@ export function formatCompletionMessage(
duration: number,
): string {
if (context.isParallelExecution) {
const prefix = getAgentPrefix(description);
return `${prefix} Complete (${turnCount} turns, ${formatDuration(duration)})`;
return `${getAgentPrefix(description)} Complete (${turnCount} turns, ${formatDuration(duration)})`;
}
if (context.useCleanOutput) {
return `${context.agentType.charAt(0).toUpperCase() + context.agentType.slice(1)} complete! (${turnCount} turns, ${formatDuration(duration)})`;
}
return ` Claude Code completed: ${description} (${turnCount} turns) in ${formatDuration(duration)}`;
}
export function formatToolUseOutput(toolName: string, input: Record<string, unknown> | undefined): string[] {
const lines: string[] = [];
lines.push(`\n Using Tool: ${toolName}`);
if (input && Object.keys(input).length > 0) {
lines.push(` Input: ${JSON.stringify(input, null, 2)}`);
}
return lines;
}
export function formatToolResultOutput(displayContent: string): string[] {
const lines: string[] = [];
lines.push(` Tool Result:`);
if (displayContent) {
lines.push(` ${displayContent}`);
}
return lines;
return ` pi agent completed: ${description} (${turnCount} turns) in ${formatDuration(duration)}`;
}
+389
View File
@@ -0,0 +1,389 @@
// Copyright (C) 2025 Keygraph, Inc.
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU Affero General Public License version 3
// as published by the Free Software Foundation.
// Production agent execution on the pi harness, with git checkpoints and audit logging.
import { createRequire } from 'node:module';
import type { AgentMessage } from '@earendil-works/pi-agent-core';
import {
type AgentSessionEvent,
createAgentSession,
DefaultResourceLoader,
getAgentDir,
ModelRegistry,
type ResourceLoader,
SessionManager,
SettingsManager,
type ToolDefinition,
} from '@earendil-works/pi-coding-agent';
import { fs, path } from 'zx';
import type { AuditSession } from '../audit/index.js';
import { BASH_TIMEOUT_EXTENSION_DIR, deliverablesDir, PLAYWRIGHT_SKILL_DIR } from '../paths.js';
import { isRetryableError, PentestError } from '../services/error-handling.js';
import { AGENT_VALIDATORS } from '../session-manager.js';
import type { ActivityLogger } from '../types/activity-logger.js';
import { ErrorCode } from '../types/errors.js';
import { isSpendingCapBehavior, matchesBillingTextPattern } from '../utils/billing-detection.js';
import { formatTimestamp } from '../utils/formatting.js';
import { Timer } from '../utils/metrics.js';
import { createAuditLogger } from './audit-logger.js';
import { type ModelTier, resolveModelSelection } from './models.js';
import {
detectExecutionContext,
formatAssistantOutput,
formatCompletionMessage,
formatErrorOutput,
formatToolCall,
} from './output-formatters.js';
import { createProgressManager } from './progress-manager.js';
import { permissionConfigPath } from './settings-writer.js';
import { createGlobTool, createTaskTool, createTodoWriteTool } from './tools.js';
declare global {
var SHANNON_DISABLE_LOADER: boolean | undefined;
}
/** Built-in pi tools enabled for every agent (custom tool names are appended). */
const BUILTIN_TOOLS = ['read', 'bash', 'edit', 'write', 'grep', 'find', 'ls'];
const requireFromHere = createRequire(import.meta.url);
let cachedExtensionDir: string | null | undefined;
/** Resolve the installed @gotgenes/pi-permission-system package dir, or null. */
function permissionExtensionDir(): string | null {
if (cachedExtensionDir !== undefined) return cachedExtensionDir;
try {
const entry = requireFromHere.resolve('@gotgenes/pi-permission-system');
cachedExtensionDir = path.dirname(path.dirname(entry));
} catch {
cachedExtensionDir = null;
}
return cachedExtensionDir;
}
async function buildResourceLoader(cwd: string, logger: ActivityLogger): Promise<ResourceLoader> {
// Always enforce bounded bash timeouts so an unbounded command cannot hang the agent.
const additionalExtensionPaths: string[] = [BASH_TIMEOUT_EXTENSION_DIR];
if (fs.existsSync(permissionConfigPath())) {
const extDir = permissionExtensionDir();
if (extDir) {
additionalExtensionPaths.push(extDir);
} else {
logger.warn(
'code_path deny config present but @gotgenes/pi-permission-system not resolvable — skipping enforcement',
);
}
}
const loader = new DefaultResourceLoader({
cwd,
agentDir: getAgentDir(),
additionalSkillPaths: [PLAYWRIGHT_SKILL_DIR],
...(additionalExtensionPaths.length > 0 && { additionalExtensionPaths }),
});
await loader.reload();
return loader;
}
export interface PiPromptResult {
result?: string | null | undefined;
success: boolean;
duration: number;
turns?: number | undefined;
cost: number;
model?: string | undefined;
partialCost?: number | undefined;
apiErrorDetected?: boolean | undefined;
error?: string | undefined;
errorType?: string | undefined;
prompt?: string | undefined;
retryable?: boolean | undefined;
structuredOutput?: unknown;
}
function outputLines(lines: string[]): void {
for (const line of lines) {
console.log(line);
}
}
async function writeErrorLog(
err: Error & { code?: string; status?: number },
sourceDir: string,
fullPrompt: string,
duration: number,
): Promise<void> {
try {
const errorLog = {
timestamp: formatTimestamp(),
agent: 'pi-executor',
error: { name: err.constructor.name, message: err.message, code: err.code, status: err.status, stack: err.stack },
context: { sourceDir, prompt: `${fullPrompt.slice(0, 200)}...`, retryable: isRetryableError(err) },
duration,
};
const logPath = path.join(deliverablesDir(sourceDir), 'error.log');
await fs.appendFile(logPath, `${JSON.stringify(errorLog)}\n`);
} catch {
// Best-effort error log writing - don't propagate failures
}
}
export async function validateAgentOutput(
result: PiPromptResult,
agentName: string | null,
sourceDir: string,
logger: ActivityLogger,
): Promise<boolean> {
logger.info(`Validating ${agentName} agent output`);
try {
if (!result.success || (!result.result && result.structuredOutput === undefined)) {
logger.error('Validation failed: Agent execution was unsuccessful');
return false;
}
const validator = agentName ? AGENT_VALIDATORS[agentName as keyof typeof AGENT_VALIDATORS] : undefined;
if (!validator) {
logger.warn(`No validator found for agent "${agentName}" - assuming success`);
return true;
}
logger.info(`Using validator for agent: ${agentName}`, { sourceDir });
const validationResult = await validator(sourceDir, logger);
if (validationResult) {
logger.info('Validation passed: Required files/structure present');
} else {
logger.error('Validation failed: Missing required deliverable files');
}
return validationResult;
} catch (error) {
const errMsg = error instanceof Error ? error.message : String(error);
logger.error(`Validation failed with error: ${errMsg}`);
return false;
}
}
/** Concatenate the text blocks of an assistant message (skips thinking + tool calls). */
function extractAssistantText(message: AgentMessage): string {
if (message.role !== 'assistant') return '';
const blocks = message.content as Array<{ type: string; text?: string }>;
return blocks
.filter((c) => c.type === 'text')
.map((c) => c.text ?? '')
.join('\n');
}
/**
* Classify error-bearing text into a PentestError, mirroring the prior provider error
* handling. Spending-cap / billing text is retryable (Temporal backs off and
* recovers when the cap resets); session limit is permanent.
*/
function classifyErrorText(content: string): PentestError | null {
if (!content) return null;
if (matchesBillingTextPattern(content)) {
return new PentestError(
`Billing limit reached: ${content.slice(0, 100)}`,
'billing',
true,
{},
ErrorCode.SPENDING_CAP_REACHED,
);
}
if (content.toLowerCase().includes('session limit reached')) {
return new PentestError('Session limit reached', 'billing', false);
}
return null;
}
// Low-level pi execution. Drives one agent session to completion with progress and
// audit logging. Exported for Temporal activities to call single-attempt execution.
export async function runPiPrompt(
prompt: string,
sourceDir: string,
context: string = '',
description: string = 'Agent analysis',
_agentName: string | null = null,
auditSession: AuditSession | null = null,
logger: ActivityLogger,
modelTier: ModelTier = 'medium',
callerTools?: ToolDefinition[],
apiKey?: string,
deliverablesSubdir?: string,
providerConfig?: import('../types/config.js').ProviderConfig,
): Promise<PiPromptResult> {
// 1. Initialize timing and prompt
const timer = new Timer(`agent-${description.toLowerCase().replace(/\s+/g, '-')}`);
const fullPrompt = context ? `${context}\n\n${prompt}` : prompt;
// 2. Set up progress and audit infrastructure
const execContext = detectExecutionContext(description);
const progress = createProgressManager(
{ description, useCleanOutput: execContext.useCleanOutput },
global.SHANNON_DISABLE_LOADER ?? false,
);
const auditLogger = createAuditLogger(auditSession);
logger.info(`Running pi agent: ${description}...`);
// 3. Expose bash-invoked CLI tooling (playwright-cli, save-deliverable) to the
// environment pi's bash tool inherits. These are constant per container, so
// setting them on process.env is parallel-safe across this workflow's agents.
process.env.PLAYWRIGHT_MCP_OUTPUT_DIR = deliverablesSubdir
? path.join(sourceDir, path.dirname(deliverablesSubdir), '.playwright-cli')
: path.join(sourceDir, '.shannon', '.playwright-cli');
if (deliverablesSubdir) process.env.SHANNON_DELIVERABLES_SUBDIR = deliverablesSubdir;
if (apiKey) process.env.ANTHROPIC_API_KEY = apiKey;
// 4. Resolve model + auth, then assemble the tool set (universal task/todo tools
// plus any caller-supplied collector/submit tools).
const selection = resolveModelSelection((auth) => ModelRegistry.create(auth), modelTier, apiKey, providerConfig);
const resourceLoader = await buildResourceLoader(sourceDir, logger);
// Accumulates cost from in-process `task` child sessions so the parent's reported
// cost includes sub-agent spend (their getSessionStats is separate from ours).
const childUsage = { cost: 0 };
const customTools: ToolDefinition[] = [
createTaskTool({
model: selection.model,
thinkingLevel: selection.thinkingLevel,
authStorage: selection.authStorage,
cwd: sourceDir,
childUsage,
resourceLoader,
}),
createTodoWriteTool(auditLogger),
createGlobTool(sourceDir),
...(callerTools ?? []),
];
// pi's `tools` allowlist gates custom tools too — list every custom name.
const tools = [...BUILTIN_TOOLS, ...customTools.map((t) => t.name)];
let turnCount = 0;
let pendingError: PentestError | null = null;
let apiErrorDetected = false;
progress.start();
try {
const { session } = await createAgentSession({
cwd: sourceDir,
model: selection.model,
thinkingLevel: selection.thinkingLevel,
tools,
customTools,
authStorage: selection.authStorage,
sessionManager: SessionManager.inMemory(),
// Temporal owns retry; pi compaction stays on (no analog previously, guards
// against context overflow on long agent runs).
settingsManager: SettingsManager.inMemory({ retry: { enabled: false }, compaction: { enabled: true } }),
resourceLoader,
});
// 5. Map pi events to audit logging + progress + error capture.
session.subscribe((event: AgentSessionEvent) => {
switch (event.type) {
case 'turn_end': {
turnCount += 1;
const msg = event.message;
const text = extractAssistantText(msg);
if (text.trim()) {
void auditLogger.logLlmResponse(turnCount, text);
progress.stop();
outputLines(formatAssistantOutput(text, execContext, turnCount, description));
progress.start();
const billing = classifyErrorText(text);
if (billing) pendingError = billing;
}
if (msg.role === 'assistant' && msg.stopReason === 'error') {
apiErrorDetected = true;
pendingError =
pendingError ??
classifyErrorText(msg.errorMessage ?? '') ??
new PentestError(`Agent error: ${(msg.errorMessage ?? 'unknown').slice(0, 200)}`, 'unknown', true);
}
break;
}
case 'tool_execution_start': {
void auditLogger.logToolStart(event.toolName, event.args);
const toolLines = formatToolCall(
event.toolName,
event.args as Record<string, unknown>,
execContext,
description,
);
if (toolLines.length > 0) {
progress.stop();
outputLines(toolLines);
progress.start();
}
break;
}
case 'tool_execution_end':
void auditLogger.logToolEnd(event.result);
break;
case 'compaction_end':
if (!event.aborted && !event.willRetry && event.errorMessage) {
pendingError =
pendingError ??
classifyErrorText(event.errorMessage) ??
new PentestError(`Context compaction failed: ${event.errorMessage.slice(0, 200)}`, 'unknown', true);
}
break;
default:
break;
}
});
// 6. Run the agent to completion (resolves at agent_end).
await session.prompt(fullPrompt);
session.dispose();
// 7. Surface any error captured during the run.
if (pendingError) throw pendingError;
// 8. Read usage/cost and final text.
const stats = session.getSessionStats();
const totalCost = stats.cost + childUsage.cost;
const result = session.getLastAssistantText() ?? null;
// 9. Defense-in-depth: detect a spending cap that produced an empty/cheap run.
if (isSpendingCapBehavior(turnCount, totalCost, result || '')) {
throw new PentestError(
`Spending cap likely reached (turns=${turnCount}, cost=$0): ${result?.slice(0, 100)}`,
'billing',
true,
);
}
const duration = timer.stop();
progress.finish(formatCompletionMessage(execContext, description, turnCount, duration));
return {
result,
success: true,
duration,
turns: turnCount,
cost: totalCost,
model: selection.model.id,
partialCost: totalCost,
apiErrorDetected,
};
} catch (error) {
// 10. Handle errors — log, write error file, return failure
const duration = timer.stop();
const err = error as Error & { code?: string; status?: number };
await auditLogger.logError(err, duration, turnCount);
progress.stop();
outputLines(formatErrorOutput(err, execContext, description, duration, sourceDir, isRetryableError(err)));
await writeErrorLog(err, sourceDir, fullPrompt, duration);
return {
error: err.message,
errorType: err.constructor.name,
prompt: `${fullPrompt.slice(0, 100)}...`,
success: false,
duration,
cost: 0,
retryable: isRetryableError(err),
};
}
}
+127 -183
View File
@@ -5,196 +5,114 @@
// as published by the Free Software Foundation.
/**
* Zod schema definitions for vulnerability exploitation queue structured outputs.
* TypeBox schemas + submit-tool factory for vulnerability exploitation queues.
*
* Each vuln agent returns a structured JSON response matching its schema.
* The SDK validates the output against the JSON Schema generated from these Zod definitions.
* pi has no JSON-schema output format, so each vuln agent's structured queue is
* captured via a `submit_exploitation_queue` custom tool whose parameters mirror
* the per-class schema below. The captured payload is written to
* `<class>_exploitation_queue.json` by the caller (agent-execution).
*/
import type { JsonSchemaOutputFormat } from '@anthropic-ai/claude-agent-sdk';
import { z } from 'zod';
import { defineTool, type ToolDefinition } from '@earendil-works/pi-coding-agent';
import { type Static, type TObject, Type } from 'typebox';
import type { AgentName } from '../types/agents.js';
// === Common Fields ===
const ANALYSIS_NOTES_DESCRIPTION = 'Plain context for defenders (caveats, scope, what is at risk). Not attack steps.';
function notesField(exploit: boolean) {
const f = z.string().optional();
return exploit ? f : f.describe(ANALYSIS_NOTES_DESCRIPTION);
}
const optStr = (description?: string) => Type.Optional(Type.String(description ? { description } : {}));
function makeBase(exploit: boolean) {
return z.object({
ID: z.string(),
vulnerability_type: z.string(),
externally_exploitable: z.boolean(),
confidence: z.string(),
notes: notesField(exploit),
});
}
// === Per-Vuln-Type Schemas (used for type inference; notes description is mode-agnostic for types) ===
const baseVulnerability = makeBase(true);
const InjectionVulnerability = baseVulnerability.extend({
source: z.string().optional(),
combined_sources: z.string().optional(),
path: z.string().optional(),
sink_call: z.string().optional(),
slot_type: z.string().optional(),
sanitization_observed: z.string().optional(),
concat_occurrences: z.string().optional(),
verdict: z.string().optional(),
mismatch_reason: z.string().optional(),
witness_payload: z.string().optional(),
});
const XssVulnerability = baseVulnerability.extend({
source: z.string().optional(),
source_detail: z.string().optional(),
path: z.string().optional(),
sink_function: z.string().optional(),
render_context: z.string().optional(),
encoding_observed: z.string().optional(),
verdict: z.string().optional(),
mismatch_reason: z.string().optional(),
witness_payload: z.string().optional(),
});
const AuthVulnerability = baseVulnerability.extend({
source_endpoint: z.string().optional(),
vulnerable_code_location: z.string().optional(),
missing_defense: z.string().optional(),
exploitation_hypothesis: z.string().optional(),
suggested_exploit_technique: z.string().optional(),
});
const SsrfVulnerability = baseVulnerability.extend({
source_endpoint: z.string().optional(),
vulnerable_parameter: z.string().optional(),
vulnerable_code_location: z.string().optional(),
missing_defense: z.string().optional(),
exploitation_hypothesis: z.string().optional(),
suggested_exploit_technique: z.string().optional(),
});
const AuthzVulnerability = baseVulnerability.extend({
endpoint: z.string().optional(),
vulnerable_code_location: z.string().optional(),
role_context: z.string().optional(),
guard_evidence: z.string().optional(),
side_effect: z.string().optional(),
reason: z.string().optional(),
minimal_witness: z.string().optional(),
});
// === Inferred Entry Types (consumed by renderer) ===
export type InjectionFinding = z.infer<typeof InjectionVulnerability>;
export type XssFinding = z.infer<typeof XssVulnerability>;
export type AuthFinding = z.infer<typeof AuthVulnerability>;
export type SsrfFinding = z.infer<typeof SsrfVulnerability>;
export type AuthzFinding = z.infer<typeof AuthzVulnerability>;
// === Convert to JSON Schema for SDK ===
// NOTE: The SDK's AJV validator expects draft-07. Zod defaults to draft-2020-12 which
// causes the SDK to silently skip structured output.
function toOutputFormat(zodSchema: z.ZodType): JsonSchemaOutputFormat {
return { type: 'json_schema', schema: z.toJSONSchema(zodSchema, { target: 'draft-07' }) as Record<string, unknown> };
}
// === Per-Mode Output Format Builders ===
// Two maps cached at module load; the only per-mode difference is the
// description on the `notes` field, which steers the LLM's writing.
function buildOutputFormats(exploit: boolean): Partial<Record<AgentName, JsonSchemaOutputFormat>> {
const base = makeBase(exploit);
/** Base fields shared by every queue entry. `notes` gains guidance in analysis mode. */
function baseFields(exploit: boolean) {
return {
'injection-vuln': toOutputFormat(
z.object({
vulnerabilities: z.array(
base.extend({
source: z.string().optional(),
combined_sources: z.string().optional(),
path: z.string().optional(),
sink_call: z.string().optional(),
slot_type: z.string().optional(),
sanitization_observed: z.string().optional(),
concat_occurrences: z.string().optional(),
verdict: z.string().optional(),
mismatch_reason: z.string().optional(),
witness_payload: z.string().optional(),
}),
),
}),
),
'xss-vuln': toOutputFormat(
z.object({
vulnerabilities: z.array(
base.extend({
source: z.string().optional(),
source_detail: z.string().optional(),
path: z.string().optional(),
sink_function: z.string().optional(),
render_context: z.string().optional(),
encoding_observed: z.string().optional(),
verdict: z.string().optional(),
mismatch_reason: z.string().optional(),
witness_payload: z.string().optional(),
}),
),
}),
),
'auth-vuln': toOutputFormat(
z.object({
vulnerabilities: z.array(
base.extend({
source_endpoint: z.string().optional(),
vulnerable_code_location: z.string().optional(),
missing_defense: z.string().optional(),
exploitation_hypothesis: z.string().optional(),
suggested_exploit_technique: z.string().optional(),
}),
),
}),
),
'ssrf-vuln': toOutputFormat(
z.object({
vulnerabilities: z.array(
base.extend({
source_endpoint: z.string().optional(),
vulnerable_parameter: z.string().optional(),
vulnerable_code_location: z.string().optional(),
missing_defense: z.string().optional(),
exploitation_hypothesis: z.string().optional(),
suggested_exploit_technique: z.string().optional(),
}),
),
}),
),
'authz-vuln': toOutputFormat(
z.object({
vulnerabilities: z.array(
base.extend({
endpoint: z.string().optional(),
vulnerable_code_location: z.string().optional(),
role_context: z.string().optional(),
guard_evidence: z.string().optional(),
side_effect: z.string().optional(),
reason: z.string().optional(),
minimal_witness: z.string().optional(),
}),
),
}),
),
ID: Type.String(),
vulnerability_type: Type.String(),
externally_exploitable: Type.Boolean(),
confidence: Type.String(),
notes: exploit ? optStr() : optStr(ANALYSIS_NOTES_DESCRIPTION),
};
}
const OUTPUT_FORMATS_EXPLOIT = buildOutputFormats(true);
const OUTPUT_FORMATS_ANALYSIS = buildOutputFormats(false);
const injectionFields = {
source: optStr(),
combined_sources: optStr(),
path: optStr(),
sink_call: optStr(),
slot_type: optStr(),
sanitization_observed: optStr(),
concat_occurrences: optStr(),
verdict: optStr(),
mismatch_reason: optStr(),
witness_payload: optStr(),
};
const xssFields = {
source: optStr(),
source_detail: optStr(),
path: optStr(),
sink_function: optStr(),
render_context: optStr(),
encoding_observed: optStr(),
verdict: optStr(),
mismatch_reason: optStr(),
witness_payload: optStr(),
};
const authFields = {
source_endpoint: optStr(),
vulnerable_code_location: optStr(),
missing_defense: optStr(),
exploitation_hypothesis: optStr(),
suggested_exploit_technique: optStr(),
};
const ssrfFields = {
source_endpoint: optStr(),
vulnerable_parameter: optStr(),
vulnerable_code_location: optStr(),
missing_defense: optStr(),
exploitation_hypothesis: optStr(),
suggested_exploit_technique: optStr(),
};
const authzFields = {
endpoint: optStr(),
vulnerable_code_location: optStr(),
role_context: optStr(),
guard_evidence: optStr(),
side_effect: optStr(),
reason: optStr(),
minimal_witness: optStr(),
};
const PER_TYPE_FIELDS: Partial<Record<AgentName, Record<string, ReturnType<typeof optStr>>>> = {
'injection-vuln': injectionFields,
'xss-vuln': xssFields,
'auth-vuln': authFields,
'ssrf-vuln': ssrfFields,
'authz-vuln': authzFields,
};
/** Build the `{ vulnerabilities: [...] }` queue schema for an agent + mode. */
function queueSchema(agentName: AgentName, exploit: boolean): TObject | null {
const extra = PER_TYPE_FIELDS[agentName];
if (!extra) return null;
return Type.Object({
vulnerabilities: Type.Array(Type.Object({ ...baseFields(exploit), ...extra })),
});
}
// === Inferred entry types (consumed by renderers) ===
export type InjectionFinding = Static<ReturnType<typeof injectionEntry>>;
export type XssFinding = Static<ReturnType<typeof xssEntry>>;
export type AuthFinding = Static<ReturnType<typeof authEntry>>;
export type SsrfFinding = Static<ReturnType<typeof ssrfEntry>>;
export type AuthzFinding = Static<ReturnType<typeof authzEntry>>;
const injectionEntry = () => Type.Object({ ...baseFields(true), ...injectionFields });
const xssEntry = () => Type.Object({ ...baseFields(true), ...xssFields });
const authEntry = () => Type.Object({ ...baseFields(true), ...authFields });
const ssrfEntry = () => Type.Object({ ...baseFields(true), ...ssrfFields });
const authzEntry = () => Type.Object({ ...baseFields(true), ...authzFields });
const VULN_AGENT_QUEUE_FILENAMES: Partial<Record<AgentName, string>> = {
'injection-vuln': 'injection_exploitation_queue.json',
@@ -204,12 +122,38 @@ const VULN_AGENT_QUEUE_FILENAMES: Partial<Record<AgentName, string>> = {
'authz-vuln': 'authz_exploitation_queue.json',
};
/** Returns the structured output format for a vuln agent, or undefined for non-vuln agents. */
export function getOutputFormat(agentName: AgentName, exploit = true): JsonSchemaOutputFormat | undefined {
return (exploit ? OUTPUT_FORMATS_EXPLOIT : OUTPUT_FORMATS_ANALYSIS)[agentName];
}
/** Returns the queue filename for a vuln agent, or undefined for non-vuln agents. */
export function getQueueFilename(agentName: AgentName): string | undefined {
return VULN_AGENT_QUEUE_FILENAMES[agentName];
}
export interface QueueSubmitTool {
tool: ToolDefinition;
getCaptured: () => unknown;
}
/**
* Build the `submit_exploitation_queue` tool for a vuln agent, or null for
* non-vuln agents. The agent calls it once with the full findings list; the
* captured payload is the structured queue.
*/
export function createQueueSubmitTool(agentName: AgentName, exploit: boolean): QueueSubmitTool | null {
const schema = queueSchema(agentName, exploit);
if (!schema) return null;
let captured: unknown;
const tool = defineTool({
name: 'submit_exploitation_queue',
label: 'Submit Exploitation Queue',
description:
'Submit the final structured list of analyzed vulnerabilities for this class. Call exactly once when ' +
'analysis is complete, with every finding included.',
promptSnippet: 'submit_exploitation_queue: record the final structured findings list (call once)',
parameters: schema,
execute: async (_toolCallId, params) => {
captured = params;
const count = (params as { vulnerabilities?: unknown[] }).vulnerabilities?.length ?? 0;
return { content: [{ type: 'text' as const, text: `Recorded ${count} findings.` }], details: {} };
},
});
return { tool, getCaptured: () => captured };
}
+51 -17
View File
@@ -5,37 +5,71 @@
// as published by the Free Software Foundation.
/**
* Writes ~/.claude/settings.json with permissions.deny rules derived from
* `code_path` avoid patterns. The SDK reads this via `settingSources: ['user']`;
* deny rules fire even in `bypassPermissions` mode.
* Writes the @gotgenes/pi-permission-system global config from `code_path` avoid
* patterns. The executor loads the extension (see pi-executor) and pi enforces
* these path denies at the tool layer for every agent. Written to the global config
* dir under `agentDir` — the project-scoped path is gated behind project trust,
* which our headless runs do not grant; the global path is not.
*/
import os from 'node:os';
import { getAgentDir } from '@earendil-works/pi-coding-agent';
import { fs, path } from 'zx';
import type { DistributedConfig } from '../types/config.js';
const FILE_TOOLS = ['Read', 'Edit'] as const;
function denyEntriesFor(pattern: string): string[] {
const arg = `./${pattern.replace(/^[./]+/, '')}`;
return FILE_TOOLS.map((tool) => `${tool}(${arg})`);
/** Absolute path to the pi-permission-system global config.json. */
export function permissionConfigPath(): string {
return path.join(getAgentDir(), 'extensions', 'pi-permission-system', 'config.json');
}
export async function writeUserSettingsForCodePathAvoids(config: DistributedConfig | null): Promise<void> {
/**
* Write (or remove) the pi-permission-system config derived from `code_path`
* avoid patterns.
*
* Each avoid maps to a cross-cutting `path` deny — the strongest surface, blocking
* the path across every tool and bash command, and not overridable by a per-tool
* allow. `"*": "allow"` keeps everything else permitted so the extension does not
* fall back to its default `ask` (which would block all access headlessly). When
* there are no avoids the config is removed, so the executor skips loading the
* extension entirely.
*/
export async function writeCodePathPermissionConfig(config: DistributedConfig | null): Promise<void> {
const avoidPatterns = (config?.avoid ?? []).filter((r) => r.type === 'code_path').map((r) => r.value);
const settingsPath = path.join(os.homedir(), '.claude', 'settings.json');
const configPath = permissionConfigPath();
if (avoidPatterns.length === 0) {
await fs.remove(settingsPath);
await fs.remove(configPath);
return;
}
const settings = {
permissions: {
deny: avoidPatterns.flatMap(denyEntriesFor),
// pi's matcher (wildcard-matcher.ts) has NO `**` globstar — it splits on each `*`
// and joins with `.*`, and a single `*` already matches any chars incl. `/`. Tool
// paths are compared as absolute (path-utils resolves them against cwd), so we
// collapse `**`→`*` and add a `*/`-prefixed variant that matches the path under
// any repo prefix. (A bare pattern never matches an absolute path.)
const pathDeny: Record<string, 'allow' | 'deny'> = { '*': 'allow' };
for (const pattern of avoidPatterns) {
const clean = pattern.replace(/^[./]+/, '').replace(/\*\*/g, '*');
// Deny the contents (under any repo prefix and as written)...
pathDeny[`*/${clean}`] = 'deny';
pathDeny[clean] = 'deny';
// ...and the folder path itself, so the directory entry is denied too — the
// contents patterns (…/*) require a trailing segment and wouldn't match it.
if (clean.endsWith('/*')) {
const folder = clean.slice(0, -2);
if (folder) {
pathDeny[`*/${folder}`] = 'deny';
pathDeny[folder] = 'deny';
}
}
}
const permissionConfig = {
permission: {
'*': 'allow',
path: pathDeny,
},
};
await fs.ensureDir(path.dirname(settingsPath));
await fs.writeJson(settingsPath, settings, { spaces: 2 });
await fs.ensureDir(path.dirname(configPath));
await fs.writeJson(configPath, permissionConfig, { spaces: 2 });
}
+205
View File
@@ -0,0 +1,205 @@
// Copyright (C) 2025 Keygraph, Inc.
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU Affero General Public License version 3
// as published by the Free Software Foundation.
/**
* Universal custom tools registered for every agent: `task`, `todo_write`, and `glob`.
*
* These replace harness built-ins that pi does not ship. `task` delegates a focused
* sub-task to an in-process child session (the Task sub-agent replacement);
* `todo_write` is a full-state-replace planning scratchpad mirrored to the workflow
* log; `glob` is fast-glob file matching (pi has no `Glob` built-in).
*/
import type { ThinkingLevel } from '@earendil-works/pi-agent-core';
import type { Api, Model } from '@earendil-works/pi-ai';
import {
type AuthStorage,
createAgentSession,
defineTool,
type ResourceLoader,
SessionManager,
SettingsManager,
type ToolDefinition,
} from '@earendil-works/pi-coding-agent';
import { Type } from 'typebox';
import { fs, glob, path } from 'zx';
import type { AuditLogger } from './audit-logger.js';
/** Tool surface for child sessions: read/search plus `write`+`bash` to author and run scripts. */
const CHILD_TOOLS = ['read', 'grep', 'find', 'ls', 'write', 'bash'];
export interface TaskToolContext {
model: Model<Api>;
thinkingLevel: ThinkingLevel;
authStorage: AuthStorage;
cwd: string;
/** When set, child sessions inherit the code_path deny policy. */
resourceLoader?: ResourceLoader;
/**
* Mutable accumulator: each child (sub-agent) session's cost is added here so the
* parent executor can include sub-agent spend in its reported cost. Child sessions
* keep their own `getSessionStats`, separate from the parent's.
*/
childUsage?: { cost: number };
}
/**
* The `task` tool — launch a new agent to handle a multi-step task autonomously.
*
* Spawns an in-process child session, drives it to completion, and returns its
* final text. Marked `parallel` for one-turn fan-out. Children get no `task` of
* their own — delegation is one level.
*/
export function createTaskTool(ctx: TaskToolContext): ToolDefinition {
return defineTool({
name: 'task',
label: 'Task',
description:
'Launch a new agent to handle complex, multi-step tasks autonomously. The agent runs on its own and ' +
'its final report is returned to you as the tool result (it is not shown to the user). Each invocation ' +
'is stateless — you cannot send follow-up messages, so give a complete, detailed instruction in a single ' +
'prompt and specify exactly what information the agent should return. Launch multiple agents concurrently ' +
'by issuing multiple task calls in a single message.',
promptSnippet: 'task: launch a new agent to handle a multi-step task',
executionMode: 'parallel',
parameters: Type.Object({
description: Type.Optional(Type.String({ description: 'Short (3-5 word) label for the delegated sub-task.' })),
prompt: Type.String({ description: 'The full instruction for the sub-agent.' }),
}),
execute: async (_toolCallId, params) => {
const { session: child } = await createAgentSession({
cwd: ctx.cwd,
model: ctx.model,
thinkingLevel: ctx.thinkingLevel,
tools: CHILD_TOOLS,
authStorage: ctx.authStorage,
sessionManager: SessionManager.inMemory(),
settingsManager: SettingsManager.inMemory({
retry: { enabled: false },
compaction: { enabled: true },
}),
...(ctx.resourceLoader && { resourceLoader: ctx.resourceLoader }),
});
try {
await child.prompt(params.prompt);
const text = child.getLastAssistantText() ?? '(sub-agent produced no output)';
return { content: [{ type: 'text' as const, text }], details: {} };
} finally {
// Roll the child's cost up to the parent before disposing (best-effort, and
// captured in `finally` so a failed child's partial spend still counts).
if (ctx.childUsage) {
try {
ctx.childUsage.cost += child.getSessionStats().cost;
} catch {
// ignore — cost capture is best-effort
}
}
child.dispose();
}
},
});
}
export interface TodoItem {
content: string;
status: 'pending' | 'in_progress' | 'completed';
activeForm: string;
}
/** Render a todo list as a compact checklist for the workflow log. */
function renderTodos(todos: readonly TodoItem[]): string {
const mark = (s: TodoItem['status']): string => (s === 'completed' ? 'x' : s === 'in_progress' ? '~' : ' ');
return todos.map((t) => `[${mark(t.status)}] ${t.content}`).join(' ');
}
/**
* The `todo_write` tool — a full-state-replace planning scratchpad.
*
* Mirrors the TodoWrite tool: each call carries the entire list and replaces
* stored state (no append/merge). No deliverable impact; every call is echoed to
* the workflow log so `shannon logs` shows the agent's live plan. State is per
* tool instance (one per agent execution).
*/
export function createTodoWriteTool(auditLogger: AuditLogger): ToolDefinition {
let current: TodoItem[] = [];
return defineTool({
name: 'todo_write',
label: 'Todo Write',
description:
'Use this tool to create and manage a structured task list for your current session. This helps you ' +
'track progress and organize complex, multi-step work, and gives visibility into what you are doing. ' +
'Pass the COMPLETE todo list on every call — it replaces the stored list entirely (no append or merge). ' +
'Each todo has a status of pending, in_progress, or completed; keep exactly one task in_progress at a ' +
'time and mark a task completed as soon as it is finished.',
promptSnippet: 'todo_write: create and manage a structured task list',
parameters: Type.Object({
todos: Type.Array(
Type.Object({
content: Type.String({ description: 'Imperative task description, e.g. "Map SSRF sinks".' }),
status: Type.Union([Type.Literal('pending'), Type.Literal('in_progress'), Type.Literal('completed')]),
activeForm: Type.String({ description: 'Present-continuous form, e.g. "Mapping SSRF sinks".' }),
}),
),
}),
execute: async (_toolCallId, params) => {
current = params.todos as TodoItem[];
const completed = current.filter((t) => t.status === 'completed').length;
await auditLogger.logNote('todo', renderTodos(current));
return {
content: [{ type: 'text' as const, text: `Todos updated (${current.length} items, ${completed} completed).` }],
details: {},
};
},
});
}
/**
* The `glob` tool — fast file pattern matching (pi ships no `Glob` built-in).
*
* Backed by the same fast-glob engine that classifies code_path rules as `[GLOB]`
* (see utils/glob.ts `isGlobPattern`), so it enumerates exactly the patterns the
* routing tags as globs — including `**` and `{a,b}`, which pi's `find` would not
* match the same way. Returns absolute paths, most-recently-modified first.
*/
export function createGlobTool(cwd: string): ToolDefinition {
return defineTool({
name: 'glob',
label: 'Glob',
description:
'Fast file pattern matching. Supports glob patterns like "**/*.ts" or "src/**/*.{js,ts}". Returns ' +
'matching file paths sorted by modification time (most recent first), one per line, or "No files found".',
promptSnippet: 'glob: find files by name pattern',
parameters: Type.Object({
pattern: Type.String({ description: 'The glob pattern to match files against.' }),
path: Type.Optional(Type.String({ description: 'Directory to search in. Omit to search the repository root.' })),
}),
execute: async (_toolCallId, params) => {
const searchRoot = params.path ? path.resolve(cwd, params.path) : cwd;
const matches = await glob.globby(params.pattern, {
cwd: searchRoot,
absolute: true,
dot: true,
onlyFiles: true,
followSymbolicLinks: false,
});
if (matches.length === 0) {
return { content: [{ type: 'text' as const, text: 'No files found' }], details: {} };
}
// Sort by mtime (most recent first) to match the canonical Glob contract.
const withMtime = await Promise.all(
matches.map(async (file) => {
try {
return { file, mtime: (await fs.stat(file)).mtimeMs };
} catch {
return { file, mtime: 0 };
}
}),
);
withMtime.sort((a, b) => b.mtime - a.mtime);
return { content: [{ type: 'text' as const, text: withMtime.map((m) => m.file).join('\n') }], details: {} };
},
});
}
+1 -90
View File
@@ -4,9 +4,7 @@
// it under the terms of the GNU Affero General Public License version 3
// as published by the Free Software Foundation.
// Type definitions for Claude executor message processing pipeline
import type { SDKAssistantMessageError } from '@anthropic-ai/claude-agent-sdk';
// Shared display/formatting types for the agent executor output layer.
export interface ExecutionContext {
isParallelExecution: boolean;
@@ -14,90 +12,3 @@ export interface ExecutionContext {
agentType: string;
agentKey: string;
}
export interface AssistantResult {
content: string;
cleanedContent: string;
apiErrorDetected: boolean;
shouldThrow?: Error;
logData: {
turn: number;
content: string;
timestamp: string;
};
}
export interface ResultData {
result: string | null;
cost: number;
duration_ms: number;
subtype?: string;
stop_reason?: string | null;
permissionDenials: number;
structuredOutput?: unknown;
}
export interface ToolUseData {
toolName: string;
parameters: Record<string, unknown>;
timestamp: string;
}
export interface ToolResultData {
content: unknown;
displayContent: string;
timestamp: string;
}
export interface ContentBlock {
type?: string;
text?: string;
thinking?: string;
data?: string;
}
export interface AssistantMessage {
type: 'assistant';
error?: SDKAssistantMessageError;
message: {
content: ContentBlock[] | string;
};
}
export interface ResultMessage {
type: 'result';
result?: string;
total_cost_usd?: number;
duration_ms?: number;
subtype?: string;
stop_reason?: string | null;
permission_denials?: unknown[];
structured_output?: unknown;
}
export interface ToolUseMessage {
type: 'tool_use';
name: string;
input?: Record<string, unknown>;
}
export interface ToolResultMessage {
type: 'tool_result';
content?: unknown;
}
export interface ApiErrorDetection {
detected: boolean;
shouldThrow?: Error;
}
export interface SystemInitMessage {
type: 'system';
subtype: 'init';
model?: string;
permissionMode?: string;
}
export interface UserMessage {
type: 'user';
}
+8
View File
@@ -158,6 +158,14 @@ export class AuditSession {
}
}
/**
* Write a human-readable note to the unified workflow log (e.g. a model
* refusal fallback). Independent of agent event logging.
*/
async logWorkflowNote(category: string, message: string): Promise<void> {
await this.workflowLogger.logEvent(category, message);
}
/**
* End agent execution (mutex-protected)
*/
+19 -5
View File
@@ -12,6 +12,7 @@
*/
import fs from 'node:fs/promises';
import { isFableModel, resolveModelId } from '../ai/models.js';
import { formatDuration, formatTimestamp } from '../utils/formatting.js';
import { LogStream } from './log-stream.js';
import { generateWorkflowLogPath, type SessionMetadata } from './utils.js';
@@ -77,18 +78,31 @@ export class WorkflowLogger {
* Write header to log file
*/
private async writeHeader(): Promise<void> {
const header = [
const lines = [
`================================================================================`,
`Shannon Pentest - Workflow Log`,
`================================================================================`,
`Workflow ID: ${this.workflowId ?? this.sessionMetadata.id}`,
`Target URL: ${this.sessionMetadata.webUrl}`,
`Started: ${formatTimestamp()}`,
`================================================================================`,
``,
].join('\n');
];
return this.logStream.write(header);
// Surface Fable usage: its safety classifiers route cybersecurity tasks to
// Opus 4.8, so those phases run on Opus 4.8 regardless of the tier setting.
const fableTiers = (['small', 'medium', 'large'] as const)
.map((tier) => ({ tier, model: resolveModelId(tier) }))
.filter(({ model }) => isFableModel(model));
if (fableTiers.length > 0) {
const tierList = fableTiers.map(({ tier, model }) => `${tier} (${model})`).join(', ');
lines.push(
`Note: ${tierList} set to a Fable model. Fable's safety classifiers`,
` route cybersecurity tasks to Opus 4.8, so those phases run on Opus 4.8.`,
);
}
lines.push(`================================================================================`, ``);
return this.logStream.write(lines.join('\n'));
}
/**
+194 -213
View File
@@ -5,10 +5,10 @@
// as published by the Free Software Foundation.
/**
* Exploit Collector MCP Server (factory parameterized by vulnerability class
* and per-run valid-ID set).
* Exploit Collector tool factory (parameterized by vulnerability class and
* per-run valid-ID set).
*
* Exposes a single Zod-validated MCP tool `add_exploit`, called once per
* Exposes a single TypeBox-validated tool `add_exploit`, called once per
* processed vulnerability by the 5 exploit-* agents (injection, xss, auth,
* ssrf, authz). After the agent terminates, the host harvests
* collector.getAll() and runs exploit-renderer to produce
@@ -16,29 +16,28 @@
* output.
*
* Schema shape:
* - The SDK tool() helper consumes a ZodRawShape (flat object), not a
* top-level discriminated union. The visible shape is therefore a single
* z.object with common fields required, status as a string enum, and
* per-status fields marked optional at the SDK layer. Each field's
* `.describe()` text explains when it applies.
* - The visible parameter schema is a single Type.Object with common fields
* required, status as a string union, and per-status fields marked optional
* at the tool layer (TypeBox cannot express a top-level discriminated union
* as the flat tool parameters). Each field's `description` text explains
* when it applies.
* - True per-status field enforcement runs inside the tool handler via a
* z.discriminatedUnion('status', ...). Missing-field errors come back to
* the agent as structured Zod issues with retryable=true so it can fix
* and retry the call.
* Type.Union([exploited, blocked]) re-validation using the TypeBox `Value`
* API. Missing-field errors come back to the agent as structured issues
* with retryable=true so it can fix and retry the call.
*
* Strict queue-ID validation: vulnerability_id is refined against the per-run
* queue's known IDs at schema-build time. Hallucinated or typo'd IDs are
* rejected with a structured Zod error that includes the valid-ID list,
* letting the agent recover locally.
* Strict queue-ID validation: vulnerability_id is checked against the per-run
* queue's known IDs in the handler. Hallucinated or typo'd IDs are rejected
* with a structured error that includes the valid-ID list, letting the agent
* recover locally.
*
* Each Zod schema's field-level descriptions carry the bullet labels and
* reproducibility guidance, so the SDK injects it into the agent's tool
* catalog.
* Each field's description carries the bullet labels and reproducibility
* guidance, so the harness injects it into the agent's tool catalog.
*/
import type { McpSdkServerConfigWithInstance } from '@anthropic-ai/claude-agent-sdk';
import { createSdkMcpServer, tool } from '@anthropic-ai/claude-agent-sdk';
import { z } from 'zod';
import { defineTool, type ToolDefinition } from '@earendil-works/pi-coding-agent';
import { type Static, Type } from 'typebox';
import { Value } from 'typebox/value';
// ============================================================================
// CLASS DISCRIMINATOR
@@ -103,214 +102,181 @@ export type AddExploitInput = ExploitedExploit | BlockedExploit;
// ============================================================================
function buildSchemas(validIds: ReadonlySet<string>) {
const vulnerabilityIdField = z
.string()
.min(1)
.describe(
const vulnerabilityIdField = Type.String({
minLength: 1,
description:
'Vulnerability identifier (e.g. "INJ-VULN-03"). Must match an ID from this run\'s ' +
'{class}_exploitation_queue.json exactly — the collector rejects IDs not in the queue. ' +
`Valid IDs for this run: ${formatValidIdsPreview(validIds)}.`,
)
.refine((id: string) => validIds.has(id), {
message:
`Vulnerability ID not in this run's queue. Valid IDs: ` +
`${formatValidIdsPreview(validIds)}. ` +
'Check the queue.json for the canonical ID — likely a typo or hallucinated ID.',
});
'{class}_exploitation_queue.json exactly — the collector rejects IDs not in the queue. ' +
`Valid IDs for this run: ${formatValidIdsPreview(validIds)}.`,
});
const titleField = z
.string()
.min(1)
.describe(
const titleField = Type.String({
minLength: 1,
description:
'Descriptive vulnerability title (e.g. "SQL Injection — User Search", "IDOR — Unauthorized ' +
'Access to User Orders"). Concise; encodes the vulnerability category and where it lives.',
);
'Access to User Orders"). Concise; encodes the vulnerability category and where it lives.',
});
const vulnerableLocationField = z
.string()
.min(1)
.describe(
const vulnerableLocationField = Type.String({
minLength: 1,
description:
'Endpoint or mechanism where the vulnerability exists (e.g. "GET /api/products?id=", ' +
'"POST /login", or a code location like "controllers/userController.js:42").',
);
'"POST /login", or a code location like "controllers/userController.js:42").',
});
const overviewField = z
.string()
.min(1)
.describe(
const overviewField = Type.String({
minLength: 1,
description:
'Brief summary of the exploit itself — what the vulnerability is and how it was demonstrated ' +
'(or how it would be demonstrated, for blocked findings). 1-3 sentences.',
);
'(or how it would be demonstrated, for blocked findings). 1-3 sentences.',
});
const prerequisitesField = z
.string()
.nullable()
.optional()
.describe(
'Required setup, tools, or conditions to reproduce the exploit (e.g. authentication, ' +
const prerequisitesField = Type.Optional(
Type.Union([Type.String(), Type.Null()], {
description:
'Required setup, tools, or conditions to reproduce the exploit (e.g. authentication, ' +
'specific role, prior application state). Omit or pass null when no prerequisites apply.',
);
}),
);
const notesField = z
.string()
.nullable()
.optional()
.describe(
'Optional supplementary context — caveats, related findings, environmental observations. ' +
const notesField = Type.Optional(
Type.Union([Type.String(), Type.Null()], {
description:
'Optional supplementary context — caveats, related findings, environmental observations. ' +
'Free-form Markdown. Omit or pass null when N/A.',
);
}),
);
const statusField = z
.enum(['exploited', 'blocked'])
.describe(
const statusField = Type.Union([Type.Literal('exploited'), Type.Literal('blocked')], {
description:
'Verdict bucket. Set to "exploited" only after reaching Proof of Exploitation Level 3+ with ' +
'concrete impact evidence (extracted data, executed JavaScript, account takeover, internal ' +
'service access). Set to "blocked" only for real vulnerabilities where external factors ' +
'(NOT security defenses) prevented full exploitation. Findings where a security defense ' +
'successfully prevented exploitation after exhaustive bypass attempts are FALSE POSITIVE — ' +
'route those to your workspace tracking file, not this tool.',
);
'concrete impact evidence (extracted data, executed JavaScript, account takeover, internal ' +
'service access). Set to "blocked" only for real vulnerabilities where external factors ' +
'(NOT security defenses) prevented full exploitation. Findings where a security defense ' +
'successfully prevented exploitation after exhaustive bypass attempts are FALSE POSITIVE — ' +
'route those to your workspace tracking file, not this tool.',
});
// Per-status fields. All optional at the SDK shape layer because a single
// ZodRawShape cannot express a top-level discriminated union; the handler
// Per-status fields. All optional at the flat parameter layer because a single
// Type.Object cannot express a top-level discriminated union; the handler
// re-validates against the discriminated union below for true enforcement.
const severityField = z
.enum(SEVERITY_VALUES)
.nullable()
.optional()
.describe(
'REQUIRED when status="exploited". Severity of the demonstrated impact. Critical = Level 4 ' +
const severityField = Type.Optional(
Type.Union([...SEVERITY_VALUES.map((v) => Type.Literal(v)), Type.Null()], {
description:
'REQUIRED when status="exploited". Severity of the demonstrated impact. Critical = Level 4 ' +
'(admin credentials extracted, sensitive data dumped, system commands executed, full account ' +
'takeover). High = Level 3 (data extraction proven, authentication bypass confirmed, ' +
'internal service access). Medium/Low based on impact narrowness or read-only access. Must ' +
'reflect demonstrated impact, not theoretical potential.',
);
}),
);
const impactField = z
.string()
.min(1)
.nullable()
.optional()
.describe(
'REQUIRED when status="exploited". Business/security impact achieved by the exploit ' +
const impactField = Type.Optional(
Type.Union([Type.String({ minLength: 1 }), Type.Null()], {
description:
'REQUIRED when status="exploited". Business/security impact achieved by the exploit ' +
'(e.g. "Extracted full user table including bcrypt password hashes for 1,247 users", ' +
'"Achieved RCE as the application user; arbitrary shell commands executed"). Must describe ' +
'what was actually demonstrated, not what could theoretically happen.',
);
}),
);
const exploitationStepsField = z
.array(z.string().min(1))
.min(1)
.nullable()
.optional()
.describe(
'REQUIRED when status="exploited". Ordered, reproducible exploitation steps — one Markdown ' +
const exploitationStepsField = Type.Optional(
Type.Union([Type.Array(Type.String({ minLength: 1 }), { minItems: 1 }), Type.Null()], {
description:
'REQUIRED when status="exploited". Ordered, reproducible exploitation steps — one Markdown ' +
'blob per numbered step. Each step must include full URLs (protocol + domain + port + path ' +
'+ params), complete payloads, and copy-paste-ready commands. Use clear placeholders for ' +
'variable values like [SESSION_TOKEN], [DATABASE_NAME], [TABLE_NAME], [TARGET_USER_ID]. ' +
'Write each step as natural Markdown — interleave prose with fenced code blocks (```bash, ' +
'```http, etc.) as you would in a write-up. Steps must be detailed enough that someone ' +
'unfamiliar with the application can follow without additional research.',
);
}),
);
const proofOfImpactField = z
.string()
.min(1)
.nullable()
.optional()
.describe(
'REQUIRED when status="exploited". Concrete evidence of successful exploitation — extracted ' +
const proofOfImpactField = Type.Optional(
Type.Union([Type.String({ minLength: 1 }), Type.Null()], {
description:
'REQUIRED when status="exploited". Concrete evidence of successful exploitation — extracted ' +
'data, achieved actions, captured request/response pairs, log excerpts. Markdown blob; ' +
'interleave prose with fenced code blocks. Must show what the exploit demonstrably achieved, ' +
'not theoretical impact.',
);
}),
);
const confidenceField = z
.enum(CONFIDENCE_VALUES)
.nullable()
.optional()
.describe(
'REQUIRED when status="blocked". Confidence that this finding is a real vulnerability that ' +
const confidenceField = Type.Optional(
Type.Union([...CONFIDENCE_VALUES.map((v) => Type.Literal(v)), Type.Null()], {
description:
'REQUIRED when status="blocked". Confidence that this finding is a real vulnerability that ' +
'would be exploited if the external blocker were removed. High = code analysis strongly ' +
'confirms vulnerability and partial exploitation (Level 1-2) succeeded. Medium = code ' +
'analysis confirms but live evidence is partial. Low = signal-only; revisit if blocker is ' +
'removed in a future run.',
);
}),
);
const currentBlockerField = z
.string()
.min(1)
.nullable()
.optional()
.describe(
'REQUIRED when status="blocked". What prevents full exploitation (e.g. "Server crashes after ' +
const currentBlockerField = Type.Optional(
Type.Union([Type.String({ minLength: 1 }), Type.Null()], {
description:
'REQUIRED when status="blocked". What prevents full exploitation (e.g. "Server crashes after ' +
'5 requests, blocking enumeration", "OAuth callback requires verified third-party email ' +
'account we could not provision"). Must be an external operational constraint, not a ' +
'security defense.',
);
}),
);
const potentialImpactField = z
.string()
.min(1)
.nullable()
.optional()
.describe(
'REQUIRED when status="blocked". What could be achieved if the blocker were removed (e.g. ' +
const potentialImpactField = Type.Optional(
Type.Union([Type.String({ minLength: 1 }), Type.Null()], {
description:
'REQUIRED when status="blocked". What could be achieved if the blocker were removed (e.g. ' +
'"Full database read access", "Account takeover of arbitrary user via reset-token leak"). ' +
'Distinct from impact — this is the hypothetical outcome, not a demonstrated one.',
);
}),
);
const evidenceOfVulnerabilityField = z
.string()
.min(1)
.nullable()
.optional()
.describe(
'REQUIRED when status="blocked". Code snippets, response excerpts, or observed behavior ' +
const evidenceOfVulnerabilityField = Type.Optional(
Type.Union([Type.String({ minLength: 1 }), Type.Null()], {
description:
'REQUIRED when status="blocked". Code snippets, response excerpts, or observed behavior ' +
'proving the vulnerability is real. Markdown blob; interleave prose with fenced code blocks. ' +
'This is what convinces the reader the finding is not a false positive despite incomplete ' +
'exploitation.',
);
}),
);
const whatWeTriedField = z
.string()
.min(1)
.nullable()
.optional()
.describe(
'REQUIRED when status="blocked". Log of attempted exploitation techniques and why each was ' +
const whatWeTriedField = Type.Optional(
Type.Union([Type.String({ minLength: 1 }), Type.Null()], {
description:
'REQUIRED when status="blocked". Log of attempted exploitation techniques and why each was ' +
'blocked. Each attempt should document the payload, the observed result, and the inferred ' +
'blocker. Markdown blob; multiple attempts as a list or distinct paragraphs. Demonstrates ' +
'exhaustive bypass effort per the Bypass Exhaustion Protocol.',
);
}),
);
const howThisWouldBeExploitedField = z
.array(z.string().min(1))
.min(1)
.nullable()
.optional()
.describe(
'REQUIRED when status="blocked". Ordered hypothetical exploitation steps assuming the blocker ' +
const howThisWouldBeExploitedField = Type.Optional(
Type.Union([Type.Array(Type.String({ minLength: 1 }), { minItems: 1 }), Type.Null()], {
description:
'REQUIRED when status="blocked". Ordered hypothetical exploitation steps assuming the blocker ' +
'is removed — one Markdown blob per numbered step. Same reproducibility requirements as ' +
'exploitation_steps: full URLs, complete payloads, copy-paste-ready commands. Frame the ' +
'first step as "If [blocker] were removed: …".',
);
}),
);
const expectedImpactField = z
.string()
.min(1)
.nullable()
.optional()
.describe(
'REQUIRED when status="blocked". Specific data or access that would be compromised if ' +
const expectedImpactField = Type.Optional(
Type.Union([Type.String({ minLength: 1 }), Type.Null()], {
description:
'REQUIRED when status="blocked". Specific data or access that would be compromised if ' +
'exploitation succeeded (e.g. "Read access to all user profile data including PII; write ' +
'access to user-owned resources"). Markdown blob.',
);
}),
);
// The flat shape passed to tool(). The SDK uses this to build the agent's
// tool catalog. Per-status enforcement happens in the handler via the
// discriminated union below.
const flatShape = {
// The flat parameter schema passed to defineTool(). The harness uses this to
// build the agent's tool catalog. Per-status enforcement happens in the
// handler via the discriminated union below.
const flatShape = Type.Object({
status: statusField,
vulnerability_id: vulnerabilityIdField,
title: titleField,
@@ -329,59 +295,64 @@ function buildSchemas(validIds: ReadonlySet<string>) {
what_we_tried: whatWeTriedField,
how_this_would_be_exploited: howThisWouldBeExploitedField,
expected_impact: expectedImpactField,
};
});
// Strict per-status validation. Re-runs in the handler so missing fields
// for the chosen status return a retryable Zod error to the agent.
const ExploitedSchema = z.object({
status: z.literal('exploited'),
// for the chosen status return a retryable error to the agent.
const ExploitedSchema = Type.Object({
status: Type.Literal('exploited'),
vulnerability_id: vulnerabilityIdField,
title: titleField,
vulnerable_location: vulnerableLocationField,
overview: overviewField,
prerequisites: prerequisitesField,
severity: z.enum(SEVERITY_VALUES),
impact: z.string().min(1),
exploitation_steps: z.array(z.string().min(1)).min(1),
proof_of_impact: z.string().min(1),
severity: Type.Union(SEVERITY_VALUES.map((v) => Type.Literal(v))),
impact: Type.String({ minLength: 1 }),
exploitation_steps: Type.Array(Type.String({ minLength: 1 }), { minItems: 1 }),
proof_of_impact: Type.String({ minLength: 1 }),
notes: notesField,
});
const BlockedSchema = z.object({
status: z.literal('blocked'),
const BlockedSchema = Type.Object({
status: Type.Literal('blocked'),
vulnerability_id: vulnerabilityIdField,
title: titleField,
vulnerable_location: vulnerableLocationField,
prerequisites: prerequisitesField,
confidence: z.enum(CONFIDENCE_VALUES),
current_blocker: z.string().min(1),
potential_impact: z.string().min(1),
evidence_of_vulnerability: z.string().min(1),
what_we_tried: z.string().min(1),
how_this_would_be_exploited: z.array(z.string().min(1)).min(1),
expected_impact: z.string().min(1),
confidence: Type.Union(CONFIDENCE_VALUES.map((v) => Type.Literal(v))),
current_blocker: Type.String({ minLength: 1 }),
potential_impact: Type.String({ minLength: 1 }),
evidence_of_vulnerability: Type.String({ minLength: 1 }),
what_we_tried: Type.String({ minLength: 1 }),
how_this_would_be_exploited: Type.Array(Type.String({ minLength: 1 }), { minItems: 1 }),
expected_impact: Type.String({ minLength: 1 }),
notes: notesField,
});
const StrictSchema = z.discriminatedUnion('status', [ExploitedSchema, BlockedSchema]);
const StrictSchema = Type.Union([ExploitedSchema, BlockedSchema]);
return { flatShape, StrictSchema };
}
type FlatInput = Static<ReturnType<typeof buildSchemas>['flatShape']>;
type StrictInput = Static<ReturnType<typeof buildSchemas>['StrictSchema']>;
// ============================================================================
// RESPONSE HELPERS
// ============================================================================
interface ToolResult {
[x: string]: unknown;
content: Array<{ type: 'text'; text: string }>;
isError: boolean;
details: Record<string, unknown>;
isError?: boolean;
}
function createToolResult(response: { status: string; [key: string]: unknown }): ToolResult {
const isError = response.status === 'error';
return {
content: [{ type: 'text', text: JSON.stringify(response, null, 2) }],
isError: response.status === 'error',
content: [{ type: 'text' as const, text: JSON.stringify(response, null, 2) }],
details: {},
...(isError && { isError: true }),
};
}
@@ -393,21 +364,21 @@ function errorResult(message: string, errorType = 'ValidationError', retryable =
return createToolResult({ status: 'error', message, errorType, retryable });
}
function formatZodIssues(error: z.ZodError): string {
return error.issues
function formatValueErrors(schema: ReturnType<typeof buildSchemas>['StrictSchema'], value: unknown): string {
return [...Value.Errors(schema, value)]
.map((issue) => {
const path = issue.path.length > 0 ? issue.path.join('.') : '(root)';
const path = issue.instancePath.length > 0 ? issue.instancePath.replace(/^\//, '').replace(/\//g, '.') : '(root)';
return `- ${path}: ${issue.message}`;
})
.join('\n');
}
// ============================================================================
// SERVER FACTORY
// TOOL FACTORY
// ============================================================================
export interface ExploitCollectorServer {
server: McpSdkServerConfigWithInstance;
tools: ToolDefinition[];
getAll(): AddExploitInput[];
}
@@ -421,9 +392,11 @@ export function createExploitCollector(options: CreateExploitCollectorOptions):
const exploits: AddExploitInput[] = [];
const { flatShape, StrictSchema } = buildSchemas(validIds);
const addExploitTool = tool(
'add_exploit',
`Record a single processed ${vulnClass} vulnerability as structured exploitation evidence. ` +
const addExploitTool = defineTool({
name: 'add_exploit',
label: 'Add Exploit',
description:
`Record a single processed ${vulnClass} vulnerability as structured exploitation evidence. ` +
'Call this once per vulnerability in your queue.json after reaching a definitive verdict ' +
'(either successfully exploited or potential-but-blocked). The status field discriminates the ' +
"two report buckets; required sub-fields differ per status (see each field's description for " +
@@ -432,20 +405,34 @@ export function createExploitCollector(options: CreateExploitCollectorOptions):
'IDs. FALSE POSITIVE findings do NOT use this tool — they go to your workspace tracking file. ' +
'After all queue vulnerabilities have been emitted, the host renderer assembles the ' +
'deliverable Markdown from your recorded calls.',
flatShape,
async (input): Promise<ToolResult> => {
// Re-validate against the strict discriminated union for per-status enforcement.
const parsed = StrictSchema.safeParse(input);
if (!parsed.success) {
parameters: flatShape,
execute: async (_toolCallId, args): Promise<ToolResult> => {
const input = args as FlatInput;
// Strict queue-ID validation: reject hallucinated or typo'd IDs with the valid-ID list.
if (!validIds.has(input.vulnerability_id)) {
return errorResult(
`Schema validation failed for status="${(input as { status?: string }).status}". ` +
'Required-field issues:\n' +
formatZodIssues(parsed.error),
`Vulnerability ID not in this run's queue. Valid IDs: ` +
`${formatValidIdsPreview(validIds)}. ` +
'Check the queue.json for the canonical ID — likely a typo or hallucinated ID.',
'ValidationError',
true,
);
}
const typed = parsed.data as AddExploitInput;
// Re-validate against the strict discriminated union for per-status enforcement.
if (!Value.Check(StrictSchema, input)) {
return errorResult(
`Schema validation failed for status="${(input as { status?: string }).status}". ` +
'Required-field issues:\n' +
formatValueErrors(StrictSchema, input),
'ValidationError',
true,
);
}
// Strip excess properties from the flat input so only the chosen status's
// fields survive (mirrors the prior discriminated-union parse).
const typed = Value.Clean(StrictSchema, structuredClone(input)) as StrictInput as AddExploitInput;
const existing = exploits.find((e) => e.vulnerability_id === typed.vulnerability_id);
if (existing) {
return errorResult(
@@ -458,16 +445,10 @@ export function createExploitCollector(options: CreateExploitCollectorOptions):
exploits.push(typed);
return successResult({ added: [typed.vulnerability_id], recorded_status: typed.status });
},
);
const server: McpSdkServerConfigWithInstance = createSdkMcpServer({
name: 'exploit-collector',
version: '1.0.0',
tools: [addExploitTool],
});
return {
server,
tools: [addExploitTool] as ToolDefinition[],
getAll: (): AddExploitInput[] => [...exploits],
};
}
+341 -369
View File
@@ -5,9 +5,9 @@
// as published by the Free Software Foundation.
/**
* Pre-Recon Collector MCP Server
* Pre-Recon Collector tools
*
* Exposes seven Zod-validated MCP tools, one per section of the
* Exposes seven TypeBox-validated tools, one per section of the
* pre_recon_deliverable.md report. Every tool is one-shot (write-once;
* duplicate calls return DuplicateError). A skipped tool renders a placeholder
* rather than failing the activity. After the agent finishes, the host calls
@@ -15,386 +15,353 @@
* per-run call pattern, and runs the deterministic renderer to produce the
* deliverable Markdown.
*
* Each Zod schema's field-level descriptions carry the section guidance, so
* the SDK injects it into the agent's tool catalog.
* Each TypeBox schema's field-level descriptions carry the section guidance, so
* the harness injects it into the agent's tool catalog.
*/
import type { McpSdkServerConfigWithInstance } from '@anthropic-ai/claude-agent-sdk';
import { createSdkMcpServer, tool } from '@anthropic-ai/claude-agent-sdk';
import { z } from 'zod';
import { defineTool, type ToolDefinition } from '@earendil-works/pi-coding-agent';
import { type Static, Type } from 'typebox';
// ============================================================================
// SHARED SCHEMA
// ============================================================================
export const SinkRefSchema = z.object({
location: z
.string()
.min(1)
.describe(
export const SinkRefSchema = Type.Object({
location: Type.String({
minLength: 1,
description:
'File path with line number (e.g., "templates/render.js:34") or richer prose ' +
'(e.g., "innerHTML at templates/render.js:34", "lines 45-67"). Must contain enough ' +
'detail for a downstream agent to find the exact location.',
),
sink_function: z
.string()
.min(1)
.describe('The sink function or property name (e.g., "innerHTML", "axios.get", "eval", "document.write").'),
notes: z
.string()
.nullable()
.optional()
.describe(
'Optional context — render-context detail, attribute name, scope hints, or anything ' +
'(e.g., "innerHTML at templates/render.js:34", "lines 45-67"). Must contain enough ' +
'detail for a downstream agent to find the exact location.',
}),
sink_function: Type.String({
minLength: 1,
description: 'The sink function or property name (e.g., "innerHTML", "axios.get", "eval", "document.write").',
}),
notes: Type.Optional(
Type.Union([Type.String(), Type.Null()], {
description:
'Optional context — render-context detail, attribute name, scope hints, or anything ' +
'a downstream agent needs to act on this sink. Omit when the location and sink_function ' +
'are sufficient on their own.',
),
}),
),
});
export type SinkRef = z.infer<typeof SinkRefSchema>;
export type SinkRef = Static<typeof SinkRefSchema>;
// ============================================================================
// PER-TOOL INPUT SCHEMAS
// ============================================================================
export const ExecutiveSummaryInputSchema = z.object({
text: z
.string()
.min(1)
.describe(
export const ExecutiveSummaryInputSchema = Type.Object({
text: Type.String({
minLength: 1,
description:
"Provide a 2-3 paragraph overview of the application's security posture, highlighting " +
'the most critical attack surfaces and architectural security decisions. Becomes ' +
'Section 1 of the rendered deliverable.',
),
'the most critical attack surfaces and architectural security decisions. Becomes ' +
'Section 1 of the rendered deliverable.',
}),
});
const ArchitectureSchema = z.object({
framework_and_language: z
.string()
.min(1)
.describe('Framework and language details with their security implications.'),
architectural_pattern: z
.string()
.min(1)
.describe('Architectural pattern (monolith, microservices, hybrid) with trust boundary analysis.'),
critical_security_components: z
.string()
.min(1)
.describe('Critical security components with focus on auth, authz, and data protection.'),
const ArchitectureSchema = Type.Object({
framework_and_language: Type.String({
minLength: 1,
description: 'Framework and language details with their security implications.',
}),
architectural_pattern: Type.String({
minLength: 1,
description: 'Architectural pattern (monolith, microservices, hybrid) with trust boundary analysis.',
}),
critical_security_components: Type.String({
minLength: 1,
description: 'Critical security components with focus on auth, authz, and data protection.',
}),
});
const DataSecuritySchema = z.object({
database_security: z
.string()
.min(1)
.describe('Analyze encryption, access controls, and query safety in database interactions.'),
data_flow_security: z
.string()
.min(1)
.describe('Identify sensitive data paths and the protection mechanisms applied along them.'),
multi_tenant_isolation: z
.string()
.min(1)
.describe(
const DataSecuritySchema = Type.Object({
database_security: Type.String({
minLength: 1,
description: 'Analyze encryption, access controls, and query safety in database interactions.',
}),
data_flow_security: Type.String({
minLength: 1,
description: 'Identify sensitive data paths and the protection mechanisms applied along them.',
}),
multi_tenant_isolation: Type.String({
minLength: 1,
description:
'Assess tenant separation effectiveness. If the application is single-tenant, state that ' +
'explicitly rather than leaving the field thin.',
),
'explicitly rather than leaving the field thin.',
}),
});
const AttackSurfaceSchema = z.object({
external_entry_points: z
.string()
.min(1)
.describe('Detailed analysis of each public interface that is network-accessible.'),
internal_service_communication: z
.string()
.min(1)
.describe(
const AttackSurfaceSchema = Type.Object({
external_entry_points: Type.String({
minLength: 1,
description: 'Detailed analysis of each public interface that is network-accessible.',
}),
internal_service_communication: Type.String({
minLength: 1,
description:
'Trust relationships and security assumptions between network-reachable services. ' +
'If the application is a single service with no internal RPC fabric, state that.',
),
input_validation_patterns: z
.string()
.min(1)
.describe('How user input is handled and validated in network-accessible endpoints.'),
background_processing: z
.string()
.min(1)
.describe(
'If the application is a single service with no internal RPC fabric, state that.',
}),
input_validation_patterns: Type.String({
minLength: 1,
description: 'How user input is handled and validated in network-accessible endpoints.',
}),
background_processing: Type.String({
minLength: 1,
description:
'Async job security and privilege models for jobs triggered by network requests. ' +
'If no async/background processing exists, state that.',
),
'If no async/background processing exists, state that.',
}),
});
const InfrastructureSchema = z.object({
secrets_management: z.string().min(1).describe('How secrets are stored, rotated, and accessed.'),
configuration_security: z
.string()
.min(1)
.describe(
const InfrastructureSchema = Type.Object({
secrets_management: Type.String({ minLength: 1, description: 'How secrets are stored, rotated, and accessed.' }),
configuration_security: Type.String({
minLength: 1,
description:
'Environment separation and secret handling. Specifically search for infrastructure ' +
'configuration (e.g., Nginx, Kubernetes Ingress, CDN settings) that defines security ' +
'headers like Strict-Transport-Security (HSTS) and Cache-Control, and report what was found.',
),
external_dependencies: z.string().min(1).describe('Third-party services and their security implications.'),
monitoring_and_logging: z
.string()
.min(1)
.describe('Security event visibility — what is logged, where it goes, and who can see it.'),
'configuration (e.g., Nginx, Kubernetes Ingress, CDN settings) that defines security ' +
'headers like Strict-Transport-Security (HSTS) and Cache-Control, and report what was found.',
}),
external_dependencies: Type.String({
minLength: 1,
description: 'Third-party services and their security implications.',
}),
monitoring_and_logging: Type.String({
minLength: 1,
description: 'Security event visibility — what is logged, where it goes, and who can see it.',
}),
});
export const ApplicationIntelligenceInputSchema = z.object({
architecture: ArchitectureSchema.describe(
'Architecture & Technology Stack — driven by the Architecture Scanner sub-agent. ' +
export const ApplicationIntelligenceInputSchema = Type.Object({
architecture: Type.Object(ArchitectureSchema.properties, {
description:
'Architecture & Technology Stack — driven by the Architecture Scanner sub-agent. ' +
'Becomes Section 2 of the rendered deliverable.',
),
data_security: DataSecuritySchema.describe(
'Data Security & Storage — driven by the Data Security Auditor sub-agent. ' +
}),
data_security: Type.Object(DataSecuritySchema.properties, {
description:
'Data Security & Storage — driven by the Data Security Auditor sub-agent. ' +
'Becomes Section 4 of the rendered deliverable.',
),
attack_surface: AttackSurfaceSchema.describe(
'Attack Surface Analysis — driven by Entry Point Mapper + Architecture Scanner sub-agents. ' +
}),
attack_surface: Type.Object(AttackSurfaceSchema.properties, {
description:
'Attack Surface Analysis — driven by Entry Point Mapper + Architecture Scanner sub-agents. ' +
'Only include entry points confirmed to be in-scope (network-reachable). ' +
'Becomes Section 5 of the rendered deliverable.',
),
infrastructure: InfrastructureSchema.describe(
'Infrastructure & Operational Security. Becomes Section 6 of the rendered deliverable.',
),
}),
infrastructure: Type.Object(InfrastructureSchema.properties, {
description: 'Infrastructure & Operational Security. Becomes Section 6 of the rendered deliverable.',
}),
});
export const AuthDeepDiveInputSchema = z.object({
authentication_mechanisms: z
.string()
.min(1)
.describe(
export const AuthDeepDiveInputSchema = Type.Object({
authentication_mechanisms: Type.String({
minLength: 1,
description:
'Authentication mechanisms and their security properties. MUST include an exhaustive list of ' +
'all API endpoints used for authentication (e.g., login, logout, token refresh, password reset).',
),
session_management: z
.string()
.min(1)
.describe(
'all API endpoints used for authentication (e.g., login, logout, token refresh, password reset).',
}),
session_management: Type.String({
minLength: 1,
description:
'Session management and token security. Pinpoint the exact file and line(s) of code where ' +
'session cookie flags (HttpOnly, Secure, SameSite) are configured.',
),
authz_model: z.string().min(1).describe('Authorization model and potential bypass scenarios.'),
multi_tenancy: z
.string()
.min(1)
.describe('Multi-tenancy security implementation. If the application is single-tenant, state that explicitly.'),
sso_oauth_oidc: z
.string()
.nullable()
.describe(
'session cookie flags (HttpOnly, Secure, SameSite) are configured.',
}),
authz_model: Type.String({ minLength: 1, description: 'Authorization model and potential bypass scenarios.' }),
multi_tenancy: Type.String({
minLength: 1,
description: 'Multi-tenancy security implementation. If the application is single-tenant, state that explicitly.',
}),
sso_oauth_oidc: Type.Union([Type.String(), Type.Null()], {
description:
'SSO/OAuth/OIDC flows: identify the callback endpoints and locate the specific code that ' +
'validates the state and nonce parameters. Set null only if the application has no SSO/OAuth/OIDC ' +
'integration at all.',
),
'validates the state and nonce parameters. Set null only if the application has no SSO/OAuth/OIDC ' +
'integration at all.',
}),
});
export const CodebaseIndexingInputSchema = z.object({
text: z
.string()
.min(1)
.describe(
export const CodebaseIndexingInputSchema = Type.Object({
text: Type.String({
minLength: 1,
description:
"A detailed, multi-sentence paragraph describing the codebase's directory structure, " +
'organization, and significant tools or conventions used (e.g., build orchestration, code ' +
'generation, testing frameworks). Focus on how this structure impacts discoverability of ' +
'security-relevant components.',
),
'organization, and significant tools or conventions used (e.g., build orchestration, code ' +
'generation, testing frameworks). Focus on how this structure impacts discoverability of ' +
'security-relevant components.',
}),
});
export const CriticalFilePathsInputSchema = z.object({
configuration: z
.array(z.string().min(1))
.describe('Configuration files (e.g., config/server.yaml, Dockerfile, docker-compose.yml).'),
authentication_and_authorization: z
.array(z.string().min(1))
.describe(
export const CriticalFilePathsInputSchema = Type.Object({
configuration: Type.Array(Type.String({ minLength: 1 }), {
description: 'Configuration files (e.g., config/server.yaml, Dockerfile, docker-compose.yml).',
}),
authentication_and_authorization: Type.Array(Type.String({ minLength: 1 }), {
description:
'Auth/authz files (e.g., auth/jwt_middleware.go, internal/user/permissions.go, ' +
'config/initializers/session_store.rb, src/services/oauth_callback.js).',
),
api_and_routing: z
.array(z.string().min(1))
.describe(
'config/initializers/session_store.rb, src/services/oauth_callback.js).',
}),
api_and_routing: Type.Array(Type.String({ minLength: 1 }), {
description:
'API and routing files (e.g., cmd/api/main.go, internal/handlers/user_routes.go, ' +
'ts/graphql/schema.graphql).',
),
data_models_and_db: z
.array(z.string().min(1))
.describe(
'ts/graphql/schema.graphql).',
}),
data_models_and_db: Type.Array(Type.String({ minLength: 1 }), {
description:
'Data model and DB interaction files (e.g., db/migrations/001_initial.sql, ' +
'internal/models/user.go, internal/repository/sql_queries.go).',
),
dependency_manifests: z
.array(z.string().min(1))
.describe('Dependency manifests (e.g., go.mod, package.json, requirements.txt).'),
sensitive_data_and_secrets: z
.array(z.string().min(1))
.describe(
'internal/models/user.go, internal/repository/sql_queries.go).',
}),
dependency_manifests: Type.Array(Type.String({ minLength: 1 }), {
description: 'Dependency manifests (e.g., go.mod, package.json, requirements.txt).',
}),
sensitive_data_and_secrets: Type.Array(Type.String({ minLength: 1 }), {
description:
'Sensitive data and secrets handling (e.g., internal/utils/encryption.go, ' + 'internal/secrets/manager.go).',
),
middleware_and_input_validation: z
.array(z.string().min(1))
.describe(
}),
middleware_and_input_validation: Type.Array(Type.String({ minLength: 1 }), {
description:
'Middleware and input validation (e.g., internal/middleware/validator.go, ' +
'internal/handlers/input_parsers.go).',
),
logging_and_monitoring: z
.array(z.string().min(1))
.describe('Logging and monitoring (e.g., internal/logging/logger.go, config/monitoring.yaml).'),
infrastructure_and_deployment: z
.array(z.string().min(1))
.describe(
'internal/handlers/input_parsers.go).',
}),
logging_and_monitoring: Type.Array(Type.String({ minLength: 1 }), {
description: 'Logging and monitoring (e.g., internal/logging/logger.go, config/monitoring.yaml).',
}),
infrastructure_and_deployment: Type.Array(Type.String({ minLength: 1 }), {
description:
'Infrastructure and deployment (e.g., infra/pulumi/main.go, kubernetes/deploy.yaml, ' +
'nginx.conf, gateway-ingress.yaml).',
),
'nginx.conf, gateway-ingress.yaml).',
}),
});
export const XssSinksInputSchema = z.object({
applicable: z
.boolean()
.describe(
export const XssSinksInputSchema = Type.Object({
applicable: Type.Boolean({
description:
'False only if the application has no web frontend at all. Otherwise true, even if no ' +
'sinks were found in a given category — empty arrays mean "scanned this category, no sinks found".',
),
html_body: z
.array(SinkRefSchema)
.describe(
'sinks were found in a given category — empty arrays mean "scanned this category, no sinks found".',
}),
html_body: Type.Array(SinkRefSchema, {
description:
'HTML Body Context sinks: element.innerHTML, element.outerHTML, document.write(), ' +
'document.writeln(), element.insertAdjacentHTML(), Range.createContextualFragment(), ' +
'and jQuery sinks like add(), after(), append(), before(), html(), prepend(), replaceWith(), wrap().',
),
html_attribute: z
.array(SinkRefSchema)
.describe(
'document.writeln(), element.insertAdjacentHTML(), Range.createContextualFragment(), ' +
'and jQuery sinks like add(), after(), append(), before(), html(), prepend(), replaceWith(), wrap().',
}),
html_attribute: Type.Array(SinkRefSchema, {
description:
'HTML Attribute Context sinks: event handlers (onclick, onerror, onmouseover, onload, onfocus), ' +
'URL-based attributes (href, src, formaction, action, background, data), the style attribute, ' +
'iframe srcdoc, and general attributes (value, id, class, name, alt) when quotes are escaped.',
),
javascript: z
.array(SinkRefSchema)
.describe(
'URL-based attributes (href, src, formaction, action, background, data), the style attribute, ' +
'iframe srcdoc, and general attributes (value, id, class, name, alt) when quotes are escaped.',
}),
javascript: Type.Array(SinkRefSchema, {
description:
'JavaScript Context sinks: eval(), Function() constructor, setTimeout() / setInterval() ' +
'with string arguments, and direct writes of user data into a <script> tag.',
),
css: z
.array(SinkRefSchema)
.describe(
'with string arguments, and direct writes of user data into a <script> tag.',
}),
css: Type.Array(SinkRefSchema, {
description:
'CSS Context sinks: element.style properties (e.g., element.style.backgroundImage) and ' +
'direct writes of user data into a <style> tag.',
),
url: z
.array(SinkRefSchema)
.describe(
'direct writes of user data into a <style> tag.',
}),
url: Type.Array(SinkRefSchema, {
description:
'URL Context sinks: location / window.location, location.href, location.replace(), ' +
'location.assign(), window.open(), history.pushState(), history.replaceState(), ' +
'URL.createObjectURL(), and jQuery selector $(userInput) in older versions.',
),
'location.assign(), window.open(), history.pushState(), history.replaceState(), ' +
'URL.createObjectURL(), and jQuery selector $(userInput) in older versions.',
}),
});
export const SsrfSinksInputSchema = z.object({
applicable: z
.boolean()
.describe(
export const SsrfSinksInputSchema = Type.Object({
applicable: Type.Boolean({
description:
'False only if the application makes no outbound requests at all. Otherwise true, even if ' +
'no sinks were found in a given category — empty arrays mean "scanned this category, no sinks found".',
),
http_clients: z
.array(SinkRefSchema)
.describe(
'no sinks were found in a given category — empty arrays mean "scanned this category, no sinks found".',
}),
http_clients: Type.Array(SinkRefSchema, {
description:
'HTTP(S) clients: curl, requests (Python), axios (Node.js), fetch (JavaScript/Node.js), ' +
'net/http (Go), HttpClient (Java/.NET), urllib (Python), RestTemplate, WebClient, OkHttp, Apache HttpClient.',
),
raw_sockets: z
.array(SinkRefSchema)
.describe(
'net/http (Go), HttpClient (Java/.NET), urllib (Python), RestTemplate, WebClient, OkHttp, Apache HttpClient.',
}),
raw_sockets: Type.Array(SinkRefSchema, {
description:
'Raw sockets and connect APIs: Socket.connect, net.Dial (Go), socket.connect (Python), ' +
'TcpClient, UdpClient, NetworkStream, java.net.Socket, java.net.URL.openConnection().',
),
url_openers: z
.array(SinkRefSchema)
.describe(
'TcpClient, UdpClient, NetworkStream, java.net.Socket, java.net.URL.openConnection().',
}),
url_openers: Type.Array(SinkRefSchema, {
description:
'URL openers and file includes: file_get_contents (PHP), fopen, include_once, require_once, ' +
'new URL().openStream() (Java), urllib.urlopen (Python), fs.readFile with URLs, ' +
'import() with dynamic URLs, loadHTML / loadXML with external sources.',
),
redirect_handlers: z
.array(SinkRefSchema)
.describe(
'new URL().openStream() (Java), urllib.urlopen (Python), fs.readFile with URLs, ' +
'import() with dynamic URLs, loadHTML / loadXML with external sources.',
}),
redirect_handlers: Type.Array(SinkRefSchema, {
description:
'Redirect and "next URL" handlers: auto-follow redirects in HTTP clients, framework Location ' +
'handlers (response.redirect), URL validation in redirect chains, "Continue to" / "Return URL" parameters.',
),
headless_browsers: z
.array(SinkRefSchema)
.describe(
'handlers (response.redirect), URL validation in redirect chains, "Continue to" / "Return URL" parameters.',
}),
headless_browsers: Type.Array(SinkRefSchema, {
description:
'Headless browsers and render engines: Puppeteer (page.goto, page.setContent), ' +
'Playwright (page.navigate, page.route), Selenium WebDriver navigation, html-to-pdf converters ' +
'(wkhtmltopdf, Puppeteer PDF), and SSR with external content.',
),
media_processors: z
.array(SinkRefSchema)
.describe(
'Playwright (page.navigate, page.route), Selenium WebDriver navigation, html-to-pdf converters ' +
'(wkhtmltopdf, Puppeteer PDF), and SSR with external content.',
}),
media_processors: Type.Array(SinkRefSchema, {
description:
'Media processors: ImageMagick (convert, identify with URLs), GraphicsMagick, FFmpeg with ' +
'network sources, wkhtmltopdf, Ghostscript with URL inputs, image optimization services with URL parameters.',
),
link_preview: z
.array(SinkRefSchema)
.describe(
'network sources, wkhtmltopdf, Ghostscript with URL inputs, image optimization services with URL parameters.',
}),
link_preview: Type.Array(SinkRefSchema, {
description:
'Link preview and unfurlers: chat application link expanders, CMS link preview generators, ' +
'oEmbed endpoint fetchers, social media card generators, URL metadata extractors.',
),
webhook_testers: z
.array(SinkRefSchema)
.describe(
'oEmbed endpoint fetchers, social media card generators, URL metadata extractors.',
}),
webhook_testers: Type.Array(SinkRefSchema, {
description:
'Webhook testers and callback verifiers: "ping my webhook" functionality, outbound callback ' +
'verification, health check notifications, event delivery confirmations, API endpoint validation tools.',
),
sso_oidc_discovery: z
.array(SinkRefSchema)
.describe(
'verification, health check notifications, event delivery confirmations, API endpoint validation tools.',
}),
sso_oidc_discovery: Type.Array(SinkRefSchema, {
description:
'SSO/OIDC discovery and JWKS fetchers: OpenID Connect discovery endpoints, JWKS fetchers, ' +
'OAuth authorization server metadata, SAML metadata fetchers, federation metadata retrievers.',
),
importers: z
.array(SinkRefSchema)
.describe(
'OAuth authorization server metadata, SAML metadata fetchers, federation metadata retrievers.',
}),
importers: Type.Array(SinkRefSchema, {
description:
'Importers and data loaders: "import from URL" functionality, CSV/JSON/XML remote loaders, ' +
'RSS/Atom feed readers, API data synchronization, configuration file fetchers.',
),
package_installers: z
.array(SinkRefSchema)
.describe(
'RSS/Atom feed readers, API data synchronization, configuration file fetchers.',
}),
package_installers: Type.Array(SinkRefSchema, {
description:
'Package/plugin/theme installers: "install from URL" features, package managers with remote ' +
'sources, plugin/theme downloaders, update mechanisms with remote checks, dependency resolution ' +
'with external repos.',
),
monitoring_and_health: z
.array(SinkRefSchema)
.describe(
'sources, plugin/theme downloaders, update mechanisms with remote checks, dependency resolution ' +
'with external repos.',
}),
monitoring_and_health: Type.Array(SinkRefSchema, {
description:
'Monitoring and health check frameworks: URL pingers and uptime checkers, health check ' +
'endpoints, monitoring probe systems, alerting webhook senders, performance testing tools.',
),
cloud_metadata: z
.array(SinkRefSchema)
.describe(
'endpoints, monitoring probe systems, alerting webhook senders, performance testing tools.',
}),
cloud_metadata: Type.Array(SinkRefSchema, {
description:
'Cloud metadata helpers: AWS/GCP/Azure instance metadata callers, cloud service discovery ' +
'mechanisms, container orchestration API clients, infrastructure metadata fetchers, service mesh ' +
'configuration retrievers.',
),
'mechanisms, container orchestration API clients, infrastructure metadata fetchers, service mesh ' +
'configuration retrievers.',
}),
});
// ============================================================================
// EXPORTED TYPES
// ============================================================================
export type ExecutiveSummaryInput = z.infer<typeof ExecutiveSummaryInputSchema>;
export type ApplicationIntelligenceInput = z.infer<typeof ApplicationIntelligenceInputSchema>;
export type AuthDeepDiveInput = z.infer<typeof AuthDeepDiveInputSchema>;
export type CodebaseIndexingInput = z.infer<typeof CodebaseIndexingInputSchema>;
export type CriticalFilePathsInput = z.infer<typeof CriticalFilePathsInputSchema>;
export type XssSinksInput = z.infer<typeof XssSinksInputSchema>;
export type SsrfSinksInput = z.infer<typeof SsrfSinksInputSchema>;
export type ExecutiveSummaryInput = Static<typeof ExecutiveSummaryInputSchema>;
export type ApplicationIntelligenceInput = Static<typeof ApplicationIntelligenceInputSchema>;
export type AuthDeepDiveInput = Static<typeof AuthDeepDiveInputSchema>;
export type CodebaseIndexingInput = Static<typeof CodebaseIndexingInputSchema>;
export type CriticalFilePathsInput = Static<typeof CriticalFilePathsInputSchema>;
export type XssSinksInput = Static<typeof XssSinksInputSchema>;
export type SsrfSinksInput = Static<typeof SsrfSinksInputSchema>;
export interface PreReconData {
readonly executive_summary?: ExecutiveSummaryInput;
@@ -427,32 +394,27 @@ export type PreReconCallStatus = Readonly<Record<PreReconToolName, PreReconToolS
// ============================================================================
interface ToolResult {
[x: string]: unknown;
content: Array<{ type: 'text'; text: string }>;
isError: boolean;
}
function createToolResult(response: { status: string; [key: string]: unknown }): ToolResult {
return {
content: [{ type: 'text', text: JSON.stringify(response, null, 2) }],
isError: response.status === 'error',
};
details: Record<string, unknown>;
isError?: boolean;
}
function successResult(data: Record<string, unknown>): ToolResult {
return createToolResult({ status: 'success', ...data });
const response = { status: 'success', ...data };
return { content: [{ type: 'text' as const, text: JSON.stringify(response, null, 2) }], details: {} };
}
function errorResult(message: string, errorType = 'ValidationError', retryable = true): ToolResult {
return createToolResult({ status: 'error', message, errorType, retryable });
const response = { status: 'error', message, errorType, retryable };
return { content: [{ type: 'text' as const, text: JSON.stringify(response, null, 2) }], details: {}, isError: true };
}
// ============================================================================
// SERVER FACTORY
// TOOLS FACTORY
// ============================================================================
export interface PreReconCollectorServer {
server: McpSdkServerConfigWithInstance;
tools: ToolDefinition[];
getAll(): PreReconData;
getCallStatus(): PreReconCallStatus;
}
@@ -476,113 +438,123 @@ export function createPreReconCollectorServer(): PreReconCollectorServer {
);
}
const setExecutiveSummary = tool(
'set_executive_summary',
"Record the application's overall security posture as a short executive summary. " +
const setExecutiveSummary = defineTool({
name: 'set_executive_summary',
label: 'Set Executive Summary',
description:
"Record the application's overall security posture as a short executive summary. " +
'Call exactly once before terminating. Becomes Section 1 of the rendered deliverable. ' +
'Duplicate calls are rejected.',
ExecutiveSummaryInputSchema.shape,
async (input): Promise<ToolResult> => {
parameters: ExecutiveSummaryInputSchema,
execute: async (_toolCallId, input): Promise<ToolResult> => {
if (state.executive_summary) return alreadyCalled('set_executive_summary');
state.executive_summary = input;
return successResult({ set: 'set_executive_summary' });
},
);
});
const setApplicationIntelligence = tool(
'set_application_intelligence',
'Record the composite application intelligence — architecture, data security, attack surface, ' +
const setApplicationIntelligence = defineTool({
name: 'set_application_intelligence',
label: 'Set Application Intelligence',
description:
'Record the composite application intelligence — architecture, data security, attack surface, ' +
'and infrastructure — in a single call. Call exactly once before terminating. ' +
'Becomes Sections 2, 4, 5, and 6 of the rendered deliverable. Duplicate calls are rejected.',
ApplicationIntelligenceInputSchema.shape,
async (input): Promise<ToolResult> => {
parameters: ApplicationIntelligenceInputSchema,
execute: async (_toolCallId, input): Promise<ToolResult> => {
if (state.application_intelligence) return alreadyCalled('set_application_intelligence');
state.application_intelligence = input;
return successResult({ set: 'set_application_intelligence' });
},
);
});
const setAuthDeepDive = tool(
'set_auth_deep_dive',
'Record the authentication & authorization deep dive. Call exactly once before terminating. ' +
const setAuthDeepDive = defineTool({
name: 'set_auth_deep_dive',
label: 'Set Auth Deep Dive',
description:
'Record the authentication & authorization deep dive. Call exactly once before terminating. ' +
'Becomes Section 3 of the rendered deliverable. Duplicate calls are rejected.',
AuthDeepDiveInputSchema.shape,
async (input): Promise<ToolResult> => {
parameters: AuthDeepDiveInputSchema,
execute: async (_toolCallId, input): Promise<ToolResult> => {
if (state.auth_deep_dive) return alreadyCalled('set_auth_deep_dive');
state.auth_deep_dive = input;
return successResult({ set: 'set_auth_deep_dive' });
},
);
});
const setCodebaseIndexing = tool(
'set_codebase_indexing',
'Record the overall codebase indexing narrative. Call exactly once before terminating. ' +
const setCodebaseIndexing = defineTool({
name: 'set_codebase_indexing',
label: 'Set Codebase Indexing',
description:
'Record the overall codebase indexing narrative. Call exactly once before terminating. ' +
'Becomes Section 7 of the rendered deliverable. Duplicate calls are rejected.',
CodebaseIndexingInputSchema.shape,
async (input): Promise<ToolResult> => {
parameters: CodebaseIndexingInputSchema,
execute: async (_toolCallId, input): Promise<ToolResult> => {
if (state.codebase_indexing) return alreadyCalled('set_codebase_indexing');
state.codebase_indexing = input;
return successResult({ set: 'set_codebase_indexing' });
},
);
});
const setCriticalFilePaths = tool(
'set_critical_file_paths',
'Record the catalog of critical file paths grouped by security relevance. Call exactly once ' +
const setCriticalFilePaths = defineTool({
name: 'set_critical_file_paths',
label: 'Set Critical File Paths',
description:
'Record the catalog of critical file paths grouped by security relevance. Call exactly once ' +
'before terminating. Becomes Section 8 of the rendered deliverable. The next agent uses this ' +
'as a starting point for manual review. Duplicate calls are rejected.',
CriticalFilePathsInputSchema.shape,
async (input): Promise<ToolResult> => {
parameters: CriticalFilePathsInputSchema,
execute: async (_toolCallId, input): Promise<ToolResult> => {
if (state.critical_file_paths) return alreadyCalled('set_critical_file_paths');
state.critical_file_paths = input;
return successResult({ set: 'set_critical_file_paths' });
},
);
});
const setXssSinks = tool(
'set_xss_sinks',
'Record discovered XSS sinks grouped by render context. Call exactly once before terminating. ' +
const setXssSinks = defineTool({
name: 'set_xss_sinks',
label: 'Set Xss Sinks',
description:
'Record discovered XSS sinks grouped by render context. Call exactly once before terminating. ' +
'If the application has no web frontend at all, set applicable=false; otherwise populate each ' +
'render-context array (empty arrays mean "scanned, no sinks of this kind"). This list drives ' +
"the vuln-xss agent's testing todos downstream. Becomes Section 9 of the rendered deliverable. " +
'Duplicate calls are rejected.',
XssSinksInputSchema.shape,
async (input): Promise<ToolResult> => {
parameters: XssSinksInputSchema,
execute: async (_toolCallId, input): Promise<ToolResult> => {
if (state.xss_sinks) return alreadyCalled('set_xss_sinks');
state.xss_sinks = input;
return successResult({ set: 'set_xss_sinks' });
},
);
});
const setSsrfSinks = tool(
'set_ssrf_sinks',
'Record discovered SSRF sinks grouped by sink category. Call exactly once before terminating. ' +
const setSsrfSinks = defineTool({
name: 'set_ssrf_sinks',
label: 'Set Ssrf Sinks',
description:
'Record discovered SSRF sinks grouped by sink category. Call exactly once before terminating. ' +
'If the application makes no outbound requests at all, set applicable=false; otherwise populate ' +
'each category array (empty arrays mean "scanned, no sinks of this kind"). This list drives ' +
"the vuln-ssrf agent's testing todos downstream. Becomes Section 10 of the rendered deliverable. " +
'Duplicate calls are rejected.',
SsrfSinksInputSchema.shape,
async (input): Promise<ToolResult> => {
parameters: SsrfSinksInputSchema,
execute: async (_toolCallId, input): Promise<ToolResult> => {
if (state.ssrf_sinks) return alreadyCalled('set_ssrf_sinks');
state.ssrf_sinks = input;
return successResult({ set: 'set_ssrf_sinks' });
},
);
const server: McpSdkServerConfigWithInstance = createSdkMcpServer({
name: 'pre-recon-collector',
version: '1.0.0',
tools: [
setExecutiveSummary,
setApplicationIntelligence,
setAuthDeepDive,
setCodebaseIndexing,
setCriticalFilePaths,
setXssSinks,
setSsrfSinks,
],
});
const tools: ToolDefinition[] = [
setExecutiveSummary,
setApplicationIntelligence,
setAuthDeepDive,
setCodebaseIndexing,
setCriticalFilePaths,
setXssSinks,
setSsrfSinks,
];
function statusOf<K extends PreReconToolName>(key: K): PreReconToolStatus {
const flagMap: Record<PreReconToolName, unknown> = {
set_executive_summary: state.executive_summary,
@@ -597,7 +569,7 @@ export function createPreReconCollectorServer(): PreReconCollectorServer {
}
return {
server,
tools,
getAll: (): PreReconData => ({
...(state.executive_summary && { executive_summary: state.executive_summary }),
...(state.application_intelligence && { application_intelligence: state.application_intelligence }),
File diff suppressed because it is too large Load Diff
+236 -257
View File
@@ -5,9 +5,9 @@
// as published by the Free Software Foundation.
/**
* Vuln Collector MCP Server (factory parameterized by vulnerability class).
* Vuln Collector tools (factory parameterized by vulnerability class).
*
* Exposes 4 one-shot, Zod-validated MCP tools per vuln agent (injection, xss,
* Exposes 4 one-shot, TypeBox-validated tools per vuln agent (injection, xss,
* auth, ssrf, authz) that feed a deterministic renderer producing
* {class}_analysis_deliverable.md:
* - set_findings_summary — §1 executive summary + §2 dominant patterns
@@ -20,14 +20,13 @@
* across classes.
*
* Skipped tools surface as renderer placeholders, not activity failures.
* getCallStatus() exposes the per-run call pattern for logging. Each Zod
* schema's field-level descriptions carry the section guidance, so the SDK
* injects it into the agent's tool catalog.
* getCallStatus() exposes the per-run call pattern for logging. Each schema's
* field-level descriptions carry the section guidance, so the agent's tool
* catalog surfaces it.
*/
import type { McpSdkServerConfigWithInstance } from '@anthropic-ai/claude-agent-sdk';
import { createSdkMcpServer, tool } from '@anthropic-ai/claude-agent-sdk';
import { type ZodRawShape, z } from 'zod';
import { defineTool, type ToolDefinition } from '@earendil-works/pi-coding-agent';
import { type Static, Type } from 'typebox';
// ============================================================================
// CLASS DISCRIMINATOR
@@ -46,286 +45,262 @@ export const BLIND_SPOTS_CLASSES: ReadonlySet<VulnClass> = new Set<VulnClass>(['
// SHARED SCHEMAS — set_findings_summary, set_safe_vectors, set_blind_spots
// ============================================================================
const PatternSchema = z.object({
name: z
.string()
.min(1)
.describe(
const PatternSchema = Type.Object({
name: Type.String({
minLength: 1,
description:
'Concise pattern name, e.g. "Weak Session Management", "Reflected XSS in Search Parameter", ' +
'"Insufficient URL Validation".',
),
description: z.string().min(1).describe('One- to two-sentence description of the pattern observed in the codebase.'),
implication: z
.string()
.min(1)
.describe('One- to two-sentence implication for exploitation — what does this pattern enable an attacker to do.'),
representative_finding_ids: z
.array(z.string().min(1))
.min(1)
.describe(
'"Insufficient URL Validation".',
}),
description: Type.String({
minLength: 1,
description: 'One- to two-sentence description of the pattern observed in the codebase.',
}),
implication: Type.String({
minLength: 1,
description: 'One- to two-sentence implication for exploitation — what does this pattern enable an attacker to do.',
}),
representative_finding_ids: Type.Array(Type.String({ minLength: 1 }), {
minItems: 1,
description:
'IDs of findings that exhibit this pattern (e.g. ["AUTH-VULN-01", "AUTH-VULN-02"]). Must match ' +
'IDs the agent has assigned in the structured-output exploitation queue.',
),
'IDs the agent has assigned in the structured-output exploitation queue.',
}),
});
export const FindingsSummaryInputSchema = z.object({
key_outcome: z
.string()
.min(1)
.describe(
export const FindingsSummaryInputSchema = Type.Object({
key_outcome: Type.String({
minLength: 1,
description:
'One to two sentences capturing the headline result of your analysis — what was found and its ' +
'severity profile (e.g. "Several high-confidence SQL injection vulnerabilities were identified; ' +
'all findings have been passed to the exploitation phase"). Becomes Section 1 of the rendered ' +
'deliverable.',
),
patterns: z
.array(PatternSchema)
.describe(
'severity profile (e.g. "Several high-confidence SQL injection vulnerabilities were identified; ' +
'all findings have been passed to the exploitation phase"). Becomes Section 1 of the rendered ' +
'deliverable.',
}),
patterns: Type.Array(PatternSchema, {
description:
'Complete list of dominant patterns observed across findings. Pass all patterns in one call. ' +
'Empty array is acceptable if no recurring patterns were observed — the deliverable will render ' +
'"No dominant patterns identified" for Section 2 in that case.',
),
'Empty array is acceptable if no recurring patterns were observed — the deliverable will render ' +
'"No dominant patterns identified" for Section 2 in that case.',
}),
});
export const SafeVectorInputSchema = z.object({
subject: z
.string()
.min(1)
.describe(
export const SafeVectorInputSchema = Type.Object({
subject: Type.String({
minLength: 1,
description:
'The specific subject of analysis. For injection/xss runs, the input parameter name (e.g. ' +
'"username", "redirect_url"). For auth/ssrf runs, the component or flow name (e.g. ' +
'"Password Hashing", "Webhook Configuration"). For authz runs, the endpoint (e.g. ' +
'"POST /api/auth/logout"). The renderer maps this to the class-appropriate column header.',
),
location: z
.string()
.min(1)
.describe(
'"username", "redirect_url"). For auth/ssrf runs, the component or flow name (e.g. ' +
'"Password Hashing", "Webhook Configuration"). For authz runs, the endpoint (e.g. ' +
'"POST /api/auth/logout"). The renderer maps this to the class-appropriate column header.',
}),
location: Type.String({
minLength: 1,
description:
'File path with line number (e.g. "controllers/authController.js:45") or endpoint URL (e.g. ' +
'"/profile"). For authz runs, this is the guard location specifically (e.g. ' +
'"middleware/auth.js:45"). The renderer maps this to the class-appropriate column header.',
),
defense_mechanism: z
.string()
.min(1)
.describe(
'"/profile"). For authz runs, this is the guard location specifically (e.g. ' +
'"middleware/auth.js:45"). The renderer maps this to the class-appropriate column header.',
}),
defense_mechanism: Type.String({
minLength: 1,
description:
'The robust defense observed (e.g. "Prepared Statement (Parameter Binding)", "HTML Entity ' +
'Encoding", "Strict URL Whitelist Validation", "bcrypt.compare for constant-time check").',
),
render_context: z
.string()
.nullable()
.optional()
.describe(
'XSS-only: the DOM render context for the validated vector — one of HTML_BODY, HTML_ATTRIBUTE, ' +
'Encoding", "Strict URL Whitelist Validation", "bcrypt.compare for constant-time check").',
}),
render_context: Type.Optional(
Type.Union([Type.String(), Type.Null()], {
description:
'XSS-only: the DOM render context for the validated vector — one of HTML_BODY, HTML_ATTRIBUTE, ' +
'JAVASCRIPT_STRING, URL_PARAM, CSS_VALUE. Omit (or pass null) for non-XSS classes; the renderer ' +
'only emits this column for the XSS deliverable.',
),
}),
),
});
export const SafeVectorsInputSchema = z.object({
vectors: z
.array(SafeVectorInputSchema)
.describe(
export const SafeVectorsInputSchema = Type.Object({
vectors: Type.Array(SafeVectorInputSchema, {
description:
'All input vectors / components / endpoints that were analyzed and confirmed to have robust, ' +
'context-appropriate defenses. Empty array is acceptable but unusual — the deliverable will ' +
'render "No vectors confirmed secure during analysis" for Section 4 in that case. Becomes ' +
'Section 4 of the rendered deliverable. The renderer sorts by (subject, location) before ' +
'rendering, so emission order does not affect output.',
),
'context-appropriate defenses. Empty array is acceptable but unusual — the deliverable will ' +
'render "No vectors confirmed secure during analysis" for Section 4 in that case. Becomes ' +
'Section 4 of the rendered deliverable. The renderer sorts by (subject, location) before ' +
'rendering, so emission order does not affect output.',
}),
});
export const BlindSpotItemSchema = z.object({
heading: z
.string()
.min(1)
.describe(
export const BlindSpotItemSchema = Type.Object({
heading: Type.String({
minLength: 1,
description:
'Short heading for the blind spot (e.g. "Untraced Asynchronous Flows", ' +
'"Limited Visibility into Stored Procedures", "Minified JavaScript Bundle").',
),
description: z
.string()
.min(1)
.describe(
'"Limited Visibility into Stored Procedures", "Minified JavaScript Bundle").',
}),
description: Type.String({
minLength: 1,
description:
'One to three sentences describing the analysis gap — what could not be traced, why, and what ' +
'the residual risk is.',
),
'the residual risk is.',
}),
});
export const BlindSpotsInputSchema = z.object({
items: z
.array(BlindSpotItemSchema)
.describe(
export const BlindSpotsInputSchema = Type.Object({
items: Type.Array(BlindSpotItemSchema, {
description:
'Analysis constraints, untraced code paths, or other coverage gaps that should be noted. ' +
'Empty array is acceptable on high-coverage runs — the deliverable will render "No analysis ' +
'constraints or blind spots identified" for Section 5 in that case. Becomes Section 5 of the ' +
'rendered deliverable.',
),
'Empty array is acceptable on high-coverage runs — the deliverable will render "No analysis ' +
'constraints or blind spots identified" for Section 5 in that case. Becomes Section 5 of the ' +
'rendered deliverable.',
}),
});
// ============================================================================
// PER-CLASS set_strategic_intelligence SCHEMAS (flat — no nesting)
// ============================================================================
const InjectionStrategicIntelSchema = z.object({
defensive_evasion_waf: z
.string()
.min(1)
.describe(
const InjectionStrategicIntelSchema = Type.Object({
defensive_evasion_waf: Type.String({
minLength: 1,
description:
'WAF behavior observed during analysis: active rules, common payloads blocked, identified ' +
'bypasses (e.g. "WAF blocks UNION SELECT but not time-based blind injection"). Write ' +
'"Not applicable — no WAF observed" if none was detected.',
),
error_based_potential: z
.string()
.min(1)
.describe(
'bypasses (e.g. "WAF blocks UNION SELECT but not time-based blind injection"). Write ' +
'"Not applicable — no WAF observed" if none was detected.',
}),
error_based_potential: Type.String({
minLength: 1,
description:
'Whether endpoints leak verbose database errors that enable error-based injection (e.g. ' +
'"/api/products returns verbose PostgreSQL error messages, prime target for error-based ' +
'exploitation"). Write "Not applicable" if no injection findings exist.',
),
confirmed_database_technology: z
.string()
.min(1)
.describe(
'"/api/products returns verbose PostgreSQL error messages, prime target for error-based ' +
'exploitation"). Write "Not applicable" if no injection findings exist.',
}),
confirmed_database_technology: Type.String({
minLength: 1,
description:
'Database engine(s) confirmed via error syntax or function calls (e.g. "PostgreSQL, confirmed ' +
'via pg_sleep() and verbose error syntax"). Drives payload selection downstream. Write ' +
'"Not applicable" if no DB sinks in scope.',
),
'via pg_sleep() and verbose error syntax"). Drives payload selection downstream. Write ' +
'"Not applicable" if no DB sinks in scope.',
}),
});
const XssStrategicIntelSchema = z.object({
csp_analysis: z
.string()
.min(1)
.describe(
const XssStrategicIntelSchema = Type.Object({
csp_analysis: Type.String({
minLength: 1,
description:
'Content Security Policy observed and its bypassability: current policy text, critical bypasses ' +
"(e.g. \"script-src 'self' https://trusted-cdn.com — the trusted CDN hosts vulnerable AngularJS, " +
'enabling client-side template injection bypass"). Write "Not applicable — no CSP header served" ' +
'if none.',
),
cookie_security: z
.string()
.min(1)
.describe(
"(e.g. \"script-src 'self' https://trusted-cdn.com — the trusted CDN hosts vulnerable AngularJS, " +
'enabling client-side template injection bypass"). Write "Not applicable — no CSP header served" ' +
'if none.',
}),
cookie_security: Type.String({
minLength: 1,
description:
'Session cookie security observations: HttpOnly, Secure, SameSite flags, and storage mechanism ' +
'(e.g. "Primary session cookie `sessionid` is missing HttpOnly; tokens are also stored in ' +
'localStorage, both accessible to JavaScript"). Drives exfiltration strategy.',
),
'(e.g. "Primary session cookie `sessionid` is missing HttpOnly; tokens are also stored in ' +
'localStorage, both accessible to JavaScript"). Drives exfiltration strategy.',
}),
});
const AuthStrategicIntelSchema = z.object({
authentication_method: z
.string()
.min(1)
.describe(
const AuthStrategicIntelSchema = Type.Object({
authentication_method: Type.String({
minLength: 1,
description:
'How users authenticate: JWT, session cookie, OAuth, SAML, etc. Include any algorithm or library ' +
'details (e.g. "JWT (RS256) with hardcoded private key in lib/insecurity.ts:23").',
),
session_token_details: z
.string()
.min(1)
.describe(
'details (e.g. "JWT (RS256) with hardcoded private key in lib/insecurity.ts:23").',
}),
session_token_details: Type.String({
minLength: 1,
description:
'Where tokens live and how they are protected: cookie name, storage mechanism (cookie vs ' +
'localStorage), cookie flags, expiration (e.g. "JWT stored in localStorage under key `token`; ' +
'cookie copy lacks HttpOnly/Secure/SameSite; 6-hour TTL with no revocation").',
),
password_policy: z
.string()
.min(1)
.describe(
'localStorage), cookie flags, expiration (e.g. "JWT stored in localStorage under key `token`; ' +
'cookie copy lacks HttpOnly/Secure/SameSite; 6-hour TTL with no revocation").',
}),
password_policy: Type.String({
minLength: 1,
description:
'Observed server-side password policy and storage: complexity rules, hashing algorithm, salt, ' +
'(e.g. "MD5 without salt via crypto.createHash; no server-side complexity policy; client-side ' +
'5-char minimum trivially bypassed").',
),
'(e.g. "MD5 without salt via crypto.createHash; no server-side complexity policy; client-side ' +
'5-char minimum trivially bypassed").',
}),
});
const SsrfStrategicIntelSchema = z.object({
http_client_library: z
.string()
.min(1)
.describe(
const SsrfStrategicIntelSchema = Type.Object({
http_client_library: Type.String({
minLength: 1,
description:
'HTTP client library/libraries used for outbound requests (e.g. "axios 1.6", "node-fetch", ' +
'"requests", "HttpClient (Spring)"). Include version where it informs known bypass techniques.',
),
request_architecture: z
.string()
.min(1)
.describe(
'"requests", "HttpClient (Spring)"). Include version where it informs known bypass techniques.',
}),
request_architecture: Type.String({
minLength: 1,
description:
'How outbound requests are constructed and routed: proxy/middleware patterns, internal routing ' +
'rules (e.g. "Webhook URLs are POSTed directly without an outbound proxy; redirects are ' +
'followed by default with no maxRedirects limit").',
),
internal_services: z
.string()
.min(1)
.describe(
'rules (e.g. "Webhook URLs are POSTed directly without an outbound proxy; redirects are ' +
'followed by default with no maxRedirects limit").',
}),
internal_services: Type.String({
minLength: 1,
description:
'Internal endpoints, services, or cloud-metadata addresses discovered during analysis that an ' +
'SSRF could reach (e.g. "169.254.169.254 (AWS IMDS), internal admin API at admin.internal:8443, ' +
'PostgreSQL on localhost:5432").',
),
'SSRF could reach (e.g. "169.254.169.254 (AWS IMDS), internal admin API at admin.internal:8443, ' +
'PostgreSQL on localhost:5432").',
}),
});
const AuthzStrategicIntelSchema = z.object({
session_management_architecture: z
.string()
.min(1)
.describe(
const AuthzStrategicIntelSchema = Type.Object({
session_management_architecture: Type.String({
minLength: 1,
description:
'Session and authentication architecture relevant to authorization decisions: where user identity ' +
'comes from, whether the user ID is trusted by downstream guards (e.g. "JWT tokens in cookies; ' +
'user ID extracted from `req.user.id` and used directly in DB queries without ownership ' +
're-validation").',
),
role_permission_model: z
.string()
.min(1)
.describe(
'comes from, whether the user ID is trusted by downstream guards (e.g. "JWT tokens in cookies; ' +
'user ID extracted from `req.user.id` and used directly in DB queries without ownership ' +
're-validation").',
}),
role_permission_model: Type.String({
minLength: 1,
description:
'Roles, capabilities, and where they live: identified roles, their privilege levels, and where ' +
'role/permission data is stored (e.g. "Three roles: user, moderator, admin. Role embedded in ' +
'JWT and database; checks inconsistent — many admin routes only check `req.user` presence").',
),
resource_access_patterns: z
.string()
.min(1)
.describe(
'role/permission data is stored (e.g. "Three roles: user, moderator, admin. Role embedded in ' +
'JWT and database; checks inconsistent — many admin routes only check `req.user` presence").',
}),
resource_access_patterns: Type.String({
minLength: 1,
description:
'How resource IDs flow through the system and ownership patterns: e.g. "Most endpoints use path ' +
'parameters for resource IDs (/api/users/{id}); IDs are passed to DB queries without ownership ' +
'validation". Critical for IDOR exploitation.',
),
workflow_implementation: z
.string()
.min(1)
.describe(
'parameters for resource IDs (/api/users/{id}); IDs are passed to DB queries without ownership ' +
'validation". Critical for IDOR exploitation.',
}),
workflow_implementation: Type.String({
minLength: 1,
description:
'Multi-step processes and state transitions: how workflow stages are tracked, whether prior-state ' +
'checks are enforced (e.g. "Multi-step processes use status fields in database; status ' +
'transitions do not verify prior state completion"). Drives context-based authz exploitation.',
),
'checks are enforced (e.g. "Multi-step processes use status fields in database; status ' +
'transitions do not verify prior state completion"). Drives context-based authz exploitation.',
}),
});
const STRATEGIC_INTEL_SCHEMAS: Record<VulnClass, z.ZodObject<ZodRawShape>> = {
const STRATEGIC_INTEL_SCHEMAS = {
injection: InjectionStrategicIntelSchema,
xss: XssStrategicIntelSchema,
auth: AuthStrategicIntelSchema,
ssrf: SsrfStrategicIntelSchema,
authz: AuthzStrategicIntelSchema,
};
} as const;
// ============================================================================
// EXPORTED TYPES
// ============================================================================
export type Pattern = z.infer<typeof PatternSchema>;
export type FindingsSummaryInput = z.infer<typeof FindingsSummaryInputSchema>;
export type SafeVectorInput = z.infer<typeof SafeVectorInputSchema>;
export type SafeVectorsInput = z.infer<typeof SafeVectorsInputSchema>;
export type BlindSpotItem = z.infer<typeof BlindSpotItemSchema>;
export type BlindSpotsInput = z.infer<typeof BlindSpotsInputSchema>;
export type Pattern = Static<typeof PatternSchema>;
export type FindingsSummaryInput = Static<typeof FindingsSummaryInputSchema>;
export type SafeVectorInput = Static<typeof SafeVectorInputSchema>;
export type SafeVectorsInput = Static<typeof SafeVectorsInputSchema>;
export type BlindSpotItem = Static<typeof BlindSpotItemSchema>;
export type BlindSpotsInput = Static<typeof BlindSpotsInputSchema>;
export type InjectionStrategicIntel = z.infer<typeof InjectionStrategicIntelSchema>;
export type XssStrategicIntel = z.infer<typeof XssStrategicIntelSchema>;
export type AuthStrategicIntel = z.infer<typeof AuthStrategicIntelSchema>;
export type SsrfStrategicIntel = z.infer<typeof SsrfStrategicIntelSchema>;
export type AuthzStrategicIntel = z.infer<typeof AuthzStrategicIntelSchema>;
export type InjectionStrategicIntel = Static<typeof InjectionStrategicIntelSchema>;
export type XssStrategicIntel = Static<typeof XssStrategicIntelSchema>;
export type AuthStrategicIntel = Static<typeof AuthStrategicIntelSchema>;
export type SsrfStrategicIntel = Static<typeof SsrfStrategicIntelSchema>;
export type AuthzStrategicIntel = Static<typeof AuthzStrategicIntelSchema>;
// Discriminated by the agent class context — the renderer reads only the
// sub-fields that apply to the active class.
@@ -363,12 +338,14 @@ export type VulnCallStatus = Readonly<Record<VulnToolName, VulnToolStatus>>;
interface ToolResult {
[x: string]: unknown;
content: Array<{ type: 'text'; text: string }>;
details: Record<string, unknown>;
isError: boolean;
}
function createToolResult(response: { status: string; [key: string]: unknown }): ToolResult {
return {
content: [{ type: 'text', text: JSON.stringify(response, null, 2) }],
content: [{ type: 'text' as const, text: JSON.stringify(response, null, 2) }],
details: {},
isError: response.status === 'error',
};
}
@@ -382,11 +359,11 @@ function errorResult(message: string, errorType = 'ValidationError', retryable =
}
// ============================================================================
// SERVER FACTORY
// COLLECTOR FACTORY
// ============================================================================
export interface VulnCollectorServer {
server: McpSdkServerConfigWithInstance;
tools: ToolDefinition[];
getAll(): VulnCollectorData;
getCallStatus(): VulnCallStatus;
}
@@ -407,68 +384,76 @@ export function createVulnCollector(vulnClass: VulnClass): VulnCollectorServer {
);
}
const setFindingsSummary = tool(
'set_findings_summary',
'Record the executive summary headline and the dominant vulnerability patterns observed across ' +
const setFindingsSummary = defineTool({
name: 'set_findings_summary',
label: 'Set Findings Summary',
description:
'Record the executive summary headline and the dominant vulnerability patterns observed across ' +
'your findings. Call exactly once before terminating. Becomes Section 1 (key outcome) and ' +
'Section 2 (patterns) of the rendered deliverable — this is the load-bearing emission for the ' +
'narrative .md and is required. Duplicate calls return "already called" and are no-ops. Empty ' +
'patterns array is acceptable (renders as "No dominant patterns identified") but key_outcome ' +
'is always required.',
FindingsSummaryInputSchema.shape,
async (input): Promise<ToolResult> => {
parameters: FindingsSummaryInputSchema,
execute: async (_toolCallId, input): Promise<ToolResult> => {
if (state.findings_summary) return alreadyCalled('set_findings_summary');
state.findings_summary = input;
return successResult({ set: 'set_findings_summary' });
},
);
});
const intelSchema = STRATEGIC_INTEL_SCHEMAS[vulnClass];
const setStrategicIntelligence = tool(
'set_strategic_intelligence',
`Record the environmental and defensive intelligence relevant to exploiting the ${vulnClass} ` +
const setStrategicIntelligence = defineTool({
name: 'set_strategic_intelligence',
label: 'Set Strategic Intelligence',
description:
`Record the environmental and defensive intelligence relevant to exploiting the ${vulnClass} ` +
'findings. Call exactly once before terminating. Becomes Section 3 of the rendered deliverable ' +
`and is the section the downstream exploit-${vulnClass} agent reads for strategic context. ` +
'Required. Duplicate calls return "already called" and are no-ops. Write "Not applicable" as ' +
'the field value when a sub-field does not apply to this run (rather than omitting).',
intelSchema.shape,
async (input): Promise<ToolResult> => {
parameters: intelSchema,
execute: async (_toolCallId, input): Promise<ToolResult> => {
if (state.strategic_intelligence) return alreadyCalled('set_strategic_intelligence');
state.strategic_intelligence = input as unknown as StrategicIntelligenceInput;
return successResult({ set: 'set_strategic_intelligence' });
},
);
});
const setSafeVectors = tool(
'set_safe_vectors',
'Record the input vectors, components, or endpoints that were analyzed and confirmed to have ' +
const setSafeVectors = defineTool({
name: 'set_safe_vectors',
label: 'Set Safe Vectors',
description:
'Record the input vectors, components, or endpoints that were analyzed and confirmed to have ' +
'robust, context-appropriate defenses. Call exactly once before terminating. Becomes Section 4 ' +
'of the rendered deliverable. Recommended (empty array is acceptable on runs where no vectors ' +
'were validated as safe, but explicit emission is preferred). The renderer sorts by ' +
'(subject, location) before rendering, so emission order does not affect output. Duplicate ' +
'calls return "already called" and are no-ops.',
SafeVectorsInputSchema.shape,
async (input): Promise<ToolResult> => {
parameters: SafeVectorsInputSchema,
execute: async (_toolCallId, input): Promise<ToolResult> => {
if (state.safe_vectors) return alreadyCalled('set_safe_vectors');
state.safe_vectors = input;
return successResult({ set: 'set_safe_vectors', count: input.vectors.length });
},
);
});
const setBlindSpots = tool(
'set_blind_spots',
'Record analysis constraints, untraced code paths, or other coverage gaps. Call exactly once ' +
const setBlindSpots = defineTool({
name: 'set_blind_spots',
label: 'Set Blind Spots',
description:
'Record analysis constraints, untraced code paths, or other coverage gaps. Call exactly once ' +
'before terminating. Becomes Section 5 of the rendered deliverable. Recommended (empty array ' +
'is acceptable on high-coverage runs, but explicit emission is preferred — readers expect ' +
'either documented gaps or an explicit "no gaps" signal). Duplicate calls return "already ' +
'called" and are no-ops.',
BlindSpotsInputSchema.shape,
async (input): Promise<ToolResult> => {
parameters: BlindSpotsInputSchema,
execute: async (_toolCallId, input): Promise<ToolResult> => {
if (state.blind_spots) return alreadyCalled('set_blind_spots');
state.blind_spots = input;
return successResult({ set: 'set_blind_spots', count: input.items.length });
},
);
});
// set_blind_spots is withheld from classes without a Section 5 (auth, ssrf).
const tools = [
@@ -478,12 +463,6 @@ export function createVulnCollector(vulnClass: VulnClass): VulnCollectorServer {
...(BLIND_SPOTS_CLASSES.has(vulnClass) ? [setBlindSpots] : []),
];
const server: McpSdkServerConfigWithInstance = createSdkMcpServer({
name: 'vuln-collector',
version: '1.0.0',
tools,
});
function statusOf<K extends VulnToolName>(key: K): VulnToolStatus {
const flagMap: Record<VulnToolName, unknown> = {
set_findings_summary: state.findings_summary,
@@ -495,7 +474,7 @@ export function createVulnCollector(vulnClass: VulnClass): VulnCollectorServer {
}
return {
server,
tools: tools as ToolDefinition[],
getAll: (): VulnCollectorData => ({
...(state.findings_summary && { findings_summary: state.findings_summary }),
...(state.strategic_intelligence && { strategic_intelligence: state.strategic_intelligence }),
+6
View File
@@ -1,6 +1,7 @@
/** Centralized path constants for the worker package */
import fs from 'node:fs';
import os from 'node:os';
import path from 'node:path';
/** Worker package root (apps/worker/) resolved from compiled dist/ files */
@@ -9,6 +10,11 @@ const WORKER_ROOT = path.resolve(import.meta.dirname, '..');
export const PROMPTS_DIR = path.join(WORKER_ROOT, 'prompts');
export const CONFIGS_DIR = path.join(WORKER_ROOT, 'configs');
export const PLAYWRIGHT_SKILL_DIR = path.join(os.homedir(), '.claude', 'skills', 'playwright-cli');
/** Compiled pi extension dir that enforces bounded `bash` timeouts (resolved from dist/) */
export const BASH_TIMEOUT_EXTENSION_DIR = path.join(import.meta.dirname, 'ai', 'extensions', 'bash-timeout');
/** Default deliverables subdirectory relative to repoPath */
export const DEFAULT_DELIVERABLES_SUBDIR = '.shannon/deliverables';
+34 -20
View File
@@ -12,18 +12,19 @@
* - Load prompt template using AGENTS[agentName].promptTemplate
* - Create git checkpoint
* - Start audit logging
* - Invoke Claude SDK via runClaudePrompt
* - Invoke the pi agent via runPiPrompt
* - Spending cap check using isSpendingCapBehavior
* - Handle failure (rollback, audit)
* - Validate output using AGENTS[agentName].deliverableFilename
* - Render the deliverable to disk via the writeDeliverable hook (if provided)
* - Commit on success, log metrics
*
* No Temporal dependencies - pure domain logic.
*/
import { fs, path } from 'zx';
import { type ClaudePromptResult, runClaudePrompt, validateAgentOutput } from '../ai/claude-executor.js';
import { getOutputFormat, getQueueFilename } from '../ai/queue-schemas.js';
import { type PiPromptResult, runPiPrompt, validateAgentOutput } from '../ai/pi-executor.js';
import { createQueueSubmitTool, getQueueFilename } from '../ai/queue-schemas.js';
import type { AuditSession } from '../audit/index.js';
import { authStateFile } from '../audit/utils.js';
import { AGENTS } from '../session-manager.js';
@@ -54,12 +55,14 @@ export interface AgentExecutionInput {
apiKey?: string | undefined;
promptDir?: string | undefined;
providerConfig?: import('../types/config.js').ProviderConfig | undefined;
mcpServers?: Record<string, import('@anthropic-ai/claude-agent-sdk').McpServerConfig>;
customTools?: import('@earendil-works/pi-coding-agent').ToolDefinition[];
// Renders the deliverable to disk; invoked after validation, before the success commit.
writeDeliverable?: (deliverablesPath: string) => Promise<void>;
}
interface FailAgentOpts {
attemptNumber: number;
result: ClaudePromptResult;
result: PiPromptResult;
rollbackReason: string;
errorMessage: string;
errorCode: ErrorCode;
@@ -109,7 +112,8 @@ export class AgentExecutionService {
apiKey,
promptDir,
providerConfig,
mcpServers,
customTools,
writeDeliverable,
} = input;
// 1. Load config (pre-parsed configData → raw YAML → file path)
@@ -163,9 +167,11 @@ export class AgentExecutionService {
// 4. Start audit logging
await auditSession.startAgent(agentName, prompt, attemptNumber);
// 5. Execute agent
const outputFormat = getOutputFormat(agentName, distributedConfig?.exploit ?? true);
const result: ClaudePromptResult = await runClaudePrompt(
// 5. Execute agent. Vuln agents get a submit tool that captures the structured
// exploitation queue (pi has no JSON-schema output format).
const submitTool = createQueueSubmitTool(agentName, distributedConfig?.exploit ?? true);
const callerTools = [...(customTools ?? []), ...(submitTool ? [submitTool.tool] : [])];
const result: PiPromptResult = await runPiPrompt(
prompt,
repoPath,
'', // context
@@ -174,11 +180,10 @@ export class AgentExecutionService {
auditSession,
logger,
AGENTS[agentName].modelTier,
outputFormat,
callerTools,
apiKey,
path.relative(repoPath, deliverablesPath),
providerConfig,
mcpServers,
);
// 6. Spending cap check - defense-in-depth
@@ -212,13 +217,17 @@ export class AgentExecutionService {
});
}
// 8. Write structured output to disk (vuln agents only)
// 8. Write structured output to disk (vuln agents only) from the submit-tool capture
const queueFilename = getQueueFilename(agentName);
if (result.structuredOutput !== undefined && queueFilename) {
await fs.ensureDir(deliverablesPath);
const queuePath = path.join(deliverablesPath, queueFilename);
await fs.writeFile(queuePath, JSON.stringify(result.structuredOutput, null, 2), 'utf8');
logger.info(`Wrote structured output queue to ${queueFilename}`);
if (submitTool && queueFilename) {
const captured = submitTool.getCaptured();
if (captured !== undefined) {
result.structuredOutput = captured; // carry for the validation gate below
await fs.ensureDir(deliverablesPath);
const queuePath = path.join(deliverablesPath, queueFilename);
await fs.writeFile(queuePath, JSON.stringify(captured, null, 2), 'utf8');
logger.info(`Wrote structured output queue to ${queueFilename}`);
}
}
// 9. Validate output
@@ -236,7 +245,12 @@ export class AgentExecutionService {
});
}
// 10. Success - commit deliverables, then capture checkpoint hash
// 10. Render the deliverable to disk so the success commit below stages it
if (writeDeliverable) {
await writeDeliverable(deliverablesPath);
}
// 11. Success - commit deliverables, then capture checkpoint hash
await commitGitSuccess(deliverablesPath, agentName, logger);
const commitHash = await getGitCommitHash(deliverablesPath);
@@ -304,10 +318,10 @@ export class AgentExecutionService {
/**
* Convert AgentEndResult to AgentMetrics for workflow state.
*/
static toMetrics(endResult: AgentEndResult, result: ClaudePromptResult): AgentMetrics {
static toMetrics(endResult: AgentEndResult, result: PiPromptResult): AgentMetrics {
return {
durationMs: endResult.duration_ms,
inputTokens: null, // Not currently exposed by SDK wrapper
inputTokens: null, // Not currently exposed by the pi executor
outputTokens: null,
costUsd: endResult.cost_usd,
numTurns: result.turns ?? null,
+2 -2
View File
@@ -62,7 +62,7 @@ const RETRYABLE_PATTERNS = [
'internal server error',
'service unavailable',
'bad gateway',
// Claude API errors
// Provider API errors
'model unavailable',
'service temporarily unavailable',
'api error',
@@ -160,7 +160,7 @@ function classifyByErrorCode(code: ErrorCode, retryableFromError: boolean): { ty
*
* Classification priority:
* 1. If error is PentestError with ErrorCode, classify by code (reliable)
* 2. Fall through to string matching for external errors (SDK, network, etc.)
* 2. Fall through to string matching for external errors (provider, network, etc.)
*/
export function classifyErrorForTemporal(error: unknown): { type: string; retryable: boolean } {
// === CODE-BASED CLASSIFICATION (Preferred for internal errors) ===
@@ -9,7 +9,7 @@
*
* Used when exploit=false: the exploit agents didn't run, so there is no
* `*_exploitation_evidence.md` to concatenate into the report. This module
* reads each `*_exploitation_queue.json` (already SDK-validated against the
* reads each `*_exploitation_queue.json` (already validated by the submit tool against the
* schemas in ../ai/queue-schemas.ts) and writes a `*_findings.md` per class
* in the canonical body shape that report-executive.txt's cleanup expects.
*
+2 -2
View File
@@ -11,8 +11,8 @@
* Services are pure domain logic with no Temporal dependencies.
*/
export type { ClaudePromptResult } from '../ai/claude-executor.js';
export { runClaudePrompt } from '../ai/claude-executor.js';
export type { PiPromptResult } from '../ai/pi-executor.js';
export { runPiPrompt } from '../ai/pi-executor.js';
export type { AgentExecutionInput } from './agent-execution.js';
export { AgentExecutionService } from './agent-execution.js';
export { ConfigLoaderService } from './config-loader.js';
+142 -175
View File
@@ -15,7 +15,7 @@
* 1. Repository path exists and contains .git
* 2. Config file parses and validates (if provided)
* 3. code_path rules match real entries in the repo (filesystem only)
* 4. Credentials validate via Claude Agent SDK query (API key, OAuth, Bedrock, or Vertex AI)
* 4. Credentials validate via a minimal pi session (API key, OAuth, or Bedrock)
* 5. Target URL resolves, is not link-local (cloud metadata), and is reachable (DNS + HTTP)
*/
@@ -25,16 +25,23 @@ import fs from 'node:fs/promises';
import http from 'node:http';
import https from 'node:https';
import net, { type LookupFunction } from 'node:net';
import type { SDKAssistantMessageError } from '@anthropic-ai/claude-agent-sdk';
import { query } from '@anthropic-ai/claude-agent-sdk';
import os from 'node:os';
import {
AuthStorage,
createAgentSession,
ModelRegistry,
SessionManager,
SettingsManager,
} from '@earendil-works/pi-coding-agent';
import { glob } from 'zx';
import { resolveModel } from '../ai/models.js';
import { resolveEffectiveProvider, resolveModelId } from '../ai/models.js';
import { parseConfig } from '../config-parser.js';
import type { ActivityLogger } from '../types/activity-logger.js';
import type { Config, Rule } from '../types/config.js';
import { ErrorCode } from '../types/errors.js';
import { err, ok, type Result } from '../types/result.js';
import { isRetryableError, PentestError } from './error-handling.js';
import { err, isErr, ok, type Result } from '../types/result.js';
import { matchesBillingTextPattern } from '../utils/billing-detection.js';
import { PentestError } from './error-handling.js';
const TARGET_URL_TIMEOUT_MS = 10_000;
@@ -240,67 +247,119 @@ async function validateCodePathsExist(
// === Credential Validation ===
/** Map SDK error type to a human-readable preflight PentestError. */
function classifySdkError(sdkError: SDKAssistantMessageError, authType: string): Result<void, PentestError> {
switch (sdkError) {
case 'authentication_failed':
return err(
new PentestError(
`Invalid ${authType}. Check your credentials in .env and try again.`,
'config',
false,
{ authType, sdkError },
ErrorCode.AUTH_FAILED,
),
);
case 'billing_error':
return err(
new PentestError(
`Anthropic account has a billing issue. Add credits or check your billing dashboard.`,
'billing',
true,
{ authType, sdkError },
ErrorCode.BILLING_ERROR,
),
);
case 'rate_limit':
return err(
new PentestError(
`Anthropic rate limit or spending cap reached. Wait a few minutes and try again.`,
'billing',
true,
{ authType, sdkError },
ErrorCode.BILLING_ERROR,
),
);
case 'server_error':
return err(
new PentestError(`Anthropic API is temporarily unavailable. Try again shortly.`, 'network', true, {
authType,
sdkError,
}),
);
default:
return err(
new PentestError(
`${authType} validation failed unexpectedly. Check your credentials in .env.`,
'config',
false,
{ authType, sdkError },
ErrorCode.AUTH_FAILED,
),
);
/** Map provider error text to a human-readable preflight PentestError. */
/** Classify a provider error message (thrown or from a failed turn) into a PentestError. */
function classifyCredentialError(text: string, authType: string): Result<void, PentestError> {
const lower = text.toLowerCase();
if (matchesBillingTextPattern(text)) {
return err(
new PentestError(
`Anthropic account has a billing or rate-limit issue during ${authType} validation. Add credits or wait and retry.`,
'billing',
true,
{ authType },
ErrorCode.BILLING_ERROR,
),
);
}
if (/401|403|invalid[ _-]?api[ _-]?key|unauthorized|authentication|forbidden|not allowed|x-api-key/.test(lower)) {
return err(
new PentestError(
`Invalid ${authType}. Check your credentials in .env and try again.`,
'config',
false,
{ authType },
ErrorCode.AUTH_FAILED,
),
);
}
if (/model/.test(lower) && /not found|not available|unknown/.test(lower)) {
return err(
new PentestError(
`Configured model is not available for this account. Check ANTHROPIC_*_MODEL in .env.`,
'config',
false,
{ authType },
),
);
}
if (
/network|timeout|enotfound|econnrefused|fetch failed|getaddrinfo|socket|overloaded|unavailable|50\d/.test(lower)
) {
return err(
new PentestError(`Anthropic API unreachable or temporarily unavailable. Try again shortly.`, 'network', true, {
authType,
}),
);
}
return err(
new PentestError(
`${authType} validation failed: ${text.slice(0, 150)}`,
'config',
false,
{ authType },
ErrorCode.AUTH_FAILED,
),
);
}
/** Validate credentials via a minimal Claude Agent SDK query. */
/** Minimal pi session probe to validate credentials. An optional baseUrl overrides the endpoint. */
async function probeCredentialsWithPi(
authType: string,
token?: string,
baseUrl?: string,
): Promise<Result<void, PentestError>> {
const authStorage = AuthStorage.inMemory();
if (token) authStorage.setRuntimeApiKey('anthropic', token);
const baseModel = ModelRegistry.create(authStorage).find('anthropic', resolveModelId('small'));
if (!baseModel) {
return err(
new PentestError(
`Model not found in pi registry: ${resolveModelId('small')}`,
'config',
false,
{},
ErrorCode.AUTH_FAILED,
),
);
}
const model = baseUrl ? { ...baseModel, baseUrl } : baseModel;
let errText: string | undefined;
try {
const { session } = await createAgentSession({
cwd: os.tmpdir(),
model,
thinkingLevel: 'off',
noTools: 'all',
authStorage,
sessionManager: SessionManager.inMemory(),
settingsManager: SettingsManager.inMemory({ retry: { enabled: false }, compaction: { enabled: false } }),
});
session.subscribe((e) => {
if (e.type === 'turn_end' && e.message.role === 'assistant' && e.message.stopReason === 'error') {
errText = e.message.errorMessage ?? 'unknown provider error';
}
});
await session.prompt('hi');
session.dispose();
} catch (error) {
errText = error instanceof Error ? error.message : String(error);
}
if (errText) return classifyCredentialError(errText, authType);
return ok(undefined);
}
/** Validate credentials via a minimal pi session. */
async function validateCredentials(
logger: ActivityLogger,
apiKey?: string,
providerConfig?: import('../types/config.js').ProviderConfig,
): Promise<Result<void, PentestError>> {
// 0. If providerConfig is present, credentials are managed by the caller.
// The executor will map providerConfig directly to sdkEnv — no process.env needed.
// The executor/provider layer owns providerConfig resolution — no env preflight needed.
if (providerConfig) {
logger.info(
`Provider config present (type: ${providerConfig.providerType || 'anthropic_api'}) — skipping env-based credential validation`,
@@ -308,44 +367,19 @@ async function validateCredentials(
return ok(undefined);
}
// 0b. If apiKey provided via config, set it in env for SDK validation
// 0b. If apiKey provided via config, set it in env for pi validation
// This avoids requiring process.env.ANTHROPIC_API_KEY when key is threaded via input
if (apiKey) {
process.env.ANTHROPIC_API_KEY = apiKey;
}
// 1. Custom base URL — validate endpoint is reachable via SDK query
if (process.env.ANTHROPIC_BASE_URL && process.env.ANTHROPIC_AUTH_TOKEN) {
const baseUrl = process.env.ANTHROPIC_BASE_URL;
logger.info('Validating custom base URL');
try {
for await (const message of query({ prompt: 'hi', options: { model: resolveModel('small'), maxTurns: 1 } })) {
if (message.type === 'assistant' && message.error) {
return classifySdkError(message.error, `custom endpoint (${baseUrl})`);
}
if (message.type === 'result') {
break;
}
}
// Resolve the active provider through the same precedence the executor uses, so
// preflight validates exactly the credentials the run will use (no drift).
const eff = resolveEffectiveProvider(apiKey);
logger.info('Custom base URL OK');
return ok(undefined);
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
return err(
new PentestError(
`Custom base URL unreachable: ${baseUrl}${message}`,
'network',
false,
{ baseUrl },
ErrorCode.AUTH_FAILED,
),
);
}
}
// 2. Bedrock mode — validate required AWS credentials are present
if (process.env.CLAUDE_CODE_USE_BEDROCK === '1') {
// 1. Bedrock mode — validate required AWS credentials are present (pi-ai owns the
// live AWS auth, so there is no cheap session probe here)
if (eff.providerId === 'amazon-bedrock') {
const required = [
'AWS_REGION',
'AWS_BEARER_TOKEN_BEDROCK',
@@ -369,62 +403,20 @@ async function validateCredentials(
return ok(undefined);
}
// 3. Vertex AI mode — validate required GCP credentials are present
if (process.env.CLAUDE_CODE_USE_VERTEX === '1') {
const required = [
'CLOUD_ML_REGION',
'ANTHROPIC_VERTEX_PROJECT_ID',
'ANTHROPIC_SMALL_MODEL',
'ANTHROPIC_MEDIUM_MODEL',
'ANTHROPIC_LARGE_MODEL',
];
const missing = required.filter((v) => !process.env[v]);
if (missing.length > 0) {
return err(
new PentestError(
`Vertex AI mode requires the following env vars in .env: ${missing.join(', ')}`,
'config',
false,
{ missing },
ErrorCode.AUTH_FAILED,
),
);
}
// Validate service account credentials file is accessible
const credPath = process.env.GOOGLE_APPLICATION_CREDENTIALS;
if (!credPath) {
return err(
new PentestError(
'Vertex AI mode requires GOOGLE_APPLICATION_CREDENTIALS pointing to a service account key JSON file',
'config',
false,
{},
ErrorCode.AUTH_FAILED,
),
);
}
try {
await fs.access(credPath);
} catch {
return err(
new PentestError(
`Service account key file not found at: ${credPath}`,
'config',
false,
{ credPath },
ErrorCode.AUTH_FAILED,
),
);
}
logger.info('Vertex AI credentials OK');
// 2. Custom base URL — validate the endpoint via a minimal pi session
if (eff.baseUrl) {
logger.info('Validating custom base URL');
const probe = await probeCredentialsWithPi(`custom endpoint (${eff.baseUrl})`, eff.anthropicToken, eff.baseUrl);
if (isErr(probe)) return probe;
logger.info('Custom base URL OK');
return ok(undefined);
}
// 4. Check that at least one credential is present
if (!process.env.ANTHROPIC_API_KEY && !process.env.CLAUDE_CODE_OAUTH_TOKEN && !process.env.ANTHROPIC_AUTH_TOKEN) {
// 3. Direct Anthropic — require a credential, then validate via a minimal pi session
if (!eff.anthropicToken) {
return err(
new PentestError(
'No API credentials found. Set ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN in .env (or use CLAUDE_CODE_USE_BEDROCK=1 for AWS Bedrock, or CLAUDE_CODE_USE_VERTEX=1 for Google Vertex AI)',
'No API credentials found. Set ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN in .env (or use CLAUDE_CODE_USE_BEDROCK=1 for AWS Bedrock)',
'config',
false,
{},
@@ -433,38 +425,13 @@ async function validateCredentials(
);
}
// 5. Validate via SDK query
const authType = process.env.CLAUDE_CODE_OAUTH_TOKEN ? 'OAuth token' : 'API key';
logger.info(`Validating ${authType} via SDK...`);
try {
for await (const message of query({ prompt: 'hi', options: { model: resolveModel('small'), maxTurns: 1 } })) {
if (message.type === 'assistant' && message.error) {
return classifySdkError(message.error, authType);
}
if (message.type === 'result') {
break;
}
}
logger.info(`${authType} OK`);
return ok(undefined);
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
const retryable = isRetryableError(error instanceof Error ? error : new Error(message));
return err(
new PentestError(
retryable
? `Failed to reach Anthropic API. Check your network connection.`
: `${authType} validation failed: ${message}`,
retryable ? 'network' : 'config',
retryable,
{ authType },
retryable ? undefined : ErrorCode.AUTH_FAILED,
),
);
}
const usingApiKey = Boolean(apiKey ?? process.env.ANTHROPIC_API_KEY);
const authType = usingApiKey ? 'API key' : 'OAuth token';
logger.info(`Validating ${authType} via pi...`);
const probe = await probeCredentialsWithPi(authType, eff.anthropicToken);
if (isErr(probe)) return probe;
logger.info(`${authType} OK`);
return ok(undefined);
}
// === Target URL Validation ===
@@ -595,7 +562,7 @@ async function validateTargetUrl(targetUrl: string, logger: ActivityLogger): Pro
* 1. Repository path exists and contains .git
* 2. Config file parses and validates (if configPath provided)
* 3. code_path rules match at least one entry in the repo (skipped without config)
* 4. Credentials validate (API key, OAuth, Bedrock, or Vertex AI)
* 4. Credentials validate (API key, OAuth, or Bedrock)
* 5. Target URL is reachable from the container
*
* Returns on first failure.
@@ -634,7 +601,7 @@ export async function runPreflightChecks(
}
}
// 4. Credential check (cheap — 1 SDK round-trip, skipped when providerConfig present)
// 4. Credential check (cheap — 1 pi round-trip, skipped when providerConfig present)
const credResult = await validateCredentials(logger, apiKey, providerConfig);
if (!credResult.ok) {
return credResult;
@@ -13,9 +13,9 @@
*/
import { readFile, rm } from 'node:fs/promises';
import type { JsonSchemaOutputFormat } from '@anthropic-ai/claude-agent-sdk';
import { z } from 'zod';
import { runClaudePrompt } from '../ai/claude-executor.js';
import { defineTool, type ToolDefinition } from '@earendil-works/pi-coding-agent';
import { Type } from 'typebox';
import { runPiPrompt } from '../ai/pi-executor.js';
import type { AuditSession } from '../audit/index.js';
import { authStateFile } from '../audit/utils.js';
import type { ActivityLogger } from '../types/activity-logger.js';
@@ -33,26 +33,38 @@ function isAuthFailurePoint(v: unknown): v is AuthFailurePoint {
return typeof v === 'string' && (FAILURE_POINTS as readonly string[]).includes(v);
}
// NOTE: SDK's AJV validator expects draft-07; Zod defaults to draft-2020-12,
// which causes the SDK to silently skip structured output.
const AuthValidationSchema = z.object({
login_success: z.boolean(),
failure_point: z.enum(FAILURE_POINTS).optional(),
failure_detail: z
.string()
.max(250)
.optional()
.describe(
'Free-form 1-2 sentence diagnostic of what the page showed (error messages, page state) when login failed. Required when login_success is false. Mask any sensitive values.',
),
});
interface AuthValidationVerdict {
login_success: boolean;
failure_point?: AuthFailurePoint;
failure_detail?: string;
}
type AuthValidationVerdict = z.infer<typeof AuthValidationSchema>;
const VALIDATION_SCHEMA: JsonSchemaOutputFormat = {
type: 'json_schema',
schema: z.toJSONSchema(AuthValidationSchema, { target: 'draft-07' }) as Record<string, unknown>,
};
/** Submit tool capturing the login verdict (pi has no JSON-schema output format). */
function createAuthSubmitTool(): { tool: ToolDefinition; getCaptured: () => AuthValidationVerdict | undefined } {
let captured: AuthValidationVerdict | undefined;
const tool = defineTool({
name: 'submit_auth_result',
label: 'Submit Auth Result',
description: 'Report the login outcome. Call exactly once when the login attempt has concluded.',
parameters: Type.Object({
login_success: Type.Boolean(),
failure_point: Type.Optional(
Type.Union([Type.Literal('username_or_password'), Type.Literal('totp_secret'), Type.Literal('out_of_band')]),
),
failure_detail: Type.Optional(
Type.String({
description:
'Free-form 1-2 sentence diagnostic of what the page showed (error messages, page state) when login failed. Required when login_success is false. Mask any sensitive values.',
}),
),
}),
execute: async (_toolCallId, params) => {
captured = params as AuthValidationVerdict;
return { content: [{ type: 'text' as const, text: 'Auth result recorded.' }], details: {} };
},
});
return { tool, getCaptured: () => captured };
}
const AGENT_NAME = 'validate-authentication';
@@ -110,7 +122,8 @@ export async function validateAuthentication(input: ValidateAuthInput): Promise<
await auditSession.startAgent(AGENT_NAME, prompt, attemptNumber);
const startTime = Date.now();
const result = await runClaudePrompt(
const submit = createAuthSubmitTool();
const result = await runPiPrompt(
prompt,
repoPath,
'',
@@ -119,11 +132,13 @@ export async function validateAuthentication(input: ValidateAuthInput): Promise<
auditSession,
logger,
'medium',
VALIDATION_SCHEMA,
[submit.tool],
apiKey,
deliverablesSubdir,
providerConfig,
);
const verdict = submit.getCaptured();
if (verdict !== undefined) result.structuredOutput = verdict;
let classification = classifyResult(result, authentication);
@@ -204,7 +219,7 @@ function countStorageEntries(parsed: unknown, key: 'cookies' | 'origins'): numbe
}
function classifyResult(
result: import('../ai/claude-executor.js').ClaudePromptResult,
result: import('../ai/pi-executor.js').PiPromptResult,
authentication: NonNullable<DistributedConfig['authentication']>,
): Result<void, PentestError> {
if (!result.success) {
+8 -9
View File
@@ -127,12 +127,11 @@ export const AGENT_PHASE_MAP: Readonly<Record<AgentName, PhaseName>> = Object.fr
// Factory function for vulnerability queue validators.
//
// Post-MCP-migration, the analysis_deliverable.md is rendered by the activity
// wrapper after validateAgentOutput runs, so the previous "both files exist"
// check would race the renderer. The validator only checks the queue.json —
// that file is written by the SDK structured-output path in agent-execution.ts
// before this validator runs. The downstream checkExploitationQueue still
// renders the .md.
// The analysis_deliverable.md is rendered via the writeDeliverable hook, which
// AgentExecutionService runs after validateAgentOutput but before the success
// commit — so a "both files exist" check here would race the renderer. The
// validator only checks queue.json, written by the submit-tool path in
// agent-execution.ts before this validator runs.
function createVulnValidator(vulnType: VulnType): AgentValidator {
return async (sourceDir: string, logger: ActivityLogger): Promise<boolean> => {
const queueFile = path.join(sourceDir, `${vulnType}_exploitation_queue.json`);
@@ -145,9 +144,9 @@ function createVulnValidator(vulnType: VulnType): AgentValidator {
};
}
// Exploitation agents — validation lives in runExploitAgentWithCollector post-processing
// (collector harvest + renderer write). The deliverable file is written by the renderer
// after the agent succeeds, so a file-existence check here would race the renderer.
// Exploitation agents — the evidence deliverable is rendered via the writeDeliverable
// hook after the agent succeeds (before the success commit), so a file-existence check
// here would race the renderer.
//
// VulnType is kept in the import surface for createVulnValidator above; this factory
// returns a no-op validator parameterized only for symmetry with the vuln-side factory.
+74 -93
View File
@@ -19,7 +19,7 @@ import fs from 'node:fs/promises';
import path from 'node:path';
import { ApplicationFailure, Context, heartbeat } from '@temporalio/activity';
import { writePlaywrightStealthConfig } from '../ai/playwright-config-writer.js';
import { writeUserSettingsForCodePathAvoids } from '../ai/settings-writer.js';
import { writeCodePathPermissionConfig } from '../ai/settings-writer.js';
import { AuditSession } from '../audit/index.js';
import type { ResumeAttempt } from '../audit/metrics-tracker.js';
import { authStateFile, generateSessionJsonPath, type SessionMetadata } from '../audit/utils.js';
@@ -137,7 +137,8 @@ function buildContainerConfig(input: ActivityInput): ContainerConfig {
async function runAgentActivity(
agentName: AgentName,
input: ActivityInput,
mcpServers?: Record<string, import('@anthropic-ai/claude-agent-sdk').McpServerConfig>,
customTools?: import('@earendil-works/pi-coding-agent').ToolDefinition[],
writeDeliverable?: (deliverablesPath: string) => Promise<void>,
): Promise<AgentMetrics> {
const { repoPath, configPath, pipelineTestingMode = false, workflowId, webUrl } = input;
@@ -192,7 +193,8 @@ async function runAgentActivity(
...(input.providerConfig !== undefined && { providerConfig: input.providerConfig }),
...(input.promptDir !== undefined && { promptDir: input.promptDir }),
...(input.configYAML !== undefined && { configYAML: input.configYAML }),
...(mcpServers && { mcpServers }),
...(customTools && { customTools }),
...(writeDeliverable && { writeDeliverable }),
},
auditSession,
logger,
@@ -256,28 +258,21 @@ export async function runPreReconAgent(input: ActivityInput): Promise<AgentMetri
const { renderPreRecon } = await import('../services/pre-recon-renderer.js');
const collector = createPreReconCollectorServer();
const metrics = await runAgentActivity('pre-recon', input, { 'pre-recon-collector': collector.server });
// On resume, the agent is skipped and the collector is never populated.
// The cached deliverable from the prior run is the source of truth.
if (metrics.skipped) {
return metrics;
}
const writeDeliverable = async (deliverablesPath: string): Promise<void> => {
const logger = createActivityLogger();
// Skipped tools surface as renderer placeholders, not as activity failures.
const callStatus = collector.getCallStatus();
logger.info('Pre-recon tool call status', { callStatus });
const logger = createActivityLogger();
const dir = deliverablesDir(input.repoPath, input.deliverablesSubdir);
const collected = collector.getAll();
const markdown = renderPreRecon(collected);
const mdPath = path.join(deliverablesPath, 'pre_recon_deliverable.md');
await atomicWrite(mdPath, markdown);
logger.info(`Wrote pre_recon_deliverable.md from structured data (${markdown.length} bytes)`);
};
// Skipped tools surface as renderer placeholders, not as activity failures.
const callStatus = collector.getCallStatus();
logger.info('Pre-recon tool call status', { callStatus });
const collected = collector.getAll();
const markdown = renderPreRecon(collected);
const mdPath = path.join(dir, 'pre_recon_deliverable.md');
await atomicWrite(mdPath, markdown);
logger.info(`Wrote pre_recon_deliverable.md from structured data (${markdown.length} bytes)`);
return metrics;
return runAgentActivity('pre-recon', input, collector.tools, writeDeliverable);
}
export async function runReconAgent(input: ActivityInput): Promise<AgentMetrics> {
@@ -285,28 +280,21 @@ export async function runReconAgent(input: ActivityInput): Promise<AgentMetrics>
const { renderRecon } = await import('../services/recon-renderer.js');
const collector = createReconCollectorServer();
const metrics = await runAgentActivity('recon', input, { 'recon-collector': collector.server });
// On resume, the agent is skipped and the collector is never populated.
// The cached deliverable from the prior run is the source of truth.
if (metrics.skipped) {
return metrics;
}
const writeDeliverable = async (deliverablesPath: string): Promise<void> => {
const logger = createActivityLogger();
// Skipped tools surface as renderer placeholders, not as activity failures.
const callStatus = collector.getCallStatus();
logger.info('Recon tool call status', { callStatus });
const logger = createActivityLogger();
const dir = deliverablesDir(input.repoPath, input.deliverablesSubdir);
const collected = collector.getAll();
const markdown = renderRecon(collected);
const mdPath = path.join(deliverablesPath, 'recon_deliverable.md');
await atomicWrite(mdPath, markdown);
logger.info(`Wrote recon_deliverable.md from structured data (${markdown.length} bytes)`);
};
// Skipped tools surface as renderer placeholders, not as activity failures.
const callStatus = collector.getCallStatus();
logger.info('Recon tool call status', { callStatus });
const collected = collector.getAll();
const markdown = renderRecon(collected);
const mdPath = path.join(dir, 'recon_deliverable.md');
await atomicWrite(mdPath, markdown);
logger.info(`Wrote recon_deliverable.md from structured data (${markdown.length} bytes)`);
return metrics;
return runAgentActivity('recon', input, collector.tools, writeDeliverable);
}
async function runVulnAgentWithCollector(
@@ -318,28 +306,21 @@ async function runVulnAgentWithCollector(
const { renderVulnDeliverable } = await import('../services/vuln-renderer.js');
const collector = createVulnCollector(vulnClass);
const metrics = await runAgentActivity(agentName, input, { 'vuln-collector': collector.server });
// On resume, the agent is skipped and the collector is never populated.
// The cached deliverable from the prior run is the source of truth.
if (metrics.skipped) {
return metrics;
}
const writeDeliverable = async (deliverablesPath: string): Promise<void> => {
const logger = createActivityLogger();
// Skipped tools surface as renderer placeholders, not as activity failures.
const callStatus = collector.getCallStatus();
logger.info(`${vulnClass} vuln tool call status`, { callStatus });
const logger = createActivityLogger();
const dir = deliverablesDir(input.repoPath, input.deliverablesSubdir);
const collected = collector.getAll();
const markdown = renderVulnDeliverable(vulnClass, collected);
const mdPath = path.join(deliverablesPath, `${vulnClass}_analysis_deliverable.md`);
await atomicWrite(mdPath, markdown);
logger.info(`Wrote ${vulnClass}_analysis_deliverable.md from structured data (${markdown.length} bytes)`);
};
// Skipped tools surface as renderer placeholders, not as activity failures.
const callStatus = collector.getCallStatus();
logger.info(`${vulnClass} vuln tool call status`, { callStatus });
const collected = collector.getAll();
const markdown = renderVulnDeliverable(vulnClass, collected);
const mdPath = path.join(dir, `${vulnClass}_analysis_deliverable.md`);
await atomicWrite(mdPath, markdown);
logger.info(`Wrote ${vulnClass}_analysis_deliverable.md from structured data (${markdown.length} bytes)`);
return metrics;
return runAgentActivity(agentName, input, collector.tools, writeDeliverable);
}
export async function runInjectionVulnAgent(input: ActivityInput): Promise<AgentMetrics> {
@@ -399,34 +380,29 @@ async function runExploitAgentWithCollector(
const { validIds, idToType } = await readExploitQueue(queuePath);
const collector = createExploitCollector({ vulnClass, validIds });
const metrics = await runAgentActivity(agentName, input, { 'exploit-collector': collector.server });
// On resume, the agent is skipped and the collector is never populated.
// The cached deliverable from the prior run is the source of truth.
if (metrics.skipped) {
return metrics;
}
const writeDeliverable = async (deliverablesPath: string): Promise<void> => {
const logger = createActivityLogger();
const collected = collector.getAll();
const emittedIds = new Set(collected.map((e) => e.vulnerability_id));
const missingIds = [...validIds].filter((id) => !emittedIds.has(id));
const exploitedCount = collected.filter((e) => e.status === 'exploited').length;
const blockedCount = collected.filter((e) => e.status === 'blocked').length;
const logger = createActivityLogger();
const collected = collector.getAll();
const emittedIds = new Set(collected.map((e) => e.vulnerability_id));
const missingIds = [...validIds].filter((id) => !emittedIds.has(id));
const exploitedCount = collected.filter((e) => e.status === 'exploited').length;
const blockedCount = collected.filter((e) => e.status === 'blocked').length;
logger.info(`${vulnClass} exploit tool call metrics`, {
queueSize: validIds.size,
exploited: exploitedCount,
blocked: blockedCount,
missing: missingIds.length,
});
logger.info(`${vulnClass} exploit tool call metrics`, {
queueSize: validIds.size,
exploited: exploitedCount,
blocked: blockedCount,
missing: missingIds.length,
});
const markdown = renderExploitDeliverable(vulnClass, collected, idToType);
const mdPath = path.join(deliverablesPath, `${vulnClass}_exploitation_evidence.md`);
await atomicWrite(mdPath, markdown);
logger.info(`Wrote ${vulnClass}_exploitation_evidence.md from structured data (${markdown.length} bytes)`);
};
const markdown = renderExploitDeliverable(vulnClass, collected, idToType);
const mdPath = path.join(dir, `${vulnClass}_exploitation_evidence.md`);
await atomicWrite(mdPath, markdown);
logger.info(`Wrote ${vulnClass}_exploitation_evidence.md from structured data (${markdown.length} bytes)`);
return metrics;
return runAgentActivity(agentName, input, collector.tools, writeDeliverable);
}
export async function runInjectionExploitAgent(input: ActivityInput): Promise<AgentMetrics> {
@@ -459,10 +435,10 @@ export async function runReportAgent(input: ActivityInput): Promise<AgentMetrics
* Runs cheap checks before any agent execution:
* 1. Repository path exists with .git
* 2. Config file validates (if provided)
* 3. Credential validation (API key, OAuth, Bedrock, or Vertex AI)
* 3. Credential validation (API key, OAuth, or Bedrock)
* 4. Target URL reachable from the container
*
* NOT using runAgentActivity — preflight doesn't run an agent via the SDK.
* NOT using runAgentActivity — preflight doesn't run a full analysis agent.
*/
export async function runPreflightValidation(input: ActivityInput): Promise<void> {
const startTime = Date.now();
@@ -661,12 +637,13 @@ export async function syncPlaywrightStealthConfig(input: ActivityInput): Promise
}
/**
* Sync code_path avoid rules into Claude's user-scope settings.json so the
* SDK enforces them at the tool layer for every agent in this run.
* Sync code_path avoid rules into the @gotgenes/pi-permission-system global config
* so pi enforces them at the tool layer for every agent in this run. The executor
* loads the extension when this config is present (see pi-executor).
*
* Runs once per workflow before any agent fires. Config is fixed for the
* lifetime of the workflow, so writing once avoids the parallel-agent race
* on the global ~/.claude/settings.json file.
* Runs once per workflow before any analysis agent fires. Config is fixed for the
* lifetime of the workflow, so writing once avoids a parallel-agent race on the
* global config file.
*/
export async function syncCodePathDenyRules(input: ActivityInput): Promise<void> {
const logger = createActivityLogger();
@@ -680,8 +657,12 @@ export async function syncCodePathDenyRules(input: ActivityInput): Promise<void>
const config = configResult.value;
const denyCount = (config?.avoid ?? []).filter((r) => r.type === 'code_path').length;
await writeUserSettingsForCodePathAvoids(config);
logger.info(`Synced code_path deny rules to user settings (${denyCount} entries)`);
await writeCodePathPermissionConfig(config);
logger.info(
denyCount > 0
? `Synced ${denyCount} code_path deny rule(s) to the pi-permission-system config`
: 'No code_path deny rules; pi-permission-system config cleared',
);
}
/**
+1 -1
View File
@@ -28,7 +28,7 @@ export interface PipelineInput {
sastSarifPath?: string; // Optional path for consumer-supplied findings input
checkpointsEnabled?: boolean; // Enable checkpoint activities (default: false)
skipGitCheck?: boolean; // Skip .git directory validation in preflight (e.g. when .git is removed after clone)
providerConfig?: ProviderConfig; // LLM provider configuration (Bedrock, Vertex, etc.)
providerConfig?: ProviderConfig; // LLM provider configuration (Bedrock, custom base URL, etc.)
vulnClasses?: VulnClass[]; // omitted = all five
exploit?: boolean; // false skips the exploitation phase
}
+3 -3
View File
@@ -92,7 +92,7 @@ const TESTING_RETRY = {
// Activity proxy with production retry configuration (default)
const acts = proxyActivities<typeof activities>({
startToCloseTimeout: '2 hours',
heartbeatTimeout: '60 minutes', // Extended for sub-agent execution (SDK blocks event loop during Task tool calls)
heartbeatTimeout: '60 minutes', // Extended for nested pi task execution
retry: PRODUCTION_RETRY,
});
@@ -135,7 +135,7 @@ const preflightActs = proxyActivities<typeof activities>({
retry: PREFLIGHT_RETRY,
});
// Credential rejection is not retryable; transient SDK errors get 3 attempts.
// Credential rejection is not retryable; transient provider errors get 3 attempts.
const AUTH_VALIDATION_RETRY = {
initialInterval: '10 seconds',
maximumInterval: '1 minute',
@@ -452,7 +452,7 @@ export async function pentestPipeline(input: PipelineInput): Promise<PipelineSta
// === Initialize Deliverables Git ===
await a.initDeliverableGit(activityInput);
// === Sync SDK deny rules ===
// === Sync code_path deny rules ===
await a.syncCodePathDenyRules(activityInput);
log.info(`Run scope: vuln_classes=[${selectedVulnClasses.join(', ')}] exploit=${exploit}`);
+4 -6
View File
@@ -94,8 +94,9 @@ export interface DistributedConfig {
/**
* LLM provider configuration for multi-provider support.
*
* Maps to SDK environment variables at execution time. When providerType
* is omitted or 'anthropic_api', falls back to apiKey + ANTHROPIC_API_KEY.
* Resolved by the pi model/provider layer at execution time. Recognized
* providerType values: 'bedrock', 'custom_base_url', 'anthropic_api'.
* When omitted or 'anthropic_api', falls back to apiKey + ANTHROPIC_API_KEY.
*/
export interface ProviderConfig {
readonly providerType?: string;
@@ -103,9 +104,6 @@ export interface ProviderConfig {
readonly awsRegion?: string;
readonly awsAccessKeyId?: string;
readonly awsSecretAccessKey?: string;
readonly gcpRegion?: string;
readonly gcpProjectId?: string;
readonly gcpCredentialsPath?: string;
readonly baseUrl?: string;
readonly authToken?: string;
readonly modelOverrides?: Record<string, string>;
@@ -127,6 +125,6 @@ export interface ContainerConfig {
readonly apiKey?: string;
/** Prompt directory override — when set, prompt manager loads from this path */
readonly promptDir?: string;
/** LLM provider configuration — when set, executor maps to SDK env vars directly */
/** LLM provider configuration for the pi executor */
readonly providerConfig?: ProviderConfig;
}
+6 -6
View File
@@ -8,8 +8,8 @@
* Consolidated billing/spending cap detection utilities.
*
* Anthropic's spending cap behavior is inconsistent:
* - Sometimes a proper SDK error (billing_error)
* - Sometimes Claude responds with text about the cap
* - Sometimes a proper provider error (billing_error)
* - Sometimes the agent responds with text about the cap
* - Sometimes partial billing before cutoff
*
* This module provides defense-in-depth detection with shared pattern lists
@@ -17,8 +17,8 @@
*/
/**
* Text patterns for SDK output sniffing (what Claude says).
* Used by message-handlers.ts and the behavioral heuristic.
* Text patterns for provider/harness output sniffing (what the agent says).
* Used by the pi event stream and the behavioral heuristic.
*/
export const BILLING_TEXT_PATTERNS = [
'spending cap',
@@ -48,7 +48,7 @@ export const BILLING_API_PATTERNS = [
/**
* Checks if text matches any billing text pattern.
* Used for sniffing SDK output content for spending cap messages.
* Used for sniffing agent output content for spending cap messages.
*/
export function matchesBillingTextPattern(text: string): boolean {
const lowerText = text.toLowerCase();
@@ -67,7 +67,7 @@ export function matchesBillingApiPattern(message: string): boolean {
/**
* Behavioral heuristic for detecting spending cap.
*
* When Claude hits a spending cap, it often returns a short message
* When the agent hits a spending cap, it often returns a short message
* with $0 cost. Legitimate agent work NEVER costs $0 with only 1-2 turns.
*
* This combines three signals:
+7 -38
View File
@@ -1,6 +1,6 @@
# AI Providers
Shannon Lite works best with Claude models. Anthropic API keys are recommended for most users, and Shannon Lite also supports AWS Bedrock, Google Vertex AI, and custom Anthropic-compatible endpoints.
Shannon Lite works best with Claude models. Anthropic API keys are recommended for most users, and Shannon Lite also supports AWS Bedrock and custom Anthropic-compatible endpoints.
## Anthropic
@@ -20,9 +20,10 @@ Source-build mode can use a `.env` file:
```bash
ANTHROPIC_API_KEY=your-api-key
CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000
```
Each tier can be pointed at any Claude model via `ANTHROPIC_SMALL_MODEL` / `ANTHROPIC_MEDIUM_MODEL` / `ANTHROPIC_LARGE_MODEL` (or the setup wizard). If you set a tier to `claude-fable-5`, note that Fable's safety classifiers route cybersecurity tasks to Opus 4.8, so those phases run on Opus 4.8 regardless.
## AWS Bedrock
Run `npx @keygraph/shannon setup` and select **AWS Bedrock**. The wizard prompts for region, bearer token, and model IDs.
@@ -35,7 +36,7 @@ export AWS_REGION=us-east-1
export AWS_BEARER_TOKEN_BEDROCK=your-bearer-token
export ANTHROPIC_SMALL_MODEL=us.anthropic.claude-haiku-4-5-20251001-v1:0
export ANTHROPIC_MEDIUM_MODEL=us.anthropic.claude-sonnet-4-6
export ANTHROPIC_LARGE_MODEL=us.anthropic.claude-opus-4-7
export ANTHROPIC_LARGE_MODEL=us.anthropic.claude-opus-4-8
```
Source-build `.env` equivalent:
@@ -46,7 +47,7 @@ AWS_REGION=us-east-1
AWS_BEARER_TOKEN_BEDROCK=your-bearer-token
ANTHROPIC_SMALL_MODEL=us.anthropic.claude-haiku-4-5-20251001-v1:0
ANTHROPIC_MEDIUM_MODEL=us.anthropic.claude-sonnet-4-6
ANTHROPIC_LARGE_MODEL=us.anthropic.claude-opus-4-7
ANTHROPIC_LARGE_MODEL=us.anthropic.claude-opus-4-8
```
Shannon Lite uses three model tiers:
@@ -57,38 +58,6 @@ Shannon Lite uses three model tiers:
Set `ANTHROPIC_SMALL_MODEL`, `ANTHROPIC_MEDIUM_MODEL`, and `ANTHROPIC_LARGE_MODEL` to Bedrock model IDs available in your region.
## Google Vertex AI
Create a service account with the `roles/aiplatform.user` role in the GCP Console, then download a JSON key file.
Run `npx @keygraph/shannon setup` and select **Google Vertex AI**. The wizard prompts for region, project ID, service account key file path, and model IDs. The key file is copied to `~/.shannon/google-sa-key.json`.
Or export environment variables directly:
```bash
export CLAUDE_CODE_USE_VERTEX=1
export CLOUD_ML_REGION=us-east5
export ANTHROPIC_VERTEX_PROJECT_ID=your-gcp-project-id
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your-sa-key.json
export ANTHROPIC_SMALL_MODEL=claude-haiku-4-5@20251001
export ANTHROPIC_MEDIUM_MODEL=claude-sonnet-4-6
export ANTHROPIC_LARGE_MODEL=claude-opus-4-7
```
Source-build `.env` equivalent:
```bash
CLAUDE_CODE_USE_VERTEX=1
CLOUD_ML_REGION=us-east5
ANTHROPIC_VERTEX_PROJECT_ID=your-gcp-project-id
GOOGLE_APPLICATION_CREDENTIALS=./credentials/google-sa-key.json
ANTHROPIC_SMALL_MODEL=claude-haiku-4-5@20251001
ANTHROPIC_MEDIUM_MODEL=claude-sonnet-4-6
ANTHROPIC_LARGE_MODEL=claude-opus-4-7
```
Set `CLOUD_ML_REGION=global` for global endpoints, or use a specific region like `us-east5`. Some models may not be available on global endpoints.
## Custom Base URL
Shannon Lite supports pointing the SDK at an Anthropic-compatible endpoint with `ANTHROPIC_BASE_URL`. For proxy-based routing, use an LLM proxy such as LiteLLM configured to expose an Anthropic-compatible endpoint.
@@ -105,7 +74,7 @@ export ANTHROPIC_BASE_URL=https://your-proxy.example.com
export ANTHROPIC_AUTH_TOKEN=your-auth-token
export ANTHROPIC_SMALL_MODEL=claude-haiku-4-5-20251001
export ANTHROPIC_MEDIUM_MODEL=claude-sonnet-4-6
export ANTHROPIC_LARGE_MODEL=claude-opus-4-7
export ANTHROPIC_LARGE_MODEL=claude-opus-4-8
```
Source-build `.env` equivalent:
@@ -115,5 +84,5 @@ ANTHROPIC_BASE_URL=https://your-proxy.example.com
ANTHROPIC_AUTH_TOKEN=your-auth-token
ANTHROPIC_SMALL_MODEL=claude-haiku-4-5-20251001
ANTHROPIC_MEDIUM_MODEL=claude-sonnet-4-6
ANTHROPIC_LARGE_MODEL=claude-opus-4-7
ANTHROPIC_LARGE_MODEL=claude-opus-4-8
```
+1 -1
View File
@@ -133,7 +133,7 @@ login_flow:
## Adaptive Thinking
Claude decides when and how deeply to reason on Opus 4.6 and 4.7. This is enabled by default whenever a tier resolves to one of these models.
Claude decides when and how deeply to reason on Opus 4.6, 4.7, and 4.8. This is enabled by default whenever a tier resolves to one of these models.
- `npx` mode: `npx @keygraph/shannon setup` prompts you during the wizard.
- Source-build mode: set `CLAUDE_ADAPTIVE_THINKING=false` in `.env` or export it in your shell.
-2
View File
@@ -33,14 +33,12 @@ At minimum, your `.env` file should include one supported AI provider credential
```bash
ANTHROPIC_API_KEY=your-api-key
CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000
```
Environment variables can also be exported directly:
```bash
export ANTHROPIC_API_KEY="your-api-key"
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000
```
## Prepare Your Repository
+8 -43
View File
@@ -87,7 +87,7 @@ Sample Shannon Lite penetration test reports from intentionally vulnerable appli
- **Docker** - required for the worker container.
- **Node.js 18+** - required for the recommended `npx` workflow.
- **AI provider credentials** - Anthropic is recommended; AWS Bedrock, Google Vertex AI, and compatible proxy setups are documented separately.
- **AI provider credentials** - Anthropic is recommended; AWS Bedrock and compatible proxy setups are documented separately.
### Run Shannon Lite
@@ -203,7 +203,7 @@ Use these guides for operational detail:
| --- | --- |
| [Source build and CLI commands](docs/development.md) | Cloning, building, common commands, output paths, and local development. |
| [Configuration](docs/configuration.md) | Authenticated testing, login flows, rules of engagement, report filters, and rate-limit settings. |
| [AI providers](docs/ai-providers.md) | Anthropic, AWS Bedrock, Google Vertex AI, and custom Anthropic-compatible endpoints. |
| [AI providers](docs/ai-providers.md) | Anthropic, AWS Bedrock, and custom Anthropic-compatible endpoints. |
| [Platforms and networking](docs/platforms.md) | Windows/WSL2, Linux, macOS, Docker networking, local apps, and custom hostnames. |
| [Workspaces and resuming](docs/workspaces.md) | Naming workspaces, resuming interrupted scans, and workspace storage. |
| [Safety and limitations](docs/safety.md) | Authorized-use requirements, non-production guidance, mutative effects, cost, and model caveats. |
@@ -298,14 +298,12 @@ At minimum, your `.env` file should include one supported AI provider credential
```bash
ANTHROPIC_API_KEY=your-api-key
CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000
```
Environment variables can also be exported directly:
```bash
export ANTHROPIC_API_KEY="your-api-key"
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000
```
## Prepare Your Repository
@@ -548,7 +546,7 @@ login_flow:
## Adaptive Thinking
Claude decides when and how deeply to reason on Opus 4.6 and 4.7. This is enabled by default whenever a tier resolves to one of these models.
Claude decides when and how deeply to reason on Opus 4.6, 4.7, and 4.8. This is enabled by default whenever a tier resolves to one of these models.
- `npx` mode: `npx @keygraph/shannon setup` prompts you during the wizard.
- Source-build mode: set `CLAUDE_ADAPTIVE_THINKING=false` in `.env` or export it in your shell.
@@ -571,7 +569,7 @@ pipeline:
# AI Providers
Shannon Lite works best with Claude models. Anthropic API keys are recommended for most users, and Shannon Lite also supports AWS Bedrock, Google Vertex AI, and custom Anthropic-compatible endpoints.
Shannon Lite works best with Claude models. Anthropic API keys are recommended for most users, and Shannon Lite also supports AWS Bedrock and custom Anthropic-compatible endpoints.
## Anthropic
@@ -591,7 +589,6 @@ Source-build mode can use a `.env` file:
```bash
ANTHROPIC_API_KEY=your-api-key
CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000
```
## AWS Bedrock
@@ -606,7 +603,7 @@ export AWS_REGION=us-east-1
export AWS_BEARER_TOKEN_BEDROCK=your-bearer-token
export ANTHROPIC_SMALL_MODEL=us.anthropic.claude-haiku-4-5-20251001-v1:0
export ANTHROPIC_MEDIUM_MODEL=us.anthropic.claude-sonnet-4-6
export ANTHROPIC_LARGE_MODEL=us.anthropic.claude-opus-4-7
export ANTHROPIC_LARGE_MODEL=us.anthropic.claude-opus-4-8
```
Source-build `.env` equivalent:
@@ -617,7 +614,7 @@ AWS_REGION=us-east-1
AWS_BEARER_TOKEN_BEDROCK=your-bearer-token
ANTHROPIC_SMALL_MODEL=us.anthropic.claude-haiku-4-5-20251001-v1:0
ANTHROPIC_MEDIUM_MODEL=us.anthropic.claude-sonnet-4-6
ANTHROPIC_LARGE_MODEL=us.anthropic.claude-opus-4-7
ANTHROPIC_LARGE_MODEL=us.anthropic.claude-opus-4-8
```
Shannon Lite uses three model tiers:
@@ -628,38 +625,6 @@ Shannon Lite uses three model tiers:
Set `ANTHROPIC_SMALL_MODEL`, `ANTHROPIC_MEDIUM_MODEL`, and `ANTHROPIC_LARGE_MODEL` to Bedrock model IDs available in your region.
## Google Vertex AI
Create a service account with the `roles/aiplatform.user` role in the GCP Console, then download a JSON key file.
Run `npx @keygraph/shannon setup` and select **Google Vertex AI**. The wizard prompts for region, project ID, service account key file path, and model IDs. The key file is copied to `~/.shannon/google-sa-key.json`.
Or export environment variables directly:
```bash
export CLAUDE_CODE_USE_VERTEX=1
export CLOUD_ML_REGION=us-east5
export ANTHROPIC_VERTEX_PROJECT_ID=your-gcp-project-id
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your-sa-key.json
export ANTHROPIC_SMALL_MODEL=claude-haiku-4-5@20251001
export ANTHROPIC_MEDIUM_MODEL=claude-sonnet-4-6
export ANTHROPIC_LARGE_MODEL=claude-opus-4-7
```
Source-build `.env` equivalent:
```bash
CLAUDE_CODE_USE_VERTEX=1
CLOUD_ML_REGION=us-east5
ANTHROPIC_VERTEX_PROJECT_ID=your-gcp-project-id
GOOGLE_APPLICATION_CREDENTIALS=./credentials/google-sa-key.json
ANTHROPIC_SMALL_MODEL=claude-haiku-4-5@20251001
ANTHROPIC_MEDIUM_MODEL=claude-sonnet-4-6
ANTHROPIC_LARGE_MODEL=claude-opus-4-7
```
Set `CLOUD_ML_REGION=global` for global endpoints, or use a specific region like `us-east5`. Some models may not be available on global endpoints.
## Custom Base URL
Shannon Lite supports pointing the SDK at an Anthropic-compatible endpoint with `ANTHROPIC_BASE_URL`. For proxy-based routing, use an LLM proxy such as LiteLLM configured to expose an Anthropic-compatible endpoint.
@@ -676,7 +641,7 @@ export ANTHROPIC_BASE_URL=https://your-proxy.example.com
export ANTHROPIC_AUTH_TOKEN=your-auth-token
export ANTHROPIC_SMALL_MODEL=claude-haiku-4-5-20251001
export ANTHROPIC_MEDIUM_MODEL=claude-sonnet-4-6
export ANTHROPIC_LARGE_MODEL=claude-opus-4-7
export ANTHROPIC_LARGE_MODEL=claude-opus-4-8
```
Source-build `.env` equivalent:
@@ -686,7 +651,7 @@ ANTHROPIC_BASE_URL=https://your-proxy.example.com
ANTHROPIC_AUTH_TOKEN=your-auth-token
ANTHROPIC_SMALL_MODEL=claude-haiku-4-5-20251001
ANTHROPIC_MEDIUM_MODEL=claude-sonnet-4-6
ANTHROPIC_LARGE_MODEL=claude-opus-4-7
ANTHROPIC_LARGE_MODEL=claude-opus-4-8
```
---
+1 -1
View File
@@ -13,7 +13,7 @@ Use this file as the concise entry point for AI agents and LLMs reading this rep
- [Development](docs/development.md): Source-build workflow, common CLI commands, repository paths, and output locations.
- [Configuration](docs/configuration.md): Authenticated testing, login flows, rules of engagement, report filters, credential precedence, adaptive thinking, and rate-limit settings.
- [AI Providers](docs/ai-providers.md): Anthropic, AWS Bedrock, Google Vertex AI, and custom Anthropic-compatible endpoint setup.
- [AI Providers](docs/ai-providers.md): Anthropic, AWS Bedrock, and custom Anthropic-compatible endpoint setup.
- [Platforms and Networking](docs/platforms.md): Windows/WSL2, Linux, macOS, Docker networking, local applications, and custom hostnames.
- [Workspaces and Resuming](docs/workspaces.md): Workspace storage, naming, resuming interrupted scans, and examples.
- [Safety and Limitations](docs/safety.md): Authorized-use requirements, non-production guidance, mutative effects, model caveats, scope limits, cost, and performance.
+1254 -146
View File
File diff suppressed because it is too large Load Diff
-3
View File
@@ -1,5 +1,2 @@
packages:
- "apps/*"
catalog:
"@anthropic-ai/claude-agent-sdk": ^0.2.114