mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-02 11:45:20 +02:00
8ca950f6f1
* feat: token registry for multi-agent browser access Per-agent scoped tokens with read/write/admin/meta command categories, domain glob restrictions, rate limiting, expiry, and revocation. Setup key exchange for the /pair-agent ceremony (5-min one-time key → 24h session token). Idempotent exchange handles tunnel drops. 39 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: integrate token registry + scoped auth into browse server Server changes for multi-agent browser access: - /connect endpoint: setup key exchange for /pair-agent ceremony - /token endpoint: root-only minting of scoped sub-tokens - /token/:clientId DELETE: revoke agent tokens - /agents endpoint: list connected agents (root-only) - /health: strips root token when tunnel is active (P0 security fix) - /command: scope/rate/domain checks via token registry before dispatch - Idle timer skips shutdown when tunnel is active Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: ngrok tunnel integration + @ngrok/ngrok dependency BROWSE_TUNNEL=1 env var starts an ngrok tunnel after Bun.serve(). Reads NGROK_AUTHTOKEN from env or ~/.gstack/ngrok.env. Reads NGROK_DOMAIN for dedicated domain (stable URL). Updates state file with tunnel URL. Feasibility spike confirmed: SDK works in compiled Bun binary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: tab isolation for multi-agent browser access Add per-tab ownership tracking to BrowserManager. Scoped agents must create their own tab via newtab before writing. Unowned tabs (pre-existing, user-opened) are root-only for writes. Read access always allowed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: tab enforcement + POST /pair endpoint + activity attribution Server-side tab ownership check blocks scoped agents from writing to unowned tabs. Special-case newtab records ownership for scoped tokens. POST /pair endpoint creates setup keys for the pairing ceremony. Activity events now include clientId for attribution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: pair-agent CLI command + instruction block generator One command to pair a remote agent: $B pair-agent. Creates a setup key via POST /pair, prints a copy-pasteable instruction block with curl commands. Smart tunnel fallback (tunnel URL > auto-start > localhost). Flags: --for HOST, --local HOST, --admin, --client NAME. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: tab isolation + instruction block generator tests 14 tests covering tab ownership lifecycle (access checks, unowned tabs, transferTab) and instruction block generator (scopes, URLs, admin flag, troubleshooting section). Fix server-auth test that used fragile sliceBetween boundaries broken by new endpoints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.15.9.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: CSO security fixes — token leak, domain bypass, input validation 1. Remove root token from /health endpoint entirely (CSO #1 CRITICAL). Origin header is spoofable. Extension reads from ~/.gstack/.auth.json. 2. Add domain check for newtab URL (CSO #5). Previously only goto was checked, allowing domain-restricted agents to bypass via newtab. 3. Validate scope values, rateLimit, expiresSeconds in createToken() (CSO #4). Rejects invalid scopes and negative values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: /pair-agent skill — syntactic sugar for browser sharing Users remember /pair-agent, not $B pair-agent. The skill walks through agent selection (OpenClaw, Hermes, Codex, Cursor, generic), local vs remote setup, tunnel configuration, and includes platform-specific notes for each agent type. Wraps the CLI command with context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: remote browser access reference for paired agents Full API reference, snapshot→@ref pattern, scopes, tab isolation, error codes, ngrok setup, and same-machine shortcuts. The instruction block points here for deeper reading. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: improved instruction block with snapshot→@ref pattern The paste-into-agent instruction block now teaches the snapshot→@ref workflow (the most powerful browsing pattern), shows the server URL prominently, and uses clearer formatting. Tests updated to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: smart ngrok detection + auto-tunnel in pair-agent The pair-agent command now checks ngrok's native config (not just ~/.gstack/ngrok.env) and auto-starts the tunnel when ngrok is available. The skill template walks users through ngrok install and auth if not set up, instead of just printing a dead localhost URL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: on-demand tunnel start via POST /tunnel/start pair-agent now auto-starts the ngrok tunnel without restarting the server. New POST /tunnel/start endpoint reads authtoken from env, ~/.gstack/ngrok.env, or ngrok's native config. CLI detects ngrok availability and calls the endpoint automatically. Zero manual steps when ngrok is installed and authed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: pair-agent skill must output the instruction block verbatim Added CRITICAL instruction: the agent MUST output the full instruction block so the user can copy it. Previously the agent could summarize over it, leaving the user with nothing to paste. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: scoped tokens rejected on /command — auth gate ordering bug The blanket validateAuth() gate (root-only) sat above the /command endpoint, rejecting all scoped tokens with 401 before they reached getTokenInfo(). Moved /command above the gate so both root and scoped tokens are accepted. This was the bug Wintermute hit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: pair-agent auto-launches headed mode before pairing When pair-agent detects headless mode, it auto-switches to headed (visible Chromium window) so the user can watch what the remote agent does. Use --headless to skip this. Fixed compiled binary path resolution (process.execPath, not process.argv[1] which is virtual /$bunfs/ in Bun compiled binaries). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: comprehensive tests for auth ordering, tunnel, ngrok, headed mode 16 new tests covering: - /command sits above blanket auth gate (Wintermute bug) - /command uses getTokenInfo not validateAuth - /tunnel/start requires root, checks native ngrok config, returns already_active - /pair creates setup keys not session tokens - Tab ownership checked before command dispatch - Activity events include clientId - Instruction block teaches snapshot→@ref pattern - pair-agent auto-headed mode, process.execPath, --headless skip - isNgrokAvailable checks all 3 sources (gstack env, env var, native config) - handlePairAgent calls /tunnel/start not server restart Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: chain scope bypass + /health info leak when tunneled 1. Chain command now pre-validates ALL subcommand scopes before executing any. A read+meta token can no longer escalate to admin via chain (eval, js, cookies were dispatched without scope checks). tokenInfo flows through handleMetaCommand into the chain handler. Rejects entire chain if any subcommand fails. 2. /health strips sensitive fields (currentUrl, agent.currentMessage, session) when tunnel is active. Only operational metadata (status, mode, uptime, tabs) exposed to the internet. Previously anyone reaching the ngrok URL could surveil browsing activity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: tout /pair-agent as headline feature in CHANGELOG + README Lead with what it does for the user: type /pair-agent, paste into your other agent, done. First time AI agents from different companies can coordinate through a shared browser with real security boundaries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: expand /pair-agent, /design-shotgun, /design-html in README Each skill gets a real narrative paragraph explaining the workflow, not just a table cell. design-shotgun: visual exploration with taste memory. design-html: production HTML with Pretext computed layout. pair-agent: cross-vendor AI agent coordination through shared browser. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: split handleCommand into handleCommandInternal + HTTP wrapper Chain subcommands now route through handleCommandInternal for full security enforcement (scope, domain, tab ownership, rate limiting, content wrapping). Adds recursion guard for nested chains, rate-limit exemption for chain subcommands, and activity event suppression (1 event per chain, not per sub). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add content-security.ts with datamarking, envelope, and filter hooks Four-layer prompt injection defense for pair-agent browser sharing: - Datamarking: session-scoped watermark for text exfiltration detection - Content envelope: trust boundary wrapping with ZWSP marker escaping - Content filter hooks: extensible filter pipeline with warn/block modes - Built-in URL blocklist: requestbin, pipedream, webhook.site, etc. BROWSE_CONTENT_FILTER env var controls mode: off|warn|block (default: warn) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: centralize content wrapping in handleCommandInternal response path Single wrapping location replaces fragmented per-handler wrapping: - Scoped tokens: content filters + datamarking + enhanced envelope - Root tokens: existing basic wrapping (backward compat) - Chain subcommands exempt from top-level wrapping (wrapped individually) - Adds 'attrs' to PAGE_CONTENT_COMMANDS (ARIA value exposure defense) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: hidden element stripping for scoped token text extraction Detects CSS-hidden elements (opacity, font-size, off-screen, same-color, clip-path) and ARIA label injection patterns. Marks elements with data-gstack-hidden, extracts text from a clean clone (no DOM mutation), then removes markers. Only active for scoped tokens on text command. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: snapshot split output format for scoped tokens Scoped tokens get a split snapshot: trusted @refs section (for click/fill) separated from untrusted web content in an envelope. Ref names truncated to 50 chars in trusted section. Root tokens unchanged (backward compat). Resume command also uses split format for scoped tokens. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add SECURITY section to pair-agent instruction block Instructs remote agents to treat content inside untrusted envelopes as potentially malicious. Lists common injection phrases to watch for. Directs agents to only use @refs from the trusted INTERACTIVE ELEMENTS section, not from page content. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add 4 prompt injection test fixtures - injection-visible.html: visible injection in product review text - injection-hidden.html: 7 CSS hiding techniques + ARIA injection + false positive - injection-social.html: social engineering in legitimate-looking content - injection-combined.html: all attack types + envelope escape attempt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: comprehensive content security tests (47 tests) Covers all 4 defense layers: - Datamarking: marker format, session consistency, text-only application - Content envelope: wrapping, ZWSP marker escaping, filter warnings - Content filter hooks: URL blocklist, custom filters, warn/block modes - Instruction block: SECURITY section content, ordering, generation - Centralized wrapping: source-level verification of integration - Chain security: recursion guard, rate-limit exemption, activity suppression - Hidden element stripping: 7 CSS techniques, ARIA injection, false positives - Snapshot split format: scoped vs root output, resume integration Also fixes: visibility:hidden detection, case-insensitive ARIA pattern matching. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: pair-agent skill compliance + fix all 16 pre-existing test failures Root cause: pair-agent was added without completing the gen-skill-docs compliance checklist. All 16 failures traced back to this. Fixes: - Sync package.json version to VERSION (0.15.9.0) - Add "(gstack)" to pair-agent description for discoverability - Add pair-agent to Codex path exception (legitimately documents ~/.codex/) - Add CLI_COMMANDS (status, pair-agent, tunnel) to skill parser allowlist - Regenerate SKILL.md for all hosts (claude, codex, factory, kiro, etc.) - Update golden file baselines for ship skill - Fix relink tests: pass GSTACK_INSTALL_DIR to auto-relink calls so they use the fast mock install instead of scanning real ~/.claude/skills/gstack Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.15.12.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: E2E exit reason precedence + worktree prune race condition Two fixes for E2E test reliability: 1. session-runner.ts: error_max_turns was misclassified as error_api because is_error flag was checked before subtype. Now known subtypes like error_max_turns are preserved even when is_error is set. The is_error override only applies when subtype=success (API failure). 2. worktree.ts: pruneStale() now skips worktrees < 1 hour old to avoid deleting worktrees from concurrent test runs still in progress. Previously any second test execution would kill the first's worktrees. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: restore token in /health for localhost extension auth The CSO security fix stripped the token from /health to prevent leaking when tunneled. But the extension needs it to authenticate on localhost. Now returns token only when not tunneled (safe: localhost-only path). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: verify /health token is localhost-only, never served through tunnel Updated tests to match the restored token behavior: - Test 1: token assignment exists AND is inside the !tunnelActive guard - Test 1b: tunnel branch (else block) does not contain AUTH_TOKEN Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add security rationale for token in /health on localhost Explains why this is an accepted risk (no escalation over file-based token access), CORS protection, and tunnel guard. Prevents future CSO scans from stripping it without providing an alternative auth path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: verify tunnel is alive before returning URL to pair-agent Root cause: when ngrok dies externally (pkill, crash, timeout), the server still reports tunnelActive=true with a dead URL. pair-agent prints an instruction block pointing at a dead tunnel. The remote agent gets "endpoint offline" and the user has to manually restart everything. Three-layer fix: - Server /pair endpoint: probes tunnel URL before returning it. If dead, resets tunnelActive/tunnelUrl and returns null (triggers CLI restart). - Server /tunnel/start: probes cached tunnel before returning already_active. If dead, falls through to restart ngrok automatically. - CLI pair-agent: double-checks tunnel URL from server before printing instruction block. Falls through to auto-start on failure. 4 regression tests verify all three probe points + CLI verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add POST /batch endpoint for multi-command batching Remote agents controlling GStack Browser through a tunnel pay 2-5s of latency per HTTP round-trip. A typical "navigate and read" takes 4 sequential commands = 10-20 seconds. The /batch endpoint collapses N commands into a single HTTP round-trip, cutting a 20-tab crawl from ~60s to ~5s. Sequential execution through the full security pipeline (scope, domain, tab ownership, content wrapping). Rate limiting counts the batch as 1 request. Activity events emitted at batch level, not per-command. Max 50 commands per batch. Nested batches rejected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add source-level security tests for /batch endpoint 8 tests verifying: auth gate placement, scoped token support, max command limit, nested batch rejection, rate limiting bypass, batch-level activity events, command field validation, and tabId passthrough. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: correct CHANGELOG date from 2026-04-06 to 2026-04-05 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: consolidate Hermes into generic HTTP option in pair-agent Hermes doesn't have a host-specific config — it uses the same generic curl instructions as any other agent. Removing the dedicated option simplifies the menu and eliminates a misleading distinction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump VERSION to 0.15.14.0, add CHANGELOG entry for batch endpoint Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate pair-agent/SKILL.md after main merge Vendoring deprecation section from main's template wasn't reflected in the generated file. Fixes check-freshness CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: checkTabAccess uses options object, add own-only tab policy Refactors checkTabAccess(tabId, clientId, isWrite) to use an options object { isWrite?, ownOnly? }. Adds tabPolicy === 'own-only' support in the server command dispatch — scoped tokens with this policy are restricted to their own tabs for all commands, not just writes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add --domain flag to pair-agent CLI for domain restrictions Allows passing --domain to pair-agent to restrict the remote agent's navigation to specific domains (comma-separated). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * revert: remove batch commands CHANGELOG entry and VERSION bump The batch endpoint work belongs on the browser-batch-multitab branch (port-louis), not this branch. Reverting VERSION to 0.15.14.0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: adopt main's headed-mode /health token serving Our merge kept the old !tunnelActive guard which conflicted with main's security-audit-r2 tests that require no currentUrl/currentMessage in /health. Adopts main's approach: serve token conditionally based on headed mode or chrome-extension origin. Updates server-auth tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: improve snapshot flags docs completeness for LLM judge Adds $B placeholder explanation, explicit syntax line, and detailed flag behavior (-d depth values, -s CSS selector syntax, -D unified diff format and baseline persistence, -a screenshot vs text output relationship). Fixes snapshot flags reference LLM eval scoring completeness < 4. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
361 lines
12 KiB
TypeScript
361 lines
12 KiB
TypeScript
/**
|
|
* Claude CLI subprocess runner for skill E2E testing.
|
|
*
|
|
* Spawns `claude -p` as a completely independent process (not via Agent SDK),
|
|
* so it works inside Claude Code sessions. Pipes prompt via stdin, streams
|
|
* NDJSON output for real-time progress, scans for browse errors.
|
|
*/
|
|
|
|
import * as fs from 'fs';
|
|
import * as path from 'path';
|
|
import * as os from 'os';
|
|
import { getProjectEvalDir } from './eval-store';
|
|
|
|
const GSTACK_DEV_DIR = path.join(os.homedir(), '.gstack-dev');
|
|
const HEARTBEAT_PATH = path.join(GSTACK_DEV_DIR, 'e2e-live.json'); // heartbeat stays global
|
|
const PROJECT_DIR = path.dirname(getProjectEvalDir()); // ~/.gstack/projects/$SLUG/
|
|
|
|
/** Sanitize test name for use as filename: strip leading slashes, replace / with - */
|
|
export function sanitizeTestName(name: string): string {
|
|
return name.replace(/^\/+/, '').replace(/\//g, '-');
|
|
}
|
|
|
|
/** Atomic write: write to .tmp then rename. Non-fatal on error. */
|
|
function atomicWriteSync(filePath: string, data: string): void {
|
|
const tmp = filePath + '.tmp';
|
|
fs.writeFileSync(tmp, data);
|
|
fs.renameSync(tmp, filePath);
|
|
}
|
|
|
|
export interface CostEstimate {
|
|
inputChars: number;
|
|
outputChars: number;
|
|
estimatedTokens: number;
|
|
estimatedCost: number; // USD
|
|
turnsUsed: number;
|
|
}
|
|
|
|
export interface SkillTestResult {
|
|
toolCalls: Array<{ tool: string; input: any; output: string }>;
|
|
browseErrors: string[];
|
|
exitReason: string;
|
|
duration: number;
|
|
output: string;
|
|
costEstimate: CostEstimate;
|
|
transcript: any[];
|
|
/** Which model was used for this test (added for Sonnet/Opus split diagnostics) */
|
|
model: string;
|
|
/** Time from spawn to first NDJSON line, in ms (added for rate-limit diagnostics) */
|
|
firstResponseMs: number;
|
|
/** Peak latency between consecutive tool calls, in ms */
|
|
maxInterTurnMs: number;
|
|
}
|
|
|
|
const BROWSE_ERROR_PATTERNS = [
|
|
/Unknown command: \w+/,
|
|
/Unknown snapshot flag: .+/,
|
|
/ERROR: browse binary not found/,
|
|
/Server failed to start/,
|
|
/no such file or directory.*browse/i,
|
|
];
|
|
|
|
// --- Testable NDJSON parser ---
|
|
|
|
export interface ParsedNDJSON {
|
|
transcript: any[];
|
|
resultLine: any | null;
|
|
turnCount: number;
|
|
toolCallCount: number;
|
|
toolCalls: Array<{ tool: string; input: any; output: string }>;
|
|
}
|
|
|
|
/**
|
|
* Parse an array of NDJSON lines into structured transcript data.
|
|
* Pure function — no I/O, no side effects. Used by both the streaming
|
|
* reader and unit tests.
|
|
*/
|
|
export function parseNDJSON(lines: string[]): ParsedNDJSON {
|
|
const transcript: any[] = [];
|
|
let resultLine: any = null;
|
|
let turnCount = 0;
|
|
let toolCallCount = 0;
|
|
const toolCalls: ParsedNDJSON['toolCalls'] = [];
|
|
|
|
for (const line of lines) {
|
|
if (!line.trim()) continue;
|
|
try {
|
|
const event = JSON.parse(line);
|
|
transcript.push(event);
|
|
|
|
// Track turns and tool calls from assistant events
|
|
if (event.type === 'assistant') {
|
|
turnCount++;
|
|
const content = event.message?.content || [];
|
|
for (const item of content) {
|
|
if (item.type === 'tool_use') {
|
|
toolCallCount++;
|
|
toolCalls.push({
|
|
tool: item.name || 'unknown',
|
|
input: item.input || {},
|
|
output: '',
|
|
});
|
|
}
|
|
}
|
|
}
|
|
|
|
if (event.type === 'result') resultLine = event;
|
|
} catch { /* skip malformed lines */ }
|
|
}
|
|
|
|
return { transcript, resultLine, turnCount, toolCallCount, toolCalls };
|
|
}
|
|
|
|
function truncate(s: string, max: number): string {
|
|
return s.length > max ? s.slice(0, max) + '…' : s;
|
|
}
|
|
|
|
// --- Main runner ---
|
|
|
|
export async function runSkillTest(options: {
|
|
prompt: string;
|
|
workingDirectory: string;
|
|
maxTurns?: number;
|
|
allowedTools?: string[];
|
|
timeout?: number;
|
|
testName?: string;
|
|
runId?: string;
|
|
/** Model to use. Defaults to claude-sonnet-4-6 (overridable via EVALS_MODEL env). */
|
|
model?: string;
|
|
}): Promise<SkillTestResult> {
|
|
const {
|
|
prompt,
|
|
workingDirectory,
|
|
maxTurns = 15,
|
|
allowedTools = ['Bash', 'Read', 'Write'],
|
|
timeout = 120_000,
|
|
testName,
|
|
runId,
|
|
} = options;
|
|
const model = options.model ?? process.env.EVALS_MODEL ?? 'claude-sonnet-4-6';
|
|
|
|
const startTime = Date.now();
|
|
const startedAt = new Date().toISOString();
|
|
|
|
// Set up per-run log directory if runId is provided
|
|
let runDir: string | null = null;
|
|
const safeName = testName ? sanitizeTestName(testName) : null;
|
|
if (runId) {
|
|
try {
|
|
runDir = path.join(PROJECT_DIR, 'e2e-runs', runId);
|
|
fs.mkdirSync(runDir, { recursive: true });
|
|
} catch { /* non-fatal */ }
|
|
}
|
|
|
|
// Spawn claude -p with streaming NDJSON output. Prompt piped via stdin to
|
|
// avoid shell escaping issues. --verbose is required for stream-json mode.
|
|
const args = [
|
|
'-p',
|
|
'--model', model,
|
|
'--output-format', 'stream-json',
|
|
'--verbose',
|
|
'--dangerously-skip-permissions',
|
|
'--max-turns', String(maxTurns),
|
|
'--allowed-tools', ...allowedTools,
|
|
];
|
|
|
|
// Write prompt to a temp file OUTSIDE workingDirectory to avoid race conditions
|
|
// where afterAll cleanup deletes the dir before cat reads the file (especially
|
|
// with --concurrent --retry). Using os.tmpdir() + unique suffix keeps it stable.
|
|
const promptFile = path.join(os.tmpdir(), `.prompt-${process.pid}-${Date.now()}-${Math.random().toString(36).slice(2)}`);
|
|
fs.writeFileSync(promptFile, prompt);
|
|
|
|
const proc = Bun.spawn(['sh', '-c', `cat "${promptFile}" | claude ${args.map(a => `"${a}"`).join(' ')}`], {
|
|
cwd: workingDirectory,
|
|
stdout: 'pipe',
|
|
stderr: 'pipe',
|
|
});
|
|
|
|
// Race against timeout
|
|
let stderr = '';
|
|
let exitReason = 'unknown';
|
|
let timedOut = false;
|
|
|
|
const timeoutId = setTimeout(() => {
|
|
timedOut = true;
|
|
proc.kill();
|
|
}, timeout);
|
|
|
|
// Stream NDJSON from stdout for real-time progress
|
|
const collectedLines: string[] = [];
|
|
let liveTurnCount = 0;
|
|
let liveToolCount = 0;
|
|
let firstResponseMs = 0;
|
|
let lastToolTime = 0;
|
|
let maxInterTurnMs = 0;
|
|
const stderrPromise = new Response(proc.stderr).text();
|
|
|
|
const reader = proc.stdout.getReader();
|
|
const decoder = new TextDecoder();
|
|
let buf = '';
|
|
|
|
try {
|
|
while (true) {
|
|
const { done, value } = await reader.read();
|
|
if (done) break;
|
|
buf += decoder.decode(value, { stream: true });
|
|
const lines = buf.split('\n');
|
|
buf = lines.pop() || '';
|
|
for (const line of lines) {
|
|
if (!line.trim()) continue;
|
|
collectedLines.push(line);
|
|
|
|
// Real-time progress to stderr + persistent logs
|
|
try {
|
|
const event = JSON.parse(line);
|
|
if (event.type === 'assistant') {
|
|
liveTurnCount++;
|
|
const content = event.message?.content || [];
|
|
for (const item of content) {
|
|
if (item.type === 'tool_use') {
|
|
liveToolCount++;
|
|
const now = Date.now();
|
|
const elapsed = Math.round((now - startTime) / 1000);
|
|
// Track timing telemetry
|
|
if (firstResponseMs === 0) firstResponseMs = now - startTime;
|
|
if (lastToolTime > 0) {
|
|
const interTurn = now - lastToolTime;
|
|
if (interTurn > maxInterTurnMs) maxInterTurnMs = interTurn;
|
|
}
|
|
lastToolTime = now;
|
|
const progressLine = ` [${elapsed}s] turn ${liveTurnCount} tool #${liveToolCount}: ${item.name}(${truncate(JSON.stringify(item.input || {}), 80)})\n`;
|
|
process.stderr.write(progressLine);
|
|
|
|
// Persist progress.log
|
|
if (runDir) {
|
|
try { fs.appendFileSync(path.join(runDir, 'progress.log'), progressLine); } catch { /* non-fatal */ }
|
|
}
|
|
|
|
// Write heartbeat (atomic)
|
|
if (runId && testName) {
|
|
try {
|
|
const toolDesc = `${item.name}(${truncate(JSON.stringify(item.input || {}), 60)})`;
|
|
atomicWriteSync(HEARTBEAT_PATH, JSON.stringify({
|
|
runId,
|
|
pid: proc.pid,
|
|
startedAt,
|
|
currentTest: testName,
|
|
status: 'running',
|
|
turn: liveTurnCount,
|
|
toolCount: liveToolCount,
|
|
lastTool: toolDesc,
|
|
lastToolAt: new Date().toISOString(),
|
|
elapsedSec: elapsed,
|
|
}, null, 2) + '\n');
|
|
} catch { /* non-fatal */ }
|
|
}
|
|
}
|
|
}
|
|
}
|
|
} catch { /* skip — parseNDJSON will handle it later */ }
|
|
|
|
// Append raw NDJSON line to per-test transcript file
|
|
if (runDir && safeName) {
|
|
try { fs.appendFileSync(path.join(runDir, `${safeName}.ndjson`), line + '\n'); } catch { /* non-fatal */ }
|
|
}
|
|
}
|
|
}
|
|
} catch { /* stream read error — fall through to exit code handling */ }
|
|
|
|
// Flush remaining buffer
|
|
if (buf.trim()) {
|
|
collectedLines.push(buf);
|
|
}
|
|
|
|
stderr = await stderrPromise;
|
|
const exitCode = await proc.exited;
|
|
clearTimeout(timeoutId);
|
|
|
|
try { fs.unlinkSync(promptFile); } catch { /* non-fatal */ }
|
|
|
|
if (timedOut) {
|
|
exitReason = 'timeout';
|
|
} else if (exitCode === 0) {
|
|
exitReason = 'success';
|
|
} else {
|
|
exitReason = `exit_code_${exitCode}`;
|
|
}
|
|
|
|
const duration = Date.now() - startTime;
|
|
|
|
// Parse all collected NDJSON lines
|
|
const parsed = parseNDJSON(collectedLines);
|
|
const { transcript, resultLine, toolCalls } = parsed;
|
|
const browseErrors: string[] = [];
|
|
|
|
// Scan transcript + stderr for browse errors
|
|
const allText = transcript.map(e => JSON.stringify(e)).join('\n') + '\n' + stderr;
|
|
for (const pattern of BROWSE_ERROR_PATTERNS) {
|
|
const match = allText.match(pattern);
|
|
if (match) {
|
|
browseErrors.push(match[0].slice(0, 200));
|
|
}
|
|
}
|
|
|
|
// Use resultLine for structured result data
|
|
if (resultLine) {
|
|
if (resultLine.subtype === 'success' && resultLine.is_error) {
|
|
// claude -p can return subtype=success with is_error=true (e.g. API connection failure)
|
|
exitReason = 'error_api';
|
|
} else if (resultLine.subtype === 'success') {
|
|
exitReason = 'success';
|
|
} else if (resultLine.subtype) {
|
|
// Preserve known subtypes like error_max_turns even if is_error is set
|
|
exitReason = resultLine.subtype;
|
|
}
|
|
}
|
|
|
|
// Save failure transcript to persistent run directory (or fallback to workingDirectory)
|
|
if (browseErrors.length > 0 || exitReason !== 'success') {
|
|
try {
|
|
const failureDir = runDir || path.join(workingDirectory, '.gstack', 'test-transcripts');
|
|
fs.mkdirSync(failureDir, { recursive: true });
|
|
const failureName = safeName
|
|
? `${safeName}-failure.json`
|
|
: `e2e-${new Date().toISOString().replace(/[:.]/g, '-')}.json`;
|
|
fs.writeFileSync(
|
|
path.join(failureDir, failureName),
|
|
JSON.stringify({
|
|
prompt: prompt.slice(0, 500),
|
|
testName: testName || 'unknown',
|
|
exitReason,
|
|
browseErrors,
|
|
duration,
|
|
turnAtTimeout: timedOut ? liveTurnCount : undefined,
|
|
lastToolCall: liveToolCount > 0 ? `tool #${liveToolCount}` : undefined,
|
|
stderr: stderr.slice(0, 2000),
|
|
result: resultLine ? { type: resultLine.type, subtype: resultLine.subtype, result: resultLine.result?.slice?.(0, 500) } : null,
|
|
}, null, 2),
|
|
);
|
|
} catch { /* non-fatal */ }
|
|
}
|
|
|
|
// Cost from result line (exact) or estimate from chars
|
|
const turnsUsed = resultLine?.num_turns || 0;
|
|
const estimatedCost = resultLine?.total_cost_usd || 0;
|
|
const inputChars = prompt.length;
|
|
const outputChars = (resultLine?.result || '').length;
|
|
const estimatedTokens = (resultLine?.usage?.input_tokens || 0)
|
|
+ (resultLine?.usage?.output_tokens || 0)
|
|
+ (resultLine?.usage?.cache_read_input_tokens || 0);
|
|
|
|
const costEstimate: CostEstimate = {
|
|
inputChars,
|
|
outputChars,
|
|
estimatedTokens,
|
|
estimatedCost: Math.round((estimatedCost) * 100) / 100,
|
|
turnsUsed,
|
|
};
|
|
|
|
return { toolCalls, browseErrors, exitReason, duration, output: resultLine?.result || '', costEstimate, transcript, model, firstResponseMs, maxInterTurnMs };
|
|
}
|