From 30fe6bb11c6915e6264440b8a044ed0dcfbe06e0 Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Sun, 3 May 2026 20:26:59 -0700 Subject: [PATCH] v1.26.2.0 fix: plan-eng-review STOP gates always fire AskUserQuestion + report-at-bottom contract enforcement (#1313) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * fix(plan-eng-review): tighten STOP gates with anti-rationalization clause Five sites in SKILL.md.tmpl uplift to the office-hours b512be71 pattern: the four review-section gates (Architecture, Code Quality, Test, Performance) plus the Step 0 complexity-check trigger. Adds tool_use reminder ("call the tool directly"), names blocked next steps explicitly, anti-rationalization clause naming the precise failure mode (loading the schema via ToolSearch and writing the recommendation as chat prose). Co-Authored-By: Claude Opus 4.7 (1M context) * feat(test/helpers): initialPlanContent + wrote_findings_before_asking + shared report-at-bottom assertion Three additions to claude-pty-runner.ts: 1. runPlanSkillObservation gains initialPlanContent?: string. Pre-pumps a user message containing the seeded plan before invoking the skill, with a 3s gap so the message renders before the slash command. claude has no --plan-file flag (verified via claude --help), so message-pump is the route. Lets STOP-gate regression tests force complexity findings. 2. ClassifyResult gains wrote_findings_before_asking with companion strictPlanWrites?: boolean opt on classifyVisible. Fires when a Write/ Edit to .claude/plans/* precedes any AskUserQuestion render in the session window. Default off — preserves zero-findings → write plan → plan_ready as legitimate for unseeded smokes. Six new unit tests cover before/after-AUQ ordering, permission-dialog edge case, strict-off path. 3. assertReportAtBottomIfPlanWritten(obs) shared helper. Wraps the existing assertReviewReportAtBottom(content) and gates on obs.planFile (artifact existing), so the assertion fires under both 'asked' and 'plan_ready' when a plan was actually written. Also: runPlanSkillObservation now captures obs.planFile on every classifier outcome, not just 'plan_ready'. Catches the case where the skill wrote a plan partway through then paused on a question. Co-Authored-By: Claude Opus 4.7 (1M context) * test: wire assertReportAtBottomIfPlanWritten into 4 plan-mode E2E tests + add seeded-plan STOP-gate case Every test case in skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.ts that produces a plan file now asserts ## GSTACK REVIEW REPORT is the last ## section. The {{PLAN_FILE_REVIEW_REPORT}} resolver mandated this contract; nothing tested it until now. Plan-eng additionally gains a third test case: STOP gate fires when seeded plan forces Step 0 findings. Combines the new initialPlanContent runner option with --disallowedTools AskUserQuestion to force the Conductor MCP-variant path through mcp__*__AskUserQuestion. Asserts outcome NOT in {wrote_findings_before_asking, auto_decided, silent_write, exited, timeout} and that plan_ready outcomes carry a ## Decisions section. Co-Authored-By: Claude Opus 4.7 (1M context) * fix(touchfiles): delete duplicate plan-design-review-plan-mode keys Verified duplicates in test/helpers/touchfiles.ts: - E2E_TOUCHFILES had plan-design-review-plan-mode at line 94 (full deps) AND line 243 (smaller deps); JS object literals: later wins. - E2E_TIERS had it at line 399 ('gate') AND line 524 ('periodic'); same later-wins rule. Effective tier was 'periodic', not 'gate'. Three of four plan-mode siblings ran on every PR; design ran weekly only. Delete the line-243 and line-524 duplicates. Keep line 94 (full deps) and line 399 ('gate'). Also extend the four plan-mode-test entries to include scripts/resolvers/review.ts so changes to {{PLAN_FILE_REVIEW_REPORT}} trigger all four siblings in bun run eval:select. Co-Authored-By: Claude Opus 4.7 (1M context) * chore: bump version and changelog (v1.26.2.0) Co-Authored-By: Claude Opus 4.7 (1M context) * docs: tighten CHANGELOG voice for v1.26.2.0 Move contributor-flavored bullet (runPlanSkillObservation seeding) into For contributors. Drop branch-internal narrative (Codex review pass, plan iteration tracking) per CHANGELOG-for-users style. Co-Authored-By: Claude Opus 4.7 --------- Co-authored-by: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 50 +++++++ VERSION | 2 +- package.json | 2 +- plan-eng-review/SKILL.md | 22 ++- plan-eng-review/SKILL.md.tmpl | 22 ++- test/helpers/claude-pty-runner.ts | 139 +++++++++++++++++-- test/helpers/claude-pty-runner.unit.test.ts | 72 ++++++++++ test/helpers/touchfiles.ts | 10 +- test/skill-e2e-plan-ceo-plan-mode.test.ts | 8 +- test/skill-e2e-plan-design-plan-mode.test.ts | 7 +- test/skill-e2e-plan-devex-plan-mode.test.ts | 8 +- test/skill-e2e-plan-eng-plan-mode.test.ts | 83 ++++++++++- 12 files changed, 394 insertions(+), 31 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 4f584f46..3144aa4e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,55 @@ # Changelog +## [1.26.2.0] - 2026-05-03 + +## **`/plan-eng-review` always asks. Never silently writes findings to your plan first.** + +Plan-mode review skills now have a hard STOP gate before any AskUserQuestion. The bug +this closes: a `/plan-eng-review` session would do Step 0 scope challenge, find real +issues, write the findings into the plan file as prose, then call `ExitPlanMode` — +never invoking AskUserQuestion. The user only saw "ready to execute" with the model's +opinions already baked in. The tool to surface the question existed, the prompt +told the model to use it, and the model still routed around it. + +Five sites in `plan-eng-review/SKILL.md.tmpl` now use the office-hours `b512be71` +pattern verbatim: "the AskUserQuestion call is a tool_use, not prose — call the +tool directly," named blockers ("do not edit the plan file, do not call +ExitPlanMode"), and an anti-rationalization clause ("loading the schema via +ToolSearch and writing the recommendation as chat prose is the failure mode this +gate exists to prevent"). The four review-section gates (Architecture, Code +Quality, Test, Performance) and the Step 0 complexity-check trigger all use the +same language. + +### What you can now do + +- **Trust that any plan-* review skill that produces a plan file ends with the review report.** All four plan-mode E2E tests (`plan-eng`, `plan-ceo`, `plan-design`, `plan-devex`) now assert `## GSTACK REVIEW REPORT` is the last `## ` section of the plan file whenever one was written. The `{{PLAN_FILE_REVIEW_REPORT}}` resolver mandated this contract; nothing tested it until now. +- **Catch the "writes findings to plan as prose before asking" failure mode.** New `wrote_findings_before_asking` classifier outcome fires when a `Write`/`Edit` to `.claude/plans/*` precedes any AskUserQuestion render in the session window. Opt-in via `strictPlanWrites: true` so existing tests where zero-findings → write plan → plan_ready stays legitimate. +- **Run `plan-design-review-plan-mode` on PR CI again.** The touchfiles entry was duplicated — `plan-design-review-plan-mode` appeared at line 94 (gate, full deps) and line 243 (smaller deps). JS object literals: later wins. The effective tier was `periodic`, not `gate`. Three of four plan-mode siblings ran on every PR; design didn't. + +### Itemized changes + +#### Added + +- `runPlanSkillObservation`'s `initialPlanContent?: string` option. Pre-pumps a user message containing the seeded plan before invoking the skill, with a 3s gap so the message renders before the slash command. +- `ClassifyResult` outcome `wrote_findings_before_asking` with companion `strictPlanWrites?` opt on `classifyVisible`. Six new unit tests in `claude-pty-runner.unit.test.ts` cover before/after-AUQ ordering plus the strict-off legacy path. +- Shared test helper `assertReportAtBottomIfPlanWritten(obs)` in `claude-pty-runner.ts`. Wraps the existing `assertReviewReportAtBottom(content)` and gates on `obs.planFile` (artifact existing), so the assertion fires under `'asked'` and `'plan_ready'` both — wherever a plan file was actually written. +- New seeded-plan test case in `skill-e2e-plan-eng-plan-mode.test.ts`: `STOP gate fires when seeded plan forces Step 0 findings`. Combines `initialPlanContent` + `--disallowedTools AskUserQuestion` to force the Conductor MCP-variant path through `mcp__*__AskUserQuestion`. + +#### Changed + +- `plan-eng-review/SKILL.md.tmpl` lines 116, 139, 152, 160, 169 ported from soft "STOP." prose to the office-hours pattern. Adds tool_use reminder, names blocked next steps explicitly, anti-rationalization clause. +- `runPlanSkillObservation` now captures `obs.planFile` on every classifier outcome (was: only `'plan_ready'`). Catches the case where the skill wrote a plan partway through then paused on a question. + +#### Fixed + +- `test/helpers/touchfiles.ts` duplicate `plan-design-review-plan-mode` keys deleted (line 243 in `E2E_TOUCHFILES`, line 524 in `E2E_TIERS`). Effective tier is now `gate` again, matching the other three siblings. +- `scripts/resolvers/review.ts` added to all four plan-mode-test touchfiles entries so changes to the `{{PLAN_FILE_REVIEW_REPORT}}` resolver text trigger all four sibling tests in `bun run eval:select`. + +#### For contributors + +- 6 new classifier unit tests in `test/helpers/claude-pty-runner.unit.test.ts` (70 → 76). +- New `initialPlanContent?: string` option on `runPlanSkillObservation` for seeding a draft plan into a test run before invoking the skill. Lets STOP-gate regression tests pre-pump guaranteed-finding-triggering complexity (8+ files, custom-vs-builtin smell) so the skill has something concrete to react to. + ## [1.26.1.0] - 2026-05-03 ## **`gstack-gbrain-sync` ships host-agnostic. Curated artifacts push from Claude Code, Codex CLI, or a dev workspace — same orchestrator, same install, same result.** diff --git a/VERSION b/VERSION index 0a959e2f..75334e9d 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -1.26.1.0 +1.26.2.0 diff --git a/package.json b/package.json index 23130761..3fa1d216 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "gstack", - "version": "1.26.1.0", + "version": "1.26.2.0", "description": "Garry's Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.", "license": "MIT", "type": "module", diff --git a/plan-eng-review/SKILL.md b/plan-eng-review/SKILL.md index 53da6b23..02ffc528 100644 --- a/plan-eng-review/SKILL.md +++ b/plan-eng-review/SKILL.md @@ -832,7 +832,11 @@ Before reviewing anything, answer these questions: - How will users download or install it (GitHub Releases, package manager, container registry)? If the plan defers distribution, flag it explicitly in the "NOT in scope" section — don't let it silently drop. -If the complexity check triggers (8+ files or 2+ new classes/services), proactively recommend scope reduction via AskUserQuestion — explain what's overbuilt, propose a minimal version that achieves the core goal, and ask whether to reduce or proceed as-is. If the complexity check does not trigger, present your Step 0 findings and proceed directly to Section 1. +If the complexity check triggers (8+ files or 2+ new classes/services), STOP before any review-section work. Call AskUserQuestion: name what's overbuilt, propose a minimal version that achieves the core goal, ask whether to reduce or proceed as-is. The AskUserQuestion call is a tool_use, not prose — call the tool directly. If no AskUserQuestion variant is callable, follow the preamble's "Tool resolution" fallback: in plan mode, write `## Decisions to confirm` into the plan file and ExitPlanMode; outside plan mode, output the decision brief as prose and stop. Never silently auto-decide. + +**STOP.** Do NOT proceed to Section 1 (Architecture review), edit the plan file with a proposed scope reduction, or call ExitPlanMode until the user responds. Naming the 80% solution in chat prose and continuing — or loading the AskUserQuestion schema via ToolSearch and then never invoking it — is the failure mode this gate exists to prevent. + +If the complexity check does not trigger, present your Step 0 findings and proceed directly to Section 1. Always work through the full interactive review: one section at a time (Architecture → Code Quality → Tests → Performance) with at most 8 top issues per section. @@ -891,7 +895,9 @@ Evaluate: * For each new codepath or integration point, describe one realistic production failure scenario and whether the plan accounts for it. * **Distribution architecture:** If this introduces a new artifact (binary, package, container), how does it get built, published, and updated? Is the CI/CD pipeline part of the plan or deferred? -**STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved. +For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Use the preamble's AskUserQuestion Format section. The AskUserQuestion call is a tool_use, not prose — call the tool directly. If no AskUserQuestion variant is callable in this session, follow the preamble's "Tool resolution" fallback: in plan mode, write `## Decisions to confirm` into the plan file and ExitPlanMode; outside plan mode, output the decision brief as prose and stop. Never silently auto-decide. + +**STOP.** Do NOT proceed to the next review section, edit the plan file with the proposed fix, or call ExitPlanMode until the user responds. An issue with an "obvious fix" is still an issue and still needs explicit user approval before it lands in the plan. Loading the AskUserQuestion schema via ToolSearch and then writing the recommendation as chat prose is the failure mode this gate exists to prevent. ## Confidence Calibration @@ -927,7 +933,9 @@ Evaluate: * Areas that are over-engineered or under-engineered relative to my preferences. * Existing ASCII diagrams in touched files — are they still accurate after this change? -**STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved. +For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Use the preamble's AskUserQuestion Format section. The AskUserQuestion call is a tool_use, not prose — call the tool directly. If no AskUserQuestion variant is callable in this session, follow the preamble's "Tool resolution" fallback: in plan mode, write `## Decisions to confirm` into the plan file and ExitPlanMode; outside plan mode, output the decision brief as prose and stop. Never silently auto-decide. + +**STOP.** Do NOT proceed to the next review section, edit the plan file with the proposed fix, or call ExitPlanMode until the user responds. An issue with an "obvious fix" is still an issue and still needs explicit user approval before it lands in the plan. Loading the AskUserQuestion schema via ToolSearch and then writing the recommendation as chat prose is the failure mode this gate exists to prevent. ### 3. Test review @@ -1109,7 +1117,9 @@ This file is consumed by `/qa` and `/qa-only` as primary test input. Include onl For LLM/prompt changes: check the "Prompt/LLM changes" file patterns listed in CLAUDE.md. If this plan touches ANY of those patterns, state which eval suites must be run, which cases should be added, and what baselines to compare against. Then use AskUserQuestion to confirm the eval scope with the user. -**STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved. +For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Use the preamble's AskUserQuestion Format section. The AskUserQuestion call is a tool_use, not prose — call the tool directly. If no AskUserQuestion variant is callable in this session, follow the preamble's "Tool resolution" fallback: in plan mode, write `## Decisions to confirm` into the plan file and ExitPlanMode; outside plan mode, output the decision brief as prose and stop. Never silently auto-decide. + +**STOP.** Do NOT proceed to the next review section, edit the plan file with the proposed fix, or call ExitPlanMode until the user responds. An issue with an "obvious fix" is still an issue and still needs explicit user approval before it lands in the plan. Loading the AskUserQuestion schema via ToolSearch and then writing the recommendation as chat prose is the failure mode this gate exists to prevent. ### 4. Performance review Evaluate: @@ -1118,7 +1128,9 @@ Evaluate: * Caching opportunities. * Slow or high-complexity code paths. -**STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved. +For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Use the preamble's AskUserQuestion Format section. The AskUserQuestion call is a tool_use, not prose — call the tool directly. If no AskUserQuestion variant is callable in this session, follow the preamble's "Tool resolution" fallback: in plan mode, write `## Decisions to confirm` into the plan file and ExitPlanMode; outside plan mode, output the decision brief as prose and stop. Never silently auto-decide. + +**STOP.** Do NOT proceed to the next review section, edit the plan file with the proposed fix, or call ExitPlanMode until the user responds. An issue with an "obvious fix" is still an issue and still needs explicit user approval before it lands in the plan. Loading the AskUserQuestion schema via ToolSearch and then writing the recommendation as chat prose is the failure mode this gate exists to prevent. ## Outside Voice — Independent Plan Challenge (optional, recommended) diff --git a/plan-eng-review/SKILL.md.tmpl b/plan-eng-review/SKILL.md.tmpl index 2d267837..4b094b07 100644 --- a/plan-eng-review/SKILL.md.tmpl +++ b/plan-eng-review/SKILL.md.tmpl @@ -113,7 +113,11 @@ Before reviewing anything, answer these questions: - How will users download or install it (GitHub Releases, package manager, container registry)? If the plan defers distribution, flag it explicitly in the "NOT in scope" section — don't let it silently drop. -If the complexity check triggers (8+ files or 2+ new classes/services), proactively recommend scope reduction via AskUserQuestion — explain what's overbuilt, propose a minimal version that achieves the core goal, and ask whether to reduce or proceed as-is. If the complexity check does not trigger, present your Step 0 findings and proceed directly to Section 1. +If the complexity check triggers (8+ files or 2+ new classes/services), STOP before any review-section work. Call AskUserQuestion: name what's overbuilt, propose a minimal version that achieves the core goal, ask whether to reduce or proceed as-is. The AskUserQuestion call is a tool_use, not prose — call the tool directly. If no AskUserQuestion variant is callable, follow the preamble's "Tool resolution" fallback: in plan mode, write `## Decisions to confirm` into the plan file and ExitPlanMode; outside plan mode, output the decision brief as prose and stop. Never silently auto-decide. + +**STOP.** Do NOT proceed to Section 1 (Architecture review), edit the plan file with a proposed scope reduction, or call ExitPlanMode until the user responds. Naming the 80% solution in chat prose and continuing — or loading the AskUserQuestion schema via ToolSearch and then never invoking it — is the failure mode this gate exists to prevent. + +If the complexity check does not trigger, present your Step 0 findings and proceed directly to Section 1. Always work through the full interactive review: one section at a time (Architecture → Code Quality → Tests → Performance) with at most 8 top issues per section. @@ -136,7 +140,9 @@ Evaluate: * For each new codepath or integration point, describe one realistic production failure scenario and whether the plan accounts for it. * **Distribution architecture:** If this introduces a new artifact (binary, package, container), how does it get built, published, and updated? Is the CI/CD pipeline part of the plan or deferred? -**STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved. +For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Use the preamble's AskUserQuestion Format section. The AskUserQuestion call is a tool_use, not prose — call the tool directly. If no AskUserQuestion variant is callable in this session, follow the preamble's "Tool resolution" fallback: in plan mode, write `## Decisions to confirm` into the plan file and ExitPlanMode; outside plan mode, output the decision brief as prose and stop. Never silently auto-decide. + +**STOP.** Do NOT proceed to the next review section, edit the plan file with the proposed fix, or call ExitPlanMode until the user responds. An issue with an "obvious fix" is still an issue and still needs explicit user approval before it lands in the plan. Loading the AskUserQuestion schema via ToolSearch and then writing the recommendation as chat prose is the failure mode this gate exists to prevent. {{CONFIDENCE_CALIBRATION}} @@ -149,7 +155,9 @@ Evaluate: * Areas that are over-engineered or under-engineered relative to my preferences. * Existing ASCII diagrams in touched files — are they still accurate after this change? -**STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved. +For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Use the preamble's AskUserQuestion Format section. The AskUserQuestion call is a tool_use, not prose — call the tool directly. If no AskUserQuestion variant is callable in this session, follow the preamble's "Tool resolution" fallback: in plan mode, write `## Decisions to confirm` into the plan file and ExitPlanMode; outside plan mode, output the decision brief as prose and stop. Never silently auto-decide. + +**STOP.** Do NOT proceed to the next review section, edit the plan file with the proposed fix, or call ExitPlanMode until the user responds. An issue with an "obvious fix" is still an issue and still needs explicit user approval before it lands in the plan. Loading the AskUserQuestion schema via ToolSearch and then writing the recommendation as chat prose is the failure mode this gate exists to prevent. ### 3. Test review @@ -157,7 +165,9 @@ Evaluate: For LLM/prompt changes: check the "Prompt/LLM changes" file patterns listed in CLAUDE.md. If this plan touches ANY of those patterns, state which eval suites must be run, which cases should be added, and what baselines to compare against. Then use AskUserQuestion to confirm the eval scope with the user. -**STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved. +For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Use the preamble's AskUserQuestion Format section. The AskUserQuestion call is a tool_use, not prose — call the tool directly. If no AskUserQuestion variant is callable in this session, follow the preamble's "Tool resolution" fallback: in plan mode, write `## Decisions to confirm` into the plan file and ExitPlanMode; outside plan mode, output the decision brief as prose and stop. Never silently auto-decide. + +**STOP.** Do NOT proceed to the next review section, edit the plan file with the proposed fix, or call ExitPlanMode until the user responds. An issue with an "obvious fix" is still an issue and still needs explicit user approval before it lands in the plan. Loading the AskUserQuestion schema via ToolSearch and then writing the recommendation as chat prose is the failure mode this gate exists to prevent. ### 4. Performance review Evaluate: @@ -166,7 +176,9 @@ Evaluate: * Caching opportunities. * Slow or high-complexity code paths. -**STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved. +For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Use the preamble's AskUserQuestion Format section. The AskUserQuestion call is a tool_use, not prose — call the tool directly. If no AskUserQuestion variant is callable in this session, follow the preamble's "Tool resolution" fallback: in plan mode, write `## Decisions to confirm` into the plan file and ExitPlanMode; outside plan mode, output the decision brief as prose and stop. Never silently auto-decide. + +**STOP.** Do NOT proceed to the next review section, edit the plan file with the proposed fix, or call ExitPlanMode until the user responds. An issue with an "obvious fix" is still an issue and still needs explicit user approval before it lands in the plan. Loading the AskUserQuestion schema via ToolSearch and then writing the recommendation as chat prose is the failure mode this gate exists to prevent. {{CODEX_PLAN_REVIEW}} diff --git a/test/helpers/claude-pty-runner.ts b/test/helpers/claude-pty-runner.ts index 58a17b51..796d125d 100644 --- a/test/helpers/claude-pty-runner.ts +++ b/test/helpers/claude-pty-runner.ts @@ -454,6 +454,7 @@ export function optionsSignature( */ export type ClassifyResult = | { outcome: 'silent_write'; summary: string } + | { outcome: 'wrote_findings_before_asking'; summary: string } | { outcome: 'auto_decided'; summary: string } | { outcome: 'plan_ready'; summary: string } | { outcome: 'asked'; summary: string } @@ -467,16 +468,73 @@ const SANCTIONED_WRITE_SUBSTRINGS = [ 'TODOS.md', ]; -export function classifyVisible(visible: string): ClassifyResult { +/** + * Find the position of the first AskUserQuestion-style numbered-option list + * that is NOT a permission dialog. Returns -1 if none has rendered yet. + * + * Used by the strict-plan-writes detector (D4) to distinguish legitimate + * post-AUQ plan writes from the transcript bug ("write findings to plan + * before asking"). + */ +function findFirstAuqRenderIndex(visible: string): number { + const re = /❯\s*1\./g; + let m: RegExpExecArray | null; + while ((m = re.exec(visible)) !== null) { + // 200 bytes back + TAIL_SCAN_BYTES forward gives enough context for + // isPermissionDialogVisible to recognize the typical permission UI. + const surroundStart = Math.max(0, m.index - 200); + const surroundEnd = Math.min(visible.length, m.index + TAIL_SCAN_BYTES); + const surround = visible.slice(surroundStart, surroundEnd); + if (!isPermissionDialogVisible(surround)) { + return m.index; + } + } + return -1; +} + +export function classifyVisible( + visible: string, + opts?: { + /** + * When true, treat Write/Edit to `.claude/plans/*` BEFORE any + * AskUserQuestion render as `wrote_findings_before_asking` rather than + * letting the sanctioned-write list silently approve it. Used by tests + * that seed a draft plan with guaranteed-finding-triggering complexity + * (D3-B), where a pre-AUQ plan write is the precise transcript bug. + * Default false — preserves existing behavior for unseeded smoke tests + * where zero-findings → write plan → plan_ready is legitimate. + */ + strictPlanWrites?: boolean; + }, +): ClassifyResult { // Silent-write detection: any Write/Edit tool render that targets a path // OUTSIDE the sanctioned dirs, AND no numbered prompt is currently on screen // (a numbered prompt means a permission/AskUserQuestion is gating the write, // not an actual silent write). const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g; let m: RegExpExecArray | null; + const auqRenderIdx = opts?.strictPlanWrites ? findFirstAuqRenderIndex(visible) : -1; while ((m = writeRe.exec(visible)) !== null) { const target = m[1] ?? ''; + const writePos = m.index; + const isPlanWrite = target.includes('.claude/plans'); const sanctioned = SANCTIONED_WRITE_SUBSTRINGS.some((s) => target.includes(s)); + + // D4-B: when strictPlanWrites is on, plan writes that precede the first + // AUQ render are flagged. Legitimate end-of-workflow plan writes happen + // AFTER an AUQ has rendered (i.e., the user has been asked). The + // transcript bug is a plan write WITHOUT any AUQ render preceding it. + if (opts?.strictPlanWrites && isPlanWrite) { + if (auqRenderIdx < 0 || writePos < auqRenderIdx) { + return { + outcome: 'wrote_findings_before_asking', + summary: `Write/Edit to ${target} fired before any AskUserQuestion render`, + }; + } + // post-AUQ plan write — legitimate, fall through to other writes + continue; + } + if (!sanctioned && !isNumberedOptionListVisible(visible)) { return { outcome: 'silent_write', @@ -712,6 +770,36 @@ export function assertReviewReportAtBottom( return { ok: true }; } +/** + * Test helper: if `obs.planFile` was set, read it and assert + * `## GSTACK REVIEW REPORT` is the last `## ` section. Throws on + * violation with a diagnostic message including the plan path, + * the reason, any trailing headings, and the last 2KB of TTY output. + * + * Used by the four plan-mode E2E tests + * (skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.ts) to enforce + * the {{PLAN_FILE_REVIEW_REPORT}} resolver contract uniformly. Gates on + * `obs.planFile` (artifact existing), not on `obs.outcome === 'plan_ready'`, + * so it also catches the report-missing case under `'asked'` / + * `'wrote_findings_before_asking'` when a plan was already written. + */ +export function assertReportAtBottomIfPlanWritten( + obs: { planFile?: string; evidence: string }, +): void { + if (!obs.planFile) return; + const content = fs.readFileSync(obs.planFile, 'utf-8'); + const verdict = assertReviewReportAtBottom(content); + if (!verdict.ok) { + const trailing = verdict.trailingHeadings?.length + ? `\ntrailing headings: ${verdict.trailingHeadings.join(', ')}` + : ''; + throw new Error( + `GSTACK REVIEW REPORT contract violation in ${obs.planFile}: ${verdict.reason}${trailing}\n` + + `--- evidence (last 2KB) ---\n${obs.evidence}`, + ); + } +} + /** * Per-skill Step-0 boundary predicates. Each fires `true` when the answered * AUQ's fingerprint matches the LAST question of that skill's Step 0 phase. @@ -1085,6 +1173,20 @@ export async function runPlanSkillObservation(opts: { * rendered AskUserQuestion list). */ env?: Record; + /** + * Seed an initial plan that the spawned `claude` process operates on. + * STOP-gate regression tests need a plan with guaranteed-finding-triggering + * complexity (8+ files, custom-vs-builtin smell) so the skill MUST emit + * AskUserQuestion or fall back to a Decisions section. Without this, + * plan-mode creates a fresh empty plan and the skill has nothing to find + * issues with. + * + * Implementation: claude has no `--plan-file` flag (verified via + * `claude --help`). We pre-pump a user message containing the draft + * plan, wait for it to register, then invoke the skill. The skill's + * Step 0 reads the prior conversation context so it sees the draft. + */ + initialPlanContent?: string; }): Promise { const startedAt = Date.now(); const session = await launchClaudePty({ @@ -1098,6 +1200,21 @@ export async function runPlanSkillObservation(opts: { try { // Boot grace + trust-dialog auto-handle. await Bun.sleep(8000); + if (opts.initialPlanContent) { + // Pre-pump the draft as a user message so the skill's Step 0 has + // concrete content to scope-challenge. The trailing `\r` submits + // the message; embedded `\n` are preserved as line breaks within + // the message (claude-code uses Enter to send, Shift+Enter for + // newlines, but raw `\r` from a PTY just submits whatever's in + // the input buffer). + const seed = `Please review the following draft plan when I run the skill below:\n\n${opts.initialPlanContent}`; + session.send(`${seed}\r`); + // Wait for the seed message to render before sending the skill + // command. Without this gap the two messages can fuse and the + // skill name becomes part of the user prompt instead of a slash + // command. + await Bun.sleep(3000); + } const since = session.mark(); session.send(`/${opts.skillName}\r`); @@ -1123,20 +1240,24 @@ export async function runPlanSkillObservation(opts: { elapsedMs: Date.now() - startedAt, }; } - const classified = classifyVisible(visible); + const classified = classifyVisible(visible, { + strictPlanWrites: !!opts.initialPlanContent, + }); if (classified) { const obs: PlanSkillObservation = { ...classified, evidence: visible.slice(-2000), elapsedMs: Date.now() - startedAt, }; - // For plan_ready outcomes, capture the plan file path from the full - // visible buffer — tests under --disallowedTools verify the file's - // contents to distinguish legitimate fallback flow from silent-skip. - if (classified.outcome === 'plan_ready') { - const planFile = extractPlanFilePath(visible); - if (planFile) obs.planFile = planFile; - } + // Capture the plan file path on any outcome where one may have been + // written. Gating only on 'plan_ready' missed two cases: (1) the + // 'asked' outcome where the model wrote a plan partway through then + // paused on a question, and (2) 'wrote_findings_before_asking' where + // the bug is precisely that the plan was written. The + // assertReviewReportAtBottom checks downstream gate on planFile + // existing, not on the outcome. + const planFile = extractPlanFilePath(visible); + if (planFile) obs.planFile = planFile; return obs; } } diff --git a/test/helpers/claude-pty-runner.unit.test.ts b/test/helpers/claude-pty-runner.unit.test.ts index e830d730..dcb4f5ef 100644 --- a/test/helpers/claude-pty-runner.unit.test.ts +++ b/test/helpers/claude-pty-runner.unit.test.ts @@ -295,6 +295,78 @@ describe('classifyVisible (runtime path through the runner classifier)', () => { // recent-tail window would surface here. expect(TAIL_SCAN_BYTES).toBe(1500); }); + + // D4-B: strictPlanWrites detector. Catches the transcript bug where the + // model writes findings to the plan file before any AskUserQuestion fires. + test('strictPlanWrites: plan write before any AUQ → wrote_findings_before_asking', () => { + const visible = ` + ⏺ Edit(/Users/me/.claude/plans/some-plan.md) + ⎿ Updated 12 lines + `; + const result = classifyVisible(visible, { strictPlanWrites: true }); + expect(result?.outcome).toBe('wrote_findings_before_asking'); + expect(result?.summary).toContain('.claude/plans/some-plan.md'); + }); + + test('strictPlanWrites: plan write AFTER an AUQ render → not flagged', () => { + // AUQ renders first, then the model writes the plan post-answer. This is + // the legitimate end-of-workflow flow and must NOT trigger the detector. + const visible = ` + D1 — Some scope question + + ❯ 1. Option A + 2. Option B + + ⏺ Edit(/Users/me/.claude/plans/some-plan.md) + ⎿ Updated 12 lines + `; + const result = classifyVisible(visible, { strictPlanWrites: true }); + // Outcome is 'asked' (the numbered list rendered); the post-AUQ plan + // write is ignored by the detector. + expect(result?.outcome).toBe('asked'); + }); + + test('strictPlanWrites: AUQ first then plan write — write_pos > auq_pos → not flagged', () => { + // Same scenario, more explicit ordering: the regex finds the write at a + // position AFTER the numbered list. Detector lets it through. + const visible = [ + 'D1 — Choose your approach', + '', + '❯ 1. Approach A', + ' 2. Approach B', + '', + '⏺ Write(/Users/me/.claude/plans/draft.md)', + '⎿ Wrote 42 lines', + ].join('\n'); + const result = classifyVisible(visible, { strictPlanWrites: true }); + expect(result?.outcome).toBe('asked'); + }); + + test('strictPlanWrites: only a permission dialog visible → plan write still flagged', () => { + // A permission dialog ❯ 1./2. is NOT an AUQ; pre-AUQ plan writes still + // hit the detector even when a permission prompt is on screen. + const visible = ` + ⏺ Edit(/Users/me/.claude/plans/some-plan.md) + + Edit to /Users/me/.claude/plans/some-plan.md + + Do you want to proceed? + + ❯ 1. Yes + 2. No + `; + const result = classifyVisible(visible, { strictPlanWrites: true }); + expect(result?.outcome).toBe('wrote_findings_before_asking'); + }); + + test('strictPlanWrites OFF: plan write before AUQ → returns null (legacy behavior preserved)', () => { + const visible = ` + ⏺ Edit(/Users/me/.claude/plans/some-plan.md) + ⎿ Updated 12 lines + `; + // Without strictPlanWrites, the sanctioned-path list lets this through. + expect(classifyVisible(visible)).toBeNull(); + }); }); describe('parseNumberedOptions', () => { diff --git a/test/helpers/touchfiles.ts b/test/helpers/touchfiles.ts index dbe1060d..42ce4027 100644 --- a/test/helpers/touchfiles.ts +++ b/test/helpers/touchfiles.ts @@ -89,10 +89,10 @@ export const E2E_TOUCHFILES: Record = { // include question-tuning.ts and generate-ask-user-format.ts because the // AUTO_DECIDE preamble injection lives there and changes can flip the // regression test outcome between 'asked' and 'auto_decided'. - 'plan-ceo-review-plan-mode': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'], - 'plan-eng-review-plan-mode': ['plan-eng-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'], - 'plan-design-review-plan-mode': ['plan-design-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'], - 'plan-devex-review-plan-mode': ['plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'], + 'plan-ceo-review-plan-mode': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts'], + 'plan-eng-review-plan-mode': ['plan-eng-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts'], + 'plan-design-review-plan-mode': ['plan-design-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts'], + 'plan-devex-review-plan-mode': ['plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts'], 'plan-mode-no-op': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'], // v1.21+ AskUserQuestion-blocked regression tests — Conductor launches @@ -240,7 +240,6 @@ export const E2E_TOUCHFILES: Record = { 'design-consultation-existing': ['design-consultation/**', 'scripts/gen-skill-docs.ts'], 'design-consultation-research': ['design-consultation/**', 'scripts/gen-skill-docs.ts'], 'design-consultation-preview': ['design-consultation/**', 'scripts/gen-skill-docs.ts'], - 'plan-design-review-plan-mode': ['plan-design-review/**', 'scripts/gen-skill-docs.ts'], 'plan-design-review-no-ui-scope': ['plan-design-review/**', 'scripts/gen-skill-docs.ts'], 'design-review-fix': ['design-review/**', 'browse/src/**', 'scripts/gen-skill-docs.ts'], @@ -521,7 +520,6 @@ export const E2E_TIERS: Record = { 'design-consultation-existing': 'periodic', 'design-consultation-research': 'gate', 'design-consultation-preview': 'gate', - 'plan-design-review-plan-mode': 'periodic', 'plan-design-review-no-ui-scope': 'gate', 'design-review-fix': 'periodic', 'design-shotgun-path': 'gate', diff --git a/test/skill-e2e-plan-ceo-plan-mode.test.ts b/test/skill-e2e-plan-ceo-plan-mode.test.ts index 8ee1efdb..a99084e1 100644 --- a/test/skill-e2e-plan-ceo-plan-mode.test.ts +++ b/test/skill-e2e-plan-ceo-plan-mode.test.ts @@ -34,7 +34,11 @@ */ import { describe, test, expect } from 'bun:test'; -import { runPlanSkillObservation, planFileHasDecisionsSection } from './helpers/claude-pty-runner'; +import { + runPlanSkillObservation, + planFileHasDecisionsSection, + assertReportAtBottomIfPlanWritten, +} from './helpers/claude-pty-runner'; const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate'; const describeE2E = shouldRun ? describe : describe.skip; @@ -68,6 +72,7 @@ describeE2E('plan-ceo-review plan-mode smoke (gate)', () => { `--- evidence (last 2KB visible) ---\n${obs.evidence}`, ); } + assertReportAtBottomIfPlanWritten(obs); }, 360_000); // v1.21+ regression: Conductor launches Claude Code with @@ -129,5 +134,6 @@ describeE2E('plan-ceo-review plan-mode smoke (gate)', () => { } } expect(['asked', 'plan_ready']).toContain(obs.outcome); + assertReportAtBottomIfPlanWritten(obs); }, 360_000); }); diff --git a/test/skill-e2e-plan-design-plan-mode.test.ts b/test/skill-e2e-plan-design-plan-mode.test.ts index 0f2bd69a..da3d591a 100644 --- a/test/skill-e2e-plan-design-plan-mode.test.ts +++ b/test/skill-e2e-plan-design-plan-mode.test.ts @@ -10,7 +10,10 @@ */ import { describe, test, expect } from 'bun:test'; -import { runPlanSkillObservation, planFileHasDecisionsSection } from './helpers/claude-pty-runner'; +import { + runPlanSkillObservation, + assertReportAtBottomIfPlanWritten, +} from './helpers/claude-pty-runner'; const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate'; const describeE2E = shouldRun ? describe : describe.skip; @@ -32,6 +35,7 @@ describeE2E('plan-design-review plan-mode smoke (gate)', () => { ); } expect(['asked', 'plan_ready']).toContain(obs.outcome); + assertReportAtBottomIfPlanWritten(obs); }, 360_000); // v1.21+ regression: see skill-e2e-plan-ceo-plan-mode.test.ts for the @@ -68,5 +72,6 @@ describeE2E('plan-design-review plan-mode smoke (gate)', () => { // Other plan-mode skills require the decisions section under // --disallowedTools; design is the special case. expect(['asked', 'plan_ready']).toContain(obs.outcome); + assertReportAtBottomIfPlanWritten(obs); }, 360_000); }); diff --git a/test/skill-e2e-plan-devex-plan-mode.test.ts b/test/skill-e2e-plan-devex-plan-mode.test.ts index 5ecad5aa..959efce0 100644 --- a/test/skill-e2e-plan-devex-plan-mode.test.ts +++ b/test/skill-e2e-plan-devex-plan-mode.test.ts @@ -6,7 +6,11 @@ */ import { describe, test, expect } from 'bun:test'; -import { runPlanSkillObservation, planFileHasDecisionsSection } from './helpers/claude-pty-runner'; +import { + runPlanSkillObservation, + planFileHasDecisionsSection, + assertReportAtBottomIfPlanWritten, +} from './helpers/claude-pty-runner'; const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate'; const describeE2E = shouldRun ? describe : describe.skip; @@ -28,6 +32,7 @@ describeE2E('plan-devex-review plan-mode smoke (gate)', () => { ); } expect(['asked', 'plan_ready']).toContain(obs.outcome); + assertReportAtBottomIfPlanWritten(obs); }, 360_000); // v1.21+ regression: see skill-e2e-plan-ceo-plan-mode.test.ts for the @@ -64,5 +69,6 @@ describeE2E('plan-devex-review plan-mode smoke (gate)', () => { } } expect(['asked', 'plan_ready']).toContain(obs.outcome); + assertReportAtBottomIfPlanWritten(obs); }, 360_000); }); diff --git a/test/skill-e2e-plan-eng-plan-mode.test.ts b/test/skill-e2e-plan-eng-plan-mode.test.ts index d4a635e2..3324022b 100644 --- a/test/skill-e2e-plan-eng-plan-mode.test.ts +++ b/test/skill-e2e-plan-eng-plan-mode.test.ts @@ -6,11 +6,45 @@ */ import { describe, test, expect } from 'bun:test'; -import { runPlanSkillObservation, planFileHasDecisionsSection } from './helpers/claude-pty-runner'; +import { + runPlanSkillObservation, + planFileHasDecisionsSection, + assertReportAtBottomIfPlanWritten, +} from './helpers/claude-pty-runner'; const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate'; const describeE2E = shouldRun ? describe : describe.skip; +// SEED_PLAN_FORCING_FINDINGS: 8+ files + custom-vs-builtin smell forces the +// Step 0 complexity check to trigger. Passed via runPlanSkillObservation's +// initialPlanContent (D3-B) so the spawned `claude` actually sees it. +const SEED_PLAN_FORCING_FINDINGS = ` +# Parallelize unit tests + +## Plan +Build a custom test runner: scripts/test-parallel.ts, scripts/test-shard-impl.ts, +scripts/test-merge-results.ts, scripts/test-progress.ts, scripts/test-watch.ts, +scripts/test-coverage.ts, scripts/test-cli.ts, scripts/test-config.ts. + +Add new TestRunner class, new ShardManager class, new ResultMerger class. + +Ignore Bun's native --shard flag because we want full control. + +## Files +- scripts/test-parallel.ts (new) +- scripts/test-shard-impl.ts (new) +- scripts/test-merge-results.ts (new) +- scripts/test-progress.ts (new) +- scripts/test-watch.ts (new) +- scripts/test-coverage.ts (new) +- scripts/test-cli.ts (new) +- scripts/test-config.ts (new) +- package.json (add scripts) + +## Tests +None planned — will add later. +`; + describeE2E('plan-eng-review plan-mode smoke (gate)', () => { test('reaches a terminal outcome (asked or plan_ready) without silent writes', async () => { const obs = await runPlanSkillObservation({ @@ -28,6 +62,7 @@ describeE2E('plan-eng-review plan-mode smoke (gate)', () => { ); } expect(['asked', 'plan_ready']).toContain(obs.outcome); + assertReportAtBottomIfPlanWritten(obs); }, 360_000); // v1.21+ regression: see skill-e2e-plan-ceo-plan-mode.test.ts for the @@ -64,5 +99,51 @@ describeE2E('plan-eng-review plan-mode smoke (gate)', () => { } } expect(['asked', 'plan_ready']).toContain(obs.outcome); + assertReportAtBottomIfPlanWritten(obs); + }, 360_000); + + // D3-B / D4-B: when a plan with guaranteed-finding-triggering complexity + // is seeded, the skill MUST fire AskUserQuestion (or fall back to a + // Decisions section) before writing findings to the plan. The + // wrote_findings_before_asking outcome catches the precise transcript bug + // — model writes findings to the plan before any AUQ render. + test('STOP gate fires when seeded plan forces Step 0 findings', async () => { + const obs = await runPlanSkillObservation({ + skillName: 'plan-eng-review', + inPlanMode: true, + initialPlanContent: SEED_PLAN_FORCING_FINDINGS, + // Force the Conductor-style path: native AUQ disallowed → the model + // must use mcp__*__AskUserQuestion (outcome='asked') or fall back to + // writing Decisions ('plan_ready'). + extraArgs: ['--disallowedTools', 'AskUserQuestion'], + timeoutMs: 300_000, + }); + + if ( + obs.outcome === 'wrote_findings_before_asking' || + obs.outcome === 'auto_decided' || + obs.outcome === 'silent_write' || + obs.outcome === 'exited' || + obs.outcome === 'timeout' + ) { + throw new Error( + `STOP-gate regression: outcome=${obs.outcome}\nsummary: ${obs.summary}\n` + + `elapsed: ${obs.elapsedMs}ms\n` + + `--- evidence (last 2KB) ---\n${obs.evidence}`, + ); + } + + if (obs.outcome === 'plan_ready') { + if (!obs.planFile || !planFileHasDecisionsSection(obs.planFile)) { + throw new Error( + `STOP-gate regression: plan_ready without ## Decisions section in ` + + `${obs.planFile ?? ''} — gate skipped after ToolSearch.\n` + + `--- evidence (last 2KB) ---\n${obs.evidence}`, + ); + } + } + + expect(['asked', 'plan_ready']).toContain(obs.outcome); + assertReportAtBottomIfPlanWritten(obs); }, 360_000); });