diff --git a/CHANGELOG.md b/CHANGELOG.md index 44d662ee..176c1b81 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,44 +1,45 @@ # Changelog -## [1.21.0.0] - 2026-04-30 +## [1.22.0.0] - 2026-05-01 ## **Plan-mode skills surface every decision again, even when the host disallows AskUserQuestion.** -Conductor launches Claude Code with `--disallowedTools AskUserQuestion --permission-mode default --permission-prompt-tool stdio` (verified by inspecting the live conductor claude process via `ps`). The native AskUserQuestion tool is removed from the model's tool registry, so when a plan-mode skill instructs the model to "call AskUserQuestion," the call silently fails: the model can't ask, the user never sees the question, and the skill auto-proceeds without input. The whole interactive premise of `/plan-ceo-review`, `/plan-eng-review`, `/plan-design-review`, `/plan-devex-review`, `/autoplan`, and `/office-hours` was broken in any session that used these flags. +Conductor launches Claude Code with `--disallowedTools AskUserQuestion --permission-mode default --permission-prompt-tool stdio` (verified by inspecting the live conductor claude process via `ps`). The native AskUserQuestion tool is removed from the model's tool registry, so when a plan-mode skill instructs the model to "call AskUserQuestion," the call silently fails: the model can't ask, the user never sees the question, and the skill auto-proceeds without input. The whole interactive premise of `/plan-ceo-review`, `/plan-eng-review`, `/plan-design-review`, `/plan-devex-review`, `/autoplan`, and `/office-hours` was broken in any Conductor session. The fix is preamble guidance, not skill-template surgery. A new `Tool resolution` section in `scripts/resolvers/preamble/generate-ask-user-format.ts` tells the model to check its tool list and prefer any `mcp__*__AskUserQuestion` variant (e.g. `mcp__conductor__AskUserQuestion`) over the native tool. Hosts that disable native AskUserQuestion register their own MCP variant; the variant takes the same questions/options shape and the host renders the prompt through its own UI surface. If neither variant is callable, the model falls back to writing a `## Decisions to confirm` section into the plan file and calling ExitPlanMode — plan-mode's native "Ready to execute?" confirmation surfaces the decisions through TTY UI. **Never silently auto-decide.** -Six new gate-tier real-PTY regression tests reproduce the exact Conductor flag set (`extraArgs: ['--disallowedTools', 'AskUserQuestion']`) for every plan-mode skill, plus a periodic-tier eval that protects the legitimate `/plan-tune` AUTO_DECIDE opt-in path from being broken by the fix. +Six gate-tier real-PTY regression tests reproduce the exact Conductor flag set (`extraArgs: ['--disallowedTools', 'AskUserQuestion']`) for every plan-mode skill, plus a periodic-tier eval that protects the legitimate `/plan-tune` AUTO_DECIDE opt-in path from being broken by the fix. The harness gains a new `'auto_decided'` outcome and whitespace-tolerant detectors that survive TTY cursor-positioning escape sequences (which `stripAnsi` removes without leaving spaces, collapsing "ready to execute" to "readytoexecute"). ### What you can now do - **Use plan-mode review skills in Conductor.** Open a Conductor workspace, run `/plan-ceo-review` against a plan, and the scope-mode question actually appears for you to answer. Same for `/plan-eng-review`, `/plan-design-review`, `/plan-devex-review`, `/autoplan`'s premise gate, and `/office-hours`. -- **Stay in control under `--disallowedTools` without writing template overrides.** The Tool resolution section is in the gstack preamble that prepends every tier-≥2 skill; new hosts that disable native AUQ via the same pattern get the fix transparently as long as they register an MCP variant. +- **Stay in control under `--disallowedTools` without writing template overrides.** The Tool resolution section sits at preamble position 1 in every tier-≥2 skill; new hosts that disable native AUQ via the same pattern get the fix transparently as long as they register an MCP variant. - **Opt-in to AUTO_DECIDE without losing the regression guard.** `/plan-tune` users who set `never-ask` for specific questions keep auto-pick under Conductor flags; the periodic-tier `auto-decide-preserved` eval protects this path. ### The numbers that matter -Source: `ps -p -o args=` for the regression mechanism (verified primary source). 6 new gate-tier real-PTY regression cases plus 1 periodic-tier AUTO_DECIDE preserve eval; coverage in `test/skill-e2e-plan-{ceo,eng,design,devex}-plan-mode.test.ts` (parameterized inline) + `test/skill-e2e-{autoplan,office-hours}-auto-mode.test.ts` (standalone) + `test/skill-e2e-auto-decide-preserved.test.ts` (periodic). +Source: `ps -p -o args=` for the regression mechanism (verified primary source). 6 new gate-tier regression cases + 1 periodic-tier AUTO_DECIDE eval; coverage in `test/skill-e2e-plan-{ceo,eng,design,devex}-plan-mode.test.ts` (parameterized inline) + `test/skill-e2e-{autoplan,office-hours}-auto-mode.test.ts` (standalone) + `test/skill-e2e-auto-decide-preserved.test.ts` (periodic). | Surface | Shape | |---|---| | Skills that regain interactivity in Conductor | 6 (`/plan-ceo-review`, `/plan-eng-review`, `/plan-design-review`, `/plan-devex-review`, `/autoplan`, `/office-hours`) | -| New gate-tier regression test cases | 6 (one per skill above; `--disallowedTools AskUserQuestion` parameterized) | +| New gate-tier regression test cases | 6 (one per skill; `--disallowedTools AskUserQuestion` parameterized) | | New periodic-tier eval | 1 (`auto-decide-preserved`, protects `/plan-tune` opt-in path) | -| New `PlanSkillObservation` outcome | `auto_decided` — TTY shows "Auto-decided … (your preference)" | -| New `runPlanSkillObservation` parameter | `extraArgs?: string[]` — plumbs raw flags to the spawned `claude` | +| New `ClassifyResult` outcome | `auto_decided` — TTY shows "Auto-decided … (your preference)" | +| New `runPlanSkillObservation` parameter | `extraArgs?: string[]` — plumbs raw flags to spawned `claude` | | Preamble resolvers touched | 2 (`generate-ask-user-format.ts`, `generate-completion-status.ts`) | | SKILL.md files regenerated | 41 | -| Test detection order | `asked` > `auto_decided` > `plan_ready` (rendered list wins; auto-decide text wins over downstream plan-ready) | -| Mechanism verified by | `ps -p -o args=` showing `--disallowedTools AskUserQuestion --permission-mode default` | +| `classifyVisible` branch order | `silent_write` → `auto_decided` → `plan_ready` → `asked` (each more specific than the next) | +| Whitespace-tolerant detectors | `isPlanReadyVisible`, `isAutoDecidedVisible` (defeats stripAnsi cursor-positioning collapse) | +| Verified by | `ps -p -o args=` showing `--disallowedTools AskUserQuestion --permission-mode default` | ### What this means for builders -If you built a workflow on `/plan-ceo-review` or any plan-mode review skill and ran it in Conductor, this release is the difference between the skill silently producing a plan you didn't shape and the skill stopping for you to choose scope mode, accept or reject scope expansions, and gate every architectural decision the way the skill was designed to. The fix is in the preamble, so you don't update skill templates yourself — just upgrade gstack and the next plan review you run honors your input. +If you ran `/plan-ceo-review` or any plan-mode review skill in Conductor before this release, the skill silently produced a plan you didn't shape — the scope-mode question, expansion proposals, and per-section STOPs never reached you. After upgrading, the skill stops for every gate the template defines. The fix is in the preamble, so you don't update skill templates yourself — just upgrade gstack and the next plan review you run honors your input. -If you opted into auto-deciding specific questions via `/plan-tune`, the new periodic eval guards that path. The fix is "prefer MCP variant when registered," not "force every question to surface" — your `never-ask` preferences still auto-pick, the AUTO_DECIDE annotation still renders, nothing changes for opt-in users. +If you opted into auto-deciding specific questions via `/plan-tune`, the periodic eval guards that path. The fix is "prefer MCP variant when registered," not "force every question to surface" — your `never-ask` preferences still auto-pick, the AUTO_DECIDE annotation still renders, nothing changes for opt-in users. -The gstack-side regression test surface now mirrors what real users hit. Each existing plan-mode test file gained a second `test()` block that sets `extraArgs: ['--disallowedTools', 'AskUserQuestion']` and asserts the AskUserQuestion still surfaces. CI gate runtime adds roughly 18-30 minutes for the new cases (real-PTY tests are slow); the alternative was missing the regression entirely, which is exactly what happened until v1.21. +The gstack-side regression test surface now mirrors what real users hit. Each plan-mode test file gained a second `test()` block that sets `extraArgs: ['--disallowedTools', 'AskUserQuestion']` and asserts the AskUserQuestion still surfaces. Builds on v1.21.1.0's `classifyVisible()` extraction — the new auto-decided branch slots in cleanly between silent_write and plan_ready. ### Itemized changes @@ -47,28 +48,74 @@ The gstack-side regression test surface now mirrors what real users hit. Each ex - `scripts/resolvers/preamble/generate-ask-user-format.ts` gets a new `### Tool resolution (read first)` section at the top of the AskUserQuestion Format block. Tells the model: AskUserQuestion can resolve to two tools at runtime (host MCP variant or native); prefer any `mcp__*__AskUserQuestion` variant in the tool list over native; hosts may disable native via `--disallowedTools AskUserQuestion` (Conductor does this by default); same questions/options shape and decision-brief format applies to the MCP variant. Includes a fallback path when neither variant is callable: write the decision into the plan file as `## Decisions to confirm` + ExitPlanMode. - `scripts/resolvers/preamble/generate-completion-status.ts` (the plan-mode-info block at preamble position 1) updated to point at the Tool resolution section: AskUserQuestion satisfies plan mode's end-of-turn requirement for "any variant," with the plan-file fallback for the no-variant case. -#### Added — gate-tier regression tests +#### Added — regression tests -- 4 inline `test()` blocks added to `test/skill-e2e-plan-{ceo,eng,design,devex}-plan-mode.test.ts`. Each spawns claude with `extraArgs: ['--disallowedTools', 'AskUserQuestion']` and asserts the skill still reaches `outcome === 'asked'` (or for plan-design-review's no-UI-scope short-circuit, the looser envelope `['asked', 'plan_ready']` plus an explicit fail on `'auto_decided'`). +- 4 inline `test()` blocks added to `test/skill-e2e-plan-{ceo,eng,design,devex}-plan-mode.test.ts`. Each spawns claude with `extraArgs: ['--disallowedTools', 'AskUserQuestion']` and asserts the skill still surfaces the question — pass envelope `['asked', 'plan_ready']` (the latter covers the plan-file fallback flow), failure signals are `'auto_decided'` (caught explicitly) plus the standard silent_write/exited/timeout. - `test/skill-e2e-autoplan-auto-mode.test.ts` (new). Asserts autoplan's first non-auto-decided gate (Phase 1 premise confirmation) still surfaces. Autoplan auto-decides intermediate questions BY DESIGN, so the test scopes to gates the user MUST see. - `test/skill-e2e-office-hours-auto-mode.test.ts` (new). Asserts office-hours' startup-vs-builder mode AskUserQuestion still surfaces. - `test/skill-e2e-auto-decide-preserved.test.ts` (new, periodic-tier). Sets up an isolated `GSTACK_HOME` tmpdir, writes `question_tuning=true` + a `never-ask` preference for `plan-ceo-review-mode` (source `'plan-tune'`), runs `/plan-ceo-review` under `--disallowedTools AskUserQuestion`, asserts outcome is NOT `'asked'` (the model honored the opt-in). #### Changed — PTY harness -- `test/helpers/claude-pty-runner.ts`: `runPlanSkillObservation` accepts a new optional `extraArgs?: string[]` parameter that plumbs straight through to `launchClaudePty` (which already supported the field at the lower level). New outcome `'auto_decided'` in `PlanSkillObservation` plus `isAutoDecidedVisible(visible)` detector that matches the AUTO_DECIDE preamble template (`Auto-decided … (your preference)`). Detection order in the observation loop is `asked` (rendered numbered list) > `auto_decided` (text only, no list) > `plan_ready`, so an upstream auto-decide isn't masked by a downstream plan-mode confirmation. +- `test/helpers/claude-pty-runner.ts`: `runPlanSkillObservation` accepts new optional `extraArgs?: string[]` (plumbs straight through to `launchClaudePty`, which already supported the field). `ClassifyResult` gains `'auto_decided'` outcome plus `isAutoDecidedVisible(visible)` detector that matches the AUTO_DECIDE preamble template (`Auto-decided … (your preference)`). `classifyVisible` branch order extended to `silent_write → auto_decided → plan_ready → asked` so an upstream auto-decide isn't masked by a downstream plan-mode confirmation. +- Whitespace-tolerant detection: `isPlanReadyVisible` and `isAutoDecidedVisible` now test both spaced and whitespace-collapsed forms of their target phrases. `stripAnsi` removes cursor-positioning escapes (`\x1b[40C`) without replacing them with spaces, so "ready to execute" can come through as "readytoexecute" — the spaced regex would miss it. #### Changed — touchfiles - `test/helpers/touchfiles.ts`: existing `plan-X-review-plan-mode` entries gain `scripts/resolvers/question-tuning.ts` and `scripts/resolvers/preamble/generate-ask-user-format.ts` as touchfile dependencies, so AUTO_DECIDE-bearing resolver changes correctly invalidate the regression cases. - New entries: `autoplan-auto-mode` (gate), `office-hours-auto-mode` (gate), `auto-decide-preserved` (periodic). -- `test/touchfiles.test.ts`: count of tests selected by `plan-ceo-review/SKILL.md` updates from 18 to 20 to cover the two new entries that depend on `plan-ceo-review/**`. +- `test/touchfiles.test.ts`: count of tests selected by `plan-ceo-review/SKILL.md` updates from 19 to 21 to cover the new entries that depend on `plan-ceo-review/**`. #### For contributors -- The PTY harness's new `auto_decided` outcome is a defense-in-depth signal: it fires on the AUTO_DECIDE preamble template wording, which is non-deterministic. Treat it as evidence of a regression, not a hard contract. The strict regression assertion across most plan-mode skills is `outcome === 'asked'`; `auto_decided` and `silent_write`/`exited`/`timeout` are the FAIL paths. -- The Tool resolution section is the surgical fix site for any future host that disables native AUQ in a similar way (cursor, factory, etc.). The pattern: register a `mcp____AskUserQuestion` MCP tool; the gstack preamble already tells the model to prefer it. No skill-template changes needed per-host. -- `auto-decide-preserved` runs in an isolated `GSTACK_HOME` tmpdir to avoid mutating the developer's real `~/.gstack` state. When debugging this test, set `GSTACK_HOME` manually to a scratch dir and run the same setup steps the test does (`gstack-config set question_tuning true` then `gstack-question-preference --write`). +- The PTY harness's `auto_decided` outcome is a defense-in-depth signal: it fires on the AUTO_DECIDE preamble template wording, which is non-deterministic. Treat it as evidence of a regression, not a hard contract. +- The Tool resolution section is the surgical fix site for any future host that disables native AUQ similarly. The pattern: register a `mcp____AskUserQuestion` MCP tool; the gstack preamble already tells the model to prefer it. No skill-template changes needed per-host. +- `auto-decide-preserved` runs in an isolated `GSTACK_HOME` tmpdir to avoid mutating the developer's real `~/.gstack` state. When debugging, set `GSTACK_HOME` manually to a scratch dir and run the same setup the test does (`gstack-config set question_tuning true`, then `gstack-question-preference --write`). + +## [1.21.1.0] - 2026-04-28 + +## **plan-ceo-review smoke tightens. The "agent skips Step 0 and ships a plan" regression now fails the gate.** + +The v1.15.0.0 real-PTY harness shipped with a smoke that accepted either `'asked'` or `'plan_ready'` as success. That OR was too lax for `/plan-ceo-review` specifically: the skill template mandates Step 0A premise challenge plus Step 0F mode selection BEFORE any plan write, so reaching `plan_ready` first IS the regression. This release tightens the assertion to `'asked'` only for that smoke, and refactors the runner so the contract is testable in <1s instead of $0.50 of stochastic PTY. + +### The numbers that matter + +Numbers come from `git diff --shortstat origin/main..HEAD` and `bun test test/helpers/claude-pty-runner.unit.test.ts` on a clean tree. + +| Metric | Δ | +|---|---| +| Net branch size vs main | +162 / −65 lines (3 files) | +| New unit tests added | **+24** (claude-pty-runner.unit.test.ts) | +| Unit suite runtime | **14ms** (deterministic, free-tier) | +| Real-PTY gate runs verified | **4 clean PTY runs** (3 lock-in + 1 post-refactor) | +| Outcome assertions covered | **5/5** (was 3/5; `plan_ready` is now FAIL for plan-ceo) | +| Reviewers run on this PR | plan-eng-review (CLEARED) + codex consult + 2 specialists + adversarial | + +### What this means for builders + +Three new classes of harness regression are now caught deterministically in the free tier instead of waiting on a $0.50 stochastic PTY run. The classifier is extracted into a pure `classifyVisible()` function so reordering branches in the polling loop fails the unit tests instead of silently shipping. Permission dialogs (which render numbered lists) are filtered out of the `'asked'` classification so a permission prompt cannot pose as a Step 0 skill question. The bare phrase `Do you want to proceed?` no longer triggers permission detection on its own — it now requires a file-edit context co-trigger, so a skill question that contains the phrase isn't mis-classified. + +For `/plan-ceo-review` specifically: any future preamble slim-down or template edit that lets the agent skip Step 0 and write a plan will fail the gate before the PR ships. Pull, run `bun test`, and the harness layer is provably tighter without you having to spend a token. + +### Itemized changes + +#### Added + +- `test/helpers/claude-pty-runner.unit.test.ts`: 24 deterministic tests covering `isPermissionDialogVisible` (with the new co-trigger contract), `isNumberedOptionListVisible`, `parseNumberedOptions`, and the new `classifyVisible()` runtime path. Free-tier, runs on every `bun test`. +- `classifyVisible(visible)` in `claude-pty-runner.ts`: pure classifier extracted from the polling loop. Returns `{ outcome, summary } | null`. Branch order: silent_write → plan_ready → asked → null (with permission-dialog filter). Live-state branches (process exited, "Unknown command") stay in the runner. +- `TAIL_SCAN_BYTES = 1500` exported constant. Shared between `runPlanSkillObservation` and the routing test's nav loop so tuning stays in sync. +- `env?: Record` option on `runPlanSkillObservation`, threaded to `launchClaudePty`. Plumbing for future env-driven test isolation (gstack-config does not yet honor env overrides; tracked as post-merge follow-up). + +#### Changed + +- `test/skill-e2e-plan-ceo-plan-mode.test.ts`: assertion narrowed from `['asked', 'plan_ready']` to `'asked'` only. Failure message now branches on `outcome` (plan_ready vs timeout vs silent_write) with a tailored diagnosis line, and references skill-template section names instead of line numbers (durable to template edits). +- `isPermissionDialogVisible`: bare `Do you want to proceed?` now requires a file-edit context co-trigger (`Edit to ` or `Write to `). Other clauses (`requested permissions to`, `allow all edits`, `always allow access to`, `Bash command requires permission`) remain unconditional. +- `test/skill-e2e-plan-ceo-mode-routing.test.ts`: replaces the local `1500` magic number with the shared `TAIL_SCAN_BYTES` constant. + +#### For contributors + +- The runner change is additive and the existing sibling smokes (`plan-eng`, `plan-design`, `plan-devex`, `plan-mode-no-op`) keep their loose `['asked', 'plan_ready']` assertion. Their behavior is unchanged. +- Post-merge follow-ups captured in `TODOS.md`: per-finding AskUserQuestion count assertion (V2), env-driven gstack-config overrides (so `QUESTION_TUNING=false` actually isolates the test), path-confusion hardening on `SANCTIONED_WRITE_SUBSTRINGS`. ## [1.20.0.0] - 2026-04-28 diff --git a/TODOS.md b/TODOS.md index 953d2872..0a68ac05 100644 --- a/TODOS.md +++ b/TODOS.md @@ -213,6 +213,56 @@ scope of that PR; deliberately deferred to keep PTY-import small. ## Testing +## P2: Per-finding AskUserQuestion count assertion for /plan-ceo-review + +**What:** PTY E2E test that drives /plan-ceo-review through Step 0 with a stable fixture diff containing N known findings, asserts that exactly N distinct AskUserQuestions fire (one per finding) before plan_ready. + +**Why:** The skill template repeats "One issue = one AskUserQuestion call. Never combine multiple issues into one question." at every review checkpoint. No test enforces it. The current `skill-e2e-plan-ceo-plan-mode.test.ts` smoke (post-v1.21.1.0) only catches "agent skipped Step 0 entirely." Batching findings into one question slips through silently. + +**Pros:** Locks in the strongest contract the skill mandates. Catches a real failure mode (the original attachment showed 2 findings batched as 0 questions). +**Cons:** Needs a stable fixture diff to keep finding count deterministic (~1 day human / ~30 min CC). Opus may reasonably consolidate two related findings, so the assertion needs a forgiving lower bound (e.g., `>= ceil(N * 0.6)`) rather than strict equality. + +**Context:** The PTY harness (`runPlanSkillObservation`) returns at first terminal outcome — for V2 we need a streaming variant that counts AskUserQuestions across the whole session up to `plan_ready`. Probably a new helper alongside `runPlanSkillObservation`. + +**Depends on:** Stable fixture diff (`test/fixtures/plans/multi-finding.diff` or similar) with a small known set of issues that triggers all 4 review sections. + +**Priority:** P2. +**Effort:** S (CC: ~30 min once fixture exists). Captured from v1.21.1.0 plan-eng-review D2. + +--- + +## P3: Honor env vars in gstack-config (so QUESTION_TUNING/EXPLAIN_LEVEL actually isolate tests) + +**What:** `gstack-config get ` reads `~/.gstack/config.yaml`. `runPlanSkillObservation` plumbs `env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }` through to the spawned `claude` process — but the skill preamble bash uses `gstack-config get question_tuning`, which never looks at env. The env passthrough is theater on current code. + +**Why:** Without env honoring, the v1.21.1.0 plan-ceo-review smoke is still flaky on machines with `question_tuning: true` set in YAML. AUTO_DECIDE preferences would skip the rendered AskUserQuestion list, masking the regression we want to catch. + +**Pros:** Makes the gate test hermetic across machines. The env wiring is already in place — only `gstack-config` needs to read env first, fall back to YAML. +**Cons:** Touches the gstack-config binary across all 3 platforms (linux/darwin/windows). Cross-binary refactor. + +**Context:** Captured from v1.21.1.0 adversarial review. Documented honestly in the test docstring as a known limitation. + +**Priority:** P3. +**Effort:** S. Single-file edit to `bin/gstack-config` (~10 LOC for env-first lookup). + +--- + +## P3: Path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS + +**What:** `runPlanSkillObservation`'s silent-write detector uses substring matching on a few sanctioned paths (`.gstack/`, `CHANGELOG.md`, `TODOS.md`, etc). A write to `node_modules/some-pkg/CHANGELOG.md` or `src/foo/.gstack/leak.ts` is currently sanctioned because the substring matches anywhere in the path. + +**Why:** Defensive — no current bug exploits this, but a malicious skill or fixture could write to a path that happens to contain `.gstack/` or `CHANGELOG.md` and slip past silent-write detection. + +**Pros:** Hardens the harness against future skill misbehavior. Aligns substring rules with their intent. +**Cons:** Need to anchor against absolute prefixes (`os.homedir() + '/.gstack/'`, worktree root) which makes the test less portable across machines. + +**Context:** Captured from v1.21.1.0 adversarial review (HIGH/FIXABLE finding, pre-existing). Refactored into a `SANCTIONED_WRITE_SUBSTRINGS` constant in v1.21.1.0 but the substring-includes logic is unchanged from before. + +**Priority:** P3. +**Effort:** S. + +--- + ## P1: Structural STOP-Ask forcing function across all skills **What:** Design and implement a structural forcing function that catches when a skill mandates per-issue AskUserQuestion but the model silently substitutes batch-synthesis. Candidate mechanisms: question-count assertion (skill declares expected question count in frontmatter; post-run audit logs if model fired " or "Write to "). + * The standalone permission signatures (`requested permissions to`, + * `allow all edits`, `always allow access to`, `Bash command requires permission`) + * remain unconditional. */ export function isPermissionDialogVisible(visible: string): boolean { - return ( - /requested\s+permissions?\s+to/i.test(visible) || - /Do\s+you\s+want\s+to\s+proceed\?/i.test(visible) || - // "Yes / Yes, allow all edits / No" shape rendered by Claude Code for - // file-edit permission grants. The middle option's "allow all" phrase - // is the unique signature. - /\ballow\s+all\s+edits\b/i.test(visible) || - // "Yes, and always allow access to " shape (workspace trust). - /always\s+allow\s+access\s+to/i.test(visible) || - // Bash command permission prompts. - /Bash\s+command\s+.*\s+requires\s+permission/i.test(visible) - ); + // Standalone signatures — high specificity, never appear in skill questions. + if (/requested\s+permissions?\s+to/i.test(visible)) return true; + // "Yes / Yes, allow all edits / No" shape — file-edit permission grants. + if (/\ballow\s+all\s+edits\b/i.test(visible)) return true; + // "Yes, and always allow access to " shape — workspace trust. + if (/always\s+allow\s+access\s+to/i.test(visible)) return true; + // Bash command permission prompts. + if (/Bash\s+command\s+.*\s+requires\s+permission/i.test(visible)) return true; + // "Do you want to proceed?" only counts as a permission dialog when paired + // with a file-edit context. Skill questions can use the bare phrase. + if ( + /Do\s+you\s+want\s+to\s+proceed\?/i.test(visible) && + /(Edit|Write)\s+to\s+\S+/i.test(visible) + ) { + return true; + } + return false; } /** Detect any AskUserQuestion-shaped numbered option list with cursor. */ @@ -236,12 +259,14 @@ export function parseNumberedOptions( // this, parseNumberedOptions returns stale options after the dialog is // dismissed. const lines = tail.split('\n'); - // Anchor on the LAST `❯ 1.` line (cursor is on option 1 of the active - // AskUserQuestion). Greedy character classes don't help here — we need a literal - // `❯` after optional leading whitespace. + // Anchor on the LAST line containing `❯1.` ANYWHERE on the line. + // The /plan-*-review skill's box-layout AUQ uses TTY cursor-positioning + // escapes that stripAnsi removes — leaving the cursor `❯1.` mid-line, + // after dividers + header + prompt text on the same logical line. The + // earlier `^\s*❯` anchor missed those entirely. let cursorLineIdx = -1; for (let i = lines.length - 1; i >= 0; i--) { - if (/^\s*❯\s*1\./.test(lines[i] ?? '')) { + if (/❯\s*1\./.test(lines[i] ?? '')) { cursorLineIdx = i; break; } @@ -261,7 +286,37 @@ export function parseNumberedOptions( if (cursorLineIdx < 0) return []; const found: Array<{ index: number; label: string }> = []; const seenIndices = new Set(); - for (let i = cursorLineIdx; i < lines.length; i++) { + + // Cursor line: option 1 may be inline after box dividers + prompt header + // (`...divider...header...❯1. label`). Use a non-anchored regex that + // captures `❯N. label` from anywhere on the line through end-of-line. + // Only used for the cursor line — subsequent options are parsed with the + // start-of-line `optionRe`. + const cursorLine = lines[cursorLineIdx] ?? ''; + const cursorInlineRe = /❯\s*([1-9])\.\s*(\S.*?)\s*$/; + const inlineMatch = cursorInlineRe.exec(cursorLine); + if (inlineMatch) { + const idx = Number(inlineMatch[1]); + const label = (inlineMatch[2] ?? '').trim(); + if (label.length > 0 && !seenIndices.has(idx)) { + seenIndices.add(idx); + found.push({ index: idx, label }); + } + } else { + // No inline cursor match — fall back to start-of-line regex. + const startMatch = optionRe.exec(cursorLine); + if (startMatch) { + const idx = Number(startMatch[1]); + const label = (startMatch[2] ?? '').trim(); + if (label.length > 0 && !seenIndices.has(idx)) { + seenIndices.add(idx); + found.push({ index: idx, label }); + } + } + } + + // Subsequent lines: standard start-of-line option parsing. + for (let i = cursorLineIdx + 1; i < lines.length; i++) { const m = optionRe.exec(lines[i] ?? ''); if (!m) continue; const idx = Number(m[1]); @@ -286,6 +341,345 @@ export function parseNumberedOptions( return found; } +/** + * The four /plan-ceo-review modes. Used by `skill-e2e-plan-ceo-mode-routing` + * to detect Step 0F mode-selection AskUserQuestions, and by the upcoming + * finding-count tests as a Step-0 boundary signal: an AUQ whose options + * match this regex IS the mode pick (the last Step-0 question for plan-ceo). + * + * Lifted out of the mode-routing test so multiple PTY tests can share one + * source of truth — when /plan-ceo-review adds a fifth mode, one regex updates + * everywhere instead of drifting per-test. + */ +export const MODE_RE = /HOLD SCOPE|SCOPE EXPANSION|SELECTIVE EXPANSION|SCOPE REDUCTION/i; + +/** + * Stable signature for a parsed numbered-option list — used by tests to detect + * "is this AUQ the same as the last poll, or has the agent advanced to a new + * one?" Joins each option as `${index}:${label}` after sorting by index. + * + * Defensive sort means the signature is order-independent at the input level, + * even though `parseNumberedOptions` already returns indices in ascending order. + */ +export function optionsSignature( + opts: Array<{ index: number; label: string }>, +): string { + return [...opts] + .sort((a, b) => a.index - b.index) + .map((o) => `${o.index}:${o.label}`) + .join('|'); +} + +/** + * Pure classifier for the visible TTY buffer. Decides which outcome the + * polling loop should return on this tick, or `null` to keep polling. + * + * Extracted from `runPlanSkillObservation` so the unit suite can exercise + * the actual branch order with synthetic input strings — a future contributor + * who reorders the branches (e.g., moves the permission short-circuit) gets + * caught by the unit tests, not by a stochastic E2E run. + * + * Live-state branches (process exited, "Unknown command") stay in the runner + * since they need the session handle. + */ +export type ClassifyResult = + | { outcome: 'silent_write'; summary: string } + | { outcome: 'auto_decided'; summary: string } + | { outcome: 'plan_ready'; summary: string } + | { outcome: 'asked'; summary: string } + | null; + +const SANCTIONED_WRITE_SUBSTRINGS = [ + '.claude/plans', + '.gstack/', + '/.context/', + 'CHANGELOG.md', + 'TODOS.md', +]; + +export function classifyVisible(visible: string): ClassifyResult { + // Silent-write detection: any Write/Edit tool render that targets a path + // OUTSIDE the sanctioned dirs, AND no numbered prompt is currently on screen + // (a numbered prompt means a permission/AskUserQuestion is gating the write, + // not an actual silent write). + const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g; + let m: RegExpExecArray | null; + while ((m = writeRe.exec(visible)) !== null) { + const target = m[1] ?? ''; + const sanctioned = SANCTIONED_WRITE_SUBSTRINGS.some((s) => target.includes(s)); + if (!sanctioned && !isNumberedOptionListVisible(visible)) { + return { + outcome: 'silent_write', + summary: `Write/Edit to ${target} fired before any AskUserQuestion`, + }; + } + } + // 'auto_decided' must beat 'plan_ready': when AUTO_DECIDE fires upstream of + // plan-ready, both signals are visible by the time the polling loop checks. + // The annotation text is the more informative outcome — it explains WHY + // we got to plan_ready without surfacing the question. + if (isAutoDecidedVisible(visible)) { + return { + outcome: 'auto_decided', + summary: + 'skill auto-decided an AskUserQuestion via the AUTO_DECIDE preamble (the user never saw the prompt)', + }; + } + if (isPlanReadyVisible(visible)) { + return { + outcome: 'plan_ready', + summary: 'skill ran end-to-end and emitted plan-mode "Ready to execute" confirmation', + }; + } + if (isNumberedOptionListVisible(visible)) { + // Permission dialogs render numbered lists too. Skip them — the + // bug we want to catch is "skill question never fired." + if (isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))) { + return null; + } + return { + outcome: 'asked', + summary: 'skill fired a numbered-option prompt (AskUserQuestion or routing-injection)', + }; + } + return null; +} + +// ──────────────────────────────────────────────────────────────────────────── +// Per-finding AskUserQuestion count primitives (used by runPlanSkillCounting). +// +// These are pure helpers extracted up-front so the unit suite can exercise +// them deterministically before the live-PTY counter runs them. Each one is +// independently unit-testable against synthetic visible-buffer strings. +// ──────────────────────────────────────────────────────────────────────────── + +/** + * Captured identity of an AskUserQuestion — the rendered question text plus + * its numbered options. Used by `runPlanSkillCounting` to dedupe redrawn + * prompts and to feed `Step0BoundaryPredicate` callers. + * + * `signature` is the stable hash. Two AUQs with identical prompt + options + * produce the same signature; differences in either field produce different + * signatures. Critically: two AUQs with shared option labels (e.g. the + * generic "A) Add to plan / B) Defer / C) Build now" menu) but different + * question text get DIFFERENT signatures because the prompt is in the hash. + */ +export interface AskUserQuestionFingerprint { + /** Stable hash combining normalized prompt text + options signature. */ + signature: string; + /** First 240 chars of the rendered question prompt (post-normalization). */ + promptSnippet: string; + /** Captured option labels, in index order. */ + options: Array<{ index: number; label: string }>; + /** Wall-clock when first observed (ms since the helper started polling). */ + observedAtMs: number; + /** True if observed BEFORE the Step-0 boundary fired. */ + preReview: boolean; +} + +/** + * Predicate fired against the AUQ we just answered (not the visible buffer). + * Returns true if this AUQ's fingerprint marks the LAST Step-0 question for + * its skill — all subsequent AUQs are review-phase findings. + * + * Event-based by design: matching against an answered AUQ's fingerprint + * (prompt + options) is deterministic, whereas matching against later + * rendered content (section headers, summary text) races with the agent's + * output cadence. See plan §D14 for the rationale. + */ +export type Step0BoundaryPredicate = ( + answeredFingerprint: AskUserQuestionFingerprint, +) => boolean; + +/** + * Parse the rendered question prompt out of a visible TTY buffer. The prompt + * is the 1–3 lines of text immediately ABOVE the latest `❯ 1.` cursor line — + * not part of the option list, not the permission-dialog header. + * + * Returns the prompt normalized to a single-spaced 240-char snippet (strip + * ANSI residue, collapse internal whitespace, trim) — short enough to use as + * a hash key, long enough to disambiguate distinct questions. + * + * Returns "" when no prompt could be parsed (cursor not yet rendered, or + * cursor is at the top of the buffer with no preceding text). Callers that + * use the empty string as a fingerprint input should treat empty-prompt + * AUQs as "wait one more poll" rather than fingerprinting them — otherwise + * the same options + empty prompt across two distinct questions collide. + */ +export function parseQuestionPrompt(visible: string): string { + // Tail-only — older prompts higher in the buffer are stale. + const tail = visible.length > 4096 ? visible.slice(-4096) : visible; + const lines = tail.split('\n'); + + // Find the latest line containing `❯1.` (matching parseNumberedOptions — + // unanchored to handle the box-layout case where cursor is mid-line after + // divider + header + prompt text on the same logical line). + let cursorLineIdx = -1; + for (let i = lines.length - 1; i >= 0; i--) { + if (/❯\s*1\./.test(lines[i] ?? '')) { + cursorLineIdx = i; + break; + } + } + if (cursorLineIdx < 0) return ''; + + // Box-layout case: prompt text may be ON the cursor line, BEFORE `❯1.`. + // Extract that prefix (after stripping leading box-drawing characters and + // dividers) as the last piece of the prompt — appended after any prior + // multi-line prompt text we walk up to find. + const cursorLine = lines[cursorLineIdx] ?? ''; + let inlinePrompt = ''; + const cursorPos = cursorLine.search(/❯\s*1\./); + if (cursorPos > 0) { + inlinePrompt = cursorLine + .slice(0, cursorPos) + // Strip box-drawing chars + dividers + leading checkbox sigil. + .replace(/^[─━┄┅┈┉─┌┐└┘├┤┬┴┼│┃☐□■\s]+/, '') + .trim(); + } + + // Walk up at most 6 lines collecting prompt text. Stop at: + // - a blank line preceded by another blank line (paragraph break) + // - top of buffer + // - a line that itself starts with `N.` (we're inside an option list) + const promptLines: string[] = []; + let blankRun = 0; + for (let i = cursorLineIdx - 1; i >= 0 && promptLines.length < 6; i--) { + const raw = lines[i] ?? ''; + const trimmed = raw.trim(); + if (trimmed === '') { + blankRun += 1; + if (blankRun >= 2 && promptLines.length > 0) break; + continue; + } + blankRun = 0; + // Stop if we hit what looks like a previous numbered list. + if (/^[\s❯]*[1-9]\.\s+\S/.test(raw)) break; + promptLines.unshift(trimmed); + } + + const all = inlinePrompt.length > 0 ? [...promptLines, inlinePrompt] : promptLines; + const joined = all.join(' ').replace(/\s+/g, ' ').trim(); + return joined.slice(0, 240); +} + +/** + * Stable hash for an AskUserQuestion's identity — combines normalized prompt + * text with the options signature so two distinct questions with shared menu + * labels (the generic A/B/C TODO-proposal menu, for instance) get different + * fingerprints. + * + * Uses Bun's fast non-crypto hash since these strings are short and we only + * need collision resistance against accidental TTY redraws, not adversaries. + * Hex-encoded for diagnostic dumps. + */ +export function auqFingerprint( + promptSnippet: string, + opts: Array<{ index: number; label: string }>, +): string { + const normalized = promptSnippet.replace(/\s+/g, ' ').trim(); + const sig = optionsSignature(opts); + // eslint-disable-next-line @typescript-eslint/no-explicit-any + return (Bun as any).hash(normalized + '||' + sig).toString(16); +} + +/** + * Detects when a plan-* skill has reached its Completion Summary / Review + * Report — a terminal signal complementary to plan-mode's "Ready to execute" + * confirmation. Each plan-review skill writes one of these phrasings near + * the end of its run; matching any one is enough to stop counting. + * + * Best-effort: this is a content marker, not a deterministic event. Hard + * ceiling (`reviewCountCeiling` in `runPlanSkillCounting`) is the reliable + * stop signal; this regex is the "we're done, go gracefully" hint. + */ +export const COMPLETION_SUMMARY_RE = + /(GSTACK REVIEW REPORT|## Completion [Ss]ummary|Status:\s*(clean|issues_open)|^VERDICT:)/m; + +/** + * Result of asserting that a plan file ends with `## GSTACK REVIEW REPORT` + * as its last `## ` heading. `ok` is true iff the report is present AND no + * other `## ` heading appears after it. Diagnostic fields are populated only + * on failure to keep the success path cheap. + */ +export interface ReviewReportAtBottomResult { + ok: boolean; + reason?: string; + trailingHeadings?: string[]; +} + +/** + * Assert that `## GSTACK REVIEW REPORT` is the last `## ` heading in a plan + * file's content. Pure string operation — no filesystem access. Used by the + * finding-count E2E tests as a second assertion on each test's produced plan. + * + * The plan-mode skill template mandates the agent move/append the review + * report so it's always the last `##` section. A regression where the agent + * appends additional sections after the report (or skips it entirely) ships + * silently today; this assertion catches both. + */ +export function assertReviewReportAtBottom( + content: string, +): ReviewReportAtBottomResult { + const re = /^## GSTACK REVIEW REPORT\s*$/m; + const match = re.exec(content); + if (!match) { + return { ok: false, reason: 'no GSTACK REVIEW REPORT section' }; + } + const after = content.slice(match.index + match[0].length); + // Match any `## ` heading after the report. Reject `## ` followed by + // newline-only (trailing-whitespace ## headers) to avoid false positives. + const trailingHeadings = Array.from( + after.matchAll(/^## \S.*$/gm), + ).map((m) => m[0]); + if (trailingHeadings.length > 0) { + return { + ok: false, + reason: 'trailing ## heading(s) after GSTACK REVIEW REPORT', + trailingHeadings, + }; + } + return { ok: true }; +} + +/** + * Per-skill Step-0 boundary predicates. Each fires `true` when the answered + * AUQ's fingerprint matches the LAST question of that skill's Step 0 phase. + * + * - `ceoStep0Boundary`: matches the mode-pick AUQ (options match `MODE_RE`). + * - `engStep0Boundary`: matches the cross-project-learnings or scope-reduction + * AUQ that closes plan-eng-review's preamble. + * - `designStep0Boundary`: matches plan-design-review's first dimension / + * posture AUQ. + * - `devexStep0Boundary`: matches plan-devex-review's persona-selection AUQ. + * + * Predicates live alongside the helper so the unit suite can exercise each + * against synthetic fingerprints (positive AND negative cases). Skill test + * files import them directly. + */ +export const ceoStep0Boundary: Step0BoundaryPredicate = (fp) => + // Mode-pick path (Step 0F): one of HOLD SCOPE / SCOPE EXPANSION / etc. + fp.options.some((o) => MODE_RE.test(o.label)) || + // Skip-interview path: scope-selection AUQ has "Skip interview and plan + // immediately" — picking it bypasses the rest of Step 0 and routes + // directly to review-phase. Boundary fires on the scope AUQ itself. + fp.options.some((o) => /skip\s+interview|plan\s+immediately/i.test(o.label)); + +export const engStep0Boundary: Step0BoundaryPredicate = (fp) => + /scope reduction recommendation|cross[\s-]?project learnings/i.test( + fp.promptSnippet, + ); + +export const designStep0Boundary: Step0BoundaryPredicate = (fp) => + /design system|design posture|design score|first dimension/i.test( + fp.promptSnippet, + ); + +export const devexStep0Boundary: Step0BoundaryPredicate = (fp) => + /developer persona|target persona|persona selection|TTHW target/i.test( + fp.promptSnippet, + ); + /** * Spawn `claude --permission-mode plan` in a real PTY and return a session * handle. Caller is responsible for `await session.close()` to release the @@ -598,12 +992,20 @@ export async function runPlanSkillObservation(opts: { cwd?: string; /** Total budget for skill to reach a terminal outcome. Default 180000. */ timeoutMs?: number; - /** Extra CLI args appended after --permission-mode. Used by the v1.21+ + /** Extra CLI args appended after --permission-mode. Used by the v1.22+ * AskUserQuestion-blocked regression tests to pass * `['--disallowedTools', 'AskUserQuestion']` (the flag set Conductor * uses to remove native AskUserQuestion in favor of its MCP variant). * Plumbs straight through to launchClaudePty. */ extraArgs?: string[]; + /** + * Extra env merged into the spawned `claude` process. `launchClaudePty` + * already supports this; exposing it here lets per-skill tests isolate + * from local config that would mask the regression they're trying to + * catch (e.g., `QUESTION_TUNING=true` causing AUTO_DECIDE to skip the + * rendered AskUserQuestion list). + */ + env?: Record; }): Promise { const startedAt = Date.now(); const session = await launchClaudePty({ @@ -611,6 +1013,7 @@ export async function runPlanSkillObservation(opts: { cwd: opts.cwd, timeoutMs: (opts.timeoutMs ?? 180_000) + 30_000, extraArgs: opts.extraArgs, + env: opts.env, }); try { @@ -641,60 +1044,10 @@ export async function runPlanSkillObservation(opts: { elapsedMs: Date.now() - startedAt, }; } - // Silent-write detection: any Write/Edit tool render that targets a - // path OUTSIDE ~/.claude/plans, ~/.gstack/, or the active worktree's - // .gstack/. Plan files and gbrain artifacts are sanctioned. - const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g; - let m: RegExpExecArray | null; - while ((m = writeRe.exec(visible)) !== null) { - const target = m[1] ?? ''; - const sanctioned = - target.includes('.claude/plans') || - target.includes('.gstack/') || - target.includes('/.context/') || - target.includes('CHANGELOG.md') || - target.includes('TODOS.md'); - if (!sanctioned && !isNumberedOptionListVisible(visible)) { - return { - outcome: 'silent_write', - summary: `Write/Edit to ${target} fired before any AskUserQuestion`, - evidence: visible.slice(-2000), - elapsedMs: Date.now() - startedAt, - }; - } - } - // Detection order is most-specific first: - // 1. 'auto_decided' — requires "Auto-decided" + "(your preference)"; - // the strongest signal that AUTO_DECIDE fired regardless of what - // else is on screen. - // 2. 'plan_ready' — "ready to execute" / "Would you like to proceed"; - // the plan-mode native confirmation. MUST be checked before - // 'asked' because the confirmation itself renders as a numbered - // options list ("1. Yes, ... / 2. Manual ... / 3. ..."), and a - // naive numbered-list check would mis-classify it as 'asked'. - // 3. 'asked' — any numbered options list that wasn't already - // classified as plan_ready. Real AskUserQuestion prompts AND - // fallback-flow prose with numbered options both land here. - if (isAutoDecidedVisible(visible)) { + const classified = classifyVisible(visible); + if (classified) { return { - outcome: 'auto_decided', - summary: 'skill auto-decided an AskUserQuestion via the AUTO_DECIDE preamble (the user never saw the prompt)', - evidence: visible.slice(-2000), - elapsedMs: Date.now() - startedAt, - }; - } - if (isPlanReadyVisible(visible)) { - return { - outcome: 'plan_ready', - summary: 'skill ran end-to-end and emitted plan-mode "Ready to execute" confirmation', - evidence: visible.slice(-2000), - elapsedMs: Date.now() - startedAt, - }; - } - if (isNumberedOptionListVisible(visible)) { - return { - outcome: 'asked', - summary: 'skill fired a numbered-option prompt (AskUserQuestion or routing-injection)', + ...classified, evidence: visible.slice(-2000), elapsedMs: Date.now() - startedAt, }; @@ -711,3 +1064,281 @@ export async function runPlanSkillObservation(opts: { await session.close(); } } + +// ──────────────────────────────────────────────────────────────────────────── +// runPlanSkillCounting — drives a plan-* skill end-to-end through Step 0 then +// counts distinct review-phase AskUserQuestion fingerprints. The actual +// product asserted by the per-finding-count tests. +// ──────────────────────────────────────────────────────────────────────────── + +/** + * Result of a `runPlanSkillCounting` run. Includes both the count summary + * (`step0Count`, `reviewCount`) and the full fingerprint list for diagnostic + * dumps when an assertion fails. + */ +export interface PlanSkillCountObservation { + outcome: + | 'plan_ready' + | 'completion_summary' + | 'ceiling_reached' + | 'silent_write' + | 'exited' + | 'timeout'; + summary: string; + /** Visible terminal text at terminal time (last 3KB). */ + evidence: string; + /** Wall time (ms) until the outcome was decided. */ + elapsedMs: number; + /** All distinct AskUserQuestions observed, in observation order. */ + fingerprints: AskUserQuestionFingerprint[]; + /** Count of fingerprints with `preReview === true`. */ + step0Count: number; + /** Count of fingerprints with `preReview === false`. */ + reviewCount: number; +} + +/** + * Drive a plan-* skill in plan mode and count distinct review-phase + * AskUserQuestions until a terminal signal fires. + * + * Flow: + * 1. Boot PTY in plan mode (8s grace + auto-trust dialog). + * 2. Send `slashCommand` alone. Sleep ~3s. + * 3. Send `followUpPrompt` as a chat message — this is the plan content + * the skill reviews. Slash commands with trailing args are rejected by + * Claude Code unless the skill defines them, so the plan goes as a + * follow-up message (the proven pattern at + * skill-e2e-plan-design-with-ui.test.ts:57-71). + * 4. Poll loop: + * - Skip permission dialogs (auto-grant with `defaultPick`). + * - On a new numbered-option list, parse prompt + options, build + * fingerprint via `auqFingerprint`. Empty-prompt parses are skipped + * and re-polled (avoids the empty-prompt collision documented in + * the auqFingerprint contract). + * - First time we see a fingerprint: push it, classify as Step 0 or + * review-phase based on `boundaryFired`, press `defaultPick` to + * advance. + * - After pressing, evaluate `isLastStep0AUQ(fingerprint)`. If true, + * all subsequent AUQs are review-phase. + * - Hard ceiling: if `reviewCount >= reviewCountCeiling`, return + * `ceiling_reached`. This bounds runaway counts; tests should set + * the ceiling above their assertion CEILING. + * - Soft terminals: `COMPLETION_SUMMARY_RE` match → `completion_summary`; + * plan-ready confirmation → `plan_ready`; silent write outside + * sanctioned dirs → `silent_write`; process exited → `exited`; + * wall clock exceeded → `timeout`. + * + * Boundary detection (D14): event-based, fired against the answered AUQ's + * fingerprint, not against later rendered content. This avoids the race + * where Step-0-final and Section-1-first AUQs straddle a section header + * regex match. + * + * Fingerprint composition (D9): `auqFingerprint(prompt, options)` mixes + * normalized prompt text with the options signature so distinct findings + * with shared menu structure (the generic A/B/C TODO menu) get distinct + * fingerprints. + */ +export async function runPlanSkillCounting(opts: { + /** Skill name, e.g. 'plan-ceo-review'. Used for diagnostic strings only. */ + skillName: string; + /** Slash command to send alone, e.g. '/plan-ceo-review'. No trailing args. */ + slashCommand: string; + /** Plan content sent as a follow-up message ~3s after the slash command. */ + followUpPrompt: string; + /** Per-skill predicate: which answered AUQ is the last Step-0 question. */ + isLastStep0AUQ: Step0BoundaryPredicate; + /** Hard cap on review-phase count; helper returns when reached. Should be + * set ABOVE the test's assertion ceiling so the test sees the cap as a + * failure rather than a silent stop. */ + reviewCountCeiling: number; + /** Numbered option to press by default. Defaults to 1 (recommended). */ + defaultPick?: number; + /** + * Optional override for the FIRST AUQ observed. Receives the fingerprint; + * returns the option index to press. Subsequent AUQs always use defaultPick. + * + * Skill-specific routing helper: /plan-ceo-review's first AUQ asks "what + * scope?" with options like "branch diff" / "describe inline" / "skip + * interview". Pressing the default 1 routes to "branch diff" (the wrong + * review target for a seeded fixture). firstAUQPick lets the test pick + * "Skip interview" or "describe inline" so the agent reviews the + * follow-up plan content the test sent, not the git diff. + */ + firstAUQPick?: (fp: AskUserQuestionFingerprint) => number; + /** Working directory. Default process.cwd() (repo cwd holds skill registry). */ + cwd?: string; + /** Total budget for skill to reach a terminal outcome. Default 1_500_000 (25 min). */ + timeoutMs?: number; + /** Extra env merged into the spawned `claude` process. */ + env?: Record; +}): Promise { + const startedAt = Date.now(); + const defaultPick = opts.defaultPick ?? 1; + const timeoutMs = opts.timeoutMs ?? 1_500_000; + + const session = await launchClaudePty({ + permissionMode: 'plan', + cwd: opts.cwd, + timeoutMs: timeoutMs + 60_000, + env: opts.env, + }); + + const fingerprints: AskUserQuestionFingerprint[] = []; + const seen = new Set(); + let boundaryFired = false; + let step0Count = 0; + let reviewCount = 0; + let isFirstAUQ = true; + let lastSig = ''; + + function snapshot( + outcome: PlanSkillCountObservation['outcome'], + summary: string, + visible: string, + ): PlanSkillCountObservation { + return { + outcome, + summary, + evidence: visible.slice(-3000), + elapsedMs: Date.now() - startedAt, + fingerprints, + step0Count, + reviewCount, + }; + } + + try { + await Bun.sleep(8000); // boot grace + auto-trust handler window + const since = session.mark(); + session.send(`${opts.slashCommand}\r`); + await Bun.sleep(3000); + session.send(`${opts.followUpPrompt}\r`); + + const budgetStart = Date.now(); + while (Date.now() - budgetStart < timeoutMs) { + await Bun.sleep(2000); + const visible = session.visibleSince(since); + + // Process exited? + if (session.exited()) { + return snapshot( + 'exited', + `claude exited (code=${session.exitCode()}) during counting (step0=${step0Count}, review=${reviewCount})`, + visible, + ); + } + if (visible.includes('Unknown command:')) { + return snapshot( + 'exited', + `claude rejected ${opts.slashCommand} as unknown command (skill not registered in this cwd)`, + visible, + ); + } + + // Silent write detection — only fires if no numbered prompt is on + // screen (otherwise the write is gated by a permission/AUQ). + const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g; + let m: RegExpExecArray | null; + while ((m = writeRe.exec(visible)) !== null) { + const target = m[1] ?? ''; + const sanctioned = SANCTIONED_WRITE_SUBSTRINGS.some((s) => + target.includes(s), + ); + if (!sanctioned && !isNumberedOptionListVisible(visible)) { + return snapshot( + 'silent_write', + `Write/Edit to ${target} fired before any AskUserQuestion`, + visible, + ); + } + } + + // Soft terminal signals — check before AUQ processing so a final + // completion-summary doesn't get misclassified as a bonus AUQ. + if (COMPLETION_SUMMARY_RE.test(visible)) { + return snapshot( + 'completion_summary', + `skill emitted completion summary / verdict / status line (step0=${step0Count}, review=${reviewCount})`, + visible, + ); + } + if (isPlanReadyVisible(visible)) { + return snapshot( + 'plan_ready', + `skill emitted plan-mode "Ready to execute" confirmation (step0=${step0Count}, review=${reviewCount})`, + visible, + ); + } + + // Numbered option list? + if (!isNumberedOptionListVisible(visible)) continue; + + // Permission dialog? Auto-grant with defaultPick. Only act on the + // recent tail to avoid re-triggering on stale dialogs in scrollback. + if (isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))) { + session.send(`${defaultPick}\r`); + await Bun.sleep(1500); + continue; + } + + // Parse the active AUQ. Skip same-redraw and empty-prompt cases. + const options = parseNumberedOptions(visible); + if (options.length < 2) continue; + const sig = optionsSignature(options); + if (sig === lastSig) continue; + const promptSnippet = parseQuestionPrompt(visible); + if (promptSnippet === '') continue; // not yet rendered, poll again + lastSig = sig; + + const fingerprintHash = auqFingerprint(promptSnippet, options); + if (seen.has(fingerprintHash)) { + // Same content, already counted (TTY redrew with whitespace diff). + continue; + } + seen.add(fingerprintHash); + + const fp: AskUserQuestionFingerprint = { + signature: fingerprintHash, + promptSnippet, + options, + observedAtMs: Date.now() - startedAt, + preReview: !boundaryFired, + }; + fingerprints.push(fp); + if (boundaryFired) reviewCount += 1; + else step0Count += 1; + + // Press to advance — first AUQ may use the override pick. + const pickIdx = + isFirstAUQ && opts.firstAUQPick ? opts.firstAUQPick(fp) : defaultPick; + isFirstAUQ = false; + session.send(`${pickIdx}\r`); + + // Evaluate boundary AFTER pressing — if THIS AUQ was the last Step 0 + // question, all subsequent AUQs go to reviewCount. + if (!boundaryFired && opts.isLastStep0AUQ(fp)) { + boundaryFired = true; + } + + // Hard ceiling — runaway protection. + if (reviewCount >= opts.reviewCountCeiling) { + return snapshot( + 'ceiling_reached', + `review-phase AUQ count reached ceiling (${opts.reviewCountCeiling})`, + session.visibleSince(since), + ); + } + + // Give the agent a beat to advance to the next state. + await Bun.sleep(2000); + } + + return snapshot( + 'timeout', + `no terminal outcome within ${timeoutMs}ms (step0=${step0Count}, review=${reviewCount})`, + session.visibleSince(since), + ); + } finally { + await session.close(); + } +} diff --git a/test/helpers/claude-pty-runner.unit.test.ts b/test/helpers/claude-pty-runner.unit.test.ts new file mode 100644 index 00000000..e830d730 --- /dev/null +++ b/test/helpers/claude-pty-runner.unit.test.ts @@ -0,0 +1,749 @@ +/** + * Deterministic unit tests for claude-pty-runner.ts behavior changes. + * + * Free-tier (no EVALS=1 needed). Runs in <1s on every `bun test`. Catches + * harness plumbing bugs before stochastic PTY runs surface them. + * + * Two surface areas tested: + * + * 1. Permission-dialog short-circuit in 'asked' classification: a TTY frame + * that matches BOTH isPermissionDialogVisible AND isNumberedOptionListVisible + * must NOT be classified as a skill question — permission dialogs render + * as numbered lists too, but they're not what we're guarding. + * + * 2. Env passthrough surface: runPlanSkillObservation accepts an `env` + * option and threads it to launchClaudePty. We can't fully exercise the + * spawn pipeline without paying for a PTY session, but we CAN verify the + * option exists in the type signature and that calling without env still + * works (no regression). + * + * The PTY test (skill-e2e-plan-ceo-plan-mode.test.ts) is the integration + * check; this file is the cheap deterministic guard for the harness primitives + * those tests stand on. + */ + +import { describe, test, expect } from 'bun:test'; +import { + isPermissionDialogVisible, + isNumberedOptionListVisible, + isPlanReadyVisible, + parseNumberedOptions, + classifyVisible, + TAIL_SCAN_BYTES, + optionsSignature, + parseQuestionPrompt, + auqFingerprint, + COMPLETION_SUMMARY_RE, + assertReviewReportAtBottom, + ceoStep0Boundary, + engStep0Boundary, + designStep0Boundary, + devexStep0Boundary, + type ClaudePtyOptions, + type AskUserQuestionFingerprint, +} from './claude-pty-runner'; + +describe('isPermissionDialogVisible', () => { + test('matches "Bash command requires permission" prompts', () => { + const sample = ` + Some preamble output + + Bash command \`gstack-config get telemetry\` requires permission to run. + + ❯ 1. Yes + 2. Yes, and always allow + 3. No, abort + `; + expect(isPermissionDialogVisible(sample)).toBe(true); + }); + + test('matches "allow all edits" file-edit prompts', () => { + // Isolated to the "allow all edits" clause only — no overlapping + // "Do you want to proceed?" co-trigger, so this asserts the clause works. + const sample = ` + Edit to ~/.gstack/config.yaml + + ❯ 1. Yes + 2. Yes, allow all edits during this session + 3. No + `; + expect(isPermissionDialogVisible(sample)).toBe(true); + }); + + test('matches the "Do you want to proceed?" file-edit confirmation by itself', () => { + // Separate fixture so weakening this clause is detected by a dedicated test. + const sample = ` + Edit to ~/.gstack/config.yaml + + Do you want to proceed? + + ❯ 1. Yes + 2. No + `; + expect(isPermissionDialogVisible(sample)).toBe(true); + }); + + test('matches workspace-trust "always allow access to" prompt', () => { + const sample = ` + Do you trust the files in this folder? + + ❯ 1. Yes, proceed + 2. Yes, and always allow access to /Users/me/repo + 3. No, exit + `; + expect(isPermissionDialogVisible(sample)).toBe(true); + }); + + test('does NOT match a skill AskUserQuestion list', () => { + const sample = ` + D1 — Premise challenge: do users actually want this? + + ❯ 1. Yes, validated + 2. No, premise is wrong + 3. Need more info + `; + expect(isPermissionDialogVisible(sample)).toBe(false); + }); + + test('does NOT match a plan-ready confirmation', () => { + const sample = ` + Ready to execute the plan? + + ❯ 1. Yes + 2. No, keep planning + `; + expect(isPermissionDialogVisible(sample)).toBe(false); + }); + + test('does NOT match a skill question that contains the bare phrase "Do you want to proceed?"', () => { + // Co-trigger requirement: "Do you want to proceed?" alone is not enough. + // It must appear with "Edit to " or "Write to " to count as + // a permission dialog. This guards against a skill question like + // "Do you want to proceed with HOLD SCOPE?" being mis-classified. + const sample = ` + Choose your scope mode for this review. + Do you want to proceed? + + ❯ 1. HOLD SCOPE + 2. SCOPE EXPANSION + 3. SELECTIVE EXPANSION + `; + expect(isPermissionDialogVisible(sample)).toBe(false); + }); + + test('does NOT mis-match when adversarial prose includes "Edit to " alongside the bare proceed phrase', () => { + // Adversarial fixture: a skill question whose body legitimately mentions + // "Edit to " in prose AND ends with "Do you want to proceed?". The + // current co-trigger regex would mis-classify this as a permission + // dialog. We DO want this test to fail until the regex is tightened + // further (e.g., proximity constraint, or anchoring "Edit to" to a + // line-start). For now this is documented as a known limitation: a + // skill question that talks about "Edit to" in prose IS still treated + // as a permission dialog. The test asserts the current behavior so a + // future fix can flip it intentionally. + const sample = ` + Plan: I will Edit to ./plan.md to capture the decision. + Do you want to proceed? + + ❯ 1. HOLD SCOPE + 2. SCOPE EXPANSION + `; + // KNOWN LIMITATION: the co-trigger fires here. Documented as a + // post-merge follow-up. Flip this assertion once the regex tightens. + expect(isPermissionDialogVisible(sample)).toBe(true); + }); +}); + +describe('isNumberedOptionListVisible', () => { + test('matches a basic ❯ 1. + 2. cursor list', () => { + const sample = ` + ❯ 1. Option one + 2. Option two + 3. Option three + `; + expect(isNumberedOptionListVisible(sample)).toBe(true); + }); + + test('returns false on a single-option prompt', () => { + const sample = ` + ❯ 1. Only option + `; + expect(isNumberedOptionListVisible(sample)).toBe(false); + }); + + test('returns false when no cursor renders', () => { + const sample = ` + Just some prose with 1. a numbered point and 2. another. + `; + expect(isNumberedOptionListVisible(sample)).toBe(false); + }); + + test('overlaps permission dialogs (this is why D5 short-circuits)', () => { + // The whole point of D5: this string matches BOTH classifiers, so the + // runner must consult isPermissionDialogVisible to disambiguate. + const sample = ` + Bash command \`do-thing\` requires permission to run. + + ❯ 1. Yes + 2. No + `; + expect(isNumberedOptionListVisible(sample)).toBe(true); + expect(isPermissionDialogVisible(sample)).toBe(true); + }); +}); + +describe('classifyVisible (runtime path through the runner classifier)', () => { + // These tests call the actual classifier so a future contributor who + // reorders branches (e.g. moves the permission short-circuit before + // isPlanReadyVisible) is caught deterministically. + + test('skill question → returns asked', () => { + const visible = ` + D1 — Choose your scope mode + + ❯ 1. HOLD SCOPE + 2. SCOPE EXPANSION + 3. SELECTIVE EXPANSION + 4. SCOPE REDUCTION + `; + const result = classifyVisible(visible); + expect(result?.outcome).toBe('asked'); + }); + + test('permission dialog (Bash) → returns null (skip, keep polling)', () => { + const visible = ` + Bash command \`gstack-update-check\` requires permission to run. + + ❯ 1. Yes + 2. No + `; + expect(isNumberedOptionListVisible(visible)).toBe(true); // pre-filter + expect(classifyVisible(visible)).toBeNull(); // post-filter + }); + + test('plan-ready confirmation → returns plan_ready (wins over asked)', () => { + const visible = ` + Ready to execute the plan? + + ❯ 1. Yes, proceed + 2. No, keep planning + `; + const result = classifyVisible(visible); + expect(result?.outcome).toBe('plan_ready'); + }); + + test('silent write to unsanctioned path → returns silent_write', () => { + const visible = ` + ⏺ Write(src/app/dangerous-write.ts) + ⎿ Wrote 42 lines + `; + const result = classifyVisible(visible); + expect(result?.outcome).toBe('silent_write'); + expect(result?.summary).toContain('src/app/dangerous-write.ts'); + }); + + test('write to sanctioned path (.claude/plans) → returns null (allowed)', () => { + const visible = ` + ⏺ Write(/Users/me/.claude/plans/some-plan.md) + ⎿ Wrote 42 lines + `; + expect(classifyVisible(visible)).toBeNull(); + }); + + test('write while a permission dialog is on screen → returns null (gated, not silent, not asked)', () => { + const visible = ` + ⏺ Write(src/app/edit-with-permission.ts) + + Edit to src/app/edit-with-permission.ts + + Do you want to proceed? + + ❯ 1. Yes + 2. No + `; + // The numbered prompt is a permission dialog (Edit to + Do you want to proceed?); + // silent_write is suppressed because a numbered prompt is visible, AND + // 'asked' is suppressed because the prompt is a permission dialog. + expect(classifyVisible(visible)).toBeNull(); + }); + + test('write while a real skill question is on screen → returns asked (write is captured but not silent)', () => { + const visible = ` + ⏺ Write(src/app/foo.ts) + + D1 — Choose your scope mode + + ❯ 1. HOLD SCOPE + 2. SCOPE EXPANSION + `; + // The numbered prompt is a skill question, not a permission dialog; + // silent_write is suppressed (numbered prompt is visible) and the + // outcome is 'asked' — Step 0 fired. + const result = classifyVisible(visible); + expect(result?.outcome).toBe('asked'); + }); + + test('idle / no signals → returns null', () => { + const visible = ` + Some prose without any classifier signals. + `; + expect(classifyVisible(visible)).toBeNull(); + }); + + test('TAIL_SCAN_BYTES is exported as 1500', () => { + // Shared between runner and routing test; a regression that desyncs the + // recent-tail window would surface here. + expect(TAIL_SCAN_BYTES).toBe(1500); + }); +}); + +describe('parseNumberedOptions', () => { + test('extracts options from a clean cursor list', () => { + const visible = ` + ❯ 1. HOLD SCOPE + 2. SCOPE EXPANSION + `; + const opts = parseNumberedOptions(visible); + expect(opts).toHaveLength(2); + expect(opts[0]).toEqual({ index: 1, label: 'HOLD SCOPE' }); + expect(opts[1]).toEqual({ index: 2, label: 'SCOPE EXPANSION' }); + }); + + test('returns empty array on prose-with-numbers (no cursor)', () => { + expect(parseNumberedOptions('text 1. one 2. two')).toEqual([]); + }); + + test('extracts options when the cursor is INLINE with prompt header (box-layout)', () => { + // Real /plan-ceo-review rendering: the TTY's cursor-positioning escapes + // collapse divider + header + prompt + cursor onto one logical line. + // Subsequent options (2..7) still start their own lines. + const visible = [ + '────────────────────────────────────────', + '☐ Review scope What scope do you want me to CEO-review? ❯ 1. The branch\'s diff vs main', + ' Review the full branch: ~10K LOC.', + '2. A specific plan file or design doc', + ' You point me at a file (path) and I review that.', + '3. An idea you\'ll describe inline', + '4. Cancel — wrong skill', + '5. Type something.', + '────────────────────────────────────────', + '6. Chat about this', + '7. Skip interview and plan immediately', + ].join('\n'); + const opts = parseNumberedOptions(visible); + expect(opts).toHaveLength(7); + expect(opts[0]).toEqual({ index: 1, label: "The branch's diff vs main" }); + expect(opts[1]?.index).toBe(2); + expect(opts[6]?.index).toBe(7); + expect(opts[6]?.label).toBe('Skip interview and plan immediately'); + }); + + test('inline-cursor and start-of-line cursor both produce 7 options for the box-layout case', () => { + // The inline path captures option 1 from the cursor line itself; the + // subsequent-lines path captures 2..7 with the existing optionRe. + const inlineLayout = [ + 'header text ❯ 1. first option', + '2. second', + '3. third', + ].join('\n'); + expect(parseNumberedOptions(inlineLayout)).toEqual([ + { index: 1, label: 'first option' }, + { index: 2, label: 'second' }, + { index: 3, label: 'third' }, + ]); + + const cleanLayout = [ + ' ❯ 1. first option', + ' 2. second', + ' 3. third', + ].join('\n'); + expect(parseNumberedOptions(cleanLayout)).toEqual([ + { index: 1, label: 'first option' }, + { index: 2, label: 'second' }, + { index: 3, label: 'third' }, + ]); + }); +}); + +describe('runPlanSkillObservation env passthrough surface', () => { + test('ClaudePtyOptions exposes env: Record', () => { + // Type-level guard: this file would fail to compile if the env field + // were removed or its shape regressed. The actual env merge happens in + // launchClaudePty's spawn call (`env: { ...process.env, ...opts.env }`), + // so a regression where `env: opts.env` gets dropped from the + // runPlanSkillObservation -> launchClaudePty handoff is only caught by + // the live PTY test, not here. + const opts: ClaudePtyOptions = { + env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }, + }; + expect(opts.env).toEqual({ QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }); + }); +}); + +// ──────────────────────────────────────────────────────────────────────────── +// Per-finding count primitives — Section 3 unit tests #1–#5, #7, #12. +// ──────────────────────────────────────────────────────────────────────────── + +describe('optionsSignature', () => { + test('returns a "|"-joined `index:label` string for a clean list', () => { + const sig = optionsSignature([ + { index: 1, label: 'HOLD SCOPE' }, + { index: 2, label: 'SCOPE EXPANSION' }, + ]); + expect(sig).toBe('1:HOLD SCOPE|2:SCOPE EXPANSION'); + }); + + test('order-independent: shuffled inputs produce the same signature', () => { + // parseNumberedOptions already returns sorted, but defensive sort means + // a future caller that hands us shuffled input still produces a stable + // dedupe signature. + const a = optionsSignature([ + { index: 2, label: 'B' }, + { index: 1, label: 'A' }, + { index: 3, label: 'C' }, + ]); + const b = optionsSignature([ + { index: 1, label: 'A' }, + { index: 2, label: 'B' }, + { index: 3, label: 'C' }, + ]); + expect(a).toBe(b); + }); + + test('empty list returns empty string', () => { + expect(optionsSignature([])).toBe(''); + }); + + test('single-item list returns just that entry', () => { + expect(optionsSignature([{ index: 1, label: 'Only' }])).toBe('1:Only'); + }); +}); + +describe('parseQuestionPrompt', () => { + test('captures 1-line prompt above the cursor', () => { + const visible = ` + D1 — Pick a mode + + ❯ 1. HOLD SCOPE + 2. SCOPE EXPANSION + `; + const prompt = parseQuestionPrompt(visible); + expect(prompt).toBe('D1 — Pick a mode'); + }); + + test('captures multi-line prompt above the cursor', () => { + const visible = ` + D2 — Approach selection + + Which architecture should we follow? + + ❯ 1. Bypass existing helper + 2. Reuse existing helper + `; + const prompt = parseQuestionPrompt(visible); + // Multi-line prompts get joined with single spaces. + expect(prompt).toContain('D2 — Approach selection'); + expect(prompt).toContain('Which architecture should we follow?'); + }); + + test('returns "" when no cursor is rendered', () => { + expect(parseQuestionPrompt('Just some prose.\nNo cursor.')).toBe(''); + }); + + test('truncates to 240 chars', () => { + const longPrompt = 'A'.repeat(500); + const visible = `${longPrompt}\n\n ❯ 1. yes\n 2. no`; + expect(parseQuestionPrompt(visible).length).toBeLessThanOrEqual(240); + }); + + test('does not pull text from a previous numbered list above', () => { + const visible = ` + ❯ 1. previous answered question + 2. previous option two + + D2 — A new question text + + ❯ 1. fresh option A + 2. fresh option B + `; + const prompt = parseQuestionPrompt(visible); + // Stops at the previous numbered-list line; should NOT contain "previous answered question". + expect(prompt).toContain('D2 — A new question text'); + expect(prompt).not.toContain('previous answered question'); + }); + + test('normalizes whitespace (collapses runs of spaces and tabs)', () => { + const visible = `D1 — Spaced out + + ❯ 1. yes + 2. no`; + expect(parseQuestionPrompt(visible)).toBe('D1 — Spaced out'); + }); + + test('inline-cursor box-layout: extracts prompt text BEFORE ❯1. on the cursor line', () => { + // Real /plan-ceo-review rendering: divider + ☐ header + prompt text + + // cursor are all on one logical line because TTY cursor-positioning + // escapes collapse the box layout under stripAnsi. + const visible = [ + '──────────────────', + '☐ Review scope What scope do you want me to CEO-review? ❯ 1. The branch\'s diff vs main', + '2. A specific plan file', + '3. An idea inline', + ].join('\n'); + const prompt = parseQuestionPrompt(visible); + // Should extract "Review scope" and the prompt text, dropping the ☐ box-drawing sigil. + expect(prompt).toContain('Review scope'); + expect(prompt).toContain('What scope do you want me to CEO-review?'); + expect(prompt).not.toContain('❯'); + expect(prompt).not.toMatch(/^☐/); + }); +}); + +describe('auqFingerprint', () => { + test('returns the same fingerprint for identical inputs', () => { + const opts = [ + { index: 1, label: 'A' }, + { index: 2, label: 'B' }, + ]; + expect(auqFingerprint('hello', opts)).toBe(auqFingerprint('hello', opts)); + }); + + test('different prompts with shared option labels produce DIFFERENT fingerprints', () => { + // The collision regression Codex F1 caught: option-label-only fingerprints + // collapsed multiple distinct findings into one when they shared menu shape. + const sharedOpts = [ + { index: 1, label: 'Add to plan' }, + { index: 2, label: 'Defer' }, + { index: 3, label: 'Build now' }, + ]; + const fpFinding1 = auqFingerprint('D5 — Architecture: bypass helper?', sharedOpts); + const fpFinding2 = auqFingerprint('D6 — Tests: zero coverage?', sharedOpts); + expect(fpFinding1).not.toBe(fpFinding2); + }); + + test('same prompt with different options produces DIFFERENT fingerprints', () => { + const prompt = 'D1 — Pick a mode'; + const fpA = auqFingerprint(prompt, [ + { index: 1, label: 'HOLD SCOPE' }, + { index: 2, label: 'SCOPE EXPANSION' }, + ]); + const fpB = auqFingerprint(prompt, [ + { index: 1, label: 'HOLD SCOPE' }, + { index: 2, label: 'SCOPE REDUCTION' }, + ]); + expect(fpA).not.toBe(fpB); + }); + + test('whitespace-only differences in prompt do NOT change the fingerprint', () => { + // Same content, different rendering whitespace (TTY redraw artifact) + // must produce the same fingerprint so dedupe survives reflow. + const opts = [{ index: 1, label: 'A' }, { index: 2, label: 'B' }]; + const fpA = auqFingerprint('Pick a mode', opts); + const fpB = auqFingerprint('Pick a mode', opts); + expect(fpA).toBe(fpB); + }); + + test('empty prompt + same options collide (caller must guard against this)', () => { + // Documents the contract: empty-prompt fingerprints WILL collide if the + // caller fingerprints them. runPlanSkillCounting must skip empty-prompt + // AUQs and re-poll instead. + const opts = [{ index: 1, label: 'A' }]; + expect(auqFingerprint('', opts)).toBe(auqFingerprint('', opts)); + }); +}); + +describe('COMPLETION_SUMMARY_RE', () => { + test('matches GSTACK REVIEW REPORT heading', () => { + expect(COMPLETION_SUMMARY_RE.test('## GSTACK REVIEW REPORT')).toBe(true); + }); + + test('matches Completion Summary heading (ceo + eng)', () => { + expect(COMPLETION_SUMMARY_RE.test('## Completion Summary')).toBe(true); + expect(COMPLETION_SUMMARY_RE.test('## Completion summary')).toBe(true); + }); + + test('matches Status: clean (CEO review-log shape)', () => { + expect(COMPLETION_SUMMARY_RE.test('Status: clean')).toBe(true); + expect(COMPLETION_SUMMARY_RE.test('Status: issues_open')).toBe(true); + }); + + test('matches VERDICT: line', () => { + expect(COMPLETION_SUMMARY_RE.test('VERDICT: CLEARED — Eng Review passed')).toBe(true); + }); + + test('does NOT match prose mentions of "verdict" mid-line', () => { + // VERDICT must be at the start of a line to count. + expect(COMPLETION_SUMMARY_RE.test('the final verdict: undecided')).toBe(false); + }); +}); + +describe('assertReviewReportAtBottom', () => { + test('passes when REVIEW REPORT is the only/last ## heading', () => { + const content = `# Plan + +## Context +stuff + +## Approach +more stuff + +## GSTACK REVIEW REPORT + +| col | col | +`; + const r = assertReviewReportAtBottom(content); + expect(r.ok).toBe(true); + }); + + test('fails when REVIEW REPORT is missing', () => { + const content = `# Plan + +## Context +stuff +`; + const r = assertReviewReportAtBottom(content); + expect(r.ok).toBe(false); + expect(r.reason).toMatch(/no GSTACK REVIEW REPORT/); + }); + + test('fails when REVIEW REPORT exists but a ## heading follows it', () => { + const content = `# Plan + +## GSTACK REVIEW REPORT + +| col | col | + +## Late Section +oops +`; + const r = assertReviewReportAtBottom(content); + expect(r.ok).toBe(false); + expect(r.reason).toMatch(/trailing ## heading/); + expect(r.trailingHeadings).toEqual(['## Late Section']); + }); + + test('passes when only ### subheadings follow REVIEW REPORT (deeper nesting allowed)', () => { + const content = `## GSTACK REVIEW REPORT + +### Cross-model tension +- F1: resolved +- F2: resolved +`; + const r = assertReviewReportAtBottom(content); + expect(r.ok).toBe(true); + }); + + test('fails with multiple trailing ## headings reported', () => { + const content = `## GSTACK REVIEW REPORT + +## First trailing + +## Second trailing +`; + const r = assertReviewReportAtBottom(content); + expect(r.ok).toBe(false); + expect(r.trailingHeadings).toHaveLength(2); + }); +}); + +describe('Step0BoundaryPredicate per-skill', () => { + // Helper to build a synthetic fingerprint for predicate tests. + function fp(promptSnippet: string, optionLabels: string[]): AskUserQuestionFingerprint { + const options = optionLabels.map((label, i) => ({ index: i + 1, label })); + return { + signature: auqFingerprint(promptSnippet, options), + promptSnippet, + options, + observedAtMs: 0, + preReview: true, + }; + } + + describe('ceoStep0Boundary', () => { + test('FIRES on Step 0F mode-pick AUQ (HOLD SCOPE in options)', () => { + const f = fp('Pick a mode', ['HOLD SCOPE', 'SCOPE EXPANSION', 'SELECTIVE EXPANSION', 'SCOPE REDUCTION']); + expect(ceoStep0Boundary(f)).toBe(true); + }); + + test('FIRES on scope-selection AUQ with "Skip interview" option (skip-interview path)', () => { + // After calibration run 1: plan-ceo's first AUQ is scope-selection, + // and we route via "Skip interview and plan immediately" to bypass + // Step 0 entirely. Boundary must fire on this AUQ so subsequent + // AUQs go to reviewCount. + const f = fp( + 'What scope do you want me to CEO-review?', + [ + "The branch's diff vs main", + 'A specific plan file', + "An idea you'll describe inline", + 'Cancel — wrong skill', + 'Type something.', + 'Chat about this', + 'Skip interview and plan immediately', + ], + ); + expect(ceoStep0Boundary(f)).toBe(true); + }); + + test('does NOT fire on premise challenge AUQs', () => { + const f = fp('D1 — Premise check: is this the right problem?', ['Yes', 'No', 'Other']); + expect(ceoStep0Boundary(f)).toBe(false); + }); + + test('does NOT fire on review-section AUQs', () => { + const f = fp('Architecture: bypass helper?', ['Reuse existing', 'Roll new', 'Defer']); + expect(ceoStep0Boundary(f)).toBe(false); + }); + }); + + describe('engStep0Boundary', () => { + test('FIRES on cross-project learnings prompt', () => { + const f = fp('Enable cross-project learnings on this machine?', ['Yes', 'No']); + expect(engStep0Boundary(f)).toBe(true); + }); + + test('FIRES on scope reduction recommendation', () => { + const f = fp('Scope reduction recommendation: cut to MVP?', ['Reduce', 'Proceed', 'Modify']); + expect(engStep0Boundary(f)).toBe(true); + }); + + test('does NOT fire on review-section AUQs', () => { + const f = fp('Architecture: shared mutable state?', ['Refactor', 'Defer', 'Skip']); + expect(engStep0Boundary(f)).toBe(false); + }); + }); + + describe('designStep0Boundary', () => { + test('FIRES on design system / posture mention', () => { + const f = fp('Pick a design posture for this review', ['Polish', 'Triage', 'Expansion']); + expect(designStep0Boundary(f)).toBe(true); + }); + + test('FIRES on first-dimension prompt', () => { + const f = fp('First dimension: visual hierarchy. Score?', ['7', '8', '9']); + expect(designStep0Boundary(f)).toBe(true); + }); + + test('does NOT fire on later dimension AUQs', () => { + const f = fp('Spacing dimension score?', ['7', '8', '9']); + expect(designStep0Boundary(f)).toBe(false); + }); + }); + + describe('devexStep0Boundary', () => { + test('FIRES on developer persona selection', () => { + const f = fp('Pick the target persona for this review', ['Senior backend', 'Junior frontend', 'Other']); + expect(devexStep0Boundary(f)).toBe(true); + }); + + test('FIRES on TTHW target prompt', () => { + const f = fp('What is the TTHW target for first run?', ['<5 min', '<15 min', '<30 min']); + expect(devexStep0Boundary(f)).toBe(true); + }); + + test('does NOT fire on review-section AUQs', () => { + const f = fp('Friction point: 5-min CI wait. Address?', ['Now', 'Defer', 'Skip']); + expect(devexStep0Boundary(f)).toBe(false); + }); + }); +}); diff --git a/test/helpers/touchfiles.ts b/test/helpers/touchfiles.ts index e0c6ebcf..37a97f1b 100644 --- a/test/helpers/touchfiles.ts +++ b/test/helpers/touchfiles.ts @@ -122,6 +122,15 @@ export const E2E_TOUCHFILES: Record = { 'ship-idempotency-pty': ['ship/**', 'bin/gstack-next-version', 'lib/worktree.ts', 'test/helpers/claude-pty-runner.ts'], 'autoplan-chain-pty': ['autoplan/**', 'plan-ceo-review/**', 'plan-design-review/**', 'plan-eng-review/**', 'plan-devex-review/**', 'test/fixtures/plans/ui-heavy-feature.md', 'test/helpers/claude-pty-runner.ts'], 'e2e-harness-audit': ['plan-ceo-review/**', 'plan-eng-review/**', 'plan-design-review/**', 'plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/agent-sdk-runner.ts', 'test/helpers/claude-pty-runner.ts'], + + // Per-finding AskUserQuestion count + review-report-at-bottom assertion. + // Each test drives its skill end-to-end; touchfiles include preamble + + // completion-status resolvers because they affect question cadence and + // terminal output (the regression surface this test catches). + 'plan-ceo-finding-count': ['plan-ceo-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-ceo-finding-count.test.ts'], + 'plan-eng-finding-count': ['plan-eng-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-eng-finding-count.test.ts'], + 'plan-design-finding-count': ['plan-design-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-design-finding-count.test.ts'], + 'plan-devex-finding-count': ['plan-devex-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-devex-finding-count.test.ts'], 'brain-privacy-gate': ['scripts/resolvers/preamble/generate-brain-sync-block.ts', 'scripts/resolvers/preamble.ts', 'bin/gstack-brain-sync', 'bin/gstack-brain-init', 'bin/gstack-config', 'test/helpers/agent-sdk-runner.ts'], // AskUserQuestion format regression (RECOMMENDATION + Completeness: N/10) @@ -404,6 +413,15 @@ export const E2E_TIERS: Record = { 'ship-idempotency-pty': 'periodic', // ~$3/run, real /ship in plan mode 'autoplan-chain-pty': 'periodic', // ~$8/run, all 3 phases sequential + // Per-finding count + review-report-at-bottom — periodic because each + // run drives a full skill end-to-end (~25 min, ~$5/run). Sequential + // execution during calibration; concurrent opt-in only after measured + // comparison agrees (plan §D15). + 'plan-ceo-finding-count': 'periodic', + 'plan-eng-finding-count': 'periodic', + 'plan-design-finding-count': 'periodic', + 'plan-devex-finding-count': 'periodic', + // Privacy gate for gstack-brain-sync — periodic (non-deterministic LLM call, // costs ~$0.30-$0.50 per run, not needed on every commit) 'brain-privacy-gate': 'periodic', diff --git a/test/skill-e2e-plan-ceo-finding-count.test.ts b/test/skill-e2e-plan-ceo-finding-count.test.ts new file mode 100644 index 00000000..850c1a03 --- /dev/null +++ b/test/skill-e2e-plan-ceo-finding-count.test.ts @@ -0,0 +1,253 @@ +/** + * /plan-ceo-review per-finding AskUserQuestion count (periodic, paid, real-PTY). + * + * Asserts the load-bearing rule "One issue = one AskUserQuestion call" by + * driving /plan-ceo-review against a 5-finding seeded plan and counting + * distinct review-phase AUQs. Passes when count is in [N-1, N+2]. + * + * Two tests in this file: + * - 5-finding distinct fixture: count band assertion + D19 review-report-at-bottom. + * - 2-finding paired control (D12 positive control): related findings still + * produce 2 distinct AUQs, not 1 batched, when the rule is honored. + * + * Tier: periodic. Each run drives Step 0 + 11 review sections end-to-end + * (~25 min, ~$5/run). Sequential by default per plan §D15. See + * test/helpers/claude-pty-runner.ts for runPlanSkillCounting internals. + */ + +import { describe, test } from 'bun:test'; +import * as fs from 'node:fs'; +import { + runPlanSkillCounting, + ceoStep0Boundary, + assertReviewReportAtBottom, + type AskUserQuestionFingerprint, +} from './helpers/claude-pty-runner'; + +/** + * /plan-ceo-review's first AUQ asks "what scope?" with options like + * 1. Branch diff vs main + * 2. A specific plan file or design doc + * 3. An idea you'll describe inline + * ... + * 7. Skip interview and plan immediately + * + * The default pick (1) routes to "branch diff vs main" — the wrong target + * for our seeded fixture (the agent would review the gstack PR itself, + * recursively). Picking "Skip interview and plan immediately" bypasses + * Step 0 and routes the agent to review the chat context (where our + * follow-up plan was pasted). + */ +function pickSkipInterview(fp: AskUserQuestionFingerprint): number { + const skipOpt = fp.options.find((o) => + /skip\s+interview|plan\s+immediately/i.test(o.label), + ); + if (skipOpt) return skipOpt.index; + // Fallback: "describe inline" also routes to using our pasted plan. + const inlineOpt = fp.options.find((o) => + /describe.*inline|inline.*idea/i.test(o.label), + ); + if (inlineOpt) return inlineOpt.index; + return 1; +} + +const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic'; +const describeE2E = shouldRun ? describe : describe.skip; + +const N_DISTINCT = 5; +const FLOOR_DISTINCT = N_DISTINCT - 1; // 4 (D11) +const CEILING_DISTINCT = N_DISTINCT + 2; // 7 (D11) + +const N_PAIRED = 2; +const FLOOR_PAIRED = 2; +const CEILING_PAIRED = 4; + +const PLAN_CEO_5_FINDINGS = [ + 'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-ceo.md (use Edit/Write to that exact path).', + '', + '# Plan: Payment Processing Integration', + '', + '## Architecture', + "We're adding a new `PaymentService` class that will handle Stripe webhooks.", + 'This bypasses the existing `WebhookDispatcher` module — we want a clean', + 'namespace separation.', + '', + '## Database access', + 'The new endpoint reads `request.params.userId` directly into a raw SQL', + 'fragment for the lookup query.', + '', + '## Webhook fan-out', + 'On payment success we update the user record AND fire a notification email.', + 'Both happen inline; no error handling on the email leg.', + '', + '## Tests', + "None planned. We'll rely on the existing integration suite catching regressions.", + '', + '## Performance', + 'Each webhook lookup hits the database for the user, then fetches each', + 'order in a loop.', +].join('\n'); + +const PLAN_CEO_2_PAIRED_FINDINGS = [ + 'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-ceo-paired.md (use Edit/Write to that exact path).', + '', + '# Plan: Payment Processing — Test Coverage', + '', + '## Tests', + 'We need test coverage for `processPayment()`. Specifically:', + '1. The happy path (successful Stripe charge — assert correct receipt is generated).', + '2. The error/timeout path (Stripe returns 502 — assert retry-with-backoff fires once, then fails clean).', + '', + 'Currently neither has a unit test. These are deliberately separate concerns:', + 'the success path is correctness, the failure path is graceful degradation.', +].join('\n'); + +const PLAN_CEO_PATH = '/tmp/gstack-test-plan-ceo.md'; +const PLAN_CEO_PAIRED_PATH = '/tmp/gstack-test-plan-ceo-paired.md'; + +describeE2E('/plan-ceo-review per-finding AskUserQuestion count (periodic)', () => { + test( + `5-finding plan emits ${FLOOR_DISTINCT}-${CEILING_DISTINCT} review-phase AskUserQuestions`, + async () => { + try { + fs.rmSync(PLAN_CEO_PATH, { force: true }); + } catch { + /* best-effort */ + } + + const obs = await runPlanSkillCounting({ + skillName: 'plan-ceo-review', + slashCommand: '/plan-ceo-review', + followUpPrompt: PLAN_CEO_5_FINDINGS, + isLastStep0AUQ: ceoStep0Boundary, + reviewCountCeiling: CEILING_DISTINCT + 1, // hard cap above assertion ceiling + firstAUQPick: pickSkipInterview, // bypass scope-selection, route to review + cwd: process.cwd(), + timeoutMs: 1_500_000, // 25 min + env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }, + }); + + try { + if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) { + throw new Error( + `plan-ceo-review finding-count FAILED: outcome=${obs.outcome}\n` + + `step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` + + `fingerprints (last 8):\n` + + obs.fingerprints + .slice(-8) + .map( + (f, i) => + ` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`, + ) + .join('\n') + + `\n--- evidence (last 3KB) ---\n${obs.evidence}`, + ); + } + if (obs.reviewCount < FLOOR_DISTINCT) { + throw new Error( + `BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR_DISTINCT}.\n` + + `Likely batching regression — agent collapsed multiple findings into fewer questions.\n` + + `Fingerprints (review-phase only):\n` + + obs.fingerprints + .filter((f) => !f.preReview) + .map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`) + .join('\n'), + ); + } + if (obs.reviewCount > CEILING_DISTINCT) { + throw new Error( + `BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING_DISTINCT}.\n` + + `Possible over-asking regression. Review-phase fingerprints:\n` + + obs.fingerprints + .filter((f) => !f.preReview) + .map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`) + .join('\n'), + ); + } + + // D19: review report at bottom of plan file. + if (!fs.existsSync(PLAN_CEO_PATH)) { + throw new Error( + `D19 FAIL: agent did not produce expected plan file at ${PLAN_CEO_PATH}.\n` + + `Either the agent ignored the path instruction in the follow-up prompt, or\n` + + `the helper exited before the agent wrote the file. ` + + `outcome=${obs.outcome} review=${obs.reviewCount}`, + ); + } + const planContent = fs.readFileSync(PLAN_CEO_PATH, 'utf-8'); + const verdict = assertReviewReportAtBottom(planContent); + if (!verdict.ok) { + throw new Error( + `D19 FAIL: plan file at ${PLAN_CEO_PATH} ${verdict.reason}\n` + + (verdict.trailingHeadings + ? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n` + : '') + + `--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`, + ); + } + } finally { + try { + fs.rmSync(PLAN_CEO_PATH, { force: true }); + } catch { + /* best-effort */ + } + } + }, + 1_700_000, + ); + + test( + `paired-finding positive control: ${N_PAIRED} related findings produce ${FLOOR_PAIRED}-${CEILING_PAIRED} AskUserQuestions`, + async () => { + try { + fs.rmSync(PLAN_CEO_PAIRED_PATH, { force: true }); + } catch { + /* best-effort */ + } + + const obs = await runPlanSkillCounting({ + skillName: 'plan-ceo-review', + slashCommand: '/plan-ceo-review', + followUpPrompt: PLAN_CEO_2_PAIRED_FINDINGS, + isLastStep0AUQ: ceoStep0Boundary, + reviewCountCeiling: CEILING_PAIRED + 1, + cwd: process.cwd(), + timeoutMs: 1_500_000, + env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }, + }); + + try { + if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) { + throw new Error( + `paired-finding control FAILED: outcome=${obs.outcome}\n` + + `step0=${obs.step0Count} review=${obs.reviewCount}\n` + + `--- evidence (last 3KB) ---\n${obs.evidence}`, + ); + } + if (obs.reviewCount < FLOOR_PAIRED) { + throw new Error( + `PAIRED CONTROL FAIL: reviewCount=${obs.reviewCount} < FLOOR=${FLOOR_PAIRED}.\n` + + `Two deliberately related findings were batched into <2 questions — the rule failed under D12.\n` + + `Review-phase fingerprints:\n` + + obs.fingerprints + .filter((f) => !f.preReview) + .map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`) + .join('\n'), + ); + } + if (obs.reviewCount > CEILING_PAIRED) { + throw new Error( + `PAIRED CONTROL FAIL: reviewCount=${obs.reviewCount} > CEILING=${CEILING_PAIRED} (over-asking on a 2-finding fixture).`, + ); + } + } finally { + try { + fs.rmSync(PLAN_CEO_PAIRED_PATH, { force: true }); + } catch { + /* best-effort */ + } + } + }, + 1_700_000, + ); +}); diff --git a/test/skill-e2e-plan-ceo-mode-routing.test.ts b/test/skill-e2e-plan-ceo-mode-routing.test.ts index 4e85ed64..0199413b 100644 --- a/test/skill-e2e-plan-ceo-mode-routing.test.ts +++ b/test/skill-e2e-plan-ceo-mode-routing.test.ts @@ -37,14 +37,15 @@ import { isPermissionDialogVisible, parseNumberedOptions, isPlanReadyVisible, + MODE_RE, + optionsSignature, + TAIL_SCAN_BYTES, type ClaudePtySession, } from './helpers/claude-pty-runner'; const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic'; const describeE2E = shouldRun ? describe : describe.skip; -const MODE_RE = /HOLD SCOPE|SCOPE EXPANSION|SELECTIVE EXPANSION|SCOPE REDUCTION/i; - interface ModeCase { mode: 'HOLD SCOPE' | 'SCOPE EXPANSION'; /** Regex applied to visible-since-mode-pick text. At least one must match. */ @@ -95,8 +96,8 @@ async function navigateToModeAskUserQuestion( // Has the rendered list changed since last poll? If not, we're seeing // the same prompt and shouldn't double-press. - const sig = opts.map(o => `${o.index}:${o.label}`).join('|'); - const lastSig = lastSeenList.map(o => `${o.index}:${o.label}`).join('|'); + const sig = optionsSignature(opts); + const lastSig = optionsSignature(lastSeenList); if (sig === lastSig) continue; lastSeenList = opts; @@ -115,7 +116,14 @@ async function navigateToModeAskUserQuestion( // Permission dialog? Grant with "1" but don't count it against nav budget. // Classify on the recent tail only — old permission text persists in // visibleSince and would re-trigger forever. - if (isPermissionDialogVisible(visible.slice(-1500))) { + // + // Note: runPlanSkillObservation has its own permission-dialog filter that + // simply skips classification (since it observes, doesn't drive). This nav + // loop drives the PTY directly via launchClaudePty and so owns its own + // dialog handling — granting with "1" so the workflow advances. Both + // paths share TAIL_SCAN_BYTES as the recent-tail window so tuning stays + // in sync. + if (isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))) { session.send('1\r'); await Bun.sleep(1500); continue; diff --git a/test/skill-e2e-plan-ceo-plan-mode.test.ts b/test/skill-e2e-plan-ceo-plan-mode.test.ts index 6e6e0d8f..36d384db 100644 --- a/test/skill-e2e-plan-ceo-plan-mode.test.ts +++ b/test/skill-e2e-plan-ceo-plan-mode.test.ts @@ -1,22 +1,34 @@ /** * plan-ceo-review plan-mode smoke (gate, paid, real-PTY). * - * Asserts: when /plan-ceo-review is invoked in plan mode, the skill reaches - * a terminal outcome that is either: - * - 'asked' — skill emitted its Step 0 numbered prompt (scope mode - * selection, or the routing-injection prompt that runs - * before Step 0) - * - 'plan_ready' — skill ran end-to-end and surfaced claude's native - * "Ready to execute" confirmation + * Asserts: when /plan-ceo-review is invoked in plan mode, the FIRST terminal + * outcome is 'asked' — a skill-question numbered list. Permission dialogs + * (which also render numbered lists) are filtered out by `runPlanSkillObservation` + * via its `isPermissionDialogVisible(visible.slice(-1500))` short-circuit. * - * FAIL conditions: silent Write/Edit before any prompt, claude crash, - * timeout. + * Reaching 'plan_ready' first IS the regression we want to catch: the agent + * skipped Step 0 entirely and went straight to ExitPlanMode. The original + * failure had the assistant read a diff, write a plan with two issues, and + * call ExitPlanMode without ever firing AskUserQuestion — the user had to + * manually call out the missing per-issue questions. * - * Replaces the SDK-based test that never worked: the SDK's canUseTool - * interceptor on AskUserQuestion never fires in plan mode because plan - * mode renders its native confirmation as TTY UI, not via the - * AskUserQuestion tool. The real PTY harness observes the rendered - * terminal output directly. + * Why this skill is special: unlike plan-eng-review / plan-design-review / + * plan-devex-review (whose smokes accept either 'asked' or 'plan_ready'), + * plan-ceo-review's template mandates Step 0A premise challenge (3 baked-in + * questions) AND Step 0F mode selection BEFORE any plan write. There is no + * legitimate path to plan_ready that does not first emit a skill-question + * numbered prompt. + * + * Env passthrough: passes `QUESTION_TUNING=false` and `EXPLAIN_LEVEL=default` + * via the runner's env option. Today these are advisory — `gstack-config` + * reads `~/.gstack/config.yaml`, not env vars, so a contributor with + * `question_tuning: true` set in their YAML config can still see AUTO_DECIDE + * masking. The env passthrough is wired so a future gstack-config change to + * honor env overrides will make this test hermetic without further edits. + * Tracked as a post-merge follow-up. + * + * FAIL conditions: 'plan_ready' first, silent Write/Edit before any prompt, + * claude crash, timeout. * * See test/helpers/claude-pty-runner.ts for runner internals. */ @@ -28,22 +40,34 @@ const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate'; const describeE2E = shouldRun ? describe : describe.skip; describeE2E('plan-ceo-review plan-mode smoke (gate)', () => { - test('reaches a terminal outcome (asked or plan_ready) without silent writes', async () => { + test('first terminal outcome is asked (Step 0 fires before any plan write)', async () => { const obs = await runPlanSkillObservation({ skillName: 'plan-ceo-review', inPlanMode: true, timeoutMs: 300_000, + env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }, }); - if (obs.outcome === 'silent_write' || obs.outcome === 'exited' || obs.outcome === 'timeout') { + if (obs.outcome !== 'asked') { + const diagnosis = + obs.outcome === 'plan_ready' + ? `'plan_ready' first means the agent skipped Step 0 entirely and went straight to ExitPlanMode without asking.` + : obs.outcome === 'timeout' + ? `Timeout means the agent neither asked nor completed within the budget — likely hung mid-question or stuck on a permission dialog.` + : obs.outcome === 'silent_write' + ? `Silent Write/Edit fired to an unsanctioned path before any AskUserQuestion — also a Step 0 skip.` + : `Outcome '${obs.outcome}' is unexpected; investigate the evidence below.`; throw new Error( - `plan-ceo-review plan-mode smoke FAILED: outcome=${obs.outcome}\n` + + `plan-ceo-review smoke FAILED: outcome=${obs.outcome}\n` + + `${diagnosis}\n` + + `Expected 'asked'. See plan-ceo-review/SKILL.md.tmpl: the Step 0 STOP rules ` + + `and the "One issue = one AskUserQuestion call" rule under "CRITICAL RULE — ` + + `How to ask questions".\n` + `summary: ${obs.summary}\n` + `elapsed: ${obs.elapsedMs}ms\n` + `--- evidence (last 2KB visible) ---\n${obs.evidence}`, ); } - expect(['asked', 'plan_ready']).toContain(obs.outcome); }, 360_000); // v1.21+ regression: Conductor launches Claude Code with diff --git a/test/skill-e2e-plan-design-finding-count.test.ts b/test/skill-e2e-plan-design-finding-count.test.ts new file mode 100644 index 00000000..ef0d9b68 --- /dev/null +++ b/test/skill-e2e-plan-design-finding-count.test.ts @@ -0,0 +1,135 @@ +/** + * /plan-design-review per-finding AskUserQuestion count (periodic, paid, real-PTY). + * + * Same shape as skill-e2e-plan-ceo-finding-count: drives /plan-design-review + * against a 5-finding seeded plan and asserts review-phase AUQ count ∈ [N-1, N+2]. + * Plus D19: review report at bottom of produced plan file. + * + * Tier: periodic (~25 min, ~$5/run). Sequential by default per plan §D15. + */ + +import { describe, test } from 'bun:test'; +import * as fs from 'node:fs'; +import { + runPlanSkillCounting, + designStep0Boundary, + assertReviewReportAtBottom, +} from './helpers/claude-pty-runner'; + +const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic'; +const describeE2E = shouldRun ? describe : describe.skip; + +const N = 5; +const FLOOR = N - 1; +const CEILING = N + 2; + +const PLAN_DESIGN_5_FINDINGS = [ + 'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-design.md (use Edit/Write to that exact path).', + '', + '# Plan: Settings Page UI redesign', + '', + '## Visual Hierarchy', + 'The "Save" button is rendered with the same size, weight, and color as', + 'three other buttons in the page header (Reset, Cancel, Export). Nothing', + 'tells the user which is the primary action.', + '', + '## Spacing', + 'Between sections we have 24px in some places, 32px in others, and 16px', + 'in a third — no consistent vertical rhythm.', + '', + '## Color', + 'The error message uses red text on a light pink background. Contrast', + 'ratio is approximately 3:1 (below WCAG AA).', + '', + '## Typography', + 'We use 14px, 16px, and 18px font sizes across the form labels. Two', + 'sizes would suffice and create stronger hierarchy.', + '', + '## Motion', + 'The "Save" action takes 2-5 seconds with no loading indicator. Users', + 'see a frozen page; we should add a spinner or skeleton state.', +].join('\n'); + +const PLAN_DESIGN_PATH = '/tmp/gstack-test-plan-design.md'; + +describeE2E('/plan-design-review per-finding AskUserQuestion count (periodic)', () => { + test( + `5-finding plan emits ${FLOOR}-${CEILING} review-phase AskUserQuestions`, + async () => { + try { + fs.rmSync(PLAN_DESIGN_PATH, { force: true }); + } catch { + /* best-effort */ + } + + const obs = await runPlanSkillCounting({ + skillName: 'plan-design-review', + slashCommand: '/plan-design-review', + followUpPrompt: PLAN_DESIGN_5_FINDINGS, + isLastStep0AUQ: designStep0Boundary, + reviewCountCeiling: CEILING + 1, + cwd: process.cwd(), + timeoutMs: 1_500_000, + env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }, + }); + + try { + if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) { + throw new Error( + `plan-design-review finding-count FAILED: outcome=${obs.outcome}\n` + + `step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` + + `fingerprints (last 8):\n` + + obs.fingerprints + .slice(-8) + .map( + (f, i) => + ` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`, + ) + .join('\n') + + `\n--- evidence (last 3KB) ---\n${obs.evidence}`, + ); + } + if (obs.reviewCount < FLOOR) { + throw new Error( + `BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR}.\n` + + `Likely batching regression. Review-phase fingerprints:\n` + + obs.fingerprints + .filter((f) => !f.preReview) + .map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`) + .join('\n'), + ); + } + if (obs.reviewCount > CEILING) { + throw new Error( + `BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING}.`, + ); + } + + if (!fs.existsSync(PLAN_DESIGN_PATH)) { + throw new Error( + `D19 FAIL: agent did not produce expected plan file at ${PLAN_DESIGN_PATH}. ` + + `outcome=${obs.outcome} review=${obs.reviewCount}`, + ); + } + const planContent = fs.readFileSync(PLAN_DESIGN_PATH, 'utf-8'); + const verdict = assertReviewReportAtBottom(planContent); + if (!verdict.ok) { + throw new Error( + `D19 FAIL: plan file at ${PLAN_DESIGN_PATH} ${verdict.reason}\n` + + (verdict.trailingHeadings + ? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n` + : '') + + `--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`, + ); + } + } finally { + try { + fs.rmSync(PLAN_DESIGN_PATH, { force: true }); + } catch { + /* best-effort */ + } + } + }, + 1_700_000, + ); +}); diff --git a/test/skill-e2e-plan-devex-finding-count.test.ts b/test/skill-e2e-plan-devex-finding-count.test.ts new file mode 100644 index 00000000..e4b3f8e7 --- /dev/null +++ b/test/skill-e2e-plan-devex-finding-count.test.ts @@ -0,0 +1,135 @@ +/** + * /plan-devex-review per-finding AskUserQuestion count (periodic, paid, real-PTY). + * + * Same shape as skill-e2e-plan-ceo-finding-count: drives /plan-devex-review + * against a 5-finding seeded plan and asserts review-phase AUQ count ∈ [N-1, N+2]. + * Plus D19: review report at bottom of produced plan file. + * + * Tier: periodic (~25 min, ~$5/run). Sequential by default per plan §D15. + */ + +import { describe, test } from 'bun:test'; +import * as fs from 'node:fs'; +import { + runPlanSkillCounting, + devexStep0Boundary, + assertReviewReportAtBottom, +} from './helpers/claude-pty-runner'; + +const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic'; +const describeE2E = shouldRun ? describe : describe.skip; + +const N = 5; +const FLOOR = N - 1; +const CEILING = N + 2; + +const PLAN_DEVEX_5_FINDINGS = [ + 'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-devex.md (use Edit/Write to that exact path).', + '', + '# Plan: Public SDK Beta Launch', + '', + '## Persona', + "The plan doesn't specify which developer persona is the target — we're", + "shipping for \"everyone,\" which means we tune for nobody.", + '', + '## TTHW (time to hello world)', + 'Time-to-hello-world is not measured. No benchmark data referenced. We', + "don't know if first-run takes 5 minutes or 50.", + '', + '## Friction Point', + 'First-run currently requires a 5-minute mandatory CI step before the', + 'developer can run their first eval. There is no way to skip it.', + '', + '## Magical Moment', + 'Getting-started flow has no delight beat. Pure documentation, no', + 'interactive demo, no "ah-ha" moment that makes the developer trust us.', + '', + '## Competitive Blind Spot', + "The plan doesn't reference how peer SDKs (LangChain, Semantic Kernel,", + 'OpenAI) handle this DX surface. We may be reinventing worse versions', + 'of solved problems.', +].join('\n'); + +const PLAN_DEVEX_PATH = '/tmp/gstack-test-plan-devex.md'; + +describeE2E('/plan-devex-review per-finding AskUserQuestion count (periodic)', () => { + test( + `5-finding plan emits ${FLOOR}-${CEILING} review-phase AskUserQuestions`, + async () => { + try { + fs.rmSync(PLAN_DEVEX_PATH, { force: true }); + } catch { + /* best-effort */ + } + + const obs = await runPlanSkillCounting({ + skillName: 'plan-devex-review', + slashCommand: '/plan-devex-review', + followUpPrompt: PLAN_DEVEX_5_FINDINGS, + isLastStep0AUQ: devexStep0Boundary, + reviewCountCeiling: CEILING + 1, + cwd: process.cwd(), + timeoutMs: 1_500_000, + env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }, + }); + + try { + if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) { + throw new Error( + `plan-devex-review finding-count FAILED: outcome=${obs.outcome}\n` + + `step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` + + `fingerprints (last 8):\n` + + obs.fingerprints + .slice(-8) + .map( + (f, i) => + ` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`, + ) + .join('\n') + + `\n--- evidence (last 3KB) ---\n${obs.evidence}`, + ); + } + if (obs.reviewCount < FLOOR) { + throw new Error( + `BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR}.\n` + + `Likely batching regression. Review-phase fingerprints:\n` + + obs.fingerprints + .filter((f) => !f.preReview) + .map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`) + .join('\n'), + ); + } + if (obs.reviewCount > CEILING) { + throw new Error( + `BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING}.`, + ); + } + + if (!fs.existsSync(PLAN_DEVEX_PATH)) { + throw new Error( + `D19 FAIL: agent did not produce expected plan file at ${PLAN_DEVEX_PATH}. ` + + `outcome=${obs.outcome} review=${obs.reviewCount}`, + ); + } + const planContent = fs.readFileSync(PLAN_DEVEX_PATH, 'utf-8'); + const verdict = assertReviewReportAtBottom(planContent); + if (!verdict.ok) { + throw new Error( + `D19 FAIL: plan file at ${PLAN_DEVEX_PATH} ${verdict.reason}\n` + + (verdict.trailingHeadings + ? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n` + : '') + + `--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`, + ); + } + } finally { + try { + fs.rmSync(PLAN_DEVEX_PATH, { force: true }); + } catch { + /* best-effort */ + } + } + }, + 1_700_000, + ); +}); diff --git a/test/skill-e2e-plan-eng-finding-count.test.ts b/test/skill-e2e-plan-eng-finding-count.test.ts new file mode 100644 index 00000000..93b8ba68 --- /dev/null +++ b/test/skill-e2e-plan-eng-finding-count.test.ts @@ -0,0 +1,134 @@ +/** + * /plan-eng-review per-finding AskUserQuestion count (periodic, paid, real-PTY). + * + * Same shape as skill-e2e-plan-ceo-finding-count: drives /plan-eng-review + * against a 5-finding seeded plan and asserts review-phase AUQ count ∈ [N-1, N+2]. + * Plus D19: review report at bottom of produced plan file. + * + * Tier: periodic (~25 min, ~$5/run). Sequential by default per plan §D15. + */ + +import { describe, test } from 'bun:test'; +import * as fs from 'node:fs'; +import { + runPlanSkillCounting, + engStep0Boundary, + assertReviewReportAtBottom, +} from './helpers/claude-pty-runner'; + +const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic'; +const describeE2E = shouldRun ? describe : describe.skip; + +const N = 5; +const FLOOR = N - 1; // 4 +const CEILING = N + 2; // 7 + +const PLAN_ENG_5_FINDINGS = [ + 'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-eng.md (use Edit/Write to that exact path).', + '', + '# Plan: Multi-tenant Auth Refactor', + '', + '## Architecture', + 'Two new services (`AuthBroker` and `SessionMint`) share a global mutable', + '`AuthCache` instance via module-level export. Both services mutate it.', + '', + '## Code quality', + 'The `validateAndDispatch()` function is 60 lines with three nested', + 'try/catch blocks; each catch swallows a different error class.', + '', + '## Tests', + 'The existing `legacyAuthFlow()` will get rewritten as part of this work;', + 'no regression test for the prior behavior is planned.', + '', + '## Performance', + 'Token validation issues 5 sequential API calls to the IDP; they could be', + 'parallelized via Promise.all trivially (calls are independent).', + '', + '## Architecture (scope smell)', + 'This touches 12 files and introduces 4 new classes (TokenStore,', + 'SessionMint, AuthCache, RequestPolicy). Worth flagging the complexity check.', +].join('\n'); + +const PLAN_ENG_PATH = '/tmp/gstack-test-plan-eng.md'; + +describeE2E('/plan-eng-review per-finding AskUserQuestion count (periodic)', () => { + test( + `5-finding plan emits ${FLOOR}-${CEILING} review-phase AskUserQuestions`, + async () => { + try { + fs.rmSync(PLAN_ENG_PATH, { force: true }); + } catch { + /* best-effort */ + } + + const obs = await runPlanSkillCounting({ + skillName: 'plan-eng-review', + slashCommand: '/plan-eng-review', + followUpPrompt: PLAN_ENG_5_FINDINGS, + isLastStep0AUQ: engStep0Boundary, + reviewCountCeiling: CEILING + 1, + cwd: process.cwd(), + timeoutMs: 1_500_000, + env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }, + }); + + try { + if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) { + throw new Error( + `plan-eng-review finding-count FAILED: outcome=${obs.outcome}\n` + + `step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` + + `fingerprints (last 8):\n` + + obs.fingerprints + .slice(-8) + .map( + (f, i) => + ` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`, + ) + .join('\n') + + `\n--- evidence (last 3KB) ---\n${obs.evidence}`, + ); + } + if (obs.reviewCount < FLOOR) { + throw new Error( + `BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR}.\n` + + `Likely batching regression. Review-phase fingerprints:\n` + + obs.fingerprints + .filter((f) => !f.preReview) + .map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`) + .join('\n'), + ); + } + if (obs.reviewCount > CEILING) { + throw new Error( + `BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING}.`, + ); + } + + if (!fs.existsSync(PLAN_ENG_PATH)) { + throw new Error( + `D19 FAIL: agent did not produce expected plan file at ${PLAN_ENG_PATH}. ` + + `outcome=${obs.outcome} review=${obs.reviewCount}`, + ); + } + const planContent = fs.readFileSync(PLAN_ENG_PATH, 'utf-8'); + const verdict = assertReviewReportAtBottom(planContent); + if (!verdict.ok) { + throw new Error( + `D19 FAIL: plan file at ${PLAN_ENG_PATH} ${verdict.reason}\n` + + (verdict.trailingHeadings + ? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n` + : '') + + `--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`, + ); + } + } finally { + try { + fs.rmSync(PLAN_ENG_PATH, { force: true }); + } catch { + /* best-effort */ + } + } + }, + 1_700_000, + ); +}); diff --git a/test/touchfiles.test.ts b/test/touchfiles.test.ts index ff7f018e..8fb66161 100644 --- a/test/touchfiles.test.ts +++ b/test/touchfiles.test.ts @@ -97,12 +97,14 @@ describe('selectTests', () => { expect(result.selected).toContain('ask-user-question-format-pty'); expect(result.selected).toContain('plan-ceo-mode-routing'); expect(result.selected).toContain('autoplan-chain-pty'); - // v1.21+ regression: autoplan-auto-mode + auto-decide-preserved also - // depend on plan-ceo-review/** + // Per-finding count + review-report-at-bottom (v1.21.x) + expect(result.selected).toContain('plan-ceo-finding-count'); + // v1.22+ AskUserQuestion-blocked regression: autoplan-auto-mode + + // auto-decide-preserved also depend on plan-ceo-review/** expect(result.selected).toContain('autoplan-auto-mode'); expect(result.selected).toContain('auto-decide-preserved'); - expect(result.selected.length).toBe(20); - expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 20); + expect(result.selected.length).toBe(21); + expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 21); }); test('global touchfile triggers ALL tests', () => {