From 454423aeb3d3dafa88d5b57bfbe0ead05569d21e Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Thu, 30 Apr 2026 02:50:09 -0700 Subject: [PATCH] v1.21.1.0 test: tighten plan-ceo-review smoke (Step 0 must fire) (#1255) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * test: extract classifyVisible() + permission-dialog filter in PTY runner Pure classifier extracted from runPlanSkillObservation's polling loop so unit tests can exercise the actual branch order with synthetic input strings. Runner gains: - env? passthrough on runPlanSkillObservation (forwarded to launchClaudePty). gstack-config does not yet honor env overrides; plumbing is in place for a future change to make tests hermetic. - TAIL_SCAN_BYTES = 1500 exported constant. Replaces a duplicated magic number in test/skill-e2e-plan-ceo-mode-routing.test.ts so tuning stays in sync. - isPermissionDialogVisible: the bare phrase "Do you want to proceed?" now requires a file-edit context co-trigger. Other clauses unchanged. Skill questions that contain the bare phrase are no longer mis-classified. - classifyVisible(visible): pure function. Branch order silent_write → plan_ready → asked → null. Permission dialogs filtered out of the 'asked' classification so a permission prompt cannot pose as a Step 0 skill question. Adds 24 unit tests covering all classifier branches, edge cases, and the co-trigger contract. * test: tighten plan-ceo-review smoke to require Step 0 fires first Assertion narrows from ['asked', 'plan_ready'] to 'asked' only. Reaching plan_ready first means the agent skipped Step 0 entirely and went straight to ExitPlanMode — the regression we want to catch. Why plan-ceo is special: unlike plan-eng / plan-design / plan-devex (whose smokes legitimately reach plan_ready on certain branches without asking), plan-ceo-review's template mandates Step 0A premise challenge plus Step 0F mode selection BEFORE any plan write. There is no legitimate path to plan_ready that does not first emit a skill-question numbered prompt. Failure message now branches on outcome (plan_ready vs timeout vs silent_write) with a tailored diagnosis line per case. References the skill template by section name ("Step 0 STOP rules", "One issue = one AskUserQuestion call") instead of line numbers, so it survives template edits. Passes env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' } through the runner. Today this is advisory — gstack-config reads only ~/.gstack/config.yaml, not env vars — but the wiring is in place for a future change. Documented honestly in the docstring. Verified across 4 PTY runs: 3 pre-refactor + 1 post-refactor, all PASS. * chore: capture v1.21.1.0 follow-ups in TODOS.md - P2: per-finding AskUserQuestion count assertion (V2) - P3: honor env vars in gstack-config so test isolation env actually works - P3: path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS All three surfaced during the v1.21.1.0 plan-eng-review and adversarial review passes. Captured here so the design intent persists. * chore: bump version and changelog (v1.21.1.0) Co-Authored-By: Claude Opus 4.7 (1M context) * test: extract MODE_RE + optionsSignature into PTY runner exports Refactor prep for the upcoming per-finding AskUserQuestion count test across plan-{ceo,eng,design,devex}-review. Both new tests and the existing mode-routing test need the same mode regex and the same option-list fingerprint dedupe — pulling them into one source of truth in test/helpers/claude-pty-runner.ts so a fifth mode (or a tweak to the fingerprint shape) updates everywhere instead of drifting per-test. Mechanical: no behavior change in the mode-routing test. Co-Authored-By: Claude Opus 4.7 (1M context) * test: add per-finding count primitives + unit tests Pure helpers landing ahead of runPlanSkillCounting: - parseQuestionPrompt(visible) — extract the 1-3 line prompt above the latest "❯ 1." cursor, normalize to a 240-char snippet - auqFingerprint(prompt, opts) — Bun.hash of normalized prompt + sorted options signature; distinct prompts with shared option labels (the generic A/B/C TODO menu) get distinct fingerprints - COMPLETION_SUMMARY_RE — terminal-signal regex matching all four plan-review skills' completion / verdict markers - assertReviewReportAtBottom(content) — checks "## GSTACK REVIEW REPORT" is present and is the last "## " heading in a plan file - Step0BoundaryPredicate type + four per-skill predicates (ceo / eng / design / devex) — fire on the answered AUQ's fingerprint, marking the end of Step 0 deterministically (event-based, not content-based, per Codex F7) Plus 37 deterministic unit tests covering option-label collision regression, prompt extraction edge cases, predicate positive AND negative cases, and review-report-at-bottom triple-check (missing / mid-file / multiple trailing). Co-Authored-By: Claude Opus 4.7 (1M context) * test: add runPlanSkillCounting PTY helper Drives a plan-* skill end-to-end and counts distinct review-phase AskUserQuestions. Composes the primitives from the previous commit: - Boot + auto-trust handler (existing launchClaudePty) - Send slash command alone, sleep 3s, send plan content as follow-up message (proven pattern from skill-e2e-plan-design-with-ui) - Poll loop with permission-dialog auto-grant, same-redraw skip, empty-prompt re-poll - Event-based Step-0 boundary via isLastStep0AUQ predicate fired on the answered AUQ's fingerprint (Codex F7 — boundary is observed event, not later rendered content) - Multi-signal terminals: hard ceiling, COMPLETION_SUMMARY_RE, plan_ready, silent_write, exited, timeout Empty-prompt fingerprints are skipped per the contract documented in auqFingerprint's unit tests — fingerprinting them would re-introduce the option-label collision regression Codex F1 caught. No E2E tests yet — those land in commit 5 with the four skill fixtures. Co-Authored-By: Claude Opus 4.7 (1M context) * test: register four finding-count tests in touchfiles + tier map Each new test depends on its skill template, the runner, and three preamble resolvers (preamble.ts, generate-ask-user-format.ts, generate-completion-status.ts) — those affect question cadence and completion rendering, which is exactly what the test asserts on. All four classified periodic. Sequential execution during calibration; opt-in to concurrent only after measured comparison agrees (plan §D15). Updated touchfiles.test.ts: plan-ceo-review/** now selects 19 tests (was 18) because plan-ceo-finding-count joins the family. Co-Authored-By: Claude Opus 4.7 (1M context) * test: add four per-finding count E2E tests (plan-ceo + eng + design + devex) Each test drives its plan-* skill through Step 0 then asserts the review-phase AskUserQuestion count falls in [N-1, N+2] for an N=5 seeded plan, plus D19: produced plan file ends with "## GSTACK REVIEW REPORT" as its last "## " heading. plan-ceo also runs a paired-finding positive control: 2 deliberately related findings should still produce 2 distinct AUQs, not 1 batched. Periodic-tier (gate-skipped without EVALS=1, EVALS_TIER=periodic). Sequential execution by plan §D15. Each fixture is inline TypeScript content delivered as a follow-up message after the slash command, per the proven pattern at skill-e2e-plan-design-with-ui.test.ts. Calibration loop (5 runs per skill) and the manual pre-merge negative check (D7 + D12) are required before merge per plan §Verification. NOT yet run. Co-Authored-By: Claude Opus 4.7 (1M context) * test: fix parseNumberedOptions for inline-cursor box-layout AUQs Calibration run 1 timed out with step0=0 review=0 because the parser could not find the cursor in /plan-ceo-review's scope-selection AUQ. The TTY's box-layout rendering inlines divider + header + prompt + "1." onto one logical line — cursor escapes get stripped, leaving text crushed onto a single line. Cursor anchor regex changed from anchored to unanchored so it matches mid-line. Cursor-line option extraction uses a non-anchored regex; subsequent options stay with the original start-of-line parser. parseQuestionPrompt picks up the inline prompt text BEFORE the cursor on the cursor line (after stripping box-drawing chars + sigil) and appends it after any walked-up multi-line prompt above. Three new unit tests: clean-cursor still works, inline-cursor extracts all 7 options, prompt extraction strips box chars. Co-Authored-By: Claude Opus 4.7 (1M context) * test: add firstAUQPick + plan-ceo skip-interview routing Calibration run 1 surfaced a second issue beyond the parser bug: the default pick of 1 on /plan-ceo-review's scope-selection AUQ routes the agent to "branch diff vs main" — so it reviews the gstack PR itself (recursive!) instead of the seeded fixture plan we sent. Added firstAUQPick callback to runPlanSkillCounting. Override applies only to the FIRST AUQ; subsequent presses keep using defaultPick. ceoStep0Boundary now fires on either the mode-pick AUQ (existing path) or any AUQ containing "Skip interview and plan immediately" — which is the scope-selection AUQ. Picking that option bypasses Step 0 and routes straight to review-phase using the chat-paste plan as context. Plan-ceo test wires firstAUQPick = pickSkipInterview which finds the "Skip interview" option by label. Falls back to "describe inline" if the option labels change. Two new unit tests: ceoStep0Boundary fires on the scope-selection fixture; existing mode-pick fixture still fires. Co-Authored-By: Claude Opus 4.7 (1M context) --------- Co-authored-by: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 46 ++ TODOS.md | 50 ++ VERSION | 2 +- package.json | 2 +- test/helpers/claude-pty-runner.ts | 741 +++++++++++++++-- test/helpers/claude-pty-runner.unit.test.ts | 749 ++++++++++++++++++ test/helpers/touchfiles.ts | 18 + test/skill-e2e-plan-ceo-finding-count.test.ts | 253 ++++++ test/skill-e2e-plan-ceo-mode-routing.test.ts | 18 +- test/skill-e2e-plan-ceo-plan-mode.test.ts | 60 +- ...kill-e2e-plan-design-finding-count.test.ts | 135 ++++ ...skill-e2e-plan-devex-finding-count.test.ts | 135 ++++ test/skill-e2e-plan-eng-finding-count.test.ts | 134 ++++ test/touchfiles.test.ts | 6 +- 14 files changed, 2271 insertions(+), 78 deletions(-) create mode 100644 test/helpers/claude-pty-runner.unit.test.ts create mode 100644 test/skill-e2e-plan-ceo-finding-count.test.ts create mode 100644 test/skill-e2e-plan-design-finding-count.test.ts create mode 100644 test/skill-e2e-plan-devex-finding-count.test.ts create mode 100644 test/skill-e2e-plan-eng-finding-count.test.ts diff --git a/CHANGELOG.md b/CHANGELOG.md index 40ca6cbc..2ee05133 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,51 @@ # Changelog +## [1.21.1.0] - 2026-04-28 + +## **plan-ceo-review smoke tightens. The "agent skips Step 0 and ships a plan" regression now fails the gate.** + +The v1.15.0.0 real-PTY harness shipped with a smoke that accepted either `'asked'` or `'plan_ready'` as success. That OR was too lax for `/plan-ceo-review` specifically: the skill template mandates Step 0A premise challenge plus Step 0F mode selection BEFORE any plan write, so reaching `plan_ready` first IS the regression. This release tightens the assertion to `'asked'` only for that smoke, and refactors the runner so the contract is testable in <1s instead of $0.50 of stochastic PTY. + +### The numbers that matter + +Numbers come from `git diff --shortstat origin/main..HEAD` and `bun test test/helpers/claude-pty-runner.unit.test.ts` on a clean tree. + +| Metric | Δ | +|---|---| +| Net branch size vs main | +162 / −65 lines (3 files) | +| New unit tests added | **+24** (claude-pty-runner.unit.test.ts) | +| Unit suite runtime | **14ms** (deterministic, free-tier) | +| Real-PTY gate runs verified | **4 clean PTY runs** (3 lock-in + 1 post-refactor) | +| Outcome assertions covered | **5/5** (was 3/5; `plan_ready` is now FAIL for plan-ceo) | +| Reviewers run on this PR | plan-eng-review (CLEARED) + codex consult + 2 specialists + adversarial | + +### What this means for builders + +Three new classes of harness regression are now caught deterministically in the free tier instead of waiting on a $0.50 stochastic PTY run. The classifier is extracted into a pure `classifyVisible()` function so reordering branches in the polling loop fails the unit tests instead of silently shipping. Permission dialogs (which render numbered lists) are filtered out of the `'asked'` classification so a permission prompt cannot pose as a Step 0 skill question. The bare phrase `Do you want to proceed?` no longer triggers permission detection on its own — it now requires a file-edit context co-trigger, so a skill question that contains the phrase isn't mis-classified. + +For `/plan-ceo-review` specifically: any future preamble slim-down or template edit that lets the agent skip Step 0 and write a plan will fail the gate before the PR ships. Pull, run `bun test`, and the harness layer is provably tighter without you having to spend a token. + +### Itemized changes + +#### Added + +- `test/helpers/claude-pty-runner.unit.test.ts`: 24 deterministic tests covering `isPermissionDialogVisible` (with the new co-trigger contract), `isNumberedOptionListVisible`, `parseNumberedOptions`, and the new `classifyVisible()` runtime path. Free-tier, runs on every `bun test`. +- `classifyVisible(visible)` in `claude-pty-runner.ts`: pure classifier extracted from the polling loop. Returns `{ outcome, summary } | null`. Branch order: silent_write → plan_ready → asked → null (with permission-dialog filter). Live-state branches (process exited, "Unknown command") stay in the runner. +- `TAIL_SCAN_BYTES = 1500` exported constant. Shared between `runPlanSkillObservation` and the routing test's nav loop so tuning stays in sync. +- `env?: Record` option on `runPlanSkillObservation`, threaded to `launchClaudePty`. Plumbing for future env-driven test isolation (gstack-config does not yet honor env overrides; tracked as post-merge follow-up). + +#### Changed + +- `test/skill-e2e-plan-ceo-plan-mode.test.ts`: assertion narrowed from `['asked', 'plan_ready']` to `'asked'` only. Failure message now branches on `outcome` (plan_ready vs timeout vs silent_write) with a tailored diagnosis line, and references skill-template section names instead of line numbers (durable to template edits). +- `isPermissionDialogVisible`: bare `Do you want to proceed?` now requires a file-edit context co-trigger (`Edit to ` or `Write to `). Other clauses (`requested permissions to`, `allow all edits`, `always allow access to`, `Bash command requires permission`) remain unconditional. +- `test/skill-e2e-plan-ceo-mode-routing.test.ts`: replaces the local `1500` magic number with the shared `TAIL_SCAN_BYTES` constant. + +#### For contributors + +- The runner change is additive and the existing sibling smokes (`plan-eng`, `plan-design`, `plan-devex`, `plan-mode-no-op`) keep their loose `['asked', 'plan_ready']` assertion. Their behavior is unchanged. +- Post-merge follow-ups captured in `TODOS.md`: per-finding AskUserQuestion count assertion (V2), env-driven gstack-config overrides (so `QUESTION_TUNING=false` actually isolates the test), path-confusion hardening on `SANCTIONED_WRITE_SUBSTRINGS`. + + ## [1.20.0.0] - 2026-04-28 ## **Browser-skills land. `/scrape ` first call drives the page; second call runs the codified script in 200ms.** diff --git a/TODOS.md b/TODOS.md index 953d2872..0a68ac05 100644 --- a/TODOS.md +++ b/TODOS.md @@ -213,6 +213,56 @@ scope of that PR; deliberately deferred to keep PTY-import small. ## Testing +## P2: Per-finding AskUserQuestion count assertion for /plan-ceo-review + +**What:** PTY E2E test that drives /plan-ceo-review through Step 0 with a stable fixture diff containing N known findings, asserts that exactly N distinct AskUserQuestions fire (one per finding) before plan_ready. + +**Why:** The skill template repeats "One issue = one AskUserQuestion call. Never combine multiple issues into one question." at every review checkpoint. No test enforces it. The current `skill-e2e-plan-ceo-plan-mode.test.ts` smoke (post-v1.21.1.0) only catches "agent skipped Step 0 entirely." Batching findings into one question slips through silently. + +**Pros:** Locks in the strongest contract the skill mandates. Catches a real failure mode (the original attachment showed 2 findings batched as 0 questions). +**Cons:** Needs a stable fixture diff to keep finding count deterministic (~1 day human / ~30 min CC). Opus may reasonably consolidate two related findings, so the assertion needs a forgiving lower bound (e.g., `>= ceil(N * 0.6)`) rather than strict equality. + +**Context:** The PTY harness (`runPlanSkillObservation`) returns at first terminal outcome — for V2 we need a streaming variant that counts AskUserQuestions across the whole session up to `plan_ready`. Probably a new helper alongside `runPlanSkillObservation`. + +**Depends on:** Stable fixture diff (`test/fixtures/plans/multi-finding.diff` or similar) with a small known set of issues that triggers all 4 review sections. + +**Priority:** P2. +**Effort:** S (CC: ~30 min once fixture exists). Captured from v1.21.1.0 plan-eng-review D2. + +--- + +## P3: Honor env vars in gstack-config (so QUESTION_TUNING/EXPLAIN_LEVEL actually isolate tests) + +**What:** `gstack-config get ` reads `~/.gstack/config.yaml`. `runPlanSkillObservation` plumbs `env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }` through to the spawned `claude` process — but the skill preamble bash uses `gstack-config get question_tuning`, which never looks at env. The env passthrough is theater on current code. + +**Why:** Without env honoring, the v1.21.1.0 plan-ceo-review smoke is still flaky on machines with `question_tuning: true` set in YAML. AUTO_DECIDE preferences would skip the rendered AskUserQuestion list, masking the regression we want to catch. + +**Pros:** Makes the gate test hermetic across machines. The env wiring is already in place — only `gstack-config` needs to read env first, fall back to YAML. +**Cons:** Touches the gstack-config binary across all 3 platforms (linux/darwin/windows). Cross-binary refactor. + +**Context:** Captured from v1.21.1.0 adversarial review. Documented honestly in the test docstring as a known limitation. + +**Priority:** P3. +**Effort:** S. Single-file edit to `bin/gstack-config` (~10 LOC for env-first lookup). + +--- + +## P3: Path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS + +**What:** `runPlanSkillObservation`'s silent-write detector uses substring matching on a few sanctioned paths (`.gstack/`, `CHANGELOG.md`, `TODOS.md`, etc). A write to `node_modules/some-pkg/CHANGELOG.md` or `src/foo/.gstack/leak.ts` is currently sanctioned because the substring matches anywhere in the path. + +**Why:** Defensive — no current bug exploits this, but a malicious skill or fixture could write to a path that happens to contain `.gstack/` or `CHANGELOG.md` and slip past silent-write detection. + +**Pros:** Hardens the harness against future skill misbehavior. Aligns substring rules with their intent. +**Cons:** Need to anchor against absolute prefixes (`os.homedir() + '/.gstack/'`, worktree root) which makes the test less portable across machines. + +**Context:** Captured from v1.21.1.0 adversarial review (HIGH/FIXABLE finding, pre-existing). Refactored into a `SANCTIONED_WRITE_SUBSTRINGS` constant in v1.21.1.0 but the substring-includes logic is unchanged from before. + +**Priority:** P3. +**Effort:** S. + +--- + ## P1: Structural STOP-Ask forcing function across all skills **What:** Design and implement a structural forcing function that catches when a skill mandates per-issue AskUserQuestion but the model silently substitutes batch-synthesis. Candidate mechanisms: question-count assertion (skill declares expected question count in frontmatter; post-run audit logs if model fired " or "Write to "). + * The standalone permission signatures (`requested permissions to`, + * `allow all edits`, `always allow access to`, `Bash command requires permission`) + * remain unconditional. */ export function isPermissionDialogVisible(visible: string): boolean { - return ( - /requested\s+permissions?\s+to/i.test(visible) || - /Do\s+you\s+want\s+to\s+proceed\?/i.test(visible) || - // "Yes / Yes, allow all edits / No" shape rendered by Claude Code for - // file-edit permission grants. The middle option's "allow all" phrase - // is the unique signature. - /\ballow\s+all\s+edits\b/i.test(visible) || - // "Yes, and always allow access to " shape (workspace trust). - /always\s+allow\s+access\s+to/i.test(visible) || - // Bash command permission prompts. - /Bash\s+command\s+.*\s+requires\s+permission/i.test(visible) - ); + // Standalone signatures — high specificity, never appear in skill questions. + if (/requested\s+permissions?\s+to/i.test(visible)) return true; + // "Yes / Yes, allow all edits / No" shape — file-edit permission grants. + if (/\ballow\s+all\s+edits\b/i.test(visible)) return true; + // "Yes, and always allow access to " shape — workspace trust. + if (/always\s+allow\s+access\s+to/i.test(visible)) return true; + // Bash command permission prompts. + if (/Bash\s+command\s+.*\s+requires\s+permission/i.test(visible)) return true; + // "Do you want to proceed?" only counts as a permission dialog when paired + // with a file-edit context. Skill questions can use the bare phrase. + if ( + /Do\s+you\s+want\s+to\s+proceed\?/i.test(visible) && + /(Edit|Write)\s+to\s+\S+/i.test(visible) + ) { + return true; + } + return false; } /** Detect any AskUserQuestion-shaped numbered option list with cursor. */ @@ -211,12 +234,14 @@ export function parseNumberedOptions( // this, parseNumberedOptions returns stale options after the dialog is // dismissed. const lines = tail.split('\n'); - // Anchor on the LAST `❯ 1.` line (cursor is on option 1 of the active - // AskUserQuestion). Greedy character classes don't help here — we need a literal - // `❯` after optional leading whitespace. + // Anchor on the LAST line containing `❯1.` ANYWHERE on the line. + // The /plan-*-review skill's box-layout AUQ uses TTY cursor-positioning + // escapes that stripAnsi removes — leaving the cursor `❯1.` mid-line, + // after dividers + header + prompt text on the same logical line. The + // earlier `^\s*❯` anchor missed those entirely. let cursorLineIdx = -1; for (let i = lines.length - 1; i >= 0; i--) { - if (/^\s*❯\s*1\./.test(lines[i] ?? '')) { + if (/❯\s*1\./.test(lines[i] ?? '')) { cursorLineIdx = i; break; } @@ -236,7 +261,37 @@ export function parseNumberedOptions( if (cursorLineIdx < 0) return []; const found: Array<{ index: number; label: string }> = []; const seenIndices = new Set(); - for (let i = cursorLineIdx; i < lines.length; i++) { + + // Cursor line: option 1 may be inline after box dividers + prompt header + // (`...divider...header...❯1. label`). Use a non-anchored regex that + // captures `❯N. label` from anywhere on the line through end-of-line. + // Only used for the cursor line — subsequent options are parsed with the + // start-of-line `optionRe`. + const cursorLine = lines[cursorLineIdx] ?? ''; + const cursorInlineRe = /❯\s*([1-9])\.\s*(\S.*?)\s*$/; + const inlineMatch = cursorInlineRe.exec(cursorLine); + if (inlineMatch) { + const idx = Number(inlineMatch[1]); + const label = (inlineMatch[2] ?? '').trim(); + if (label.length > 0 && !seenIndices.has(idx)) { + seenIndices.add(idx); + found.push({ index: idx, label }); + } + } else { + // No inline cursor match — fall back to start-of-line regex. + const startMatch = optionRe.exec(cursorLine); + if (startMatch) { + const idx = Number(startMatch[1]); + const label = (startMatch[2] ?? '').trim(); + if (label.length > 0 && !seenIndices.has(idx)) { + seenIndices.add(idx); + found.push({ index: idx, label }); + } + } + } + + // Subsequent lines: standard start-of-line option parsing. + for (let i = cursorLineIdx + 1; i < lines.length; i++) { const m = optionRe.exec(lines[i] ?? ''); if (!m) continue; const idx = Number(m[1]); @@ -261,6 +316,333 @@ export function parseNumberedOptions( return found; } +/** + * The four /plan-ceo-review modes. Used by `skill-e2e-plan-ceo-mode-routing` + * to detect Step 0F mode-selection AskUserQuestions, and by the upcoming + * finding-count tests as a Step-0 boundary signal: an AUQ whose options + * match this regex IS the mode pick (the last Step-0 question for plan-ceo). + * + * Lifted out of the mode-routing test so multiple PTY tests can share one + * source of truth — when /plan-ceo-review adds a fifth mode, one regex updates + * everywhere instead of drifting per-test. + */ +export const MODE_RE = /HOLD SCOPE|SCOPE EXPANSION|SELECTIVE EXPANSION|SCOPE REDUCTION/i; + +/** + * Stable signature for a parsed numbered-option list — used by tests to detect + * "is this AUQ the same as the last poll, or has the agent advanced to a new + * one?" Joins each option as `${index}:${label}` after sorting by index. + * + * Defensive sort means the signature is order-independent at the input level, + * even though `parseNumberedOptions` already returns indices in ascending order. + */ +export function optionsSignature( + opts: Array<{ index: number; label: string }>, +): string { + return [...opts] + .sort((a, b) => a.index - b.index) + .map((o) => `${o.index}:${o.label}`) + .join('|'); +} + +/** + * Pure classifier for the visible TTY buffer. Decides which outcome the + * polling loop should return on this tick, or `null` to keep polling. + * + * Extracted from `runPlanSkillObservation` so the unit suite can exercise + * the actual branch order with synthetic input strings — a future contributor + * who reorders the branches (e.g., moves the permission short-circuit) gets + * caught by the unit tests, not by a stochastic E2E run. + * + * Live-state branches (process exited, "Unknown command") stay in the runner + * since they need the session handle. + */ +export type ClassifyResult = + | { outcome: 'silent_write'; summary: string } + | { outcome: 'plan_ready'; summary: string } + | { outcome: 'asked'; summary: string } + | null; + +const SANCTIONED_WRITE_SUBSTRINGS = [ + '.claude/plans', + '.gstack/', + '/.context/', + 'CHANGELOG.md', + 'TODOS.md', +]; + +export function classifyVisible(visible: string): ClassifyResult { + // Silent-write detection: any Write/Edit tool render that targets a path + // OUTSIDE the sanctioned dirs, AND no numbered prompt is currently on screen + // (a numbered prompt means a permission/AskUserQuestion is gating the write, + // not an actual silent write). + const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g; + let m: RegExpExecArray | null; + while ((m = writeRe.exec(visible)) !== null) { + const target = m[1] ?? ''; + const sanctioned = SANCTIONED_WRITE_SUBSTRINGS.some((s) => target.includes(s)); + if (!sanctioned && !isNumberedOptionListVisible(visible)) { + return { + outcome: 'silent_write', + summary: `Write/Edit to ${target} fired before any AskUserQuestion`, + }; + } + } + if (isPlanReadyVisible(visible)) { + return { + outcome: 'plan_ready', + summary: 'skill ran end-to-end and emitted plan-mode "Ready to execute" confirmation', + }; + } + if (isNumberedOptionListVisible(visible)) { + // Permission dialogs render numbered lists too. Skip them — the + // bug we want to catch is "skill question never fired." + if (isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))) { + return null; + } + return { + outcome: 'asked', + summary: 'skill fired a numbered-option prompt (AskUserQuestion or routing-injection)', + }; + } + return null; +} + +// ──────────────────────────────────────────────────────────────────────────── +// Per-finding AskUserQuestion count primitives (used by runPlanSkillCounting). +// +// These are pure helpers extracted up-front so the unit suite can exercise +// them deterministically before the live-PTY counter runs them. Each one is +// independently unit-testable against synthetic visible-buffer strings. +// ──────────────────────────────────────────────────────────────────────────── + +/** + * Captured identity of an AskUserQuestion — the rendered question text plus + * its numbered options. Used by `runPlanSkillCounting` to dedupe redrawn + * prompts and to feed `Step0BoundaryPredicate` callers. + * + * `signature` is the stable hash. Two AUQs with identical prompt + options + * produce the same signature; differences in either field produce different + * signatures. Critically: two AUQs with shared option labels (e.g. the + * generic "A) Add to plan / B) Defer / C) Build now" menu) but different + * question text get DIFFERENT signatures because the prompt is in the hash. + */ +export interface AskUserQuestionFingerprint { + /** Stable hash combining normalized prompt text + options signature. */ + signature: string; + /** First 240 chars of the rendered question prompt (post-normalization). */ + promptSnippet: string; + /** Captured option labels, in index order. */ + options: Array<{ index: number; label: string }>; + /** Wall-clock when first observed (ms since the helper started polling). */ + observedAtMs: number; + /** True if observed BEFORE the Step-0 boundary fired. */ + preReview: boolean; +} + +/** + * Predicate fired against the AUQ we just answered (not the visible buffer). + * Returns true if this AUQ's fingerprint marks the LAST Step-0 question for + * its skill — all subsequent AUQs are review-phase findings. + * + * Event-based by design: matching against an answered AUQ's fingerprint + * (prompt + options) is deterministic, whereas matching against later + * rendered content (section headers, summary text) races with the agent's + * output cadence. See plan §D14 for the rationale. + */ +export type Step0BoundaryPredicate = ( + answeredFingerprint: AskUserQuestionFingerprint, +) => boolean; + +/** + * Parse the rendered question prompt out of a visible TTY buffer. The prompt + * is the 1–3 lines of text immediately ABOVE the latest `❯ 1.` cursor line — + * not part of the option list, not the permission-dialog header. + * + * Returns the prompt normalized to a single-spaced 240-char snippet (strip + * ANSI residue, collapse internal whitespace, trim) — short enough to use as + * a hash key, long enough to disambiguate distinct questions. + * + * Returns "" when no prompt could be parsed (cursor not yet rendered, or + * cursor is at the top of the buffer with no preceding text). Callers that + * use the empty string as a fingerprint input should treat empty-prompt + * AUQs as "wait one more poll" rather than fingerprinting them — otherwise + * the same options + empty prompt across two distinct questions collide. + */ +export function parseQuestionPrompt(visible: string): string { + // Tail-only — older prompts higher in the buffer are stale. + const tail = visible.length > 4096 ? visible.slice(-4096) : visible; + const lines = tail.split('\n'); + + // Find the latest line containing `❯1.` (matching parseNumberedOptions — + // unanchored to handle the box-layout case where cursor is mid-line after + // divider + header + prompt text on the same logical line). + let cursorLineIdx = -1; + for (let i = lines.length - 1; i >= 0; i--) { + if (/❯\s*1\./.test(lines[i] ?? '')) { + cursorLineIdx = i; + break; + } + } + if (cursorLineIdx < 0) return ''; + + // Box-layout case: prompt text may be ON the cursor line, BEFORE `❯1.`. + // Extract that prefix (after stripping leading box-drawing characters and + // dividers) as the last piece of the prompt — appended after any prior + // multi-line prompt text we walk up to find. + const cursorLine = lines[cursorLineIdx] ?? ''; + let inlinePrompt = ''; + const cursorPos = cursorLine.search(/❯\s*1\./); + if (cursorPos > 0) { + inlinePrompt = cursorLine + .slice(0, cursorPos) + // Strip box-drawing chars + dividers + leading checkbox sigil. + .replace(/^[─━┄┅┈┉─┌┐└┘├┤┬┴┼│┃☐□■\s]+/, '') + .trim(); + } + + // Walk up at most 6 lines collecting prompt text. Stop at: + // - a blank line preceded by another blank line (paragraph break) + // - top of buffer + // - a line that itself starts with `N.` (we're inside an option list) + const promptLines: string[] = []; + let blankRun = 0; + for (let i = cursorLineIdx - 1; i >= 0 && promptLines.length < 6; i--) { + const raw = lines[i] ?? ''; + const trimmed = raw.trim(); + if (trimmed === '') { + blankRun += 1; + if (blankRun >= 2 && promptLines.length > 0) break; + continue; + } + blankRun = 0; + // Stop if we hit what looks like a previous numbered list. + if (/^[\s❯]*[1-9]\.\s+\S/.test(raw)) break; + promptLines.unshift(trimmed); + } + + const all = inlinePrompt.length > 0 ? [...promptLines, inlinePrompt] : promptLines; + const joined = all.join(' ').replace(/\s+/g, ' ').trim(); + return joined.slice(0, 240); +} + +/** + * Stable hash for an AskUserQuestion's identity — combines normalized prompt + * text with the options signature so two distinct questions with shared menu + * labels (the generic A/B/C TODO-proposal menu, for instance) get different + * fingerprints. + * + * Uses Bun's fast non-crypto hash since these strings are short and we only + * need collision resistance against accidental TTY redraws, not adversaries. + * Hex-encoded for diagnostic dumps. + */ +export function auqFingerprint( + promptSnippet: string, + opts: Array<{ index: number; label: string }>, +): string { + const normalized = promptSnippet.replace(/\s+/g, ' ').trim(); + const sig = optionsSignature(opts); + // eslint-disable-next-line @typescript-eslint/no-explicit-any + return (Bun as any).hash(normalized + '||' + sig).toString(16); +} + +/** + * Detects when a plan-* skill has reached its Completion Summary / Review + * Report — a terminal signal complementary to plan-mode's "Ready to execute" + * confirmation. Each plan-review skill writes one of these phrasings near + * the end of its run; matching any one is enough to stop counting. + * + * Best-effort: this is a content marker, not a deterministic event. Hard + * ceiling (`reviewCountCeiling` in `runPlanSkillCounting`) is the reliable + * stop signal; this regex is the "we're done, go gracefully" hint. + */ +export const COMPLETION_SUMMARY_RE = + /(GSTACK REVIEW REPORT|## Completion [Ss]ummary|Status:\s*(clean|issues_open)|^VERDICT:)/m; + +/** + * Result of asserting that a plan file ends with `## GSTACK REVIEW REPORT` + * as its last `## ` heading. `ok` is true iff the report is present AND no + * other `## ` heading appears after it. Diagnostic fields are populated only + * on failure to keep the success path cheap. + */ +export interface ReviewReportAtBottomResult { + ok: boolean; + reason?: string; + trailingHeadings?: string[]; +} + +/** + * Assert that `## GSTACK REVIEW REPORT` is the last `## ` heading in a plan + * file's content. Pure string operation — no filesystem access. Used by the + * finding-count E2E tests as a second assertion on each test's produced plan. + * + * The plan-mode skill template mandates the agent move/append the review + * report so it's always the last `##` section. A regression where the agent + * appends additional sections after the report (or skips it entirely) ships + * silently today; this assertion catches both. + */ +export function assertReviewReportAtBottom( + content: string, +): ReviewReportAtBottomResult { + const re = /^## GSTACK REVIEW REPORT\s*$/m; + const match = re.exec(content); + if (!match) { + return { ok: false, reason: 'no GSTACK REVIEW REPORT section' }; + } + const after = content.slice(match.index + match[0].length); + // Match any `## ` heading after the report. Reject `## ` followed by + // newline-only (trailing-whitespace ## headers) to avoid false positives. + const trailingHeadings = Array.from( + after.matchAll(/^## \S.*$/gm), + ).map((m) => m[0]); + if (trailingHeadings.length > 0) { + return { + ok: false, + reason: 'trailing ## heading(s) after GSTACK REVIEW REPORT', + trailingHeadings, + }; + } + return { ok: true }; +} + +/** + * Per-skill Step-0 boundary predicates. Each fires `true` when the answered + * AUQ's fingerprint matches the LAST question of that skill's Step 0 phase. + * + * - `ceoStep0Boundary`: matches the mode-pick AUQ (options match `MODE_RE`). + * - `engStep0Boundary`: matches the cross-project-learnings or scope-reduction + * AUQ that closes plan-eng-review's preamble. + * - `designStep0Boundary`: matches plan-design-review's first dimension / + * posture AUQ. + * - `devexStep0Boundary`: matches plan-devex-review's persona-selection AUQ. + * + * Predicates live alongside the helper so the unit suite can exercise each + * against synthetic fingerprints (positive AND negative cases). Skill test + * files import them directly. + */ +export const ceoStep0Boundary: Step0BoundaryPredicate = (fp) => + // Mode-pick path (Step 0F): one of HOLD SCOPE / SCOPE EXPANSION / etc. + fp.options.some((o) => MODE_RE.test(o.label)) || + // Skip-interview path: scope-selection AUQ has "Skip interview and plan + // immediately" — picking it bypasses the rest of Step 0 and routes + // directly to review-phase. Boundary fires on the scope AUQ itself. + fp.options.some((o) => /skip\s+interview|plan\s+immediately/i.test(o.label)); + +export const engStep0Boundary: Step0BoundaryPredicate = (fp) => + /scope reduction recommendation|cross[\s-]?project learnings/i.test( + fp.promptSnippet, + ); + +export const designStep0Boundary: Step0BoundaryPredicate = (fp) => + /design system|design posture|design score|first dimension/i.test( + fp.promptSnippet, + ); + +export const devexStep0Boundary: Step0BoundaryPredicate = (fp) => + /developer persona|target persona|persona selection|TTHW target/i.test( + fp.promptSnippet, + ); + /** * Spawn `claude --permission-mode plan` in a real PTY and return a session * handle. Caller is responsible for `await session.close()` to release the @@ -566,12 +948,21 @@ export async function runPlanSkillObservation(opts: { cwd?: string; /** Total budget for skill to reach a terminal outcome. Default 180000. */ timeoutMs?: number; + /** + * Extra env merged into the spawned `claude` process. `launchClaudePty` + * already supports this; exposing it here lets per-skill tests isolate + * from local config that would mask the regression they're trying to + * catch (e.g., `QUESTION_TUNING=true` causing AUTO_DECIDE to skip the + * rendered AskUserQuestion list). + */ + env?: Record; }): Promise { const startedAt = Date.now(); const session = await launchClaudePty({ permissionMode: opts.inPlanMode === false ? null : 'plan', cwd: opts.cwd, timeoutMs: (opts.timeoutMs ?? 180_000) + 30_000, + env: opts.env, }); try { @@ -602,40 +993,10 @@ export async function runPlanSkillObservation(opts: { elapsedMs: Date.now() - startedAt, }; } - // Silent-write detection: any Write/Edit tool render that targets a - // path OUTSIDE ~/.claude/plans, ~/.gstack/, or the active worktree's - // .gstack/. Plan files and gbrain artifacts are sanctioned. - const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g; - let m: RegExpExecArray | null; - while ((m = writeRe.exec(visible)) !== null) { - const target = m[1] ?? ''; - const sanctioned = - target.includes('.claude/plans') || - target.includes('.gstack/') || - target.includes('/.context/') || - target.includes('CHANGELOG.md') || - target.includes('TODOS.md'); - if (!sanctioned && !isNumberedOptionListVisible(visible)) { - return { - outcome: 'silent_write', - summary: `Write/Edit to ${target} fired before any AskUserQuestion`, - evidence: visible.slice(-2000), - elapsedMs: Date.now() - startedAt, - }; - } - } - if (isPlanReadyVisible(visible)) { + const classified = classifyVisible(visible); + if (classified) { return { - outcome: 'plan_ready', - summary: 'skill ran end-to-end and emitted plan-mode "Ready to execute" confirmation', - evidence: visible.slice(-2000), - elapsedMs: Date.now() - startedAt, - }; - } - if (isNumberedOptionListVisible(visible)) { - return { - outcome: 'asked', - summary: 'skill fired a numbered-option prompt (AskUserQuestion or routing-injection)', + ...classified, evidence: visible.slice(-2000), elapsedMs: Date.now() - startedAt, }; @@ -652,3 +1013,281 @@ export async function runPlanSkillObservation(opts: { await session.close(); } } + +// ──────────────────────────────────────────────────────────────────────────── +// runPlanSkillCounting — drives a plan-* skill end-to-end through Step 0 then +// counts distinct review-phase AskUserQuestion fingerprints. The actual +// product asserted by the per-finding-count tests. +// ──────────────────────────────────────────────────────────────────────────── + +/** + * Result of a `runPlanSkillCounting` run. Includes both the count summary + * (`step0Count`, `reviewCount`) and the full fingerprint list for diagnostic + * dumps when an assertion fails. + */ +export interface PlanSkillCountObservation { + outcome: + | 'plan_ready' + | 'completion_summary' + | 'ceiling_reached' + | 'silent_write' + | 'exited' + | 'timeout'; + summary: string; + /** Visible terminal text at terminal time (last 3KB). */ + evidence: string; + /** Wall time (ms) until the outcome was decided. */ + elapsedMs: number; + /** All distinct AskUserQuestions observed, in observation order. */ + fingerprints: AskUserQuestionFingerprint[]; + /** Count of fingerprints with `preReview === true`. */ + step0Count: number; + /** Count of fingerprints with `preReview === false`. */ + reviewCount: number; +} + +/** + * Drive a plan-* skill in plan mode and count distinct review-phase + * AskUserQuestions until a terminal signal fires. + * + * Flow: + * 1. Boot PTY in plan mode (8s grace + auto-trust dialog). + * 2. Send `slashCommand` alone. Sleep ~3s. + * 3. Send `followUpPrompt` as a chat message — this is the plan content + * the skill reviews. Slash commands with trailing args are rejected by + * Claude Code unless the skill defines them, so the plan goes as a + * follow-up message (the proven pattern at + * skill-e2e-plan-design-with-ui.test.ts:57-71). + * 4. Poll loop: + * - Skip permission dialogs (auto-grant with `defaultPick`). + * - On a new numbered-option list, parse prompt + options, build + * fingerprint via `auqFingerprint`. Empty-prompt parses are skipped + * and re-polled (avoids the empty-prompt collision documented in + * the auqFingerprint contract). + * - First time we see a fingerprint: push it, classify as Step 0 or + * review-phase based on `boundaryFired`, press `defaultPick` to + * advance. + * - After pressing, evaluate `isLastStep0AUQ(fingerprint)`. If true, + * all subsequent AUQs are review-phase. + * - Hard ceiling: if `reviewCount >= reviewCountCeiling`, return + * `ceiling_reached`. This bounds runaway counts; tests should set + * the ceiling above their assertion CEILING. + * - Soft terminals: `COMPLETION_SUMMARY_RE` match → `completion_summary`; + * plan-ready confirmation → `plan_ready`; silent write outside + * sanctioned dirs → `silent_write`; process exited → `exited`; + * wall clock exceeded → `timeout`. + * + * Boundary detection (D14): event-based, fired against the answered AUQ's + * fingerprint, not against later rendered content. This avoids the race + * where Step-0-final and Section-1-first AUQs straddle a section header + * regex match. + * + * Fingerprint composition (D9): `auqFingerprint(prompt, options)` mixes + * normalized prompt text with the options signature so distinct findings + * with shared menu structure (the generic A/B/C TODO menu) get distinct + * fingerprints. + */ +export async function runPlanSkillCounting(opts: { + /** Skill name, e.g. 'plan-ceo-review'. Used for diagnostic strings only. */ + skillName: string; + /** Slash command to send alone, e.g. '/plan-ceo-review'. No trailing args. */ + slashCommand: string; + /** Plan content sent as a follow-up message ~3s after the slash command. */ + followUpPrompt: string; + /** Per-skill predicate: which answered AUQ is the last Step-0 question. */ + isLastStep0AUQ: Step0BoundaryPredicate; + /** Hard cap on review-phase count; helper returns when reached. Should be + * set ABOVE the test's assertion ceiling so the test sees the cap as a + * failure rather than a silent stop. */ + reviewCountCeiling: number; + /** Numbered option to press by default. Defaults to 1 (recommended). */ + defaultPick?: number; + /** + * Optional override for the FIRST AUQ observed. Receives the fingerprint; + * returns the option index to press. Subsequent AUQs always use defaultPick. + * + * Skill-specific routing helper: /plan-ceo-review's first AUQ asks "what + * scope?" with options like "branch diff" / "describe inline" / "skip + * interview". Pressing the default 1 routes to "branch diff" (the wrong + * review target for a seeded fixture). firstAUQPick lets the test pick + * "Skip interview" or "describe inline" so the agent reviews the + * follow-up plan content the test sent, not the git diff. + */ + firstAUQPick?: (fp: AskUserQuestionFingerprint) => number; + /** Working directory. Default process.cwd() (repo cwd holds skill registry). */ + cwd?: string; + /** Total budget for skill to reach a terminal outcome. Default 1_500_000 (25 min). */ + timeoutMs?: number; + /** Extra env merged into the spawned `claude` process. */ + env?: Record; +}): Promise { + const startedAt = Date.now(); + const defaultPick = opts.defaultPick ?? 1; + const timeoutMs = opts.timeoutMs ?? 1_500_000; + + const session = await launchClaudePty({ + permissionMode: 'plan', + cwd: opts.cwd, + timeoutMs: timeoutMs + 60_000, + env: opts.env, + }); + + const fingerprints: AskUserQuestionFingerprint[] = []; + const seen = new Set(); + let boundaryFired = false; + let step0Count = 0; + let reviewCount = 0; + let isFirstAUQ = true; + let lastSig = ''; + + function snapshot( + outcome: PlanSkillCountObservation['outcome'], + summary: string, + visible: string, + ): PlanSkillCountObservation { + return { + outcome, + summary, + evidence: visible.slice(-3000), + elapsedMs: Date.now() - startedAt, + fingerprints, + step0Count, + reviewCount, + }; + } + + try { + await Bun.sleep(8000); // boot grace + auto-trust handler window + const since = session.mark(); + session.send(`${opts.slashCommand}\r`); + await Bun.sleep(3000); + session.send(`${opts.followUpPrompt}\r`); + + const budgetStart = Date.now(); + while (Date.now() - budgetStart < timeoutMs) { + await Bun.sleep(2000); + const visible = session.visibleSince(since); + + // Process exited? + if (session.exited()) { + return snapshot( + 'exited', + `claude exited (code=${session.exitCode()}) during counting (step0=${step0Count}, review=${reviewCount})`, + visible, + ); + } + if (visible.includes('Unknown command:')) { + return snapshot( + 'exited', + `claude rejected ${opts.slashCommand} as unknown command (skill not registered in this cwd)`, + visible, + ); + } + + // Silent write detection — only fires if no numbered prompt is on + // screen (otherwise the write is gated by a permission/AUQ). + const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g; + let m: RegExpExecArray | null; + while ((m = writeRe.exec(visible)) !== null) { + const target = m[1] ?? ''; + const sanctioned = SANCTIONED_WRITE_SUBSTRINGS.some((s) => + target.includes(s), + ); + if (!sanctioned && !isNumberedOptionListVisible(visible)) { + return snapshot( + 'silent_write', + `Write/Edit to ${target} fired before any AskUserQuestion`, + visible, + ); + } + } + + // Soft terminal signals — check before AUQ processing so a final + // completion-summary doesn't get misclassified as a bonus AUQ. + if (COMPLETION_SUMMARY_RE.test(visible)) { + return snapshot( + 'completion_summary', + `skill emitted completion summary / verdict / status line (step0=${step0Count}, review=${reviewCount})`, + visible, + ); + } + if (isPlanReadyVisible(visible)) { + return snapshot( + 'plan_ready', + `skill emitted plan-mode "Ready to execute" confirmation (step0=${step0Count}, review=${reviewCount})`, + visible, + ); + } + + // Numbered option list? + if (!isNumberedOptionListVisible(visible)) continue; + + // Permission dialog? Auto-grant with defaultPick. Only act on the + // recent tail to avoid re-triggering on stale dialogs in scrollback. + if (isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))) { + session.send(`${defaultPick}\r`); + await Bun.sleep(1500); + continue; + } + + // Parse the active AUQ. Skip same-redraw and empty-prompt cases. + const options = parseNumberedOptions(visible); + if (options.length < 2) continue; + const sig = optionsSignature(options); + if (sig === lastSig) continue; + const promptSnippet = parseQuestionPrompt(visible); + if (promptSnippet === '') continue; // not yet rendered, poll again + lastSig = sig; + + const fingerprintHash = auqFingerprint(promptSnippet, options); + if (seen.has(fingerprintHash)) { + // Same content, already counted (TTY redrew with whitespace diff). + continue; + } + seen.add(fingerprintHash); + + const fp: AskUserQuestionFingerprint = { + signature: fingerprintHash, + promptSnippet, + options, + observedAtMs: Date.now() - startedAt, + preReview: !boundaryFired, + }; + fingerprints.push(fp); + if (boundaryFired) reviewCount += 1; + else step0Count += 1; + + // Press to advance — first AUQ may use the override pick. + const pickIdx = + isFirstAUQ && opts.firstAUQPick ? opts.firstAUQPick(fp) : defaultPick; + isFirstAUQ = false; + session.send(`${pickIdx}\r`); + + // Evaluate boundary AFTER pressing — if THIS AUQ was the last Step 0 + // question, all subsequent AUQs go to reviewCount. + if (!boundaryFired && opts.isLastStep0AUQ(fp)) { + boundaryFired = true; + } + + // Hard ceiling — runaway protection. + if (reviewCount >= opts.reviewCountCeiling) { + return snapshot( + 'ceiling_reached', + `review-phase AUQ count reached ceiling (${opts.reviewCountCeiling})`, + session.visibleSince(since), + ); + } + + // Give the agent a beat to advance to the next state. + await Bun.sleep(2000); + } + + return snapshot( + 'timeout', + `no terminal outcome within ${timeoutMs}ms (step0=${step0Count}, review=${reviewCount})`, + session.visibleSince(since), + ); + } finally { + await session.close(); + } +} diff --git a/test/helpers/claude-pty-runner.unit.test.ts b/test/helpers/claude-pty-runner.unit.test.ts new file mode 100644 index 00000000..e830d730 --- /dev/null +++ b/test/helpers/claude-pty-runner.unit.test.ts @@ -0,0 +1,749 @@ +/** + * Deterministic unit tests for claude-pty-runner.ts behavior changes. + * + * Free-tier (no EVALS=1 needed). Runs in <1s on every `bun test`. Catches + * harness plumbing bugs before stochastic PTY runs surface them. + * + * Two surface areas tested: + * + * 1. Permission-dialog short-circuit in 'asked' classification: a TTY frame + * that matches BOTH isPermissionDialogVisible AND isNumberedOptionListVisible + * must NOT be classified as a skill question — permission dialogs render + * as numbered lists too, but they're not what we're guarding. + * + * 2. Env passthrough surface: runPlanSkillObservation accepts an `env` + * option and threads it to launchClaudePty. We can't fully exercise the + * spawn pipeline without paying for a PTY session, but we CAN verify the + * option exists in the type signature and that calling without env still + * works (no regression). + * + * The PTY test (skill-e2e-plan-ceo-plan-mode.test.ts) is the integration + * check; this file is the cheap deterministic guard for the harness primitives + * those tests stand on. + */ + +import { describe, test, expect } from 'bun:test'; +import { + isPermissionDialogVisible, + isNumberedOptionListVisible, + isPlanReadyVisible, + parseNumberedOptions, + classifyVisible, + TAIL_SCAN_BYTES, + optionsSignature, + parseQuestionPrompt, + auqFingerprint, + COMPLETION_SUMMARY_RE, + assertReviewReportAtBottom, + ceoStep0Boundary, + engStep0Boundary, + designStep0Boundary, + devexStep0Boundary, + type ClaudePtyOptions, + type AskUserQuestionFingerprint, +} from './claude-pty-runner'; + +describe('isPermissionDialogVisible', () => { + test('matches "Bash command requires permission" prompts', () => { + const sample = ` + Some preamble output + + Bash command \`gstack-config get telemetry\` requires permission to run. + + ❯ 1. Yes + 2. Yes, and always allow + 3. No, abort + `; + expect(isPermissionDialogVisible(sample)).toBe(true); + }); + + test('matches "allow all edits" file-edit prompts', () => { + // Isolated to the "allow all edits" clause only — no overlapping + // "Do you want to proceed?" co-trigger, so this asserts the clause works. + const sample = ` + Edit to ~/.gstack/config.yaml + + ❯ 1. Yes + 2. Yes, allow all edits during this session + 3. No + `; + expect(isPermissionDialogVisible(sample)).toBe(true); + }); + + test('matches the "Do you want to proceed?" file-edit confirmation by itself', () => { + // Separate fixture so weakening this clause is detected by a dedicated test. + const sample = ` + Edit to ~/.gstack/config.yaml + + Do you want to proceed? + + ❯ 1. Yes + 2. No + `; + expect(isPermissionDialogVisible(sample)).toBe(true); + }); + + test('matches workspace-trust "always allow access to" prompt', () => { + const sample = ` + Do you trust the files in this folder? + + ❯ 1. Yes, proceed + 2. Yes, and always allow access to /Users/me/repo + 3. No, exit + `; + expect(isPermissionDialogVisible(sample)).toBe(true); + }); + + test('does NOT match a skill AskUserQuestion list', () => { + const sample = ` + D1 — Premise challenge: do users actually want this? + + ❯ 1. Yes, validated + 2. No, premise is wrong + 3. Need more info + `; + expect(isPermissionDialogVisible(sample)).toBe(false); + }); + + test('does NOT match a plan-ready confirmation', () => { + const sample = ` + Ready to execute the plan? + + ❯ 1. Yes + 2. No, keep planning + `; + expect(isPermissionDialogVisible(sample)).toBe(false); + }); + + test('does NOT match a skill question that contains the bare phrase "Do you want to proceed?"', () => { + // Co-trigger requirement: "Do you want to proceed?" alone is not enough. + // It must appear with "Edit to " or "Write to " to count as + // a permission dialog. This guards against a skill question like + // "Do you want to proceed with HOLD SCOPE?" being mis-classified. + const sample = ` + Choose your scope mode for this review. + Do you want to proceed? + + ❯ 1. HOLD SCOPE + 2. SCOPE EXPANSION + 3. SELECTIVE EXPANSION + `; + expect(isPermissionDialogVisible(sample)).toBe(false); + }); + + test('does NOT mis-match when adversarial prose includes "Edit to " alongside the bare proceed phrase', () => { + // Adversarial fixture: a skill question whose body legitimately mentions + // "Edit to " in prose AND ends with "Do you want to proceed?". The + // current co-trigger regex would mis-classify this as a permission + // dialog. We DO want this test to fail until the regex is tightened + // further (e.g., proximity constraint, or anchoring "Edit to" to a + // line-start). For now this is documented as a known limitation: a + // skill question that talks about "Edit to" in prose IS still treated + // as a permission dialog. The test asserts the current behavior so a + // future fix can flip it intentionally. + const sample = ` + Plan: I will Edit to ./plan.md to capture the decision. + Do you want to proceed? + + ❯ 1. HOLD SCOPE + 2. SCOPE EXPANSION + `; + // KNOWN LIMITATION: the co-trigger fires here. Documented as a + // post-merge follow-up. Flip this assertion once the regex tightens. + expect(isPermissionDialogVisible(sample)).toBe(true); + }); +}); + +describe('isNumberedOptionListVisible', () => { + test('matches a basic ❯ 1. + 2. cursor list', () => { + const sample = ` + ❯ 1. Option one + 2. Option two + 3. Option three + `; + expect(isNumberedOptionListVisible(sample)).toBe(true); + }); + + test('returns false on a single-option prompt', () => { + const sample = ` + ❯ 1. Only option + `; + expect(isNumberedOptionListVisible(sample)).toBe(false); + }); + + test('returns false when no cursor renders', () => { + const sample = ` + Just some prose with 1. a numbered point and 2. another. + `; + expect(isNumberedOptionListVisible(sample)).toBe(false); + }); + + test('overlaps permission dialogs (this is why D5 short-circuits)', () => { + // The whole point of D5: this string matches BOTH classifiers, so the + // runner must consult isPermissionDialogVisible to disambiguate. + const sample = ` + Bash command \`do-thing\` requires permission to run. + + ❯ 1. Yes + 2. No + `; + expect(isNumberedOptionListVisible(sample)).toBe(true); + expect(isPermissionDialogVisible(sample)).toBe(true); + }); +}); + +describe('classifyVisible (runtime path through the runner classifier)', () => { + // These tests call the actual classifier so a future contributor who + // reorders branches (e.g. moves the permission short-circuit before + // isPlanReadyVisible) is caught deterministically. + + test('skill question → returns asked', () => { + const visible = ` + D1 — Choose your scope mode + + ❯ 1. HOLD SCOPE + 2. SCOPE EXPANSION + 3. SELECTIVE EXPANSION + 4. SCOPE REDUCTION + `; + const result = classifyVisible(visible); + expect(result?.outcome).toBe('asked'); + }); + + test('permission dialog (Bash) → returns null (skip, keep polling)', () => { + const visible = ` + Bash command \`gstack-update-check\` requires permission to run. + + ❯ 1. Yes + 2. No + `; + expect(isNumberedOptionListVisible(visible)).toBe(true); // pre-filter + expect(classifyVisible(visible)).toBeNull(); // post-filter + }); + + test('plan-ready confirmation → returns plan_ready (wins over asked)', () => { + const visible = ` + Ready to execute the plan? + + ❯ 1. Yes, proceed + 2. No, keep planning + `; + const result = classifyVisible(visible); + expect(result?.outcome).toBe('plan_ready'); + }); + + test('silent write to unsanctioned path → returns silent_write', () => { + const visible = ` + ⏺ Write(src/app/dangerous-write.ts) + ⎿ Wrote 42 lines + `; + const result = classifyVisible(visible); + expect(result?.outcome).toBe('silent_write'); + expect(result?.summary).toContain('src/app/dangerous-write.ts'); + }); + + test('write to sanctioned path (.claude/plans) → returns null (allowed)', () => { + const visible = ` + ⏺ Write(/Users/me/.claude/plans/some-plan.md) + ⎿ Wrote 42 lines + `; + expect(classifyVisible(visible)).toBeNull(); + }); + + test('write while a permission dialog is on screen → returns null (gated, not silent, not asked)', () => { + const visible = ` + ⏺ Write(src/app/edit-with-permission.ts) + + Edit to src/app/edit-with-permission.ts + + Do you want to proceed? + + ❯ 1. Yes + 2. No + `; + // The numbered prompt is a permission dialog (Edit to + Do you want to proceed?); + // silent_write is suppressed because a numbered prompt is visible, AND + // 'asked' is suppressed because the prompt is a permission dialog. + expect(classifyVisible(visible)).toBeNull(); + }); + + test('write while a real skill question is on screen → returns asked (write is captured but not silent)', () => { + const visible = ` + ⏺ Write(src/app/foo.ts) + + D1 — Choose your scope mode + + ❯ 1. HOLD SCOPE + 2. SCOPE EXPANSION + `; + // The numbered prompt is a skill question, not a permission dialog; + // silent_write is suppressed (numbered prompt is visible) and the + // outcome is 'asked' — Step 0 fired. + const result = classifyVisible(visible); + expect(result?.outcome).toBe('asked'); + }); + + test('idle / no signals → returns null', () => { + const visible = ` + Some prose without any classifier signals. + `; + expect(classifyVisible(visible)).toBeNull(); + }); + + test('TAIL_SCAN_BYTES is exported as 1500', () => { + // Shared between runner and routing test; a regression that desyncs the + // recent-tail window would surface here. + expect(TAIL_SCAN_BYTES).toBe(1500); + }); +}); + +describe('parseNumberedOptions', () => { + test('extracts options from a clean cursor list', () => { + const visible = ` + ❯ 1. HOLD SCOPE + 2. SCOPE EXPANSION + `; + const opts = parseNumberedOptions(visible); + expect(opts).toHaveLength(2); + expect(opts[0]).toEqual({ index: 1, label: 'HOLD SCOPE' }); + expect(opts[1]).toEqual({ index: 2, label: 'SCOPE EXPANSION' }); + }); + + test('returns empty array on prose-with-numbers (no cursor)', () => { + expect(parseNumberedOptions('text 1. one 2. two')).toEqual([]); + }); + + test('extracts options when the cursor is INLINE with prompt header (box-layout)', () => { + // Real /plan-ceo-review rendering: the TTY's cursor-positioning escapes + // collapse divider + header + prompt + cursor onto one logical line. + // Subsequent options (2..7) still start their own lines. + const visible = [ + '────────────────────────────────────────', + '☐ Review scope What scope do you want me to CEO-review? ❯ 1. The branch\'s diff vs main', + ' Review the full branch: ~10K LOC.', + '2. A specific plan file or design doc', + ' You point me at a file (path) and I review that.', + '3. An idea you\'ll describe inline', + '4. Cancel — wrong skill', + '5. Type something.', + '────────────────────────────────────────', + '6. Chat about this', + '7. Skip interview and plan immediately', + ].join('\n'); + const opts = parseNumberedOptions(visible); + expect(opts).toHaveLength(7); + expect(opts[0]).toEqual({ index: 1, label: "The branch's diff vs main" }); + expect(opts[1]?.index).toBe(2); + expect(opts[6]?.index).toBe(7); + expect(opts[6]?.label).toBe('Skip interview and plan immediately'); + }); + + test('inline-cursor and start-of-line cursor both produce 7 options for the box-layout case', () => { + // The inline path captures option 1 from the cursor line itself; the + // subsequent-lines path captures 2..7 with the existing optionRe. + const inlineLayout = [ + 'header text ❯ 1. first option', + '2. second', + '3. third', + ].join('\n'); + expect(parseNumberedOptions(inlineLayout)).toEqual([ + { index: 1, label: 'first option' }, + { index: 2, label: 'second' }, + { index: 3, label: 'third' }, + ]); + + const cleanLayout = [ + ' ❯ 1. first option', + ' 2. second', + ' 3. third', + ].join('\n'); + expect(parseNumberedOptions(cleanLayout)).toEqual([ + { index: 1, label: 'first option' }, + { index: 2, label: 'second' }, + { index: 3, label: 'third' }, + ]); + }); +}); + +describe('runPlanSkillObservation env passthrough surface', () => { + test('ClaudePtyOptions exposes env: Record', () => { + // Type-level guard: this file would fail to compile if the env field + // were removed or its shape regressed. The actual env merge happens in + // launchClaudePty's spawn call (`env: { ...process.env, ...opts.env }`), + // so a regression where `env: opts.env` gets dropped from the + // runPlanSkillObservation -> launchClaudePty handoff is only caught by + // the live PTY test, not here. + const opts: ClaudePtyOptions = { + env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }, + }; + expect(opts.env).toEqual({ QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }); + }); +}); + +// ──────────────────────────────────────────────────────────────────────────── +// Per-finding count primitives — Section 3 unit tests #1–#5, #7, #12. +// ──────────────────────────────────────────────────────────────────────────── + +describe('optionsSignature', () => { + test('returns a "|"-joined `index:label` string for a clean list', () => { + const sig = optionsSignature([ + { index: 1, label: 'HOLD SCOPE' }, + { index: 2, label: 'SCOPE EXPANSION' }, + ]); + expect(sig).toBe('1:HOLD SCOPE|2:SCOPE EXPANSION'); + }); + + test('order-independent: shuffled inputs produce the same signature', () => { + // parseNumberedOptions already returns sorted, but defensive sort means + // a future caller that hands us shuffled input still produces a stable + // dedupe signature. + const a = optionsSignature([ + { index: 2, label: 'B' }, + { index: 1, label: 'A' }, + { index: 3, label: 'C' }, + ]); + const b = optionsSignature([ + { index: 1, label: 'A' }, + { index: 2, label: 'B' }, + { index: 3, label: 'C' }, + ]); + expect(a).toBe(b); + }); + + test('empty list returns empty string', () => { + expect(optionsSignature([])).toBe(''); + }); + + test('single-item list returns just that entry', () => { + expect(optionsSignature([{ index: 1, label: 'Only' }])).toBe('1:Only'); + }); +}); + +describe('parseQuestionPrompt', () => { + test('captures 1-line prompt above the cursor', () => { + const visible = ` + D1 — Pick a mode + + ❯ 1. HOLD SCOPE + 2. SCOPE EXPANSION + `; + const prompt = parseQuestionPrompt(visible); + expect(prompt).toBe('D1 — Pick a mode'); + }); + + test('captures multi-line prompt above the cursor', () => { + const visible = ` + D2 — Approach selection + + Which architecture should we follow? + + ❯ 1. Bypass existing helper + 2. Reuse existing helper + `; + const prompt = parseQuestionPrompt(visible); + // Multi-line prompts get joined with single spaces. + expect(prompt).toContain('D2 — Approach selection'); + expect(prompt).toContain('Which architecture should we follow?'); + }); + + test('returns "" when no cursor is rendered', () => { + expect(parseQuestionPrompt('Just some prose.\nNo cursor.')).toBe(''); + }); + + test('truncates to 240 chars', () => { + const longPrompt = 'A'.repeat(500); + const visible = `${longPrompt}\n\n ❯ 1. yes\n 2. no`; + expect(parseQuestionPrompt(visible).length).toBeLessThanOrEqual(240); + }); + + test('does not pull text from a previous numbered list above', () => { + const visible = ` + ❯ 1. previous answered question + 2. previous option two + + D2 — A new question text + + ❯ 1. fresh option A + 2. fresh option B + `; + const prompt = parseQuestionPrompt(visible); + // Stops at the previous numbered-list line; should NOT contain "previous answered question". + expect(prompt).toContain('D2 — A new question text'); + expect(prompt).not.toContain('previous answered question'); + }); + + test('normalizes whitespace (collapses runs of spaces and tabs)', () => { + const visible = `D1 — Spaced out + + ❯ 1. yes + 2. no`; + expect(parseQuestionPrompt(visible)).toBe('D1 — Spaced out'); + }); + + test('inline-cursor box-layout: extracts prompt text BEFORE ❯1. on the cursor line', () => { + // Real /plan-ceo-review rendering: divider + ☐ header + prompt text + + // cursor are all on one logical line because TTY cursor-positioning + // escapes collapse the box layout under stripAnsi. + const visible = [ + '──────────────────', + '☐ Review scope What scope do you want me to CEO-review? ❯ 1. The branch\'s diff vs main', + '2. A specific plan file', + '3. An idea inline', + ].join('\n'); + const prompt = parseQuestionPrompt(visible); + // Should extract "Review scope" and the prompt text, dropping the ☐ box-drawing sigil. + expect(prompt).toContain('Review scope'); + expect(prompt).toContain('What scope do you want me to CEO-review?'); + expect(prompt).not.toContain('❯'); + expect(prompt).not.toMatch(/^☐/); + }); +}); + +describe('auqFingerprint', () => { + test('returns the same fingerprint for identical inputs', () => { + const opts = [ + { index: 1, label: 'A' }, + { index: 2, label: 'B' }, + ]; + expect(auqFingerprint('hello', opts)).toBe(auqFingerprint('hello', opts)); + }); + + test('different prompts with shared option labels produce DIFFERENT fingerprints', () => { + // The collision regression Codex F1 caught: option-label-only fingerprints + // collapsed multiple distinct findings into one when they shared menu shape. + const sharedOpts = [ + { index: 1, label: 'Add to plan' }, + { index: 2, label: 'Defer' }, + { index: 3, label: 'Build now' }, + ]; + const fpFinding1 = auqFingerprint('D5 — Architecture: bypass helper?', sharedOpts); + const fpFinding2 = auqFingerprint('D6 — Tests: zero coverage?', sharedOpts); + expect(fpFinding1).not.toBe(fpFinding2); + }); + + test('same prompt with different options produces DIFFERENT fingerprints', () => { + const prompt = 'D1 — Pick a mode'; + const fpA = auqFingerprint(prompt, [ + { index: 1, label: 'HOLD SCOPE' }, + { index: 2, label: 'SCOPE EXPANSION' }, + ]); + const fpB = auqFingerprint(prompt, [ + { index: 1, label: 'HOLD SCOPE' }, + { index: 2, label: 'SCOPE REDUCTION' }, + ]); + expect(fpA).not.toBe(fpB); + }); + + test('whitespace-only differences in prompt do NOT change the fingerprint', () => { + // Same content, different rendering whitespace (TTY redraw artifact) + // must produce the same fingerprint so dedupe survives reflow. + const opts = [{ index: 1, label: 'A' }, { index: 2, label: 'B' }]; + const fpA = auqFingerprint('Pick a mode', opts); + const fpB = auqFingerprint('Pick a mode', opts); + expect(fpA).toBe(fpB); + }); + + test('empty prompt + same options collide (caller must guard against this)', () => { + // Documents the contract: empty-prompt fingerprints WILL collide if the + // caller fingerprints them. runPlanSkillCounting must skip empty-prompt + // AUQs and re-poll instead. + const opts = [{ index: 1, label: 'A' }]; + expect(auqFingerprint('', opts)).toBe(auqFingerprint('', opts)); + }); +}); + +describe('COMPLETION_SUMMARY_RE', () => { + test('matches GSTACK REVIEW REPORT heading', () => { + expect(COMPLETION_SUMMARY_RE.test('## GSTACK REVIEW REPORT')).toBe(true); + }); + + test('matches Completion Summary heading (ceo + eng)', () => { + expect(COMPLETION_SUMMARY_RE.test('## Completion Summary')).toBe(true); + expect(COMPLETION_SUMMARY_RE.test('## Completion summary')).toBe(true); + }); + + test('matches Status: clean (CEO review-log shape)', () => { + expect(COMPLETION_SUMMARY_RE.test('Status: clean')).toBe(true); + expect(COMPLETION_SUMMARY_RE.test('Status: issues_open')).toBe(true); + }); + + test('matches VERDICT: line', () => { + expect(COMPLETION_SUMMARY_RE.test('VERDICT: CLEARED — Eng Review passed')).toBe(true); + }); + + test('does NOT match prose mentions of "verdict" mid-line', () => { + // VERDICT must be at the start of a line to count. + expect(COMPLETION_SUMMARY_RE.test('the final verdict: undecided')).toBe(false); + }); +}); + +describe('assertReviewReportAtBottom', () => { + test('passes when REVIEW REPORT is the only/last ## heading', () => { + const content = `# Plan + +## Context +stuff + +## Approach +more stuff + +## GSTACK REVIEW REPORT + +| col | col | +`; + const r = assertReviewReportAtBottom(content); + expect(r.ok).toBe(true); + }); + + test('fails when REVIEW REPORT is missing', () => { + const content = `# Plan + +## Context +stuff +`; + const r = assertReviewReportAtBottom(content); + expect(r.ok).toBe(false); + expect(r.reason).toMatch(/no GSTACK REVIEW REPORT/); + }); + + test('fails when REVIEW REPORT exists but a ## heading follows it', () => { + const content = `# Plan + +## GSTACK REVIEW REPORT + +| col | col | + +## Late Section +oops +`; + const r = assertReviewReportAtBottom(content); + expect(r.ok).toBe(false); + expect(r.reason).toMatch(/trailing ## heading/); + expect(r.trailingHeadings).toEqual(['## Late Section']); + }); + + test('passes when only ### subheadings follow REVIEW REPORT (deeper nesting allowed)', () => { + const content = `## GSTACK REVIEW REPORT + +### Cross-model tension +- F1: resolved +- F2: resolved +`; + const r = assertReviewReportAtBottom(content); + expect(r.ok).toBe(true); + }); + + test('fails with multiple trailing ## headings reported', () => { + const content = `## GSTACK REVIEW REPORT + +## First trailing + +## Second trailing +`; + const r = assertReviewReportAtBottom(content); + expect(r.ok).toBe(false); + expect(r.trailingHeadings).toHaveLength(2); + }); +}); + +describe('Step0BoundaryPredicate per-skill', () => { + // Helper to build a synthetic fingerprint for predicate tests. + function fp(promptSnippet: string, optionLabels: string[]): AskUserQuestionFingerprint { + const options = optionLabels.map((label, i) => ({ index: i + 1, label })); + return { + signature: auqFingerprint(promptSnippet, options), + promptSnippet, + options, + observedAtMs: 0, + preReview: true, + }; + } + + describe('ceoStep0Boundary', () => { + test('FIRES on Step 0F mode-pick AUQ (HOLD SCOPE in options)', () => { + const f = fp('Pick a mode', ['HOLD SCOPE', 'SCOPE EXPANSION', 'SELECTIVE EXPANSION', 'SCOPE REDUCTION']); + expect(ceoStep0Boundary(f)).toBe(true); + }); + + test('FIRES on scope-selection AUQ with "Skip interview" option (skip-interview path)', () => { + // After calibration run 1: plan-ceo's first AUQ is scope-selection, + // and we route via "Skip interview and plan immediately" to bypass + // Step 0 entirely. Boundary must fire on this AUQ so subsequent + // AUQs go to reviewCount. + const f = fp( + 'What scope do you want me to CEO-review?', + [ + "The branch's diff vs main", + 'A specific plan file', + "An idea you'll describe inline", + 'Cancel — wrong skill', + 'Type something.', + 'Chat about this', + 'Skip interview and plan immediately', + ], + ); + expect(ceoStep0Boundary(f)).toBe(true); + }); + + test('does NOT fire on premise challenge AUQs', () => { + const f = fp('D1 — Premise check: is this the right problem?', ['Yes', 'No', 'Other']); + expect(ceoStep0Boundary(f)).toBe(false); + }); + + test('does NOT fire on review-section AUQs', () => { + const f = fp('Architecture: bypass helper?', ['Reuse existing', 'Roll new', 'Defer']); + expect(ceoStep0Boundary(f)).toBe(false); + }); + }); + + describe('engStep0Boundary', () => { + test('FIRES on cross-project learnings prompt', () => { + const f = fp('Enable cross-project learnings on this machine?', ['Yes', 'No']); + expect(engStep0Boundary(f)).toBe(true); + }); + + test('FIRES on scope reduction recommendation', () => { + const f = fp('Scope reduction recommendation: cut to MVP?', ['Reduce', 'Proceed', 'Modify']); + expect(engStep0Boundary(f)).toBe(true); + }); + + test('does NOT fire on review-section AUQs', () => { + const f = fp('Architecture: shared mutable state?', ['Refactor', 'Defer', 'Skip']); + expect(engStep0Boundary(f)).toBe(false); + }); + }); + + describe('designStep0Boundary', () => { + test('FIRES on design system / posture mention', () => { + const f = fp('Pick a design posture for this review', ['Polish', 'Triage', 'Expansion']); + expect(designStep0Boundary(f)).toBe(true); + }); + + test('FIRES on first-dimension prompt', () => { + const f = fp('First dimension: visual hierarchy. Score?', ['7', '8', '9']); + expect(designStep0Boundary(f)).toBe(true); + }); + + test('does NOT fire on later dimension AUQs', () => { + const f = fp('Spacing dimension score?', ['7', '8', '9']); + expect(designStep0Boundary(f)).toBe(false); + }); + }); + + describe('devexStep0Boundary', () => { + test('FIRES on developer persona selection', () => { + const f = fp('Pick the target persona for this review', ['Senior backend', 'Junior frontend', 'Other']); + expect(devexStep0Boundary(f)).toBe(true); + }); + + test('FIRES on TTHW target prompt', () => { + const f = fp('What is the TTHW target for first run?', ['<5 min', '<15 min', '<30 min']); + expect(devexStep0Boundary(f)).toBe(true); + }); + + test('does NOT fire on review-section AUQs', () => { + const f = fp('Friction point: 5-min CI wait. Address?', ['Now', 'Defer', 'Skip']); + expect(devexStep0Boundary(f)).toBe(false); + }); + }); +}); diff --git a/test/helpers/touchfiles.ts b/test/helpers/touchfiles.ts index 4552b8e1..a62d9360 100644 --- a/test/helpers/touchfiles.ts +++ b/test/helpers/touchfiles.ts @@ -103,6 +103,15 @@ export const E2E_TOUCHFILES: Record = { 'ship-idempotency-pty': ['ship/**', 'bin/gstack-next-version', 'lib/worktree.ts', 'test/helpers/claude-pty-runner.ts'], 'autoplan-chain-pty': ['autoplan/**', 'plan-ceo-review/**', 'plan-design-review/**', 'plan-eng-review/**', 'plan-devex-review/**', 'test/fixtures/plans/ui-heavy-feature.md', 'test/helpers/claude-pty-runner.ts'], 'e2e-harness-audit': ['plan-ceo-review/**', 'plan-eng-review/**', 'plan-design-review/**', 'plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/agent-sdk-runner.ts', 'test/helpers/claude-pty-runner.ts'], + + // Per-finding AskUserQuestion count + review-report-at-bottom assertion. + // Each test drives its skill end-to-end; touchfiles include preamble + + // completion-status resolvers because they affect question cadence and + // terminal output (the regression surface this test catches). + 'plan-ceo-finding-count': ['plan-ceo-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-ceo-finding-count.test.ts'], + 'plan-eng-finding-count': ['plan-eng-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-eng-finding-count.test.ts'], + 'plan-design-finding-count': ['plan-design-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-design-finding-count.test.ts'], + 'plan-devex-finding-count': ['plan-devex-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-devex-finding-count.test.ts'], 'brain-privacy-gate': ['scripts/resolvers/preamble/generate-brain-sync-block.ts', 'scripts/resolvers/preamble.ts', 'bin/gstack-brain-sync', 'bin/gstack-brain-init', 'bin/gstack-config', 'test/helpers/agent-sdk-runner.ts'], // AskUserQuestion format regression (RECOMMENDATION + Completeness: N/10) @@ -381,6 +390,15 @@ export const E2E_TIERS: Record = { 'ship-idempotency-pty': 'periodic', // ~$3/run, real /ship in plan mode 'autoplan-chain-pty': 'periodic', // ~$8/run, all 3 phases sequential + // Per-finding count + review-report-at-bottom — periodic because each + // run drives a full skill end-to-end (~25 min, ~$5/run). Sequential + // execution during calibration; concurrent opt-in only after measured + // comparison agrees (plan §D15). + 'plan-ceo-finding-count': 'periodic', + 'plan-eng-finding-count': 'periodic', + 'plan-design-finding-count': 'periodic', + 'plan-devex-finding-count': 'periodic', + // Privacy gate for gstack-brain-sync — periodic (non-deterministic LLM call, // costs ~$0.30-$0.50 per run, not needed on every commit) 'brain-privacy-gate': 'periodic', diff --git a/test/skill-e2e-plan-ceo-finding-count.test.ts b/test/skill-e2e-plan-ceo-finding-count.test.ts new file mode 100644 index 00000000..850c1a03 --- /dev/null +++ b/test/skill-e2e-plan-ceo-finding-count.test.ts @@ -0,0 +1,253 @@ +/** + * /plan-ceo-review per-finding AskUserQuestion count (periodic, paid, real-PTY). + * + * Asserts the load-bearing rule "One issue = one AskUserQuestion call" by + * driving /plan-ceo-review against a 5-finding seeded plan and counting + * distinct review-phase AUQs. Passes when count is in [N-1, N+2]. + * + * Two tests in this file: + * - 5-finding distinct fixture: count band assertion + D19 review-report-at-bottom. + * - 2-finding paired control (D12 positive control): related findings still + * produce 2 distinct AUQs, not 1 batched, when the rule is honored. + * + * Tier: periodic. Each run drives Step 0 + 11 review sections end-to-end + * (~25 min, ~$5/run). Sequential by default per plan §D15. See + * test/helpers/claude-pty-runner.ts for runPlanSkillCounting internals. + */ + +import { describe, test } from 'bun:test'; +import * as fs from 'node:fs'; +import { + runPlanSkillCounting, + ceoStep0Boundary, + assertReviewReportAtBottom, + type AskUserQuestionFingerprint, +} from './helpers/claude-pty-runner'; + +/** + * /plan-ceo-review's first AUQ asks "what scope?" with options like + * 1. Branch diff vs main + * 2. A specific plan file or design doc + * 3. An idea you'll describe inline + * ... + * 7. Skip interview and plan immediately + * + * The default pick (1) routes to "branch diff vs main" — the wrong target + * for our seeded fixture (the agent would review the gstack PR itself, + * recursively). Picking "Skip interview and plan immediately" bypasses + * Step 0 and routes the agent to review the chat context (where our + * follow-up plan was pasted). + */ +function pickSkipInterview(fp: AskUserQuestionFingerprint): number { + const skipOpt = fp.options.find((o) => + /skip\s+interview|plan\s+immediately/i.test(o.label), + ); + if (skipOpt) return skipOpt.index; + // Fallback: "describe inline" also routes to using our pasted plan. + const inlineOpt = fp.options.find((o) => + /describe.*inline|inline.*idea/i.test(o.label), + ); + if (inlineOpt) return inlineOpt.index; + return 1; +} + +const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic'; +const describeE2E = shouldRun ? describe : describe.skip; + +const N_DISTINCT = 5; +const FLOOR_DISTINCT = N_DISTINCT - 1; // 4 (D11) +const CEILING_DISTINCT = N_DISTINCT + 2; // 7 (D11) + +const N_PAIRED = 2; +const FLOOR_PAIRED = 2; +const CEILING_PAIRED = 4; + +const PLAN_CEO_5_FINDINGS = [ + 'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-ceo.md (use Edit/Write to that exact path).', + '', + '# Plan: Payment Processing Integration', + '', + '## Architecture', + "We're adding a new `PaymentService` class that will handle Stripe webhooks.", + 'This bypasses the existing `WebhookDispatcher` module — we want a clean', + 'namespace separation.', + '', + '## Database access', + 'The new endpoint reads `request.params.userId` directly into a raw SQL', + 'fragment for the lookup query.', + '', + '## Webhook fan-out', + 'On payment success we update the user record AND fire a notification email.', + 'Both happen inline; no error handling on the email leg.', + '', + '## Tests', + "None planned. We'll rely on the existing integration suite catching regressions.", + '', + '## Performance', + 'Each webhook lookup hits the database for the user, then fetches each', + 'order in a loop.', +].join('\n'); + +const PLAN_CEO_2_PAIRED_FINDINGS = [ + 'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-ceo-paired.md (use Edit/Write to that exact path).', + '', + '# Plan: Payment Processing — Test Coverage', + '', + '## Tests', + 'We need test coverage for `processPayment()`. Specifically:', + '1. The happy path (successful Stripe charge — assert correct receipt is generated).', + '2. The error/timeout path (Stripe returns 502 — assert retry-with-backoff fires once, then fails clean).', + '', + 'Currently neither has a unit test. These are deliberately separate concerns:', + 'the success path is correctness, the failure path is graceful degradation.', +].join('\n'); + +const PLAN_CEO_PATH = '/tmp/gstack-test-plan-ceo.md'; +const PLAN_CEO_PAIRED_PATH = '/tmp/gstack-test-plan-ceo-paired.md'; + +describeE2E('/plan-ceo-review per-finding AskUserQuestion count (periodic)', () => { + test( + `5-finding plan emits ${FLOOR_DISTINCT}-${CEILING_DISTINCT} review-phase AskUserQuestions`, + async () => { + try { + fs.rmSync(PLAN_CEO_PATH, { force: true }); + } catch { + /* best-effort */ + } + + const obs = await runPlanSkillCounting({ + skillName: 'plan-ceo-review', + slashCommand: '/plan-ceo-review', + followUpPrompt: PLAN_CEO_5_FINDINGS, + isLastStep0AUQ: ceoStep0Boundary, + reviewCountCeiling: CEILING_DISTINCT + 1, // hard cap above assertion ceiling + firstAUQPick: pickSkipInterview, // bypass scope-selection, route to review + cwd: process.cwd(), + timeoutMs: 1_500_000, // 25 min + env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }, + }); + + try { + if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) { + throw new Error( + `plan-ceo-review finding-count FAILED: outcome=${obs.outcome}\n` + + `step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` + + `fingerprints (last 8):\n` + + obs.fingerprints + .slice(-8) + .map( + (f, i) => + ` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`, + ) + .join('\n') + + `\n--- evidence (last 3KB) ---\n${obs.evidence}`, + ); + } + if (obs.reviewCount < FLOOR_DISTINCT) { + throw new Error( + `BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR_DISTINCT}.\n` + + `Likely batching regression — agent collapsed multiple findings into fewer questions.\n` + + `Fingerprints (review-phase only):\n` + + obs.fingerprints + .filter((f) => !f.preReview) + .map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`) + .join('\n'), + ); + } + if (obs.reviewCount > CEILING_DISTINCT) { + throw new Error( + `BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING_DISTINCT}.\n` + + `Possible over-asking regression. Review-phase fingerprints:\n` + + obs.fingerprints + .filter((f) => !f.preReview) + .map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`) + .join('\n'), + ); + } + + // D19: review report at bottom of plan file. + if (!fs.existsSync(PLAN_CEO_PATH)) { + throw new Error( + `D19 FAIL: agent did not produce expected plan file at ${PLAN_CEO_PATH}.\n` + + `Either the agent ignored the path instruction in the follow-up prompt, or\n` + + `the helper exited before the agent wrote the file. ` + + `outcome=${obs.outcome} review=${obs.reviewCount}`, + ); + } + const planContent = fs.readFileSync(PLAN_CEO_PATH, 'utf-8'); + const verdict = assertReviewReportAtBottom(planContent); + if (!verdict.ok) { + throw new Error( + `D19 FAIL: plan file at ${PLAN_CEO_PATH} ${verdict.reason}\n` + + (verdict.trailingHeadings + ? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n` + : '') + + `--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`, + ); + } + } finally { + try { + fs.rmSync(PLAN_CEO_PATH, { force: true }); + } catch { + /* best-effort */ + } + } + }, + 1_700_000, + ); + + test( + `paired-finding positive control: ${N_PAIRED} related findings produce ${FLOOR_PAIRED}-${CEILING_PAIRED} AskUserQuestions`, + async () => { + try { + fs.rmSync(PLAN_CEO_PAIRED_PATH, { force: true }); + } catch { + /* best-effort */ + } + + const obs = await runPlanSkillCounting({ + skillName: 'plan-ceo-review', + slashCommand: '/plan-ceo-review', + followUpPrompt: PLAN_CEO_2_PAIRED_FINDINGS, + isLastStep0AUQ: ceoStep0Boundary, + reviewCountCeiling: CEILING_PAIRED + 1, + cwd: process.cwd(), + timeoutMs: 1_500_000, + env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }, + }); + + try { + if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) { + throw new Error( + `paired-finding control FAILED: outcome=${obs.outcome}\n` + + `step0=${obs.step0Count} review=${obs.reviewCount}\n` + + `--- evidence (last 3KB) ---\n${obs.evidence}`, + ); + } + if (obs.reviewCount < FLOOR_PAIRED) { + throw new Error( + `PAIRED CONTROL FAIL: reviewCount=${obs.reviewCount} < FLOOR=${FLOOR_PAIRED}.\n` + + `Two deliberately related findings were batched into <2 questions — the rule failed under D12.\n` + + `Review-phase fingerprints:\n` + + obs.fingerprints + .filter((f) => !f.preReview) + .map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`) + .join('\n'), + ); + } + if (obs.reviewCount > CEILING_PAIRED) { + throw new Error( + `PAIRED CONTROL FAIL: reviewCount=${obs.reviewCount} > CEILING=${CEILING_PAIRED} (over-asking on a 2-finding fixture).`, + ); + } + } finally { + try { + fs.rmSync(PLAN_CEO_PAIRED_PATH, { force: true }); + } catch { + /* best-effort */ + } + } + }, + 1_700_000, + ); +}); diff --git a/test/skill-e2e-plan-ceo-mode-routing.test.ts b/test/skill-e2e-plan-ceo-mode-routing.test.ts index 4e85ed64..0199413b 100644 --- a/test/skill-e2e-plan-ceo-mode-routing.test.ts +++ b/test/skill-e2e-plan-ceo-mode-routing.test.ts @@ -37,14 +37,15 @@ import { isPermissionDialogVisible, parseNumberedOptions, isPlanReadyVisible, + MODE_RE, + optionsSignature, + TAIL_SCAN_BYTES, type ClaudePtySession, } from './helpers/claude-pty-runner'; const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic'; const describeE2E = shouldRun ? describe : describe.skip; -const MODE_RE = /HOLD SCOPE|SCOPE EXPANSION|SELECTIVE EXPANSION|SCOPE REDUCTION/i; - interface ModeCase { mode: 'HOLD SCOPE' | 'SCOPE EXPANSION'; /** Regex applied to visible-since-mode-pick text. At least one must match. */ @@ -95,8 +96,8 @@ async function navigateToModeAskUserQuestion( // Has the rendered list changed since last poll? If not, we're seeing // the same prompt and shouldn't double-press. - const sig = opts.map(o => `${o.index}:${o.label}`).join('|'); - const lastSig = lastSeenList.map(o => `${o.index}:${o.label}`).join('|'); + const sig = optionsSignature(opts); + const lastSig = optionsSignature(lastSeenList); if (sig === lastSig) continue; lastSeenList = opts; @@ -115,7 +116,14 @@ async function navigateToModeAskUserQuestion( // Permission dialog? Grant with "1" but don't count it against nav budget. // Classify on the recent tail only — old permission text persists in // visibleSince and would re-trigger forever. - if (isPermissionDialogVisible(visible.slice(-1500))) { + // + // Note: runPlanSkillObservation has its own permission-dialog filter that + // simply skips classification (since it observes, doesn't drive). This nav + // loop drives the PTY directly via launchClaudePty and so owns its own + // dialog handling — granting with "1" so the workflow advances. Both + // paths share TAIL_SCAN_BYTES as the recent-tail window so tuning stays + // in sync. + if (isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))) { session.send('1\r'); await Bun.sleep(1500); continue; diff --git a/test/skill-e2e-plan-ceo-plan-mode.test.ts b/test/skill-e2e-plan-ceo-plan-mode.test.ts index 8bb6a95b..86409e6f 100644 --- a/test/skill-e2e-plan-ceo-plan-mode.test.ts +++ b/test/skill-e2e-plan-ceo-plan-mode.test.ts @@ -1,22 +1,34 @@ /** * plan-ceo-review plan-mode smoke (gate, paid, real-PTY). * - * Asserts: when /plan-ceo-review is invoked in plan mode, the skill reaches - * a terminal outcome that is either: - * - 'asked' — skill emitted its Step 0 numbered prompt (scope mode - * selection, or the routing-injection prompt that runs - * before Step 0) - * - 'plan_ready' — skill ran end-to-end and surfaced claude's native - * "Ready to execute" confirmation + * Asserts: when /plan-ceo-review is invoked in plan mode, the FIRST terminal + * outcome is 'asked' — a skill-question numbered list. Permission dialogs + * (which also render numbered lists) are filtered out by `runPlanSkillObservation` + * via its `isPermissionDialogVisible(visible.slice(-1500))` short-circuit. * - * FAIL conditions: silent Write/Edit before any prompt, claude crash, - * timeout. + * Reaching 'plan_ready' first IS the regression we want to catch: the agent + * skipped Step 0 entirely and went straight to ExitPlanMode. The original + * failure had the assistant read a diff, write a plan with two issues, and + * call ExitPlanMode without ever firing AskUserQuestion — the user had to + * manually call out the missing per-issue questions. * - * Replaces the SDK-based test that never worked: the SDK's canUseTool - * interceptor on AskUserQuestion never fires in plan mode because plan - * mode renders its native confirmation as TTY UI, not via the - * AskUserQuestion tool. The real PTY harness observes the rendered - * terminal output directly. + * Why this skill is special: unlike plan-eng-review / plan-design-review / + * plan-devex-review (whose smokes accept either 'asked' or 'plan_ready'), + * plan-ceo-review's template mandates Step 0A premise challenge (3 baked-in + * questions) AND Step 0F mode selection BEFORE any plan write. There is no + * legitimate path to plan_ready that does not first emit a skill-question + * numbered prompt. + * + * Env passthrough: passes `QUESTION_TUNING=false` and `EXPLAIN_LEVEL=default` + * via the runner's env option. Today these are advisory — `gstack-config` + * reads `~/.gstack/config.yaml`, not env vars, so a contributor with + * `question_tuning: true` set in their YAML config can still see AUTO_DECIDE + * masking. The env passthrough is wired so a future gstack-config change to + * honor env overrides will make this test hermetic without further edits. + * Tracked as a post-merge follow-up. + * + * FAIL conditions: 'plan_ready' first, silent Write/Edit before any prompt, + * claude crash, timeout. * * See test/helpers/claude-pty-runner.ts for runner internals. */ @@ -28,21 +40,33 @@ const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate'; const describeE2E = shouldRun ? describe : describe.skip; describeE2E('plan-ceo-review plan-mode smoke (gate)', () => { - test('reaches a terminal outcome (asked or plan_ready) without silent writes', async () => { + test('first terminal outcome is asked (Step 0 fires before any plan write)', async () => { const obs = await runPlanSkillObservation({ skillName: 'plan-ceo-review', inPlanMode: true, timeoutMs: 300_000, + env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }, }); - if (obs.outcome === 'silent_write' || obs.outcome === 'exited' || obs.outcome === 'timeout') { + if (obs.outcome !== 'asked') { + const diagnosis = + obs.outcome === 'plan_ready' + ? `'plan_ready' first means the agent skipped Step 0 entirely and went straight to ExitPlanMode without asking.` + : obs.outcome === 'timeout' + ? `Timeout means the agent neither asked nor completed within the budget — likely hung mid-question or stuck on a permission dialog.` + : obs.outcome === 'silent_write' + ? `Silent Write/Edit fired to an unsanctioned path before any AskUserQuestion — also a Step 0 skip.` + : `Outcome '${obs.outcome}' is unexpected; investigate the evidence below.`; throw new Error( - `plan-ceo-review plan-mode smoke FAILED: outcome=${obs.outcome}\n` + + `plan-ceo-review smoke FAILED: outcome=${obs.outcome}\n` + + `${diagnosis}\n` + + `Expected 'asked'. See plan-ceo-review/SKILL.md.tmpl: the Step 0 STOP rules ` + + `and the "One issue = one AskUserQuestion call" rule under "CRITICAL RULE — ` + + `How to ask questions".\n` + `summary: ${obs.summary}\n` + `elapsed: ${obs.elapsedMs}ms\n` + `--- evidence (last 2KB visible) ---\n${obs.evidence}`, ); } - expect(['asked', 'plan_ready']).toContain(obs.outcome); }, 360_000); }); diff --git a/test/skill-e2e-plan-design-finding-count.test.ts b/test/skill-e2e-plan-design-finding-count.test.ts new file mode 100644 index 00000000..ef0d9b68 --- /dev/null +++ b/test/skill-e2e-plan-design-finding-count.test.ts @@ -0,0 +1,135 @@ +/** + * /plan-design-review per-finding AskUserQuestion count (periodic, paid, real-PTY). + * + * Same shape as skill-e2e-plan-ceo-finding-count: drives /plan-design-review + * against a 5-finding seeded plan and asserts review-phase AUQ count ∈ [N-1, N+2]. + * Plus D19: review report at bottom of produced plan file. + * + * Tier: periodic (~25 min, ~$5/run). Sequential by default per plan §D15. + */ + +import { describe, test } from 'bun:test'; +import * as fs from 'node:fs'; +import { + runPlanSkillCounting, + designStep0Boundary, + assertReviewReportAtBottom, +} from './helpers/claude-pty-runner'; + +const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic'; +const describeE2E = shouldRun ? describe : describe.skip; + +const N = 5; +const FLOOR = N - 1; +const CEILING = N + 2; + +const PLAN_DESIGN_5_FINDINGS = [ + 'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-design.md (use Edit/Write to that exact path).', + '', + '# Plan: Settings Page UI redesign', + '', + '## Visual Hierarchy', + 'The "Save" button is rendered with the same size, weight, and color as', + 'three other buttons in the page header (Reset, Cancel, Export). Nothing', + 'tells the user which is the primary action.', + '', + '## Spacing', + 'Between sections we have 24px in some places, 32px in others, and 16px', + 'in a third — no consistent vertical rhythm.', + '', + '## Color', + 'The error message uses red text on a light pink background. Contrast', + 'ratio is approximately 3:1 (below WCAG AA).', + '', + '## Typography', + 'We use 14px, 16px, and 18px font sizes across the form labels. Two', + 'sizes would suffice and create stronger hierarchy.', + '', + '## Motion', + 'The "Save" action takes 2-5 seconds with no loading indicator. Users', + 'see a frozen page; we should add a spinner or skeleton state.', +].join('\n'); + +const PLAN_DESIGN_PATH = '/tmp/gstack-test-plan-design.md'; + +describeE2E('/plan-design-review per-finding AskUserQuestion count (periodic)', () => { + test( + `5-finding plan emits ${FLOOR}-${CEILING} review-phase AskUserQuestions`, + async () => { + try { + fs.rmSync(PLAN_DESIGN_PATH, { force: true }); + } catch { + /* best-effort */ + } + + const obs = await runPlanSkillCounting({ + skillName: 'plan-design-review', + slashCommand: '/plan-design-review', + followUpPrompt: PLAN_DESIGN_5_FINDINGS, + isLastStep0AUQ: designStep0Boundary, + reviewCountCeiling: CEILING + 1, + cwd: process.cwd(), + timeoutMs: 1_500_000, + env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }, + }); + + try { + if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) { + throw new Error( + `plan-design-review finding-count FAILED: outcome=${obs.outcome}\n` + + `step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` + + `fingerprints (last 8):\n` + + obs.fingerprints + .slice(-8) + .map( + (f, i) => + ` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`, + ) + .join('\n') + + `\n--- evidence (last 3KB) ---\n${obs.evidence}`, + ); + } + if (obs.reviewCount < FLOOR) { + throw new Error( + `BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR}.\n` + + `Likely batching regression. Review-phase fingerprints:\n` + + obs.fingerprints + .filter((f) => !f.preReview) + .map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`) + .join('\n'), + ); + } + if (obs.reviewCount > CEILING) { + throw new Error( + `BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING}.`, + ); + } + + if (!fs.existsSync(PLAN_DESIGN_PATH)) { + throw new Error( + `D19 FAIL: agent did not produce expected plan file at ${PLAN_DESIGN_PATH}. ` + + `outcome=${obs.outcome} review=${obs.reviewCount}`, + ); + } + const planContent = fs.readFileSync(PLAN_DESIGN_PATH, 'utf-8'); + const verdict = assertReviewReportAtBottom(planContent); + if (!verdict.ok) { + throw new Error( + `D19 FAIL: plan file at ${PLAN_DESIGN_PATH} ${verdict.reason}\n` + + (verdict.trailingHeadings + ? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n` + : '') + + `--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`, + ); + } + } finally { + try { + fs.rmSync(PLAN_DESIGN_PATH, { force: true }); + } catch { + /* best-effort */ + } + } + }, + 1_700_000, + ); +}); diff --git a/test/skill-e2e-plan-devex-finding-count.test.ts b/test/skill-e2e-plan-devex-finding-count.test.ts new file mode 100644 index 00000000..e4b3f8e7 --- /dev/null +++ b/test/skill-e2e-plan-devex-finding-count.test.ts @@ -0,0 +1,135 @@ +/** + * /plan-devex-review per-finding AskUserQuestion count (periodic, paid, real-PTY). + * + * Same shape as skill-e2e-plan-ceo-finding-count: drives /plan-devex-review + * against a 5-finding seeded plan and asserts review-phase AUQ count ∈ [N-1, N+2]. + * Plus D19: review report at bottom of produced plan file. + * + * Tier: periodic (~25 min, ~$5/run). Sequential by default per plan §D15. + */ + +import { describe, test } from 'bun:test'; +import * as fs from 'node:fs'; +import { + runPlanSkillCounting, + devexStep0Boundary, + assertReviewReportAtBottom, +} from './helpers/claude-pty-runner'; + +const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic'; +const describeE2E = shouldRun ? describe : describe.skip; + +const N = 5; +const FLOOR = N - 1; +const CEILING = N + 2; + +const PLAN_DEVEX_5_FINDINGS = [ + 'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-devex.md (use Edit/Write to that exact path).', + '', + '# Plan: Public SDK Beta Launch', + '', + '## Persona', + "The plan doesn't specify which developer persona is the target — we're", + "shipping for \"everyone,\" which means we tune for nobody.", + '', + '## TTHW (time to hello world)', + 'Time-to-hello-world is not measured. No benchmark data referenced. We', + "don't know if first-run takes 5 minutes or 50.", + '', + '## Friction Point', + 'First-run currently requires a 5-minute mandatory CI step before the', + 'developer can run their first eval. There is no way to skip it.', + '', + '## Magical Moment', + 'Getting-started flow has no delight beat. Pure documentation, no', + 'interactive demo, no "ah-ha" moment that makes the developer trust us.', + '', + '## Competitive Blind Spot', + "The plan doesn't reference how peer SDKs (LangChain, Semantic Kernel,", + 'OpenAI) handle this DX surface. We may be reinventing worse versions', + 'of solved problems.', +].join('\n'); + +const PLAN_DEVEX_PATH = '/tmp/gstack-test-plan-devex.md'; + +describeE2E('/plan-devex-review per-finding AskUserQuestion count (periodic)', () => { + test( + `5-finding plan emits ${FLOOR}-${CEILING} review-phase AskUserQuestions`, + async () => { + try { + fs.rmSync(PLAN_DEVEX_PATH, { force: true }); + } catch { + /* best-effort */ + } + + const obs = await runPlanSkillCounting({ + skillName: 'plan-devex-review', + slashCommand: '/plan-devex-review', + followUpPrompt: PLAN_DEVEX_5_FINDINGS, + isLastStep0AUQ: devexStep0Boundary, + reviewCountCeiling: CEILING + 1, + cwd: process.cwd(), + timeoutMs: 1_500_000, + env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }, + }); + + try { + if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) { + throw new Error( + `plan-devex-review finding-count FAILED: outcome=${obs.outcome}\n` + + `step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` + + `fingerprints (last 8):\n` + + obs.fingerprints + .slice(-8) + .map( + (f, i) => + ` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`, + ) + .join('\n') + + `\n--- evidence (last 3KB) ---\n${obs.evidence}`, + ); + } + if (obs.reviewCount < FLOOR) { + throw new Error( + `BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR}.\n` + + `Likely batching regression. Review-phase fingerprints:\n` + + obs.fingerprints + .filter((f) => !f.preReview) + .map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`) + .join('\n'), + ); + } + if (obs.reviewCount > CEILING) { + throw new Error( + `BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING}.`, + ); + } + + if (!fs.existsSync(PLAN_DEVEX_PATH)) { + throw new Error( + `D19 FAIL: agent did not produce expected plan file at ${PLAN_DEVEX_PATH}. ` + + `outcome=${obs.outcome} review=${obs.reviewCount}`, + ); + } + const planContent = fs.readFileSync(PLAN_DEVEX_PATH, 'utf-8'); + const verdict = assertReviewReportAtBottom(planContent); + if (!verdict.ok) { + throw new Error( + `D19 FAIL: plan file at ${PLAN_DEVEX_PATH} ${verdict.reason}\n` + + (verdict.trailingHeadings + ? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n` + : '') + + `--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`, + ); + } + } finally { + try { + fs.rmSync(PLAN_DEVEX_PATH, { force: true }); + } catch { + /* best-effort */ + } + } + }, + 1_700_000, + ); +}); diff --git a/test/skill-e2e-plan-eng-finding-count.test.ts b/test/skill-e2e-plan-eng-finding-count.test.ts new file mode 100644 index 00000000..93b8ba68 --- /dev/null +++ b/test/skill-e2e-plan-eng-finding-count.test.ts @@ -0,0 +1,134 @@ +/** + * /plan-eng-review per-finding AskUserQuestion count (periodic, paid, real-PTY). + * + * Same shape as skill-e2e-plan-ceo-finding-count: drives /plan-eng-review + * against a 5-finding seeded plan and asserts review-phase AUQ count ∈ [N-1, N+2]. + * Plus D19: review report at bottom of produced plan file. + * + * Tier: periodic (~25 min, ~$5/run). Sequential by default per plan §D15. + */ + +import { describe, test } from 'bun:test'; +import * as fs from 'node:fs'; +import { + runPlanSkillCounting, + engStep0Boundary, + assertReviewReportAtBottom, +} from './helpers/claude-pty-runner'; + +const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic'; +const describeE2E = shouldRun ? describe : describe.skip; + +const N = 5; +const FLOOR = N - 1; // 4 +const CEILING = N + 2; // 7 + +const PLAN_ENG_5_FINDINGS = [ + 'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-eng.md (use Edit/Write to that exact path).', + '', + '# Plan: Multi-tenant Auth Refactor', + '', + '## Architecture', + 'Two new services (`AuthBroker` and `SessionMint`) share a global mutable', + '`AuthCache` instance via module-level export. Both services mutate it.', + '', + '## Code quality', + 'The `validateAndDispatch()` function is 60 lines with three nested', + 'try/catch blocks; each catch swallows a different error class.', + '', + '## Tests', + 'The existing `legacyAuthFlow()` will get rewritten as part of this work;', + 'no regression test for the prior behavior is planned.', + '', + '## Performance', + 'Token validation issues 5 sequential API calls to the IDP; they could be', + 'parallelized via Promise.all trivially (calls are independent).', + '', + '## Architecture (scope smell)', + 'This touches 12 files and introduces 4 new classes (TokenStore,', + 'SessionMint, AuthCache, RequestPolicy). Worth flagging the complexity check.', +].join('\n'); + +const PLAN_ENG_PATH = '/tmp/gstack-test-plan-eng.md'; + +describeE2E('/plan-eng-review per-finding AskUserQuestion count (periodic)', () => { + test( + `5-finding plan emits ${FLOOR}-${CEILING} review-phase AskUserQuestions`, + async () => { + try { + fs.rmSync(PLAN_ENG_PATH, { force: true }); + } catch { + /* best-effort */ + } + + const obs = await runPlanSkillCounting({ + skillName: 'plan-eng-review', + slashCommand: '/plan-eng-review', + followUpPrompt: PLAN_ENG_5_FINDINGS, + isLastStep0AUQ: engStep0Boundary, + reviewCountCeiling: CEILING + 1, + cwd: process.cwd(), + timeoutMs: 1_500_000, + env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }, + }); + + try { + if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) { + throw new Error( + `plan-eng-review finding-count FAILED: outcome=${obs.outcome}\n` + + `step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` + + `fingerprints (last 8):\n` + + obs.fingerprints + .slice(-8) + .map( + (f, i) => + ` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`, + ) + .join('\n') + + `\n--- evidence (last 3KB) ---\n${obs.evidence}`, + ); + } + if (obs.reviewCount < FLOOR) { + throw new Error( + `BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR}.\n` + + `Likely batching regression. Review-phase fingerprints:\n` + + obs.fingerprints + .filter((f) => !f.preReview) + .map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`) + .join('\n'), + ); + } + if (obs.reviewCount > CEILING) { + throw new Error( + `BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING}.`, + ); + } + + if (!fs.existsSync(PLAN_ENG_PATH)) { + throw new Error( + `D19 FAIL: agent did not produce expected plan file at ${PLAN_ENG_PATH}. ` + + `outcome=${obs.outcome} review=${obs.reviewCount}`, + ); + } + const planContent = fs.readFileSync(PLAN_ENG_PATH, 'utf-8'); + const verdict = assertReviewReportAtBottom(planContent); + if (!verdict.ok) { + throw new Error( + `D19 FAIL: plan file at ${PLAN_ENG_PATH} ${verdict.reason}\n` + + (verdict.trailingHeadings + ? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n` + : '') + + `--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`, + ); + } + } finally { + try { + fs.rmSync(PLAN_ENG_PATH, { force: true }); + } catch { + /* best-effort */ + } + } + }, + 1_700_000, + ); +}); diff --git a/test/touchfiles.test.ts b/test/touchfiles.test.ts index 0d9ada4b..7f6ad22b 100644 --- a/test/touchfiles.test.ts +++ b/test/touchfiles.test.ts @@ -97,8 +97,10 @@ describe('selectTests', () => { expect(result.selected).toContain('ask-user-question-format-pty'); expect(result.selected).toContain('plan-ceo-mode-routing'); expect(result.selected).toContain('autoplan-chain-pty'); - expect(result.selected.length).toBe(18); - expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 18); + // Per-finding count + review-report-at-bottom (v1.21.x) + expect(result.selected).toContain('plan-ceo-finding-count'); + expect(result.selected.length).toBe(19); + expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 19); }); test('global touchfile triggers ALL tests', () => {