v1.21.1.0 test: tighten plan-ceo-review smoke (Step 0 must fire) (#1255)

* test: extract classifyVisible() + permission-dialog filter in PTY runner

Pure classifier extracted from runPlanSkillObservation's polling loop so
unit tests can exercise the actual branch order with synthetic input
strings. Runner gains:

- env? passthrough on runPlanSkillObservation (forwarded to launchClaudePty).
  gstack-config does not yet honor env overrides; plumbing is in place for a
  future change to make tests hermetic.
- TAIL_SCAN_BYTES = 1500 exported constant. Replaces a duplicated magic
  number in test/skill-e2e-plan-ceo-mode-routing.test.ts so tuning stays
  in sync.
- isPermissionDialogVisible: the bare phrase "Do you want to proceed?" now
  requires a file-edit context co-trigger. Other clauses unchanged. Skill
  questions that contain the bare phrase are no longer mis-classified.
- classifyVisible(visible): pure function. Branch order silent_write →
  plan_ready → asked → null. Permission dialogs filtered out of the
  'asked' classification so a permission prompt cannot pose as a Step 0
  skill question.

Adds 24 unit tests covering all classifier branches, edge cases, and the
co-trigger contract.

* test: tighten plan-ceo-review smoke to require Step 0 fires first

Assertion narrows from ['asked', 'plan_ready'] to 'asked' only. Reaching
plan_ready first means the agent skipped Step 0 entirely and went
straight to ExitPlanMode — the regression we want to catch.

Why plan-ceo is special: unlike plan-eng / plan-design / plan-devex
(whose smokes legitimately reach plan_ready on certain branches without
asking), plan-ceo-review's template mandates Step 0A premise challenge
plus Step 0F mode selection BEFORE any plan write. There is no
legitimate path to plan_ready that does not first emit a skill-question
numbered prompt.

Failure message now branches on outcome (plan_ready vs timeout vs
silent_write) with a tailored diagnosis line per case. References the
skill template by section name ("Step 0 STOP rules", "One issue = one
AskUserQuestion call") instead of line numbers, so it survives template
edits.

Passes env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }
through the runner. Today this is advisory — gstack-config reads only
~/.gstack/config.yaml, not env vars — but the wiring is in place for a
future change. Documented honestly in the docstring.

Verified across 4 PTY runs: 3 pre-refactor + 1 post-refactor, all PASS.

* chore: capture v1.21.1.0 follow-ups in TODOS.md

- P2: per-finding AskUserQuestion count assertion (V2)
- P3: honor env vars in gstack-config so test isolation env actually works
- P3: path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS

All three surfaced during the v1.21.1.0 plan-eng-review and adversarial
review passes. Captured here so the design intent persists.

* chore: bump version and changelog (v1.21.1.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: extract MODE_RE + optionsSignature into PTY runner exports

Refactor prep for the upcoming per-finding AskUserQuestion count test
across plan-{ceo,eng,design,devex}-review. Both new tests and the existing
mode-routing test need the same mode regex and the same option-list
fingerprint dedupe — pulling them into one source of truth in
test/helpers/claude-pty-runner.ts so a fifth mode (or a tweak to the
fingerprint shape) updates everywhere instead of drifting per-test.

Mechanical: no behavior change in the mode-routing test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add per-finding count primitives + unit tests

Pure helpers landing ahead of runPlanSkillCounting:

  - parseQuestionPrompt(visible) — extract the 1-3 line prompt above
    the latest "❯ 1." cursor, normalize to a 240-char snippet
  - auqFingerprint(prompt, opts) — Bun.hash of normalized prompt + sorted
    options signature; distinct prompts with shared option labels
    (the generic A/B/C TODO menu) get distinct fingerprints
  - COMPLETION_SUMMARY_RE — terminal-signal regex matching all four
    plan-review skills' completion / verdict markers
  - assertReviewReportAtBottom(content) — checks "## GSTACK REVIEW
    REPORT" is present and is the last "## " heading in a plan file
  - Step0BoundaryPredicate type + four per-skill predicates
    (ceo / eng / design / devex) — fire on the answered AUQ's
    fingerprint, marking the end of Step 0 deterministically
    (event-based, not content-based, per Codex F7)

Plus 37 deterministic unit tests covering option-label collision
regression, prompt extraction edge cases, predicate positive AND
negative cases, and review-report-at-bottom triple-check
(missing / mid-file / multiple trailing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add runPlanSkillCounting PTY helper

Drives a plan-* skill end-to-end and counts distinct review-phase
AskUserQuestions. Composes the primitives from the previous commit:

  - Boot + auto-trust handler (existing launchClaudePty)
  - Send slash command alone, sleep 3s, send plan content as follow-up
    message (proven pattern from skill-e2e-plan-design-with-ui)
  - Poll loop with permission-dialog auto-grant, same-redraw skip,
    empty-prompt re-poll
  - Event-based Step-0 boundary via isLastStep0AUQ predicate fired on
    the answered AUQ's fingerprint (Codex F7 — boundary is observed
    event, not later rendered content)
  - Multi-signal terminals: hard ceiling, COMPLETION_SUMMARY_RE,
    plan_ready, silent_write, exited, timeout

Empty-prompt fingerprints are skipped per the contract documented in
auqFingerprint's unit tests — fingerprinting them would re-introduce
the option-label collision regression Codex F1 caught.

No E2E tests yet — those land in commit 5 with the four skill fixtures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: register four finding-count tests in touchfiles + tier map

Each new test depends on its skill template, the runner, and three
preamble resolvers (preamble.ts, generate-ask-user-format.ts,
generate-completion-status.ts) — those affect question cadence and
completion rendering, which is exactly what the test asserts on.

All four classified periodic. Sequential execution during calibration;
opt-in to concurrent only after measured comparison agrees (plan §D15).

Updated touchfiles.test.ts: plan-ceo-review/** now selects 19 tests
(was 18) because plan-ceo-finding-count joins the family.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add four per-finding count E2E tests (plan-ceo + eng + design + devex)

Each test drives its plan-* skill through Step 0 then asserts the
review-phase AskUserQuestion count falls in [N-1, N+2] for an N=5
seeded plan, plus D19: produced plan file ends with
"## GSTACK REVIEW REPORT" as its last "## " heading.

plan-ceo also runs a paired-finding positive control: 2 deliberately
related findings should still produce 2 distinct AUQs, not 1 batched.

Periodic-tier (gate-skipped without EVALS=1, EVALS_TIER=periodic).
Sequential execution by plan §D15. Each fixture is inline TypeScript
content delivered as a follow-up message after the slash command, per
the proven pattern at skill-e2e-plan-design-with-ui.test.ts.

Calibration loop (5 runs per skill) and the manual pre-merge negative
check (D7 + D12) are required before merge per plan §Verification.
NOT yet run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: fix parseNumberedOptions for inline-cursor box-layout AUQs

Calibration run 1 timed out with step0=0 review=0 because the parser
could not find the cursor in /plan-ceo-review's scope-selection AUQ.
The TTY's box-layout rendering inlines divider + header + prompt +
"1." onto one logical line — cursor escapes get stripped, leaving
text crushed onto a single line.

Cursor anchor regex changed from anchored to unanchored so it matches
mid-line. Cursor-line option extraction uses a non-anchored regex;
subsequent options stay with the original start-of-line parser.

parseQuestionPrompt picks up the inline prompt text BEFORE the cursor
on the cursor line (after stripping box-drawing chars + sigil) and
appends it after any walked-up multi-line prompt above.

Three new unit tests: clean-cursor still works, inline-cursor
extracts all 7 options, prompt extraction strips box chars.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add firstAUQPick + plan-ceo skip-interview routing

Calibration run 1 surfaced a second issue beyond the parser bug: the
default pick of 1 on /plan-ceo-review's scope-selection AUQ routes
the agent to "branch diff vs main" — so it reviews the gstack PR
itself (recursive!) instead of the seeded fixture plan we sent.

Added firstAUQPick callback to runPlanSkillCounting. Override applies
only to the FIRST AUQ; subsequent presses keep using defaultPick.

ceoStep0Boundary now fires on either the mode-pick AUQ (existing path)
or any AUQ containing "Skip interview and plan immediately" — which
is the scope-selection AUQ. Picking that option bypasses Step 0 and
routes straight to review-phase using the chat-paste plan as context.

Plan-ceo test wires firstAUQPick = pickSkipInterview which finds the
"Skip interview" option by label. Falls back to "describe inline" if
the option labels change.

Two new unit tests: ceoStep0Boundary fires on the scope-selection
fixture; existing mode-pick fixture still fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-30 02:50:09 -07:00
committed by GitHub
parent e8893a18b1
commit 454423aeb3
14 changed files with 2271 additions and 78 deletions
+46
View File
@@ -1,5 +1,51 @@
# Changelog
## [1.21.1.0] - 2026-04-28
## **plan-ceo-review smoke tightens. The "agent skips Step 0 and ships a plan" regression now fails the gate.**
The v1.15.0.0 real-PTY harness shipped with a smoke that accepted either `'asked'` or `'plan_ready'` as success. That OR was too lax for `/plan-ceo-review` specifically: the skill template mandates Step 0A premise challenge plus Step 0F mode selection BEFORE any plan write, so reaching `plan_ready` first IS the regression. This release tightens the assertion to `'asked'` only for that smoke, and refactors the runner so the contract is testable in <1s instead of $0.50 of stochastic PTY.
### The numbers that matter
Numbers come from `git diff --shortstat origin/main..HEAD` and `bun test test/helpers/claude-pty-runner.unit.test.ts` on a clean tree.
| Metric | Δ |
|---|---|
| Net branch size vs main | +162 / 65 lines (3 files) |
| New unit tests added | **+24** (claude-pty-runner.unit.test.ts) |
| Unit suite runtime | **14ms** (deterministic, free-tier) |
| Real-PTY gate runs verified | **4 clean PTY runs** (3 lock-in + 1 post-refactor) |
| Outcome assertions covered | **5/5** (was 3/5; `plan_ready` is now FAIL for plan-ceo) |
| Reviewers run on this PR | plan-eng-review (CLEARED) + codex consult + 2 specialists + adversarial |
### What this means for builders
Three new classes of harness regression are now caught deterministically in the free tier instead of waiting on a $0.50 stochastic PTY run. The classifier is extracted into a pure `classifyVisible()` function so reordering branches in the polling loop fails the unit tests instead of silently shipping. Permission dialogs (which render numbered lists) are filtered out of the `'asked'` classification so a permission prompt cannot pose as a Step 0 skill question. The bare phrase `Do you want to proceed?` no longer triggers permission detection on its own — it now requires a file-edit context co-trigger, so a skill question that contains the phrase isn't mis-classified.
For `/plan-ceo-review` specifically: any future preamble slim-down or template edit that lets the agent skip Step 0 and write a plan will fail the gate before the PR ships. Pull, run `bun test`, and the harness layer is provably tighter without you having to spend a token.
### Itemized changes
#### Added
- `test/helpers/claude-pty-runner.unit.test.ts`: 24 deterministic tests covering `isPermissionDialogVisible` (with the new co-trigger contract), `isNumberedOptionListVisible`, `parseNumberedOptions`, and the new `classifyVisible()` runtime path. Free-tier, runs on every `bun test`.
- `classifyVisible(visible)` in `claude-pty-runner.ts`: pure classifier extracted from the polling loop. Returns `{ outcome, summary } | null`. Branch order: silent_write → plan_ready → asked → null (with permission-dialog filter). Live-state branches (process exited, "Unknown command") stay in the runner.
- `TAIL_SCAN_BYTES = 1500` exported constant. Shared between `runPlanSkillObservation` and the routing test's nav loop so tuning stays in sync.
- `env?: Record<string, string>` option on `runPlanSkillObservation`, threaded to `launchClaudePty`. Plumbing for future env-driven test isolation (gstack-config does not yet honor env overrides; tracked as post-merge follow-up).
#### Changed
- `test/skill-e2e-plan-ceo-plan-mode.test.ts`: assertion narrowed from `['asked', 'plan_ready']` to `'asked'` only. Failure message now branches on `outcome` (plan_ready vs timeout vs silent_write) with a tailored diagnosis line, and references skill-template section names instead of line numbers (durable to template edits).
- `isPermissionDialogVisible`: bare `Do you want to proceed?` now requires a file-edit context co-trigger (`Edit to <path>` or `Write to <path>`). Other clauses (`requested permissions to`, `allow all edits`, `always allow access to`, `Bash command requires permission`) remain unconditional.
- `test/skill-e2e-plan-ceo-mode-routing.test.ts`: replaces the local `1500` magic number with the shared `TAIL_SCAN_BYTES` constant.
#### For contributors
- The runner change is additive and the existing sibling smokes (`plan-eng`, `plan-design`, `plan-devex`, `plan-mode-no-op`) keep their loose `['asked', 'plan_ready']` assertion. Their behavior is unchanged.
- Post-merge follow-ups captured in `TODOS.md`: per-finding AskUserQuestion count assertion (V2), env-driven gstack-config overrides (so `QUESTION_TUNING=false` actually isolates the test), path-confusion hardening on `SANCTIONED_WRITE_SUBSTRINGS`.
## [1.20.0.0] - 2026-04-28
## **Browser-skills land. `/scrape <intent>` first call drives the page; second call runs the codified script in 200ms.**
+50
View File
@@ -213,6 +213,56 @@ scope of that PR; deliberately deferred to keep PTY-import small.
## Testing
## P2: Per-finding AskUserQuestion count assertion for /plan-ceo-review
**What:** PTY E2E test that drives /plan-ceo-review through Step 0 with a stable fixture diff containing N known findings, asserts that exactly N distinct AskUserQuestions fire (one per finding) before plan_ready.
**Why:** The skill template repeats "One issue = one AskUserQuestion call. Never combine multiple issues into one question." at every review checkpoint. No test enforces it. The current `skill-e2e-plan-ceo-plan-mode.test.ts` smoke (post-v1.21.1.0) only catches "agent skipped Step 0 entirely." Batching findings into one question slips through silently.
**Pros:** Locks in the strongest contract the skill mandates. Catches a real failure mode (the original attachment showed 2 findings batched as 0 questions).
**Cons:** Needs a stable fixture diff to keep finding count deterministic (~1 day human / ~30 min CC). Opus may reasonably consolidate two related findings, so the assertion needs a forgiving lower bound (e.g., `>= ceil(N * 0.6)`) rather than strict equality.
**Context:** The PTY harness (`runPlanSkillObservation`) returns at first terminal outcome — for V2 we need a streaming variant that counts AskUserQuestions across the whole session up to `plan_ready`. Probably a new helper alongside `runPlanSkillObservation`.
**Depends on:** Stable fixture diff (`test/fixtures/plans/multi-finding.diff` or similar) with a small known set of issues that triggers all 4 review sections.
**Priority:** P2.
**Effort:** S (CC: ~30 min once fixture exists). Captured from v1.21.1.0 plan-eng-review D2.
---
## P3: Honor env vars in gstack-config (so QUESTION_TUNING/EXPLAIN_LEVEL actually isolate tests)
**What:** `gstack-config get <key>` reads `~/.gstack/config.yaml`. `runPlanSkillObservation` plumbs `env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }` through to the spawned `claude` process — but the skill preamble bash uses `gstack-config get question_tuning`, which never looks at env. The env passthrough is theater on current code.
**Why:** Without env honoring, the v1.21.1.0 plan-ceo-review smoke is still flaky on machines with `question_tuning: true` set in YAML. AUTO_DECIDE preferences would skip the rendered AskUserQuestion list, masking the regression we want to catch.
**Pros:** Makes the gate test hermetic across machines. The env wiring is already in place — only `gstack-config` needs to read env first, fall back to YAML.
**Cons:** Touches the gstack-config binary across all 3 platforms (linux/darwin/windows). Cross-binary refactor.
**Context:** Captured from v1.21.1.0 adversarial review. Documented honestly in the test docstring as a known limitation.
**Priority:** P3.
**Effort:** S. Single-file edit to `bin/gstack-config` (~10 LOC for env-first lookup).
---
## P3: Path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS
**What:** `runPlanSkillObservation`'s silent-write detector uses substring matching on a few sanctioned paths (`.gstack/`, `CHANGELOG.md`, `TODOS.md`, etc). A write to `node_modules/some-pkg/CHANGELOG.md` or `src/foo/.gstack/leak.ts` is currently sanctioned because the substring matches anywhere in the path.
**Why:** Defensive — no current bug exploits this, but a malicious skill or fixture could write to a path that happens to contain `.gstack/` or `CHANGELOG.md` and slip past silent-write detection.
**Pros:** Hardens the harness against future skill misbehavior. Aligns substring rules with their intent.
**Cons:** Need to anchor against absolute prefixes (`os.homedir() + '/.gstack/'`, worktree root) which makes the test less portable across machines.
**Context:** Captured from v1.21.1.0 adversarial review (HIGH/FIXABLE finding, pre-existing). Refactored into a `SANCTIONED_WRITE_SUBSTRINGS` constant in v1.21.1.0 but the substring-includes logic is unchanged from before.
**Priority:** P3.
**Effort:** S.
---
## P1: Structural STOP-Ask forcing function across all skills
**What:** Design and implement a structural forcing function that catches when a skill mandates per-issue AskUserQuestion but the model silently substitutes batch-synthesis. Candidate mechanisms: question-count assertion (skill declares expected question count in frontmatter; post-run audit logs if model fired <N), typed question templates (skill hands the model pre-built AskUserQuestion payloads rather than prose instructions), or a canUseTool-based post-run audit that compares declared-gates-fired vs expected.
+1 -1
View File
@@ -1 +1 @@
1.20.0.0
1.21.1.0
+1 -1
View File
@@ -1,6 +1,6 @@
{
"name": "gstack",
"version": "1.20.0.0",
"version": "1.21.1.0",
"description": "Garry's Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.",
"license": "MIT",
"type": "module",
+690 -51
View File
@@ -138,6 +138,15 @@ export function isPlanReadyVisible(visible: string): boolean {
return /ready to execute|Would you like to proceed/i.test(visible);
}
/**
* Recent-tail window (in bytes of stripped TTY text) used when classifying
* permission dialogs. Old permission text persists in the visibleSince buffer
* after the dialog is dismissed, so callers should pass `visible.slice(-TAIL_SCAN_BYTES)`
* to avoid re-triggering on stale scrollback. Shared between `runPlanSkillObservation`
* and `navigateToModeAskUserQuestion` in the routing test so tuning stays in sync.
*/
export const TAIL_SCAN_BYTES = 1500;
/**
* Detect a Claude Code permission dialog. These render as a numbered
* option list (so isNumberedOptionListVisible matches them) but they
@@ -145,23 +154,37 @@ export function isPlanReadyVisible(visible: string): boolean {
* whether to grant a tool/file permission. Tests that look for skill
* AskUserQuestions must explicitly skip these.
*
* Both English phrases below are stable across recent Claude Code
* The English phrases below are stable across recent Claude Code
* versions. The check is permissive on whitespace because TTY rendering
* may wrap or reflow text.
*
* Co-trigger requirement: the bare phrase "Do you want to proceed?" is
* generic enough that a skill question could legitimately use it
* ("Do you want to proceed with HOLD SCOPE?"). To avoid mis-classifying
* skill questions as permission dialogs, this phrase only counts when it
* co-occurs with a file-edit context ("Edit to <path>" or "Write to <path>").
* The standalone permission signatures (`requested permissions to`,
* `allow all edits`, `always allow access to`, `Bash command requires permission`)
* remain unconditional.
*/
export function isPermissionDialogVisible(visible: string): boolean {
return (
/requested\s+permissions?\s+to/i.test(visible) ||
/Do\s+you\s+want\s+to\s+proceed\?/i.test(visible) ||
// "Yes / Yes, allow all edits / No" shape rendered by Claude Code for
// file-edit permission grants. The middle option's "allow all" phrase
// is the unique signature.
/\ballow\s+all\s+edits\b/i.test(visible) ||
// "Yes, and always allow access to <dir>" shape (workspace trust).
/always\s+allow\s+access\s+to/i.test(visible) ||
// Bash command permission prompts.
/Bash\s+command\s+.*\s+requires\s+permission/i.test(visible)
);
// Standalone signatures — high specificity, never appear in skill questions.
if (/requested\s+permissions?\s+to/i.test(visible)) return true;
// "Yes / Yes, allow all edits / No" shape — file-edit permission grants.
if (/\ballow\s+all\s+edits\b/i.test(visible)) return true;
// "Yes, and always allow access to <dir>" shape — workspace trust.
if (/always\s+allow\s+access\s+to/i.test(visible)) return true;
// Bash command permission prompts.
if (/Bash\s+command\s+.*\s+requires\s+permission/i.test(visible)) return true;
// "Do you want to proceed?" only counts as a permission dialog when paired
// with a file-edit context. Skill questions can use the bare phrase.
if (
/Do\s+you\s+want\s+to\s+proceed\?/i.test(visible) &&
/(Edit|Write)\s+to\s+\S+/i.test(visible)
) {
return true;
}
return false;
}
/** Detect any AskUserQuestion-shaped numbered option list with cursor. */
@@ -211,12 +234,14 @@ export function parseNumberedOptions(
// this, parseNumberedOptions returns stale options after the dialog is
// dismissed.
const lines = tail.split('\n');
// Anchor on the LAST ` 1.` line (cursor is on option 1 of the active
// AskUserQuestion). Greedy character classes don't help here — we need a literal
// `` after optional leading whitespace.
// Anchor on the LAST line containing `<spaces>1.` ANYWHERE on the line.
// The /plan-*-review skill's box-layout AUQ uses TTY cursor-positioning
// escapes that stripAnsi removes — leaving the cursor `1.` mid-line,
// after dividers + header + prompt text on the same logical line. The
// earlier `^\s*` anchor missed those entirely.
let cursorLineIdx = -1;
for (let i = lines.length - 1; i >= 0; i--) {
if (/^\s*\s*1\./.test(lines[i] ?? '')) {
if (/\s*1\./.test(lines[i] ?? '')) {
cursorLineIdx = i;
break;
}
@@ -236,7 +261,37 @@ export function parseNumberedOptions(
if (cursorLineIdx < 0) return [];
const found: Array<{ index: number; label: string }> = [];
const seenIndices = new Set<number>();
for (let i = cursorLineIdx; i < lines.length; i++) {
// Cursor line: option 1 may be inline after box dividers + prompt header
// (`...divider...header...1. label`). Use a non-anchored regex that
// captures `N. label` from anywhere on the line through end-of-line.
// Only used for the cursor line — subsequent options are parsed with the
// start-of-line `optionRe`.
const cursorLine = lines[cursorLineIdx] ?? '';
const cursorInlineRe = /\s*([1-9])\.\s*(\S.*?)\s*$/;
const inlineMatch = cursorInlineRe.exec(cursorLine);
if (inlineMatch) {
const idx = Number(inlineMatch[1]);
const label = (inlineMatch[2] ?? '').trim();
if (label.length > 0 && !seenIndices.has(idx)) {
seenIndices.add(idx);
found.push({ index: idx, label });
}
} else {
// No inline cursor match — fall back to start-of-line regex.
const startMatch = optionRe.exec(cursorLine);
if (startMatch) {
const idx = Number(startMatch[1]);
const label = (startMatch[2] ?? '').trim();
if (label.length > 0 && !seenIndices.has(idx)) {
seenIndices.add(idx);
found.push({ index: idx, label });
}
}
}
// Subsequent lines: standard start-of-line option parsing.
for (let i = cursorLineIdx + 1; i < lines.length; i++) {
const m = optionRe.exec(lines[i] ?? '');
if (!m) continue;
const idx = Number(m[1]);
@@ -261,6 +316,333 @@ export function parseNumberedOptions(
return found;
}
/**
* The four /plan-ceo-review modes. Used by `skill-e2e-plan-ceo-mode-routing`
* to detect Step 0F mode-selection AskUserQuestions, and by the upcoming
* finding-count tests as a Step-0 boundary signal: an AUQ whose options
* match this regex IS the mode pick (the last Step-0 question for plan-ceo).
*
* Lifted out of the mode-routing test so multiple PTY tests can share one
* source of truth — when /plan-ceo-review adds a fifth mode, one regex updates
* everywhere instead of drifting per-test.
*/
export const MODE_RE = /HOLD SCOPE|SCOPE EXPANSION|SELECTIVE EXPANSION|SCOPE REDUCTION/i;
/**
* Stable signature for a parsed numbered-option list — used by tests to detect
* "is this AUQ the same as the last poll, or has the agent advanced to a new
* one?" Joins each option as `${index}:${label}` after sorting by index.
*
* Defensive sort means the signature is order-independent at the input level,
* even though `parseNumberedOptions` already returns indices in ascending order.
*/
export function optionsSignature(
opts: Array<{ index: number; label: string }>,
): string {
return [...opts]
.sort((a, b) => a.index - b.index)
.map((o) => `${o.index}:${o.label}`)
.join('|');
}
/**
* Pure classifier for the visible TTY buffer. Decides which outcome the
* polling loop should return on this tick, or `null` to keep polling.
*
* Extracted from `runPlanSkillObservation` so the unit suite can exercise
* the actual branch order with synthetic input strings — a future contributor
* who reorders the branches (e.g., moves the permission short-circuit) gets
* caught by the unit tests, not by a stochastic E2E run.
*
* Live-state branches (process exited, "Unknown command") stay in the runner
* since they need the session handle.
*/
export type ClassifyResult =
| { outcome: 'silent_write'; summary: string }
| { outcome: 'plan_ready'; summary: string }
| { outcome: 'asked'; summary: string }
| null;
const SANCTIONED_WRITE_SUBSTRINGS = [
'.claude/plans',
'.gstack/',
'/.context/',
'CHANGELOG.md',
'TODOS.md',
];
export function classifyVisible(visible: string): ClassifyResult {
// Silent-write detection: any Write/Edit tool render that targets a path
// OUTSIDE the sanctioned dirs, AND no numbered prompt is currently on screen
// (a numbered prompt means a permission/AskUserQuestion is gating the write,
// not an actual silent write).
const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g;
let m: RegExpExecArray | null;
while ((m = writeRe.exec(visible)) !== null) {
const target = m[1] ?? '';
const sanctioned = SANCTIONED_WRITE_SUBSTRINGS.some((s) => target.includes(s));
if (!sanctioned && !isNumberedOptionListVisible(visible)) {
return {
outcome: 'silent_write',
summary: `Write/Edit to ${target} fired before any AskUserQuestion`,
};
}
}
if (isPlanReadyVisible(visible)) {
return {
outcome: 'plan_ready',
summary: 'skill ran end-to-end and emitted plan-mode "Ready to execute" confirmation',
};
}
if (isNumberedOptionListVisible(visible)) {
// Permission dialogs render numbered lists too. Skip them — the
// bug we want to catch is "skill question never fired."
if (isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))) {
return null;
}
return {
outcome: 'asked',
summary: 'skill fired a numbered-option prompt (AskUserQuestion or routing-injection)',
};
}
return null;
}
// ────────────────────────────────────────────────────────────────────────────
// Per-finding AskUserQuestion count primitives (used by runPlanSkillCounting).
//
// These are pure helpers extracted up-front so the unit suite can exercise
// them deterministically before the live-PTY counter runs them. Each one is
// independently unit-testable against synthetic visible-buffer strings.
// ────────────────────────────────────────────────────────────────────────────
/**
* Captured identity of an AskUserQuestion — the rendered question text plus
* its numbered options. Used by `runPlanSkillCounting` to dedupe redrawn
* prompts and to feed `Step0BoundaryPredicate` callers.
*
* `signature` is the stable hash. Two AUQs with identical prompt + options
* produce the same signature; differences in either field produce different
* signatures. Critically: two AUQs with shared option labels (e.g. the
* generic "A) Add to plan / B) Defer / C) Build now" menu) but different
* question text get DIFFERENT signatures because the prompt is in the hash.
*/
export interface AskUserQuestionFingerprint {
/** Stable hash combining normalized prompt text + options signature. */
signature: string;
/** First 240 chars of the rendered question prompt (post-normalization). */
promptSnippet: string;
/** Captured option labels, in index order. */
options: Array<{ index: number; label: string }>;
/** Wall-clock when first observed (ms since the helper started polling). */
observedAtMs: number;
/** True if observed BEFORE the Step-0 boundary fired. */
preReview: boolean;
}
/**
* Predicate fired against the AUQ we just answered (not the visible buffer).
* Returns true if this AUQ's fingerprint marks the LAST Step-0 question for
* its skill — all subsequent AUQs are review-phase findings.
*
* Event-based by design: matching against an answered AUQ's fingerprint
* (prompt + options) is deterministic, whereas matching against later
* rendered content (section headers, summary text) races with the agent's
* output cadence. See plan §D14 for the rationale.
*/
export type Step0BoundaryPredicate = (
answeredFingerprint: AskUserQuestionFingerprint,
) => boolean;
/**
* Parse the rendered question prompt out of a visible TTY buffer. The prompt
* is the 13 lines of text immediately ABOVE the latest ` 1.` cursor line —
* not part of the option list, not the permission-dialog header.
*
* Returns the prompt normalized to a single-spaced 240-char snippet (strip
* ANSI residue, collapse internal whitespace, trim) — short enough to use as
* a hash key, long enough to disambiguate distinct questions.
*
* Returns "" when no prompt could be parsed (cursor not yet rendered, or
* cursor is at the top of the buffer with no preceding text). Callers that
* use the empty string as a fingerprint input should treat empty-prompt
* AUQs as "wait one more poll" rather than fingerprinting them — otherwise
* the same options + empty prompt across two distinct questions collide.
*/
export function parseQuestionPrompt(visible: string): string {
// Tail-only — older prompts higher in the buffer are stale.
const tail = visible.length > 4096 ? visible.slice(-4096) : visible;
const lines = tail.split('\n');
// Find the latest line containing `<spaces>1.` (matching parseNumberedOptions —
// unanchored to handle the box-layout case where cursor is mid-line after
// divider + header + prompt text on the same logical line).
let cursorLineIdx = -1;
for (let i = lines.length - 1; i >= 0; i--) {
if (/\s*1\./.test(lines[i] ?? '')) {
cursorLineIdx = i;
break;
}
}
if (cursorLineIdx < 0) return '';
// Box-layout case: prompt text may be ON the cursor line, BEFORE `1.`.
// Extract that prefix (after stripping leading box-drawing characters and
// dividers) as the last piece of the prompt — appended after any prior
// multi-line prompt text we walk up to find.
const cursorLine = lines[cursorLineIdx] ?? '';
let inlinePrompt = '';
const cursorPos = cursorLine.search(/\s*1\./);
if (cursorPos > 0) {
inlinePrompt = cursorLine
.slice(0, cursorPos)
// Strip box-drawing chars + dividers + leading checkbox sigil.
.replace(/^[─━┄┅┈┉─┌┐└┘├┤┬┴┼│┃☐□■\s]+/, '')
.trim();
}
// Walk up at most 6 lines collecting prompt text. Stop at:
// - a blank line preceded by another blank line (paragraph break)
// - top of buffer
// - a line that itself starts with `N.` (we're inside an option list)
const promptLines: string[] = [];
let blankRun = 0;
for (let i = cursorLineIdx - 1; i >= 0 && promptLines.length < 6; i--) {
const raw = lines[i] ?? '';
const trimmed = raw.trim();
if (trimmed === '') {
blankRun += 1;
if (blankRun >= 2 && promptLines.length > 0) break;
continue;
}
blankRun = 0;
// Stop if we hit what looks like a previous numbered list.
if (/^[\s]*[1-9]\.\s+\S/.test(raw)) break;
promptLines.unshift(trimmed);
}
const all = inlinePrompt.length > 0 ? [...promptLines, inlinePrompt] : promptLines;
const joined = all.join(' ').replace(/\s+/g, ' ').trim();
return joined.slice(0, 240);
}
/**
* Stable hash for an AskUserQuestion's identity — combines normalized prompt
* text with the options signature so two distinct questions with shared menu
* labels (the generic A/B/C TODO-proposal menu, for instance) get different
* fingerprints.
*
* Uses Bun's fast non-crypto hash since these strings are short and we only
* need collision resistance against accidental TTY redraws, not adversaries.
* Hex-encoded for diagnostic dumps.
*/
export function auqFingerprint(
promptSnippet: string,
opts: Array<{ index: number; label: string }>,
): string {
const normalized = promptSnippet.replace(/\s+/g, ' ').trim();
const sig = optionsSignature(opts);
// eslint-disable-next-line @typescript-eslint/no-explicit-any
return (Bun as any).hash(normalized + '||' + sig).toString(16);
}
/**
* Detects when a plan-* skill has reached its Completion Summary / Review
* Report — a terminal signal complementary to plan-mode's "Ready to execute"
* confirmation. Each plan-review skill writes one of these phrasings near
* the end of its run; matching any one is enough to stop counting.
*
* Best-effort: this is a content marker, not a deterministic event. Hard
* ceiling (`reviewCountCeiling` in `runPlanSkillCounting`) is the reliable
* stop signal; this regex is the "we're done, go gracefully" hint.
*/
export const COMPLETION_SUMMARY_RE =
/(GSTACK REVIEW REPORT|## Completion [Ss]ummary|Status:\s*(clean|issues_open)|^VERDICT:)/m;
/**
* Result of asserting that a plan file ends with `## GSTACK REVIEW REPORT`
* as its last `## ` heading. `ok` is true iff the report is present AND no
* other `## ` heading appears after it. Diagnostic fields are populated only
* on failure to keep the success path cheap.
*/
export interface ReviewReportAtBottomResult {
ok: boolean;
reason?: string;
trailingHeadings?: string[];
}
/**
* Assert that `## GSTACK REVIEW REPORT` is the last `## ` heading in a plan
* file's content. Pure string operation — no filesystem access. Used by the
* finding-count E2E tests as a second assertion on each test's produced plan.
*
* The plan-mode skill template mandates the agent move/append the review
* report so it's always the last `##` section. A regression where the agent
* appends additional sections after the report (or skips it entirely) ships
* silently today; this assertion catches both.
*/
export function assertReviewReportAtBottom(
content: string,
): ReviewReportAtBottomResult {
const re = /^## GSTACK REVIEW REPORT\s*$/m;
const match = re.exec(content);
if (!match) {
return { ok: false, reason: 'no GSTACK REVIEW REPORT section' };
}
const after = content.slice(match.index + match[0].length);
// Match any `## ` heading after the report. Reject `## ` followed by
// newline-only (trailing-whitespace ## headers) to avoid false positives.
const trailingHeadings = Array.from(
after.matchAll(/^## \S.*$/gm),
).map((m) => m[0]);
if (trailingHeadings.length > 0) {
return {
ok: false,
reason: 'trailing ## heading(s) after GSTACK REVIEW REPORT',
trailingHeadings,
};
}
return { ok: true };
}
/**
* Per-skill Step-0 boundary predicates. Each fires `true` when the answered
* AUQ's fingerprint matches the LAST question of that skill's Step 0 phase.
*
* - `ceoStep0Boundary`: matches the mode-pick AUQ (options match `MODE_RE`).
* - `engStep0Boundary`: matches the cross-project-learnings or scope-reduction
* AUQ that closes plan-eng-review's preamble.
* - `designStep0Boundary`: matches plan-design-review's first dimension /
* posture AUQ.
* - `devexStep0Boundary`: matches plan-devex-review's persona-selection AUQ.
*
* Predicates live alongside the helper so the unit suite can exercise each
* against synthetic fingerprints (positive AND negative cases). Skill test
* files import them directly.
*/
export const ceoStep0Boundary: Step0BoundaryPredicate = (fp) =>
// Mode-pick path (Step 0F): one of HOLD SCOPE / SCOPE EXPANSION / etc.
fp.options.some((o) => MODE_RE.test(o.label)) ||
// Skip-interview path: scope-selection AUQ has "Skip interview and plan
// immediately" — picking it bypasses the rest of Step 0 and routes
// directly to review-phase. Boundary fires on the scope AUQ itself.
fp.options.some((o) => /skip\s+interview|plan\s+immediately/i.test(o.label));
export const engStep0Boundary: Step0BoundaryPredicate = (fp) =>
/scope reduction recommendation|cross[\s-]?project learnings/i.test(
fp.promptSnippet,
);
export const designStep0Boundary: Step0BoundaryPredicate = (fp) =>
/design system|design posture|design score|first dimension/i.test(
fp.promptSnippet,
);
export const devexStep0Boundary: Step0BoundaryPredicate = (fp) =>
/developer persona|target persona|persona selection|TTHW target/i.test(
fp.promptSnippet,
);
/**
* Spawn `claude --permission-mode plan` in a real PTY and return a session
* handle. Caller is responsible for `await session.close()` to release the
@@ -566,12 +948,21 @@ export async function runPlanSkillObservation(opts: {
cwd?: string;
/** Total budget for skill to reach a terminal outcome. Default 180000. */
timeoutMs?: number;
/**
* Extra env merged into the spawned `claude` process. `launchClaudePty`
* already supports this; exposing it here lets per-skill tests isolate
* from local config that would mask the regression they're trying to
* catch (e.g., `QUESTION_TUNING=true` causing AUTO_DECIDE to skip the
* rendered AskUserQuestion list).
*/
env?: Record<string, string>;
}): Promise<PlanSkillObservation> {
const startedAt = Date.now();
const session = await launchClaudePty({
permissionMode: opts.inPlanMode === false ? null : 'plan',
cwd: opts.cwd,
timeoutMs: (opts.timeoutMs ?? 180_000) + 30_000,
env: opts.env,
});
try {
@@ -602,40 +993,10 @@ export async function runPlanSkillObservation(opts: {
elapsedMs: Date.now() - startedAt,
};
}
// Silent-write detection: any Write/Edit tool render that targets a
// path OUTSIDE ~/.claude/plans, ~/.gstack/, or the active worktree's
// .gstack/. Plan files and gbrain artifacts are sanctioned.
const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g;
let m: RegExpExecArray | null;
while ((m = writeRe.exec(visible)) !== null) {
const target = m[1] ?? '';
const sanctioned =
target.includes('.claude/plans') ||
target.includes('.gstack/') ||
target.includes('/.context/') ||
target.includes('CHANGELOG.md') ||
target.includes('TODOS.md');
if (!sanctioned && !isNumberedOptionListVisible(visible)) {
return {
outcome: 'silent_write',
summary: `Write/Edit to ${target} fired before any AskUserQuestion`,
evidence: visible.slice(-2000),
elapsedMs: Date.now() - startedAt,
};
}
}
if (isPlanReadyVisible(visible)) {
const classified = classifyVisible(visible);
if (classified) {
return {
outcome: 'plan_ready',
summary: 'skill ran end-to-end and emitted plan-mode "Ready to execute" confirmation',
evidence: visible.slice(-2000),
elapsedMs: Date.now() - startedAt,
};
}
if (isNumberedOptionListVisible(visible)) {
return {
outcome: 'asked',
summary: 'skill fired a numbered-option prompt (AskUserQuestion or routing-injection)',
...classified,
evidence: visible.slice(-2000),
elapsedMs: Date.now() - startedAt,
};
@@ -652,3 +1013,281 @@ export async function runPlanSkillObservation(opts: {
await session.close();
}
}
// ────────────────────────────────────────────────────────────────────────────
// runPlanSkillCounting — drives a plan-* skill end-to-end through Step 0 then
// counts distinct review-phase AskUserQuestion fingerprints. The actual
// product asserted by the per-finding-count tests.
// ────────────────────────────────────────────────────────────────────────────
/**
* Result of a `runPlanSkillCounting` run. Includes both the count summary
* (`step0Count`, `reviewCount`) and the full fingerprint list for diagnostic
* dumps when an assertion fails.
*/
export interface PlanSkillCountObservation {
outcome:
| 'plan_ready'
| 'completion_summary'
| 'ceiling_reached'
| 'silent_write'
| 'exited'
| 'timeout';
summary: string;
/** Visible terminal text at terminal time (last 3KB). */
evidence: string;
/** Wall time (ms) until the outcome was decided. */
elapsedMs: number;
/** All distinct AskUserQuestions observed, in observation order. */
fingerprints: AskUserQuestionFingerprint[];
/** Count of fingerprints with `preReview === true`. */
step0Count: number;
/** Count of fingerprints with `preReview === false`. */
reviewCount: number;
}
/**
* Drive a plan-* skill in plan mode and count distinct review-phase
* AskUserQuestions until a terminal signal fires.
*
* Flow:
* 1. Boot PTY in plan mode (8s grace + auto-trust dialog).
* 2. Send `slashCommand` alone. Sleep ~3s.
* 3. Send `followUpPrompt` as a chat message — this is the plan content
* the skill reviews. Slash commands with trailing args are rejected by
* Claude Code unless the skill defines them, so the plan goes as a
* follow-up message (the proven pattern at
* skill-e2e-plan-design-with-ui.test.ts:57-71).
* 4. Poll loop:
* - Skip permission dialogs (auto-grant with `defaultPick`).
* - On a new numbered-option list, parse prompt + options, build
* fingerprint via `auqFingerprint`. Empty-prompt parses are skipped
* and re-polled (avoids the empty-prompt collision documented in
* the auqFingerprint contract).
* - First time we see a fingerprint: push it, classify as Step 0 or
* review-phase based on `boundaryFired`, press `defaultPick` to
* advance.
* - After pressing, evaluate `isLastStep0AUQ(fingerprint)`. If true,
* all subsequent AUQs are review-phase.
* - Hard ceiling: if `reviewCount >= reviewCountCeiling`, return
* `ceiling_reached`. This bounds runaway counts; tests should set
* the ceiling above their assertion CEILING.
* - Soft terminals: `COMPLETION_SUMMARY_RE` match → `completion_summary`;
* plan-ready confirmation → `plan_ready`; silent write outside
* sanctioned dirs → `silent_write`; process exited → `exited`;
* wall clock exceeded → `timeout`.
*
* Boundary detection (D14): event-based, fired against the answered AUQ's
* fingerprint, not against later rendered content. This avoids the race
* where Step-0-final and Section-1-first AUQs straddle a section header
* regex match.
*
* Fingerprint composition (D9): `auqFingerprint(prompt, options)` mixes
* normalized prompt text with the options signature so distinct findings
* with shared menu structure (the generic A/B/C TODO menu) get distinct
* fingerprints.
*/
export async function runPlanSkillCounting(opts: {
/** Skill name, e.g. 'plan-ceo-review'. Used for diagnostic strings only. */
skillName: string;
/** Slash command to send alone, e.g. '/plan-ceo-review'. No trailing args. */
slashCommand: string;
/** Plan content sent as a follow-up message ~3s after the slash command. */
followUpPrompt: string;
/** Per-skill predicate: which answered AUQ is the last Step-0 question. */
isLastStep0AUQ: Step0BoundaryPredicate;
/** Hard cap on review-phase count; helper returns when reached. Should be
* set ABOVE the test's assertion ceiling so the test sees the cap as a
* failure rather than a silent stop. */
reviewCountCeiling: number;
/** Numbered option to press by default. Defaults to 1 (recommended). */
defaultPick?: number;
/**
* Optional override for the FIRST AUQ observed. Receives the fingerprint;
* returns the option index to press. Subsequent AUQs always use defaultPick.
*
* Skill-specific routing helper: /plan-ceo-review's first AUQ asks "what
* scope?" with options like "branch diff" / "describe inline" / "skip
* interview". Pressing the default 1 routes to "branch diff" (the wrong
* review target for a seeded fixture). firstAUQPick lets the test pick
* "Skip interview" or "describe inline" so the agent reviews the
* follow-up plan content the test sent, not the git diff.
*/
firstAUQPick?: (fp: AskUserQuestionFingerprint) => number;
/** Working directory. Default process.cwd() (repo cwd holds skill registry). */
cwd?: string;
/** Total budget for skill to reach a terminal outcome. Default 1_500_000 (25 min). */
timeoutMs?: number;
/** Extra env merged into the spawned `claude` process. */
env?: Record<string, string>;
}): Promise<PlanSkillCountObservation> {
const startedAt = Date.now();
const defaultPick = opts.defaultPick ?? 1;
const timeoutMs = opts.timeoutMs ?? 1_500_000;
const session = await launchClaudePty({
permissionMode: 'plan',
cwd: opts.cwd,
timeoutMs: timeoutMs + 60_000,
env: opts.env,
});
const fingerprints: AskUserQuestionFingerprint[] = [];
const seen = new Set<string>();
let boundaryFired = false;
let step0Count = 0;
let reviewCount = 0;
let isFirstAUQ = true;
let lastSig = '';
function snapshot(
outcome: PlanSkillCountObservation['outcome'],
summary: string,
visible: string,
): PlanSkillCountObservation {
return {
outcome,
summary,
evidence: visible.slice(-3000),
elapsedMs: Date.now() - startedAt,
fingerprints,
step0Count,
reviewCount,
};
}
try {
await Bun.sleep(8000); // boot grace + auto-trust handler window
const since = session.mark();
session.send(`${opts.slashCommand}\r`);
await Bun.sleep(3000);
session.send(`${opts.followUpPrompt}\r`);
const budgetStart = Date.now();
while (Date.now() - budgetStart < timeoutMs) {
await Bun.sleep(2000);
const visible = session.visibleSince(since);
// Process exited?
if (session.exited()) {
return snapshot(
'exited',
`claude exited (code=${session.exitCode()}) during counting (step0=${step0Count}, review=${reviewCount})`,
visible,
);
}
if (visible.includes('Unknown command:')) {
return snapshot(
'exited',
`claude rejected ${opts.slashCommand} as unknown command (skill not registered in this cwd)`,
visible,
);
}
// Silent write detection — only fires if no numbered prompt is on
// screen (otherwise the write is gated by a permission/AUQ).
const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g;
let m: RegExpExecArray | null;
while ((m = writeRe.exec(visible)) !== null) {
const target = m[1] ?? '';
const sanctioned = SANCTIONED_WRITE_SUBSTRINGS.some((s) =>
target.includes(s),
);
if (!sanctioned && !isNumberedOptionListVisible(visible)) {
return snapshot(
'silent_write',
`Write/Edit to ${target} fired before any AskUserQuestion`,
visible,
);
}
}
// Soft terminal signals — check before AUQ processing so a final
// completion-summary doesn't get misclassified as a bonus AUQ.
if (COMPLETION_SUMMARY_RE.test(visible)) {
return snapshot(
'completion_summary',
`skill emitted completion summary / verdict / status line (step0=${step0Count}, review=${reviewCount})`,
visible,
);
}
if (isPlanReadyVisible(visible)) {
return snapshot(
'plan_ready',
`skill emitted plan-mode "Ready to execute" confirmation (step0=${step0Count}, review=${reviewCount})`,
visible,
);
}
// Numbered option list?
if (!isNumberedOptionListVisible(visible)) continue;
// Permission dialog? Auto-grant with defaultPick. Only act on the
// recent tail to avoid re-triggering on stale dialogs in scrollback.
if (isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))) {
session.send(`${defaultPick}\r`);
await Bun.sleep(1500);
continue;
}
// Parse the active AUQ. Skip same-redraw and empty-prompt cases.
const options = parseNumberedOptions(visible);
if (options.length < 2) continue;
const sig = optionsSignature(options);
if (sig === lastSig) continue;
const promptSnippet = parseQuestionPrompt(visible);
if (promptSnippet === '') continue; // not yet rendered, poll again
lastSig = sig;
const fingerprintHash = auqFingerprint(promptSnippet, options);
if (seen.has(fingerprintHash)) {
// Same content, already counted (TTY redrew with whitespace diff).
continue;
}
seen.add(fingerprintHash);
const fp: AskUserQuestionFingerprint = {
signature: fingerprintHash,
promptSnippet,
options,
observedAtMs: Date.now() - startedAt,
preReview: !boundaryFired,
};
fingerprints.push(fp);
if (boundaryFired) reviewCount += 1;
else step0Count += 1;
// Press to advance — first AUQ may use the override pick.
const pickIdx =
isFirstAUQ && opts.firstAUQPick ? opts.firstAUQPick(fp) : defaultPick;
isFirstAUQ = false;
session.send(`${pickIdx}\r`);
// Evaluate boundary AFTER pressing — if THIS AUQ was the last Step 0
// question, all subsequent AUQs go to reviewCount.
if (!boundaryFired && opts.isLastStep0AUQ(fp)) {
boundaryFired = true;
}
// Hard ceiling — runaway protection.
if (reviewCount >= opts.reviewCountCeiling) {
return snapshot(
'ceiling_reached',
`review-phase AUQ count reached ceiling (${opts.reviewCountCeiling})`,
session.visibleSince(since),
);
}
// Give the agent a beat to advance to the next state.
await Bun.sleep(2000);
}
return snapshot(
'timeout',
`no terminal outcome within ${timeoutMs}ms (step0=${step0Count}, review=${reviewCount})`,
session.visibleSince(since),
);
} finally {
await session.close();
}
}
+749
View File
@@ -0,0 +1,749 @@
/**
* Deterministic unit tests for claude-pty-runner.ts behavior changes.
*
* Free-tier (no EVALS=1 needed). Runs in <1s on every `bun test`. Catches
* harness plumbing bugs before stochastic PTY runs surface them.
*
* Two surface areas tested:
*
* 1. Permission-dialog short-circuit in 'asked' classification: a TTY frame
* that matches BOTH isPermissionDialogVisible AND isNumberedOptionListVisible
* must NOT be classified as a skill question — permission dialogs render
* as numbered lists too, but they're not what we're guarding.
*
* 2. Env passthrough surface: runPlanSkillObservation accepts an `env`
* option and threads it to launchClaudePty. We can't fully exercise the
* spawn pipeline without paying for a PTY session, but we CAN verify the
* option exists in the type signature and that calling without env still
* works (no regression).
*
* The PTY test (skill-e2e-plan-ceo-plan-mode.test.ts) is the integration
* check; this file is the cheap deterministic guard for the harness primitives
* those tests stand on.
*/
import { describe, test, expect } from 'bun:test';
import {
isPermissionDialogVisible,
isNumberedOptionListVisible,
isPlanReadyVisible,
parseNumberedOptions,
classifyVisible,
TAIL_SCAN_BYTES,
optionsSignature,
parseQuestionPrompt,
auqFingerprint,
COMPLETION_SUMMARY_RE,
assertReviewReportAtBottom,
ceoStep0Boundary,
engStep0Boundary,
designStep0Boundary,
devexStep0Boundary,
type ClaudePtyOptions,
type AskUserQuestionFingerprint,
} from './claude-pty-runner';
describe('isPermissionDialogVisible', () => {
test('matches "Bash command requires permission" prompts', () => {
const sample = `
Some preamble output
Bash command \`gstack-config get telemetry\` requires permission to run.
1. Yes
2. Yes, and always allow
3. No, abort
`;
expect(isPermissionDialogVisible(sample)).toBe(true);
});
test('matches "allow all edits" file-edit prompts', () => {
// Isolated to the "allow all edits" clause only — no overlapping
// "Do you want to proceed?" co-trigger, so this asserts the clause works.
const sample = `
Edit to ~/.gstack/config.yaml
1. Yes
2. Yes, allow all edits during this session
3. No
`;
expect(isPermissionDialogVisible(sample)).toBe(true);
});
test('matches the "Do you want to proceed?" file-edit confirmation by itself', () => {
// Separate fixture so weakening this clause is detected by a dedicated test.
const sample = `
Edit to ~/.gstack/config.yaml
Do you want to proceed?
1. Yes
2. No
`;
expect(isPermissionDialogVisible(sample)).toBe(true);
});
test('matches workspace-trust "always allow access to" prompt', () => {
const sample = `
Do you trust the files in this folder?
1. Yes, proceed
2. Yes, and always allow access to /Users/me/repo
3. No, exit
`;
expect(isPermissionDialogVisible(sample)).toBe(true);
});
test('does NOT match a skill AskUserQuestion list', () => {
const sample = `
D1 — Premise challenge: do users actually want this?
1. Yes, validated
2. No, premise is wrong
3. Need more info
`;
expect(isPermissionDialogVisible(sample)).toBe(false);
});
test('does NOT match a plan-ready confirmation', () => {
const sample = `
Ready to execute the plan?
1. Yes
2. No, keep planning
`;
expect(isPermissionDialogVisible(sample)).toBe(false);
});
test('does NOT match a skill question that contains the bare phrase "Do you want to proceed?"', () => {
// Co-trigger requirement: "Do you want to proceed?" alone is not enough.
// It must appear with "Edit to <path>" or "Write to <path>" to count as
// a permission dialog. This guards against a skill question like
// "Do you want to proceed with HOLD SCOPE?" being mis-classified.
const sample = `
Choose your scope mode for this review.
Do you want to proceed?
1. HOLD SCOPE
2. SCOPE EXPANSION
3. SELECTIVE EXPANSION
`;
expect(isPermissionDialogVisible(sample)).toBe(false);
});
test('does NOT mis-match when adversarial prose includes "Edit to <path>" alongside the bare proceed phrase', () => {
// Adversarial fixture: a skill question whose body legitimately mentions
// "Edit to <path>" in prose AND ends with "Do you want to proceed?". The
// current co-trigger regex would mis-classify this as a permission
// dialog. We DO want this test to fail until the regex is tightened
// further (e.g., proximity constraint, or anchoring "Edit to" to a
// line-start). For now this is documented as a known limitation: a
// skill question that talks about "Edit to" in prose IS still treated
// as a permission dialog. The test asserts the current behavior so a
// future fix can flip it intentionally.
const sample = `
Plan: I will Edit to ./plan.md to capture the decision.
Do you want to proceed?
1. HOLD SCOPE
2. SCOPE EXPANSION
`;
// KNOWN LIMITATION: the co-trigger fires here. Documented as a
// post-merge follow-up. Flip this assertion once the regex tightens.
expect(isPermissionDialogVisible(sample)).toBe(true);
});
});
describe('isNumberedOptionListVisible', () => {
test('matches a basic 1. + 2. cursor list', () => {
const sample = `
1. Option one
2. Option two
3. Option three
`;
expect(isNumberedOptionListVisible(sample)).toBe(true);
});
test('returns false on a single-option prompt', () => {
const sample = `
1. Only option
`;
expect(isNumberedOptionListVisible(sample)).toBe(false);
});
test('returns false when no cursor renders', () => {
const sample = `
Just some prose with 1. a numbered point and 2. another.
`;
expect(isNumberedOptionListVisible(sample)).toBe(false);
});
test('overlaps permission dialogs (this is why D5 short-circuits)', () => {
// The whole point of D5: this string matches BOTH classifiers, so the
// runner must consult isPermissionDialogVisible to disambiguate.
const sample = `
Bash command \`do-thing\` requires permission to run.
1. Yes
2. No
`;
expect(isNumberedOptionListVisible(sample)).toBe(true);
expect(isPermissionDialogVisible(sample)).toBe(true);
});
});
describe('classifyVisible (runtime path through the runner classifier)', () => {
// These tests call the actual classifier so a future contributor who
// reorders branches (e.g. moves the permission short-circuit before
// isPlanReadyVisible) is caught deterministically.
test('skill question → returns asked', () => {
const visible = `
D1 — Choose your scope mode
1. HOLD SCOPE
2. SCOPE EXPANSION
3. SELECTIVE EXPANSION
4. SCOPE REDUCTION
`;
const result = classifyVisible(visible);
expect(result?.outcome).toBe('asked');
});
test('permission dialog (Bash) → returns null (skip, keep polling)', () => {
const visible = `
Bash command \`gstack-update-check\` requires permission to run.
1. Yes
2. No
`;
expect(isNumberedOptionListVisible(visible)).toBe(true); // pre-filter
expect(classifyVisible(visible)).toBeNull(); // post-filter
});
test('plan-ready confirmation → returns plan_ready (wins over asked)', () => {
const visible = `
Ready to execute the plan?
1. Yes, proceed
2. No, keep planning
`;
const result = classifyVisible(visible);
expect(result?.outcome).toBe('plan_ready');
});
test('silent write to unsanctioned path → returns silent_write', () => {
const visible = `
⏺ Write(src/app/dangerous-write.ts)
⎿ Wrote 42 lines
`;
const result = classifyVisible(visible);
expect(result?.outcome).toBe('silent_write');
expect(result?.summary).toContain('src/app/dangerous-write.ts');
});
test('write to sanctioned path (.claude/plans) → returns null (allowed)', () => {
const visible = `
⏺ Write(/Users/me/.claude/plans/some-plan.md)
⎿ Wrote 42 lines
`;
expect(classifyVisible(visible)).toBeNull();
});
test('write while a permission dialog is on screen → returns null (gated, not silent, not asked)', () => {
const visible = `
⏺ Write(src/app/edit-with-permission.ts)
Edit to src/app/edit-with-permission.ts
Do you want to proceed?
1. Yes
2. No
`;
// The numbered prompt is a permission dialog (Edit to + Do you want to proceed?);
// silent_write is suppressed because a numbered prompt is visible, AND
// 'asked' is suppressed because the prompt is a permission dialog.
expect(classifyVisible(visible)).toBeNull();
});
test('write while a real skill question is on screen → returns asked (write is captured but not silent)', () => {
const visible = `
⏺ Write(src/app/foo.ts)
D1 — Choose your scope mode
1. HOLD SCOPE
2. SCOPE EXPANSION
`;
// The numbered prompt is a skill question, not a permission dialog;
// silent_write is suppressed (numbered prompt is visible) and the
// outcome is 'asked' — Step 0 fired.
const result = classifyVisible(visible);
expect(result?.outcome).toBe('asked');
});
test('idle / no signals → returns null', () => {
const visible = `
Some prose without any classifier signals.
`;
expect(classifyVisible(visible)).toBeNull();
});
test('TAIL_SCAN_BYTES is exported as 1500', () => {
// Shared between runner and routing test; a regression that desyncs the
// recent-tail window would surface here.
expect(TAIL_SCAN_BYTES).toBe(1500);
});
});
describe('parseNumberedOptions', () => {
test('extracts options from a clean cursor list', () => {
const visible = `
1. HOLD SCOPE
2. SCOPE EXPANSION
`;
const opts = parseNumberedOptions(visible);
expect(opts).toHaveLength(2);
expect(opts[0]).toEqual({ index: 1, label: 'HOLD SCOPE' });
expect(opts[1]).toEqual({ index: 2, label: 'SCOPE EXPANSION' });
});
test('returns empty array on prose-with-numbers (no cursor)', () => {
expect(parseNumberedOptions('text 1. one 2. two')).toEqual([]);
});
test('extracts options when the cursor is INLINE with prompt header (box-layout)', () => {
// Real /plan-ceo-review rendering: the TTY's cursor-positioning escapes
// collapse divider + header + prompt + cursor onto one logical line.
// Subsequent options (2..7) still start their own lines.
const visible = [
'────────────────────────────────────────',
'☐ Review scope What scope do you want me to CEO-review? 1. The branch\'s diff vs main',
' Review the full branch: ~10K LOC.',
'2. A specific plan file or design doc',
' You point me at a file (path) and I review that.',
'3. An idea you\'ll describe inline',
'4. Cancel — wrong skill',
'5. Type something.',
'────────────────────────────────────────',
'6. Chat about this',
'7. Skip interview and plan immediately',
].join('\n');
const opts = parseNumberedOptions(visible);
expect(opts).toHaveLength(7);
expect(opts[0]).toEqual({ index: 1, label: "The branch's diff vs main" });
expect(opts[1]?.index).toBe(2);
expect(opts[6]?.index).toBe(7);
expect(opts[6]?.label).toBe('Skip interview and plan immediately');
});
test('inline-cursor and start-of-line cursor both produce 7 options for the box-layout case', () => {
// The inline path captures option 1 from the cursor line itself; the
// subsequent-lines path captures 2..7 with the existing optionRe.
const inlineLayout = [
'header text 1. first option',
'2. second',
'3. third',
].join('\n');
expect(parseNumberedOptions(inlineLayout)).toEqual([
{ index: 1, label: 'first option' },
{ index: 2, label: 'second' },
{ index: 3, label: 'third' },
]);
const cleanLayout = [
' 1. first option',
' 2. second',
' 3. third',
].join('\n');
expect(parseNumberedOptions(cleanLayout)).toEqual([
{ index: 1, label: 'first option' },
{ index: 2, label: 'second' },
{ index: 3, label: 'third' },
]);
});
});
describe('runPlanSkillObservation env passthrough surface', () => {
test('ClaudePtyOptions exposes env: Record<string, string>', () => {
// Type-level guard: this file would fail to compile if the env field
// were removed or its shape regressed. The actual env merge happens in
// launchClaudePty's spawn call (`env: { ...process.env, ...opts.env }`),
// so a regression where `env: opts.env` gets dropped from the
// runPlanSkillObservation -> launchClaudePty handoff is only caught by
// the live PTY test, not here.
const opts: ClaudePtyOptions = {
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
};
expect(opts.env).toEqual({ QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' });
});
});
// ────────────────────────────────────────────────────────────────────────────
// Per-finding count primitives — Section 3 unit tests #1#5, #7, #12.
// ────────────────────────────────────────────────────────────────────────────
describe('optionsSignature', () => {
test('returns a "|"-joined `index:label` string for a clean list', () => {
const sig = optionsSignature([
{ index: 1, label: 'HOLD SCOPE' },
{ index: 2, label: 'SCOPE EXPANSION' },
]);
expect(sig).toBe('1:HOLD SCOPE|2:SCOPE EXPANSION');
});
test('order-independent: shuffled inputs produce the same signature', () => {
// parseNumberedOptions already returns sorted, but defensive sort means
// a future caller that hands us shuffled input still produces a stable
// dedupe signature.
const a = optionsSignature([
{ index: 2, label: 'B' },
{ index: 1, label: 'A' },
{ index: 3, label: 'C' },
]);
const b = optionsSignature([
{ index: 1, label: 'A' },
{ index: 2, label: 'B' },
{ index: 3, label: 'C' },
]);
expect(a).toBe(b);
});
test('empty list returns empty string', () => {
expect(optionsSignature([])).toBe('');
});
test('single-item list returns just that entry', () => {
expect(optionsSignature([{ index: 1, label: 'Only' }])).toBe('1:Only');
});
});
describe('parseQuestionPrompt', () => {
test('captures 1-line prompt above the cursor', () => {
const visible = `
D1 — Pick a mode
1. HOLD SCOPE
2. SCOPE EXPANSION
`;
const prompt = parseQuestionPrompt(visible);
expect(prompt).toBe('D1 — Pick a mode');
});
test('captures multi-line prompt above the cursor', () => {
const visible = `
D2 — Approach selection
Which architecture should we follow?
1. Bypass existing helper
2. Reuse existing helper
`;
const prompt = parseQuestionPrompt(visible);
// Multi-line prompts get joined with single spaces.
expect(prompt).toContain('D2 — Approach selection');
expect(prompt).toContain('Which architecture should we follow?');
});
test('returns "" when no cursor is rendered', () => {
expect(parseQuestionPrompt('Just some prose.\nNo cursor.')).toBe('');
});
test('truncates to 240 chars', () => {
const longPrompt = 'A'.repeat(500);
const visible = `${longPrompt}\n\n 1. yes\n 2. no`;
expect(parseQuestionPrompt(visible).length).toBeLessThanOrEqual(240);
});
test('does not pull text from a previous numbered list above', () => {
const visible = `
1. previous answered question
2. previous option two
D2 — A new question text
1. fresh option A
2. fresh option B
`;
const prompt = parseQuestionPrompt(visible);
// Stops at the previous numbered-list line; should NOT contain "previous answered question".
expect(prompt).toContain('D2 — A new question text');
expect(prompt).not.toContain('previous answered question');
});
test('normalizes whitespace (collapses runs of spaces and tabs)', () => {
const visible = `D1 — Spaced out
1. yes
2. no`;
expect(parseQuestionPrompt(visible)).toBe('D1 — Spaced out');
});
test('inline-cursor box-layout: extracts prompt text BEFORE 1. on the cursor line', () => {
// Real /plan-ceo-review rendering: divider + ☐ header + prompt text +
// cursor are all on one logical line because TTY cursor-positioning
// escapes collapse the box layout under stripAnsi.
const visible = [
'──────────────────',
'☐ Review scope What scope do you want me to CEO-review? 1. The branch\'s diff vs main',
'2. A specific plan file',
'3. An idea inline',
].join('\n');
const prompt = parseQuestionPrompt(visible);
// Should extract "Review scope" and the prompt text, dropping the ☐ box-drawing sigil.
expect(prompt).toContain('Review scope');
expect(prompt).toContain('What scope do you want me to CEO-review?');
expect(prompt).not.toContain('');
expect(prompt).not.toMatch(/^☐/);
});
});
describe('auqFingerprint', () => {
test('returns the same fingerprint for identical inputs', () => {
const opts = [
{ index: 1, label: 'A' },
{ index: 2, label: 'B' },
];
expect(auqFingerprint('hello', opts)).toBe(auqFingerprint('hello', opts));
});
test('different prompts with shared option labels produce DIFFERENT fingerprints', () => {
// The collision regression Codex F1 caught: option-label-only fingerprints
// collapsed multiple distinct findings into one when they shared menu shape.
const sharedOpts = [
{ index: 1, label: 'Add to plan' },
{ index: 2, label: 'Defer' },
{ index: 3, label: 'Build now' },
];
const fpFinding1 = auqFingerprint('D5 — Architecture: bypass helper?', sharedOpts);
const fpFinding2 = auqFingerprint('D6 — Tests: zero coverage?', sharedOpts);
expect(fpFinding1).not.toBe(fpFinding2);
});
test('same prompt with different options produces DIFFERENT fingerprints', () => {
const prompt = 'D1 — Pick a mode';
const fpA = auqFingerprint(prompt, [
{ index: 1, label: 'HOLD SCOPE' },
{ index: 2, label: 'SCOPE EXPANSION' },
]);
const fpB = auqFingerprint(prompt, [
{ index: 1, label: 'HOLD SCOPE' },
{ index: 2, label: 'SCOPE REDUCTION' },
]);
expect(fpA).not.toBe(fpB);
});
test('whitespace-only differences in prompt do NOT change the fingerprint', () => {
// Same content, different rendering whitespace (TTY redraw artifact)
// must produce the same fingerprint so dedupe survives reflow.
const opts = [{ index: 1, label: 'A' }, { index: 2, label: 'B' }];
const fpA = auqFingerprint('Pick a mode', opts);
const fpB = auqFingerprint('Pick a mode', opts);
expect(fpA).toBe(fpB);
});
test('empty prompt + same options collide (caller must guard against this)', () => {
// Documents the contract: empty-prompt fingerprints WILL collide if the
// caller fingerprints them. runPlanSkillCounting must skip empty-prompt
// AUQs and re-poll instead.
const opts = [{ index: 1, label: 'A' }];
expect(auqFingerprint('', opts)).toBe(auqFingerprint('', opts));
});
});
describe('COMPLETION_SUMMARY_RE', () => {
test('matches GSTACK REVIEW REPORT heading', () => {
expect(COMPLETION_SUMMARY_RE.test('## GSTACK REVIEW REPORT')).toBe(true);
});
test('matches Completion Summary heading (ceo + eng)', () => {
expect(COMPLETION_SUMMARY_RE.test('## Completion Summary')).toBe(true);
expect(COMPLETION_SUMMARY_RE.test('## Completion summary')).toBe(true);
});
test('matches Status: clean (CEO review-log shape)', () => {
expect(COMPLETION_SUMMARY_RE.test('Status: clean')).toBe(true);
expect(COMPLETION_SUMMARY_RE.test('Status: issues_open')).toBe(true);
});
test('matches VERDICT: line', () => {
expect(COMPLETION_SUMMARY_RE.test('VERDICT: CLEARED — Eng Review passed')).toBe(true);
});
test('does NOT match prose mentions of "verdict" mid-line', () => {
// VERDICT must be at the start of a line to count.
expect(COMPLETION_SUMMARY_RE.test('the final verdict: undecided')).toBe(false);
});
});
describe('assertReviewReportAtBottom', () => {
test('passes when REVIEW REPORT is the only/last ## heading', () => {
const content = `# Plan
## Context
stuff
## Approach
more stuff
## GSTACK REVIEW REPORT
| col | col |
`;
const r = assertReviewReportAtBottom(content);
expect(r.ok).toBe(true);
});
test('fails when REVIEW REPORT is missing', () => {
const content = `# Plan
## Context
stuff
`;
const r = assertReviewReportAtBottom(content);
expect(r.ok).toBe(false);
expect(r.reason).toMatch(/no GSTACK REVIEW REPORT/);
});
test('fails when REVIEW REPORT exists but a ## heading follows it', () => {
const content = `# Plan
## GSTACK REVIEW REPORT
| col | col |
## Late Section
oops
`;
const r = assertReviewReportAtBottom(content);
expect(r.ok).toBe(false);
expect(r.reason).toMatch(/trailing ## heading/);
expect(r.trailingHeadings).toEqual(['## Late Section']);
});
test('passes when only ### subheadings follow REVIEW REPORT (deeper nesting allowed)', () => {
const content = `## GSTACK REVIEW REPORT
### Cross-model tension
- F1: resolved
- F2: resolved
`;
const r = assertReviewReportAtBottom(content);
expect(r.ok).toBe(true);
});
test('fails with multiple trailing ## headings reported', () => {
const content = `## GSTACK REVIEW REPORT
## First trailing
## Second trailing
`;
const r = assertReviewReportAtBottom(content);
expect(r.ok).toBe(false);
expect(r.trailingHeadings).toHaveLength(2);
});
});
describe('Step0BoundaryPredicate per-skill', () => {
// Helper to build a synthetic fingerprint for predicate tests.
function fp(promptSnippet: string, optionLabels: string[]): AskUserQuestionFingerprint {
const options = optionLabels.map((label, i) => ({ index: i + 1, label }));
return {
signature: auqFingerprint(promptSnippet, options),
promptSnippet,
options,
observedAtMs: 0,
preReview: true,
};
}
describe('ceoStep0Boundary', () => {
test('FIRES on Step 0F mode-pick AUQ (HOLD SCOPE in options)', () => {
const f = fp('Pick a mode', ['HOLD SCOPE', 'SCOPE EXPANSION', 'SELECTIVE EXPANSION', 'SCOPE REDUCTION']);
expect(ceoStep0Boundary(f)).toBe(true);
});
test('FIRES on scope-selection AUQ with "Skip interview" option (skip-interview path)', () => {
// After calibration run 1: plan-ceo's first AUQ is scope-selection,
// and we route via "Skip interview and plan immediately" to bypass
// Step 0 entirely. Boundary must fire on this AUQ so subsequent
// AUQs go to reviewCount.
const f = fp(
'What scope do you want me to CEO-review?',
[
"The branch's diff vs main",
'A specific plan file',
"An idea you'll describe inline",
'Cancel — wrong skill',
'Type something.',
'Chat about this',
'Skip interview and plan immediately',
],
);
expect(ceoStep0Boundary(f)).toBe(true);
});
test('does NOT fire on premise challenge AUQs', () => {
const f = fp('D1 — Premise check: is this the right problem?', ['Yes', 'No', 'Other']);
expect(ceoStep0Boundary(f)).toBe(false);
});
test('does NOT fire on review-section AUQs', () => {
const f = fp('Architecture: bypass helper?', ['Reuse existing', 'Roll new', 'Defer']);
expect(ceoStep0Boundary(f)).toBe(false);
});
});
describe('engStep0Boundary', () => {
test('FIRES on cross-project learnings prompt', () => {
const f = fp('Enable cross-project learnings on this machine?', ['Yes', 'No']);
expect(engStep0Boundary(f)).toBe(true);
});
test('FIRES on scope reduction recommendation', () => {
const f = fp('Scope reduction recommendation: cut to MVP?', ['Reduce', 'Proceed', 'Modify']);
expect(engStep0Boundary(f)).toBe(true);
});
test('does NOT fire on review-section AUQs', () => {
const f = fp('Architecture: shared mutable state?', ['Refactor', 'Defer', 'Skip']);
expect(engStep0Boundary(f)).toBe(false);
});
});
describe('designStep0Boundary', () => {
test('FIRES on design system / posture mention', () => {
const f = fp('Pick a design posture for this review', ['Polish', 'Triage', 'Expansion']);
expect(designStep0Boundary(f)).toBe(true);
});
test('FIRES on first-dimension prompt', () => {
const f = fp('First dimension: visual hierarchy. Score?', ['7', '8', '9']);
expect(designStep0Boundary(f)).toBe(true);
});
test('does NOT fire on later dimension AUQs', () => {
const f = fp('Spacing dimension score?', ['7', '8', '9']);
expect(designStep0Boundary(f)).toBe(false);
});
});
describe('devexStep0Boundary', () => {
test('FIRES on developer persona selection', () => {
const f = fp('Pick the target persona for this review', ['Senior backend', 'Junior frontend', 'Other']);
expect(devexStep0Boundary(f)).toBe(true);
});
test('FIRES on TTHW target prompt', () => {
const f = fp('What is the TTHW target for first run?', ['<5 min', '<15 min', '<30 min']);
expect(devexStep0Boundary(f)).toBe(true);
});
test('does NOT fire on review-section AUQs', () => {
const f = fp('Friction point: 5-min CI wait. Address?', ['Now', 'Defer', 'Skip']);
expect(devexStep0Boundary(f)).toBe(false);
});
});
});
+18
View File
@@ -103,6 +103,15 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
'ship-idempotency-pty': ['ship/**', 'bin/gstack-next-version', 'lib/worktree.ts', 'test/helpers/claude-pty-runner.ts'],
'autoplan-chain-pty': ['autoplan/**', 'plan-ceo-review/**', 'plan-design-review/**', 'plan-eng-review/**', 'plan-devex-review/**', 'test/fixtures/plans/ui-heavy-feature.md', 'test/helpers/claude-pty-runner.ts'],
'e2e-harness-audit': ['plan-ceo-review/**', 'plan-eng-review/**', 'plan-design-review/**', 'plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/agent-sdk-runner.ts', 'test/helpers/claude-pty-runner.ts'],
// Per-finding AskUserQuestion count + review-report-at-bottom assertion.
// Each test drives its skill end-to-end; touchfiles include preamble +
// completion-status resolvers because they affect question cadence and
// terminal output (the regression surface this test catches).
'plan-ceo-finding-count': ['plan-ceo-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-ceo-finding-count.test.ts'],
'plan-eng-finding-count': ['plan-eng-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-eng-finding-count.test.ts'],
'plan-design-finding-count': ['plan-design-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-design-finding-count.test.ts'],
'plan-devex-finding-count': ['plan-devex-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-devex-finding-count.test.ts'],
'brain-privacy-gate': ['scripts/resolvers/preamble/generate-brain-sync-block.ts', 'scripts/resolvers/preamble.ts', 'bin/gstack-brain-sync', 'bin/gstack-brain-init', 'bin/gstack-config', 'test/helpers/agent-sdk-runner.ts'],
// AskUserQuestion format regression (RECOMMENDATION + Completeness: N/10)
@@ -381,6 +390,15 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
'ship-idempotency-pty': 'periodic', // ~$3/run, real /ship in plan mode
'autoplan-chain-pty': 'periodic', // ~$8/run, all 3 phases sequential
// Per-finding count + review-report-at-bottom — periodic because each
// run drives a full skill end-to-end (~25 min, ~$5/run). Sequential
// execution during calibration; concurrent opt-in only after measured
// comparison agrees (plan §D15).
'plan-ceo-finding-count': 'periodic',
'plan-eng-finding-count': 'periodic',
'plan-design-finding-count': 'periodic',
'plan-devex-finding-count': 'periodic',
// Privacy gate for gstack-brain-sync — periodic (non-deterministic LLM call,
// costs ~$0.30-$0.50 per run, not needed on every commit)
'brain-privacy-gate': 'periodic',
@@ -0,0 +1,253 @@
/**
* /plan-ceo-review per-finding AskUserQuestion count (periodic, paid, real-PTY).
*
* Asserts the load-bearing rule "One issue = one AskUserQuestion call" by
* driving /plan-ceo-review against a 5-finding seeded plan and counting
* distinct review-phase AUQs. Passes when count is in [N-1, N+2].
*
* Two tests in this file:
* - 5-finding distinct fixture: count band assertion + D19 review-report-at-bottom.
* - 2-finding paired control (D12 positive control): related findings still
* produce 2 distinct AUQs, not 1 batched, when the rule is honored.
*
* Tier: periodic. Each run drives Step 0 + 11 review sections end-to-end
* (~25 min, ~$5/run). Sequential by default per plan §D15. See
* test/helpers/claude-pty-runner.ts for runPlanSkillCounting internals.
*/
import { describe, test } from 'bun:test';
import * as fs from 'node:fs';
import {
runPlanSkillCounting,
ceoStep0Boundary,
assertReviewReportAtBottom,
type AskUserQuestionFingerprint,
} from './helpers/claude-pty-runner';
/**
* /plan-ceo-review's first AUQ asks "what scope?" with options like
* 1. Branch diff vs main
* 2. A specific plan file or design doc
* 3. An idea you'll describe inline
* ...
* 7. Skip interview and plan immediately
*
* The default pick (1) routes to "branch diff vs main" — the wrong target
* for our seeded fixture (the agent would review the gstack PR itself,
* recursively). Picking "Skip interview and plan immediately" bypasses
* Step 0 and routes the agent to review the chat context (where our
* follow-up plan was pasted).
*/
function pickSkipInterview(fp: AskUserQuestionFingerprint): number {
const skipOpt = fp.options.find((o) =>
/skip\s+interview|plan\s+immediately/i.test(o.label),
);
if (skipOpt) return skipOpt.index;
// Fallback: "describe inline" also routes to using our pasted plan.
const inlineOpt = fp.options.find((o) =>
/describe.*inline|inline.*idea/i.test(o.label),
);
if (inlineOpt) return inlineOpt.index;
return 1;
}
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
const describeE2E = shouldRun ? describe : describe.skip;
const N_DISTINCT = 5;
const FLOOR_DISTINCT = N_DISTINCT - 1; // 4 (D11)
const CEILING_DISTINCT = N_DISTINCT + 2; // 7 (D11)
const N_PAIRED = 2;
const FLOOR_PAIRED = 2;
const CEILING_PAIRED = 4;
const PLAN_CEO_5_FINDINGS = [
'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-ceo.md (use Edit/Write to that exact path).',
'',
'# Plan: Payment Processing Integration',
'',
'## Architecture',
"We're adding a new `PaymentService` class that will handle Stripe webhooks.",
'This bypasses the existing `WebhookDispatcher` module — we want a clean',
'namespace separation.',
'',
'## Database access',
'The new endpoint reads `request.params.userId` directly into a raw SQL',
'fragment for the lookup query.',
'',
'## Webhook fan-out',
'On payment success we update the user record AND fire a notification email.',
'Both happen inline; no error handling on the email leg.',
'',
'## Tests',
"None planned. We'll rely on the existing integration suite catching regressions.",
'',
'## Performance',
'Each webhook lookup hits the database for the user, then fetches each',
'order in a loop.',
].join('\n');
const PLAN_CEO_2_PAIRED_FINDINGS = [
'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-ceo-paired.md (use Edit/Write to that exact path).',
'',
'# Plan: Payment Processing — Test Coverage',
'',
'## Tests',
'We need test coverage for `processPayment()`. Specifically:',
'1. The happy path (successful Stripe charge — assert correct receipt is generated).',
'2. The error/timeout path (Stripe returns 502 — assert retry-with-backoff fires once, then fails clean).',
'',
'Currently neither has a unit test. These are deliberately separate concerns:',
'the success path is correctness, the failure path is graceful degradation.',
].join('\n');
const PLAN_CEO_PATH = '/tmp/gstack-test-plan-ceo.md';
const PLAN_CEO_PAIRED_PATH = '/tmp/gstack-test-plan-ceo-paired.md';
describeE2E('/plan-ceo-review per-finding AskUserQuestion count (periodic)', () => {
test(
`5-finding plan emits ${FLOOR_DISTINCT}-${CEILING_DISTINCT} review-phase AskUserQuestions`,
async () => {
try {
fs.rmSync(PLAN_CEO_PATH, { force: true });
} catch {
/* best-effort */
}
const obs = await runPlanSkillCounting({
skillName: 'plan-ceo-review',
slashCommand: '/plan-ceo-review',
followUpPrompt: PLAN_CEO_5_FINDINGS,
isLastStep0AUQ: ceoStep0Boundary,
reviewCountCeiling: CEILING_DISTINCT + 1, // hard cap above assertion ceiling
firstAUQPick: pickSkipInterview, // bypass scope-selection, route to review
cwd: process.cwd(),
timeoutMs: 1_500_000, // 25 min
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
});
try {
if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) {
throw new Error(
`plan-ceo-review finding-count FAILED: outcome=${obs.outcome}\n` +
`step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` +
`fingerprints (last 8):\n` +
obs.fingerprints
.slice(-8)
.map(
(f, i) =>
` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`,
)
.join('\n') +
`\n--- evidence (last 3KB) ---\n${obs.evidence}`,
);
}
if (obs.reviewCount < FLOOR_DISTINCT) {
throw new Error(
`BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR_DISTINCT}.\n` +
`Likely batching regression — agent collapsed multiple findings into fewer questions.\n` +
`Fingerprints (review-phase only):\n` +
obs.fingerprints
.filter((f) => !f.preReview)
.map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`)
.join('\n'),
);
}
if (obs.reviewCount > CEILING_DISTINCT) {
throw new Error(
`BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING_DISTINCT}.\n` +
`Possible over-asking regression. Review-phase fingerprints:\n` +
obs.fingerprints
.filter((f) => !f.preReview)
.map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`)
.join('\n'),
);
}
// D19: review report at bottom of plan file.
if (!fs.existsSync(PLAN_CEO_PATH)) {
throw new Error(
`D19 FAIL: agent did not produce expected plan file at ${PLAN_CEO_PATH}.\n` +
`Either the agent ignored the path instruction in the follow-up prompt, or\n` +
`the helper exited before the agent wrote the file. ` +
`outcome=${obs.outcome} review=${obs.reviewCount}`,
);
}
const planContent = fs.readFileSync(PLAN_CEO_PATH, 'utf-8');
const verdict = assertReviewReportAtBottom(planContent);
if (!verdict.ok) {
throw new Error(
`D19 FAIL: plan file at ${PLAN_CEO_PATH} ${verdict.reason}\n` +
(verdict.trailingHeadings
? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n`
: '') +
`--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`,
);
}
} finally {
try {
fs.rmSync(PLAN_CEO_PATH, { force: true });
} catch {
/* best-effort */
}
}
},
1_700_000,
);
test(
`paired-finding positive control: ${N_PAIRED} related findings produce ${FLOOR_PAIRED}-${CEILING_PAIRED} AskUserQuestions`,
async () => {
try {
fs.rmSync(PLAN_CEO_PAIRED_PATH, { force: true });
} catch {
/* best-effort */
}
const obs = await runPlanSkillCounting({
skillName: 'plan-ceo-review',
slashCommand: '/plan-ceo-review',
followUpPrompt: PLAN_CEO_2_PAIRED_FINDINGS,
isLastStep0AUQ: ceoStep0Boundary,
reviewCountCeiling: CEILING_PAIRED + 1,
cwd: process.cwd(),
timeoutMs: 1_500_000,
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
});
try {
if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) {
throw new Error(
`paired-finding control FAILED: outcome=${obs.outcome}\n` +
`step0=${obs.step0Count} review=${obs.reviewCount}\n` +
`--- evidence (last 3KB) ---\n${obs.evidence}`,
);
}
if (obs.reviewCount < FLOOR_PAIRED) {
throw new Error(
`PAIRED CONTROL FAIL: reviewCount=${obs.reviewCount} < FLOOR=${FLOOR_PAIRED}.\n` +
`Two deliberately related findings were batched into <2 questions — the rule failed under D12.\n` +
`Review-phase fingerprints:\n` +
obs.fingerprints
.filter((f) => !f.preReview)
.map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`)
.join('\n'),
);
}
if (obs.reviewCount > CEILING_PAIRED) {
throw new Error(
`PAIRED CONTROL FAIL: reviewCount=${obs.reviewCount} > CEILING=${CEILING_PAIRED} (over-asking on a 2-finding fixture).`,
);
}
} finally {
try {
fs.rmSync(PLAN_CEO_PAIRED_PATH, { force: true });
} catch {
/* best-effort */
}
}
},
1_700_000,
);
});
+13 -5
View File
@@ -37,14 +37,15 @@ import {
isPermissionDialogVisible,
parseNumberedOptions,
isPlanReadyVisible,
MODE_RE,
optionsSignature,
TAIL_SCAN_BYTES,
type ClaudePtySession,
} from './helpers/claude-pty-runner';
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
const describeE2E = shouldRun ? describe : describe.skip;
const MODE_RE = /HOLD SCOPE|SCOPE EXPANSION|SELECTIVE EXPANSION|SCOPE REDUCTION/i;
interface ModeCase {
mode: 'HOLD SCOPE' | 'SCOPE EXPANSION';
/** Regex applied to visible-since-mode-pick text. At least one must match. */
@@ -95,8 +96,8 @@ async function navigateToModeAskUserQuestion(
// Has the rendered list changed since last poll? If not, we're seeing
// the same prompt and shouldn't double-press.
const sig = opts.map(o => `${o.index}:${o.label}`).join('|');
const lastSig = lastSeenList.map(o => `${o.index}:${o.label}`).join('|');
const sig = optionsSignature(opts);
const lastSig = optionsSignature(lastSeenList);
if (sig === lastSig) continue;
lastSeenList = opts;
@@ -115,7 +116,14 @@ async function navigateToModeAskUserQuestion(
// Permission dialog? Grant with "1" but don't count it against nav budget.
// Classify on the recent tail only — old permission text persists in
// visibleSince and would re-trigger forever.
if (isPermissionDialogVisible(visible.slice(-1500))) {
//
// Note: runPlanSkillObservation has its own permission-dialog filter that
// simply skips classification (since it observes, doesn't drive). This nav
// loop drives the PTY directly via launchClaudePty and so owns its own
// dialog handling — granting with "1" so the workflow advances. Both
// paths share TAIL_SCAN_BYTES as the recent-tail window so tuning stays
// in sync.
if (isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))) {
session.send('1\r');
await Bun.sleep(1500);
continue;
+42 -18
View File
@@ -1,22 +1,34 @@
/**
* plan-ceo-review plan-mode smoke (gate, paid, real-PTY).
*
* Asserts: when /plan-ceo-review is invoked in plan mode, the skill reaches
* a terminal outcome that is either:
* - 'asked' — skill emitted its Step 0 numbered prompt (scope mode
* selection, or the routing-injection prompt that runs
* before Step 0)
* - 'plan_ready' — skill ran end-to-end and surfaced claude's native
* "Ready to execute" confirmation
* Asserts: when /plan-ceo-review is invoked in plan mode, the FIRST terminal
* outcome is 'asked' — a skill-question numbered list. Permission dialogs
* (which also render numbered lists) are filtered out by `runPlanSkillObservation`
* via its `isPermissionDialogVisible(visible.slice(-1500))` short-circuit.
*
* FAIL conditions: silent Write/Edit before any prompt, claude crash,
* timeout.
* Reaching 'plan_ready' first IS the regression we want to catch: the agent
* skipped Step 0 entirely and went straight to ExitPlanMode. The original
* failure had the assistant read a diff, write a plan with two issues, and
* call ExitPlanMode without ever firing AskUserQuestion — the user had to
* manually call out the missing per-issue questions.
*
* Replaces the SDK-based test that never worked: the SDK's canUseTool
* interceptor on AskUserQuestion never fires in plan mode because plan
* mode renders its native confirmation as TTY UI, not via the
* AskUserQuestion tool. The real PTY harness observes the rendered
* terminal output directly.
* Why this skill is special: unlike plan-eng-review / plan-design-review /
* plan-devex-review (whose smokes accept either 'asked' or 'plan_ready'),
* plan-ceo-review's template mandates Step 0A premise challenge (3 baked-in
* questions) AND Step 0F mode selection BEFORE any plan write. There is no
* legitimate path to plan_ready that does not first emit a skill-question
* numbered prompt.
*
* Env passthrough: passes `QUESTION_TUNING=false` and `EXPLAIN_LEVEL=default`
* via the runner's env option. Today these are advisory — `gstack-config`
* reads `~/.gstack/config.yaml`, not env vars, so a contributor with
* `question_tuning: true` set in their YAML config can still see AUTO_DECIDE
* masking. The env passthrough is wired so a future gstack-config change to
* honor env overrides will make this test hermetic without further edits.
* Tracked as a post-merge follow-up.
*
* FAIL conditions: 'plan_ready' first, silent Write/Edit before any prompt,
* claude crash, timeout.
*
* See test/helpers/claude-pty-runner.ts for runner internals.
*/
@@ -28,21 +40,33 @@ const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
const describeE2E = shouldRun ? describe : describe.skip;
describeE2E('plan-ceo-review plan-mode smoke (gate)', () => {
test('reaches a terminal outcome (asked or plan_ready) without silent writes', async () => {
test('first terminal outcome is asked (Step 0 fires before any plan write)', async () => {
const obs = await runPlanSkillObservation({
skillName: 'plan-ceo-review',
inPlanMode: true,
timeoutMs: 300_000,
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
});
if (obs.outcome === 'silent_write' || obs.outcome === 'exited' || obs.outcome === 'timeout') {
if (obs.outcome !== 'asked') {
const diagnosis =
obs.outcome === 'plan_ready'
? `'plan_ready' first means the agent skipped Step 0 entirely and went straight to ExitPlanMode without asking.`
: obs.outcome === 'timeout'
? `Timeout means the agent neither asked nor completed within the budget — likely hung mid-question or stuck on a permission dialog.`
: obs.outcome === 'silent_write'
? `Silent Write/Edit fired to an unsanctioned path before any AskUserQuestion — also a Step 0 skip.`
: `Outcome '${obs.outcome}' is unexpected; investigate the evidence below.`;
throw new Error(
`plan-ceo-review plan-mode smoke FAILED: outcome=${obs.outcome}\n` +
`plan-ceo-review smoke FAILED: outcome=${obs.outcome}\n` +
`${diagnosis}\n` +
`Expected 'asked'. See plan-ceo-review/SKILL.md.tmpl: the Step 0 STOP rules ` +
`and the "One issue = one AskUserQuestion call" rule under "CRITICAL RULE — ` +
`How to ask questions".\n` +
`summary: ${obs.summary}\n` +
`elapsed: ${obs.elapsedMs}ms\n` +
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
);
}
expect(['asked', 'plan_ready']).toContain(obs.outcome);
}, 360_000);
});
@@ -0,0 +1,135 @@
/**
* /plan-design-review per-finding AskUserQuestion count (periodic, paid, real-PTY).
*
* Same shape as skill-e2e-plan-ceo-finding-count: drives /plan-design-review
* against a 5-finding seeded plan and asserts review-phase AUQ count ∈ [N-1, N+2].
* Plus D19: review report at bottom of produced plan file.
*
* Tier: periodic (~25 min, ~$5/run). Sequential by default per plan §D15.
*/
import { describe, test } from 'bun:test';
import * as fs from 'node:fs';
import {
runPlanSkillCounting,
designStep0Boundary,
assertReviewReportAtBottom,
} from './helpers/claude-pty-runner';
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
const describeE2E = shouldRun ? describe : describe.skip;
const N = 5;
const FLOOR = N - 1;
const CEILING = N + 2;
const PLAN_DESIGN_5_FINDINGS = [
'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-design.md (use Edit/Write to that exact path).',
'',
'# Plan: Settings Page UI redesign',
'',
'## Visual Hierarchy',
'The "Save" button is rendered with the same size, weight, and color as',
'three other buttons in the page header (Reset, Cancel, Export). Nothing',
'tells the user which is the primary action.',
'',
'## Spacing',
'Between sections we have 24px in some places, 32px in others, and 16px',
'in a third — no consistent vertical rhythm.',
'',
'## Color',
'The error message uses red text on a light pink background. Contrast',
'ratio is approximately 3:1 (below WCAG AA).',
'',
'## Typography',
'We use 14px, 16px, and 18px font sizes across the form labels. Two',
'sizes would suffice and create stronger hierarchy.',
'',
'## Motion',
'The "Save" action takes 2-5 seconds with no loading indicator. Users',
'see a frozen page; we should add a spinner or skeleton state.',
].join('\n');
const PLAN_DESIGN_PATH = '/tmp/gstack-test-plan-design.md';
describeE2E('/plan-design-review per-finding AskUserQuestion count (periodic)', () => {
test(
`5-finding plan emits ${FLOOR}-${CEILING} review-phase AskUserQuestions`,
async () => {
try {
fs.rmSync(PLAN_DESIGN_PATH, { force: true });
} catch {
/* best-effort */
}
const obs = await runPlanSkillCounting({
skillName: 'plan-design-review',
slashCommand: '/plan-design-review',
followUpPrompt: PLAN_DESIGN_5_FINDINGS,
isLastStep0AUQ: designStep0Boundary,
reviewCountCeiling: CEILING + 1,
cwd: process.cwd(),
timeoutMs: 1_500_000,
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
});
try {
if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) {
throw new Error(
`plan-design-review finding-count FAILED: outcome=${obs.outcome}\n` +
`step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` +
`fingerprints (last 8):\n` +
obs.fingerprints
.slice(-8)
.map(
(f, i) =>
` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`,
)
.join('\n') +
`\n--- evidence (last 3KB) ---\n${obs.evidence}`,
);
}
if (obs.reviewCount < FLOOR) {
throw new Error(
`BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR}.\n` +
`Likely batching regression. Review-phase fingerprints:\n` +
obs.fingerprints
.filter((f) => !f.preReview)
.map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`)
.join('\n'),
);
}
if (obs.reviewCount > CEILING) {
throw new Error(
`BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING}.`,
);
}
if (!fs.existsSync(PLAN_DESIGN_PATH)) {
throw new Error(
`D19 FAIL: agent did not produce expected plan file at ${PLAN_DESIGN_PATH}. ` +
`outcome=${obs.outcome} review=${obs.reviewCount}`,
);
}
const planContent = fs.readFileSync(PLAN_DESIGN_PATH, 'utf-8');
const verdict = assertReviewReportAtBottom(planContent);
if (!verdict.ok) {
throw new Error(
`D19 FAIL: plan file at ${PLAN_DESIGN_PATH} ${verdict.reason}\n` +
(verdict.trailingHeadings
? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n`
: '') +
`--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`,
);
}
} finally {
try {
fs.rmSync(PLAN_DESIGN_PATH, { force: true });
} catch {
/* best-effort */
}
}
},
1_700_000,
);
});
@@ -0,0 +1,135 @@
/**
* /plan-devex-review per-finding AskUserQuestion count (periodic, paid, real-PTY).
*
* Same shape as skill-e2e-plan-ceo-finding-count: drives /plan-devex-review
* against a 5-finding seeded plan and asserts review-phase AUQ count ∈ [N-1, N+2].
* Plus D19: review report at bottom of produced plan file.
*
* Tier: periodic (~25 min, ~$5/run). Sequential by default per plan §D15.
*/
import { describe, test } from 'bun:test';
import * as fs from 'node:fs';
import {
runPlanSkillCounting,
devexStep0Boundary,
assertReviewReportAtBottom,
} from './helpers/claude-pty-runner';
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
const describeE2E = shouldRun ? describe : describe.skip;
const N = 5;
const FLOOR = N - 1;
const CEILING = N + 2;
const PLAN_DEVEX_5_FINDINGS = [
'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-devex.md (use Edit/Write to that exact path).',
'',
'# Plan: Public SDK Beta Launch',
'',
'## Persona',
"The plan doesn't specify which developer persona is the target — we're",
"shipping for \"everyone,\" which means we tune for nobody.",
'',
'## TTHW (time to hello world)',
'Time-to-hello-world is not measured. No benchmark data referenced. We',
"don't know if first-run takes 5 minutes or 50.",
'',
'## Friction Point',
'First-run currently requires a 5-minute mandatory CI step before the',
'developer can run their first eval. There is no way to skip it.',
'',
'## Magical Moment',
'Getting-started flow has no delight beat. Pure documentation, no',
'interactive demo, no "ah-ha" moment that makes the developer trust us.',
'',
'## Competitive Blind Spot',
"The plan doesn't reference how peer SDKs (LangChain, Semantic Kernel,",
'OpenAI) handle this DX surface. We may be reinventing worse versions',
'of solved problems.',
].join('\n');
const PLAN_DEVEX_PATH = '/tmp/gstack-test-plan-devex.md';
describeE2E('/plan-devex-review per-finding AskUserQuestion count (periodic)', () => {
test(
`5-finding plan emits ${FLOOR}-${CEILING} review-phase AskUserQuestions`,
async () => {
try {
fs.rmSync(PLAN_DEVEX_PATH, { force: true });
} catch {
/* best-effort */
}
const obs = await runPlanSkillCounting({
skillName: 'plan-devex-review',
slashCommand: '/plan-devex-review',
followUpPrompt: PLAN_DEVEX_5_FINDINGS,
isLastStep0AUQ: devexStep0Boundary,
reviewCountCeiling: CEILING + 1,
cwd: process.cwd(),
timeoutMs: 1_500_000,
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
});
try {
if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) {
throw new Error(
`plan-devex-review finding-count FAILED: outcome=${obs.outcome}\n` +
`step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` +
`fingerprints (last 8):\n` +
obs.fingerprints
.slice(-8)
.map(
(f, i) =>
` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`,
)
.join('\n') +
`\n--- evidence (last 3KB) ---\n${obs.evidence}`,
);
}
if (obs.reviewCount < FLOOR) {
throw new Error(
`BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR}.\n` +
`Likely batching regression. Review-phase fingerprints:\n` +
obs.fingerprints
.filter((f) => !f.preReview)
.map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`)
.join('\n'),
);
}
if (obs.reviewCount > CEILING) {
throw new Error(
`BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING}.`,
);
}
if (!fs.existsSync(PLAN_DEVEX_PATH)) {
throw new Error(
`D19 FAIL: agent did not produce expected plan file at ${PLAN_DEVEX_PATH}. ` +
`outcome=${obs.outcome} review=${obs.reviewCount}`,
);
}
const planContent = fs.readFileSync(PLAN_DEVEX_PATH, 'utf-8');
const verdict = assertReviewReportAtBottom(planContent);
if (!verdict.ok) {
throw new Error(
`D19 FAIL: plan file at ${PLAN_DEVEX_PATH} ${verdict.reason}\n` +
(verdict.trailingHeadings
? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n`
: '') +
`--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`,
);
}
} finally {
try {
fs.rmSync(PLAN_DEVEX_PATH, { force: true });
} catch {
/* best-effort */
}
}
},
1_700_000,
);
});
@@ -0,0 +1,134 @@
/**
* /plan-eng-review per-finding AskUserQuestion count (periodic, paid, real-PTY).
*
* Same shape as skill-e2e-plan-ceo-finding-count: drives /plan-eng-review
* against a 5-finding seeded plan and asserts review-phase AUQ count ∈ [N-1, N+2].
* Plus D19: review report at bottom of produced plan file.
*
* Tier: periodic (~25 min, ~$5/run). Sequential by default per plan §D15.
*/
import { describe, test } from 'bun:test';
import * as fs from 'node:fs';
import {
runPlanSkillCounting,
engStep0Boundary,
assertReviewReportAtBottom,
} from './helpers/claude-pty-runner';
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
const describeE2E = shouldRun ? describe : describe.skip;
const N = 5;
const FLOOR = N - 1; // 4
const CEILING = N + 2; // 7
const PLAN_ENG_5_FINDINGS = [
'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-eng.md (use Edit/Write to that exact path).',
'',
'# Plan: Multi-tenant Auth Refactor',
'',
'## Architecture',
'Two new services (`AuthBroker` and `SessionMint`) share a global mutable',
'`AuthCache` instance via module-level export. Both services mutate it.',
'',
'## Code quality',
'The `validateAndDispatch()` function is 60 lines with three nested',
'try/catch blocks; each catch swallows a different error class.',
'',
'## Tests',
'The existing `legacyAuthFlow()` will get rewritten as part of this work;',
'no regression test for the prior behavior is planned.',
'',
'## Performance',
'Token validation issues 5 sequential API calls to the IDP; they could be',
'parallelized via Promise.all trivially (calls are independent).',
'',
'## Architecture (scope smell)',
'This touches 12 files and introduces 4 new classes (TokenStore,',
'SessionMint, AuthCache, RequestPolicy). Worth flagging the complexity check.',
].join('\n');
const PLAN_ENG_PATH = '/tmp/gstack-test-plan-eng.md';
describeE2E('/plan-eng-review per-finding AskUserQuestion count (periodic)', () => {
test(
`5-finding plan emits ${FLOOR}-${CEILING} review-phase AskUserQuestions`,
async () => {
try {
fs.rmSync(PLAN_ENG_PATH, { force: true });
} catch {
/* best-effort */
}
const obs = await runPlanSkillCounting({
skillName: 'plan-eng-review',
slashCommand: '/plan-eng-review',
followUpPrompt: PLAN_ENG_5_FINDINGS,
isLastStep0AUQ: engStep0Boundary,
reviewCountCeiling: CEILING + 1,
cwd: process.cwd(),
timeoutMs: 1_500_000,
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
});
try {
if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) {
throw new Error(
`plan-eng-review finding-count FAILED: outcome=${obs.outcome}\n` +
`step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` +
`fingerprints (last 8):\n` +
obs.fingerprints
.slice(-8)
.map(
(f, i) =>
` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`,
)
.join('\n') +
`\n--- evidence (last 3KB) ---\n${obs.evidence}`,
);
}
if (obs.reviewCount < FLOOR) {
throw new Error(
`BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR}.\n` +
`Likely batching regression. Review-phase fingerprints:\n` +
obs.fingerprints
.filter((f) => !f.preReview)
.map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`)
.join('\n'),
);
}
if (obs.reviewCount > CEILING) {
throw new Error(
`BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING}.`,
);
}
if (!fs.existsSync(PLAN_ENG_PATH)) {
throw new Error(
`D19 FAIL: agent did not produce expected plan file at ${PLAN_ENG_PATH}. ` +
`outcome=${obs.outcome} review=${obs.reviewCount}`,
);
}
const planContent = fs.readFileSync(PLAN_ENG_PATH, 'utf-8');
const verdict = assertReviewReportAtBottom(planContent);
if (!verdict.ok) {
throw new Error(
`D19 FAIL: plan file at ${PLAN_ENG_PATH} ${verdict.reason}\n` +
(verdict.trailingHeadings
? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n`
: '') +
`--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`,
);
}
} finally {
try {
fs.rmSync(PLAN_ENG_PATH, { force: true });
} catch {
/* best-effort */
}
}
},
1_700_000,
);
});
+4 -2
View File
@@ -97,8 +97,10 @@ describe('selectTests', () => {
expect(result.selected).toContain('ask-user-question-format-pty');
expect(result.selected).toContain('plan-ceo-mode-routing');
expect(result.selected).toContain('autoplan-chain-pty');
expect(result.selected.length).toBe(18);
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 18);
// Per-finding count + review-report-at-bottom (v1.21.x)
expect(result.selected).toContain('plan-ceo-finding-count');
expect(result.selected.length).toBe(19);
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 19);
});
test('global touchfile triggers ALL tests', () => {