Files
Garry Tan 454423aeb3 v1.21.1.0 test: tighten plan-ceo-review smoke (Step 0 must fire) (#1255)
* test: extract classifyVisible() + permission-dialog filter in PTY runner

Pure classifier extracted from runPlanSkillObservation's polling loop so
unit tests can exercise the actual branch order with synthetic input
strings. Runner gains:

- env? passthrough on runPlanSkillObservation (forwarded to launchClaudePty).
  gstack-config does not yet honor env overrides; plumbing is in place for a
  future change to make tests hermetic.
- TAIL_SCAN_BYTES = 1500 exported constant. Replaces a duplicated magic
  number in test/skill-e2e-plan-ceo-mode-routing.test.ts so tuning stays
  in sync.
- isPermissionDialogVisible: the bare phrase "Do you want to proceed?" now
  requires a file-edit context co-trigger. Other clauses unchanged. Skill
  questions that contain the bare phrase are no longer mis-classified.
- classifyVisible(visible): pure function. Branch order silent_write →
  plan_ready → asked → null. Permission dialogs filtered out of the
  'asked' classification so a permission prompt cannot pose as a Step 0
  skill question.

Adds 24 unit tests covering all classifier branches, edge cases, and the
co-trigger contract.

* test: tighten plan-ceo-review smoke to require Step 0 fires first

Assertion narrows from ['asked', 'plan_ready'] to 'asked' only. Reaching
plan_ready first means the agent skipped Step 0 entirely and went
straight to ExitPlanMode — the regression we want to catch.

Why plan-ceo is special: unlike plan-eng / plan-design / plan-devex
(whose smokes legitimately reach plan_ready on certain branches without
asking), plan-ceo-review's template mandates Step 0A premise challenge
plus Step 0F mode selection BEFORE any plan write. There is no
legitimate path to plan_ready that does not first emit a skill-question
numbered prompt.

Failure message now branches on outcome (plan_ready vs timeout vs
silent_write) with a tailored diagnosis line per case. References the
skill template by section name ("Step 0 STOP rules", "One issue = one
AskUserQuestion call") instead of line numbers, so it survives template
edits.

Passes env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }
through the runner. Today this is advisory — gstack-config reads only
~/.gstack/config.yaml, not env vars — but the wiring is in place for a
future change. Documented honestly in the docstring.

Verified across 4 PTY runs: 3 pre-refactor + 1 post-refactor, all PASS.

* chore: capture v1.21.1.0 follow-ups in TODOS.md

- P2: per-finding AskUserQuestion count assertion (V2)
- P3: honor env vars in gstack-config so test isolation env actually works
- P3: path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS

All three surfaced during the v1.21.1.0 plan-eng-review and adversarial
review passes. Captured here so the design intent persists.

* chore: bump version and changelog (v1.21.1.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: extract MODE_RE + optionsSignature into PTY runner exports

Refactor prep for the upcoming per-finding AskUserQuestion count test
across plan-{ceo,eng,design,devex}-review. Both new tests and the existing
mode-routing test need the same mode regex and the same option-list
fingerprint dedupe — pulling them into one source of truth in
test/helpers/claude-pty-runner.ts so a fifth mode (or a tweak to the
fingerprint shape) updates everywhere instead of drifting per-test.

Mechanical: no behavior change in the mode-routing test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add per-finding count primitives + unit tests

Pure helpers landing ahead of runPlanSkillCounting:

  - parseQuestionPrompt(visible) — extract the 1-3 line prompt above
    the latest "❯ 1." cursor, normalize to a 240-char snippet
  - auqFingerprint(prompt, opts) — Bun.hash of normalized prompt + sorted
    options signature; distinct prompts with shared option labels
    (the generic A/B/C TODO menu) get distinct fingerprints
  - COMPLETION_SUMMARY_RE — terminal-signal regex matching all four
    plan-review skills' completion / verdict markers
  - assertReviewReportAtBottom(content) — checks "## GSTACK REVIEW
    REPORT" is present and is the last "## " heading in a plan file
  - Step0BoundaryPredicate type + four per-skill predicates
    (ceo / eng / design / devex) — fire on the answered AUQ's
    fingerprint, marking the end of Step 0 deterministically
    (event-based, not content-based, per Codex F7)

Plus 37 deterministic unit tests covering option-label collision
regression, prompt extraction edge cases, predicate positive AND
negative cases, and review-report-at-bottom triple-check
(missing / mid-file / multiple trailing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add runPlanSkillCounting PTY helper

Drives a plan-* skill end-to-end and counts distinct review-phase
AskUserQuestions. Composes the primitives from the previous commit:

  - Boot + auto-trust handler (existing launchClaudePty)
  - Send slash command alone, sleep 3s, send plan content as follow-up
    message (proven pattern from skill-e2e-plan-design-with-ui)
  - Poll loop with permission-dialog auto-grant, same-redraw skip,
    empty-prompt re-poll
  - Event-based Step-0 boundary via isLastStep0AUQ predicate fired on
    the answered AUQ's fingerprint (Codex F7 — boundary is observed
    event, not later rendered content)
  - Multi-signal terminals: hard ceiling, COMPLETION_SUMMARY_RE,
    plan_ready, silent_write, exited, timeout

Empty-prompt fingerprints are skipped per the contract documented in
auqFingerprint's unit tests — fingerprinting them would re-introduce
the option-label collision regression Codex F1 caught.

No E2E tests yet — those land in commit 5 with the four skill fixtures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: register four finding-count tests in touchfiles + tier map

Each new test depends on its skill template, the runner, and three
preamble resolvers (preamble.ts, generate-ask-user-format.ts,
generate-completion-status.ts) — those affect question cadence and
completion rendering, which is exactly what the test asserts on.

All four classified periodic. Sequential execution during calibration;
opt-in to concurrent only after measured comparison agrees (plan §D15).

Updated touchfiles.test.ts: plan-ceo-review/** now selects 19 tests
(was 18) because plan-ceo-finding-count joins the family.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add four per-finding count E2E tests (plan-ceo + eng + design + devex)

Each test drives its plan-* skill through Step 0 then asserts the
review-phase AskUserQuestion count falls in [N-1, N+2] for an N=5
seeded plan, plus D19: produced plan file ends with
"## GSTACK REVIEW REPORT" as its last "## " heading.

plan-ceo also runs a paired-finding positive control: 2 deliberately
related findings should still produce 2 distinct AUQs, not 1 batched.

Periodic-tier (gate-skipped without EVALS=1, EVALS_TIER=periodic).
Sequential execution by plan §D15. Each fixture is inline TypeScript
content delivered as a follow-up message after the slash command, per
the proven pattern at skill-e2e-plan-design-with-ui.test.ts.

Calibration loop (5 runs per skill) and the manual pre-merge negative
check (D7 + D12) are required before merge per plan §Verification.
NOT yet run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: fix parseNumberedOptions for inline-cursor box-layout AUQs

Calibration run 1 timed out with step0=0 review=0 because the parser
could not find the cursor in /plan-ceo-review's scope-selection AUQ.
The TTY's box-layout rendering inlines divider + header + prompt +
"1." onto one logical line — cursor escapes get stripped, leaving
text crushed onto a single line.

Cursor anchor regex changed from anchored to unanchored so it matches
mid-line. Cursor-line option extraction uses a non-anchored regex;
subsequent options stay with the original start-of-line parser.

parseQuestionPrompt picks up the inline prompt text BEFORE the cursor
on the cursor line (after stripping box-drawing chars + sigil) and
appends it after any walked-up multi-line prompt above.

Three new unit tests: clean-cursor still works, inline-cursor
extracts all 7 options, prompt extraction strips box chars.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add firstAUQPick + plan-ceo skip-interview routing

Calibration run 1 surfaced a second issue beyond the parser bug: the
default pick of 1 on /plan-ceo-review's scope-selection AUQ routes
the agent to "branch diff vs main" — so it reviews the gstack PR
itself (recursive!) instead of the seeded fixture plan we sent.

Added firstAUQPick callback to runPlanSkillCounting. Override applies
only to the FIRST AUQ; subsequent presses keep using defaultPick.

ceoStep0Boundary now fires on either the mode-pick AUQ (existing path)
or any AUQ containing "Skip interview and plan immediately" — which
is the scope-selection AUQ. Picking that option bypasses Step 0 and
routes straight to review-phase using the chat-paste plan as context.

Plan-ceo test wires firstAUQPick = pickSkipInterview which finds the
"Skip interview" option by label. Falls back to "describe inline" if
the option labels change.

Two new unit tests: ceoStep0Boundary fires on the scope-selection
fixture; existing mode-pick fixture still fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:50:09 -07:00
..