mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-01 19:25:10 +02:00
v1.21.1.0 test: tighten plan-ceo-review smoke (Step 0 must fire) (#1255)
* test: extract classifyVisible() + permission-dialog filter in PTY runner
Pure classifier extracted from runPlanSkillObservation's polling loop so
unit tests can exercise the actual branch order with synthetic input
strings. Runner gains:
- env? passthrough on runPlanSkillObservation (forwarded to launchClaudePty).
gstack-config does not yet honor env overrides; plumbing is in place for a
future change to make tests hermetic.
- TAIL_SCAN_BYTES = 1500 exported constant. Replaces a duplicated magic
number in test/skill-e2e-plan-ceo-mode-routing.test.ts so tuning stays
in sync.
- isPermissionDialogVisible: the bare phrase "Do you want to proceed?" now
requires a file-edit context co-trigger. Other clauses unchanged. Skill
questions that contain the bare phrase are no longer mis-classified.
- classifyVisible(visible): pure function. Branch order silent_write →
plan_ready → asked → null. Permission dialogs filtered out of the
'asked' classification so a permission prompt cannot pose as a Step 0
skill question.
Adds 24 unit tests covering all classifier branches, edge cases, and the
co-trigger contract.
* test: tighten plan-ceo-review smoke to require Step 0 fires first
Assertion narrows from ['asked', 'plan_ready'] to 'asked' only. Reaching
plan_ready first means the agent skipped Step 0 entirely and went
straight to ExitPlanMode — the regression we want to catch.
Why plan-ceo is special: unlike plan-eng / plan-design / plan-devex
(whose smokes legitimately reach plan_ready on certain branches without
asking), plan-ceo-review's template mandates Step 0A premise challenge
plus Step 0F mode selection BEFORE any plan write. There is no
legitimate path to plan_ready that does not first emit a skill-question
numbered prompt.
Failure message now branches on outcome (plan_ready vs timeout vs
silent_write) with a tailored diagnosis line per case. References the
skill template by section name ("Step 0 STOP rules", "One issue = one
AskUserQuestion call") instead of line numbers, so it survives template
edits.
Passes env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }
through the runner. Today this is advisory — gstack-config reads only
~/.gstack/config.yaml, not env vars — but the wiring is in place for a
future change. Documented honestly in the docstring.
Verified across 4 PTY runs: 3 pre-refactor + 1 post-refactor, all PASS.
* chore: capture v1.21.1.0 follow-ups in TODOS.md
- P2: per-finding AskUserQuestion count assertion (V2)
- P3: honor env vars in gstack-config so test isolation env actually works
- P3: path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS
All three surfaced during the v1.21.1.0 plan-eng-review and adversarial
review passes. Captured here so the design intent persists.
* chore: bump version and changelog (v1.21.1.0)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: extract MODE_RE + optionsSignature into PTY runner exports
Refactor prep for the upcoming per-finding AskUserQuestion count test
across plan-{ceo,eng,design,devex}-review. Both new tests and the existing
mode-routing test need the same mode regex and the same option-list
fingerprint dedupe — pulling them into one source of truth in
test/helpers/claude-pty-runner.ts so a fifth mode (or a tweak to the
fingerprint shape) updates everywhere instead of drifting per-test.
Mechanical: no behavior change in the mode-routing test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: add per-finding count primitives + unit tests
Pure helpers landing ahead of runPlanSkillCounting:
- parseQuestionPrompt(visible) — extract the 1-3 line prompt above
the latest "❯ 1." cursor, normalize to a 240-char snippet
- auqFingerprint(prompt, opts) — Bun.hash of normalized prompt + sorted
options signature; distinct prompts with shared option labels
(the generic A/B/C TODO menu) get distinct fingerprints
- COMPLETION_SUMMARY_RE — terminal-signal regex matching all four
plan-review skills' completion / verdict markers
- assertReviewReportAtBottom(content) — checks "## GSTACK REVIEW
REPORT" is present and is the last "## " heading in a plan file
- Step0BoundaryPredicate type + four per-skill predicates
(ceo / eng / design / devex) — fire on the answered AUQ's
fingerprint, marking the end of Step 0 deterministically
(event-based, not content-based, per Codex F7)
Plus 37 deterministic unit tests covering option-label collision
regression, prompt extraction edge cases, predicate positive AND
negative cases, and review-report-at-bottom triple-check
(missing / mid-file / multiple trailing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: add runPlanSkillCounting PTY helper
Drives a plan-* skill end-to-end and counts distinct review-phase
AskUserQuestions. Composes the primitives from the previous commit:
- Boot + auto-trust handler (existing launchClaudePty)
- Send slash command alone, sleep 3s, send plan content as follow-up
message (proven pattern from skill-e2e-plan-design-with-ui)
- Poll loop with permission-dialog auto-grant, same-redraw skip,
empty-prompt re-poll
- Event-based Step-0 boundary via isLastStep0AUQ predicate fired on
the answered AUQ's fingerprint (Codex F7 — boundary is observed
event, not later rendered content)
- Multi-signal terminals: hard ceiling, COMPLETION_SUMMARY_RE,
plan_ready, silent_write, exited, timeout
Empty-prompt fingerprints are skipped per the contract documented in
auqFingerprint's unit tests — fingerprinting them would re-introduce
the option-label collision regression Codex F1 caught.
No E2E tests yet — those land in commit 5 with the four skill fixtures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: register four finding-count tests in touchfiles + tier map
Each new test depends on its skill template, the runner, and three
preamble resolvers (preamble.ts, generate-ask-user-format.ts,
generate-completion-status.ts) — those affect question cadence and
completion rendering, which is exactly what the test asserts on.
All four classified periodic. Sequential execution during calibration;
opt-in to concurrent only after measured comparison agrees (plan §D15).
Updated touchfiles.test.ts: plan-ceo-review/** now selects 19 tests
(was 18) because plan-ceo-finding-count joins the family.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: add four per-finding count E2E tests (plan-ceo + eng + design + devex)
Each test drives its plan-* skill through Step 0 then asserts the
review-phase AskUserQuestion count falls in [N-1, N+2] for an N=5
seeded plan, plus D19: produced plan file ends with
"## GSTACK REVIEW REPORT" as its last "## " heading.
plan-ceo also runs a paired-finding positive control: 2 deliberately
related findings should still produce 2 distinct AUQs, not 1 batched.
Periodic-tier (gate-skipped without EVALS=1, EVALS_TIER=periodic).
Sequential execution by plan §D15. Each fixture is inline TypeScript
content delivered as a follow-up message after the slash command, per
the proven pattern at skill-e2e-plan-design-with-ui.test.ts.
Calibration loop (5 runs per skill) and the manual pre-merge negative
check (D7 + D12) are required before merge per plan §Verification.
NOT yet run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: fix parseNumberedOptions for inline-cursor box-layout AUQs
Calibration run 1 timed out with step0=0 review=0 because the parser
could not find the cursor in /plan-ceo-review's scope-selection AUQ.
The TTY's box-layout rendering inlines divider + header + prompt +
"1." onto one logical line — cursor escapes get stripped, leaving
text crushed onto a single line.
Cursor anchor regex changed from anchored to unanchored so it matches
mid-line. Cursor-line option extraction uses a non-anchored regex;
subsequent options stay with the original start-of-line parser.
parseQuestionPrompt picks up the inline prompt text BEFORE the cursor
on the cursor line (after stripping box-drawing chars + sigil) and
appends it after any walked-up multi-line prompt above.
Three new unit tests: clean-cursor still works, inline-cursor
extracts all 7 options, prompt extraction strips box chars.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: add firstAUQPick + plan-ceo skip-interview routing
Calibration run 1 surfaced a second issue beyond the parser bug: the
default pick of 1 on /plan-ceo-review's scope-selection AUQ routes
the agent to "branch diff vs main" — so it reviews the gstack PR
itself (recursive!) instead of the seeded fixture plan we sent.
Added firstAUQPick callback to runPlanSkillCounting. Override applies
only to the FIRST AUQ; subsequent presses keep using defaultPick.
ceoStep0Boundary now fires on either the mode-pick AUQ (existing path)
or any AUQ containing "Skip interview and plan immediately" — which
is the scope-selection AUQ. Picking that option bypasses Step 0 and
routes straight to review-phase using the chat-paste plan as context.
Plan-ceo test wires firstAUQPick = pickSkipInterview which finds the
"Skip interview" option by label. Falls back to "describe inline" if
the option labels change.
Two new unit tests: ceoStep0Boundary fires on the scope-selection
fixture; existing mode-pick fixture still fires.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,5 +1,51 @@
|
||||
# Changelog
|
||||
|
||||
## [1.21.1.0] - 2026-04-28
|
||||
|
||||
## **plan-ceo-review smoke tightens. The "agent skips Step 0 and ships a plan" regression now fails the gate.**
|
||||
|
||||
The v1.15.0.0 real-PTY harness shipped with a smoke that accepted either `'asked'` or `'plan_ready'` as success. That OR was too lax for `/plan-ceo-review` specifically: the skill template mandates Step 0A premise challenge plus Step 0F mode selection BEFORE any plan write, so reaching `plan_ready` first IS the regression. This release tightens the assertion to `'asked'` only for that smoke, and refactors the runner so the contract is testable in <1s instead of $0.50 of stochastic PTY.
|
||||
|
||||
### The numbers that matter
|
||||
|
||||
Numbers come from `git diff --shortstat origin/main..HEAD` and `bun test test/helpers/claude-pty-runner.unit.test.ts` on a clean tree.
|
||||
|
||||
| Metric | Δ |
|
||||
|---|---|
|
||||
| Net branch size vs main | +162 / −65 lines (3 files) |
|
||||
| New unit tests added | **+24** (claude-pty-runner.unit.test.ts) |
|
||||
| Unit suite runtime | **14ms** (deterministic, free-tier) |
|
||||
| Real-PTY gate runs verified | **4 clean PTY runs** (3 lock-in + 1 post-refactor) |
|
||||
| Outcome assertions covered | **5/5** (was 3/5; `plan_ready` is now FAIL for plan-ceo) |
|
||||
| Reviewers run on this PR | plan-eng-review (CLEARED) + codex consult + 2 specialists + adversarial |
|
||||
|
||||
### What this means for builders
|
||||
|
||||
Three new classes of harness regression are now caught deterministically in the free tier instead of waiting on a $0.50 stochastic PTY run. The classifier is extracted into a pure `classifyVisible()` function so reordering branches in the polling loop fails the unit tests instead of silently shipping. Permission dialogs (which render numbered lists) are filtered out of the `'asked'` classification so a permission prompt cannot pose as a Step 0 skill question. The bare phrase `Do you want to proceed?` no longer triggers permission detection on its own — it now requires a file-edit context co-trigger, so a skill question that contains the phrase isn't mis-classified.
|
||||
|
||||
For `/plan-ceo-review` specifically: any future preamble slim-down or template edit that lets the agent skip Step 0 and write a plan will fail the gate before the PR ships. Pull, run `bun test`, and the harness layer is provably tighter without you having to spend a token.
|
||||
|
||||
### Itemized changes
|
||||
|
||||
#### Added
|
||||
|
||||
- `test/helpers/claude-pty-runner.unit.test.ts`: 24 deterministic tests covering `isPermissionDialogVisible` (with the new co-trigger contract), `isNumberedOptionListVisible`, `parseNumberedOptions`, and the new `classifyVisible()` runtime path. Free-tier, runs on every `bun test`.
|
||||
- `classifyVisible(visible)` in `claude-pty-runner.ts`: pure classifier extracted from the polling loop. Returns `{ outcome, summary } | null`. Branch order: silent_write → plan_ready → asked → null (with permission-dialog filter). Live-state branches (process exited, "Unknown command") stay in the runner.
|
||||
- `TAIL_SCAN_BYTES = 1500` exported constant. Shared between `runPlanSkillObservation` and the routing test's nav loop so tuning stays in sync.
|
||||
- `env?: Record<string, string>` option on `runPlanSkillObservation`, threaded to `launchClaudePty`. Plumbing for future env-driven test isolation (gstack-config does not yet honor env overrides; tracked as post-merge follow-up).
|
||||
|
||||
#### Changed
|
||||
|
||||
- `test/skill-e2e-plan-ceo-plan-mode.test.ts`: assertion narrowed from `['asked', 'plan_ready']` to `'asked'` only. Failure message now branches on `outcome` (plan_ready vs timeout vs silent_write) with a tailored diagnosis line, and references skill-template section names instead of line numbers (durable to template edits).
|
||||
- `isPermissionDialogVisible`: bare `Do you want to proceed?` now requires a file-edit context co-trigger (`Edit to <path>` or `Write to <path>`). Other clauses (`requested permissions to`, `allow all edits`, `always allow access to`, `Bash command requires permission`) remain unconditional.
|
||||
- `test/skill-e2e-plan-ceo-mode-routing.test.ts`: replaces the local `1500` magic number with the shared `TAIL_SCAN_BYTES` constant.
|
||||
|
||||
#### For contributors
|
||||
|
||||
- The runner change is additive and the existing sibling smokes (`plan-eng`, `plan-design`, `plan-devex`, `plan-mode-no-op`) keep their loose `['asked', 'plan_ready']` assertion. Their behavior is unchanged.
|
||||
- Post-merge follow-ups captured in `TODOS.md`: per-finding AskUserQuestion count assertion (V2), env-driven gstack-config overrides (so `QUESTION_TUNING=false` actually isolates the test), path-confusion hardening on `SANCTIONED_WRITE_SUBSTRINGS`.
|
||||
|
||||
|
||||
## [1.20.0.0] - 2026-04-28
|
||||
|
||||
## **Browser-skills land. `/scrape <intent>` first call drives the page; second call runs the codified script in 200ms.**
|
||||
|
||||
@@ -213,6 +213,56 @@ scope of that PR; deliberately deferred to keep PTY-import small.
|
||||
|
||||
## Testing
|
||||
|
||||
## P2: Per-finding AskUserQuestion count assertion for /plan-ceo-review
|
||||
|
||||
**What:** PTY E2E test that drives /plan-ceo-review through Step 0 with a stable fixture diff containing N known findings, asserts that exactly N distinct AskUserQuestions fire (one per finding) before plan_ready.
|
||||
|
||||
**Why:** The skill template repeats "One issue = one AskUserQuestion call. Never combine multiple issues into one question." at every review checkpoint. No test enforces it. The current `skill-e2e-plan-ceo-plan-mode.test.ts` smoke (post-v1.21.1.0) only catches "agent skipped Step 0 entirely." Batching findings into one question slips through silently.
|
||||
|
||||
**Pros:** Locks in the strongest contract the skill mandates. Catches a real failure mode (the original attachment showed 2 findings batched as 0 questions).
|
||||
**Cons:** Needs a stable fixture diff to keep finding count deterministic (~1 day human / ~30 min CC). Opus may reasonably consolidate two related findings, so the assertion needs a forgiving lower bound (e.g., `>= ceil(N * 0.6)`) rather than strict equality.
|
||||
|
||||
**Context:** The PTY harness (`runPlanSkillObservation`) returns at first terminal outcome — for V2 we need a streaming variant that counts AskUserQuestions across the whole session up to `plan_ready`. Probably a new helper alongside `runPlanSkillObservation`.
|
||||
|
||||
**Depends on:** Stable fixture diff (`test/fixtures/plans/multi-finding.diff` or similar) with a small known set of issues that triggers all 4 review sections.
|
||||
|
||||
**Priority:** P2.
|
||||
**Effort:** S (CC: ~30 min once fixture exists). Captured from v1.21.1.0 plan-eng-review D2.
|
||||
|
||||
---
|
||||
|
||||
## P3: Honor env vars in gstack-config (so QUESTION_TUNING/EXPLAIN_LEVEL actually isolate tests)
|
||||
|
||||
**What:** `gstack-config get <key>` reads `~/.gstack/config.yaml`. `runPlanSkillObservation` plumbs `env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }` through to the spawned `claude` process — but the skill preamble bash uses `gstack-config get question_tuning`, which never looks at env. The env passthrough is theater on current code.
|
||||
|
||||
**Why:** Without env honoring, the v1.21.1.0 plan-ceo-review smoke is still flaky on machines with `question_tuning: true` set in YAML. AUTO_DECIDE preferences would skip the rendered AskUserQuestion list, masking the regression we want to catch.
|
||||
|
||||
**Pros:** Makes the gate test hermetic across machines. The env wiring is already in place — only `gstack-config` needs to read env first, fall back to YAML.
|
||||
**Cons:** Touches the gstack-config binary across all 3 platforms (linux/darwin/windows). Cross-binary refactor.
|
||||
|
||||
**Context:** Captured from v1.21.1.0 adversarial review. Documented honestly in the test docstring as a known limitation.
|
||||
|
||||
**Priority:** P3.
|
||||
**Effort:** S. Single-file edit to `bin/gstack-config` (~10 LOC for env-first lookup).
|
||||
|
||||
---
|
||||
|
||||
## P3: Path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS
|
||||
|
||||
**What:** `runPlanSkillObservation`'s silent-write detector uses substring matching on a few sanctioned paths (`.gstack/`, `CHANGELOG.md`, `TODOS.md`, etc). A write to `node_modules/some-pkg/CHANGELOG.md` or `src/foo/.gstack/leak.ts` is currently sanctioned because the substring matches anywhere in the path.
|
||||
|
||||
**Why:** Defensive — no current bug exploits this, but a malicious skill or fixture could write to a path that happens to contain `.gstack/` or `CHANGELOG.md` and slip past silent-write detection.
|
||||
|
||||
**Pros:** Hardens the harness against future skill misbehavior. Aligns substring rules with their intent.
|
||||
**Cons:** Need to anchor against absolute prefixes (`os.homedir() + '/.gstack/'`, worktree root) which makes the test less portable across machines.
|
||||
|
||||
**Context:** Captured from v1.21.1.0 adversarial review (HIGH/FIXABLE finding, pre-existing). Refactored into a `SANCTIONED_WRITE_SUBSTRINGS` constant in v1.21.1.0 but the substring-includes logic is unchanged from before.
|
||||
|
||||
**Priority:** P3.
|
||||
**Effort:** S.
|
||||
|
||||
---
|
||||
|
||||
## P1: Structural STOP-Ask forcing function across all skills
|
||||
|
||||
**What:** Design and implement a structural forcing function that catches when a skill mandates per-issue AskUserQuestion but the model silently substitutes batch-synthesis. Candidate mechanisms: question-count assertion (skill declares expected question count in frontmatter; post-run audit logs if model fired <N), typed question templates (skill hands the model pre-built AskUserQuestion payloads rather than prose instructions), or a canUseTool-based post-run audit that compares declared-gates-fired vs expected.
|
||||
|
||||
+1
-1
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "gstack",
|
||||
"version": "1.20.0.0",
|
||||
"version": "1.21.1.0",
|
||||
"description": "Garry's Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.",
|
||||
"license": "MIT",
|
||||
"type": "module",
|
||||
|
||||
@@ -138,6 +138,15 @@ export function isPlanReadyVisible(visible: string): boolean {
|
||||
return /ready to execute|Would you like to proceed/i.test(visible);
|
||||
}
|
||||
|
||||
/**
|
||||
* Recent-tail window (in bytes of stripped TTY text) used when classifying
|
||||
* permission dialogs. Old permission text persists in the visibleSince buffer
|
||||
* after the dialog is dismissed, so callers should pass `visible.slice(-TAIL_SCAN_BYTES)`
|
||||
* to avoid re-triggering on stale scrollback. Shared between `runPlanSkillObservation`
|
||||
* and `navigateToModeAskUserQuestion` in the routing test so tuning stays in sync.
|
||||
*/
|
||||
export const TAIL_SCAN_BYTES = 1500;
|
||||
|
||||
/**
|
||||
* Detect a Claude Code permission dialog. These render as a numbered
|
||||
* option list (so isNumberedOptionListVisible matches them) but they
|
||||
@@ -145,23 +154,37 @@ export function isPlanReadyVisible(visible: string): boolean {
|
||||
* whether to grant a tool/file permission. Tests that look for skill
|
||||
* AskUserQuestions must explicitly skip these.
|
||||
*
|
||||
* Both English phrases below are stable across recent Claude Code
|
||||
* The English phrases below are stable across recent Claude Code
|
||||
* versions. The check is permissive on whitespace because TTY rendering
|
||||
* may wrap or reflow text.
|
||||
*
|
||||
* Co-trigger requirement: the bare phrase "Do you want to proceed?" is
|
||||
* generic enough that a skill question could legitimately use it
|
||||
* ("Do you want to proceed with HOLD SCOPE?"). To avoid mis-classifying
|
||||
* skill questions as permission dialogs, this phrase only counts when it
|
||||
* co-occurs with a file-edit context ("Edit to <path>" or "Write to <path>").
|
||||
* The standalone permission signatures (`requested permissions to`,
|
||||
* `allow all edits`, `always allow access to`, `Bash command requires permission`)
|
||||
* remain unconditional.
|
||||
*/
|
||||
export function isPermissionDialogVisible(visible: string): boolean {
|
||||
return (
|
||||
/requested\s+permissions?\s+to/i.test(visible) ||
|
||||
/Do\s+you\s+want\s+to\s+proceed\?/i.test(visible) ||
|
||||
// "Yes / Yes, allow all edits / No" shape rendered by Claude Code for
|
||||
// file-edit permission grants. The middle option's "allow all" phrase
|
||||
// is the unique signature.
|
||||
/\ballow\s+all\s+edits\b/i.test(visible) ||
|
||||
// "Yes, and always allow access to <dir>" shape (workspace trust).
|
||||
/always\s+allow\s+access\s+to/i.test(visible) ||
|
||||
// Bash command permission prompts.
|
||||
/Bash\s+command\s+.*\s+requires\s+permission/i.test(visible)
|
||||
);
|
||||
// Standalone signatures — high specificity, never appear in skill questions.
|
||||
if (/requested\s+permissions?\s+to/i.test(visible)) return true;
|
||||
// "Yes / Yes, allow all edits / No" shape — file-edit permission grants.
|
||||
if (/\ballow\s+all\s+edits\b/i.test(visible)) return true;
|
||||
// "Yes, and always allow access to <dir>" shape — workspace trust.
|
||||
if (/always\s+allow\s+access\s+to/i.test(visible)) return true;
|
||||
// Bash command permission prompts.
|
||||
if (/Bash\s+command\s+.*\s+requires\s+permission/i.test(visible)) return true;
|
||||
// "Do you want to proceed?" only counts as a permission dialog when paired
|
||||
// with a file-edit context. Skill questions can use the bare phrase.
|
||||
if (
|
||||
/Do\s+you\s+want\s+to\s+proceed\?/i.test(visible) &&
|
||||
/(Edit|Write)\s+to\s+\S+/i.test(visible)
|
||||
) {
|
||||
return true;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
/** Detect any AskUserQuestion-shaped numbered option list with cursor. */
|
||||
@@ -211,12 +234,14 @@ export function parseNumberedOptions(
|
||||
// this, parseNumberedOptions returns stale options after the dialog is
|
||||
// dismissed.
|
||||
const lines = tail.split('\n');
|
||||
// Anchor on the LAST `❯ 1.` line (cursor is on option 1 of the active
|
||||
// AskUserQuestion). Greedy character classes don't help here — we need a literal
|
||||
// `❯` after optional leading whitespace.
|
||||
// Anchor on the LAST line containing `❯<spaces>1.` ANYWHERE on the line.
|
||||
// The /plan-*-review skill's box-layout AUQ uses TTY cursor-positioning
|
||||
// escapes that stripAnsi removes — leaving the cursor `❯1.` mid-line,
|
||||
// after dividers + header + prompt text on the same logical line. The
|
||||
// earlier `^\s*❯` anchor missed those entirely.
|
||||
let cursorLineIdx = -1;
|
||||
for (let i = lines.length - 1; i >= 0; i--) {
|
||||
if (/^\s*❯\s*1\./.test(lines[i] ?? '')) {
|
||||
if (/❯\s*1\./.test(lines[i] ?? '')) {
|
||||
cursorLineIdx = i;
|
||||
break;
|
||||
}
|
||||
@@ -236,7 +261,37 @@ export function parseNumberedOptions(
|
||||
if (cursorLineIdx < 0) return [];
|
||||
const found: Array<{ index: number; label: string }> = [];
|
||||
const seenIndices = new Set<number>();
|
||||
for (let i = cursorLineIdx; i < lines.length; i++) {
|
||||
|
||||
// Cursor line: option 1 may be inline after box dividers + prompt header
|
||||
// (`...divider...header...❯1. label`). Use a non-anchored regex that
|
||||
// captures `❯N. label` from anywhere on the line through end-of-line.
|
||||
// Only used for the cursor line — subsequent options are parsed with the
|
||||
// start-of-line `optionRe`.
|
||||
const cursorLine = lines[cursorLineIdx] ?? '';
|
||||
const cursorInlineRe = /❯\s*([1-9])\.\s*(\S.*?)\s*$/;
|
||||
const inlineMatch = cursorInlineRe.exec(cursorLine);
|
||||
if (inlineMatch) {
|
||||
const idx = Number(inlineMatch[1]);
|
||||
const label = (inlineMatch[2] ?? '').trim();
|
||||
if (label.length > 0 && !seenIndices.has(idx)) {
|
||||
seenIndices.add(idx);
|
||||
found.push({ index: idx, label });
|
||||
}
|
||||
} else {
|
||||
// No inline cursor match — fall back to start-of-line regex.
|
||||
const startMatch = optionRe.exec(cursorLine);
|
||||
if (startMatch) {
|
||||
const idx = Number(startMatch[1]);
|
||||
const label = (startMatch[2] ?? '').trim();
|
||||
if (label.length > 0 && !seenIndices.has(idx)) {
|
||||
seenIndices.add(idx);
|
||||
found.push({ index: idx, label });
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Subsequent lines: standard start-of-line option parsing.
|
||||
for (let i = cursorLineIdx + 1; i < lines.length; i++) {
|
||||
const m = optionRe.exec(lines[i] ?? '');
|
||||
if (!m) continue;
|
||||
const idx = Number(m[1]);
|
||||
@@ -261,6 +316,333 @@ export function parseNumberedOptions(
|
||||
return found;
|
||||
}
|
||||
|
||||
/**
|
||||
* The four /plan-ceo-review modes. Used by `skill-e2e-plan-ceo-mode-routing`
|
||||
* to detect Step 0F mode-selection AskUserQuestions, and by the upcoming
|
||||
* finding-count tests as a Step-0 boundary signal: an AUQ whose options
|
||||
* match this regex IS the mode pick (the last Step-0 question for plan-ceo).
|
||||
*
|
||||
* Lifted out of the mode-routing test so multiple PTY tests can share one
|
||||
* source of truth — when /plan-ceo-review adds a fifth mode, one regex updates
|
||||
* everywhere instead of drifting per-test.
|
||||
*/
|
||||
export const MODE_RE = /HOLD SCOPE|SCOPE EXPANSION|SELECTIVE EXPANSION|SCOPE REDUCTION/i;
|
||||
|
||||
/**
|
||||
* Stable signature for a parsed numbered-option list — used by tests to detect
|
||||
* "is this AUQ the same as the last poll, or has the agent advanced to a new
|
||||
* one?" Joins each option as `${index}:${label}` after sorting by index.
|
||||
*
|
||||
* Defensive sort means the signature is order-independent at the input level,
|
||||
* even though `parseNumberedOptions` already returns indices in ascending order.
|
||||
*/
|
||||
export function optionsSignature(
|
||||
opts: Array<{ index: number; label: string }>,
|
||||
): string {
|
||||
return [...opts]
|
||||
.sort((a, b) => a.index - b.index)
|
||||
.map((o) => `${o.index}:${o.label}`)
|
||||
.join('|');
|
||||
}
|
||||
|
||||
/**
|
||||
* Pure classifier for the visible TTY buffer. Decides which outcome the
|
||||
* polling loop should return on this tick, or `null` to keep polling.
|
||||
*
|
||||
* Extracted from `runPlanSkillObservation` so the unit suite can exercise
|
||||
* the actual branch order with synthetic input strings — a future contributor
|
||||
* who reorders the branches (e.g., moves the permission short-circuit) gets
|
||||
* caught by the unit tests, not by a stochastic E2E run.
|
||||
*
|
||||
* Live-state branches (process exited, "Unknown command") stay in the runner
|
||||
* since they need the session handle.
|
||||
*/
|
||||
export type ClassifyResult =
|
||||
| { outcome: 'silent_write'; summary: string }
|
||||
| { outcome: 'plan_ready'; summary: string }
|
||||
| { outcome: 'asked'; summary: string }
|
||||
| null;
|
||||
|
||||
const SANCTIONED_WRITE_SUBSTRINGS = [
|
||||
'.claude/plans',
|
||||
'.gstack/',
|
||||
'/.context/',
|
||||
'CHANGELOG.md',
|
||||
'TODOS.md',
|
||||
];
|
||||
|
||||
export function classifyVisible(visible: string): ClassifyResult {
|
||||
// Silent-write detection: any Write/Edit tool render that targets a path
|
||||
// OUTSIDE the sanctioned dirs, AND no numbered prompt is currently on screen
|
||||
// (a numbered prompt means a permission/AskUserQuestion is gating the write,
|
||||
// not an actual silent write).
|
||||
const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g;
|
||||
let m: RegExpExecArray | null;
|
||||
while ((m = writeRe.exec(visible)) !== null) {
|
||||
const target = m[1] ?? '';
|
||||
const sanctioned = SANCTIONED_WRITE_SUBSTRINGS.some((s) => target.includes(s));
|
||||
if (!sanctioned && !isNumberedOptionListVisible(visible)) {
|
||||
return {
|
||||
outcome: 'silent_write',
|
||||
summary: `Write/Edit to ${target} fired before any AskUserQuestion`,
|
||||
};
|
||||
}
|
||||
}
|
||||
if (isPlanReadyVisible(visible)) {
|
||||
return {
|
||||
outcome: 'plan_ready',
|
||||
summary: 'skill ran end-to-end and emitted plan-mode "Ready to execute" confirmation',
|
||||
};
|
||||
}
|
||||
if (isNumberedOptionListVisible(visible)) {
|
||||
// Permission dialogs render numbered lists too. Skip them — the
|
||||
// bug we want to catch is "skill question never fired."
|
||||
if (isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))) {
|
||||
return null;
|
||||
}
|
||||
return {
|
||||
outcome: 'asked',
|
||||
summary: 'skill fired a numbered-option prompt (AskUserQuestion or routing-injection)',
|
||||
};
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
// ────────────────────────────────────────────────────────────────────────────
|
||||
// Per-finding AskUserQuestion count primitives (used by runPlanSkillCounting).
|
||||
//
|
||||
// These are pure helpers extracted up-front so the unit suite can exercise
|
||||
// them deterministically before the live-PTY counter runs them. Each one is
|
||||
// independently unit-testable against synthetic visible-buffer strings.
|
||||
// ────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
/**
|
||||
* Captured identity of an AskUserQuestion — the rendered question text plus
|
||||
* its numbered options. Used by `runPlanSkillCounting` to dedupe redrawn
|
||||
* prompts and to feed `Step0BoundaryPredicate` callers.
|
||||
*
|
||||
* `signature` is the stable hash. Two AUQs with identical prompt + options
|
||||
* produce the same signature; differences in either field produce different
|
||||
* signatures. Critically: two AUQs with shared option labels (e.g. the
|
||||
* generic "A) Add to plan / B) Defer / C) Build now" menu) but different
|
||||
* question text get DIFFERENT signatures because the prompt is in the hash.
|
||||
*/
|
||||
export interface AskUserQuestionFingerprint {
|
||||
/** Stable hash combining normalized prompt text + options signature. */
|
||||
signature: string;
|
||||
/** First 240 chars of the rendered question prompt (post-normalization). */
|
||||
promptSnippet: string;
|
||||
/** Captured option labels, in index order. */
|
||||
options: Array<{ index: number; label: string }>;
|
||||
/** Wall-clock when first observed (ms since the helper started polling). */
|
||||
observedAtMs: number;
|
||||
/** True if observed BEFORE the Step-0 boundary fired. */
|
||||
preReview: boolean;
|
||||
}
|
||||
|
||||
/**
|
||||
* Predicate fired against the AUQ we just answered (not the visible buffer).
|
||||
* Returns true if this AUQ's fingerprint marks the LAST Step-0 question for
|
||||
* its skill — all subsequent AUQs are review-phase findings.
|
||||
*
|
||||
* Event-based by design: matching against an answered AUQ's fingerprint
|
||||
* (prompt + options) is deterministic, whereas matching against later
|
||||
* rendered content (section headers, summary text) races with the agent's
|
||||
* output cadence. See plan §D14 for the rationale.
|
||||
*/
|
||||
export type Step0BoundaryPredicate = (
|
||||
answeredFingerprint: AskUserQuestionFingerprint,
|
||||
) => boolean;
|
||||
|
||||
/**
|
||||
* Parse the rendered question prompt out of a visible TTY buffer. The prompt
|
||||
* is the 1–3 lines of text immediately ABOVE the latest `❯ 1.` cursor line —
|
||||
* not part of the option list, not the permission-dialog header.
|
||||
*
|
||||
* Returns the prompt normalized to a single-spaced 240-char snippet (strip
|
||||
* ANSI residue, collapse internal whitespace, trim) — short enough to use as
|
||||
* a hash key, long enough to disambiguate distinct questions.
|
||||
*
|
||||
* Returns "" when no prompt could be parsed (cursor not yet rendered, or
|
||||
* cursor is at the top of the buffer with no preceding text). Callers that
|
||||
* use the empty string as a fingerprint input should treat empty-prompt
|
||||
* AUQs as "wait one more poll" rather than fingerprinting them — otherwise
|
||||
* the same options + empty prompt across two distinct questions collide.
|
||||
*/
|
||||
export function parseQuestionPrompt(visible: string): string {
|
||||
// Tail-only — older prompts higher in the buffer are stale.
|
||||
const tail = visible.length > 4096 ? visible.slice(-4096) : visible;
|
||||
const lines = tail.split('\n');
|
||||
|
||||
// Find the latest line containing `❯<spaces>1.` (matching parseNumberedOptions —
|
||||
// unanchored to handle the box-layout case where cursor is mid-line after
|
||||
// divider + header + prompt text on the same logical line).
|
||||
let cursorLineIdx = -1;
|
||||
for (let i = lines.length - 1; i >= 0; i--) {
|
||||
if (/❯\s*1\./.test(lines[i] ?? '')) {
|
||||
cursorLineIdx = i;
|
||||
break;
|
||||
}
|
||||
}
|
||||
if (cursorLineIdx < 0) return '';
|
||||
|
||||
// Box-layout case: prompt text may be ON the cursor line, BEFORE `❯1.`.
|
||||
// Extract that prefix (after stripping leading box-drawing characters and
|
||||
// dividers) as the last piece of the prompt — appended after any prior
|
||||
// multi-line prompt text we walk up to find.
|
||||
const cursorLine = lines[cursorLineIdx] ?? '';
|
||||
let inlinePrompt = '';
|
||||
const cursorPos = cursorLine.search(/❯\s*1\./);
|
||||
if (cursorPos > 0) {
|
||||
inlinePrompt = cursorLine
|
||||
.slice(0, cursorPos)
|
||||
// Strip box-drawing chars + dividers + leading checkbox sigil.
|
||||
.replace(/^[─━┄┅┈┉─┌┐└┘├┤┬┴┼│┃☐□■\s]+/, '')
|
||||
.trim();
|
||||
}
|
||||
|
||||
// Walk up at most 6 lines collecting prompt text. Stop at:
|
||||
// - a blank line preceded by another blank line (paragraph break)
|
||||
// - top of buffer
|
||||
// - a line that itself starts with `N.` (we're inside an option list)
|
||||
const promptLines: string[] = [];
|
||||
let blankRun = 0;
|
||||
for (let i = cursorLineIdx - 1; i >= 0 && promptLines.length < 6; i--) {
|
||||
const raw = lines[i] ?? '';
|
||||
const trimmed = raw.trim();
|
||||
if (trimmed === '') {
|
||||
blankRun += 1;
|
||||
if (blankRun >= 2 && promptLines.length > 0) break;
|
||||
continue;
|
||||
}
|
||||
blankRun = 0;
|
||||
// Stop if we hit what looks like a previous numbered list.
|
||||
if (/^[\s❯]*[1-9]\.\s+\S/.test(raw)) break;
|
||||
promptLines.unshift(trimmed);
|
||||
}
|
||||
|
||||
const all = inlinePrompt.length > 0 ? [...promptLines, inlinePrompt] : promptLines;
|
||||
const joined = all.join(' ').replace(/\s+/g, ' ').trim();
|
||||
return joined.slice(0, 240);
|
||||
}
|
||||
|
||||
/**
|
||||
* Stable hash for an AskUserQuestion's identity — combines normalized prompt
|
||||
* text with the options signature so two distinct questions with shared menu
|
||||
* labels (the generic A/B/C TODO-proposal menu, for instance) get different
|
||||
* fingerprints.
|
||||
*
|
||||
* Uses Bun's fast non-crypto hash since these strings are short and we only
|
||||
* need collision resistance against accidental TTY redraws, not adversaries.
|
||||
* Hex-encoded for diagnostic dumps.
|
||||
*/
|
||||
export function auqFingerprint(
|
||||
promptSnippet: string,
|
||||
opts: Array<{ index: number; label: string }>,
|
||||
): string {
|
||||
const normalized = promptSnippet.replace(/\s+/g, ' ').trim();
|
||||
const sig = optionsSignature(opts);
|
||||
// eslint-disable-next-line @typescript-eslint/no-explicit-any
|
||||
return (Bun as any).hash(normalized + '||' + sig).toString(16);
|
||||
}
|
||||
|
||||
/**
|
||||
* Detects when a plan-* skill has reached its Completion Summary / Review
|
||||
* Report — a terminal signal complementary to plan-mode's "Ready to execute"
|
||||
* confirmation. Each plan-review skill writes one of these phrasings near
|
||||
* the end of its run; matching any one is enough to stop counting.
|
||||
*
|
||||
* Best-effort: this is a content marker, not a deterministic event. Hard
|
||||
* ceiling (`reviewCountCeiling` in `runPlanSkillCounting`) is the reliable
|
||||
* stop signal; this regex is the "we're done, go gracefully" hint.
|
||||
*/
|
||||
export const COMPLETION_SUMMARY_RE =
|
||||
/(GSTACK REVIEW REPORT|## Completion [Ss]ummary|Status:\s*(clean|issues_open)|^VERDICT:)/m;
|
||||
|
||||
/**
|
||||
* Result of asserting that a plan file ends with `## GSTACK REVIEW REPORT`
|
||||
* as its last `## ` heading. `ok` is true iff the report is present AND no
|
||||
* other `## ` heading appears after it. Diagnostic fields are populated only
|
||||
* on failure to keep the success path cheap.
|
||||
*/
|
||||
export interface ReviewReportAtBottomResult {
|
||||
ok: boolean;
|
||||
reason?: string;
|
||||
trailingHeadings?: string[];
|
||||
}
|
||||
|
||||
/**
|
||||
* Assert that `## GSTACK REVIEW REPORT` is the last `## ` heading in a plan
|
||||
* file's content. Pure string operation — no filesystem access. Used by the
|
||||
* finding-count E2E tests as a second assertion on each test's produced plan.
|
||||
*
|
||||
* The plan-mode skill template mandates the agent move/append the review
|
||||
* report so it's always the last `##` section. A regression where the agent
|
||||
* appends additional sections after the report (or skips it entirely) ships
|
||||
* silently today; this assertion catches both.
|
||||
*/
|
||||
export function assertReviewReportAtBottom(
|
||||
content: string,
|
||||
): ReviewReportAtBottomResult {
|
||||
const re = /^## GSTACK REVIEW REPORT\s*$/m;
|
||||
const match = re.exec(content);
|
||||
if (!match) {
|
||||
return { ok: false, reason: 'no GSTACK REVIEW REPORT section' };
|
||||
}
|
||||
const after = content.slice(match.index + match[0].length);
|
||||
// Match any `## ` heading after the report. Reject `## ` followed by
|
||||
// newline-only (trailing-whitespace ## headers) to avoid false positives.
|
||||
const trailingHeadings = Array.from(
|
||||
after.matchAll(/^## \S.*$/gm),
|
||||
).map((m) => m[0]);
|
||||
if (trailingHeadings.length > 0) {
|
||||
return {
|
||||
ok: false,
|
||||
reason: 'trailing ## heading(s) after GSTACK REVIEW REPORT',
|
||||
trailingHeadings,
|
||||
};
|
||||
}
|
||||
return { ok: true };
|
||||
}
|
||||
|
||||
/**
|
||||
* Per-skill Step-0 boundary predicates. Each fires `true` when the answered
|
||||
* AUQ's fingerprint matches the LAST question of that skill's Step 0 phase.
|
||||
*
|
||||
* - `ceoStep0Boundary`: matches the mode-pick AUQ (options match `MODE_RE`).
|
||||
* - `engStep0Boundary`: matches the cross-project-learnings or scope-reduction
|
||||
* AUQ that closes plan-eng-review's preamble.
|
||||
* - `designStep0Boundary`: matches plan-design-review's first dimension /
|
||||
* posture AUQ.
|
||||
* - `devexStep0Boundary`: matches plan-devex-review's persona-selection AUQ.
|
||||
*
|
||||
* Predicates live alongside the helper so the unit suite can exercise each
|
||||
* against synthetic fingerprints (positive AND negative cases). Skill test
|
||||
* files import them directly.
|
||||
*/
|
||||
export const ceoStep0Boundary: Step0BoundaryPredicate = (fp) =>
|
||||
// Mode-pick path (Step 0F): one of HOLD SCOPE / SCOPE EXPANSION / etc.
|
||||
fp.options.some((o) => MODE_RE.test(o.label)) ||
|
||||
// Skip-interview path: scope-selection AUQ has "Skip interview and plan
|
||||
// immediately" — picking it bypasses the rest of Step 0 and routes
|
||||
// directly to review-phase. Boundary fires on the scope AUQ itself.
|
||||
fp.options.some((o) => /skip\s+interview|plan\s+immediately/i.test(o.label));
|
||||
|
||||
export const engStep0Boundary: Step0BoundaryPredicate = (fp) =>
|
||||
/scope reduction recommendation|cross[\s-]?project learnings/i.test(
|
||||
fp.promptSnippet,
|
||||
);
|
||||
|
||||
export const designStep0Boundary: Step0BoundaryPredicate = (fp) =>
|
||||
/design system|design posture|design score|first dimension/i.test(
|
||||
fp.promptSnippet,
|
||||
);
|
||||
|
||||
export const devexStep0Boundary: Step0BoundaryPredicate = (fp) =>
|
||||
/developer persona|target persona|persona selection|TTHW target/i.test(
|
||||
fp.promptSnippet,
|
||||
);
|
||||
|
||||
/**
|
||||
* Spawn `claude --permission-mode plan` in a real PTY and return a session
|
||||
* handle. Caller is responsible for `await session.close()` to release the
|
||||
@@ -566,12 +948,21 @@ export async function runPlanSkillObservation(opts: {
|
||||
cwd?: string;
|
||||
/** Total budget for skill to reach a terminal outcome. Default 180000. */
|
||||
timeoutMs?: number;
|
||||
/**
|
||||
* Extra env merged into the spawned `claude` process. `launchClaudePty`
|
||||
* already supports this; exposing it here lets per-skill tests isolate
|
||||
* from local config that would mask the regression they're trying to
|
||||
* catch (e.g., `QUESTION_TUNING=true` causing AUTO_DECIDE to skip the
|
||||
* rendered AskUserQuestion list).
|
||||
*/
|
||||
env?: Record<string, string>;
|
||||
}): Promise<PlanSkillObservation> {
|
||||
const startedAt = Date.now();
|
||||
const session = await launchClaudePty({
|
||||
permissionMode: opts.inPlanMode === false ? null : 'plan',
|
||||
cwd: opts.cwd,
|
||||
timeoutMs: (opts.timeoutMs ?? 180_000) + 30_000,
|
||||
env: opts.env,
|
||||
});
|
||||
|
||||
try {
|
||||
@@ -602,40 +993,10 @@ export async function runPlanSkillObservation(opts: {
|
||||
elapsedMs: Date.now() - startedAt,
|
||||
};
|
||||
}
|
||||
// Silent-write detection: any Write/Edit tool render that targets a
|
||||
// path OUTSIDE ~/.claude/plans, ~/.gstack/, or the active worktree's
|
||||
// .gstack/. Plan files and gbrain artifacts are sanctioned.
|
||||
const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g;
|
||||
let m: RegExpExecArray | null;
|
||||
while ((m = writeRe.exec(visible)) !== null) {
|
||||
const target = m[1] ?? '';
|
||||
const sanctioned =
|
||||
target.includes('.claude/plans') ||
|
||||
target.includes('.gstack/') ||
|
||||
target.includes('/.context/') ||
|
||||
target.includes('CHANGELOG.md') ||
|
||||
target.includes('TODOS.md');
|
||||
if (!sanctioned && !isNumberedOptionListVisible(visible)) {
|
||||
return {
|
||||
outcome: 'silent_write',
|
||||
summary: `Write/Edit to ${target} fired before any AskUserQuestion`,
|
||||
evidence: visible.slice(-2000),
|
||||
elapsedMs: Date.now() - startedAt,
|
||||
};
|
||||
}
|
||||
}
|
||||
if (isPlanReadyVisible(visible)) {
|
||||
const classified = classifyVisible(visible);
|
||||
if (classified) {
|
||||
return {
|
||||
outcome: 'plan_ready',
|
||||
summary: 'skill ran end-to-end and emitted plan-mode "Ready to execute" confirmation',
|
||||
evidence: visible.slice(-2000),
|
||||
elapsedMs: Date.now() - startedAt,
|
||||
};
|
||||
}
|
||||
if (isNumberedOptionListVisible(visible)) {
|
||||
return {
|
||||
outcome: 'asked',
|
||||
summary: 'skill fired a numbered-option prompt (AskUserQuestion or routing-injection)',
|
||||
...classified,
|
||||
evidence: visible.slice(-2000),
|
||||
elapsedMs: Date.now() - startedAt,
|
||||
};
|
||||
@@ -652,3 +1013,281 @@ export async function runPlanSkillObservation(opts: {
|
||||
await session.close();
|
||||
}
|
||||
}
|
||||
|
||||
// ────────────────────────────────────────────────────────────────────────────
|
||||
// runPlanSkillCounting — drives a plan-* skill end-to-end through Step 0 then
|
||||
// counts distinct review-phase AskUserQuestion fingerprints. The actual
|
||||
// product asserted by the per-finding-count tests.
|
||||
// ────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
/**
|
||||
* Result of a `runPlanSkillCounting` run. Includes both the count summary
|
||||
* (`step0Count`, `reviewCount`) and the full fingerprint list for diagnostic
|
||||
* dumps when an assertion fails.
|
||||
*/
|
||||
export interface PlanSkillCountObservation {
|
||||
outcome:
|
||||
| 'plan_ready'
|
||||
| 'completion_summary'
|
||||
| 'ceiling_reached'
|
||||
| 'silent_write'
|
||||
| 'exited'
|
||||
| 'timeout';
|
||||
summary: string;
|
||||
/** Visible terminal text at terminal time (last 3KB). */
|
||||
evidence: string;
|
||||
/** Wall time (ms) until the outcome was decided. */
|
||||
elapsedMs: number;
|
||||
/** All distinct AskUserQuestions observed, in observation order. */
|
||||
fingerprints: AskUserQuestionFingerprint[];
|
||||
/** Count of fingerprints with `preReview === true`. */
|
||||
step0Count: number;
|
||||
/** Count of fingerprints with `preReview === false`. */
|
||||
reviewCount: number;
|
||||
}
|
||||
|
||||
/**
|
||||
* Drive a plan-* skill in plan mode and count distinct review-phase
|
||||
* AskUserQuestions until a terminal signal fires.
|
||||
*
|
||||
* Flow:
|
||||
* 1. Boot PTY in plan mode (8s grace + auto-trust dialog).
|
||||
* 2. Send `slashCommand` alone. Sleep ~3s.
|
||||
* 3. Send `followUpPrompt` as a chat message — this is the plan content
|
||||
* the skill reviews. Slash commands with trailing args are rejected by
|
||||
* Claude Code unless the skill defines them, so the plan goes as a
|
||||
* follow-up message (the proven pattern at
|
||||
* skill-e2e-plan-design-with-ui.test.ts:57-71).
|
||||
* 4. Poll loop:
|
||||
* - Skip permission dialogs (auto-grant with `defaultPick`).
|
||||
* - On a new numbered-option list, parse prompt + options, build
|
||||
* fingerprint via `auqFingerprint`. Empty-prompt parses are skipped
|
||||
* and re-polled (avoids the empty-prompt collision documented in
|
||||
* the auqFingerprint contract).
|
||||
* - First time we see a fingerprint: push it, classify as Step 0 or
|
||||
* review-phase based on `boundaryFired`, press `defaultPick` to
|
||||
* advance.
|
||||
* - After pressing, evaluate `isLastStep0AUQ(fingerprint)`. If true,
|
||||
* all subsequent AUQs are review-phase.
|
||||
* - Hard ceiling: if `reviewCount >= reviewCountCeiling`, return
|
||||
* `ceiling_reached`. This bounds runaway counts; tests should set
|
||||
* the ceiling above their assertion CEILING.
|
||||
* - Soft terminals: `COMPLETION_SUMMARY_RE` match → `completion_summary`;
|
||||
* plan-ready confirmation → `plan_ready`; silent write outside
|
||||
* sanctioned dirs → `silent_write`; process exited → `exited`;
|
||||
* wall clock exceeded → `timeout`.
|
||||
*
|
||||
* Boundary detection (D14): event-based, fired against the answered AUQ's
|
||||
* fingerprint, not against later rendered content. This avoids the race
|
||||
* where Step-0-final and Section-1-first AUQs straddle a section header
|
||||
* regex match.
|
||||
*
|
||||
* Fingerprint composition (D9): `auqFingerprint(prompt, options)` mixes
|
||||
* normalized prompt text with the options signature so distinct findings
|
||||
* with shared menu structure (the generic A/B/C TODO menu) get distinct
|
||||
* fingerprints.
|
||||
*/
|
||||
export async function runPlanSkillCounting(opts: {
|
||||
/** Skill name, e.g. 'plan-ceo-review'. Used for diagnostic strings only. */
|
||||
skillName: string;
|
||||
/** Slash command to send alone, e.g. '/plan-ceo-review'. No trailing args. */
|
||||
slashCommand: string;
|
||||
/** Plan content sent as a follow-up message ~3s after the slash command. */
|
||||
followUpPrompt: string;
|
||||
/** Per-skill predicate: which answered AUQ is the last Step-0 question. */
|
||||
isLastStep0AUQ: Step0BoundaryPredicate;
|
||||
/** Hard cap on review-phase count; helper returns when reached. Should be
|
||||
* set ABOVE the test's assertion ceiling so the test sees the cap as a
|
||||
* failure rather than a silent stop. */
|
||||
reviewCountCeiling: number;
|
||||
/** Numbered option to press by default. Defaults to 1 (recommended). */
|
||||
defaultPick?: number;
|
||||
/**
|
||||
* Optional override for the FIRST AUQ observed. Receives the fingerprint;
|
||||
* returns the option index to press. Subsequent AUQs always use defaultPick.
|
||||
*
|
||||
* Skill-specific routing helper: /plan-ceo-review's first AUQ asks "what
|
||||
* scope?" with options like "branch diff" / "describe inline" / "skip
|
||||
* interview". Pressing the default 1 routes to "branch diff" (the wrong
|
||||
* review target for a seeded fixture). firstAUQPick lets the test pick
|
||||
* "Skip interview" or "describe inline" so the agent reviews the
|
||||
* follow-up plan content the test sent, not the git diff.
|
||||
*/
|
||||
firstAUQPick?: (fp: AskUserQuestionFingerprint) => number;
|
||||
/** Working directory. Default process.cwd() (repo cwd holds skill registry). */
|
||||
cwd?: string;
|
||||
/** Total budget for skill to reach a terminal outcome. Default 1_500_000 (25 min). */
|
||||
timeoutMs?: number;
|
||||
/** Extra env merged into the spawned `claude` process. */
|
||||
env?: Record<string, string>;
|
||||
}): Promise<PlanSkillCountObservation> {
|
||||
const startedAt = Date.now();
|
||||
const defaultPick = opts.defaultPick ?? 1;
|
||||
const timeoutMs = opts.timeoutMs ?? 1_500_000;
|
||||
|
||||
const session = await launchClaudePty({
|
||||
permissionMode: 'plan',
|
||||
cwd: opts.cwd,
|
||||
timeoutMs: timeoutMs + 60_000,
|
||||
env: opts.env,
|
||||
});
|
||||
|
||||
const fingerprints: AskUserQuestionFingerprint[] = [];
|
||||
const seen = new Set<string>();
|
||||
let boundaryFired = false;
|
||||
let step0Count = 0;
|
||||
let reviewCount = 0;
|
||||
let isFirstAUQ = true;
|
||||
let lastSig = '';
|
||||
|
||||
function snapshot(
|
||||
outcome: PlanSkillCountObservation['outcome'],
|
||||
summary: string,
|
||||
visible: string,
|
||||
): PlanSkillCountObservation {
|
||||
return {
|
||||
outcome,
|
||||
summary,
|
||||
evidence: visible.slice(-3000),
|
||||
elapsedMs: Date.now() - startedAt,
|
||||
fingerprints,
|
||||
step0Count,
|
||||
reviewCount,
|
||||
};
|
||||
}
|
||||
|
||||
try {
|
||||
await Bun.sleep(8000); // boot grace + auto-trust handler window
|
||||
const since = session.mark();
|
||||
session.send(`${opts.slashCommand}\r`);
|
||||
await Bun.sleep(3000);
|
||||
session.send(`${opts.followUpPrompt}\r`);
|
||||
|
||||
const budgetStart = Date.now();
|
||||
while (Date.now() - budgetStart < timeoutMs) {
|
||||
await Bun.sleep(2000);
|
||||
const visible = session.visibleSince(since);
|
||||
|
||||
// Process exited?
|
||||
if (session.exited()) {
|
||||
return snapshot(
|
||||
'exited',
|
||||
`claude exited (code=${session.exitCode()}) during counting (step0=${step0Count}, review=${reviewCount})`,
|
||||
visible,
|
||||
);
|
||||
}
|
||||
if (visible.includes('Unknown command:')) {
|
||||
return snapshot(
|
||||
'exited',
|
||||
`claude rejected ${opts.slashCommand} as unknown command (skill not registered in this cwd)`,
|
||||
visible,
|
||||
);
|
||||
}
|
||||
|
||||
// Silent write detection — only fires if no numbered prompt is on
|
||||
// screen (otherwise the write is gated by a permission/AUQ).
|
||||
const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g;
|
||||
let m: RegExpExecArray | null;
|
||||
while ((m = writeRe.exec(visible)) !== null) {
|
||||
const target = m[1] ?? '';
|
||||
const sanctioned = SANCTIONED_WRITE_SUBSTRINGS.some((s) =>
|
||||
target.includes(s),
|
||||
);
|
||||
if (!sanctioned && !isNumberedOptionListVisible(visible)) {
|
||||
return snapshot(
|
||||
'silent_write',
|
||||
`Write/Edit to ${target} fired before any AskUserQuestion`,
|
||||
visible,
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
// Soft terminal signals — check before AUQ processing so a final
|
||||
// completion-summary doesn't get misclassified as a bonus AUQ.
|
||||
if (COMPLETION_SUMMARY_RE.test(visible)) {
|
||||
return snapshot(
|
||||
'completion_summary',
|
||||
`skill emitted completion summary / verdict / status line (step0=${step0Count}, review=${reviewCount})`,
|
||||
visible,
|
||||
);
|
||||
}
|
||||
if (isPlanReadyVisible(visible)) {
|
||||
return snapshot(
|
||||
'plan_ready',
|
||||
`skill emitted plan-mode "Ready to execute" confirmation (step0=${step0Count}, review=${reviewCount})`,
|
||||
visible,
|
||||
);
|
||||
}
|
||||
|
||||
// Numbered option list?
|
||||
if (!isNumberedOptionListVisible(visible)) continue;
|
||||
|
||||
// Permission dialog? Auto-grant with defaultPick. Only act on the
|
||||
// recent tail to avoid re-triggering on stale dialogs in scrollback.
|
||||
if (isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))) {
|
||||
session.send(`${defaultPick}\r`);
|
||||
await Bun.sleep(1500);
|
||||
continue;
|
||||
}
|
||||
|
||||
// Parse the active AUQ. Skip same-redraw and empty-prompt cases.
|
||||
const options = parseNumberedOptions(visible);
|
||||
if (options.length < 2) continue;
|
||||
const sig = optionsSignature(options);
|
||||
if (sig === lastSig) continue;
|
||||
const promptSnippet = parseQuestionPrompt(visible);
|
||||
if (promptSnippet === '') continue; // not yet rendered, poll again
|
||||
lastSig = sig;
|
||||
|
||||
const fingerprintHash = auqFingerprint(promptSnippet, options);
|
||||
if (seen.has(fingerprintHash)) {
|
||||
// Same content, already counted (TTY redrew with whitespace diff).
|
||||
continue;
|
||||
}
|
||||
seen.add(fingerprintHash);
|
||||
|
||||
const fp: AskUserQuestionFingerprint = {
|
||||
signature: fingerprintHash,
|
||||
promptSnippet,
|
||||
options,
|
||||
observedAtMs: Date.now() - startedAt,
|
||||
preReview: !boundaryFired,
|
||||
};
|
||||
fingerprints.push(fp);
|
||||
if (boundaryFired) reviewCount += 1;
|
||||
else step0Count += 1;
|
||||
|
||||
// Press to advance — first AUQ may use the override pick.
|
||||
const pickIdx =
|
||||
isFirstAUQ && opts.firstAUQPick ? opts.firstAUQPick(fp) : defaultPick;
|
||||
isFirstAUQ = false;
|
||||
session.send(`${pickIdx}\r`);
|
||||
|
||||
// Evaluate boundary AFTER pressing — if THIS AUQ was the last Step 0
|
||||
// question, all subsequent AUQs go to reviewCount.
|
||||
if (!boundaryFired && opts.isLastStep0AUQ(fp)) {
|
||||
boundaryFired = true;
|
||||
}
|
||||
|
||||
// Hard ceiling — runaway protection.
|
||||
if (reviewCount >= opts.reviewCountCeiling) {
|
||||
return snapshot(
|
||||
'ceiling_reached',
|
||||
`review-phase AUQ count reached ceiling (${opts.reviewCountCeiling})`,
|
||||
session.visibleSince(since),
|
||||
);
|
||||
}
|
||||
|
||||
// Give the agent a beat to advance to the next state.
|
||||
await Bun.sleep(2000);
|
||||
}
|
||||
|
||||
return snapshot(
|
||||
'timeout',
|
||||
`no terminal outcome within ${timeoutMs}ms (step0=${step0Count}, review=${reviewCount})`,
|
||||
session.visibleSince(since),
|
||||
);
|
||||
} finally {
|
||||
await session.close();
|
||||
}
|
||||
}
|
||||
|
||||
@@ -0,0 +1,749 @@
|
||||
/**
|
||||
* Deterministic unit tests for claude-pty-runner.ts behavior changes.
|
||||
*
|
||||
* Free-tier (no EVALS=1 needed). Runs in <1s on every `bun test`. Catches
|
||||
* harness plumbing bugs before stochastic PTY runs surface them.
|
||||
*
|
||||
* Two surface areas tested:
|
||||
*
|
||||
* 1. Permission-dialog short-circuit in 'asked' classification: a TTY frame
|
||||
* that matches BOTH isPermissionDialogVisible AND isNumberedOptionListVisible
|
||||
* must NOT be classified as a skill question — permission dialogs render
|
||||
* as numbered lists too, but they're not what we're guarding.
|
||||
*
|
||||
* 2. Env passthrough surface: runPlanSkillObservation accepts an `env`
|
||||
* option and threads it to launchClaudePty. We can't fully exercise the
|
||||
* spawn pipeline without paying for a PTY session, but we CAN verify the
|
||||
* option exists in the type signature and that calling without env still
|
||||
* works (no regression).
|
||||
*
|
||||
* The PTY test (skill-e2e-plan-ceo-plan-mode.test.ts) is the integration
|
||||
* check; this file is the cheap deterministic guard for the harness primitives
|
||||
* those tests stand on.
|
||||
*/
|
||||
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import {
|
||||
isPermissionDialogVisible,
|
||||
isNumberedOptionListVisible,
|
||||
isPlanReadyVisible,
|
||||
parseNumberedOptions,
|
||||
classifyVisible,
|
||||
TAIL_SCAN_BYTES,
|
||||
optionsSignature,
|
||||
parseQuestionPrompt,
|
||||
auqFingerprint,
|
||||
COMPLETION_SUMMARY_RE,
|
||||
assertReviewReportAtBottom,
|
||||
ceoStep0Boundary,
|
||||
engStep0Boundary,
|
||||
designStep0Boundary,
|
||||
devexStep0Boundary,
|
||||
type ClaudePtyOptions,
|
||||
type AskUserQuestionFingerprint,
|
||||
} from './claude-pty-runner';
|
||||
|
||||
describe('isPermissionDialogVisible', () => {
|
||||
test('matches "Bash command requires permission" prompts', () => {
|
||||
const sample = `
|
||||
Some preamble output
|
||||
|
||||
Bash command \`gstack-config get telemetry\` requires permission to run.
|
||||
|
||||
❯ 1. Yes
|
||||
2. Yes, and always allow
|
||||
3. No, abort
|
||||
`;
|
||||
expect(isPermissionDialogVisible(sample)).toBe(true);
|
||||
});
|
||||
|
||||
test('matches "allow all edits" file-edit prompts', () => {
|
||||
// Isolated to the "allow all edits" clause only — no overlapping
|
||||
// "Do you want to proceed?" co-trigger, so this asserts the clause works.
|
||||
const sample = `
|
||||
Edit to ~/.gstack/config.yaml
|
||||
|
||||
❯ 1. Yes
|
||||
2. Yes, allow all edits during this session
|
||||
3. No
|
||||
`;
|
||||
expect(isPermissionDialogVisible(sample)).toBe(true);
|
||||
});
|
||||
|
||||
test('matches the "Do you want to proceed?" file-edit confirmation by itself', () => {
|
||||
// Separate fixture so weakening this clause is detected by a dedicated test.
|
||||
const sample = `
|
||||
Edit to ~/.gstack/config.yaml
|
||||
|
||||
Do you want to proceed?
|
||||
|
||||
❯ 1. Yes
|
||||
2. No
|
||||
`;
|
||||
expect(isPermissionDialogVisible(sample)).toBe(true);
|
||||
});
|
||||
|
||||
test('matches workspace-trust "always allow access to" prompt', () => {
|
||||
const sample = `
|
||||
Do you trust the files in this folder?
|
||||
|
||||
❯ 1. Yes, proceed
|
||||
2. Yes, and always allow access to /Users/me/repo
|
||||
3. No, exit
|
||||
`;
|
||||
expect(isPermissionDialogVisible(sample)).toBe(true);
|
||||
});
|
||||
|
||||
test('does NOT match a skill AskUserQuestion list', () => {
|
||||
const sample = `
|
||||
D1 — Premise challenge: do users actually want this?
|
||||
|
||||
❯ 1. Yes, validated
|
||||
2. No, premise is wrong
|
||||
3. Need more info
|
||||
`;
|
||||
expect(isPermissionDialogVisible(sample)).toBe(false);
|
||||
});
|
||||
|
||||
test('does NOT match a plan-ready confirmation', () => {
|
||||
const sample = `
|
||||
Ready to execute the plan?
|
||||
|
||||
❯ 1. Yes
|
||||
2. No, keep planning
|
||||
`;
|
||||
expect(isPermissionDialogVisible(sample)).toBe(false);
|
||||
});
|
||||
|
||||
test('does NOT match a skill question that contains the bare phrase "Do you want to proceed?"', () => {
|
||||
// Co-trigger requirement: "Do you want to proceed?" alone is not enough.
|
||||
// It must appear with "Edit to <path>" or "Write to <path>" to count as
|
||||
// a permission dialog. This guards against a skill question like
|
||||
// "Do you want to proceed with HOLD SCOPE?" being mis-classified.
|
||||
const sample = `
|
||||
Choose your scope mode for this review.
|
||||
Do you want to proceed?
|
||||
|
||||
❯ 1. HOLD SCOPE
|
||||
2. SCOPE EXPANSION
|
||||
3. SELECTIVE EXPANSION
|
||||
`;
|
||||
expect(isPermissionDialogVisible(sample)).toBe(false);
|
||||
});
|
||||
|
||||
test('does NOT mis-match when adversarial prose includes "Edit to <path>" alongside the bare proceed phrase', () => {
|
||||
// Adversarial fixture: a skill question whose body legitimately mentions
|
||||
// "Edit to <path>" in prose AND ends with "Do you want to proceed?". The
|
||||
// current co-trigger regex would mis-classify this as a permission
|
||||
// dialog. We DO want this test to fail until the regex is tightened
|
||||
// further (e.g., proximity constraint, or anchoring "Edit to" to a
|
||||
// line-start). For now this is documented as a known limitation: a
|
||||
// skill question that talks about "Edit to" in prose IS still treated
|
||||
// as a permission dialog. The test asserts the current behavior so a
|
||||
// future fix can flip it intentionally.
|
||||
const sample = `
|
||||
Plan: I will Edit to ./plan.md to capture the decision.
|
||||
Do you want to proceed?
|
||||
|
||||
❯ 1. HOLD SCOPE
|
||||
2. SCOPE EXPANSION
|
||||
`;
|
||||
// KNOWN LIMITATION: the co-trigger fires here. Documented as a
|
||||
// post-merge follow-up. Flip this assertion once the regex tightens.
|
||||
expect(isPermissionDialogVisible(sample)).toBe(true);
|
||||
});
|
||||
});
|
||||
|
||||
describe('isNumberedOptionListVisible', () => {
|
||||
test('matches a basic ❯ 1. + 2. cursor list', () => {
|
||||
const sample = `
|
||||
❯ 1. Option one
|
||||
2. Option two
|
||||
3. Option three
|
||||
`;
|
||||
expect(isNumberedOptionListVisible(sample)).toBe(true);
|
||||
});
|
||||
|
||||
test('returns false on a single-option prompt', () => {
|
||||
const sample = `
|
||||
❯ 1. Only option
|
||||
`;
|
||||
expect(isNumberedOptionListVisible(sample)).toBe(false);
|
||||
});
|
||||
|
||||
test('returns false when no cursor renders', () => {
|
||||
const sample = `
|
||||
Just some prose with 1. a numbered point and 2. another.
|
||||
`;
|
||||
expect(isNumberedOptionListVisible(sample)).toBe(false);
|
||||
});
|
||||
|
||||
test('overlaps permission dialogs (this is why D5 short-circuits)', () => {
|
||||
// The whole point of D5: this string matches BOTH classifiers, so the
|
||||
// runner must consult isPermissionDialogVisible to disambiguate.
|
||||
const sample = `
|
||||
Bash command \`do-thing\` requires permission to run.
|
||||
|
||||
❯ 1. Yes
|
||||
2. No
|
||||
`;
|
||||
expect(isNumberedOptionListVisible(sample)).toBe(true);
|
||||
expect(isPermissionDialogVisible(sample)).toBe(true);
|
||||
});
|
||||
});
|
||||
|
||||
describe('classifyVisible (runtime path through the runner classifier)', () => {
|
||||
// These tests call the actual classifier so a future contributor who
|
||||
// reorders branches (e.g. moves the permission short-circuit before
|
||||
// isPlanReadyVisible) is caught deterministically.
|
||||
|
||||
test('skill question → returns asked', () => {
|
||||
const visible = `
|
||||
D1 — Choose your scope mode
|
||||
|
||||
❯ 1. HOLD SCOPE
|
||||
2. SCOPE EXPANSION
|
||||
3. SELECTIVE EXPANSION
|
||||
4. SCOPE REDUCTION
|
||||
`;
|
||||
const result = classifyVisible(visible);
|
||||
expect(result?.outcome).toBe('asked');
|
||||
});
|
||||
|
||||
test('permission dialog (Bash) → returns null (skip, keep polling)', () => {
|
||||
const visible = `
|
||||
Bash command \`gstack-update-check\` requires permission to run.
|
||||
|
||||
❯ 1. Yes
|
||||
2. No
|
||||
`;
|
||||
expect(isNumberedOptionListVisible(visible)).toBe(true); // pre-filter
|
||||
expect(classifyVisible(visible)).toBeNull(); // post-filter
|
||||
});
|
||||
|
||||
test('plan-ready confirmation → returns plan_ready (wins over asked)', () => {
|
||||
const visible = `
|
||||
Ready to execute the plan?
|
||||
|
||||
❯ 1. Yes, proceed
|
||||
2. No, keep planning
|
||||
`;
|
||||
const result = classifyVisible(visible);
|
||||
expect(result?.outcome).toBe('plan_ready');
|
||||
});
|
||||
|
||||
test('silent write to unsanctioned path → returns silent_write', () => {
|
||||
const visible = `
|
||||
⏺ Write(src/app/dangerous-write.ts)
|
||||
⎿ Wrote 42 lines
|
||||
`;
|
||||
const result = classifyVisible(visible);
|
||||
expect(result?.outcome).toBe('silent_write');
|
||||
expect(result?.summary).toContain('src/app/dangerous-write.ts');
|
||||
});
|
||||
|
||||
test('write to sanctioned path (.claude/plans) → returns null (allowed)', () => {
|
||||
const visible = `
|
||||
⏺ Write(/Users/me/.claude/plans/some-plan.md)
|
||||
⎿ Wrote 42 lines
|
||||
`;
|
||||
expect(classifyVisible(visible)).toBeNull();
|
||||
});
|
||||
|
||||
test('write while a permission dialog is on screen → returns null (gated, not silent, not asked)', () => {
|
||||
const visible = `
|
||||
⏺ Write(src/app/edit-with-permission.ts)
|
||||
|
||||
Edit to src/app/edit-with-permission.ts
|
||||
|
||||
Do you want to proceed?
|
||||
|
||||
❯ 1. Yes
|
||||
2. No
|
||||
`;
|
||||
// The numbered prompt is a permission dialog (Edit to + Do you want to proceed?);
|
||||
// silent_write is suppressed because a numbered prompt is visible, AND
|
||||
// 'asked' is suppressed because the prompt is a permission dialog.
|
||||
expect(classifyVisible(visible)).toBeNull();
|
||||
});
|
||||
|
||||
test('write while a real skill question is on screen → returns asked (write is captured but not silent)', () => {
|
||||
const visible = `
|
||||
⏺ Write(src/app/foo.ts)
|
||||
|
||||
D1 — Choose your scope mode
|
||||
|
||||
❯ 1. HOLD SCOPE
|
||||
2. SCOPE EXPANSION
|
||||
`;
|
||||
// The numbered prompt is a skill question, not a permission dialog;
|
||||
// silent_write is suppressed (numbered prompt is visible) and the
|
||||
// outcome is 'asked' — Step 0 fired.
|
||||
const result = classifyVisible(visible);
|
||||
expect(result?.outcome).toBe('asked');
|
||||
});
|
||||
|
||||
test('idle / no signals → returns null', () => {
|
||||
const visible = `
|
||||
Some prose without any classifier signals.
|
||||
`;
|
||||
expect(classifyVisible(visible)).toBeNull();
|
||||
});
|
||||
|
||||
test('TAIL_SCAN_BYTES is exported as 1500', () => {
|
||||
// Shared between runner and routing test; a regression that desyncs the
|
||||
// recent-tail window would surface here.
|
||||
expect(TAIL_SCAN_BYTES).toBe(1500);
|
||||
});
|
||||
});
|
||||
|
||||
describe('parseNumberedOptions', () => {
|
||||
test('extracts options from a clean cursor list', () => {
|
||||
const visible = `
|
||||
❯ 1. HOLD SCOPE
|
||||
2. SCOPE EXPANSION
|
||||
`;
|
||||
const opts = parseNumberedOptions(visible);
|
||||
expect(opts).toHaveLength(2);
|
||||
expect(opts[0]).toEqual({ index: 1, label: 'HOLD SCOPE' });
|
||||
expect(opts[1]).toEqual({ index: 2, label: 'SCOPE EXPANSION' });
|
||||
});
|
||||
|
||||
test('returns empty array on prose-with-numbers (no cursor)', () => {
|
||||
expect(parseNumberedOptions('text 1. one 2. two')).toEqual([]);
|
||||
});
|
||||
|
||||
test('extracts options when the cursor is INLINE with prompt header (box-layout)', () => {
|
||||
// Real /plan-ceo-review rendering: the TTY's cursor-positioning escapes
|
||||
// collapse divider + header + prompt + cursor onto one logical line.
|
||||
// Subsequent options (2..7) still start their own lines.
|
||||
const visible = [
|
||||
'────────────────────────────────────────',
|
||||
'☐ Review scope What scope do you want me to CEO-review? ❯ 1. The branch\'s diff vs main',
|
||||
' Review the full branch: ~10K LOC.',
|
||||
'2. A specific plan file or design doc',
|
||||
' You point me at a file (path) and I review that.',
|
||||
'3. An idea you\'ll describe inline',
|
||||
'4. Cancel — wrong skill',
|
||||
'5. Type something.',
|
||||
'────────────────────────────────────────',
|
||||
'6. Chat about this',
|
||||
'7. Skip interview and plan immediately',
|
||||
].join('\n');
|
||||
const opts = parseNumberedOptions(visible);
|
||||
expect(opts).toHaveLength(7);
|
||||
expect(opts[0]).toEqual({ index: 1, label: "The branch's diff vs main" });
|
||||
expect(opts[1]?.index).toBe(2);
|
||||
expect(opts[6]?.index).toBe(7);
|
||||
expect(opts[6]?.label).toBe('Skip interview and plan immediately');
|
||||
});
|
||||
|
||||
test('inline-cursor and start-of-line cursor both produce 7 options for the box-layout case', () => {
|
||||
// The inline path captures option 1 from the cursor line itself; the
|
||||
// subsequent-lines path captures 2..7 with the existing optionRe.
|
||||
const inlineLayout = [
|
||||
'header text ❯ 1. first option',
|
||||
'2. second',
|
||||
'3. third',
|
||||
].join('\n');
|
||||
expect(parseNumberedOptions(inlineLayout)).toEqual([
|
||||
{ index: 1, label: 'first option' },
|
||||
{ index: 2, label: 'second' },
|
||||
{ index: 3, label: 'third' },
|
||||
]);
|
||||
|
||||
const cleanLayout = [
|
||||
' ❯ 1. first option',
|
||||
' 2. second',
|
||||
' 3. third',
|
||||
].join('\n');
|
||||
expect(parseNumberedOptions(cleanLayout)).toEqual([
|
||||
{ index: 1, label: 'first option' },
|
||||
{ index: 2, label: 'second' },
|
||||
{ index: 3, label: 'third' },
|
||||
]);
|
||||
});
|
||||
});
|
||||
|
||||
describe('runPlanSkillObservation env passthrough surface', () => {
|
||||
test('ClaudePtyOptions exposes env: Record<string, string>', () => {
|
||||
// Type-level guard: this file would fail to compile if the env field
|
||||
// were removed or its shape regressed. The actual env merge happens in
|
||||
// launchClaudePty's spawn call (`env: { ...process.env, ...opts.env }`),
|
||||
// so a regression where `env: opts.env` gets dropped from the
|
||||
// runPlanSkillObservation -> launchClaudePty handoff is only caught by
|
||||
// the live PTY test, not here.
|
||||
const opts: ClaudePtyOptions = {
|
||||
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
|
||||
};
|
||||
expect(opts.env).toEqual({ QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' });
|
||||
});
|
||||
});
|
||||
|
||||
// ────────────────────────────────────────────────────────────────────────────
|
||||
// Per-finding count primitives — Section 3 unit tests #1–#5, #7, #12.
|
||||
// ────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
describe('optionsSignature', () => {
|
||||
test('returns a "|"-joined `index:label` string for a clean list', () => {
|
||||
const sig = optionsSignature([
|
||||
{ index: 1, label: 'HOLD SCOPE' },
|
||||
{ index: 2, label: 'SCOPE EXPANSION' },
|
||||
]);
|
||||
expect(sig).toBe('1:HOLD SCOPE|2:SCOPE EXPANSION');
|
||||
});
|
||||
|
||||
test('order-independent: shuffled inputs produce the same signature', () => {
|
||||
// parseNumberedOptions already returns sorted, but defensive sort means
|
||||
// a future caller that hands us shuffled input still produces a stable
|
||||
// dedupe signature.
|
||||
const a = optionsSignature([
|
||||
{ index: 2, label: 'B' },
|
||||
{ index: 1, label: 'A' },
|
||||
{ index: 3, label: 'C' },
|
||||
]);
|
||||
const b = optionsSignature([
|
||||
{ index: 1, label: 'A' },
|
||||
{ index: 2, label: 'B' },
|
||||
{ index: 3, label: 'C' },
|
||||
]);
|
||||
expect(a).toBe(b);
|
||||
});
|
||||
|
||||
test('empty list returns empty string', () => {
|
||||
expect(optionsSignature([])).toBe('');
|
||||
});
|
||||
|
||||
test('single-item list returns just that entry', () => {
|
||||
expect(optionsSignature([{ index: 1, label: 'Only' }])).toBe('1:Only');
|
||||
});
|
||||
});
|
||||
|
||||
describe('parseQuestionPrompt', () => {
|
||||
test('captures 1-line prompt above the cursor', () => {
|
||||
const visible = `
|
||||
D1 — Pick a mode
|
||||
|
||||
❯ 1. HOLD SCOPE
|
||||
2. SCOPE EXPANSION
|
||||
`;
|
||||
const prompt = parseQuestionPrompt(visible);
|
||||
expect(prompt).toBe('D1 — Pick a mode');
|
||||
});
|
||||
|
||||
test('captures multi-line prompt above the cursor', () => {
|
||||
const visible = `
|
||||
D2 — Approach selection
|
||||
|
||||
Which architecture should we follow?
|
||||
|
||||
❯ 1. Bypass existing helper
|
||||
2. Reuse existing helper
|
||||
`;
|
||||
const prompt = parseQuestionPrompt(visible);
|
||||
// Multi-line prompts get joined with single spaces.
|
||||
expect(prompt).toContain('D2 — Approach selection');
|
||||
expect(prompt).toContain('Which architecture should we follow?');
|
||||
});
|
||||
|
||||
test('returns "" when no cursor is rendered', () => {
|
||||
expect(parseQuestionPrompt('Just some prose.\nNo cursor.')).toBe('');
|
||||
});
|
||||
|
||||
test('truncates to 240 chars', () => {
|
||||
const longPrompt = 'A'.repeat(500);
|
||||
const visible = `${longPrompt}\n\n ❯ 1. yes\n 2. no`;
|
||||
expect(parseQuestionPrompt(visible).length).toBeLessThanOrEqual(240);
|
||||
});
|
||||
|
||||
test('does not pull text from a previous numbered list above', () => {
|
||||
const visible = `
|
||||
❯ 1. previous answered question
|
||||
2. previous option two
|
||||
|
||||
D2 — A new question text
|
||||
|
||||
❯ 1. fresh option A
|
||||
2. fresh option B
|
||||
`;
|
||||
const prompt = parseQuestionPrompt(visible);
|
||||
// Stops at the previous numbered-list line; should NOT contain "previous answered question".
|
||||
expect(prompt).toContain('D2 — A new question text');
|
||||
expect(prompt).not.toContain('previous answered question');
|
||||
});
|
||||
|
||||
test('normalizes whitespace (collapses runs of spaces and tabs)', () => {
|
||||
const visible = `D1 — Spaced out
|
||||
|
||||
❯ 1. yes
|
||||
2. no`;
|
||||
expect(parseQuestionPrompt(visible)).toBe('D1 — Spaced out');
|
||||
});
|
||||
|
||||
test('inline-cursor box-layout: extracts prompt text BEFORE ❯1. on the cursor line', () => {
|
||||
// Real /plan-ceo-review rendering: divider + ☐ header + prompt text +
|
||||
// cursor are all on one logical line because TTY cursor-positioning
|
||||
// escapes collapse the box layout under stripAnsi.
|
||||
const visible = [
|
||||
'──────────────────',
|
||||
'☐ Review scope What scope do you want me to CEO-review? ❯ 1. The branch\'s diff vs main',
|
||||
'2. A specific plan file',
|
||||
'3. An idea inline',
|
||||
].join('\n');
|
||||
const prompt = parseQuestionPrompt(visible);
|
||||
// Should extract "Review scope" and the prompt text, dropping the ☐ box-drawing sigil.
|
||||
expect(prompt).toContain('Review scope');
|
||||
expect(prompt).toContain('What scope do you want me to CEO-review?');
|
||||
expect(prompt).not.toContain('❯');
|
||||
expect(prompt).not.toMatch(/^☐/);
|
||||
});
|
||||
});
|
||||
|
||||
describe('auqFingerprint', () => {
|
||||
test('returns the same fingerprint for identical inputs', () => {
|
||||
const opts = [
|
||||
{ index: 1, label: 'A' },
|
||||
{ index: 2, label: 'B' },
|
||||
];
|
||||
expect(auqFingerprint('hello', opts)).toBe(auqFingerprint('hello', opts));
|
||||
});
|
||||
|
||||
test('different prompts with shared option labels produce DIFFERENT fingerprints', () => {
|
||||
// The collision regression Codex F1 caught: option-label-only fingerprints
|
||||
// collapsed multiple distinct findings into one when they shared menu shape.
|
||||
const sharedOpts = [
|
||||
{ index: 1, label: 'Add to plan' },
|
||||
{ index: 2, label: 'Defer' },
|
||||
{ index: 3, label: 'Build now' },
|
||||
];
|
||||
const fpFinding1 = auqFingerprint('D5 — Architecture: bypass helper?', sharedOpts);
|
||||
const fpFinding2 = auqFingerprint('D6 — Tests: zero coverage?', sharedOpts);
|
||||
expect(fpFinding1).not.toBe(fpFinding2);
|
||||
});
|
||||
|
||||
test('same prompt with different options produces DIFFERENT fingerprints', () => {
|
||||
const prompt = 'D1 — Pick a mode';
|
||||
const fpA = auqFingerprint(prompt, [
|
||||
{ index: 1, label: 'HOLD SCOPE' },
|
||||
{ index: 2, label: 'SCOPE EXPANSION' },
|
||||
]);
|
||||
const fpB = auqFingerprint(prompt, [
|
||||
{ index: 1, label: 'HOLD SCOPE' },
|
||||
{ index: 2, label: 'SCOPE REDUCTION' },
|
||||
]);
|
||||
expect(fpA).not.toBe(fpB);
|
||||
});
|
||||
|
||||
test('whitespace-only differences in prompt do NOT change the fingerprint', () => {
|
||||
// Same content, different rendering whitespace (TTY redraw artifact)
|
||||
// must produce the same fingerprint so dedupe survives reflow.
|
||||
const opts = [{ index: 1, label: 'A' }, { index: 2, label: 'B' }];
|
||||
const fpA = auqFingerprint('Pick a mode', opts);
|
||||
const fpB = auqFingerprint('Pick a mode', opts);
|
||||
expect(fpA).toBe(fpB);
|
||||
});
|
||||
|
||||
test('empty prompt + same options collide (caller must guard against this)', () => {
|
||||
// Documents the contract: empty-prompt fingerprints WILL collide if the
|
||||
// caller fingerprints them. runPlanSkillCounting must skip empty-prompt
|
||||
// AUQs and re-poll instead.
|
||||
const opts = [{ index: 1, label: 'A' }];
|
||||
expect(auqFingerprint('', opts)).toBe(auqFingerprint('', opts));
|
||||
});
|
||||
});
|
||||
|
||||
describe('COMPLETION_SUMMARY_RE', () => {
|
||||
test('matches GSTACK REVIEW REPORT heading', () => {
|
||||
expect(COMPLETION_SUMMARY_RE.test('## GSTACK REVIEW REPORT')).toBe(true);
|
||||
});
|
||||
|
||||
test('matches Completion Summary heading (ceo + eng)', () => {
|
||||
expect(COMPLETION_SUMMARY_RE.test('## Completion Summary')).toBe(true);
|
||||
expect(COMPLETION_SUMMARY_RE.test('## Completion summary')).toBe(true);
|
||||
});
|
||||
|
||||
test('matches Status: clean (CEO review-log shape)', () => {
|
||||
expect(COMPLETION_SUMMARY_RE.test('Status: clean')).toBe(true);
|
||||
expect(COMPLETION_SUMMARY_RE.test('Status: issues_open')).toBe(true);
|
||||
});
|
||||
|
||||
test('matches VERDICT: line', () => {
|
||||
expect(COMPLETION_SUMMARY_RE.test('VERDICT: CLEARED — Eng Review passed')).toBe(true);
|
||||
});
|
||||
|
||||
test('does NOT match prose mentions of "verdict" mid-line', () => {
|
||||
// VERDICT must be at the start of a line to count.
|
||||
expect(COMPLETION_SUMMARY_RE.test('the final verdict: undecided')).toBe(false);
|
||||
});
|
||||
});
|
||||
|
||||
describe('assertReviewReportAtBottom', () => {
|
||||
test('passes when REVIEW REPORT is the only/last ## heading', () => {
|
||||
const content = `# Plan
|
||||
|
||||
## Context
|
||||
stuff
|
||||
|
||||
## Approach
|
||||
more stuff
|
||||
|
||||
## GSTACK REVIEW REPORT
|
||||
|
||||
| col | col |
|
||||
`;
|
||||
const r = assertReviewReportAtBottom(content);
|
||||
expect(r.ok).toBe(true);
|
||||
});
|
||||
|
||||
test('fails when REVIEW REPORT is missing', () => {
|
||||
const content = `# Plan
|
||||
|
||||
## Context
|
||||
stuff
|
||||
`;
|
||||
const r = assertReviewReportAtBottom(content);
|
||||
expect(r.ok).toBe(false);
|
||||
expect(r.reason).toMatch(/no GSTACK REVIEW REPORT/);
|
||||
});
|
||||
|
||||
test('fails when REVIEW REPORT exists but a ## heading follows it', () => {
|
||||
const content = `# Plan
|
||||
|
||||
## GSTACK REVIEW REPORT
|
||||
|
||||
| col | col |
|
||||
|
||||
## Late Section
|
||||
oops
|
||||
`;
|
||||
const r = assertReviewReportAtBottom(content);
|
||||
expect(r.ok).toBe(false);
|
||||
expect(r.reason).toMatch(/trailing ## heading/);
|
||||
expect(r.trailingHeadings).toEqual(['## Late Section']);
|
||||
});
|
||||
|
||||
test('passes when only ### subheadings follow REVIEW REPORT (deeper nesting allowed)', () => {
|
||||
const content = `## GSTACK REVIEW REPORT
|
||||
|
||||
### Cross-model tension
|
||||
- F1: resolved
|
||||
- F2: resolved
|
||||
`;
|
||||
const r = assertReviewReportAtBottom(content);
|
||||
expect(r.ok).toBe(true);
|
||||
});
|
||||
|
||||
test('fails with multiple trailing ## headings reported', () => {
|
||||
const content = `## GSTACK REVIEW REPORT
|
||||
|
||||
## First trailing
|
||||
|
||||
## Second trailing
|
||||
`;
|
||||
const r = assertReviewReportAtBottom(content);
|
||||
expect(r.ok).toBe(false);
|
||||
expect(r.trailingHeadings).toHaveLength(2);
|
||||
});
|
||||
});
|
||||
|
||||
describe('Step0BoundaryPredicate per-skill', () => {
|
||||
// Helper to build a synthetic fingerprint for predicate tests.
|
||||
function fp(promptSnippet: string, optionLabels: string[]): AskUserQuestionFingerprint {
|
||||
const options = optionLabels.map((label, i) => ({ index: i + 1, label }));
|
||||
return {
|
||||
signature: auqFingerprint(promptSnippet, options),
|
||||
promptSnippet,
|
||||
options,
|
||||
observedAtMs: 0,
|
||||
preReview: true,
|
||||
};
|
||||
}
|
||||
|
||||
describe('ceoStep0Boundary', () => {
|
||||
test('FIRES on Step 0F mode-pick AUQ (HOLD SCOPE in options)', () => {
|
||||
const f = fp('Pick a mode', ['HOLD SCOPE', 'SCOPE EXPANSION', 'SELECTIVE EXPANSION', 'SCOPE REDUCTION']);
|
||||
expect(ceoStep0Boundary(f)).toBe(true);
|
||||
});
|
||||
|
||||
test('FIRES on scope-selection AUQ with "Skip interview" option (skip-interview path)', () => {
|
||||
// After calibration run 1: plan-ceo's first AUQ is scope-selection,
|
||||
// and we route via "Skip interview and plan immediately" to bypass
|
||||
// Step 0 entirely. Boundary must fire on this AUQ so subsequent
|
||||
// AUQs go to reviewCount.
|
||||
const f = fp(
|
||||
'What scope do you want me to CEO-review?',
|
||||
[
|
||||
"The branch's diff vs main",
|
||||
'A specific plan file',
|
||||
"An idea you'll describe inline",
|
||||
'Cancel — wrong skill',
|
||||
'Type something.',
|
||||
'Chat about this',
|
||||
'Skip interview and plan immediately',
|
||||
],
|
||||
);
|
||||
expect(ceoStep0Boundary(f)).toBe(true);
|
||||
});
|
||||
|
||||
test('does NOT fire on premise challenge AUQs', () => {
|
||||
const f = fp('D1 — Premise check: is this the right problem?', ['Yes', 'No', 'Other']);
|
||||
expect(ceoStep0Boundary(f)).toBe(false);
|
||||
});
|
||||
|
||||
test('does NOT fire on review-section AUQs', () => {
|
||||
const f = fp('Architecture: bypass helper?', ['Reuse existing', 'Roll new', 'Defer']);
|
||||
expect(ceoStep0Boundary(f)).toBe(false);
|
||||
});
|
||||
});
|
||||
|
||||
describe('engStep0Boundary', () => {
|
||||
test('FIRES on cross-project learnings prompt', () => {
|
||||
const f = fp('Enable cross-project learnings on this machine?', ['Yes', 'No']);
|
||||
expect(engStep0Boundary(f)).toBe(true);
|
||||
});
|
||||
|
||||
test('FIRES on scope reduction recommendation', () => {
|
||||
const f = fp('Scope reduction recommendation: cut to MVP?', ['Reduce', 'Proceed', 'Modify']);
|
||||
expect(engStep0Boundary(f)).toBe(true);
|
||||
});
|
||||
|
||||
test('does NOT fire on review-section AUQs', () => {
|
||||
const f = fp('Architecture: shared mutable state?', ['Refactor', 'Defer', 'Skip']);
|
||||
expect(engStep0Boundary(f)).toBe(false);
|
||||
});
|
||||
});
|
||||
|
||||
describe('designStep0Boundary', () => {
|
||||
test('FIRES on design system / posture mention', () => {
|
||||
const f = fp('Pick a design posture for this review', ['Polish', 'Triage', 'Expansion']);
|
||||
expect(designStep0Boundary(f)).toBe(true);
|
||||
});
|
||||
|
||||
test('FIRES on first-dimension prompt', () => {
|
||||
const f = fp('First dimension: visual hierarchy. Score?', ['7', '8', '9']);
|
||||
expect(designStep0Boundary(f)).toBe(true);
|
||||
});
|
||||
|
||||
test('does NOT fire on later dimension AUQs', () => {
|
||||
const f = fp('Spacing dimension score?', ['7', '8', '9']);
|
||||
expect(designStep0Boundary(f)).toBe(false);
|
||||
});
|
||||
});
|
||||
|
||||
describe('devexStep0Boundary', () => {
|
||||
test('FIRES on developer persona selection', () => {
|
||||
const f = fp('Pick the target persona for this review', ['Senior backend', 'Junior frontend', 'Other']);
|
||||
expect(devexStep0Boundary(f)).toBe(true);
|
||||
});
|
||||
|
||||
test('FIRES on TTHW target prompt', () => {
|
||||
const f = fp('What is the TTHW target for first run?', ['<5 min', '<15 min', '<30 min']);
|
||||
expect(devexStep0Boundary(f)).toBe(true);
|
||||
});
|
||||
|
||||
test('does NOT fire on review-section AUQs', () => {
|
||||
const f = fp('Friction point: 5-min CI wait. Address?', ['Now', 'Defer', 'Skip']);
|
||||
expect(devexStep0Boundary(f)).toBe(false);
|
||||
});
|
||||
});
|
||||
});
|
||||
@@ -103,6 +103,15 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
|
||||
'ship-idempotency-pty': ['ship/**', 'bin/gstack-next-version', 'lib/worktree.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'autoplan-chain-pty': ['autoplan/**', 'plan-ceo-review/**', 'plan-design-review/**', 'plan-eng-review/**', 'plan-devex-review/**', 'test/fixtures/plans/ui-heavy-feature.md', 'test/helpers/claude-pty-runner.ts'],
|
||||
'e2e-harness-audit': ['plan-ceo-review/**', 'plan-eng-review/**', 'plan-design-review/**', 'plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/agent-sdk-runner.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
|
||||
// Per-finding AskUserQuestion count + review-report-at-bottom assertion.
|
||||
// Each test drives its skill end-to-end; touchfiles include preamble +
|
||||
// completion-status resolvers because they affect question cadence and
|
||||
// terminal output (the regression surface this test catches).
|
||||
'plan-ceo-finding-count': ['plan-ceo-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-ceo-finding-count.test.ts'],
|
||||
'plan-eng-finding-count': ['plan-eng-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-eng-finding-count.test.ts'],
|
||||
'plan-design-finding-count': ['plan-design-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-design-finding-count.test.ts'],
|
||||
'plan-devex-finding-count': ['plan-devex-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-devex-finding-count.test.ts'],
|
||||
'brain-privacy-gate': ['scripts/resolvers/preamble/generate-brain-sync-block.ts', 'scripts/resolvers/preamble.ts', 'bin/gstack-brain-sync', 'bin/gstack-brain-init', 'bin/gstack-config', 'test/helpers/agent-sdk-runner.ts'],
|
||||
|
||||
// AskUserQuestion format regression (RECOMMENDATION + Completeness: N/10)
|
||||
@@ -381,6 +390,15 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
|
||||
'ship-idempotency-pty': 'periodic', // ~$3/run, real /ship in plan mode
|
||||
'autoplan-chain-pty': 'periodic', // ~$8/run, all 3 phases sequential
|
||||
|
||||
// Per-finding count + review-report-at-bottom — periodic because each
|
||||
// run drives a full skill end-to-end (~25 min, ~$5/run). Sequential
|
||||
// execution during calibration; concurrent opt-in only after measured
|
||||
// comparison agrees (plan §D15).
|
||||
'plan-ceo-finding-count': 'periodic',
|
||||
'plan-eng-finding-count': 'periodic',
|
||||
'plan-design-finding-count': 'periodic',
|
||||
'plan-devex-finding-count': 'periodic',
|
||||
|
||||
// Privacy gate for gstack-brain-sync — periodic (non-deterministic LLM call,
|
||||
// costs ~$0.30-$0.50 per run, not needed on every commit)
|
||||
'brain-privacy-gate': 'periodic',
|
||||
|
||||
@@ -0,0 +1,253 @@
|
||||
/**
|
||||
* /plan-ceo-review per-finding AskUserQuestion count (periodic, paid, real-PTY).
|
||||
*
|
||||
* Asserts the load-bearing rule "One issue = one AskUserQuestion call" by
|
||||
* driving /plan-ceo-review against a 5-finding seeded plan and counting
|
||||
* distinct review-phase AUQs. Passes when count is in [N-1, N+2].
|
||||
*
|
||||
* Two tests in this file:
|
||||
* - 5-finding distinct fixture: count band assertion + D19 review-report-at-bottom.
|
||||
* - 2-finding paired control (D12 positive control): related findings still
|
||||
* produce 2 distinct AUQs, not 1 batched, when the rule is honored.
|
||||
*
|
||||
* Tier: periodic. Each run drives Step 0 + 11 review sections end-to-end
|
||||
* (~25 min, ~$5/run). Sequential by default per plan §D15. See
|
||||
* test/helpers/claude-pty-runner.ts for runPlanSkillCounting internals.
|
||||
*/
|
||||
|
||||
import { describe, test } from 'bun:test';
|
||||
import * as fs from 'node:fs';
|
||||
import {
|
||||
runPlanSkillCounting,
|
||||
ceoStep0Boundary,
|
||||
assertReviewReportAtBottom,
|
||||
type AskUserQuestionFingerprint,
|
||||
} from './helpers/claude-pty-runner';
|
||||
|
||||
/**
|
||||
* /plan-ceo-review's first AUQ asks "what scope?" with options like
|
||||
* 1. Branch diff vs main
|
||||
* 2. A specific plan file or design doc
|
||||
* 3. An idea you'll describe inline
|
||||
* ...
|
||||
* 7. Skip interview and plan immediately
|
||||
*
|
||||
* The default pick (1) routes to "branch diff vs main" — the wrong target
|
||||
* for our seeded fixture (the agent would review the gstack PR itself,
|
||||
* recursively). Picking "Skip interview and plan immediately" bypasses
|
||||
* Step 0 and routes the agent to review the chat context (where our
|
||||
* follow-up plan was pasted).
|
||||
*/
|
||||
function pickSkipInterview(fp: AskUserQuestionFingerprint): number {
|
||||
const skipOpt = fp.options.find((o) =>
|
||||
/skip\s+interview|plan\s+immediately/i.test(o.label),
|
||||
);
|
||||
if (skipOpt) return skipOpt.index;
|
||||
// Fallback: "describe inline" also routes to using our pasted plan.
|
||||
const inlineOpt = fp.options.find((o) =>
|
||||
/describe.*inline|inline.*idea/i.test(o.label),
|
||||
);
|
||||
if (inlineOpt) return inlineOpt.index;
|
||||
return 1;
|
||||
}
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
|
||||
const N_DISTINCT = 5;
|
||||
const FLOOR_DISTINCT = N_DISTINCT - 1; // 4 (D11)
|
||||
const CEILING_DISTINCT = N_DISTINCT + 2; // 7 (D11)
|
||||
|
||||
const N_PAIRED = 2;
|
||||
const FLOOR_PAIRED = 2;
|
||||
const CEILING_PAIRED = 4;
|
||||
|
||||
const PLAN_CEO_5_FINDINGS = [
|
||||
'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-ceo.md (use Edit/Write to that exact path).',
|
||||
'',
|
||||
'# Plan: Payment Processing Integration',
|
||||
'',
|
||||
'## Architecture',
|
||||
"We're adding a new `PaymentService` class that will handle Stripe webhooks.",
|
||||
'This bypasses the existing `WebhookDispatcher` module — we want a clean',
|
||||
'namespace separation.',
|
||||
'',
|
||||
'## Database access',
|
||||
'The new endpoint reads `request.params.userId` directly into a raw SQL',
|
||||
'fragment for the lookup query.',
|
||||
'',
|
||||
'## Webhook fan-out',
|
||||
'On payment success we update the user record AND fire a notification email.',
|
||||
'Both happen inline; no error handling on the email leg.',
|
||||
'',
|
||||
'## Tests',
|
||||
"None planned. We'll rely on the existing integration suite catching regressions.",
|
||||
'',
|
||||
'## Performance',
|
||||
'Each webhook lookup hits the database for the user, then fetches each',
|
||||
'order in a loop.',
|
||||
].join('\n');
|
||||
|
||||
const PLAN_CEO_2_PAIRED_FINDINGS = [
|
||||
'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-ceo-paired.md (use Edit/Write to that exact path).',
|
||||
'',
|
||||
'# Plan: Payment Processing — Test Coverage',
|
||||
'',
|
||||
'## Tests',
|
||||
'We need test coverage for `processPayment()`. Specifically:',
|
||||
'1. The happy path (successful Stripe charge — assert correct receipt is generated).',
|
||||
'2. The error/timeout path (Stripe returns 502 — assert retry-with-backoff fires once, then fails clean).',
|
||||
'',
|
||||
'Currently neither has a unit test. These are deliberately separate concerns:',
|
||||
'the success path is correctness, the failure path is graceful degradation.',
|
||||
].join('\n');
|
||||
|
||||
const PLAN_CEO_PATH = '/tmp/gstack-test-plan-ceo.md';
|
||||
const PLAN_CEO_PAIRED_PATH = '/tmp/gstack-test-plan-ceo-paired.md';
|
||||
|
||||
describeE2E('/plan-ceo-review per-finding AskUserQuestion count (periodic)', () => {
|
||||
test(
|
||||
`5-finding plan emits ${FLOOR_DISTINCT}-${CEILING_DISTINCT} review-phase AskUserQuestions`,
|
||||
async () => {
|
||||
try {
|
||||
fs.rmSync(PLAN_CEO_PATH, { force: true });
|
||||
} catch {
|
||||
/* best-effort */
|
||||
}
|
||||
|
||||
const obs = await runPlanSkillCounting({
|
||||
skillName: 'plan-ceo-review',
|
||||
slashCommand: '/plan-ceo-review',
|
||||
followUpPrompt: PLAN_CEO_5_FINDINGS,
|
||||
isLastStep0AUQ: ceoStep0Boundary,
|
||||
reviewCountCeiling: CEILING_DISTINCT + 1, // hard cap above assertion ceiling
|
||||
firstAUQPick: pickSkipInterview, // bypass scope-selection, route to review
|
||||
cwd: process.cwd(),
|
||||
timeoutMs: 1_500_000, // 25 min
|
||||
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
|
||||
});
|
||||
|
||||
try {
|
||||
if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) {
|
||||
throw new Error(
|
||||
`plan-ceo-review finding-count FAILED: outcome=${obs.outcome}\n` +
|
||||
`step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` +
|
||||
`fingerprints (last 8):\n` +
|
||||
obs.fingerprints
|
||||
.slice(-8)
|
||||
.map(
|
||||
(f, i) =>
|
||||
` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`,
|
||||
)
|
||||
.join('\n') +
|
||||
`\n--- evidence (last 3KB) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
if (obs.reviewCount < FLOOR_DISTINCT) {
|
||||
throw new Error(
|
||||
`BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR_DISTINCT}.\n` +
|
||||
`Likely batching regression — agent collapsed multiple findings into fewer questions.\n` +
|
||||
`Fingerprints (review-phase only):\n` +
|
||||
obs.fingerprints
|
||||
.filter((f) => !f.preReview)
|
||||
.map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`)
|
||||
.join('\n'),
|
||||
);
|
||||
}
|
||||
if (obs.reviewCount > CEILING_DISTINCT) {
|
||||
throw new Error(
|
||||
`BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING_DISTINCT}.\n` +
|
||||
`Possible over-asking regression. Review-phase fingerprints:\n` +
|
||||
obs.fingerprints
|
||||
.filter((f) => !f.preReview)
|
||||
.map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`)
|
||||
.join('\n'),
|
||||
);
|
||||
}
|
||||
|
||||
// D19: review report at bottom of plan file.
|
||||
if (!fs.existsSync(PLAN_CEO_PATH)) {
|
||||
throw new Error(
|
||||
`D19 FAIL: agent did not produce expected plan file at ${PLAN_CEO_PATH}.\n` +
|
||||
`Either the agent ignored the path instruction in the follow-up prompt, or\n` +
|
||||
`the helper exited before the agent wrote the file. ` +
|
||||
`outcome=${obs.outcome} review=${obs.reviewCount}`,
|
||||
);
|
||||
}
|
||||
const planContent = fs.readFileSync(PLAN_CEO_PATH, 'utf-8');
|
||||
const verdict = assertReviewReportAtBottom(planContent);
|
||||
if (!verdict.ok) {
|
||||
throw new Error(
|
||||
`D19 FAIL: plan file at ${PLAN_CEO_PATH} ${verdict.reason}\n` +
|
||||
(verdict.trailingHeadings
|
||||
? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n`
|
||||
: '') +
|
||||
`--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`,
|
||||
);
|
||||
}
|
||||
} finally {
|
||||
try {
|
||||
fs.rmSync(PLAN_CEO_PATH, { force: true });
|
||||
} catch {
|
||||
/* best-effort */
|
||||
}
|
||||
}
|
||||
},
|
||||
1_700_000,
|
||||
);
|
||||
|
||||
test(
|
||||
`paired-finding positive control: ${N_PAIRED} related findings produce ${FLOOR_PAIRED}-${CEILING_PAIRED} AskUserQuestions`,
|
||||
async () => {
|
||||
try {
|
||||
fs.rmSync(PLAN_CEO_PAIRED_PATH, { force: true });
|
||||
} catch {
|
||||
/* best-effort */
|
||||
}
|
||||
|
||||
const obs = await runPlanSkillCounting({
|
||||
skillName: 'plan-ceo-review',
|
||||
slashCommand: '/plan-ceo-review',
|
||||
followUpPrompt: PLAN_CEO_2_PAIRED_FINDINGS,
|
||||
isLastStep0AUQ: ceoStep0Boundary,
|
||||
reviewCountCeiling: CEILING_PAIRED + 1,
|
||||
cwd: process.cwd(),
|
||||
timeoutMs: 1_500_000,
|
||||
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
|
||||
});
|
||||
|
||||
try {
|
||||
if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) {
|
||||
throw new Error(
|
||||
`paired-finding control FAILED: outcome=${obs.outcome}\n` +
|
||||
`step0=${obs.step0Count} review=${obs.reviewCount}\n` +
|
||||
`--- evidence (last 3KB) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
if (obs.reviewCount < FLOOR_PAIRED) {
|
||||
throw new Error(
|
||||
`PAIRED CONTROL FAIL: reviewCount=${obs.reviewCount} < FLOOR=${FLOOR_PAIRED}.\n` +
|
||||
`Two deliberately related findings were batched into <2 questions — the rule failed under D12.\n` +
|
||||
`Review-phase fingerprints:\n` +
|
||||
obs.fingerprints
|
||||
.filter((f) => !f.preReview)
|
||||
.map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`)
|
||||
.join('\n'),
|
||||
);
|
||||
}
|
||||
if (obs.reviewCount > CEILING_PAIRED) {
|
||||
throw new Error(
|
||||
`PAIRED CONTROL FAIL: reviewCount=${obs.reviewCount} > CEILING=${CEILING_PAIRED} (over-asking on a 2-finding fixture).`,
|
||||
);
|
||||
}
|
||||
} finally {
|
||||
try {
|
||||
fs.rmSync(PLAN_CEO_PAIRED_PATH, { force: true });
|
||||
} catch {
|
||||
/* best-effort */
|
||||
}
|
||||
}
|
||||
},
|
||||
1_700_000,
|
||||
);
|
||||
});
|
||||
@@ -37,14 +37,15 @@ import {
|
||||
isPermissionDialogVisible,
|
||||
parseNumberedOptions,
|
||||
isPlanReadyVisible,
|
||||
MODE_RE,
|
||||
optionsSignature,
|
||||
TAIL_SCAN_BYTES,
|
||||
type ClaudePtySession,
|
||||
} from './helpers/claude-pty-runner';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
|
||||
const MODE_RE = /HOLD SCOPE|SCOPE EXPANSION|SELECTIVE EXPANSION|SCOPE REDUCTION/i;
|
||||
|
||||
interface ModeCase {
|
||||
mode: 'HOLD SCOPE' | 'SCOPE EXPANSION';
|
||||
/** Regex applied to visible-since-mode-pick text. At least one must match. */
|
||||
@@ -95,8 +96,8 @@ async function navigateToModeAskUserQuestion(
|
||||
|
||||
// Has the rendered list changed since last poll? If not, we're seeing
|
||||
// the same prompt and shouldn't double-press.
|
||||
const sig = opts.map(o => `${o.index}:${o.label}`).join('|');
|
||||
const lastSig = lastSeenList.map(o => `${o.index}:${o.label}`).join('|');
|
||||
const sig = optionsSignature(opts);
|
||||
const lastSig = optionsSignature(lastSeenList);
|
||||
if (sig === lastSig) continue;
|
||||
lastSeenList = opts;
|
||||
|
||||
@@ -115,7 +116,14 @@ async function navigateToModeAskUserQuestion(
|
||||
// Permission dialog? Grant with "1" but don't count it against nav budget.
|
||||
// Classify on the recent tail only — old permission text persists in
|
||||
// visibleSince and would re-trigger forever.
|
||||
if (isPermissionDialogVisible(visible.slice(-1500))) {
|
||||
//
|
||||
// Note: runPlanSkillObservation has its own permission-dialog filter that
|
||||
// simply skips classification (since it observes, doesn't drive). This nav
|
||||
// loop drives the PTY directly via launchClaudePty and so owns its own
|
||||
// dialog handling — granting with "1" so the workflow advances. Both
|
||||
// paths share TAIL_SCAN_BYTES as the recent-tail window so tuning stays
|
||||
// in sync.
|
||||
if (isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))) {
|
||||
session.send('1\r');
|
||||
await Bun.sleep(1500);
|
||||
continue;
|
||||
|
||||
@@ -1,22 +1,34 @@
|
||||
/**
|
||||
* plan-ceo-review plan-mode smoke (gate, paid, real-PTY).
|
||||
*
|
||||
* Asserts: when /plan-ceo-review is invoked in plan mode, the skill reaches
|
||||
* a terminal outcome that is either:
|
||||
* - 'asked' — skill emitted its Step 0 numbered prompt (scope mode
|
||||
* selection, or the routing-injection prompt that runs
|
||||
* before Step 0)
|
||||
* - 'plan_ready' — skill ran end-to-end and surfaced claude's native
|
||||
* "Ready to execute" confirmation
|
||||
* Asserts: when /plan-ceo-review is invoked in plan mode, the FIRST terminal
|
||||
* outcome is 'asked' — a skill-question numbered list. Permission dialogs
|
||||
* (which also render numbered lists) are filtered out by `runPlanSkillObservation`
|
||||
* via its `isPermissionDialogVisible(visible.slice(-1500))` short-circuit.
|
||||
*
|
||||
* FAIL conditions: silent Write/Edit before any prompt, claude crash,
|
||||
* timeout.
|
||||
* Reaching 'plan_ready' first IS the regression we want to catch: the agent
|
||||
* skipped Step 0 entirely and went straight to ExitPlanMode. The original
|
||||
* failure had the assistant read a diff, write a plan with two issues, and
|
||||
* call ExitPlanMode without ever firing AskUserQuestion — the user had to
|
||||
* manually call out the missing per-issue questions.
|
||||
*
|
||||
* Replaces the SDK-based test that never worked: the SDK's canUseTool
|
||||
* interceptor on AskUserQuestion never fires in plan mode because plan
|
||||
* mode renders its native confirmation as TTY UI, not via the
|
||||
* AskUserQuestion tool. The real PTY harness observes the rendered
|
||||
* terminal output directly.
|
||||
* Why this skill is special: unlike plan-eng-review / plan-design-review /
|
||||
* plan-devex-review (whose smokes accept either 'asked' or 'plan_ready'),
|
||||
* plan-ceo-review's template mandates Step 0A premise challenge (3 baked-in
|
||||
* questions) AND Step 0F mode selection BEFORE any plan write. There is no
|
||||
* legitimate path to plan_ready that does not first emit a skill-question
|
||||
* numbered prompt.
|
||||
*
|
||||
* Env passthrough: passes `QUESTION_TUNING=false` and `EXPLAIN_LEVEL=default`
|
||||
* via the runner's env option. Today these are advisory — `gstack-config`
|
||||
* reads `~/.gstack/config.yaml`, not env vars, so a contributor with
|
||||
* `question_tuning: true` set in their YAML config can still see AUTO_DECIDE
|
||||
* masking. The env passthrough is wired so a future gstack-config change to
|
||||
* honor env overrides will make this test hermetic without further edits.
|
||||
* Tracked as a post-merge follow-up.
|
||||
*
|
||||
* FAIL conditions: 'plan_ready' first, silent Write/Edit before any prompt,
|
||||
* claude crash, timeout.
|
||||
*
|
||||
* See test/helpers/claude-pty-runner.ts for runner internals.
|
||||
*/
|
||||
@@ -28,21 +40,33 @@ const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
|
||||
describeE2E('plan-ceo-review plan-mode smoke (gate)', () => {
|
||||
test('reaches a terminal outcome (asked or plan_ready) without silent writes', async () => {
|
||||
test('first terminal outcome is asked (Step 0 fires before any plan write)', async () => {
|
||||
const obs = await runPlanSkillObservation({
|
||||
skillName: 'plan-ceo-review',
|
||||
inPlanMode: true,
|
||||
timeoutMs: 300_000,
|
||||
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
|
||||
});
|
||||
|
||||
if (obs.outcome === 'silent_write' || obs.outcome === 'exited' || obs.outcome === 'timeout') {
|
||||
if (obs.outcome !== 'asked') {
|
||||
const diagnosis =
|
||||
obs.outcome === 'plan_ready'
|
||||
? `'plan_ready' first means the agent skipped Step 0 entirely and went straight to ExitPlanMode without asking.`
|
||||
: obs.outcome === 'timeout'
|
||||
? `Timeout means the agent neither asked nor completed within the budget — likely hung mid-question or stuck on a permission dialog.`
|
||||
: obs.outcome === 'silent_write'
|
||||
? `Silent Write/Edit fired to an unsanctioned path before any AskUserQuestion — also a Step 0 skip.`
|
||||
: `Outcome '${obs.outcome}' is unexpected; investigate the evidence below.`;
|
||||
throw new Error(
|
||||
`plan-ceo-review plan-mode smoke FAILED: outcome=${obs.outcome}\n` +
|
||||
`plan-ceo-review smoke FAILED: outcome=${obs.outcome}\n` +
|
||||
`${diagnosis}\n` +
|
||||
`Expected 'asked'. See plan-ceo-review/SKILL.md.tmpl: the Step 0 STOP rules ` +
|
||||
`and the "One issue = one AskUserQuestion call" rule under "CRITICAL RULE — ` +
|
||||
`How to ask questions".\n` +
|
||||
`summary: ${obs.summary}\n` +
|
||||
`elapsed: ${obs.elapsedMs}ms\n` +
|
||||
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
expect(['asked', 'plan_ready']).toContain(obs.outcome);
|
||||
}, 360_000);
|
||||
});
|
||||
|
||||
@@ -0,0 +1,135 @@
|
||||
/**
|
||||
* /plan-design-review per-finding AskUserQuestion count (periodic, paid, real-PTY).
|
||||
*
|
||||
* Same shape as skill-e2e-plan-ceo-finding-count: drives /plan-design-review
|
||||
* against a 5-finding seeded plan and asserts review-phase AUQ count ∈ [N-1, N+2].
|
||||
* Plus D19: review report at bottom of produced plan file.
|
||||
*
|
||||
* Tier: periodic (~25 min, ~$5/run). Sequential by default per plan §D15.
|
||||
*/
|
||||
|
||||
import { describe, test } from 'bun:test';
|
||||
import * as fs from 'node:fs';
|
||||
import {
|
||||
runPlanSkillCounting,
|
||||
designStep0Boundary,
|
||||
assertReviewReportAtBottom,
|
||||
} from './helpers/claude-pty-runner';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
|
||||
const N = 5;
|
||||
const FLOOR = N - 1;
|
||||
const CEILING = N + 2;
|
||||
|
||||
const PLAN_DESIGN_5_FINDINGS = [
|
||||
'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-design.md (use Edit/Write to that exact path).',
|
||||
'',
|
||||
'# Plan: Settings Page UI redesign',
|
||||
'',
|
||||
'## Visual Hierarchy',
|
||||
'The "Save" button is rendered with the same size, weight, and color as',
|
||||
'three other buttons in the page header (Reset, Cancel, Export). Nothing',
|
||||
'tells the user which is the primary action.',
|
||||
'',
|
||||
'## Spacing',
|
||||
'Between sections we have 24px in some places, 32px in others, and 16px',
|
||||
'in a third — no consistent vertical rhythm.',
|
||||
'',
|
||||
'## Color',
|
||||
'The error message uses red text on a light pink background. Contrast',
|
||||
'ratio is approximately 3:1 (below WCAG AA).',
|
||||
'',
|
||||
'## Typography',
|
||||
'We use 14px, 16px, and 18px font sizes across the form labels. Two',
|
||||
'sizes would suffice and create stronger hierarchy.',
|
||||
'',
|
||||
'## Motion',
|
||||
'The "Save" action takes 2-5 seconds with no loading indicator. Users',
|
||||
'see a frozen page; we should add a spinner or skeleton state.',
|
||||
].join('\n');
|
||||
|
||||
const PLAN_DESIGN_PATH = '/tmp/gstack-test-plan-design.md';
|
||||
|
||||
describeE2E('/plan-design-review per-finding AskUserQuestion count (periodic)', () => {
|
||||
test(
|
||||
`5-finding plan emits ${FLOOR}-${CEILING} review-phase AskUserQuestions`,
|
||||
async () => {
|
||||
try {
|
||||
fs.rmSync(PLAN_DESIGN_PATH, { force: true });
|
||||
} catch {
|
||||
/* best-effort */
|
||||
}
|
||||
|
||||
const obs = await runPlanSkillCounting({
|
||||
skillName: 'plan-design-review',
|
||||
slashCommand: '/plan-design-review',
|
||||
followUpPrompt: PLAN_DESIGN_5_FINDINGS,
|
||||
isLastStep0AUQ: designStep0Boundary,
|
||||
reviewCountCeiling: CEILING + 1,
|
||||
cwd: process.cwd(),
|
||||
timeoutMs: 1_500_000,
|
||||
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
|
||||
});
|
||||
|
||||
try {
|
||||
if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) {
|
||||
throw new Error(
|
||||
`plan-design-review finding-count FAILED: outcome=${obs.outcome}\n` +
|
||||
`step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` +
|
||||
`fingerprints (last 8):\n` +
|
||||
obs.fingerprints
|
||||
.slice(-8)
|
||||
.map(
|
||||
(f, i) =>
|
||||
` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`,
|
||||
)
|
||||
.join('\n') +
|
||||
`\n--- evidence (last 3KB) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
if (obs.reviewCount < FLOOR) {
|
||||
throw new Error(
|
||||
`BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR}.\n` +
|
||||
`Likely batching regression. Review-phase fingerprints:\n` +
|
||||
obs.fingerprints
|
||||
.filter((f) => !f.preReview)
|
||||
.map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`)
|
||||
.join('\n'),
|
||||
);
|
||||
}
|
||||
if (obs.reviewCount > CEILING) {
|
||||
throw new Error(
|
||||
`BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING}.`,
|
||||
);
|
||||
}
|
||||
|
||||
if (!fs.existsSync(PLAN_DESIGN_PATH)) {
|
||||
throw new Error(
|
||||
`D19 FAIL: agent did not produce expected plan file at ${PLAN_DESIGN_PATH}. ` +
|
||||
`outcome=${obs.outcome} review=${obs.reviewCount}`,
|
||||
);
|
||||
}
|
||||
const planContent = fs.readFileSync(PLAN_DESIGN_PATH, 'utf-8');
|
||||
const verdict = assertReviewReportAtBottom(planContent);
|
||||
if (!verdict.ok) {
|
||||
throw new Error(
|
||||
`D19 FAIL: plan file at ${PLAN_DESIGN_PATH} ${verdict.reason}\n` +
|
||||
(verdict.trailingHeadings
|
||||
? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n`
|
||||
: '') +
|
||||
`--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`,
|
||||
);
|
||||
}
|
||||
} finally {
|
||||
try {
|
||||
fs.rmSync(PLAN_DESIGN_PATH, { force: true });
|
||||
} catch {
|
||||
/* best-effort */
|
||||
}
|
||||
}
|
||||
},
|
||||
1_700_000,
|
||||
);
|
||||
});
|
||||
@@ -0,0 +1,135 @@
|
||||
/**
|
||||
* /plan-devex-review per-finding AskUserQuestion count (periodic, paid, real-PTY).
|
||||
*
|
||||
* Same shape as skill-e2e-plan-ceo-finding-count: drives /plan-devex-review
|
||||
* against a 5-finding seeded plan and asserts review-phase AUQ count ∈ [N-1, N+2].
|
||||
* Plus D19: review report at bottom of produced plan file.
|
||||
*
|
||||
* Tier: periodic (~25 min, ~$5/run). Sequential by default per plan §D15.
|
||||
*/
|
||||
|
||||
import { describe, test } from 'bun:test';
|
||||
import * as fs from 'node:fs';
|
||||
import {
|
||||
runPlanSkillCounting,
|
||||
devexStep0Boundary,
|
||||
assertReviewReportAtBottom,
|
||||
} from './helpers/claude-pty-runner';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
|
||||
const N = 5;
|
||||
const FLOOR = N - 1;
|
||||
const CEILING = N + 2;
|
||||
|
||||
const PLAN_DEVEX_5_FINDINGS = [
|
||||
'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-devex.md (use Edit/Write to that exact path).',
|
||||
'',
|
||||
'# Plan: Public SDK Beta Launch',
|
||||
'',
|
||||
'## Persona',
|
||||
"The plan doesn't specify which developer persona is the target — we're",
|
||||
"shipping for \"everyone,\" which means we tune for nobody.",
|
||||
'',
|
||||
'## TTHW (time to hello world)',
|
||||
'Time-to-hello-world is not measured. No benchmark data referenced. We',
|
||||
"don't know if first-run takes 5 minutes or 50.",
|
||||
'',
|
||||
'## Friction Point',
|
||||
'First-run currently requires a 5-minute mandatory CI step before the',
|
||||
'developer can run their first eval. There is no way to skip it.',
|
||||
'',
|
||||
'## Magical Moment',
|
||||
'Getting-started flow has no delight beat. Pure documentation, no',
|
||||
'interactive demo, no "ah-ha" moment that makes the developer trust us.',
|
||||
'',
|
||||
'## Competitive Blind Spot',
|
||||
"The plan doesn't reference how peer SDKs (LangChain, Semantic Kernel,",
|
||||
'OpenAI) handle this DX surface. We may be reinventing worse versions',
|
||||
'of solved problems.',
|
||||
].join('\n');
|
||||
|
||||
const PLAN_DEVEX_PATH = '/tmp/gstack-test-plan-devex.md';
|
||||
|
||||
describeE2E('/plan-devex-review per-finding AskUserQuestion count (periodic)', () => {
|
||||
test(
|
||||
`5-finding plan emits ${FLOOR}-${CEILING} review-phase AskUserQuestions`,
|
||||
async () => {
|
||||
try {
|
||||
fs.rmSync(PLAN_DEVEX_PATH, { force: true });
|
||||
} catch {
|
||||
/* best-effort */
|
||||
}
|
||||
|
||||
const obs = await runPlanSkillCounting({
|
||||
skillName: 'plan-devex-review',
|
||||
slashCommand: '/plan-devex-review',
|
||||
followUpPrompt: PLAN_DEVEX_5_FINDINGS,
|
||||
isLastStep0AUQ: devexStep0Boundary,
|
||||
reviewCountCeiling: CEILING + 1,
|
||||
cwd: process.cwd(),
|
||||
timeoutMs: 1_500_000,
|
||||
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
|
||||
});
|
||||
|
||||
try {
|
||||
if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) {
|
||||
throw new Error(
|
||||
`plan-devex-review finding-count FAILED: outcome=${obs.outcome}\n` +
|
||||
`step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` +
|
||||
`fingerprints (last 8):\n` +
|
||||
obs.fingerprints
|
||||
.slice(-8)
|
||||
.map(
|
||||
(f, i) =>
|
||||
` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`,
|
||||
)
|
||||
.join('\n') +
|
||||
`\n--- evidence (last 3KB) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
if (obs.reviewCount < FLOOR) {
|
||||
throw new Error(
|
||||
`BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR}.\n` +
|
||||
`Likely batching regression. Review-phase fingerprints:\n` +
|
||||
obs.fingerprints
|
||||
.filter((f) => !f.preReview)
|
||||
.map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`)
|
||||
.join('\n'),
|
||||
);
|
||||
}
|
||||
if (obs.reviewCount > CEILING) {
|
||||
throw new Error(
|
||||
`BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING}.`,
|
||||
);
|
||||
}
|
||||
|
||||
if (!fs.existsSync(PLAN_DEVEX_PATH)) {
|
||||
throw new Error(
|
||||
`D19 FAIL: agent did not produce expected plan file at ${PLAN_DEVEX_PATH}. ` +
|
||||
`outcome=${obs.outcome} review=${obs.reviewCount}`,
|
||||
);
|
||||
}
|
||||
const planContent = fs.readFileSync(PLAN_DEVEX_PATH, 'utf-8');
|
||||
const verdict = assertReviewReportAtBottom(planContent);
|
||||
if (!verdict.ok) {
|
||||
throw new Error(
|
||||
`D19 FAIL: plan file at ${PLAN_DEVEX_PATH} ${verdict.reason}\n` +
|
||||
(verdict.trailingHeadings
|
||||
? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n`
|
||||
: '') +
|
||||
`--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`,
|
||||
);
|
||||
}
|
||||
} finally {
|
||||
try {
|
||||
fs.rmSync(PLAN_DEVEX_PATH, { force: true });
|
||||
} catch {
|
||||
/* best-effort */
|
||||
}
|
||||
}
|
||||
},
|
||||
1_700_000,
|
||||
);
|
||||
});
|
||||
@@ -0,0 +1,134 @@
|
||||
/**
|
||||
* /plan-eng-review per-finding AskUserQuestion count (periodic, paid, real-PTY).
|
||||
*
|
||||
* Same shape as skill-e2e-plan-ceo-finding-count: drives /plan-eng-review
|
||||
* against a 5-finding seeded plan and asserts review-phase AUQ count ∈ [N-1, N+2].
|
||||
* Plus D19: review report at bottom of produced plan file.
|
||||
*
|
||||
* Tier: periodic (~25 min, ~$5/run). Sequential by default per plan §D15.
|
||||
*/
|
||||
|
||||
import { describe, test } from 'bun:test';
|
||||
import * as fs from 'node:fs';
|
||||
import {
|
||||
runPlanSkillCounting,
|
||||
engStep0Boundary,
|
||||
assertReviewReportAtBottom,
|
||||
} from './helpers/claude-pty-runner';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
|
||||
const N = 5;
|
||||
const FLOOR = N - 1; // 4
|
||||
const CEILING = N + 2; // 7
|
||||
|
||||
const PLAN_ENG_5_FINDINGS = [
|
||||
'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-eng.md (use Edit/Write to that exact path).',
|
||||
'',
|
||||
'# Plan: Multi-tenant Auth Refactor',
|
||||
'',
|
||||
'## Architecture',
|
||||
'Two new services (`AuthBroker` and `SessionMint`) share a global mutable',
|
||||
'`AuthCache` instance via module-level export. Both services mutate it.',
|
||||
'',
|
||||
'## Code quality',
|
||||
'The `validateAndDispatch()` function is 60 lines with three nested',
|
||||
'try/catch blocks; each catch swallows a different error class.',
|
||||
'',
|
||||
'## Tests',
|
||||
'The existing `legacyAuthFlow()` will get rewritten as part of this work;',
|
||||
'no regression test for the prior behavior is planned.',
|
||||
'',
|
||||
'## Performance',
|
||||
'Token validation issues 5 sequential API calls to the IDP; they could be',
|
||||
'parallelized via Promise.all trivially (calls are independent).',
|
||||
'',
|
||||
'## Architecture (scope smell)',
|
||||
'This touches 12 files and introduces 4 new classes (TokenStore,',
|
||||
'SessionMint, AuthCache, RequestPolicy). Worth flagging the complexity check.',
|
||||
].join('\n');
|
||||
|
||||
const PLAN_ENG_PATH = '/tmp/gstack-test-plan-eng.md';
|
||||
|
||||
describeE2E('/plan-eng-review per-finding AskUserQuestion count (periodic)', () => {
|
||||
test(
|
||||
`5-finding plan emits ${FLOOR}-${CEILING} review-phase AskUserQuestions`,
|
||||
async () => {
|
||||
try {
|
||||
fs.rmSync(PLAN_ENG_PATH, { force: true });
|
||||
} catch {
|
||||
/* best-effort */
|
||||
}
|
||||
|
||||
const obs = await runPlanSkillCounting({
|
||||
skillName: 'plan-eng-review',
|
||||
slashCommand: '/plan-eng-review',
|
||||
followUpPrompt: PLAN_ENG_5_FINDINGS,
|
||||
isLastStep0AUQ: engStep0Boundary,
|
||||
reviewCountCeiling: CEILING + 1,
|
||||
cwd: process.cwd(),
|
||||
timeoutMs: 1_500_000,
|
||||
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
|
||||
});
|
||||
|
||||
try {
|
||||
if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) {
|
||||
throw new Error(
|
||||
`plan-eng-review finding-count FAILED: outcome=${obs.outcome}\n` +
|
||||
`step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` +
|
||||
`fingerprints (last 8):\n` +
|
||||
obs.fingerprints
|
||||
.slice(-8)
|
||||
.map(
|
||||
(f, i) =>
|
||||
` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`,
|
||||
)
|
||||
.join('\n') +
|
||||
`\n--- evidence (last 3KB) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
if (obs.reviewCount < FLOOR) {
|
||||
throw new Error(
|
||||
`BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR}.\n` +
|
||||
`Likely batching regression. Review-phase fingerprints:\n` +
|
||||
obs.fingerprints
|
||||
.filter((f) => !f.preReview)
|
||||
.map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`)
|
||||
.join('\n'),
|
||||
);
|
||||
}
|
||||
if (obs.reviewCount > CEILING) {
|
||||
throw new Error(
|
||||
`BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING}.`,
|
||||
);
|
||||
}
|
||||
|
||||
if (!fs.existsSync(PLAN_ENG_PATH)) {
|
||||
throw new Error(
|
||||
`D19 FAIL: agent did not produce expected plan file at ${PLAN_ENG_PATH}. ` +
|
||||
`outcome=${obs.outcome} review=${obs.reviewCount}`,
|
||||
);
|
||||
}
|
||||
const planContent = fs.readFileSync(PLAN_ENG_PATH, 'utf-8');
|
||||
const verdict = assertReviewReportAtBottom(planContent);
|
||||
if (!verdict.ok) {
|
||||
throw new Error(
|
||||
`D19 FAIL: plan file at ${PLAN_ENG_PATH} ${verdict.reason}\n` +
|
||||
(verdict.trailingHeadings
|
||||
? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n`
|
||||
: '') +
|
||||
`--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`,
|
||||
);
|
||||
}
|
||||
} finally {
|
||||
try {
|
||||
fs.rmSync(PLAN_ENG_PATH, { force: true });
|
||||
} catch {
|
||||
/* best-effort */
|
||||
}
|
||||
}
|
||||
},
|
||||
1_700_000,
|
||||
);
|
||||
});
|
||||
@@ -97,8 +97,10 @@ describe('selectTests', () => {
|
||||
expect(result.selected).toContain('ask-user-question-format-pty');
|
||||
expect(result.selected).toContain('plan-ceo-mode-routing');
|
||||
expect(result.selected).toContain('autoplan-chain-pty');
|
||||
expect(result.selected.length).toBe(18);
|
||||
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 18);
|
||||
// Per-finding count + review-report-at-bottom (v1.21.x)
|
||||
expect(result.selected).toContain('plan-ceo-finding-count');
|
||||
expect(result.selected.length).toBe(19);
|
||||
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 19);
|
||||
});
|
||||
|
||||
test('global touchfile triggers ALL tests', () => {
|
||||
|
||||
Reference in New Issue
Block a user