Files
gstack/test/helpers/claude-pty-runner.unit.test.ts
Garry Tan 454423aeb3 v1.21.1.0 test: tighten plan-ceo-review smoke (Step 0 must fire) (#1255)
* test: extract classifyVisible() + permission-dialog filter in PTY runner

Pure classifier extracted from runPlanSkillObservation's polling loop so
unit tests can exercise the actual branch order with synthetic input
strings. Runner gains:

- env? passthrough on runPlanSkillObservation (forwarded to launchClaudePty).
  gstack-config does not yet honor env overrides; plumbing is in place for a
  future change to make tests hermetic.
- TAIL_SCAN_BYTES = 1500 exported constant. Replaces a duplicated magic
  number in test/skill-e2e-plan-ceo-mode-routing.test.ts so tuning stays
  in sync.
- isPermissionDialogVisible: the bare phrase "Do you want to proceed?" now
  requires a file-edit context co-trigger. Other clauses unchanged. Skill
  questions that contain the bare phrase are no longer mis-classified.
- classifyVisible(visible): pure function. Branch order silent_write →
  plan_ready → asked → null. Permission dialogs filtered out of the
  'asked' classification so a permission prompt cannot pose as a Step 0
  skill question.

Adds 24 unit tests covering all classifier branches, edge cases, and the
co-trigger contract.

* test: tighten plan-ceo-review smoke to require Step 0 fires first

Assertion narrows from ['asked', 'plan_ready'] to 'asked' only. Reaching
plan_ready first means the agent skipped Step 0 entirely and went
straight to ExitPlanMode — the regression we want to catch.

Why plan-ceo is special: unlike plan-eng / plan-design / plan-devex
(whose smokes legitimately reach plan_ready on certain branches without
asking), plan-ceo-review's template mandates Step 0A premise challenge
plus Step 0F mode selection BEFORE any plan write. There is no
legitimate path to plan_ready that does not first emit a skill-question
numbered prompt.

Failure message now branches on outcome (plan_ready vs timeout vs
silent_write) with a tailored diagnosis line per case. References the
skill template by section name ("Step 0 STOP rules", "One issue = one
AskUserQuestion call") instead of line numbers, so it survives template
edits.

Passes env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }
through the runner. Today this is advisory — gstack-config reads only
~/.gstack/config.yaml, not env vars — but the wiring is in place for a
future change. Documented honestly in the docstring.

Verified across 4 PTY runs: 3 pre-refactor + 1 post-refactor, all PASS.

* chore: capture v1.21.1.0 follow-ups in TODOS.md

- P2: per-finding AskUserQuestion count assertion (V2)
- P3: honor env vars in gstack-config so test isolation env actually works
- P3: path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS

All three surfaced during the v1.21.1.0 plan-eng-review and adversarial
review passes. Captured here so the design intent persists.

* chore: bump version and changelog (v1.21.1.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: extract MODE_RE + optionsSignature into PTY runner exports

Refactor prep for the upcoming per-finding AskUserQuestion count test
across plan-{ceo,eng,design,devex}-review. Both new tests and the existing
mode-routing test need the same mode regex and the same option-list
fingerprint dedupe — pulling them into one source of truth in
test/helpers/claude-pty-runner.ts so a fifth mode (or a tweak to the
fingerprint shape) updates everywhere instead of drifting per-test.

Mechanical: no behavior change in the mode-routing test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add per-finding count primitives + unit tests

Pure helpers landing ahead of runPlanSkillCounting:

  - parseQuestionPrompt(visible) — extract the 1-3 line prompt above
    the latest "❯ 1." cursor, normalize to a 240-char snippet
  - auqFingerprint(prompt, opts) — Bun.hash of normalized prompt + sorted
    options signature; distinct prompts with shared option labels
    (the generic A/B/C TODO menu) get distinct fingerprints
  - COMPLETION_SUMMARY_RE — terminal-signal regex matching all four
    plan-review skills' completion / verdict markers
  - assertReviewReportAtBottom(content) — checks "## GSTACK REVIEW
    REPORT" is present and is the last "## " heading in a plan file
  - Step0BoundaryPredicate type + four per-skill predicates
    (ceo / eng / design / devex) — fire on the answered AUQ's
    fingerprint, marking the end of Step 0 deterministically
    (event-based, not content-based, per Codex F7)

Plus 37 deterministic unit tests covering option-label collision
regression, prompt extraction edge cases, predicate positive AND
negative cases, and review-report-at-bottom triple-check
(missing / mid-file / multiple trailing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add runPlanSkillCounting PTY helper

Drives a plan-* skill end-to-end and counts distinct review-phase
AskUserQuestions. Composes the primitives from the previous commit:

  - Boot + auto-trust handler (existing launchClaudePty)
  - Send slash command alone, sleep 3s, send plan content as follow-up
    message (proven pattern from skill-e2e-plan-design-with-ui)
  - Poll loop with permission-dialog auto-grant, same-redraw skip,
    empty-prompt re-poll
  - Event-based Step-0 boundary via isLastStep0AUQ predicate fired on
    the answered AUQ's fingerprint (Codex F7 — boundary is observed
    event, not later rendered content)
  - Multi-signal terminals: hard ceiling, COMPLETION_SUMMARY_RE,
    plan_ready, silent_write, exited, timeout

Empty-prompt fingerprints are skipped per the contract documented in
auqFingerprint's unit tests — fingerprinting them would re-introduce
the option-label collision regression Codex F1 caught.

No E2E tests yet — those land in commit 5 with the four skill fixtures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: register four finding-count tests in touchfiles + tier map

Each new test depends on its skill template, the runner, and three
preamble resolvers (preamble.ts, generate-ask-user-format.ts,
generate-completion-status.ts) — those affect question cadence and
completion rendering, which is exactly what the test asserts on.

All four classified periodic. Sequential execution during calibration;
opt-in to concurrent only after measured comparison agrees (plan §D15).

Updated touchfiles.test.ts: plan-ceo-review/** now selects 19 tests
(was 18) because plan-ceo-finding-count joins the family.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add four per-finding count E2E tests (plan-ceo + eng + design + devex)

Each test drives its plan-* skill through Step 0 then asserts the
review-phase AskUserQuestion count falls in [N-1, N+2] for an N=5
seeded plan, plus D19: produced plan file ends with
"## GSTACK REVIEW REPORT" as its last "## " heading.

plan-ceo also runs a paired-finding positive control: 2 deliberately
related findings should still produce 2 distinct AUQs, not 1 batched.

Periodic-tier (gate-skipped without EVALS=1, EVALS_TIER=periodic).
Sequential execution by plan §D15. Each fixture is inline TypeScript
content delivered as a follow-up message after the slash command, per
the proven pattern at skill-e2e-plan-design-with-ui.test.ts.

Calibration loop (5 runs per skill) and the manual pre-merge negative
check (D7 + D12) are required before merge per plan §Verification.
NOT yet run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: fix parseNumberedOptions for inline-cursor box-layout AUQs

Calibration run 1 timed out with step0=0 review=0 because the parser
could not find the cursor in /plan-ceo-review's scope-selection AUQ.
The TTY's box-layout rendering inlines divider + header + prompt +
"1." onto one logical line — cursor escapes get stripped, leaving
text crushed onto a single line.

Cursor anchor regex changed from anchored to unanchored so it matches
mid-line. Cursor-line option extraction uses a non-anchored regex;
subsequent options stay with the original start-of-line parser.

parseQuestionPrompt picks up the inline prompt text BEFORE the cursor
on the cursor line (after stripping box-drawing chars + sigil) and
appends it after any walked-up multi-line prompt above.

Three new unit tests: clean-cursor still works, inline-cursor
extracts all 7 options, prompt extraction strips box chars.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add firstAUQPick + plan-ceo skip-interview routing

Calibration run 1 surfaced a second issue beyond the parser bug: the
default pick of 1 on /plan-ceo-review's scope-selection AUQ routes
the agent to "branch diff vs main" — so it reviews the gstack PR
itself (recursive!) instead of the seeded fixture plan we sent.

Added firstAUQPick callback to runPlanSkillCounting. Override applies
only to the FIRST AUQ; subsequent presses keep using defaultPick.

ceoStep0Boundary now fires on either the mode-pick AUQ (existing path)
or any AUQ containing "Skip interview and plan immediately" — which
is the scope-selection AUQ. Picking that option bypasses Step 0 and
routes straight to review-phase using the chat-paste plan as context.

Plan-ceo test wires firstAUQPick = pickSkipInterview which finds the
"Skip interview" option by label. Falls back to "describe inline" if
the option labels change.

Two new unit tests: ceoStep0Boundary fires on the scope-selection
fixture; existing mode-pick fixture still fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:50:09 -07:00

750 lines
26 KiB
TypeScript
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
/**
* Deterministic unit tests for claude-pty-runner.ts behavior changes.
*
* Free-tier (no EVALS=1 needed). Runs in <1s on every `bun test`. Catches
* harness plumbing bugs before stochastic PTY runs surface them.
*
* Two surface areas tested:
*
* 1. Permission-dialog short-circuit in 'asked' classification: a TTY frame
* that matches BOTH isPermissionDialogVisible AND isNumberedOptionListVisible
* must NOT be classified as a skill question — permission dialogs render
* as numbered lists too, but they're not what we're guarding.
*
* 2. Env passthrough surface: runPlanSkillObservation accepts an `env`
* option and threads it to launchClaudePty. We can't fully exercise the
* spawn pipeline without paying for a PTY session, but we CAN verify the
* option exists in the type signature and that calling without env still
* works (no regression).
*
* The PTY test (skill-e2e-plan-ceo-plan-mode.test.ts) is the integration
* check; this file is the cheap deterministic guard for the harness primitives
* those tests stand on.
*/
import { describe, test, expect } from 'bun:test';
import {
isPermissionDialogVisible,
isNumberedOptionListVisible,
isPlanReadyVisible,
parseNumberedOptions,
classifyVisible,
TAIL_SCAN_BYTES,
optionsSignature,
parseQuestionPrompt,
auqFingerprint,
COMPLETION_SUMMARY_RE,
assertReviewReportAtBottom,
ceoStep0Boundary,
engStep0Boundary,
designStep0Boundary,
devexStep0Boundary,
type ClaudePtyOptions,
type AskUserQuestionFingerprint,
} from './claude-pty-runner';
describe('isPermissionDialogVisible', () => {
test('matches "Bash command requires permission" prompts', () => {
const sample = `
Some preamble output
Bash command \`gstack-config get telemetry\` requires permission to run.
1. Yes
2. Yes, and always allow
3. No, abort
`;
expect(isPermissionDialogVisible(sample)).toBe(true);
});
test('matches "allow all edits" file-edit prompts', () => {
// Isolated to the "allow all edits" clause only — no overlapping
// "Do you want to proceed?" co-trigger, so this asserts the clause works.
const sample = `
Edit to ~/.gstack/config.yaml
1. Yes
2. Yes, allow all edits during this session
3. No
`;
expect(isPermissionDialogVisible(sample)).toBe(true);
});
test('matches the "Do you want to proceed?" file-edit confirmation by itself', () => {
// Separate fixture so weakening this clause is detected by a dedicated test.
const sample = `
Edit to ~/.gstack/config.yaml
Do you want to proceed?
1. Yes
2. No
`;
expect(isPermissionDialogVisible(sample)).toBe(true);
});
test('matches workspace-trust "always allow access to" prompt', () => {
const sample = `
Do you trust the files in this folder?
1. Yes, proceed
2. Yes, and always allow access to /Users/me/repo
3. No, exit
`;
expect(isPermissionDialogVisible(sample)).toBe(true);
});
test('does NOT match a skill AskUserQuestion list', () => {
const sample = `
D1 — Premise challenge: do users actually want this?
1. Yes, validated
2. No, premise is wrong
3. Need more info
`;
expect(isPermissionDialogVisible(sample)).toBe(false);
});
test('does NOT match a plan-ready confirmation', () => {
const sample = `
Ready to execute the plan?
1. Yes
2. No, keep planning
`;
expect(isPermissionDialogVisible(sample)).toBe(false);
});
test('does NOT match a skill question that contains the bare phrase "Do you want to proceed?"', () => {
// Co-trigger requirement: "Do you want to proceed?" alone is not enough.
// It must appear with "Edit to <path>" or "Write to <path>" to count as
// a permission dialog. This guards against a skill question like
// "Do you want to proceed with HOLD SCOPE?" being mis-classified.
const sample = `
Choose your scope mode for this review.
Do you want to proceed?
1. HOLD SCOPE
2. SCOPE EXPANSION
3. SELECTIVE EXPANSION
`;
expect(isPermissionDialogVisible(sample)).toBe(false);
});
test('does NOT mis-match when adversarial prose includes "Edit to <path>" alongside the bare proceed phrase', () => {
// Adversarial fixture: a skill question whose body legitimately mentions
// "Edit to <path>" in prose AND ends with "Do you want to proceed?". The
// current co-trigger regex would mis-classify this as a permission
// dialog. We DO want this test to fail until the regex is tightened
// further (e.g., proximity constraint, or anchoring "Edit to" to a
// line-start). For now this is documented as a known limitation: a
// skill question that talks about "Edit to" in prose IS still treated
// as a permission dialog. The test asserts the current behavior so a
// future fix can flip it intentionally.
const sample = `
Plan: I will Edit to ./plan.md to capture the decision.
Do you want to proceed?
1. HOLD SCOPE
2. SCOPE EXPANSION
`;
// KNOWN LIMITATION: the co-trigger fires here. Documented as a
// post-merge follow-up. Flip this assertion once the regex tightens.
expect(isPermissionDialogVisible(sample)).toBe(true);
});
});
describe('isNumberedOptionListVisible', () => {
test('matches a basic 1. + 2. cursor list', () => {
const sample = `
1. Option one
2. Option two
3. Option three
`;
expect(isNumberedOptionListVisible(sample)).toBe(true);
});
test('returns false on a single-option prompt', () => {
const sample = `
1. Only option
`;
expect(isNumberedOptionListVisible(sample)).toBe(false);
});
test('returns false when no cursor renders', () => {
const sample = `
Just some prose with 1. a numbered point and 2. another.
`;
expect(isNumberedOptionListVisible(sample)).toBe(false);
});
test('overlaps permission dialogs (this is why D5 short-circuits)', () => {
// The whole point of D5: this string matches BOTH classifiers, so the
// runner must consult isPermissionDialogVisible to disambiguate.
const sample = `
Bash command \`do-thing\` requires permission to run.
1. Yes
2. No
`;
expect(isNumberedOptionListVisible(sample)).toBe(true);
expect(isPermissionDialogVisible(sample)).toBe(true);
});
});
describe('classifyVisible (runtime path through the runner classifier)', () => {
// These tests call the actual classifier so a future contributor who
// reorders branches (e.g. moves the permission short-circuit before
// isPlanReadyVisible) is caught deterministically.
test('skill question → returns asked', () => {
const visible = `
D1 — Choose your scope mode
1. HOLD SCOPE
2. SCOPE EXPANSION
3. SELECTIVE EXPANSION
4. SCOPE REDUCTION
`;
const result = classifyVisible(visible);
expect(result?.outcome).toBe('asked');
});
test('permission dialog (Bash) → returns null (skip, keep polling)', () => {
const visible = `
Bash command \`gstack-update-check\` requires permission to run.
1. Yes
2. No
`;
expect(isNumberedOptionListVisible(visible)).toBe(true); // pre-filter
expect(classifyVisible(visible)).toBeNull(); // post-filter
});
test('plan-ready confirmation → returns plan_ready (wins over asked)', () => {
const visible = `
Ready to execute the plan?
1. Yes, proceed
2. No, keep planning
`;
const result = classifyVisible(visible);
expect(result?.outcome).toBe('plan_ready');
});
test('silent write to unsanctioned path → returns silent_write', () => {
const visible = `
⏺ Write(src/app/dangerous-write.ts)
⎿ Wrote 42 lines
`;
const result = classifyVisible(visible);
expect(result?.outcome).toBe('silent_write');
expect(result?.summary).toContain('src/app/dangerous-write.ts');
});
test('write to sanctioned path (.claude/plans) → returns null (allowed)', () => {
const visible = `
⏺ Write(/Users/me/.claude/plans/some-plan.md)
⎿ Wrote 42 lines
`;
expect(classifyVisible(visible)).toBeNull();
});
test('write while a permission dialog is on screen → returns null (gated, not silent, not asked)', () => {
const visible = `
⏺ Write(src/app/edit-with-permission.ts)
Edit to src/app/edit-with-permission.ts
Do you want to proceed?
1. Yes
2. No
`;
// The numbered prompt is a permission dialog (Edit to + Do you want to proceed?);
// silent_write is suppressed because a numbered prompt is visible, AND
// 'asked' is suppressed because the prompt is a permission dialog.
expect(classifyVisible(visible)).toBeNull();
});
test('write while a real skill question is on screen → returns asked (write is captured but not silent)', () => {
const visible = `
⏺ Write(src/app/foo.ts)
D1 — Choose your scope mode
1. HOLD SCOPE
2. SCOPE EXPANSION
`;
// The numbered prompt is a skill question, not a permission dialog;
// silent_write is suppressed (numbered prompt is visible) and the
// outcome is 'asked' — Step 0 fired.
const result = classifyVisible(visible);
expect(result?.outcome).toBe('asked');
});
test('idle / no signals → returns null', () => {
const visible = `
Some prose without any classifier signals.
`;
expect(classifyVisible(visible)).toBeNull();
});
test('TAIL_SCAN_BYTES is exported as 1500', () => {
// Shared between runner and routing test; a regression that desyncs the
// recent-tail window would surface here.
expect(TAIL_SCAN_BYTES).toBe(1500);
});
});
describe('parseNumberedOptions', () => {
test('extracts options from a clean cursor list', () => {
const visible = `
1. HOLD SCOPE
2. SCOPE EXPANSION
`;
const opts = parseNumberedOptions(visible);
expect(opts).toHaveLength(2);
expect(opts[0]).toEqual({ index: 1, label: 'HOLD SCOPE' });
expect(opts[1]).toEqual({ index: 2, label: 'SCOPE EXPANSION' });
});
test('returns empty array on prose-with-numbers (no cursor)', () => {
expect(parseNumberedOptions('text 1. one 2. two')).toEqual([]);
});
test('extracts options when the cursor is INLINE with prompt header (box-layout)', () => {
// Real /plan-ceo-review rendering: the TTY's cursor-positioning escapes
// collapse divider + header + prompt + cursor onto one logical line.
// Subsequent options (2..7) still start their own lines.
const visible = [
'────────────────────────────────────────',
'☐ Review scope What scope do you want me to CEO-review? 1. The branch\'s diff vs main',
' Review the full branch: ~10K LOC.',
'2. A specific plan file or design doc',
' You point me at a file (path) and I review that.',
'3. An idea you\'ll describe inline',
'4. Cancel — wrong skill',
'5. Type something.',
'────────────────────────────────────────',
'6. Chat about this',
'7. Skip interview and plan immediately',
].join('\n');
const opts = parseNumberedOptions(visible);
expect(opts).toHaveLength(7);
expect(opts[0]).toEqual({ index: 1, label: "The branch's diff vs main" });
expect(opts[1]?.index).toBe(2);
expect(opts[6]?.index).toBe(7);
expect(opts[6]?.label).toBe('Skip interview and plan immediately');
});
test('inline-cursor and start-of-line cursor both produce 7 options for the box-layout case', () => {
// The inline path captures option 1 from the cursor line itself; the
// subsequent-lines path captures 2..7 with the existing optionRe.
const inlineLayout = [
'header text 1. first option',
'2. second',
'3. third',
].join('\n');
expect(parseNumberedOptions(inlineLayout)).toEqual([
{ index: 1, label: 'first option' },
{ index: 2, label: 'second' },
{ index: 3, label: 'third' },
]);
const cleanLayout = [
' 1. first option',
' 2. second',
' 3. third',
].join('\n');
expect(parseNumberedOptions(cleanLayout)).toEqual([
{ index: 1, label: 'first option' },
{ index: 2, label: 'second' },
{ index: 3, label: 'third' },
]);
});
});
describe('runPlanSkillObservation env passthrough surface', () => {
test('ClaudePtyOptions exposes env: Record<string, string>', () => {
// Type-level guard: this file would fail to compile if the env field
// were removed or its shape regressed. The actual env merge happens in
// launchClaudePty's spawn call (`env: { ...process.env, ...opts.env }`),
// so a regression where `env: opts.env` gets dropped from the
// runPlanSkillObservation -> launchClaudePty handoff is only caught by
// the live PTY test, not here.
const opts: ClaudePtyOptions = {
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
};
expect(opts.env).toEqual({ QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' });
});
});
// ────────────────────────────────────────────────────────────────────────────
// Per-finding count primitives — Section 3 unit tests #1#5, #7, #12.
// ────────────────────────────────────────────────────────────────────────────
describe('optionsSignature', () => {
test('returns a "|"-joined `index:label` string for a clean list', () => {
const sig = optionsSignature([
{ index: 1, label: 'HOLD SCOPE' },
{ index: 2, label: 'SCOPE EXPANSION' },
]);
expect(sig).toBe('1:HOLD SCOPE|2:SCOPE EXPANSION');
});
test('order-independent: shuffled inputs produce the same signature', () => {
// parseNumberedOptions already returns sorted, but defensive sort means
// a future caller that hands us shuffled input still produces a stable
// dedupe signature.
const a = optionsSignature([
{ index: 2, label: 'B' },
{ index: 1, label: 'A' },
{ index: 3, label: 'C' },
]);
const b = optionsSignature([
{ index: 1, label: 'A' },
{ index: 2, label: 'B' },
{ index: 3, label: 'C' },
]);
expect(a).toBe(b);
});
test('empty list returns empty string', () => {
expect(optionsSignature([])).toBe('');
});
test('single-item list returns just that entry', () => {
expect(optionsSignature([{ index: 1, label: 'Only' }])).toBe('1:Only');
});
});
describe('parseQuestionPrompt', () => {
test('captures 1-line prompt above the cursor', () => {
const visible = `
D1 — Pick a mode
1. HOLD SCOPE
2. SCOPE EXPANSION
`;
const prompt = parseQuestionPrompt(visible);
expect(prompt).toBe('D1 — Pick a mode');
});
test('captures multi-line prompt above the cursor', () => {
const visible = `
D2 — Approach selection
Which architecture should we follow?
1. Bypass existing helper
2. Reuse existing helper
`;
const prompt = parseQuestionPrompt(visible);
// Multi-line prompts get joined with single spaces.
expect(prompt).toContain('D2 — Approach selection');
expect(prompt).toContain('Which architecture should we follow?');
});
test('returns "" when no cursor is rendered', () => {
expect(parseQuestionPrompt('Just some prose.\nNo cursor.')).toBe('');
});
test('truncates to 240 chars', () => {
const longPrompt = 'A'.repeat(500);
const visible = `${longPrompt}\n\n 1. yes\n 2. no`;
expect(parseQuestionPrompt(visible).length).toBeLessThanOrEqual(240);
});
test('does not pull text from a previous numbered list above', () => {
const visible = `
1. previous answered question
2. previous option two
D2 — A new question text
1. fresh option A
2. fresh option B
`;
const prompt = parseQuestionPrompt(visible);
// Stops at the previous numbered-list line; should NOT contain "previous answered question".
expect(prompt).toContain('D2 — A new question text');
expect(prompt).not.toContain('previous answered question');
});
test('normalizes whitespace (collapses runs of spaces and tabs)', () => {
const visible = `D1 — Spaced out
1. yes
2. no`;
expect(parseQuestionPrompt(visible)).toBe('D1 — Spaced out');
});
test('inline-cursor box-layout: extracts prompt text BEFORE 1. on the cursor line', () => {
// Real /plan-ceo-review rendering: divider + ☐ header + prompt text +
// cursor are all on one logical line because TTY cursor-positioning
// escapes collapse the box layout under stripAnsi.
const visible = [
'──────────────────',
'☐ Review scope What scope do you want me to CEO-review? 1. The branch\'s diff vs main',
'2. A specific plan file',
'3. An idea inline',
].join('\n');
const prompt = parseQuestionPrompt(visible);
// Should extract "Review scope" and the prompt text, dropping the ☐ box-drawing sigil.
expect(prompt).toContain('Review scope');
expect(prompt).toContain('What scope do you want me to CEO-review?');
expect(prompt).not.toContain('');
expect(prompt).not.toMatch(/^☐/);
});
});
describe('auqFingerprint', () => {
test('returns the same fingerprint for identical inputs', () => {
const opts = [
{ index: 1, label: 'A' },
{ index: 2, label: 'B' },
];
expect(auqFingerprint('hello', opts)).toBe(auqFingerprint('hello', opts));
});
test('different prompts with shared option labels produce DIFFERENT fingerprints', () => {
// The collision regression Codex F1 caught: option-label-only fingerprints
// collapsed multiple distinct findings into one when they shared menu shape.
const sharedOpts = [
{ index: 1, label: 'Add to plan' },
{ index: 2, label: 'Defer' },
{ index: 3, label: 'Build now' },
];
const fpFinding1 = auqFingerprint('D5 — Architecture: bypass helper?', sharedOpts);
const fpFinding2 = auqFingerprint('D6 — Tests: zero coverage?', sharedOpts);
expect(fpFinding1).not.toBe(fpFinding2);
});
test('same prompt with different options produces DIFFERENT fingerprints', () => {
const prompt = 'D1 — Pick a mode';
const fpA = auqFingerprint(prompt, [
{ index: 1, label: 'HOLD SCOPE' },
{ index: 2, label: 'SCOPE EXPANSION' },
]);
const fpB = auqFingerprint(prompt, [
{ index: 1, label: 'HOLD SCOPE' },
{ index: 2, label: 'SCOPE REDUCTION' },
]);
expect(fpA).not.toBe(fpB);
});
test('whitespace-only differences in prompt do NOT change the fingerprint', () => {
// Same content, different rendering whitespace (TTY redraw artifact)
// must produce the same fingerprint so dedupe survives reflow.
const opts = [{ index: 1, label: 'A' }, { index: 2, label: 'B' }];
const fpA = auqFingerprint('Pick a mode', opts);
const fpB = auqFingerprint('Pick a mode', opts);
expect(fpA).toBe(fpB);
});
test('empty prompt + same options collide (caller must guard against this)', () => {
// Documents the contract: empty-prompt fingerprints WILL collide if the
// caller fingerprints them. runPlanSkillCounting must skip empty-prompt
// AUQs and re-poll instead.
const opts = [{ index: 1, label: 'A' }];
expect(auqFingerprint('', opts)).toBe(auqFingerprint('', opts));
});
});
describe('COMPLETION_SUMMARY_RE', () => {
test('matches GSTACK REVIEW REPORT heading', () => {
expect(COMPLETION_SUMMARY_RE.test('## GSTACK REVIEW REPORT')).toBe(true);
});
test('matches Completion Summary heading (ceo + eng)', () => {
expect(COMPLETION_SUMMARY_RE.test('## Completion Summary')).toBe(true);
expect(COMPLETION_SUMMARY_RE.test('## Completion summary')).toBe(true);
});
test('matches Status: clean (CEO review-log shape)', () => {
expect(COMPLETION_SUMMARY_RE.test('Status: clean')).toBe(true);
expect(COMPLETION_SUMMARY_RE.test('Status: issues_open')).toBe(true);
});
test('matches VERDICT: line', () => {
expect(COMPLETION_SUMMARY_RE.test('VERDICT: CLEARED — Eng Review passed')).toBe(true);
});
test('does NOT match prose mentions of "verdict" mid-line', () => {
// VERDICT must be at the start of a line to count.
expect(COMPLETION_SUMMARY_RE.test('the final verdict: undecided')).toBe(false);
});
});
describe('assertReviewReportAtBottom', () => {
test('passes when REVIEW REPORT is the only/last ## heading', () => {
const content = `# Plan
## Context
stuff
## Approach
more stuff
## GSTACK REVIEW REPORT
| col | col |
`;
const r = assertReviewReportAtBottom(content);
expect(r.ok).toBe(true);
});
test('fails when REVIEW REPORT is missing', () => {
const content = `# Plan
## Context
stuff
`;
const r = assertReviewReportAtBottom(content);
expect(r.ok).toBe(false);
expect(r.reason).toMatch(/no GSTACK REVIEW REPORT/);
});
test('fails when REVIEW REPORT exists but a ## heading follows it', () => {
const content = `# Plan
## GSTACK REVIEW REPORT
| col | col |
## Late Section
oops
`;
const r = assertReviewReportAtBottom(content);
expect(r.ok).toBe(false);
expect(r.reason).toMatch(/trailing ## heading/);
expect(r.trailingHeadings).toEqual(['## Late Section']);
});
test('passes when only ### subheadings follow REVIEW REPORT (deeper nesting allowed)', () => {
const content = `## GSTACK REVIEW REPORT
### Cross-model tension
- F1: resolved
- F2: resolved
`;
const r = assertReviewReportAtBottom(content);
expect(r.ok).toBe(true);
});
test('fails with multiple trailing ## headings reported', () => {
const content = `## GSTACK REVIEW REPORT
## First trailing
## Second trailing
`;
const r = assertReviewReportAtBottom(content);
expect(r.ok).toBe(false);
expect(r.trailingHeadings).toHaveLength(2);
});
});
describe('Step0BoundaryPredicate per-skill', () => {
// Helper to build a synthetic fingerprint for predicate tests.
function fp(promptSnippet: string, optionLabels: string[]): AskUserQuestionFingerprint {
const options = optionLabels.map((label, i) => ({ index: i + 1, label }));
return {
signature: auqFingerprint(promptSnippet, options),
promptSnippet,
options,
observedAtMs: 0,
preReview: true,
};
}
describe('ceoStep0Boundary', () => {
test('FIRES on Step 0F mode-pick AUQ (HOLD SCOPE in options)', () => {
const f = fp('Pick a mode', ['HOLD SCOPE', 'SCOPE EXPANSION', 'SELECTIVE EXPANSION', 'SCOPE REDUCTION']);
expect(ceoStep0Boundary(f)).toBe(true);
});
test('FIRES on scope-selection AUQ with "Skip interview" option (skip-interview path)', () => {
// After calibration run 1: plan-ceo's first AUQ is scope-selection,
// and we route via "Skip interview and plan immediately" to bypass
// Step 0 entirely. Boundary must fire on this AUQ so subsequent
// AUQs go to reviewCount.
const f = fp(
'What scope do you want me to CEO-review?',
[
"The branch's diff vs main",
'A specific plan file',
"An idea you'll describe inline",
'Cancel — wrong skill',
'Type something.',
'Chat about this',
'Skip interview and plan immediately',
],
);
expect(ceoStep0Boundary(f)).toBe(true);
});
test('does NOT fire on premise challenge AUQs', () => {
const f = fp('D1 — Premise check: is this the right problem?', ['Yes', 'No', 'Other']);
expect(ceoStep0Boundary(f)).toBe(false);
});
test('does NOT fire on review-section AUQs', () => {
const f = fp('Architecture: bypass helper?', ['Reuse existing', 'Roll new', 'Defer']);
expect(ceoStep0Boundary(f)).toBe(false);
});
});
describe('engStep0Boundary', () => {
test('FIRES on cross-project learnings prompt', () => {
const f = fp('Enable cross-project learnings on this machine?', ['Yes', 'No']);
expect(engStep0Boundary(f)).toBe(true);
});
test('FIRES on scope reduction recommendation', () => {
const f = fp('Scope reduction recommendation: cut to MVP?', ['Reduce', 'Proceed', 'Modify']);
expect(engStep0Boundary(f)).toBe(true);
});
test('does NOT fire on review-section AUQs', () => {
const f = fp('Architecture: shared mutable state?', ['Refactor', 'Defer', 'Skip']);
expect(engStep0Boundary(f)).toBe(false);
});
});
describe('designStep0Boundary', () => {
test('FIRES on design system / posture mention', () => {
const f = fp('Pick a design posture for this review', ['Polish', 'Triage', 'Expansion']);
expect(designStep0Boundary(f)).toBe(true);
});
test('FIRES on first-dimension prompt', () => {
const f = fp('First dimension: visual hierarchy. Score?', ['7', '8', '9']);
expect(designStep0Boundary(f)).toBe(true);
});
test('does NOT fire on later dimension AUQs', () => {
const f = fp('Spacing dimension score?', ['7', '8', '9']);
expect(designStep0Boundary(f)).toBe(false);
});
});
describe('devexStep0Boundary', () => {
test('FIRES on developer persona selection', () => {
const f = fp('Pick the target persona for this review', ['Senior backend', 'Junior frontend', 'Other']);
expect(devexStep0Boundary(f)).toBe(true);
});
test('FIRES on TTHW target prompt', () => {
const f = fp('What is the TTHW target for first run?', ['<5 min', '<15 min', '<30 min']);
expect(devexStep0Boundary(f)).toBe(true);
});
test('does NOT fire on review-section AUQs', () => {
const f = fp('Friction point: 5-min CI wait. Address?', ['Now', 'Defer', 'Skip']);
expect(devexStep0Boundary(f)).toBe(false);
});
});
});