mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-01 11:17:50 +02:00
454423aeb3
* test: extract classifyVisible() + permission-dialog filter in PTY runner
Pure classifier extracted from runPlanSkillObservation's polling loop so
unit tests can exercise the actual branch order with synthetic input
strings. Runner gains:
- env? passthrough on runPlanSkillObservation (forwarded to launchClaudePty).
gstack-config does not yet honor env overrides; plumbing is in place for a
future change to make tests hermetic.
- TAIL_SCAN_BYTES = 1500 exported constant. Replaces a duplicated magic
number in test/skill-e2e-plan-ceo-mode-routing.test.ts so tuning stays
in sync.
- isPermissionDialogVisible: the bare phrase "Do you want to proceed?" now
requires a file-edit context co-trigger. Other clauses unchanged. Skill
questions that contain the bare phrase are no longer mis-classified.
- classifyVisible(visible): pure function. Branch order silent_write →
plan_ready → asked → null. Permission dialogs filtered out of the
'asked' classification so a permission prompt cannot pose as a Step 0
skill question.
Adds 24 unit tests covering all classifier branches, edge cases, and the
co-trigger contract.
* test: tighten plan-ceo-review smoke to require Step 0 fires first
Assertion narrows from ['asked', 'plan_ready'] to 'asked' only. Reaching
plan_ready first means the agent skipped Step 0 entirely and went
straight to ExitPlanMode — the regression we want to catch.
Why plan-ceo is special: unlike plan-eng / plan-design / plan-devex
(whose smokes legitimately reach plan_ready on certain branches without
asking), plan-ceo-review's template mandates Step 0A premise challenge
plus Step 0F mode selection BEFORE any plan write. There is no
legitimate path to plan_ready that does not first emit a skill-question
numbered prompt.
Failure message now branches on outcome (plan_ready vs timeout vs
silent_write) with a tailored diagnosis line per case. References the
skill template by section name ("Step 0 STOP rules", "One issue = one
AskUserQuestion call") instead of line numbers, so it survives template
edits.
Passes env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }
through the runner. Today this is advisory — gstack-config reads only
~/.gstack/config.yaml, not env vars — but the wiring is in place for a
future change. Documented honestly in the docstring.
Verified across 4 PTY runs: 3 pre-refactor + 1 post-refactor, all PASS.
* chore: capture v1.21.1.0 follow-ups in TODOS.md
- P2: per-finding AskUserQuestion count assertion (V2)
- P3: honor env vars in gstack-config so test isolation env actually works
- P3: path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS
All three surfaced during the v1.21.1.0 plan-eng-review and adversarial
review passes. Captured here so the design intent persists.
* chore: bump version and changelog (v1.21.1.0)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: extract MODE_RE + optionsSignature into PTY runner exports
Refactor prep for the upcoming per-finding AskUserQuestion count test
across plan-{ceo,eng,design,devex}-review. Both new tests and the existing
mode-routing test need the same mode regex and the same option-list
fingerprint dedupe — pulling them into one source of truth in
test/helpers/claude-pty-runner.ts so a fifth mode (or a tweak to the
fingerprint shape) updates everywhere instead of drifting per-test.
Mechanical: no behavior change in the mode-routing test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: add per-finding count primitives + unit tests
Pure helpers landing ahead of runPlanSkillCounting:
- parseQuestionPrompt(visible) — extract the 1-3 line prompt above
the latest "❯ 1." cursor, normalize to a 240-char snippet
- auqFingerprint(prompt, opts) — Bun.hash of normalized prompt + sorted
options signature; distinct prompts with shared option labels
(the generic A/B/C TODO menu) get distinct fingerprints
- COMPLETION_SUMMARY_RE — terminal-signal regex matching all four
plan-review skills' completion / verdict markers
- assertReviewReportAtBottom(content) — checks "## GSTACK REVIEW
REPORT" is present and is the last "## " heading in a plan file
- Step0BoundaryPredicate type + four per-skill predicates
(ceo / eng / design / devex) — fire on the answered AUQ's
fingerprint, marking the end of Step 0 deterministically
(event-based, not content-based, per Codex F7)
Plus 37 deterministic unit tests covering option-label collision
regression, prompt extraction edge cases, predicate positive AND
negative cases, and review-report-at-bottom triple-check
(missing / mid-file / multiple trailing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: add runPlanSkillCounting PTY helper
Drives a plan-* skill end-to-end and counts distinct review-phase
AskUserQuestions. Composes the primitives from the previous commit:
- Boot + auto-trust handler (existing launchClaudePty)
- Send slash command alone, sleep 3s, send plan content as follow-up
message (proven pattern from skill-e2e-plan-design-with-ui)
- Poll loop with permission-dialog auto-grant, same-redraw skip,
empty-prompt re-poll
- Event-based Step-0 boundary via isLastStep0AUQ predicate fired on
the answered AUQ's fingerprint (Codex F7 — boundary is observed
event, not later rendered content)
- Multi-signal terminals: hard ceiling, COMPLETION_SUMMARY_RE,
plan_ready, silent_write, exited, timeout
Empty-prompt fingerprints are skipped per the contract documented in
auqFingerprint's unit tests — fingerprinting them would re-introduce
the option-label collision regression Codex F1 caught.
No E2E tests yet — those land in commit 5 with the four skill fixtures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: register four finding-count tests in touchfiles + tier map
Each new test depends on its skill template, the runner, and three
preamble resolvers (preamble.ts, generate-ask-user-format.ts,
generate-completion-status.ts) — those affect question cadence and
completion rendering, which is exactly what the test asserts on.
All four classified periodic. Sequential execution during calibration;
opt-in to concurrent only after measured comparison agrees (plan §D15).
Updated touchfiles.test.ts: plan-ceo-review/** now selects 19 tests
(was 18) because plan-ceo-finding-count joins the family.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: add four per-finding count E2E tests (plan-ceo + eng + design + devex)
Each test drives its plan-* skill through Step 0 then asserts the
review-phase AskUserQuestion count falls in [N-1, N+2] for an N=5
seeded plan, plus D19: produced plan file ends with
"## GSTACK REVIEW REPORT" as its last "## " heading.
plan-ceo also runs a paired-finding positive control: 2 deliberately
related findings should still produce 2 distinct AUQs, not 1 batched.
Periodic-tier (gate-skipped without EVALS=1, EVALS_TIER=periodic).
Sequential execution by plan §D15. Each fixture is inline TypeScript
content delivered as a follow-up message after the slash command, per
the proven pattern at skill-e2e-plan-design-with-ui.test.ts.
Calibration loop (5 runs per skill) and the manual pre-merge negative
check (D7 + D12) are required before merge per plan §Verification.
NOT yet run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: fix parseNumberedOptions for inline-cursor box-layout AUQs
Calibration run 1 timed out with step0=0 review=0 because the parser
could not find the cursor in /plan-ceo-review's scope-selection AUQ.
The TTY's box-layout rendering inlines divider + header + prompt +
"1." onto one logical line — cursor escapes get stripped, leaving
text crushed onto a single line.
Cursor anchor regex changed from anchored to unanchored so it matches
mid-line. Cursor-line option extraction uses a non-anchored regex;
subsequent options stay with the original start-of-line parser.
parseQuestionPrompt picks up the inline prompt text BEFORE the cursor
on the cursor line (after stripping box-drawing chars + sigil) and
appends it after any walked-up multi-line prompt above.
Three new unit tests: clean-cursor still works, inline-cursor
extracts all 7 options, prompt extraction strips box chars.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: add firstAUQPick + plan-ceo skip-interview routing
Calibration run 1 surfaced a second issue beyond the parser bug: the
default pick of 1 on /plan-ceo-review's scope-selection AUQ routes
the agent to "branch diff vs main" — so it reviews the gstack PR
itself (recursive!) instead of the seeded fixture plan we sent.
Added firstAUQPick callback to runPlanSkillCounting. Override applies
only to the FIRST AUQ; subsequent presses keep using defaultPick.
ceoStep0Boundary now fires on either the mode-pick AUQ (existing path)
or any AUQ containing "Skip interview and plan immediately" — which
is the scope-selection AUQ. Picking that option bypasses Step 0 and
routes straight to review-phase using the chat-paste plan as context.
Plan-ceo test wires firstAUQPick = pickSkipInterview which finds the
"Skip interview" option by label. Falls back to "describe inline" if
the option labels change.
Two new unit tests: ceoStep0Boundary fires on the scope-selection
fixture; existing mode-pick fixture still fires.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
254 lines
9.7 KiB
TypeScript
254 lines
9.7 KiB
TypeScript
/**
|
|
* /plan-ceo-review per-finding AskUserQuestion count (periodic, paid, real-PTY).
|
|
*
|
|
* Asserts the load-bearing rule "One issue = one AskUserQuestion call" by
|
|
* driving /plan-ceo-review against a 5-finding seeded plan and counting
|
|
* distinct review-phase AUQs. Passes when count is in [N-1, N+2].
|
|
*
|
|
* Two tests in this file:
|
|
* - 5-finding distinct fixture: count band assertion + D19 review-report-at-bottom.
|
|
* - 2-finding paired control (D12 positive control): related findings still
|
|
* produce 2 distinct AUQs, not 1 batched, when the rule is honored.
|
|
*
|
|
* Tier: periodic. Each run drives Step 0 + 11 review sections end-to-end
|
|
* (~25 min, ~$5/run). Sequential by default per plan §D15. See
|
|
* test/helpers/claude-pty-runner.ts for runPlanSkillCounting internals.
|
|
*/
|
|
|
|
import { describe, test } from 'bun:test';
|
|
import * as fs from 'node:fs';
|
|
import {
|
|
runPlanSkillCounting,
|
|
ceoStep0Boundary,
|
|
assertReviewReportAtBottom,
|
|
type AskUserQuestionFingerprint,
|
|
} from './helpers/claude-pty-runner';
|
|
|
|
/**
|
|
* /plan-ceo-review's first AUQ asks "what scope?" with options like
|
|
* 1. Branch diff vs main
|
|
* 2. A specific plan file or design doc
|
|
* 3. An idea you'll describe inline
|
|
* ...
|
|
* 7. Skip interview and plan immediately
|
|
*
|
|
* The default pick (1) routes to "branch diff vs main" — the wrong target
|
|
* for our seeded fixture (the agent would review the gstack PR itself,
|
|
* recursively). Picking "Skip interview and plan immediately" bypasses
|
|
* Step 0 and routes the agent to review the chat context (where our
|
|
* follow-up plan was pasted).
|
|
*/
|
|
function pickSkipInterview(fp: AskUserQuestionFingerprint): number {
|
|
const skipOpt = fp.options.find((o) =>
|
|
/skip\s+interview|plan\s+immediately/i.test(o.label),
|
|
);
|
|
if (skipOpt) return skipOpt.index;
|
|
// Fallback: "describe inline" also routes to using our pasted plan.
|
|
const inlineOpt = fp.options.find((o) =>
|
|
/describe.*inline|inline.*idea/i.test(o.label),
|
|
);
|
|
if (inlineOpt) return inlineOpt.index;
|
|
return 1;
|
|
}
|
|
|
|
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
|
|
const describeE2E = shouldRun ? describe : describe.skip;
|
|
|
|
const N_DISTINCT = 5;
|
|
const FLOOR_DISTINCT = N_DISTINCT - 1; // 4 (D11)
|
|
const CEILING_DISTINCT = N_DISTINCT + 2; // 7 (D11)
|
|
|
|
const N_PAIRED = 2;
|
|
const FLOOR_PAIRED = 2;
|
|
const CEILING_PAIRED = 4;
|
|
|
|
const PLAN_CEO_5_FINDINGS = [
|
|
'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-ceo.md (use Edit/Write to that exact path).',
|
|
'',
|
|
'# Plan: Payment Processing Integration',
|
|
'',
|
|
'## Architecture',
|
|
"We're adding a new `PaymentService` class that will handle Stripe webhooks.",
|
|
'This bypasses the existing `WebhookDispatcher` module — we want a clean',
|
|
'namespace separation.',
|
|
'',
|
|
'## Database access',
|
|
'The new endpoint reads `request.params.userId` directly into a raw SQL',
|
|
'fragment for the lookup query.',
|
|
'',
|
|
'## Webhook fan-out',
|
|
'On payment success we update the user record AND fire a notification email.',
|
|
'Both happen inline; no error handling on the email leg.',
|
|
'',
|
|
'## Tests',
|
|
"None planned. We'll rely on the existing integration suite catching regressions.",
|
|
'',
|
|
'## Performance',
|
|
'Each webhook lookup hits the database for the user, then fetches each',
|
|
'order in a loop.',
|
|
].join('\n');
|
|
|
|
const PLAN_CEO_2_PAIRED_FINDINGS = [
|
|
'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-ceo-paired.md (use Edit/Write to that exact path).',
|
|
'',
|
|
'# Plan: Payment Processing — Test Coverage',
|
|
'',
|
|
'## Tests',
|
|
'We need test coverage for `processPayment()`. Specifically:',
|
|
'1. The happy path (successful Stripe charge — assert correct receipt is generated).',
|
|
'2. The error/timeout path (Stripe returns 502 — assert retry-with-backoff fires once, then fails clean).',
|
|
'',
|
|
'Currently neither has a unit test. These are deliberately separate concerns:',
|
|
'the success path is correctness, the failure path is graceful degradation.',
|
|
].join('\n');
|
|
|
|
const PLAN_CEO_PATH = '/tmp/gstack-test-plan-ceo.md';
|
|
const PLAN_CEO_PAIRED_PATH = '/tmp/gstack-test-plan-ceo-paired.md';
|
|
|
|
describeE2E('/plan-ceo-review per-finding AskUserQuestion count (periodic)', () => {
|
|
test(
|
|
`5-finding plan emits ${FLOOR_DISTINCT}-${CEILING_DISTINCT} review-phase AskUserQuestions`,
|
|
async () => {
|
|
try {
|
|
fs.rmSync(PLAN_CEO_PATH, { force: true });
|
|
} catch {
|
|
/* best-effort */
|
|
}
|
|
|
|
const obs = await runPlanSkillCounting({
|
|
skillName: 'plan-ceo-review',
|
|
slashCommand: '/plan-ceo-review',
|
|
followUpPrompt: PLAN_CEO_5_FINDINGS,
|
|
isLastStep0AUQ: ceoStep0Boundary,
|
|
reviewCountCeiling: CEILING_DISTINCT + 1, // hard cap above assertion ceiling
|
|
firstAUQPick: pickSkipInterview, // bypass scope-selection, route to review
|
|
cwd: process.cwd(),
|
|
timeoutMs: 1_500_000, // 25 min
|
|
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
|
|
});
|
|
|
|
try {
|
|
if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) {
|
|
throw new Error(
|
|
`plan-ceo-review finding-count FAILED: outcome=${obs.outcome}\n` +
|
|
`step0=${obs.step0Count} review=${obs.reviewCount} elapsed=${obs.elapsedMs}ms\n` +
|
|
`fingerprints (last 8):\n` +
|
|
obs.fingerprints
|
|
.slice(-8)
|
|
.map(
|
|
(f, i) =>
|
|
` ${i}. preReview=${f.preReview} sig=${f.signature.slice(0, 12)} prompt="${f.promptSnippet.slice(0, 60)}"`,
|
|
)
|
|
.join('\n') +
|
|
`\n--- evidence (last 3KB) ---\n${obs.evidence}`,
|
|
);
|
|
}
|
|
if (obs.reviewCount < FLOOR_DISTINCT) {
|
|
throw new Error(
|
|
`BAND FAIL (below floor): reviewCount=${obs.reviewCount} < FLOOR=${FLOOR_DISTINCT}.\n` +
|
|
`Likely batching regression — agent collapsed multiple findings into fewer questions.\n` +
|
|
`Fingerprints (review-phase only):\n` +
|
|
obs.fingerprints
|
|
.filter((f) => !f.preReview)
|
|
.map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`)
|
|
.join('\n'),
|
|
);
|
|
}
|
|
if (obs.reviewCount > CEILING_DISTINCT) {
|
|
throw new Error(
|
|
`BAND FAIL (above ceiling): reviewCount=${obs.reviewCount} > CEILING=${CEILING_DISTINCT}.\n` +
|
|
`Possible over-asking regression. Review-phase fingerprints:\n` +
|
|
obs.fingerprints
|
|
.filter((f) => !f.preReview)
|
|
.map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`)
|
|
.join('\n'),
|
|
);
|
|
}
|
|
|
|
// D19: review report at bottom of plan file.
|
|
if (!fs.existsSync(PLAN_CEO_PATH)) {
|
|
throw new Error(
|
|
`D19 FAIL: agent did not produce expected plan file at ${PLAN_CEO_PATH}.\n` +
|
|
`Either the agent ignored the path instruction in the follow-up prompt, or\n` +
|
|
`the helper exited before the agent wrote the file. ` +
|
|
`outcome=${obs.outcome} review=${obs.reviewCount}`,
|
|
);
|
|
}
|
|
const planContent = fs.readFileSync(PLAN_CEO_PATH, 'utf-8');
|
|
const verdict = assertReviewReportAtBottom(planContent);
|
|
if (!verdict.ok) {
|
|
throw new Error(
|
|
`D19 FAIL: plan file at ${PLAN_CEO_PATH} ${verdict.reason}\n` +
|
|
(verdict.trailingHeadings
|
|
? `Trailing headings: ${verdict.trailingHeadings.join(' | ')}\n`
|
|
: '') +
|
|
`--- plan content (last 1KB) ---\n${planContent.slice(-1024)}`,
|
|
);
|
|
}
|
|
} finally {
|
|
try {
|
|
fs.rmSync(PLAN_CEO_PATH, { force: true });
|
|
} catch {
|
|
/* best-effort */
|
|
}
|
|
}
|
|
},
|
|
1_700_000,
|
|
);
|
|
|
|
test(
|
|
`paired-finding positive control: ${N_PAIRED} related findings produce ${FLOOR_PAIRED}-${CEILING_PAIRED} AskUserQuestions`,
|
|
async () => {
|
|
try {
|
|
fs.rmSync(PLAN_CEO_PAIRED_PATH, { force: true });
|
|
} catch {
|
|
/* best-effort */
|
|
}
|
|
|
|
const obs = await runPlanSkillCounting({
|
|
skillName: 'plan-ceo-review',
|
|
slashCommand: '/plan-ceo-review',
|
|
followUpPrompt: PLAN_CEO_2_PAIRED_FINDINGS,
|
|
isLastStep0AUQ: ceoStep0Boundary,
|
|
reviewCountCeiling: CEILING_PAIRED + 1,
|
|
cwd: process.cwd(),
|
|
timeoutMs: 1_500_000,
|
|
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
|
|
});
|
|
|
|
try {
|
|
if (!['plan_ready', 'completion_summary', 'ceiling_reached'].includes(obs.outcome)) {
|
|
throw new Error(
|
|
`paired-finding control FAILED: outcome=${obs.outcome}\n` +
|
|
`step0=${obs.step0Count} review=${obs.reviewCount}\n` +
|
|
`--- evidence (last 3KB) ---\n${obs.evidence}`,
|
|
);
|
|
}
|
|
if (obs.reviewCount < FLOOR_PAIRED) {
|
|
throw new Error(
|
|
`PAIRED CONTROL FAIL: reviewCount=${obs.reviewCount} < FLOOR=${FLOOR_PAIRED}.\n` +
|
|
`Two deliberately related findings were batched into <2 questions — the rule failed under D12.\n` +
|
|
`Review-phase fingerprints:\n` +
|
|
obs.fingerprints
|
|
.filter((f) => !f.preReview)
|
|
.map((f) => ` - "${f.promptSnippet.slice(0, 80)}"`)
|
|
.join('\n'),
|
|
);
|
|
}
|
|
if (obs.reviewCount > CEILING_PAIRED) {
|
|
throw new Error(
|
|
`PAIRED CONTROL FAIL: reviewCount=${obs.reviewCount} > CEILING=${CEILING_PAIRED} (over-asking on a 2-finding fixture).`,
|
|
);
|
|
}
|
|
} finally {
|
|
try {
|
|
fs.rmSync(PLAN_CEO_PAIRED_PATH, { force: true });
|
|
} catch {
|
|
/* best-effort */
|
|
}
|
|
}
|
|
},
|
|
1_700_000,
|
|
);
|
|
});
|