v1.27.1.0 fix: anti-shortcut clause + gate-tier AskUserQuestion floor tests for all plan-* skills (#1354)

* feat(test/helpers): runPlanSkillFloorCheck — minimal AskUserQuestion-floor observer

Adds a focused PTY observer that exits at the first non-permission
numbered-option render. Catches the May 2026 transcript-bug class
(model wrote plan + ExitPlanMode without firing any AUQ) without
needing to fingerprint or navigate past the AUQ.

Why separate from runPlanSkillCounting: plan-mode AUQs render every
option on a single logical line via cursor-positioning escapes that
stripAnsi can't simulate, so parseNumberedOptions returns < 2 options
and never records a fingerprint. Counting tests work on 25-min budgets
because eventually one frame parses cleanly; gate-tier floor tests
need to exit early on the first observation. Trades fingerprint
precision for early-exit reliability.

Also drops COMPLETION_SUMMARY_RE check from this helper — it matches
"GSTACK REVIEW REPORT" anywhere in the buffer including when the
agent does recon by reading existing plan files. plan_ready
(claude's actual "Ready to execute" confirmation) is the reliable
terminal signal for "agent finished without asking."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(resolvers): generateAntiShortcutClause shared resolver

Adds {{ANTI_SHORTCUT_CLAUSE}} placeholder backed by a single resolver
function in scripts/resolvers/review.ts. Plan-* review skills can now
include the clause via one placeholder line in their .tmpl rather than
cloning the paragraph four times. Future tightening edits one resolver,
all four skills update on next gen-skill-docs.

Wired into the existing RESOLVERS map alongside generateReviewDashboard
and generatePlanFileReviewReport — no gen-skill-docs.ts change needed
because the generator already does generic placeholder substitution
against that map.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(plan-*-review): anti-shortcut clause in all four review skills

Inserts {{ANTI_SHORTCUT_CLAUSE}} placeholder immediately after the
**Anti-skip rule:** paragraph in plan-{eng,ceo,design,devex}-review
SKILL.md.tmpl. The four templates use different surrounding section
headers (eng "Review Sections (after scope is agreed)" vs ceo/design/devex
variants), so anchoring on the paragraph rather than the heading works
across all four.

Closes the May 2026 transcript-bug loophole: existing STOP gates name
forbidden actions only AFTER a per-section finding is identified. The
anti-shortcut clause adds the pre-emptive rule — "the plan file is the
OUTPUT of the interactive review, not a substitute for it" — covering
the case the transcript exhibited (skip per-section walk, dump every
finding into one plan write, call ExitPlanMode).

Regenerated SKILL.md for all hosts via bun run gen:skill-docs --host all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: gate-tier AskUserQuestion floor tests for all plan-* review skills

Adds 4 finding-floor tests (one per plan-* skill) that catch the May
2026 transcript-bug class — model wrote a plan and called ExitPlanMode
without firing any review-phase AskUserQuestion. Asserts via
runPlanSkillFloorCheck that ANY non-permission AUQ render fires before
the agent reaches plan_ready.

Verified:
- Eng floor: passed in 59s
- CEO floor: passed in 197s
- Design floor: passed
- Devex floor: passed
- Total ~$2-6 per CI run; only triggers on diff against the 4 plan-*
  templates, the shared resolver review.ts, the seeds fixture, or the
  PTY runner helper.

Fixtures live in test/fixtures/forcing-finding-seeds.ts, one constant
per skill. Each seed is engineered to force at least one obvious
finding under that skill's review focus (architectural smell for eng,
scope-creep for ceo, UI-slop for design, painful onboarding for devex).

Touchfiles wiring:
- E2E_TOUCHFILES: 4 plan-*-finding-floor entries with deps on the
  matching skill template, the shared resolver, the seeds fixture,
  and the PTY runner helper
- E2E_TIERS: all 4 entries marked 'gate'
- touchfiles.test.ts: count assertion bumped 21→22 with explicit
  plan-ceo-finding-floor containment check

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.27.1.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-05-06 20:27:20 -07:00
committed by GitHub
parent f44de365c5
commit 7b4738bca0
21 changed files with 532 additions and 5 deletions
+80
View File
@@ -1,5 +1,85 @@
# Changelog
## [1.27.1.0] - 2026-05-06
## **Plan-mode reviews now refuse to dump findings without asking. Four gate-tier tests catch the regression on every PR.**
The four `/plan-*-review` skills (eng, ceo, design, devex) gain an
anti-shortcut clause baked in via a single shared resolver. The clause
names the May 2026 transcript-bug failure mode directly: model explores,
finds issues, dumps every finding into one plan write, calls
ExitPlanMode without firing AskUserQuestion. The new clause closes that
loophole: "the plan file is the OUTPUT of the interactive review, not a
substitute for it." Future tightening edits one resolver, all four
skills update on the next gen-skill-docs.
Four gate-tier E2E tests catch the regression class on every PR that
touches the four templates, the shared resolver, or the seeds fixture.
Each test drives the matching skill against a small "forcing finding"
seed and asserts the agent fires at least one AskUserQuestion before
reaching plan_ready. ~1-3 min wall time per test, ~$2-6 total per CI
hit. Eng floor: 59s. CEO floor: 197s. All four pass against the new
template.
### The numbers that matter
Verified end-to-end via live PTY runs against `claude` plan mode:
| Surface | Before | After | Δ |
|---|---|---|---|
| Plan-mode reviews with anti-shortcut clause | 0/4 | 4/4 | full coverage of plan-* family |
| Gate-tier regression tests for the transcript-bug class | 0 | 4 | one per skill |
| Wall time per floor test (typical) | n/a | 30s-3m | early exit on first AUQ render |
| Cost per gate run (when triggered) | n/a | ~$2-6 | diff-gated; only fires on relevant edits |
| Lines added / deleted | — | +450 / 3 | additive; no breaking changes |
The floor tests use a focused observer (`runPlanSkillFloorCheck`) that
exits at the first non-permission numbered-option render. Existing
periodic finding-count tests use `runPlanSkillCounting` for full
fingerprint analysis on a 25-min budget; the floor variant trades
fingerprint precision for early-exit reliability so it fits gate-tier
constraints. Both helpers live side-by-side in
`test/helpers/claude-pty-runner.ts`.
### What this means for the four review skills
Every plan-* review now has a structural rule against the precise
failure mode the transcript exhibited. The anti-shortcut clause
appears in the rendered prompt right after the existing Anti-skip
rule, so it's read alongside the per-section STOP gates v1.26.2.0
already added. If a future model regression revives the bug, the
gate-tier floor test fires with full PTY evidence on the next PR.
### Itemized changes
#### Added
- **`generateAntiShortcutClause` resolver** in `scripts/resolvers/review.ts`,
registered as `{{ANTI_SHORTCUT_CLAUSE}}` in the `RESOLVERS` map.
Plan-* SKILL.md.tmpl files include it via one placeholder line.
- **`runPlanSkillFloorCheck` PTY helper** in
`test/helpers/claude-pty-runner.ts` — minimal "did the agent fire ANY
AskUserQuestion?" observer with early exit on first non-permission
numbered-option render.
- **Four gate-tier finding-floor E2E tests** in
`test/skill-e2e-plan-{eng,ceo,design,devex}-finding-floor.test.ts`,
each using the shared `runPlanSkillFloorCheck` helper.
- **Four forcing-finding seeds** in `test/fixtures/forcing-finding-seeds.ts`,
one per skill, each engineered to surface at least one finding under
that skill's review focus.
#### Changed
- **All four `plan-*-review` SKILL.md** files now include the
anti-shortcut clause immediately after the `**Anti-skip rule:**`
paragraph. Anchored on the paragraph (not the surrounding heading)
so the same insertion works across all four templates regardless of
their differing section labels.
- **`test/helpers/touchfiles.ts`** adds 4 entries to `E2E_TOUCHFILES`
and `E2E_TIERS=gate`. The new entries depend on the matching skill
template, the shared resolver, the seeds fixture, and the PTY
runner helper.
- **`test/touchfiles.test.ts`** count assertion bumped 21→22 with
explicit `plan-ceo-finding-floor` containment.
## [1.27.0.0] - 2026-05-06
## **`/setup-gbrain` connects to a remote brain in one paste. Brain repo renamed to gstack-artifacts.**
+1 -1
View File
@@ -1 +1 @@
1.27.0.0
1.27.1.0
+1 -1
View File
@@ -1,6 +1,6 @@
{
"name": "gstack",
"version": "1.27.0.0",
"version": "1.27.1.0",
"description": "Garry's Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.",
"license": "MIT",
"type": "module",
+2
View File
@@ -1337,6 +1337,8 @@ Present these mode options via AskUserQuestion using the preamble's AskUserQuest
**Anti-skip rule:** Never condense, abbreviate, or skip any review section (1-11) regardless of plan type (strategy, spec, code, infra). Every section in this skill exists for a reason. "This is a strategy doc so implementation sections don't apply" is always wrong — implementation details are where strategy breaks down. If a section genuinely has zero findings, say "No issues found" and move on — but you must evaluate it.
**Anti-shortcut clause:** The plan file is the OUTPUT of the interactive review, not a substitute for it. Writing every finding into one plan write and calling ExitPlanMode without firing AskUserQuestion is the precise failure mode of the May 2026 transcript bug — the model explored, found issues, and dumped them into a deliverable rather than walking the user through them. If you have ANY non-trivial finding in any review section, the path from finding to ExitPlanMode goes THROUGH AskUserQuestion. Zero findings in every section is the only path to ExitPlanMode that bypasses AskUserQuestion. If you find yourself wanting to write a plan with findings before asking, stop and call AskUserQuestion now — that's the bug, recognize it.
### Section 1: Architecture Review
Evaluate and diagram:
* Overall system design and component boundaries. Draw the dependency graph.
+2
View File
@@ -411,6 +411,8 @@ Present these mode options via AskUserQuestion using the preamble's AskUserQuest
**Anti-skip rule:** Never condense, abbreviate, or skip any review section (1-11) regardless of plan type (strategy, spec, code, infra). Every section in this skill exists for a reason. "This is a strategy doc so implementation sections don't apply" is always wrong — implementation details are where strategy breaks down. If a section genuinely has zero findings, say "No issues found" and move on — but you must evaluate it.
{{ANTI_SHORTCUT_CLAUSE}}
### Section 1: Architecture Review
Evaluate and diagram:
* Overall system design and component boundaries. Draw the dependency graph.
+2
View File
@@ -1352,6 +1352,8 @@ descriptions of what 10/10 looks like.
**Anti-skip rule:** Never condense, abbreviate, or skip any review pass (1-7) regardless of plan type (strategy, spec, code, infra). Every pass in this skill exists for a reason. "This is a strategy doc so design passes don't apply" is always wrong — design gaps are where implementation breaks down. If a pass genuinely has zero findings, say "No issues found" and move on — but you must evaluate it.
**Anti-shortcut clause:** The plan file is the OUTPUT of the interactive review, not a substitute for it. Writing every finding into one plan write and calling ExitPlanMode without firing AskUserQuestion is the precise failure mode of the May 2026 transcript bug — the model explored, found issues, and dumped them into a deliverable rather than walking the user through them. If you have ANY non-trivial finding in any review section, the path from finding to ExitPlanMode goes THROUGH AskUserQuestion. Zero findings in every section is the only path to ExitPlanMode that bypasses AskUserQuestion. If you find yourself wanting to write a plan with findings before asking, stop and call AskUserQuestion now — that's the bug, recognize it.
## Prior Learnings
Search for relevant learnings from previous sessions:
+2
View File
@@ -265,6 +265,8 @@ descriptions of what 10/10 looks like.
**Anti-skip rule:** Never condense, abbreviate, or skip any review pass (1-7) regardless of plan type (strategy, spec, code, infra). Every pass in this skill exists for a reason. "This is a strategy doc so design passes don't apply" is always wrong — design gaps are where implementation breaks down. If a pass genuinely has zero findings, say "No issues found" and move on — but you must evaluate it.
{{ANTI_SHORTCUT_CLAUSE}}
{{LEARNINGS_SEARCH}}
### Pass 1: Information Architecture
+2
View File
@@ -1323,6 +1323,8 @@ Pattern:
**Anti-skip rule:** Never condense, abbreviate, or skip any review pass (1-8) regardless of plan type (strategy, spec, code, infra). Every pass in this skill exists for a reason. "This is a strategy doc so DX passes don't apply" is always wrong — DX gaps are where adoption breaks down. If a pass genuinely has zero findings, say "No issues found" and move on — but you must evaluate it.
**Anti-shortcut clause:** The plan file is the OUTPUT of the interactive review, not a substitute for it. Writing every finding into one plan write and calling ExitPlanMode without firing AskUserQuestion is the precise failure mode of the May 2026 transcript bug — the model explored, found issues, and dumped them into a deliverable rather than walking the user through them. If you have ANY non-trivial finding in any review section, the path from finding to ExitPlanMode goes THROUGH AskUserQuestion. Zero findings in every section is the only path to ExitPlanMode that bypasses AskUserQuestion. If you find yourself wanting to write a plan with findings before asking, stop and call AskUserQuestion now — that's the bug, recognize it.
## Prior Learnings
Search for relevant learnings from previous sessions:
+2
View File
@@ -449,6 +449,8 @@ Pattern:
**Anti-skip rule:** Never condense, abbreviate, or skip any review pass (1-8) regardless of plan type (strategy, spec, code, infra). Every pass in this skill exists for a reason. "This is a strategy doc so DX passes don't apply" is always wrong — DX gaps are where adoption breaks down. If a pass genuinely has zero findings, say "No issues found" and move on — but you must evaluate it.
{{ANTI_SHORTCUT_CLAUSE}}
{{LEARNINGS_SEARCH}}
### DX Trend Check
+2
View File
@@ -899,6 +899,8 @@ Always work through the full interactive review: one section at a time (Architec
**Anti-skip rule:** Never condense, abbreviate, or skip any review section (1-4) regardless of plan type (strategy, spec, code, infra). Every section in this skill exists for a reason. "This is a strategy doc so implementation sections don't apply" is always wrong — implementation details are where strategy breaks down. If a section genuinely has zero findings, say "No issues found" and move on — but you must evaluate it.
**Anti-shortcut clause:** The plan file is the OUTPUT of the interactive review, not a substitute for it. Writing every finding into one plan write and calling ExitPlanMode without firing AskUserQuestion is the precise failure mode of the May 2026 transcript bug — the model explored, found issues, and dumped them into a deliverable rather than walking the user through them. If you have ANY non-trivial finding in any review section, the path from finding to ExitPlanMode goes THROUGH AskUserQuestion. Zero findings in every section is the only path to ExitPlanMode that bypasses AskUserQuestion. If you find yourself wanting to write a plan with findings before asking, stop and call AskUserQuestion now — that's the bug, recognize it.
## Prior Learnings
Search for relevant learnings from previous sessions:
+2
View File
@@ -127,6 +127,8 @@ Always work through the full interactive review: one section at a time (Architec
**Anti-skip rule:** Never condense, abbreviate, or skip any review section (1-4) regardless of plan type (strategy, spec, code, infra). Every section in this skill exists for a reason. "This is a strategy doc so implementation sections don't apply" is always wrong — implementation details are where strategy breaks down. If a section genuinely has zero findings, say "No issues found" and move on — but you must evaluate it.
{{ANTI_SHORTCUT_CLAUSE}}
{{LEARNINGS_SEARCH}}
### 1. Architecture review
+2 -1
View File
@@ -11,7 +11,7 @@ import { generateTestFailureTriage } from './preamble';
import { generateCommandReference, generateSnapshotFlags, generateBrowseSetup } from './browse';
import { generateDesignMethodology, generateDesignHardRules, generateDesignOutsideVoices, generateDesignReviewLite, generateDesignSketch, generateDesignSetup, generateDesignMockup, generateDesignShotgunLoop, generateTasteProfile, generateUXPrinciples } from './design';
import { generateTestBootstrap, generateTestCoverageAuditPlan, generateTestCoverageAuditShip, generateTestCoverageAuditReview } from './testing';
import { generateReviewDashboard, generatePlanFileReviewReport, generateSpecReviewLoop, generateBenefitsFrom, generateCodexSecondOpinion, generateAdversarialStep, generateCodexPlanReview, generatePlanCompletionAuditShip, generatePlanCompletionAuditReview, generatePlanVerificationExec, generateScopeDrift, generateCrossReviewDedup } from './review';
import { generateReviewDashboard, generatePlanFileReviewReport, generateAntiShortcutClause, generateSpecReviewLoop, generateBenefitsFrom, generateCodexSecondOpinion, generateAdversarialStep, generateCodexPlanReview, generatePlanCompletionAuditShip, generatePlanCompletionAuditReview, generatePlanVerificationExec, generateScopeDrift, generateCrossReviewDedup } from './review';
import { generateSlugEval, generateSlugSetup, generateBaseBranchDetect, generateDeployBootstrap, generateQAMethodology, generateCoAuthorTrailer, generateChangelogWorkflow } from './utility';
import { generateLearningsSearch, generateLearningsLog } from './learnings';
import { generateConfidenceCalibration } from './confidence';
@@ -39,6 +39,7 @@ export const RESOLVERS: Record<string, ResolverFn> = {
DESIGN_REVIEW_LITE: generateDesignReviewLite,
REVIEW_DASHBOARD: generateReviewDashboard,
PLAN_FILE_REVIEW_REPORT: generatePlanFileReviewReport,
ANTI_SHORTCUT_CLAUSE: generateAntiShortcutClause,
TEST_BOOTSTRAP: generateTestBootstrap,
TEST_COVERAGE_AUDIT_PLAN: generateTestCoverageAuditPlan,
TEST_COVERAGE_AUDIT_SHIP: generateTestCoverageAuditShip,
+4
View File
@@ -158,6 +158,10 @@ there — the user then sees a plan whose review report is not at the bottom and
(correctly) rejects it.`;
}
export function generateAntiShortcutClause(_ctx: TemplateContext): string {
return `**Anti-shortcut clause:** The plan file is the OUTPUT of the interactive review, not a substitute for it. Writing every finding into one plan write and calling ExitPlanMode without firing AskUserQuestion is the precise failure mode of the May 2026 transcript bug — the model explored, found issues, and dumped them into a deliverable rather than walking the user through them. If you have ANY non-trivial finding in any review section, the path from finding to ExitPlanMode goes THROUGH AskUserQuestion. Zero findings in every section is the only path to ExitPlanMode that bypasses AskUserQuestion. If you find yourself wanting to write a plan with findings before asking, stop and call AskUserQuestion now — that's the bug, recognize it.`;
}
export function generateSpecReviewLoop(_ctx: TemplateContext): string {
return `## Spec Review Loop
+83
View File
@@ -0,0 +1,83 @@
/**
* Per-skill draft-plan seeds engineered to surface at least one
* review-phase finding in the corresponding plan-* review skill.
*
* Used by gate-tier finding-floor tests
* (test/skill-e2e-plan-{eng,ceo,design,devex}-finding-floor.test.ts) as
* the minimum-cost regression for the May 2026 transcript bug:
* "/plan-eng-review reviewed a real PR diff, wrote a multi-section
* review plan to ~/.claude/plans/ and called ExitPlanMode without
* ever firing AskUserQuestion."
*
* Each seed is small and pre-loaded with one obvious finding the
* matching skill cannot honestly miss. Floor tests assert
* `reviewCount >= 1` i.e., the model fired at least one review-phase
* AUQ before reaching plan_ready / completion_summary / ceiling.
*
* Each seed includes the standard "write your plan-mode plan to /tmp/…"
* preamble that the existing periodic finding-count fixtures use, so
* the agent has a concrete plan-file target. The /tmp path is unique
* per skill to avoid collisions if floor tests run in parallel.
*
* For a deeper [N-1, N+2] count band assertion, see the periodic
* test/skill-e2e-plan-{X}-finding-count.test.ts fixtures.
*/
export const FORCING_FLOOR_ENG = [
'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-eng-floor.md (use Edit/Write to that exact path).',
'',
'# Plan: Add request-id propagation across services',
'',
'## Architecture',
"We'll roll a custom UUIDv7 generator inline in each service rather than",
"use Node's crypto.randomUUID() built-in. Same shape, but we want full",
'control over the entropy source for "future flexibility" — no concrete',
'reason yet.',
].join('\n');
export const FORCING_FLOOR_CEO = [
'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-ceo-floor.md (use Edit/Write to that exact path).',
'',
'# Plan: Launch a "developer-friendly" pricing tier',
'',
'## Goal',
'Increase developer adoption.',
'',
'## Success metric',
'More signups.',
'',
'## Premise',
"We haven't talked to any developers about whether the current pricing",
'is actually a barrier. The team agreed it "feels like" it should be cheaper.',
].join('\n');
export const FORCING_FLOOR_DESIGN = [
'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-design-floor.md (use Edit/Write to that exact path).',
'',
'# Plan: Marketing landing page',
'',
'## Layout',
'All headings, taglines, and body copy will be center-aligned for a',
'"clean modern look." The hero h1 sits 8px above the subhead with no',
'breathing room; the CTA button is the same visual weight as a',
'secondary "Learn more" link directly beside it.',
].join('\n');
export const FORCING_FLOOR_DEVEX = [
'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-devex-floor.md (use Edit/Write to that exact path).',
'',
'# Plan: SDK quickstart docs',
'',
'## Onboarding flow',
'Step 1: clone the repo.',
'Step 2: install bun manually if not present.',
'Step 3: copy .env.example to .env and fill in 8 environment variables.',
'Step 4: run database migrations against your local Postgres.',
'Step 5: start the dev server.',
'Step 6: open the docs in a separate tab.',
'Step 7: register an API key by emailing the team.',
'Step 8: paste the key into your .env, restart the server, then make',
'your first SDK call.',
'',
'No quickstart command, no hosted sandbox, no copy-pasteable curl example.',
].join('\n');
+164
View File
@@ -1550,3 +1550,167 @@ export async function runPlanSkillCounting(opts: {
await session.close();
}
}
// ────────────────────────────────────────────────────────────────────────────
// runPlanSkillFloorCheck — minimal "did the agent fire ANY AskUserQuestion?"
// observer for gate-tier floor tests catching the May 2026 transcript bug
// (model wrote plan + ExitPlanMode'd with reviewCount=0).
//
// Why this exists separately from runPlanSkillCounting: plan-mode AUQs render
// every option on a single logical line via cursor-positioning escapes that
// stripAnsi can't simulate. parseNumberedOptions therefore returns < 2 options
// from those frames and never records a fingerprint. The full counting helper
// works for periodic finding-count tests because their 25-min budgets give the
// agent enough redraws that one frame eventually parses cleanly. Gate-tier
// floor tests don't have that wall-time budget and need to exit early on the
// first observation. This helper trades fingerprint precision for early-exit
// reliability.
//
// Contract:
// - PASS → outcome === 'auq_observed' (agent rendered any non-permission
// numbered-option list; we exit immediately and report success)
// - FAIL → outcome === 'plan_ready' | 'completion_summary' | 'silent_write'
// (agent reached a terminal state without ever firing an AUQ —
// this IS the transcript bug)
// - SOFT → outcome === 'timeout' (neither happened in budget; agent may
// just be slow — test should retry with a larger budget rather
// than treat as a hard regression)
// ────────────────────────────────────────────────────────────────────────────
export interface PlanSkillFloorObservation {
/** True iff a review-phase AUQ render was observed. */
auqObserved: boolean;
outcome:
| 'auq_observed'
| 'plan_ready'
| 'silent_write'
| 'exited'
| 'timeout';
summary: string;
/** Visible TTY tail (last 3KB) at terminal time. */
evidence: string;
/** Wall time (ms) until the outcome was decided. */
elapsedMs: number;
}
/**
* Drive a plan-* skill in plan mode and exit at the first non-permission
* numbered-option render. See block comment above for the contract.
*/
export async function runPlanSkillFloorCheck(opts: {
/** Skill name, e.g. 'plan-eng-review'. Used for diagnostic strings only. */
skillName: string;
/** Slash command to send alone, e.g. '/plan-eng-review'. */
slashCommand: string;
/** Plan content sent as a follow-up message ~3s after the slash command. */
followUpPrompt: string;
/** Working directory. Default process.cwd(). */
cwd?: string;
/** Total budget. Default 600000 (10 min). Tests exit early on AUQ. */
timeoutMs?: number;
/** Extra env merged into the spawned `claude` process. */
env?: Record<string, string>;
}): Promise<PlanSkillFloorObservation> {
const startedAt = Date.now();
const timeoutMs = opts.timeoutMs ?? 600_000;
const session = await launchClaudePty({
permissionMode: 'plan',
cwd: opts.cwd,
timeoutMs: timeoutMs + 60_000,
env: opts.env,
});
try {
await Bun.sleep(8000); // boot grace + auto-trust handler window
const since = session.mark();
session.send(`${opts.slashCommand}\r`);
await Bun.sleep(3000);
session.send(`${opts.followUpPrompt}\r`);
const start = Date.now();
while (Date.now() - start < timeoutMs) {
await Bun.sleep(2000);
const visible = session.visibleSince(since);
if (session.exited()) {
return {
auqObserved: false,
outcome: 'exited',
summary: `claude exited (code=${session.exitCode()}) before any AUQ render`,
evidence: visible.slice(-3000),
elapsedMs: Date.now() - startedAt,
};
}
if (visible.includes('Unknown command:')) {
return {
auqObserved: false,
outcome: 'exited',
summary: `claude rejected ${opts.slashCommand} as unknown command`,
evidence: visible.slice(-3000),
elapsedMs: Date.now() - startedAt,
};
}
// Success: ANY non-permission numbered-option list is an AUQ render.
// The bug we're catching is "fired zero AUQs," so observing one is
// sufficient — we don't need to fingerprint or navigate past it.
if (
isNumberedOptionListVisible(visible) &&
!isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))
) {
return {
auqObserved: true,
outcome: 'auq_observed',
summary: 'agent rendered an AskUserQuestion (floor met)',
evidence: visible.slice(-3000),
elapsedMs: Date.now() - startedAt,
};
}
// Silent write outside sanctioned dirs is the transcript-bug shape.
const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g;
let m: RegExpExecArray | null;
while ((m = writeRe.exec(visible)) !== null) {
const target = m[1] ?? '';
const sanctioned = SANCTIONED_WRITE_SUBSTRINGS.some((s) => target.includes(s));
if (!sanctioned && !isNumberedOptionListVisible(visible)) {
return {
auqObserved: false,
outcome: 'silent_write',
summary: `Write/Edit to ${target} fired before any AskUserQuestion`,
evidence: visible.slice(-3000),
elapsedMs: Date.now() - startedAt,
};
}
}
// Reached terminal without AUQ → transcript-bug regression.
// Note: COMPLETION_SUMMARY_RE is intentionally NOT checked here — it
// matches "GSTACK REVIEW REPORT" anywhere in the buffer, including
// when the agent does recon by reading existing plan files (which
// contain that string as a generated section). The plan_ready check
// (claude's actual "Ready to execute" confirmation) is the reliable
// terminal signal for "agent finished without asking."
if (isPlanReadyVisible(visible)) {
return {
auqObserved: false,
outcome: 'plan_ready',
summary: 'agent reached plan_ready without firing any AskUserQuestion',
evidence: visible.slice(-3000),
elapsedMs: Date.now() - startedAt,
};
}
}
return {
auqObserved: false,
outcome: 'timeout',
summary: `no AUQ render and no terminal outcome within ${timeoutMs}ms`,
evidence: session.visibleSince(since).slice(-3000),
elapsedMs: Date.now() - startedAt,
};
} finally {
await session.close();
}
}
+14
View File
@@ -133,6 +133,16 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
'plan-eng-finding-count': ['plan-eng-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-eng-finding-count.test.ts'],
'plan-design-finding-count': ['plan-design-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-design-finding-count.test.ts'],
'plan-devex-finding-count': ['plan-devex-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-devex-finding-count.test.ts'],
// Gate-tier reviewCount-floor counterparts. Catch the May 2026 transcript
// bug (model wrote a plan-mode plan and ExitPlanMode'd without firing any
// review-phase AskUserQuestion). Uses runPlanSkillFloorCheck — minimal
// "did agent fire ANY AUQ?" observer that exits early on first non-permission
// numbered-option render. ~1-3 min typical wall time per test, ~$2-6 total.
'plan-eng-finding-floor': ['plan-eng-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts', 'test/fixtures/forcing-finding-seeds.ts', 'test/skill-e2e-plan-eng-finding-floor.test.ts'],
'plan-ceo-finding-floor': ['plan-ceo-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts', 'test/fixtures/forcing-finding-seeds.ts', 'test/skill-e2e-plan-ceo-finding-floor.test.ts'],
'plan-design-finding-floor': ['plan-design-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts', 'test/fixtures/forcing-finding-seeds.ts', 'test/skill-e2e-plan-design-finding-floor.test.ts'],
'plan-devex-finding-floor': ['plan-devex-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts', 'test/fixtures/forcing-finding-seeds.ts', 'test/skill-e2e-plan-devex-finding-floor.test.ts'],
'brain-privacy-gate': ['scripts/resolvers/preamble/generate-brain-sync-block.ts', 'scripts/resolvers/preamble.ts', 'bin/gstack-brain-sync', 'bin/gstack-artifacts-init', 'bin/gstack-config', 'test/helpers/agent-sdk-runner.ts'],
// /setup-gbrain Path 4 (Remote MCP) — happy + bad-token end-to-end via
@@ -429,6 +439,10 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
'plan-eng-finding-count': 'periodic',
'plan-design-finding-count': 'periodic',
'plan-devex-finding-count': 'periodic',
'plan-eng-finding-floor': 'gate',
'plan-ceo-finding-floor': 'gate',
'plan-design-finding-floor': 'gate',
'plan-devex-finding-floor': 'gate',
// Privacy gate for gstack-brain-sync — periodic (non-deterministic LLM call,
// costs ~$0.30-$0.50 per run, not needed on every commit)
@@ -0,0 +1,37 @@
/**
* /plan-ceo-review AskUserQuestion floor regression (gate, paid, real-PTY).
*
* See test/skill-e2e-plan-eng-finding-floor.test.ts for the contract.
*/
import { describe, test } from 'bun:test';
import { runPlanSkillFloorCheck } from './helpers/claude-pty-runner';
import { FORCING_FLOOR_CEO } from './fixtures/forcing-finding-seeds';
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
const describeE2E = shouldRun ? describe : describe.skip;
describeE2E('/plan-ceo-review AskUserQuestion floor (gate)', () => {
test(
'seeded forcing finding causes the agent to fire at least one AskUserQuestion',
async () => {
const obs = await runPlanSkillFloorCheck({
skillName: 'plan-ceo-review',
slashCommand: '/plan-ceo-review',
followUpPrompt: FORCING_FLOOR_CEO,
cwd: process.cwd(),
timeoutMs: 600_000,
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
});
if (obs.outcome !== 'auq_observed') {
throw new Error(
`floor test FAILED: outcome=${obs.outcome} elapsed=${obs.elapsedMs}ms\n` +
`summary: ${obs.summary}\n` +
`--- evidence (last 3KB) ---\n${obs.evidence}`,
);
}
},
660_000,
);
});
@@ -0,0 +1,37 @@
/**
* /plan-design-review AskUserQuestion floor regression (gate, paid, real-PTY).
*
* See test/skill-e2e-plan-eng-finding-floor.test.ts for the contract.
*/
import { describe, test } from 'bun:test';
import { runPlanSkillFloorCheck } from './helpers/claude-pty-runner';
import { FORCING_FLOOR_DESIGN } from './fixtures/forcing-finding-seeds';
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
const describeE2E = shouldRun ? describe : describe.skip;
describeE2E('/plan-design-review AskUserQuestion floor (gate)', () => {
test(
'seeded forcing finding causes the agent to fire at least one AskUserQuestion',
async () => {
const obs = await runPlanSkillFloorCheck({
skillName: 'plan-design-review',
slashCommand: '/plan-design-review',
followUpPrompt: FORCING_FLOOR_DESIGN,
cwd: process.cwd(),
timeoutMs: 600_000,
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
});
if (obs.outcome !== 'auq_observed') {
throw new Error(
`floor test FAILED: outcome=${obs.outcome} elapsed=${obs.elapsedMs}ms\n` +
`summary: ${obs.summary}\n` +
`--- evidence (last 3KB) ---\n${obs.evidence}`,
);
}
},
660_000,
);
});
@@ -0,0 +1,37 @@
/**
* /plan-devex-review AskUserQuestion floor regression (gate, paid, real-PTY).
*
* See test/skill-e2e-plan-eng-finding-floor.test.ts for the contract.
*/
import { describe, test } from 'bun:test';
import { runPlanSkillFloorCheck } from './helpers/claude-pty-runner';
import { FORCING_FLOOR_DEVEX } from './fixtures/forcing-finding-seeds';
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
const describeE2E = shouldRun ? describe : describe.skip;
describeE2E('/plan-devex-review AskUserQuestion floor (gate)', () => {
test(
'seeded forcing finding causes the agent to fire at least one AskUserQuestion',
async () => {
const obs = await runPlanSkillFloorCheck({
skillName: 'plan-devex-review',
slashCommand: '/plan-devex-review',
followUpPrompt: FORCING_FLOOR_DEVEX,
cwd: process.cwd(),
timeoutMs: 600_000,
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
});
if (obs.outcome !== 'auq_observed') {
throw new Error(
`floor test FAILED: outcome=${obs.outcome} elapsed=${obs.elapsedMs}ms\n` +
`summary: ${obs.summary}\n` +
`--- evidence (last 3KB) ---\n${obs.evidence}`,
);
}
},
660_000,
);
});
@@ -0,0 +1,52 @@
/**
* /plan-eng-review AskUserQuestion floor regression (gate, paid, real-PTY).
*
* Catches the May 2026 transcript bug where /plan-eng-review wrote a
* multi-section review plan to ~/.claude/plans/ and called ExitPlanMode
* without firing any AskUserQuestion. See
* `.context/attachments/pasted_text_2026-05-06_10-25-23.txt`.
*
* Uses runPlanSkillFloorCheck a minimal "did the agent fire ANY AUQ?"
* observer that exits early on the first non-permission numbered-option
* render. See claude-pty-runner.ts for why this is separate from the
* runPlanSkillCounting harness used by periodic finding-count tests.
*
* Tier: gate. Budget: 10 min (early exit on success ~30-90s typical).
* Cost: ~$0.50-$1.50 per run depending on early-exit timing.
*/
import { describe, test } from 'bun:test';
import { runPlanSkillFloorCheck } from './helpers/claude-pty-runner';
import { FORCING_FLOOR_ENG } from './fixtures/forcing-finding-seeds';
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
const describeE2E = shouldRun ? describe : describe.skip;
describeE2E('/plan-eng-review AskUserQuestion floor (gate)', () => {
test(
'seeded forcing finding causes the agent to fire at least one AskUserQuestion',
async () => {
const obs = await runPlanSkillFloorCheck({
skillName: 'plan-eng-review',
slashCommand: '/plan-eng-review',
followUpPrompt: FORCING_FLOOR_ENG,
cwd: process.cwd(),
timeoutMs: 600_000,
env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
});
if (obs.outcome !== 'auq_observed') {
throw new Error(
`floor test FAILED: outcome=${obs.outcome} elapsed=${obs.elapsedMs}ms\n` +
`summary: ${obs.summary}\n` +
`If outcome is plan_ready or completion_summary, this is the transcript-bug ` +
`regression — agent reached terminal without firing AskUserQuestion. See ` +
`.context/attachments/pasted_text_2026-05-06_10-25-23.txt.\n` +
`If outcome is timeout, agent may just be slow — re-run or increase budget.\n` +
`--- evidence (last 3KB) ---\n${obs.evidence}`,
);
}
},
660_000,
);
});
+4 -2
View File
@@ -103,8 +103,10 @@ describe('selectTests', () => {
// auto-decide-preserved also depend on plan-ceo-review/**
expect(result.selected).toContain('autoplan-auto-mode');
expect(result.selected).toContain('auto-decide-preserved');
expect(result.selected.length).toBe(21);
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 21);
// v1.27+ gate-tier reviewCount-floor regression for transcript bug
expect(result.selected).toContain('plan-ceo-finding-floor');
expect(result.selected.length).toBe(22);
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 22);
});
test('global touchfile triggers ALL tests', () => {