v1.26.2.0 fix: plan-eng-review STOP gates always fire AskUserQuestion + report-at-bottom contract enforcement (#1313)

* fix(plan-eng-review): tighten STOP gates with anti-rationalization clause

Five sites in SKILL.md.tmpl uplift to the office-hours b512be71 pattern:
the four review-section gates (Architecture, Code Quality, Test, Performance)
plus the Step 0 complexity-check trigger. Adds tool_use reminder ("call the
tool directly"), names blocked next steps explicitly, anti-rationalization
clause naming the precise failure mode (loading the schema via ToolSearch
and writing the recommendation as chat prose).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(test/helpers): initialPlanContent + wrote_findings_before_asking + shared report-at-bottom assertion

Three additions to claude-pty-runner.ts:

1. runPlanSkillObservation gains initialPlanContent?: string. Pre-pumps a
   user message containing the seeded plan before invoking the skill, with
   a 3s gap so the message renders before the slash command. claude has no
   --plan-file flag (verified via claude --help), so message-pump is the
   route. Lets STOP-gate regression tests force complexity findings.

2. ClassifyResult gains wrote_findings_before_asking with companion
   strictPlanWrites?: boolean opt on classifyVisible. Fires when a Write/
   Edit to .claude/plans/* precedes any AskUserQuestion render in the
   session window. Default off — preserves zero-findings → write plan →
   plan_ready as legitimate for unseeded smokes. Six new unit tests cover
   before/after-AUQ ordering, permission-dialog edge case, strict-off path.

3. assertReportAtBottomIfPlanWritten(obs) shared helper. Wraps the existing
   assertReviewReportAtBottom(content) and gates on obs.planFile (artifact
   existing), so the assertion fires under both 'asked' and 'plan_ready'
   when a plan was actually written.

Also: runPlanSkillObservation now captures obs.planFile on every classifier
outcome, not just 'plan_ready'. Catches the case where the skill wrote a
plan partway through then paused on a question.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: wire assertReportAtBottomIfPlanWritten into 4 plan-mode E2E tests + add seeded-plan STOP-gate case

Every test case in skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.ts
that produces a plan file now asserts ## GSTACK REVIEW REPORT is the last
## section. The {{PLAN_FILE_REVIEW_REPORT}} resolver mandated this contract;
nothing tested it until now.

Plan-eng additionally gains a third test case: STOP gate fires when seeded
plan forces Step 0 findings. Combines the new initialPlanContent runner
option with --disallowedTools AskUserQuestion to force the Conductor
MCP-variant path through mcp__*__AskUserQuestion. Asserts outcome NOT in
{wrote_findings_before_asking, auto_decided, silent_write, exited, timeout}
and that plan_ready outcomes carry a ## Decisions section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(touchfiles): delete duplicate plan-design-review-plan-mode keys

Verified duplicates in test/helpers/touchfiles.ts:
- E2E_TOUCHFILES had plan-design-review-plan-mode at line 94 (full deps)
  AND line 243 (smaller deps); JS object literals: later wins.
- E2E_TIERS had it at line 399 ('gate') AND line 524 ('periodic'); same
  later-wins rule.

Effective tier was 'periodic', not 'gate'. Three of four plan-mode siblings
ran on every PR; design ran weekly only.

Delete the line-243 and line-524 duplicates. Keep line 94 (full deps) and
line 399 ('gate'). Also extend the four plan-mode-test entries to include
scripts/resolvers/review.ts so changes to {{PLAN_FILE_REVIEW_REPORT}}
trigger all four siblings in bun run eval:select.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.26.2.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: tighten CHANGELOG voice for v1.26.2.0

Move contributor-flavored bullet (runPlanSkillObservation seeding) into
For contributors. Drop branch-internal narrative (Codex review pass,
plan iteration tracking) per CHANGELOG-for-users style.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-05-03 20:26:59 -07:00
committed by GitHub
parent a0bfa001d3
commit 30fe6bb11c
12 changed files with 394 additions and 31 deletions
+130 -9
View File
@@ -454,6 +454,7 @@ export function optionsSignature(
*/
export type ClassifyResult =
| { outcome: 'silent_write'; summary: string }
| { outcome: 'wrote_findings_before_asking'; summary: string }
| { outcome: 'auto_decided'; summary: string }
| { outcome: 'plan_ready'; summary: string }
| { outcome: 'asked'; summary: string }
@@ -467,16 +468,73 @@ const SANCTIONED_WRITE_SUBSTRINGS = [
'TODOS.md',
];
export function classifyVisible(visible: string): ClassifyResult {
/**
* Find the position of the first AskUserQuestion-style numbered-option list
* that is NOT a permission dialog. Returns -1 if none has rendered yet.
*
* Used by the strict-plan-writes detector (D4) to distinguish legitimate
* post-AUQ plan writes from the transcript bug ("write findings to plan
* before asking").
*/
function findFirstAuqRenderIndex(visible: string): number {
const re = /\s*1\./g;
let m: RegExpExecArray | null;
while ((m = re.exec(visible)) !== null) {
// 200 bytes back + TAIL_SCAN_BYTES forward gives enough context for
// isPermissionDialogVisible to recognize the typical permission UI.
const surroundStart = Math.max(0, m.index - 200);
const surroundEnd = Math.min(visible.length, m.index + TAIL_SCAN_BYTES);
const surround = visible.slice(surroundStart, surroundEnd);
if (!isPermissionDialogVisible(surround)) {
return m.index;
}
}
return -1;
}
export function classifyVisible(
visible: string,
opts?: {
/**
* When true, treat Write/Edit to `.claude/plans/*` BEFORE any
* AskUserQuestion render as `wrote_findings_before_asking` rather than
* letting the sanctioned-write list silently approve it. Used by tests
* that seed a draft plan with guaranteed-finding-triggering complexity
* (D3-B), where a pre-AUQ plan write is the precise transcript bug.
* Default false — preserves existing behavior for unseeded smoke tests
* where zero-findings → write plan → plan_ready is legitimate.
*/
strictPlanWrites?: boolean;
},
): ClassifyResult {
// Silent-write detection: any Write/Edit tool render that targets a path
// OUTSIDE the sanctioned dirs, AND no numbered prompt is currently on screen
// (a numbered prompt means a permission/AskUserQuestion is gating the write,
// not an actual silent write).
const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g;
let m: RegExpExecArray | null;
const auqRenderIdx = opts?.strictPlanWrites ? findFirstAuqRenderIndex(visible) : -1;
while ((m = writeRe.exec(visible)) !== null) {
const target = m[1] ?? '';
const writePos = m.index;
const isPlanWrite = target.includes('.claude/plans');
const sanctioned = SANCTIONED_WRITE_SUBSTRINGS.some((s) => target.includes(s));
// D4-B: when strictPlanWrites is on, plan writes that precede the first
// AUQ render are flagged. Legitimate end-of-workflow plan writes happen
// AFTER an AUQ has rendered (i.e., the user has been asked). The
// transcript bug is a plan write WITHOUT any AUQ render preceding it.
if (opts?.strictPlanWrites && isPlanWrite) {
if (auqRenderIdx < 0 || writePos < auqRenderIdx) {
return {
outcome: 'wrote_findings_before_asking',
summary: `Write/Edit to ${target} fired before any AskUserQuestion render`,
};
}
// post-AUQ plan write — legitimate, fall through to other writes
continue;
}
if (!sanctioned && !isNumberedOptionListVisible(visible)) {
return {
outcome: 'silent_write',
@@ -712,6 +770,36 @@ export function assertReviewReportAtBottom(
return { ok: true };
}
/**
* Test helper: if `obs.planFile` was set, read it and assert
* `## GSTACK REVIEW REPORT` is the last `## ` section. Throws on
* violation with a diagnostic message including the plan path,
* the reason, any trailing headings, and the last 2KB of TTY output.
*
* Used by the four plan-mode E2E tests
* (skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.ts) to enforce
* the {{PLAN_FILE_REVIEW_REPORT}} resolver contract uniformly. Gates on
* `obs.planFile` (artifact existing), not on `obs.outcome === 'plan_ready'`,
* so it also catches the report-missing case under `'asked'` /
* `'wrote_findings_before_asking'` when a plan was already written.
*/
export function assertReportAtBottomIfPlanWritten(
obs: { planFile?: string; evidence: string },
): void {
if (!obs.planFile) return;
const content = fs.readFileSync(obs.planFile, 'utf-8');
const verdict = assertReviewReportAtBottom(content);
if (!verdict.ok) {
const trailing = verdict.trailingHeadings?.length
? `\ntrailing headings: ${verdict.trailingHeadings.join(', ')}`
: '';
throw new Error(
`GSTACK REVIEW REPORT contract violation in ${obs.planFile}: ${verdict.reason}${trailing}\n` +
`--- evidence (last 2KB) ---\n${obs.evidence}`,
);
}
}
/**
* Per-skill Step-0 boundary predicates. Each fires `true` when the answered
* AUQ's fingerprint matches the LAST question of that skill's Step 0 phase.
@@ -1085,6 +1173,20 @@ export async function runPlanSkillObservation(opts: {
* rendered AskUserQuestion list).
*/
env?: Record<string, string>;
/**
* Seed an initial plan that the spawned `claude` process operates on.
* STOP-gate regression tests need a plan with guaranteed-finding-triggering
* complexity (8+ files, custom-vs-builtin smell) so the skill MUST emit
* AskUserQuestion or fall back to a Decisions section. Without this,
* plan-mode creates a fresh empty plan and the skill has nothing to find
* issues with.
*
* Implementation: claude has no `--plan-file` flag (verified via
* `claude --help`). We pre-pump a user message containing the draft
* plan, wait for it to register, then invoke the skill. The skill's
* Step 0 reads the prior conversation context so it sees the draft.
*/
initialPlanContent?: string;
}): Promise<PlanSkillObservation> {
const startedAt = Date.now();
const session = await launchClaudePty({
@@ -1098,6 +1200,21 @@ export async function runPlanSkillObservation(opts: {
try {
// Boot grace + trust-dialog auto-handle.
await Bun.sleep(8000);
if (opts.initialPlanContent) {
// Pre-pump the draft as a user message so the skill's Step 0 has
// concrete content to scope-challenge. The trailing `\r` submits
// the message; embedded `\n` are preserved as line breaks within
// the message (claude-code uses Enter to send, Shift+Enter for
// newlines, but raw `\r` from a PTY just submits whatever's in
// the input buffer).
const seed = `Please review the following draft plan when I run the skill below:\n\n${opts.initialPlanContent}`;
session.send(`${seed}\r`);
// Wait for the seed message to render before sending the skill
// command. Without this gap the two messages can fuse and the
// skill name becomes part of the user prompt instead of a slash
// command.
await Bun.sleep(3000);
}
const since = session.mark();
session.send(`/${opts.skillName}\r`);
@@ -1123,20 +1240,24 @@ export async function runPlanSkillObservation(opts: {
elapsedMs: Date.now() - startedAt,
};
}
const classified = classifyVisible(visible);
const classified = classifyVisible(visible, {
strictPlanWrites: !!opts.initialPlanContent,
});
if (classified) {
const obs: PlanSkillObservation = {
...classified,
evidence: visible.slice(-2000),
elapsedMs: Date.now() - startedAt,
};
// For plan_ready outcomes, capture the plan file path from the full
// visible buffer — tests under --disallowedTools verify the file's
// contents to distinguish legitimate fallback flow from silent-skip.
if (classified.outcome === 'plan_ready') {
const planFile = extractPlanFilePath(visible);
if (planFile) obs.planFile = planFile;
}
// Capture the plan file path on any outcome where one may have been
// written. Gating only on 'plan_ready' missed two cases: (1) the
// 'asked' outcome where the model wrote a plan partway through then
// paused on a question, and (2) 'wrote_findings_before_asking' where
// the bug is precisely that the plan was written. The
// assertReviewReportAtBottom checks downstream gate on planFile
// existing, not on the outcome.
const planFile = extractPlanFilePath(visible);
if (planFile) obs.planFile = planFile;
return obs;
}
}
@@ -295,6 +295,78 @@ describe('classifyVisible (runtime path through the runner classifier)', () => {
// recent-tail window would surface here.
expect(TAIL_SCAN_BYTES).toBe(1500);
});
// D4-B: strictPlanWrites detector. Catches the transcript bug where the
// model writes findings to the plan file before any AskUserQuestion fires.
test('strictPlanWrites: plan write before any AUQ → wrote_findings_before_asking', () => {
const visible = `
⏺ Edit(/Users/me/.claude/plans/some-plan.md)
⎿ Updated 12 lines
`;
const result = classifyVisible(visible, { strictPlanWrites: true });
expect(result?.outcome).toBe('wrote_findings_before_asking');
expect(result?.summary).toContain('.claude/plans/some-plan.md');
});
test('strictPlanWrites: plan write AFTER an AUQ render → not flagged', () => {
// AUQ renders first, then the model writes the plan post-answer. This is
// the legitimate end-of-workflow flow and must NOT trigger the detector.
const visible = `
D1 — Some scope question
1. Option A
2. Option B
⏺ Edit(/Users/me/.claude/plans/some-plan.md)
⎿ Updated 12 lines
`;
const result = classifyVisible(visible, { strictPlanWrites: true });
// Outcome is 'asked' (the numbered list rendered); the post-AUQ plan
// write is ignored by the detector.
expect(result?.outcome).toBe('asked');
});
test('strictPlanWrites: AUQ first then plan write — write_pos > auq_pos → not flagged', () => {
// Same scenario, more explicit ordering: the regex finds the write at a
// position AFTER the numbered list. Detector lets it through.
const visible = [
'D1 — Choose your approach',
'',
' 1. Approach A',
' 2. Approach B',
'',
'⏺ Write(/Users/me/.claude/plans/draft.md)',
'⎿ Wrote 42 lines',
].join('\n');
const result = classifyVisible(visible, { strictPlanWrites: true });
expect(result?.outcome).toBe('asked');
});
test('strictPlanWrites: only a permission dialog visible → plan write still flagged', () => {
// A permission dialog 1./2. is NOT an AUQ; pre-AUQ plan writes still
// hit the detector even when a permission prompt is on screen.
const visible = `
⏺ Edit(/Users/me/.claude/plans/some-plan.md)
Edit to /Users/me/.claude/plans/some-plan.md
Do you want to proceed?
1. Yes
2. No
`;
const result = classifyVisible(visible, { strictPlanWrites: true });
expect(result?.outcome).toBe('wrote_findings_before_asking');
});
test('strictPlanWrites OFF: plan write before AUQ → returns null (legacy behavior preserved)', () => {
const visible = `
⏺ Edit(/Users/me/.claude/plans/some-plan.md)
⎿ Updated 12 lines
`;
// Without strictPlanWrites, the sanctioned-path list lets this through.
expect(classifyVisible(visible)).toBeNull();
});
});
describe('parseNumberedOptions', () => {
+4 -6
View File
@@ -89,10 +89,10 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
// include question-tuning.ts and generate-ask-user-format.ts because the
// AUTO_DECIDE preamble injection lives there and changes can flip the
// regression test outcome between 'asked' and 'auto_decided'.
'plan-ceo-review-plan-mode': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
'plan-eng-review-plan-mode': ['plan-eng-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
'plan-design-review-plan-mode': ['plan-design-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
'plan-devex-review-plan-mode': ['plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
'plan-ceo-review-plan-mode': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts'],
'plan-eng-review-plan-mode': ['plan-eng-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts'],
'plan-design-review-plan-mode': ['plan-design-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts'],
'plan-devex-review-plan-mode': ['plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts'],
'plan-mode-no-op': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
// v1.21+ AskUserQuestion-blocked regression tests — Conductor launches
@@ -240,7 +240,6 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
'design-consultation-existing': ['design-consultation/**', 'scripts/gen-skill-docs.ts'],
'design-consultation-research': ['design-consultation/**', 'scripts/gen-skill-docs.ts'],
'design-consultation-preview': ['design-consultation/**', 'scripts/gen-skill-docs.ts'],
'plan-design-review-plan-mode': ['plan-design-review/**', 'scripts/gen-skill-docs.ts'],
'plan-design-review-no-ui-scope': ['plan-design-review/**', 'scripts/gen-skill-docs.ts'],
'design-review-fix': ['design-review/**', 'browse/src/**', 'scripts/gen-skill-docs.ts'],
@@ -521,7 +520,6 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
'design-consultation-existing': 'periodic',
'design-consultation-research': 'gate',
'design-consultation-preview': 'gate',
'plan-design-review-plan-mode': 'periodic',
'plan-design-review-no-ui-scope': 'gate',
'design-review-fix': 'periodic',
'design-shotgun-path': 'gate',