v1.26.2.0 fix: plan-eng-review STOP gates always fire AskUserQuestion + report-at-bottom contract enforcement (#1313)

* fix(plan-eng-review): tighten STOP gates with anti-rationalization clause Five sites in SKILL.md.tmpl uplift to the office-hours b512be71 pattern: the four review-section gates (Architecture, Code Quality, Test, Performance) plus the Step 0 complexity-check trigger. Adds tool_use reminder ("call the tool directly"), names blocked next steps explicitly, anti-rationalization clause naming the precise failure mode (loading the schema via ToolSearch and writing the recommendation as chat prose). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(test/helpers): initialPlanContent + wrote_findings_before_asking + shared report-at-bottom assertion Three additions to claude-pty-runner.ts: 1. runPlanSkillObservation gains initialPlanContent?: string. Pre-pumps a user message containing the seeded plan before invoking the skill, with a 3s gap so the message renders before the slash command. claude has no --plan-file flag (verified via claude --help), so message-pump is the route. Lets STOP-gate regression tests force complexity findings. 2. ClassifyResult gains wrote_findings_before_asking with companion strictPlanWrites?: boolean opt on classifyVisible. Fires when a Write/ Edit to .claude/plans/* precedes any AskUserQuestion render in the session window. Default off — preserves zero-findings → write plan → plan_ready as legitimate for unseeded smokes. Six new unit tests cover before/after-AUQ ordering, permission-dialog edge case, strict-off path. 3. assertReportAtBottomIfPlanWritten(obs) shared helper. Wraps the existing assertReviewReportAtBottom(content) and gates on obs.planFile (artifact existing), so the assertion fires under both 'asked' and 'plan_ready' when a plan was actually written. Also: runPlanSkillObservation now captures obs.planFile on every classifier outcome, not just 'plan_ready'. Catches the case where the skill wrote a plan partway through then paused on a question. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: wire assertReportAtBottomIfPlanWritten into 4 plan-mode E2E tests + add seeded-plan STOP-gate case Every test case in skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.ts that produces a plan file now asserts ## GSTACK REVIEW REPORT is the last ## section. The {{PLAN_FILE_REVIEW_REPORT}} resolver mandated this contract; nothing tested it until now. Plan-eng additionally gains a third test case: STOP gate fires when seeded plan forces Step 0 findings. Combines the new initialPlanContent runner option with --disallowedTools AskUserQuestion to force the Conductor MCP-variant path through mcp__*__AskUserQuestion. Asserts outcome NOT in {wrote_findings_before_asking, auto_decided, silent_write, exited, timeout} and that plan_ready outcomes carry a ## Decisions section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(touchfiles): delete duplicate plan-design-review-plan-mode keys Verified duplicates in test/helpers/touchfiles.ts: - E2E_TOUCHFILES had plan-design-review-plan-mode at line 94 (full deps) AND line 243 (smaller deps); JS object literals: later wins. - E2E_TIERS had it at line 399 ('gate') AND line 524 ('periodic'); same later-wins rule. Effective tier was 'periodic', not 'gate'. Three of four plan-mode siblings ran on every PR; design ran weekly only. Delete the line-243 and line-524 duplicates. Keep line 94 (full deps) and line 399 ('gate'). Also extend the four plan-mode-test entries to include scripts/resolvers/review.ts so changes to {{PLAN_FILE_REVIEW_REPORT}} trigger all four siblings in bun run eval:select. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.26.2.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: tighten CHANGELOG voice for v1.26.2.0 Move contributor-flavored bullet (runPlanSkillObservation seeding) into For contributors. Drop branch-internal narrative (Codex review pass, plan iteration tracking) per CHANGELOG-for-users style. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:46:40 +02:00 · 2026-05-03 20:26:59 -07:00
parent a0bfa001d3
commit 30fe6bb11c
12 changed files with 394 additions and 31 deletions
@@ -454,6 +454,7 @@ export function optionsSignature(
 */
 export type ClassifyResult =
  | { outcome: 'silent_write'; summary: string }
+  | { outcome: 'wrote_findings_before_asking'; summary: string }
  | { outcome: 'auto_decided'; summary: string }
  | { outcome: 'plan_ready'; summary: string }
  | { outcome: 'asked'; summary: string }
@@ -467,16 +468,73 @@ const SANCTIONED_WRITE_SUBSTRINGS = [
  'TODOS.md',
 ];

-export function classifyVisible(visible: string): ClassifyResult {
+/**
+ * Find the position of the first AskUserQuestion-style numbered-option list
+ * that is NOT a permission dialog. Returns -1 if none has rendered yet.
+ *
+ * Used by the strict-plan-writes detector (D4) to distinguish legitimate
+ * post-AUQ plan writes from the transcript bug ("write findings to plan
+ * before asking").
+ */
+function findFirstAuqRenderIndex(visible: string): number {
+  const re = /❯\s*1\./g;
+  let m: RegExpExecArray | null;
+  while ((m = re.exec(visible)) !== null) {
+    // 200 bytes back + TAIL_SCAN_BYTES forward gives enough context for
+    // isPermissionDialogVisible to recognize the typical permission UI.
+    const surroundStart = Math.max(0, m.index - 200);
+    const surroundEnd = Math.min(visible.length, m.index + TAIL_SCAN_BYTES);
+    const surround = visible.slice(surroundStart, surroundEnd);
+    if (!isPermissionDialogVisible(surround)) {
+      return m.index;
+    }
+  }
+  return -1;
+}
+
+export function classifyVisible(
+  visible: string,
+  opts?: {
+    /**
+     * When true, treat Write/Edit to `.claude/plans/*` BEFORE any
+     * AskUserQuestion render as `wrote_findings_before_asking` rather than
+     * letting the sanctioned-write list silently approve it. Used by tests
+     * that seed a draft plan with guaranteed-finding-triggering complexity
+     * (D3-B), where a pre-AUQ plan write is the precise transcript bug.
+     * Default false — preserves existing behavior for unseeded smoke tests
+     * where zero-findings → write plan → plan_ready is legitimate.
+     */
+    strictPlanWrites?: boolean;
+  },
+): ClassifyResult {
  // Silent-write detection: any Write/Edit tool render that targets a path
  // OUTSIDE the sanctioned dirs, AND no numbered prompt is currently on screen
  // (a numbered prompt means a permission/AskUserQuestion is gating the write,
  // not an actual silent write).
  const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g;
  let m: RegExpExecArray | null;
+  const auqRenderIdx = opts?.strictPlanWrites ? findFirstAuqRenderIndex(visible) : -1;
  while ((m = writeRe.exec(visible)) !== null) {
    const target = m[1] ?? '';
+    const writePos = m.index;
+    const isPlanWrite = target.includes('.claude/plans');
    const sanctioned = SANCTIONED_WRITE_SUBSTRINGS.some((s) => target.includes(s));
+
+    // D4-B: when strictPlanWrites is on, plan writes that precede the first
+    // AUQ render are flagged. Legitimate end-of-workflow plan writes happen
+    // AFTER an AUQ has rendered (i.e., the user has been asked). The
+    // transcript bug is a plan write WITHOUT any AUQ render preceding it.
+    if (opts?.strictPlanWrites && isPlanWrite) {
+      if (auqRenderIdx < 0 || writePos < auqRenderIdx) {
+        return {
+          outcome: 'wrote_findings_before_asking',
+          summary: `Write/Edit to ${target} fired before any AskUserQuestion render`,
+        };
+      }
+      // post-AUQ plan write — legitimate, fall through to other writes
+      continue;
+    }
+
    if (!sanctioned && !isNumberedOptionListVisible(visible)) {
      return {
        outcome: 'silent_write',
@@ -712,6 +770,36 @@ export function assertReviewReportAtBottom(
  return { ok: true };
 }

+/**
+ * Test helper: if `obs.planFile` was set, read it and assert
+ * `## GSTACK REVIEW REPORT` is the last `## ` section. Throws on
+ * violation with a diagnostic message including the plan path,
+ * the reason, any trailing headings, and the last 2KB of TTY output.
+ *
+ * Used by the four plan-mode E2E tests
+ * (skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.ts) to enforce
+ * the {{PLAN_FILE_REVIEW_REPORT}} resolver contract uniformly. Gates on
+ * `obs.planFile` (artifact existing), not on `obs.outcome === 'plan_ready'`,
+ * so it also catches the report-missing case under `'asked'` /
+ * `'wrote_findings_before_asking'` when a plan was already written.
+ */
+export function assertReportAtBottomIfPlanWritten(
+  obs: { planFile?: string; evidence: string },
+): void {
+  if (!obs.planFile) return;
+  const content = fs.readFileSync(obs.planFile, 'utf-8');
+  const verdict = assertReviewReportAtBottom(content);
+  if (!verdict.ok) {
+    const trailing = verdict.trailingHeadings?.length
+      ? `\ntrailing headings: ${verdict.trailingHeadings.join(', ')}`
+      : '';
+    throw new Error(
+      `GSTACK REVIEW REPORT contract violation in ${obs.planFile}: ${verdict.reason}${trailing}\n` +
+        `--- evidence (last 2KB) ---\n${obs.evidence}`,
+    );
+  }
+}
+
 /**
 * Per-skill Step-0 boundary predicates. Each fires `true` when the answered
 * AUQ's fingerprint matches the LAST question of that skill's Step 0 phase.
@@ -1085,6 +1173,20 @@ export async function runPlanSkillObservation(opts: {
   * rendered AskUserQuestion list).
   */
  env?: Record<string, string>;
+  /**
+   * Seed an initial plan that the spawned `claude` process operates on.
+   * STOP-gate regression tests need a plan with guaranteed-finding-triggering
+   * complexity (8+ files, custom-vs-builtin smell) so the skill MUST emit
+   * AskUserQuestion or fall back to a Decisions section. Without this,
+   * plan-mode creates a fresh empty plan and the skill has nothing to find
+   * issues with.
+   *
+   * Implementation: claude has no `--plan-file` flag (verified via
+   * `claude --help`). We pre-pump a user message containing the draft
+   * plan, wait for it to register, then invoke the skill. The skill's
+   * Step 0 reads the prior conversation context so it sees the draft.
+   */
+  initialPlanContent?: string;
 }): Promise<PlanSkillObservation> {
  const startedAt = Date.now();
  const session = await launchClaudePty({
@@ -1098,6 +1200,21 @@ export async function runPlanSkillObservation(opts: {
  try {
    // Boot grace + trust-dialog auto-handle.
    await Bun.sleep(8000);
+    if (opts.initialPlanContent) {
+      // Pre-pump the draft as a user message so the skill's Step 0 has
+      // concrete content to scope-challenge. The trailing `\r` submits
+      // the message; embedded `\n` are preserved as line breaks within
+      // the message (claude-code uses Enter to send, Shift+Enter for
+      // newlines, but raw `\r` from a PTY just submits whatever's in
+      // the input buffer).
+      const seed = `Please review the following draft plan when I run the skill below:\n\n${opts.initialPlanContent}`;
+      session.send(`${seed}\r`);
+      // Wait for the seed message to render before sending the skill
+      // command. Without this gap the two messages can fuse and the
+      // skill name becomes part of the user prompt instead of a slash
+      // command.
+      await Bun.sleep(3000);
+    }
    const since = session.mark();
    session.send(`/${opts.skillName}\r`);

@@ -1123,20 +1240,24 @@ export async function runPlanSkillObservation(opts: {
          elapsedMs: Date.now() - startedAt,
        };
      }
-      const classified = classifyVisible(visible);
+      const classified = classifyVisible(visible, {
+        strictPlanWrites: !!opts.initialPlanContent,
+      });
      if (classified) {
        const obs: PlanSkillObservation = {
          ...classified,
          evidence: visible.slice(-2000),
          elapsedMs: Date.now() - startedAt,
        };
-        // For plan_ready outcomes, capture the plan file path from the full
-        // visible buffer — tests under --disallowedTools verify the file's
-        // contents to distinguish legitimate fallback flow from silent-skip.
-        if (classified.outcome === 'plan_ready') {
-          const planFile = extractPlanFilePath(visible);
-          if (planFile) obs.planFile = planFile;
-        }
+        // Capture the plan file path on any outcome where one may have been
+        // written. Gating only on 'plan_ready' missed two cases: (1) the
+        // 'asked' outcome where the model wrote a plan partway through then
+        // paused on a question, and (2) 'wrote_findings_before_asking' where
+        // the bug is precisely that the plan was written. The
+        // assertReviewReportAtBottom checks downstream gate on planFile
+        // existing, not on the outcome.
+        const planFile = extractPlanFilePath(visible);
+        if (planFile) obs.planFile = planFile;
        return obs;
      }
    }
@@ -295,6 +295,78 @@ describe('classifyVisible (runtime path through the runner classifier)', () => {
    // recent-tail window would surface here.
    expect(TAIL_SCAN_BYTES).toBe(1500);
  });
+
+  // D4-B: strictPlanWrites detector. Catches the transcript bug where the
+  // model writes findings to the plan file before any AskUserQuestion fires.
+  test('strictPlanWrites: plan write before any AUQ → wrote_findings_before_asking', () => {
+    const visible = `
+      ⏺ Edit(/Users/me/.claude/plans/some-plan.md)
+      ⎿  Updated 12 lines
+    `;
+    const result = classifyVisible(visible, { strictPlanWrites: true });
+    expect(result?.outcome).toBe('wrote_findings_before_asking');
+    expect(result?.summary).toContain('.claude/plans/some-plan.md');
+  });
+
+  test('strictPlanWrites: plan write AFTER an AUQ render → not flagged', () => {
+    // AUQ renders first, then the model writes the plan post-answer. This is
+    // the legitimate end-of-workflow flow and must NOT trigger the detector.
+    const visible = `
+      D1 — Some scope question
+
+      ❯ 1. Option A
+        2. Option B
+
+      ⏺ Edit(/Users/me/.claude/plans/some-plan.md)
+      ⎿  Updated 12 lines
+    `;
+    const result = classifyVisible(visible, { strictPlanWrites: true });
+    // Outcome is 'asked' (the numbered list rendered); the post-AUQ plan
+    // write is ignored by the detector.
+    expect(result?.outcome).toBe('asked');
+  });
+
+  test('strictPlanWrites: AUQ first then plan write — write_pos > auq_pos → not flagged', () => {
+    // Same scenario, more explicit ordering: the regex finds the write at a
+    // position AFTER the numbered list. Detector lets it through.
+    const visible = [
+      'D1 — Choose your approach',
+      '',
+      '❯ 1. Approach A',
+      '  2. Approach B',
+      '',
+      '⏺ Write(/Users/me/.claude/plans/draft.md)',
+      '⎿  Wrote 42 lines',
+    ].join('\n');
+    const result = classifyVisible(visible, { strictPlanWrites: true });
+    expect(result?.outcome).toBe('asked');
+  });
+
+  test('strictPlanWrites: only a permission dialog visible → plan write still flagged', () => {
+    // A permission dialog ❯ 1./2. is NOT an AUQ; pre-AUQ plan writes still
+    // hit the detector even when a permission prompt is on screen.
+    const visible = `
+      ⏺ Edit(/Users/me/.claude/plans/some-plan.md)
+
+      Edit to /Users/me/.claude/plans/some-plan.md
+
+      Do you want to proceed?
+
+      ❯ 1. Yes
+        2. No
+    `;
+    const result = classifyVisible(visible, { strictPlanWrites: true });
+    expect(result?.outcome).toBe('wrote_findings_before_asking');
+  });
+
+  test('strictPlanWrites OFF: plan write before AUQ → returns null (legacy behavior preserved)', () => {
+    const visible = `
+      ⏺ Edit(/Users/me/.claude/plans/some-plan.md)
+      ⎿  Updated 12 lines
+    `;
+    // Without strictPlanWrites, the sanctioned-path list lets this through.
+    expect(classifyVisible(visible)).toBeNull();
+  });
 });

 describe('parseNumberedOptions', () => {
@@ -89,10 +89,10 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
  // include question-tuning.ts and generate-ask-user-format.ts because the
  // AUTO_DECIDE preamble injection lives there and changes can flip the
  // regression test outcome between 'asked' and 'auto_decided'.
-  'plan-ceo-review-plan-mode':    ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
-  'plan-eng-review-plan-mode':    ['plan-eng-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
-  'plan-design-review-plan-mode': ['plan-design-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
-  'plan-devex-review-plan-mode':  ['plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
+  'plan-ceo-review-plan-mode':    ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts'],
+  'plan-eng-review-plan-mode':    ['plan-eng-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts'],
+  'plan-design-review-plan-mode': ['plan-design-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts'],
+  'plan-devex-review-plan-mode':  ['plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts'],
  'plan-mode-no-op':              ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],

  // v1.21+ AskUserQuestion-blocked regression tests — Conductor launches
@@ -240,7 +240,6 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
  'design-consultation-existing':   ['design-consultation/**', 'scripts/gen-skill-docs.ts'],
  'design-consultation-research':   ['design-consultation/**', 'scripts/gen-skill-docs.ts'],
  'design-consultation-preview':    ['design-consultation/**', 'scripts/gen-skill-docs.ts'],
-  'plan-design-review-plan-mode':   ['plan-design-review/**', 'scripts/gen-skill-docs.ts'],
  'plan-design-review-no-ui-scope': ['plan-design-review/**', 'scripts/gen-skill-docs.ts'],
  'design-review-fix':              ['design-review/**', 'browse/src/**', 'scripts/gen-skill-docs.ts'],

@@ -521,7 +520,6 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
  'design-consultation-existing': 'periodic',
  'design-consultation-research': 'gate',
  'design-consultation-preview': 'gate',
-  'plan-design-review-plan-mode': 'periodic',
  'plan-design-review-no-ui-scope': 'gate',
  'design-review-fix': 'periodic',
  'design-shotgun-path': 'gate',