v1.27.1.0 fix: anti-shortcut clause + gate-tier AskUserQuestion floor tests for all plan-* skills (#1354)

* feat(test/helpers): runPlanSkillFloorCheck — minimal AskUserQuestion-floor observer Adds a focused PTY observer that exits at the first non-permission numbered-option render. Catches the May 2026 transcript-bug class (model wrote plan + ExitPlanMode without firing any AUQ) without needing to fingerprint or navigate past the AUQ. Why separate from runPlanSkillCounting: plan-mode AUQs render every option on a single logical line via cursor-positioning escapes that stripAnsi can't simulate, so parseNumberedOptions returns < 2 options and never records a fingerprint. Counting tests work on 25-min budgets because eventually one frame parses cleanly; gate-tier floor tests need to exit early on the first observation. Trades fingerprint precision for early-exit reliability. Also drops COMPLETION_SUMMARY_RE check from this helper — it matches "GSTACK REVIEW REPORT" anywhere in the buffer including when the agent does recon by reading existing plan files. plan_ready (claude's actual "Ready to execute" confirmation) is the reliable terminal signal for "agent finished without asking." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(resolvers): generateAntiShortcutClause shared resolver Adds {{ANTI_SHORTCUT_CLAUSE}} placeholder backed by a single resolver function in scripts/resolvers/review.ts. Plan-* review skills can now include the clause via one placeholder line in their .tmpl rather than cloning the paragraph four times. Future tightening edits one resolver, all four skills update on next gen-skill-docs. Wired into the existing RESOLVERS map alongside generateReviewDashboard and generatePlanFileReviewReport — no gen-skill-docs.ts change needed because the generator already does generic placeholder substitution against that map. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(plan-*-review): anti-shortcut clause in all four review skills Inserts {{ANTI_SHORTCUT_CLAUSE}} placeholder immediately after the **Anti-skip rule:** paragraph in plan-{eng,ceo,design,devex}-review SKILL.md.tmpl. The four templates use different surrounding section headers (eng "Review Sections (after scope is agreed)" vs ceo/design/devex variants), so anchoring on the paragraph rather than the heading works across all four. Closes the May 2026 transcript-bug loophole: existing STOP gates name forbidden actions only AFTER a per-section finding is identified. The anti-shortcut clause adds the pre-emptive rule — "the plan file is the OUTPUT of the interactive review, not a substitute for it" — covering the case the transcript exhibited (skip per-section walk, dump every finding into one plan write, call ExitPlanMode). Regenerated SKILL.md for all hosts via bun run gen:skill-docs --host all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: gate-tier AskUserQuestion floor tests for all plan-* review skills Adds 4 finding-floor tests (one per plan-* skill) that catch the May 2026 transcript-bug class — model wrote a plan and called ExitPlanMode without firing any review-phase AskUserQuestion. Asserts via runPlanSkillFloorCheck that ANY non-permission AUQ render fires before the agent reaches plan_ready. Verified: - Eng floor: passed in 59s - CEO floor: passed in 197s - Design floor: passed - Devex floor: passed - Total ~$2-6 per CI run; only triggers on diff against the 4 plan-* templates, the shared resolver review.ts, the seeds fixture, or the PTY runner helper. Fixtures live in test/fixtures/forcing-finding-seeds.ts, one constant per skill. Each seed is engineered to force at least one obvious finding under that skill's review focus (architectural smell for eng, scope-creep for ceo, UI-slop for design, painful onboarding for devex). Touchfiles wiring: - E2E_TOUCHFILES: 4 plan-*-finding-floor entries with deps on the matching skill template, the shared resolver, the seeds fixture, and the PTY runner helper - E2E_TIERS: all 4 entries marked 'gate' - touchfiles.test.ts: count assertion bumped 21→22 with explicit plan-ceo-finding-floor containment check Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.27.1.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 14:06:42 +02:00 · 2026-05-06 20:27:20 -07:00
parent f44de365c5
commit 7b4738bca0
21 changed files with 532 additions and 5 deletions
@@ -0,0 +1,83 @@
+/**
+ * Per-skill draft-plan seeds engineered to surface at least one
+ * review-phase finding in the corresponding plan-* review skill.
+ *
+ * Used by gate-tier finding-floor tests
+ * (test/skill-e2e-plan-{eng,ceo,design,devex}-finding-floor.test.ts) as
+ * the minimum-cost regression for the May 2026 transcript bug:
+ *   "/plan-eng-review reviewed a real PR diff, wrote a multi-section
+ *    review plan to ~/.claude/plans/ and called ExitPlanMode without
+ *    ever firing AskUserQuestion."
+ *
+ * Each seed is small and pre-loaded with one obvious finding the
+ * matching skill cannot honestly miss. Floor tests assert
+ * `reviewCount >= 1` — i.e., the model fired at least one review-phase
+ * AUQ before reaching plan_ready / completion_summary / ceiling.
+ *
+ * Each seed includes the standard "write your plan-mode plan to /tmp/…"
+ * preamble that the existing periodic finding-count fixtures use, so
+ * the agent has a concrete plan-file target. The /tmp path is unique
+ * per skill to avoid collisions if floor tests run in parallel.
+ *
+ * For a deeper [N-1, N+2] count band assertion, see the periodic
+ * test/skill-e2e-plan-{X}-finding-count.test.ts fixtures.
+ */
+
+export const FORCING_FLOOR_ENG = [
+  'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-eng-floor.md (use Edit/Write to that exact path).',
+  '',
+  '# Plan: Add request-id propagation across services',
+  '',
+  '## Architecture',
+  "We'll roll a custom UUIDv7 generator inline in each service rather than",
+  "use Node's crypto.randomUUID() built-in. Same shape, but we want full",
+  'control over the entropy source for "future flexibility" — no concrete',
+  'reason yet.',
+].join('\n');
+
+export const FORCING_FLOOR_CEO = [
+  'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-ceo-floor.md (use Edit/Write to that exact path).',
+  '',
+  '# Plan: Launch a "developer-friendly" pricing tier',
+  '',
+  '## Goal',
+  'Increase developer adoption.',
+  '',
+  '## Success metric',
+  'More signups.',
+  '',
+  '## Premise',
+  "We haven't talked to any developers about whether the current pricing",
+  'is actually a barrier. The team agreed it "feels like" it should be cheaper.',
+].join('\n');
+
+export const FORCING_FLOOR_DESIGN = [
+  'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-design-floor.md (use Edit/Write to that exact path).',
+  '',
+  '# Plan: Marketing landing page',
+  '',
+  '## Layout',
+  'All headings, taglines, and body copy will be center-aligned for a',
+  '"clean modern look." The hero h1 sits 8px above the subhead with no',
+  'breathing room; the CTA button is the same visual weight as a',
+  'secondary "Learn more" link directly beside it.',
+].join('\n');
+
+export const FORCING_FLOOR_DEVEX = [
+  'Please review this plan thoroughly. As you go, write your plan-mode plan to /tmp/gstack-test-plan-devex-floor.md (use Edit/Write to that exact path).',
+  '',
+  '# Plan: SDK quickstart docs',
+  '',
+  '## Onboarding flow',
+  'Step 1: clone the repo.',
+  'Step 2: install bun manually if not present.',
+  'Step 3: copy .env.example to .env and fill in 8 environment variables.',
+  'Step 4: run database migrations against your local Postgres.',
+  'Step 5: start the dev server.',
+  'Step 6: open the docs in a separate tab.',
+  'Step 7: register an API key by emailing the team.',
+  'Step 8: paste the key into your .env, restart the server, then make',
+  'your first SDK call.',
+  '',
+  'No quickstart command, no hosted sandbox, no copy-pasteable curl example.',
+].join('\n');
@@ -1550,3 +1550,167 @@ export async function runPlanSkillCounting(opts: {
    await session.close();
  }
 }
+
+// ────────────────────────────────────────────────────────────────────────────
+// runPlanSkillFloorCheck — minimal "did the agent fire ANY AskUserQuestion?"
+// observer for gate-tier floor tests catching the May 2026 transcript bug
+// (model wrote plan + ExitPlanMode'd with reviewCount=0).
+//
+// Why this exists separately from runPlanSkillCounting: plan-mode AUQs render
+// every option on a single logical line via cursor-positioning escapes that
+// stripAnsi can't simulate. parseNumberedOptions therefore returns < 2 options
+// from those frames and never records a fingerprint. The full counting helper
+// works for periodic finding-count tests because their 25-min budgets give the
+// agent enough redraws that one frame eventually parses cleanly. Gate-tier
+// floor tests don't have that wall-time budget and need to exit early on the
+// first observation. This helper trades fingerprint precision for early-exit
+// reliability.
+//
+// Contract:
+//   - PASS  → outcome === 'auq_observed' (agent rendered any non-permission
+//             numbered-option list; we exit immediately and report success)
+//   - FAIL  → outcome === 'plan_ready' | 'completion_summary' | 'silent_write'
+//             (agent reached a terminal state without ever firing an AUQ —
+//             this IS the transcript bug)
+//   - SOFT  → outcome === 'timeout' (neither happened in budget; agent may
+//             just be slow — test should retry with a larger budget rather
+//             than treat as a hard regression)
+// ────────────────────────────────────────────────────────────────────────────
+
+export interface PlanSkillFloorObservation {
+  /** True iff a review-phase AUQ render was observed. */
+  auqObserved: boolean;
+  outcome:
+    | 'auq_observed'
+    | 'plan_ready'
+    | 'silent_write'
+    | 'exited'
+    | 'timeout';
+  summary: string;
+  /** Visible TTY tail (last 3KB) at terminal time. */
+  evidence: string;
+  /** Wall time (ms) until the outcome was decided. */
+  elapsedMs: number;
+}
+
+/**
+ * Drive a plan-* skill in plan mode and exit at the first non-permission
+ * numbered-option render. See block comment above for the contract.
+ */
+export async function runPlanSkillFloorCheck(opts: {
+  /** Skill name, e.g. 'plan-eng-review'. Used for diagnostic strings only. */
+  skillName: string;
+  /** Slash command to send alone, e.g. '/plan-eng-review'. */
+  slashCommand: string;
+  /** Plan content sent as a follow-up message ~3s after the slash command. */
+  followUpPrompt: string;
+  /** Working directory. Default process.cwd(). */
+  cwd?: string;
+  /** Total budget. Default 600000 (10 min). Tests exit early on AUQ. */
+  timeoutMs?: number;
+  /** Extra env merged into the spawned `claude` process. */
+  env?: Record<string, string>;
+}): Promise<PlanSkillFloorObservation> {
+  const startedAt = Date.now();
+  const timeoutMs = opts.timeoutMs ?? 600_000;
+
+  const session = await launchClaudePty({
+    permissionMode: 'plan',
+    cwd: opts.cwd,
+    timeoutMs: timeoutMs + 60_000,
+    env: opts.env,
+  });
+
+  try {
+    await Bun.sleep(8000); // boot grace + auto-trust handler window
+    const since = session.mark();
+    session.send(`${opts.slashCommand}\r`);
+    await Bun.sleep(3000);
+    session.send(`${opts.followUpPrompt}\r`);
+
+    const start = Date.now();
+    while (Date.now() - start < timeoutMs) {
+      await Bun.sleep(2000);
+      const visible = session.visibleSince(since);
+
+      if (session.exited()) {
+        return {
+          auqObserved: false,
+          outcome: 'exited',
+          summary: `claude exited (code=${session.exitCode()}) before any AUQ render`,
+          evidence: visible.slice(-3000),
+          elapsedMs: Date.now() - startedAt,
+        };
+      }
+      if (visible.includes('Unknown command:')) {
+        return {
+          auqObserved: false,
+          outcome: 'exited',
+          summary: `claude rejected ${opts.slashCommand} as unknown command`,
+          evidence: visible.slice(-3000),
+          elapsedMs: Date.now() - startedAt,
+        };
+      }
+
+      // Success: ANY non-permission numbered-option list is an AUQ render.
+      // The bug we're catching is "fired zero AUQs," so observing one is
+      // sufficient — we don't need to fingerprint or navigate past it.
+      if (
+        isNumberedOptionListVisible(visible) &&
+        !isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))
+      ) {
+        return {
+          auqObserved: true,
+          outcome: 'auq_observed',
+          summary: 'agent rendered an AskUserQuestion (floor met)',
+          evidence: visible.slice(-3000),
+          elapsedMs: Date.now() - startedAt,
+        };
+      }
+
+      // Silent write outside sanctioned dirs is the transcript-bug shape.
+      const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g;
+      let m: RegExpExecArray | null;
+      while ((m = writeRe.exec(visible)) !== null) {
+        const target = m[1] ?? '';
+        const sanctioned = SANCTIONED_WRITE_SUBSTRINGS.some((s) => target.includes(s));
+        if (!sanctioned && !isNumberedOptionListVisible(visible)) {
+          return {
+            auqObserved: false,
+            outcome: 'silent_write',
+            summary: `Write/Edit to ${target} fired before any AskUserQuestion`,
+            evidence: visible.slice(-3000),
+            elapsedMs: Date.now() - startedAt,
+          };
+        }
+      }
+
+      // Reached terminal without AUQ → transcript-bug regression.
+      // Note: COMPLETION_SUMMARY_RE is intentionally NOT checked here — it
+      // matches "GSTACK REVIEW REPORT" anywhere in the buffer, including
+      // when the agent does recon by reading existing plan files (which
+      // contain that string as a generated section). The plan_ready check
+      // (claude's actual "Ready to execute" confirmation) is the reliable
+      // terminal signal for "agent finished without asking."
+      if (isPlanReadyVisible(visible)) {
+        return {
+          auqObserved: false,
+          outcome: 'plan_ready',
+          summary: 'agent reached plan_ready without firing any AskUserQuestion',
+          evidence: visible.slice(-3000),
+          elapsedMs: Date.now() - startedAt,
+        };
+      }
+    }
+
+    return {
+      auqObserved: false,
+      outcome: 'timeout',
+      summary: `no AUQ render and no terminal outcome within ${timeoutMs}ms`,
+      evidence: session.visibleSince(since).slice(-3000),
+      elapsedMs: Date.now() - startedAt,
+    };
+  } finally {
+    await session.close();
+  }
+}
@@ -133,6 +133,16 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
  'plan-eng-finding-count':      ['plan-eng-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-eng-finding-count.test.ts'],
  'plan-design-finding-count':   ['plan-design-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-design-finding-count.test.ts'],
  'plan-devex-finding-count':    ['plan-devex-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-devex-finding-count.test.ts'],
+
+  // Gate-tier reviewCount-floor counterparts. Catch the May 2026 transcript
+  // bug (model wrote a plan-mode plan and ExitPlanMode'd without firing any
+  // review-phase AskUserQuestion). Uses runPlanSkillFloorCheck — minimal
+  // "did agent fire ANY AUQ?" observer that exits early on first non-permission
+  // numbered-option render. ~1-3 min typical wall time per test, ~$2-6 total.
+  'plan-eng-finding-floor':      ['plan-eng-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts', 'test/fixtures/forcing-finding-seeds.ts', 'test/skill-e2e-plan-eng-finding-floor.test.ts'],
+  'plan-ceo-finding-floor':      ['plan-ceo-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts', 'test/fixtures/forcing-finding-seeds.ts', 'test/skill-e2e-plan-ceo-finding-floor.test.ts'],
+  'plan-design-finding-floor':   ['plan-design-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts', 'test/fixtures/forcing-finding-seeds.ts', 'test/skill-e2e-plan-design-finding-floor.test.ts'],
+  'plan-devex-finding-floor':    ['plan-devex-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/review.ts', 'test/helpers/claude-pty-runner.ts', 'test/fixtures/forcing-finding-seeds.ts', 'test/skill-e2e-plan-devex-finding-floor.test.ts'],
  'brain-privacy-gate':           ['scripts/resolvers/preamble/generate-brain-sync-block.ts', 'scripts/resolvers/preamble.ts', 'bin/gstack-brain-sync', 'bin/gstack-artifacts-init', 'bin/gstack-config', 'test/helpers/agent-sdk-runner.ts'],

  // /setup-gbrain Path 4 (Remote MCP) — happy + bad-token end-to-end via
@@ -429,6 +439,10 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
  'plan-eng-finding-count':    'periodic',
  'plan-design-finding-count': 'periodic',
  'plan-devex-finding-count':  'periodic',
+  'plan-eng-finding-floor':    'gate',
+  'plan-ceo-finding-floor':    'gate',
+  'plan-design-finding-floor': 'gate',
+  'plan-devex-finding-floor':  'gate',

  // Privacy gate for gstack-brain-sync — periodic (non-deterministic LLM call,
  // costs ~$0.30-$0.50 per run, not needed on every commit)
@@ -0,0 +1,37 @@
+/**
+ * /plan-ceo-review AskUserQuestion floor regression (gate, paid, real-PTY).
+ *
+ * See test/skill-e2e-plan-eng-finding-floor.test.ts for the contract.
+ */
+
+import { describe, test } from 'bun:test';
+import { runPlanSkillFloorCheck } from './helpers/claude-pty-runner';
+import { FORCING_FLOOR_CEO } from './fixtures/forcing-finding-seeds';
+
+const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
+const describeE2E = shouldRun ? describe : describe.skip;
+
+describeE2E('/plan-ceo-review AskUserQuestion floor (gate)', () => {
+  test(
+    'seeded forcing finding causes the agent to fire at least one AskUserQuestion',
+    async () => {
+      const obs = await runPlanSkillFloorCheck({
+        skillName: 'plan-ceo-review',
+        slashCommand: '/plan-ceo-review',
+        followUpPrompt: FORCING_FLOOR_CEO,
+        cwd: process.cwd(),
+        timeoutMs: 600_000,
+        env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
+      });
+
+      if (obs.outcome !== 'auq_observed') {
+        throw new Error(
+          `floor test FAILED: outcome=${obs.outcome} elapsed=${obs.elapsedMs}ms\n` +
+            `summary: ${obs.summary}\n` +
+            `--- evidence (last 3KB) ---\n${obs.evidence}`,
+        );
+      }
+    },
+    660_000,
+  );
+});
@@ -0,0 +1,37 @@
+/**
+ * /plan-design-review AskUserQuestion floor regression (gate, paid, real-PTY).
+ *
+ * See test/skill-e2e-plan-eng-finding-floor.test.ts for the contract.
+ */
+
+import { describe, test } from 'bun:test';
+import { runPlanSkillFloorCheck } from './helpers/claude-pty-runner';
+import { FORCING_FLOOR_DESIGN } from './fixtures/forcing-finding-seeds';
+
+const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
+const describeE2E = shouldRun ? describe : describe.skip;
+
+describeE2E('/plan-design-review AskUserQuestion floor (gate)', () => {
+  test(
+    'seeded forcing finding causes the agent to fire at least one AskUserQuestion',
+    async () => {
+      const obs = await runPlanSkillFloorCheck({
+        skillName: 'plan-design-review',
+        slashCommand: '/plan-design-review',
+        followUpPrompt: FORCING_FLOOR_DESIGN,
+        cwd: process.cwd(),
+        timeoutMs: 600_000,
+        env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
+      });
+
+      if (obs.outcome !== 'auq_observed') {
+        throw new Error(
+          `floor test FAILED: outcome=${obs.outcome} elapsed=${obs.elapsedMs}ms\n` +
+            `summary: ${obs.summary}\n` +
+            `--- evidence (last 3KB) ---\n${obs.evidence}`,
+        );
+      }
+    },
+    660_000,
+  );
+});
@@ -0,0 +1,37 @@
+/**
+ * /plan-devex-review AskUserQuestion floor regression (gate, paid, real-PTY).
+ *
+ * See test/skill-e2e-plan-eng-finding-floor.test.ts for the contract.
+ */
+
+import { describe, test } from 'bun:test';
+import { runPlanSkillFloorCheck } from './helpers/claude-pty-runner';
+import { FORCING_FLOOR_DEVEX } from './fixtures/forcing-finding-seeds';
+
+const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
+const describeE2E = shouldRun ? describe : describe.skip;
+
+describeE2E('/plan-devex-review AskUserQuestion floor (gate)', () => {
+  test(
+    'seeded forcing finding causes the agent to fire at least one AskUserQuestion',
+    async () => {
+      const obs = await runPlanSkillFloorCheck({
+        skillName: 'plan-devex-review',
+        slashCommand: '/plan-devex-review',
+        followUpPrompt: FORCING_FLOOR_DEVEX,
+        cwd: process.cwd(),
+        timeoutMs: 600_000,
+        env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
+      });
+
+      if (obs.outcome !== 'auq_observed') {
+        throw new Error(
+          `floor test FAILED: outcome=${obs.outcome} elapsed=${obs.elapsedMs}ms\n` +
+            `summary: ${obs.summary}\n` +
+            `--- evidence (last 3KB) ---\n${obs.evidence}`,
+        );
+      }
+    },
+    660_000,
+  );
+});
@@ -0,0 +1,52 @@
+/**
+ * /plan-eng-review AskUserQuestion floor regression (gate, paid, real-PTY).
+ *
+ * Catches the May 2026 transcript bug where /plan-eng-review wrote a
+ * multi-section review plan to ~/.claude/plans/ and called ExitPlanMode
+ * without firing any AskUserQuestion. See
+ * `.context/attachments/pasted_text_2026-05-06_10-25-23.txt`.
+ *
+ * Uses runPlanSkillFloorCheck — a minimal "did the agent fire ANY AUQ?"
+ * observer that exits early on the first non-permission numbered-option
+ * render. See claude-pty-runner.ts for why this is separate from the
+ * runPlanSkillCounting harness used by periodic finding-count tests.
+ *
+ * Tier: gate. Budget: 10 min (early exit on success ~30-90s typical).
+ * Cost: ~$0.50-$1.50 per run depending on early-exit timing.
+ */
+
+import { describe, test } from 'bun:test';
+import { runPlanSkillFloorCheck } from './helpers/claude-pty-runner';
+import { FORCING_FLOOR_ENG } from './fixtures/forcing-finding-seeds';
+
+const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
+const describeE2E = shouldRun ? describe : describe.skip;
+
+describeE2E('/plan-eng-review AskUserQuestion floor (gate)', () => {
+  test(
+    'seeded forcing finding causes the agent to fire at least one AskUserQuestion',
+    async () => {
+      const obs = await runPlanSkillFloorCheck({
+        skillName: 'plan-eng-review',
+        slashCommand: '/plan-eng-review',
+        followUpPrompt: FORCING_FLOOR_ENG,
+        cwd: process.cwd(),
+        timeoutMs: 600_000,
+        env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
+      });
+
+      if (obs.outcome !== 'auq_observed') {
+        throw new Error(
+          `floor test FAILED: outcome=${obs.outcome} elapsed=${obs.elapsedMs}ms\n` +
+            `summary: ${obs.summary}\n` +
+            `If outcome is plan_ready or completion_summary, this is the transcript-bug ` +
+            `regression — agent reached terminal without firing AskUserQuestion. See ` +
+            `.context/attachments/pasted_text_2026-05-06_10-25-23.txt.\n` +
+            `If outcome is timeout, agent may just be slow — re-run or increase budget.\n` +
+            `--- evidence (last 3KB) ---\n${obs.evidence}`,
+        );
+      }
+    },
+    660_000,
+  );
+});
@@ -103,8 +103,10 @@ describe('selectTests', () => {
    // auto-decide-preserved also depend on plan-ceo-review/**
    expect(result.selected).toContain('autoplan-auto-mode');
    expect(result.selected).toContain('auto-decide-preserved');
-    expect(result.selected.length).toBe(21);
-    expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 21);
+    // v1.27+ gate-tier reviewCount-floor regression for transcript bug
+    expect(result.selected).toContain('plan-ceo-finding-floor');
+    expect(result.selected.length).toBe(22);
+    expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 22);
  });

  test('global touchfile triggers ALL tests', () => {