v1.21.1.0 test: tighten plan-ceo-review smoke (Step 0 must fire) (#1255)

* test: extract classifyVisible() + permission-dialog filter in PTY runner Pure classifier extracted from runPlanSkillObservation's polling loop so unit tests can exercise the actual branch order with synthetic input strings. Runner gains: - env? passthrough on runPlanSkillObservation (forwarded to launchClaudePty). gstack-config does not yet honor env overrides; plumbing is in place for a future change to make tests hermetic. - TAIL_SCAN_BYTES = 1500 exported constant. Replaces a duplicated magic number in test/skill-e2e-plan-ceo-mode-routing.test.ts so tuning stays in sync. - isPermissionDialogVisible: the bare phrase "Do you want to proceed?" now requires a file-edit context co-trigger. Other clauses unchanged. Skill questions that contain the bare phrase are no longer mis-classified. - classifyVisible(visible): pure function. Branch order silent_write → plan_ready → asked → null. Permission dialogs filtered out of the 'asked' classification so a permission prompt cannot pose as a Step 0 skill question. Adds 24 unit tests covering all classifier branches, edge cases, and the co-trigger contract. * test: tighten plan-ceo-review smoke to require Step 0 fires first Assertion narrows from ['asked', 'plan_ready'] to 'asked' only. Reaching plan_ready first means the agent skipped Step 0 entirely and went straight to ExitPlanMode — the regression we want to catch. Why plan-ceo is special: unlike plan-eng / plan-design / plan-devex (whose smokes legitimately reach plan_ready on certain branches without asking), plan-ceo-review's template mandates Step 0A premise challenge plus Step 0F mode selection BEFORE any plan write. There is no legitimate path to plan_ready that does not first emit a skill-question numbered prompt. Failure message now branches on outcome (plan_ready vs timeout vs silent_write) with a tailored diagnosis line per case. References the skill template by section name ("Step 0 STOP rules", "One issue = one AskUserQuestion call") instead of line numbers, so it survives template edits. Passes env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' } through the runner. Today this is advisory — gstack-config reads only ~/.gstack/config.yaml, not env vars — but the wiring is in place for a future change. Documented honestly in the docstring. Verified across 4 PTY runs: 3 pre-refactor + 1 post-refactor, all PASS. * chore: capture v1.21.1.0 follow-ups in TODOS.md - P2: per-finding AskUserQuestion count assertion (V2) - P3: honor env vars in gstack-config so test isolation env actually works - P3: path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS All three surfaced during the v1.21.1.0 plan-eng-review and adversarial review passes. Captured here so the design intent persists. * chore: bump version and changelog (v1.21.1.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: extract MODE_RE + optionsSignature into PTY runner exports Refactor prep for the upcoming per-finding AskUserQuestion count test across plan-{ceo,eng,design,devex}-review. Both new tests and the existing mode-routing test need the same mode regex and the same option-list fingerprint dedupe — pulling them into one source of truth in test/helpers/claude-pty-runner.ts so a fifth mode (or a tweak to the fingerprint shape) updates everywhere instead of drifting per-test. Mechanical: no behavior change in the mode-routing test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: add per-finding count primitives + unit tests Pure helpers landing ahead of runPlanSkillCounting: - parseQuestionPrompt(visible) — extract the 1-3 line prompt above the latest "❯ 1." cursor, normalize to a 240-char snippet - auqFingerprint(prompt, opts) — Bun.hash of normalized prompt + sorted options signature; distinct prompts with shared option labels (the generic A/B/C TODO menu) get distinct fingerprints - COMPLETION_SUMMARY_RE — terminal-signal regex matching all four plan-review skills' completion / verdict markers - assertReviewReportAtBottom(content) — checks "## GSTACK REVIEW REPORT" is present and is the last "## " heading in a plan file - Step0BoundaryPredicate type + four per-skill predicates (ceo / eng / design / devex) — fire on the answered AUQ's fingerprint, marking the end of Step 0 deterministically (event-based, not content-based, per Codex F7) Plus 37 deterministic unit tests covering option-label collision regression, prompt extraction edge cases, predicate positive AND negative cases, and review-report-at-bottom triple-check (missing / mid-file / multiple trailing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: add runPlanSkillCounting PTY helper Drives a plan-* skill end-to-end and counts distinct review-phase AskUserQuestions. Composes the primitives from the previous commit: - Boot + auto-trust handler (existing launchClaudePty) - Send slash command alone, sleep 3s, send plan content as follow-up message (proven pattern from skill-e2e-plan-design-with-ui) - Poll loop with permission-dialog auto-grant, same-redraw skip, empty-prompt re-poll - Event-based Step-0 boundary via isLastStep0AUQ predicate fired on the answered AUQ's fingerprint (Codex F7 — boundary is observed event, not later rendered content) - Multi-signal terminals: hard ceiling, COMPLETION_SUMMARY_RE, plan_ready, silent_write, exited, timeout Empty-prompt fingerprints are skipped per the contract documented in auqFingerprint's unit tests — fingerprinting them would re-introduce the option-label collision regression Codex F1 caught. No E2E tests yet — those land in commit 5 with the four skill fixtures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: register four finding-count tests in touchfiles + tier map Each new test depends on its skill template, the runner, and three preamble resolvers (preamble.ts, generate-ask-user-format.ts, generate-completion-status.ts) — those affect question cadence and completion rendering, which is exactly what the test asserts on. All four classified periodic. Sequential execution during calibration; opt-in to concurrent only after measured comparison agrees (plan §D15). Updated touchfiles.test.ts: plan-ceo-review/** now selects 19 tests (was 18) because plan-ceo-finding-count joins the family. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: add four per-finding count E2E tests (plan-ceo + eng + design + devex) Each test drives its plan-* skill through Step 0 then asserts the review-phase AskUserQuestion count falls in [N-1, N+2] for an N=5 seeded plan, plus D19: produced plan file ends with "## GSTACK REVIEW REPORT" as its last "## " heading. plan-ceo also runs a paired-finding positive control: 2 deliberately related findings should still produce 2 distinct AUQs, not 1 batched. Periodic-tier (gate-skipped without EVALS=1, EVALS_TIER=periodic). Sequential execution by plan §D15. Each fixture is inline TypeScript content delivered as a follow-up message after the slash command, per the proven pattern at skill-e2e-plan-design-with-ui.test.ts. Calibration loop (5 runs per skill) and the manual pre-merge negative check (D7 + D12) are required before merge per plan §Verification. NOT yet run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: fix parseNumberedOptions for inline-cursor box-layout AUQs Calibration run 1 timed out with step0=0 review=0 because the parser could not find the cursor in /plan-ceo-review's scope-selection AUQ. The TTY's box-layout rendering inlines divider + header + prompt + "1." onto one logical line — cursor escapes get stripped, leaving text crushed onto a single line. Cursor anchor regex changed from anchored to unanchored so it matches mid-line. Cursor-line option extraction uses a non-anchored regex; subsequent options stay with the original start-of-line parser. parseQuestionPrompt picks up the inline prompt text BEFORE the cursor on the cursor line (after stripping box-drawing chars + sigil) and appends it after any walked-up multi-line prompt above. Three new unit tests: clean-cursor still works, inline-cursor extracts all 7 options, prompt extraction strips box chars. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: add firstAUQPick + plan-ceo skip-interview routing Calibration run 1 surfaced a second issue beyond the parser bug: the default pick of 1 on /plan-ceo-review's scope-selection AUQ routes the agent to "branch diff vs main" — so it reviews the gstack PR itself (recursive!) instead of the seeded fixture plan we sent. Added firstAUQPick callback to runPlanSkillCounting. Override applies only to the FIRST AUQ; subsequent presses keep using defaultPick. ceoStep0Boundary now fires on either the mode-pick AUQ (existing path) or any AUQ containing "Skip interview and plan immediately" — which is the scope-selection AUQ. Picking that option bypasses Step 0 and routes straight to review-phase using the chat-paste plan as context. Plan-ceo test wires firstAUQPick = pickSkipInterview which finds the "Skip interview" option by label. Falls back to "describe inline" if the option labels change. Two new unit tests: ceoStep0Boundary fires on the scope-selection fixture; existing mode-pick fixture still fires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:25:10 +02:00 · 2026-04-30 02:50:09 -07:00
parent e8893a18b1
commit 454423aeb3
14 changed files with 2271 additions and 78 deletions
@@ -138,6 +138,15 @@ export function isPlanReadyVisible(visible: string): boolean {
  return /ready to execute|Would you like to proceed/i.test(visible);
 }

+/**
+ * Recent-tail window (in bytes of stripped TTY text) used when classifying
+ * permission dialogs. Old permission text persists in the visibleSince buffer
+ * after the dialog is dismissed, so callers should pass `visible.slice(-TAIL_SCAN_BYTES)`
+ * to avoid re-triggering on stale scrollback. Shared between `runPlanSkillObservation`
+ * and `navigateToModeAskUserQuestion` in the routing test so tuning stays in sync.
+ */
+export const TAIL_SCAN_BYTES = 1500;
+
 /**
 * Detect a Claude Code permission dialog. These render as a numbered
 * option list (so isNumberedOptionListVisible matches them) but they
@@ -145,23 +154,37 @@ export function isPlanReadyVisible(visible: string): boolean {
 * whether to grant a tool/file permission. Tests that look for skill
 * AskUserQuestions must explicitly skip these.
 *
- * Both English phrases below are stable across recent Claude Code
+ * The English phrases below are stable across recent Claude Code
 * versions. The check is permissive on whitespace because TTY rendering
 * may wrap or reflow text.
+ *
+ * Co-trigger requirement: the bare phrase "Do you want to proceed?" is
+ * generic enough that a skill question could legitimately use it
+ * ("Do you want to proceed with HOLD SCOPE?"). To avoid mis-classifying
+ * skill questions as permission dialogs, this phrase only counts when it
+ * co-occurs with a file-edit context ("Edit to <path>" or "Write to <path>").
+ * The standalone permission signatures (`requested permissions to`,
+ * `allow all edits`, `always allow access to`, `Bash command requires permission`)
+ * remain unconditional.
 */
 export function isPermissionDialogVisible(visible: string): boolean {
-  return (
-    /requested\s+permissions?\s+to/i.test(visible) ||
-    /Do\s+you\s+want\s+to\s+proceed\?/i.test(visible) ||
-    // "Yes / Yes, allow all edits / No" shape rendered by Claude Code for
-    // file-edit permission grants. The middle option's "allow all" phrase
-    // is the unique signature.
-    /\ballow\s+all\s+edits\b/i.test(visible) ||
-    // "Yes, and always allow access to <dir>" shape (workspace trust).
-    /always\s+allow\s+access\s+to/i.test(visible) ||
-    // Bash command permission prompts.
-    /Bash\s+command\s+.*\s+requires\s+permission/i.test(visible)
-  );
+  // Standalone signatures — high specificity, never appear in skill questions.
+  if (/requested\s+permissions?\s+to/i.test(visible)) return true;
+  // "Yes / Yes, allow all edits / No" shape — file-edit permission grants.
+  if (/\ballow\s+all\s+edits\b/i.test(visible)) return true;
+  // "Yes, and always allow access to <dir>" shape — workspace trust.
+  if (/always\s+allow\s+access\s+to/i.test(visible)) return true;
+  // Bash command permission prompts.
+  if (/Bash\s+command\s+.*\s+requires\s+permission/i.test(visible)) return true;
+  // "Do you want to proceed?" only counts as a permission dialog when paired
+  // with a file-edit context. Skill questions can use the bare phrase.
+  if (
+    /Do\s+you\s+want\s+to\s+proceed\?/i.test(visible) &&
+    /(Edit|Write)\s+to\s+\S+/i.test(visible)
+  ) {
+    return true;
+  }
+  return false;
 }

 /** Detect any AskUserQuestion-shaped numbered option list with cursor. */
@@ -211,12 +234,14 @@ export function parseNumberedOptions(
  // this, parseNumberedOptions returns stale options after the dialog is
  // dismissed.
  const lines = tail.split('\n');
-  // Anchor on the LAST `❯ 1.` line (cursor is on option 1 of the active
-  // AskUserQuestion). Greedy character classes don't help here — we need a literal
-  // `❯` after optional leading whitespace.
+  // Anchor on the LAST line containing `❯<spaces>1.` ANYWHERE on the line.
+  // The /plan-*-review skill's box-layout AUQ uses TTY cursor-positioning
+  // escapes that stripAnsi removes — leaving the cursor `❯1.` mid-line,
+  // after dividers + header + prompt text on the same logical line. The
+  // earlier `^\s*❯` anchor missed those entirely.
  let cursorLineIdx = -1;
  for (let i = lines.length - 1; i >= 0; i--) {
-    if (/^\s*❯\s*1\./.test(lines[i] ?? '')) {
+    if (/❯\s*1\./.test(lines[i] ?? '')) {
      cursorLineIdx = i;
      break;
    }
@@ -236,7 +261,37 @@ export function parseNumberedOptions(
  if (cursorLineIdx < 0) return [];
  const found: Array<{ index: number; label: string }> = [];
  const seenIndices = new Set<number>();
-  for (let i = cursorLineIdx; i < lines.length; i++) {
+
+  // Cursor line: option 1 may be inline after box dividers + prompt header
+  // (`...divider...header...❯1. label`). Use a non-anchored regex that
+  // captures `❯N. label` from anywhere on the line through end-of-line.
+  // Only used for the cursor line — subsequent options are parsed with the
+  // start-of-line `optionRe`.
+  const cursorLine = lines[cursorLineIdx] ?? '';
+  const cursorInlineRe = /❯\s*([1-9])\.\s*(\S.*?)\s*$/;
+  const inlineMatch = cursorInlineRe.exec(cursorLine);
+  if (inlineMatch) {
+    const idx = Number(inlineMatch[1]);
+    const label = (inlineMatch[2] ?? '').trim();
+    if (label.length > 0 && !seenIndices.has(idx)) {
+      seenIndices.add(idx);
+      found.push({ index: idx, label });
+    }
+  } else {
+    // No inline cursor match — fall back to start-of-line regex.
+    const startMatch = optionRe.exec(cursorLine);
+    if (startMatch) {
+      const idx = Number(startMatch[1]);
+      const label = (startMatch[2] ?? '').trim();
+      if (label.length > 0 && !seenIndices.has(idx)) {
+        seenIndices.add(idx);
+        found.push({ index: idx, label });
+      }
+    }
+  }
+
+  // Subsequent lines: standard start-of-line option parsing.
+  for (let i = cursorLineIdx + 1; i < lines.length; i++) {
    const m = optionRe.exec(lines[i] ?? '');
    if (!m) continue;
    const idx = Number(m[1]);
@@ -261,6 +316,333 @@ export function parseNumberedOptions(
  return found;
 }

+/**
+ * The four /plan-ceo-review modes. Used by `skill-e2e-plan-ceo-mode-routing`
+ * to detect Step 0F mode-selection AskUserQuestions, and by the upcoming
+ * finding-count tests as a Step-0 boundary signal: an AUQ whose options
+ * match this regex IS the mode pick (the last Step-0 question for plan-ceo).
+ *
+ * Lifted out of the mode-routing test so multiple PTY tests can share one
+ * source of truth — when /plan-ceo-review adds a fifth mode, one regex updates
+ * everywhere instead of drifting per-test.
+ */
+export const MODE_RE = /HOLD SCOPE|SCOPE EXPANSION|SELECTIVE EXPANSION|SCOPE REDUCTION/i;
+
+/**
+ * Stable signature for a parsed numbered-option list — used by tests to detect
+ * "is this AUQ the same as the last poll, or has the agent advanced to a new
+ * one?" Joins each option as `${index}:${label}` after sorting by index.
+ *
+ * Defensive sort means the signature is order-independent at the input level,
+ * even though `parseNumberedOptions` already returns indices in ascending order.
+ */
+export function optionsSignature(
+  opts: Array<{ index: number; label: string }>,
+): string {
+  return [...opts]
+    .sort((a, b) => a.index - b.index)
+    .map((o) => `${o.index}:${o.label}`)
+    .join('|');
+}
+
+/**
+ * Pure classifier for the visible TTY buffer. Decides which outcome the
+ * polling loop should return on this tick, or `null` to keep polling.
+ *
+ * Extracted from `runPlanSkillObservation` so the unit suite can exercise
+ * the actual branch order with synthetic input strings — a future contributor
+ * who reorders the branches (e.g., moves the permission short-circuit) gets
+ * caught by the unit tests, not by a stochastic E2E run.
+ *
+ * Live-state branches (process exited, "Unknown command") stay in the runner
+ * since they need the session handle.
+ */
+export type ClassifyResult =
+  | { outcome: 'silent_write'; summary: string }
+  | { outcome: 'plan_ready'; summary: string }
+  | { outcome: 'asked'; summary: string }
+  | null;
+
+const SANCTIONED_WRITE_SUBSTRINGS = [
+  '.claude/plans',
+  '.gstack/',
+  '/.context/',
+  'CHANGELOG.md',
+  'TODOS.md',
+];
+
+export function classifyVisible(visible: string): ClassifyResult {
+  // Silent-write detection: any Write/Edit tool render that targets a path
+  // OUTSIDE the sanctioned dirs, AND no numbered prompt is currently on screen
+  // (a numbered prompt means a permission/AskUserQuestion is gating the write,
+  // not an actual silent write).
+  const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g;
+  let m: RegExpExecArray | null;
+  while ((m = writeRe.exec(visible)) !== null) {
+    const target = m[1] ?? '';
+    const sanctioned = SANCTIONED_WRITE_SUBSTRINGS.some((s) => target.includes(s));
+    if (!sanctioned && !isNumberedOptionListVisible(visible)) {
+      return {
+        outcome: 'silent_write',
+        summary: `Write/Edit to ${target} fired before any AskUserQuestion`,
+      };
+    }
+  }
+  if (isPlanReadyVisible(visible)) {
+    return {
+      outcome: 'plan_ready',
+      summary: 'skill ran end-to-end and emitted plan-mode "Ready to execute" confirmation',
+    };
+  }
+  if (isNumberedOptionListVisible(visible)) {
+    // Permission dialogs render numbered lists too. Skip them — the
+    // bug we want to catch is "skill question never fired."
+    if (isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))) {
+      return null;
+    }
+    return {
+      outcome: 'asked',
+      summary: 'skill fired a numbered-option prompt (AskUserQuestion or routing-injection)',
+    };
+  }
+  return null;
+}
+
+// ────────────────────────────────────────────────────────────────────────────
+// Per-finding AskUserQuestion count primitives (used by runPlanSkillCounting).
+//
+// These are pure helpers extracted up-front so the unit suite can exercise
+// them deterministically before the live-PTY counter runs them. Each one is
+// independently unit-testable against synthetic visible-buffer strings.
+// ────────────────────────────────────────────────────────────────────────────
+
+/**
+ * Captured identity of an AskUserQuestion — the rendered question text plus
+ * its numbered options. Used by `runPlanSkillCounting` to dedupe redrawn
+ * prompts and to feed `Step0BoundaryPredicate` callers.
+ *
+ * `signature` is the stable hash. Two AUQs with identical prompt + options
+ * produce the same signature; differences in either field produce different
+ * signatures. Critically: two AUQs with shared option labels (e.g. the
+ * generic "A) Add to plan / B) Defer / C) Build now" menu) but different
+ * question text get DIFFERENT signatures because the prompt is in the hash.
+ */
+export interface AskUserQuestionFingerprint {
+  /** Stable hash combining normalized prompt text + options signature. */
+  signature: string;
+  /** First 240 chars of the rendered question prompt (post-normalization). */
+  promptSnippet: string;
+  /** Captured option labels, in index order. */
+  options: Array<{ index: number; label: string }>;
+  /** Wall-clock when first observed (ms since the helper started polling). */
+  observedAtMs: number;
+  /** True if observed BEFORE the Step-0 boundary fired. */
+  preReview: boolean;
+}
+
+/**
+ * Predicate fired against the AUQ we just answered (not the visible buffer).
+ * Returns true if this AUQ's fingerprint marks the LAST Step-0 question for
+ * its skill — all subsequent AUQs are review-phase findings.
+ *
+ * Event-based by design: matching against an answered AUQ's fingerprint
+ * (prompt + options) is deterministic, whereas matching against later
+ * rendered content (section headers, summary text) races with the agent's
+ * output cadence. See plan §D14 for the rationale.
+ */
+export type Step0BoundaryPredicate = (
+  answeredFingerprint: AskUserQuestionFingerprint,
+) => boolean;
+
+/**
+ * Parse the rendered question prompt out of a visible TTY buffer. The prompt
+ * is the 1–3 lines of text immediately ABOVE the latest `❯ 1.` cursor line —
+ * not part of the option list, not the permission-dialog header.
+ *
+ * Returns the prompt normalized to a single-spaced 240-char snippet (strip
+ * ANSI residue, collapse internal whitespace, trim) — short enough to use as
+ * a hash key, long enough to disambiguate distinct questions.
+ *
+ * Returns "" when no prompt could be parsed (cursor not yet rendered, or
+ * cursor is at the top of the buffer with no preceding text). Callers that
+ * use the empty string as a fingerprint input should treat empty-prompt
+ * AUQs as "wait one more poll" rather than fingerprinting them — otherwise
+ * the same options + empty prompt across two distinct questions collide.
+ */
+export function parseQuestionPrompt(visible: string): string {
+  // Tail-only — older prompts higher in the buffer are stale.
+  const tail = visible.length > 4096 ? visible.slice(-4096) : visible;
+  const lines = tail.split('\n');
+
+  // Find the latest line containing `❯<spaces>1.` (matching parseNumberedOptions —
+  // unanchored to handle the box-layout case where cursor is mid-line after
+  // divider + header + prompt text on the same logical line).
+  let cursorLineIdx = -1;
+  for (let i = lines.length - 1; i >= 0; i--) {
+    if (/❯\s*1\./.test(lines[i] ?? '')) {
+      cursorLineIdx = i;
+      break;
+    }
+  }
+  if (cursorLineIdx < 0) return '';
+
+  // Box-layout case: prompt text may be ON the cursor line, BEFORE `❯1.`.
+  // Extract that prefix (after stripping leading box-drawing characters and
+  // dividers) as the last piece of the prompt — appended after any prior
+  // multi-line prompt text we walk up to find.
+  const cursorLine = lines[cursorLineIdx] ?? '';
+  let inlinePrompt = '';
+  const cursorPos = cursorLine.search(/❯\s*1\./);
+  if (cursorPos > 0) {
+    inlinePrompt = cursorLine
+      .slice(0, cursorPos)
+      // Strip box-drawing chars + dividers + leading checkbox sigil.
+      .replace(/^[─━┄┅┈┉─┌┐└┘├┤┬┴┼│┃☐□■\s]+/, '')
+      .trim();
+  }
+
+  // Walk up at most 6 lines collecting prompt text. Stop at:
+  //   - a blank line preceded by another blank line (paragraph break)
+  //   - top of buffer
+  //   - a line that itself starts with `N.` (we're inside an option list)
+  const promptLines: string[] = [];
+  let blankRun = 0;
+  for (let i = cursorLineIdx - 1; i >= 0 && promptLines.length < 6; i--) {
+    const raw = lines[i] ?? '';
+    const trimmed = raw.trim();
+    if (trimmed === '') {
+      blankRun += 1;
+      if (blankRun >= 2 && promptLines.length > 0) break;
+      continue;
+    }
+    blankRun = 0;
+    // Stop if we hit what looks like a previous numbered list.
+    if (/^[\s❯]*[1-9]\.\s+\S/.test(raw)) break;
+    promptLines.unshift(trimmed);
+  }
+
+  const all = inlinePrompt.length > 0 ? [...promptLines, inlinePrompt] : promptLines;
+  const joined = all.join(' ').replace(/\s+/g, ' ').trim();
+  return joined.slice(0, 240);
+}
+
+/**
+ * Stable hash for an AskUserQuestion's identity — combines normalized prompt
+ * text with the options signature so two distinct questions with shared menu
+ * labels (the generic A/B/C TODO-proposal menu, for instance) get different
+ * fingerprints.
+ *
+ * Uses Bun's fast non-crypto hash since these strings are short and we only
+ * need collision resistance against accidental TTY redraws, not adversaries.
+ * Hex-encoded for diagnostic dumps.
+ */
+export function auqFingerprint(
+  promptSnippet: string,
+  opts: Array<{ index: number; label: string }>,
+): string {
+  const normalized = promptSnippet.replace(/\s+/g, ' ').trim();
+  const sig = optionsSignature(opts);
+  // eslint-disable-next-line @typescript-eslint/no-explicit-any
+  return (Bun as any).hash(normalized + '||' + sig).toString(16);
+}
+
+/**
+ * Detects when a plan-* skill has reached its Completion Summary / Review
+ * Report — a terminal signal complementary to plan-mode's "Ready to execute"
+ * confirmation. Each plan-review skill writes one of these phrasings near
+ * the end of its run; matching any one is enough to stop counting.
+ *
+ * Best-effort: this is a content marker, not a deterministic event. Hard
+ * ceiling (`reviewCountCeiling` in `runPlanSkillCounting`) is the reliable
+ * stop signal; this regex is the "we're done, go gracefully" hint.
+ */
+export const COMPLETION_SUMMARY_RE =
+  /(GSTACK REVIEW REPORT|## Completion [Ss]ummary|Status:\s*(clean|issues_open)|^VERDICT:)/m;
+
+/**
+ * Result of asserting that a plan file ends with `## GSTACK REVIEW REPORT`
+ * as its last `## ` heading. `ok` is true iff the report is present AND no
+ * other `## ` heading appears after it. Diagnostic fields are populated only
+ * on failure to keep the success path cheap.
+ */
+export interface ReviewReportAtBottomResult {
+  ok: boolean;
+  reason?: string;
+  trailingHeadings?: string[];
+}
+
+/**
+ * Assert that `## GSTACK REVIEW REPORT` is the last `## ` heading in a plan
+ * file's content. Pure string operation — no filesystem access. Used by the
+ * finding-count E2E tests as a second assertion on each test's produced plan.
+ *
+ * The plan-mode skill template mandates the agent move/append the review
+ * report so it's always the last `##` section. A regression where the agent
+ * appends additional sections after the report (or skips it entirely) ships
+ * silently today; this assertion catches both.
+ */
+export function assertReviewReportAtBottom(
+  content: string,
+): ReviewReportAtBottomResult {
+  const re = /^## GSTACK REVIEW REPORT\s*$/m;
+  const match = re.exec(content);
+  if (!match) {
+    return { ok: false, reason: 'no GSTACK REVIEW REPORT section' };
+  }
+  const after = content.slice(match.index + match[0].length);
+  // Match any `## ` heading after the report. Reject `## ` followed by
+  // newline-only (trailing-whitespace ## headers) to avoid false positives.
+  const trailingHeadings = Array.from(
+    after.matchAll(/^## \S.*$/gm),
+  ).map((m) => m[0]);
+  if (trailingHeadings.length > 0) {
+    return {
+      ok: false,
+      reason: 'trailing ## heading(s) after GSTACK REVIEW REPORT',
+      trailingHeadings,
+    };
+  }
+  return { ok: true };
+}
+
+/**
+ * Per-skill Step-0 boundary predicates. Each fires `true` when the answered
+ * AUQ's fingerprint matches the LAST question of that skill's Step 0 phase.
+ *
+ * - `ceoStep0Boundary`: matches the mode-pick AUQ (options match `MODE_RE`).
+ * - `engStep0Boundary`: matches the cross-project-learnings or scope-reduction
+ *   AUQ that closes plan-eng-review's preamble.
+ * - `designStep0Boundary`: matches plan-design-review's first dimension /
+ *   posture AUQ.
+ * - `devexStep0Boundary`: matches plan-devex-review's persona-selection AUQ.
+ *
+ * Predicates live alongside the helper so the unit suite can exercise each
+ * against synthetic fingerprints (positive AND negative cases). Skill test
+ * files import them directly.
+ */
+export const ceoStep0Boundary: Step0BoundaryPredicate = (fp) =>
+  // Mode-pick path (Step 0F): one of HOLD SCOPE / SCOPE EXPANSION / etc.
+  fp.options.some((o) => MODE_RE.test(o.label)) ||
+  // Skip-interview path: scope-selection AUQ has "Skip interview and plan
+  // immediately" — picking it bypasses the rest of Step 0 and routes
+  // directly to review-phase. Boundary fires on the scope AUQ itself.
+  fp.options.some((o) => /skip\s+interview|plan\s+immediately/i.test(o.label));
+
+export const engStep0Boundary: Step0BoundaryPredicate = (fp) =>
+  /scope reduction recommendation|cross[\s-]?project learnings/i.test(
+    fp.promptSnippet,
+  );
+
+export const designStep0Boundary: Step0BoundaryPredicate = (fp) =>
+  /design system|design posture|design score|first dimension/i.test(
+    fp.promptSnippet,
+  );
+
+export const devexStep0Boundary: Step0BoundaryPredicate = (fp) =>
+  /developer persona|target persona|persona selection|TTHW target/i.test(
+    fp.promptSnippet,
+  );
+
 /**
 * Spawn `claude --permission-mode plan` in a real PTY and return a session
 * handle. Caller is responsible for `await session.close()` to release the
@@ -566,12 +948,21 @@ export async function runPlanSkillObservation(opts: {
  cwd?: string;
  /** Total budget for skill to reach a terminal outcome. Default 180000. */
  timeoutMs?: number;
+  /**
+   * Extra env merged into the spawned `claude` process. `launchClaudePty`
+   * already supports this; exposing it here lets per-skill tests isolate
+   * from local config that would mask the regression they're trying to
+   * catch (e.g., `QUESTION_TUNING=true` causing AUTO_DECIDE to skip the
+   * rendered AskUserQuestion list).
+   */
+  env?: Record<string, string>;
 }): Promise<PlanSkillObservation> {
  const startedAt = Date.now();
  const session = await launchClaudePty({
    permissionMode: opts.inPlanMode === false ? null : 'plan',
    cwd: opts.cwd,
    timeoutMs: (opts.timeoutMs ?? 180_000) + 30_000,
+    env: opts.env,
  });

  try {
@@ -602,40 +993,10 @@ export async function runPlanSkillObservation(opts: {
          elapsedMs: Date.now() - startedAt,
        };
      }
-      // Silent-write detection: any Write/Edit tool render that targets a
-      // path OUTSIDE ~/.claude/plans, ~/.gstack/, or the active worktree's
-      // .gstack/. Plan files and gbrain artifacts are sanctioned.
-      const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g;
-      let m: RegExpExecArray | null;
-      while ((m = writeRe.exec(visible)) !== null) {
-        const target = m[1] ?? '';
-        const sanctioned =
-          target.includes('.claude/plans') ||
-          target.includes('.gstack/') ||
-          target.includes('/.context/') ||
-          target.includes('CHANGELOG.md') ||
-          target.includes('TODOS.md');
-        if (!sanctioned && !isNumberedOptionListVisible(visible)) {
-          return {
-            outcome: 'silent_write',
-            summary: `Write/Edit to ${target} fired before any AskUserQuestion`,
-            evidence: visible.slice(-2000),
-            elapsedMs: Date.now() - startedAt,
-          };
-        }
-      }
-      if (isPlanReadyVisible(visible)) {
+      const classified = classifyVisible(visible);
+      if (classified) {
        return {
-          outcome: 'plan_ready',
-          summary: 'skill ran end-to-end and emitted plan-mode "Ready to execute" confirmation',
-          evidence: visible.slice(-2000),
-          elapsedMs: Date.now() - startedAt,
-        };
-      }
-      if (isNumberedOptionListVisible(visible)) {
-        return {
-          outcome: 'asked',
-          summary: 'skill fired a numbered-option prompt (AskUserQuestion or routing-injection)',
+          ...classified,
          evidence: visible.slice(-2000),
          elapsedMs: Date.now() - startedAt,
        };
@@ -652,3 +1013,281 @@ export async function runPlanSkillObservation(opts: {
    await session.close();
  }
 }
+
+// ────────────────────────────────────────────────────────────────────────────
+// runPlanSkillCounting — drives a plan-* skill end-to-end through Step 0 then
+// counts distinct review-phase AskUserQuestion fingerprints. The actual
+// product asserted by the per-finding-count tests.
+// ────────────────────────────────────────────────────────────────────────────
+
+/**
+ * Result of a `runPlanSkillCounting` run. Includes both the count summary
+ * (`step0Count`, `reviewCount`) and the full fingerprint list for diagnostic
+ * dumps when an assertion fails.
+ */
+export interface PlanSkillCountObservation {
+  outcome:
+    | 'plan_ready'
+    | 'completion_summary'
+    | 'ceiling_reached'
+    | 'silent_write'
+    | 'exited'
+    | 'timeout';
+  summary: string;
+  /** Visible terminal text at terminal time (last 3KB). */
+  evidence: string;
+  /** Wall time (ms) until the outcome was decided. */
+  elapsedMs: number;
+  /** All distinct AskUserQuestions observed, in observation order. */
+  fingerprints: AskUserQuestionFingerprint[];
+  /** Count of fingerprints with `preReview === true`. */
+  step0Count: number;
+  /** Count of fingerprints with `preReview === false`. */
+  reviewCount: number;
+}
+
+/**
+ * Drive a plan-* skill in plan mode and count distinct review-phase
+ * AskUserQuestions until a terminal signal fires.
+ *
+ * Flow:
+ *   1. Boot PTY in plan mode (8s grace + auto-trust dialog).
+ *   2. Send `slashCommand` alone. Sleep ~3s.
+ *   3. Send `followUpPrompt` as a chat message — this is the plan content
+ *      the skill reviews. Slash commands with trailing args are rejected by
+ *      Claude Code unless the skill defines them, so the plan goes as a
+ *      follow-up message (the proven pattern at
+ *      skill-e2e-plan-design-with-ui.test.ts:57-71).
+ *   4. Poll loop:
+ *      - Skip permission dialogs (auto-grant with `defaultPick`).
+ *      - On a new numbered-option list, parse prompt + options, build
+ *        fingerprint via `auqFingerprint`. Empty-prompt parses are skipped
+ *        and re-polled (avoids the empty-prompt collision documented in
+ *        the auqFingerprint contract).
+ *      - First time we see a fingerprint: push it, classify as Step 0 or
+ *        review-phase based on `boundaryFired`, press `defaultPick` to
+ *        advance.
+ *      - After pressing, evaluate `isLastStep0AUQ(fingerprint)`. If true,
+ *        all subsequent AUQs are review-phase.
+ *      - Hard ceiling: if `reviewCount >= reviewCountCeiling`, return
+ *        `ceiling_reached`. This bounds runaway counts; tests should set
+ *        the ceiling above their assertion CEILING.
+ *      - Soft terminals: `COMPLETION_SUMMARY_RE` match → `completion_summary`;
+ *        plan-ready confirmation → `plan_ready`; silent write outside
+ *        sanctioned dirs → `silent_write`; process exited → `exited`;
+ *        wall clock exceeded → `timeout`.
+ *
+ * Boundary detection (D14): event-based, fired against the answered AUQ's
+ * fingerprint, not against later rendered content. This avoids the race
+ * where Step-0-final and Section-1-first AUQs straddle a section header
+ * regex match.
+ *
+ * Fingerprint composition (D9): `auqFingerprint(prompt, options)` mixes
+ * normalized prompt text with the options signature so distinct findings
+ * with shared menu structure (the generic A/B/C TODO menu) get distinct
+ * fingerprints.
+ */
+export async function runPlanSkillCounting(opts: {
+  /** Skill name, e.g. 'plan-ceo-review'. Used for diagnostic strings only. */
+  skillName: string;
+  /** Slash command to send alone, e.g. '/plan-ceo-review'. No trailing args. */
+  slashCommand: string;
+  /** Plan content sent as a follow-up message ~3s after the slash command. */
+  followUpPrompt: string;
+  /** Per-skill predicate: which answered AUQ is the last Step-0 question. */
+  isLastStep0AUQ: Step0BoundaryPredicate;
+  /** Hard cap on review-phase count; helper returns when reached. Should be
+   *  set ABOVE the test's assertion ceiling so the test sees the cap as a
+   *  failure rather than a silent stop. */
+  reviewCountCeiling: number;
+  /** Numbered option to press by default. Defaults to 1 (recommended). */
+  defaultPick?: number;
+  /**
+   * Optional override for the FIRST AUQ observed. Receives the fingerprint;
+   * returns the option index to press. Subsequent AUQs always use defaultPick.
+   *
+   * Skill-specific routing helper: /plan-ceo-review's first AUQ asks "what
+   * scope?" with options like "branch diff" / "describe inline" / "skip
+   * interview". Pressing the default 1 routes to "branch diff" (the wrong
+   * review target for a seeded fixture). firstAUQPick lets the test pick
+   * "Skip interview" or "describe inline" so the agent reviews the
+   * follow-up plan content the test sent, not the git diff.
+   */
+  firstAUQPick?: (fp: AskUserQuestionFingerprint) => number;
+  /** Working directory. Default process.cwd() (repo cwd holds skill registry). */
+  cwd?: string;
+  /** Total budget for skill to reach a terminal outcome. Default 1_500_000 (25 min). */
+  timeoutMs?: number;
+  /** Extra env merged into the spawned `claude` process. */
+  env?: Record<string, string>;
+}): Promise<PlanSkillCountObservation> {
+  const startedAt = Date.now();
+  const defaultPick = opts.defaultPick ?? 1;
+  const timeoutMs = opts.timeoutMs ?? 1_500_000;
+
+  const session = await launchClaudePty({
+    permissionMode: 'plan',
+    cwd: opts.cwd,
+    timeoutMs: timeoutMs + 60_000,
+    env: opts.env,
+  });
+
+  const fingerprints: AskUserQuestionFingerprint[] = [];
+  const seen = new Set<string>();
+  let boundaryFired = false;
+  let step0Count = 0;
+  let reviewCount = 0;
+  let isFirstAUQ = true;
+  let lastSig = '';
+
+  function snapshot(
+    outcome: PlanSkillCountObservation['outcome'],
+    summary: string,
+    visible: string,
+  ): PlanSkillCountObservation {
+    return {
+      outcome,
+      summary,
+      evidence: visible.slice(-3000),
+      elapsedMs: Date.now() - startedAt,
+      fingerprints,
+      step0Count,
+      reviewCount,
+    };
+  }
+
+  try {
+    await Bun.sleep(8000); // boot grace + auto-trust handler window
+    const since = session.mark();
+    session.send(`${opts.slashCommand}\r`);
+    await Bun.sleep(3000);
+    session.send(`${opts.followUpPrompt}\r`);
+
+    const budgetStart = Date.now();
+    while (Date.now() - budgetStart < timeoutMs) {
+      await Bun.sleep(2000);
+      const visible = session.visibleSince(since);
+
+      // Process exited?
+      if (session.exited()) {
+        return snapshot(
+          'exited',
+          `claude exited (code=${session.exitCode()}) during counting (step0=${step0Count}, review=${reviewCount})`,
+          visible,
+        );
+      }
+      if (visible.includes('Unknown command:')) {
+        return snapshot(
+          'exited',
+          `claude rejected ${opts.slashCommand} as unknown command (skill not registered in this cwd)`,
+          visible,
+        );
+      }
+
+      // Silent write detection — only fires if no numbered prompt is on
+      // screen (otherwise the write is gated by a permission/AUQ).
+      const writeRe = /⏺\s*(?:Write|Edit)\(([^)]+)\)/g;
+      let m: RegExpExecArray | null;
+      while ((m = writeRe.exec(visible)) !== null) {
+        const target = m[1] ?? '';
+        const sanctioned = SANCTIONED_WRITE_SUBSTRINGS.some((s) =>
+          target.includes(s),
+        );
+        if (!sanctioned && !isNumberedOptionListVisible(visible)) {
+          return snapshot(
+            'silent_write',
+            `Write/Edit to ${target} fired before any AskUserQuestion`,
+            visible,
+          );
+        }
+      }
+
+      // Soft terminal signals — check before AUQ processing so a final
+      // completion-summary doesn't get misclassified as a bonus AUQ.
+      if (COMPLETION_SUMMARY_RE.test(visible)) {
+        return snapshot(
+          'completion_summary',
+          `skill emitted completion summary / verdict / status line (step0=${step0Count}, review=${reviewCount})`,
+          visible,
+        );
+      }
+      if (isPlanReadyVisible(visible)) {
+        return snapshot(
+          'plan_ready',
+          `skill emitted plan-mode "Ready to execute" confirmation (step0=${step0Count}, review=${reviewCount})`,
+          visible,
+        );
+      }
+
+      // Numbered option list?
+      if (!isNumberedOptionListVisible(visible)) continue;
+
+      // Permission dialog? Auto-grant with defaultPick. Only act on the
+      // recent tail to avoid re-triggering on stale dialogs in scrollback.
+      if (isPermissionDialogVisible(visible.slice(-TAIL_SCAN_BYTES))) {
+        session.send(`${defaultPick}\r`);
+        await Bun.sleep(1500);
+        continue;
+      }
+
+      // Parse the active AUQ. Skip same-redraw and empty-prompt cases.
+      const options = parseNumberedOptions(visible);
+      if (options.length < 2) continue;
+      const sig = optionsSignature(options);
+      if (sig === lastSig) continue;
+      const promptSnippet = parseQuestionPrompt(visible);
+      if (promptSnippet === '') continue; // not yet rendered, poll again
+      lastSig = sig;
+
+      const fingerprintHash = auqFingerprint(promptSnippet, options);
+      if (seen.has(fingerprintHash)) {
+        // Same content, already counted (TTY redrew with whitespace diff).
+        continue;
+      }
+      seen.add(fingerprintHash);
+
+      const fp: AskUserQuestionFingerprint = {
+        signature: fingerprintHash,
+        promptSnippet,
+        options,
+        observedAtMs: Date.now() - startedAt,
+        preReview: !boundaryFired,
+      };
+      fingerprints.push(fp);
+      if (boundaryFired) reviewCount += 1;
+      else step0Count += 1;
+
+      // Press to advance — first AUQ may use the override pick.
+      const pickIdx =
+        isFirstAUQ && opts.firstAUQPick ? opts.firstAUQPick(fp) : defaultPick;
+      isFirstAUQ = false;
+      session.send(`${pickIdx}\r`);
+
+      // Evaluate boundary AFTER pressing — if THIS AUQ was the last Step 0
+      // question, all subsequent AUQs go to reviewCount.
+      if (!boundaryFired && opts.isLastStep0AUQ(fp)) {
+        boundaryFired = true;
+      }
+
+      // Hard ceiling — runaway protection.
+      if (reviewCount >= opts.reviewCountCeiling) {
+        return snapshot(
+          'ceiling_reached',
+          `review-phase AUQ count reached ceiling (${opts.reviewCountCeiling})`,
+          session.visibleSince(since),
+        );
+      }
+
+      // Give the agent a beat to advance to the next state.
+      await Bun.sleep(2000);
+    }
+
+    return snapshot(
+      'timeout',
+      `no terminal outcome within ${timeoutMs}ms (step0=${step0Count}, review=${reviewCount})`,
+      session.visibleSince(since),
+    );
+  } finally {
+    await session.close();
+  }
+}
@@ -0,0 +1,749 @@
+/**
+ * Deterministic unit tests for claude-pty-runner.ts behavior changes.
+ *
+ * Free-tier (no EVALS=1 needed). Runs in <1s on every `bun test`. Catches
+ * harness plumbing bugs before stochastic PTY runs surface them.
+ *
+ * Two surface areas tested:
+ *
+ * 1. Permission-dialog short-circuit in 'asked' classification: a TTY frame
+ *    that matches BOTH isPermissionDialogVisible AND isNumberedOptionListVisible
+ *    must NOT be classified as a skill question — permission dialogs render
+ *    as numbered lists too, but they're not what we're guarding.
+ *
+ * 2. Env passthrough surface: runPlanSkillObservation accepts an `env`
+ *    option and threads it to launchClaudePty. We can't fully exercise the
+ *    spawn pipeline without paying for a PTY session, but we CAN verify the
+ *    option exists in the type signature and that calling without env still
+ *    works (no regression).
+ *
+ * The PTY test (skill-e2e-plan-ceo-plan-mode.test.ts) is the integration
+ * check; this file is the cheap deterministic guard for the harness primitives
+ * those tests stand on.
+ */
+
+import { describe, test, expect } from 'bun:test';
+import {
+  isPermissionDialogVisible,
+  isNumberedOptionListVisible,
+  isPlanReadyVisible,
+  parseNumberedOptions,
+  classifyVisible,
+  TAIL_SCAN_BYTES,
+  optionsSignature,
+  parseQuestionPrompt,
+  auqFingerprint,
+  COMPLETION_SUMMARY_RE,
+  assertReviewReportAtBottom,
+  ceoStep0Boundary,
+  engStep0Boundary,
+  designStep0Boundary,
+  devexStep0Boundary,
+  type ClaudePtyOptions,
+  type AskUserQuestionFingerprint,
+} from './claude-pty-runner';
+
+describe('isPermissionDialogVisible', () => {
+  test('matches "Bash command requires permission" prompts', () => {
+    const sample = `
+      Some preamble output
+
+      Bash command \`gstack-config get telemetry\` requires permission to run.
+
+      ❯ 1. Yes
+        2. Yes, and always allow
+        3. No, abort
+    `;
+    expect(isPermissionDialogVisible(sample)).toBe(true);
+  });
+
+  test('matches "allow all edits" file-edit prompts', () => {
+    // Isolated to the "allow all edits" clause only — no overlapping
+    // "Do you want to proceed?" co-trigger, so this asserts the clause works.
+    const sample = `
+      Edit to ~/.gstack/config.yaml
+
+      ❯ 1. Yes
+        2. Yes, allow all edits during this session
+        3. No
+    `;
+    expect(isPermissionDialogVisible(sample)).toBe(true);
+  });
+
+  test('matches the "Do you want to proceed?" file-edit confirmation by itself', () => {
+    // Separate fixture so weakening this clause is detected by a dedicated test.
+    const sample = `
+      Edit to ~/.gstack/config.yaml
+
+      Do you want to proceed?
+
+      ❯ 1. Yes
+        2. No
+    `;
+    expect(isPermissionDialogVisible(sample)).toBe(true);
+  });
+
+  test('matches workspace-trust "always allow access to" prompt', () => {
+    const sample = `
+      Do you trust the files in this folder?
+
+      ❯ 1. Yes, proceed
+        2. Yes, and always allow access to /Users/me/repo
+        3. No, exit
+    `;
+    expect(isPermissionDialogVisible(sample)).toBe(true);
+  });
+
+  test('does NOT match a skill AskUserQuestion list', () => {
+    const sample = `
+      D1 — Premise challenge: do users actually want this?
+
+      ❯ 1. Yes, validated
+        2. No, premise is wrong
+        3. Need more info
+    `;
+    expect(isPermissionDialogVisible(sample)).toBe(false);
+  });
+
+  test('does NOT match a plan-ready confirmation', () => {
+    const sample = `
+      Ready to execute the plan?
+
+      ❯ 1. Yes
+        2. No, keep planning
+    `;
+    expect(isPermissionDialogVisible(sample)).toBe(false);
+  });
+
+  test('does NOT match a skill question that contains the bare phrase "Do you want to proceed?"', () => {
+    // Co-trigger requirement: "Do you want to proceed?" alone is not enough.
+    // It must appear with "Edit to <path>" or "Write to <path>" to count as
+    // a permission dialog. This guards against a skill question like
+    // "Do you want to proceed with HOLD SCOPE?" being mis-classified.
+    const sample = `
+      Choose your scope mode for this review.
+      Do you want to proceed?
+
+      ❯ 1. HOLD SCOPE
+        2. SCOPE EXPANSION
+        3. SELECTIVE EXPANSION
+    `;
+    expect(isPermissionDialogVisible(sample)).toBe(false);
+  });
+
+  test('does NOT mis-match when adversarial prose includes "Edit to <path>" alongside the bare proceed phrase', () => {
+    // Adversarial fixture: a skill question whose body legitimately mentions
+    // "Edit to <path>" in prose AND ends with "Do you want to proceed?". The
+    // current co-trigger regex would mis-classify this as a permission
+    // dialog. We DO want this test to fail until the regex is tightened
+    // further (e.g., proximity constraint, or anchoring "Edit to" to a
+    // line-start). For now this is documented as a known limitation: a
+    // skill question that talks about "Edit to" in prose IS still treated
+    // as a permission dialog. The test asserts the current behavior so a
+    // future fix can flip it intentionally.
+    const sample = `
+      Plan: I will Edit to ./plan.md to capture the decision.
+      Do you want to proceed?
+
+      ❯ 1. HOLD SCOPE
+        2. SCOPE EXPANSION
+    `;
+    // KNOWN LIMITATION: the co-trigger fires here. Documented as a
+    // post-merge follow-up. Flip this assertion once the regex tightens.
+    expect(isPermissionDialogVisible(sample)).toBe(true);
+  });
+});
+
+describe('isNumberedOptionListVisible', () => {
+  test('matches a basic ❯ 1. + 2. cursor list', () => {
+    const sample = `
+      ❯ 1. Option one
+        2. Option two
+        3. Option three
+    `;
+    expect(isNumberedOptionListVisible(sample)).toBe(true);
+  });
+
+  test('returns false on a single-option prompt', () => {
+    const sample = `
+      ❯ 1. Only option
+    `;
+    expect(isNumberedOptionListVisible(sample)).toBe(false);
+  });
+
+  test('returns false when no cursor renders', () => {
+    const sample = `
+      Just some prose with 1. a numbered point and 2. another.
+    `;
+    expect(isNumberedOptionListVisible(sample)).toBe(false);
+  });
+
+  test('overlaps permission dialogs (this is why D5 short-circuits)', () => {
+    // The whole point of D5: this string matches BOTH classifiers, so the
+    // runner must consult isPermissionDialogVisible to disambiguate.
+    const sample = `
+      Bash command \`do-thing\` requires permission to run.
+
+      ❯ 1. Yes
+        2. No
+    `;
+    expect(isNumberedOptionListVisible(sample)).toBe(true);
+    expect(isPermissionDialogVisible(sample)).toBe(true);
+  });
+});
+
+describe('classifyVisible (runtime path through the runner classifier)', () => {
+  // These tests call the actual classifier so a future contributor who
+  // reorders branches (e.g. moves the permission short-circuit before
+  // isPlanReadyVisible) is caught deterministically.
+
+  test('skill question → returns asked', () => {
+    const visible = `
+      D1 — Choose your scope mode
+
+      ❯ 1. HOLD SCOPE
+        2. SCOPE EXPANSION
+        3. SELECTIVE EXPANSION
+        4. SCOPE REDUCTION
+    `;
+    const result = classifyVisible(visible);
+    expect(result?.outcome).toBe('asked');
+  });
+
+  test('permission dialog (Bash) → returns null (skip, keep polling)', () => {
+    const visible = `
+      Bash command \`gstack-update-check\` requires permission to run.
+
+      ❯ 1. Yes
+        2. No
+    `;
+    expect(isNumberedOptionListVisible(visible)).toBe(true); // pre-filter
+    expect(classifyVisible(visible)).toBeNull(); // post-filter
+  });
+
+  test('plan-ready confirmation → returns plan_ready (wins over asked)', () => {
+    const visible = `
+      Ready to execute the plan?
+
+      ❯ 1. Yes, proceed
+        2. No, keep planning
+    `;
+    const result = classifyVisible(visible);
+    expect(result?.outcome).toBe('plan_ready');
+  });
+
+  test('silent write to unsanctioned path → returns silent_write', () => {
+    const visible = `
+      ⏺ Write(src/app/dangerous-write.ts)
+      ⎿  Wrote 42 lines
+    `;
+    const result = classifyVisible(visible);
+    expect(result?.outcome).toBe('silent_write');
+    expect(result?.summary).toContain('src/app/dangerous-write.ts');
+  });
+
+  test('write to sanctioned path (.claude/plans) → returns null (allowed)', () => {
+    const visible = `
+      ⏺ Write(/Users/me/.claude/plans/some-plan.md)
+      ⎿  Wrote 42 lines
+    `;
+    expect(classifyVisible(visible)).toBeNull();
+  });
+
+  test('write while a permission dialog is on screen → returns null (gated, not silent, not asked)', () => {
+    const visible = `
+      ⏺ Write(src/app/edit-with-permission.ts)
+
+      Edit to src/app/edit-with-permission.ts
+
+      Do you want to proceed?
+
+      ❯ 1. Yes
+        2. No
+    `;
+    // The numbered prompt is a permission dialog (Edit to + Do you want to proceed?);
+    // silent_write is suppressed because a numbered prompt is visible, AND
+    // 'asked' is suppressed because the prompt is a permission dialog.
+    expect(classifyVisible(visible)).toBeNull();
+  });
+
+  test('write while a real skill question is on screen → returns asked (write is captured but not silent)', () => {
+    const visible = `
+      ⏺ Write(src/app/foo.ts)
+
+      D1 — Choose your scope mode
+
+      ❯ 1. HOLD SCOPE
+        2. SCOPE EXPANSION
+    `;
+    // The numbered prompt is a skill question, not a permission dialog;
+    // silent_write is suppressed (numbered prompt is visible) and the
+    // outcome is 'asked' — Step 0 fired.
+    const result = classifyVisible(visible);
+    expect(result?.outcome).toBe('asked');
+  });
+
+  test('idle / no signals → returns null', () => {
+    const visible = `
+      Some prose without any classifier signals.
+    `;
+    expect(classifyVisible(visible)).toBeNull();
+  });
+
+  test('TAIL_SCAN_BYTES is exported as 1500', () => {
+    // Shared between runner and routing test; a regression that desyncs the
+    // recent-tail window would surface here.
+    expect(TAIL_SCAN_BYTES).toBe(1500);
+  });
+});
+
+describe('parseNumberedOptions', () => {
+  test('extracts options from a clean cursor list', () => {
+    const visible = `
+      ❯ 1. HOLD SCOPE
+        2. SCOPE EXPANSION
+    `;
+    const opts = parseNumberedOptions(visible);
+    expect(opts).toHaveLength(2);
+    expect(opts[0]).toEqual({ index: 1, label: 'HOLD SCOPE' });
+    expect(opts[1]).toEqual({ index: 2, label: 'SCOPE EXPANSION' });
+  });
+
+  test('returns empty array on prose-with-numbers (no cursor)', () => {
+    expect(parseNumberedOptions('text 1. one 2. two')).toEqual([]);
+  });
+
+  test('extracts options when the cursor is INLINE with prompt header (box-layout)', () => {
+    // Real /plan-ceo-review rendering: the TTY's cursor-positioning escapes
+    // collapse divider + header + prompt + cursor onto one logical line.
+    // Subsequent options (2..7) still start their own lines.
+    const visible = [
+      '────────────────────────────────────────',
+      '☐ Review scope                                                     What scope do you want me to CEO-review?                                                     ❯ 1. The branch\'s diff vs main',
+      '   Review the full branch: ~10K LOC.',
+      '2. A specific plan file or design doc',
+      '   You point me at a file (path) and I review that.',
+      '3. An idea you\'ll describe inline',
+      '4. Cancel — wrong skill',
+      '5. Type something.',
+      '────────────────────────────────────────',
+      '6. Chat about this',
+      '7. Skip interview and plan immediately',
+    ].join('\n');
+    const opts = parseNumberedOptions(visible);
+    expect(opts).toHaveLength(7);
+    expect(opts[0]).toEqual({ index: 1, label: "The branch's diff vs main" });
+    expect(opts[1]?.index).toBe(2);
+    expect(opts[6]?.index).toBe(7);
+    expect(opts[6]?.label).toBe('Skip interview and plan immediately');
+  });
+
+  test('inline-cursor and start-of-line cursor both produce 7 options for the box-layout case', () => {
+    // The inline path captures option 1 from the cursor line itself; the
+    // subsequent-lines path captures 2..7 with the existing optionRe.
+    const inlineLayout = [
+      'header text                                                     ❯ 1. first option',
+      '2. second',
+      '3. third',
+    ].join('\n');
+    expect(parseNumberedOptions(inlineLayout)).toEqual([
+      { index: 1, label: 'first option' },
+      { index: 2, label: 'second' },
+      { index: 3, label: 'third' },
+    ]);
+
+    const cleanLayout = [
+      '  ❯ 1. first option',
+      '    2. second',
+      '    3. third',
+    ].join('\n');
+    expect(parseNumberedOptions(cleanLayout)).toEqual([
+      { index: 1, label: 'first option' },
+      { index: 2, label: 'second' },
+      { index: 3, label: 'third' },
+    ]);
+  });
+});
+
+describe('runPlanSkillObservation env passthrough surface', () => {
+  test('ClaudePtyOptions exposes env: Record<string, string>', () => {
+    // Type-level guard: this file would fail to compile if the env field
+    // were removed or its shape regressed. The actual env merge happens in
+    // launchClaudePty's spawn call (`env: { ...process.env, ...opts.env }`),
+    // so a regression where `env: opts.env` gets dropped from the
+    // runPlanSkillObservation -> launchClaudePty handoff is only caught by
+    // the live PTY test, not here.
+    const opts: ClaudePtyOptions = {
+      env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' },
+    };
+    expect(opts.env).toEqual({ QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' });
+  });
+});
+
+// ────────────────────────────────────────────────────────────────────────────
+// Per-finding count primitives — Section 3 unit tests #1–#5, #7, #12.
+// ────────────────────────────────────────────────────────────────────────────
+
+describe('optionsSignature', () => {
+  test('returns a "|"-joined `index:label` string for a clean list', () => {
+    const sig = optionsSignature([
+      { index: 1, label: 'HOLD SCOPE' },
+      { index: 2, label: 'SCOPE EXPANSION' },
+    ]);
+    expect(sig).toBe('1:HOLD SCOPE|2:SCOPE EXPANSION');
+  });
+
+  test('order-independent: shuffled inputs produce the same signature', () => {
+    // parseNumberedOptions already returns sorted, but defensive sort means
+    // a future caller that hands us shuffled input still produces a stable
+    // dedupe signature.
+    const a = optionsSignature([
+      { index: 2, label: 'B' },
+      { index: 1, label: 'A' },
+      { index: 3, label: 'C' },
+    ]);
+    const b = optionsSignature([
+      { index: 1, label: 'A' },
+      { index: 2, label: 'B' },
+      { index: 3, label: 'C' },
+    ]);
+    expect(a).toBe(b);
+  });
+
+  test('empty list returns empty string', () => {
+    expect(optionsSignature([])).toBe('');
+  });
+
+  test('single-item list returns just that entry', () => {
+    expect(optionsSignature([{ index: 1, label: 'Only' }])).toBe('1:Only');
+  });
+});
+
+describe('parseQuestionPrompt', () => {
+  test('captures 1-line prompt above the cursor', () => {
+    const visible = `
+      D1 — Pick a mode
+
+      ❯ 1. HOLD SCOPE
+        2. SCOPE EXPANSION
+    `;
+    const prompt = parseQuestionPrompt(visible);
+    expect(prompt).toBe('D1 — Pick a mode');
+  });
+
+  test('captures multi-line prompt above the cursor', () => {
+    const visible = `
+      D2 — Approach selection
+
+      Which architecture should we follow?
+
+      ❯ 1. Bypass existing helper
+        2. Reuse existing helper
+    `;
+    const prompt = parseQuestionPrompt(visible);
+    // Multi-line prompts get joined with single spaces.
+    expect(prompt).toContain('D2 — Approach selection');
+    expect(prompt).toContain('Which architecture should we follow?');
+  });
+
+  test('returns "" when no cursor is rendered', () => {
+    expect(parseQuestionPrompt('Just some prose.\nNo cursor.')).toBe('');
+  });
+
+  test('truncates to 240 chars', () => {
+    const longPrompt = 'A'.repeat(500);
+    const visible = `${longPrompt}\n\n      ❯ 1. yes\n        2. no`;
+    expect(parseQuestionPrompt(visible).length).toBeLessThanOrEqual(240);
+  });
+
+  test('does not pull text from a previous numbered list above', () => {
+    const visible = `
+      ❯ 1. previous answered question
+        2. previous option two
+
+      D2 — A new question text
+
+      ❯ 1. fresh option A
+        2. fresh option B
+    `;
+    const prompt = parseQuestionPrompt(visible);
+    // Stops at the previous numbered-list line; should NOT contain "previous answered question".
+    expect(prompt).toContain('D2 — A new question text');
+    expect(prompt).not.toContain('previous answered question');
+  });
+
+  test('normalizes whitespace (collapses runs of spaces and tabs)', () => {
+    const visible = `D1   —    Spaced     out
+
+      ❯ 1. yes
+        2. no`;
+    expect(parseQuestionPrompt(visible)).toBe('D1 — Spaced out');
+  });
+
+  test('inline-cursor box-layout: extracts prompt text BEFORE ❯1. on the cursor line', () => {
+    // Real /plan-ceo-review rendering: divider + ☐ header + prompt text +
+    // cursor are all on one logical line because TTY cursor-positioning
+    // escapes collapse the box layout under stripAnsi.
+    const visible = [
+      '──────────────────',
+      '☐ Review scope                                                     What scope do you want me to CEO-review?                                                     ❯ 1. The branch\'s diff vs main',
+      '2. A specific plan file',
+      '3. An idea inline',
+    ].join('\n');
+    const prompt = parseQuestionPrompt(visible);
+    // Should extract "Review scope" and the prompt text, dropping the ☐ box-drawing sigil.
+    expect(prompt).toContain('Review scope');
+    expect(prompt).toContain('What scope do you want me to CEO-review?');
+    expect(prompt).not.toContain('❯');
+    expect(prompt).not.toMatch(/^☐/);
+  });
+});
+
+describe('auqFingerprint', () => {
+  test('returns the same fingerprint for identical inputs', () => {
+    const opts = [
+      { index: 1, label: 'A' },
+      { index: 2, label: 'B' },
+    ];
+    expect(auqFingerprint('hello', opts)).toBe(auqFingerprint('hello', opts));
+  });
+
+  test('different prompts with shared option labels produce DIFFERENT fingerprints', () => {
+    // The collision regression Codex F1 caught: option-label-only fingerprints
+    // collapsed multiple distinct findings into one when they shared menu shape.
+    const sharedOpts = [
+      { index: 1, label: 'Add to plan' },
+      { index: 2, label: 'Defer' },
+      { index: 3, label: 'Build now' },
+    ];
+    const fpFinding1 = auqFingerprint('D5 — Architecture: bypass helper?', sharedOpts);
+    const fpFinding2 = auqFingerprint('D6 — Tests: zero coverage?', sharedOpts);
+    expect(fpFinding1).not.toBe(fpFinding2);
+  });
+
+  test('same prompt with different options produces DIFFERENT fingerprints', () => {
+    const prompt = 'D1 — Pick a mode';
+    const fpA = auqFingerprint(prompt, [
+      { index: 1, label: 'HOLD SCOPE' },
+      { index: 2, label: 'SCOPE EXPANSION' },
+    ]);
+    const fpB = auqFingerprint(prompt, [
+      { index: 1, label: 'HOLD SCOPE' },
+      { index: 2, label: 'SCOPE REDUCTION' },
+    ]);
+    expect(fpA).not.toBe(fpB);
+  });
+
+  test('whitespace-only differences in prompt do NOT change the fingerprint', () => {
+    // Same content, different rendering whitespace (TTY redraw artifact)
+    // must produce the same fingerprint so dedupe survives reflow.
+    const opts = [{ index: 1, label: 'A' }, { index: 2, label: 'B' }];
+    const fpA = auqFingerprint('Pick   a     mode', opts);
+    const fpB = auqFingerprint('Pick a mode', opts);
+    expect(fpA).toBe(fpB);
+  });
+
+  test('empty prompt + same options collide (caller must guard against this)', () => {
+    // Documents the contract: empty-prompt fingerprints WILL collide if the
+    // caller fingerprints them. runPlanSkillCounting must skip empty-prompt
+    // AUQs and re-poll instead.
+    const opts = [{ index: 1, label: 'A' }];
+    expect(auqFingerprint('', opts)).toBe(auqFingerprint('', opts));
+  });
+});
+
+describe('COMPLETION_SUMMARY_RE', () => {
+  test('matches GSTACK REVIEW REPORT heading', () => {
+    expect(COMPLETION_SUMMARY_RE.test('## GSTACK REVIEW REPORT')).toBe(true);
+  });
+
+  test('matches Completion Summary heading (ceo + eng)', () => {
+    expect(COMPLETION_SUMMARY_RE.test('## Completion Summary')).toBe(true);
+    expect(COMPLETION_SUMMARY_RE.test('## Completion summary')).toBe(true);
+  });
+
+  test('matches Status: clean (CEO review-log shape)', () => {
+    expect(COMPLETION_SUMMARY_RE.test('Status: clean')).toBe(true);
+    expect(COMPLETION_SUMMARY_RE.test('Status: issues_open')).toBe(true);
+  });
+
+  test('matches VERDICT: line', () => {
+    expect(COMPLETION_SUMMARY_RE.test('VERDICT: CLEARED — Eng Review passed')).toBe(true);
+  });
+
+  test('does NOT match prose mentions of "verdict" mid-line', () => {
+    // VERDICT must be at the start of a line to count.
+    expect(COMPLETION_SUMMARY_RE.test('the final verdict: undecided')).toBe(false);
+  });
+});
+
+describe('assertReviewReportAtBottom', () => {
+  test('passes when REVIEW REPORT is the only/last ## heading', () => {
+    const content = `# Plan
+
+## Context
+stuff
+
+## Approach
+more stuff
+
+## GSTACK REVIEW REPORT
+
+| col | col |
+`;
+    const r = assertReviewReportAtBottom(content);
+    expect(r.ok).toBe(true);
+  });
+
+  test('fails when REVIEW REPORT is missing', () => {
+    const content = `# Plan
+
+## Context
+stuff
+`;
+    const r = assertReviewReportAtBottom(content);
+    expect(r.ok).toBe(false);
+    expect(r.reason).toMatch(/no GSTACK REVIEW REPORT/);
+  });
+
+  test('fails when REVIEW REPORT exists but a ## heading follows it', () => {
+    const content = `# Plan
+
+## GSTACK REVIEW REPORT
+
+| col | col |
+
+## Late Section
+oops
+`;
+    const r = assertReviewReportAtBottom(content);
+    expect(r.ok).toBe(false);
+    expect(r.reason).toMatch(/trailing ## heading/);
+    expect(r.trailingHeadings).toEqual(['## Late Section']);
+  });
+
+  test('passes when only ### subheadings follow REVIEW REPORT (deeper nesting allowed)', () => {
+    const content = `## GSTACK REVIEW REPORT
+
+### Cross-model tension
+- F1: resolved
+- F2: resolved
+`;
+    const r = assertReviewReportAtBottom(content);
+    expect(r.ok).toBe(true);
+  });
+
+  test('fails with multiple trailing ## headings reported', () => {
+    const content = `## GSTACK REVIEW REPORT
+
+## First trailing
+
+## Second trailing
+`;
+    const r = assertReviewReportAtBottom(content);
+    expect(r.ok).toBe(false);
+    expect(r.trailingHeadings).toHaveLength(2);
+  });
+});
+
+describe('Step0BoundaryPredicate per-skill', () => {
+  // Helper to build a synthetic fingerprint for predicate tests.
+  function fp(promptSnippet: string, optionLabels: string[]): AskUserQuestionFingerprint {
+    const options = optionLabels.map((label, i) => ({ index: i + 1, label }));
+    return {
+      signature: auqFingerprint(promptSnippet, options),
+      promptSnippet,
+      options,
+      observedAtMs: 0,
+      preReview: true,
+    };
+  }
+
+  describe('ceoStep0Boundary', () => {
+    test('FIRES on Step 0F mode-pick AUQ (HOLD SCOPE in options)', () => {
+      const f = fp('Pick a mode', ['HOLD SCOPE', 'SCOPE EXPANSION', 'SELECTIVE EXPANSION', 'SCOPE REDUCTION']);
+      expect(ceoStep0Boundary(f)).toBe(true);
+    });
+
+    test('FIRES on scope-selection AUQ with "Skip interview" option (skip-interview path)', () => {
+      // After calibration run 1: plan-ceo's first AUQ is scope-selection,
+      // and we route via "Skip interview and plan immediately" to bypass
+      // Step 0 entirely. Boundary must fire on this AUQ so subsequent
+      // AUQs go to reviewCount.
+      const f = fp(
+        'What scope do you want me to CEO-review?',
+        [
+          "The branch's diff vs main",
+          'A specific plan file',
+          "An idea you'll describe inline",
+          'Cancel — wrong skill',
+          'Type something.',
+          'Chat about this',
+          'Skip interview and plan immediately',
+        ],
+      );
+      expect(ceoStep0Boundary(f)).toBe(true);
+    });
+
+    test('does NOT fire on premise challenge AUQs', () => {
+      const f = fp('D1 — Premise check: is this the right problem?', ['Yes', 'No', 'Other']);
+      expect(ceoStep0Boundary(f)).toBe(false);
+    });
+
+    test('does NOT fire on review-section AUQs', () => {
+      const f = fp('Architecture: bypass helper?', ['Reuse existing', 'Roll new', 'Defer']);
+      expect(ceoStep0Boundary(f)).toBe(false);
+    });
+  });
+
+  describe('engStep0Boundary', () => {
+    test('FIRES on cross-project learnings prompt', () => {
+      const f = fp('Enable cross-project learnings on this machine?', ['Yes', 'No']);
+      expect(engStep0Boundary(f)).toBe(true);
+    });
+
+    test('FIRES on scope reduction recommendation', () => {
+      const f = fp('Scope reduction recommendation: cut to MVP?', ['Reduce', 'Proceed', 'Modify']);
+      expect(engStep0Boundary(f)).toBe(true);
+    });
+
+    test('does NOT fire on review-section AUQs', () => {
+      const f = fp('Architecture: shared mutable state?', ['Refactor', 'Defer', 'Skip']);
+      expect(engStep0Boundary(f)).toBe(false);
+    });
+  });
+
+  describe('designStep0Boundary', () => {
+    test('FIRES on design system / posture mention', () => {
+      const f = fp('Pick a design posture for this review', ['Polish', 'Triage', 'Expansion']);
+      expect(designStep0Boundary(f)).toBe(true);
+    });
+
+    test('FIRES on first-dimension prompt', () => {
+      const f = fp('First dimension: visual hierarchy. Score?', ['7', '8', '9']);
+      expect(designStep0Boundary(f)).toBe(true);
+    });
+
+    test('does NOT fire on later dimension AUQs', () => {
+      const f = fp('Spacing dimension score?', ['7', '8', '9']);
+      expect(designStep0Boundary(f)).toBe(false);
+    });
+  });
+
+  describe('devexStep0Boundary', () => {
+    test('FIRES on developer persona selection', () => {
+      const f = fp('Pick the target persona for this review', ['Senior backend', 'Junior frontend', 'Other']);
+      expect(devexStep0Boundary(f)).toBe(true);
+    });
+
+    test('FIRES on TTHW target prompt', () => {
+      const f = fp('What is the TTHW target for first run?', ['<5 min', '<15 min', '<30 min']);
+      expect(devexStep0Boundary(f)).toBe(true);
+    });
+
+    test('does NOT fire on review-section AUQs', () => {
+      const f = fp('Friction point: 5-min CI wait. Address?', ['Now', 'Defer', 'Skip']);
+      expect(devexStep0Boundary(f)).toBe(false);
+    });
+  });
+});
@@ -103,6 +103,15 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
  'ship-idempotency-pty':        ['ship/**', 'bin/gstack-next-version', 'lib/worktree.ts', 'test/helpers/claude-pty-runner.ts'],
  'autoplan-chain-pty':          ['autoplan/**', 'plan-ceo-review/**', 'plan-design-review/**', 'plan-eng-review/**', 'plan-devex-review/**', 'test/fixtures/plans/ui-heavy-feature.md', 'test/helpers/claude-pty-runner.ts'],
  'e2e-harness-audit':            ['plan-ceo-review/**', 'plan-eng-review/**', 'plan-design-review/**', 'plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/agent-sdk-runner.ts', 'test/helpers/claude-pty-runner.ts'],
+
+  // Per-finding AskUserQuestion count + review-report-at-bottom assertion.
+  // Each test drives its skill end-to-end; touchfiles include preamble +
+  // completion-status resolvers because they affect question cadence and
+  // terminal output (the regression surface this test catches).
+  'plan-ceo-finding-count':      ['plan-ceo-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-ceo-finding-count.test.ts'],
+  'plan-eng-finding-count':      ['plan-eng-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-eng-finding-count.test.ts'],
+  'plan-design-finding-count':   ['plan-design-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-design-finding-count.test.ts'],
+  'plan-devex-finding-count':    ['plan-devex-review/**', 'scripts/resolvers/preamble.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/claude-pty-runner.ts', 'test/skill-e2e-plan-devex-finding-count.test.ts'],
  'brain-privacy-gate':           ['scripts/resolvers/preamble/generate-brain-sync-block.ts', 'scripts/resolvers/preamble.ts', 'bin/gstack-brain-sync', 'bin/gstack-brain-init', 'bin/gstack-config', 'test/helpers/agent-sdk-runner.ts'],

  // AskUserQuestion format regression (RECOMMENDATION + Completeness: N/10)
@@ -381,6 +390,15 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
  'ship-idempotency-pty':      'periodic',   // ~$3/run, real /ship in plan mode
  'autoplan-chain-pty':        'periodic',   // ~$8/run, all 3 phases sequential

+  // Per-finding count + review-report-at-bottom — periodic because each
+  // run drives a full skill end-to-end (~25 min, ~$5/run). Sequential
+  // execution during calibration; concurrent opt-in only after measured
+  // comparison agrees (plan §D15).
+  'plan-ceo-finding-count':    'periodic',
+  'plan-eng-finding-count':    'periodic',
+  'plan-design-finding-count': 'periodic',
+  'plan-devex-finding-count':  'periodic',
+
  // Privacy gate for gstack-brain-sync — periodic (non-deterministic LLM call,
  // costs ~$0.30-$0.50 per run, not needed on every commit)
  'brain-privacy-gate': 'periodic',