v1.56.0.0 Token-reduction Phase B + AUQ paranoid safety net (#1849)

* refactor(plan-ceo-review): carve review body into on-demand section Carve the largest skill (138,838 B) into a skeleton + one on-demand section, the documented next Phase B target after /ship (v2_PLAN.md:216). - sections/review-sections.md(.tmpl): the 11-section deep review, codex/ outside-voice rules, how-to-ask, Required Outputs, registries, Completion Summary, Review Log, REVIEW_DASHBOARD, PLAN_FILE_REVIEW_REPORT, Next Steps, docs/designs promotion, Formatting Rules, and the Mode Quick Reference. - sections/manifest.json: passive registry (CM2), one entry. - SKILL.md.tmpl: {{SECTION_INDEX}} after the system audit, a single {{SECTION:review-sections}} STOP-Read after Step 0 mode selection, and a Section self-check. All of Step 0 (the scope/mode conversation) stays in the always-loaded skeleton; only EXIT_PLAN_MODE_GATE follows the section. Measured: always-loaded skeleton 138,838 -> 80,731 B (-42%, ~14.4K tokens off every invocation). Union (skeleton + section) 139,110 B, behavior held. Boundary honors Codex P1: nothing review-governing (formatting rules, mode reference, how-to-ask, required outputs) sits in the skeleton below the STOP. Housekeeping resolvers ride in the section, matching the ship precedent (adversarial.md carries LEARNINGS_LOG + GBRAIN_SAVE_RESULTS). Tests (atomic with the carve — skill-docs.yml gates gen:skill-docs freshness on every push, so source + regen + tests must land together): - parity-harness: plan-ceo flipped to sectioned, maxSkeletonBytes 90_000 (measured 80,731 + headroom); content/minBytes run against the union. - skill-size-budget: plan-ceo-review added to SECTIONS_EXTRACTED. - section-manifest-consistency: generalized to discover every carved skill, vars computed per-skill-case (Codex P2). - skill-ceo-section-ordering (new, gate): per-PR static guard — STOP after Step 0, review body absent from skeleton, report writer in the section, nothing review-governing below the STOP. - skill-e2e-plan-ceo-review-section-loading (new, periodic): refreshes the installed skill first (Codex P1), drives full Step 0, asserts the section is Read before the report. - gen-skill-docs + skill-validation: read the skeleton+sections union for carved skills so relocated prose still counts. - touchfiles: plan-ceo-section-loading registered (periodic). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump VERSION + CHANGELOG for plan-ceo-review carve (v1.56.0.0) MINOR: carves the largest skill into skeleton + on-demand section, dropping plan-ceo-review's always-loaded cost 42% (138,838 -> 80,731 B, ~14.4K tokens off every invocation). User-facing release notes lead with the measured token win. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(todos): file P3 follow-up — carve the shared {{PREAMBLE}} reference blocks Surfaced by /plan-eng-review on the plan-ceo-review carve: per-skill section carves stay modest because the ~40-50KB shared preamble dominates the always-loaded surface. A single preamble-reference carve would help every tier->=2 skill at once. Records the why, the cold-vs-hot split to measure, and the guards it needs. Not implemented this PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): Layer 0 — guarantee AUQ format spec is always-loaded Deterministic, free, per-PR keystone for the token-reduction era. For every interactive (tier>=2) skill, asserts the full AskUserQuestion decision-brief format (ELI10/Recommendation/Pros-cons/checks/Net/(recommended)/Stakes/ self-check) lives in the always-loaded SKILL.md skeleton, NOT only in an on-demand section. Plus a roster guard (a carve can't silently drop the block) and per-skill rule survival in the skeleton+sections union. 51 cases + a negative control. Fails the instant a future carve strands AUQ-governing text where it won't be loaded when a question fires. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): SDK capture engine + verbose-vs-carved no-degradation A/B Adds the reusable SDK $OUT_FILE capture engine (auq-sdk-capture.ts): drives a skill to its AUQ and captures the verbatim text the model GENERATES, cleanly (real-PTY mangles plan-mode AUQs via cursor escapes). Pins the skill to an absolute path with Read/Write-only tools so the agent can't wander to the global install. gradeAuqRecommendation normalizes a non-"because" connective before grading so substantive reasons aren't false-flagged (without touching the pinned shared judge). The A/B drives the same prompt through the carved 80KB skeleton and the pre-carve 137KB monolith and fails if carved scores worse. Result: both 7/7 format, substance 5 — proven no degradation, transcript-verified each side read its own planted SKILL.md. Periodic tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): consistency — same trigger N runs, stable format + substance Drives the carved /plan-ceo-review AUQ N=3 times and fails if any format element appears in one run but not another, or substance craters. Targets the "fine one run, broken the next" failure class a single snapshot can't see. Result: 3/3 stable, 7/7 + substance 5 every run. Periodic tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): behavioral matrix across AUQ-heavy skills Data-driven test that drives each AUQ-heavy skill (plan-eng/design/devex, office-hours, cso, spec, design-consultation) to its first AskUserQuestion and grades it to the plan-ceo bar: 7/7 decision-brief format + recommendation substance >=4. One case per skill (isolated failures), env-subsettable via AUQ_MATRIX_ONLY. Browser/design-binary skills are intentionally excluded (comparison boards, not format-AUQs; Layer 0 covers their spec). All targeted skills pass 7/7 with substance 4-5. Periodic tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(codex): live recommendation-substance grade for /codex Closes the gap where /codex's synthesis recommendation was only checked statically (template grep) and via fixtures. Drives the real /codex skill over a flawed diff and grades the emitted "Recommendation: ... because ..." line with judgeRecommendation (present/commits/has_because/substance>=4). The named weak spot holds up: substance 5. Periodic tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): deterministic trigger for format-compliance gate A bare /plan-ceo-review against a repo whose work is already implemented makes the model improvise an off-script "what should I review?" scope question that skips the decision-brief format, which the gate test then times out waiting for. Hand it a concrete plan to review (FORCING_FLOOR_CEO) so it reaches the real Step 0 mode-selection AUQ that is the intended format check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(office-hours): carve Phase 5+6 into on-demand section Third Phase B carve (v2_PLAN.md:216, after ship and plan-ceo-review). Moves Phase 5 (Design Doc templates) + Phase 6 (tiered relationship handoff) — the session's output + closing tail, only reached after the conversation and alternatives are done — into sections/design-and-handoff.md, behind a single STOP-Read after Phase 4.5. The live conversation (Phases 1-4.5) and the always-run Important Rules stay in the always-loaded skeleton. Measured: always-loaded skeleton 118,280 -> 88,975 B (-24.8%). Union preserved. The carved AUQ is identical to pre-carve (matrix: 7/7 format, substance 5), and Layer 0 confirms the AUQ format spec stays in the skeleton — the AUQ paranoid suite de-risked this carve end to end. Atomic with tests + regen (skill-docs.yml gates gen:skill-docs freshness on every push, so source + regen + tests land together; --host all regenerates the inlined non-Claude variants): - sections/manifest.json: passive registry, one entry. - parity-harness: office-hours flipped to sectioned, maxSkeletonBytes 96_000 (measured 88,975 + headroom); content/minBytes run against the union. - skill-size-budget: office-hours added to SECTIONS_EXTRACTED. - gen-skill-docs + skill-validation: read the skeleton+sections union for office-hours so relocated Phase 5/6 prose still counts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump VERSION + CHANGELOG for office-hours carve + AUQ suite (v1.57.0.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(preamble): carve CJK-escaping manual to on-demand doc The AskUserQuestion format block is inlined into every interactive skill (~33). It carried the full multi-paragraph non-ASCII/CJK escaping manual inline, but that rationale only matters when a question contains CJK text and the operative rule already lives in the always-loaded self-check. Moved the justification to docs/askuserquestion-cjk.md (read on demand); kept the rule + a pointer. Corpus: Claude-host SKILL.md total 3,087,499 -> 3,057,975 B (-29,524 B, ~900 B x ~33 skills). Layer 0 still passes — the core decision-brief format stays always-loaded; only the rare CJK rationale moved. Atomic with the all-host regen (skill-docs.yml freshness gate). VERSION + package.json -> 1.58.0.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(plan-eng-review): carve review body into on-demand section Fourth Phase B carve (v2_PLAN.md:220). Moves the 4-section review (Architecture, Code Quality, Tests, Performance), outside voice, required outputs, and review report — everything after Step 0 scope — into sections/review-sections.md behind a single STOP-Read. Step 0 (scope challenge) and EXIT_PLAN_MODE_GATE stay in the always-loaded skeleton. Measured: skeleton 106,984 -> 54,892 B (-48.7%). Union preserved. Atomic with tests + all-host regen (freshness gate): parity flipped to sectioned (maxSkeletonBytes 62K), plan-eng-review added to SECTIONS_EXTRACTED, gen-skill-docs reads the union for relocated review/TEST_COVERAGE/dashboard prose. Layer 0 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(plan-design-review): carve review body into on-demand section Fifth Phase B carve (v2_PLAN.md:220, bundled with plan-eng). Moves the 7 design passes, required outputs, and review report — everything after Step 0 scope and the mockup/rating phase — into sections/review-sections.md behind a STOP-Read. Step 0, Step 0.5 mockups, the rating method, and EXIT_PLAN_MODE_GATE stay in the always-loaded skeleton. Measured: skeleton 112,057 -> 76,024 B (-32.2%). Union preserved. Atomic with tests + all-host regen: parity sectioned (maxSkeletonBytes 82K), added to SECTIONS_EXTRACTED, gen-skill-docs reads the union. Layer 0 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(plan-devex-review): carve review body into on-demand section Sixth Phase B carve. Moves the 8 DX passes, required outputs, and review report — everything after the Step 0 DX investigation — into sections/review-sections.md behind a STOP-Read. All of Step 0 (persona, empathy, benchmark, journey trace, roleplay) + the rating method + EXIT_PLAN_MODE_GATE stay always-loaded. Measured: skeleton 110,621 -> 69,658 B (-37%). Union preserved. Atomic with tests + all-host regen: added to SECTIONS_EXTRACTED, gen-skill-docs reads the union. Layer 0 green. (No parity invariant entry for plan-devex-review.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump VERSION + CHANGELOG for plan-* family carves (v1.59.0.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: refresh ship golden baselines + gbrain-detection union after carves Two follow-ups the carve commits should have carried (caught by the full suite, missed by targeted subsets): - ship golden baselines (claude/codex/factory) regenerated: the preamble CJK trim (v1.58) changed ship's always-loaded AskUserQuestion block. - gbrain-detection-override probes the office-hours skeleton+section union: GBRAIN_SAVE_RESULTS moved into sections/design-and-handoff.md when office-hours was carved, so the detection assertions now check both files. Full `bun test` green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): grade format-compliance gate from SDK capture, not the TUI The real-PTY version grepped the stripAnsi'd interactive AUQ picker. Verified directly that this cannot work: plan-mode AUQs render as a cursor picker whose cursor-positioning escapes stripAnsi can't flatten — the picker renders fine for a human (cursorSeen=45) but the flattened text drops ELI10:/(recommended) and parseNumberedOptions returns 0. The test was grading a lossy projection and failed by construction. Rewritten to drive /plan-ceo-review via the SDK $OUT_FILE capture (the agent writes the verbatim question it would have shown — clean text, no rendering loss) and grade 7/7 format + kind-note + recommendation substance >=4. Same property, reliable, environment-independent; shares the engine with the periodic A/B and matrix evals. Result: 7/7 format, substance 5. Touchfiles key renamed ask-user-question-format-pty -> auq-format-gate (no longer a PTY test). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: fix carve-broken CI evals (union reads + section fixtures) Two CI eval jobs failed on the carved plan-* skills because they read content that moved into sections/: - llm-judge (skill-llm-eval): runWorkflowJudge sliced SKILL.md between markers like "## Review Sections" / "## CRITICAL RULE" that now live in sections/review-sections.md. The markers vanished from the skeleton, so the judge scored empty/wrong content. Fix: read the skeleton+sections union. Verified: plan-ceo modes / plan-eng sections / plan-design passes all PASS (25/25). - e2e-plan (skill-e2e-plan): setupPlanDir copied only <skill>/SKILL.md into the fixture, not sections/. The carved skill's STOP pointed at a section file that was absent, so the model improvised a compressed report table instead of the canonical "| Review | Trigger | Why | Runs | Status | Findings |". Fix: copy sections/ alongside SKILL.md in all 6 setup sites. Verified: report test PASS, canonical table emitted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: copy carved sections into all e2e fixtures (prevent more carve-blind CI fails) Proactive sweep beyond the two CI logs: every e2e test that copies a carved skill's SKILL.md into a temp fixture must also copy its sections/, or the model hits a STOP pointing at a missing section file and improvises/degrades. - skill-e2e.test.ts: plan-ceo/plan-eng/plan-design/office-hours copies across planDir/reviewDir/ohDir/benefitsDir dests now copy sections/. - skill-e2e-plan.test.ts: the office-hours copy + the 4-skill codex-offering loop now copy sections/. - skill-e2e-design.test.ts: plan-design-review copy now copies sections/. - skill-e2e-office-hours.test.ts: both office-hours copies now copy sections/. - skill-e2e-office-hours-brain-writeback.test.ts: GBRAIN_SAVE_RESULTS moved into the section, so check the regenerated skeleton+section UNION for the gbrain put block, ship both into the workdir, and restore both (the section regen was also leaking into the working tree — finally now restores it). ship copies (single-file Step-0 slices) and review/retro (not carved) untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: migrate section-loading E2E to lossless SDK tool-stream detection The /ship and /plan-ceo-review section-loading tests drove a real PTY and scraped the ANSI screen buffer for sections/<file>.md paths. That silently saw nothing in a Conductor PTY (cursor-positioned tool renders and an unanswered Step 0 question loop both defeat the regex), so both reported read: [] even when the agent did the work. They now run the skill through claude -p (the same SDK path the AUQ matrix uses) and detect section reads from the tool-use stream — Read calls whose file_path contains sections/<file>.md — with no rendering layer to mangle. The run is also hermetic: the freshly-generated worktree skeleton + sections are copied into a throwaway fixture with the absolute path pinned, so the test validates this branch's carve without mutating the user's ~/.claude install. Validated EVALS_TIER=periodic: both pass (plan-ceo Reads review-sections.md; ship Reads review-army.md + changelog.md), ~6.5 min for both vs ~23 min combined on the old PTY path where both were failing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: consolidate branch to v1.56.0.0 (single MINOR above main) The branch bumped VERSION several times during development (1.56 → 1.57 → 1.58 → 1.59), but none of those landed on main (main is at 1.55.1.0). Per the "never orphan branch-internal versions" discipline, collapse all four into a single 1.56.0.0 entry — one MINOR release covering the whole branch: five skills carved (plan-ceo, office-hours, plan-eng, plan-design, plan-devex), the shared AskUserQuestion preamble CJK trim, and the paranoid AUQ no-degradation test suite + lossless section-loading tests. VERSION and package.json set to 1.56.0.0; main's 1.55.1.0 entry preserved below the consolidated entry. No SKILL.md drift (VERSION is not embedded in generated bodies). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-20 22:40:58 +02:00 · 2026-06-04 11:14:43 -07:00
parent c43c850cae
commit cab774cced
95 changed files with 7704 additions and 6784 deletions
@@ -1,205 +1,91 @@
 /**
- * AskUserQuestion format-compliance smoke (gate, paid, real-PTY).
+ * AskUserQuestion format-compliance gate (gate, paid, SDK capture).
 *
- * Asserts: when /plan-ceo-review fires its first AskUserQuestion in plan
- * mode, the rendered TTY output contains every element the preamble
- * format spec mandates (scripts/resolvers/preamble/generate-ask-user-format.ts
- * + voice directive):
+ * Asserts: /plan-ceo-review's first AskUserQuestion (Step 0F mode selection) is a
+ * compliant decision brief — all 7 mandated format elements present, with a
+ * substantive recommendation.
 *
- *   1. ELI10 prose paragraph
- *   2. "Recommendation:" line
- *   3. Pros/Cons header
- *   4. ✅ pro bullet AND ❌ con bullet
- *   5. "Net:" closer line
- *   6. "(recommended)" label on one option
+ * Why SDK capture, not real-PTY (changed v1.59+): the prior version launched an
+ * interactive `claude` PTY and grepped the rendered TUI after stripAnsi. But
+ * plan-mode AUQs render as an interactive cursor picker whose cursor-positioning
+ * escapes stripAnsi CANNOT faithfully flatten — verified directly: the picker
+ * renders fine for a human (cursorSeen=45) but the flattened text drops `ELI10:`
+ * and `(recommended)` and `parseNumberedOptions` returns 0. So the old test was
+ * grading a lossy projection of the TUI, not the question's actual format, and
+ * failed by construction in this environment.
 *
- * Why real-PTY: the existing skill-e2e-plan-format tests cover what the
- * AGENT writes via the SDK (capture-to-file harness). This test covers
- * what the USER actually sees in the terminal — different bug class
- * (e.g., AskUserQuestion tool truncates long prose, conductor renderer mangles
- * bullets, model collapses sections under token pressure). Two layers
- * of defense for a format-discipline regression that previously ate ~6
- * weeks of compliance drift before it was noticed.
- *
- * Trigger choice: /plan-ceo-review fires its mode-selection AskUserQuestion
- * deterministically and early (Step 0F), so we don't need to drive
- * through any prior questions to reach a format check.
- *
- * See test/helpers/claude-pty-runner.ts for runner internals.
+ * This version drives the skill via the SDK $OUT_FILE capture path (the agent
+ * writes the verbatim AskUserQuestion it would have shown to a file — clean text,
+ * zero rendering loss) and grades that. Same property tested (does the question
+ * carry every format element), reliably, environment-independent. The rendering
+ * layer is identical across skills/content, so it is not where format regressions
+ * hide; the model's composed question is. Shares the engine with the periodic
+ * A/B and matrix evals (test/helpers/auq-sdk-capture.ts).
 */
-
 import { describe, test, expect } from 'bun:test';
+import * as fs from 'node:fs';
 import {
-  launchClaudePty,
-  isNumberedOptionListVisible,
-  isPermissionDialogVisible,
-  parseNumberedOptions,
-} from './helpers/claude-pty-runner';
+  setupPlanCeoDir,
+  captureModeSelectionAuq,
+  scoreAuqFormat,
+  gradeAuqRecommendation,
+  carvedSkill,
+} from './helpers/auq-sdk-capture';

 const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
 const describeE2E = shouldRun ? describe : describe.skip;
-
-// Format predicates. Permissive on whitespace and capitalization.
-// Tightening these is V2 if real drift is observed.
-const ELI10_RE        = /ELI10\s*:/i;
-const RECOMMEND_RE    = /Recommendation\s*:/i;
-const PROS_CONS_RE    = /Pros\s*\/\s*cons\s*:/i;
-const PRO_BULLET_RE   = /✅/;
-const CON_BULLET_RE   = /❌/;
-const NET_LINE_RE     = /^[\s|]*Net\s*:/im;
-const RECOMMENDED_LBL = /\(recommended\)/i;
-
-interface FormatGap {
-  field: string;
-  re: RegExp;
-}
-
-function findFormatGaps(visible: string): FormatGap[] {
-  const checks: FormatGap[] = [
-    { field: 'ELI10:', re: ELI10_RE },
-    { field: 'Recommendation:', re: RECOMMEND_RE },
-    { field: 'Pros / cons:', re: PROS_CONS_RE },
-    { field: '✅ pro bullet', re: PRO_BULLET_RE },
-    { field: '❌ con bullet', re: CON_BULLET_RE },
-    { field: 'Net:', re: NET_LINE_RE },
-    { field: '(recommended) label', re: RECOMMENDED_LBL },
-  ];
-  return checks.filter(c => !c.re.test(visible));
-}
+const runId = `auq-format-gate-${process.env.EVALS_RUN_ID ?? 'local'}`;

 describeE2E('AskUserQuestion format compliance (gate)', () => {
  test(
-    'first AskUserQuestion from /plan-ceo-review contains all 7 mandated format elements',
+    "/plan-ceo-review's first AskUserQuestion is a compliant decision brief (7/7 + substance)",
    async () => {
-      const session = await launchClaudePty({
-        permissionMode: 'plan',
-        timeoutMs: 600_000,
+      const carved = carvedSkill();
+      const dir = setupPlanCeoDir({
+        skillMd: carved.skillMd,
+        sectionsFrom: carved.sectionsFrom,
+        tmpPrefix: 'auq-format-gate-',
      });

+      let text = '';
      try {
-        // Boot grace + auto trust-dialog handler.
-        await Bun.sleep(8000);
-        const since = session.mark();
-        session.send('/plan-ceo-review\r');
-
-        // Wait for a SKILL AskUserQuestion. Strategy: poll the visible buffer until it
-        // contains both a numbered-option list AND the format markers we
-        // expect (ELI10 + Recommendation). When both are present, it IS a
-        // real format-compliant AskUserQuestion — not a permission dialog or trust
-        // prompt.
-        //
-        // While polling, auto-grant any permission dialogs we see in the
-        // recent tail (preamble side-effects: touch on a sensitive file,
-        // etc) so the agent isn't blocked.
-        //
-        // Budget bumped 300s → 540s in v1.32: /plan-ceo-review's preamble runs
-        // multiple bash blocks (gbrain sync probe, telemetry, learnings search,
-        // dashboard read) before reaching its mode-selection AskUserQuestion in
-        // Step 0F. On substantive branches (or under contention from concurrent
-        // tests running at max-concurrency 15), 300s sometimes wasn't enough
-        // for the model to drain Step 0 work before emitting the first AUQ.
-        // 540s sits below the suite-level 360s/9min timeout headroom and
-        // tracks the same magnitude the plan-design-with-ui test uses.
-        const budgetMs = 540_000;
-        const start = Date.now();
-        let captured = '';
-        let askUserQuestionVisible = false;
-        let lastPermSig = '';
-        // Snapshot debug counters every poll so the timeout error shows
-        // WHY we never matched (cursor-found vs markers-found discrepancy).
-        let debugCursorSeen = 0;
-        let debugMarkersSeen = 0;
-        let debugBothSeen = 0;
-
-        while (Date.now() - start < budgetMs) {
-          await Bun.sleep(2000);
-          if (session.exited()) {
-            throw new Error(
-              `claude exited (code=${session.exitCode()}) before AskUserQuestion rendered.\n` +
-                `Last visible:\n${session.visibleSince(since).slice(-2000)}`,
-            );
-          }
-          const visible = session.visibleSince(since);
-          // Marker check: anywhere in the post-slash region. Since `since`
-          // is set right after sending /plan-ceo-review, there's no stale
-          // AskUserQuestion above this line — the only AskUserQuestion that can produce these
-          // markers is the current one.
-          const hasEli10 = /ELI10\s*:/i.test(visible);
-          const hasRecommend = /Recommendation\s*:/i.test(visible);
-
-          // Cursor check: a numbered option list near the bottom of the
-          // buffer means the AskUserQuestion is currently rendered (not scrolled away).
-          const cursorTail = visible.slice(-4000);
-          const hasCursor = isNumberedOptionListVisible(cursorTail) &&
-                            parseNumberedOptions(cursorTail).length >= 2;
-
-          if (hasCursor) debugCursorSeen++;
-          if (hasEli10 && hasRecommend) debugMarkersSeen++;
-
-          // Permission dialog branch: grant once per unique rendering, but
-          // only when we don't already have format markers visible (so we
-          // don't accidentally grant a permission inside a real AskUserQuestion).
-          if (
-            hasCursor &&
-            !(hasEli10 && hasRecommend) &&
-            isPermissionDialogVisible(cursorTail)
-          ) {
-            const sig = visible.slice(-500);
-            if (sig !== lastPermSig) {
-              lastPermSig = sig;
-              session.send('1\r');
-              await Bun.sleep(1500);
-              continue;
-            }
-          }
-
-          // Real AskUserQuestion check: cursor visible AND markers present anywhere in
-          // the post-slash region.
-          if (hasCursor && hasEli10 && hasRecommend) {
-            debugBothSeen++;
-            captured = visible;
-            askUserQuestionVisible = true;
-            break;
-          }
-        }
-        if (!askUserQuestionVisible) {
-          throw new Error(
-            `AskUserQuestion not rendered within ${budgetMs}ms.\n` +
-              `Debug counts: cursorSeen=${debugCursorSeen} markersSeen=${debugMarkersSeen} bothSeen=${debugBothSeen}\n` +
-              `Last visible (4KB):\n${session.visibleSince(since).slice(-4000)}`,
-          );
-        }
-        const gaps = findFormatGaps(captured);
-        if (gaps.length > 0) {
-          // Surface the captured text last 3KB on failure for debugging.
-          const tail = captured.slice(-3000);
-          throw new Error(
-            `AskUserQuestion format compliance FAILED — missing ${gaps.length} mandated field(s):\n` +
-              gaps.map(g => `  - ${g.field} (regex: ${g.re.source})`).join('\n') +
-              `\n--- captured (last 3KB) ---\n${tail}`,
-          );
-        }
-
-        // Sanity: the parsed option list contains at least 2 options and
-        // one of them carries the (recommended) marker.
-        const opts = parseNumberedOptions(captured);
-        expect(opts.length).toBeGreaterThanOrEqual(2);
-        const hasRecommended = opts.some(o => /\(recommended\)/i.test(o.label));
-        if (!hasRecommended) {
-          // It's also acceptable for the (recommended) marker to live in
-          // prose above the box (some renderers wrap labels). The text-level
-          // RECOMMENDED_LBL check above already covers that case.
-          // Surface a friendlier message if the box itself missed it.
-          // (This is non-fatal because findFormatGaps already passed.)
-          // eslint-disable-next-line no-console
-          console.warn(
-            '(recommended) label appears in prose but not on a parsed option label — acceptable but watch for drift',
-          );
-        }
+        text = await captureModeSelectionAuq({ planDir: dir, testName: 'auq-format-gate', runId });
      } finally {
-        await session.close();
+        fs.rmSync(dir, { recursive: true, force: true });
+      }
+
+      if (!text.trim()) {
+        throw new Error('No AskUserQuestion captured — the skill never reached its mode-selection question.');
+      }
+
+      // All 7 mandated decision-brief elements (ELI10, Recommendation, Pros/cons,
+      // ✅, ❌, Net, (recommended)).
+      const fmt = scoreAuqFormat(text);
+      if (fmt.missing.length > 0) {
+        throw new Error(
+          `AskUserQuestion missing ${fmt.missing.length} mandated format element(s): ` +
+            `${fmt.missing.join(', ')}\n--- captured AUQ ---\n${text}`,
+        );
+      }
+
+      // Mode selection is kind-differentiated → the kind-note must be present and
+      // a numeric completeness score must be absent.
+      expect(text).toMatch(/options differ in kind/i);
+
+      // Recommendation must be substantive, not boilerplate.
+      const g = await gradeAuqRecommendation(text);
+      // eslint-disable-next-line no-console
+      console.log(
+        `[auq-format-gate] format=${fmt.present}/${fmt.total} substance=${g.substance} ` +
+          `recPresent=${g.present} literalBecause=${g.hadLiteralBecause}`,
+      );
+      expect(g.present).toBe(true);
+      if (g.substance < 4) {
+        throw new Error(
+          `Recommendation substance ${g.substance} < 4 (boilerplate/weak):\n--- captured AUQ ---\n${text}`,
+        );
      }
    },
-    660_000,
+    300_000,
  );
 });