v1.56.0.0 Token-reduction Phase B + AUQ paranoid safety net (#1849)

* refactor(plan-ceo-review): carve review body into on-demand section Carve the largest skill (138,838 B) into a skeleton + one on-demand section, the documented next Phase B target after /ship (v2_PLAN.md:216). - sections/review-sections.md(.tmpl): the 11-section deep review, codex/ outside-voice rules, how-to-ask, Required Outputs, registries, Completion Summary, Review Log, REVIEW_DASHBOARD, PLAN_FILE_REVIEW_REPORT, Next Steps, docs/designs promotion, Formatting Rules, and the Mode Quick Reference. - sections/manifest.json: passive registry (CM2), one entry. - SKILL.md.tmpl: {{SECTION_INDEX}} after the system audit, a single {{SECTION:review-sections}} STOP-Read after Step 0 mode selection, and a Section self-check. All of Step 0 (the scope/mode conversation) stays in the always-loaded skeleton; only EXIT_PLAN_MODE_GATE follows the section. Measured: always-loaded skeleton 138,838 -> 80,731 B (-42%, ~14.4K tokens off every invocation). Union (skeleton + section) 139,110 B, behavior held. Boundary honors Codex P1: nothing review-governing (formatting rules, mode reference, how-to-ask, required outputs) sits in the skeleton below the STOP. Housekeeping resolvers ride in the section, matching the ship precedent (adversarial.md carries LEARNINGS_LOG + GBRAIN_SAVE_RESULTS). Tests (atomic with the carve — skill-docs.yml gates gen:skill-docs freshness on every push, so source + regen + tests must land together): - parity-harness: plan-ceo flipped to sectioned, maxSkeletonBytes 90_000 (measured 80,731 + headroom); content/minBytes run against the union. - skill-size-budget: plan-ceo-review added to SECTIONS_EXTRACTED. - section-manifest-consistency: generalized to discover every carved skill, vars computed per-skill-case (Codex P2). - skill-ceo-section-ordering (new, gate): per-PR static guard — STOP after Step 0, review body absent from skeleton, report writer in the section, nothing review-governing below the STOP. - skill-e2e-plan-ceo-review-section-loading (new, periodic): refreshes the installed skill first (Codex P1), drives full Step 0, asserts the section is Read before the report. - gen-skill-docs + skill-validation: read the skeleton+sections union for carved skills so relocated prose still counts. - touchfiles: plan-ceo-section-loading registered (periodic). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump VERSION + CHANGELOG for plan-ceo-review carve (v1.56.0.0) MINOR: carves the largest skill into skeleton + on-demand section, dropping plan-ceo-review's always-loaded cost 42% (138,838 -> 80,731 B, ~14.4K tokens off every invocation). User-facing release notes lead with the measured token win. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(todos): file P3 follow-up — carve the shared {{PREAMBLE}} reference blocks Surfaced by /plan-eng-review on the plan-ceo-review carve: per-skill section carves stay modest because the ~40-50KB shared preamble dominates the always-loaded surface. A single preamble-reference carve would help every tier->=2 skill at once. Records the why, the cold-vs-hot split to measure, and the guards it needs. Not implemented this PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): Layer 0 — guarantee AUQ format spec is always-loaded Deterministic, free, per-PR keystone for the token-reduction era. For every interactive (tier>=2) skill, asserts the full AskUserQuestion decision-brief format (ELI10/Recommendation/Pros-cons/checks/Net/(recommended)/Stakes/ self-check) lives in the always-loaded SKILL.md skeleton, NOT only in an on-demand section. Plus a roster guard (a carve can't silently drop the block) and per-skill rule survival in the skeleton+sections union. 51 cases + a negative control. Fails the instant a future carve strands AUQ-governing text where it won't be loaded when a question fires. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): SDK capture engine + verbose-vs-carved no-degradation A/B Adds the reusable SDK $OUT_FILE capture engine (auq-sdk-capture.ts): drives a skill to its AUQ and captures the verbatim text the model GENERATES, cleanly (real-PTY mangles plan-mode AUQs via cursor escapes). Pins the skill to an absolute path with Read/Write-only tools so the agent can't wander to the global install. gradeAuqRecommendation normalizes a non-"because" connective before grading so substantive reasons aren't false-flagged (without touching the pinned shared judge). The A/B drives the same prompt through the carved 80KB skeleton and the pre-carve 137KB monolith and fails if carved scores worse. Result: both 7/7 format, substance 5 — proven no degradation, transcript-verified each side read its own planted SKILL.md. Periodic tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): consistency — same trigger N runs, stable format + substance Drives the carved /plan-ceo-review AUQ N=3 times and fails if any format element appears in one run but not another, or substance craters. Targets the "fine one run, broken the next" failure class a single snapshot can't see. Result: 3/3 stable, 7/7 + substance 5 every run. Periodic tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): behavioral matrix across AUQ-heavy skills Data-driven test that drives each AUQ-heavy skill (plan-eng/design/devex, office-hours, cso, spec, design-consultation) to its first AskUserQuestion and grades it to the plan-ceo bar: 7/7 decision-brief format + recommendation substance >=4. One case per skill (isolated failures), env-subsettable via AUQ_MATRIX_ONLY. Browser/design-binary skills are intentionally excluded (comparison boards, not format-AUQs; Layer 0 covers their spec). All targeted skills pass 7/7 with substance 4-5. Periodic tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(codex): live recommendation-substance grade for /codex Closes the gap where /codex's synthesis recommendation was only checked statically (template grep) and via fixtures. Drives the real /codex skill over a flawed diff and grades the emitted "Recommendation: ... because ..." line with judgeRecommendation (present/commits/has_because/substance>=4). The named weak spot holds up: substance 5. Periodic tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): deterministic trigger for format-compliance gate A bare /plan-ceo-review against a repo whose work is already implemented makes the model improvise an off-script "what should I review?" scope question that skips the decision-brief format, which the gate test then times out waiting for. Hand it a concrete plan to review (FORCING_FLOOR_CEO) so it reaches the real Step 0 mode-selection AUQ that is the intended format check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(office-hours): carve Phase 5+6 into on-demand section Third Phase B carve (v2_PLAN.md:216, after ship and plan-ceo-review). Moves Phase 5 (Design Doc templates) + Phase 6 (tiered relationship handoff) — the session's output + closing tail, only reached after the conversation and alternatives are done — into sections/design-and-handoff.md, behind a single STOP-Read after Phase 4.5. The live conversation (Phases 1-4.5) and the always-run Important Rules stay in the always-loaded skeleton. Measured: always-loaded skeleton 118,280 -> 88,975 B (-24.8%). Union preserved. The carved AUQ is identical to pre-carve (matrix: 7/7 format, substance 5), and Layer 0 confirms the AUQ format spec stays in the skeleton — the AUQ paranoid suite de-risked this carve end to end. Atomic with tests + regen (skill-docs.yml gates gen:skill-docs freshness on every push, so source + regen + tests land together; --host all regenerates the inlined non-Claude variants): - sections/manifest.json: passive registry, one entry. - parity-harness: office-hours flipped to sectioned, maxSkeletonBytes 96_000 (measured 88,975 + headroom); content/minBytes run against the union. - skill-size-budget: office-hours added to SECTIONS_EXTRACTED. - gen-skill-docs + skill-validation: read the skeleton+sections union for office-hours so relocated Phase 5/6 prose still counts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump VERSION + CHANGELOG for office-hours carve + AUQ suite (v1.57.0.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(preamble): carve CJK-escaping manual to on-demand doc The AskUserQuestion format block is inlined into every interactive skill (~33). It carried the full multi-paragraph non-ASCII/CJK escaping manual inline, but that rationale only matters when a question contains CJK text and the operative rule already lives in the always-loaded self-check. Moved the justification to docs/askuserquestion-cjk.md (read on demand); kept the rule + a pointer. Corpus: Claude-host SKILL.md total 3,087,499 -> 3,057,975 B (-29,524 B, ~900 B x ~33 skills). Layer 0 still passes — the core decision-brief format stays always-loaded; only the rare CJK rationale moved. Atomic with the all-host regen (skill-docs.yml freshness gate). VERSION + package.json -> 1.58.0.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(plan-eng-review): carve review body into on-demand section Fourth Phase B carve (v2_PLAN.md:220). Moves the 4-section review (Architecture, Code Quality, Tests, Performance), outside voice, required outputs, and review report — everything after Step 0 scope — into sections/review-sections.md behind a single STOP-Read. Step 0 (scope challenge) and EXIT_PLAN_MODE_GATE stay in the always-loaded skeleton. Measured: skeleton 106,984 -> 54,892 B (-48.7%). Union preserved. Atomic with tests + all-host regen (freshness gate): parity flipped to sectioned (maxSkeletonBytes 62K), plan-eng-review added to SECTIONS_EXTRACTED, gen-skill-docs reads the union for relocated review/TEST_COVERAGE/dashboard prose. Layer 0 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(plan-design-review): carve review body into on-demand section Fifth Phase B carve (v2_PLAN.md:220, bundled with plan-eng). Moves the 7 design passes, required outputs, and review report — everything after Step 0 scope and the mockup/rating phase — into sections/review-sections.md behind a STOP-Read. Step 0, Step 0.5 mockups, the rating method, and EXIT_PLAN_MODE_GATE stay in the always-loaded skeleton. Measured: skeleton 112,057 -> 76,024 B (-32.2%). Union preserved. Atomic with tests + all-host regen: parity sectioned (maxSkeletonBytes 82K), added to SECTIONS_EXTRACTED, gen-skill-docs reads the union. Layer 0 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(plan-devex-review): carve review body into on-demand section Sixth Phase B carve. Moves the 8 DX passes, required outputs, and review report — everything after the Step 0 DX investigation — into sections/review-sections.md behind a STOP-Read. All of Step 0 (persona, empathy, benchmark, journey trace, roleplay) + the rating method + EXIT_PLAN_MODE_GATE stay always-loaded. Measured: skeleton 110,621 -> 69,658 B (-37%). Union preserved. Atomic with tests + all-host regen: added to SECTIONS_EXTRACTED, gen-skill-docs reads the union. Layer 0 green. (No parity invariant entry for plan-devex-review.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump VERSION + CHANGELOG for plan-* family carves (v1.59.0.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: refresh ship golden baselines + gbrain-detection union after carves Two follow-ups the carve commits should have carried (caught by the full suite, missed by targeted subsets): - ship golden baselines (claude/codex/factory) regenerated: the preamble CJK trim (v1.58) changed ship's always-loaded AskUserQuestion block. - gbrain-detection-override probes the office-hours skeleton+section union: GBRAIN_SAVE_RESULTS moved into sections/design-and-handoff.md when office-hours was carved, so the detection assertions now check both files. Full `bun test` green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): grade format-compliance gate from SDK capture, not the TUI The real-PTY version grepped the stripAnsi'd interactive AUQ picker. Verified directly that this cannot work: plan-mode AUQs render as a cursor picker whose cursor-positioning escapes stripAnsi can't flatten — the picker renders fine for a human (cursorSeen=45) but the flattened text drops ELI10:/(recommended) and parseNumberedOptions returns 0. The test was grading a lossy projection and failed by construction. Rewritten to drive /plan-ceo-review via the SDK $OUT_FILE capture (the agent writes the verbatim question it would have shown — clean text, no rendering loss) and grade 7/7 format + kind-note + recommendation substance >=4. Same property, reliable, environment-independent; shares the engine with the periodic A/B and matrix evals. Result: 7/7 format, substance 5. Touchfiles key renamed ask-user-question-format-pty -> auq-format-gate (no longer a PTY test). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: fix carve-broken CI evals (union reads + section fixtures) Two CI eval jobs failed on the carved plan-* skills because they read content that moved into sections/: - llm-judge (skill-llm-eval): runWorkflowJudge sliced SKILL.md between markers like "## Review Sections" / "## CRITICAL RULE" that now live in sections/review-sections.md. The markers vanished from the skeleton, so the judge scored empty/wrong content. Fix: read the skeleton+sections union. Verified: plan-ceo modes / plan-eng sections / plan-design passes all PASS (25/25). - e2e-plan (skill-e2e-plan): setupPlanDir copied only <skill>/SKILL.md into the fixture, not sections/. The carved skill's STOP pointed at a section file that was absent, so the model improvised a compressed report table instead of the canonical "| Review | Trigger | Why | Runs | Status | Findings |". Fix: copy sections/ alongside SKILL.md in all 6 setup sites. Verified: report test PASS, canonical table emitted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: copy carved sections into all e2e fixtures (prevent more carve-blind CI fails) Proactive sweep beyond the two CI logs: every e2e test that copies a carved skill's SKILL.md into a temp fixture must also copy its sections/, or the model hits a STOP pointing at a missing section file and improvises/degrades. - skill-e2e.test.ts: plan-ceo/plan-eng/plan-design/office-hours copies across planDir/reviewDir/ohDir/benefitsDir dests now copy sections/. - skill-e2e-plan.test.ts: the office-hours copy + the 4-skill codex-offering loop now copy sections/. - skill-e2e-design.test.ts: plan-design-review copy now copies sections/. - skill-e2e-office-hours.test.ts: both office-hours copies now copy sections/. - skill-e2e-office-hours-brain-writeback.test.ts: GBRAIN_SAVE_RESULTS moved into the section, so check the regenerated skeleton+section UNION for the gbrain put block, ship both into the workdir, and restore both (the section regen was also leaking into the working tree — finally now restores it). ship copies (single-file Step-0 slices) and review/retro (not carved) untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: migrate section-loading E2E to lossless SDK tool-stream detection The /ship and /plan-ceo-review section-loading tests drove a real PTY and scraped the ANSI screen buffer for sections/<file>.md paths. That silently saw nothing in a Conductor PTY (cursor-positioned tool renders and an unanswered Step 0 question loop both defeat the regex), so both reported read: [] even when the agent did the work. They now run the skill through claude -p (the same SDK path the AUQ matrix uses) and detect section reads from the tool-use stream — Read calls whose file_path contains sections/<file>.md — with no rendering layer to mangle. The run is also hermetic: the freshly-generated worktree skeleton + sections are copied into a throwaway fixture with the absolute path pinned, so the test validates this branch's carve without mutating the user's ~/.claude install. Validated EVALS_TIER=periodic: both pass (plan-ceo Reads review-sections.md; ship Reads review-army.md + changelog.md), ~6.5 min for both vs ~23 min combined on the old PTY path where both were failing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: consolidate branch to v1.56.0.0 (single MINOR above main) The branch bumped VERSION several times during development (1.56 → 1.57 → 1.58 → 1.59), but none of those landed on main (main is at 1.55.1.0). Per the "never orphan branch-internal versions" discipline, collapse all four into a single 1.56.0.0 entry — one MINOR release covering the whole branch: five skills carved (plan-ceo, office-hours, plan-eng, plan-design, plan-devex), the shared AskUserQuestion preamble CJK trim, and the paranoid AUQ no-degradation test suite + lossless section-loading tests. VERSION and package.json set to 1.56.0.0; main's 1.55.1.0 entry preserved below the consolidated entry. No SKILL.md drift (VERSION is not embedded in generated bodies). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-20 22:40:58 +02:00 · 2026-06-04 11:14:43 -07:00
parent c43c850cae
commit cab774cced
95 changed files with 7704 additions and 6784 deletions
@@ -0,0 +1,350 @@
+/**
+ * SDK-based AUQ capture — the reliable way to grade AskUserQuestion content.
+ *
+ * Real-PTY capture is lossy for plan-mode AUQs: they render every option on one
+ * cursor-positioned logical line that stripAnsi can't reconstruct, so format
+ * predicates (ELI10:, Net:, ✅) silently miss even when the question is
+ * well-formed. This helper instead uses the `claude -p` SDK path (the same one
+ * skill-e2e-plan-format uses): the agent is told to WRITE the verbatim text of
+ * the AskUserQuestion it would have asked to a file. That captures exactly what
+ * the model GENERATES — the surface where carving could degrade quality — with
+ * zero rendering loss. The TTY rendering layer is identical for fat and slim
+ * skills, so it is not where token-reduction degradation can hide.
+ */
+import * as fs from 'node:fs';
+import * as os from 'node:os';
+import * as path from 'node:path';
+import { spawnSync } from 'node:child_process';
+import { runSkillTest, type SkillTestResult } from './session-runner';
+
+const ROOT = path.resolve(__dirname, '..', '..');
+
+/** The 7 decision-brief format elements graded on the captured AUQ text. */
+export const AUQ_FORMAT_ELEMENTS: Array<{ field: string; re: RegExp }> = [
+  { field: 'ELI10:', re: /ELI10\s*:/i },
+  { field: 'Recommendation:', re: /Recommendation\s*:/i },
+  { field: 'Pros / cons:', re: /Pros\s*\/\s*cons/i },
+  { field: '✅', re: /✅/ },
+  { field: '❌', re: /❌/ },
+  { field: 'Net:', re: /Net\s*:/i },
+  { field: '(recommended)', re: /\(recommended\)/i },
+];
+
+export function scoreAuqFormat(text: string): { present: number; total: number; missing: string[] } {
+  const missing = AUQ_FORMAT_ELEMENTS.filter(e => !e.re.test(text)).map(e => e.field);
+  return { present: AUQ_FORMAT_ELEMENTS.length - missing.length, total: AUQ_FORMAT_ELEMENTS.length, missing };
+}
+
+/**
+ * Grade recommendation substance ROBUST to the connective. judgeRecommendation()
+ * keys on the literal "because" (correct for the spec, pinned by
+ * llm-judge-recommendation.test.ts), but skills routinely write equally
+ * substantive reasons as "Recommendation: A. <reason>" / "A — <reason>" /
+ * "A: <reason>". Grading those as substance-1 would make the matrix cry wolf on
+ * genuinely good recommendations. So we normalize a non-"because" connective to
+ * "because" purely for grading, then call the shared judge. We also report
+ * whether the ORIGINAL used the literal "because" — a soft style signal, since
+ * the format spec prefers it and the voice rule forbids the em-dash form.
+ *
+ * This does NOT touch judgeRecommendation or its pinned fixtures.
+ */
+export async function gradeAuqRecommendation(
+  text: string,
+): Promise<{ substance: number; present: boolean; hadLiteralBecause: boolean; reason: string }> {
+  const { judgeRecommendation } = await import('./llm-judge');
+  const recLine = text.match(/^[*_]*\s*recommendation\s*[*_]*\s*:\s*(.+)$/im);
+  const hadLiteralBecause = !!recLine && /\bbecause\s+\S/i.test(recLine[1]);
+
+  let graded = text;
+  if (recLine && !hadLiteralBecause) {
+    // Rewrite "Recommendation: <choice><sep><reason>" → "...<choice> because <reason>"
+    // sep ∈ {". ", " — ", " - ", ": "} right after a short choice token.
+    const normalizedLine = recLine[1].replace(
+      /^([^.:—-]{1,40}?)\s*(?:\.\s+|\s*[—-]\s+|:\s+)(\S.+)$/,
+      '$1 because $2',
+    );
+    if (normalizedLine !== recLine[1]) {
+      graded = text.replace(recLine[0], `Recommendation: ${normalizedLine}`);
+    }
+  }
+
+  try {
+    const r = await judgeRecommendation(graded);
+    return { substance: r.reason_substance, present: r.present, hadLiteralBecause, reason: r.reason_text };
+  } catch {
+    return { substance: 0, present: !!recLine, hadLiteralBecause, reason: '' };
+  }
+}
+
+/**
+ * Build a throwaway plan dir holding a SPECIFIC plan-ceo-review SKILL.md (so we
+ * can pit the carved skeleton against the verbose monolith). `sectionsFrom`, if
+ * given, copies that dir's sections/ alongside (for the carved variant).
+ */
+export function setupPlanCeoDir(opts: {
+  skillMd: string;
+  sectionsFrom?: string | null;
+  tmpPrefix?: string;
+}): string {
+  const dir = fs.mkdtempSync(path.join(os.tmpdir(), opts.tmpPrefix ?? 'auq-sdk-'));
+  const run = (cmd: string, args: string[]) => spawnSync(cmd, args, { cwd: dir, stdio: 'pipe', timeout: 5000 });
+  run('git', ['init', '-b', 'main']);
+  run('git', ['config', 'user.email', 'test@test.com']);
+  run('git', ['config', 'user.name', 'Test']);
+  fs.writeFileSync(
+    path.join(dir, 'plan.md'),
+    [
+      '# Plan: Launch a "developer-friendly" pricing tier',
+      '',
+      '## Goal',
+      'Increase developer adoption.',
+      '',
+      '## Success metric',
+      'More signups.',
+      '',
+      '## Premise',
+      "We haven't talked to any developers about whether the current pricing is a",
+      'barrier. The team agreed it "feels like" it should be cheaper.',
+    ].join('\n'),
+  );
+  fs.mkdirSync(path.join(dir, 'plan-ceo-review'), { recursive: true });
+  fs.writeFileSync(path.join(dir, 'plan-ceo-review', 'SKILL.md'), opts.skillMd);
+  if (opts.sectionsFrom && fs.existsSync(opts.sectionsFrom)) {
+    fs.cpSync(opts.sectionsFrom, path.join(dir, 'plan-ceo-review', 'sections'), { recursive: true });
+  }
+  run('git', ['add', '.']);
+  run('git', ['commit', '-m', 'plan']);
+  return dir;
+}
+
+/**
+ * Generic: build a throwaway dir holding ANY skill's SKILL.md (+ optional
+ * sections) plus arbitrary fixture files, so the matrix can drive each skill to
+ * its first AUQ. Mirrors setupPlanCeoDir but skill-agnostic.
+ */
+export function setupSkillDir(opts: {
+  skillName: string;
+  skillMd: string;
+  sectionsFrom?: string | null;
+  fixtures?: Record<string, string>;
+  tmpPrefix?: string;
+}): string {
+  const dir = fs.mkdtempSync(path.join(os.tmpdir(), opts.tmpPrefix ?? `auq-${opts.skillName}-`));
+  const run = (cmd: string, args: string[]) => spawnSync(cmd, args, { cwd: dir, stdio: 'pipe', timeout: 5000 });
+  run('git', ['init', '-b', 'main']);
+  run('git', ['config', 'user.email', 'test@test.com']);
+  run('git', ['config', 'user.name', 'Test']);
+  for (const [name, content] of Object.entries(opts.fixtures ?? {})) {
+    const p = path.join(dir, name);
+    fs.mkdirSync(path.dirname(p), { recursive: true });
+    fs.writeFileSync(p, content);
+  }
+  fs.mkdirSync(path.join(dir, opts.skillName), { recursive: true });
+  fs.writeFileSync(path.join(dir, opts.skillName, 'SKILL.md'), opts.skillMd);
+  if (opts.sectionsFrom && fs.existsSync(opts.sectionsFrom)) {
+    fs.cpSync(opts.sectionsFrom, path.join(dir, opts.skillName, 'sections'), { recursive: true });
+  }
+  run('git', ['add', '.']);
+  run('git', ['commit', '-m', 'fixture']);
+  return dir;
+}
+
+/** Read any skill's current (worktree) SKILL.md + its sections dir if present. */
+export function skillFromWorktree(skillName: string): { skillMd: string; sectionsFrom: string | null } {
+  const sec = path.join(ROOT, skillName, 'sections');
+  return {
+    skillMd: fs.readFileSync(path.join(ROOT, skillName, 'SKILL.md'), 'utf-8'),
+    sectionsFrom: fs.existsSync(sec) ? sec : null,
+  };
+}
+
+/**
+ * Generic: drive ANY skill to its FIRST AskUserQuestion and capture the
+ * verbatim decision-brief text the model would have shown. `scenario` is the
+ * per-skill prose that triggers a real AUQ (e.g. "review plan.md", "audit
+ * vuln.ts for security"). Absolute skill path + Read/Write-only so the agent
+ * cannot wander to the global install.
+ */
+export async function captureFirstAuq(opts: {
+  planDir: string;
+  skillName: string;
+  scenario: string;
+  testName: string;
+  runId?: string;
+  model?: string;
+}): Promise<string> {
+  const outFile = path.join(opts.planDir, 'ask-capture.md');
+  const skillPath = path.join(opts.planDir, opts.skillName, 'SKILL.md');
+  const prompt = `You are running a format-capture test. The ONLY skill file you may read is this absolute path: ${skillPath}. Do NOT search for, Glob, find, or read any other SKILL.md anywhere — especially nothing under ~/.claude or /Users.
+
+Read ${skillPath} and follow its workflow for this scenario:
+
+${opts.scenario}
+
+This is a capture test, not an interactive session. Skip any system-audit / environment-setup / codebase-exploration steps. When you reach the FIRST point where the skill would call AskUserQuestion, write the verbatim full decision-brief text of that question (title, ELI10, stakes, recommendation, every option with its ✅/❌ pros/cons bullets, and the Net line) to ${outFile}. Do NOT call any tool to ask the user. Do NOT paraphrase. After writing the file, STOP.`;
+
+  await runSkillTest({
+    prompt,
+    workingDirectory: opts.planDir,
+    allowedTools: ['Read', 'Write'],
+    maxTurns: 14,
+    timeout: 240_000,
+    testName: opts.testName,
+    runId: opts.runId,
+    model: opts.model ?? 'claude-opus-4-7',
+  });
+
+  try {
+    return fs.readFileSync(outFile, 'utf-8');
+  } catch {
+    return '';
+  }
+}
+
+/**
+ * Drive ANY carved skill through a real `claude -p` run and detect, LOSSLESSLY,
+ * which `sections/<file>.md` files the agent actually Read — from the tool-use
+ * stream, not the ANSI screen buffer. This is the reliable replacement for the
+ * real-PTY `visibleSince()` screen-scraping the section-loading tests used to do
+ * (which silently saw nothing in a Conductor PTY: cursor-positioned renders and
+ * an unanswered Step 0 question loop both defeat the regex).
+ *
+ * The skill under test is the planted copy in `planDir` (pin the absolute path so
+ * the agent cannot wander to the global install). AskUserQuestion is declared
+ * unavailable so the agent auto-picks the recommended option and proceeds far
+ * enough to hit the post-Step-0 STOP-Read directives; Read is the tool a STOP-Read
+ * resolves to, so Read/Grep/Glob/Write is all the agent needs (no Bash → it cannot
+ * `find /` its way out, nor run git/gh mutations).
+ */
+export async function captureSectionReads(opts: {
+  planDir: string;
+  skillName: string;
+  scenario: string;
+  /** Relative filename the agent writes its final output to (terminal signal). */
+  reportFile?: string;
+  /** Marker proving a real report/plan was produced (default: any non-empty text). */
+  reportMarker?: RegExp;
+  testName: string;
+  runId?: string;
+  model?: string;
+  maxTurns?: number;
+  timeout?: number;
+}): Promise<{ readSections: Set<string>; reportProduced: boolean; toolCalls: SkillTestResult['toolCalls']; output: string }> {
+  const outFile = path.join(opts.planDir, opts.reportFile ?? 'REPORT.md');
+  const skillPath = path.join(opts.planDir, opts.skillName, 'SKILL.md');
+  const prompt = `You are running an automated skill-execution test. No human is present, so AskUserQuestion is unavailable. The ONLY skill file you may read is this absolute path: ${skillPath}. Do NOT Glob/find/search for any other SKILL.md anywhere — especially nothing under ~/.claude or /Users.
+
+Read ${skillPath} and EXECUTE its workflow for this scenario:
+
+${opts.scenario}
+
+Rules for this run:
+- Skip system-audit, environment-setup, telemetry, and codebase-exploration steps.
+- At any decision point that would call AskUserQuestion, silently pick the skill's recommended option and continue. Do NOT stop to ask.
+- This skill's body has been carved into on-demand sections/. When the skill gives a STOP-Read directive (for example "Read \`.../sections/<file>\` and execute it in full"), you MUST actually Read that sections/ file with the Read tool BEFORE doing the work it covers. Do not work from memory.
+- Do NOT run git, gh, commit, push, or any mutating command.
+- When the workflow is complete, write the skill's final output (the full review report / ship plan, including any required report table) to ${outFile}.`;
+
+  const result = await runSkillTest({
+    prompt,
+    workingDirectory: opts.planDir,
+    allowedTools: ['Read', 'Grep', 'Glob', 'Write'],
+    maxTurns: opts.maxTurns ?? 25,
+    timeout: opts.timeout ?? 300_000,
+    testName: opts.testName,
+    runId: opts.runId,
+    model: opts.model ?? 'claude-opus-4-7',
+  });
+
+  const readSections = new Set<string>();
+  for (const c of result.toolCalls) {
+    if (c.tool !== 'Read') continue;
+    const fp = String(c.input?.file_path ?? '');
+    const m = fp.match(/sections\/([A-Za-z0-9._-]+\.md)/);
+    if (m) readSections.add(m[1]);
+  }
+
+  let output = '';
+  try { output = fs.readFileSync(outFile, 'utf-8'); } catch { output = result.output ?? ''; }
+  const reportProduced = opts.reportMarker ? opts.reportMarker.test(output) : output.trim().length > 0;
+
+  return { readSections, reportProduced, toolCalls: result.toolCalls, output };
+}
+
+/** Read the carved (current worktree) plan-ceo SKILL.md + its sections dir. */
+export function carvedSkill(): { skillMd: string; sectionsFrom: string | null } {
+  const sec = path.join(ROOT, 'plan-ceo-review', 'sections');
+  return {
+    skillMd: fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8'),
+    sectionsFrom: fs.existsSync(sec) ? sec : null,
+  };
+}
+
+/** Read the pre-carve verbose monolith plan-ceo SKILL.md from git. */
+export function verboseSkill(gitRef = 'ab66193e^'): string {
+  return execGit(['show', `${gitRef}:plan-ceo-review/SKILL.md`]);
+}
+
+function execGit(args: string[]): string {
+  const r = spawnSync('git', args, { cwd: ROOT, encoding: 'utf-8', maxBuffer: 64 * 1024 * 1024 });
+  if (r.status !== 0) throw new Error(`git ${args.join(' ')} failed: ${r.stderr}`);
+  return r.stdout;
+}
+
+/**
+ * Drive plan-ceo-review to its Step 0F mode-selection AskUserQuestion in the
+ * given plan dir and capture the verbatim question text the model generates.
+ * Returns the captured text ('' if the agent never wrote the file).
+ */
+export async function captureModeSelectionAuq(opts: {
+  planDir: string;
+  testName: string;
+  runId?: string;
+  model?: string;
+}): Promise<string> {
+  const outFile = path.join(opts.planDir, 'ask-capture.md');
+  const skillPath = path.join(opts.planDir, 'plan-ceo-review', 'SKILL.md');
+  const planPath = path.join(opts.planDir, 'plan.md');
+  // CRITICAL: pin the EXACT skill file. Without this the agent runs
+  // `find / -name SKILL.md` / Glob and reads the GLOBAL install
+  // (~/.claude/skills/...) instead of the version-under-test in the temp dir —
+  // which silently invalidates a carved-vs-verbose A/B (both sides end up
+  // reading the same global skill). Absolute path + no-wander instruction +
+  // Bash disallowed (so `find /` is impossible) locks it to the planted file.
+  const prompt = `You are running a format-capture test. Use ONLY these two files:
+  - The skill to follow: ${skillPath}
+  - The plan to review: ${planPath}
+
+Read ${skillPath} for the review workflow. Do NOT search for, Glob, find, or read any OTHER SKILL.md anywhere on the system — especially nothing under ~/.claude or /Users. The ONLY skill file you may read is the absolute path above.
+
+Read ${planPath} — that is the plan to review. It is a standalone plan document, not a codebase. Skip any codebase exploration or system-audit steps.
+
+Proceed to Step 0F (Mode Selection), where the skill presents the 4 review-mode options to the user via AskUserQuestion.
+
+Write the verbatim text of that AskUserQuestion (the full decision brief: title, ELI10, stakes, recommendation, every option with its pros/cons bullets, and the Net line) to ${outFile}. Do NOT call any tool to ask the user. Do NOT paraphrase. After writing the file, stop.`;
+
+  await runSkillTest({
+    prompt,
+    workingDirectory: opts.planDir,
+    // Read + Write only: no Bash means the agent cannot `find /` its way to the
+    // global install, and the skill's preamble bash blocks (irrelevant to format
+    // capture) can't run and wander.
+    allowedTools: ['Read', 'Write'],
+    maxTurns: 12,
+    timeout: 240_000,
+    testName: opts.testName,
+    runId: opts.runId,
+    model: opts.model ?? 'claude-opus-4-7',
+  });
+
+  try {
+    const text = fs.readFileSync(outFile, 'utf-8');
+    // Defense in depth: verify the agent actually read the planted skill, not a
+    // global one. If the captured run somehow read elsewhere we can't detect it
+    // from the output file alone, so callers should also confirm via the run
+    // log; this guard at least catches an empty/placeholder capture.
+    return text;
+  } catch {
+    return '';
+  }
+}
@@ -226,7 +226,14 @@ export const PARITY_INVARIANTS: ParityInvariant[] = [
    minBytes: 120_000,
  },
  {
+    // Carved (v2 plan T9): skeleton SKILL.md + sections/review-sections.md.
+    // Content + size floors run against the union (relocated prose still counts);
+    // maxSkeletonBytes asserts the always-loaded skeleton shrank from the ~138KB
+    // monolith to ~81KB (measured 80,731 B, -42%). Headroom to 90KB so a small
+    // skeleton edit doesn't trip CI, but a 10KB regression does.
    skill: 'plan-ceo-review',
+    sectioned: true,
+    maxSkeletonBytes: 90_000,
    mustContain: [
      'SCOPE EXPANSION',
      'SELECTIVE EXPANSION',
@@ -238,7 +245,13 @@ export const PARITY_INVARIANTS: ParityInvariant[] = [
    minBytes: 80_000,
  },
  {
+    // Carved (v2 plan T9): skeleton + sections/review-sections.md. The 4-section
+    // review, outside voice, and required outputs moved to the section; content
+    // checks run against the union. Skeleton shrank 106,984 -> 54,892 B (-48.7%);
+    // maxSkeletonBytes 62KB = measured + headroom.
    skill: 'plan-eng-review',
+    sectioned: true,
+    maxSkeletonBytes: 62_000,
    mustContain: [
      'Architecture',
      'Code Quality',
@@ -250,7 +263,13 @@ export const PARITY_INVARIANTS: ParityInvariant[] = [
    minBytes: 70_000,
  },
  {
+    // Carved (v2 plan T9): skeleton + sections/review-sections.md. The 7 design
+    // passes + required outputs moved to the section; content checks run against
+    // the union. Skeleton shrank 112,057 -> 76,024 B (-32.2%); maxSkeletonBytes
+    // 82KB = measured + headroom.
    skill: 'plan-design-review',
+    sectioned: true,
+    maxSkeletonBytes: 82_000,
    mustContain: [
      'design',
      'visual',
@@ -281,7 +300,15 @@ export const PARITY_INVARIANTS: ParityInvariant[] = [
    minBytes: 30_000,
  },
  {
+    // Carved (v2 plan T9): skeleton SKILL.md + sections/design-and-handoff.md.
+    // Phase 5 (design doc) + Phase 6 (handoff) moved into the section, so
+    // 'design doc' / 'problem statement' now live there — content checks run
+    // against the union. maxSkeletonBytes asserts the always-loaded skeleton
+    // shrank from the ~118KB monolith to ~89KB (measured 88,975 B, -24.8%);
+    // headroom to 96KB so a small skeleton edit doesn't trip CI.
    skill: 'office-hours',
+    sectioned: true,
+    maxSkeletonBytes: 96_000,
    mustContain: ['design doc', 'problem statement'],
    mustHaveHeadings: ['## Preamble', '## When to invoke'],
    maxSizeRatio: 1.05,
@@ -116,12 +116,13 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
  // Real-PTY E2E batch (#6 new tests on the harness).
  // Each one tests behavior the SDK harness can't observe (rendered TTY,
  // numbered-option lists, multi-phase ordering, idempotency state echo).
-  'ask-user-question-format-pty':              ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completeness-section.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
+  'auq-format-gate':                           ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completeness-section.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/auq-sdk-capture.ts', 'test/helpers/session-runner.ts', 'test/helpers/llm-judge.ts'],
  'plan-ceo-mode-routing':       ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
  'plan-design-with-ui-scope':   ['plan-design-review/**', 'test/fixtures/plans/ui-heavy-feature.md', 'test/helpers/claude-pty-runner.ts'],
  'budget-regression-pty':       ['test/helpers/eval-store.ts', 'test/skill-budget-regression.test.ts'],
  'ship-idempotency-pty':        ['ship/**', 'bin/gstack-next-version', 'bin/gstack-version-bump', 'scripts/resolvers/sections.ts', 'lib/worktree.ts', 'test/helpers/claude-pty-runner.ts'],
-  'ship-section-loading':        ['ship/**', 'scripts/resolvers/sections.ts', 'scripts/gen-skill-docs.ts', 'test/helpers/required-reads.ts', 'test/helpers/transcript-section-logger.ts', 'test/helpers/claude-pty-runner.ts'],
+  'ship-section-loading':        ['ship/**', 'scripts/resolvers/sections.ts', 'scripts/gen-skill-docs.ts', 'test/helpers/auq-sdk-capture.ts', 'test/helpers/session-runner.ts'],
+  'plan-ceo-section-loading':    ['plan-ceo-review/**', 'scripts/resolvers/sections.ts', 'scripts/gen-skill-docs.ts', 'test/helpers/auq-sdk-capture.ts', 'test/helpers/session-runner.ts'],
  'autoplan-chain-pty':          ['autoplan/**', 'plan-ceo-review/**', 'plan-design-review/**', 'plan-eng-review/**', 'plan-devex-review/**', 'test/fixtures/plans/ui-heavy-feature.md', 'test/helpers/claude-pty-runner.ts'],
  'e2e-harness-audit':            ['plan-ceo-review/**', 'plan-eng-review/**', 'plan-design-review/**', 'plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/agent-sdk-runner.ts', 'test/helpers/claude-pty-runner.ts'],

@@ -504,12 +505,13 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
  // Real-PTY E2E batch — tier classification:
  //   gate: cheap, deterministic, run on every PR
  //   periodic: long-running or expensive (>$3/run), run weekly
-  'ask-user-question-format-pty':            'gate',       // ~$0.50/run, single skill probe
+  'auq-format-gate':                         'gate',       // ~$0.50/run, SDK capture, single skill probe
  'plan-ceo-mode-routing':     'periodic',   // ~$3/run, deep navigation through 8-12 prior AskUserQuestions
  'plan-design-with-ui-scope': 'gate',       // ~$0.80/run
  'budget-regression-pty':     'gate',       // free, library-only assertion
  'ship-idempotency-pty':      'periodic',   // ~$3/run, real /ship in plan mode
  'ship-section-loading':      'periodic',   // ~$3/run, real /ship; asserts section reads
+  'plan-ceo-section-loading':  'periodic',   // ~$3-5/run, real /plan-ceo-review; asserts section read
  'autoplan-chain-pty':        'periodic',   // ~$8/run, all 3 phases sequential

  // Per-finding count + review-report-at-bottom — periodic because each