fix(plan-reviews): restore RECOMMENDATION + Completeness split + Codex ELI10 (v1.6.3.0) (#1149)

* test: add AskUserQuestion format regression eval for plan reviews

Four-case periodic-tier eval that captures the verbatim AskUserQuestion
text /plan-ceo-review and /plan-eng-review produce, then asserts the
format rule is honored: RECOMMENDATION always, Completeness: N/10 only
on coverage-differentiated options, and an explicit "options differ in
kind" note on kind-differentiated options.

Cases:
- plan-ceo-review mode selection (kind-differentiated)
- plan-ceo-review approach menu (coverage-differentiated)
- plan-eng-review per-issue coverage decision
- plan-eng-review per-issue architectural choice (kind-differentiated)

Classified periodic because behavior depends on Opus non-determinism —
gate-tier would flake and block merges.

Test harness instructs the agent to write its would-be AskUserQuestion
text to $OUT_FILE rather than invoke a real tool (MCP AskUserQuestion
isn't wired in the test subprocess). Regex predicates then validate
the captured content.

Cost: ~$2 per full run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(plan-reviews): restore RECOMMENDATION + split Completeness by question type

Opus 4.7 users reported /plan-ceo-review and /plan-eng-review stopped
emitting the RECOMMENDATION line and per-option Completeness: X/10
scores. E2E capture showed the real failure mode: on kind-differentiated
questions (mode selection, architectural A-vs-B, cherry-pick), Opus 4.7
either fabricated filler scores (10/10 on every option — conveys nothing)
or dropped the format entirely when the metric didn't fit.

Fix is at two layers:

1. scripts/resolvers/preamble/generate-ask-user-format.ts splits the old
   run-on step 3 into:
   - Step 3 "Recommend (ALWAYS)": RECOMMENDATION is required on every
     question, coverage- or kind-differentiated.
   - Step 4 "Score completeness (when meaningful)": emit Completeness: N/10
     only when options differ in coverage. When options differ in kind,
     skip the score and include a one-line explanatory note. Do not
     fabricate scores.

2. scripts/resolvers/preamble/generate-completeness-section.ts updates
   the Completeness Principle tail to match. Without this, the preamble
   contained two rules (one conditional, one unconditional) and the
   model hedged.

Template anchors reinforce the distinction where agent judgment is most
likely to drift:

- plan-ceo-review Section 0C-bis (approach menu) gets the
  coverage-differentiated anchor.
- plan-ceo-review Section 0F (mode selection) gets the kind-differentiated
  anchor.
- plan-eng-review CRITICAL RULE section gets the coverage-vs-kind rule
  for every per-issue AskUserQuestion raised during the review.

Regenerated SKILL.md for all T2 skills + golden fixtures refreshed. Every
skill using the T2 preamble now has the same conditional scoring rule.

Verified via new periodic-tier eval (test/skill-e2e-plan-format.test.ts):
all 4 cases fail on prior behavior, all 4 pass with this fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.6.2.0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test: add Codex eval for AskUserQuestion format compliance

Four-case periodic-tier eval mirrors test/skill-e2e-plan-format.test.ts
but drives the plan review skills via codex exec instead of claude -p.

Context: Codex under the gpt.md "No preamble / Prefer doing over listing"
overlay tends to skip the Simplify/ELI10 paragraph and the RECOMMENDATION
line on AskUserQuestion calls. Users have to manually re-prompt "ELI10
and don't forget to recommend" almost every time. This test pins the
behavior so regressions surface.

Cases:
- plan-ceo-review mode selection (kind-differentiated)
- plan-ceo-review approach menu (coverage-differentiated)
- plan-eng-review per-issue coverage decision
- plan-eng-review per-issue architectural choice (kind-differentiated)

Assertions on captured AskUserQuestion text:
- RECOMMENDATION: Choose present (all cases)
- Completeness: N/10 present on coverage, absent on kind
- "options differ in kind" note present on kind
- ELI10 length floor (>400 chars) — catches bare options-only output

Cost: ~\$2-4 per full run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(preamble): harden AskUserQuestion Format + Codex ELI10 carve-out

Follow-up to v1.6.2.0. Codex (GPT-5.4) under the gpt.md overlay
treated "No preamble / Prefer doing over listing" as license to skip
the Simplify paragraph and the RECOMMENDATION line on AskUserQuestion
calls. Users had to manually re-prompt "ELI10 and don't forget to
recommend" almost every time.

Two layers:

1. model-overlays/gpt.md — adds an explicit "AskUserQuestion is NOT
   preamble" carve-out. The "No preamble" rule applies to direct
   answers; AskUserQuestion content must emit the full format
   (Re-ground, Simplify/ELI10, Recommend, Options). Tells the model:
   if you find yourself about to skip any of these, back up and emit
   them — the user will ask anyway, so do it the first time.

2. scripts/resolvers/preamble/generate-ask-user-format.ts — step 2
   renamed to "Simplify (ELI10, ALWAYS)" with explicit "not optional
   verbosity, not preamble" framing. Step 3 "Recommend (ALWAYS)"
   hardened: "Never omit, never collapse into the options list."

All T2 skills regenerated across all hosts. Golden fixtures refreshed
(claude-ship, codex-ship, factory-ship). Updated the ELI10 assertion
in test/gen-skill-docs.test.ts to match the new wording.

Codex compliance to be verified empirically via test/codex-e2e-plan-format.test.ts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test: fix Codex eval sandbox + collector API

Two test infrastructure bugs in the initial Codex eval landed in the
prior commit:

1. sandbox: 'read-only' (the default) blocked Codex from writing
   $OUT_FILE. Test reported "STATUS: BLOCKED" and exited 0 without
   a capture file. Fixed: sandbox: 'workspace-write' for all 4 cases,
   allowing writes inside the tempdir.

2. recordCodexResult called a non-existent evalCollector.record()
   API (I invented it). The real surface is addTest() with a
   different field schema. Aligned with test/codex-e2e.test.ts
   pattern.

With both fixed, the eval now actually measures Codex AskUserQuestion
format compliance. All 4 cases pass on v1.6.2.0 with the gpt.md
carve-out: RECOMMENDATION always, Completeness: N/10 only on coverage,
"options differ in kind" note on kind, ELI10 explanation present.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: bump version and changelog (v1.6.3.0)

Adds the Codex ELI10 + RECOMMENDATION carve-out scope landed after
v1.6.2.0's Claude-verified fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-23 07:25:20 -07:00
committed by GitHub
parent 656df0e37e
commit 69733e2622
46 changed files with 984 additions and 175 deletions
+297
View File
@@ -0,0 +1,297 @@
/**
* AskUserQuestion format regression test for /plan-ceo-review and /plan-eng-review.
*
* Context: a user on Opus 4.7 reported the RECOMMENDATION line and the
* `Completeness: N/10` per-option score stopped appearing on AskUserQuestion
* prompts. This test captures the agent's AskUserQuestion output verbatim
* and asserts the format rule is applied.
*
* Capture shape: `claude -p` sessions inside this harness do not have the
* AskUserQuestion MCP tool wired. We instruct the agent to write the verbatim
* AskUserQuestion text it would have made to $OUT_FILE instead of calling
* any tool. Assertions read that file.
*
* Coverage-vs-kind split: the format rule says to include `Completeness: N/10`
* only when options differ in coverage. When options differ in kind (mode
* selection, posture choice, cherry-pick Add/Defer/Skip), the score is
* intentionally absent and a one-line note explains why. Assertions split
* accordingly.
*/
import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
import { runSkillTest } from './helpers/session-runner';
import {
ROOT, runId,
describeIfSelected, testConcurrentIfSelected,
logCost, recordE2E,
createEvalCollector, finalizeEvalCollector,
} from './helpers/e2e-helpers';
import { spawnSync } from 'child_process';
import * as fs from 'fs';
import * as path from 'path';
import * as os from 'os';
const evalCollector = createEvalCollector('e2e-plan-format');
// Regex predicates applied to captured AskUserQuestion content.
// RECOMMENDATION regex is lenient on intervening markdown markers (e.g.
// agent writes `**RECOMMENDATION:** Choose` — the `**` closers are benign).
const RECOMMENDATION_RE = /RECOMMENDATION:[*\s]*Choose/;
const COMPLETENESS_RE = /Completeness:\s*\d{1,2}\/10/;
const KIND_NOTE_RE = /options differ in kind/i;
const SAMPLE_PLAN = `# Plan: Add User Dashboard
## Context
We're building a new user dashboard that shows recent activity, notifications, and quick actions.
## Changes
1. New React component \`UserDashboard\` in \`src/components/\`
2. REST API endpoint \`GET /api/dashboard\` returning user stats
3. PostgreSQL query for activity aggregation
4. Redis cache layer for dashboard data (5min TTL)
## Architecture
- Frontend: React + TailwindCSS
- Backend: Express.js REST API
- Database: PostgreSQL with existing user/activity tables
- Cache: Redis for dashboard aggregates
`;
function setupPlanDir(tmpPrefix: string, skillName: 'plan-ceo-review' | 'plan-eng-review'): string {
const planDir = fs.mkdtempSync(path.join(os.tmpdir(), tmpPrefix));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(planDir, 'plan.md'), SAMPLE_PLAN);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'add plan']);
fs.mkdirSync(path.join(planDir, skillName), { recursive: true });
fs.copyFileSync(
path.join(ROOT, skillName, 'SKILL.md'),
path.join(planDir, skillName, 'SKILL.md'),
);
return planDir;
}
// The capture instruction passed to every case. Tells the agent to dump
// AskUserQuestion content to a file instead of calling a tool.
function captureInstruction(outFile: string): string {
return `Write the verbatim text of every AskUserQuestion you would have made to ${outFile} (one question per session, full text including options and recommendation line). Do NOT call any tool to ask the user. Do NOT paraphrase — include the exact prose you would have shown. This is a format-capture test, not an interactive session.`;
}
// --- Case 1: plan-ceo-review mode selection (kind-differentiated) ---
describeIfSelected('Plan Format — CEO Mode Selection', ['plan-ceo-review-format-mode'], () => {
let planDir: string;
let outFile: string;
beforeAll(() => {
planDir = setupPlanDir('skill-e2e-plan-format-ceo-mode-', 'plan-ceo-review');
outFile = path.join(planDir, 'ask-capture.md');
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('plan-ceo-review-format-mode', async () => {
const result = await runSkillTest({
prompt: `Read plan-ceo-review/SKILL.md for the review workflow.
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration or system audit steps.
Proceed to Step 0F (Mode Selection). This is where the skill presents 4 mode options (SCOPE EXPANSION, SELECTIVE EXPANSION, HOLD SCOPE, SCOPE REDUCTION) to the user via AskUserQuestion. These options differ in kind (review posture), not in coverage.
${captureInstruction(outFile)}
After writing the file, stop. Do not continue the review.`,
workingDirectory: planDir,
maxTurns: 10,
timeout: 240_000,
testName: 'plan-ceo-review-format-mode',
runId,
model: 'claude-opus-4-7',
});
logCost('/plan-ceo-review format (mode)', result);
recordE2E(evalCollector, '/plan-ceo-review-format-mode', 'Plan Format — CEO Mode Selection', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
expect(fs.existsSync(outFile)).toBe(true);
const captured = fs.readFileSync(outFile, 'utf-8');
expect(captured.length).toBeGreaterThan(100);
// Kind-differentiated: RECOMMENDATION required, Completeness: N/10 must NOT appear,
// "options differ in kind" note must appear.
expect(captured).toMatch(RECOMMENDATION_RE);
expect(captured).not.toMatch(COMPLETENESS_RE);
expect(captured).toMatch(KIND_NOTE_RE);
}, 300_000);
});
// --- Case 2: plan-ceo-review approach menu (coverage-differentiated) ---
describeIfSelected('Plan Format — CEO Approach Menu', ['plan-ceo-review-format-approach'], () => {
let planDir: string;
let outFile: string;
beforeAll(() => {
planDir = setupPlanDir('skill-e2e-plan-format-ceo-approach-', 'plan-ceo-review');
outFile = path.join(planDir, 'ask-capture.md');
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('plan-ceo-review-format-approach', async () => {
const result = await runSkillTest({
prompt: `Read plan-ceo-review/SKILL.md for the review workflow.
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration or system audit steps.
Proceed to Step 0C-bis (Implementation Alternatives / Approach Menu). This is where the skill generates 2-3 approaches (minimal viable vs ideal architecture) and presents them via AskUserQuestion. These options differ in coverage (complete vs shortcut), so Completeness: N/10 applies.
${captureInstruction(outFile)}
After writing the file, stop. Do not continue the review.`,
workingDirectory: planDir,
maxTurns: 10,
timeout: 240_000,
testName: 'plan-ceo-review-format-approach',
runId,
model: 'claude-opus-4-7',
});
logCost('/plan-ceo-review format (approach)', result);
recordE2E(evalCollector, '/plan-ceo-review-format-approach', 'Plan Format — CEO Approach Menu', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
expect(fs.existsSync(outFile)).toBe(true);
const captured = fs.readFileSync(outFile, 'utf-8');
expect(captured.length).toBeGreaterThan(100);
// Coverage-differentiated: both RECOMMENDATION and Completeness: N/10 required.
expect(captured).toMatch(RECOMMENDATION_RE);
expect(captured).toMatch(COMPLETENESS_RE);
}, 300_000);
});
// --- Case 3: plan-eng-review coverage-differentiated per-issue AskUserQuestion ---
describeIfSelected('Plan Format — Eng Coverage Issue', ['plan-eng-review-format-coverage'], () => {
let planDir: string;
let outFile: string;
beforeAll(() => {
planDir = setupPlanDir('skill-e2e-plan-format-eng-cov-', 'plan-eng-review');
outFile = path.join(planDir, 'ask-capture.md');
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('plan-eng-review-format-coverage', async () => {
const result = await runSkillTest({
prompt: `Read plan-eng-review/SKILL.md for the review workflow.
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration steps.
During your review (Section 3 Test Review is the natural place), generate ONE AskUserQuestion about test coverage depth where the options are clearly coverage-differentiated. For example:
A) Full coverage: happy path + edge cases + error paths (Completeness 10/10)
B) Happy path only (Completeness 7/10)
C) Smoke test (Completeness 3/10)
${captureInstruction(outFile)}
After writing the file with that ONE question, stop. Do not continue the review.`,
workingDirectory: planDir,
maxTurns: 10,
timeout: 240_000,
testName: 'plan-eng-review-format-coverage',
runId,
model: 'claude-opus-4-7',
});
logCost('/plan-eng-review format (coverage)', result);
recordE2E(evalCollector, '/plan-eng-review-format-coverage', 'Plan Format — Eng Coverage Issue', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
expect(fs.existsSync(outFile)).toBe(true);
const captured = fs.readFileSync(outFile, 'utf-8');
expect(captured.length).toBeGreaterThan(100);
// Coverage-differentiated: both RECOMMENDATION and Completeness: N/10 required.
expect(captured).toMatch(RECOMMENDATION_RE);
expect(captured).toMatch(COMPLETENESS_RE);
}, 300_000);
});
// --- Case 4: plan-eng-review kind-differentiated per-issue AskUserQuestion ---
describeIfSelected('Plan Format — Eng Kind Issue', ['plan-eng-review-format-kind'], () => {
let planDir: string;
let outFile: string;
beforeAll(() => {
planDir = setupPlanDir('skill-e2e-plan-format-eng-kind-', 'plan-eng-review');
outFile = path.join(planDir, 'ask-capture.md');
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('plan-eng-review-format-kind', async () => {
const result = await runSkillTest({
prompt: `Read plan-eng-review/SKILL.md for the review workflow.
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration steps.
During your review (Section 1 Architecture), generate ONE AskUserQuestion about an architectural choice where the options differ in kind, not in coverage. For example, "should we use Redis or Postgres for the cache layer?" — the options are different kinds of systems with different tradeoffs, not more-or-less-complete versions of the same thing.
${captureInstruction(outFile)}
After writing the file with that ONE question, stop. Do not continue the review.`,
workingDirectory: planDir,
maxTurns: 10,
timeout: 240_000,
testName: 'plan-eng-review-format-kind',
runId,
model: 'claude-opus-4-7',
});
logCost('/plan-eng-review format (kind)', result);
recordE2E(evalCollector, '/plan-eng-review-format-kind', 'Plan Format — Eng Kind Issue', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
expect(fs.existsSync(outFile)).toBe(true);
const captured = fs.readFileSync(outFile, 'utf-8');
expect(captured.length).toBeGreaterThan(100);
// Kind-differentiated: RECOMMENDATION required, Completeness: N/10 must NOT appear,
// "options differ in kind" note must appear.
expect(captured).toMatch(RECOMMENDATION_RE);
expect(captured).not.toMatch(COMPLETENESS_RE);
expect(captured).toMatch(KIND_NOTE_RE);
}, 300_000);
});
afterAll(async () => {
await finalizeEvalCollector(evalCollector);
});