Files
gstack/canary
Garry Tan 69733e2622 fix(plan-reviews): restore RECOMMENDATION + Completeness split + Codex ELI10 (v1.6.3.0) (#1149)
* test: add AskUserQuestion format regression eval for plan reviews

Four-case periodic-tier eval that captures the verbatim AskUserQuestion
text /plan-ceo-review and /plan-eng-review produce, then asserts the
format rule is honored: RECOMMENDATION always, Completeness: N/10 only
on coverage-differentiated options, and an explicit "options differ in
kind" note on kind-differentiated options.

Cases:
- plan-ceo-review mode selection (kind-differentiated)
- plan-ceo-review approach menu (coverage-differentiated)
- plan-eng-review per-issue coverage decision
- plan-eng-review per-issue architectural choice (kind-differentiated)

Classified periodic because behavior depends on Opus non-determinism —
gate-tier would flake and block merges.

Test harness instructs the agent to write its would-be AskUserQuestion
text to $OUT_FILE rather than invoke a real tool (MCP AskUserQuestion
isn't wired in the test subprocess). Regex predicates then validate
the captured content.

Cost: ~$2 per full run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(plan-reviews): restore RECOMMENDATION + split Completeness by question type

Opus 4.7 users reported /plan-ceo-review and /plan-eng-review stopped
emitting the RECOMMENDATION line and per-option Completeness: X/10
scores. E2E capture showed the real failure mode: on kind-differentiated
questions (mode selection, architectural A-vs-B, cherry-pick), Opus 4.7
either fabricated filler scores (10/10 on every option — conveys nothing)
or dropped the format entirely when the metric didn't fit.

Fix is at two layers:

1. scripts/resolvers/preamble/generate-ask-user-format.ts splits the old
   run-on step 3 into:
   - Step 3 "Recommend (ALWAYS)": RECOMMENDATION is required on every
     question, coverage- or kind-differentiated.
   - Step 4 "Score completeness (when meaningful)": emit Completeness: N/10
     only when options differ in coverage. When options differ in kind,
     skip the score and include a one-line explanatory note. Do not
     fabricate scores.

2. scripts/resolvers/preamble/generate-completeness-section.ts updates
   the Completeness Principle tail to match. Without this, the preamble
   contained two rules (one conditional, one unconditional) and the
   model hedged.

Template anchors reinforce the distinction where agent judgment is most
likely to drift:

- plan-ceo-review Section 0C-bis (approach menu) gets the
  coverage-differentiated anchor.
- plan-ceo-review Section 0F (mode selection) gets the kind-differentiated
  anchor.
- plan-eng-review CRITICAL RULE section gets the coverage-vs-kind rule
  for every per-issue AskUserQuestion raised during the review.

Regenerated SKILL.md for all T2 skills + golden fixtures refreshed. Every
skill using the T2 preamble now has the same conditional scoring rule.

Verified via new periodic-tier eval (test/skill-e2e-plan-format.test.ts):
all 4 cases fail on prior behavior, all 4 pass with this fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.6.2.0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test: add Codex eval for AskUserQuestion format compliance

Four-case periodic-tier eval mirrors test/skill-e2e-plan-format.test.ts
but drives the plan review skills via codex exec instead of claude -p.

Context: Codex under the gpt.md "No preamble / Prefer doing over listing"
overlay tends to skip the Simplify/ELI10 paragraph and the RECOMMENDATION
line on AskUserQuestion calls. Users have to manually re-prompt "ELI10
and don't forget to recommend" almost every time. This test pins the
behavior so regressions surface.

Cases:
- plan-ceo-review mode selection (kind-differentiated)
- plan-ceo-review approach menu (coverage-differentiated)
- plan-eng-review per-issue coverage decision
- plan-eng-review per-issue architectural choice (kind-differentiated)

Assertions on captured AskUserQuestion text:
- RECOMMENDATION: Choose present (all cases)
- Completeness: N/10 present on coverage, absent on kind
- "options differ in kind" note present on kind
- ELI10 length floor (>400 chars) — catches bare options-only output

Cost: ~\$2-4 per full run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(preamble): harden AskUserQuestion Format + Codex ELI10 carve-out

Follow-up to v1.6.2.0. Codex (GPT-5.4) under the gpt.md overlay
treated "No preamble / Prefer doing over listing" as license to skip
the Simplify paragraph and the RECOMMENDATION line on AskUserQuestion
calls. Users had to manually re-prompt "ELI10 and don't forget to
recommend" almost every time.

Two layers:

1. model-overlays/gpt.md — adds an explicit "AskUserQuestion is NOT
   preamble" carve-out. The "No preamble" rule applies to direct
   answers; AskUserQuestion content must emit the full format
   (Re-ground, Simplify/ELI10, Recommend, Options). Tells the model:
   if you find yourself about to skip any of these, back up and emit
   them — the user will ask anyway, so do it the first time.

2. scripts/resolvers/preamble/generate-ask-user-format.ts — step 2
   renamed to "Simplify (ELI10, ALWAYS)" with explicit "not optional
   verbosity, not preamble" framing. Step 3 "Recommend (ALWAYS)"
   hardened: "Never omit, never collapse into the options list."

All T2 skills regenerated across all hosts. Golden fixtures refreshed
(claude-ship, codex-ship, factory-ship). Updated the ELI10 assertion
in test/gen-skill-docs.test.ts to match the new wording.

Codex compliance to be verified empirically via test/codex-e2e-plan-format.test.ts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test: fix Codex eval sandbox + collector API

Two test infrastructure bugs in the initial Codex eval landed in the
prior commit:

1. sandbox: 'read-only' (the default) blocked Codex from writing
   $OUT_FILE. Test reported "STATUS: BLOCKED" and exited 0 without
   a capture file. Fixed: sandbox: 'workspace-write' for all 4 cases,
   allowing writes inside the tempdir.

2. recordCodexResult called a non-existent evalCollector.record()
   API (I invented it). The real surface is addTest() with a
   different field schema. Aligned with test/codex-e2e.test.ts
   pattern.

With both fixed, the eval now actually measures Codex AskUserQuestion
format compliance. All 4 cases pass on v1.6.2.0 with the gpt.md
carve-out: RECOMMENDATION always, Completeness: N/10 only on coverage,
"options differ in kind" note on kind, ELI10 explanation present.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: bump version and changelog (v1.6.3.0)

Adds the Codex ELI10 + RECOMMENDATION carve-out scope landed after
v1.6.2.0's Claude-verified fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 07:25:20 -07:00
..