Files
gstack/scripts
Garry Tan 5fe1814310 fix(plan-reviews): restore RECOMMENDATION + split Completeness by question type
Opus 4.7 users reported /plan-ceo-review and /plan-eng-review stopped
emitting the RECOMMENDATION line and per-option Completeness: X/10
scores. E2E capture showed the real failure mode: on kind-differentiated
questions (mode selection, architectural A-vs-B, cherry-pick), Opus 4.7
either fabricated filler scores (10/10 on every option — conveys nothing)
or dropped the format entirely when the metric didn't fit.

Fix is at two layers:

1. scripts/resolvers/preamble/generate-ask-user-format.ts splits the old
   run-on step 3 into:
   - Step 3 "Recommend (ALWAYS)": RECOMMENDATION is required on every
     question, coverage- or kind-differentiated.
   - Step 4 "Score completeness (when meaningful)": emit Completeness: N/10
     only when options differ in coverage. When options differ in kind,
     skip the score and include a one-line explanatory note. Do not
     fabricate scores.

2. scripts/resolvers/preamble/generate-completeness-section.ts updates
   the Completeness Principle tail to match. Without this, the preamble
   contained two rules (one conditional, one unconditional) and the
   model hedged.

Template anchors reinforce the distinction where agent judgment is most
likely to drift:

- plan-ceo-review Section 0C-bis (approach menu) gets the
  coverage-differentiated anchor.
- plan-ceo-review Section 0F (mode selection) gets the kind-differentiated
  anchor.
- plan-eng-review CRITICAL RULE section gets the coverage-vs-kind rule
  for every per-issue AskUserQuestion raised during the review.

Regenerated SKILL.md for all T2 skills + golden fixtures refreshed. Every
skill using the T2 preamble now has the same conditional scoring rule.

Verified via new periodic-tier eval (test/skill-e2e-plan-format.test.ts):
all 4 cases fail on prior behavior, all 4 pass with this fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 01:11:05 -07:00
..