mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-05 05:05:08 +02:00
fix(plan-reviews): restore RECOMMENDATION + split Completeness by question type
Opus 4.7 users reported /plan-ceo-review and /plan-eng-review stopped
emitting the RECOMMENDATION line and per-option Completeness: X/10
scores. E2E capture showed the real failure mode: on kind-differentiated
questions (mode selection, architectural A-vs-B, cherry-pick), Opus 4.7
either fabricated filler scores (10/10 on every option — conveys nothing)
or dropped the format entirely when the metric didn't fit.
Fix is at two layers:
1. scripts/resolvers/preamble/generate-ask-user-format.ts splits the old
run-on step 3 into:
- Step 3 "Recommend (ALWAYS)": RECOMMENDATION is required on every
question, coverage- or kind-differentiated.
- Step 4 "Score completeness (when meaningful)": emit Completeness: N/10
only when options differ in coverage. When options differ in kind,
skip the score and include a one-line explanatory note. Do not
fabricate scores.
2. scripts/resolvers/preamble/generate-completeness-section.ts updates
the Completeness Principle tail to match. Without this, the preamble
contained two rules (one conditional, one unconditional) and the
model hedged.
Template anchors reinforce the distinction where agent judgment is most
likely to drift:
- plan-ceo-review Section 0C-bis (approach menu) gets the
coverage-differentiated anchor.
- plan-ceo-review Section 0F (mode selection) gets the kind-differentiated
anchor.
- plan-eng-review CRITICAL RULE section gets the coverage-vs-kind rule
for every per-issue AskUserQuestion raised during the review.
Regenerated SKILL.md for all T2 skills + golden fixtures refreshed. Every
skill using the T2 preamble now has the same conditional scoring rule.
Verified via new periodic-tier eval (test/skill-e2e-plan-format.test.ts):
all 4 cases fail on prior behavior, all 4 pass with this fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -6,8 +6,9 @@ export function generateAskUserFormat(_ctx: TemplateContext): string {
|
||||
**ALWAYS follow this structure for every AskUserQuestion call:**
|
||||
1. **Re-ground:** State the project, the current branch (use the \`_BRANCH\` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences)
|
||||
2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called.
|
||||
3. **Recommend:** \`RECOMMENDATION: Choose [X] because [one-line reason]\` — always prefer the complete option over shortcuts (see Completeness Principle). Include \`Completeness: X/10\` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it.
|
||||
4. **Options:** Lettered options: \`A) ... B) ... C) ...\` — when an option involves effort, show both scales: \`(human: ~X / CC: ~Y)\`
|
||||
3. **Recommend (ALWAYS):** Every question ends with \`RECOMMENDATION: Choose [X] because [one-line reason]\`. Never omit this line. It is required for every AskUserQuestion, regardless of whether the options are coverage-differentiated or different-in-kind.
|
||||
4. **Score completeness (when meaningful):** When options differ in coverage (e.g. full test coverage vs happy path vs shortcut, complete error handling vs partial), score each with \`Completeness: N/10\` on its own line. Calibration: 10 = complete (all edge cases, full coverage), 7 = happy path only, 3 = shortcut. Flag any option ≤5 where a higher-completeness option exists. When options differ in kind (picking a review posture, picking an architectural approach, cherry-pick Add/Defer/Skip, choosing between two different kinds of systems), the completeness axis doesn't apply — skip \`Completeness: N/10\` entirely and write one line: \`Note: options differ in kind, not coverage — no completeness score.\` Do not fabricate filler scores.
|
||||
5. **Options:** Lettered options: \`A) ... B) ... C) ...\` — when an option involves effort, show both scales: \`(human: ~X / CC: ~Y)\`
|
||||
|
||||
Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex.
|
||||
|
||||
|
||||
@@ -14,6 +14,6 @@ AI makes completeness near-free. Always recommend the complete option over short
|
||||
| Feature | 1 week | 30 min | ~30x |
|
||||
| Bug fix | 4 hours | 15 min | ~20x |
|
||||
|
||||
Include \`Completeness: X/10\` for each option (10=all edge cases, 7=happy path, 3=shortcut).`;
|
||||
When options differ in coverage (e.g. full vs happy-path vs shortcut), include \`Completeness: X/10\` on each option (10 = all edge cases, 7 = happy path, 3 = shortcut). When options differ in kind (mode posture, architectural choice, cherry-pick A/B/C where each is a different kind of thing, not a more-or-less-complete version of the same thing), skip the score and write one line explaining why: \`Note: options differ in kind, not coverage — no completeness score.\` Do not fabricate scores.`;
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user