v1.10.0.0: fix AskUserQuestion cadence + Pros/Cons format upgrade (#1178)

* fix(preamble): reorder AskUserQuestion Format above model overlay + rewrite Opus 4.7 pacing directive

Root cause of plan-review regression (v1.6.4.0): model overlays rendered
ABOVE the pacing rule in every SKILL.md, so Opus 4.7 read "Batch your
questions" first and absorbed it as the ambient default. The overlay's
claimed subordination ("skill wins on pacing, always") didn't stick —
literal-interpretation mode reads physical order, not claimed hierarchy.

Part 1 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md):

scripts/resolvers/preamble.ts
- Move generateAskUserFormat above generateModelOverlay in section array
- Comment explains why — prevents future refactors from silently reverting

model-overlays/opus-4-7.md
- Replace "Batch your questions" block with "Pace questions to the skill"
- New wording makes one-question-per-turn the default when the skill
  contains STOP directives; batching becomes the explicit exception

Regenerated 30 SKILL.md files via bun run gen:skill-docs.

Verified:
- With --model opus-4-7: Format renders at line 359, Model-Specific
  Patch at 373, "Pace questions" at 419 (Format comes first, overlay
  second, pacing directive intact).
- bun test passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(plan-reviews): tighten STOP/escape-hatch directives across 4 templates

Part 2 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).

Codex caught that v1.6.3.0's reasoning collapsed on Opus 4.7: the old
escape-hatch wording ("If no issues or fix is obvious, state what
you'll do and move on — don't waste a question") let the literal
interpreter classify every finding as having an "obvious fix" and skip
AskUserQuestion entirely. Reviews became reports.

Per-template hardening (16 sites total, verified by rg):

plan-ceo-review/SKILL.md.tmpl (13 sites):
- 12 inline STOP directives: replace the full escape-hatch clause with
  "zero findings → say so and proceed; findings → MUST call AskUserQuestion
  as a tool_use, including for obvious fixes."
- 1 Escape hatch bullet in CRITICAL RULE section: tightened.

plan-eng-review, plan-design-review, plan-devex-review (1 site each):
- Each template's Escape hatch bullet tightened to match the new CEO wording,
  adapted for each review's domain (issue/gap, decision/design/DX alternatives).

After regeneration: rg "don't waste a question" returns 0 across all
*SKILL.md.tmpl and *SKILL.md files. "zero findings, state" wording
present 16 times (matches prior count of escape-hatch sites).

bun test passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(preamble): upgrade AskUserQuestion format to Pros/Cons decision brief

Part 4 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).

Every AskUserQuestion now renders as a decision brief, not a bullet list:
D-numbered header, ELI10, Stakes-if-we-pick-wrong, Recommendation, Pros/Cons
with / markers per option, closing Net: tradeoff synthesis.

scripts/resolvers/preamble/generate-ask-user-format.ts
- Full rewrite. Preserves prior rules (Re-ground, ELI10, Recommend,
  Completeness, Options) and adds:
  - D-numbering per skill invocation (model-level, not runtime state)
  - Stakes line (pain avoided / capability unlocked / consequence named)
  - Pros/Cons block with min 2  + 1  per option, min 40 chars/bullet
  - Hard-stop escape: " No cons — this is a hard-stop choice" for
    genuine one-sided choices (destructive-action confirmations)
  - Neutral-posture handling (CT1-compliant): (recommended) label
    STAYS on default option to preserve AUTO_DECIDE contract; neutrality
    expressed as prose in Recommendation line only
  - Net line closes the decision with a one-sentence tradeoff frame
  - Rule 11: tool_use mandate (prose "Question:" blocks don't count)
  - Self-check list before emitting

test/skill-validation.test.ts
- Update format assertions to check for new Pros/Cons tokens
  (Pros / cons:, Recommendation: <choice>, Net:, ELI10, Stakes if we
  pick wrong:, , ) across all tier-2+ skills
- Old "RECOMMENDATION: Choose" expectation removed (the new format uses
  mixed-case "Recommendation:" with no literal "Choose")

test/skill-e2e-plan-format.test.ts
- Add v1.7.0.0 format token regexes (PROS_CONS_HEADER_RE, PRO_BULLET_RE,
  CON_BULLET_RE, NET_LINE_RE, D_NUMBER_RE, STAKES_RE)
- Existing RECOMMENDATION_RE loosened to accept mixed-case "Recommendation:"
  (canonical v1.7.0.0 form) alongside all-caps (legacy). Tests are
  additive — the strict new-format gate is the upcoming cadence eval.

Regenerated 30 SKILL.md files via bun run gen:skill-docs.

Verified:
- bun test: 319 pass (1 pre-existing security-bench fixture oversize
  failure on main, unrelated — confirmed via git stash test on main HEAD)
- New format tokens render in all tier-2+ skills (plan-ceo-review,
  plan-eng-review, ship, office-hours verified)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: gate-tier units + periodic Pros/Cons evals for AskUserQuestion format

Part 3 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).

Gate-tier (E1, free, runs on every `bun test`):

test/preamble-compose.test.ts — pins the composition order
  Asserts AskUserQuestion Format section renders BEFORE Model-Specific
  Behavioral Patch in tier-≥2 preamble output. Covers claude default,
  opus-4-7 overlay, tier 2/3, and codex host. Catches any future edit
  to scripts/resolvers/preamble.ts that silently reverts the order.

test/resolver-ask-user-format.test.ts — pins the Pros/Cons contract
  14 assertions against generateAskUserFormat output: D<N>, ELI10,
  Stakes if we pick wrong:, Recommendation: <choice>, Pros / cons:,
  / markers, min 2 pros + 1 con rules, hard-stop escape exact
  phrase, neutral-posture CT1 rule ((recommended) label preserved for
  AUTO_DECIDE), Completeness coverage-vs-kind, tool_use mandate
  (rule 11), self-check list, D-numbering model-level caveat.

test/model-overlay-opus-4-7.test.ts — pins the pacing directive
  Asserts raw overlay file + resolved overlay output contain "Pace
  questions to the skill" and NOT "Batch your questions". Verifies
  INHERIT:claude chain still works (Todo-list, subordination wrapper),
  Fan out / Effort-match / Literal interpretation nudges preserved.
  Also asserts claude base overlay does NOT carry the Opus-specific
  pacing directive (no cross-contamination).

Periodic-tier (E2, Opus-dependent, ~$1-2/run):

test/skill-e2e-plan-prosons.test.ts — 4 cases extending v1.6.3.0 harness
  1. Format positive — every token present when plan has real tradeoff
  2. Hard-stop NEGATIVE — plan with genuine tradeoff must NOT dodge to
     "No cons — hard-stop choice" escape
  3. Neutral-posture NEGATIVE — plan where one option dominates must emit
     (recommended) label + "because <reason>", must NOT dodge to
     "taste call" / "no preference"
  4. Hard-stop POSITIVE — destructive-action plan may legitimately use
     the hard-stop escape

test/helpers/touchfiles.ts — entries for all new eval cases
  Dependencies: overlay, preamble.ts, generate-ask-user-format.ts, and
  the 4 plan-review templates. Diff-based selection triggers the evals
  whenever those files change. Also added entries for 7 expanded-coverage
  cases (ship, office-hours, investigate, qa, review, design-review,
  document-release) — test cases will land in follow-up PRs per skill.

Follow-ups noted in test file header:
- True multi-turn cadence eval (3 findings → 3 distinct asks) — current
  harness captures one $OUT_FILE per session; multi-turn capture needs
  new harness support.
- Expanded-coverage test cases for the 7 non-plan-review skills.

Verified:
- bun test: 349 pass (30 new + 319 baseline), 1 pre-existing security-bench
  oversize failure on main (unrelated, unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: regenerate golden fixtures + update ELI10 phrase check for v1.7.0.0

Pros/Cons format rewrite (6b99df9d) changed the resolver output across all
tier-2+ SKILL.md files. Three golden-file regression tests in
test/host-config.test.ts and one phrase-check test in test/gen-skill-docs.test.ts
were failing as expected.

- test/fixtures/golden/claude-ship-SKILL.md
- test/fixtures/golden/codex-ship-SKILL.md
- test/fixtures/golden/factory-ship-SKILL.md
  Regenerated via `bun run gen:skill-docs --host all` + cp into fixtures.

- test/gen-skill-docs.test.ts line 244: rename test from "ELI16 simplification
  rules" to "ELI10 simplification rules" and match the new phrase pattern.
  v1.7.0.0 uses "ELI10 (ALWAYS)" rather than legacy "Simplify (ELI10, ALWAYS)".

bun test: 744 pass, 1 fail (pre-existing security-bench fixture oversize,
unrelated to this branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.7.0.0: plan reviews walk you through each issue with Pros/Cons

Restores AskUserQuestion cadence on Opus 4.7 (v1.6.4.0 regression) and
upgrades the format to a numbered decision brief — D<N> header, ELI10,
Stakes, Recommendation, per-option / bullets, Net: closing line.

Fix: composition reorder + overlay rewrite + 16-site escape-hatch hardening
across the 4 plan-review templates.
Feature: Pros/Cons format in the preamble resolver, inherited by every
tier-2+ skill automatically.

30 new gate-tier unit tests pin the format contract (runs in <100ms, $0).
4 new periodic-tier eval cases defend against escape-hatch abuse
(2 positive, 2 negative). Golden fixtures regenerated.

CEO + Eng + Codex reviews completed. 5 of 8 Codex findings incorporated;
CT2 (16 sites, not 31) and CT1 (AUTO_DECIDE contract break) were
load-bearing catches the primary reviews missed.

bun test: 774 pass, 1 fail (pre-existing security-bench oversize, unrelated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.10.0.0: bump VERSION (was v1.7.0.0, align with branch discipline)

Per user direction — jumping to 1.10.0.0 for versioning alignment.
No functional changes from the prior ship commit (5f038ab7). The
regression fix + Pros/Cons format are identical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-23 18:25:34 -07:00
committed by GitHub
parent 9dbaf906cf
commit a81be53621
51 changed files with 5500 additions and 523 deletions
+129 -14
View File
@@ -349,6 +349,135 @@ AI orchestrator (e.g., OpenClaw). In spawned sessions:
- Focus on completing the task and reporting results via prose output.
- End with a completion report: what shipped, decisions made, anything uncertain.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call. Every element is non-skippable. If you find yourself about to skip any of them, stop and back up.**
### Required shape
Every AskUserQuestion reads like a decision brief, not a bullet list:
```
D<N> — <one-line question title>
ELI10: <plain English a 16-year-old could follow, 2-4 sentences, name the stakes>
Stakes if we pick wrong: <one sentence on what breaks, what user sees, what's lost>
Recommendation: <choice> because <one-line reason>
Completeness: A=X/10, B=Y/10 (or: Note: options differ in kind, not coverage — no completeness score)
Pros / cons:
A) <option label> (recommended)
✅ <pro — concrete, observable, ≥40 chars>
✅ <pro>
❌ <con — honest, ≥40 chars>
B) <option label>
✅ <pro>
❌ <con>
Net: <one-line synthesis of what you're actually trading off>
```
### Element rules
1. **D-numbering.** First question in a skill invocation is `D1`. Increment per
question within the same skill. This is a model-level instruction, not a
runtime counter — you count your own questions. Nested skill invocation
(e.g., `/plan-ceo-review` running `/office-hours` inline) starts its own
D1; label as `D1 (office-hours)` to disambiguate when the user will see
both. Drift is expected over long sessions; minor inconsistency is fine.
2. **Re-ground.** Before ELI10, state the project, current branch (use the
`_BRANCH` value from the preamble, NOT conversation history or gitStatus),
and the current plan/task. 1-2 sentences. Assume the user hasn't looked at
this window in 20 minutes.
3. **ELI10 (ALWAYS).** Explain in plain English a smart 16-year-old could
follow. Concrete examples and analogies, not function names. Say what it
DOES, not what it's called. This is not preamble — the user is about to
make a decision and needs context. Even in terse mode, emit the ELI10.
4. **Stakes if we pick wrong (ALWAYS).** One sentence naming what breaks in
concrete terms (pain avoided / capability unlocked / consequence named).
"Users see a 3-second spinner" beats "performance may degrade." Forces
the trade-off to be real.
5. **Recommendation (ALWAYS).** `Recommendation: <choice> because <one-line
reason>` on its own line. Never omit it. Required for every AskUserQuestion,
even when neutral-posture (see rule 8). The `(recommended)` label on the
option is REQUIRED — `scripts/resolvers/question-tuning.ts` reads it to
power the AUTO_DECIDE path. Omitting it breaks auto-decide.
6. **Completeness scoring (when meaningful).** When options differ in
coverage (full test coverage vs happy path vs shortcut, complete error
handling vs partial), score each `Completeness: N/10` on its own line.
Calibration: 10 = complete, 7 = happy path only, 3 = shortcut. Flag any
option ≤5 where a higher-completeness option exists. When options differ
in kind (review posture, architectural A-vs-B, cherry-pick Add/Defer/Skip,
two different kinds of systems), SKIP the score and write one line:
`Note: options differ in kind, not coverage — no completeness score.`
Do NOT fabricate filler scores — empty 10/10 on every option is worse
than no score.
7. **Pros / cons block.** Every option gets per-bullet ✅ (pro) and ❌ (con)
markers. Rules:
- **Minimum 2 pros and 1 con per option.** If you can't name a con for
the recommended option, the recommendation is hollow — go find one. If
you can't name a pro for the rejected option, the question isn't real.
- **Minimum 40 characters per bullet.** `✅ Simple` is not a pro. `
Reuses the YAML frontmatter format already in MEMORY.md, zero new
parser` is a pro. Concrete, observable, specific.
- **Hard-stop escape** for genuinely one-sided choices (destructive-action
confirmation, one-way doors): a single bullet `✅ No cons — this is a
hard-stop choice` satisfies the rule. Use sparingly; overuse flips a
decision brief into theater.
8. **Net line (ALWAYS).** Closes the decision with a one-sentence synthesis
of what the user is actually trading off. From the reference screenshot:
*"The new-format case is speculative. The copy-format case is immediate
leverage. Copy now, evolve later if a real pattern emerges."* Not a
summary — a verdict frame.
9. **Neutral-posture handling.** When the skill explicitly says "neutral
recommendation posture" (SELECTIVE EXPANSION cherry-picks, taste calls,
kind-differentiated choices where neither side dominates), the
Recommendation line reads: `Recommendation: <default-choice> — this is a
taste call, no strong preference either way`. The `(recommended)` label
STAYS on the default option (machine-readable hint for AUTO_DECIDE). The
`— this is a taste call` prose is the human-readable neutrality signal.
Both coexist.
10. **Effort both-scales.** When an option involves effort, show both human
and CC scales: `(human: ~2 days / CC: ~15 min)`.
11. **Tool_use, not prose.** A markdown block labeled `Question:` is not a
question — the user never sees it as interactive. If you wrote one in
prose, stop and reissue as an actual AskUserQuestion tool_use. The rich
markdown goes in the question body; the `options` array stays short
labels (A, B, C).
### Self-check before emitting
Before calling AskUserQuestion, verify:
- [ ] D<N> header present
- [ ] ELI10 paragraph present (stakes line too)
- [ ] Recommendation line present with concrete reason
- [ ] Completeness scored (coverage) OR kind-note present (kind)
- [ ] Every option has ≥2 ✅ and ≥1 ❌, each ≥40 chars (or hard-stop escape)
- [ ] (recommended) label on one option (even for neutral-posture — see rule 9)
- [ ] Net line closes the decision
- [ ] You are calling the tool, not writing prose
If you'd need to read the source to understand your own explanation, it's
too complex — simplify before emitting.
Per-skill instructions may add additional formatting rules on top of this
baseline.
## GBrain Sync (skill start)
```bash
@@ -561,20 +690,6 @@ are shown, synthesize a one-paragraph welcome briefing before proceeding:
"Welcome back to {branch}. Last session: /{skill} ({outcome}). [Checkpoint summary if
available]. [Health score if available]." Keep it to 2-3 sentences.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call. All four elements are non-skippable. If you find yourself about to skip any of them, stop and back up.**
1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences)
2. **Simplify (ELI10, ALWAYS):** Explain what's happening in plain English a smart 16-year-old could follow. Concrete examples and analogies, not function names or internal jargon. Say what it DOES, not what it's called. State the stakes: what breaks if we pick wrong. This is NOT optional verbosity and it is NOT preamble — the user is about to make a decision and needs context. Even if you'd normally stay terse, emit the ELI10 paragraph. The user will ask for it anyway; do it the first time.
3. **Recommend (ALWAYS):** Every question ends with `RECOMMENDATION: Choose [X] because [one-line reason]` on its own line. Never omit it. Never collapse it into the options list. Required for every AskUserQuestion, regardless of whether the options are coverage-differentiated or different-in-kind.
4. **Score completeness (when meaningful):** When options differ in coverage (e.g. full test coverage vs happy path vs shortcut, complete error handling vs partial), score each with `Completeness: N/10` on its own line. Calibration: 10 = complete (all edge cases, full coverage), 7 = happy path only, 3 = shortcut. Flag any option ≤5 where a higher-completeness option exists. When options differ in kind (picking a review posture, picking an architectural approach, cherry-pick Add/Defer/Skip, choosing between two different kinds of systems), the completeness axis doesn't apply — skip `Completeness: N/10` entirely and write one line: `Note: options differ in kind, not coverage — no completeness score.` Do not fabricate filler scores.
5. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)`
Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex.
Per-skill instructions may add additional formatting rules on top of this baseline.
## Writing Style (skip entirely if `EXPLAIN_LEVEL: terse` appears in the preamble echo OR the user's current message explicitly requests terse / no-explanations output)
These rules apply to every AskUserQuestion, every response you write to the user, and every review finding. They compose with the AskUserQuestion Format section above: Format = *how* a question is structured; Writing Style = *the prose quality of the content inside it*.