Three gate-tier E2E tests detect when preamble / template changes
flatten the distinctive posture of /plan-ceo-review SCOPE EXPANSION or
/office-hours (startup Q3, builder mode). The V1 regression that this
PR fixes shipped without anyone catching it at ship time — this is the
ongoing signal so the same thing doesn't happen again.
Pieces:
- `judgePosture(mode, text)` in `test/helpers/llm-judge.ts`. Sonnet
judge with mode-specific dual-axis rubric (expansion: surface_framing
+ decision_preservation; forcing: stacking_preserved +
domain_matched_consequence; builder: unexpected_combinations +
excitement_over_optimization). Pass threshold 4/5 on both axes.
- Three fixtures in `test/fixtures/mode-posture/` — deterministic input
for expansion proposal generation, Q3 forcing question, and builder
adjacent-unlock riffing.
- `plan-ceo-review-expansion-energy` case appended to
`test/skill-e2e-plan.test.ts`. Generator: Opus (skill default). Judge:
Sonnet.
- New `test/skill-e2e-office-hours.test.ts` with
`office-hours-forcing-energy` + `office-hours-builder-wildness`
cases. Generator: Sonnet. Judge: Sonnet.
- Touchfile registration in `test/helpers/touchfiles.ts` — all three as
`gate` tier in `E2E_TIERS`, triggered by changes to
`scripts/resolvers/preamble.ts`, the relevant skill template, the
judge helper, or any mode-posture fixture.
Cost: ~$0.50-$1.50 per triggered PR. Sonnet judge is cheap; Opus
generator for the plan-ceo-review case dominates.
Known V1.1 tradeoff: judges test prose markers more than deep behavior.
V1.2 candidate is a cross-provider (Codex) adversarial judge on the
same output to decouple house-style bias.