Files
gstack/test/helpers
Garry Tan 6a46b9099f test: add AskUserQuestion format regression eval for plan reviews
Four-case periodic-tier eval that captures the verbatim AskUserQuestion
text /plan-ceo-review and /plan-eng-review produce, then asserts the
format rule is honored: RECOMMENDATION always, Completeness: N/10 only
on coverage-differentiated options, and an explicit "options differ in
kind" note on kind-differentiated options.

Cases:
- plan-ceo-review mode selection (kind-differentiated)
- plan-ceo-review approach menu (coverage-differentiated)
- plan-eng-review per-issue coverage decision
- plan-eng-review per-issue architectural choice (kind-differentiated)

Classified periodic because behavior depends on Opus non-determinism —
gate-tier would flake and block merges.

Test harness instructs the agent to write its would-be AskUserQuestion
text to $OUT_FILE rather than invoke a real tool (MCP AskUserQuestion
isn't wired in the test subprocess). Regex predicates then validate
the captured content.

Cost: ~$2 per full run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 01:10:35 -07:00
..