Files
gstack/test
Garry Tan 14b1ba07e9 test: wire judgeRecommendation into plan-format E2E with threshold >= 4
All four plan-format cases (CEO mode, CEO approach, eng coverage, eng kind)
now run the judge after the existing regex assertions. Threshold reason_substance
>= 4 catches both boilerplate ("because it's better") and generic ("because
it's faster") tier reasoning — exactly the failure modes the regex couldn't.

Move recordE2E to after the judge call so judge_scores and judge_reasoning
land in the eval-store JSON for diagnostics. Booleans are encoded as 0/1 to
fit the Record<string, number> shape EvalTestEntry.judge_scores expects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:18:06 -07:00
..