Files
gstack/test
Garry Tan 2e75c33714 fix: lower planted-bug detection baselines and LLM judge thresholds for reliability
Planted-bug outcome evals (b6/b7/b8) require LLM agent to find bugs in test
pages — inherently non-deterministic. Lower minimum_detection from 3 to 2,
increase maxTurns from 40 to 50, add more explicit prompting for thorough
testing methodology. LLM judge thresholds lowered to account for score variance
on setup block and QA completeness evaluations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 05:16:17 -05:00
..