test(judge): fix two false-fail patterns surfaced by Opus 4.7 captures

COMPLETENESS_RE updated to match the option-prefixed form
`Completeness: A=10/10, B=7/10` documented in
scripts/resolvers/preamble/generate-ask-user-format.ts. The legacy regex
required a bare digit immediately after `Completeness: `, which Opus 4.7
correctly does not produce — the spec form names each option.

judgeRecommendation.commits no longer scans the entire recommendation body
for hedging keywords; it scans only the choice portion (text before the
"because" token). The because-clause is the reason and routinely contains
phrases like "the plan doesn't yet depend on Redis" — legitimate technical
language that the body-wide regex was flagging as hedging. Restricting the
check to the choice portion keeps the intent ("Either A or B because..."
flagged; "A because depends on X" accepted) without false positives.

Verified by re-reading the captured AUQs from the failing periodic run:
both Coverage tests had spec-correct `Completeness: A=10/10, B=7/10`
strings; the Kind test had a substantive recommendation whose because-clause
mentioned "depend on Redis" as part of the reasoning, not the choice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-05-01 14:29:59 -07:00
parent 91c0b31a78
commit ac032ce020
2 changed files with 16 additions and 6 deletions
+5 -1
View File
@@ -41,7 +41,11 @@ const evalCollector = createEvalCollector('e2e-plan-format');
// — the canonical form per generate-ask-user-format.ts is just
// `Recommendation: <choice> because <reason>`, where <choice> is the bare
// option label. judgeRecommendation.present covers the canonical shape.
const COMPLETENESS_RE = /Completeness:\s*\d{1,2}\/10/;
// COMPLETENESS regex matches both legacy bare form (`Completeness: 10/10`) and
// the canonical option-prefixed form (`Completeness: A=10/10, B=7/10`) per
// scripts/resolvers/preamble/generate-ask-user-format.ts. The optional
// `[A-Z]=` prefix tolerates either shape; both are acceptable spec output.
const COMPLETENESS_RE = /Completeness:\s*(?:[A-Z]=)?\d{1,2}\/10/;
const KIND_NOTE_RE = /options differ in kind/i;
// v1.7.0.0 Pros/Cons format tokens. Tests are additive: existing