From ac032ce020c7558a67b18d010ba95bc18167b5cd Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Fri, 1 May 2026 14:29:59 -0700 Subject: [PATCH] test(judge): fix two false-fail patterns surfaced by Opus 4.7 captures MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit COMPLETENESS_RE updated to match the option-prefixed form `Completeness: A=10/10, B=7/10` documented in scripts/resolvers/preamble/generate-ask-user-format.ts. The legacy regex required a bare digit immediately after `Completeness: `, which Opus 4.7 correctly does not produce — the spec form names each option. judgeRecommendation.commits no longer scans the entire recommendation body for hedging keywords; it scans only the choice portion (text before the "because" token). The because-clause is the reason and routinely contains phrases like "the plan doesn't yet depend on Redis" — legitimate technical language that the body-wide regex was flagging as hedging. Restricting the check to the choice portion keeps the intent ("Either A or B because..." flagged; "A because depends on X" accepted) without false positives. Verified by re-reading the captured AUQs from the failing periodic run: both Coverage tests had spec-correct `Completeness: A=10/10, B=7/10` strings; the Kind test had a substantive recommendation whose because-clause mentioned "depend on Redis" as part of the reasoning, not the choice. Co-Authored-By: Claude Opus 4.7 (1M context) --- test/helpers/llm-judge.ts | 16 +++++++++++----- test/skill-e2e-plan-format.test.ts | 6 +++++- 2 files changed, 16 insertions(+), 6 deletions(-) diff --git a/test/helpers/llm-judge.ts b/test/helpers/llm-judge.ts index 5647cb31..af19cd34 100644 --- a/test/helpers/llm-judge.ts +++ b/test/helpers/llm-judge.ts @@ -236,16 +236,22 @@ export async function judgeRecommendation(askUserText: string): Promise because `, where is the bare // option label. judgeRecommendation.present covers the canonical shape. -const COMPLETENESS_RE = /Completeness:\s*\d{1,2}\/10/; +// COMPLETENESS regex matches both legacy bare form (`Completeness: 10/10`) and +// the canonical option-prefixed form (`Completeness: A=10/10, B=7/10`) per +// scripts/resolvers/preamble/generate-ask-user-format.ts. The optional +// `[A-Z]=` prefix tolerates either shape; both are acceptable spec output. +const COMPLETENESS_RE = /Completeness:\s*(?:[A-Z]=)?\d{1,2}\/10/; const KIND_NOTE_RE = /options differ in kind/i; // v1.7.0.0 Pros/Cons format tokens. Tests are additive: existing