feat(codex+review): require synthesis Recommendation in cross-model skills

Extends the v1.25.1.0 AskUserQuestion recommendation-quality coverage to the cross-model synthesis surfaces that were previously emitting prose without a structured recommendation: - /codex review (Step 2A) — after presenting Codex output + GATE verdict, must emit `Recommendation: <action> because <reason>` line. Reason must compare against alternatives (other findings, fix-vs-ship, fix-order). - /codex challenge (Step 2B) — same requirement after adversarial output. - /codex consult (Step 2C) — same requirement after consult presentation, with examples for plan-review consults that engage with specific Codex insights. - Claude adversarial subagent (scripts/resolvers/review.ts:446, used by /ship Step 11 + standalone /review) — subagent prompt now ends with "After listing findings, end your output with ONE line in the canonical format Recommendation: <action> because <reason>". Codex adversarial command (line 461) gets the same final-line requirement. The same `judgeRecommendation` helper grades both AskUserQuestion and cross-model synthesis — one rubric, two surfaces. Substance-5 cross-model recommendations explicitly compare against alternatives (a different finding, fix-vs-ship, fix-order). Generic synthesis ("because adversarial review found things") fails at threshold ≥ 4. Tests: - test/llm-judge-recommendation.test.ts gains 5 cross-model fixtures (3 substance ≥ 4, 2 substance < 4). Existing rubric correctly grades them. - test/skill-cross-model-recommendation-emit.test.ts (new, free-tier) — static guard greps codex/SKILL.md.tmpl + scripts/resolvers/review.ts for the canonical emit instruction. Trips before any paid eval if the templates drift. Touchfile: extended `llm-judge-recommendation` entry with codex/SKILL.md.tmpl and scripts/resolvers/review.ts so synthesis-template edits invalidate the fixture re-run. Verified: free `bun test` exits 0 (5/5 static emit-guard tests pass), paid fixture passes 45/45 expect calls in 24s with the cross-model substance-5 fixtures correctly judged at >= 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 13:45:35 +02:00 · 2026-05-01 19:38:12 -07:00
parent 336c2cfe4a
commit 4ab0269729
9 changed files with 232 additions and 8 deletions
@@ -443,7 +443,7 @@ If \`OLD_CFG\` is \`disabled\`: skip Codex passes only. Claude adversarial subag
 Dispatch via the Agent tool. The subagent has fresh context — no checklist bias from the structured review. This genuine independence catches things the primary reviewer is blind to.

 Subagent prompt:
-"Read the diff for this branch with \`git diff origin/<base>\`. Think like an attacker and a chaos engineer. Your job is to find ways this code will fail in production. Look for: edge cases, race conditions, security holes, resource leaks, failure modes, silent data corruption, logic errors that produce wrong results silently, error handling that swallows failures, and trust boundary violations. Be adversarial. Be thorough. No compliments — just the problems. For each finding, classify as FIXABLE (you know how to fix it) or INVESTIGATE (needs human judgment)."
+"Read the diff for this branch with \`git diff origin/<base>\`. Think like an attacker and a chaos engineer. Your job is to find ways this code will fail in production. Look for: edge cases, race conditions, security holes, resource leaks, failure modes, silent data corruption, logic errors that produce wrong results silently, error handling that swallows failures, and trust boundary violations. Be adversarial. Be thorough. No compliments — just the problems. For each finding, classify as FIXABLE (you know how to fix it) or INVESTIGATE (needs human judgment). After listing findings, end your output with ONE line in the canonical format \`Recommendation: <action> because <one-line reason naming the most exploitable finding>\` — examples: \`Recommendation: Fix the unbounded retry at queue.ts:78 because it'll DoS the worker pool under sustained 429s\` or \`Recommendation: Ship as-is because the strongest finding is a theoretical race that requires conditions we can't trigger in production\`. The reason must point to a specific finding (or no-fix rationale). Generic reasons like 'because it's safer' do not qualify."

 Present findings under an \`ADVERSARIAL REVIEW (Claude subagent):\` header. **FIXABLE findings** flow into the same Fix-First pipeline as the structured review. **INVESTIGATE findings** are presented as informational.

@@ -458,7 +458,7 @@ If Codex is available AND \`OLD_CFG\` is NOT \`disabled\`:
 \`\`\`bash
 TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX)
 _REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; }
-codex exec "${CODEX_BOUNDARY}Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached < /dev/null 2>"$TMPERR_ADV"
+codex exec "${CODEX_BOUNDARY}Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems. End your output with ONE line in the canonical format \`Recommendation: <action> because <one-line reason naming the most exploitable finding>\`. Generic reasons like 'because it's safer' do not qualify; the reason must point to a specific finding or no-fix rationale." -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached < /dev/null 2>"$TMPERR_ADV"
 \`\`\`

 Set the Bash tool's \`timeout\` parameter to \`300000\` (5 minutes). Do NOT use the \`timeout\` shell command — it doesn't exist on macOS. After the command completes, read stderr: