feat(codex+review): require synthesis Recommendation in cross-model skills

Extends the v1.25.1.0 AskUserQuestion recommendation-quality coverage to the
cross-model synthesis surfaces that were previously emitting prose without a
structured recommendation:

- /codex review (Step 2A) — after presenting Codex output + GATE verdict,
  must emit `Recommendation: <action> because <reason>` line. Reason must
  compare against alternatives (other findings, fix-vs-ship, fix-order).
- /codex challenge (Step 2B) — same requirement after adversarial output.
- /codex consult (Step 2C) — same requirement after consult presentation,
  with examples for plan-review consults that engage with specific Codex
  insights.
- Claude adversarial subagent (scripts/resolvers/review.ts:446, used by
  /ship Step 11 + standalone /review) — subagent prompt now ends with
  "After listing findings, end your output with ONE line in the canonical
  format Recommendation: <action> because <reason>". Codex adversarial
  command (line 461) gets the same final-line requirement.

The same `judgeRecommendation` helper grades both AskUserQuestion and
cross-model synthesis — one rubric, two surfaces. Substance-5 cross-model
recommendations explicitly compare against alternatives (a different
finding, fix-vs-ship, fix-order). Generic synthesis ("because adversarial
review found things") fails at threshold ≥ 4.

Tests:
- test/llm-judge-recommendation.test.ts gains 5 cross-model fixtures (3
  substance ≥ 4, 2 substance < 4). Existing rubric correctly grades them.
- test/skill-cross-model-recommendation-emit.test.ts (new, free-tier) —
  static guard greps codex/SKILL.md.tmpl + scripts/resolvers/review.ts for
  the canonical emit instruction. Trips before any paid eval if the
  templates drift.

Touchfile: extended `llm-judge-recommendation` entry with codex/SKILL.md.tmpl
and scripts/resolvers/review.ts so synthesis-template edits invalidate the
fixture re-run.

Verified: free `bun test` exits 0 (5/5 static emit-guard tests pass), paid
fixture passes 45/45 expect calls in 24s with the cross-model substance-5
fixtures correctly judged at >= 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-05-01 19:38:12 -07:00
parent 336c2cfe4a
commit 4ab0269729
9 changed files with 232 additions and 8 deletions
+2 -2
View File
@@ -443,7 +443,7 @@ If \`OLD_CFG\` is \`disabled\`: skip Codex passes only. Claude adversarial subag
Dispatch via the Agent tool. The subagent has fresh context — no checklist bias from the structured review. This genuine independence catches things the primary reviewer is blind to.
Subagent prompt:
"Read the diff for this branch with \`git diff origin/<base>\`. Think like an attacker and a chaos engineer. Your job is to find ways this code will fail in production. Look for: edge cases, race conditions, security holes, resource leaks, failure modes, silent data corruption, logic errors that produce wrong results silently, error handling that swallows failures, and trust boundary violations. Be adversarial. Be thorough. No compliments — just the problems. For each finding, classify as FIXABLE (you know how to fix it) or INVESTIGATE (needs human judgment)."
"Read the diff for this branch with \`git diff origin/<base>\`. Think like an attacker and a chaos engineer. Your job is to find ways this code will fail in production. Look for: edge cases, race conditions, security holes, resource leaks, failure modes, silent data corruption, logic errors that produce wrong results silently, error handling that swallows failures, and trust boundary violations. Be adversarial. Be thorough. No compliments — just the problems. For each finding, classify as FIXABLE (you know how to fix it) or INVESTIGATE (needs human judgment). After listing findings, end your output with ONE line in the canonical format \`Recommendation: <action> because <one-line reason naming the most exploitable finding>\` — examples: \`Recommendation: Fix the unbounded retry at queue.ts:78 because it'll DoS the worker pool under sustained 429s\` or \`Recommendation: Ship as-is because the strongest finding is a theoretical race that requires conditions we can't trigger in production\`. The reason must point to a specific finding (or no-fix rationale). Generic reasons like 'because it's safer' do not qualify."
Present findings under an \`ADVERSARIAL REVIEW (Claude subagent):\` header. **FIXABLE findings** flow into the same Fix-First pipeline as the structured review. **INVESTIGATE findings** are presented as informational.
@@ -458,7 +458,7 @@ If Codex is available AND \`OLD_CFG\` is NOT \`disabled\`:
\`\`\`bash
TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX)
_REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; }
codex exec "${CODEX_BOUNDARY}Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached < /dev/null 2>"$TMPERR_ADV"
codex exec "${CODEX_BOUNDARY}Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems. End your output with ONE line in the canonical format \`Recommendation: <action> because <one-line reason naming the most exploitable finding>\`. Generic reasons like 'because it's safer' do not qualify; the reason must point to a specific finding or no-fix rationale." -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached < /dev/null 2>"$TMPERR_ADV"
\`\`\`
Set the Bash tool's \`timeout\` parameter to \`300000\` (5 minutes). Do NOT use the \`timeout\` shell command — it doesn't exist on macOS. After the command completes, read stderr: