fix: plan-design-review-audit eval — bump turns to 30, add efficiency hints

The test was flaky at 20 turns because the agent reads a 300-line SKILL.md, navigates, extracts design data, and writes a report. Added hints to skip preamble/batch commands/write early while still testing the real SKILL.md. Now completes in ~13 turns consistently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-07-11 18:26:36 +02:00 · 2026-03-17 14:41:13 -07:00
parent 7fed990f9e
commit 64bbbb2198
1 changed files with 4 additions and 2 deletions
@@ -2128,9 +2128,11 @@ B="${browseBin}"

 Read plan-design-review/SKILL.md for the design review workflow.

-Review the site at ${testServer.url}. Use --quick mode (homepage + 2 pages). Skip any AskUserQuestion calls — this is non-interactive. Write your audit report to ./design-audit.md. Do not offer to create DESIGN.md.`,
+Review the site at ${testServer.url}. Use --quick mode (homepage + 2 pages). Skip any AskUserQuestion calls — this is non-interactive. Write your audit report to ./design-audit.md. Do not offer to create DESIGN.md.
+
+EFFICIENCY: Skip the preamble bash block. Combine multiple browse commands into single bash blocks (e.g. run all Phase 2 JS extractions in one block). Write the report as soon as you have enough data — do not over-explore.`,
      workingDirectory: reviewDir,
-      maxTurns: 20,
+      maxTurns: 30,
      timeout: 360_000,
      testName: 'plan-design-review-audit',
      runId,