fix: plan-design-review-audit eval — bump turns to 30, add efficiency hints

The test was flaky at 20 turns because the agent reads a 300-line SKILL.md,
navigates, extracts design data, and writes a report. Added hints to skip
preamble/batch commands/write early while still testing the real SKILL.md.
Now completes in ~13 turns consistently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-03-17 14:41:13 -07:00
parent 7fed990f9e
commit 64bbbb2198
+4 -2
View File
@@ -2128,9 +2128,11 @@ B="${browseBin}"
Read plan-design-review/SKILL.md for the design review workflow.
Review the site at ${testServer.url}. Use --quick mode (homepage + 2 pages). Skip any AskUserQuestion calls — this is non-interactive. Write your audit report to ./design-audit.md. Do not offer to create DESIGN.md.`,
Review the site at ${testServer.url}. Use --quick mode (homepage + 2 pages). Skip any AskUserQuestion calls — this is non-interactive. Write your audit report to ./design-audit.md. Do not offer to create DESIGN.md.
EFFICIENCY: Skip the preamble bash block. Combine multiple browse commands into single bash blocks (e.g. run all Phase 2 JS extractions in one block). Write the report as soon as you have enough data — do not over-explore.`,
workingDirectory: reviewDir,
maxTurns: 20,
maxTurns: 30,
timeout: 360_000,
testName: 'plan-design-review-audit',
runId,