Commit Graph

2 Commits

Author SHA1 Message Date
Garry Tan 76829b76dc feat(eval): extend OverlayFixture with allowedTools, maxTurns, direction
Per-fixture tool allowlist unblocks measuring nudges that need Edit/Write
(e.g. literal-interpretation 'fix the failing tests' needs write access).
Per-fixture maxTurns lets harder prompts run longer without changing the
default. `direction` is cosmetic metadata for test output labeling.

Also adds reusable predicates and metrics:
- lowerIsBetter20Pct / higherIsBetter20Pct — 20% lift threshold vs baseline
- bashToolCallCount — count of Bash tool_use across the session
- turnsToCompletion — SDK-reported num_turns at result
- uniqueFilesEdited — Edit/Write/MultiEdit file_path set size

test/skill-e2e-overlay-harness.test.ts now threads fixture.allowedTools
and fixture.maxTurns through runArm.
2026-04-23 11:07:37 -07:00
Garry Tan e432f4bd94 test(eval): paid periodic overlay-efficacy harness
`test/skill-e2e-overlay-harness.test.ts` iterates OVERLAY_FIXTURES,
runs two arms per fixture (overlay-ON, overlay-OFF) at N=10 trials with
bounded concurrency. Arms use SDK preset `claude_code` so both include
the real Claude Code system prompt; overlay-ON appends the resolved
overlay text. Saves per-trial raw event streams to
`~/.gstack/projects/<slug>/transcripts/` for forensic recovery.

Gated on `EVALS=1 && EVALS_TIER=periodic`. ~$3/run (40 trials).
2026-04-23 09:13:58 -07:00