Commit Graph

3 Commits

Author SHA1 Message Date
Garry Tan d646bc12d8 test(opus-4.7): rewrite scratch-root helper + add afterAll cleanup
First run of the Opus 4.7 eval exposed two test-setup gaps that made
results misleading:

- Only the root gstack SKILL.md was installed. Claude Code does
  auto-discovery per-directory under .claude/skills/{name}/SKILL.md, so
  without individual skill dirs the Skill tool had nothing to route to.
  Positive routing cases all failed.
- `claude -p` does not load SKILL.md content as system context the way
  the Claude Code harness does. The overlay nudges in SKILL.md were
  invisible to the model, so the fanout A/B could not actually differ.

New `mkEvalRoot(suffix, includeOverlay)` helper, modelled on the pattern
in skill-routing-e2e.test.ts:

- Installs per-skill SKILL.md under .claude/skills/ for ~14 key skills
  so the Skill tool has discoverable targets.
- Writes an explicit routing block into project CLAUDE.md.
- When includeOverlay is true, inlines the content of
  model-overlays/opus-4-7.md into CLAUDE.md too. This is what makes the
  fanout A/B observable in `claude -p`: arm ON gets the overlay in
  context, arm OFF does not.

Plus an afterAll that re-runs gen-skill-docs at the default model so
the working tree is not left with opus-4-7-generated SKILL.md files
after the eval finishes (would break golden-file tests in the next
`bun test` run otherwise).

With this setup in place: routing went from 3/3 FAIL to 3/3 PASS
(correct skill or clarification in every positive case, zero false
positives on negatives). Fanout A/B is now a fair comparison; still
shows 0 parallel in both arms under `claude -p` (tracked as a P0 TODO
for re-measurement inside Claude Code's harness, where fanout may land
differently).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 00:27:53 -07:00
Garry Tan 723f9957f2 test(opus-4.7): tighten ambiguous /qa routing prompt
"does this feature work on mobile? can you check the deploy?" was too
vague — a reasonable agent asks "which feature?" via AskUserQuestion
instead of routing to /qa. That's not a routing miss, it's an under-
specified prompt.

Replaced with "I just pushed the login flow changes. Test the deployed
site and find any bugs." — concrete subject + clear QA verb.

Result: pos-does-it-work went from MISS to OK, routing TP rate 2/3 -> 3/3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 00:27:28 -07:00
Garry Tan 7e90b0f092 test(opus-4.7): E2E eval for fanout rate + routing precision
Closes the measurement gap flagged by the ship-quality review: "zero
tests exercise Opus 4.7 behavior; every skill-e2e hardcodes 4.6."

Two cases, both pinned to claude-opus-4-7:

1. Fanout rate (A/B)
   - Arm A: regen SKILL.md with --model opus-4-7 (overlay ON, includes
     "Fan out explicitly" nudge).
   - Arm B: regen SKILL.md with --model claude (overlay OFF, only
     model-agnostic nudges).
   - Prompt: "Read alpha.txt, beta.txt, gamma.txt. These are independent."
   - Measure: parallel tool calls in first assistant turn.
   - Assert: arm A >= arm B.

2. Routing precision (6-case mini-benchmark)
   - 3 positive prompts that should route (wtf bug, send it, does it work)
   - 3 negative prompts that match keywords but should NOT route
     (syntax question, algorithm question, slack message)
   - Assert: TP rate >= 66%, FP rate <= 33%.

Cost estimate: ~$3-5 per full run. Classified as periodic tier per
CLAUDE.md convention (Opus model, non-deterministic). Runs only with
EVALS=1 env var, touchfile-gated so unrelated diffs don't trigger it.

Test plan artifact at
~/.gstack/projects/garrytan-gstack/garrytan-feat-opus-4.7-migration-eng-review-test-plan-20260421-230611.md
tracks the full specification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 00:11:38 -07:00