gstack

mirror of https://github.com/garrytan/gstack.git synced 2026-06-19 00:00:13 +02:00

Files

T

Garry Tan 7e90b0f092 test(opus-4.7): E2E eval for fanout rate + routing precision

Closes the measurement gap flagged by the ship-quality review: "zero
tests exercise Opus 4.7 behavior; every skill-e2e hardcodes 4.6."

Two cases, both pinned to claude-opus-4-7:

1. Fanout rate (A/B)
   - Arm A: regen SKILL.md with --model opus-4-7 (overlay ON, includes
     "Fan out explicitly" nudge).
   - Arm B: regen SKILL.md with --model claude (overlay OFF, only
     model-agnostic nudges).
   - Prompt: "Read alpha.txt, beta.txt, gamma.txt. These are independent."
   - Measure: parallel tool calls in first assistant turn.
   - Assert: arm A >= arm B.

2. Routing precision (6-case mini-benchmark)
   - 3 positive prompts that should route (wtf bug, send it, does it work)
   - 3 negative prompts that match keywords but should NOT route
     (syntax question, algorithm question, slack message)
   - Assert: TP rate >= 66%, FP rate <= 33%.

Cost estimate: ~$3-5 per full run. Classified as periodic tier per
CLAUDE.md convention (Opus model, non-deterministic). Runs only with
EVALS=1 env var, touchfile-gated so unrelated diffs don't trigger it.

Test plan artifact at
~/.gstack/projects/garrytan-gstack/garrytan-feat-opus-4.7-migration-eng-review-test-plan-20260421-230611.md
tracks the full specification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-22 00:11:38 -07:00

providers

feat(v1.3.0.0): open agents learnings + cross-model benchmark skill (#1040 )

2026-04-19 17:50:31 +08:00

benchmark-judge.ts

feat(v1.3.0.0): open agents learnings + cross-model benchmark skill (#1040 )

2026-04-19 17:50:31 +08:00

benchmark-runner.ts

feat(v1.3.0.0): open agents learnings + cross-model benchmark skill (#1040 )

2026-04-19 17:50:31 +08:00

codex-session-runner.ts

fix: enforce Codex 1024-char description limit + auto-heal stale installs (v0.11.9.0) (#391 )