mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-02 03:35:09 +02:00
b86ef06706
Adds two new benches that permanently guard the v2 tuning: - security-bench-ensemble-live.test.ts (opt-in via GSTACK_BENCH_ENSEMBLE=1). Runs full ensemble on BrowseSafe-Bench smoke with real Haiku calls. Worker-pool concurrency (default 8, tunable via GSTACK_BENCH_ENSEMBLE_CONCURRENCY) cuts wall clock from ~2hr to ~25min on 500 cases. Captures Haiku responses to fixture for replay. Subsampling via GSTACK_BENCH_ENSEMBLE_CASES for faster iteration. Stop-loss iterations write to ~/.gstack-dev/evals/stop-loss-iter-N-* WITHOUT overwriting canonical fixture. - security-bench-ensemble.test.ts (CI gate, deterministic replay). Replays captured fixture through combineVerdict, asserts detection >= 55% AND FP <= 25%. Fail-closed when fixture is missing AND security-layer files changed in branch diff. Uses `git diff --name-only base` (two-dot) to catch both committed and working-tree changes — `git diff base...HEAD` would silently skip in CI after fixture lands. - browse/test/fixtures/security-bench-haiku-responses.json — 500 cases × 3 classifier signals each. Header includes schema_version, pinned model, component hashes (prompt, exemplars, thresholds, combiner, dataset version). Any change invalidates the fixture and forces fresh live capture. - docs/evals/security-bench-ensemble-v2.json — durable PR artifact with measured TP/FN/FP/TN, 95% CIs, knob state, v1 baseline delta. Checked in so reviewers can see the numbers that justified the ship. Measured baseline on the new harness: TP=146 FN=114 FP=55 TN=185 → 56.2% / 22.9% → GATE PASS Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>