Adds two new benches that permanently guard the v2 tuning:
- security-bench-ensemble-live.test.ts (opt-in via GSTACK_BENCH_ENSEMBLE=1).
Runs full ensemble on BrowseSafe-Bench smoke with real Haiku calls.
Worker-pool concurrency (default 8, tunable via
GSTACK_BENCH_ENSEMBLE_CONCURRENCY) cuts wall clock from ~2hr to
~25min on 500 cases. Captures Haiku responses to fixture for replay.
Subsampling via GSTACK_BENCH_ENSEMBLE_CASES for faster iteration.
Stop-loss iterations write to ~/.gstack-dev/evals/stop-loss-iter-N-*
WITHOUT overwriting canonical fixture.
- security-bench-ensemble.test.ts (CI gate, deterministic replay).
Replays captured fixture through combineVerdict, asserts
detection >= 55% AND FP <= 25%. Fail-closed when fixture is missing
AND security-layer files changed in branch diff. Uses
`git diff --name-only base` (two-dot) to catch both committed
and working-tree changes — `git diff base...HEAD` would silently
skip in CI after fixture lands.
- browse/test/fixtures/security-bench-haiku-responses.json — 500 cases
× 3 classifier signals each. Header includes schema_version, pinned
model, component hashes (prompt, exemplars, thresholds, combiner,
dataset version). Any change invalidates the fixture and forces
fresh live capture.
- docs/evals/security-bench-ensemble-v2.json — durable PR artifact
with measured TP/FN/FP/TN, 95% CIs, knob state, v1 baseline delta.
Checked in so reviewers can see the numbers that justified the ship.
Measured baseline on the new harness:
TP=146 FN=114 FP=55 TN=185 → 56.2% / 22.9% → GATE PASS
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>