Files
gstack/docs/evals/security-bench-ensemble-v2.json
T
Garry Tan b86ef06706 test(security): live + fixture-replay bench harness with 500-case capture
Adds two new benches that permanently guard the v2 tuning:

- security-bench-ensemble-live.test.ts (opt-in via GSTACK_BENCH_ENSEMBLE=1).
  Runs full ensemble on BrowseSafe-Bench smoke with real Haiku calls.
  Worker-pool concurrency (default 8, tunable via
  GSTACK_BENCH_ENSEMBLE_CONCURRENCY) cuts wall clock from ~2hr to
  ~25min on 500 cases. Captures Haiku responses to fixture for replay.
  Subsampling via GSTACK_BENCH_ENSEMBLE_CASES for faster iteration.
  Stop-loss iterations write to ~/.gstack-dev/evals/stop-loss-iter-N-*
  WITHOUT overwriting canonical fixture.

- security-bench-ensemble.test.ts (CI gate, deterministic replay).
  Replays captured fixture through combineVerdict, asserts
  detection >= 55% AND FP <= 25%. Fail-closed when fixture is missing
  AND security-layer files changed in branch diff. Uses
  `git diff --name-only base` (two-dot) to catch both committed
  and working-tree changes — `git diff base...HEAD` would silently
  skip in CI after fixture lands.

- browse/test/fixtures/security-bench-haiku-responses.json — 500 cases
  × 3 classifier signals each. Header includes schema_version, pinned
  model, component hashes (prompt, exemplars, thresholds, combiner,
  dataset version). Any change invalidates the fixture and forces
  fresh live capture.

- docs/evals/security-bench-ensemble-v2.json — durable PR artifact
  with measured TP/FN/FP/TN, 95% CIs, knob state, v1 baseline delta.
  Checked in so reviewers can see the numbers that justified the ship.

Measured baseline on the new harness:
  TP=146 FN=114 FP=55 TN=185 → 56.2% / 22.9% → GATE PASS

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 20:32:15 -07:00

63 lines
1.6 KiB
JSON

{
"title": "BrowseSafe-Bench v1.5.1.0 ensemble tuning result",
"version": "1.5.1.0",
"timestamp": "2026-04-22T02:25:15.229782Z",
"commit": null,
"dataset": {
"source": "perplexity-ai/browsesafe-bench",
"split": "test",
"size": 500,
"yes_cases": 260,
"no_cases": 240
},
"model": "claude-haiku-4-5-20251001",
"thresholds": {
"BLOCK": 0.85,
"WARN": 0.75,
"LOG_ONLY": 0.4,
"SOLO_CONTENT_BLOCK": 0.92
},
"knobs": {
"label_first_transcript_voting": true,
"hallucination_guard_confidence_floor": 0.4,
"tool_output_solo_requires_block_label": true,
"haiku_prompt_version": "v2-explicit-criteria-8-few-shots",
"haiku_timeout_ms": 45000,
"haiku_cwd_isolation": true
},
"measured": {
"tp": 146,
"fn": 114,
"fp": 55,
"tn": 185,
"detection_rate": 0.562,
"fp_rate": 0.229,
"detection_ci_95": [
0.501,
0.621
],
"fp_ci_95": [
0.181,
0.286
]
},
"v1_baseline_comparison": {
"v1_detection": 0.673,
"v1_fp": 0.441,
"delta_detection_pp": -11.1,
"delta_fp_pp": -21.2,
"banner_fire_rate_delta_pp": -16
},
"gate": {
"detection_floor": 0.55,
"fp_ceiling": 0.25,
"passed": true
},
"stop_loss_iterations": 0,
"methodology": {
"live_bench_cmd": "GSTACK_BENCH_ENSEMBLE=1 GSTACK_BENCH_ENSEMBLE_CONCURRENCY=4 GSTACK_HAIKU_TIMEOUT_MS=60000 bun test browse/test/security-bench-ensemble-live.test.ts",
"live_bench_runtime_sec": 1498,
"ci_replay_cmd": "bun test browse/test/security-bench-ensemble.test.ts",
"ci_replay_runtime_sec": 0.1
}
}