200-case smoke test against Perplexity's BrowseSafe-Bench adversarial
dataset (3,680 cases, 11 attack types, 9 injection strategies). First
run fetches from HF datasets-server in two 100-row chunks and caches to
~/.gstack/cache/browsesafe-bench-smoke/test-rows.json — subsequent runs
are hermetic.
V1 baseline (recorded via console.log for regression tracking):
* Detection rate: ~15% at WARN=0.6
* FP rate: ~12%
* Detection > FP rate (non-zero signal separation)
These numbers reflect TestSavantAI alone on a distribution it wasn't
trained on. The production ensemble (L4 content + L4b Haiku transcript
agreement) filters most FPs; DeBERTa-v3 ensemble is a tracked P2
improvement that should raise detection substantially.
Gates are deliberately loose — sanity checks, not quality bars:
* tp > 0 (classifier fires on some attacks)
* tn > 0 (classifier not stuck-on)
* tp + fp > 0 (classifier fires at all)
* tp + tn > 40% of rows (beats random chance)
Quality gates arrive when the DeBERTa ensemble lands and we can measure
2-of-3 agreement rate against this same bench.
Model cache gate via test.skipIf(!ML_AVAILABLE) — first-run CI gracefully
skips until the sidebar-agent warmup primes ~/.gstack/models/testsavant-
small/. Documented in the test file head comment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>