docs(changelog): measured BrowseSafe-Bench lift from Haiku unbreak

Before/after on the 200-case smoke cache: L4-only: 15.3% detection / 11.8% FP Ensemble: 67.3% detection / 44.1% FP 4.4x lift in detection from fixing the model alias + timeout + removing the pre-Haiku gate on tool outputs. FP rate up 3.7x — Haiku is more aggressive than L4 on edge cases. Review banner makes those recoverable; P1 follow-up to tune Haiku WARN threshold from 0.6 to ~0.7-0.85 once real attempts.jsonl data arrives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 03:35:09 +02:00 · 2026-04-20 21:57:16 +08:00
parent b515f31400
commit 372a5aa382
1 changed files with 25 additions and 0 deletions
@@ -49,6 +49,31 @@ Two independent adversarial reviewers (Claude subagent and Codex/gpt-5.4) conver

 Also: attribute-injection fix in `escapeHtml` (escapes `"` and `'` now), `GSTACK_SECURITY_OFF=1` is now a real gate in `loadTestsavant`/`loadDeberta` (not just a doc promise), device salt cached in-process so FS-unwritable environments don't break hash correlation, tool-use registry entries evicted on `tool_result` (memory leak fix), dashboard uses `jq` for brace-balanced JSON parse when available.

+### Haiku transcript classifier unbroken (silent bug + gate removal)
+
+The transcript classifier (`checkTranscript` calling `claude -p --model haiku`) was shipping dead. Two bugs:
+
+1. Model alias `haiku-4-5` returned 404 from the CLI. Correct shorthand is `haiku` (resolves to `claude-haiku-4-5-20251001` today, stays on the latest Haiku as models roll).
+2. The 2-second timeout was below the floor. Fresh `claude -p` spawn has ~2-3s CLI cold start + 5-12s inference on ~1KB prompts. At 2s every call timed out. Bumped to 15s.
+
+Compounding the dead classifier: `shouldRunTranscriptCheck` gated Haiku on any other layer firing at `>= LOG_ONLY`. On the ~85% of BrowseSafe-Bench attacks that L4 misses (TestSavantAI recall is ~15% on browser-agent-specific attacks), Haiku never got a chance to vote. We were gating our best signal on our weakest. For tool outputs this gate is now removed — L4 + L4c + Haiku always run in parallel.
+
+Review-on-BLOCK UX (centered alert-heavy banner with suspected text excerpt + per-layer scores + Allow / Block session buttons) lands alongside so false positives are recoverable instead of session-killing.
+
+### Measured: BrowseSafe-Bench (200-case smoke)
+
+Same 200 cases, before and after the fixes above:
+
+| | L4-only (before) | Ensemble with Haiku (after) |
+|---|---|---|
+| Detection rate | 15.3% | **67.3%** |
+| False-positive rate | 11.8% | 44.1% |
+| Runtime | ~90s | ~41 min (Haiku is the long pole) |
+
+**4.4x lift in detection.** FP rate also climbed 3.7x — Haiku is more aggressive and fires on edge cases that TestSavantAI smiles through. The review banner makes those FPs recoverable: user sees the suspected excerpt + layer scores, clicks Allow once, session continues. A P1 follow-up is tuning the Haiku WARN threshold (currently 0.6, probably should be 0.7-0.85) against real-world attempts.jsonl data once gstack users start reporting.
+
+Honest shipping posture: this is meaningfully safer than v1.3.x, not bulletproof. Canary (deterministic), content-security L1-L3 (structural), and the review banner remain the load-bearing defenses when the ML layers miss or over-fire.
+
 ### Env knobs

 * `GSTACK_SECURITY_OFF=1` — emergency kill switch (canary still injected, ML skipped)