docs(security): document the sidebar security stack in CLAUDE.md

Adds a security section to the Browser interaction block. Covers: * Layered defense table showing which modules live where (content-security.ts in both contexts vs security-classifier.ts only in sidebar-agent) and why the split exists (onnxruntime-node incompatibility with compiled Bun) * Threshold constants (0.85 / 0.60 / 0.40) and the ensemble rule that prevents single-classifier false-positives (the Stack Overflow FP story) * Env knobs — GSTACK_SECURITY_OFF kill switch, cache paths, salt file, attack log rotation, session state file This is the "before you modify the security stack, read this" doc. It lives next to the existing Sidebar architecture note that points at SIDEBAR_MESSAGE_FLOW.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-22 09:39:59 +02:00 · 2026-04-19 19:10:03 +08:00
parent 7e9600ffc8
commit 0df847d886
1 changed files with 36 additions and 0 deletions
@@ -206,6 +206,42 @@ failure modes. The sidebar spans 5 files across 2 codebases (extension + server)
 with non-obvious ordering dependencies. The doc exists to prevent the kind of
 silent failures that come from not understanding the cross-component flow.

+**Sidebar security stack** (layered defense against prompt injection):
+
+| Layer | Module | Lives in |
+|-------|--------|----------|
+| L1-L3 | `content-security.ts` | both server and agent — datamarking, hidden element strip, ARIA regex, URL blocklist, envelope wrapping |
+| L4 | `security-classifier.ts` (TestSavantAI ONNX) | **sidebar-agent only** |
+| L4b | `security-classifier.ts` (Claude Haiku transcript) | **sidebar-agent only** |
+| L5 | `security.ts` (canary) | both — inject in compiled, check in agent |
+| L6 | `security.ts` (combineVerdict ensemble) | both |
+
+**Critical constraint:** `security-classifier.ts` CANNOT be imported from the
+compiled browse binary. `@huggingface/transformers` v4 requires `onnxruntime-node`
+which fails to `dlopen` from Bun compile's temp extract dir. Only `security.ts`
+(pure-string operations — canary, verdict combiner, attack log, status) is safe
+for `server.ts`. See `~/.gstack/projects/garrytan-gstack/ceo-plans/2026-04-19-prompt-injection-guard.md`
+§"Pre-Impl Gate 1 Outcome" for full architectural decision.
+
+**Thresholds** (in `security.ts`):
+- `BLOCK: 0.85` — single-layer score that would cause BLOCK if cross-confirmed
+- `WARN: 0.60` — cross-confirm threshold. When L4 AND L4b both >= 0.60 → BLOCK
+- `LOG_ONLY: 0.40` — gates transcript classifier (skip Haiku when all layers < 0.40)
+
+**Ensemble rule:** BLOCK only when the ML content classifier AND the transcript
+classifier both report >= WARN. Single-layer high confidence degrades to WARN —
+this is the Stack Overflow instruction-writing FP mitigation. Canary leak
+always BLOCKs (deterministic).
+
+**Env knobs:**
+- `GSTACK_SECURITY_OFF=1` — emergency kill switch. Classifier stays off even if
+  warmed. Canary is still injected; just the ML scan is skipped.
+- Classifier model cache: `~/.gstack/models/testsavant-small/` (112MB, first run only)
+- Attack log: `~/.gstack/security/attempts.jsonl` (salted sha256 + domain only,
+  rotates at 10MB, 5 generations)
+- Per-device salt: `~/.gstack/security/device-salt` (0600)
+- Session state: `~/.gstack/security/session-state.json` (cross-process, atomic)
+
 ## Dev symlink awareness

 When developing gstack, `.claude/skills/gstack` may be a symlink back to this