mirror of
https://github.com/garrytan/gstack.git
synced 2026-06-27 12:10:00 +02:00
docs: align prompt-injection thresholds to security.ts (v1.6.4.0 catch-up)
CLAUDE.md:290 and ARCHITECTURE.md:159 were missed when WARN was bumped
0.60 → 0.75 in d75402bb (v1.6.4.0, "cut Haiku classifier FP from 44% to
23%, gate now enforced", #1135). browse/src/security.ts:37 has WARN: 0.75
and BROWSER.md:743 was updated alongside that commit; CLAUDE.md and
ARCHITECTURE.md still read 0.60.
Also adds the SOLO_CONTENT_BLOCK: 0.92 entry to CLAUDE.md (already in
security.ts:50 and BROWSER.md:745, missing from CLAUDE.md's threshold
table).
No code change. No behavior change. Pure doc-vs-code alignment.
Verification:
$ grep -n "WARN" browse/src/security.ts CLAUDE.md ARCHITECTURE.md BROWSER.md
browse/src/security.ts:37: WARN: 0.75,
CLAUDE.md:290: - \`WARN: 0.75\` ...
ARCHITECTURE.md:159: ...>= \`WARN\` (0.75)...
BROWSER.md:743: - \`WARN: 0.75\` ...
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+1
-1
@@ -156,7 +156,7 @@ The Chrome sidebar agent has tools (Bash, Read, Glob, Grep, WebFetch) and reads
|
||||
|
||||
4. **L5 canary token (`browse/src/security.ts`).** A random token injected into the system prompt at session start. Rolling-buffer detection across `text_delta` and `input_json_delta` streams catches the token if it shows up anywhere in Claude's output, tool arguments, URLs, or file writes. Deterministic BLOCK — if the token leaks, the attacker convinced Claude to reveal the system prompt, and the session ends.
|
||||
|
||||
5. **L6 ensemble combiner (`combineVerdict`).** BLOCK requires agreement from two ML classifiers at >= `WARN` (0.60), not a single confident hit. This is the Stack Overflow instruction-writing false-positive mitigation. On tool-output scans, single-layer high confidence BLOCKs directly — the content wasn't user-authored, so the FP concern doesn't apply.
|
||||
5. **L6 ensemble combiner (`combineVerdict`).** BLOCK requires agreement from two ML classifiers at >= `WARN` (0.75), not a single confident hit. This is the Stack Overflow instruction-writing false-positive mitigation. On tool-output scans, single-layer high confidence BLOCKs directly — the content wasn't user-authored, so the FP concern doesn't apply.
|
||||
|
||||
**Critical constraint:** `security-classifier.ts` runs only in the sidebar-agent process, never in the compiled browse binary. `@huggingface/transformers` v4 requires `onnxruntime-node`, which fails `dlopen` from Bun compile's temp extract directory. Only the pure-string pieces (canary inject/check, verdict combiner, attack log, status) are in `security.ts`, which is safe to import from `server.ts`.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user