mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-02 11:45:20 +02:00
94a83c50cd
23 tests covering realistic attack shapes that a hostile QA engineer would
write to break the security layer. All pure logic — no model download, no
subprocess, no network. Covers two groups:
Canary channel coverage (14 tests)
* leak via goto URL query, fragment, screenshot path, Write file_path,
Write content, form fill, curl, deep-nested BatchTool args
* key-vs-value distinction (canary in value = leak; canary in key = miss,
which is fine because Claude doesn't build keys from attacker content)
* benign deeply-nested object stays clean (no false positive)
* partial-prefix substring does NOT trigger (full-token requirement)
* canary embedded in base64-looking blob still fires on raw text
* stream text_delta chunk triggers (matches sidebar-agent detectCanaryLeak)
Verdict combiner (9 tests)
* ensemble_agreement blocks when both ML layers >= WARN (Haiku rescues
StackOne-style FPs — e.g. Stack Overflow instruction content)
* single_layer_high degrades to WARN (the canonical Stack Overflow FP
mitigation — one classifier's 0.99 does NOT kill the session alone)
* canary leak trumps all ML safe signals (deterministic > probabilistic)
* threshold boundary behavior at exactly WARN
* aria_regex + content co-correlation does NOT count as ensemble
agreement (addresses Codex review's "correlated signal amplification"
critique — ensemble needs testsavant + transcript specifically)
* degraded classifiers (confidence 0, meta.degraded) produce safe verdict
— fail-open contract preserved
All 23 tests pass in 82ms. Combined with security.test.ts, we now have
48 tests across 90 expectations for the pure-logic security surface.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>