merge: origin/main into garrytan/injection-tuning; bump v1.5.1.0 → v1.5.2.0

Main shipped v1.5.1.0 for /make-pdf entity + font fixes while this branch
was in flight, creating a version collision. Resolving by bumping this
branch's security tuning release to v1.5.2.0 (next PATCH after main's
v1.5.1.0) and retaining both CHANGELOG entries: my v1.5.2.0 on top,
main's v1.5.1.0 below.

Updated v1.5.1.0 → v1.5.2.0 references in security.ts, security-classifier.ts,
adversarial.test.ts, bench-ensemble.test.ts, bench-ensemble-live.test.ts,
bench.test.ts, and TODOS.md. Main's CHANGELOG entry left untouched.

All 231 security tests + fixture-replay gate still pass:
  TP=146 FN=114 FP=55 TN=185 → 56.2% / 22.9% → GATE PASS

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-21 20:50:02 -07:00
18 changed files with 299 additions and 38 deletions
+2 -2
View File
@@ -490,7 +490,7 @@ export async function checkTranscript(params: {
// repo with a prompt-injection-defense CLAUDE.md (like gstack itself),
// Haiku reads "we have a strict security classifier" and responds with
// meta-commentary instead of classifying the input — we measured 100%
// timeout rate in the v1.5.1.0 ensemble bench because of this, plus
// timeout rate in the v1.5.2.0 ensemble bench because of this, plus
// ~44k cache_creation tokens per call (massive cost inflation).
// Using os.tmpdir() gives Haiku a clean context for pure classification.
const p = spawn('claude', [
@@ -539,7 +539,7 @@ export async function checkTranscript(params: {
p.on('error', () => {
finish({ layer: 'transcript_classifier', confidence: 0, meta: { degraded: true, reason: 'spawn_error' } });
});
// Hard timeout. Measured in v1.5.1.0 bench: `claude -p --model
// Hard timeout. Measured in v1.5.2.0 bench: `claude -p --model
// claude-haiku-4-5-20251001` takes 17-33s end-to-end even for trivial
// prompts (CLI session startup + Haiku API). The v1 15s timeout caused
// 100% timeout rate when re-measured in v2 — v1's ensemble was
+2 -2
View File
@@ -88,7 +88,7 @@ export interface StatusDetail {
/**
* Combine per-layer signals into a single verdict. Post-v2 ensemble rule
* (v1.5.1.0+) is label-first for the transcript layer: Haiku's verdict
* (v1.5.2.0+) is label-first for the transcript layer: Haiku's verdict
* label is the primary signal, not its self-reported confidence. Other ML
* layers (testsavant_content, deberta_content) remain confidence-based
* because they emit only a scalar.
@@ -205,7 +205,7 @@ export function combineVerdict(signals: LayerSignal[], opts: CombineVerdictOpts
// Single-layer BLOCK. For tool-output, BLOCK directly; for user-input,
// degrade to WARN (SO-FP mitigation).
//
// Asymmetric thresholds (v1.5.1.0+):
// Asymmetric thresholds (v1.5.2.0+):
// - Content classifiers (testsavant, deberta): require confidence
// >= THRESHOLDS.SOLO_CONTENT_BLOCK (0.92). These are label-less so the
// bar is higher — pattern-matching on "suspicious text" alone isn't
+3 -3
View File
@@ -267,11 +267,11 @@ describe('combineVerdict — realistic attack/defense scenarios', () => {
});
});
// ─── Label-first voting (v1.5.1.0+) ──────────────────────────
// ─── Label-first voting (v1.5.2.0+) ──────────────────────────
describe('combineVerdict — label-first voting for transcript_classifier', () => {
test('Haiku verdict=warn at high confidence is a soft signal only, not a block-vote', () => {
// Under v1.5.1.0 label-first: Haiku's 'warn' label means "suspicious but
// Under v1.5.2.0 label-first: Haiku's 'warn' label means "suspicious but
// not hijack-level" regardless of its confidence. It should NOT single-
// handedly upgrade the ensemble to BLOCK even when pointed at 0.80.
const r = combineVerdict([
@@ -329,7 +329,7 @@ describe('combineVerdict — label-first voting for transcript_classifier', () =
});
test('backward-compat: transcript signal with no meta.verdict never block-votes', () => {
// Pre-v1.5.1.0 signals (or adversarial tests) may arrive without
// Pre-v1.5.2.0 signals (or adversarial tests) may arrive without
// meta.verdict. Under the new rule, missing meta is warn-vote-only
// when confidence >= WARN, never a block-vote. Even at 0.95 (high
// confidence), transcript alone doesn't upgrade the ensemble.
@@ -1,5 +1,5 @@
/**
* BrowseSafe-Bench ensemble LIVE bench (v1.5.1.0+).
* BrowseSafe-Bench ensemble LIVE bench (v1.5.2.0+).
*
* Runs the 200-case smoke through the full ensemble with real Haiku calls.
* Measures detection + FP rates at the ENSEMBLE level (not just L4 like
+1 -1
View File
@@ -1,5 +1,5 @@
/**
* BrowseSafe-Bench ensemble fixture-replay gate (v1.5.1.0+).
* BrowseSafe-Bench ensemble fixture-replay gate (v1.5.2.0+).
*
* Runs the 200-case smoke through combineVerdict using recorded Haiku
* responses from a committed fixture. Deterministic, free, gate-tier.
+1 -1
View File
@@ -118,7 +118,7 @@ describe('BrowseSafe-Bench smoke (200 cases)', () => {
let tp = 0, fp = 0, tn = 0, fn = 0;
// intentionally 0.6 — L4-only bench pinned to v1 WARN for historical
// continuity. The ensemble bench (security-bench-ensemble.test.ts) uses
// THRESHOLDS.WARN from security.ts (0.75 in v1.5.1.0+).
// THRESHOLDS.WARN from security.ts (0.75 in v1.5.2.0+).
const WARN = 0.6;
for (const row of rows) {
const signal = await scanPageContent(row.content);