fix(security): unbreak Haiku transcript classifier — wrong model + too-tight timeout

Two bugs that made checkTranscript return degraded on every call:

1. --model 'haiku-4-5' returns 404 from the Claude CLI. The accepted
   shorthand is 'haiku' (resolves to claude-haiku-4-5-20251001
   today, stays on the latest Haiku as models roll). Symptom: every
   call exited non-zero with api_error_status=404.

2. 2000ms timeout is below the floor. Fresh `claude -p` spawn has
   ~2-3s CLI cold-start + 5-12s inference on ~1KB prompts. With the
   wrong model gone, every successful call still timed out before it
   returned. Measured: 0% firing rate.

Fix: model alias + 15s timeout. Sanity check against DAN-style
injection now returns confidence 0.99 with reasoning ("Tool output
contains multiple injection patterns: instruction override, jailbreak
attempt (DAN), system prompt exfil request, and malicious curl
command to attacker domain") in 8.7s.

This was the silent cause of the 15.3% detection rate on
BrowseSafe-Bench — the ensemble numbers matched L4-alone because
Haiku never actually voted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-20 21:15:44 +08:00
parent 8f9bb84f3f
commit 5d968c43ec
+13 -3
View File
@@ -456,9 +456,13 @@ export async function checkTranscript(params: {
].join('\n');
return new Promise((resolve) => {
// Model alias 'haiku' resolves to the latest Haiku (currently
// claude-haiku-4-5-20251001). The pinned form 'haiku-4-5' returned 404
// because the CLI doesn't accept that shorthand. Using the alias keeps
// us on the latest Haiku as models roll forward.
const p = spawn('claude', [
'-p', prompt,
'--model', 'haiku-4-5',
'--model', 'haiku',
'--output-format', 'json',
], { stdio: ['ignore', 'pipe', 'pipe'] });
@@ -502,11 +506,17 @@ export async function checkTranscript(params: {
p.on('error', () => {
finish({ layer: 'transcript_classifier', confidence: 0, meta: { degraded: true, reason: 'spawn_error' } });
});
// Hard timeout — per plan §E1 (2000ms cap)
// Hard timeout. Original spec was 2000ms but real-world `claude -p`
// spawns a fresh CLI per call with ~2-3s cold-start + 5-12s inference
// on ~1KB prompts. At 2s every call timed out, defeating the
// classifier entirely (measured: 0% firing rate). At 15s we catch the
// long tail; faster prompts return in under 5s. The stream handler
// runs this in parallel with the content scan so the latency is
// bounded by this timer, not additive to session wall time.
setTimeout(() => {
try { p.kill('SIGTERM'); } catch {}
finish({ layer: 'transcript_classifier', confidence: 0, meta: { degraded: true, reason: 'timeout' } });
}, 2000);
}, 15000);
});
}