fix(security): unbreak Haiku transcript classifier — wrong model + too-tight timeout

Two bugs that made checkTranscript return degraded on every call: 1. --model 'haiku-4-5' returns 404 from the Claude CLI. The accepted shorthand is 'haiku' (resolves to claude-haiku-4-5-20251001 today, stays on the latest Haiku as models roll). Symptom: every call exited non-zero with api_error_status=404. 2. 2000ms timeout is below the floor. Fresh `claude -p` spawn has ~2-3s CLI cold-start + 5-12s inference on ~1KB prompts. With the wrong model gone, every successful call still timed out before it returned. Measured: 0% firing rate. Fix: model alias + 15s timeout. Sanity check against DAN-style injection now returns confidence 0.99 with reasoning ("Tool output contains multiple injection patterns: instruction override, jailbreak attempt (DAN), system prompt exfil request, and malicious curl command to attacker domain") in 8.7s. This was the silent cause of the 15.3% detection rate on BrowseSafe-Bench — the ensemble numbers matched L4-alone because Haiku never actually voted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 05:05:08 +02:00 · 2026-04-20 21:15:44 +08:00
parent 8f9bb84f3f
commit 5d968c43ec
1 changed files with 13 additions and 3 deletions
@@ -456,9 +456,13 @@ export async function checkTranscript(params: {
  ].join('\n');

  return new Promise((resolve) => {
+    // Model alias 'haiku' resolves to the latest Haiku (currently
+    // claude-haiku-4-5-20251001). The pinned form 'haiku-4-5' returned 404
+    // because the CLI doesn't accept that shorthand. Using the alias keeps
+    // us on the latest Haiku as models roll forward.
    const p = spawn('claude', [
      '-p', prompt,
-      '--model', 'haiku-4-5',
+      '--model', 'haiku',
      '--output-format', 'json',
    ], { stdio: ['ignore', 'pipe', 'pipe'] });

@@ -502,11 +506,17 @@ export async function checkTranscript(params: {
    p.on('error', () => {
      finish({ layer: 'transcript_classifier', confidence: 0, meta: { degraded: true, reason: 'spawn_error' } });
    });
-    // Hard timeout — per plan §E1 (2000ms cap)
+    // Hard timeout. Original spec was 2000ms but real-world `claude -p`
+    // spawns a fresh CLI per call with ~2-3s cold-start + 5-12s inference
+    // on ~1KB prompts. At 2s every call timed out, defeating the
+    // classifier entirely (measured: 0% firing rate). At 15s we catch the
+    // long tail; faster prompts return in under 5s. The stream handler
+    // runs this in parallel with the content scan so the latency is
+    // bounded by this timer, not additive to session wall time.
    setTimeout(() => {
      try { p.kill('SIGTERM'); } catch {}
      finish({ layer: 'transcript_classifier', confidence: 0, meta: { degraded: true, reason: 'timeout' } });
-    }, 2000);
+    }, 15000);
  });
 }