feat(security): always run Haiku on tool outputs (drop the L4 gate)

Tool-result scan previously short-circuited when L4 (TestSavantAI)
scored below WARN, and further gated Haiku on any layer firing at >=
LOG_ONLY. On BrowseSafe-Bench that meant Haiku almost never ran,
because TestSavantAI has ~15% recall on browser-agent-specific
attacks (social engineering, indirect injection). We were gating our
best signal on our weakest.

Run all three classifiers (L4 + L4c + Haiku) in parallel. Cost:
~$0.002 + ~8s Haiku wall time per tool result, bounded by the 15s
Haiku timeout. Haiku also runs in parallel with the content scans
so it's additive only against the stream handler budget, not
against the session wall time.

User-input pre-spawn path unchanged — shouldRunTranscriptCheck still
gates there. The Stack Overflow FP mitigation that original gate was
built for still applies to direct user input; tool outputs have
different characteristics.

Source-contract test updated to pin the new parallel-three shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-20 21:15:57 +08:00
parent 5d968c43ec
commit b515f31400
2 changed files with 23 additions and 13 deletions
+15 -11
View File
@@ -624,22 +624,26 @@ async function askClaude(queueEntry: QueueEntry): Promise<void> {
scan: async (toolName: string, text: string) => {
if (toolResultBlockFired) return;
// Parallel L4 + L4c ensemble scan (DeBERTa no-op when disabled).
const [contentSignal, debertaSignal] = await Promise.all([
// We run L4/L4c AND Haiku in parallel on tool outputs regardless of
// L4's score, because BrowseSafe-Bench shows L4 (TestSavantAI) has
// low recall on browser-agent-specific attacks (~15% at v1). Gating
// Haiku on L4 meant our best signal almost never ran. The cost is
// ~$0.002 + ~300ms per tool output, bounded by the Haiku timeout
// and offset by Haiku actually seeing the real attack context.
//
// Haiku only runs when the Claude CLI is available (checkHaikuAvailable
// caches the probe). In environments without it, the call returns a
// degraded signal and the verdict falls back to L4 alone.
const [contentSignal, debertaSignal, transcriptSignal] = await Promise.all([
scanPageContent(text),
scanPageContentDeberta(text),
]);
// Short-circuit if neither content layer crossed WARN — no point
// spinning up Haiku for a clean scan.
const maxContent = Math.max(contentSignal.confidence, debertaSignal.confidence);
if (maxContent < THRESHOLDS.WARN) return;
const signals: LayerSignal[] = [contentSignal, debertaSignal];
if (shouldRunTranscriptCheck(signals)) {
signals.push(await checkTranscript({
checkTranscript({
user_message: queueEntry.message ?? '',
tool_calls: [{ tool_name: toolName, tool_input: {} }],
tool_output: text,
}));
}
}),
]);
const signals: LayerSignal[] = [contentSignal, debertaSignal, transcriptSignal];
const result = combineVerdict(signals, { toolOutput: true });
if (result.verdict !== 'block') return;
toolResultBlockFired = true;
@@ -116,8 +116,14 @@ describe('askClaude — pre-spawn + tool-result defense wiring', () => {
expect(AGENT_SRC).toContain('}, 2000);');
});
test('tool-result scan short-circuits when both content layers below WARN', () => {
expect(AGENT_SRC).toMatch(/maxContent < THRESHOLDS\.WARN/);
test('tool-result scan runs all three classifiers in parallel (no L4 gate)', () => {
// Regression guard for the Haiku-always change. Previously the scan
// short-circuited when L4/L4c both returned below WARN, which meant
// Haiku (our best signal per BrowseSafe-Bench) rarely ran. Now we run
// all three in parallel and let combineVerdict decide.
expect(AGENT_SRC).toMatch(/scanPageContent\(text\),[\s\S]*scanPageContentDeberta\(text\),[\s\S]*checkTranscript\(/);
// The old short-circuit must be gone.
expect(AGENT_SRC).not.toMatch(/if \(maxContent < THRESHOLDS\.WARN\) return;/);
});
test('onCanaryLeaked fires both security_event and agent_error for legacy clients', () => {