feat(v1.10.1.0): overlay efficacy harness + Opus 4.7 fanout nudge removal (#1166)

* refactor: export readOverlay from model-overlay resolver Needed by the overlay-efficacy eval harness to resolve INHERIT directives without going through generateModelOverlay's full TemplateContext. * chore: add @anthropic-ai/claude-agent-sdk@0.2.117 dep Pinned exact for SDK event-shape stability. Used by the overlay-efficacy harness to drive the model through a closer-to-real Claude Code harness than `claude -p`. * feat(preflight): sanity check for agent-sdk + overlay resolver Verifies: SDK loads, claude-opus-4-7 is a live API model, SDKMessage event shape matches assumptions, readOverlay resolves INHERIT directives and includes expected content. Run with `bun run scripts/preflight-agent-sdk.ts`. PREFLIGHT OK on first run, $0.013 API spend. * feat(eval): parametric overlay-efficacy harness (runner + fixtures) `test/helpers/agent-sdk-runner.ts` wraps @anthropic-ai/claude-agent-sdk with explicit `AgentSdkResult` types, process-level API concurrency semaphore, and 3-shape 429 retry (thrown error, result-message error, mid-stream SDKRateLimitEvent). Pins the local claude binary via `pathToClaudeCodeExecutable`. `test/fixtures/overlay-nudges.ts` holds the typed registry. Two fixtures for the first measurement: `opus-4-7-fanout-toy` (3-file read) and `opus-4-7-fanout-realistic` (mixed-tool audit). Strict validator rejects duplicate ids, non-integer trials, unsafe overlay paths, non-safe id chars, and missing overlay files at module load. Adding a future overlay nudge eval = one fixture entry. * test(eval): unit tests for agent-sdk-runner (36 tests, free tier) Stub `queryProvider` feeds hand-crafted SDKMessage streams. Covers: happy-path shape, all 3 rate-limit shapes + retry, workspace reset on retry, persistent 429 -> `RateLimitExhaustedError`, non-429 propagation, process-level concurrency cap, options propagation, artifact path uniqueness, cost/turn mapping, and every validator rejection case. * test(eval): paid periodic overlay-efficacy harness `test/skill-e2e-overlay-harness.test.ts` iterates OVERLAY_FIXTURES, runs two arms per fixture (overlay-ON, overlay-OFF) at N=10 trials with bounded concurrency. Arms use SDK preset `claude_code` so both include the real Claude Code system prompt; overlay-ON appends the resolved overlay text. Saves per-trial raw event streams to `~/.gstack/projects/<slug>/transcripts/` for forensic recovery. Gated on `EVALS=1 && EVALS_TIER=periodic`. ~$3/run (40 trials). * test: register overlay harness in touchfiles (both maps) Entries for `overlay-harness-opus-4-7-fanout-toy` and `opus-4-7-fanout-realistic` in E2E_TOUCHFILES (deps: model-overlays/, fixtures file, runner, resolver) and E2E_TIERS (`periodic`). Passes `test/touchfiles.test.ts` completeness check. * fix(opus-4.7): remove "Fan out explicitly" overlay nudge Measured counterproductive under the new SDK harness. Baseline Opus 4.7 emits first-turn parallel tool_use blocks 70% of the time on a 3-file read prompt. With the custom nudge: 10%. With Anthropic's own canonical `<use_parallel_tool_calls>` block from their parallel-tool-use docs: 0%. Both overlays suppress fanout; neither improves it. On realistic multi-tool prompts (audit a project: read files + glob + summarize), Opus 4.7 never fans out in first turn regardless of overlay. Zero of 20 trials. Not a prompt problem. Keeping the other three nudges (effort-match, batch questions, literal interpretation) pending their own measurement. Harness is ready for follow-up fixtures — add one entry to `test/fixtures/overlay-nudges.ts` to measure any overlay bullet. Cost of investigation: ~$7 total across 3 eval runs. * chore: bump version and changelog (v1.6.5.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(eval): extend OverlayFixture with allowedTools, maxTurns, direction Per-fixture tool allowlist unblocks measuring nudges that need Edit/Write (e.g. literal-interpretation 'fix the failing tests' needs write access). Per-fixture maxTurns lets harder prompts run longer without changing the default. `direction` is cosmetic metadata for test output labeling. Also adds reusable predicates and metrics: - lowerIsBetter20Pct / higherIsBetter20Pct — 20% lift threshold vs baseline - bashToolCallCount — count of Bash tool_use across the session - turnsToCompletion — SDK-reported num_turns at result - uniqueFilesEdited — Edit/Write/MultiEdit file_path set size test/skill-e2e-overlay-harness.test.ts now threads fixture.allowedTools and fixture.maxTurns through runArm. * test(eval): 3 more overlay fixtures to measure remaining Claude nudges Measures three overlay bullets that haven't been tested yet: - claude-dedicated-tools-vs-bash — claude.md says 'prefer Read/Edit/Write/ Glob/Grep over cat/sed/find/grep'. Fixture prompts 'list every TypeScript file under src/ and tell me what each exports' and counts Bash tool_use across the session. Overlay-ON should drop it by >=20%. - opus-4-7-effort-match-trivial — opus-4-7.md says 'simple file reads don't need deep reasoning.' Fixture uses a trivial one-file prompt (config.json lookup) and measures turns_used. Overlay-ON should be <=80% of baseline turns. - opus-4-7-literal-interpretation — opus-4-7.md says 'fix ALL failing tests, not just the obvious one.' Fixture seeds three failing test files with deliberately distinct failure modes and counts unique files edited. Overlay-ON should touch >=20% more files. Adding a fourth fixture for any remaining overlay nudge is a single entry. The harness is now proven on: fanout (deleted after measurement), dedicated tools, effort-match, and literal-interpretation. * fix(eval): handle SDK max-turns throw gracefully Some @anthropic-ai/claude-agent-sdk versions throw from the query generator when maxTurns is reached, instead of emitting a result message with subtype='error_max_turns'. The runner treated that as a non-retryable error and killed the whole periodic run on the first fixture that exceeded its turn cap. Added isMaxTurnsError() detector and a catch branch that synthesizes an AgentSdkResult from events captured before the throw, with exitReason='error_max_turns' and costUsd=0 (unknown from the thrown path). The metric function still runs against whatever assistant turns were collected, so the trial produces a usable number. Hoisted events/assistantTurns/toolCalls/assistantTextParts and the timing counters out of the inner try so the catch branch can read them. No behavior change on the success path or on rate-limit retry paths. * test(eval): bump maxTurns to 15 for claude-dedicated-tools-vs-bash The prompt 'list every TypeScript file under src/ and tell me what each exports' needs 1 turn for Glob + ~5 for Reads + 1 for summary. Default maxTurns=5 was not enough; prior run threw from the SDK on this fixture and tanked the whole periodic eval. Bumping to 15 gives headroom. The runner now also handles max-turns gracefully even if a future fixture underestimates, so this is belt and suspenders. * test(eval): Sonnet 4.6 variants of the 5 Opus-4.7 fixtures Same overlays, same prompts, same metrics, `model: 'claude-sonnet-4-6'`. Tests whether the overlays behave differently on a weaker Claude model where baseline behavior is shakier. Sonnet trials cost ~3-4x less than Opus so these 5 add ~$4.50 to a full run. Measurement result from the first paired run (100 trials total, ~$14.55): - **Sonnet + effort-match shows real overlay benefit.** With the overlay on, Sonnet takes 2.5 turns on a trivial `What's the version in config.json?` prompt. Without, it takes exactly 3.0 turns in all 10 trials. ~17% reduction, below the 20% pass threshold but the signal is clean: overlay-ON distribution [2,2,2,2,2,3,3,3,3,3] vs overlay-OFF [3,3,3,3,3,3,3,3,3,3]. - All other Sonnet dimensions flat (fanout, dedicated-tools, literal interpretation). Same as Opus on those axes. - Opus effort-match remains flat (2.60 vs 2.50, +4% slower with overlay). Implication: model-stratified. The overlay stack helps Sonnet on some axes where it does nothing on Opus. Wholesale removal would hurt Sonnet. Per-nudge per-model measurement is the right move going forward. * chore: bump version to 1.10.1.0 Updates VERSION, package.json, CHANGELOG header, and TODOS completion marker from 1.6.5.0 to 1.10.1.0. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:25:10 +02:00 · 2026-04-23 18:42:58 -07:00
parent a81be53621
commit e3d7f49c74
13 changed files with 2489 additions and 41 deletions
@@ -0,0 +1,133 @@
+/**
+ * Preflight for the overlay efficacy harness.
+ *
+ * Confirms, before any paid eval runs:
+ *   1. `@anthropic-ai/claude-agent-sdk` loads and `query()` is the expected shape.
+ *   2. `claude-opus-4-7` is a live API model ID (not a Claude Code alias).
+ *   3. The SDK event stream contains the types we assume (system init, assistant,
+ *      result) with the fields we destructure.
+ *   4. `scripts/resolvers/model-overlay.ts` resolves `{{INHERIT:claude}}` against
+ *      `opus-4-7.md` AND the resolved text contains the "Fan out explicitly" nudge.
+ *   5. A local `claude` binary exists at `which claude` so binary pinning is possible.
+ *
+ * Run: bun run scripts/preflight-agent-sdk.ts
+ *
+ * Exit 0 on success. Exit non-zero with a clear message on any failure. No
+ * side effects beyond stdout and a ~15 token API call.
+ */
+
+import { query, type SDKMessage } from '@anthropic-ai/claude-agent-sdk';
+import { readOverlay } from './resolvers/model-overlay';
+import { execSync } from 'child_process';
+
+async function main() {
+  const failures: string[] = [];
+  const pass = (msg: string) => console.log(`  ok  ${msg}`);
+  const fail = (msg: string) => {
+    console.log(`  FAIL  ${msg}`);
+    failures.push(msg);
+  };
+
+  // 1. Overlay resolver + fanout nudge text
+  console.log('1. Overlay resolver');
+  const resolved = readOverlay('opus-4-7');
+  if (!resolved) {
+    fail("readOverlay('opus-4-7') returned empty");
+  } else {
+    pass(`resolved overlay length: ${resolved.length} chars`);
+    if (resolved.includes('{{INHERIT:')) {
+      fail('resolved overlay still contains {{INHERIT:...}} directive');
+    } else {
+      pass('no unresolved INHERIT directives');
+    }
+    if (!/Fan out explicitly/i.test(resolved)) {
+      fail('resolved overlay does not contain "Fan out explicitly" text');
+    } else {
+      pass('fanout nudge text present in resolved overlay');
+    }
+  }
+
+  // 2. Local claude binary exists
+  console.log('\n2. Binary pinning');
+  let claudePath: string | null = null;
+  try {
+    claudePath = execSync('which claude', { encoding: 'utf-8' }).trim();
+    pass(`local claude binary: ${claudePath}`);
+  } catch {
+    fail('`which claude` failed — cannot pin binary');
+  }
+
+  // 3. SDK query end-to-end
+  console.log('\n3. SDK query end-to-end');
+  if (!process.env.ANTHROPIC_API_KEY) {
+    console.log('  skip  ANTHROPIC_API_KEY not set — cannot test live query');
+  } else {
+    try {
+      const events: SDKMessage[] = [];
+      const q = query({
+        prompt: 'say pong',
+        options: {
+          model: 'claude-opus-4-7',
+          systemPrompt: '',
+          tools: [],
+          permissionMode: 'bypassPermissions',
+          allowDangerouslySkipPermissions: true,
+          settingSources: [],
+          maxTurns: 1,
+          pathToClaudeCodeExecutable: claudePath ?? undefined,
+          env: { ANTHROPIC_API_KEY: process.env.ANTHROPIC_API_KEY },
+        },
+      });
+      for await (const ev of q) events.push(ev);
+      pass(`received ${events.length} events`);
+
+      const init = events.find(
+        (e) => e.type === 'system' && (e as { subtype?: string }).subtype === 'init',
+      ) as { claude_code_version?: string; model?: string } | undefined;
+      if (!init) {
+        fail('no system/init event received');
+      } else {
+        pass(`system init: claude_code_version=${init.claude_code_version}, model=${init.model}`);
+      }
+
+      const assistantEvents = events.filter((e) => e.type === 'assistant');
+      if (assistantEvents.length === 0) {
+        fail('no assistant events received — model ID may be rejected');
+      } else {
+        pass(`received ${assistantEvents.length} assistant event(s)`);
+        const first = assistantEvents[0] as { message?: { content?: unknown[] } };
+        const content = first.message?.content;
+        if (!Array.isArray(content)) {
+          fail('first assistant event has no content[] array');
+        } else {
+          pass(`first assistant content[] has ${content.length} block(s)`);
+        }
+      }
+
+      const result = events.find((e) => e.type === 'result') as
+        | { subtype?: string; total_cost_usd?: number; num_turns?: number }
+        | undefined;
+      if (!result) {
+        fail('no result event received');
+      } else {
+        pass(
+          `result: subtype=${result.subtype}, cost=$${result.total_cost_usd?.toFixed(4)}, turns=${result.num_turns}`,
+        );
+      }
+    } catch (err) {
+      fail(`SDK query threw: ${err instanceof Error ? err.message : String(err)}`);
+    }
+  }
+
+  console.log();
+  if (failures.length > 0) {
+    console.log(`PREFLIGHT FAILED: ${failures.length} check(s) failed`);
+    process.exit(1);
+  }
+  console.log('PREFLIGHT OK');
+}
+
+main().catch((err) => {
+  console.error(err);
+  process.exit(1);
+});
@@ -24,7 +24,7 @@ const OVERLAY_DIR = path.resolve(import.meta.dir, '../../model-overlays');

 const INHERIT_RE = /^\s*\{\{INHERIT:([a-z0-9-]+(?:\.[0-9]+)*)\}\}\s*\n/;

-function readOverlay(model: string, seen: Set<string> = new Set()): string {
+export function readOverlay(model: string, seen: Set<string> = new Set()): string {
  if (seen.has(model)) return ''; // cycle guard
  seen.add(model);