mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-01 19:25:10 +02:00
feat(v1.10.1.0): overlay efficacy harness + Opus 4.7 fanout nudge removal (#1166)
* refactor: export readOverlay from model-overlay resolver Needed by the overlay-efficacy eval harness to resolve INHERIT directives without going through generateModelOverlay's full TemplateContext. * chore: add @anthropic-ai/claude-agent-sdk@0.2.117 dep Pinned exact for SDK event-shape stability. Used by the overlay-efficacy harness to drive the model through a closer-to-real Claude Code harness than `claude -p`. * feat(preflight): sanity check for agent-sdk + overlay resolver Verifies: SDK loads, claude-opus-4-7 is a live API model, SDKMessage event shape matches assumptions, readOverlay resolves INHERIT directives and includes expected content. Run with `bun run scripts/preflight-agent-sdk.ts`. PREFLIGHT OK on first run, $0.013 API spend. * feat(eval): parametric overlay-efficacy harness (runner + fixtures) `test/helpers/agent-sdk-runner.ts` wraps @anthropic-ai/claude-agent-sdk with explicit `AgentSdkResult` types, process-level API concurrency semaphore, and 3-shape 429 retry (thrown error, result-message error, mid-stream SDKRateLimitEvent). Pins the local claude binary via `pathToClaudeCodeExecutable`. `test/fixtures/overlay-nudges.ts` holds the typed registry. Two fixtures for the first measurement: `opus-4-7-fanout-toy` (3-file read) and `opus-4-7-fanout-realistic` (mixed-tool audit). Strict validator rejects duplicate ids, non-integer trials, unsafe overlay paths, non-safe id chars, and missing overlay files at module load. Adding a future overlay nudge eval = one fixture entry. * test(eval): unit tests for agent-sdk-runner (36 tests, free tier) Stub `queryProvider` feeds hand-crafted SDKMessage streams. Covers: happy-path shape, all 3 rate-limit shapes + retry, workspace reset on retry, persistent 429 -> `RateLimitExhaustedError`, non-429 propagation, process-level concurrency cap, options propagation, artifact path uniqueness, cost/turn mapping, and every validator rejection case. * test(eval): paid periodic overlay-efficacy harness `test/skill-e2e-overlay-harness.test.ts` iterates OVERLAY_FIXTURES, runs two arms per fixture (overlay-ON, overlay-OFF) at N=10 trials with bounded concurrency. Arms use SDK preset `claude_code` so both include the real Claude Code system prompt; overlay-ON appends the resolved overlay text. Saves per-trial raw event streams to `~/.gstack/projects/<slug>/transcripts/` for forensic recovery. Gated on `EVALS=1 && EVALS_TIER=periodic`. ~$3/run (40 trials). * test: register overlay harness in touchfiles (both maps) Entries for `overlay-harness-opus-4-7-fanout-toy` and `opus-4-7-fanout-realistic` in E2E_TOUCHFILES (deps: model-overlays/, fixtures file, runner, resolver) and E2E_TIERS (`periodic`). Passes `test/touchfiles.test.ts` completeness check. * fix(opus-4.7): remove "Fan out explicitly" overlay nudge Measured counterproductive under the new SDK harness. Baseline Opus 4.7 emits first-turn parallel tool_use blocks 70% of the time on a 3-file read prompt. With the custom nudge: 10%. With Anthropic's own canonical `<use_parallel_tool_calls>` block from their parallel-tool-use docs: 0%. Both overlays suppress fanout; neither improves it. On realistic multi-tool prompts (audit a project: read files + glob + summarize), Opus 4.7 never fans out in first turn regardless of overlay. Zero of 20 trials. Not a prompt problem. Keeping the other three nudges (effort-match, batch questions, literal interpretation) pending their own measurement. Harness is ready for follow-up fixtures — add one entry to `test/fixtures/overlay-nudges.ts` to measure any overlay bullet. Cost of investigation: ~$7 total across 3 eval runs. * chore: bump version and changelog (v1.6.5.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(eval): extend OverlayFixture with allowedTools, maxTurns, direction Per-fixture tool allowlist unblocks measuring nudges that need Edit/Write (e.g. literal-interpretation 'fix the failing tests' needs write access). Per-fixture maxTurns lets harder prompts run longer without changing the default. `direction` is cosmetic metadata for test output labeling. Also adds reusable predicates and metrics: - lowerIsBetter20Pct / higherIsBetter20Pct — 20% lift threshold vs baseline - bashToolCallCount — count of Bash tool_use across the session - turnsToCompletion — SDK-reported num_turns at result - uniqueFilesEdited — Edit/Write/MultiEdit file_path set size test/skill-e2e-overlay-harness.test.ts now threads fixture.allowedTools and fixture.maxTurns through runArm. * test(eval): 3 more overlay fixtures to measure remaining Claude nudges Measures three overlay bullets that haven't been tested yet: - claude-dedicated-tools-vs-bash — claude.md says 'prefer Read/Edit/Write/ Glob/Grep over cat/sed/find/grep'. Fixture prompts 'list every TypeScript file under src/ and tell me what each exports' and counts Bash tool_use across the session. Overlay-ON should drop it by >=20%. - opus-4-7-effort-match-trivial — opus-4-7.md says 'simple file reads don't need deep reasoning.' Fixture uses a trivial one-file prompt (config.json lookup) and measures turns_used. Overlay-ON should be <=80% of baseline turns. - opus-4-7-literal-interpretation — opus-4-7.md says 'fix ALL failing tests, not just the obvious one.' Fixture seeds three failing test files with deliberately distinct failure modes and counts unique files edited. Overlay-ON should touch >=20% more files. Adding a fourth fixture for any remaining overlay nudge is a single entry. The harness is now proven on: fanout (deleted after measurement), dedicated tools, effort-match, and literal-interpretation. * fix(eval): handle SDK max-turns throw gracefully Some @anthropic-ai/claude-agent-sdk versions throw from the query generator when maxTurns is reached, instead of emitting a result message with subtype='error_max_turns'. The runner treated that as a non-retryable error and killed the whole periodic run on the first fixture that exceeded its turn cap. Added isMaxTurnsError() detector and a catch branch that synthesizes an AgentSdkResult from events captured before the throw, with exitReason='error_max_turns' and costUsd=0 (unknown from the thrown path). The metric function still runs against whatever assistant turns were collected, so the trial produces a usable number. Hoisted events/assistantTurns/toolCalls/assistantTextParts and the timing counters out of the inner try so the catch branch can read them. No behavior change on the success path or on rate-limit retry paths. * test(eval): bump maxTurns to 15 for claude-dedicated-tools-vs-bash The prompt 'list every TypeScript file under src/ and tell me what each exports' needs 1 turn for Glob + ~5 for Reads + 1 for summary. Default maxTurns=5 was not enough; prior run threw from the SDK on this fixture and tanked the whole periodic eval. Bumping to 15 gives headroom. The runner now also handles max-turns gracefully even if a future fixture underestimates, so this is belt and suspenders. * test(eval): Sonnet 4.6 variants of the 5 Opus-4.7 fixtures Same overlays, same prompts, same metrics, `model: 'claude-sonnet-4-6'`. Tests whether the overlays behave differently on a weaker Claude model where baseline behavior is shakier. Sonnet trials cost ~3-4x less than Opus so these 5 add ~$4.50 to a full run. Measurement result from the first paired run (100 trials total, ~$14.55): - **Sonnet + effort-match shows real overlay benefit.** With the overlay on, Sonnet takes 2.5 turns on a trivial `What's the version in config.json?` prompt. Without, it takes exactly 3.0 turns in all 10 trials. ~17% reduction, below the 20% pass threshold but the signal is clean: overlay-ON distribution [2,2,2,2,2,3,3,3,3,3] vs overlay-OFF [3,3,3,3,3,3,3,3,3,3]. - All other Sonnet dimensions flat (fanout, dedicated-tools, literal interpretation). Same as Opus on those axes. - Opus effort-match remains flat (2.60 vs 2.50, +4% slower with overlay). Implication: model-stratified. The overlay stack helps Sonnet on some axes where it does nothing on Opus. Wholesale removal would hurt Sonnet. Per-nudge per-model measurement is the right move going forward. * chore: bump version to 1.10.1.0 Updates VERSION, package.json, CHANGELOG header, and TODOS completion marker from 1.6.5.0 to 1.10.1.0. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,133 @@
|
||||
/**
|
||||
* Preflight for the overlay efficacy harness.
|
||||
*
|
||||
* Confirms, before any paid eval runs:
|
||||
* 1. `@anthropic-ai/claude-agent-sdk` loads and `query()` is the expected shape.
|
||||
* 2. `claude-opus-4-7` is a live API model ID (not a Claude Code alias).
|
||||
* 3. The SDK event stream contains the types we assume (system init, assistant,
|
||||
* result) with the fields we destructure.
|
||||
* 4. `scripts/resolvers/model-overlay.ts` resolves `{{INHERIT:claude}}` against
|
||||
* `opus-4-7.md` AND the resolved text contains the "Fan out explicitly" nudge.
|
||||
* 5. A local `claude` binary exists at `which claude` so binary pinning is possible.
|
||||
*
|
||||
* Run: bun run scripts/preflight-agent-sdk.ts
|
||||
*
|
||||
* Exit 0 on success. Exit non-zero with a clear message on any failure. No
|
||||
* side effects beyond stdout and a ~15 token API call.
|
||||
*/
|
||||
|
||||
import { query, type SDKMessage } from '@anthropic-ai/claude-agent-sdk';
|
||||
import { readOverlay } from './resolvers/model-overlay';
|
||||
import { execSync } from 'child_process';
|
||||
|
||||
async function main() {
|
||||
const failures: string[] = [];
|
||||
const pass = (msg: string) => console.log(` ok ${msg}`);
|
||||
const fail = (msg: string) => {
|
||||
console.log(` FAIL ${msg}`);
|
||||
failures.push(msg);
|
||||
};
|
||||
|
||||
// 1. Overlay resolver + fanout nudge text
|
||||
console.log('1. Overlay resolver');
|
||||
const resolved = readOverlay('opus-4-7');
|
||||
if (!resolved) {
|
||||
fail("readOverlay('opus-4-7') returned empty");
|
||||
} else {
|
||||
pass(`resolved overlay length: ${resolved.length} chars`);
|
||||
if (resolved.includes('{{INHERIT:')) {
|
||||
fail('resolved overlay still contains {{INHERIT:...}} directive');
|
||||
} else {
|
||||
pass('no unresolved INHERIT directives');
|
||||
}
|
||||
if (!/Fan out explicitly/i.test(resolved)) {
|
||||
fail('resolved overlay does not contain "Fan out explicitly" text');
|
||||
} else {
|
||||
pass('fanout nudge text present in resolved overlay');
|
||||
}
|
||||
}
|
||||
|
||||
// 2. Local claude binary exists
|
||||
console.log('\n2. Binary pinning');
|
||||
let claudePath: string | null = null;
|
||||
try {
|
||||
claudePath = execSync('which claude', { encoding: 'utf-8' }).trim();
|
||||
pass(`local claude binary: ${claudePath}`);
|
||||
} catch {
|
||||
fail('`which claude` failed — cannot pin binary');
|
||||
}
|
||||
|
||||
// 3. SDK query end-to-end
|
||||
console.log('\n3. SDK query end-to-end');
|
||||
if (!process.env.ANTHROPIC_API_KEY) {
|
||||
console.log(' skip ANTHROPIC_API_KEY not set — cannot test live query');
|
||||
} else {
|
||||
try {
|
||||
const events: SDKMessage[] = [];
|
||||
const q = query({
|
||||
prompt: 'say pong',
|
||||
options: {
|
||||
model: 'claude-opus-4-7',
|
||||
systemPrompt: '',
|
||||
tools: [],
|
||||
permissionMode: 'bypassPermissions',
|
||||
allowDangerouslySkipPermissions: true,
|
||||
settingSources: [],
|
||||
maxTurns: 1,
|
||||
pathToClaudeCodeExecutable: claudePath ?? undefined,
|
||||
env: { ANTHROPIC_API_KEY: process.env.ANTHROPIC_API_KEY },
|
||||
},
|
||||
});
|
||||
for await (const ev of q) events.push(ev);
|
||||
pass(`received ${events.length} events`);
|
||||
|
||||
const init = events.find(
|
||||
(e) => e.type === 'system' && (e as { subtype?: string }).subtype === 'init',
|
||||
) as { claude_code_version?: string; model?: string } | undefined;
|
||||
if (!init) {
|
||||
fail('no system/init event received');
|
||||
} else {
|
||||
pass(`system init: claude_code_version=${init.claude_code_version}, model=${init.model}`);
|
||||
}
|
||||
|
||||
const assistantEvents = events.filter((e) => e.type === 'assistant');
|
||||
if (assistantEvents.length === 0) {
|
||||
fail('no assistant events received — model ID may be rejected');
|
||||
} else {
|
||||
pass(`received ${assistantEvents.length} assistant event(s)`);
|
||||
const first = assistantEvents[0] as { message?: { content?: unknown[] } };
|
||||
const content = first.message?.content;
|
||||
if (!Array.isArray(content)) {
|
||||
fail('first assistant event has no content[] array');
|
||||
} else {
|
||||
pass(`first assistant content[] has ${content.length} block(s)`);
|
||||
}
|
||||
}
|
||||
|
||||
const result = events.find((e) => e.type === 'result') as
|
||||
| { subtype?: string; total_cost_usd?: number; num_turns?: number }
|
||||
| undefined;
|
||||
if (!result) {
|
||||
fail('no result event received');
|
||||
} else {
|
||||
pass(
|
||||
`result: subtype=${result.subtype}, cost=$${result.total_cost_usd?.toFixed(4)}, turns=${result.num_turns}`,
|
||||
);
|
||||
}
|
||||
} catch (err) {
|
||||
fail(`SDK query threw: ${err instanceof Error ? err.message : String(err)}`);
|
||||
}
|
||||
}
|
||||
|
||||
console.log();
|
||||
if (failures.length > 0) {
|
||||
console.log(`PREFLIGHT FAILED: ${failures.length} check(s) failed`);
|
||||
process.exit(1);
|
||||
}
|
||||
console.log('PREFLIGHT OK');
|
||||
}
|
||||
|
||||
main().catch((err) => {
|
||||
console.error(err);
|
||||
process.exit(1);
|
||||
});
|
||||
@@ -24,7 +24,7 @@ const OVERLAY_DIR = path.resolve(import.meta.dir, '../../model-overlays');
|
||||
|
||||
const INHERIT_RE = /^\s*\{\{INHERIT:([a-z0-9-]+(?:\.[0-9]+)*)\}\}\s*\n/;
|
||||
|
||||
function readOverlay(model: string, seen: Set<string> = new Set()): string {
|
||||
export function readOverlay(model: string, seen: Set<string> = new Set()): string {
|
||||
if (seen.has(model)) return ''; // cycle guard
|
||||
seen.add(model);
|
||||
|
||||
|
||||
Reference in New Issue
Block a user