mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-01 19:25:10 +02:00
e3d7f49c74
* refactor: export readOverlay from model-overlay resolver Needed by the overlay-efficacy eval harness to resolve INHERIT directives without going through generateModelOverlay's full TemplateContext. * chore: add @anthropic-ai/claude-agent-sdk@0.2.117 dep Pinned exact for SDK event-shape stability. Used by the overlay-efficacy harness to drive the model through a closer-to-real Claude Code harness than `claude -p`. * feat(preflight): sanity check for agent-sdk + overlay resolver Verifies: SDK loads, claude-opus-4-7 is a live API model, SDKMessage event shape matches assumptions, readOverlay resolves INHERIT directives and includes expected content. Run with `bun run scripts/preflight-agent-sdk.ts`. PREFLIGHT OK on first run, $0.013 API spend. * feat(eval): parametric overlay-efficacy harness (runner + fixtures) `test/helpers/agent-sdk-runner.ts` wraps @anthropic-ai/claude-agent-sdk with explicit `AgentSdkResult` types, process-level API concurrency semaphore, and 3-shape 429 retry (thrown error, result-message error, mid-stream SDKRateLimitEvent). Pins the local claude binary via `pathToClaudeCodeExecutable`. `test/fixtures/overlay-nudges.ts` holds the typed registry. Two fixtures for the first measurement: `opus-4-7-fanout-toy` (3-file read) and `opus-4-7-fanout-realistic` (mixed-tool audit). Strict validator rejects duplicate ids, non-integer trials, unsafe overlay paths, non-safe id chars, and missing overlay files at module load. Adding a future overlay nudge eval = one fixture entry. * test(eval): unit tests for agent-sdk-runner (36 tests, free tier) Stub `queryProvider` feeds hand-crafted SDKMessage streams. Covers: happy-path shape, all 3 rate-limit shapes + retry, workspace reset on retry, persistent 429 -> `RateLimitExhaustedError`, non-429 propagation, process-level concurrency cap, options propagation, artifact path uniqueness, cost/turn mapping, and every validator rejection case. * test(eval): paid periodic overlay-efficacy harness `test/skill-e2e-overlay-harness.test.ts` iterates OVERLAY_FIXTURES, runs two arms per fixture (overlay-ON, overlay-OFF) at N=10 trials with bounded concurrency. Arms use SDK preset `claude_code` so both include the real Claude Code system prompt; overlay-ON appends the resolved overlay text. Saves per-trial raw event streams to `~/.gstack/projects/<slug>/transcripts/` for forensic recovery. Gated on `EVALS=1 && EVALS_TIER=periodic`. ~$3/run (40 trials). * test: register overlay harness in touchfiles (both maps) Entries for `overlay-harness-opus-4-7-fanout-toy` and `opus-4-7-fanout-realistic` in E2E_TOUCHFILES (deps: model-overlays/, fixtures file, runner, resolver) and E2E_TIERS (`periodic`). Passes `test/touchfiles.test.ts` completeness check. * fix(opus-4.7): remove "Fan out explicitly" overlay nudge Measured counterproductive under the new SDK harness. Baseline Opus 4.7 emits first-turn parallel tool_use blocks 70% of the time on a 3-file read prompt. With the custom nudge: 10%. With Anthropic's own canonical `<use_parallel_tool_calls>` block from their parallel-tool-use docs: 0%. Both overlays suppress fanout; neither improves it. On realistic multi-tool prompts (audit a project: read files + glob + summarize), Opus 4.7 never fans out in first turn regardless of overlay. Zero of 20 trials. Not a prompt problem. Keeping the other three nudges (effort-match, batch questions, literal interpretation) pending their own measurement. Harness is ready for follow-up fixtures — add one entry to `test/fixtures/overlay-nudges.ts` to measure any overlay bullet. Cost of investigation: ~$7 total across 3 eval runs. * chore: bump version and changelog (v1.6.5.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(eval): extend OverlayFixture with allowedTools, maxTurns, direction Per-fixture tool allowlist unblocks measuring nudges that need Edit/Write (e.g. literal-interpretation 'fix the failing tests' needs write access). Per-fixture maxTurns lets harder prompts run longer without changing the default. `direction` is cosmetic metadata for test output labeling. Also adds reusable predicates and metrics: - lowerIsBetter20Pct / higherIsBetter20Pct — 20% lift threshold vs baseline - bashToolCallCount — count of Bash tool_use across the session - turnsToCompletion — SDK-reported num_turns at result - uniqueFilesEdited — Edit/Write/MultiEdit file_path set size test/skill-e2e-overlay-harness.test.ts now threads fixture.allowedTools and fixture.maxTurns through runArm. * test(eval): 3 more overlay fixtures to measure remaining Claude nudges Measures three overlay bullets that haven't been tested yet: - claude-dedicated-tools-vs-bash — claude.md says 'prefer Read/Edit/Write/ Glob/Grep over cat/sed/find/grep'. Fixture prompts 'list every TypeScript file under src/ and tell me what each exports' and counts Bash tool_use across the session. Overlay-ON should drop it by >=20%. - opus-4-7-effort-match-trivial — opus-4-7.md says 'simple file reads don't need deep reasoning.' Fixture uses a trivial one-file prompt (config.json lookup) and measures turns_used. Overlay-ON should be <=80% of baseline turns. - opus-4-7-literal-interpretation — opus-4-7.md says 'fix ALL failing tests, not just the obvious one.' Fixture seeds three failing test files with deliberately distinct failure modes and counts unique files edited. Overlay-ON should touch >=20% more files. Adding a fourth fixture for any remaining overlay nudge is a single entry. The harness is now proven on: fanout (deleted after measurement), dedicated tools, effort-match, and literal-interpretation. * fix(eval): handle SDK max-turns throw gracefully Some @anthropic-ai/claude-agent-sdk versions throw from the query generator when maxTurns is reached, instead of emitting a result message with subtype='error_max_turns'. The runner treated that as a non-retryable error and killed the whole periodic run on the first fixture that exceeded its turn cap. Added isMaxTurnsError() detector and a catch branch that synthesizes an AgentSdkResult from events captured before the throw, with exitReason='error_max_turns' and costUsd=0 (unknown from the thrown path). The metric function still runs against whatever assistant turns were collected, so the trial produces a usable number. Hoisted events/assistantTurns/toolCalls/assistantTextParts and the timing counters out of the inner try so the catch branch can read them. No behavior change on the success path or on rate-limit retry paths. * test(eval): bump maxTurns to 15 for claude-dedicated-tools-vs-bash The prompt 'list every TypeScript file under src/ and tell me what each exports' needs 1 turn for Glob + ~5 for Reads + 1 for summary. Default maxTurns=5 was not enough; prior run threw from the SDK on this fixture and tanked the whole periodic eval. Bumping to 15 gives headroom. The runner now also handles max-turns gracefully even if a future fixture underestimates, so this is belt and suspenders. * test(eval): Sonnet 4.6 variants of the 5 Opus-4.7 fixtures Same overlays, same prompts, same metrics, `model: 'claude-sonnet-4-6'`. Tests whether the overlays behave differently on a weaker Claude model where baseline behavior is shakier. Sonnet trials cost ~3-4x less than Opus so these 5 add ~$4.50 to a full run. Measurement result from the first paired run (100 trials total, ~$14.55): - **Sonnet + effort-match shows real overlay benefit.** With the overlay on, Sonnet takes 2.5 turns on a trivial `What's the version in config.json?` prompt. Without, it takes exactly 3.0 turns in all 10 trials. ~17% reduction, below the 20% pass threshold but the signal is clean: overlay-ON distribution [2,2,2,2,2,3,3,3,3,3] vs overlay-OFF [3,3,3,3,3,3,3,3,3,3]. - All other Sonnet dimensions flat (fanout, dedicated-tools, literal interpretation). Same as Opus on those axes. - Opus effort-match remains flat (2.60 vs 2.50, +4% slower with overlay). Implication: model-stratified. The overlay stack helps Sonnet on some axes where it does nothing on Opus. Wholesale removal would hurt Sonnet. Per-nudge per-model measurement is the right move going forward. * chore: bump version to 1.10.1.0 Updates VERSION, package.json, CHANGELOG header, and TODOS completion marker from 1.6.5.0 to 1.10.1.0. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
321 lines
11 KiB
TypeScript
321 lines
11 KiB
TypeScript
/**
|
||
* Overlay-efficacy harness (periodic tier, paid).
|
||
*
|
||
* Measures whether a model-specific overlay nudge actually changes model
|
||
* behavior when run through the real Claude Agent SDK — the harness
|
||
* Claude Code itself is built on. This complements test/skill-e2e-opus-47.test.ts
|
||
* which measures the same thing via `claude -p` subprocess (a different
|
||
* harness with different prompt composition).
|
||
*
|
||
* For each fixture in test/fixtures/overlay-nudges.ts, runs two arms at
|
||
* `fixture.trials` trials per arm with bounded concurrency:
|
||
* - overlay-on: SDK systemPrompt = resolved overlay content
|
||
* - overlay-off: SDK systemPrompt = "" (empty)
|
||
*
|
||
* Both arms have no CLAUDE.md, no skills directory, no setting-source
|
||
* inheritance (settingSources: []). This is the TRUE bare comparison —
|
||
* the only variable is the overlay text.
|
||
*
|
||
* Budget ~$20 per run at 40 trials (2 fixtures × 2 arms × 10 trials).
|
||
* Gated by EVALS=1 AND EVALS_TIER=periodic. Never runs under test:gate.
|
||
*/
|
||
|
||
import { describe, test, expect, afterAll } from 'bun:test';
|
||
import * as fs from 'fs';
|
||
import * as path from 'path';
|
||
import * as os from 'os';
|
||
import {
|
||
runAgentSdkTest,
|
||
resolveClaudeBinary,
|
||
type AgentSdkResult,
|
||
type SystemPromptOption,
|
||
} from './helpers/agent-sdk-runner';
|
||
import { EvalCollector, getProjectEvalDir } from './helpers/eval-store';
|
||
import {
|
||
OVERLAY_FIXTURES,
|
||
type OverlayFixture,
|
||
} from './fixtures/overlay-nudges';
|
||
import { readOverlay } from '../scripts/resolvers/model-overlay';
|
||
|
||
const evalsEnabled = !!process.env.EVALS;
|
||
const periodicTier = process.env.EVALS_TIER === 'periodic';
|
||
const shouldRun = evalsEnabled && periodicTier;
|
||
|
||
const describeE2E = shouldRun ? describe : describe.skip;
|
||
// EvalCollector's tier must be 'e2e' | 'llm-judge' per its type signature.
|
||
// The existing paid evals violate this by passing descriptive names like
|
||
// 'e2e-opus-47' — a pre-existing pattern that only works because bun-test
|
||
// runs without strict typechecking. We stay conforming here.
|
||
const evalCollector = shouldRun ? new EvalCollector('e2e') : null;
|
||
|
||
const REPO_ROOT = path.resolve(import.meta.dir, '..');
|
||
const runId = new Date()
|
||
.toISOString()
|
||
.replace(/[:.]/g, '')
|
||
.replace('T', '-')
|
||
.slice(0, 15);
|
||
const TRANSCRIPTS_DIR = path.join(
|
||
path.dirname(getProjectEvalDir()),
|
||
'transcripts',
|
||
`overlay-harness-${runId}`,
|
||
);
|
||
|
||
// ---------------------------------------------------------------------------
|
||
// Per-arm helpers
|
||
// ---------------------------------------------------------------------------
|
||
|
||
type Arm = 'overlay-on' | 'overlay-off';
|
||
|
||
function mkTrialDir(fixtureId: string, arm: Arm, n: number): string {
|
||
const dir = fs.mkdtempSync(
|
||
path.join(os.tmpdir(), `overlay-harness-${fixtureId}-${arm}-${n}-`),
|
||
);
|
||
return dir;
|
||
}
|
||
|
||
function saveRawTranscript(
|
||
fixtureId: string,
|
||
arm: Arm,
|
||
n: number,
|
||
result: AgentSdkResult,
|
||
): void {
|
||
fs.mkdirSync(TRANSCRIPTS_DIR, { recursive: true });
|
||
const out = path.join(TRANSCRIPTS_DIR, `${fixtureId}-${arm}-${n}.jsonl`);
|
||
const lines = result.events.map((e) => JSON.stringify(e));
|
||
fs.writeFileSync(out, lines.join('\n') + '\n');
|
||
}
|
||
|
||
function overlayContentFor(fixture: OverlayFixture): string {
|
||
const family = path.basename(fixture.overlayPath, '.md');
|
||
const resolved = readOverlay(family);
|
||
if (!resolved) {
|
||
throw new Error(
|
||
`fixture ${fixture.id}: resolver returned empty content for ${family}`,
|
||
);
|
||
}
|
||
return resolved;
|
||
}
|
||
|
||
// ---------------------------------------------------------------------------
|
||
// Per-fixture runner
|
||
// ---------------------------------------------------------------------------
|
||
|
||
interface ArmResult {
|
||
metrics: number[];
|
||
costs: number[];
|
||
durations: number[];
|
||
rateLimitExhausted: number;
|
||
sdkClaudeCodeVersions: Set<string>;
|
||
}
|
||
|
||
async function runArm(
|
||
fixture: OverlayFixture,
|
||
arm: Arm,
|
||
systemPrompt: SystemPromptOption,
|
||
claudeBinary: string | null,
|
||
): Promise<ArmResult> {
|
||
const result: ArmResult = {
|
||
metrics: [],
|
||
costs: [],
|
||
durations: [],
|
||
rateLimitExhausted: 0,
|
||
sdkClaudeCodeVersions: new Set(),
|
||
};
|
||
|
||
const trials = fixture.trials;
|
||
const concurrency = fixture.concurrency ?? 3;
|
||
|
||
// Simple bounded executor: run trials in chunks of `concurrency`.
|
||
// The process-level semaphore in agent-sdk-runner.ts enforces the true cap.
|
||
let nextTrial = 0;
|
||
const workers = Array.from({ length: concurrency }, async () => {
|
||
while (true) {
|
||
const n = nextTrial++;
|
||
if (n >= trials) return;
|
||
|
||
const dir = mkTrialDir(fixture.id, arm, n);
|
||
fixture.setupWorkspace(dir);
|
||
try {
|
||
const sdkResult = await runAgentSdkTest({
|
||
systemPrompt,
|
||
userPrompt: fixture.userPrompt,
|
||
workingDirectory: dir,
|
||
model: fixture.model,
|
||
maxTurns: fixture.maxTurns ?? 5,
|
||
allowedTools: fixture.allowedTools ?? ['Read', 'Glob', 'Grep', 'Bash'],
|
||
permissionMode: 'bypassPermissions',
|
||
settingSources: [],
|
||
env: { ANTHROPIC_API_KEY: process.env.ANTHROPIC_API_KEY ?? '' },
|
||
pathToClaudeCodeExecutable: claudeBinary ?? undefined,
|
||
testName: `${fixture.id}-${arm}-${n}`,
|
||
runId,
|
||
fixtureId: fixture.id,
|
||
onRetry: (_) => {
|
||
// Reset the workspace before the retry so partial Bash side effects
|
||
// from the failed attempt don't contaminate.
|
||
fs.rmSync(dir, { recursive: true, force: true });
|
||
fs.mkdirSync(dir, { recursive: true });
|
||
fixture.setupWorkspace(dir);
|
||
},
|
||
});
|
||
|
||
saveRawTranscript(fixture.id, arm, n, sdkResult);
|
||
|
||
const metric = fixture.metric(sdkResult);
|
||
result.metrics.push(metric);
|
||
result.costs.push(sdkResult.costUsd);
|
||
result.durations.push(sdkResult.durationMs);
|
||
result.sdkClaudeCodeVersions.add(sdkResult.sdkClaudeCodeVersion);
|
||
|
||
evalCollector?.addTest({
|
||
name: `${fixture.id}-${arm}-${n}`,
|
||
suite: 'overlay-harness',
|
||
tier: 'e2e',
|
||
passed: true,
|
||
duration_ms: sdkResult.durationMs,
|
||
cost_usd: sdkResult.costUsd,
|
||
transcript: sdkResult.events,
|
||
prompt: fixture.userPrompt,
|
||
output: sdkResult.output,
|
||
turns_used: sdkResult.turnsUsed,
|
||
browse_errors: sdkResult.browseErrors,
|
||
exit_reason: sdkResult.exitReason,
|
||
model: sdkResult.model,
|
||
first_response_ms: sdkResult.firstResponseMs,
|
||
max_inter_turn_ms: sdkResult.maxInterTurnMs,
|
||
});
|
||
} catch (err) {
|
||
if (err instanceof Error && err.name === 'RateLimitExhaustedError') {
|
||
result.rateLimitExhausted++;
|
||
// Record a failed trial so the collector captures the attempt.
|
||
evalCollector?.addTest({
|
||
name: `${fixture.id}-${arm}-${n}`,
|
||
suite: 'overlay-harness',
|
||
tier: 'e2e',
|
||
passed: false,
|
||
duration_ms: 0,
|
||
cost_usd: 0,
|
||
exit_reason: 'rate_limit_exhausted',
|
||
error: err.message,
|
||
});
|
||
} else {
|
||
throw err;
|
||
}
|
||
} finally {
|
||
try {
|
||
fs.rmSync(dir, { recursive: true, force: true });
|
||
} catch {
|
||
// best-effort cleanup
|
||
}
|
||
}
|
||
}
|
||
});
|
||
|
||
await Promise.all(workers);
|
||
return result;
|
||
}
|
||
|
||
function mean(xs: number[]): number {
|
||
if (xs.length === 0) return 0;
|
||
return xs.reduce((a, b) => a + b, 0) / xs.length;
|
||
}
|
||
|
||
function sum(xs: number[]): number {
|
||
return xs.reduce((a, b) => a + b, 0);
|
||
}
|
||
|
||
// ---------------------------------------------------------------------------
|
||
// Test bodies
|
||
// ---------------------------------------------------------------------------
|
||
|
||
describeE2E('overlay efficacy harness (SDK)', () => {
|
||
// Resolve binary once
|
||
const claudeBinary = resolveClaudeBinary();
|
||
|
||
if (!claudeBinary) {
|
||
test.skip(
|
||
'no local `claude` binary on PATH — cannot pin for harness parity',
|
||
() => {},
|
||
);
|
||
return;
|
||
}
|
||
|
||
for (const fixture of OVERLAY_FIXTURES) {
|
||
test(
|
||
`${fixture.id}: overlay-ON vs overlay-OFF, N=${fixture.trials} per arm`,
|
||
async () => {
|
||
const overlayText = overlayContentFor(fixture);
|
||
expect(overlayText.length).toBeGreaterThan(100);
|
||
|
||
// Arm composition: both arms use the real Claude Code default system
|
||
// prompt (preset). Overlay-ON APPENDS the overlay text; overlay-OFF
|
||
// uses the default alone. This measures the overlay's marginal effect
|
||
// ON TOP of Claude Code's normal behavioral scaffolding — which is
|
||
// the only measurement that matches how real Claude Code composes
|
||
// overlays into its system prompt stack.
|
||
const [onArm, offArm] = await Promise.all([
|
||
runArm(
|
||
fixture,
|
||
'overlay-on',
|
||
{ type: 'preset', preset: 'claude_code', append: overlayText },
|
||
claudeBinary,
|
||
),
|
||
runArm(
|
||
fixture,
|
||
'overlay-off',
|
||
{ type: 'preset', preset: 'claude_code' },
|
||
claudeBinary,
|
||
),
|
||
]);
|
||
|
||
const arms = {
|
||
overlay: onArm.metrics,
|
||
off: offArm.metrics,
|
||
};
|
||
|
||
const meanOn = mean(arms.overlay);
|
||
const meanOff = mean(arms.off);
|
||
const lift = meanOn - meanOff;
|
||
const floorHits = arms.overlay.filter((n) => n >= 2).length;
|
||
const totalCost = sum(onArm.costs) + sum(offArm.costs);
|
||
const versionSet = new Set([
|
||
...onArm.sdkClaudeCodeVersions,
|
||
...offArm.sdkClaudeCodeVersions,
|
||
]);
|
||
|
||
// Loud output for the next person reading the eval JSON:
|
||
// eslint-disable-next-line no-console
|
||
console.log(
|
||
`\n[${fixture.id}]\n` +
|
||
` binary: ${claudeBinary}\n` +
|
||
` claude_code_version(s): ${[...versionSet].join(', ')}\n` +
|
||
` overlay-ON metrics: [${arms.overlay.join(', ')}] mean=${meanOn.toFixed(2)}\n` +
|
||
` overlay-OFF metrics: [${arms.off.join(', ')}] mean=${meanOff.toFixed(2)}\n` +
|
||
` lift: ${lift.toFixed(2)} floor_hits(>=2): ${floorHits}/${fixture.trials}\n` +
|
||
` rate_limit_exhausted: on=${onArm.rateLimitExhausted} off=${offArm.rateLimitExhausted}\n` +
|
||
` total_cost_usd: $${totalCost.toFixed(4)}\n` +
|
||
` transcripts: ${TRANSCRIPTS_DIR}`,
|
||
);
|
||
|
||
// Demand enough trials actually completed to make the assertion
|
||
// meaningful. If rate-limit exhaustion took out more than half of an
|
||
// arm, fail loudly rather than pass/fail on a fragment.
|
||
const minTrials = Math.ceil(fixture.trials / 2);
|
||
expect(arms.overlay.length).toBeGreaterThanOrEqual(minTrials);
|
||
expect(arms.off.length).toBeGreaterThanOrEqual(minTrials);
|
||
|
||
expect(fixture.pass(arms)).toBe(true);
|
||
},
|
||
30 * 60 * 1000, // 30 minute timeout per fixture
|
||
);
|
||
}
|
||
});
|
||
|
||
afterAll(async () => {
|
||
if (evalCollector) {
|
||
const filepath = await evalCollector.finalize();
|
||
// eslint-disable-next-line no-console
|
||
console.log(`\n[overlay-harness] eval results: ${filepath}`);
|
||
}
|
||
});
|