diff --git a/CHANGELOG.md b/CHANGELOG.md index a8ac2c2c..e9780e1f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,100 @@ # Changelog +## [1.6.5.0] - 2026-04-23 + +## **We tried to make Opus 4.7 faster with a prompt. Measurement said it got slower. Pulled the bullet.** + +gstack shipped a "Fan out explicitly" overlay nudge in `model-overlays/opus-4-7.md` +back in v1.5.2.0. The idea: tell Opus 4.7 to emit multiple tool calls in one +assistant turn instead of one per turn, so "read three files" takes one API +round-trip instead of three. Sounded obvious. This release removes that +bullet after measuring that it actively hurt performance, and ships the eval +harness we used to prove it so you can measure your own overlay changes. + +### The numbers that matter + +Source: new `test/skill-e2e-overlay-harness.test.ts`, N=10 trials per arm per +fixture, 40 trials per run, ~$3 per run. Pinned to `claude-opus-4-7` via +Anthropic's published Agent SDK (`@anthropic-ai/claude-agent-sdk@0.2.117`) +with `pathToClaudeCodeExecutable` set to the locally-installed `claude` binary +(2.1.118). Metric: number of parallel `tool_use` blocks in the first assistant +turn. + +| Prompt text in overlay | First-turn fanout rate (toy: read 3 files) | Lift vs baseline | +|---|---|---| +| No overlay (default Claude Code system prompt only) | **70%** (7/10) | baseline | +| gstack's original "Fan out explicitly" nudge (v1.5.2.0 through v1.6.3.0) | 10% (1/10) | **-60%** | +| Anthropic's own canonical `` text from their parallel-tool-use docs | **0%** (0/10) | **-70%** | + +On a realistic multi-file audit prompt (`read app.ts + config.ts + README.md, +glob src/*.ts, summarize`), Opus 4.7 never fanned out in the first turn at all, +regardless of overlay. Zero of 20 trials. The nudge had nothing to grip. + +Total cost of the investigation: **$7** across three eval runs. + +### What this means for you + +If you ship system-prompt nudges for Claude, measure them. Anthropic's own +published best-practice text dropped our fanout rate to zero. That's not a +claim about Anthropic, it's a claim about measurement: the model, the SDK, +the binary, and the context all move under the advice, and the advice sits +still. The harness is in the repo now. Run +`EVALS=1 EVALS_TIER=periodic bun test test/skill-e2e-overlay-harness.test.ts`. +Three dollars per run. + +### Itemized changes + +#### Fixed + +- `model-overlays/opus-4-7.md` — removed the "Fan out explicitly" block. The + other three nudges (effort-match, batch questions, literal interpretation) + are untested and stay in for now. They're candidates for their own + measurement in a follow-up PR. + +#### Added + +- `test/skill-e2e-overlay-harness.test.ts` — periodic-tier eval that iterates a + typed fixture registry and runs A/B arms through `@anthropic-ai/claude-agent-sdk`. + Uses SDK preset `claude_code` so the arms include Claude Code's real system + prompt; overlay-ON appends the resolved overlay text. Saves per-trial raw + event streams for forensic recovery. Gated on both `EVALS=1` and + `EVALS_TIER=periodic`. +- `test/fixtures/overlay-nudges.ts` — typed `OverlayFixture` registry with + strict validator. Adding a future nudge to measure = one fixture entry. + First two fixtures: `opus-4-7-fanout-toy` and `opus-4-7-fanout-realistic`. +- `test/helpers/agent-sdk-runner.ts` — parametric SDK wrapper with explicit + `AgentSdkResult` types, process-level API concurrency semaphore, and + three-shape 429 retry (thrown error, result-message error, mid-stream + `SDKRateLimitEvent`). Binary pinning via `pathToClaudeCodeExecutable`. +- `test/agent-sdk-runner.test.ts` — 36 free-tier unit tests covering happy + path, all three rate-limit shapes, persistent-429 `RateLimitExhaustedError`, + non-429 propagation, options propagation, concurrency cap, and every + validator rejection case. +- `scripts/preflight-agent-sdk.ts` — 20-line sanity check that confirms the + SDK loads, `claude-opus-4-7` is a live API model, the `SDKMessage` event + shape matches assumptions, and the overlay resolver produces the expected + text. Run manually before paid runs if you suspect drift. Costs ~$0.013. +- `@anthropic-ai/claude-agent-sdk@0.2.117` in `devDependencies`. Exact pin, + no caret — SDK event shapes can drift on minor versions. + +#### Changed + +- `scripts/resolvers/model-overlay.ts` — exported `readOverlay` so the eval + harness can resolve `{{INHERIT:claude}}` directives without synthesizing a + full `TemplateContext`. + +#### For contributors + +- `test/helpers/touchfiles.ts` — registered the new eval in both + `E2E_TOUCHFILES` (deps: `model-overlays/**`, `overlay-nudges.ts`, runner, + resolver) and `E2E_TIERS` (`periodic`). Passes the + `test/touchfiles.test.ts` completeness check. +- The harness is deliberately parametric. Adding a second overlay nudge + measurement (for the remaining three nudges in `opus-4-7.md`, or any + future nudge in any overlay file) is a single entry in + `test/fixtures/overlay-nudges.ts`. Total incremental effort: ~15 minutes + per fixture. + ## [1.6.3.0] - 2026-04-23 ## **Codex finally explains what it's asking about. No more "ELI10 please" the 10th time in a row.** diff --git a/TODOS.md b/TODOS.md index eeac8c15..d2b3a31c 100644 --- a/TODOS.md +++ b/TODOS.md @@ -18,22 +18,6 @@ **Priority:** P3 (nice-to-have, not blocking anyone yet) **Depends on:** `/context-save` + `/context-restore` rename stable in production (v1.0.1.0+). Research: does Conductor expose a spawn-workspace CLI? -## P0: Verify Opus 4.7 fanout nudge inside Claude Code harness (next rev) - -**What:** Re-run the fanout A/B from `test/skill-e2e-opus-47.test.ts` against Opus 4.7 **inside Claude Code's interactive harness**, not via `claude -p`. The current eval calls `claude -p` as a subprocess, which does not load SKILL.md content as system context and uses different tool wiring than the live Claude Code session. Build a small harness (Claude Code extension hook, direct API call with the same system prompt Claude Code uses, or a scripted MCP invocation) that reproduces the real tool_use context, then run the same 3-file-read A/B with and without the `model-overlays/opus-4-7.md` overlay. Record parallel-tool-call count in the first assistant turn for each arm. - -**Why:** v1.6.1.0 shipped a rewritten "Fan out explicitly" nudge with a concrete tool_use example (`[Read(a), Read(b), Read(c)]`). Under `claude -p` on `claude-opus-4-7`, both overlay-ON and overlay-OFF arms emitted zero parallel tool calls in the first turn. The routing A/B worked fine in the same harness (3/3 positives routed correctly), so the gap is specific to fanout, and likely specific to how `claude -p` constructs system prompts and tool schemas. Without measurement inside the real harness, we do not know whether the nudge ever lands for a real user. The PR went to production with the fanout claim asserted but unverified; this TODO closes that loop. - -**Pros:** Produces the "actually shipped fanout" measurement the ship-quality review flagged as missing. If the nudge works in Claude Code harness, we can gate it with a `periodic` eval and stop worrying. If it does not, we know to rewrite or drop the nudge rather than carry dead prompt weight. Either answer is better than the current "unverified." - -**Cons:** Requires instrumenting Claude Code's harness (or a faithful replica) rather than the easier `claude -p` path. A faithful replica needs the same system prompt, the same tool definitions, and the same stop-sequence handling. Estimated one afternoon to wire, plus $3-5 per eval run. - -**Context:** See `~/.gstack/projects/garrytan-gstack/evals/1.6.0.0-feat-opus-4.7-migration-e2e-opus-47-*.json` for the raw transcripts showing 0 parallel calls in first turn across both arms. The overlay is at `model-overlays/opus-4-7.md` with an explicit wrong/right tool_use example. The eval file at `test/skill-e2e-opus-47.test.ts` has the full setup including per-skill SKILL.md install, CLAUDE.md routing block, and overlay inlining. - -**Effort:** M (human: ~1 day / CC: ~45 min for the harness wiring, plus the eval run cost) -**Priority:** P0 (ship-quality commitment from v1.6.1.0 — do not let it drift) -**Depends on / blocked by:** Access to Claude Code's system prompt + tool schema (or a reproducible way to mirror them). May require a small MCP server or a direct Messages API call that mirrors Claude Code's session setup. - ## P0: PACING_UPDATES_V0 — Louise's fatigue root cause (V1.1) **What:** Implement the pacing overhaul extracted from PLAN_TUNING_V1. Full design in `docs/designs/PACING_UPDATES_V0.md`. Requires: session-state model, `phase` field in question-log schema, registry extension for dynamic findings, pacing as skill-template control flow (not preamble prose), `bin/gstack-flip-decision` command, migration-prompt budget rule, first-run preamble audit, ranking threshold calibration from real V0 data, one-way-door uncapped rule, concrete verification values. @@ -1262,6 +1246,15 @@ Shipped in v0.6.5. TemplateContext in gen-skill-docs.ts bakes skill name into pr ## Completed +### Overlay efficacy harness + Opus 4.7 fanout nudge removal (v1.6.5.0) +- Built `test/skill-e2e-overlay-harness.test.ts`, a parametric periodic-tier eval that drives `@anthropic-ai/claude-agent-sdk` and measures first-turn fanout rate (overlay-ON vs overlay-OFF) across registered fixtures +- Measured the original "Fan out explicitly" overlay nudge: baseline Opus 4.7 = 70% first-turn fanout on toy prompt, with our nudge = 10%, with Anthropic's own canonical `` text = 0% +- Removed the counterproductive nudge from `model-overlays/opus-4-7.md` +- Shipped 36-test free-tier unit suite for the SDK runner + strict fixture validator +- Registered `overlay-harness-opus-4-7-fanout-{toy,realistic}` in E2E_TOUCHFILES and E2E_TIERS +- Total investigation cost: ~$7 across 3 eval runs +**Completed:** v1.6.5.0 + ### CI eval pipeline (v0.9.9.0) - GitHub Actions eval upload on Ubicloud runners ($0.006/run) - Within-file test concurrency (test() → testConcurrentIfSelected()) diff --git a/VERSION b/VERSION index 4d019863..51b0f476 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -1.6.3.0 +1.6.5.0 diff --git a/package.json b/package.json index 83008f0b..faead288 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "gstack", - "version": "1.6.3.0", + "version": "1.6.5.0", "description": "Garry's Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.", "license": "MIT", "type": "module",