chore: bump version and changelog (v1.6.5.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-23 10:37:25 -07:00
parent bfb8ac3b7f
commit 8a3f197aea
4 changed files with 106 additions and 18 deletions
+95
View File
@@ -1,5 +1,100 @@
# Changelog
## [1.6.5.0] - 2026-04-23
## **We tried to make Opus 4.7 faster with a prompt. Measurement said it got slower. Pulled the bullet.**
gstack shipped a "Fan out explicitly" overlay nudge in `model-overlays/opus-4-7.md`
back in v1.5.2.0. The idea: tell Opus 4.7 to emit multiple tool calls in one
assistant turn instead of one per turn, so "read three files" takes one API
round-trip instead of three. Sounded obvious. This release removes that
bullet after measuring that it actively hurt performance, and ships the eval
harness we used to prove it so you can measure your own overlay changes.
### The numbers that matter
Source: new `test/skill-e2e-overlay-harness.test.ts`, N=10 trials per arm per
fixture, 40 trials per run, ~$3 per run. Pinned to `claude-opus-4-7` via
Anthropic's published Agent SDK (`@anthropic-ai/claude-agent-sdk@0.2.117`)
with `pathToClaudeCodeExecutable` set to the locally-installed `claude` binary
(2.1.118). Metric: number of parallel `tool_use` blocks in the first assistant
turn.
| Prompt text in overlay | First-turn fanout rate (toy: read 3 files) | Lift vs baseline |
|---|---|---|
| No overlay (default Claude Code system prompt only) | **70%** (7/10) | baseline |
| gstack's original "Fan out explicitly" nudge (v1.5.2.0 through v1.6.3.0) | 10% (1/10) | **-60%** |
| Anthropic's own canonical `<use_parallel_tool_calls>` text from their parallel-tool-use docs | **0%** (0/10) | **-70%** |
On a realistic multi-file audit prompt (`read app.ts + config.ts + README.md,
glob src/*.ts, summarize`), Opus 4.7 never fanned out in the first turn at all,
regardless of overlay. Zero of 20 trials. The nudge had nothing to grip.
Total cost of the investigation: **$7** across three eval runs.
### What this means for you
If you ship system-prompt nudges for Claude, measure them. Anthropic's own
published best-practice text dropped our fanout rate to zero. That's not a
claim about Anthropic, it's a claim about measurement: the model, the SDK,
the binary, and the context all move under the advice, and the advice sits
still. The harness is in the repo now. Run
`EVALS=1 EVALS_TIER=periodic bun test test/skill-e2e-overlay-harness.test.ts`.
Three dollars per run.
### Itemized changes
#### Fixed
- `model-overlays/opus-4-7.md` — removed the "Fan out explicitly" block. The
other three nudges (effort-match, batch questions, literal interpretation)
are untested and stay in for now. They're candidates for their own
measurement in a follow-up PR.
#### Added
- `test/skill-e2e-overlay-harness.test.ts` — periodic-tier eval that iterates a
typed fixture registry and runs A/B arms through `@anthropic-ai/claude-agent-sdk`.
Uses SDK preset `claude_code` so the arms include Claude Code's real system
prompt; overlay-ON appends the resolved overlay text. Saves per-trial raw
event streams for forensic recovery. Gated on both `EVALS=1` and
`EVALS_TIER=periodic`.
- `test/fixtures/overlay-nudges.ts` — typed `OverlayFixture` registry with
strict validator. Adding a future nudge to measure = one fixture entry.
First two fixtures: `opus-4-7-fanout-toy` and `opus-4-7-fanout-realistic`.
- `test/helpers/agent-sdk-runner.ts` — parametric SDK wrapper with explicit
`AgentSdkResult` types, process-level API concurrency semaphore, and
three-shape 429 retry (thrown error, result-message error, mid-stream
`SDKRateLimitEvent`). Binary pinning via `pathToClaudeCodeExecutable`.
- `test/agent-sdk-runner.test.ts` — 36 free-tier unit tests covering happy
path, all three rate-limit shapes, persistent-429 `RateLimitExhaustedError`,
non-429 propagation, options propagation, concurrency cap, and every
validator rejection case.
- `scripts/preflight-agent-sdk.ts` — 20-line sanity check that confirms the
SDK loads, `claude-opus-4-7` is a live API model, the `SDKMessage` event
shape matches assumptions, and the overlay resolver produces the expected
text. Run manually before paid runs if you suspect drift. Costs ~$0.013.
- `@anthropic-ai/claude-agent-sdk@0.2.117` in `devDependencies`. Exact pin,
no caret — SDK event shapes can drift on minor versions.
#### Changed
- `scripts/resolvers/model-overlay.ts` — exported `readOverlay` so the eval
harness can resolve `{{INHERIT:claude}}` directives without synthesizing a
full `TemplateContext`.
#### For contributors
- `test/helpers/touchfiles.ts` — registered the new eval in both
`E2E_TOUCHFILES` (deps: `model-overlays/**`, `overlay-nudges.ts`, runner,
resolver) and `E2E_TIERS` (`periodic`). Passes the
`test/touchfiles.test.ts` completeness check.
- The harness is deliberately parametric. Adding a second overlay nudge
measurement (for the remaining three nudges in `opus-4-7.md`, or any
future nudge in any overlay file) is a single entry in
`test/fixtures/overlay-nudges.ts`. Total incremental effort: ~15 minutes
per fixture.
## [1.6.3.0] - 2026-04-23
## **Codex finally explains what it's asking about. No more "ELI10 please" the 10th time in a row.**
+9 -16
View File
@@ -18,22 +18,6 @@
**Priority:** P3 (nice-to-have, not blocking anyone yet)
**Depends on:** `/context-save` + `/context-restore` rename stable in production (v1.0.1.0+). Research: does Conductor expose a spawn-workspace CLI?
## P0: Verify Opus 4.7 fanout nudge inside Claude Code harness (next rev)
**What:** Re-run the fanout A/B from `test/skill-e2e-opus-47.test.ts` against Opus 4.7 **inside Claude Code's interactive harness**, not via `claude -p`. The current eval calls `claude -p` as a subprocess, which does not load SKILL.md content as system context and uses different tool wiring than the live Claude Code session. Build a small harness (Claude Code extension hook, direct API call with the same system prompt Claude Code uses, or a scripted MCP invocation) that reproduces the real tool_use context, then run the same 3-file-read A/B with and without the `model-overlays/opus-4-7.md` overlay. Record parallel-tool-call count in the first assistant turn for each arm.
**Why:** v1.6.1.0 shipped a rewritten "Fan out explicitly" nudge with a concrete tool_use example (`[Read(a), Read(b), Read(c)]`). Under `claude -p` on `claude-opus-4-7`, both overlay-ON and overlay-OFF arms emitted zero parallel tool calls in the first turn. The routing A/B worked fine in the same harness (3/3 positives routed correctly), so the gap is specific to fanout, and likely specific to how `claude -p` constructs system prompts and tool schemas. Without measurement inside the real harness, we do not know whether the nudge ever lands for a real user. The PR went to production with the fanout claim asserted but unverified; this TODO closes that loop.
**Pros:** Produces the "actually shipped fanout" measurement the ship-quality review flagged as missing. If the nudge works in Claude Code harness, we can gate it with a `periodic` eval and stop worrying. If it does not, we know to rewrite or drop the nudge rather than carry dead prompt weight. Either answer is better than the current "unverified."
**Cons:** Requires instrumenting Claude Code's harness (or a faithful replica) rather than the easier `claude -p` path. A faithful replica needs the same system prompt, the same tool definitions, and the same stop-sequence handling. Estimated one afternoon to wire, plus $3-5 per eval run.
**Context:** See `~/.gstack/projects/garrytan-gstack/evals/1.6.0.0-feat-opus-4.7-migration-e2e-opus-47-*.json` for the raw transcripts showing 0 parallel calls in first turn across both arms. The overlay is at `model-overlays/opus-4-7.md` with an explicit wrong/right tool_use example. The eval file at `test/skill-e2e-opus-47.test.ts` has the full setup including per-skill SKILL.md install, CLAUDE.md routing block, and overlay inlining.
**Effort:** M (human: ~1 day / CC: ~45 min for the harness wiring, plus the eval run cost)
**Priority:** P0 (ship-quality commitment from v1.6.1.0 — do not let it drift)
**Depends on / blocked by:** Access to Claude Code's system prompt + tool schema (or a reproducible way to mirror them). May require a small MCP server or a direct Messages API call that mirrors Claude Code's session setup.
## P0: PACING_UPDATES_V0 — Louise's fatigue root cause (V1.1)
**What:** Implement the pacing overhaul extracted from PLAN_TUNING_V1. Full design in `docs/designs/PACING_UPDATES_V0.md`. Requires: session-state model, `phase` field in question-log schema, registry extension for dynamic findings, pacing as skill-template control flow (not preamble prose), `bin/gstack-flip-decision` command, migration-prompt budget rule, first-run preamble audit, ranking threshold calibration from real V0 data, one-way-door uncapped rule, concrete verification values.
@@ -1262,6 +1246,15 @@ Shipped in v0.6.5. TemplateContext in gen-skill-docs.ts bakes skill name into pr
## Completed
### Overlay efficacy harness + Opus 4.7 fanout nudge removal (v1.6.5.0)
- Built `test/skill-e2e-overlay-harness.test.ts`, a parametric periodic-tier eval that drives `@anthropic-ai/claude-agent-sdk` and measures first-turn fanout rate (overlay-ON vs overlay-OFF) across registered fixtures
- Measured the original "Fan out explicitly" overlay nudge: baseline Opus 4.7 = 70% first-turn fanout on toy prompt, with our nudge = 10%, with Anthropic's own canonical `<use_parallel_tool_calls>` text = 0%
- Removed the counterproductive nudge from `model-overlays/opus-4-7.md`
- Shipped 36-test free-tier unit suite for the SDK runner + strict fixture validator
- Registered `overlay-harness-opus-4-7-fanout-{toy,realistic}` in E2E_TOUCHFILES and E2E_TIERS
- Total investigation cost: ~$7 across 3 eval runs
**Completed:** v1.6.5.0
### CI eval pipeline (v0.9.9.0)
- GitHub Actions eval upload on Ubicloud runners ($0.006/run)
- Within-file test concurrency (test() → testConcurrentIfSelected())
+1 -1
View File
@@ -1 +1 @@
1.6.3.0
1.6.5.0
+1 -1
View File
@@ -1,6 +1,6 @@
{
"name": "gstack",
"version": "1.6.3.0",
"version": "1.6.5.0",
"description": "Garry's Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.",
"license": "MIT",
"type": "module",