mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-02 11:45:20 +02:00
chore: bump version and changelog (v1.6.5.0)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,5 +1,100 @@
|
||||
# Changelog
|
||||
|
||||
## [1.6.5.0] - 2026-04-23
|
||||
|
||||
## **We tried to make Opus 4.7 faster with a prompt. Measurement said it got slower. Pulled the bullet.**
|
||||
|
||||
gstack shipped a "Fan out explicitly" overlay nudge in `model-overlays/opus-4-7.md`
|
||||
back in v1.5.2.0. The idea: tell Opus 4.7 to emit multiple tool calls in one
|
||||
assistant turn instead of one per turn, so "read three files" takes one API
|
||||
round-trip instead of three. Sounded obvious. This release removes that
|
||||
bullet after measuring that it actively hurt performance, and ships the eval
|
||||
harness we used to prove it so you can measure your own overlay changes.
|
||||
|
||||
### The numbers that matter
|
||||
|
||||
Source: new `test/skill-e2e-overlay-harness.test.ts`, N=10 trials per arm per
|
||||
fixture, 40 trials per run, ~$3 per run. Pinned to `claude-opus-4-7` via
|
||||
Anthropic's published Agent SDK (`@anthropic-ai/claude-agent-sdk@0.2.117`)
|
||||
with `pathToClaudeCodeExecutable` set to the locally-installed `claude` binary
|
||||
(2.1.118). Metric: number of parallel `tool_use` blocks in the first assistant
|
||||
turn.
|
||||
|
||||
| Prompt text in overlay | First-turn fanout rate (toy: read 3 files) | Lift vs baseline |
|
||||
|---|---|---|
|
||||
| No overlay (default Claude Code system prompt only) | **70%** (7/10) | baseline |
|
||||
| gstack's original "Fan out explicitly" nudge (v1.5.2.0 through v1.6.3.0) | 10% (1/10) | **-60%** |
|
||||
| Anthropic's own canonical `<use_parallel_tool_calls>` text from their parallel-tool-use docs | **0%** (0/10) | **-70%** |
|
||||
|
||||
On a realistic multi-file audit prompt (`read app.ts + config.ts + README.md,
|
||||
glob src/*.ts, summarize`), Opus 4.7 never fanned out in the first turn at all,
|
||||
regardless of overlay. Zero of 20 trials. The nudge had nothing to grip.
|
||||
|
||||
Total cost of the investigation: **$7** across three eval runs.
|
||||
|
||||
### What this means for you
|
||||
|
||||
If you ship system-prompt nudges for Claude, measure them. Anthropic's own
|
||||
published best-practice text dropped our fanout rate to zero. That's not a
|
||||
claim about Anthropic, it's a claim about measurement: the model, the SDK,
|
||||
the binary, and the context all move under the advice, and the advice sits
|
||||
still. The harness is in the repo now. Run
|
||||
`EVALS=1 EVALS_TIER=periodic bun test test/skill-e2e-overlay-harness.test.ts`.
|
||||
Three dollars per run.
|
||||
|
||||
### Itemized changes
|
||||
|
||||
#### Fixed
|
||||
|
||||
- `model-overlays/opus-4-7.md` — removed the "Fan out explicitly" block. The
|
||||
other three nudges (effort-match, batch questions, literal interpretation)
|
||||
are untested and stay in for now. They're candidates for their own
|
||||
measurement in a follow-up PR.
|
||||
|
||||
#### Added
|
||||
|
||||
- `test/skill-e2e-overlay-harness.test.ts` — periodic-tier eval that iterates a
|
||||
typed fixture registry and runs A/B arms through `@anthropic-ai/claude-agent-sdk`.
|
||||
Uses SDK preset `claude_code` so the arms include Claude Code's real system
|
||||
prompt; overlay-ON appends the resolved overlay text. Saves per-trial raw
|
||||
event streams for forensic recovery. Gated on both `EVALS=1` and
|
||||
`EVALS_TIER=periodic`.
|
||||
- `test/fixtures/overlay-nudges.ts` — typed `OverlayFixture` registry with
|
||||
strict validator. Adding a future nudge to measure = one fixture entry.
|
||||
First two fixtures: `opus-4-7-fanout-toy` and `opus-4-7-fanout-realistic`.
|
||||
- `test/helpers/agent-sdk-runner.ts` — parametric SDK wrapper with explicit
|
||||
`AgentSdkResult` types, process-level API concurrency semaphore, and
|
||||
three-shape 429 retry (thrown error, result-message error, mid-stream
|
||||
`SDKRateLimitEvent`). Binary pinning via `pathToClaudeCodeExecutable`.
|
||||
- `test/agent-sdk-runner.test.ts` — 36 free-tier unit tests covering happy
|
||||
path, all three rate-limit shapes, persistent-429 `RateLimitExhaustedError`,
|
||||
non-429 propagation, options propagation, concurrency cap, and every
|
||||
validator rejection case.
|
||||
- `scripts/preflight-agent-sdk.ts` — 20-line sanity check that confirms the
|
||||
SDK loads, `claude-opus-4-7` is a live API model, the `SDKMessage` event
|
||||
shape matches assumptions, and the overlay resolver produces the expected
|
||||
text. Run manually before paid runs if you suspect drift. Costs ~$0.013.
|
||||
- `@anthropic-ai/claude-agent-sdk@0.2.117` in `devDependencies`. Exact pin,
|
||||
no caret — SDK event shapes can drift on minor versions.
|
||||
|
||||
#### Changed
|
||||
|
||||
- `scripts/resolvers/model-overlay.ts` — exported `readOverlay` so the eval
|
||||
harness can resolve `{{INHERIT:claude}}` directives without synthesizing a
|
||||
full `TemplateContext`.
|
||||
|
||||
#### For contributors
|
||||
|
||||
- `test/helpers/touchfiles.ts` — registered the new eval in both
|
||||
`E2E_TOUCHFILES` (deps: `model-overlays/**`, `overlay-nudges.ts`, runner,
|
||||
resolver) and `E2E_TIERS` (`periodic`). Passes the
|
||||
`test/touchfiles.test.ts` completeness check.
|
||||
- The harness is deliberately parametric. Adding a second overlay nudge
|
||||
measurement (for the remaining three nudges in `opus-4-7.md`, or any
|
||||
future nudge in any overlay file) is a single entry in
|
||||
`test/fixtures/overlay-nudges.ts`. Total incremental effort: ~15 minutes
|
||||
per fixture.
|
||||
|
||||
## [1.6.3.0] - 2026-04-23
|
||||
|
||||
## **Codex finally explains what it's asking about. No more "ELI10 please" the 10th time in a row.**
|
||||
|
||||
@@ -18,22 +18,6 @@
|
||||
**Priority:** P3 (nice-to-have, not blocking anyone yet)
|
||||
**Depends on:** `/context-save` + `/context-restore` rename stable in production (v1.0.1.0+). Research: does Conductor expose a spawn-workspace CLI?
|
||||
|
||||
## P0: Verify Opus 4.7 fanout nudge inside Claude Code harness (next rev)
|
||||
|
||||
**What:** Re-run the fanout A/B from `test/skill-e2e-opus-47.test.ts` against Opus 4.7 **inside Claude Code's interactive harness**, not via `claude -p`. The current eval calls `claude -p` as a subprocess, which does not load SKILL.md content as system context and uses different tool wiring than the live Claude Code session. Build a small harness (Claude Code extension hook, direct API call with the same system prompt Claude Code uses, or a scripted MCP invocation) that reproduces the real tool_use context, then run the same 3-file-read A/B with and without the `model-overlays/opus-4-7.md` overlay. Record parallel-tool-call count in the first assistant turn for each arm.
|
||||
|
||||
**Why:** v1.6.1.0 shipped a rewritten "Fan out explicitly" nudge with a concrete tool_use example (`[Read(a), Read(b), Read(c)]`). Under `claude -p` on `claude-opus-4-7`, both overlay-ON and overlay-OFF arms emitted zero parallel tool calls in the first turn. The routing A/B worked fine in the same harness (3/3 positives routed correctly), so the gap is specific to fanout, and likely specific to how `claude -p` constructs system prompts and tool schemas. Without measurement inside the real harness, we do not know whether the nudge ever lands for a real user. The PR went to production with the fanout claim asserted but unverified; this TODO closes that loop.
|
||||
|
||||
**Pros:** Produces the "actually shipped fanout" measurement the ship-quality review flagged as missing. If the nudge works in Claude Code harness, we can gate it with a `periodic` eval and stop worrying. If it does not, we know to rewrite or drop the nudge rather than carry dead prompt weight. Either answer is better than the current "unverified."
|
||||
|
||||
**Cons:** Requires instrumenting Claude Code's harness (or a faithful replica) rather than the easier `claude -p` path. A faithful replica needs the same system prompt, the same tool definitions, and the same stop-sequence handling. Estimated one afternoon to wire, plus $3-5 per eval run.
|
||||
|
||||
**Context:** See `~/.gstack/projects/garrytan-gstack/evals/1.6.0.0-feat-opus-4.7-migration-e2e-opus-47-*.json` for the raw transcripts showing 0 parallel calls in first turn across both arms. The overlay is at `model-overlays/opus-4-7.md` with an explicit wrong/right tool_use example. The eval file at `test/skill-e2e-opus-47.test.ts` has the full setup including per-skill SKILL.md install, CLAUDE.md routing block, and overlay inlining.
|
||||
|
||||
**Effort:** M (human: ~1 day / CC: ~45 min for the harness wiring, plus the eval run cost)
|
||||
**Priority:** P0 (ship-quality commitment from v1.6.1.0 — do not let it drift)
|
||||
**Depends on / blocked by:** Access to Claude Code's system prompt + tool schema (or a reproducible way to mirror them). May require a small MCP server or a direct Messages API call that mirrors Claude Code's session setup.
|
||||
|
||||
## P0: PACING_UPDATES_V0 — Louise's fatigue root cause (V1.1)
|
||||
|
||||
**What:** Implement the pacing overhaul extracted from PLAN_TUNING_V1. Full design in `docs/designs/PACING_UPDATES_V0.md`. Requires: session-state model, `phase` field in question-log schema, registry extension for dynamic findings, pacing as skill-template control flow (not preamble prose), `bin/gstack-flip-decision` command, migration-prompt budget rule, first-run preamble audit, ranking threshold calibration from real V0 data, one-way-door uncapped rule, concrete verification values.
|
||||
@@ -1262,6 +1246,15 @@ Shipped in v0.6.5. TemplateContext in gen-skill-docs.ts bakes skill name into pr
|
||||
|
||||
## Completed
|
||||
|
||||
### Overlay efficacy harness + Opus 4.7 fanout nudge removal (v1.6.5.0)
|
||||
- Built `test/skill-e2e-overlay-harness.test.ts`, a parametric periodic-tier eval that drives `@anthropic-ai/claude-agent-sdk` and measures first-turn fanout rate (overlay-ON vs overlay-OFF) across registered fixtures
|
||||
- Measured the original "Fan out explicitly" overlay nudge: baseline Opus 4.7 = 70% first-turn fanout on toy prompt, with our nudge = 10%, with Anthropic's own canonical `<use_parallel_tool_calls>` text = 0%
|
||||
- Removed the counterproductive nudge from `model-overlays/opus-4-7.md`
|
||||
- Shipped 36-test free-tier unit suite for the SDK runner + strict fixture validator
|
||||
- Registered `overlay-harness-opus-4-7-fanout-{toy,realistic}` in E2E_TOUCHFILES and E2E_TIERS
|
||||
- Total investigation cost: ~$7 across 3 eval runs
|
||||
**Completed:** v1.6.5.0
|
||||
|
||||
### CI eval pipeline (v0.9.9.0)
|
||||
- GitHub Actions eval upload on Ubicloud runners ($0.006/run)
|
||||
- Within-file test concurrency (test() → testConcurrentIfSelected())
|
||||
|
||||
+1
-1
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "gstack",
|
||||
"version": "1.6.3.0",
|
||||
"version": "1.6.5.0",
|
||||
"description": "Garry's Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.",
|
||||
"license": "MIT",
|
||||
"type": "module",
|
||||
|
||||
Reference in New Issue
Block a user