gstack/model-overlays/opus-4-7.md at main

mirror of https://github.com/garrytan/gstack.git synced 2026-05-01 19:25:10 +02:00

Files

T

Garry Tan e3d7f49c74 feat(v1.10.1.0): overlay efficacy harness + Opus 4.7 fanout nudge removal (#1166 )

* refactor: export readOverlay from model-overlay resolver

Needed by the overlay-efficacy eval harness to resolve INHERIT directives
without going through generateModelOverlay's full TemplateContext.

* chore: add @anthropic-ai/claude-agent-sdk@0.2.117 dep

Pinned exact for SDK event-shape stability. Used by the overlay-efficacy
harness to drive the model through a closer-to-real Claude Code harness
than `claude -p`.

* feat(preflight): sanity check for agent-sdk + overlay resolver

Verifies: SDK loads, claude-opus-4-7 is a live API model, SDKMessage
event shape matches assumptions, readOverlay resolves INHERIT directives
and includes expected content. Run with `bun run scripts/preflight-agent-sdk.ts`.

PREFLIGHT OK on first run, $0.013 API spend.

* feat(eval): parametric overlay-efficacy harness (runner + fixtures)

`test/helpers/agent-sdk-runner.ts` wraps @anthropic-ai/claude-agent-sdk
with explicit `AgentSdkResult` types, process-level API concurrency
semaphore, and 3-shape 429 retry (thrown error, result-message error,
mid-stream SDKRateLimitEvent). Pins the local claude binary via
`pathToClaudeCodeExecutable`.

`test/fixtures/overlay-nudges.ts` holds the typed registry. Two
fixtures for the first measurement: `opus-4-7-fanout-toy` (3-file read)
and `opus-4-7-fanout-realistic` (mixed-tool audit). Strict validator
rejects duplicate ids, non-integer trials, unsafe overlay paths, non-safe
id chars, and missing overlay files at module load.

Adding a future overlay nudge eval = one fixture entry.

* test(eval): unit tests for agent-sdk-runner (36 tests, free tier)

Stub `queryProvider` feeds hand-crafted SDKMessage streams. Covers:
happy-path shape, all 3 rate-limit shapes + retry, workspace reset on
retry, persistent 429 -> `RateLimitExhaustedError`, non-429 propagation,
process-level concurrency cap, options propagation, artifact path
uniqueness, cost/turn mapping, and every validator rejection case.

* test(eval): paid periodic overlay-efficacy harness

`test/skill-e2e-overlay-harness.test.ts` iterates OVERLAY_FIXTURES,
runs two arms per fixture (overlay-ON, overlay-OFF) at N=10 trials with
bounded concurrency. Arms use SDK preset `claude_code` so both include
the real Claude Code system prompt; overlay-ON appends the resolved
overlay text. Saves per-trial raw event streams to
`~/.gstack/projects/<slug>/transcripts/` for forensic recovery.

Gated on `EVALS=1 && EVALS_TIER=periodic`. ~$3/run (40 trials).

* test: register overlay harness in touchfiles (both maps)

Entries for `overlay-harness-opus-4-7-fanout-toy` and
`opus-4-7-fanout-realistic` in E2E_TOUCHFILES (deps: model-overlays/,
fixtures file, runner, resolver) and E2E_TIERS (`periodic`). Passes
`test/touchfiles.test.ts` completeness check.

* fix(opus-4.7): remove "Fan out explicitly" overlay nudge

Measured counterproductive under the new SDK harness. Baseline Opus 4.7
emits first-turn parallel tool_use blocks 70% of the time on a 3-file
read prompt. With the custom nudge: 10%. With Anthropic's own canonical
`<use_parallel_tool_calls>` block from their parallel-tool-use docs:
0%. Both overlays suppress fanout; neither improves it.

On realistic multi-tool prompts (audit a project: read files + glob +
summarize), Opus 4.7 never fans out in first turn regardless of overlay.
Zero of 20 trials. Not a prompt problem.

Keeping the other three nudges (effort-match, batch questions, literal
interpretation) pending their own measurement. Harness is ready for
follow-up fixtures — add one entry to
`test/fixtures/overlay-nudges.ts` to measure any overlay bullet.

Cost of investigation: ~$7 total across 3 eval runs.

* chore: bump version and changelog (v1.6.5.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(eval): extend OverlayFixture with allowedTools, maxTurns, direction

Per-fixture tool allowlist unblocks measuring nudges that need Edit/Write
(e.g. literal-interpretation 'fix the failing tests' needs write access).
Per-fixture maxTurns lets harder prompts run longer without changing the
default. `direction` is cosmetic metadata for test output labeling.

Also adds reusable predicates and metrics:
- lowerIsBetter20Pct / higherIsBetter20Pct — 20% lift threshold vs baseline
- bashToolCallCount — count of Bash tool_use across the session
- turnsToCompletion — SDK-reported num_turns at result
- uniqueFilesEdited — Edit/Write/MultiEdit file_path set size

test/skill-e2e-overlay-harness.test.ts now threads fixture.allowedTools
and fixture.maxTurns through runArm.

* test(eval): 3 more overlay fixtures to measure remaining Claude nudges

Measures three overlay bullets that haven't been tested yet:

- claude-dedicated-tools-vs-bash — claude.md says 'prefer Read/Edit/Write/
  Glob/Grep over cat/sed/find/grep'. Fixture prompts 'list every TypeScript
  file under src/ and tell me what each exports' and counts Bash tool_use
  across the session. Overlay-ON should drop it by >=20%.
- opus-4-7-effort-match-trivial — opus-4-7.md says 'simple file reads
  don't need deep reasoning.' Fixture uses a trivial one-file prompt
  (config.json lookup) and measures turns_used. Overlay-ON should be
  <=80% of baseline turns.
- opus-4-7-literal-interpretation — opus-4-7.md says 'fix ALL failing
  tests, not just the obvious one.' Fixture seeds three failing test
  files with deliberately distinct failure modes and counts unique files
  edited. Overlay-ON should touch >=20% more files.

Adding a fourth fixture for any remaining overlay nudge is a single entry.
The harness is now proven on: fanout (deleted after measurement), dedicated
tools, effort-match, and literal-interpretation.

* fix(eval): handle SDK max-turns throw gracefully

Some @anthropic-ai/claude-agent-sdk versions throw from the query
generator when maxTurns is reached, instead of emitting a result
message with subtype='error_max_turns'. The runner treated that as
a non-retryable error and killed the whole periodic run on the first
fixture that exceeded its turn cap.

Added isMaxTurnsError() detector and a catch branch that synthesizes
an AgentSdkResult from events captured before the throw, with
exitReason='error_max_turns' and costUsd=0 (unknown from the thrown
path). The metric function still runs against whatever assistant
turns were collected, so the trial produces a usable number.

Hoisted events/assistantTurns/toolCalls/assistantTextParts and the
timing counters out of the inner try so the catch branch can read
them. No behavior change on the success path or on rate-limit retry
paths.

* test(eval): bump maxTurns to 15 for claude-dedicated-tools-vs-bash

The prompt 'list every TypeScript file under src/ and tell me what
each exports' needs 1 turn for Glob + ~5 for Reads + 1 for summary.
Default maxTurns=5 was not enough; prior run threw from the SDK on
this fixture and tanked the whole periodic eval.

Bumping to 15 gives headroom. The runner now also handles max-turns
gracefully even if a future fixture underestimates, so this is belt
and suspenders.

* test(eval): Sonnet 4.6 variants of the 5 Opus-4.7 fixtures

Same overlays, same prompts, same metrics, `model: 'claude-sonnet-4-6'`.
Tests whether the overlays behave differently on a weaker Claude model
where baseline behavior is shakier. Sonnet trials cost ~3-4x less than
Opus so these 5 add ~$4.50 to a full run.

Measurement result from the first paired run (100 trials total,
~$14.55):

- **Sonnet + effort-match shows real overlay benefit.** With the overlay
  on, Sonnet takes 2.5 turns on a trivial `What's the version in
  config.json?` prompt. Without, it takes exactly 3.0 turns in all 10
  trials. ~17% reduction, below the 20% pass threshold but the signal
  is clean: overlay-ON distribution [2,2,2,2,2,3,3,3,3,3] vs overlay-OFF
  [3,3,3,3,3,3,3,3,3,3].
- All other Sonnet dimensions flat (fanout, dedicated-tools, literal
  interpretation). Same as Opus on those axes.
- Opus effort-match remains flat (2.60 vs 2.50, +4% slower with overlay).

Implication: model-stratified. The overlay stack helps Sonnet on some
axes where it does nothing on Opus. Wholesale removal would hurt Sonnet.
Per-nudge per-model measurement is the right move going forward.

* chore: bump version to 1.10.1.0

Updates VERSION, package.json, CHANGELOG header, and TODOS completion
marker from 1.6.5.0 to 1.10.1.0.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-23 18:42:58 -07:00

1.5 KiB

Raw Permalink Blame History

Effort-match the step. Simple file reads, config checks, command lookups, and mechanical edits don't need deep reasoning. Complete them quickly and move on. Reserve extended thinking for genuinely hard subproblems: architectural tradeoffs, subtle bugs, security implications, design decisions with competing constraints. Over-thinking simple steps wastes tokens and time.

Pace questions to the skill. If the current skill's text contains STOP. AskUserQuestion anywhere, pace one question per turn — emit the question as a tool_use, stop, wait for the user's response, then continue. Do not batch. A finding with an "obvious fix" is still a finding and still needs user approval before it lands in the plan. Only batch clarifying questions upfront when (a) the skill has no STOP. AskUserQuestion directive AND (b) you need multiple unrelated clarifications before you can begin. When in doubt, ask one question per turn.

Literal interpretation awareness. Opus 4.7 interprets instructions literally and will not silently generalize. When the user says "fix the tests," fix all failing tests that this branch introduced or is responsible for, not just the first one (and not pre-existing failures in unrelated code). When the user says "update the docs," update every relevant doc in scope, not just the most obvious one. Read the full scope of what was asked and deliver the full scope. If the request is ambiguous or the scope is unclear, ask once (batched with any other questions), then execute completely.

1.5 KiB Raw Permalink Blame History

1.5 KiB

Raw Permalink Blame History