Commit Graph

2 Commits

Author SHA1 Message Date
Garry Tan 9e244c0bed v1.11.1.0 fix: plan-mode handshake + canUseTool test harness (#1182)
* feat: plan-mode handshake for interactive review skills

Add a preamble-level STOP-Ask handshake that fires when the user invokes any
of the 4 interactive review skills (plan-ceo-review, plan-eng-review,
plan-design-review, plan-devex-review) while their Claude Code session is
in plan mode. Without this gate, plan mode's "this supercedes any other
instructions" system-reminder outranked the skills' interactive STOP gates
and the skills silently wrote plan files without any per-finding AskUserQuestion.

The handshake offers 2 options (exit-and-rerun, cancel) — the original
third "stay and batch" option was dropped after two independent reviewers
flagged it as a silent bypass of the skills' anti-skip rule.

Architecture decisions (CEO+Eng review):
- Preamble-level resolver, not per-template injection (Codex finding #2)
- Position 1 in preamble composition: after bash block (_SESSION_ID live),
  before onboarding AskUserQuestion gates (so fresh-install users see the
  handshake first, not drowned in telemetry/proactive/routing prompts)
- Generator-only `interactive: true` frontmatter flag, following the
  `preamble-tier` precedent (no host-config frontmatter allowlist edits)
- Host-scoped to Claude via `ctx.host === 'claude'` check inside the
  resolver (simpler than `suppressedResolvers` which only gates `{{}}`
  placeholders)
- One-way-door classification in scripts/question-registry.ts for all 4
  skills so question-tuning `never-ask` preferences can't suppress the gate
- Synchronous telemetry write to ~/.gstack/analytics/skill-usage.jsonl on
  handshake fire (captures A-exit and C-cancel outcomes that terminate the
  skill before end-of-run telemetry runs)

Also adds an explicit STOP block to plan-ceo-review Step 0C-bis so the
approach-selection question can't silently skip to mode selection.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat: extend agent-sdk-runner with canUseTool for AskUserQuestion interception

Test harness at test/helpers/agent-sdk-runner.ts gains an optional
`canUseTool` callback parameter. When a test supplies it, the harness
flips `permissionMode` from `bypassPermissions` (overlay-harness default)
to `default` so the SDK actually invokes the callback on every tool use,
and auto-adds `AskUserQuestion` to `allowedTools` so Claude can fire it
at all.

Exports a `passThroughNonAskUserQuestion` helper so tests that only want
to intercept AskUserQuestion can auto-allow every other tool with one
line: `return passThroughNonAskUserQuestion(toolName, input)`.

This is the foundation for D14 — every future interactive-skill E2E test
can now assert on AskUserQuestion shape and routing. Previous E2E tests
at `test/skill-e2e.test.ts` explicitly instructed the model to skip
AskUserQuestion ("non-interactive run") which meant no test could actually
verify the question content or routing.

6 new unit tests in test/agent-sdk-runner.test.ts cover:
- permissionMode flips to 'default' when canUseTool supplied
- permissionMode stays 'bypassPermissions' when canUseTool absent
- canUseTool callback reaches the SDK options
- AskUserQuestion auto-added to allowedTools when canUseTool supplied
- AskUserQuestion NOT added when canUseTool absent
- passThroughNonAskUserQuestion helper returns allow+updatedInput

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test: plan-mode handshake E2E coverage and unit assertions

Adds 6 E2E test files and 8 new unit assertions to verify the plan-mode
handshake works end-to-end and stays correct under regeneration.

E2E tests (gate-tier, paid, EVALS=1 EVALS_TIER=gate):
- test/skill-e2e-plan-ceo-plan-mode.test.ts — handshake fires before any
  Write/Edit when plan-mode distinctive phrase is present; 2-option shape
  (Exit/Cancel); option A routes to ExitPlanMode cleanly
- test/skill-e2e-plan-eng-plan-mode.test.ts — same contract for plan-eng
- test/skill-e2e-plan-design-plan-mode.test.ts — same contract for
  plan-design; exercises C-cancel branch instead of A-exit
- test/skill-e2e-plan-devex-plan-mode.test.ts — same contract for plan-devex
- test/skill-e2e-plan-mode-no-op.test.ts — negative regression: handshake
  must NOT fire when distinctive phrase is absent; skill proceeds normally
  through Step 0 (REGRESSION RULE guardrail against breaking existing
  interactive-review sessions)
- test/e2e-harness-audit.test.ts — free unit test asserting every
  `interactive: true` skill has at least one canUseTool-using test file
  (prevents future drift where a skill opts in without coverage)

Shared helper test/helpers/plan-mode-handshake-helpers.ts centralizes the
canUseTool interceptor + distinctive-phrase injection so the 4 sibling
E2E tests are thin wiring (~20 LOC each) and can't drift out of sync.

Unit assertions added to test/gen-skill-docs.test.ts:
- handshake section present in all 4 Claude-generated SKILL.md files
- handshake section absent from non-interactive Claude skills (ship,
  review, qa, office-hours, codex, retro, cso)
- handshake section absent from non-Claude host outputs (.agents, etc.)
- 0C-bis STOP block present in plan-ceo-review/SKILL.md at correct
  position (between the "Present these approach options" line and
  "### 0D-prelude" header)
- handshake resolver wired BEFORE generateUpgradeCheck in preamble
  composition order

6 new gate-tier entries added to test/helpers/touchfiles.ts so any change
to the handshake resolver, preamble composition, skill templates, question
registry, one-way-door classifier, or agent-sdk-runner fires the relevant
E2E tests. test/touchfiles.test.ts updated for the new selection count
(plan-ceo-review/** now triggers 15 tests, up from 8).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(v1.11.1.0): VERSION bump + CHANGELOG entry + TODOS follow-ups

Bumps from main's v1.11.0.0 to v1.11.1.0 (PATCH — bug-fix release, no new
user-facing artifacts). CHANGELOG entry covers the plan-mode handshake,
agent-sdk-runner canUseTool extension, and the 2 follow-up TODOs.

CHANGELOG order: v1.11.1.0 (this) → v1.11.0.0 (workspace-aware ship,
merged from main) → v1.10.1.0 (overlay efficacy harness). No duplicate
headers.

Syncs package.json version to match VERSION per the Step 12 idempotency
invariant (both files must agree or /ship halts).

TODOS.md:
- Preserves the Testing/security-bench-haiku-responses P1 added on main
- Adds P1 "Structural STOP-Ask forcing function" — broader class of the
  bug this release fixes
- Adds P2 "Apply interactive: true to non-review skills (office-hours,
  codex, investigate, qa, retro, cso)"

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-24 00:04:53 -07:00
Garry Tan e3d7f49c74 feat(v1.10.1.0): overlay efficacy harness + Opus 4.7 fanout nudge removal (#1166)
* refactor: export readOverlay from model-overlay resolver

Needed by the overlay-efficacy eval harness to resolve INHERIT directives
without going through generateModelOverlay's full TemplateContext.

* chore: add @anthropic-ai/claude-agent-sdk@0.2.117 dep

Pinned exact for SDK event-shape stability. Used by the overlay-efficacy
harness to drive the model through a closer-to-real Claude Code harness
than `claude -p`.

* feat(preflight): sanity check for agent-sdk + overlay resolver

Verifies: SDK loads, claude-opus-4-7 is a live API model, SDKMessage
event shape matches assumptions, readOverlay resolves INHERIT directives
and includes expected content. Run with `bun run scripts/preflight-agent-sdk.ts`.

PREFLIGHT OK on first run, $0.013 API spend.

* feat(eval): parametric overlay-efficacy harness (runner + fixtures)

`test/helpers/agent-sdk-runner.ts` wraps @anthropic-ai/claude-agent-sdk
with explicit `AgentSdkResult` types, process-level API concurrency
semaphore, and 3-shape 429 retry (thrown error, result-message error,
mid-stream SDKRateLimitEvent). Pins the local claude binary via
`pathToClaudeCodeExecutable`.

`test/fixtures/overlay-nudges.ts` holds the typed registry. Two
fixtures for the first measurement: `opus-4-7-fanout-toy` (3-file read)
and `opus-4-7-fanout-realistic` (mixed-tool audit). Strict validator
rejects duplicate ids, non-integer trials, unsafe overlay paths, non-safe
id chars, and missing overlay files at module load.

Adding a future overlay nudge eval = one fixture entry.

* test(eval): unit tests for agent-sdk-runner (36 tests, free tier)

Stub `queryProvider` feeds hand-crafted SDKMessage streams. Covers:
happy-path shape, all 3 rate-limit shapes + retry, workspace reset on
retry, persistent 429 -> `RateLimitExhaustedError`, non-429 propagation,
process-level concurrency cap, options propagation, artifact path
uniqueness, cost/turn mapping, and every validator rejection case.

* test(eval): paid periodic overlay-efficacy harness

`test/skill-e2e-overlay-harness.test.ts` iterates OVERLAY_FIXTURES,
runs two arms per fixture (overlay-ON, overlay-OFF) at N=10 trials with
bounded concurrency. Arms use SDK preset `claude_code` so both include
the real Claude Code system prompt; overlay-ON appends the resolved
overlay text. Saves per-trial raw event streams to
`~/.gstack/projects/<slug>/transcripts/` for forensic recovery.

Gated on `EVALS=1 && EVALS_TIER=periodic`. ~$3/run (40 trials).

* test: register overlay harness in touchfiles (both maps)

Entries for `overlay-harness-opus-4-7-fanout-toy` and
`opus-4-7-fanout-realistic` in E2E_TOUCHFILES (deps: model-overlays/,
fixtures file, runner, resolver) and E2E_TIERS (`periodic`). Passes
`test/touchfiles.test.ts` completeness check.

* fix(opus-4.7): remove "Fan out explicitly" overlay nudge

Measured counterproductive under the new SDK harness. Baseline Opus 4.7
emits first-turn parallel tool_use blocks 70% of the time on a 3-file
read prompt. With the custom nudge: 10%. With Anthropic's own canonical
`<use_parallel_tool_calls>` block from their parallel-tool-use docs:
0%. Both overlays suppress fanout; neither improves it.

On realistic multi-tool prompts (audit a project: read files + glob +
summarize), Opus 4.7 never fans out in first turn regardless of overlay.
Zero of 20 trials. Not a prompt problem.

Keeping the other three nudges (effort-match, batch questions, literal
interpretation) pending their own measurement. Harness is ready for
follow-up fixtures — add one entry to
`test/fixtures/overlay-nudges.ts` to measure any overlay bullet.

Cost of investigation: ~$7 total across 3 eval runs.

* chore: bump version and changelog (v1.6.5.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(eval): extend OverlayFixture with allowedTools, maxTurns, direction

Per-fixture tool allowlist unblocks measuring nudges that need Edit/Write
(e.g. literal-interpretation 'fix the failing tests' needs write access).
Per-fixture maxTurns lets harder prompts run longer without changing the
default. `direction` is cosmetic metadata for test output labeling.

Also adds reusable predicates and metrics:
- lowerIsBetter20Pct / higherIsBetter20Pct — 20% lift threshold vs baseline
- bashToolCallCount — count of Bash tool_use across the session
- turnsToCompletion — SDK-reported num_turns at result
- uniqueFilesEdited — Edit/Write/MultiEdit file_path set size

test/skill-e2e-overlay-harness.test.ts now threads fixture.allowedTools
and fixture.maxTurns through runArm.

* test(eval): 3 more overlay fixtures to measure remaining Claude nudges

Measures three overlay bullets that haven't been tested yet:

- claude-dedicated-tools-vs-bash — claude.md says 'prefer Read/Edit/Write/
  Glob/Grep over cat/sed/find/grep'. Fixture prompts 'list every TypeScript
  file under src/ and tell me what each exports' and counts Bash tool_use
  across the session. Overlay-ON should drop it by >=20%.
- opus-4-7-effort-match-trivial — opus-4-7.md says 'simple file reads
  don't need deep reasoning.' Fixture uses a trivial one-file prompt
  (config.json lookup) and measures turns_used. Overlay-ON should be
  <=80% of baseline turns.
- opus-4-7-literal-interpretation — opus-4-7.md says 'fix ALL failing
  tests, not just the obvious one.' Fixture seeds three failing test
  files with deliberately distinct failure modes and counts unique files
  edited. Overlay-ON should touch >=20% more files.

Adding a fourth fixture for any remaining overlay nudge is a single entry.
The harness is now proven on: fanout (deleted after measurement), dedicated
tools, effort-match, and literal-interpretation.

* fix(eval): handle SDK max-turns throw gracefully

Some @anthropic-ai/claude-agent-sdk versions throw from the query
generator when maxTurns is reached, instead of emitting a result
message with subtype='error_max_turns'. The runner treated that as
a non-retryable error and killed the whole periodic run on the first
fixture that exceeded its turn cap.

Added isMaxTurnsError() detector and a catch branch that synthesizes
an AgentSdkResult from events captured before the throw, with
exitReason='error_max_turns' and costUsd=0 (unknown from the thrown
path). The metric function still runs against whatever assistant
turns were collected, so the trial produces a usable number.

Hoisted events/assistantTurns/toolCalls/assistantTextParts and the
timing counters out of the inner try so the catch branch can read
them. No behavior change on the success path or on rate-limit retry
paths.

* test(eval): bump maxTurns to 15 for claude-dedicated-tools-vs-bash

The prompt 'list every TypeScript file under src/ and tell me what
each exports' needs 1 turn for Glob + ~5 for Reads + 1 for summary.
Default maxTurns=5 was not enough; prior run threw from the SDK on
this fixture and tanked the whole periodic eval.

Bumping to 15 gives headroom. The runner now also handles max-turns
gracefully even if a future fixture underestimates, so this is belt
and suspenders.

* test(eval): Sonnet 4.6 variants of the 5 Opus-4.7 fixtures

Same overlays, same prompts, same metrics, `model: 'claude-sonnet-4-6'`.
Tests whether the overlays behave differently on a weaker Claude model
where baseline behavior is shakier. Sonnet trials cost ~3-4x less than
Opus so these 5 add ~$4.50 to a full run.

Measurement result from the first paired run (100 trials total,
~$14.55):

- **Sonnet + effort-match shows real overlay benefit.** With the overlay
  on, Sonnet takes 2.5 turns on a trivial `What's the version in
  config.json?` prompt. Without, it takes exactly 3.0 turns in all 10
  trials. ~17% reduction, below the 20% pass threshold but the signal
  is clean: overlay-ON distribution [2,2,2,2,2,3,3,3,3,3] vs overlay-OFF
  [3,3,3,3,3,3,3,3,3,3].
- All other Sonnet dimensions flat (fanout, dedicated-tools, literal
  interpretation). Same as Opus on those axes.
- Opus effort-match remains flat (2.60 vs 2.50, +4% slower with overlay).

Implication: model-stratified. The overlay stack helps Sonnet on some
axes where it does nothing on Opus. Wholesale removal would hurt Sonnet.
Per-nudge per-model measurement is the right move going forward.

* chore: bump version to 1.10.1.0

Updates VERSION, package.json, CHANGELOG header, and TODOS completion
marker from 1.6.5.0 to 1.10.1.0.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:42:58 -07:00