Files
gstack/test/skill-e2e-plan-ceo-plan-mode.test.ts
T
Garry Tan 9e244c0bed v1.11.1.0 fix: plan-mode handshake + canUseTool test harness (#1182)
* feat: plan-mode handshake for interactive review skills

Add a preamble-level STOP-Ask handshake that fires when the user invokes any
of the 4 interactive review skills (plan-ceo-review, plan-eng-review,
plan-design-review, plan-devex-review) while their Claude Code session is
in plan mode. Without this gate, plan mode's "this supercedes any other
instructions" system-reminder outranked the skills' interactive STOP gates
and the skills silently wrote plan files without any per-finding AskUserQuestion.

The handshake offers 2 options (exit-and-rerun, cancel) — the original
third "stay and batch" option was dropped after two independent reviewers
flagged it as a silent bypass of the skills' anti-skip rule.

Architecture decisions (CEO+Eng review):
- Preamble-level resolver, not per-template injection (Codex finding #2)
- Position 1 in preamble composition: after bash block (_SESSION_ID live),
  before onboarding AskUserQuestion gates (so fresh-install users see the
  handshake first, not drowned in telemetry/proactive/routing prompts)
- Generator-only `interactive: true` frontmatter flag, following the
  `preamble-tier` precedent (no host-config frontmatter allowlist edits)
- Host-scoped to Claude via `ctx.host === 'claude'` check inside the
  resolver (simpler than `suppressedResolvers` which only gates `{{}}`
  placeholders)
- One-way-door classification in scripts/question-registry.ts for all 4
  skills so question-tuning `never-ask` preferences can't suppress the gate
- Synchronous telemetry write to ~/.gstack/analytics/skill-usage.jsonl on
  handshake fire (captures A-exit and C-cancel outcomes that terminate the
  skill before end-of-run telemetry runs)

Also adds an explicit STOP block to plan-ceo-review Step 0C-bis so the
approach-selection question can't silently skip to mode selection.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat: extend agent-sdk-runner with canUseTool for AskUserQuestion interception

Test harness at test/helpers/agent-sdk-runner.ts gains an optional
`canUseTool` callback parameter. When a test supplies it, the harness
flips `permissionMode` from `bypassPermissions` (overlay-harness default)
to `default` so the SDK actually invokes the callback on every tool use,
and auto-adds `AskUserQuestion` to `allowedTools` so Claude can fire it
at all.

Exports a `passThroughNonAskUserQuestion` helper so tests that only want
to intercept AskUserQuestion can auto-allow every other tool with one
line: `return passThroughNonAskUserQuestion(toolName, input)`.

This is the foundation for D14 — every future interactive-skill E2E test
can now assert on AskUserQuestion shape and routing. Previous E2E tests
at `test/skill-e2e.test.ts` explicitly instructed the model to skip
AskUserQuestion ("non-interactive run") which meant no test could actually
verify the question content or routing.

6 new unit tests in test/agent-sdk-runner.test.ts cover:
- permissionMode flips to 'default' when canUseTool supplied
- permissionMode stays 'bypassPermissions' when canUseTool absent
- canUseTool callback reaches the SDK options
- AskUserQuestion auto-added to allowedTools when canUseTool supplied
- AskUserQuestion NOT added when canUseTool absent
- passThroughNonAskUserQuestion helper returns allow+updatedInput

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test: plan-mode handshake E2E coverage and unit assertions

Adds 6 E2E test files and 8 new unit assertions to verify the plan-mode
handshake works end-to-end and stays correct under regeneration.

E2E tests (gate-tier, paid, EVALS=1 EVALS_TIER=gate):
- test/skill-e2e-plan-ceo-plan-mode.test.ts — handshake fires before any
  Write/Edit when plan-mode distinctive phrase is present; 2-option shape
  (Exit/Cancel); option A routes to ExitPlanMode cleanly
- test/skill-e2e-plan-eng-plan-mode.test.ts — same contract for plan-eng
- test/skill-e2e-plan-design-plan-mode.test.ts — same contract for
  plan-design; exercises C-cancel branch instead of A-exit
- test/skill-e2e-plan-devex-plan-mode.test.ts — same contract for plan-devex
- test/skill-e2e-plan-mode-no-op.test.ts — negative regression: handshake
  must NOT fire when distinctive phrase is absent; skill proceeds normally
  through Step 0 (REGRESSION RULE guardrail against breaking existing
  interactive-review sessions)
- test/e2e-harness-audit.test.ts — free unit test asserting every
  `interactive: true` skill has at least one canUseTool-using test file
  (prevents future drift where a skill opts in without coverage)

Shared helper test/helpers/plan-mode-handshake-helpers.ts centralizes the
canUseTool interceptor + distinctive-phrase injection so the 4 sibling
E2E tests are thin wiring (~20 LOC each) and can't drift out of sync.

Unit assertions added to test/gen-skill-docs.test.ts:
- handshake section present in all 4 Claude-generated SKILL.md files
- handshake section absent from non-interactive Claude skills (ship,
  review, qa, office-hours, codex, retro, cso)
- handshake section absent from non-Claude host outputs (.agents, etc.)
- 0C-bis STOP block present in plan-ceo-review/SKILL.md at correct
  position (between the "Present these approach options" line and
  "### 0D-prelude" header)
- handshake resolver wired BEFORE generateUpgradeCheck in preamble
  composition order

6 new gate-tier entries added to test/helpers/touchfiles.ts so any change
to the handshake resolver, preamble composition, skill templates, question
registry, one-way-door classifier, or agent-sdk-runner fires the relevant
E2E tests. test/touchfiles.test.ts updated for the new selection count
(plan-ceo-review/** now triggers 15 tests, up from 8).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(v1.11.1.0): VERSION bump + CHANGELOG entry + TODOS follow-ups

Bumps from main's v1.11.0.0 to v1.11.1.0 (PATCH — bug-fix release, no new
user-facing artifacts). CHANGELOG entry covers the plan-mode handshake,
agent-sdk-runner canUseTool extension, and the 2 follow-up TODOs.

CHANGELOG order: v1.11.1.0 (this) → v1.11.0.0 (workspace-aware ship,
merged from main) → v1.10.1.0 (overlay efficacy harness). No duplicate
headers.

Syncs package.json version to match VERSION per the Step 12 idempotency
invariant (both files must agree or /ship halts).

TODOS.md:
- Preserves the Testing/security-bench-haiku-responses P1 added on main
- Adds P1 "Structural STOP-Ask forcing function" — broader class of the
  bug this release fixes
- Adds P2 "Apply interactive: true to non-review skills (office-hours,
  codex, investigate, qa, retro, cso)"

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-24 00:04:53 -07:00

41 lines
1.6 KiB
TypeScript
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
/**
* plan-ceo-review plan-mode handshake E2E (gate tier, paid).
*
* Asserts: when /plan-ceo-review is invoked with the plan-mode distinctive
* phrase in the system reminder, the skill fires AskUserQuestion FIRST
* (before any Write or Edit), the question has exactly 2 options (A exit,
* C cancel), picking "Exit" leads to an orderly exit with no plan file
* written.
*
* Cost: ~$0.50$1.00 per run. Gated: EVALS=1 EVALS_TIER=gate.
* Depends on: scripts/resolvers/preamble/generate-plan-mode-handshake.ts,
* test/helpers/agent-sdk-runner.ts (canUseTool extension).
*/
import { describe, test, expect } from 'bun:test';
import {
runPlanModeHandshakeTest,
assertHandshakeShape,
} from './helpers/plan-mode-handshake-helpers';
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
const describeE2E = shouldRun ? describe : describe.skip;
describeE2E('plan-ceo-review plan-mode handshake (gate)', () => {
test('handshake fires before any Write/Edit when plan mode is detected', async () => {
const result = await runPlanModeHandshakeTest({
skillName: 'plan-ceo-review',
answerLabel: 'Exit',
});
// Handshake must have fired at least once.
expect(result.askUserQuestions.length).toBeGreaterThanOrEqual(1);
// Critically: no Write or Edit fired before the first AskUserQuestion.
// This is the bug v1.10.2.0 fixes — plan mode used to allow silent
// plan-file writes without any interactive gate.
expect(result.writeOrEditBeforeAsk).toBe(false);
// Handshake shape: 2 options (Exit/Cancel), Option B dropped per D8.
assertHandshakeShape(result.askUserQuestions[0]!);
}, 120_000);
});