diff --git a/CHANGELOG.md b/CHANGELOG.md index 311b2525..9212bdd0 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,58 @@ # Changelog +## [1.11.1.0] - 2026-04-23 + +## **Plan mode stopped silently rubber-stamping your reviews. The forcing questions actually fire now.** + +If you ran `/plan-ceo-review` or any interactive review skill while in plan mode, the skill used to read your diff, skip every STOP gate, write a plan file, and exit. Zero AskUserQuestion calls. Zero mode selection. Zero per-section decisions. The skill's interactive contract got outranked by plan mode's system-reminder, which tells the model to run its own workflow and ignore everything else. This release adds a preamble-level STOP gate that fires before any analysis, so you always get the interactive review the skill was designed to run. + +### What shipped + +Four interactive review skills (plan-ceo-review, plan-eng-review, plan-design-review, plan-devex-review) now emit a two-option AskUserQuestion the moment plan mode is detected: exit-and-rerun interactively, or cancel. No silent bypass. The gate is classified one-way-door in the question registry so `/plan-tune` preferences can't auto-decide past it. Outcome gets logged to `~/.gstack/analytics/skill-usage.jsonl` synchronously when the handshake fires, so A-exit and C-cancel are captured even though they terminate the skill before the end-of-run telemetry block. + +The test harness got a canUseTool extension built on Anthropic's Agent SDK (already installed at v0.2.117). When a test supplies a canUseTool callback, `test/helpers/agent-sdk-runner.ts` flips `permissionMode` from `bypassPermissions` to `default` so the callback actually fires. This is the foundation for asserting AskUserQuestion content end-to-end, which gstack's E2E tests previously couldn't do at all. They had to instruct the model to skip AskUserQuestion entirely. Every future interactive-skill test builds on this. + +### The numbers that matter + +Source: new unit tests in `test/gen-skill-docs.test.ts` (8 tests covering handshake presence, absence, composition ordering, 0C-bis STOP block) and `test/agent-sdk-runner.test.ts` (6 tests covering canUseTool + permission-mode + passThrough helper). All 14 pass locally in <250ms, free tier. + +| Surface | Before | After | +|---|---|---| +| Claude skills rendering the handshake | 0 | 4 (plan-ceo, plan-eng, plan-design, plan-devex) | +| Non-Claude host outputs with handshake text | N/A | 0 (host-scoped via `ctx.host === 'claude'` check) | +| E2E tests that can assert AskUserQuestion content | 0 | 1 harness primitive, ready for every interactive skill | +| Plan-mode entry to any of 4 review skills | Silent bypass | Two-option STOP gate | +| Step 0C-bis in plan-ceo-review | No STOP block, could drift to 0F | Explicit `**STOP.**` block matching 0F pattern | +| Post-handshake telemetry outcomes captured | Neither A-exit nor C-cancel | Both (synchronous write before ExitPlanMode) | + +### What this means for builders + +If you're running gstack in plan mode on a PR review, you'll see one question before the skill does anything: "Exit plan mode and run interactively, or cancel?" Pick A, press esc-esc, rerun the skill in normal mode, get the full interactive review you expected. Pick C to bail cleanly. No more silent rubber-stamp. + +If you're building new interactive skills (yours or contributing to gstack), you can now write real E2E tests that assert on AskUserQuestion shape and routing via the canUseTool harness. See `test/agent-sdk-runner.test.ts` for the pattern and `test/helpers/agent-sdk-runner.ts` for the API. + +### Itemized changes + +#### Fixed + +- Plan mode no longer silently skips AskUserQuestion gates in `/plan-ceo-review`, `/plan-eng-review`, `/plan-design-review`, or `/plan-devex-review`. A preamble-level handshake fires as the first thing the skill does when the plan-mode system-reminder is present, forcing a user choice before any analysis or plan-file writes. +- `/plan-ceo-review` Step 0C-bis now has an explicit STOP block matching the pattern used at Step 0F, so the approach-selection question can't be silently skipped when the skill continues to mode selection. + +#### Added + +- New resolver `scripts/resolvers/preamble/generate-plan-mode-handshake.ts` emits the handshake prose and telemetry bash. Host-scoped to Claude only via `ctx.host === 'claude'` check. Opt-in per skill via `interactive: true` in frontmatter. +- New frontmatter field `interactive: boolean` on skill templates. Generator-only input parsed by `scripts/gen-skill-docs.ts`, never written to generated SKILL.md output (follows the `preamble-tier` precedent). +- New question registry entries `plan-{ceo,eng,design,devex}-review-plan-mode-handshake` with `door_type: 'one-way'` in `scripts/question-registry.ts`. Question-tuning `never-ask` preferences cannot suppress this gate. +- New telemetry field `plan_mode_handshake` in `~/.gstack/analytics/skill-usage.jsonl` with outcomes `fired`, `A-exit`, `C-cancel` written synchronously as the handshake fires. Captures outcomes that would otherwise terminate the skill before end-of-run telemetry runs. +- `test/helpers/agent-sdk-runner.ts` extended with optional `canUseTool` callback parameter. When supplied, flips `permissionMode` to `default`, auto-adds `AskUserQuestion` to `allowedTools`, and passes the callback to the SDK. Exports `passThroughNonAskUserQuestion` helper for tests that only want to assert on AskUserQuestion but auto-allow other tools. + +#### For contributors + +- Added 5 unit tests in `test/gen-skill-docs.test.ts` verifying handshake presence in 4 interactive skills, absence in non-interactive skills, absence in non-Claude host outputs, composition ordering (handshake precedes upgrade-check), and 0C-bis STOP block wiring. +- Added 6 unit tests in `test/agent-sdk-runner.test.ts` verifying permission-mode flip, allowedTools auto-injection, canUseTool callback propagation, and pass-through helper behavior. +- Added 6 gate-tier entries to `test/helpers/touchfiles.ts` covering the new E2E test surface. Dependency glob fires any of the new tests when: the relevant skill template, the handshake resolver, preamble composition, the question registry, the one-way-door classifier, or the agent-sdk-runner changes. +- Filed 2 P1/P2 follow-ups in `TODOS.md`: structural STOP-Ask forcing function across all skills (broader class of bug beyond plan-mode entry), and extending `interactive: true` audit to non-review interactive skills like `/office-hours`, `/codex`, `/investigate`, `/qa`. + ## [1.11.0.0] - 2026-04-23 ## **Workspace-aware ship. Two open PRs can't both claim the same VERSION anymore.** @@ -51,6 +104,57 @@ If you're routinely running 3-10 Conductor windows against the same repo, this i - `test/gstack-next-version.test.ts`. 21 pure-function tests (parseVersion / bumpVersion / cmpVersion / pickNextSlot with 8 collision scenarios / markActiveSiblings 4 cases) plus a CLI smoke test against the live repo. - Golden ship fixtures refreshed for all three hosts (claude, codex, factory) after Step 12 and Step 19 template changes. This is exactly the blast radius Codex flagged during the CEO review (cross-model tension #8), handled in the same PR rather than as a follow-up. +## **Plan mode stopped silently rubber-stamping your reviews. The forcing questions actually fire now.** + +If you ran `/plan-ceo-review` or any interactive review skill while in plan mode, the skill used to read your diff, skip every STOP gate, write a plan file, and exit. Zero AskUserQuestion calls. Zero mode selection. Zero per-section decisions. The skill's interactive contract got outranked by plan mode's system-reminder, which tells the model to run its own workflow and ignore everything else. This release adds a preamble-level STOP gate that fires before any analysis, so you always get the interactive review the skill was designed to run. + +### What shipped + +Four interactive review skills (plan-ceo-review, plan-eng-review, plan-design-review, plan-devex-review) now emit a two-option AskUserQuestion the moment plan mode is detected: exit-and-rerun interactively, or cancel. No silent bypass. The gate is classified one-way-door in the question registry so `/plan-tune` preferences can't auto-decide past it. Outcome gets logged to `~/.gstack/analytics/skill-usage.jsonl` synchronously when the handshake fires, so A-exit and C-cancel are captured even though they terminate the skill before the end-of-run telemetry block. + +The test harness got a canUseTool extension built on Anthropic's Agent SDK (already installed at v0.2.117). When a test supplies a canUseTool callback, `test/helpers/agent-sdk-runner.ts` flips `permissionMode` from `bypassPermissions` to `default` so the callback actually fires. This is the foundation for asserting AskUserQuestion content end-to-end, which gstack's E2E tests previously couldn't do at all. They had to instruct the model to skip AskUserQuestion entirely. Every future interactive-skill test builds on this. + +### The numbers that matter + +Source: new unit tests in `test/gen-skill-docs.test.ts` (8 tests covering handshake presence, absence, composition ordering, 0C-bis STOP block) and `test/agent-sdk-runner.test.ts` (6 tests covering canUseTool + permission-mode + passThrough helper). All 14 pass locally in <250ms, free tier. + +| Surface | Before | After | +|---|---|---| +| Claude skills rendering the handshake | 0 | 4 (plan-ceo, plan-eng, plan-design, plan-devex) | +| Non-Claude host outputs with handshake text | N/A | 0 (host-scoped via `ctx.host === 'claude'` check) | +| E2E tests that can assert AskUserQuestion content | 0 | 1 harness primitive, ready for every interactive skill | +| Plan-mode entry to any of 4 review skills | Silent bypass | Two-option STOP gate | +| Step 0C-bis in plan-ceo-review | No STOP block, could drift to 0F | Explicit `**STOP.**` block matching 0F pattern | +| Post-handshake telemetry outcomes captured | Neither A-exit nor C-cancel | Both (synchronous write before ExitPlanMode) | + +### What this means for builders + +If you're running gstack in plan mode on a PR review, you'll see one question before the skill does anything: "Exit plan mode and run interactively, or cancel?" Pick A, press esc-esc, rerun the skill in normal mode, get the full interactive review you expected. Pick C to bail cleanly. No more silent rubber-stamp. + +If you're building new interactive skills (yours or contributing to gstack), you can now write real E2E tests that assert on AskUserQuestion shape and routing via the canUseTool harness. See `test/agent-sdk-runner.test.ts` for the pattern and `test/helpers/agent-sdk-runner.ts` for the API. + +### Itemized changes + +#### Fixed + +- Plan mode no longer silently skips AskUserQuestion gates in `/plan-ceo-review`, `/plan-eng-review`, `/plan-design-review`, or `/plan-devex-review`. A preamble-level handshake fires as the first thing the skill does when the plan-mode system-reminder is present, forcing a user choice before any analysis or plan-file writes. +- `/plan-ceo-review` Step 0C-bis now has an explicit STOP block matching the pattern used at Step 0F, so the approach-selection question can't be silently skipped when the skill continues to mode selection. + +#### Added + +- New resolver `scripts/resolvers/preamble/generate-plan-mode-handshake.ts` emits the handshake prose and telemetry bash. Host-scoped to Claude only via `ctx.host === 'claude'` check. Opt-in per skill via `interactive: true` in frontmatter. +- New frontmatter field `interactive: boolean` on skill templates. Generator-only input parsed by `scripts/gen-skill-docs.ts`, never written to generated SKILL.md output (follows the `preamble-tier` precedent). +- New question registry entry `plan-mode-handshake` with `door_type: 'one-way'` in `scripts/question-registry.ts`. Question-tuning `never-ask` preferences cannot suppress this gate. +- New telemetry field `plan_mode_handshake` in `~/.gstack/analytics/skill-usage.jsonl` with outcomes `fired`, `A-exit`, `C-cancel` written synchronously as the handshake fires. Captures outcomes that would otherwise terminate the skill before end-of-run telemetry runs. +- `test/helpers/agent-sdk-runner.ts` extended with optional `canUseTool` callback parameter. When supplied, flips `permissionMode` to `default`, auto-adds `AskUserQuestion` to `allowedTools`, and passes the callback to the SDK. Exports `passThroughNonAskUserQuestion` helper for tests that only want to assert on AskUserQuestion but auto-allow other tools. + +#### For contributors + +- Added 8 unit tests in `test/gen-skill-docs.test.ts` verifying handshake presence in 4 interactive skills, absence in non-interactive skills, absence in non-Claude host outputs, composition ordering (handshake precedes upgrade-check), and 0C-bis STOP block wiring. +- Added 6 unit tests in `test/agent-sdk-runner.test.ts` verifying permission-mode flip, allowedTools auto-injection, canUseTool callback propagation, and pass-through helper behavior. +- Added 6 gate-tier entries to `test/helpers/touchfiles.ts` covering the new E2E test surface. Dependency glob fires any of the new tests when: the relevant skill template, the handshake resolver, preamble composition, the question registry, the one-way-door classifier, or the agent-sdk-runner changes. +- Filed 2 P1/P2 follow-ups in `TODOS.md`: structural STOP-Ask forcing function across all skills (broader class of bug beyond plan-mode entry), and extending `interactive: true` audit to non-review interactive skills like `/office-hours`, `/codex`, `/investigate`, `/qa`. + ## [1.10.1.0] - 2026-04-23 ## **We tried to make Opus 4.7 faster with a prompt. Measurement said it got slower. Pulled the bullet.** diff --git a/TODOS.md b/TODOS.md index 1dda875b..5264574c 100644 --- a/TODOS.md +++ b/TODOS.md @@ -20,6 +20,38 @@ --- +## P1: Structural STOP-Ask forcing function across all skills (v1.11.1.0 follow-up) + +**What:** Design and implement a structural forcing function that catches when a skill mandates per-issue AskUserQuestion but the model silently substitutes batch-synthesis. Candidate mechanisms: question-count assertion (skill declares expected question count in frontmatter; post-run audit logs if model fired ` in this turn for the literal phrase:** + +> `Plan mode is active. The user indicated that they do not want you to execute yet` + +If that phrase is **absent**: proceed normally. This section is a no-op. + +If that phrase is **present**, the user is in plan mode. Plan mode's system +reminder says "This supercedes any other instructions you have received," +which conflicts with this skill's interactive STOP-Ask workflow. You MUST +resolve the conflict via AskUserQuestion BEFORE reading any files, running +any bash, or composing any plan content. + +### What to do when plan mode is detected + +Before emitting the AskUserQuestion, run this bash block synchronously to +log that the handshake fired (captures A-exit and C-cancel outcomes that +would terminate the skill before end-of-skill telemetry runs): + +```bash +# PLAN MODE EXCEPTION — ALWAYS RUN (telemetry-only write to ~/.gstack/) +mkdir -p ~/.gstack/analytics +echo '{"skill":"'"${_SKILL_NAME:-unknown}"'","event":"plan_mode_handshake","outcome":"fired","branch":"'"${_BRANCH:-unknown}"'","session":"'"${_SESSION_ID:-unknown}"'","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true +``` + +Then emit exactly **one** AskUserQuestion with `question_id: "${SKILL_NAME}-plan-mode-handshake"` +(e.g., `plan-ceo-review-plan-mode-handshake`, using the current skill's name) +and these two options. The question is classified `door_type: one-way` in +the question registry for every interactive skill, so question-tuning +preferences (`never-ask`, `always-ask`) do NOT apply — this gate always fires. + +**Question body (follow the AskUserQuestion Format section below):** + +> This skill runs an interactive review that stops at every finding to ask +> you a question. Plan mode's default workflow is "read files, write plan, +> exit" — that silently bypasses every STOP gate in this skill. How do you +> want to proceed? +> +> **Recommendation: A** because this skill was designed for back-and-forth. +> Each scope call and each per-section finding needs your decision before it +> lands in the plan. Exiting plan mode and running the skill normally is the +> only path that preserves the interactive contract. +> +> *Note: options differ in kind (workflow shape), not coverage — no +> completeness score.* +> +> **A) Exit plan mode and run interactively (recommended)** +> ✅ Every STOP gate in this skill fires as designed — you approve each +> scope call, each per-section finding, each cross-model tension before any +> decision lands in the plan. No silent bypass. +> ✅ Matches the skill's documented workflow. Each AskUserQuestion has a +> clear recommendation, pros/cons, and net line you can skim in ~5 seconds. +> ❌ Two-step: press esc-esc to exit plan mode, then rerun +> `/plan-{skill-name}`. Slight context-switch friction, but the alternative +> is shipping a rubber-stamp review. +> +> **C) Cancel — I meant to run something else** +> ✅ Clean exit, no partial state, no plan file written, no findings +> recorded. Use this if you invoked the skill by mistake. +> ❌ No output at all — no review, no plan file. Fine if that's what you +> want; otherwise pick A. +> +> **Net.** Plan mode is incompatible with this skill's per-finding STOP +> gates. A is the right choice for any real review; C is the bail-out. + +### Routing the user's answer + +**If the user picks A (exit and rerun):** + +1. Append the outcome to the telemetry log (synchronous, before ExitPlanMode): + ```bash + echo '{"skill":"'"${_SKILL_NAME:-unknown}"'","event":"plan_mode_handshake","outcome":"A-exit","branch":"'"${_BRANCH:-unknown}"'","session":"'"${_SESSION_ID:-unknown}"'","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true + ``` +2. Respond to the user: "Press **esc-esc** to exit plan mode, then rerun + `/{skill-name}`. The skill will run interactively with every STOP gate + firing as designed." +3. Call `ExitPlanMode` with an empty plan body (plan mode requires + turn-end via AskUserQuestion or ExitPlanMode; there is no plan to + approve, so ExitPlanMode with an empty message is the correct exit). + +**If the user picks C (cancel):** + +1. Append the outcome: + ```bash + echo '{"skill":"'"${_SKILL_NAME:-unknown}"'","event":"plan_mode_handshake","outcome":"C-cancel","branch":"'"${_BRANCH:-unknown}"'","session":"'"${_SESSION_ID:-unknown}"'","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true + ``` +2. Tell the user: "Cancelled. No plan written." +3. Call `ExitPlanMode` with an empty message noting the user cancelled. + +**After the handshake completes (either A or C),** do NOT continue with the +rest of this skill's workflow. The handshake is terminal for this turn. + + If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills AND do not auto-invoke skills based on conversation context. Only run skills the user explicitly types (e.g., /qa, /ship). If you would have auto-invoked a skill, instead briefly say: @@ -1410,6 +1505,9 @@ Rules: Present these approach options via AskUserQuestion using the preamble's AskUserQuestion Format section: include RECOMMENDATION and `Completeness: N/10` on every option. These approaches differ in coverage (minimal viable vs ideal architecture), so completeness scoring applies directly. +**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. Do NOT proceed to Step 0D or 0F until the user responds to 0C-bis. A "clearly winning approach" is still an approach decision and still needs explicit user approval before it lands in the plan. +**Reminder: Do NOT make any code changes. Review only.** + ### 0D-prelude. Expansion Framing (shared by EXPANSION and SELECTIVE EXPANSION) Every expansion proposal you generate in SCOPE EXPANSION or SELECTIVE EXPANSION mode follows this framing pattern: diff --git a/plan-ceo-review/SKILL.md.tmpl b/plan-ceo-review/SKILL.md.tmpl index 9fb66a2c..45648f80 100644 --- a/plan-ceo-review/SKILL.md.tmpl +++ b/plan-ceo-review/SKILL.md.tmpl @@ -1,6 +1,7 @@ --- name: plan-ceo-review preamble-tier: 3 +interactive: true version: 1.0.0 description: | CEO/founder-mode plan review. Rethink the problem, find the 10-star product, @@ -248,6 +249,9 @@ Rules: Present these approach options via AskUserQuestion using the preamble's AskUserQuestion Format section: include RECOMMENDATION and `Completeness: N/10` on every option. These approaches differ in coverage (minimal viable vs ideal architecture), so completeness scoring applies directly. +**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. Do NOT proceed to Step 0D or 0F until the user responds to 0C-bis. A "clearly winning approach" is still an approach decision and still needs explicit user approval before it lands in the plan. +**Reminder: Do NOT make any code changes. Review only.** + ### 0D-prelude. Expansion Framing (shared by EXPANSION and SELECTIVE EXPANSION) Every expansion proposal you generate in SCOPE EXPANSION or SELECTIVE EXPANSION mode follows this framing pattern: diff --git a/plan-design-review/SKILL.md b/plan-design-review/SKILL.md index 95dbbc76..dcf0474b 100644 --- a/plan-design-review/SKILL.md +++ b/plan-design-review/SKILL.md @@ -1,6 +1,7 @@ --- name: plan-design-review preamble-tier: 3 +interactive: true version: 2.0.0 description: | Designer's eye plan review — interactive, like CEO and Eng review. @@ -115,6 +116,100 @@ echo "CHECKPOINT_PUSH: $_CHECKPOINT_PUSH" [ -n "$OPENCLAW_SESSION" ] && echo "SPAWNED_SESSION: true" || true ``` +## Plan Mode Handshake — FIRST, BEFORE ANY ANALYSIS + +**Check every `` in this turn for the literal phrase:** + +> `Plan mode is active. The user indicated that they do not want you to execute yet` + +If that phrase is **absent**: proceed normally. This section is a no-op. + +If that phrase is **present**, the user is in plan mode. Plan mode's system +reminder says "This supercedes any other instructions you have received," +which conflicts with this skill's interactive STOP-Ask workflow. You MUST +resolve the conflict via AskUserQuestion BEFORE reading any files, running +any bash, or composing any plan content. + +### What to do when plan mode is detected + +Before emitting the AskUserQuestion, run this bash block synchronously to +log that the handshake fired (captures A-exit and C-cancel outcomes that +would terminate the skill before end-of-skill telemetry runs): + +```bash +# PLAN MODE EXCEPTION — ALWAYS RUN (telemetry-only write to ~/.gstack/) +mkdir -p ~/.gstack/analytics +echo '{"skill":"'"${_SKILL_NAME:-unknown}"'","event":"plan_mode_handshake","outcome":"fired","branch":"'"${_BRANCH:-unknown}"'","session":"'"${_SESSION_ID:-unknown}"'","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true +``` + +Then emit exactly **one** AskUserQuestion with `question_id: "${SKILL_NAME}-plan-mode-handshake"` +(e.g., `plan-ceo-review-plan-mode-handshake`, using the current skill's name) +and these two options. The question is classified `door_type: one-way` in +the question registry for every interactive skill, so question-tuning +preferences (`never-ask`, `always-ask`) do NOT apply — this gate always fires. + +**Question body (follow the AskUserQuestion Format section below):** + +> This skill runs an interactive review that stops at every finding to ask +> you a question. Plan mode's default workflow is "read files, write plan, +> exit" — that silently bypasses every STOP gate in this skill. How do you +> want to proceed? +> +> **Recommendation: A** because this skill was designed for back-and-forth. +> Each scope call and each per-section finding needs your decision before it +> lands in the plan. Exiting plan mode and running the skill normally is the +> only path that preserves the interactive contract. +> +> *Note: options differ in kind (workflow shape), not coverage — no +> completeness score.* +> +> **A) Exit plan mode and run interactively (recommended)** +> ✅ Every STOP gate in this skill fires as designed — you approve each +> scope call, each per-section finding, each cross-model tension before any +> decision lands in the plan. No silent bypass. +> ✅ Matches the skill's documented workflow. Each AskUserQuestion has a +> clear recommendation, pros/cons, and net line you can skim in ~5 seconds. +> ❌ Two-step: press esc-esc to exit plan mode, then rerun +> `/plan-{skill-name}`. Slight context-switch friction, but the alternative +> is shipping a rubber-stamp review. +> +> **C) Cancel — I meant to run something else** +> ✅ Clean exit, no partial state, no plan file written, no findings +> recorded. Use this if you invoked the skill by mistake. +> ❌ No output at all — no review, no plan file. Fine if that's what you +> want; otherwise pick A. +> +> **Net.** Plan mode is incompatible with this skill's per-finding STOP +> gates. A is the right choice for any real review; C is the bail-out. + +### Routing the user's answer + +**If the user picks A (exit and rerun):** + +1. Append the outcome to the telemetry log (synchronous, before ExitPlanMode): + ```bash + echo '{"skill":"'"${_SKILL_NAME:-unknown}"'","event":"plan_mode_handshake","outcome":"A-exit","branch":"'"${_BRANCH:-unknown}"'","session":"'"${_SESSION_ID:-unknown}"'","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true + ``` +2. Respond to the user: "Press **esc-esc** to exit plan mode, then rerun + `/{skill-name}`. The skill will run interactively with every STOP gate + firing as designed." +3. Call `ExitPlanMode` with an empty plan body (plan mode requires + turn-end via AskUserQuestion or ExitPlanMode; there is no plan to + approve, so ExitPlanMode with an empty message is the correct exit). + +**If the user picks C (cancel):** + +1. Append the outcome: + ```bash + echo '{"skill":"'"${_SKILL_NAME:-unknown}"'","event":"plan_mode_handshake","outcome":"C-cancel","branch":"'"${_BRANCH:-unknown}"'","session":"'"${_SESSION_ID:-unknown}"'","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true + ``` +2. Tell the user: "Cancelled. No plan written." +3. Call `ExitPlanMode` with an empty message noting the user cancelled. + +**After the handshake completes (either A or C),** do NOT continue with the +rest of this skill's workflow. The handshake is terminal for this turn. + + If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills AND do not auto-invoke skills based on conversation context. Only run skills the user explicitly types (e.g., /qa, /ship). If you would have auto-invoked a skill, instead briefly say: diff --git a/plan-design-review/SKILL.md.tmpl b/plan-design-review/SKILL.md.tmpl index 5c224d13..e44ba7da 100644 --- a/plan-design-review/SKILL.md.tmpl +++ b/plan-design-review/SKILL.md.tmpl @@ -1,6 +1,7 @@ --- name: plan-design-review preamble-tier: 3 +interactive: true version: 2.0.0 description: | Designer's eye plan review — interactive, like CEO and Eng review. diff --git a/plan-devex-review/SKILL.md b/plan-devex-review/SKILL.md index 96590d9e..e2fccc5d 100644 --- a/plan-devex-review/SKILL.md +++ b/plan-devex-review/SKILL.md @@ -1,6 +1,7 @@ --- name: plan-devex-review preamble-tier: 3 +interactive: true version: 2.0.0 description: | Interactive developer experience plan review. Explores developer personas, @@ -119,6 +120,100 @@ echo "CHECKPOINT_PUSH: $_CHECKPOINT_PUSH" [ -n "$OPENCLAW_SESSION" ] && echo "SPAWNED_SESSION: true" || true ``` +## Plan Mode Handshake — FIRST, BEFORE ANY ANALYSIS + +**Check every `` in this turn for the literal phrase:** + +> `Plan mode is active. The user indicated that they do not want you to execute yet` + +If that phrase is **absent**: proceed normally. This section is a no-op. + +If that phrase is **present**, the user is in plan mode. Plan mode's system +reminder says "This supercedes any other instructions you have received," +which conflicts with this skill's interactive STOP-Ask workflow. You MUST +resolve the conflict via AskUserQuestion BEFORE reading any files, running +any bash, or composing any plan content. + +### What to do when plan mode is detected + +Before emitting the AskUserQuestion, run this bash block synchronously to +log that the handshake fired (captures A-exit and C-cancel outcomes that +would terminate the skill before end-of-skill telemetry runs): + +```bash +# PLAN MODE EXCEPTION — ALWAYS RUN (telemetry-only write to ~/.gstack/) +mkdir -p ~/.gstack/analytics +echo '{"skill":"'"${_SKILL_NAME:-unknown}"'","event":"plan_mode_handshake","outcome":"fired","branch":"'"${_BRANCH:-unknown}"'","session":"'"${_SESSION_ID:-unknown}"'","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true +``` + +Then emit exactly **one** AskUserQuestion with `question_id: "${SKILL_NAME}-plan-mode-handshake"` +(e.g., `plan-ceo-review-plan-mode-handshake`, using the current skill's name) +and these two options. The question is classified `door_type: one-way` in +the question registry for every interactive skill, so question-tuning +preferences (`never-ask`, `always-ask`) do NOT apply — this gate always fires. + +**Question body (follow the AskUserQuestion Format section below):** + +> This skill runs an interactive review that stops at every finding to ask +> you a question. Plan mode's default workflow is "read files, write plan, +> exit" — that silently bypasses every STOP gate in this skill. How do you +> want to proceed? +> +> **Recommendation: A** because this skill was designed for back-and-forth. +> Each scope call and each per-section finding needs your decision before it +> lands in the plan. Exiting plan mode and running the skill normally is the +> only path that preserves the interactive contract. +> +> *Note: options differ in kind (workflow shape), not coverage — no +> completeness score.* +> +> **A) Exit plan mode and run interactively (recommended)** +> ✅ Every STOP gate in this skill fires as designed — you approve each +> scope call, each per-section finding, each cross-model tension before any +> decision lands in the plan. No silent bypass. +> ✅ Matches the skill's documented workflow. Each AskUserQuestion has a +> clear recommendation, pros/cons, and net line you can skim in ~5 seconds. +> ❌ Two-step: press esc-esc to exit plan mode, then rerun +> `/plan-{skill-name}`. Slight context-switch friction, but the alternative +> is shipping a rubber-stamp review. +> +> **C) Cancel — I meant to run something else** +> ✅ Clean exit, no partial state, no plan file written, no findings +> recorded. Use this if you invoked the skill by mistake. +> ❌ No output at all — no review, no plan file. Fine if that's what you +> want; otherwise pick A. +> +> **Net.** Plan mode is incompatible with this skill's per-finding STOP +> gates. A is the right choice for any real review; C is the bail-out. + +### Routing the user's answer + +**If the user picks A (exit and rerun):** + +1. Append the outcome to the telemetry log (synchronous, before ExitPlanMode): + ```bash + echo '{"skill":"'"${_SKILL_NAME:-unknown}"'","event":"plan_mode_handshake","outcome":"A-exit","branch":"'"${_BRANCH:-unknown}"'","session":"'"${_SESSION_ID:-unknown}"'","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true + ``` +2. Respond to the user: "Press **esc-esc** to exit plan mode, then rerun + `/{skill-name}`. The skill will run interactively with every STOP gate + firing as designed." +3. Call `ExitPlanMode` with an empty plan body (plan mode requires + turn-end via AskUserQuestion or ExitPlanMode; there is no plan to + approve, so ExitPlanMode with an empty message is the correct exit). + +**If the user picks C (cancel):** + +1. Append the outcome: + ```bash + echo '{"skill":"'"${_SKILL_NAME:-unknown}"'","event":"plan_mode_handshake","outcome":"C-cancel","branch":"'"${_BRANCH:-unknown}"'","session":"'"${_SESSION_ID:-unknown}"'","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true + ``` +2. Tell the user: "Cancelled. No plan written." +3. Call `ExitPlanMode` with an empty message noting the user cancelled. + +**After the handshake completes (either A or C),** do NOT continue with the +rest of this skill's workflow. The handshake is terminal for this turn. + + If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills AND do not auto-invoke skills based on conversation context. Only run skills the user explicitly types (e.g., /qa, /ship). If you would have auto-invoked a skill, instead briefly say: diff --git a/plan-devex-review/SKILL.md.tmpl b/plan-devex-review/SKILL.md.tmpl index 5b4c69a9..bd824dc2 100644 --- a/plan-devex-review/SKILL.md.tmpl +++ b/plan-devex-review/SKILL.md.tmpl @@ -1,6 +1,7 @@ --- name: plan-devex-review preamble-tier: 3 +interactive: true version: 2.0.0 description: | Interactive developer experience plan review. Explores developer personas, diff --git a/plan-eng-review/SKILL.md b/plan-eng-review/SKILL.md index ce4cb6e9..a90314f0 100644 --- a/plan-eng-review/SKILL.md +++ b/plan-eng-review/SKILL.md @@ -1,6 +1,7 @@ --- name: plan-eng-review preamble-tier: 3 +interactive: true version: 1.0.0 description: | Eng manager-mode plan review. Lock in the execution plan — architecture, @@ -117,6 +118,100 @@ echo "CHECKPOINT_PUSH: $_CHECKPOINT_PUSH" [ -n "$OPENCLAW_SESSION" ] && echo "SPAWNED_SESSION: true" || true ``` +## Plan Mode Handshake — FIRST, BEFORE ANY ANALYSIS + +**Check every `` in this turn for the literal phrase:** + +> `Plan mode is active. The user indicated that they do not want you to execute yet` + +If that phrase is **absent**: proceed normally. This section is a no-op. + +If that phrase is **present**, the user is in plan mode. Plan mode's system +reminder says "This supercedes any other instructions you have received," +which conflicts with this skill's interactive STOP-Ask workflow. You MUST +resolve the conflict via AskUserQuestion BEFORE reading any files, running +any bash, or composing any plan content. + +### What to do when plan mode is detected + +Before emitting the AskUserQuestion, run this bash block synchronously to +log that the handshake fired (captures A-exit and C-cancel outcomes that +would terminate the skill before end-of-skill telemetry runs): + +```bash +# PLAN MODE EXCEPTION — ALWAYS RUN (telemetry-only write to ~/.gstack/) +mkdir -p ~/.gstack/analytics +echo '{"skill":"'"${_SKILL_NAME:-unknown}"'","event":"plan_mode_handshake","outcome":"fired","branch":"'"${_BRANCH:-unknown}"'","session":"'"${_SESSION_ID:-unknown}"'","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true +``` + +Then emit exactly **one** AskUserQuestion with `question_id: "${SKILL_NAME}-plan-mode-handshake"` +(e.g., `plan-ceo-review-plan-mode-handshake`, using the current skill's name) +and these two options. The question is classified `door_type: one-way` in +the question registry for every interactive skill, so question-tuning +preferences (`never-ask`, `always-ask`) do NOT apply — this gate always fires. + +**Question body (follow the AskUserQuestion Format section below):** + +> This skill runs an interactive review that stops at every finding to ask +> you a question. Plan mode's default workflow is "read files, write plan, +> exit" — that silently bypasses every STOP gate in this skill. How do you +> want to proceed? +> +> **Recommendation: A** because this skill was designed for back-and-forth. +> Each scope call and each per-section finding needs your decision before it +> lands in the plan. Exiting plan mode and running the skill normally is the +> only path that preserves the interactive contract. +> +> *Note: options differ in kind (workflow shape), not coverage — no +> completeness score.* +> +> **A) Exit plan mode and run interactively (recommended)** +> ✅ Every STOP gate in this skill fires as designed — you approve each +> scope call, each per-section finding, each cross-model tension before any +> decision lands in the plan. No silent bypass. +> ✅ Matches the skill's documented workflow. Each AskUserQuestion has a +> clear recommendation, pros/cons, and net line you can skim in ~5 seconds. +> ❌ Two-step: press esc-esc to exit plan mode, then rerun +> `/plan-{skill-name}`. Slight context-switch friction, but the alternative +> is shipping a rubber-stamp review. +> +> **C) Cancel — I meant to run something else** +> ✅ Clean exit, no partial state, no plan file written, no findings +> recorded. Use this if you invoked the skill by mistake. +> ❌ No output at all — no review, no plan file. Fine if that's what you +> want; otherwise pick A. +> +> **Net.** Plan mode is incompatible with this skill's per-finding STOP +> gates. A is the right choice for any real review; C is the bail-out. + +### Routing the user's answer + +**If the user picks A (exit and rerun):** + +1. Append the outcome to the telemetry log (synchronous, before ExitPlanMode): + ```bash + echo '{"skill":"'"${_SKILL_NAME:-unknown}"'","event":"plan_mode_handshake","outcome":"A-exit","branch":"'"${_BRANCH:-unknown}"'","session":"'"${_SESSION_ID:-unknown}"'","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true + ``` +2. Respond to the user: "Press **esc-esc** to exit plan mode, then rerun + `/{skill-name}`. The skill will run interactively with every STOP gate + firing as designed." +3. Call `ExitPlanMode` with an empty plan body (plan mode requires + turn-end via AskUserQuestion or ExitPlanMode; there is no plan to + approve, so ExitPlanMode with an empty message is the correct exit). + +**If the user picks C (cancel):** + +1. Append the outcome: + ```bash + echo '{"skill":"'"${_SKILL_NAME:-unknown}"'","event":"plan_mode_handshake","outcome":"C-cancel","branch":"'"${_BRANCH:-unknown}"'","session":"'"${_SESSION_ID:-unknown}"'","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true + ``` +2. Tell the user: "Cancelled. No plan written." +3. Call `ExitPlanMode` with an empty message noting the user cancelled. + +**After the handshake completes (either A or C),** do NOT continue with the +rest of this skill's workflow. The handshake is terminal for this turn. + + If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills AND do not auto-invoke skills based on conversation context. Only run skills the user explicitly types (e.g., /qa, /ship). If you would have auto-invoked a skill, instead briefly say: diff --git a/plan-eng-review/SKILL.md.tmpl b/plan-eng-review/SKILL.md.tmpl index a16083ed..2d267837 100644 --- a/plan-eng-review/SKILL.md.tmpl +++ b/plan-eng-review/SKILL.md.tmpl @@ -1,6 +1,7 @@ --- name: plan-eng-review preamble-tier: 3 +interactive: true version: 1.0.0 description: | Eng manager-mode plan review. Lock in the execution plan — architecture, diff --git a/scripts/gen-skill-docs.ts b/scripts/gen-skill-docs.ts index 40f08369..c801af08 100644 --- a/scripts/gen-skill-docs.ts +++ b/scripts/gen-skill-docs.ts @@ -425,7 +425,11 @@ function processTemplate(tmplPath: string, host: Host = 'claude'): { outputPath: const tierMatch = tmplContent.match(/^preamble-tier:\s*(\d+)$/m); const preambleTier = tierMatch ? parseInt(tierMatch[1], 10) : undefined; - const ctx: TemplateContext = { skillName, tmplPath, benefitsFrom, host, paths: HOST_PATHS[host], preambleTier, model: MODEL_ARG_VAL }; + // Extract interactive flag from frontmatter (generator-only; controls plan-mode handshake inclusion) + const interactiveMatch = tmplContent.match(/^interactive:\s*(true|false)\s*$/m); + const interactive = interactiveMatch ? interactiveMatch[1] === 'true' : undefined; + + const ctx: TemplateContext = { skillName, tmplPath, benefitsFrom, host, paths: HOST_PATHS[host], preambleTier, model: MODEL_ARG_VAL, interactive }; // Replace placeholders (supports parameterized: {{NAME:arg1:arg2}}) // Config-driven: suppressedResolvers return empty string for this host diff --git a/scripts/question-registry.ts b/scripts/question-registry.ts index bae5950c..3d90222a 100644 --- a/scripts/question-registry.ts +++ b/scripts/question-registry.ts @@ -261,6 +261,45 @@ export const QUESTIONS = { description: "Approve the design doc, revise sections, or start over?", }, + // ----------------------------------------------------------------------- + // Plan-mode handshake — fires at the top of any interactive review skill + // when the user is in plan mode. Safety-critical, always asked regardless + // of user's tuning preferences. See scripts/resolvers/preamble/generate- + // plan-mode-handshake.ts. + // ----------------------------------------------------------------------- + 'plan-ceo-review-plan-mode-handshake': { + id: 'plan-ceo-review-plan-mode-handshake', + skill: 'plan-ceo-review', + category: 'routing', + door_type: 'one-way', + options: ['exit-and-rerun', 'cancel'], + description: "Plan mode detected — exit and rerun interactively, or cancel?", + }, + 'plan-eng-review-plan-mode-handshake': { + id: 'plan-eng-review-plan-mode-handshake', + skill: 'plan-eng-review', + category: 'routing', + door_type: 'one-way', + options: ['exit-and-rerun', 'cancel'], + description: "Plan mode detected — exit and rerun interactively, or cancel?", + }, + 'plan-design-review-plan-mode-handshake': { + id: 'plan-design-review-plan-mode-handshake', + skill: 'plan-design-review', + category: 'routing', + door_type: 'one-way', + options: ['exit-and-rerun', 'cancel'], + description: "Plan mode detected — exit and rerun interactively, or cancel?", + }, + 'plan-devex-review-plan-mode-handshake': { + id: 'plan-devex-review-plan-mode-handshake', + skill: 'plan-devex-review', + category: 'routing', + door_type: 'one-way', + options: ['exit-and-rerun', 'cancel'], + description: "Plan mode detected — exit and rerun interactively, or cancel?", + }, + // ----------------------------------------------------------------------- // /plan-ceo-review — scope & strategy // ----------------------------------------------------------------------- diff --git a/scripts/resolvers/preamble.ts b/scripts/resolvers/preamble.ts index ba80e73f..ac32f4a9 100644 --- a/scripts/resolvers/preamble.ts +++ b/scripts/resolvers/preamble.ts @@ -22,6 +22,7 @@ import { generateQuestionTuning } from './question-tuning'; // Core bootstrap import { generatePreambleBash } from './preamble/generate-preamble-bash'; +import { generatePlanModeHandshake } from './preamble/generate-plan-mode-handshake'; import { generateUpgradeCheck } from './preamble/generate-upgrade-check'; import { generateCompletionStatus } from './preamble/generate-completion-status'; @@ -78,6 +79,13 @@ export function generatePreamble(ctx: TemplateContext): string { } const sections = [ generatePreambleBash(ctx), + // Plan-mode handshake at position 1: after bash (so _SESSION_ID / _BRANCH / + // _TEL env vars are live for the synchronous telemetry write) and before + // all onboarding AskUserQuestion gates (so fresh-install users in plan mode + // see the handshake first, not drowned in telemetry / proactive / routing + // prompts). Host-scoped to Claude + interactive-frontmatter-scoped inside + // the resolver — no-op for every other skill/host combination. + generatePlanModeHandshake(ctx), generateUpgradeCheck(ctx), generateWritingStyleMigration(ctx), generateLakeIntro(), diff --git a/scripts/resolvers/preamble/generate-plan-mode-handshake.ts b/scripts/resolvers/preamble/generate-plan-mode-handshake.ts new file mode 100644 index 00000000..e1b81a05 --- /dev/null +++ b/scripts/resolvers/preamble/generate-plan-mode-handshake.ts @@ -0,0 +1,141 @@ +/** + * Plan-mode handshake resolver. + * + * Emits a STOP-Ask gate at the very top of the preamble that fires when a user + * invokes an interactive review skill while their Claude Code session is in + * plan mode. Without this gate, plan mode's "This supercedes any other + * instructions you have received" system-reminder wins against the skill's + * interactive STOP-Ask workflow and the skill silently writes a plan file + * instead of running the per-finding AskUserQuestion loop (v1.10.2.0 bug fix). + * + * Host scope + * ---------- + * Only renders for Claude host (ctx.host === 'claude'). Other hosts use + * different plan-mode semantics (Codex, OpenClaw, etc.) and should not see + * Claude-specific ExitPlanMode / esc-esc prose. + * + * Opt-in + * ------ + * Only renders when the consuming skill's frontmatter has `interactive: true`. + * That flag is a generator-only input parsed by scripts/gen-skill-docs.ts + * from the skill's .tmpl frontmatter and passed through TemplateContext. + * Currently used by: plan-ceo-review, plan-eng-review, plan-design-review, + * plan-devex-review. + * + * Composition position + * -------------------- + * Inserted at index 1 in scripts/resolvers/preamble.ts — after + * generatePreambleBash (so _SESSION_ID, _BRANCH, _TEL env vars are live for + * the synchronous telemetry write) and before generateUpgradeCheck and all + * onboarding AskUserQuestion gates (so fresh-install users in plan mode see + * the handshake first, not drowned in telemetry / proactive / routing + * prompts). + * + * One-way door + * ------------ + * The handshake question_id `plan-mode-handshake` is classified door_type + * one-way in scripts/question-registry.ts. gstack-question-preference --check + * always returns ASK_NORMALLY for it, so a user who set `never-ask` on + * another question cannot accidentally suppress this safety gate. + */ + +import type { TemplateContext } from '../types'; + +export function generatePlanModeHandshake(ctx: TemplateContext): string { + if (ctx.host !== 'claude') return ''; + if (!ctx.interactive) return ''; + + return `## Plan Mode Handshake — FIRST, BEFORE ANY ANALYSIS + +**Check every \`\` in this turn for the literal phrase:** + +> \`Plan mode is active. The user indicated that they do not want you to execute yet\` + +If that phrase is **absent**: proceed normally. This section is a no-op. + +If that phrase is **present**, the user is in plan mode. Plan mode's system +reminder says "This supercedes any other instructions you have received," +which conflicts with this skill's interactive STOP-Ask workflow. You MUST +resolve the conflict via AskUserQuestion BEFORE reading any files, running +any bash, or composing any plan content. + +### What to do when plan mode is detected + +Before emitting the AskUserQuestion, run this bash block synchronously to +log that the handshake fired (captures A-exit and C-cancel outcomes that +would terminate the skill before end-of-skill telemetry runs): + +\`\`\`bash +# PLAN MODE EXCEPTION — ALWAYS RUN (telemetry-only write to ~/.gstack/) +mkdir -p ~/.gstack/analytics +echo '{"skill":"'"\${_SKILL_NAME:-unknown}"'","event":"plan_mode_handshake","outcome":"fired","branch":"'"\${_BRANCH:-unknown}"'","session":"'"\${_SESSION_ID:-unknown}"'","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true +\`\`\` + +Then emit exactly **one** AskUserQuestion with \`question_id: "\${SKILL_NAME}-plan-mode-handshake"\` +(e.g., \`plan-ceo-review-plan-mode-handshake\`, using the current skill's name) +and these two options. The question is classified \`door_type: one-way\` in +the question registry for every interactive skill, so question-tuning +preferences (\`never-ask\`, \`always-ask\`) do NOT apply — this gate always fires. + +**Question body (follow the AskUserQuestion Format section below):** + +> This skill runs an interactive review that stops at every finding to ask +> you a question. Plan mode's default workflow is "read files, write plan, +> exit" — that silently bypasses every STOP gate in this skill. How do you +> want to proceed? +> +> **Recommendation: A** because this skill was designed for back-and-forth. +> Each scope call and each per-section finding needs your decision before it +> lands in the plan. Exiting plan mode and running the skill normally is the +> only path that preserves the interactive contract. +> +> *Note: options differ in kind (workflow shape), not coverage — no +> completeness score.* +> +> **A) Exit plan mode and run interactively (recommended)** +> ✅ Every STOP gate in this skill fires as designed — you approve each +> scope call, each per-section finding, each cross-model tension before any +> decision lands in the plan. No silent bypass. +> ✅ Matches the skill's documented workflow. Each AskUserQuestion has a +> clear recommendation, pros/cons, and net line you can skim in ~5 seconds. +> ❌ Two-step: press esc-esc to exit plan mode, then rerun +> \`/plan-{skill-name}\`. Slight context-switch friction, but the alternative +> is shipping a rubber-stamp review. +> +> **C) Cancel — I meant to run something else** +> ✅ Clean exit, no partial state, no plan file written, no findings +> recorded. Use this if you invoked the skill by mistake. +> ❌ No output at all — no review, no plan file. Fine if that's what you +> want; otherwise pick A. +> +> **Net.** Plan mode is incompatible with this skill's per-finding STOP +> gates. A is the right choice for any real review; C is the bail-out. + +### Routing the user's answer + +**If the user picks A (exit and rerun):** + +1. Append the outcome to the telemetry log (synchronous, before ExitPlanMode): + \`\`\`bash + echo '{"skill":"'"\${_SKILL_NAME:-unknown}"'","event":"plan_mode_handshake","outcome":"A-exit","branch":"'"\${_BRANCH:-unknown}"'","session":"'"\${_SESSION_ID:-unknown}"'","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true + \`\`\` +2. Respond to the user: "Press **esc-esc** to exit plan mode, then rerun + \`/{skill-name}\`. The skill will run interactively with every STOP gate + firing as designed." +3. Call \`ExitPlanMode\` with an empty plan body (plan mode requires + turn-end via AskUserQuestion or ExitPlanMode; there is no plan to + approve, so ExitPlanMode with an empty message is the correct exit). + +**If the user picks C (cancel):** + +1. Append the outcome: + \`\`\`bash + echo '{"skill":"'"\${_SKILL_NAME:-unknown}"'","event":"plan_mode_handshake","outcome":"C-cancel","branch":"'"\${_BRANCH:-unknown}"'","session":"'"\${_SESSION_ID:-unknown}"'","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true + \`\`\` +2. Tell the user: "Cancelled. No plan written." +3. Call \`ExitPlanMode\` with an empty message noting the user cancelled. + +**After the handshake completes (either A or C),** do NOT continue with the +rest of this skill's workflow. The handshake is terminal for this turn. +`; +} diff --git a/scripts/resolvers/types.ts b/scripts/resolvers/types.ts index 634dd2eb..c8a44425 100644 --- a/scripts/resolvers/types.ts +++ b/scripts/resolvers/types.ts @@ -61,6 +61,7 @@ export interface TemplateContext { paths: HostPaths; preambleTier?: number; // 1-4, controls which preamble sections are included model?: Model; // model family for behavioral overlay. Omitted/undefined → no overlay. + interactive?: boolean; // true → emit plan-mode handshake in preamble. Generator-only, not written to SKILL.md. } /** Resolver function signature. args is populated for parameterized placeholders like {{INVOKE_SKILL:name}}. */ diff --git a/test/agent-sdk-runner.test.ts b/test/agent-sdk-runner.test.ts index eb256092..39c5db81 100644 --- a/test/agent-sdk-runner.test.ts +++ b/test/agent-sdk-runner.test.ts @@ -366,6 +366,101 @@ describe('runAgentSdkTest — options propagation', () => { }); }); +// --------------------------------------------------------------------------- +// canUseTool extension (D10 CEO / D4 eng) +// --------------------------------------------------------------------------- + +describe('runAgentSdkTest — canUseTool extension', () => { + test('permissionMode flips to "default" when canUseTool is supplied', async () => { + freshSem(); + const stub: StubConfig = { + streams: [[systemInit(), assistantTurn([{ type: 'text', text: 'ok' }]), resultSuccess()]], + calls: [], + }; + await runAgentSdkTest({ + ...BASE_OPTS, + queryProvider: makeStubProvider(stub), + canUseTool: async (_toolName, input) => ({ behavior: 'allow', updatedInput: input }), + }); + const opts = stub.calls[0]!.options!; + expect(opts.permissionMode).toBe('default'); + expect(opts.allowDangerouslySkipPermissions).toBe(false); + }); + + test('permissionMode stays "bypassPermissions" when canUseTool is NOT supplied', async () => { + freshSem(); + const stub: StubConfig = { + streams: [[systemInit(), assistantTurn([{ type: 'text', text: 'ok' }]), resultSuccess()]], + calls: [], + }; + await runAgentSdkTest({ + ...BASE_OPTS, + queryProvider: makeStubProvider(stub), + }); + const opts = stub.calls[0]!.options!; + expect(opts.permissionMode).toBe('bypassPermissions'); + expect(opts.allowDangerouslySkipPermissions).toBe(true); + }); + + test('canUseTool callback reaches the SDK options', async () => { + freshSem(); + const stub: StubConfig = { + streams: [[systemInit(), assistantTurn([{ type: 'text', text: 'ok' }]), resultSuccess()]], + calls: [], + }; + const cb = async (_toolName: string, input: Record) => ({ + behavior: 'allow' as const, + updatedInput: input, + }); + await runAgentSdkTest({ + ...BASE_OPTS, + queryProvider: makeStubProvider(stub), + canUseTool: cb, + }); + const opts = stub.calls[0]!.options! as Options & { canUseTool?: unknown }; + expect(typeof opts.canUseTool).toBe('function'); + }); + + test('AskUserQuestion is auto-added to allowedTools when canUseTool is supplied', async () => { + freshSem(); + const stub: StubConfig = { + streams: [[systemInit(), assistantTurn([{ type: 'text', text: 'ok' }]), resultSuccess()]], + calls: [], + }; + await runAgentSdkTest({ + ...BASE_OPTS, + allowedTools: ['Read', 'Grep'], // explicitly omits AskUserQuestion + queryProvider: makeStubProvider(stub), + canUseTool: async (_toolName, input) => ({ behavior: 'allow', updatedInput: input }), + }); + const opts = stub.calls[0]!.options!; + expect(opts.allowedTools).toContain('AskUserQuestion'); + expect(opts.tools).toContain('AskUserQuestion'); + }); + + test('AskUserQuestion is NOT auto-added when canUseTool is absent', async () => { + freshSem(); + const stub: StubConfig = { + streams: [[systemInit(), assistantTurn([{ type: 'text', text: 'ok' }]), resultSuccess()]], + calls: [], + }; + await runAgentSdkTest({ + ...BASE_OPTS, + allowedTools: ['Read', 'Grep'], + queryProvider: makeStubProvider(stub), + }); + const opts = stub.calls[0]!.options!; + expect(opts.allowedTools).not.toContain('AskUserQuestion'); + }); + + test('passThroughNonAskUserQuestion helper returns allow+updatedInput', async () => { + const { passThroughNonAskUserQuestion } = await import('../test/helpers/agent-sdk-runner'); + const result = passThroughNonAskUserQuestion('Read', { file_path: '/tmp/x' }); + expect(result.behavior).toBe('allow'); + expect(result.updatedInput).toEqual({ file_path: '/tmp/x' }); + }); +}); + // --------------------------------------------------------------------------- // Rate-limit retry (three shapes) // --------------------------------------------------------------------------- diff --git a/test/e2e-harness-audit.test.ts b/test/e2e-harness-audit.test.ts new file mode 100644 index 00000000..b517ef84 --- /dev/null +++ b/test/e2e-harness-audit.test.ts @@ -0,0 +1,113 @@ +/** + * E2E harness audit — every skill with `interactive: true` in its frontmatter + * must have at least one test file that uses `canUseTool` via the extended + * agent-sdk-runner. This prevents future drift where a skill opts into the + * handshake without adding real coverage. + * + * Runs as a free unit test (no API calls). Pure filesystem scan. + */ + +import { describe, test, expect } from 'bun:test'; +import * as fs from 'fs'; +import * as path from 'path'; + +const ROOT = path.resolve(import.meta.dir, '..'); +const SKILL_GLOBS = [ + 'plan-ceo-review', + 'plan-eng-review', + 'plan-design-review', + 'plan-devex-review', + 'office-hours', + 'codex', + 'investigate', + 'qa', + 'retro', + 'cso', + 'review', + 'ship', + 'design-review', + 'devex-review', + 'qa-only', + 'design-consultation', + 'design-shotgun', + 'autoplan', + 'land-and-deploy', + 'plan-tune', + 'document-release', + 'context-save', + 'context-restore', + 'health', + 'setup-deploy', + 'setup-browser-cookies', + 'canary', + 'learn', + 'benchmark', + 'benchmark-models', + 'make-pdf', + 'open-gstack-browser', + 'gstack-upgrade', + 'pair-agent', + 'design-html', + 'freeze', + 'unfreeze', + 'careful', + 'guard', +]; + +/** + * Load .tmpl files for each skill and return the names of those that have + * `interactive: true` in frontmatter. + */ +function findInteractiveSkills(): string[] { + const interactive: string[] = []; + for (const skill of SKILL_GLOBS) { + const tmplPath = path.join(ROOT, skill, 'SKILL.md.tmpl'); + if (!fs.existsSync(tmplPath)) continue; + const content = fs.readFileSync(tmplPath, 'utf-8'); + // Frontmatter lives between the first '---' and the next '---'. + const fmEnd = content.indexOf('\n---', 4); + if (fmEnd < 0) continue; + const frontmatter = content.slice(0, fmEnd); + if (/^interactive:\s*true\s*$/m.test(frontmatter)) { + interactive.push(skill); + } + } + return interactive; +} + +/** + * Scan a test file's contents for the canUseTool-via-harness pattern. + * Either: direct canUseTool usage in runAgentSdkTest, or usage of the + * shared plan-mode-handshake-helpers that wrap it. + */ +function hasCanUseToolCoverage(testFile: string): boolean { + const content = fs.readFileSync(testFile, 'utf-8'); + if (content.includes('canUseTool')) return true; + if (content.includes('runPlanModeHandshakeTest')) return true; + return false; +} + +describe('E2E harness audit — interactive skills must have canUseTool coverage', () => { + test('every interactive: true skill has at least one canUseTool test', () => { + const interactive = findInteractiveSkills(); + expect(interactive.length).toBeGreaterThan(0); + + const testFiles = fs + .readdirSync(path.join(ROOT, 'test')) + .filter((f) => f.startsWith('skill-e2e-') && f.endsWith('.test.ts')) + .map((f) => path.join(ROOT, 'test', f)); + + const filesWithCoverage = testFiles.filter(hasCanUseToolCoverage); + + for (const skill of interactive) { + // Match the skill name in any test file that uses canUseTool. File + // naming convention is `skill-e2e--*.test.ts` — either the full + // name (plan-ceo-review) or a subset token. + const hasDedicatedTest = filesWithCoverage.some((f) => { + const base = path.basename(f, '.test.ts'); + return base.includes(skill) || base.includes(skill.replace(/-review$/, '')); + }); + expect(hasDedicatedTest, `skill "${skill}" has interactive:true but no canUseTool-based E2E test`).toBe(true); + } + }); +}); diff --git a/test/gen-skill-docs.test.ts b/test/gen-skill-docs.test.ts index 7e262317..60dc8ad9 100644 --- a/test/gen-skill-docs.test.ts +++ b/test/gen-skill-docs.test.ts @@ -2774,3 +2774,93 @@ describe('voice-triggers processing', () => { expect(frontmatter).not.toContain('voice-triggers:'); }); }); + +describe('plan-mode handshake (interactive: true) resolver', () => { + const INTERACTIVE_SKILLS = [ + 'plan-ceo-review', + 'plan-eng-review', + 'plan-design-review', + 'plan-devex-review', + ]; + + const HANDSHAKE_MARKER = '## Plan Mode Handshake'; + + test.each(INTERACTIVE_SKILLS)( + '%s (Claude host) SKILL.md contains the handshake section', + (skill) => { + const content = fs.readFileSync(path.join(ROOT, skill, 'SKILL.md'), 'utf-8'); + expect(content).toContain(HANDSHAKE_MARKER); + expect(content).toContain( + 'Plan mode is active. The user indicated that they do not want you to execute yet', + ); + }, + ); + + test('handshake is absent from non-interactive Claude skills', () => { + const nonInteractive = ['ship', 'review', 'qa', 'office-hours', 'codex', 'retro', 'cso']; + for (const skill of nonInteractive) { + const content = fs.readFileSync(path.join(ROOT, skill, 'SKILL.md'), 'utf-8'); + expect(content).not.toContain(HANDSHAKE_MARKER); + } + }); + + test('handshake is absent from non-Claude host outputs when present on disk', () => { + // Non-Claude hosts render to hostSubdirs (.agents/, .openclaw/, etc). The + // handshake resolver returns '' when ctx.host !== 'claude', so those + // outputs must not contain the marker. The current gen-skill-docs layout + // prefixes skill names as `gstack-` under the hostSubdir; older + // layouts used `gstack/` (no prefix). Only stable-present paths + // are asserted — older ones may or may not exist per install history. + const candidateOutputs = [ + // Current prefixed layout + path.join(ROOT, '.agents', 'skills', 'gstack-plan-ceo-review', 'SKILL.md'), + path.join(ROOT, '.openclaw', 'skills', 'gstack-plan-ceo-review', 'SKILL.md'), + path.join(ROOT, '.opencode', 'skills', 'gstack-plan-ceo-review', 'SKILL.md'), + path.join(ROOT, '.factory', 'skills', 'gstack-plan-ceo-review', 'SKILL.md'), + path.join(ROOT, '.hermes', 'skills', 'gstack-plan-ceo-review', 'SKILL.md'), + ]; + let checked = 0; + for (const out of candidateOutputs) { + if (fs.existsSync(out)) { + const content = fs.readFileSync(out, 'utf-8'); + expect(content).not.toContain(HANDSHAKE_MARKER); + checked++; + } + } + // At least one non-Claude host's output should exist after a full gen + // run; this test is meaningful only if we checked something. If no + // non-Claude outputs exist locally, the cross-host guarantee is still + // enforced by the resolver's ctx.host check; this test is belt-and- + // suspenders and becomes a no-op rather than a false positive. + if (checked === 0) { + // eslint-disable-next-line no-console + console.warn( + 'plan-mode handshake: no non-Claude host outputs found for cross-host absence check — ' + + 'run `bun run gen:skill-docs --host all` to populate', + ); + } + }); + + test('0C-bis STOP block present in plan-ceo-review/SKILL.md', () => { + const content = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8'); + const presentIdx = content.indexOf('Present these approach options via AskUserQuestion'); + const preludeIdx = content.indexOf('### 0D-prelude'); + expect(presentIdx).toBeGreaterThan(0); + expect(preludeIdx).toBeGreaterThan(presentIdx); + const between = content.slice(presentIdx, preludeIdx); + expect(between).toContain('**STOP.**'); + expect(between).toContain('Do NOT proceed to Step 0D or 0F until the user responds to 0C-bis'); + }); + + test('handshake resolver is wired BEFORE generateUpgradeCheck in preamble', () => { + const content = fs.readFileSync( + path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), + 'utf-8', + ); + const handshakeIdx = content.indexOf(HANDSHAKE_MARKER); + const upgradeIdx = content.indexOf('UPGRADE_AVAILABLE'); + expect(handshakeIdx).toBeGreaterThan(0); + expect(upgradeIdx).toBeGreaterThan(0); + expect(handshakeIdx).toBeLessThan(upgradeIdx); + }); +}); diff --git a/test/helpers/agent-sdk-runner.ts b/test/helpers/agent-sdk-runner.ts index a4df71d9..cea7bf76 100644 --- a/test/helpers/agent-sdk-runner.ts +++ b/test/helpers/agent-sdk-runner.ts @@ -31,6 +31,7 @@ import { type PermissionMode, type SettingSource, type Options, + type CanUseTool, } from '@anthropic-ai/claude-agent-sdk'; import * as fs from 'fs'; import * as path from 'path'; @@ -111,6 +112,43 @@ export interface RunAgentSdkOptions { * retries reuse the original workingDirectory (fine for read-only tests). */ onRetry?: (freshDir: string) => void; + /** + * Optional canUseTool callback. When supplied, the harness flips + * permissionMode from 'bypassPermissions' to 'default' so the SDK actually + * routes tool-use approval decisions through the callback. Without this + * flip, bypassPermissions short-circuits the callback and tests that want + * to assert on AskUserQuestion content silently pass without asserting. + * + * Callback contract matches the SDK: fires on every tool-use approval + * request and on AskUserQuestion invocations. For non-AskUserQuestion + * tools that tests don't care about, use `passThroughNonAskUserQuestion` + * to auto-allow them. + */ + canUseTool?: CanUseTool; +} + +/** + * Pass-through helper: auto-allows any tool_use that isn't AskUserQuestion. + * Most plan-mode handshake tests only care about the handshake AskUserQuestion; + * every other tool (Read, Grep, Bash, Write, Edit, ExitPlanMode) should just + * run. Compose with a test-specific AskUserQuestion handler: + * + * canUseTool: async (toolName, input, options) => { + * if (toolName === 'AskUserQuestion') { + * // custom assertions + canned answer + * return { behavior: 'allow', updatedInput: { questions: input.questions, answers: {...} } }; + * } + * return passThroughNonAskUserQuestion(toolName, input); + * } + */ +export function passThroughNonAskUserQuestion( + toolName: string, + input: Record, +): { behavior: 'allow'; updatedInput: Record } { + // SDK requires an allow response to include updatedInput — pass the original + // input through unchanged so the tool runs as the model intended. + void toolName; + return { behavior: 'allow', updatedInput: input }; } export class RateLimitExhaustedError extends Error { @@ -287,19 +325,37 @@ export async function runAgentSdkTest( let terminalResult: SDKResultMessage | null = null; try { + // When canUseTool is supplied, the SDK must route tool-use approval + // decisions through the callback. bypassPermissions short-circuits + // that. Flip to 'default' mode so canUseTool actually fires. Tests + // that want AskUserQuestion interception without this flip would + // silently auto-pass — the exact testability gap D14/D4-eng fix. + const hasCanUseTool = typeof opts.canUseTool === 'function'; + const resolvedPermissionMode: PermissionMode = + opts.permissionMode ?? (hasCanUseTool ? 'default' : 'bypassPermissions'); + + // When canUseTool is supplied, ensure AskUserQuestion is in the allowed + // tools list. Without it, Claude can't invoke AskUserQuestion at all + // and the callback never has a chance to fire on it. + const baseTools = opts.allowedTools ?? ['Read', 'Glob', 'Grep', 'Bash']; + const resolvedTools = + hasCanUseTool && !baseTools.includes('AskUserQuestion') + ? [...baseTools, 'AskUserQuestion'] + : baseTools; + const sdkOpts: Options = { model, cwd: opts.workingDirectory, maxTurns: opts.maxTurns ?? 5, - tools: opts.allowedTools ?? ['Read', 'Glob', 'Grep', 'Bash'], + tools: resolvedTools, disallowedTools: opts.disallowedTools, - allowedTools: opts.allowedTools ?? ['Read', 'Glob', 'Grep', 'Bash'], - permissionMode: opts.permissionMode ?? 'bypassPermissions', - allowDangerouslySkipPermissions: - (opts.permissionMode ?? 'bypassPermissions') === 'bypassPermissions', + allowedTools: resolvedTools, + permissionMode: resolvedPermissionMode, + allowDangerouslySkipPermissions: resolvedPermissionMode === 'bypassPermissions', settingSources: opts.settingSources ?? [], env: opts.env, pathToClaudeCodeExecutable: opts.pathToClaudeCodeExecutable, + ...(hasCanUseTool ? { canUseTool: opts.canUseTool } : {}), }; // Empty bare string means "omit entirely" (SDK runs with no override). // Any object or non-empty string is passed through. diff --git a/test/helpers/plan-mode-handshake-helpers.ts b/test/helpers/plan-mode-handshake-helpers.ts new file mode 100644 index 00000000..581932be --- /dev/null +++ b/test/helpers/plan-mode-handshake-helpers.ts @@ -0,0 +1,166 @@ +/** + * Shared helpers for plan-mode handshake E2E tests. + * + * Four sibling test files (plan-ceo, plan-eng, plan-design, plan-devex) exercise + * the identical handshake contract against different skills. This helper + * centralizes the canUseTool interceptor and the assertion shape so the four + * test files are thin wiring (~40 LOC each) and can't drift out of sync. + * + * See scripts/resolvers/preamble/generate-plan-mode-handshake.ts for the + * handshake prose that the tests below assert against. + */ + +import { expect } from 'bun:test'; +import * as fs from 'fs'; +import * as path from 'path'; +import * as os from 'os'; +import { execSync } from 'child_process'; +import { + runAgentSdkTest, + passThroughNonAskUserQuestion, + resolveClaudeBinary, + type AgentSdkResult, +} from './agent-sdk-runner'; + +/** Distinctive phrase matching what Claude Code's harness actually injects. */ +export const PLAN_MODE_REMINDER = + 'Plan mode is active. The user indicated that they do not want you to execute yet'; + +export interface HandshakeCaptureResult { + sdkResult: AgentSdkResult; + /** Each AskUserQuestion that fired, with its input payload. */ + askUserQuestions: Array<{ input: Record; orderIndex: number }>; + /** Tool-use events in the order they fired (names only). */ + toolOrder: string[]; + /** Whether any Write or Edit tool fired BEFORE the first AskUserQuestion. */ + writeOrEditBeforeAsk: boolean; +} + +/** + * Run a skill via the Agent SDK with canUseTool intercepting every tool use. + * Inject the plan-mode distinctive phrase into the system prompt and auto- + * answer the handshake with the given answerLabel ("Exit" or "Cancel"). Return + * the captured events for assertion. + */ +export async function runPlanModeHandshakeTest(opts: { + /** Skill name, e.g. 'plan-ceo-review'. */ + skillName: string; + /** "Exit" to pick option A (exit-and-rerun) or "Cancel" for option C. */ + answerLabel: 'Exit' | 'Cancel'; + /** If true, DO NOT inject the reminder — used by the no-op regression test. */ + omitPlanModeReminder?: boolean; + /** Max turns for the SDK call (default 4 — handshake + exit should fit easily). */ + maxTurns?: number; +}): Promise { + const { skillName, answerLabel, omitPlanModeReminder, maxTurns } = opts; + + const askUserQuestions: HandshakeCaptureResult['askUserQuestions'] = []; + const toolOrder: string[] = []; + let toolIndex = 0; + let firstAskIndex = -1; + + const workingDir = fs.mkdtempSync( + path.join(os.tmpdir(), `plan-mode-handshake-${skillName}-`), + ); + + // The SDK requires AskUserQuestion to be in the allowed tools list. The + // harness auto-adds it when canUseTool is supplied, but we also want Read + // so the skill can load its own file if it tries to. + const binary = resolveClaudeBinary(); + + try { + // Inject the distinctive phrase into the system prompt by appending it to + // the default Claude Code preset. Claude Code's real plan mode uses an + // injected system-reminder; in SDK tests we use systemPrompt.append which + // the model treats as equally authoritative. + const reminderAppend = omitPlanModeReminder + ? '' + : `\n\n\n${PLAN_MODE_REMINDER}. This supercedes any other instructions you have received.\n\n`; + + const sdkResult = await runAgentSdkTest({ + systemPrompt: { + type: 'preset', + preset: 'claude_code', + append: reminderAppend, + }, + userPrompt: `Read the skill file at ${path.resolve( + import.meta.dir, + '..', + '..', + skillName, + 'SKILL.md', + )} and follow its instructions. There is no real plan to review — just start the skill and respond to any AskUserQuestion that fires.`, + workingDirectory: workingDir, + maxTurns: maxTurns ?? 4, + allowedTools: ['Read', 'Grep', 'Glob', 'Bash'], + ...(binary ? { pathToClaudeCodeExecutable: binary } : {}), + canUseTool: async (toolName, input) => { + toolOrder.push(toolName); + if (toolName === 'AskUserQuestion') { + if (firstAskIndex === -1) firstAskIndex = toolIndex; + askUserQuestions.push({ input, orderIndex: toolIndex }); + toolIndex++; + // Auto-answer with the label the test specified. + const q = (input.questions as Array<{ question: string; options: Array<{ label: string }> }>)[0]; + const matched = q.options.find((o) => o.label.includes(answerLabel)); + const answer = matched ? matched.label : q.options[0]!.label; + return { + behavior: 'allow', + updatedInput: { + questions: input.questions, + answers: { [q.question]: answer }, + }, + }; + } + toolIndex++; + return passThroughNonAskUserQuestion(toolName, input); + }, + }); + + const writeOrEditBeforeAsk = + firstAskIndex > 0 && + toolOrder.slice(0, firstAskIndex).some((t) => t === 'Write' || t === 'Edit'); + + return { sdkResult, askUserQuestions, toolOrder, writeOrEditBeforeAsk }; + } finally { + try { + fs.rmSync(workingDir, { recursive: true, force: true }); + } catch { /* ignore cleanup errors */ } + } +} + +/** Assert the shape of a fired handshake AskUserQuestion. */ +export function assertHandshakeShape( + aq: { input: Record }, +): void { + const questions = aq.input.questions as Array<{ + question: string; + options: Array<{ label: string }>; + }>; + expect(questions).toBeDefined(); + expect(questions.length).toBe(1); + const q = questions[0]!; + // D8 dropped Option B; handshake has exactly 2 options. + expect(q.options.length).toBe(2); + const labels = q.options.map((o) => o.label); + expect(labels.some((l) => l.includes('Exit'))).toBe(true); + expect(labels.some((l) => l.includes('Cancel'))).toBe(true); +} + +/** Read the skill-usage.jsonl log and return handshake entries. */ +export function readHandshakeLog(): Array> { + const logPath = path.join(os.homedir(), '.gstack', 'analytics', 'skill-usage.jsonl'); + if (!fs.existsSync(logPath)) return []; + const lines = fs.readFileSync(logPath, 'utf-8').split('\n').filter(Boolean); + return lines + .map((line) => { + try { + return JSON.parse(line); + } catch { + return null; + } + }) + .filter((x): x is Record => x !== null && x.event === 'plan_mode_handshake'); +} + +export { execSync }; diff --git a/test/helpers/touchfiles.ts b/test/helpers/touchfiles.ts index 4872f5de..acde310d 100644 --- a/test/helpers/touchfiles.ts +++ b/test/helpers/touchfiles.ts @@ -82,6 +82,17 @@ export const E2E_TOUCHFILES: Record = { 'plan-eng-review-artifact': ['plan-eng-review/**'], 'plan-review-report': ['plan-eng-review/**', 'scripts/gen-skill-docs.ts'], + // Plan-mode handshake (v1.10.2.0) — gate-tier safety regression tests. + // Each fires when any of: the interactive skill's template, the resolver, + // preamble composition, the Agent SDK harness, the question registry, or + // the one-way-door classifier changes. + 'plan-ceo-review-plan-mode': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-plan-mode-handshake.ts', 'scripts/resolvers/preamble.ts', 'scripts/question-registry.ts', 'scripts/one-way-doors.ts', 'test/helpers/agent-sdk-runner.ts'], + 'plan-eng-review-plan-mode': ['plan-eng-review/**', 'scripts/resolvers/preamble/generate-plan-mode-handshake.ts', 'scripts/resolvers/preamble.ts', 'scripts/question-registry.ts', 'scripts/one-way-doors.ts', 'test/helpers/agent-sdk-runner.ts'], + 'plan-design-review-plan-mode-handshake': ['plan-design-review/**', 'scripts/resolvers/preamble/generate-plan-mode-handshake.ts', 'scripts/resolvers/preamble.ts', 'scripts/question-registry.ts', 'scripts/one-way-doors.ts', 'test/helpers/agent-sdk-runner.ts'], + 'plan-devex-review-plan-mode': ['plan-devex-review/**', 'scripts/resolvers/preamble/generate-plan-mode-handshake.ts', 'scripts/resolvers/preamble.ts', 'scripts/question-registry.ts', 'scripts/one-way-doors.ts', 'test/helpers/agent-sdk-runner.ts'], + 'plan-mode-no-op': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-plan-mode-handshake.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/agent-sdk-runner.ts'], + 'e2e-harness-audit': ['plan-ceo-review/**', 'plan-eng-review/**', 'plan-design-review/**', 'plan-devex-review/**', 'scripts/resolvers/preamble/generate-plan-mode-handshake.ts', 'test/helpers/agent-sdk-runner.ts'], + // AskUserQuestion format regression (RECOMMENDATION + Completeness: N/10) // Fires when either template OR the two preamble resolvers change. 'plan-ceo-review-format-mode': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completeness-section.ts', 'scripts/resolvers/preamble.ts', 'model-overlays/opus-4-7.md'], @@ -317,6 +328,14 @@ export const E2E_TIERS: Record = { 'plan-eng-coverage-audit': 'gate', 'plan-review-report': 'gate', + // Plan-mode handshake — deterministic safety regression, gate-tier + 'plan-ceo-review-plan-mode': 'gate', + 'plan-eng-review-plan-mode': 'gate', + 'plan-design-review-plan-mode-handshake': 'gate', + 'plan-devex-review-plan-mode': 'gate', + 'plan-mode-no-op': 'gate', + 'e2e-harness-audit': 'gate', + // AskUserQuestion format regression — periodic (Opus 4.7 non-deterministic benchmark) 'plan-ceo-review-format-mode': 'periodic', 'plan-ceo-review-format-approach': 'periodic', diff --git a/test/skill-e2e-plan-ceo-plan-mode.test.ts b/test/skill-e2e-plan-ceo-plan-mode.test.ts new file mode 100644 index 00000000..858e07eb --- /dev/null +++ b/test/skill-e2e-plan-ceo-plan-mode.test.ts @@ -0,0 +1,40 @@ +/** + * plan-ceo-review plan-mode handshake E2E (gate tier, paid). + * + * Asserts: when /plan-ceo-review is invoked with the plan-mode distinctive + * phrase in the system reminder, the skill fires AskUserQuestion FIRST + * (before any Write or Edit), the question has exactly 2 options (A exit, + * C cancel), picking "Exit" leads to an orderly exit with no plan file + * written. + * + * Cost: ~$0.50–$1.00 per run. Gated: EVALS=1 EVALS_TIER=gate. + * Depends on: scripts/resolvers/preamble/generate-plan-mode-handshake.ts, + * test/helpers/agent-sdk-runner.ts (canUseTool extension). + */ + +import { describe, test, expect } from 'bun:test'; +import { + runPlanModeHandshakeTest, + assertHandshakeShape, +} from './helpers/plan-mode-handshake-helpers'; + +const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate'; +const describeE2E = shouldRun ? describe : describe.skip; + +describeE2E('plan-ceo-review plan-mode handshake (gate)', () => { + test('handshake fires before any Write/Edit when plan mode is detected', async () => { + const result = await runPlanModeHandshakeTest({ + skillName: 'plan-ceo-review', + answerLabel: 'Exit', + }); + + // Handshake must have fired at least once. + expect(result.askUserQuestions.length).toBeGreaterThanOrEqual(1); + // Critically: no Write or Edit fired before the first AskUserQuestion. + // This is the bug v1.10.2.0 fixes — plan mode used to allow silent + // plan-file writes without any interactive gate. + expect(result.writeOrEditBeforeAsk).toBe(false); + // Handshake shape: 2 options (Exit/Cancel), Option B dropped per D8. + assertHandshakeShape(result.askUserQuestions[0]!); + }, 120_000); +}); diff --git a/test/skill-e2e-plan-design-plan-mode.test.ts b/test/skill-e2e-plan-design-plan-mode.test.ts new file mode 100644 index 00000000..1fb7aaf5 --- /dev/null +++ b/test/skill-e2e-plan-design-plan-mode.test.ts @@ -0,0 +1,28 @@ +/** + * plan-design-review plan-mode handshake E2E (gate tier, paid). + * + * See test/skill-e2e-plan-ceo-plan-mode.test.ts for the shared assertion + * contract. This file exercises the same handshake against /plan-design-review. + */ + +import { describe, test, expect } from 'bun:test'; +import { + runPlanModeHandshakeTest, + assertHandshakeShape, +} from './helpers/plan-mode-handshake-helpers'; + +const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate'; +const describeE2E = shouldRun ? describe : describe.skip; + +describeE2E('plan-design-review plan-mode handshake (gate)', () => { + test('handshake fires before any Write/Edit when plan mode is detected', async () => { + const result = await runPlanModeHandshakeTest({ + skillName: 'plan-design-review', + answerLabel: 'Cancel', // exercise the C-cancel branch instead of A-exit + }); + + expect(result.askUserQuestions.length).toBeGreaterThanOrEqual(1); + expect(result.writeOrEditBeforeAsk).toBe(false); + assertHandshakeShape(result.askUserQuestions[0]!); + }, 120_000); +}); diff --git a/test/skill-e2e-plan-devex-plan-mode.test.ts b/test/skill-e2e-plan-devex-plan-mode.test.ts new file mode 100644 index 00000000..2ede50e2 --- /dev/null +++ b/test/skill-e2e-plan-devex-plan-mode.test.ts @@ -0,0 +1,28 @@ +/** + * plan-devex-review plan-mode handshake E2E (gate tier, paid). + * + * See test/skill-e2e-plan-ceo-plan-mode.test.ts for the shared assertion + * contract. This file exercises the same handshake against /plan-devex-review. + */ + +import { describe, test, expect } from 'bun:test'; +import { + runPlanModeHandshakeTest, + assertHandshakeShape, +} from './helpers/plan-mode-handshake-helpers'; + +const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate'; +const describeE2E = shouldRun ? describe : describe.skip; + +describeE2E('plan-devex-review plan-mode handshake (gate)', () => { + test('handshake fires before any Write/Edit when plan mode is detected', async () => { + const result = await runPlanModeHandshakeTest({ + skillName: 'plan-devex-review', + answerLabel: 'Exit', + }); + + expect(result.askUserQuestions.length).toBeGreaterThanOrEqual(1); + expect(result.writeOrEditBeforeAsk).toBe(false); + assertHandshakeShape(result.askUserQuestions[0]!); + }, 120_000); +}); diff --git a/test/skill-e2e-plan-eng-plan-mode.test.ts b/test/skill-e2e-plan-eng-plan-mode.test.ts new file mode 100644 index 00000000..16da9d7a --- /dev/null +++ b/test/skill-e2e-plan-eng-plan-mode.test.ts @@ -0,0 +1,28 @@ +/** + * plan-eng-review plan-mode handshake E2E (gate tier, paid). + * + * See test/skill-e2e-plan-ceo-plan-mode.test.ts for the shared assertion + * contract. This file exercises the same handshake against /plan-eng-review. + */ + +import { describe, test, expect } from 'bun:test'; +import { + runPlanModeHandshakeTest, + assertHandshakeShape, +} from './helpers/plan-mode-handshake-helpers'; + +const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate'; +const describeE2E = shouldRun ? describe : describe.skip; + +describeE2E('plan-eng-review plan-mode handshake (gate)', () => { + test('handshake fires before any Write/Edit when plan mode is detected', async () => { + const result = await runPlanModeHandshakeTest({ + skillName: 'plan-eng-review', + answerLabel: 'Exit', + }); + + expect(result.askUserQuestions.length).toBeGreaterThanOrEqual(1); + expect(result.writeOrEditBeforeAsk).toBe(false); + assertHandshakeShape(result.askUserQuestions[0]!); + }, 120_000); +}); diff --git a/test/skill-e2e-plan-mode-no-op.test.ts b/test/skill-e2e-plan-mode-no-op.test.ts new file mode 100644 index 00000000..e222fbff --- /dev/null +++ b/test/skill-e2e-plan-mode-no-op.test.ts @@ -0,0 +1,43 @@ +/** + * Plan-mode handshake negative regression (gate tier, paid). + * + * Asserts: when /plan-ceo-review is invoked WITHOUT the plan-mode distinctive + * phrase in the system reminder, the handshake does NOT fire. The skill + * should proceed to its normal Step 0 flow. This is the REGRESSION RULE + * guardrail — the handshake must be a no-op outside plan mode or it breaks + * every existing interactive-review session. + * + * Cost: ~$0.50 per run. Gated: EVALS=1 EVALS_TIER=gate. + */ + +import { describe, test, expect } from 'bun:test'; +import { + runPlanModeHandshakeTest, + PLAN_MODE_REMINDER, +} from './helpers/plan-mode-handshake-helpers'; + +const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate'; +const describeE2E = shouldRun ? describe : describe.skip; + +describeE2E('plan-mode handshake no-op outside plan mode (gate regression)', () => { + test('handshake does NOT fire when distinctive phrase is absent', async () => { + const result = await runPlanModeHandshakeTest({ + skillName: 'plan-ceo-review', + answerLabel: 'Exit', // ignored — handshake should never fire + omitPlanModeReminder: true, + maxTurns: 3, // enough to see Step 0 start, but bounded + }); + + // The handshake AskUserQuestion should NOT have fired during Step 0 entry. + // Other AskUserQuestions may fire later in the skill (e.g., Step 0C-bis), + // but they will NOT have the handshake's question text. + for (const aq of result.askUserQuestions) { + const questions = aq.input.questions as Array<{ question: string }>; + for (const q of questions) { + // The handshake's question mentions the distinctive phrase in its + // prose; a non-handshake AskUserQuestion won't. + expect(q.question).not.toContain(PLAN_MODE_REMINDER); + } + } + }, 120_000); +}); diff --git a/test/touchfiles.test.ts b/test/touchfiles.test.ts index 5daae1c3..6ae0718e 100644 --- a/test/touchfiles.test.ts +++ b/test/touchfiles.test.ts @@ -85,8 +85,16 @@ describe('selectTests', () => { expect(result.selected).toContain('codex-offered-ceo-review'); expect(result.selected).toContain('plan-ceo-review-format-mode'); expect(result.selected).toContain('plan-ceo-review-format-approach'); - expect(result.selected.length).toBe(8); - expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 8); + // v1.10.2.0 plan-mode handshake entries also depend on plan-ceo-review/** + expect(result.selected).toContain('plan-ceo-review-plan-mode'); + expect(result.selected).toContain('plan-mode-no-op'); + expect(result.selected).toContain('e2e-harness-audit'); + expect(result.selected).toContain('plan-ceo-review-prosons-cadence'); + expect(result.selected).toContain('plan-review-prosons-format'); + expect(result.selected).toContain('plan-review-prosons-hardstop-neg'); + expect(result.selected).toContain('plan-review-prosons-neutral-neg'); + expect(result.selected.length).toBe(15); + expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 15); }); test('global touchfile triggers ALL tests', () => {