mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-01 19:25:10 +02:00
v1.25.0.0 fix: AskUserQuestion resolves to host MCP variant when native is disallowed (#1287)
* test(harness): plumb extraArgs and auto_decided outcome through PTY runner runPlanSkillObservation now accepts extraArgs that pass through to launchClaudePty (which already supported them at the lower level), and exposes a new 'auto_decided' outcome detected via isAutoDecidedVisible when the AUTO_DECIDE preamble template fires (Auto-decided ... (your preference)). Both pieces are needed for the v1.21+ AskUserQuestion-blocked regression tests in the next commit. Detection order is deliberate: 'asked' (rendered numbered list) wins over 'auto_decided' (text only, no list), which wins over 'plan_ready' so the auto-decide evidence isn't masked by a downstream plan-mode confirmation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(e2e): add AskUserQuestion-blocked regression cases for 6 plan-mode skills Conductor launches Claude Code with --disallowedTools AskUserQuestion --permission-mode default --permission-prompt-tool stdio (verified by inspecting the live conductor claude process via ps -p ... -o args=). Native AskUserQuestion is removed from the model's tool registry; without fallback guidance the plan-mode skills (plan-ceo-review, plan-eng-review, plan-design-review, plan-devex-review, autoplan, office-hours) silently proceed and never surface decisions to the user. Adds 6 gate-tier real-PTY regression cases: - 4 inline test cases inside the existing plan-X-review-plan-mode.test files, each exercising the same skill with extraArgs ['--disallowedTools', 'AskUserQuestion'] and asserting outcome === 'asked'. plan-design-review keeps the ['asked', 'plan_ready'] envelope (legitimate short-circuit on no-UI-scope) but explicitly fails on 'auto_decided'. - 2 standalone test files for autoplan + office-hours (which had no prior plan-mode test). autoplan asserts the FIRST non-auto-decided gate fires (Phase 1 premise confirmation) — autoplan auto-decides intermediate questions BY DESIGN. Touchfile entries: - autoplan-auto-mode + office-hours-auto-mode added to E2E_TOUCHFILES + E2E_TIERS (gate) - existing plan-X-review-plan-mode entries gain question-tuning.ts and generate-ask-user-format.ts touchfile deps so AUTO_DECIDE-related resolver changes correctly invalidate the regression tests - touchfiles.test.ts count updated 18 -> 19 to cover the autoplan touchfile dependency on plan-ceo-review/** Filenames retain `auto-mode` for branch-history continuity. Auto-mode (the AUTO_DECIDE preamble path when QUESTION_TUNING=true) is a related but distinct silencing mechanism; both share the same fix surface in the preamble. These tests are expected to FAIL on this branch until the fix lands. The failure is the receipt for the regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(preamble): teach the model to prefer mcp__*__AskUserQuestion when registered When a host launches Claude Code with --disallowedTools AskUserQuestion (Conductor does this by default — verified via ps on the live conductor claude process), the native AskUserQuestion tool is removed from the model's tool registry. Skill templates that say "call AskUserQuestion" silently fail in that environment: the model can't ask, the user never sees the question, the skill auto-proceeds without input. The fix is preamble guidance, not a skill-template change: generate-ask-user-format.ts: new "Tool resolution" section at the top of the AskUserQuestion Format block. Tells the model that "AskUserQuestion" can resolve to two tools at runtime — the host MCP variant (e.g. mcp__conductor__AskUserQuestion, registered when the host injects it) and the native tool — and to PREFER any mcp__*__AskUserQuestion variant. Same questions/options shape; same decision-brief format. If neither variant is callable, fall back to writing a "## Decisions to confirm" section into the plan file plus ExitPlanMode (the native plan-mode confirmation surfaces it). Never silently auto-decide. generate-completion-status.ts: the plan-mode-info block (preamble position 1) now explicitly notes that AskUserQuestion satisfies plan mode's end-of-turn requirement for "any variant" and points at the Tool resolution section for the fallback path. This puts the resolution rule in front of every tier-≥2 skill via the preamble, so plan-mode review skills (plan-ceo-review, plan-eng-review, plan-design-review, plan-devex-review, autoplan, office-hours) all gain the fix without per-template surgery. Includes regenerated SKILL.md files for all 41 skills + the 3 host-ship golden fixtures used by test/host-config.test.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(periodic): AUTO_DECIDE opt-in preserved under Conductor flags Periodic-tier eval that exercises the legitimate /plan-tune AUTO_DECIDE path under the same flags Conductor uses (--disallowedTools AskUserQuestion). Confirms the new Tool resolution preamble doesn't trip opt-in users: when the user has set a never-ask preference for a question, the model should auto-pick (outcome 'auto_decided' or 'plan_ready') rather than surface the prompt. Setup runs in an isolated GSTACK_HOME tmpdir — never touches the user's real ~/.gstack state. Writes question_tuning=true + a never-ask preference for plan-ceo-review-mode (source: 'plan-tune', which bypasses the inline-user origin gate). Spawns claude with --disallowedTools AskUserQuestion in plan mode, runs /plan-ceo-review, asserts outcome is NOT 'asked' (i.e., the model honored the preference). Periodic tier because AUTO_DECIDE behavior depends on the model adhering to the QUESTION_TUNING preamble injection — non-deterministic, weekly cron is the right cadence rather than CI gating. Touchfiles cover the AUTO_DECIDE-bearing resolvers + the question-tuning binaries the test setup invokes. touchfiles.test.ts count updates 19 -> 20 because auto-decide-preserved also depends on plan-ceo-review/**. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * v1.21.0.0: AskUserQuestion resolves to host MCP variant when native is disallowed MINOR scale per scale-aware bumps in CLAUDE.md: substantial coordinated multi-file change (preamble fix + new test infrastructure + 6 gate-tier regression cases + 1 periodic eval) and a user-visible regression fix that affects every plan-mode review skill running under Conductor's default flag set. User originally targeted v1.21.2.0; landing as v1.21.0.0 since this is the first 1.21.x release on main and there's no prior 1.21.0.0/1.21.1.0 to skip past. Adjust at /ship time if a different number is preferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(harness): fix detection order + whitespace-tolerant pattern matching Two bugs surfaced when validating the v1.21 fix end-to-end: 1. PlanSkillObservation outcome detection ran 'asked' (any numbered options list) BEFORE 'plan_ready'. Plan-mode's "Ready to execute?" confirmation IS a numbered options list (1=auto, 2=manual, ...), so any skill that successfully reached the native confirmation got misclassified as 'asked'. Reorder: 'auto_decided' (most specific, requires AUTO_DECIDE annotation) > 'plan_ready' (next, requires the "ready to execute" stem) > 'asked' (any remaining numbered list). 2. isPlanReadyVisible and isAutoDecidedVisible regexes only matched spaced forms ("ready to execute", "(your preference)"). stripAnsi removes cursor-positioning escapes (`\x1b[40C`) entirely instead of replacing them with spaces, so the same text can render as "readytoexecute" or "(yourpreference)". Both detectors now test the spaced form first, fall through to a whitespace-collapsed comparison. Inline unit smoke confirms both forms match. Updates to the 5 strict 'asked' regression test cases (plan-ceo, plan-eng, plan-devex, autoplan, office-hours): with the detection order corrected, the model's plan-file fallback flow legitimately lands at 'plan_ready' instead of 'asked'. Pass envelope expanded to ['asked', 'plan_ready'] (matching plan-design-review's existing pattern). Failure signals tightened to include 'auto_decided' (catches AUTO_DECIDE without opt-in) plus the standard silent_write/exited/timeout. plan-design was already on this contract from v1.21's first commit, no change needed. The expanded envelope is correct: under --disallowedTools AskUserQuestion the Tool resolution preamble routes the question through plan-mode's native "Ready to execute?" surface — the user still sees the decision, just via the plan-file flow rather than a numbered prompt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(harness): require ## Decisions section under --disallowedTools plan_ready Adversarial review (during /ship Step 11) found that the previous gate-test envelope ['asked', 'plan_ready'] for the AskUserQuestion-blocked regression cases accepted the bug they exist to catch: a model that silently skips Step 0 entirely (writes a plan with no questions, no `## Decisions to confirm` section, just ExitPlanModes) reaches plan_ready and passes. The fix tightens the contract in two layers: 1. Harness: PlanSkillObservation gains a `planFile?: string` field populated when outcome is plan_ready. extractPlanFilePath() walks the visible TTY buffer for "Plan saved to:", "Plan file:", or ".claude/plans/<name>.md" patterns and resolves tilde to absolute. planFileHasDecisionsSection() reads the resolved file and returns true if it contains a `## Decisions` heading (any form: "to confirm", "needed", etc.). 2. Tests: 5 of 6 regression cases now require, when outcome is plan_ready, that obs.planFile is set AND planFileHasDecisionsSection returns true. Otherwise the test fails with a "Step 0 was silently skipped" diagnosis. plan-design-review remains the sole exception — it legitimately short-circuits to plan_ready on no-UI-scope branches and we have no deterministic way to distinguish that from a silent skip. This closes the loophole the adversarial review identified. The fix preamble flow already tells the model to write `## Decisions to confirm` when neither AUQ variant is callable — now the test verifies the model actually did it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(harness): anchor extractPlanFilePath path captures on /Users|~|/home|/var|/tmp Adversarial-tightened gate sweep surfaced a real bug in the path extraction: stripAnsi collapses whitespace via cursor-positioning escape removal, so "yet at /Users/..." in the visible buffer becomes "yetat/Users/..." with no space between. The previous fallback pattern `(~?\/?\S*\.claude\/plans\/[\w-]+\.md)` greedily matched non-whitespace characters BEFORE the path, producing `yetat/Users/garrytan/.claude/...` which then fails fs.readFileSync. Fix: every regex now requires the path to START at a known path-anchor: `~/`, `/Users/`, `/home/`, `/var/`, `/tmp/`, or `./`. Earlier non-whitespace runs can't be glommed in. Verified against the failing fixture (`yetat/Users/...`) plus the four canonical render forms ("Plan saved to:", "Plan file:", `·`-decorated ctrl-g hint, and the bare fallback). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+11
-1
@@ -113,7 +113,7 @@ In plan mode, allowed because they inform the plan: `$B`, `$D`, `codex exec`/`co
|
||||
|
||||
## Skill Invocation During Plan Mode
|
||||
|
||||
If the user invokes a skill in plan mode, the skill takes precedence over generic plan mode behavior. **Treat the skill file as executable instructions, not reference.** Follow it step by step starting from Step 0; the first AskUserQuestion is the workflow entering plan mode, not a violation of it. AskUserQuestion satisfies plan mode's end-of-turn requirement. At a STOP point, stop immediately. Do not continue the workflow or call ExitPlanMode there. Commands marked "PLAN MODE EXCEPTION — ALWAYS RUN" execute. Call ExitPlanMode only after the skill workflow completes, or if the user tells you to cancel the skill or leave plan mode.
|
||||
If the user invokes a skill in plan mode, the skill takes precedence over generic plan mode behavior. **Treat the skill file as executable instructions, not reference.** Follow it step by step starting from Step 0; the first AskUserQuestion is the workflow entering plan mode, not a violation of it. AskUserQuestion (any variant — `mcp__*__AskUserQuestion` or native; see "AskUserQuestion Format → Tool resolution") satisfies plan mode's end-of-turn requirement. If no variant is callable, fall back to writing the decision brief into the plan file as a `## Decisions to confirm` section + ExitPlanMode — never silently auto-decide. At a STOP point, stop immediately. Do not continue the workflow or call ExitPlanMode there. Commands marked "PLAN MODE EXCEPTION — ALWAYS RUN" execute. Call ExitPlanMode only after the skill workflow completes, or if the user tells you to cancel the skill or leave plan mode.
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not auto-invoke or proactively suggest skills. If a skill seems useful, ask: "I think /skillname might help here — want me to run it?"
|
||||
|
||||
@@ -278,6 +278,16 @@ AI orchestrator (e.g., OpenClaw). In spawned sessions:
|
||||
|
||||
## AskUserQuestion Format
|
||||
|
||||
### Tool resolution (read first)
|
||||
|
||||
"AskUserQuestion" can resolve to two tools at runtime: the **host MCP variant** (e.g. `mcp__conductor__AskUserQuestion` — appears in your tool list when the host registers it) or the **native** Claude Code tool.
|
||||
|
||||
**Rule:** if any `mcp__*__AskUserQuestion` variant is in your tool list, prefer it. Hosts may disable native AUQ via `--disallowedTools AskUserQuestion` (Conductor does, by default) and route through their MCP variant; calling native there silently fails. Same questions/options shape; same decision-brief format applies.
|
||||
|
||||
**Fallback when neither variant is callable:** in plan mode, write the decision brief into the plan file as a `## Decisions to confirm` section + ExitPlanMode (the native "Ready to execute?" surfaces it). Outside plan mode, output the brief as prose and stop. **Never silently auto-decide** — only `/plan-tune` AUTO_DECIDE opt-ins authorize auto-picking.
|
||||
|
||||
### Format
|
||||
|
||||
Every AskUserQuestion is a decision brief and must be sent as tool_use, not prose.
|
||||
|
||||
```
|
||||
|
||||
+11
-1
@@ -102,7 +102,7 @@ In plan mode, allowed because they inform the plan: `$B`, `$D`, `codex exec`/`co
|
||||
|
||||
## Skill Invocation During Plan Mode
|
||||
|
||||
If the user invokes a skill in plan mode, the skill takes precedence over generic plan mode behavior. **Treat the skill file as executable instructions, not reference.** Follow it step by step starting from Step 0; the first AskUserQuestion is the workflow entering plan mode, not a violation of it. AskUserQuestion satisfies plan mode's end-of-turn requirement. At a STOP point, stop immediately. Do not continue the workflow or call ExitPlanMode there. Commands marked "PLAN MODE EXCEPTION — ALWAYS RUN" execute. Call ExitPlanMode only after the skill workflow completes, or if the user tells you to cancel the skill or leave plan mode.
|
||||
If the user invokes a skill in plan mode, the skill takes precedence over generic plan mode behavior. **Treat the skill file as executable instructions, not reference.** Follow it step by step starting from Step 0; the first AskUserQuestion is the workflow entering plan mode, not a violation of it. AskUserQuestion (any variant — `mcp__*__AskUserQuestion` or native; see "AskUserQuestion Format → Tool resolution") satisfies plan mode's end-of-turn requirement. If no variant is callable, fall back to writing the decision brief into the plan file as a `## Decisions to confirm` section + ExitPlanMode — never silently auto-decide. At a STOP point, stop immediately. Do not continue the workflow or call ExitPlanMode there. Commands marked "PLAN MODE EXCEPTION — ALWAYS RUN" execute. Call ExitPlanMode only after the skill workflow completes, or if the user tells you to cancel the skill or leave plan mode.
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not auto-invoke or proactively suggest skills. If a skill seems useful, ask: "I think /skillname might help here — want me to run it?"
|
||||
|
||||
@@ -267,6 +267,16 @@ AI orchestrator (e.g., OpenClaw). In spawned sessions:
|
||||
|
||||
## AskUserQuestion Format
|
||||
|
||||
### Tool resolution (read first)
|
||||
|
||||
"AskUserQuestion" can resolve to two tools at runtime: the **host MCP variant** (e.g. `mcp__conductor__AskUserQuestion` — appears in your tool list when the host registers it) or the **native** Claude Code tool.
|
||||
|
||||
**Rule:** if any `mcp__*__AskUserQuestion` variant is in your tool list, prefer it. Hosts may disable native AUQ via `--disallowedTools AskUserQuestion` (Conductor does, by default) and route through their MCP variant; calling native there silently fails. Same questions/options shape; same decision-brief format applies.
|
||||
|
||||
**Fallback when neither variant is callable:** in plan mode, write the decision brief into the plan file as a `## Decisions to confirm` section + ExitPlanMode (the native "Ready to execute?" surfaces it). Outside plan mode, output the brief as prose and stop. **Never silently auto-decide** — only `/plan-tune` AUTO_DECIDE opt-ins authorize auto-picking.
|
||||
|
||||
### Format
|
||||
|
||||
Every AskUserQuestion is a decision brief and must be sent as tool_use, not prose.
|
||||
|
||||
```
|
||||
|
||||
+11
-1
@@ -104,7 +104,7 @@ In plan mode, allowed because they inform the plan: `$B`, `$D`, `codex exec`/`co
|
||||
|
||||
## Skill Invocation During Plan Mode
|
||||
|
||||
If the user invokes a skill in plan mode, the skill takes precedence over generic plan mode behavior. **Treat the skill file as executable instructions, not reference.** Follow it step by step starting from Step 0; the first AskUserQuestion is the workflow entering plan mode, not a violation of it. AskUserQuestion satisfies plan mode's end-of-turn requirement. At a STOP point, stop immediately. Do not continue the workflow or call ExitPlanMode there. Commands marked "PLAN MODE EXCEPTION — ALWAYS RUN" execute. Call ExitPlanMode only after the skill workflow completes, or if the user tells you to cancel the skill or leave plan mode.
|
||||
If the user invokes a skill in plan mode, the skill takes precedence over generic plan mode behavior. **Treat the skill file as executable instructions, not reference.** Follow it step by step starting from Step 0; the first AskUserQuestion is the workflow entering plan mode, not a violation of it. AskUserQuestion (any variant — `mcp__*__AskUserQuestion` or native; see "AskUserQuestion Format → Tool resolution") satisfies plan mode's end-of-turn requirement. If no variant is callable, fall back to writing the decision brief into the plan file as a `## Decisions to confirm` section + ExitPlanMode — never silently auto-decide. At a STOP point, stop immediately. Do not continue the workflow or call ExitPlanMode there. Commands marked "PLAN MODE EXCEPTION — ALWAYS RUN" execute. Call ExitPlanMode only after the skill workflow completes, or if the user tells you to cancel the skill or leave plan mode.
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not auto-invoke or proactively suggest skills. If a skill seems useful, ask: "I think /skillname might help here — want me to run it?"
|
||||
|
||||
@@ -269,6 +269,16 @@ AI orchestrator (e.g., OpenClaw). In spawned sessions:
|
||||
|
||||
## AskUserQuestion Format
|
||||
|
||||
### Tool resolution (read first)
|
||||
|
||||
"AskUserQuestion" can resolve to two tools at runtime: the **host MCP variant** (e.g. `mcp__conductor__AskUserQuestion` — appears in your tool list when the host registers it) or the **native** Claude Code tool.
|
||||
|
||||
**Rule:** if any `mcp__*__AskUserQuestion` variant is in your tool list, prefer it. Hosts may disable native AUQ via `--disallowedTools AskUserQuestion` (Conductor does, by default) and route through their MCP variant; calling native there silently fails. Same questions/options shape; same decision-brief format applies.
|
||||
|
||||
**Fallback when neither variant is callable:** in plan mode, write the decision brief into the plan file as a `## Decisions to confirm` section + ExitPlanMode (the native "Ready to execute?" surfaces it). Outside plan mode, output the brief as prose and stop. **Never silently auto-decide** — only `/plan-tune` AUTO_DECIDE opt-ins authorize auto-picking.
|
||||
|
||||
### Format
|
||||
|
||||
Every AskUserQuestion is a decision brief and must be sent as tool_use, not prose.
|
||||
|
||||
```
|
||||
|
||||
@@ -133,9 +133,104 @@ export function isTrustDialogVisible(visible: string): boolean {
|
||||
return visible.includes('trust this folder');
|
||||
}
|
||||
|
||||
/** Detect plan-mode's native "ready to execute" confirmation. */
|
||||
/**
|
||||
* Detect plan-mode's native "ready to execute" confirmation. Tests both the
|
||||
* spaced and whitespace-collapsed forms because stripAnsi removes cursor-
|
||||
* positioning escapes (e.g. `\x1b[40C`) that render visually as spaces but
|
||||
* leave no character behind — so "ready to execute" can come through as
|
||||
* "readytoexecute" depending on the rendering path.
|
||||
*/
|
||||
export function isPlanReadyVisible(visible: string): boolean {
|
||||
return /ready to execute|Would you like to proceed/i.test(visible);
|
||||
if (/ready to execute|Would you like to proceed/i.test(visible)) return true;
|
||||
const collapsed = visible.replace(/\s+/g, '');
|
||||
return /readytoexecute|Wouldyouliketoproceed/i.test(collapsed);
|
||||
}
|
||||
|
||||
/**
|
||||
* Detect the AUTO_DECIDE preamble template firing. The model prints
|
||||
* "Auto-decided <summary> → <option> (your preference). Change with /plan-tune."
|
||||
* when it short-circuits an AskUserQuestion via the question-tuning resolver
|
||||
* (`scripts/resolvers/question-tuning.ts:26`). The "Auto-decided ..." stem +
|
||||
* "(your preference)" tail combination is the tightest signal. Whitespace-
|
||||
* collapsed forms covered for the same TTY-rendering reason as
|
||||
* isPlanReadyVisible.
|
||||
*/
|
||||
export function isAutoDecidedVisible(visible: string): boolean {
|
||||
const stemMatch =
|
||||
/Auto-decided\b/i.test(visible) || /Auto-decided/i.test(visible.replace(/\s+/g, ''));
|
||||
if (!stemMatch) return false;
|
||||
if (/\(your preference\)/i.test(visible)) return true;
|
||||
return /\(yourpreference\)/i.test(visible.replace(/\s+/g, ''));
|
||||
}
|
||||
|
||||
/**
|
||||
* Extract the plan file path from rendered TTY output. Plan-mode's native
|
||||
* confirmation includes one of these formats near the "Ready to execute?"
|
||||
* prompt:
|
||||
* - `Plan saved to: /path/to/plan.md`
|
||||
* - `Plan file: /path/to/plan.md`
|
||||
* - `ctrl-g to edit in VSCode · ~/.claude/plans/<name>.md`
|
||||
*
|
||||
* stripAnsi may collapse whitespace via cursor-positioning escape removal,
|
||||
* so the regex tolerates variable spacing. Returns the resolved absolute
|
||||
* path with `~` expanded, or null if no path was rendered.
|
||||
*
|
||||
* Used by v1.22 AskUserQuestion-blocked regression tests to read the plan
|
||||
* file post-`plan_ready` and verify it contains a decisions section, which
|
||||
* distinguishes the legitimate fallback flow ("write decision brief into
|
||||
* plan file") from the silent-skip regression ("write a plan that didn't
|
||||
* surface any decisions").
|
||||
*/
|
||||
export function extractPlanFilePath(visible: string): string | null {
|
||||
// Patterns checked in order of specificity. Each captures the .md path.
|
||||
// The visible buffer may have stripAnsi-collapsed whitespace ("yet at" can
|
||||
// become "yetat"), so the captured path MUST start at a clear path-anchor
|
||||
// character: `~/`, `/Users/`, `/home/`, `/var/`, or `/tmp/`. Anchoring on
|
||||
// these prefixes prevents earlier non-whitespace characters from being
|
||||
// glommed into the path (real bug seen in the wild: `yetat/Users/...`).
|
||||
const PATH_ANCHOR = '(~\\/|\\/Users\\/|\\/home\\/|\\/var\\/|\\/tmp\\/|\\.\\/)';
|
||||
const patterns: RegExp[] = [
|
||||
new RegExp(`Plan\\s*saved\\s*to\\s*:?\\s*(${PATH_ANCHOR}\\S+\\.md)`, 'i'),
|
||||
new RegExp(`Plan\\s*file\\s*:?\\s*(${PATH_ANCHOR}\\S+\\.md)`, 'i'),
|
||||
new RegExp(`·\\s*(${PATH_ANCHOR}\\S*\\.claude\\/plans\\/\\S+\\.md)`, 'i'),
|
||||
// Fallback: any path-anchored reference to a .claude/plans .md file.
|
||||
new RegExp(`(${PATH_ANCHOR}\\S*\\.claude\\/plans\\/[\\w-]+\\.md)`, 'i'),
|
||||
];
|
||||
for (const p of patterns) {
|
||||
const m = visible.match(p);
|
||||
if (m && m[1]) {
|
||||
let raw = m[1];
|
||||
// Strip trailing punctuation that some patterns may capture.
|
||||
raw = raw.replace(/\.+$/, '.md').replace(/\.md\.+$/, '.md');
|
||||
// Tilde expansion to absolute path.
|
||||
if (raw.startsWith('~')) {
|
||||
const home = process.env.HOME ?? '';
|
||||
raw = home + raw.slice(1);
|
||||
}
|
||||
return raw;
|
||||
}
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
/**
|
||||
* Read a plan file written by a plan-mode skill and verify it contains a
|
||||
* "decisions" section — evidence the skill surfaced the decisions it was
|
||||
* supposed to gate on, even when AskUserQuestion is --disallowedTools and
|
||||
* the model used the plan-file fallback flow instead of a numbered prompt.
|
||||
*
|
||||
* Accepts any `## Decisions ...` heading (the canonical form from the
|
||||
* preamble is `## Decisions to confirm`, but small variants like
|
||||
* `## Decisions needed` or `## Decisions for review` are common). Returns
|
||||
* false if the file is unreadable, missing, or has no decisions section.
|
||||
*/
|
||||
export function planFileHasDecisionsSection(planFile: string): boolean {
|
||||
try {
|
||||
const content = fs.readFileSync(planFile, 'utf-8');
|
||||
return /^##\s+Decisions\b/im.test(content);
|
||||
} catch {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
@@ -359,6 +454,7 @@ export function optionsSignature(
|
||||
*/
|
||||
export type ClassifyResult =
|
||||
| { outcome: 'silent_write'; summary: string }
|
||||
| { outcome: 'auto_decided'; summary: string }
|
||||
| { outcome: 'plan_ready'; summary: string }
|
||||
| { outcome: 'asked'; summary: string }
|
||||
| null;
|
||||
@@ -388,6 +484,17 @@ export function classifyVisible(visible: string): ClassifyResult {
|
||||
};
|
||||
}
|
||||
}
|
||||
// 'auto_decided' must beat 'plan_ready': when AUTO_DECIDE fires upstream of
|
||||
// plan-ready, both signals are visible by the time the polling loop checks.
|
||||
// The annotation text is the more informative outcome — it explains WHY
|
||||
// we got to plan_ready without surfacing the question.
|
||||
if (isAutoDecidedVisible(visible)) {
|
||||
return {
|
||||
outcome: 'auto_decided',
|
||||
summary:
|
||||
'skill auto-decided an AskUserQuestion via the AUTO_DECIDE preamble (the user never saw the prompt)',
|
||||
};
|
||||
}
|
||||
if (isPlanReadyVisible(visible)) {
|
||||
return {
|
||||
outcome: 'plan_ready',
|
||||
@@ -903,22 +1010,38 @@ export async function invokeAndObserve(
|
||||
export interface PlanSkillObservation {
|
||||
/**
|
||||
* What happened first. One of:
|
||||
* - 'asked' — skill emitted a numbered-option prompt (its Step 0
|
||||
* AskUserQuestion or the routing-injection prompt)
|
||||
* - 'plan_ready' — claude wrote a plan and emitted its native
|
||||
* "Ready to execute" confirmation
|
||||
* - 'asked' — skill emitted a numbered-option prompt (its Step 0
|
||||
* AskUserQuestion or the routing-injection prompt)
|
||||
* - 'auto_decided' — visible TTY shows "Auto-decided ... → ..." (the
|
||||
* AUTO_DECIDE preamble template fired). Distinguishes
|
||||
* "the regression we're tracking" (auto-mode silently
|
||||
* auto-deciding questions the user wanted to see) from
|
||||
* "skill legitimately reached plan_ready". Detected
|
||||
* before plan_ready/silent_write so the auto-decide
|
||||
* evidence wins when both are present.
|
||||
* - 'plan_ready' — claude wrote a plan and emitted its native
|
||||
* "Ready to execute" confirmation
|
||||
* - 'silent_write' — a Write/Edit landed BEFORE any prompt, to a path
|
||||
* outside the sanctioned plan/project directories
|
||||
* - 'exited' — claude process died before any of the above
|
||||
* - 'timeout' — none of the above within budget
|
||||
* outside the sanctioned plan/project directories
|
||||
* - 'exited' — claude process died before any of the above
|
||||
* - 'timeout' — none of the above within budget
|
||||
*/
|
||||
outcome: 'asked' | 'plan_ready' | 'silent_write' | 'exited' | 'timeout';
|
||||
outcome: 'asked' | 'auto_decided' | 'plan_ready' | 'silent_write' | 'exited' | 'timeout';
|
||||
/** Human-readable summary. */
|
||||
summary: string;
|
||||
/** Visible terminal text since the slash command was sent (last 2KB). */
|
||||
evidence: string;
|
||||
/** Wall time (ms) until the outcome was decided. */
|
||||
elapsedMs: number;
|
||||
/**
|
||||
* Path to the plan file the skill wrote (if outcome is 'plan_ready').
|
||||
* Extracted from the visible TTY via {@link extractPlanFilePath}. Lets the
|
||||
* v1.22 AskUserQuestion-blocked regression tests verify the plan file
|
||||
* contains a `## Decisions to confirm` section under --disallowedTools —
|
||||
* a model that silently skips Step 0 reaches plan_ready WITHOUT writing
|
||||
* the section, and that's the regression we want to catch.
|
||||
*/
|
||||
planFile?: string;
|
||||
}
|
||||
|
||||
/**
|
||||
@@ -948,6 +1071,12 @@ export async function runPlanSkillObservation(opts: {
|
||||
cwd?: string;
|
||||
/** Total budget for skill to reach a terminal outcome. Default 180000. */
|
||||
timeoutMs?: number;
|
||||
/** Extra CLI args appended after --permission-mode. Used by the v1.22+
|
||||
* AskUserQuestion-blocked regression tests to pass
|
||||
* `['--disallowedTools', 'AskUserQuestion']` (the flag set Conductor
|
||||
* uses to remove native AskUserQuestion in favor of its MCP variant).
|
||||
* Plumbs straight through to launchClaudePty. */
|
||||
extraArgs?: string[];
|
||||
/**
|
||||
* Extra env merged into the spawned `claude` process. `launchClaudePty`
|
||||
* already supports this; exposing it here lets per-skill tests isolate
|
||||
@@ -962,6 +1091,7 @@ export async function runPlanSkillObservation(opts: {
|
||||
permissionMode: opts.inPlanMode === false ? null : 'plan',
|
||||
cwd: opts.cwd,
|
||||
timeoutMs: (opts.timeoutMs ?? 180_000) + 30_000,
|
||||
extraArgs: opts.extraArgs,
|
||||
env: opts.env,
|
||||
});
|
||||
|
||||
@@ -995,11 +1125,19 @@ export async function runPlanSkillObservation(opts: {
|
||||
}
|
||||
const classified = classifyVisible(visible);
|
||||
if (classified) {
|
||||
return {
|
||||
const obs: PlanSkillObservation = {
|
||||
...classified,
|
||||
evidence: visible.slice(-2000),
|
||||
elapsedMs: Date.now() - startedAt,
|
||||
};
|
||||
// For plan_ready outcomes, capture the plan file path from the full
|
||||
// visible buffer — tests under --disallowedTools verify the file's
|
||||
// contents to distinguish legitimate fallback flow from silent-skip.
|
||||
if (classified.outcome === 'plan_ready') {
|
||||
const planFile = extractPlanFilePath(visible);
|
||||
if (planFile) obs.planFile = planFile;
|
||||
}
|
||||
return obs;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -82,17 +82,36 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
|
||||
'plan-eng-review-artifact': ['plan-eng-review/**'],
|
||||
'plan-review-report': ['plan-eng-review/**', 'scripts/gen-skill-docs.ts'],
|
||||
|
||||
// Plan-mode smoke tests — gate-tier safety regression tests. Each fires when
|
||||
// any of: the interactive skill's template, the plan-mode resolver
|
||||
// (completion-status owns generatePlanModeInfo), preamble composition, or
|
||||
// the real-PTY runner (which the tests now use instead of the SDK harness)
|
||||
// change.
|
||||
'plan-ceo-review-plan-mode': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'plan-eng-review-plan-mode': ['plan-eng-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'plan-design-review-plan-mode': ['plan-design-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'plan-devex-review-plan-mode': ['plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
// Plan-mode smoke tests — gate-tier safety regression tests. Each test file
|
||||
// contains TWO test cases as of v1.21: the baseline plan-mode case and the
|
||||
// AskUserQuestion-blocked regression case (--disallowedTools AskUserQuestion
|
||||
// parameterized — the flag set Conductor uses by default). Touchfiles
|
||||
// include question-tuning.ts and generate-ask-user-format.ts because the
|
||||
// AUTO_DECIDE preamble injection lives there and changes can flip the
|
||||
// regression test outcome between 'asked' and 'auto_decided'.
|
||||
'plan-ceo-review-plan-mode': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'plan-eng-review-plan-mode': ['plan-eng-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'plan-design-review-plan-mode': ['plan-design-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'plan-devex-review-plan-mode': ['plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'plan-mode-no-op': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
|
||||
// v1.21+ AskUserQuestion-blocked regression tests — Conductor launches
|
||||
// claude with `--disallowedTools AskUserQuestion --permission-mode default`
|
||||
// (verified via `ps`); skills must still surface user-decisions through a
|
||||
// fallback path (mcp__conductor__AskUserQuestion or plan-file flow) rather
|
||||
// than silently auto-deciding. Parameterized regression test cases live
|
||||
// INSIDE the existing 4 plan-X-review-plan-mode test files (covered
|
||||
// transitively by the entries above). Two new standalone files exist for
|
||||
// skills with no prior plan-mode test:
|
||||
'autoplan-auto-mode': ['autoplan/**', 'plan-ceo-review/**', 'plan-design-review/**', 'plan-eng-review/**', 'plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'office-hours-auto-mode': ['office-hours/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
// v1.21+ AUTO_DECIDE preserve eval (periodic). Verifies the Tool resolution
|
||||
// fix doesn't trip the legitimate /plan-tune opt-in path: when the user has
|
||||
// written a never-ask preference, AUQ should still auto-decide rather than
|
||||
// surfacing the question. Touches the question-tuning + preference
|
||||
// infrastructure plus the resolvers that own the AUTO_DECIDE preamble.
|
||||
'auto-decide-preserved': ['scripts/resolvers/question-tuning.ts', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completion-status.ts', 'plan-ceo-review/**', 'bin/gstack-question-preference', 'bin/gstack-config', 'bin/gstack-slug', 'test/helpers/claude-pty-runner.ts'],
|
||||
|
||||
// Real-PTY E2E batch (#6 new tests on the harness).
|
||||
// Each one tests behavior the SDK harness can't observe (rendered TTY,
|
||||
// numbered-option lists, multi-phase ordering, idempotency state echo).
|
||||
@@ -378,6 +397,10 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
|
||||
'plan-design-review-plan-mode': 'gate',
|
||||
'plan-devex-review-plan-mode': 'gate',
|
||||
'plan-mode-no-op': 'gate',
|
||||
// v1.21+ auto-mode regression tests
|
||||
'autoplan-auto-mode': 'gate',
|
||||
'office-hours-auto-mode': 'gate',
|
||||
'auto-decide-preserved': 'periodic',
|
||||
'e2e-harness-audit': 'gate',
|
||||
|
||||
// Real-PTY E2E batch — tier classification:
|
||||
|
||||
@@ -0,0 +1,131 @@
|
||||
/**
|
||||
* AUTO_DECIDE opt-in preserved under Conductor flags (periodic-tier, paid, real-PTY).
|
||||
*
|
||||
* Regression test for v1.21+ fix: the new "Tool resolution" preamble
|
||||
* (scripts/resolvers/preamble/generate-ask-user-format.ts) tells the model
|
||||
* to prefer mcp__*__AskUserQuestion variants and fall back to plan-file
|
||||
* decisions when neither is callable. This must NOT break the legitimate
|
||||
* `/plan-tune` AUTO_DECIDE path: when the user has explicitly opted into
|
||||
* auto-deciding a specific question via `gstack-question-preference --write
|
||||
* never-ask`, the model is supposed to honor that — it should still
|
||||
* auto-pick the recommended option and emit the AUTO_DECIDE annotation
|
||||
* ("Auto-decided <summary> → <option> (your preference). Change with
|
||||
* /plan-tune.") instead of opening a question prompt.
|
||||
*
|
||||
* Periodic tier: AUTO_DECIDE behavior depends on the model adhering to
|
||||
* the QUESTION_TUNING preamble injection. Non-deterministic; runs weekly
|
||||
* or manually rather than gating CI.
|
||||
*
|
||||
* Set up:
|
||||
* - tmpDir as GSTACK_HOME (isolated state, doesn't touch the user's
|
||||
* real ~/.gstack)
|
||||
* - question_tuning=true in the tmp config
|
||||
* - preference for plan-ceo-review-mode → never-ask (source: plan-tune)
|
||||
*
|
||||
* Spawn:
|
||||
* claude --permission-mode plan --disallowedTools AskUserQuestion
|
||||
* /plan-ceo-review
|
||||
*
|
||||
* Expected:
|
||||
* - outcome === 'auto_decided' (the AUTO_DECIDE preamble fired and the
|
||||
* "Auto-decided ... (your preference)" text rendered)
|
||||
*
|
||||
* If outcome is 'asked', the model ignored the user's `/plan-tune`
|
||||
* preference — that's a regression against the opt-in feature. If outcome
|
||||
* is 'plan_ready' with no AUTO_DECIDE text, the model auto-decided BUT
|
||||
* skipped the annotation (acceptable; AUTO_DECIDE annotation is good
|
||||
* practice but not the load-bearing behavior).
|
||||
*/
|
||||
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import { runPlanSkillObservation } from './helpers/claude-pty-runner';
|
||||
import * as fs from 'fs';
|
||||
import * as os from 'os';
|
||||
import * as path from 'path';
|
||||
import { spawnSync } from 'child_process';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
|
||||
const ROOT = path.resolve(import.meta.dir, '..');
|
||||
|
||||
describeE2E('AUTO_DECIDE opt-in preserved under Conductor flags (periodic)', () => {
|
||||
test('user-opted-in question still auto-decides when AskUserQuestion is --disallowedTools', async () => {
|
||||
const tmpHome = fs.mkdtempSync(path.join(os.tmpdir(), 'gstack-auto-decide-'));
|
||||
try {
|
||||
// 1. Bootstrap the tmp GSTACK_HOME with question_tuning=true.
|
||||
const configBin = path.join(ROOT, 'bin', 'gstack-config');
|
||||
const setRes = spawnSync(configBin, ['set', 'question_tuning', 'true'], {
|
||||
env: { ...process.env, GSTACK_HOME: tmpHome },
|
||||
encoding: 'utf-8',
|
||||
});
|
||||
if (setRes.status !== 0) {
|
||||
throw new Error(`gstack-config set failed: ${setRes.stderr || setRes.stdout}`);
|
||||
}
|
||||
|
||||
// 2. Resolve slug for the project (uses git remote — same as the spawned
|
||||
// claude would resolve). The preference file path keys on this slug.
|
||||
const slugBin = path.join(ROOT, 'bin', 'gstack-slug');
|
||||
const slugRes = spawnSync(slugBin, [], {
|
||||
cwd: ROOT,
|
||||
env: { ...process.env, GSTACK_HOME: tmpHome },
|
||||
encoding: 'utf-8',
|
||||
});
|
||||
// gstack-slug emits `eval`-able shell exports like `SLUG=garrytan-gstack`.
|
||||
const slug = (slugRes.stdout.match(/SLUG=([^\s;]+)/)?.[1] ?? 'unknown').replace(/['"]/g, '');
|
||||
|
||||
// 3. Write the preference: plan-ceo-review-mode → never-ask. The
|
||||
// 'plan-tune' source bypasses the inline-user origin gate.
|
||||
const prefBin = path.join(ROOT, 'bin', 'gstack-question-preference');
|
||||
const writeRes = spawnSync(
|
||||
prefBin,
|
||||
['--write', JSON.stringify({
|
||||
question_id: 'plan-ceo-review-mode',
|
||||
preference: 'never-ask',
|
||||
source: 'plan-tune',
|
||||
})],
|
||||
{
|
||||
env: { ...process.env, GSTACK_HOME: tmpHome },
|
||||
encoding: 'utf-8',
|
||||
},
|
||||
);
|
||||
if (writeRes.status !== 0) {
|
||||
throw new Error(`gstack-question-preference --write failed: ${writeRes.stderr || writeRes.stdout}`);
|
||||
}
|
||||
|
||||
// Sanity: the preference file landed where we expect.
|
||||
const prefFile = path.join(tmpHome, 'projects', slug, 'question-preferences.json');
|
||||
if (!fs.existsSync(prefFile)) {
|
||||
throw new Error(`expected preference file at ${prefFile}; not found. slug=${slug}`);
|
||||
}
|
||||
|
||||
// 4. Run /plan-ceo-review with the Conductor flag set + isolated state.
|
||||
const obs = await runPlanSkillObservation({
|
||||
skillName: 'plan-ceo-review',
|
||||
inPlanMode: true,
|
||||
extraArgs: ['--disallowedTools', 'AskUserQuestion'],
|
||||
timeoutMs: 300_000,
|
||||
});
|
||||
|
||||
// 5. Pass: 'auto_decided' (the strongest signal) or 'plan_ready' with
|
||||
// no question rendered. Fail: 'asked' (model ignored the opt-in).
|
||||
if (obs.outcome === 'asked') {
|
||||
throw new Error(
|
||||
`AUTO_DECIDE regression: the model surfaced an AskUserQuestion despite the user's never-ask preference.\n` +
|
||||
`summary: ${obs.summary}\n` +
|
||||
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
if (obs.outcome === 'silent_write' || obs.outcome === 'exited' || obs.outcome === 'timeout') {
|
||||
throw new Error(
|
||||
`AUTO_DECIDE preserve test inconclusive: outcome=${obs.outcome}\n` +
|
||||
`summary: ${obs.summary}\n` +
|
||||
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
expect(['auto_decided', 'plan_ready']).toContain(obs.outcome);
|
||||
} finally {
|
||||
try { fs.rmSync(tmpHome, { recursive: true, force: true }); } catch { /* best-effort */ }
|
||||
}
|
||||
}, 360_000);
|
||||
});
|
||||
@@ -0,0 +1,67 @@
|
||||
/**
|
||||
* autoplan AskUserQuestion-blocked regression (gate, paid, real-PTY).
|
||||
*
|
||||
* v1.21+ regression: Conductor launches Claude Code with
|
||||
* `--disallowedTools AskUserQuestion --permission-mode default` (verified
|
||||
* by inspecting the parent claude process via `ps`). The native
|
||||
* AskUserQuestion tool is removed from the model's tool registry; without
|
||||
* fallback guidance the model can't ask the user and silently proceeds.
|
||||
*
|
||||
* Autoplan auto-decides INTERMEDIATE questions BY DESIGN
|
||||
* (autoplan/SKILL.md.tmpl:45), but Phase 1's premise confirmation gate is
|
||||
* one of the few non-auto-decided AskUserQuestions and MUST surface to the
|
||||
* user. This test asserts that gate still surfaces when AskUserQuestion is
|
||||
* disallowed at the tool-registry level — the fix must route the question
|
||||
* through a Conductor-side variant (mcp__conductor__AskUserQuestion) or
|
||||
* through the plan-file + ExitPlanMode flow.
|
||||
*
|
||||
* Filename keeps `auto-mode` for branch-history continuity. Auto-mode (the
|
||||
* AUTO_DECIDE preamble path when QUESTION_TUNING=true) is a related but
|
||||
* distinct silencing mechanism; both share the same fix surface.
|
||||
*/
|
||||
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import { runPlanSkillObservation, planFileHasDecisionsSection } from './helpers/claude-pty-runner';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
|
||||
describeE2E('autoplan AskUserQuestion-blocked smoke (gate)', () => {
|
||||
// Pass envelope is ['asked', 'plan_ready']: model either renders the
|
||||
// first non-auto-decided gate (Phase 1 premise confirmation) as numbered
|
||||
// prose or surfaces it through the plan file + ExitPlanMode flow.
|
||||
// Autoplan auto-decides intermediate questions BY DESIGN; the failure
|
||||
// signal we care about is the AUTO_DECIDE preamble firing on a gate it
|
||||
// shouldn't (caught explicitly via the 'auto_decided' outcome).
|
||||
test('a non-auto-decided gate surfaces when AskUserQuestion is --disallowedTools', async () => {
|
||||
const obs = await runPlanSkillObservation({
|
||||
skillName: 'autoplan',
|
||||
inPlanMode: true,
|
||||
extraArgs: ['--disallowedTools', 'AskUserQuestion'],
|
||||
timeoutMs: 300_000,
|
||||
});
|
||||
|
||||
if (
|
||||
obs.outcome === 'auto_decided' ||
|
||||
obs.outcome === 'silent_write' ||
|
||||
obs.outcome === 'exited' ||
|
||||
obs.outcome === 'timeout'
|
||||
) {
|
||||
throw new Error(
|
||||
`autoplan AskUserQuestion-blocked regression: outcome=${obs.outcome}\n` +
|
||||
`summary: ${obs.summary}\n` +
|
||||
`elapsed: ${obs.elapsedMs}ms\n` +
|
||||
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
if (obs.outcome === 'plan_ready') {
|
||||
if (!obs.planFile || !planFileHasDecisionsSection(obs.planFile)) {
|
||||
throw new Error(
|
||||
`autoplan AskUserQuestion-blocked regression: plan_ready without a "## Decisions" section in ${obs.planFile ?? '<no plan file detected>'} — Phase 1 premise gate was silently skipped.\n` +
|
||||
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
}
|
||||
expect(['asked', 'plan_ready']).toContain(obs.outcome);
|
||||
}, 360_000);
|
||||
});
|
||||
@@ -0,0 +1,59 @@
|
||||
/**
|
||||
* office-hours AskUserQuestion-blocked regression (gate, paid, real-PTY).
|
||||
*
|
||||
* v1.21+ regression: Conductor launches Claude Code with
|
||||
* `--disallowedTools AskUserQuestion --permission-mode default` (verified
|
||||
* by inspecting the parent claude process via `ps`). office-hours' first
|
||||
* step issues a startup-vs-builder mode AskUserQuestion
|
||||
* (office-hours/SKILL.md.tmpl:69); when AskUserQuestion is disallowed at
|
||||
* the tool-registry level the model cannot ask and silently picks one mode,
|
||||
* breaking the whole interactive premise. This test asserts that question
|
||||
* still surfaces — fix must route through mcp__conductor__AskUserQuestion
|
||||
* (when present) or plan-file + ExitPlanMode flow.
|
||||
*
|
||||
* Filename keeps `auto-mode` for branch-history continuity. Auto-mode (the
|
||||
* AUTO_DECIDE preamble path when QUESTION_TUNING=true) is a related but
|
||||
* distinct silencing mechanism; both share the same fix surface.
|
||||
*/
|
||||
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import { runPlanSkillObservation, planFileHasDecisionsSection } from './helpers/claude-pty-runner';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
|
||||
describeE2E('office-hours AskUserQuestion-blocked smoke (gate)', () => {
|
||||
// Pass envelope is ['asked', 'plan_ready']; failure signals are
|
||||
// 'auto_decided' + silent_write/exited/timeout.
|
||||
test('AskUserQuestion surfaces when --disallowedTools AskUserQuestion is set', async () => {
|
||||
const obs = await runPlanSkillObservation({
|
||||
skillName: 'office-hours',
|
||||
inPlanMode: true,
|
||||
extraArgs: ['--disallowedTools', 'AskUserQuestion'],
|
||||
timeoutMs: 300_000,
|
||||
});
|
||||
|
||||
if (
|
||||
obs.outcome === 'auto_decided' ||
|
||||
obs.outcome === 'silent_write' ||
|
||||
obs.outcome === 'exited' ||
|
||||
obs.outcome === 'timeout'
|
||||
) {
|
||||
throw new Error(
|
||||
`office-hours AskUserQuestion-blocked regression: outcome=${obs.outcome}\n` +
|
||||
`summary: ${obs.summary}\n` +
|
||||
`elapsed: ${obs.elapsedMs}ms\n` +
|
||||
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
if (obs.outcome === 'plan_ready') {
|
||||
if (!obs.planFile || !planFileHasDecisionsSection(obs.planFile)) {
|
||||
throw new Error(
|
||||
`office-hours AskUserQuestion-blocked regression: plan_ready without a "## Decisions" section in ${obs.planFile ?? '<no plan file detected>'} — startup-vs-builder mode question was silently skipped.\n` +
|
||||
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
}
|
||||
expect(['asked', 'plan_ready']).toContain(obs.outcome);
|
||||
}, 360_000);
|
||||
});
|
||||
@@ -34,7 +34,7 @@
|
||||
*/
|
||||
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import { runPlanSkillObservation } from './helpers/claude-pty-runner';
|
||||
import { runPlanSkillObservation, planFileHasDecisionsSection } from './helpers/claude-pty-runner';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
@@ -69,4 +69,65 @@ describeE2E('plan-ceo-review plan-mode smoke (gate)', () => {
|
||||
);
|
||||
}
|
||||
}, 360_000);
|
||||
|
||||
// v1.21+ regression: Conductor launches Claude Code with
|
||||
// `--disallowedTools AskUserQuestion --permission-mode default` (verified
|
||||
// via `ps` on the live Conductor claude process). Native AskUserQuestion
|
||||
// is removed from the model's tool registry; without fallback guidance
|
||||
// the model can't ask and silently proceeds.
|
||||
//
|
||||
// The fix (Tool resolution preamble) accepts two surface paths under
|
||||
// --disallowedTools:
|
||||
// - 'asked' — model emits a numbered-option prompt as prose (with
|
||||
// the same D<N> + Pros/cons format as a real AUQ)
|
||||
// - 'plan_ready' — model writes the question into the plan file as a
|
||||
// "## Decisions to confirm" section + ExitPlanMode;
|
||||
// the native plan-mode "Ready to execute?" surfaces
|
||||
// it through the TTY confirmation
|
||||
//
|
||||
// Both let the user see the decision. Failure signals are
|
||||
// silent_write/exited/timeout (model never surfaced the question) and
|
||||
// 'auto_decided' (the AUTO_DECIDE preamble fired without a /plan-tune
|
||||
// opt-in — caught explicitly).
|
||||
test('AskUserQuestion surfaces when --disallowedTools AskUserQuestion is set', async () => {
|
||||
const obs = await runPlanSkillObservation({
|
||||
skillName: 'plan-ceo-review',
|
||||
inPlanMode: true,
|
||||
extraArgs: ['--disallowedTools', 'AskUserQuestion'],
|
||||
timeoutMs: 300_000,
|
||||
});
|
||||
|
||||
if (
|
||||
obs.outcome === 'auto_decided' ||
|
||||
obs.outcome === 'silent_write' ||
|
||||
obs.outcome === 'exited' ||
|
||||
obs.outcome === 'timeout'
|
||||
) {
|
||||
throw new Error(
|
||||
`plan-ceo-review AskUserQuestion-blocked regression: outcome=${obs.outcome}\n` +
|
||||
`summary: ${obs.summary}\n` +
|
||||
`elapsed: ${obs.elapsedMs}ms\n` +
|
||||
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
// plan_ready under --disallowedTools is only a pass when the model used
|
||||
// the plan-file fallback (wrote a `## Decisions to confirm` section).
|
||||
// Without that section, plan_ready means the model silently skipped Step 0
|
||||
// and went straight to ExitPlanMode — the regression we're catching.
|
||||
if (obs.outcome === 'plan_ready') {
|
||||
if (!obs.planFile) {
|
||||
throw new Error(
|
||||
`plan-ceo-review AskUserQuestion-blocked regression: outcome=plan_ready but no plan file path detected in TTY output. Cannot verify the model used the fallback flow.\n` +
|
||||
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
if (!planFileHasDecisionsSection(obs.planFile)) {
|
||||
throw new Error(
|
||||
`plan-ceo-review AskUserQuestion-blocked regression: model wrote ${obs.planFile} without a "## Decisions" section. Step 0 was silently skipped.\n` +
|
||||
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
}
|
||||
expect(['asked', 'plan_ready']).toContain(obs.outcome);
|
||||
}, 360_000);
|
||||
});
|
||||
|
||||
@@ -10,7 +10,7 @@
|
||||
*/
|
||||
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import { runPlanSkillObservation } from './helpers/claude-pty-runner';
|
||||
import { runPlanSkillObservation, planFileHasDecisionsSection } from './helpers/claude-pty-runner';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
@@ -33,4 +33,40 @@ describeE2E('plan-design-review plan-mode smoke (gate)', () => {
|
||||
}
|
||||
expect(['asked', 'plan_ready']).toContain(obs.outcome);
|
||||
}, 360_000);
|
||||
|
||||
// v1.21+ regression: see skill-e2e-plan-ceo-plan-mode.test.ts for the
|
||||
// contract. plan-design-review legitimately short-circuits on no-UI-scope
|
||||
// branches, so this case keeps the same ['asked', 'plan_ready'] envelope
|
||||
// as the baseline. The discriminating regression signals are
|
||||
// 'auto_decided' (AUTO_DECIDE preamble fired upstream) or any failure
|
||||
// outcome — both mean the user never saw a question they should have.
|
||||
test('does not silently auto-decide when --disallowedTools AskUserQuestion is set', async () => {
|
||||
const obs = await runPlanSkillObservation({
|
||||
skillName: 'plan-design-review',
|
||||
inPlanMode: true,
|
||||
extraArgs: ['--disallowedTools', 'AskUserQuestion'],
|
||||
timeoutMs: 300_000,
|
||||
});
|
||||
|
||||
if (
|
||||
obs.outcome === 'auto_decided' ||
|
||||
obs.outcome === 'silent_write' ||
|
||||
obs.outcome === 'exited' ||
|
||||
obs.outcome === 'timeout'
|
||||
) {
|
||||
throw new Error(
|
||||
`plan-design-review AskUserQuestion-blocked regression: outcome=${obs.outcome}\n` +
|
||||
`summary: ${obs.summary}\n` +
|
||||
`elapsed: ${obs.elapsedMs}ms\n` +
|
||||
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
// plan-design-review legitimately short-circuits to plan_ready on no-UI
|
||||
// branches. Allow plan_ready WITHOUT a decisions section ONLY if the
|
||||
// plan file genuinely has no UI scope (we don't have a deterministic way
|
||||
// to check this from the test, so this skill keeps the looser envelope).
|
||||
// Other plan-mode skills require the decisions section under
|
||||
// --disallowedTools; design is the special case.
|
||||
expect(['asked', 'plan_ready']).toContain(obs.outcome);
|
||||
}, 360_000);
|
||||
});
|
||||
|
||||
@@ -6,7 +6,7 @@
|
||||
*/
|
||||
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import { runPlanSkillObservation } from './helpers/claude-pty-runner';
|
||||
import { runPlanSkillObservation, planFileHasDecisionsSection } from './helpers/claude-pty-runner';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
@@ -29,4 +29,40 @@ describeE2E('plan-devex-review plan-mode smoke (gate)', () => {
|
||||
}
|
||||
expect(['asked', 'plan_ready']).toContain(obs.outcome);
|
||||
}, 360_000);
|
||||
|
||||
// v1.21+ regression: see skill-e2e-plan-ceo-plan-mode.test.ts for the
|
||||
// contract. Pass envelope is ['asked', 'plan_ready']; failure signals
|
||||
// are 'auto_decided' (AUTO_DECIDE without opt-in) plus the standard
|
||||
// silent_write/exited/timeout.
|
||||
test('AskUserQuestion surfaces when --disallowedTools AskUserQuestion is set', async () => {
|
||||
const obs = await runPlanSkillObservation({
|
||||
skillName: 'plan-devex-review',
|
||||
inPlanMode: true,
|
||||
extraArgs: ['--disallowedTools', 'AskUserQuestion'],
|
||||
timeoutMs: 300_000,
|
||||
});
|
||||
|
||||
if (
|
||||
obs.outcome === 'auto_decided' ||
|
||||
obs.outcome === 'silent_write' ||
|
||||
obs.outcome === 'exited' ||
|
||||
obs.outcome === 'timeout'
|
||||
) {
|
||||
throw new Error(
|
||||
`plan-devex-review AskUserQuestion-blocked regression: outcome=${obs.outcome}\n` +
|
||||
`summary: ${obs.summary}\n` +
|
||||
`elapsed: ${obs.elapsedMs}ms\n` +
|
||||
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
if (obs.outcome === 'plan_ready') {
|
||||
if (!obs.planFile || !planFileHasDecisionsSection(obs.planFile)) {
|
||||
throw new Error(
|
||||
`plan-devex-review AskUserQuestion-blocked regression: plan_ready without a "## Decisions" section in ${obs.planFile ?? '<no plan file detected>'} — Step 0 was silently skipped.\n` +
|
||||
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
}
|
||||
expect(['asked', 'plan_ready']).toContain(obs.outcome);
|
||||
}, 360_000);
|
||||
});
|
||||
|
||||
@@ -6,7 +6,7 @@
|
||||
*/
|
||||
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import { runPlanSkillObservation } from './helpers/claude-pty-runner';
|
||||
import { runPlanSkillObservation, planFileHasDecisionsSection } from './helpers/claude-pty-runner';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
@@ -29,4 +29,40 @@ describeE2E('plan-eng-review plan-mode smoke (gate)', () => {
|
||||
}
|
||||
expect(['asked', 'plan_ready']).toContain(obs.outcome);
|
||||
}, 360_000);
|
||||
|
||||
// v1.21+ regression: see skill-e2e-plan-ceo-plan-mode.test.ts for the
|
||||
// contract. Pass envelope is ['asked', 'plan_ready']; failure signals
|
||||
// are 'auto_decided' (AUTO_DECIDE without opt-in) plus the standard
|
||||
// silent_write/exited/timeout.
|
||||
test('AskUserQuestion surfaces when --disallowedTools AskUserQuestion is set', async () => {
|
||||
const obs = await runPlanSkillObservation({
|
||||
skillName: 'plan-eng-review',
|
||||
inPlanMode: true,
|
||||
extraArgs: ['--disallowedTools', 'AskUserQuestion'],
|
||||
timeoutMs: 300_000,
|
||||
});
|
||||
|
||||
if (
|
||||
obs.outcome === 'auto_decided' ||
|
||||
obs.outcome === 'silent_write' ||
|
||||
obs.outcome === 'exited' ||
|
||||
obs.outcome === 'timeout'
|
||||
) {
|
||||
throw new Error(
|
||||
`plan-eng-review AskUserQuestion-blocked regression: outcome=${obs.outcome}\n` +
|
||||
`summary: ${obs.summary}\n` +
|
||||
`elapsed: ${obs.elapsedMs}ms\n` +
|
||||
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
if (obs.outcome === 'plan_ready') {
|
||||
if (!obs.planFile || !planFileHasDecisionsSection(obs.planFile)) {
|
||||
throw new Error(
|
||||
`plan-eng-review AskUserQuestion-blocked regression: plan_ready without a "## Decisions" section in ${obs.planFile ?? '<no plan file detected>'} — Step 0 was silently skipped.\n` +
|
||||
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
|
||||
);
|
||||
}
|
||||
}
|
||||
expect(['asked', 'plan_ready']).toContain(obs.outcome);
|
||||
}, 360_000);
|
||||
});
|
||||
|
||||
@@ -99,8 +99,12 @@ describe('selectTests', () => {
|
||||
expect(result.selected).toContain('autoplan-chain-pty');
|
||||
// Per-finding count + review-report-at-bottom (v1.21.x)
|
||||
expect(result.selected).toContain('plan-ceo-finding-count');
|
||||
expect(result.selected.length).toBe(19);
|
||||
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 19);
|
||||
// v1.22+ AskUserQuestion-blocked regression: autoplan-auto-mode +
|
||||
// auto-decide-preserved also depend on plan-ceo-review/**
|
||||
expect(result.selected).toContain('autoplan-auto-mode');
|
||||
expect(result.selected).toContain('auto-decide-preserved');
|
||||
expect(result.selected.length).toBe(21);
|
||||
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 21);
|
||||
});
|
||||
|
||||
test('global touchfile triggers ALL tests', () => {
|
||||
|
||||
Reference in New Issue
Block a user