Files
gstack/test/skill-e2e-autoplan-auto-mode.test.ts
T
Garry Tan aaf2668103 test(e2e): add AskUserQuestion-blocked regression cases for 6 plan-mode skills
Conductor launches Claude Code with --disallowedTools AskUserQuestion
--permission-mode default --permission-prompt-tool stdio (verified by
inspecting the live conductor claude process via ps -p ... -o args=).
Native AskUserQuestion is removed from the model's tool registry; without
fallback guidance the plan-mode skills (plan-ceo-review, plan-eng-review,
plan-design-review, plan-devex-review, autoplan, office-hours) silently
proceed and never surface decisions to the user.

Adds 6 gate-tier real-PTY regression cases:

  - 4 inline test cases inside the existing plan-X-review-plan-mode.test
    files, each exercising the same skill with extraArgs ['--disallowedTools',
    'AskUserQuestion'] and asserting outcome === 'asked'. plan-design-review
    keeps the ['asked', 'plan_ready'] envelope (legitimate short-circuit on
    no-UI-scope) but explicitly fails on 'auto_decided'.
  - 2 standalone test files for autoplan + office-hours (which had no prior
    plan-mode test). autoplan asserts the FIRST non-auto-decided gate fires
    (Phase 1 premise confirmation) — autoplan auto-decides intermediate
    questions BY DESIGN.

Touchfile entries:
  - autoplan-auto-mode + office-hours-auto-mode added to E2E_TOUCHFILES +
    E2E_TIERS (gate)
  - existing plan-X-review-plan-mode entries gain question-tuning.ts and
    generate-ask-user-format.ts touchfile deps so AUTO_DECIDE-related
    resolver changes correctly invalidate the regression tests
  - touchfiles.test.ts count updated 18 -> 19 to cover the autoplan
    touchfile dependency on plan-ceo-review/**

Filenames retain `auto-mode` for branch-history continuity. Auto-mode (the
AUTO_DECIDE preamble path when QUESTION_TUNING=true) is a related but
distinct silencing mechanism; both share the same fix surface in the
preamble.

These tests are expected to FAIL on this branch until the fix lands. The
failure is the receipt for the regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 21:13:48 -07:00

49 lines
2.1 KiB
TypeScript

/**
* autoplan AskUserQuestion-blocked regression (gate, paid, real-PTY).
*
* v1.21+ regression: Conductor launches Claude Code with
* `--disallowedTools AskUserQuestion --permission-mode default` (verified
* by inspecting the parent claude process via `ps`). The native
* AskUserQuestion tool is removed from the model's tool registry; without
* fallback guidance the model can't ask the user and silently proceeds.
*
* Autoplan auto-decides INTERMEDIATE questions BY DESIGN
* (autoplan/SKILL.md.tmpl:45), but Phase 1's premise confirmation gate is
* one of the few non-auto-decided AskUserQuestions and MUST surface to the
* user. This test asserts that gate still surfaces when AskUserQuestion is
* disallowed at the tool-registry level — the fix must route the question
* through a Conductor-side variant (mcp__conductor__AskUserQuestion) or
* through the plan-file + ExitPlanMode flow.
*
* Filename keeps `auto-mode` for branch-history continuity. Auto-mode (the
* AUTO_DECIDE preamble path when QUESTION_TUNING=true) is a related but
* distinct silencing mechanism; both share the same fix surface.
*/
import { describe, test, expect } from 'bun:test';
import { runPlanSkillObservation } from './helpers/claude-pty-runner';
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
const describeE2E = shouldRun ? describe : describe.skip;
describeE2E('autoplan AskUserQuestion-blocked smoke (gate)', () => {
test('a non-auto-decided gate surfaces when AskUserQuestion is --disallowedTools', async () => {
const obs = await runPlanSkillObservation({
skillName: 'autoplan',
inPlanMode: true,
extraArgs: ['--disallowedTools', 'AskUserQuestion'],
timeoutMs: 300_000,
});
if (obs.outcome !== 'asked') {
throw new Error(
`autoplan AskUserQuestion-blocked regression: outcome=${obs.outcome}\n` +
`summary: ${obs.summary}\n` +
`elapsed: ${obs.elapsedMs}ms\n` +
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
);
}
expect(obs.outcome).toEqual('asked');
}, 360_000);
});