Adversarial review (during /ship Step 11) found that the previous gate-test
envelope ['asked', 'plan_ready'] for the AskUserQuestion-blocked regression
cases accepted the bug they exist to catch: a model that silently skips
Step 0 entirely (writes a plan with no questions, no `## Decisions to
confirm` section, just ExitPlanModes) reaches plan_ready and passes.
The fix tightens the contract in two layers:
1. Harness: PlanSkillObservation gains a `planFile?: string` field
populated when outcome is plan_ready. extractPlanFilePath() walks the
visible TTY buffer for "Plan saved to:", "Plan file:", or
".claude/plans/<name>.md" patterns and resolves tilde to absolute.
planFileHasDecisionsSection() reads the resolved file and returns true
if it contains a `## Decisions` heading (any form: "to confirm",
"needed", etc.).
2. Tests: 5 of 6 regression cases now require, when outcome is plan_ready,
that obs.planFile is set AND planFileHasDecisionsSection returns true.
Otherwise the test fails with a "Step 0 was silently skipped" diagnosis.
plan-design-review remains the sole exception — it legitimately
short-circuits to plan_ready on no-UI-scope branches and we have no
deterministic way to distinguish that from a silent skip.
This closes the loophole the adversarial review identified. The fix
preamble flow already tells the model to write `## Decisions to confirm`
when neither AUQ variant is callable — now the test verifies the model
actually did it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs surfaced when validating the v1.21 fix end-to-end:
1. PlanSkillObservation outcome detection ran 'asked' (any numbered
options list) BEFORE 'plan_ready'. Plan-mode's "Ready to execute?"
confirmation IS a numbered options list (1=auto, 2=manual, ...), so
any skill that successfully reached the native confirmation got
misclassified as 'asked'. Reorder: 'auto_decided' (most specific,
requires AUTO_DECIDE annotation) > 'plan_ready' (next, requires the
"ready to execute" stem) > 'asked' (any remaining numbered list).
2. isPlanReadyVisible and isAutoDecidedVisible regexes only matched
spaced forms ("ready to execute", "(your preference)"). stripAnsi
removes cursor-positioning escapes (`\x1b[40C`) entirely instead of
replacing them with spaces, so the same text can render as
"readytoexecute" or "(yourpreference)". Both detectors now test the
spaced form first, fall through to a whitespace-collapsed comparison.
Inline unit smoke confirms both forms match.
Updates to the 5 strict 'asked' regression test cases (plan-ceo,
plan-eng, plan-devex, autoplan, office-hours): with the detection order
corrected, the model's plan-file fallback flow legitimately lands at
'plan_ready' instead of 'asked'. Pass envelope expanded to ['asked',
'plan_ready'] (matching plan-design-review's existing pattern). Failure
signals tightened to include 'auto_decided' (catches AUTO_DECIDE without
opt-in) plus the standard silent_write/exited/timeout. plan-design was
already on this contract from v1.21's first commit, no change needed.
The expanded envelope is correct: under --disallowedTools AskUserQuestion
the Tool resolution preamble routes the question through plan-mode's
native "Ready to execute?" surface — the user still sees the decision,
just via the plan-file flow rather than a numbered prompt.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conductor launches Claude Code with --disallowedTools AskUserQuestion
--permission-mode default --permission-prompt-tool stdio (verified by
inspecting the live conductor claude process via ps -p ... -o args=).
Native AskUserQuestion is removed from the model's tool registry; without
fallback guidance the plan-mode skills (plan-ceo-review, plan-eng-review,
plan-design-review, plan-devex-review, autoplan, office-hours) silently
proceed and never surface decisions to the user.
Adds 6 gate-tier real-PTY regression cases:
- 4 inline test cases inside the existing plan-X-review-plan-mode.test
files, each exercising the same skill with extraArgs ['--disallowedTools',
'AskUserQuestion'] and asserting outcome === 'asked'. plan-design-review
keeps the ['asked', 'plan_ready'] envelope (legitimate short-circuit on
no-UI-scope) but explicitly fails on 'auto_decided'.
- 2 standalone test files for autoplan + office-hours (which had no prior
plan-mode test). autoplan asserts the FIRST non-auto-decided gate fires
(Phase 1 premise confirmation) — autoplan auto-decides intermediate
questions BY DESIGN.
Touchfile entries:
- autoplan-auto-mode + office-hours-auto-mode added to E2E_TOUCHFILES +
E2E_TIERS (gate)
- existing plan-X-review-plan-mode entries gain question-tuning.ts and
generate-ask-user-format.ts touchfile deps so AUTO_DECIDE-related
resolver changes correctly invalidate the regression tests
- touchfiles.test.ts count updated 18 -> 19 to cover the autoplan
touchfile dependency on plan-ceo-review/**
Filenames retain `auto-mode` for branch-history continuity. Auto-mode (the
AUTO_DECIDE preamble path when QUESTION_TUNING=true) is a related but
distinct silencing mechanism; both share the same fix surface in the
preamble.
These tests are expected to FAIL on this branch until the fix lands. The
failure is the receipt for the regression.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>