test: delete --disallowedTools AskUserQuestion-blocked test variants

These tests simulated a fictional environment that doesn't exist in
production. Real Conductor sessions launch claude with
`--disallowedTools AskUserQuestion` AND register
`mcp__conductor__AskUserQuestion` — the model has the MCP variant. But
the tests passed `--disallowedTools` without standing up any MCP server,
so they tested "model behavior with NO AUQ available," which no real
user state produces.

Combined with bare `/plan-ceo-review` invocation (no follow-up content),
this forced the model into a 5+ minute deliberation loop trying to
prose-render a question with options it had to first invent. The result
was persistent flakes that consumed nine paid E2E runs trying to fix
"the model takes too long" — but the actual problem was the test
configuration, not the model.

Removals:
- test/skill-e2e-autoplan-auto-mode.test.ts (deleted; the entire file
  was a single AUQ-blocked test)
- test/skill-e2e-plan-ceo-plan-mode.test.ts test 2 (the migrated
  --disallowedTools test); test 1 (baseline plan-mode smoke) stays
- test/skill-e2e-plan-design-plan-mode.test.ts test 2 (same shape);
  test 1 stays
- test/skill-e2e-plan-eng-plan-mode.test.ts test 2 (same shape); test 1
  (baseline) and test 3 (STOP-gate with seeded plan, different
  contract) stay
- test/helpers/touchfiles.ts: autoplan-auto-mode entry removed
- test/touchfiles.test.ts: assertion count + commentary updated

Coverage retained: test 1 of each plan-mode file already verifies the
model fires AUQ; the periodic finding-count tests verify per-finding
AUQ cadence end-to-end. The harness improvements landed during this
debugging cycle (isProseAUQVisible regex, LLM judge, snapshot logging,
high-water-mark tracking, ENOENT-tolerant assertReportAtBottomIfPlanWritten)
all stay — they're useful for the remaining plan-mode tests that can
also encounter prose rendering and slow-thinking phases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-05-08 23:43:37 -07:00
parent 4d577830a8
commit 558e195c46
6 changed files with 6 additions and 342 deletions
+1 -126
View File
@@ -33,43 +33,15 @@
* See test/helpers/claude-pty-runner.ts for runner internals.
*/
import { describe, test, expect } from 'bun:test';
import { describe, test } from 'bun:test';
import {
runPlanSkillObservation,
planFileHasDecisionsSection,
assertReportAtBottomIfPlanWritten,
isProseAUQVisible,
} from './helpers/claude-pty-runner';
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
const describeE2E = shouldRun ? describe : describe.skip;
// Concrete plan to review. Used by the --disallowedTools test to skip
// the "what should I review?" deliberation that otherwise eats the
// model's budget. Has CEO-review-shaped issues (premise gap, vague
// success metric, scope-creep smell) so Step 0 has real material.
const SEED_PLAN_FOR_CEO_REVIEW = `
# Plan: Launch a "developer-friendly" pricing tier
## Goal
Increase developer adoption.
## Success metric
More signups.
## Premise
We haven't talked to any developers about whether the current pricing
is actually a barrier. The team agreed it "feels like" it should be
cheaper. No data yet on what dev users would pay for or what the unit
economics would look like at the new price point.
## Plan
- Pick a 30% discount as the developer tier
- Add an email field to /pricing for "verify with @company.com"
- Auto-enroll anyone with @gmail/@hotmail addresses too as a pilot
- Ship next week
`.trim();
describeE2E('plan-ceo-review plan-mode smoke (gate)', () => {
test('first terminal outcome is asked (Step 0 fires before any plan write)', async () => {
const obs = await runPlanSkillObservation({
@@ -101,101 +73,4 @@ describeE2E('plan-ceo-review plan-mode smoke (gate)', () => {
}
assertReportAtBottomIfPlanWritten(obs);
}, 360_000);
// v1.21+ regression: Conductor launches Claude Code with
// `--disallowedTools AskUserQuestion --permission-mode default` (verified
// via `ps` on the live Conductor claude process). Native AskUserQuestion
// is removed from the model's tool registry; without fallback guidance
// the model can't ask and silently proceeds.
//
// After v1.28+ (forever-war fix), the preamble fallback that wrote a
// "## Decisions to confirm" section was deleted in favor of a hard
// BLOCKED rule. The pass envelope under --disallowedTools accepts:
// - 'asked' — model emits a numbered-option prompt as prose
// - 'plan_ready' WITH (## Decisions section [legacy]
// OR BLOCKED string visible [post-fix])
// - 'exited' WITH BLOCKED string visible [post-fix]
//
// The legacy `## Decisions` path stays in the envelope so this test
// keeps passing during the migration window when the fallback delete
// and resolver edits land in the same PR but mid-rebase states are
// possible. Once the deletion has been on main long enough that the
// generated SKILL.md cache has flushed, the legacy branch can be
// removed in a follow-up.
//
// Failure signals (regression we DO want to catch):
// - 'auto_decided' — AUTO_DECIDE preamble fired without /plan-tune opt-in
// - 'silent_write' — Write/Edit before any AUQ surface
// - 'timeout' — neither asked nor terminated in budget
// - 'plan_ready' or 'exited' WITHOUT either Decisions section or BLOCKED
test('AskUserQuestion surfaces when --disallowedTools AskUserQuestion is set', async () => {
// Pre-prime with concrete plan content so the model doesn't burn its
// budget deliberating about WHICH artifact to review. Without this seed,
// a bare /plan-ceo-review under --disallowedTools puts the model in a
// 5-minute thinking loop trying to enumerate scope options before
// surfacing them as prose. With the seed, the model has a real plan to
// critique and can move directly to Step 0 / Section 1 findings.
//
// The test still exercises the regression we care about: under
// --disallowedTools, does the skill SURFACE its first decision question
// (via prose, BLOCKED, or some visible surface) rather than silently
// ExitPlanMode-ing?
const obs = await runPlanSkillObservation({
skillName: 'plan-ceo-review',
inPlanMode: true,
extraArgs: ['--disallowedTools', 'AskUserQuestion'],
initialPlanContent: SEED_PLAN_FOR_CEO_REVIEW,
timeoutMs: 300_000,
});
// The user must SEE the question one way or another. Three valid surfaces:
// 1. `## Decisions to confirm` section in the plan file (legacy fallback)
// 2. `BLOCKED — AskUserQuestion` string visible in TTY (post-v1.28 BLOCKED rule)
// 3. Numbered/lettered options visible in TTY as prose (post-v1.28 prose-AUQ rendering)
const blockedVisible = /BLOCKED\s*[—-]\s*AskUserQuestion/i.test(obs.evidence);
const proseAUQVisible = isProseAUQVisible(obs.evidence) || obs.proseAUQEverObserved === true;
const surfaceVisible = blockedVisible || proseAUQVisible || obs.waitingEverObserved === true;
if (
obs.outcome === 'auto_decided' ||
obs.outcome === 'silent_write' ||
obs.outcome === 'timeout'
) {
throw new Error(
`plan-ceo-review AskUserQuestion-blocked regression: outcome=${obs.outcome}\n` +
`summary: ${obs.summary}\n` +
`elapsed: ${obs.elapsedMs}ms\n` +
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
);
}
if (obs.outcome === 'exited' && !surfaceVisible) {
throw new Error(
`plan-ceo-review AskUserQuestion-blocked regression: outcome=exited without any visible question surface (no BLOCKED string, no prose-rendered AUQ options). Model quit silently.\n` +
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
);
}
if (obs.outcome === 'plan_ready') {
if (!obs.planFile) {
if (!surfaceVisible) {
throw new Error(
`plan-ceo-review AskUserQuestion-blocked regression: outcome=plan_ready but no plan file path detected, no BLOCKED string, no prose AUQ options. Cannot verify the model used any legitimate path.\n` +
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
);
}
} else if (!planFileHasDecisionsSection(obs.planFile) && !surfaceVisible) {
throw new Error(
`plan-ceo-review AskUserQuestion-blocked regression: model wrote ${obs.planFile} without a "## Decisions" section AND no BLOCKED string AND no prose AUQ options in TTY. Step 0 was silently skipped.\n` +
`--- evidence (last 2KB visible) ---\n${obs.evidence}`,
);
}
}
expect(['asked', 'plan_ready', 'exited']).toContain(obs.outcome);
// NOTE: assertReportAtBottomIfPlanWritten is intentionally NOT called
// here. This test runs --disallowedTools AskUserQuestion and only
// checks "did the question surface" — the model can't run the full
// multi-section review without AUQ tools, so no review report exists
// to enforce the at-bottom contract against. The contract is
// exercised by the periodic finding-count tests, which DO run the
// full review.
}, 360_000);
});