test: delete --disallowedTools AskUserQuestion-blocked test variants

These tests simulated a fictional environment that doesn't exist in production. Real Conductor sessions launch claude with `--disallowedTools AskUserQuestion` AND register `mcp__conductor__AskUserQuestion` — the model has the MCP variant. But the tests passed `--disallowedTools` without standing up any MCP server, so they tested "model behavior with NO AUQ available," which no real user state produces. Combined with bare `/plan-ceo-review` invocation (no follow-up content), this forced the model into a 5+ minute deliberation loop trying to prose-render a question with options it had to first invent. The result was persistent flakes that consumed nine paid E2E runs trying to fix "the model takes too long" — but the actual problem was the test configuration, not the model. Removals: - test/skill-e2e-autoplan-auto-mode.test.ts (deleted; the entire file was a single AUQ-blocked test) - test/skill-e2e-plan-ceo-plan-mode.test.ts test 2 (the migrated --disallowedTools test); test 1 (baseline plan-mode smoke) stays - test/skill-e2e-plan-design-plan-mode.test.ts test 2 (same shape); test 1 stays - test/skill-e2e-plan-eng-plan-mode.test.ts test 2 (same shape); test 1 (baseline) and test 3 (STOP-gate with seeded plan, different contract) stay - test/helpers/touchfiles.ts: autoplan-auto-mode entry removed - test/touchfiles.test.ts: assertion count + commentary updated Coverage retained: test 1 of each plan-mode file already verifies the model fires AUQ; the periodic finding-count tests verify per-finding AUQ cadence end-to-end. The harness improvements landed during this debugging cycle (isProseAUQVisible regex, LLM judge, snapshot logging, high-water-mark tracking, ENOENT-tolerant assertReportAtBottomIfPlanWritten) all stay — they're useful for the remaining plan-mode tests that can also encounter prose rendering and slow-thinking phases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-07-01 14:05:43 +02:00 · 2026-05-08 23:43:37 -07:00
parent 4d577830a8
commit 558e195c46
6 changed files with 6 additions and 342 deletions
@@ -33,43 +33,15 @@
 * See test/helpers/claude-pty-runner.ts for runner internals.
 */

-import { describe, test, expect } from 'bun:test';
+import { describe, test } from 'bun:test';
 import {
  runPlanSkillObservation,
-  planFileHasDecisionsSection,
  assertReportAtBottomIfPlanWritten,
-  isProseAUQVisible,
 } from './helpers/claude-pty-runner';

 const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
 const describeE2E = shouldRun ? describe : describe.skip;

-// Concrete plan to review. Used by the --disallowedTools test to skip
-// the "what should I review?" deliberation that otherwise eats the
-// model's budget. Has CEO-review-shaped issues (premise gap, vague
-// success metric, scope-creep smell) so Step 0 has real material.
-const SEED_PLAN_FOR_CEO_REVIEW = `
-# Plan: Launch a "developer-friendly" pricing tier
-
-## Goal
-Increase developer adoption.
-
-## Success metric
-More signups.
-
-## Premise
-We haven't talked to any developers about whether the current pricing
-is actually a barrier. The team agreed it "feels like" it should be
-cheaper. No data yet on what dev users would pay for or what the unit
-economics would look like at the new price point.
-
-## Plan
- Pick a 30% discount as the developer tier
- Add an email field to /pricing for "verify with @company.com"
- Auto-enroll anyone with @gmail/@hotmail addresses too as a pilot
- Ship next week
-`.trim();
-
 describeE2E('plan-ceo-review plan-mode smoke (gate)', () => {
  test('first terminal outcome is asked (Step 0 fires before any plan write)', async () => {
    const obs = await runPlanSkillObservation({
@@ -101,101 +73,4 @@ describeE2E('plan-ceo-review plan-mode smoke (gate)', () => {
    }
    assertReportAtBottomIfPlanWritten(obs);
  }, 360_000);
-
-  // v1.21+ regression: Conductor launches Claude Code with
-  // `--disallowedTools AskUserQuestion --permission-mode default` (verified
-  // via `ps` on the live Conductor claude process). Native AskUserQuestion
-  // is removed from the model's tool registry; without fallback guidance
-  // the model can't ask and silently proceeds.
-  //
-  // After v1.28+ (forever-war fix), the preamble fallback that wrote a
-  // "## Decisions to confirm" section was deleted in favor of a hard
-  // BLOCKED rule. The pass envelope under --disallowedTools accepts:
-  //   - 'asked'      — model emits a numbered-option prompt as prose
-  //   - 'plan_ready' WITH (## Decisions section [legacy]
-  //                  OR BLOCKED string visible [post-fix])
-  //   - 'exited'     WITH BLOCKED string visible [post-fix]
-  //
-  // The legacy `## Decisions` path stays in the envelope so this test
-  // keeps passing during the migration window when the fallback delete
-  // and resolver edits land in the same PR but mid-rebase states are
-  // possible. Once the deletion has been on main long enough that the
-  // generated SKILL.md cache has flushed, the legacy branch can be
-  // removed in a follow-up.
-  //
-  // Failure signals (regression we DO want to catch):
-  //   - 'auto_decided' — AUTO_DECIDE preamble fired without /plan-tune opt-in
-  //   - 'silent_write' — Write/Edit before any AUQ surface
-  //   - 'timeout'      — neither asked nor terminated in budget
-  //   - 'plan_ready' or 'exited' WITHOUT either Decisions section or BLOCKED
-  test('AskUserQuestion surfaces when --disallowedTools AskUserQuestion is set', async () => {
-    // Pre-prime with concrete plan content so the model doesn't burn its
-    // budget deliberating about WHICH artifact to review. Without this seed,
-    // a bare /plan-ceo-review under --disallowedTools puts the model in a
-    // 5-minute thinking loop trying to enumerate scope options before
-    // surfacing them as prose. With the seed, the model has a real plan to
-    // critique and can move directly to Step 0 / Section 1 findings.
-    //
-    // The test still exercises the regression we care about: under
-    // --disallowedTools, does the skill SURFACE its first decision question
-    // (via prose, BLOCKED, or some visible surface) rather than silently
-    // ExitPlanMode-ing?
-    const obs = await runPlanSkillObservation({
-      skillName: 'plan-ceo-review',
-      inPlanMode: true,
-      extraArgs: ['--disallowedTools', 'AskUserQuestion'],
-      initialPlanContent: SEED_PLAN_FOR_CEO_REVIEW,
-      timeoutMs: 300_000,
-    });
-
-    // The user must SEE the question one way or another. Three valid surfaces:
-    //   1. `## Decisions to confirm` section in the plan file (legacy fallback)
-    //   2. `BLOCKED — AskUserQuestion` string visible in TTY (post-v1.28 BLOCKED rule)
-    //   3. Numbered/lettered options visible in TTY as prose (post-v1.28 prose-AUQ rendering)
-    const blockedVisible = /BLOCKED\s*[—-]\s*AskUserQuestion/i.test(obs.evidence);
-    const proseAUQVisible = isProseAUQVisible(obs.evidence) || obs.proseAUQEverObserved === true;
-    const surfaceVisible = blockedVisible || proseAUQVisible || obs.waitingEverObserved === true;
-
-    if (
-      obs.outcome === 'auto_decided' ||
-      obs.outcome === 'silent_write' ||
-      obs.outcome === 'timeout'
-    ) {
-      throw new Error(
-        `plan-ceo-review AskUserQuestion-blocked regression: outcome=${obs.outcome}\n` +
-          `summary: ${obs.summary}\n` +
-          `elapsed: ${obs.elapsedMs}ms\n` +
-          `--- evidence (last 2KB visible) ---\n${obs.evidence}`,
-      );
-    }
-    if (obs.outcome === 'exited' && !surfaceVisible) {
-      throw new Error(
-        `plan-ceo-review AskUserQuestion-blocked regression: outcome=exited without any visible question surface (no BLOCKED string, no prose-rendered AUQ options). Model quit silently.\n` +
-          `--- evidence (last 2KB visible) ---\n${obs.evidence}`,
-      );
-    }
-    if (obs.outcome === 'plan_ready') {
-      if (!obs.planFile) {
-        if (!surfaceVisible) {
-          throw new Error(
-            `plan-ceo-review AskUserQuestion-blocked regression: outcome=plan_ready but no plan file path detected, no BLOCKED string, no prose AUQ options. Cannot verify the model used any legitimate path.\n` +
-              `--- evidence (last 2KB visible) ---\n${obs.evidence}`,
-          );
-        }
-      } else if (!planFileHasDecisionsSection(obs.planFile) && !surfaceVisible) {
-        throw new Error(
-          `plan-ceo-review AskUserQuestion-blocked regression: model wrote ${obs.planFile} without a "## Decisions" section AND no BLOCKED string AND no prose AUQ options in TTY. Step 0 was silently skipped.\n` +
-            `--- evidence (last 2KB visible) ---\n${obs.evidence}`,
-        );
-      }
-    }
-    expect(['asked', 'plan_ready', 'exited']).toContain(obs.outcome);
-    // NOTE: assertReportAtBottomIfPlanWritten is intentionally NOT called
-    // here. This test runs --disallowedTools AskUserQuestion and only
-    // checks "did the question surface" — the model can't run the full
-    // multi-section review without AUQ tools, so no review report exists
-    // to enforce the at-bottom contract against. The contract is
-    // exercised by the periodic finding-count tests, which DO run the
-    // full review.
-  }, 360_000);
 });