mirror of
https://github.com/garrytan/gstack.git
synced 2026-06-17 15:20:11 +02:00
558e195c46
These tests simulated a fictional environment that doesn't exist in production. Real Conductor sessions launch claude with `--disallowedTools AskUserQuestion` AND register `mcp__conductor__AskUserQuestion` — the model has the MCP variant. But the tests passed `--disallowedTools` without standing up any MCP server, so they tested "model behavior with NO AUQ available," which no real user state produces. Combined with bare `/plan-ceo-review` invocation (no follow-up content), this forced the model into a 5+ minute deliberation loop trying to prose-render a question with options it had to first invent. The result was persistent flakes that consumed nine paid E2E runs trying to fix "the model takes too long" — but the actual problem was the test configuration, not the model. Removals: - test/skill-e2e-autoplan-auto-mode.test.ts (deleted; the entire file was a single AUQ-blocked test) - test/skill-e2e-plan-ceo-plan-mode.test.ts test 2 (the migrated --disallowedTools test); test 1 (baseline plan-mode smoke) stays - test/skill-e2e-plan-design-plan-mode.test.ts test 2 (same shape); test 1 stays - test/skill-e2e-plan-eng-plan-mode.test.ts test 2 (same shape); test 1 (baseline) and test 3 (STOP-gate with seeded plan, different contract) stay - test/helpers/touchfiles.ts: autoplan-auto-mode entry removed - test/touchfiles.test.ts: assertion count + commentary updated Coverage retained: test 1 of each plan-mode file already verifies the model fires AUQ; the periodic finding-count tests verify per-finding AUQ cadence end-to-end. The harness improvements landed during this debugging cycle (isProseAUQVisible regex, LLM judge, snapshot logging, high-water-mark tracking, ENOENT-tolerant assertReportAtBottomIfPlanWritten) all stay — they're useful for the remaining plan-mode tests that can also encounter prose rendering and slow-thinking phases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>