Files
gstack/test
Garry Tan 1b85b29ff2 test: add office-hours Phase 4 silent-auto-decide regression
Reproduces the production bug: agent in builder mode reaches Phase 4, presents
A/B/C alternatives, writes "Recommendation: C" in chat prose, and starts
editing the design doc immediately — never calls AskUserQuestion. The Phase 4
STOP-gate fix is the production-side change; this test traps regressions.

SDK + captureInstruction pattern (mirrors skill-e2e-plan-format). The PTY
harness can't seed builder mode + accept-premises to reach Phase 4
(runPlanSkillObservation only sends /skill\\r and waits), so we instruct the
agent to dump the verbatim Phase 4 AskUserQuestion to a file and assert on it
directly. The captured file IS the question — no false-pass risk on which
question got asked, since earlier-phase AUQs cannot satisfy the Phase-4-vocab
regex (approach / alternative / architecture / implementation).

Periodic-tier: Phase 4 requires the agent to invent 2-3 distinct architectures,
more open-ended than the 4 plan-format cases. Reclassify to gate if stable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:18:28 -07:00
..