test(auq): grade format-compliance gate from SDK capture, not the TUI

The real-PTY version grepped the stripAnsi'd interactive AUQ picker. Verified directly that this cannot work: plan-mode AUQs render as a cursor picker whose cursor-positioning escapes stripAnsi can't flatten — the picker renders fine for a human (cursorSeen=45) but the flattened text drops ELI10:/(recommended) and parseNumberedOptions returns 0. The test was grading a lossy projection and failed by construction. Rewritten to drive /plan-ceo-review via the SDK $OUT_FILE capture (the agent writes the verbatim question it would have shown — clean text, no rendering loss) and grade 7/7 format + kind-note + recommendation substance >=4. Same property, reliable, environment-independent; shares the engine with the periodic A/B and matrix evals. Result: 7/7 format, substance 5. Touchfiles key renamed ask-user-question-format-pty -> auq-format-gate (no longer a PTY test). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 21:15:37 +02:00 · 2026-06-02 21:42:32 -07:00
parent b0a6977c3f
commit 0690066c3f
3 changed files with 72 additions and 196 deletions
@@ -94,7 +94,7 @@ describe('selectTests', () => {
    expect(result.selected).toContain('plan-review-prosons-hardstop-neg');
    expect(result.selected).toContain('plan-review-prosons-neutral-neg');
    // v1.13.x real-PTY E2E batch entries that also depend on plan-ceo-review/**
-    expect(result.selected).toContain('ask-user-question-format-pty');
+    expect(result.selected).toContain('auq-format-gate');
    expect(result.selected).toContain('plan-ceo-mode-routing');
    expect(result.selected).toContain('autoplan-chain-pty');
    // Per-finding count + review-report-at-bottom (v1.21.x)