test(auq): grade format-compliance gate from SDK capture, not the TUI

The real-PTY version grepped the stripAnsi'd interactive AUQ picker. Verified
directly that this cannot work: plan-mode AUQs render as a cursor picker whose
cursor-positioning escapes stripAnsi can't flatten — the picker renders fine for
a human (cursorSeen=45) but the flattened text drops ELI10:/(recommended) and
parseNumberedOptions returns 0. The test was grading a lossy projection and
failed by construction.

Rewritten to drive /plan-ceo-review via the SDK $OUT_FILE capture (the agent
writes the verbatim question it would have shown — clean text, no rendering
loss) and grade 7/7 format + kind-note + recommendation substance >=4. Same
property, reliable, environment-independent; shares the engine with the periodic
A/B and matrix evals. Result: 7/7 format, substance 5. Touchfiles key renamed
ask-user-question-format-pty -> auq-format-gate (no longer a PTY test).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-06-02 21:42:32 -07:00
parent b0a6977c3f
commit 0690066c3f
3 changed files with 72 additions and 196 deletions
+1 -1
View File
@@ -94,7 +94,7 @@ describe('selectTests', () => {
expect(result.selected).toContain('plan-review-prosons-hardstop-neg');
expect(result.selected).toContain('plan-review-prosons-neutral-neg');
// v1.13.x real-PTY E2E batch entries that also depend on plan-ceo-review/**
expect(result.selected).toContain('ask-user-question-format-pty');
expect(result.selected).toContain('auq-format-gate');
expect(result.selected).toContain('plan-ceo-mode-routing');
expect(result.selected).toContain('autoplan-chain-pty');
// Per-finding count + review-report-at-bottom (v1.21.x)