Commit Graph

2 Commits

Author SHA1 Message Date
Garry Tan 09c82222ea test: fix Codex eval sandbox + collector API
Two test infrastructure bugs in the initial Codex eval landed in the
prior commit:

1. sandbox: 'read-only' (the default) blocked Codex from writing
   $OUT_FILE. Test reported "STATUS: BLOCKED" and exited 0 without
   a capture file. Fixed: sandbox: 'workspace-write' for all 4 cases,
   allowing writes inside the tempdir.

2. recordCodexResult called a non-existent evalCollector.record()
   API (I invented it). The real surface is addTest() with a
   different field schema. Aligned with test/codex-e2e.test.ts
   pattern.

With both fixed, the eval now actually measures Codex AskUserQuestion
format compliance. All 4 cases pass on v1.6.2.0 with the gpt.md
carve-out: RECOMMENDATION always, Completeness: N/10 only on coverage,
"options differ in kind" note on kind, ELI10 explanation present.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-22 22:02:32 -07:00
Garry Tan b7f6246061 test: add Codex eval for AskUserQuestion format compliance
Four-case periodic-tier eval mirrors test/skill-e2e-plan-format.test.ts
but drives the plan review skills via codex exec instead of claude -p.

Context: Codex under the gpt.md "No preamble / Prefer doing over listing"
overlay tends to skip the Simplify/ELI10 paragraph and the RECOMMENDATION
line on AskUserQuestion calls. Users have to manually re-prompt "ELI10
and don't forget to recommend" almost every time. This test pins the
behavior so regressions surface.

Cases:
- plan-ceo-review mode selection (kind-differentiated)
- plan-ceo-review approach menu (coverage-differentiated)
- plan-eng-review per-issue coverage decision
- plan-eng-review per-issue architectural choice (kind-differentiated)

Assertions on captured AskUserQuestion text:
- RECOMMENDATION: Choose present (all cases)
- Completeness: N/10 present on coverage, absent on kind
- "options differ in kind" note present on kind
- ELI10 length floor (>400 chars) — catches bare options-only output

Cost: ~\$2-4 per full run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-22 21:34:11 -07:00