gstack

CalvinBackup/gstack

Fork 0

mirror of https://github.com/garrytan/gstack.git synced 2026-05-05 05:05:08 +02:00

Commit Graph

Author	SHA1	Message	Date
Garry Tan	09c82222ea	test: fix Codex eval sandbox + collector API Two test infrastructure bugs in the initial Codex eval landed in the prior commit: 1. sandbox: 'read-only' (the default) blocked Codex from writing $OUT_FILE. Test reported "STATUS: BLOCKED" and exited 0 without a capture file. Fixed: sandbox: 'workspace-write' for all 4 cases, allowing writes inside the tempdir. 2. recordCodexResult called a non-existent evalCollector.record() API (I invented it). The real surface is addTest() with a different field schema. Aligned with test/codex-e2e.test.ts pattern. With both fixed, the eval now actually measures Codex AskUserQuestion format compliance. All 4 cases pass on v1.6.2.0 with the gpt.md carve-out: RECOMMENDATION always, Completeness: N/10 only on coverage, "options differ in kind" note on kind, ELI10 explanation present. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-22 22:02:32 -07:00
Garry Tan	b7f6246061	test: add Codex eval for AskUserQuestion format compliance Four-case periodic-tier eval mirrors test/skill-e2e-plan-format.test.ts but drives the plan review skills via codex exec instead of claude -p. Context: Codex under the gpt.md "No preamble / Prefer doing over listing" overlay tends to skip the Simplify/ELI10 paragraph and the RECOMMENDATION line on AskUserQuestion calls. Users have to manually re-prompt "ELI10 and don't forget to recommend" almost every time. This test pins the behavior so regressions surface. Cases: - plan-ceo-review mode selection (kind-differentiated) - plan-ceo-review approach menu (coverage-differentiated) - plan-eng-review per-issue coverage decision - plan-eng-review per-issue architectural choice (kind-differentiated) Assertions on captured AskUserQuestion text: - RECOMMENDATION: Choose present (all cases) - Completeness: N/10 present on coverage, absent on kind - "options differ in kind" note present on kind - ELI10 length floor (>400 chars) — catches bare options-only output Cost: ~\$2-4 per full run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-22 21:34:11 -07:00

Author

SHA1

Message

Date

Garry Tan

09c82222ea

test: fix Codex eval sandbox + collector API

Two test infrastructure bugs in the initial Codex eval landed in the
prior commit:

1. sandbox: 'read-only' (the default) blocked Codex from writing
   $OUT_FILE. Test reported "STATUS: BLOCKED" and exited 0 without
   a capture file. Fixed: sandbox: 'workspace-write' for all 4 cases,
   allowing writes inside the tempdir.

2. recordCodexResult called a non-existent evalCollector.record()
   API (I invented it). The real surface is addTest() with a
   different field schema. Aligned with test/codex-e2e.test.ts
   pattern.

With both fixed, the eval now actually measures Codex AskUserQuestion
format compliance. All 4 cases pass on v1.6.2.0 with the gpt.md
carve-out: RECOMMENDATION always, Completeness: N/10 only on coverage,
"options differ in kind" note on kind, ELI10 explanation present.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-22 22:02:32 -07:00

Garry Tan

b7f6246061

test: add Codex eval for AskUserQuestion format compliance

Four-case periodic-tier eval mirrors test/skill-e2e-plan-format.test.ts
but drives the plan review skills via codex exec instead of claude -p.

Context: Codex under the gpt.md "No preamble / Prefer doing over listing"
overlay tends to skip the Simplify/ELI10 paragraph and the RECOMMENDATION
line on AskUserQuestion calls. Users have to manually re-prompt "ELI10
and don't forget to recommend" almost every time. This test pins the
behavior so regressions surface.

Cases:
- plan-ceo-review mode selection (kind-differentiated)
- plan-ceo-review approach menu (coverage-differentiated)
- plan-eng-review per-issue coverage decision
- plan-eng-review per-issue architectural choice (kind-differentiated)

Assertions on captured AskUserQuestion text:
- RECOMMENDATION: Choose present (all cases)
- Completeness: N/10 present on coverage, absent on kind
- "options differ in kind" note present on kind
- ELI10 length floor (>400 chars) — catches bare options-only output

Cost: ~\$2-4 per full run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-22 21:34:11 -07:00

2 Commits