Files
gstack/test/skill-e2e-plan.test.ts
T
Garry Tan cab774cced v1.56.0.0 Token-reduction Phase B + AUQ paranoid safety net (#1849)
* refactor(plan-ceo-review): carve review body into on-demand section

Carve the largest skill (138,838 B) into a skeleton + one on-demand
section, the documented next Phase B target after /ship (v2_PLAN.md:216).

- sections/review-sections.md(.tmpl): the 11-section deep review, codex/
  outside-voice rules, how-to-ask, Required Outputs, registries, Completion
  Summary, Review Log, REVIEW_DASHBOARD, PLAN_FILE_REVIEW_REPORT, Next Steps,
  docs/designs promotion, Formatting Rules, and the Mode Quick Reference.
- sections/manifest.json: passive registry (CM2), one entry.
- SKILL.md.tmpl: {{SECTION_INDEX}} after the system audit, a single
  {{SECTION:review-sections}} STOP-Read after Step 0 mode selection, and a
  Section self-check. All of Step 0 (the scope/mode conversation) stays in
  the always-loaded skeleton; only EXIT_PLAN_MODE_GATE follows the section.

Measured: always-loaded skeleton 138,838 -> 80,731 B (-42%, ~14.4K tokens
off every invocation). Union (skeleton + section) 139,110 B, behavior held.

Boundary honors Codex P1: nothing review-governing (formatting rules, mode
reference, how-to-ask, required outputs) sits in the skeleton below the
STOP. Housekeeping resolvers ride in the section, matching the ship
precedent (adversarial.md carries LEARNINGS_LOG + GBRAIN_SAVE_RESULTS).

Tests (atomic with the carve — skill-docs.yml gates gen:skill-docs
freshness on every push, so source + regen + tests must land together):
- parity-harness: plan-ceo flipped to sectioned, maxSkeletonBytes 90_000
  (measured 80,731 + headroom); content/minBytes run against the union.
- skill-size-budget: plan-ceo-review added to SECTIONS_EXTRACTED.
- section-manifest-consistency: generalized to discover every carved skill,
  vars computed per-skill-case (Codex P2).
- skill-ceo-section-ordering (new, gate): per-PR static guard — STOP after
  Step 0, review body absent from skeleton, report writer in the section,
  nothing review-governing below the STOP.
- skill-e2e-plan-ceo-review-section-loading (new, periodic): refreshes the
  installed skill first (Codex P1), drives full Step 0, asserts the section
  is Read before the report.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
  carved skills so relocated prose still counts.
- touchfiles: plan-ceo-section-loading registered (periodic).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for plan-ceo-review carve (v1.56.0.0)

MINOR: carves the largest skill into skeleton + on-demand section,
dropping plan-ceo-review's always-loaded cost 42% (138,838 -> 80,731 B,
~14.4K tokens off every invocation). User-facing release notes lead with
the measured token win.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(todos): file P3 follow-up — carve the shared {{PREAMBLE}} reference blocks

Surfaced by /plan-eng-review on the plan-ceo-review carve: per-skill section
carves stay modest because the ~40-50KB shared preamble dominates the
always-loaded surface. A single preamble-reference carve would help every
tier->=2 skill at once. Records the why, the cold-vs-hot split to measure,
and the guards it needs. Not implemented this PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): Layer 0 — guarantee AUQ format spec is always-loaded

Deterministic, free, per-PR keystone for the token-reduction era. For every
interactive (tier>=2) skill, asserts the full AskUserQuestion decision-brief
format (ELI10/Recommendation/Pros-cons/checks/Net/(recommended)/Stakes/
self-check) lives in the always-loaded SKILL.md skeleton, NOT only in an
on-demand section. Plus a roster guard (a carve can't silently drop the block)
and per-skill rule survival in the skeleton+sections union. 51 cases + a
negative control. Fails the instant a future carve strands AUQ-governing text
where it won't be loaded when a question fires.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): SDK capture engine + verbose-vs-carved no-degradation A/B

Adds the reusable SDK $OUT_FILE capture engine (auq-sdk-capture.ts): drives a
skill to its AUQ and captures the verbatim text the model GENERATES, cleanly
(real-PTY mangles plan-mode AUQs via cursor escapes). Pins the skill to an
absolute path with Read/Write-only tools so the agent can't wander to the
global install. gradeAuqRecommendation normalizes a non-"because" connective
before grading so substantive reasons aren't false-flagged (without touching
the pinned shared judge).

The A/B drives the same prompt through the carved 80KB skeleton and the
pre-carve 137KB monolith and fails if carved scores worse. Result: both 7/7
format, substance 5 — proven no degradation, transcript-verified each side read
its own planted SKILL.md. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): consistency — same trigger N runs, stable format + substance

Drives the carved /plan-ceo-review AUQ N=3 times and fails if any format
element appears in one run but not another, or substance craters. Targets the
"fine one run, broken the next" failure class a single snapshot can't see.
Result: 3/3 stable, 7/7 + substance 5 every run. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): behavioral matrix across AUQ-heavy skills

Data-driven test that drives each AUQ-heavy skill (plan-eng/design/devex,
office-hours, cso, spec, design-consultation) to its first AskUserQuestion and
grades it to the plan-ceo bar: 7/7 decision-brief format + recommendation
substance >=4. One case per skill (isolated failures), env-subsettable via
AUQ_MATRIX_ONLY. Browser/design-binary skills are intentionally excluded
(comparison boards, not format-AUQs; Layer 0 covers their spec). All targeted
skills pass 7/7 with substance 4-5. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(codex): live recommendation-substance grade for /codex

Closes the gap where /codex's synthesis recommendation was only checked
statically (template grep) and via fixtures. Drives the real /codex skill over
a flawed diff and grades the emitted "Recommendation: ... because ..." line
with judgeRecommendation (present/commits/has_because/substance>=4). The named
weak spot holds up: substance 5. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): deterministic trigger for format-compliance gate

A bare /plan-ceo-review against a repo whose work is already implemented makes
the model improvise an off-script "what should I review?" scope question that
skips the decision-brief format, which the gate test then times out waiting for.
Hand it a concrete plan to review (FORCING_FLOOR_CEO) so it reaches the real
Step 0 mode-selection AUQ that is the intended format check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(office-hours): carve Phase 5+6 into on-demand section

Third Phase B carve (v2_PLAN.md:216, after ship and plan-ceo-review). Moves
Phase 5 (Design Doc templates) + Phase 6 (tiered relationship handoff) — the
session's output + closing tail, only reached after the conversation and
alternatives are done — into sections/design-and-handoff.md, behind a single
STOP-Read after Phase 4.5. The live conversation (Phases 1-4.5) and the
always-run Important Rules stay in the always-loaded skeleton.

Measured: always-loaded skeleton 118,280 -> 88,975 B (-24.8%). Union preserved.
The carved AUQ is identical to pre-carve (matrix: 7/7 format, substance 5),
and Layer 0 confirms the AUQ format spec stays in the skeleton — the AUQ
paranoid suite de-risked this carve end to end.

Atomic with tests + regen (skill-docs.yml gates gen:skill-docs freshness on
every push, so source + regen + tests land together; --host all regenerates
the inlined non-Claude variants):
- sections/manifest.json: passive registry, one entry.
- parity-harness: office-hours flipped to sectioned, maxSkeletonBytes 96_000
  (measured 88,975 + headroom); content/minBytes run against the union.
- skill-size-budget: office-hours added to SECTIONS_EXTRACTED.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
  office-hours so relocated Phase 5/6 prose still counts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for office-hours carve + AUQ suite (v1.57.0.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(preamble): carve CJK-escaping manual to on-demand doc

The AskUserQuestion format block is inlined into every interactive skill (~33).
It carried the full multi-paragraph non-ASCII/CJK escaping manual inline, but
that rationale only matters when a question contains CJK text and the operative
rule already lives in the always-loaded self-check. Moved the justification to
docs/askuserquestion-cjk.md (read on demand); kept the rule + a pointer.

Corpus: Claude-host SKILL.md total 3,087,499 -> 3,057,975 B (-29,524 B, ~900 B
x ~33 skills). Layer 0 still passes — the core decision-brief format stays
always-loaded; only the rare CJK rationale moved. Atomic with the all-host
regen (skill-docs.yml freshness gate). VERSION + package.json -> 1.58.0.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-eng-review): carve review body into on-demand section

Fourth Phase B carve (v2_PLAN.md:220). Moves the 4-section review (Architecture,
Code Quality, Tests, Performance), outside voice, required outputs, and review
report — everything after Step 0 scope — into sections/review-sections.md behind
a single STOP-Read. Step 0 (scope challenge) and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.

Measured: skeleton 106,984 -> 54,892 B (-48.7%). Union preserved. Atomic with
tests + all-host regen (freshness gate): parity flipped to sectioned
(maxSkeletonBytes 62K), plan-eng-review added to SECTIONS_EXTRACTED, gen-skill-docs
reads the union for relocated review/TEST_COVERAGE/dashboard prose. Layer 0 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-design-review): carve review body into on-demand section

Fifth Phase B carve (v2_PLAN.md:220, bundled with plan-eng). Moves the 7 design
passes, required outputs, and review report — everything after Step 0 scope and
the mockup/rating phase — into sections/review-sections.md behind a STOP-Read.
Step 0, Step 0.5 mockups, the rating method, and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.

Measured: skeleton 112,057 -> 76,024 B (-32.2%). Union preserved. Atomic with
tests + all-host regen: parity sectioned (maxSkeletonBytes 82K), added to
SECTIONS_EXTRACTED, gen-skill-docs reads the union. Layer 0 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-devex-review): carve review body into on-demand section

Sixth Phase B carve. Moves the 8 DX passes, required outputs, and review report
— everything after the Step 0 DX investigation — into sections/review-sections.md
behind a STOP-Read. All of Step 0 (persona, empathy, benchmark, journey trace,
roleplay) + the rating method + EXIT_PLAN_MODE_GATE stay always-loaded.

Measured: skeleton 110,621 -> 69,658 B (-37%). Union preserved. Atomic with
tests + all-host regen: added to SECTIONS_EXTRACTED, gen-skill-docs reads the
union. Layer 0 green. (No parity invariant entry for plan-devex-review.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for plan-* family carves (v1.59.0.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: refresh ship golden baselines + gbrain-detection union after carves

Two follow-ups the carve commits should have carried (caught by the full suite,
missed by targeted subsets):
- ship golden baselines (claude/codex/factory) regenerated: the preamble CJK
  trim (v1.58) changed ship's always-loaded AskUserQuestion block.
- gbrain-detection-override probes the office-hours skeleton+section union:
  GBRAIN_SAVE_RESULTS moved into sections/design-and-handoff.md when office-hours
  was carved, so the detection assertions now check both files.

Full `bun test` green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): grade format-compliance gate from SDK capture, not the TUI

The real-PTY version grepped the stripAnsi'd interactive AUQ picker. Verified
directly that this cannot work: plan-mode AUQs render as a cursor picker whose
cursor-positioning escapes stripAnsi can't flatten — the picker renders fine for
a human (cursorSeen=45) but the flattened text drops ELI10:/(recommended) and
parseNumberedOptions returns 0. The test was grading a lossy projection and
failed by construction.

Rewritten to drive /plan-ceo-review via the SDK $OUT_FILE capture (the agent
writes the verbatim question it would have shown — clean text, no rendering
loss) and grade 7/7 format + kind-note + recommendation substance >=4. Same
property, reliable, environment-independent; shares the engine with the periodic
A/B and matrix evals. Result: 7/7 format, substance 5. Touchfiles key renamed
ask-user-question-format-pty -> auq-format-gate (no longer a PTY test).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: fix carve-broken CI evals (union reads + section fixtures)

Two CI eval jobs failed on the carved plan-* skills because they read content
that moved into sections/:

- llm-judge (skill-llm-eval): runWorkflowJudge sliced SKILL.md between markers
  like "## Review Sections" / "## CRITICAL RULE" that now live in
  sections/review-sections.md. The markers vanished from the skeleton, so the
  judge scored empty/wrong content. Fix: read the skeleton+sections union.
  Verified: plan-ceo modes / plan-eng sections / plan-design passes all PASS
  (25/25).

- e2e-plan (skill-e2e-plan): setupPlanDir copied only <skill>/SKILL.md into the
  fixture, not sections/. The carved skill's STOP pointed at a section file that
  was absent, so the model improvised a compressed report table instead of the
  canonical "| Review | Trigger | Why | Runs | Status | Findings |". Fix: copy
  sections/ alongside SKILL.md in all 6 setup sites. Verified: report test PASS,
  canonical table emitted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: copy carved sections into all e2e fixtures (prevent more carve-blind CI fails)

Proactive sweep beyond the two CI logs: every e2e test that copies a carved
skill's SKILL.md into a temp fixture must also copy its sections/, or the
model hits a STOP pointing at a missing section file and improvises/degrades.

- skill-e2e.test.ts: plan-ceo/plan-eng/plan-design/office-hours copies across
  planDir/reviewDir/ohDir/benefitsDir dests now copy sections/.
- skill-e2e-plan.test.ts: the office-hours copy + the 4-skill codex-offering
  loop now copy sections/.
- skill-e2e-design.test.ts: plan-design-review copy now copies sections/.
- skill-e2e-office-hours.test.ts: both office-hours copies now copy sections/.
- skill-e2e-office-hours-brain-writeback.test.ts: GBRAIN_SAVE_RESULTS moved into
  the section, so check the regenerated skeleton+section UNION for the gbrain put
  block, ship both into the workdir, and restore both (the section regen was also
  leaking into the working tree — finally now restores it).

ship copies (single-file Step-0 slices) and review/retro (not carved) untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: migrate section-loading E2E to lossless SDK tool-stream detection

The /ship and /plan-ceo-review section-loading tests drove a real PTY and
scraped the ANSI screen buffer for sections/<file>.md paths. That silently
saw nothing in a Conductor PTY (cursor-positioned tool renders and an
unanswered Step 0 question loop both defeat the regex), so both reported
read: [] even when the agent did the work.

They now run the skill through claude -p (the same SDK path the AUQ matrix
uses) and detect section reads from the tool-use stream — Read calls whose
file_path contains sections/<file>.md — with no rendering layer to mangle.
The run is also hermetic: the freshly-generated worktree skeleton + sections
are copied into a throwaway fixture with the absolute path pinned, so the
test validates this branch's carve without mutating the user's ~/.claude
install.

Validated EVALS_TIER=periodic: both pass (plan-ceo Reads review-sections.md;
ship Reads review-army.md + changelog.md), ~6.5 min for both vs ~23 min
combined on the old PTY path where both were failing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: consolidate branch to v1.56.0.0 (single MINOR above main)

The branch bumped VERSION several times during development (1.56 → 1.57 →
1.58 → 1.59), but none of those landed on main (main is at 1.55.1.0). Per
the "never orphan branch-internal versions" discipline, collapse all four
into a single 1.56.0.0 entry — one MINOR release covering the whole branch:
five skills carved (plan-ceo, office-hours, plan-eng, plan-design,
plan-devex), the shared AskUserQuestion preamble CJK trim, and the paranoid
AUQ no-degradation test suite + lossless section-loading tests.

VERSION and package.json set to 1.56.0.0; main's 1.55.1.0 entry preserved
below the consolidated entry. No SKILL.md drift (VERSION is not embedded in
generated bodies).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 11:14:43 -07:00

846 lines
34 KiB
TypeScript

import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
import { runSkillTest } from './helpers/session-runner';
import {
ROOT, browseBin, runId, evalsEnabled,
describeIfSelected, testConcurrentIfSelected,
copyDirSync, setupBrowseShims, logCost, recordE2E,
createEvalCollector, finalizeEvalCollector,
} from './helpers/e2e-helpers';
import { judgePosture } from './helpers/llm-judge';
import { spawnSync } from 'child_process';
import * as fs from 'fs';
import * as path from 'path';
import * as os from 'os';
const evalCollector = createEvalCollector('e2e-plan');
// --- Plan CEO Review E2E ---
describeIfSelected('Plan CEO Review E2E', ['plan-ceo-review'], () => {
let planDir: string;
beforeAll(() => {
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-ceo-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
// Init git repo (CEO review SKILL.md has a "System Audit" step that runs git)
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Create a simple plan document for the agent to review
fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Add User Dashboard
## Context
We're building a new user dashboard that shows recent activity, notifications, and quick actions.
## Changes
1. New React component \`UserDashboard\` in \`src/components/\`
2. REST API endpoint \`GET /api/dashboard\` returning user stats
3. PostgreSQL query for activity aggregation
4. Redis cache layer for dashboard data (5min TTL)
## Architecture
- Frontend: React + TailwindCSS
- Backend: Express.js REST API
- Database: PostgreSQL with existing user/activity tables
- Cache: Redis for dashboard aggregates
## Open questions
- Should we use WebSocket for real-time updates?
- How do we handle users with 100k+ activity records?
`);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'add plan']);
// Copy plan-ceo-review skill
fs.mkdirSync(path.join(planDir, 'plan-ceo-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
path.join(planDir, 'plan-ceo-review', 'SKILL.md'),
);
// Carved skills (v2 plan T9): copy sections/ so the review workflow + report template are present.
{ const _sec = path.join(ROOT, 'plan-ceo-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(planDir, 'plan-ceo-review', 'sections'), { recursive: true }); }
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('plan-ceo-review', async () => {
const result = await runSkillTest({
prompt: `Read plan-ceo-review/SKILL.md for the review workflow.
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration or system audit steps.
Choose HOLD SCOPE mode. Skip any AskUserQuestion calls — this is non-interactive.
Write your complete review directly to ${planDir}/review-output.md
Focus on reviewing the plan content: architecture, error handling, security, and performance.`,
workingDirectory: planDir,
maxTurns: 15,
timeout: 360_000,
testName: 'plan-ceo-review',
runId,
model: 'claude-opus-4-7',
});
logCost('/plan-ceo-review', result);
recordE2E(evalCollector, '/plan-ceo-review', 'Plan CEO Review E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
// Accept error_max_turns — the CEO review is very thorough and may exceed turns
expect(['success', 'error_max_turns']).toContain(result.exitReason);
// Verify the review was written
const reviewPath = path.join(planDir, 'review-output.md');
if (fs.existsSync(reviewPath)) {
const review = fs.readFileSync(reviewPath, 'utf-8');
expect(review.length).toBeGreaterThan(200);
}
}, 420_000);
});
// --- Plan CEO Review (SELECTIVE EXPANSION) E2E ---
describeIfSelected('Plan CEO Review SELECTIVE EXPANSION E2E', ['plan-ceo-review-selective'], () => {
let planDir: string;
beforeAll(() => {
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-ceo-sel-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Add User Dashboard
## Context
We're building a new user dashboard that shows recent activity, notifications, and quick actions.
## Changes
1. New React component \`UserDashboard\` in \`src/components/\`
2. REST API endpoint \`GET /api/dashboard\` returning user stats
3. PostgreSQL query for activity aggregation
4. Redis cache layer for dashboard data (5min TTL)
## Architecture
- Frontend: React + TailwindCSS
- Backend: Express.js REST API
- Database: PostgreSQL with existing user/activity tables
- Cache: Redis for dashboard aggregates
## Open questions
- Should we use WebSocket for real-time updates?
- How do we handle users with 100k+ activity records?
`);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'add plan']);
fs.mkdirSync(path.join(planDir, 'plan-ceo-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
path.join(planDir, 'plan-ceo-review', 'SKILL.md'),
);
// Carved skills (v2 plan T9): copy sections/ so the review workflow + report template are present.
{ const _sec = path.join(ROOT, 'plan-ceo-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(planDir, 'plan-ceo-review', 'sections'), { recursive: true }); }
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('plan-ceo-review-selective', async () => {
const result = await runSkillTest({
prompt: `Read plan-ceo-review/SKILL.md for the review workflow.
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration or system audit steps.
Choose SELECTIVE EXPANSION mode. Skip any AskUserQuestion calls — this is non-interactive.
For the cherry-pick ceremony, accept all expansion proposals automatically.
Write your complete review directly to ${planDir}/review-output-selective.md
Focus on reviewing the plan content: architecture, error handling, security, and performance.`,
workingDirectory: planDir,
maxTurns: 15,
timeout: 360_000,
testName: 'plan-ceo-review-selective',
runId,
model: 'claude-opus-4-7',
});
logCost('/plan-ceo-review (SELECTIVE)', result);
recordE2E(evalCollector, '/plan-ceo-review-selective', 'Plan CEO Review SELECTIVE EXPANSION E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
const reviewPath = path.join(planDir, 'review-output-selective.md');
if (fs.existsSync(reviewPath)) {
const review = fs.readFileSync(reviewPath, 'utf-8');
expect(review.length).toBeGreaterThan(200);
}
}, 420_000);
});
// --- Plan CEO Review SCOPE EXPANSION energy (V1.1 mode-posture regression gate) ---
describeIfSelected('Plan CEO Review Expansion Energy E2E', ['plan-ceo-review-expansion-energy'], () => {
let planDir: string;
beforeAll(() => {
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-ceo-exp-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Use the shared fixture so expansion-energy regressions are reproducible.
const fixture = fs.readFileSync(
path.join(ROOT, 'test', 'fixtures', 'mode-posture', 'expansion-plan.md'),
'utf-8',
);
fs.writeFileSync(path.join(planDir, 'plan.md'), fixture);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'add plan']);
fs.mkdirSync(path.join(planDir, 'plan-ceo-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
path.join(planDir, 'plan-ceo-review', 'SKILL.md'),
);
// Carved skills (v2 plan T9): copy sections/ so the review workflow + report template are present.
{ const _sec = path.join(ROOT, 'plan-ceo-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(planDir, 'plan-ceo-review', 'sections'), { recursive: true }); }
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('plan-ceo-review-expansion-energy', async () => {
const result = await runSkillTest({
prompt: `Read plan-ceo-review/SKILL.md for the review workflow.
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration or system audit steps.
Choose SCOPE EXPANSION mode. Skip any AskUserQuestion calls — this is non-interactive. Auto-approve the ideal-architecture approach in 0C-bis. For 0D, run all three analyses (10x check, platonic ideal, delight opportunities), then emit exactly 2 concrete expansion proposals in the opt-in ceremony.
Write your expansion proposals to ${planDir}/proposals.md with ONLY the proposal text — no conversational wrapper, no review summary, no mode analysis. Each proposal separated by "---".`,
workingDirectory: planDir,
maxTurns: 15,
timeout: 360_000,
testName: 'plan-ceo-review-expansion-energy',
runId,
model: 'claude-opus-4-7',
});
logCost('/plan-ceo-review (EXPANSION ENERGY)', result);
recordE2E(evalCollector, '/plan-ceo-review-expansion-energy', 'Plan CEO Review Expansion Energy E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
// Transient API failure escape hatch — see /plan-review-report for the
// full rationale. Same shape: error_api with 0 turns means the API call
// never reached the model, so nothing the test verifies could have run.
if (result.exitReason === 'error_api' && result.costEstimate?.turnsUsed === 0) {
console.warn('[transient] /plan-ceo-review-expansion-energy: error_api with 0 turns — treating as inconclusive');
return;
}
expect(['success', 'error_max_turns']).toContain(result.exitReason);
const proposalsPath = path.join(planDir, 'proposals.md');
if (!fs.existsSync(proposalsPath)) {
throw new Error('Agent did not emit proposals.md — expansion energy eval requires proposal output');
}
const proposalText = fs.readFileSync(proposalsPath, 'utf-8');
expect(proposalText.length).toBeGreaterThan(200);
const scores = await judgePosture('expansion', proposalText);
console.log('Expansion energy scores:', JSON.stringify(scores, null, 2));
// Pass threshold: 4/5 on both axes (good — matches posture with minor weakness).
expect(scores.axis_a).toBeGreaterThanOrEqual(4); // surface_framing
expect(scores.axis_b).toBeGreaterThanOrEqual(4); // decision_preservation
}, 600_000);
});
// --- Plan Eng Review E2E ---
describeIfSelected('Plan Eng Review E2E', ['plan-eng-review'], () => {
let planDir: string;
beforeAll(() => {
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-eng-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Create a plan with more engineering detail
fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Migrate Auth to JWT
## Context
Replace session-cookie auth with JWT tokens. Currently using express-session + Redis store.
## Changes
1. Add \`jsonwebtoken\` package
2. New middleware \`auth/jwt-verify.ts\` replacing \`auth/session-check.ts\`
3. Login endpoint returns { accessToken, refreshToken }
4. Refresh endpoint rotates tokens
5. Migration script to invalidate existing sessions
## Files Modified
| File | Change |
|------|--------|
| auth/jwt-verify.ts | NEW: JWT verification middleware |
| auth/session-check.ts | DELETED |
| routes/login.ts | Return JWT instead of setting cookie |
| routes/refresh.ts | NEW: Token refresh endpoint |
| middleware/index.ts | Swap session-check for jwt-verify |
## Error handling
- Expired token: 401 with \`token_expired\` code
- Invalid token: 401 with \`invalid_token\` code
- Refresh with revoked token: 403
## Not in scope
- OAuth/OIDC integration
- Rate limiting on refresh endpoint
`);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'add plan']);
// Copy plan-eng-review skill
fs.mkdirSync(path.join(planDir, 'plan-eng-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-eng-review', 'SKILL.md'),
path.join(planDir, 'plan-eng-review', 'SKILL.md'),
);
// Carved skills (v2 plan T9): copy sections/ so the review workflow + report template are present.
{ const _sec = path.join(ROOT, 'plan-eng-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(planDir, 'plan-eng-review', 'sections'), { recursive: true }); }
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('plan-eng-review', async () => {
const result = await runSkillTest({
prompt: `Read plan-eng-review/SKILL.md for the review workflow.
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration steps.
Proceed directly to the full review. Skip any AskUserQuestion calls — this is non-interactive.
Write your complete review directly to ${planDir}/review-output.md
Focus on architecture, code quality, tests, and performance sections.`,
workingDirectory: planDir,
maxTurns: 15,
timeout: 360_000,
testName: 'plan-eng-review',
runId,
model: 'claude-opus-4-7',
});
logCost('/plan-eng-review', result);
recordE2E(evalCollector, '/plan-eng-review', 'Plan Eng Review E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
// Verify the review was written
const reviewPath = path.join(planDir, 'review-output.md');
if (fs.existsSync(reviewPath)) {
const review = fs.readFileSync(reviewPath, 'utf-8');
expect(review.length).toBeGreaterThan(200);
}
}, 420_000);
});
// --- Plan-Eng-Review Test-Plan Artifact E2E ---
describeIfSelected('Plan-Eng-Review Test-Plan Artifact E2E', ['plan-eng-review-artifact'], () => {
let planDir: string;
let projectDir: string;
beforeAll(() => {
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-artifact-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Create base commit on main
fs.writeFileSync(path.join(planDir, 'app.ts'), 'export function greet() { return "hello"; }\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'initial']);
// Create feature branch with changes
run('git', ['checkout', '-b', 'feature/add-dashboard']);
fs.writeFileSync(path.join(planDir, 'dashboard.ts'), `export function Dashboard() {
const data = fetchStats();
return { users: data.users, revenue: data.revenue };
}
function fetchStats() {
return fetch('/api/stats').then(r => r.json());
}
`);
fs.writeFileSync(path.join(planDir, 'app.ts'), `import { Dashboard } from "./dashboard";
export function greet() { return "hello"; }
export function main() { return Dashboard(); }
`);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'feat: add dashboard']);
// Plan document
fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Add Dashboard
## Changes
1. New \`dashboard.ts\` with Dashboard component and fetchStats API call
2. Updated \`app.ts\` to import and use Dashboard
## Architecture
- Dashboard fetches from \`/api/stats\` endpoint
- Returns user count and revenue metrics
`);
run('git', ['add', 'plan.md']);
run('git', ['commit', '-m', 'add plan']);
// Copy plan-eng-review skill
fs.mkdirSync(path.join(planDir, 'plan-eng-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-eng-review', 'SKILL.md'),
path.join(planDir, 'plan-eng-review', 'SKILL.md'),
);
// Carved skills (v2 plan T9): copy sections/ so the review workflow + report template are present.
{ const _sec = path.join(ROOT, 'plan-eng-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(planDir, 'plan-eng-review', 'sections'), { recursive: true }); }
// Set up remote-slug shim and browse shims (plan-eng-review uses remote-slug for artifact path)
setupBrowseShims(planDir);
// Create project directory for artifacts
projectDir = path.join(os.homedir(), '.gstack', 'projects', 'test-project');
fs.mkdirSync(projectDir, { recursive: true });
// Clean up stale test-plan files from previous runs
try {
const staleFiles = fs.readdirSync(projectDir).filter(f => f.includes('test-plan'));
for (const f of staleFiles) {
fs.unlinkSync(path.join(projectDir, f));
}
} catch {}
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
// Clean up test-plan artifacts (but not the project dir itself)
try {
const files = fs.readdirSync(projectDir);
for (const f of files) {
if (f.includes('test-plan')) {
fs.unlinkSync(path.join(projectDir, f));
}
}
} catch {}
});
testConcurrentIfSelected('plan-eng-review-artifact', async () => {
// Count existing test-plan files before
const beforeFiles = fs.readdirSync(projectDir).filter(f => f.includes('test-plan'));
const result = await runSkillTest({
prompt: `Read plan-eng-review/SKILL.md for the review workflow.
Skip the preamble bash block, lake intro, telemetry, and contributor mode sections — go straight to the review.
Read plan.md — that's the plan to review. This is a standalone plan with source code in app.ts and dashboard.ts.
Proceed directly to the full review. Skip any AskUserQuestion calls — this is non-interactive.
IMPORTANT: After your review, you MUST write the test-plan artifact as described in the "Test Plan Artifact" section of SKILL.md. The remote-slug shim is at ${planDir}/browse/bin/remote-slug.
Write your review to ${planDir}/review-output.md`,
workingDirectory: planDir,
maxTurns: 25,
allowedTools: ['Bash', 'Read', 'Write', 'Glob', 'Grep'],
timeout: 360_000,
testName: 'plan-eng-review-artifact',
runId,
model: 'claude-opus-4-7',
});
logCost('/plan-eng-review artifact', result);
recordE2E(evalCollector, '/plan-eng-review test-plan artifact', 'Plan-Eng-Review Test-Plan Artifact E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
// Verify test-plan artifact was written
const afterFiles = fs.readdirSync(projectDir).filter(f => f.includes('test-plan'));
const newFiles = afterFiles.filter(f => !beforeFiles.includes(f));
console.log(`Test-plan artifacts: ${beforeFiles.length} before, ${afterFiles.length} after, ${newFiles.length} new`);
if (newFiles.length > 0) {
const content = fs.readFileSync(path.join(projectDir, newFiles[0]), 'utf-8');
console.log(`Test-plan artifact (${newFiles[0]}): ${content.length} chars`);
expect(content.length).toBeGreaterThan(50);
} else {
console.warn('No test-plan artifact found — agent may not have followed artifact instructions');
}
// Soft assertion: we expect an artifact but agent compliance is not guaranteed.
// Log rather than fail — the test-plan artifact is a bonus output, not the core test.
if (newFiles.length === 0) {
console.warn('SOFT FAIL: No test-plan artifact written — agent did not follow artifact instructions');
}
}, 420_000);
});
// --- Office Hours Spec Review E2E ---
describeIfSelected('Office Hours Spec Review E2E', ['office-hours-spec-review'], () => {
let ohDir: string;
beforeAll(() => {
ohDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-oh-spec-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: ohDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(ohDir, 'README.md'), '# Test Project\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'init']);
// Copy office-hours skill
fs.mkdirSync(path.join(ohDir, 'office-hours'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'office-hours', 'SKILL.md'),
path.join(ohDir, 'office-hours', 'SKILL.md'),
);
{ const _sec = path.join(ROOT, 'office-hours', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(ohDir, 'office-hours', 'sections'), { recursive: true }); }
});
afterAll(() => {
try { fs.rmSync(ohDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('office-hours-spec-review', async () => {
const result = await runSkillTest({
prompt: `Read office-hours/SKILL.md. I want to understand the spec review loop.
Summarize what the "Spec Review Loop" section does — specifically:
1. How many dimensions does the reviewer check?
2. What tool is used to dispatch the reviewer?
3. What's the maximum number of iterations?
4. What metrics are tracked?
Write your summary to ${ohDir}/spec-review-summary.md`,
workingDirectory: ohDir,
maxTurns: 8,
timeout: 120_000,
testName: 'office-hours-spec-review',
runId,
});
logCost('/office-hours spec review', result);
recordE2E(evalCollector, '/office-hours-spec-review', 'Office Hours Spec Review E2E', result);
expect(result.exitReason).toBe('success');
const summaryPath = path.join(ohDir, 'spec-review-summary.md');
if (fs.existsSync(summaryPath)) {
const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase();
expect(summary).toMatch(/5.*dimension|dimension.*5|completeness|consistency|clarity|scope|feasibility/);
expect(summary).toMatch(/agent|subagent/);
expect(summary).toMatch(/3.*iteration|iteration.*3|maximum.*3/);
}
}, 180_000);
});
// --- Plan CEO Review Benefits-From E2E ---
describeIfSelected('Plan CEO Review Benefits-From E2E', ['plan-ceo-review-benefits'], () => {
let benefitsDir: string;
beforeAll(() => {
benefitsDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-benefits-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: benefitsDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(benefitsDir, 'README.md'), '# Test Project\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'init']);
fs.mkdirSync(path.join(benefitsDir, 'plan-ceo-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
path.join(benefitsDir, 'plan-ceo-review', 'SKILL.md'),
);
{ const _sec = path.join(ROOT, 'plan-ceo-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(benefitsDir, 'plan-ceo-review', 'sections'), { recursive: true }); }
});
afterAll(() => {
try { fs.rmSync(benefitsDir, { recursive: true, force: true }); } catch {}
});
testConcurrentIfSelected('plan-ceo-review-benefits', async () => {
const result = await runSkillTest({
prompt: `Read plan-ceo-review/SKILL.md. Search for sections about "Prerequisite" or "office-hours" or "design doc found".
Summarize what happens when no design doc is found — specifically:
1. Is /office-hours offered as a prerequisite?
2. What options does the user get?
3. Is there a mid-session detection for when the user seems lost?
Write your summary to ${benefitsDir}/benefits-summary.md`,
workingDirectory: benefitsDir,
maxTurns: 8,
timeout: 120_000,
testName: 'plan-ceo-review-benefits',
runId,
});
logCost('/plan-ceo-review benefits-from', result);
recordE2E(evalCollector, '/plan-ceo-review-benefits', 'Plan CEO Review Benefits-From E2E', result);
expect(result.exitReason).toBe('success');
const summaryPath = path.join(benefitsDir, 'benefits-summary.md');
if (fs.existsSync(summaryPath)) {
const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase();
expect(summary).toMatch(/office.hours/);
expect(summary).toMatch(/design doc|no design/i);
}
}, 180_000);
});
// --- Plan Review Report E2E ---
// Verifies that plan-eng-review writes a "## GSTACK REVIEW REPORT" section
// to the bottom of the plan file (the living review status footer).
describeIfSelected('Plan Review Report E2E', ['plan-review-report'], () => {
let planDir: string;
beforeAll(() => {
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-review-report-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Add Notifications System
## Context
We're building a real-time notification system for our SaaS app.
## Changes
1. WebSocket server for push notifications
2. Notification preferences API
3. Email digest fallback for offline users
4. PostgreSQL table for notification storage
## Architecture
- WebSocket: Socket.io on Express
- Queue: Bull + Redis for email digests
- Storage: PostgreSQL notifications table
- Frontend: React toast component
## Open questions
- Retry policy for failed WebSocket delivery?
- Max notifications stored per user?
`);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'add plan']);
// Copy plan-eng-review skill
fs.mkdirSync(path.join(planDir, 'plan-eng-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-eng-review', 'SKILL.md'),
path.join(planDir, 'plan-eng-review', 'SKILL.md'),
);
// Carved skills (v2 plan T9): copy sections/ so the review workflow + report template are present.
{ const _sec = path.join(ROOT, 'plan-eng-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(planDir, 'plan-eng-review', 'sections'), { recursive: true }); }
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
});
test('/plan-eng-review writes GSTACK REVIEW REPORT to plan file', async () => {
const result = await runSkillTest({
prompt: `Read plan-eng-review/SKILL.md for the review workflow.
Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration steps.
Proceed directly to the full review. Skip any AskUserQuestion calls — this is non-interactive.
Skip the preamble bash block, lake intro, telemetry, and contributor mode sections.
CRITICAL REQUIREMENT: plan.md IS the plan file for this review session. After completing your review, you MUST write a "## GSTACK REVIEW REPORT" section to the END of plan.md, exactly as described in the "Plan File Review Report" section of SKILL.md. If gstack-review-read is not available or returns NO_REVIEWS, write the placeholder table with all four review rows (CEO, Codex, Eng, Design). Use the Edit tool to append to plan.md — do NOT overwrite the existing plan content.
This review report at the bottom of the plan is the MOST IMPORTANT deliverable of this test.`,
workingDirectory: planDir,
maxTurns: 20,
timeout: 360_000,
testName: 'plan-review-report',
runId,
model: 'claude-opus-4-7',
});
logCost('/plan-eng-review report', result);
recordE2E(evalCollector, '/plan-review-report', 'Plan Review Report E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
// Transient API failure escape hatch: when the SDK returns error_api with
// zero turns / zero tokens, the API call died before the model ever ran —
// no skill code executed, no file was written. Bun retries the test up to
// 3x; if every attempt hits the same API hiccup, surface a warning and
// treat as inconclusive rather than gating the build on Anthropic
// availability. Logic regressions still surface as success/error_max_turns
// with a missing artifact, which the downstream assertions catch.
if (result.exitReason === 'error_api' && result.costEstimate?.turnsUsed === 0) {
console.warn('[transient] /plan-review-report: error_api with 0 turns — treating as inconclusive (likely Anthropic API hiccup, see CLAUDE.md eval-blame protocol)');
return;
}
expect(['success', 'error_max_turns']).toContain(result.exitReason);
// Verify the review report was written to the plan file
const planContent = fs.readFileSync(path.join(planDir, 'plan.md'), 'utf-8');
// Original plan content should still be present
expect(planContent).toContain('# Plan: Add Notifications System');
expect(planContent).toContain('WebSocket');
// Review report section must exist
expect(planContent).toContain('## GSTACK REVIEW REPORT');
// Report should be at the bottom of the file
const reportIndex = planContent.lastIndexOf('## GSTACK REVIEW REPORT');
const afterReport = planContent.slice(reportIndex);
// Should contain the review table with standard rows
expect(afterReport).toMatch(/\|\s*Review\s*\|/);
expect(afterReport).toContain('CEO Review');
expect(afterReport).toContain('Eng Review');
expect(afterReport).toContain('Design Review');
console.log('Plan review report found at bottom of plan.md');
}, 420_000);
});
// --- Codex Offering E2E ---
// Verifies that Codex is properly offered (with availability check, user prompt,
// and fallback) in office-hours, plan-ceo-review, plan-design-review, plan-eng-review.
describeIfSelected('Codex Offering E2E', [
'codex-offered-office-hours', 'codex-offered-ceo-review',
'codex-offered-design-review', 'codex-offered-eng-review',
], () => {
let testDir: string;
beforeAll(() => {
testDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-codex-offer-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: testDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(testDir, 'README.md'), '# Test Project\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'init']);
// Copy all 4 SKILL.md files
for (const skill of ['office-hours', 'plan-ceo-review', 'plan-design-review', 'plan-eng-review']) {
fs.mkdirSync(path.join(testDir, skill), { recursive: true });
fs.copyFileSync(
path.join(ROOT, skill, 'SKILL.md'),
path.join(testDir, skill, 'SKILL.md'),
);
// Carved skills (v2 plan T9): copy sections/ so codex/outside-voice content
// (carved into review-sections.md) is present for the search.
const _sec = path.join(ROOT, skill, 'sections');
if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(testDir, skill, 'sections'), { recursive: true });
}
});
afterAll(() => {
try { fs.rmSync(testDir, { recursive: true, force: true }); } catch {}
});
async function checkCodexOffering(skill: string, testName: string, featureName: string) {
const result = await runSkillTest({
prompt: `Read ${skill}/SKILL.md. Search for ALL sections related to "codex", "outside voice", or "second opinion".
Summarize the Codex/${featureName} integration — answer these specific questions:
1. How is Codex availability checked? (what exact bash command?)
2. How is the user prompted? (via AskUserQuestion? what are the options?)
3. What happens when Codex is NOT available? (fallback to subagent? skip entirely?)
4. Is this step blocking (gates the workflow) or optional (can be skipped)?
5. What prompt/context is sent to Codex?
Write your summary to ${testDir}/${testName}-summary.md`,
workingDirectory: testDir,
maxTurns: 8,
timeout: 120_000,
testName,
runId,
});
logCost(`/${skill} codex offering`, result);
recordE2E(evalCollector, `/${testName}`, 'Codex Offering E2E', result);
expect(result.exitReason).toBe('success');
const summaryPath = path.join(testDir, `${testName}-summary.md`);
expect(fs.existsSync(summaryPath)).toBe(true);
const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase();
// All skills should have codex availability check (command -v per #1197)
expect(summary).toMatch(/command -v codex/);
// All skills should have fallback behavior
expect(summary).toMatch(/fallback|subagent|unavailable|not available|skip/);
// All skills should show it's optional/non-blocking
expect(summary).toMatch(/optional|non.?blocking|skip|not.*required/);
console.log(`${skill}: Codex offering verified`);
}
testConcurrentIfSelected('codex-offered-office-hours', async () => {
await checkCodexOffering('office-hours', 'codex-offered-office-hours', 'second opinion');
}, 180_000);
testConcurrentIfSelected('codex-offered-ceo-review', async () => {
await checkCodexOffering('plan-ceo-review', 'codex-offered-ceo-review', 'outside voice');
}, 180_000);
testConcurrentIfSelected('codex-offered-design-review', async () => {
await checkCodexOffering('plan-design-review', 'codex-offered-design-review', 'design outside voices');
}, 180_000);
testConcurrentIfSelected('codex-offered-eng-review', async () => {
await checkCodexOffering('plan-eng-review', 'codex-offered-eng-review', 'outside voice');
}, 180_000);
});
// Module-level afterAll — finalize eval collector after all tests complete
afterAll(async () => {
await finalizeEvalCollector(evalCollector);
});