feat: /codex skill — multi-AI second opinion + proactive suggestions (#197)

* feat: /codex skill — multi-AI second opinion (review, challenge, consult)

Three modes: code review with pass/fail gate, adversarial challenge mode,
and conversational consult with session continuity. First multi-AI skill
in gstack, wrapping OpenAI's Codex CLI.

* feat: integrate /codex into /review, /ship, /plan-eng-review + dashboard

/review offers Codex second opinion after completing its own review.
/ship offers Codex review as optional gate before pushing.
/plan-eng-review offers Codex plan critique after scope challenge.
Review Readiness Dashboard shows Codex Review as optional row.

* chore: bump version and changelog (v0.8.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: codex skill validation (12 stub tests) + E2E eval test

Stub tests (free tier): verify template content — three modes, gate verdict,
session continuity, cost tracking, cross-model comparison, binary discovery,
error handling, mktemp usage, and integrations into /review, /ship, /plan-eng-review.

E2E test (paid tier): runs /codex review on vulnerable fixture repo via
session-runner, verifies output contains findings and GATE verdict.

* fix: codex auth error message — use codex login, not OPENAI_API_KEY

Codex authenticates via ChatGPT OAuth (codex login), not an env var.

* feat: codex uses high reasoning effort by default

gpt-5.2-codex is the only model available with ChatGPT login.
All commands now use model_reasoning_effort="high" for maximum
depth — the whole point is a thorough second opinion.

* feat: crank codex reasoning to xhigh (maximum)

* feat: per-mode reasoning (high for review/consult, xhigh for challenge) + web search

Review and consult use high reasoning — thorough but not slow.
Challenge (adversarial) uses xhigh — maximum depth for breaking code.
All modes enable web_search_cached so Codex can look up docs/APIs.

* refactor: don't hardcode model — use codex default (always latest)

* feat: JSONL output for codex challenge + consult modes

Use --json flag to parse codex's JSONL events, extracting reasoning
traces ([codex thinking]), tool calls ([codex ran]), and token counts.
This gives richer output than the -o flag alone — you can see what
codex thought through before its answer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: only persist codex-review log when code review actually ran

Don't write a codex-review entry to reviews.jsonl when only the
adversarial challenge (option B) was selected — there's no gate
verdict to record, and a false entry misleads the Review Readiness
Dashboard into thinking a code review happened.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add codex plan review option to /plan-eng-review

After scope challenge (Step 0), offer to have Codex independently
review the plan with a brutally honest tech reviewer persona.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: update e2e test for codex skill

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: codex integration bugs — plan content, review persistence, quoting, stderr

- plan-eng-review: Codex now reads the plan file itself instead of inlining
  content as a CLI arg (avoids ARG_MAX for large plans)
- review: add missing echo to persist codex-review results to reviews.jsonl
- codex: consult mode uses $TMPERR (mktemp) instead of hardcoded stderr path
- codex + review: quote $SLUG/$BRANCH_SLUG in review log paths
- codex: scope plan lookup to current project, warn on cross-project fallback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: add .context/ to .gitignore to prevent session ID leaks

Codex consult mode stores session IDs in .context/codex-session-id.
Without this ignore rule, session IDs could leak into commits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: proactive skill suggestions + opt-out + trigger phrase tests

- Preamble reads proactive config via gstack-config
- Root SKILL.md.tmpl has lifecycle map (stage → skill suggestion)
- Users can opt out ("stop suggesting") / opt in ("be proactive again")
- Restored trigger phrase validation tests (16 skills × "Use when" check)
- Added missing "Use when" trigger phrases to /debug and /office-hours

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update changelog for v0.8.0 — add proactive suggestions note

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-03-19 00:22:52 -05:00
committed by GitHub
parent 823772ff0b
commit d85233017b
29 changed files with 1372 additions and 63 deletions
+3
View File
@@ -73,6 +73,9 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
// Document-release
'document-release': ['document-release/**'],
// Codex
'codex-review': ['codex/**'],
// QA bootstrap
'qa-bootstrap': ['qa/**', 'browse/src/**', 'ship/**'],
+86 -16
View File
@@ -387,7 +387,7 @@ File a contributor report about this issue. Then tell me what you filed.`,
// Set up a git repo so there's project/branch context to reference
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: sessionDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(sessionDir, 'app.rb'), '# my app\n');
@@ -518,7 +518,7 @@ describeIfSelected('Review skill E2E', ['review-sql-injection'], () => {
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: reviewDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
@@ -575,7 +575,7 @@ describeIfSelected('Review enum completeness E2E', ['review-enum-completeness'],
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: enumDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
@@ -647,7 +647,7 @@ describeE2E('Review design lite E2E', () => {
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: designDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
@@ -910,7 +910,7 @@ describeIfSelected('Plan CEO Review E2E', ['plan-ceo-review'], () => {
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
// Init git repo (CEO review SKILL.md has a "System Audit" step that runs git)
run('git', ['init']);
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
@@ -996,7 +996,7 @@ describeIfSelected('Plan CEO Review SELECTIVE EXPANSION E2E', ['plan-ceo-review-
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
@@ -1079,7 +1079,7 @@ describeIfSelected('Plan Eng Review E2E', ['plan-eng-review'], () => {
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
@@ -1174,7 +1174,7 @@ describeIfSelected('Retro E2E', ['retro'], () => {
spawnSync(cmd, args, { cwd: retroDir, stdio: 'pipe', timeout: 5000 });
// Create a git repo with varied commit history
run('git', ['init']);
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'dev@example.com']);
run('git', ['config', 'user.name', 'Dev']);
@@ -1273,7 +1273,7 @@ describeIfSelected('QA-Only skill E2E', ['qa-only-no-fix'], () => {
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: qaOnlyDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(qaOnlyDir, 'index.html'), '<h1>Test</h1>\n');
@@ -1373,7 +1373,7 @@ describeIfSelected('QA Fix Loop E2E', ['qa-fix-loop'], () => {
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: qaFixDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
run('git', ['add', '.']);
@@ -1460,7 +1460,7 @@ describeIfSelected('Plan-Eng-Review Test-Plan Artifact E2E', ['plan-eng-review-a
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
@@ -1777,7 +1777,7 @@ describeIfSelected('Document-Release skill E2E', ['document-release'], () => {
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: docReleaseDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
@@ -2030,7 +2030,7 @@ describeIfSelected('Design Consultation E2E', [
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: designDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
@@ -2302,7 +2302,7 @@ describeIfSelected('Plan Design Review E2E', ['plan-design-review-plan-mode', 'p
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: reviewDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
@@ -2453,7 +2453,7 @@ describeIfSelected('Design Review E2E', ['design-review-fix'], () => {
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: qaDesignDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
@@ -2620,7 +2620,7 @@ export function divide(a, b) { return a / b; } // BUG: no zero check
// Init git repo
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: bootstrapDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
run('git', ['add', '.']);
@@ -2841,6 +2841,76 @@ Output the diagram directly.`,
}, 180_000);
});
// --- Codex skill E2E ---
describeIfSelected('Codex skill E2E', ['codex-review'], () => {
let codexDir: string;
beforeAll(() => {
codexDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-codex-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: codexDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init', '-b', 'main']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Commit a clean base on main
fs.writeFileSync(path.join(codexDir, 'app.rb'), '# clean base\nclass App\nend\n');
run('git', ['add', 'app.rb']);
run('git', ['commit', '-m', 'initial commit']);
// Create feature branch with vulnerable code (reuse review fixture)
run('git', ['checkout', '-b', 'feature/add-vuln']);
const vulnContent = fs.readFileSync(path.join(ROOT, 'test', 'fixtures', 'review-eval-vuln.rb'), 'utf-8');
fs.writeFileSync(path.join(codexDir, 'user_controller.rb'), vulnContent);
run('git', ['add', 'user_controller.rb']);
run('git', ['commit', '-m', 'add vulnerable controller']);
// Copy the codex skill file
fs.copyFileSync(path.join(ROOT, 'codex', 'SKILL.md'), path.join(codexDir, 'codex-SKILL.md'));
});
afterAll(() => {
try { fs.rmSync(codexDir, { recursive: true, force: true }); } catch {}
});
test('/codex review produces findings and GATE verdict', async () => {
// Check codex is available — skip if not installed
const codexCheck = spawnSync('which', ['codex'], { stdio: 'pipe', timeout: 3000 });
if (codexCheck.status !== 0) {
console.warn('codex CLI not installed — skipping E2E test');
return;
}
const result = await runSkillTest({
prompt: `You are in a git repo on branch feature/add-vuln with changes against main.
Read codex-SKILL.md for the /codex skill instructions.
Run /codex review to review the current diff against main.
Write the full output (including the GATE verdict) to ${codexDir}/codex-output.md`,
workingDirectory: codexDir,
maxTurns: 10,
timeout: 300_000,
testName: 'codex-review',
runId,
});
logCost('/codex review', result);
recordE2E('/codex review', 'Codex skill E2E', result);
expect(result.exitReason).toBe('success');
// Check that output file was created with review content
const outputPath = path.join(codexDir, 'codex-output.md');
if (fs.existsSync(outputPath)) {
const output = fs.readFileSync(outputPath, 'utf-8');
// Should contain the CODEX SAYS header or GATE verdict
const hasCodexOutput = output.includes('CODEX') || output.includes('GATE') || output.includes('codex');
expect(hasCodexOutput).toBe(true);
}
}, 360_000);
});
// Module-level afterAll — finalize eval collector after all tests complete
afterAll(async () => {
if (evalCollector) {
+102 -2
View File
@@ -447,6 +447,7 @@ describe('No hardcoded branch names in SKILL templates', () => {
'document-release/SKILL.md.tmpl',
'plan-eng-review/SKILL.md.tmpl',
'plan-design-review/SKILL.md.tmpl',
'codex/SKILL.md.tmpl',
];
// Patterns that indicate hardcoded 'main' in git commands
@@ -1121,16 +1122,109 @@ describe('QA report template', () => {
});
});
// --- Codex skill validation ---
describe('Codex skill', () => {
test('codex/SKILL.md exists and has correct frontmatter', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('name: codex');
expect(content).toContain('version: 1.0.0');
expect(content).toContain('allowed-tools:');
});
test('codex/SKILL.md contains all three modes', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('Step 2A: Review Mode');
expect(content).toContain('Step 2B: Challenge');
expect(content).toContain('Step 2C: Consult Mode');
});
test('codex/SKILL.md contains gate verdict logic', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('[P1]');
expect(content).toContain('GATE: PASS');
expect(content).toContain('GATE: FAIL');
});
test('codex/SKILL.md contains session continuity', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('codex-session-id');
expect(content).toContain('codex exec resume');
});
test('codex/SKILL.md contains cost tracking', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('tokens used');
expect(content).toContain('Est. cost');
});
test('codex/SKILL.md contains cross-model comparison', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('CROSS-MODEL ANALYSIS');
expect(content).toContain('Agreement rate');
});
test('codex/SKILL.md contains review log persistence', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('codex-review');
expect(content).toContain('reviews.jsonl');
});
test('codex/SKILL.md uses which for binary discovery, not hardcoded path', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('which codex');
expect(content).not.toContain('/opt/homebrew/bin/codex');
});
test('codex/SKILL.md contains error handling for missing binary and auth', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('NOT_FOUND');
expect(content).toContain('codex login');
});
test('codex/SKILL.md uses mktemp for temp files', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('mktemp');
});
test('codex integration in /review offers second opinion', () => {
const content = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8');
expect(content).toContain('Codex second opinion');
expect(content).toContain('codex review');
expect(content).toContain('adversarial');
});
test('codex integration in /ship offers review gate', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('Codex');
expect(content).toContain('codex review');
expect(content).toContain('codex-review');
});
test('codex integration in /plan-eng-review offers plan critique', () => {
const content = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
expect(content).toContain('Codex');
expect(content).toContain('codex exec');
});
test('Review Readiness Dashboard includes Codex Review row', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('Codex Review');
expect(content).toContain('codex-review');
});
});
// --- Trigger phrase validation ---
describe('Skill trigger phrases', () => {
// Skills that must have "Use when" trigger phrases in their description.
// Excluded: root gstack (browser tool), gstack-upgrade (gstack-specific),
// setup-browser-cookies (utility), humanizer (text tool), browse (subskill of gstack)
// humanizer (text tool)
const SKILLS_REQUIRING_TRIGGERS = [
'qa', 'qa-only', 'ship', 'review', 'debug', 'office-hours',
'plan-ceo-review', 'plan-eng-review', 'plan-design-review',
'design-review', 'design-consultation', 'retro', 'document-release',
'codex', 'browse', 'setup-browser-cookies',
];
for (const skill of SKILLS_REQUIRING_TRIGGERS) {
@@ -1146,7 +1240,13 @@ describe('Skill trigger phrases', () => {
}
// Skills with proactive triggers should have "Proactively suggest" in description
for (const skill of SKILLS_REQUIRING_TRIGGERS) {
const SKILLS_REQUIRING_PROACTIVE = [
'qa', 'qa-only', 'ship', 'review', 'debug', 'office-hours',
'plan-ceo-review', 'plan-eng-review', 'plan-design-review',
'design-review', 'design-consultation', 'retro', 'document-release',
];
for (const skill of SKILLS_REQUIRING_PROACTIVE) {
test(`${skill}/SKILL.md has "Proactively suggest" phrase`, () => {
const skillPath = path.join(ROOT, skill, 'SKILL.md');
if (!fs.existsSync(skillPath)) return;