feat: /codex skill — multi-AI second opinion + proactive suggestions (#197)

* feat: /codex skill — multi-AI second opinion (review, challenge, consult) Three modes: code review with pass/fail gate, adversarial challenge mode, and conversational consult with session continuity. First multi-AI skill in gstack, wrapping OpenAI's Codex CLI. * feat: integrate /codex into /review, /ship, /plan-eng-review + dashboard /review offers Codex second opinion after completing its own review. /ship offers Codex review as optional gate before pushing. /plan-eng-review offers Codex plan critique after scope challenge. Review Readiness Dashboard shows Codex Review as optional row. * chore: bump version and changelog (v0.8.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: codex skill validation (12 stub tests) + E2E eval test Stub tests (free tier): verify template content — three modes, gate verdict, session continuity, cost tracking, cross-model comparison, binary discovery, error handling, mktemp usage, and integrations into /review, /ship, /plan-eng-review. E2E test (paid tier): runs /codex review on vulnerable fixture repo via session-runner, verifies output contains findings and GATE verdict. * fix: codex auth error message — use codex login, not OPENAI_API_KEY Codex authenticates via ChatGPT OAuth (codex login), not an env var. * feat: codex uses high reasoning effort by default gpt-5.2-codex is the only model available with ChatGPT login. All commands now use model_reasoning_effort="high" for maximum depth — the whole point is a thorough second opinion. * feat: crank codex reasoning to xhigh (maximum) * feat: per-mode reasoning (high for review/consult, xhigh for challenge) + web search Review and consult use high reasoning — thorough but not slow. Challenge (adversarial) uses xhigh — maximum depth for breaking code. All modes enable web_search_cached so Codex can look up docs/APIs. * refactor: don't hardcode model — use codex default (always latest) * feat: JSONL output for codex challenge + consult modes Use --json flag to parse codex's JSONL events, extracting reasoning traces ([codex thinking]), tool calls ([codex ran]), and token counts. This gives richer output than the -o flag alone — you can see what codex thought through before its answer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: only persist codex-review log when code review actually ran Don't write a codex-review entry to reviews.jsonl when only the adversarial challenge (option B) was selected — there's no gate verdict to record, and a false entry misleads the Review Readiness Dashboard into thinking a code review happened. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add codex plan review option to /plan-eng-review After scope challenge (Step 0), offer to have Codex independently review the plan with a brutally honest tech reviewer persona. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: update e2e test for codex skill Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: codex integration bugs — plan content, review persistence, quoting, stderr - plan-eng-review: Codex now reads the plan file itself instead of inlining content as a CLI arg (avoids ARG_MAX for large plans) - review: add missing echo to persist codex-review results to reviews.jsonl - codex: consult mode uses $TMPERR (mktemp) instead of hardcoded stderr path - codex + review: quote $SLUG/$BRANCH_SLUG in review log paths - codex: scope plan lookup to current project, warn on cross-project fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add .context/ to .gitignore to prevent session ID leaks Codex consult mode stores session IDs in .context/codex-session-id. Without this ignore rule, session IDs could leak into commits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: proactive skill suggestions + opt-out + trigger phrase tests - Preamble reads proactive config via gstack-config - Root SKILL.md.tmpl has lifecycle map (stage → skill suggestion) - Users can opt out ("stop suggesting") / opt in ("be proactive again") - Restored trigger phrase validation tests (16 skills × "Use when" check) - Added missing "Use when" trigger phrases to /debug and /office-hours Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: update changelog for v0.8.0 — add proactive suggestions note Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-27 20:20:03 +02:00 · 2026-03-19 00:22:52 -05:00
parent 823772ff0b
commit d85233017b
29 changed files with 1372 additions and 63 deletions
@@ -447,6 +447,7 @@ describe('No hardcoded branch names in SKILL templates', () => {
    'document-release/SKILL.md.tmpl',
    'plan-eng-review/SKILL.md.tmpl',
    'plan-design-review/SKILL.md.tmpl',
+    'codex/SKILL.md.tmpl',
  ];

  // Patterns that indicate hardcoded 'main' in git commands
@@ -1121,16 +1122,109 @@ describe('QA report template', () => {
  });
 });

+// --- Codex skill validation ---
+
+describe('Codex skill', () => {
+  test('codex/SKILL.md exists and has correct frontmatter', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('name: codex');
+    expect(content).toContain('version: 1.0.0');
+    expect(content).toContain('allowed-tools:');
+  });
+
+  test('codex/SKILL.md contains all three modes', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('Step 2A: Review Mode');
+    expect(content).toContain('Step 2B: Challenge');
+    expect(content).toContain('Step 2C: Consult Mode');
+  });
+
+  test('codex/SKILL.md contains gate verdict logic', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('[P1]');
+    expect(content).toContain('GATE: PASS');
+    expect(content).toContain('GATE: FAIL');
+  });
+
+  test('codex/SKILL.md contains session continuity', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('codex-session-id');
+    expect(content).toContain('codex exec resume');
+  });
+
+  test('codex/SKILL.md contains cost tracking', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('tokens used');
+    expect(content).toContain('Est. cost');
+  });
+
+  test('codex/SKILL.md contains cross-model comparison', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('CROSS-MODEL ANALYSIS');
+    expect(content).toContain('Agreement rate');
+  });
+
+  test('codex/SKILL.md contains review log persistence', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('codex-review');
+    expect(content).toContain('reviews.jsonl');
+  });
+
+  test('codex/SKILL.md uses which for binary discovery, not hardcoded path', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('which codex');
+    expect(content).not.toContain('/opt/homebrew/bin/codex');
+  });
+
+  test('codex/SKILL.md contains error handling for missing binary and auth', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('NOT_FOUND');
+    expect(content).toContain('codex login');
+  });
+
+  test('codex/SKILL.md uses mktemp for temp files', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('mktemp');
+  });
+
+  test('codex integration in /review offers second opinion', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('Codex second opinion');
+    expect(content).toContain('codex review');
+    expect(content).toContain('adversarial');
+  });
+
+  test('codex integration in /ship offers review gate', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('Codex');
+    expect(content).toContain('codex review');
+    expect(content).toContain('codex-review');
+  });
+
+  test('codex integration in /plan-eng-review offers plan critique', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('Codex');
+    expect(content).toContain('codex exec');
+  });
+
+  test('Review Readiness Dashboard includes Codex Review row', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('Codex Review');
+    expect(content).toContain('codex-review');
+  });
+});
+
 // --- Trigger phrase validation ---

 describe('Skill trigger phrases', () => {
  // Skills that must have "Use when" trigger phrases in their description.
  // Excluded: root gstack (browser tool), gstack-upgrade (gstack-specific),
-  // setup-browser-cookies (utility), humanizer (text tool), browse (subskill of gstack)
+  // humanizer (text tool)
  const SKILLS_REQUIRING_TRIGGERS = [
    'qa', 'qa-only', 'ship', 'review', 'debug', 'office-hours',
    'plan-ceo-review', 'plan-eng-review', 'plan-design-review',
    'design-review', 'design-consultation', 'retro', 'document-release',
+    'codex', 'browse', 'setup-browser-cookies',
  ];

  for (const skill of SKILLS_REQUIRING_TRIGGERS) {
@@ -1146,7 +1240,13 @@ describe('Skill trigger phrases', () => {
  }

  // Skills with proactive triggers should have "Proactively suggest" in description
-  for (const skill of SKILLS_REQUIRING_TRIGGERS) {
+  const SKILLS_REQUIRING_PROACTIVE = [
+    'qa', 'qa-only', 'ship', 'review', 'debug', 'office-hours',
+    'plan-ceo-review', 'plan-eng-review', 'plan-design-review',
+    'design-review', 'design-consultation', 'retro', 'document-release',
+  ];
+
+  for (const skill of SKILLS_REQUIRING_PROACTIVE) {
    test(`${skill}/SKILL.md has "Proactively suggest" phrase`, () => {
      const skillPath = path.join(ROOT, skill, 'SKILL.md');
      if (!fs.existsSync(skillPath)) return;