v1.57.10.0 feat: Codex review default-on across review/ship/plan/docs (#1966)

* feat(config): make codex_reviews the master switch for all Codex review Broaden the codex_reviews doc to describe it governing /review, /ship, /document-release, plan reviews, and /autoplan. Reject invalid values on set (preserving the existing value) so a typo can never silently flip paid Codex calls on or off. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(review): Codex review default-on across review/ship/plan/docs Add a shared codexPreflight() helper (constants.ts) that, in one bash block, reads codex_reviews, sources gstack-codex-probe, checks install + auth, and echoes a single canonical mode (ready/not_installed/not_authed/ disabled). All Codex resolvers route through it. - generateCodexPlanReview: opt-in question removed; the outside voice now runs automatically (default-on), falling back to a Claude subagent when Codex is missing/unauthed. Cross-model tension still gates on user approval (sovereignty preserved). - generateAdversarialStep: probe-based availability (install AND auth), distinct not-installed vs not-authed guidance; 200-line structured-review threshold unchanged. - generateCodexDocReview (new, wired via CODEX_DOC_REVIEW): reviews the release's docs against the shipped diff range, informational + an explicit apply-fixes decision point, never auto-edits. - autoplan Phase 0.5 now honors codex_reviews=disabled so the switch is truly global. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(docs): regenerate SKILL docs + refresh ship golden Output of gen:skill-docs for the Codex-default-on resolver/template changes. Refreshes the factory-ship golden fixture (codex-host output unchanged — resolvers strip for the codex host). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(infra): widen size-budget guards for default-on Codex outside-voice The codexPreflight() block + CODEX_MODE branch prose (replacing the smaller opt-in question) grows plan-ceo/eng/devex-review and review by 5-7% over baseline. Each bump carries a comment justifying it as intentional capability, not slop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: guard Codex default-on + config reject-on-set skill-validation: assert plan reviews no longer carry the opt-in question and render the default-on outside-voice, document-release carries the doc review, and the codex host strips all of it. gstack-config: codex_reviews defaults to enabled, accepts enabled/disabled, and rejects an invalid value while preserving the existing one. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(test): align gstack-config tests with defaults-fallback behavior Three tests (last touched v0.13.7.0) asserted get/list print empty for unset keys, but gstack-config falls back to the documented defaults table (get returns the default, list shows the active-values block). Update the assertions to the real behavior and split out an unknown-key case that does still return empty. Pre-existing red, unrelated to codex review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * v1.57.10.0 feat: Codex review default-on across review/ship/plan/docs Codex cross-model review now runs by default on /review, /ship, all four plan reviews, /document-release, and /autoplan, governed by one master switch (codex_reviews, default enabled). Plan-review outside voice is default-on; /document-release gets a new Codex doc-vs-diff audit; every call site detects install AND auth and falls back to a Claude subagent with a clear reason. Disable everything with: gstack-config set codex_reviews disabled Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-08-03 12:58:40 +02:00 · 2026-06-10 21:14:58 -07:00
parent 8241949357
commit a5833c413f
21 changed files with 766 additions and 196 deletions
@@ -2332,23 +2332,47 @@ For each comment in `comments`:

 Every diff gets adversarial review from both Claude and Codex. LOC is not a proxy for risk — a 5-line auth change can be critical.

-**Detect diff size and tool availability:**
+**Detect diff size:**

 ```bash
 DIFF_BASE=$(git merge-base origin/<base> HEAD)
 DIFF_INS=$(git diff "$DIFF_BASE" --stat | tail -1 | grep -oE '[0-9]+ insertion' | grep -oE '[0-9]+' || echo "0")
 DIFF_DEL=$(git diff "$DIFF_BASE" --stat | tail -1 | grep -oE '[0-9]+ deletion' | grep -oE '[0-9]+' || echo "0")
 DIFF_TOTAL=$((DIFF_INS + DIFF_DEL))
-command -v codex >/dev/null 2>&1 && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
-# Legacy opt-out — only gates Codex passes, Claude always runs
-OLD_CFG=$($GSTACK_ROOT/bin/gstack-config get codex_reviews 2>/dev/null || true)
 echo "DIFF_SIZE: $DIFF_TOTAL"
-echo "OLD_CFG: ${OLD_CFG:-not_set}"
 ```

-If `OLD_CFG` is `disabled`: skip Codex passes only. Claude adversarial subagent still runs (it's free and fast). Jump to the "Claude adversarial subagent" section.
+**Detect the Codex master switch + tool availability:**

-**User override:** If the user explicitly requested "full review", "structured review", or "P1 gate", also run the Codex structured review regardless of diff size.
+```bash
+# Codex preflight: one block (functions sourced here don't persist to later blocks).
+_TEL=$($GSTACK_ROOT/bin/gstack-config get telemetry 2>/dev/null || echo off)
+_CODEX_CFG=$($GSTACK_ROOT/bin/gstack-config get codex_reviews 2>/dev/null || echo enabled)
+source $GSTACK_ROOT/bin/gstack-codex-probe 2>/dev/null || true
+if [ "$_CODEX_CFG" = "disabled" ]; then
+  _CODEX_MODE="disabled"
+elif ! command -v codex >/dev/null 2>&1; then
+  _CODEX_MODE="not_installed"; _gstack_codex_log_event "codex_cli_missing" 2>/dev/null || true
+elif ! _gstack_codex_auth_probe >/dev/null 2>&1; then
+  _CODEX_MODE="not_authed"; _gstack_codex_log_event "codex_auth_failed" 2>/dev/null || true
+else
+  _CODEX_MODE="ready"; _gstack_codex_version_check 2>/dev/null || true
+fi
+echo "CODEX_MODE: $_CODEX_MODE"
+```
+
+Branch on the echoed `CODEX_MODE`:
+- **`disabled`** — the user turned Codex reviews off (`codex_reviews=disabled`). Skip the Codex passes only; the Claude adversarial subagent below STILL runs (it is free and fast). Print: "Codex passes skipped (codex_reviews disabled) — running Claude adversarial only."
+- **`not_installed`** — Codex CLI absent. Print: "Codex not installed — using Claude subagent. Install for cross-model coverage: `npm install -g @openai/codex`." Fall back to the Claude subagent path.
+- **`not_authed`** — installed but no credentials. Print: "Codex installed but not authenticated — using Claude subagent. Run `codex login` or set `$CODEX_API_KEY`." Fall back to the Claude subagent path.
+- **`ready`** — run the Codex pass below.
+
+For this diff-review path, `CODEX_MODE: disabled` means skip the Codex passes ONLY — the
+Claude adversarial subagent below still runs (it's free and fast). `ready` runs the Codex
+passes; `not_installed` / `not_authed` skip them with the printed note and continue with
+Claude only.
+
+**User override:** If the user explicitly requested "full review", "structured review", or "P1 gate", also run the Codex structured review regardless of diff size (still requires `CODEX_MODE: ready`).

 ---

@@ -2369,9 +2393,9 @@ If the subagent fails or times out: "Claude adversarial subagent unavailable. Co

 ---

-### Codex adversarial challenge (always runs when available)
+### Codex adversarial challenge (runs whenever `CODEX_MODE: ready`)

-If Codex is available AND `OLD_CFG` is NOT `disabled`:
+If `CODEX_MODE` is `ready`:

 ```bash
 TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX)
@@ -2393,13 +2417,13 @@ Present the full output verbatim. This is informational — it never blocks ship

 **Cleanup:** Run `rm -f "$TMPERR_ADV"` after processing.

-If Codex is NOT available: "Codex CLI not found — running Claude adversarial only. Install Codex for cross-model coverage: `npm install -g @openai/codex`"
+If `CODEX_MODE` is `not_installed` / `not_authed` / `disabled`: the preflight already printed the reason; run Claude adversarial only.

 ---

 ### Codex structured review (large diffs only, 200+ lines)

-If `DIFF_TOTAL >= 200` AND Codex is available AND `OLD_CFG` is NOT `disabled`:
+If `DIFF_TOTAL >= 200` AND `CODEX_MODE` is `ready`:

 ```bash
 TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX)
@@ -145,6 +145,9 @@ export const CARVE_GUARDS: Record<string, CarveGuard> = {
    maxSkeletonBytes: 90_000,
    minUnionBytes: 80_000,
    mustContain: ['SCOPE EXPANSION', 'SELECTIVE EXPANSION', 'HOLD SCOPE', 'SCOPE REDUCTION'],
+    // Default-on Codex outside-voice (codexPreflight block + CODEX_MODE branch
+    // prose replacing the smaller opt-in question) lands this ~5.2% over baseline.
+    maxSizeRatio: 1.08,
  },
  'plan-eng-review': {
    skill: 'plan-eng-review',
@@ -162,9 +165,11 @@ export const CARVE_GUARDS: Record<string, CarveGuard> = {
    minUnionBytes: 70_000,
    mustContain: ['Architecture', 'Code Quality', 'Test', 'Performance'],
    // Cross-cutting preamble growth (v1.57.2.0 AUQ-failure prose fallback + the
-    // decision-memory nudge + the v1.57.4.0 Boil-the-Ocean rename) lands this just
-    // over the strict 1.05; small headroom for the shared preamble additions.
-    maxSizeRatio: 1.06,
+    // decision-memory nudge + the v1.57.4.0 Boil-the-Ocean rename) plus the
+    // default-on Codex outside-voice (codexPreflight block + CODEX_MODE branch
+    // prose, replacing the smaller opt-in question) land this at ~6.6% over the
+    // v1.53.0.0 baseline. Headroom for those intentional additions.
+    maxSizeRatio: 1.08,
  },
  'plan-design-review': {
    skill: 'plan-design-review',
@@ -197,6 +202,9 @@ export const CARVE_GUARDS: Record<string, CarveGuard> = {
    maxSkeletonBytes: 76_000,
    minUnionBytes: 70_000,
    mustContain: ['developer experience', 'Getting Started'],
+    // Default-on Codex outside-voice (codexPreflight block + CODEX_MODE branch
+    // prose replacing the smaller opt-in question) lands this ~5.7% over baseline.
+    maxSizeRatio: 1.08,
  },
  'office-hours': {
    skill: 'office-hours',
@@ -232,11 +240,15 @@ export const CARVE_GUARDS: Record<string, CarveGuard> = {
    maxSkeletonBytes: 50_000,
    minUnionBytes: 55_000,
    mustContain: ['CHANGELOG', 'Diataxis', 'coverage'],
-    // The AUQ-failure prose fallback (v1.57.2.0) adds ~2KB to every skill's
-    // always-loaded preamble; on this small carved skeleton that lands at ~5.9%
-    // over the pre-carve/pre-AUQ v1.53.0.0 baseline. Headroom for the
-    // cross-cutting addition; all other skills keep the strict 1.05 ceiling.
-    maxSizeRatio: 1.08,
+    // Two intentional additions stack on this small skill: the AUQ-failure prose
+    // fallback (v1.57.2.0, ~2KB to every preamble) AND the new default-on Codex
+    // documentation-review section (codexPreflight + prompt + apply-gate, carved
+    // into release-body so the SKELETON stays under maxSkeletonBytes). On a ~55KB
+    // baseline that whole new capability is ~18.6% of union bytes. The doc review
+    // is a deliberate new feature, not preamble creep; the union ceiling is raised
+    // to match while the skeleton budget (50_000) still holds the always-loaded
+    // cost flat.
+    maxSizeRatio: 1.20,
  },
  'design-consultation': {
    skill: 'design-consultation',
@@ -210,7 +210,11 @@ const MONOLITH_INVARIANTS: ParityInvariant[] = [
    skill: 'review',
    mustContain: ['confidence', 'P1', 'P2'],
    mustHaveHeadings: ['## Preamble', '## When to invoke'],
-    maxSizeRatio: 1.05,
+    // The adversarial step swapped its bare `command -v codex` check for the shared
+    // codexPreflight() block (install + auth tri-state + CODEX_MODE branch prose),
+    // landing ~6.3% over the v1.53.0.0 baseline. Intentional: it adds proper
+    // not-installed vs not-authed handling, not slop.
+    maxSizeRatio: 1.08,
    minBytes: 70_000,
  },
  {
@@ -1386,15 +1386,16 @@ describe('Codex skill', () => {
    expect(content).toContain('Adversarial review (always-on)');
    // Always-on: both Claude and Codex adversarial
    expect(content).toContain('Claude adversarial subagent (always runs)');
-    expect(content).toContain('Codex adversarial challenge (always runs when available)');
+    expect(content).toContain('Codex adversarial challenge (runs whenever');
    // Claude adversarial subagent dispatch
    expect(content).toContain('Agent tool');
    expect(content).toContain('FIXABLE');
    expect(content).toContain('INVESTIGATE');
-    // Codex availability check
-    expect(content).toContain('CODEX_NOT_AVAILABLE');
-    // OLD_CFG only gates Codex, not Claude
-    expect(content).toContain('skip Codex passes only');
+    // Probe-based availability via the shared codexPreflight() (install + auth)
+    expect(content).toContain('CODEX_MODE');
+    expect(content).toContain('command -v codex'); // install check kept literal
+    // codex_reviews=disabled gates Codex passes only; Claude adversarial still runs
+    expect(content).toContain('skip the Codex passes ONLY');
    // Review log
    expect(content).toContain('adversarial-review');
    expect(content).toContain('reasoning_effort="high"');
@@ -1449,6 +1450,43 @@ describe('Codex skill', () => {
    expect(content).toContain('codex exec');
  });

+  // D5 regression guard: the Codex outside voice is default-on, not opt-in. A future
+  // gen-skill-docs change must not silently reintroduce the "Want an outside voice?"
+  // AskUserQuestion. The CODEX_PLAN_REVIEW content renders into each skill's
+  // sections/review-sections.md (the skeleton points at it). plan-design-review uses
+  // DESIGN_OUTSIDE_VOICES, not CODEX_PLAN_REVIEW, so it is excluded here.
+  test('plan reviews run the Codex outside voice default-on (no opt-in question)', () => {
+    for (const skill of ['plan-eng-review', 'plan-ceo-review', 'plan-devex-review']) {
+      const content = fs.readFileSync(
+        path.join(ROOT, skill, 'sections', 'review-sections.md'), 'utf-8');
+      expect(content).not.toContain('Want an outside voice');
+      expect(content).toContain('Outside Voice — Independent Plan Challenge (default-on)');
+      expect(content).toContain('CODEX_MODE');
+      expect(content).toContain('command -v codex'); // preflight install check (e2e relies on it)
+    }
+  });
+
+  test('/document-release includes the default-on Codex documentation review', () => {
+    // The doc-review renders into the carved release-body section (kept out of the
+    // always-loaded skeleton to respect the skeleton-byte budget).
+    const content = fs.readFileSync(
+      path.join(ROOT, 'document-release', 'sections', 'release-body.md'), 'utf-8');
+    expect(content).toContain('Codex Documentation Review (default-on)');
+    expect(content).toContain('CODEX_MODE');
+    expect(content).toContain('codex-doc-review');
+  });
+
+  test('codex-host document-release does NOT contain the Codex doc review', () => {
+    // .agents/ is gitignored — generate on demand (codex never invokes itself)
+    Bun.spawnSync(['bun', 'run', 'scripts/gen-skill-docs.ts', '--host', 'codex'], {
+      cwd: ROOT, stdout: 'pipe', stderr: 'pipe',
+    });
+    const content = fs.readFileSync(
+      path.join(ROOT, '.agents', 'skills', 'gstack-document-release', 'SKILL.md'), 'utf-8');
+    expect(content).not.toContain('Codex Documentation Review');
+    expect(content).not.toContain('codex-doc-review');
+  });
+
  test('codex review invocations avoid the prompt plus --base argument shape', () => {
    for (const rel of ['codex/SKILL.md', 'review/SKILL.md', 'ship/SKILL.md']) {
      // ship's codex command moved into sections/adversarial.md (T9 carve).