fix: Codex hang fixes — plan visibility, stdout buffering, reasoning effort (v0.12.4.0) (#536)

* fix: unbuffer Python stdout in codex --json streaming Python fully buffers stdout when piped (not a TTY). The `codex exec --json | python3 -c "..."` pattern meant zero output visible until process exit — users saw nothing for 30+ minutes. Add PYTHONUNBUFFERED=1 env var, python3 -u flag, and flush=True to all print() calls in all three Python parser blocks (Challenge, Consult new session, Consult resumed session). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: per-mode reasoning effort defaults, add --xhigh override xhigh reasoning uses ~23x more tokens and causes 50+ minute hangs on large context tasks (OpenAI issues #8545, #8402, #6931). Per-mode defaults for /codex skill: - Review: high (bounded diff, needs thoroughness) - Challenge: high (adversarial but bounded by diff) - Consult: medium (large context, interactive, needs speed) Also changes all Outside Voice / adversarial codex invocations across gstack (resolvers, gen-skill-docs) from xhigh to high. Users can override with --xhigh flag when they want max reasoning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: explicit plan content embedding for codex sandbox visibility Codex runs sandboxed to repo root (-C) and cannot access ~/.claude/plans/. The template already instructed content embedding but wasn't explicit enough — Claude sometimes shortcut to referencing the file path, causing Codex to waste 10+ tool calls searching before giving up. Strengthen the instruction to make embedding unambiguous: "embed FULL CONTENT, do NOT reference the file path." Also extract referenced source file paths from the plan so Codex reads them directly instead of discovering via rg/find. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add --xhigh reminder to challenge and consult modes The --xhigh override was only documented in Step 2A (review). Steps 2B (challenge) and 2C (consult) lacked the reminder, so the flag would silently do nothing for those modes. Found by adversarial review. * chore: bump version and changelog (v0.12.4.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 19:25:10 +02:00 · 2026-03-26 18:19:26 -06:00
parent de20228c2c
commit 1b60acd576
13 changed files with 138 additions and 59 deletions
@@ -407,6 +407,14 @@ Parse the user's input to determine which mode to run:
   - Otherwise, ask: "What would you like to ask Codex?"
 4. `/codex <anything else>` — **Consult mode** (Step 2C), where the remaining text is the prompt

+**Reasoning effort override:** If the user's input contains `--xhigh` anywhere,
+note it and remove it from the prompt text before passing to Codex. When `--xhigh`
+is present, use `model_reasoning_effort="xhigh"` for all modes regardless of the
+per-mode default below. Otherwise, use the per-mode defaults:
+- Review (2A): `high` — bounded diff input, needs thoroughness
+- Challenge (2B): `high` — adversarial but bounded by diff
+- Consult (2C): `medium` — large context, interactive, needs speed
+
 ---

 ## Step 2A: Review Mode
@@ -420,13 +428,15 @@ TMPERR=$(mktemp /tmp/codex-err-XXXXXX.txt)

 2. Run the review (5-minute timeout):
 ```bash
-codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
+codex review --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR"
 ```

+If the user passed `--xhigh`, use `"xhigh"` instead of `"high"`.
+
 Use `timeout: 300000` on the Bash call. If the user provided custom instructions
 (e.g., `/codex review focus on security`), pass them as the prompt argument:
 ```bash
-codex review "focus on security" --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
+codex review "focus on security" --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR"
 ```

 3. Capture the output. Then parse cost from stderr:
@@ -563,8 +573,11 @@ With focus (e.g., "security"):
 "Review the changes on this branch against the base branch. Run `git diff origin/<base>` to see the diff. Focus specifically on SECURITY. Your job is to find every way an attacker could exploit this code. Think about injection vectors, auth bypasses, privilege escalation, data exposure, and timing attacks. Be adversarial."

 2. Run codex exec with **JSONL output** to capture reasoning traces and tool calls (5-minute timeout):
+
+If the user passed `--xhigh`, use `"xhigh"` instead of `"high"`.
+
 ```bash
-codex exec "<prompt>" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>/dev/null | python3 -c "
+codex exec "<prompt>" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>/dev/null | PYTHONUNBUFFERED=1 python3 -u -c "
 import sys, json
 for line in sys.stdin:
    line = line.strip()
@@ -577,17 +590,17 @@ for line in sys.stdin:
            itype = item.get('type','')
            text = item.get('text','')
            if itype == 'reasoning' and text:
-                print(f'[codex thinking] {text}')
-                print()
+                print(f'[codex thinking] {text}', flush=True)
+                print(flush=True)
            elif itype == 'agent_message' and text:
-                print(text)
+                print(text, flush=True)
            elif itype == 'command_execution':
                cmd = item.get('command','')
-                if cmd: print(f'[codex ran] {cmd}')
+                if cmd: print(f'[codex ran] {cmd}', flush=True)
        elif t == 'turn.completed':
            usage = obj.get('usage',{})
            tokens = usage.get('input_tokens',0) + usage.get('output_tokens',0)
-            if tokens: print(f'\ntokens used: {tokens}')
+            if tokens: print(f'\ntokens used: {tokens}', flush=True)
    except: pass
 "
 ```
@@ -636,20 +649,34 @@ ls -t ~/.claude/plans/*.md 2>/dev/null | xargs grep -l "$(basename $(pwd))" 2>/d
 ```
 If no project-scoped match, fall back to `ls -t ~/.claude/plans/*.md 2>/dev/null | head -1`
 but warn: "Note: this plan may be from a different project — verify before sending to Codex."
-Read the plan file and prepend the persona to the user's prompt:
+
+**IMPORTANT — embed content, don't reference path:** Codex runs sandboxed to the repo
+root (`-C`) and cannot access `~/.claude/plans/` or any files outside the repo. You MUST
+read the plan file yourself and embed its FULL CONTENT in the prompt below. Do NOT tell
+Codex the file path or ask it to read the plan file — it will waste 10+ tool calls
+searching and fail.
+
+Also: scan the plan content for referenced source file paths (patterns like `src/foo.ts`,
+`lib/bar.py`, paths containing `/` that exist in the repo). If found, list them in the
+prompt so Codex reads them directly instead of discovering them via rg/find.
+
+Prepend the persona to the user's prompt:
 "You are a brutally honest technical reviewer. Review this plan for: logical gaps and
 unstated assumptions, missing error handling or edge cases, overcomplexity (is there a
 simpler approach?), feasibility risks (what could go wrong?), and missing dependencies
 or sequencing issues. Be direct. Be terse. No compliments. Just the problems.
+Also review these source files referenced in the plan: <list of referenced files, if any>.

 THE PLAN:
-<plan content>"
+<full plan content, embedded verbatim>"

 4. Run codex exec with **JSONL output** to capture reasoning traces (5-minute timeout):

+If the user passed `--xhigh`, use `"xhigh"` instead of `"medium"`.
+
 For a **new session:**
 ```bash
-codex exec "<prompt>" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
+codex exec "<prompt>" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached --json 2>"$TMPERR" | PYTHONUNBUFFERED=1 python3 -u -c "
 import sys, json
 for line in sys.stdin:
    line = line.strip()
@@ -659,31 +686,31 @@ for line in sys.stdin:
        t = obj.get('type','')
        if t == 'thread.started':
            tid = obj.get('thread_id','')
-            if tid: print(f'SESSION_ID:{tid}')
+            if tid: print(f'SESSION_ID:{tid}', flush=True)
        elif t == 'item.completed' and 'item' in obj:
            item = obj['item']
            itype = item.get('type','')
            text = item.get('text','')
            if itype == 'reasoning' and text:
-                print(f'[codex thinking] {text}')
-                print()
+                print(f'[codex thinking] {text}', flush=True)
+                print(flush=True)
            elif itype == 'agent_message' and text:
-                print(text)
+                print(text, flush=True)
            elif itype == 'command_execution':
                cmd = item.get('command','')
-                if cmd: print(f'[codex ran] {cmd}')
+                if cmd: print(f'[codex ran] {cmd}', flush=True)
        elif t == 'turn.completed':
            usage = obj.get('usage',{})
            tokens = usage.get('input_tokens',0) + usage.get('output_tokens',0)
-            if tokens: print(f'\ntokens used: {tokens}')
+            if tokens: print(f'\ntokens used: {tokens}', flush=True)
    except: pass
 "
 ```

 For a **resumed session** (user chose "Continue"):
 ```bash
-codex exec resume <session-id> "<prompt>" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
-<same python streaming parser as above>
+codex exec resume <session-id> "<prompt>" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached --json 2>"$TMPERR" | PYTHONUNBUFFERED=1 python3 -u -c "
+<same python streaming parser as above, with flush=True on all print() calls>
 "
 ```

@@ -718,7 +745,14 @@ Session saved — run /codex again to continue this conversation.
 agentic coding model). This means as OpenAI ships newer models, /codex automatically
 uses them. If the user wants a specific model, pass `-m` through to codex.

-**Reasoning effort:** All modes use `xhigh` — maximum reasoning power. When reviewing code, breaking code, or consulting on architecture, you want the model thinking as hard as possible.
+**Reasoning effort (per-mode defaults):**
+- **Review (2A):** `high` — bounded diff input, needs thoroughness but not max tokens
+- **Challenge (2B):** `high` — adversarial but bounded by diff size
+- **Consult (2C):** `medium` — large context (plans, codebase), interactive, needs speed
+
+`xhigh` uses ~23x more tokens than `high` and causes 50+ minute hangs on large context
+tasks (OpenAI issues #8545, #8402, #6931). Users can override with `--xhigh` flag
+(e.g., `/codex review --xhigh`) when they want maximum reasoning and are willing to wait.

 **Web search:** All codex commands use `--enable web_search_cached` so Codex can look up
 docs and APIs during review. This is OpenAI's cached index — fast, no extra cost.
@@ -67,6 +67,14 @@ Parse the user's input to determine which mode to run:
   - Otherwise, ask: "What would you like to ask Codex?"
 4. `/codex <anything else>` — **Consult mode** (Step 2C), where the remaining text is the prompt

+**Reasoning effort override:** If the user's input contains `--xhigh` anywhere,
+note it and remove it from the prompt text before passing to Codex. When `--xhigh`
+is present, use `model_reasoning_effort="xhigh"` for all modes regardless of the
+per-mode default below. Otherwise, use the per-mode defaults:
+- Review (2A): `high` — bounded diff input, needs thoroughness
+- Challenge (2B): `high` — adversarial but bounded by diff
+- Consult (2C): `medium` — large context, interactive, needs speed
+
 ---

 ## Step 2A: Review Mode
@@ -80,13 +88,15 @@ TMPERR=$(mktemp /tmp/codex-err-XXXXXX.txt)

 2. Run the review (5-minute timeout):
 ```bash
-codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
+codex review --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR"
 ```

+If the user passed `--xhigh`, use `"xhigh"` instead of `"high"`.
+
 Use `timeout: 300000` on the Bash call. If the user provided custom instructions
 (e.g., `/codex review focus on security`), pass them as the prompt argument:
 ```bash
-codex review "focus on security" --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
+codex review "focus on security" --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR"
 ```

 3. Capture the output. Then parse cost from stderr:
@@ -158,8 +168,11 @@ With focus (e.g., "security"):
 "Review the changes on this branch against the base branch. Run `git diff origin/<base>` to see the diff. Focus specifically on SECURITY. Your job is to find every way an attacker could exploit this code. Think about injection vectors, auth bypasses, privilege escalation, data exposure, and timing attacks. Be adversarial."

 2. Run codex exec with **JSONL output** to capture reasoning traces and tool calls (5-minute timeout):
+
+If the user passed `--xhigh`, use `"xhigh"` instead of `"high"`.
+
 ```bash
-codex exec "<prompt>" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>/dev/null | python3 -c "
+codex exec "<prompt>" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>/dev/null | PYTHONUNBUFFERED=1 python3 -u -c "
 import sys, json
 for line in sys.stdin:
    line = line.strip()
@@ -172,17 +185,17 @@ for line in sys.stdin:
            itype = item.get('type','')
            text = item.get('text','')
            if itype == 'reasoning' and text:
-                print(f'[codex thinking] {text}')
-                print()
+                print(f'[codex thinking] {text}', flush=True)
+                print(flush=True)
            elif itype == 'agent_message' and text:
-                print(text)
+                print(text, flush=True)
            elif itype == 'command_execution':
                cmd = item.get('command','')
-                if cmd: print(f'[codex ran] {cmd}')
+                if cmd: print(f'[codex ran] {cmd}', flush=True)
        elif t == 'turn.completed':
            usage = obj.get('usage',{})
            tokens = usage.get('input_tokens',0) + usage.get('output_tokens',0)
-            if tokens: print(f'\ntokens used: {tokens}')
+            if tokens: print(f'\ntokens used: {tokens}', flush=True)
    except: pass
 "
 ```
@@ -231,20 +244,34 @@ ls -t ~/.claude/plans/*.md 2>/dev/null | xargs grep -l "$(basename $(pwd))" 2>/d
 ```
 If no project-scoped match, fall back to `ls -t ~/.claude/plans/*.md 2>/dev/null | head -1`
 but warn: "Note: this plan may be from a different project — verify before sending to Codex."
-Read the plan file and prepend the persona to the user's prompt:
+
+**IMPORTANT — embed content, don't reference path:** Codex runs sandboxed to the repo
+root (`-C`) and cannot access `~/.claude/plans/` or any files outside the repo. You MUST
+read the plan file yourself and embed its FULL CONTENT in the prompt below. Do NOT tell
+Codex the file path or ask it to read the plan file — it will waste 10+ tool calls
+searching and fail.
+
+Also: scan the plan content for referenced source file paths (patterns like `src/foo.ts`,
+`lib/bar.py`, paths containing `/` that exist in the repo). If found, list them in the
+prompt so Codex reads them directly instead of discovering them via rg/find.
+
+Prepend the persona to the user's prompt:
 "You are a brutally honest technical reviewer. Review this plan for: logical gaps and
 unstated assumptions, missing error handling or edge cases, overcomplexity (is there a
 simpler approach?), feasibility risks (what could go wrong?), and missing dependencies
 or sequencing issues. Be direct. Be terse. No compliments. Just the problems.
+Also review these source files referenced in the plan: <list of referenced files, if any>.

 THE PLAN:
-<plan content>"
+<full plan content, embedded verbatim>"

 4. Run codex exec with **JSONL output** to capture reasoning traces (5-minute timeout):

+If the user passed `--xhigh`, use `"xhigh"` instead of `"medium"`.
+
 For a **new session:**
 ```bash
-codex exec "<prompt>" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
+codex exec "<prompt>" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached --json 2>"$TMPERR" | PYTHONUNBUFFERED=1 python3 -u -c "
 import sys, json
 for line in sys.stdin:
    line = line.strip()
@@ -254,31 +281,31 @@ for line in sys.stdin:
        t = obj.get('type','')
        if t == 'thread.started':
            tid = obj.get('thread_id','')
-            if tid: print(f'SESSION_ID:{tid}')
+            if tid: print(f'SESSION_ID:{tid}', flush=True)
        elif t == 'item.completed' and 'item' in obj:
            item = obj['item']
            itype = item.get('type','')
            text = item.get('text','')
            if itype == 'reasoning' and text:
-                print(f'[codex thinking] {text}')
-                print()
+                print(f'[codex thinking] {text}', flush=True)
+                print(flush=True)
            elif itype == 'agent_message' and text:
-                print(text)
+                print(text, flush=True)
            elif itype == 'command_execution':
                cmd = item.get('command','')
-                if cmd: print(f'[codex ran] {cmd}')
+                if cmd: print(f'[codex ran] {cmd}', flush=True)
        elif t == 'turn.completed':
            usage = obj.get('usage',{})
            tokens = usage.get('input_tokens',0) + usage.get('output_tokens',0)
-            if tokens: print(f'\ntokens used: {tokens}')
+            if tokens: print(f'\ntokens used: {tokens}', flush=True)
    except: pass
 "
 ```

 For a **resumed session** (user chose "Continue"):
 ```bash
-codex exec resume <session-id> "<prompt>" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
-<same python streaming parser as above>
+codex exec resume <session-id> "<prompt>" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached --json 2>"$TMPERR" | PYTHONUNBUFFERED=1 python3 -u -c "
+<same python streaming parser as above, with flush=True on all print() calls>
 "
 ```

@@ -313,7 +340,14 @@ Session saved — run /codex again to continue this conversation.
 agentic coding model). This means as OpenAI ships newer models, /codex automatically
 uses them. If the user wants a specific model, pass `-m` through to codex.

-**Reasoning effort:** All modes use `xhigh` — maximum reasoning power. When reviewing code, breaking code, or consulting on architecture, you want the model thinking as hard as possible.
+**Reasoning effort (per-mode defaults):**
+- **Review (2A):** `high` — bounded diff input, needs thoroughness but not max tokens
+- **Challenge (2B):** `high` — adversarial but bounded by diff size
+- **Consult (2C):** `medium` — large context (plans, codebase), interactive, needs speed
+
+`xhigh` uses ~23x more tokens than `high` and causes 50+ minute hangs on large context
+tasks (OpenAI issues #8545, #8402, #6931). Users can override with `--xhigh` flag
+(e.g., `/codex review --xhigh`) when they want maximum reasoning and are willing to wait.

 **Web search:** All codex commands use `--enable web_search_cached` so Codex can look up
 docs and APIs during review. This is OpenAI's cached index — fast, no extra cost.