v1.58.4.0 fix: high-priority community bug wave + PTY plan-mode smoke gate (#2077)

* fix(gbrain): stop forcing GBRAIN_PREPARE on transaction-mode poolers (#1965) buildGbrainEnv auto-set GBRAIN_PREPARE=true whenever DATABASE_URL targeted port 6543, and the /sync-gbrain capability check exported it for the rest of the skill run. Both had the semantics inverted: gbrain auto-disables prepared statements on transaction-mode poolers because they break every write there ("prepared statement does not exist"); GBRAIN_PREPARE=true is gbrain's documented override for SESSION-mode poolers on 6543, not a requirement for transaction mode. The #1435 search symptom the auto-set worked around was fixed gbrain-side. Remove both force-sets. A caller-set GBRAIN_PREPARE (either value) still passes through untouched, preserving the session-mode-on-6543 escape hatch. isTransactionModePooler stays exported. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(gbrain): classify probe timeout as its own status; sync proceeds instead of skipping (#1964) The 5s engine probe misclassified healthy-but-slow engines (cold Supabase pooler connections measured at 6.9-10.7s) as broken-config, so /sync-gbrain silently skipped code+memory and told the user their config was malformed. - New "timeout" status: probe killed at the deadline with no recognized stderr pattern. Default deadline is now 15s, overridable via GSTACK_GBRAIN_PROBE_TIMEOUT_MS (tests set 300ms against a fake that sleeps 2s). - Sync stages PROCEED on timeout with a stderr warning naming the env knob; a genuinely-dead engine surfaces its real error at the first operation instead of a false config diagnosis. - Consistency everywhere "ok" gated behavior: gstack-gbrain-detect --is-ok exits 0 on timeout, and gen-skill-docs' detection gate accepts it, so a slow engine no longer silently suppresses brain-aware features. - Status cache: key now includes the effective probe timeout (raising it invalidates a cached timeout) and GBRAIN_HOME; config detection honors GBRAIN_HOME so relocated-home users stop being misclassified as missing-config. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(bins): cygpath-normalize SCRIPT_DIR for bun imports; surface learnings-log errors (#1950) Under Windows git-bash, pwd yields a POSIX path (/c/Users/...) that Bun on Windows cannot resolve as an ES module specifier. gstack-learnings-log interpolates SCRIPT_DIR into a bun -e import, so every invocation died with "Cannot find module" — and 2>/dev/null swallowed the error, silently dropping every AI-logged learning for Windows users. - 3-line cygpath -m guard in gstack-learnings-log and gstack-question-log (which gains the same import shape in the next commit). Matches the duplicated IS_WINDOWS convention in setup; no shared shell lib exists. - learnings-log adopts question-log's set +e / TMPERR capture pattern wholesale: validation errors now print to stderr. The old `if [ $? -ne 0 ]` check was dead code under set -euo pipefail — the script exited at the failing assignment before reaching it. - New test/bin-windows-bun-import-paths.test.ts: static invariant (any bash bin interpolating $SCRIPT_DIR into a bun -e import must carry the guard) + behavioral end-to-end run invoked via `bash <bin>` — added to the windows-free-tests workflow list so the conversion is proven on the only platform where the bug exists. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(question-log): dedupe INJECTION_PATTERNS via lib/jsonl-store (#1934) bin/gstack-question-log carried a local copy of the injection-pattern list, so pattern fixes to lib/jsonl-store.ts never propagated — including the /override[:\s]/i false-positive fix arriving via community PR #1940. Import the shared hasInjection instead (enabled by the previous commit's cygpath guard). question-log also gets the lib's stricter superset (human:, disregard, from-now-on, approve-all patterns). Tests pin the contract in a #1940-order-independent way: an "Override: ignore all previous instructions" header is rejected, "prose overrides the deterministic table" is accepted, and a static invariant keeps local INJECTION_PATTERNS duplicates out of the bin. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(security): community-pulse + both dashboards never report fake zeros (#1947) The security-signaling surface failed open at three layers — every failure mode read as a reassuring "0 attacks" / "0 installs": - community-pulse edge function: supabase-js returns {data,error} without throwing, and all five queries discarded `error` — a DB outage produced real-looking zeros via the SUCCESS path, and the catch (also returning zeros with HTTP 200) was unreachable for query failures. Every query now destructures and throws; the catch serves the stale cache (marked "stale": true) when one exists, else 503 {"error":"pulse_unavailable"}. Success responses carry "status":"ok" so clients can distinguish authoritative data from legacy backends. NOTE: the edge function deploys out-of-band (supabase functions deploy community-pulse). - gstack-security-dashboard: captures the HTTP status; non-200 / network failure / error body / missing section → "unknown — backend error"; jq missing → "unknown — install jq" (the lossy grep fallback broke on nested arrays and under-reported attacks as zero — removed); a 200 without the new marker shows figures with an "unverified (legacy backend)" note. Also fixes a latent display bug: the TOTAL grep matched the digit 7 inside "attacks_last_7_days" and misreported every count. - gstack-community-dashboard: same class — curl || echo "{}" plus grep || echo "0" printed "Weekly active installs: 0" on any failure. Now "unknown — backend error (HTTP N)". test/security-dashboard-fallback.test.ts pins the matrix (200+marker, 200-legacy, 503, network failure) x (jq present, jq absent) for both bins: "unknown" states never render as 0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(telemetry): redact error_message spans before they leave the machine (#1947) error_message was uploaded with only quote/newline escaping — stack traces and failed-API errors can embed credentials, private paths, and hostnames, and the sync path strips only _repo_slug/_branch. New lib/redact-engine.ts export redactFindingSpans(): replaces EVERY finding's span with <REDACTED-{id}> regardless of tier (applyRedactions is the interactive PII-only path and exits nonzero on credential findings, so it can't serve machine egress). Returns null when a span can't be located — callers drop the whole payload rather than risk a leak. gstack-telemetry-log pipes error_message through it at LOG time, so the local JSONL at rest is clean too; surrounding text survives for crash triage. FAIL CLOSED: bun missing, engine error, or non-JSON-string output all null the field. Tests pin: embedded ghp_ token → <REDACTED-github.pat> with context intact; redactor unavailable → null; raw bytes on disk never contain the token. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(redact): prepush guard fails closed on git failure; /ship owns hook install (#1946) Two gaps closed: 1. Fail closed. The git() helper returned "" on ANY non-zero exit or maxBuffer overflow (status null), addedLinesFor produced an empty string, and the push sailed through unscanned — fail-open on exactly the oversized-diff case where a large secret-bearing blob is most likely. The diff call now uses a strict variant that throws; main blocks with a clear message naming the GSTACK_REDACT_PREPUSH=skip escape valve. Probe calls (symbolic-ref, rev-parse, merge-base) keep the permissive helper — their failures are normal control flow. 2. Install path. The hook was installed by nothing ("opt-in, installed by nothing" was the issue's words). ./setup runs in the gstack checkout — the wrong repo for a per-project hook — so it gets a one-line hint only. /ship owns per-repo install: config redact_prepush_hook=true + hook missing → silent install (consent already given); config unset + no ~/.gstack/.redact-prepush-prompted marker → one-time machine-wide AskUserQuestion offer, answer persisted. ship/SKILL.md regenerated in this same commit (check-freshness bisect discipline). Tests: unscannable diff (bogus SHAs) → exit 1 + valve named; empty-but- successful diff → exit 0; static asserts pin setup as hint-only and the ship template as the installer surface. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(redact): six new credential patterns — GitLab, HuggingFace, npm, DigitalOcean, Bearer, GCP SA (#1946) Coverage gaps from the #1946 security review, including token types for tooling gstack itself drives (glab): HIGH (block): gitlab.token (glpat-/glptt-/gldt-), huggingface.token (hf_), npm.token (npm_), digitalocean.token (dop_v1_), gcp.service_account (the JSON-escaped "private_key" form that dodges pem.private_key's literal-block match when minified, confirmed by "private_key_id" proximity). MEDIUM (warn): auth.bearer — the most FP-prone shape in the set (docs are full of "Authorization: Bearer <token>"), so it requires header-context proximity and the same entropy>=3.0 + placeholder validator recipe as env.kv. "Bearer YOUR_TOKEN_HERE" never fires; calibration over coverage, per the cries-wolf principle. All shapes are linear-time; test/redact-pattern-lint.test.ts covers them automatically. Engine tests add positive + placeholder-negative cases per pattern. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: coverage-audit additions for the fix wave Ship Step 7 gap-fill (all passing, 248 tests across the touched suites): memory + dream stage probe-timeout proceeds, gbrain-detect override paths, stale-flag passthrough, 200-body-missing-.security fail-closed case, telemetry redaction edges, and credential-pattern edge cases. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix: pre-landing review fixes Review army findings (1 critical, auto-fixed with regression tests): - CRITICAL (security specialist, verified live): redactFindingSpans spliced only the regex capture span, and pem.private_key / gcp.service_account capture just the BEGIN-header — the key body survived "redaction" and shipped via telemetry. Marker-only patterns now drop the whole payload (null, fail closed). Overlapping spans (Bearer+JWT on the same bytes) are coalesced before splicing so stale offsets can't leave partial secret bytes behind. - gitStrict: drop the dead `|| r.status === null` disjunct (null !== 0 already covers it); add the signal-kill/null-status regression test the docstring promised. - security-dashboard human mode flags stale snapshots ("figures may be out of date") instead of presenting frozen counts as current. - community-dashboard marker check uses jq when available — the grep-only variant misclassified whitespaced/reserialized bodies as legacy. - telemetry fail-closed test now shadows bun with a failing stub (deterministic on any host layout); stale "five status cases" describe title renamed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix: adversarial review fixes (Claude + Codex cross-model passes) Both adversarial passes ran against the wave; every FIXABLE finding landed with a regression test: - probeTimeoutMs clamps to >=1ms: a fractional override floored to 0, and execFileSync treats timeout:0 as NO timeout — the probe that exists to bound hangs could hang forever (found by both models independently). - /ship silent hook install now requires the hooks dir to live inside .git: with core.hooksPath (husky's COMMITTED .husky/), the chaining installer would have renamed the team's committed pre-push and written a machine-local wrapper into the working tree (found by both models). - gstack-config gbrain-refresh accepts the "timeout" status — the last consumer still gating on literal "ok" (Codex); gstack-gbrain-detect's config-derived fields honor GBRAIN_HOME so the detection JSON can't report status ok alongside config_exists false (Codex). - prepush: a remote sha absent locally (shallow clone / stale fetch) falls back to the merge-base/empty-tree range — scans MORE, never blocks a legitimate push into training users toward --no-verify. - dashboards: curl's own 000 no longer doubles to "HTTP 000000"; the community dashboard flags stale snapshots like the security one; array sections parse via jq (the sed/grep loops truncated at the first ']'); the no-jq marker grep tolerates whitespace. - telemetry: multi-line redactor output nulls the field instead of corrupting the JSONL record; setup's hint fires only when the config key is genuinely unset (an explicit false is a recorded decline); the /ship prompt marker honors GSTACK_HOME. Kept as designed (cross-model tension noted): Bearer stays MEDIUM in the prepush gate — a HIGH Bearer would block every docs example; the entropy validator can't eliminate that FP class, and MEDIUM warns visibly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: bump version and changelog (v1.57.11.0) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: P1 TODO — eval harness live progress + incremental persistence Root-caused during this ship: a killed eval run was indistinguishable from a healthy one for hours (per-file output buffering across mega test files, no incremental eval-store writes, no honest liveness signal). Full context and starting points in the entry. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: fix operational-learning E2E fixture — copy lib/jsonl-store.ts Pre-existing breakage, proven on main: gstack-learnings-log has imported lib/jsonl-store.ts (shared injection patterns) since v1.57.5.0 / #1910, but the fixture copies only the bin scripts — the bin exits 1 before writing anything, on main silently (stderr swallowed) and on this branch loudly (the #1950 error-surfacing made the four-day-old failure visible). A real install always ships bin/ and lib/ together; the fixture now does too. Verified: the fixture-shaped invocation writes the learning (exit 0) with lib present, exits 1 on both main and this branch without it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(ios-qa): isolate E2E tests under --concurrent (3 real races) The ios-qa E2E file failed intermittently under `bun test --concurrent` (the eval harness default). Three distinct shared-state races, all fixed: 1. Shared pidfile: a module-level `workDir` reassigned in beforeEach was clobbered by parallel tests, so concurrent daemons collided on the same pidfile and the loser returned `already_running`. Each test now gets its own dir via makeWorkDir(). 2. process.env path globals: tests set GSTACK_IOS_AUDIT_PATH / _ATTEMPTS_PATH / _ALLOWLIST_PATH on the shared process env; concurrent tests stomped each other's audit/attempts destinations. Threaded auditPath/attemptsPath/allowlistPath through DaemonOptions (and mintForCaller) as explicit args — env is no longer load-bearing. 3. afterEach cleanup race: the per-test cleanup drained a shared dir array, so the first test to finish deleted still-running tests' workDirs mid-assertion. Moved to afterAll (cleans once, after all settle). Verified: 5/5 clean full-suite runs at --max-concurrency 15 (was intermittent); daemon unit suite 91/91; daemon source compiles. The paths default to the env-derived locations when options are omitted, so the production CLI path is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(pty): pin spawned claude to EVALS model chain (default claude-sonnet-4-6) launchClaudePty spawned the interactive `claude` TUI with no --model flag, so the child inherited the operator's ~/.claude/settings.json model. On a slow-thinking model that meant 5+ min of extended thinking on empty plan-mode context, timing out the plan-mode smoke tests regardless of contention. Pin the model via opts.model ?? EVALS_MODEL ?? 'claude-sonnet-4-6' — byte-identical to session-runner.ts:144, so PTY and `claude -p` evals always agree. Pushed before extraArgs (last flag wins, so a per-test --model still overrides). Placement leaves the spawn region byte-stable for a clean merge with the in-flight hermetic-env branch. Plumbed model through the three plan-skill wrappers. Static-grep tripwires guard the pin, its fallback chain, the before-extraArgs ordering, and all three wrapper forwards. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(pty): detect markdown bold-bullet prose AUQs (fixes office-hours smoke) office-hours auto-mode renders its mode question as `- **Building a startup**` markdown bullets (office-hours/SKILL.md.tmpl:102) with no letter/number marker. isProseAUQVisible only matched `A)`-style lettered or `1.`-style numbered options, so the question went undetected: the model surfaced it at ~2m19s (well under the 300s budget) but the harness kept scoring the run "working" off the spinner glyphs and timed out — a false timeout on a question that was already on screen. Add Pattern 3: when an interrogative line ('?') is present AND 3+ bold-bullet markers (`- **`) appear in the 4KB tail, classify as a prose AUQ. Bold is the discriminator vs incidental prose bullets; the line anchor is dropped (stripAnsi can collapse option lines) and the existing `❯ 1.` cursor gate still defers to a live native list. Wires through the existing classifyVisible 'asked' path and the timeout high-water-mark, so office-hours now classifies 'asked' instead of 'timeout'. Five unit cases: the office-hours render passes; no-'?', <3-bullet, plain-bullet, and native-cursor cases stay false. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(pty): detect stripAnsi-collapsed prose AUQs + judge spinner-precedence The plan-eng/plan-design plan-mode + finding-floor smokes timed out even when the skill HAD rendered a complete prose AskUserQuestion and was waiting: the PTY strips cursor-positioning escapes, collapsing the option newlines/spaces so "A) ..." arrives as "A(recommended)" / "-B:" and "Reply with A, B, or C" as "ReplywithA,B,orC". Every line-anchored detector (Patterns 1-3) returns false on those bytes, so proseAUQEverObserved never latched and the run timed out on a question that was already on screen. Add Pattern 4/5: a two-signal collapsed-form detector — a reply/recommendation marker (space-insensitive "reply with [A-D]", "Recommendation:", or "(recommended)") AND 2+ distinct A-D letters each punctuated by ) : or (. The conjunction is what separates a real AUQ from incidental report prose; verified true on the verbatim failing-run buffers where Patterns 1-3 return false. Also fix the Haiku judge spinner bias: of 614 verdicts, 569 were 'working' and 95 of those noted a question was visible — Claude Code keeps the spinner animating at an idle prose decision, so the judge coin-flipped. Add a precedence override: when an option list AND a Recommendation/Reply instruction are both visible, classify WAITING even with spinner glyphs. Kept the strict dual-signal gate (never option-list-alone) so auto-decide-preserved doesn't flip. 5 unit tests pin the two-signal contract (2 true on real collapsed bytes, 3 false guards). 90 -> 95 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(plan-review): ask-first scope gate for plan-eng + plan-design review On an empty/cold invocation, plan-eng-review and plan-design-review would dive straight into repo exploration (plan-eng) or a 7-pass mockup+audit (plan-design) and only ask the user much later, if at all. plan-ceo-review already asks first via an unconditional Step-0 gate and behaves well; these two did not. Add a hard-STOP scope gate as the FIRST operational instruction in each skill (above the design-doc check / pre-review audit / mockup defaults it explicitly overrides): the first tool call must be AskUserQuestion confirming the review target, before any git/Read/Grep/Glob/Bash or mockup generation. Under --disallowedTools the options render as plain column-0 lettered prose with a Recommendation + "Reply with A, B, or C" line so the answer is detectable. This is correct cold-start UX (confirm what to review before grinding a full review on nothing) and it is the product half of the plan-mode smoke fix; the harness collapsed-form detector is the deterministic half that catches the ask however it renders. Templates + regenerated SKILL.md (default variant). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(tiers): reclassify stochastic plan-eng/plan-design ask-first smokes as periodic plan-eng-review and plan-design-review run a long explore/audit before their first AskUserQuestion, so whether the plan-mode + finding-floor smokes reach a terminal outcome within the 300s/600s budget depends on stochastic ask-first compliance (measured ~50-67%/run even with the hardened gate). Per the "non-deterministic -> periodic" tiering rule, move the four affected smokes (plan-eng/plan-design review-plan-mode + finding-floor) to periodic. The deterministic harness fix (collapsed-form detector + judge precedence) and the ask-first gate lift these from always-failing to mostly-passing and are the real product+harness improvements; periodic monitoring tracks the rate weekly without blocking PRs on an LLM coin-flip. plan-ceo/plan-devex ask-first reliably and stay gate-tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(evals): gate the deterministic PTY plan-mode smokes in CI The real-PTY plan-mode smokes never ran in CI — the gate was local-only. Add an e2e-pty-plan-smoke matrix suite running the two deterministically-reliable ones (office-hours-auto-mode, plan-mode-no-op) so a regression there blocks PRs. The stochastic plan-eng/plan-design ask-first smokes stay periodic (touchfiles E2E_TIERS) and are not CI-gated. A fresh CI container has no ~/.claude.json, so the spawned interactive `claude` would wedge on the onboarding + API-key-approval dialog. Add a scoped seed step (hasCompletedOnboarding + key approval, its own ANTHROPIC_API_KEY env) before the run — mirrors what the hermetic E2E child env seeds. Per-suite timeout override (35 min) via matrix.suite.timeout so the PTY suite has headroom for --retry 2 without bumping the other 12 suites. Report runner count 12 -> 13. Validate via workflow_dispatch before relying on the gate (PTY-in-CI is new). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(evals): install gstack skill registry for the PTY smoke suite The first dry-run of e2e-pty-plan-smoke failed: the spawned interactive `claude` printed "Unknown command: /plan-ceo-review". .claude/skills is gitignored, so a fresh CI checkout has no gstack skill registry and the TUI can't resolve /office-hours or /plan-ceo-review. Add a Register step (scoped to the suite, after Seed, before Run) that mirrors setup's --no-prefix user-scoped registry minimally: $HOME/.claude/skills/gstack -> repo (resolves the preambles' absolute ~/.claude/skills/gstack/bin/* and <skill>/sections/* paths) + per-skill SKILL.md/sections symlinks for the two skills these tests invoke. HOME is /github/home in this container and the runner adds no HOME/CLAUDE_CONFIG_DIR override (no hermetic mode), so $HOME is the right anchor — the Seed step already proved claude reads it. No ./setup (binary build + Chromium + fonts + /dev/tty prompt); SKILL.md + bin/ + sections/ are committed. Self-validating: fails the step loudly on a dangling symlink or missing `name:` frontmatter, so a moved target surfaces here instead of as a silent 35-min "Unknown command" timeout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.58.4.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-25 19:20:00 +02:00 · 2026-06-21 07:15:19 -07:00
parent a861c00cfa
commit 9fd03fae9e
54 changed files with 2376 additions and 248 deletions
@@ -71,6 +71,15 @@ export interface ClaudePtyOptions {
  permissionMode?: 'plan' | 'default' | 'acceptEdits' | 'bypassPermissions' | 'auto' | 'dontAsk' | null;
  /** Extra args after the permission-mode flag. */
  extraArgs?: string[];
+  /**
+   * Model for the spawned interactive `claude`. Without an explicit --model the
+   * child inherits the operator's ~/.claude/settings.json model (e.g.
+   * claude-fable-5[1m]), which can spend 5+ min in extended thinking on an empty
+   * plan-mode context and blow every smoke budget. Resolution mirrors
+   * session-runner.ts:144 exactly: opts.model ?? EVALS_MODEL ?? 'claude-sonnet-4-6'.
+   * Pushed BEFORE extraArgs so a test-supplied --model still wins (last flag wins).
+   */
+  model?: string;
  /** Terminal size. Default 120x40. Plan-mode UI lays out cleanly at this size. */
  cols?: number;
  rows?: number;
@@ -406,8 +415,10 @@ export function judgePtyState(
  const prompt = `You are reading a snapshot of a terminal where Claude Code is running in plan mode for an automated test. Your job: classify the agent's current state.

 Pick exactly ONE:
- WAITING — agent surfaced a question or option list and is sitting at the input prompt waiting for user reply. Signs: numbered/lettered options visible (1./2./3. or A)/B)/C)), "Recommendation:" line, cursor at empty input prompt with no recent generation activity.
+- WAITING — agent surfaced a question or option list and is sitting at the input prompt waiting for user reply. Signs: numbered/lettered options visible (1./2./3. or A)/B)/C)), "Recommendation:" line, cursor at empty input prompt with no recent generation activity, OR a fully-rendered question + reply-instruction (e.g. "Reply with A, B, or C" / "Recommendation:") is visible.
 - WORKING — agent is actively generating or running tools. Signs: spinner glyphs (✻ ✶ ✳ ✢ ✽), "Musing..." or "Churned for ..." text, recent tool-call blocks (Read/Edit/Bash/Grep), in-flight token output.
+
+PRECEDENCE OVERRIDE: if a lettered/numbered option list (A)/B)/1./2.) AND a "Recommendation:" or "Reply with"/"Reply A" instruction are BOTH visible in this snapshot, classify WAITING even when spinner glyphs (✻ ✶ ✳ ✢ ✽) are still animating — Claude Code keeps the spinner up at an idle prose decision, so a spinner alongside a fully-rendered question + reply-instruction is a residual render artifact, not active generation.
 - HUNG — agent has stopped without surfacing a question and without any spinner/work activity. Rare; usually means a crash.

 Respond with strict JSON ONLY (no markdown fences, no prose):
@@ -503,6 +514,16 @@ ${tail}
 *     for plan-eng / plan-design / plan-devex prose AUQ
 *   - 3+ distinct numbered options (1. 2. 3.) at line starts WITHOUT a
 *     `❯<spaces>1.` cursor — typical for autoplan / office-hours prose AUQ
+ *   - 3+ markdown bold-bullet options (`- **label**`) following an
+ *     interrogative line — office-hours renders its mode question this way
+ *     (`> - **Building a startup**`), which has no letter/number marker
+ *   - Pattern 4/5 (collapsed-form): a reply-instruction OR recommendation
+ *     marker PLUS 2+ distinct A-D letter markers each punctuated by ) : or (
+ *     anywhere in the tail. stripAnsi destroys the newlines + inter-word
+ *     spaces that the line-anchored patterns above need, so a real prose AUQ
+ *     arrives collapsed ("ReplywithA,B,orC", "A(recommended)", "-B:") and is
+ *     invisible to Patterns 1-3. This is the dominant Shape-B render mode in
+ *     the plan-design smoke + floor timeouts (verified against real run bytes).
 *
 * Used by classifyVisible and runPlanSkillFloorCheck to return outcome='asked'
 * (or auq_observed) instead of letting the harness time out when the model
@@ -546,7 +567,55 @@ export function isProseAUQVisible(visible: string): boolean {
  while ((nm = numberedRe.exec(tail)) !== null) {
    if (nm[1]) numberedHits.add(nm[1]);
  }
-  return numberedHits.size >= 2;
+  if (numberedHits.size >= 2) return true;
+
+  // Pattern 3: markdown bold-bullet option list. office-hours renders its
+  // mode question as `> - **Building a startup**` lines under
+  // --disallowedTools — no letter/number marker, so Patterns 1-2 miss it,
+  // and the model keeps a spinner up so the Haiku judge scores it 'working'
+  // and the run times out despite the question being on screen.
+  // Require both: an interrogative line (the question stem ends in '?') AND
+  // 3+ bold-bullet markers. The bold (`- **`) requirement is what separates
+  // an option list from incidental prose bullets; the line anchor is dropped
+  // because stripAnsi can collapse option lines (see Pattern 1 note), so we
+  // count markers anywhere in the tail. The `❯ 1.` cursor gate above already
+  // excludes a live native list.
+  if (/\?/.test(tail)) {
+    const boldBulletHits = (tail.match(/[-*•]\s+\*\*/g) || []).length;
+    if (boldBulletHits >= 3) return true;
+  }
+
+  // Pattern 4/5: collapsed-form prose AUQ. stripAnsi removes the
+  // cursor-positioning escapes that render option newlines + inter-word
+  // spaces, so "Reply with A, B, or C" arrives as "ReplywithA,B,orC" and
+  // "A) ..." as "A(recommended)" / "-B:" — defeating every line-anchored or
+  // ')'-anchored pattern above (Patterns 1-3 all return false on the real
+  // plan-design smoke + floor timeout bytes). Detect via two INDEPENDENT
+  // signals that must BOTH hold — the corroboration is what separates a real
+  // AUQ from incidental report prose that happens to mention a recommendation:
+  //   (1) a reply-instruction matched space-insensitively OR a recommendation
+  //       marker, AND
+  //   (2) 2+ distinct A-D letter markers each punctuated by ) : or ( anywhere
+  //       in the tail.
+  // A single 'B)' + the word "recommendation", or a comma-only collapsed
+  // "ReplywithA,B,orC" with no )/:/( punctuation on the letters, both stay
+  // false — the two-signal contract is pinned by unit tests.
+  const replyOrRec =
+    /reply\s*(?:with)?\s*[A-D]/i.test(tail) ||
+    /reply(?:with)?[A-D]/i.test(tail.replace(/\s+/g, '')) ||
+    /\bRecommendation\s*:/i.test(tail) ||
+    /\(recommended\)/i.test(tail);
+  if (replyOrRec) {
+    const collapsedLetterRe = /\b([A-D])[):(]/g;
+    const collapsedHits = new Set<string>();
+    let cm: RegExpExecArray | null;
+    while ((cm = collapsedLetterRe.exec(tail)) !== null) {
+      if (cm[1]) collapsedHits.add(cm[1]);
+    }
+    if (collapsedHits.size >= 2) return true;
+  }
+
+  return false;
 }

 /**
@@ -1145,9 +1214,15 @@ export async function launchClaudePty(
  let exited = false;
  let exitCodeCaptured: number | null = null;

+  const args: string[] = [];
+  // Pin the model so smokes don't inherit the operator's settings.json model
+  // (see ClaudePtyOptions.model). Chain mirrors session-runner.ts:144 so PTY and
+  // `claude -p` evals always agree. Pushed before extraArgs => a test-supplied
+  // --model wins (last flag wins).
+  const model = opts.model ?? process.env.EVALS_MODEL ?? 'claude-sonnet-4-6';
+  args.push('--model', model);
  // Permission mode: 'plan' default, null => omit flag entirely.
  const permissionMode = opts.permissionMode === undefined ? 'plan' : opts.permissionMode;
-  const args: string[] = [];
  if (permissionMode !== null) {
    args.push('--permission-mode', permissionMode);
  }
@@ -1498,6 +1573,9 @@ export async function runPlanSkillObservation(opts: {
   * Step 0 reads the prior conversation context so it sees the draft.
   */
  initialPlanContent?: string;
+  /** Override the spawned model. Defaults via launchClaudePty's chain
+   *  (opts.model ?? EVALS_MODEL ?? 'claude-sonnet-4-6'). */
+  model?: string;
 }): Promise<PlanSkillObservation> {
  const startedAt = Date.now();
  const session = await launchClaudePty({
@@ -1506,6 +1584,7 @@ export async function runPlanSkillObservation(opts: {
    timeoutMs: (opts.timeoutMs ?? 180_000) + 30_000,
    extraArgs: opts.extraArgs,
    env: opts.env,
+    model: opts.model,
  });

  try {
@@ -1762,6 +1841,8 @@ export async function runPlanSkillCounting(opts: {
  timeoutMs?: number;
  /** Extra env merged into the spawned `claude` process. */
  env?: Record<string, string>;
+  /** Override the spawned model. Defaults via launchClaudePty's chain. */
+  model?: string;
 }): Promise<PlanSkillCountObservation> {
  const startedAt = Date.now();
  const defaultPick = opts.defaultPick ?? 1;
@@ -1772,6 +1853,7 @@ export async function runPlanSkillCounting(opts: {
    cwd: opts.cwd,
    timeoutMs: timeoutMs + 60_000,
    env: opts.env,
+    model: opts.model,
  });

  const fingerprints: AskUserQuestionFingerprint[] = [];
@@ -1993,6 +2075,8 @@ export async function runPlanSkillFloorCheck(opts: {
  timeoutMs?: number;
  /** Extra env merged into the spawned `claude` process. */
  env?: Record<string, string>;
+  /** Override the spawned model. Defaults via launchClaudePty's chain. */
+  model?: string;
 }): Promise<PlanSkillFloorObservation> {
  const startedAt = Date.now();
  const timeoutMs = opts.timeoutMs ?? 600_000;
@@ -2002,6 +2086,7 @@ export async function runPlanSkillFloorCheck(opts: {
    cwd: opts.cwd,
    timeoutMs: timeoutMs + 60_000,
    env: opts.env,
+    model: opts.model,
  });

  try {
@@ -23,6 +23,7 @@
 */

 import { describe, test, expect } from 'bun:test';
+import { readFileSync } from 'node:fs';
 import {
  isPermissionDialogVisible,
  isNumberedOptionListVisible,
@@ -290,6 +291,109 @@ This refers to (see option B) above and also to point A) earlier.
    expect(isProseAUQVisible('Just some plain text output from the model.')).toBe(false);
    expect(isProseAUQVisible('')).toBe(false);
  });
+
+  // Pattern 3: markdown bold-bullet options — office-hours renders its mode
+  // question this way under --disallowedTools, with no letter/number marker.
+  test('matches office-hours markdown bold-bullet mode question (Pattern 3)', () => {
+    const sample = `
+> Before we dig in — what's your goal with this?
+>
+> - **Building a startup** (or thinking about it)
+> - **Intrapreneurship** — internal project at a company, need to ship fast
+> - **Hackathon / demo** — time-boxed, need to impress
+> - **Open source / research** — building for a community
+> - **Learning** — teaching yourself to code
+❯
+`;
+    expect(isProseAUQVisible(sample)).toBe(true);
+  });
+
+  test('bold-bullets require a preceding interrogative — no "?" => false', () => {
+    // 3+ bold bullets but no question stem: this is a feature list, not an AUQ.
+    const sample = `
+Here is what shipped:
+- **Faster builds** via caching
+- **Smaller binaries** through tree-shaking
+- **Better errors** with source maps
+`;
+    expect(isProseAUQVisible(sample)).toBe(false);
+  });
+
+  test('a question with fewer than 3 bold bullets stays false (guard)', () => {
+    const sample = `
+Which approach do you prefer?
+- **Option one** is simpler
+- **Option two** is faster
+`;
+    expect(isProseAUQVisible(sample)).toBe(false);
+  });
+
+  test('plain (non-bold) bullets after a question do not trigger Pattern 3', () => {
+    // Only bold bullets count — plain "- text" prose lists are too common.
+    const sample = `
+What should we do about this?
+- run the tests
+- ship the fix
+- file a follow-up
+`;
+    expect(isProseAUQVisible(sample)).toBe(false);
+  });
+
+  test('Pattern 3 still defers to a live native cursor list (❯ 1.)', () => {
+    const sample = `
+> What's your goal?
+❯ 1. **Building a startup**
+  2. **Intrapreneurship**
+  3. **Hackathon**
+`;
+    // The ❯1. cursor gate fires first — native list handling owns this.
+    expect(isProseAUQVisible(sample)).toBe(false);
+  });
+
+  // Pattern 4/5: collapsed-form prose AUQ. stripAnsi destroys the newlines +
+  // inter-word spaces, so a real prose AUQ arrives collapsed and defeats the
+  // line-anchored Patterns 1-3. These are the dominant Shape-B render mode in
+  // the plan-design smoke + floor timeouts — verbatim de-spinnered bytes from
+  // the real failing runs (bdm3sucql.output).
+  test('matches the real collapsed floor render (colon-delimited, Pattern 4/5)', () => {
+    const sample =
+      'The review is blocked on D1—reply withA, B, r Cabovetocontinue:' +
+      '- A(recommended): Spec thefull P1AskUserQuestioncopy in this review' +
+      '-B:LeaveP1copytotheimplementerwithstructuralrequirements' +
+      'C: Add a placeholder template to the plan';
+    expect(isProseAUQVisible(sample)).toBe(true);
+  });
+
+  test('matches the real collapsed plan-mode render (Recommendation + collapsed A)/B), Pattern 4/5)', () => {
+    const sample =
+      'Recommendation:A—writethecopynow.(recommended)A) Writ the fullcopy in thisdesign review— now.' +
+      '(recommended) Completeness:10/10 B) Leveit to theimplemente — task spec is enough.' +
+      'Reply withA (write the copy now)orB(leavetoimplementer)';
+    expect(isProseAUQVisible(sample)).toBe(true);
+  });
+
+  test('collapsed-form requires BOTH signals — single B) + word "recommendation" stays false', () => {
+    // Only one punctuated letter marker: the two-signal contract is not met.
+    const sample =
+      'We should consider option B) here. My recommendation is to do it now.';
+    expect(isProseAUQVisible(sample)).toBe(false);
+  });
+
+  test('collapsed-form requires letter punctuation — comma-only "ReplywithA,B,orC" stays false', () => {
+    // Reply-instruction present, but the letters carry no ) : or ( punctuation,
+    // so they could be incidental enumerations in running prose. Stays false.
+    const sample = 'ReplywithA,B,orC';
+    expect(isProseAUQVisible(sample)).toBe(false);
+  });
+
+  test('collapsed-form does not regress the existing FP guard (see option B) ... point A))', () => {
+    // The classic citation FP: a model referencing prior options in prose.
+    // No reply-instruction / recommendation marker on its own line, so the
+    // collapsed-form signal does not fire either.
+    const sample =
+      'As noted (see option B) above, and the earlier point A) we discussed, this is fine.';
+    expect(isProseAUQVisible(sample)).toBe(false);
+  });
 });

 describe('classifyVisible (runtime path through the runner classifier)', () => {
@@ -552,6 +656,45 @@ describe('runPlanSkillObservation env passthrough surface', () => {
  });
 });

+describe('launchClaudePty model pin (static tripwire)', () => {
+  // Why static-grep, not a behavioral assert: the spawn fires immediately
+  // inside launchClaudePty, so asserting the built args array would require
+  // extracting an arg-builder seam — which rewrites the exact region kyoto-v5's
+  // hermetic --strict-mcp-config insertion edits, reintroducing a merge
+  // conflict the placement deliberately avoids. The end-to-end behavioral proof
+  // is the live PTY smoke (skill-e2e-plan-*-plan-mode.test.ts) running under the
+  // pinned model. These grep-level guards stop a refactor from silently
+  // dropping the pin or reordering it past extraArgs.
+  const src = readFileSync(new URL('./claude-pty-runner.ts', import.meta.url), 'utf-8');
+
+  test('ClaudePtyOptions exposes model?: string', () => {
+    const opts: ClaudePtyOptions = { model: 'claude-sonnet-4-6' };
+    expect(opts.model).toBe('claude-sonnet-4-6');
+  });
+
+  test('spawn args push --model from the EVALS_MODEL fallback chain', () => {
+    expect(src).toContain("args.push('--model', model)");
+    // opts.model -> EVALS_MODEL -> 'claude-sonnet-4-6' (mirrors session-runner.ts:144)
+    expect(src).toMatch(
+      /opts\.model\s*\?\?\s*process\.env\.EVALS_MODEL\s*\?\?\s*'claude-sonnet-4-6'/,
+    );
+  });
+
+  test('--model is pushed BEFORE extraArgs so a per-test --model override wins', () => {
+    const modelPush = src.indexOf("args.push('--model', model)");
+    const extraArgsPush = src.indexOf('if (opts.extraArgs) args.push(...opts.extraArgs)');
+    expect(modelPush).toBeGreaterThan(-1);
+    expect(extraArgsPush).toBeGreaterThan(-1);
+    expect(modelPush).toBeLessThan(extraArgsPush);
+  });
+
+  test('all three plan-skill wrappers forward model to launchClaudePty', () => {
+    // Count must match the number of wrappers (observation, counting, floor).
+    const forwards = src.match(/^\s*model: opts\.model,$/gm) ?? [];
+    expect(forwards.length).toBe(3);
+  });
+});
+
 // ────────────────────────────────────────────────────────────────────────────
 // Per-finding count primitives — Section 3 unit tests #1–#5, #7, #12.
 // ────────────────────────────────────────────────────────────────────────────
@@ -516,10 +516,16 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
  'plan-eng-coverage-audit': 'gate',
  'plan-review-report': 'gate',

-  // Plan-mode handshake — deterministic safety regression, gate-tier
+  // Plan-mode handshake. plan-ceo/plan-devex ask-first reliably (gate-tier);
+  // plan-eng/plan-design run a long explore/audit before their first
+  // AskUserQuestion, so whether they reach a terminal outcome within the 300s
+  // budget hinges on stochastic ask-first compliance (~50-67%/run measured).
+  // Per the "non-deterministic -> periodic" tiering rule they are periodic:
+  // the hardened ask-first gate + the collapsed-form detector lifted them from
+  // always-failing to mostly-passing, but they are not deterministic gates.
  'plan-ceo-review-plan-mode': 'gate',
-  'plan-eng-review-plan-mode': 'gate',
-  'plan-design-review-plan-mode': 'gate',
+  'plan-eng-review-plan-mode': 'periodic',
+  'plan-design-review-plan-mode': 'periodic',
  'plan-devex-review-plan-mode': 'gate',
  'plan-mode-no-op': 'gate',
  // v1.21+ auto-mode regression tests
@@ -549,9 +555,9 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
  'plan-eng-finding-count':    'periodic',
  'plan-design-finding-count': 'periodic',
  'plan-devex-finding-count':  'periodic',
-  'plan-eng-finding-floor':    'gate',
+  'plan-eng-finding-floor':    'periodic',  // stochastic ask-first (see plan-mode-handshake note); periodic
  'plan-ceo-finding-floor':    'gate',
-  'plan-design-finding-floor': 'gate',
+  'plan-design-finding-floor': 'periodic',  // stochastic ask-first (see plan-mode-handshake note); periodic
  'plan-devex-finding-floor':  'gate',
  'plan-eng-multi-finding-batching': 'periodic',
  'plan-ceo-split-overflow': 'periodic',