mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-27 05:22:29 +02:00
7ca04d8ef0
* fix(gstack-paths): guard CLAUDE_PLUGIN_DATA against cross-plugin contamination (#1569) gstack-paths previously trusted CLAUDE_PLUGIN_DATA as a fallback for GSTACK_STATE_ROOT whenever GSTACK_HOME was unset. When another plugin (e.g. Codex) persists its own CLAUDE_PLUGIN_DATA into the session env via CLAUDE_ENV_FILE, gstack picked it up and wrote checkpoints, analytics, and learnings into that plugin's directory. Anyone with the Codex plugin installed alongside gstack hit this silently. Fix: guard the CLAUDE_PLUGIN_DATA branch so it only fires when CLAUDE_PLUGIN_ROOT confirms we're running as the gstack plugin (path contains "gstack"). Skill installs fall through to \$HOME/.gstack. Contributed by @ElliotDrel via #1570. Closes #1569. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(gbrain-sync): sourceLocalPath handles wrapped {sources:[...]} shape from gbrain v0.20+ gbrain v0.20+ changed `gbrain sources list --json` to return {sources: [...]} instead of a flat array. sourceLocalPath crashed upstream with `list.find is not a function` on every /sync-gbrain invocation against modern gbrain. Accept both shapes for forward/backward compat, matching probeSource/sourcePageCount in lib/gbrain-sources.ts. Contributed by @jakehann11 via #1571. Closes #1567. Supersedes #1564 (@tonyjzhou, same fix, different shape — credit retained). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(brain-context-load): probe gbrain via execFile, not shell builtin (#1559) gbrainAvailable() used `execFileSync("command", ["-v", "gbrain"])`, which fails in any environment where the `command` builtin isn't on the spawned process's PATH (most non-interactive shells). The probe then reported gbrain as missing even when it was installed, and context-load silently skipped vector/list queries. Fix: probe `gbrain --version` directly with a 500ms timeout (matching the rest of the file's MCP_TIMEOUT_MS). Same semantics, works everywhere execFile works. Contributed by @jbetala7 via #1560. Closes #1559. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(gbrain-doctor): pin schema_version:2 doctor parse path (#1418) Adds an exec-path regression test that runs a fake gbrain shim emitting the v0.25+ doctor JSON shape (schema_version: 2, status: "warnings", exit 1 for health_score < 100, no top-level `engine` field). Confirms freshDetectEngineTier recovers stdout from the non-zero exit and falls back to GBRAIN_HOME/config.json for the engine label. The pre-existing test for #1415 only stripped gbrain from PATH; this test exercises the actual doctor parse path, closing the gap that codex's plan review flagged. Also documents the schema_version separation in lib/gbrain-local-status.ts: the local CacheEntry stays at version 1, distinct from the doctor-output schema_version which we accept across versions in gstack-memory-helpers. Closes #1418 (credit @mvanhorn for surfacing the doctor + schema_v2 collapse). The fix landed pre-emptively in v1.29.x; this commit pins it with a stronger test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(memory-ingest): pin put_page regression + scrub stale name from --help and comments (#1346) #1346 reported that gstack-memory-ingest still called the renamed gbrain put_page subcommand on gbrain v0.18+. The actual code migrated to `gbrain put` and later to batch `gbrain import <dir>` before this report landed — only documentation lag remained. This commit: - Updates the --help string ("Skip gbrain put calls (still updates state file)") so user-facing docs match the shipped subcommand - Updates two inline comments that still referenced the old name - Adds test/memory-ingest-no-put_page.test.ts: a regression pin that strips comments from bin/gstack-memory-ingest.ts and fails the build if "put_page" appears in any active code or string literal, plus a sanity check that the file still calls a supported gbrain page-write verb (put or import) Closes #1346. Reporter @kylma-code surfaced the doc lag; the original code migration credit is on the v1.27.x wave. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(resolvers): rewrite all gbrain put_page instructions to canonical put <slug> scripts/resolvers/gbrain.ts emitted user-facing copy-paste instructions using the renamed `gbrain put_page` subcommand across 10 skills (office-hours, investigate, plan-ceo-review, retro, plan-eng-review, ship, cso, design-consultation, fallback, entity-stub). Every gstack user copying those snippets hit "unknown command: put_page" on gbrain v0.18+. This commit: - Rewrites all 10 instruction templates to use `gbrain put <slug> --content "$(cat <<EOF...EOF)"` with title/tags moved into YAML frontmatter inside --content, matching the v0.18+ subcommand shape - Updates README.md and USING_GBRAIN_WITH_GSTACK.md "common commands" table to reference `gbrain put` and `gbrain get` - Adds test/resolvers-gbrain-put-rewrite.test.ts pinning two invariants: (a) resolver source ships only canonical instructions, (b) every tracked SKILL.md file is free of `gbrain put_page` CHANGELOG entries are deliberately left untouched (historical record). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(build): extract package.json build to scripts/build.sh for Windows Bun compat (#1538, #1537, #1530, #1457, #1561) Bun's Windows shell parser rejects multiple constructs the inline package.json build chain used: brace groups `{ cmd; }`, subshells with redirection `( git ... ) > path/.version`, and (in Bun 1.3.x) subshells near redirections in general. Every Windows install + every auto-upgrade since v1.34.2.0 has failed on `bun run build`. Extracts the build chain to scripts/build.sh and the .version writes to scripts/write-version-files.sh. POSIX-portable, no Bun shell parsing involved. Also adds Windows-specific bun.exe handling for non-ASCII PATHs (a separate Windows footgun where Bun's --compile fails when the binary lives under a path with non-ASCII chars). Updates test/build-script-shell-compat.test.ts to assert the new shape: no subshells with redirections anywhere in the build chain, and build delegates to scripts/build.sh which delegates .version writes. Contributed by @Charlie-El via #1544. Supersedes #1531 (@scarson, fixed in build helper), #1480 (@mikepsinn, partial overlap), #1460 (@realcarsonterry, brace-group fix subsumed) — credit retained. Closes #1538, #1537, #1530, #1457, #1561. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): .exe glob in .gitignore + .exe extension resolution in find-browse (#1554) bun build --compile on Windows appends .exe to the output filename, producing browse.exe instead of browse. find-browse's existsSync probe only checked the bare path and returned null on Windows even when the binary was correctly built. .gitignore similarly only excluded the bare bin/gstack-global-discover path, leaving the .exe variant tracked. This commit: - .gitignore: changes `bin/gstack-global-discover` → `bin/gstack-global-discover*` so the Windows .exe variant is ignored - browse/src/find-browse.ts: adds isExecutable + findExecutable helpers that fall back to .exe/.cmd/.bat probing on Windows, mirroring the same helper already in make-pdf/src/browseClient.ts and pdftotext.ts Contributed by @Mike-E-Log via #1554. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(windows): add fresh-install E2E gate that runs bun run build on windows-latest Adds .github/workflows/windows-setup-e2e.yml as the gate that catches Bun shell-parser regressions in the build chain before they reach users. Triggers on PRs touching package.json, scripts/build.sh, scripts/write-version-files.sh, setup, browse cli/find-browse, or gstack-paths. What it verifies: 1. bun run build completes on Windows (the previously-broken path that #1538/#1537/#1530/#1457/#1561 reported) 2. All compiled binaries land on disk (browse.exe, find-browse.exe, design.exe, gstack-global-discover.exe) 3. find-browse resolves to the .exe variant on Windows (regression gate for #1554) 4. gstack-paths returns non-empty GSTACK_STATE_ROOT/PLAN_ROOT/TMP_ROOT on Windows (regression gate for #1570) Complements the existing windows-free-tests.yml (curated unit subset); this new workflow exercises the install path itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(codex): move diff scope into prompt instead of --base (Codex CLI 0.130+ argv conflict) (#1209) Codex CLI ≥ 0.130.0 rejects passing a custom prompt and --base together (mutually exclusive at argv level). Every /codex review, /review, and /ship structured Codex review call ended with an argv error before the model ran. Fix: scope the diff in prompt text using "Run git diff origin/<base>...HEAD 2>/dev/null || git diff <base>...HEAD" instead of `--base <base>`. Preserves the filesystem boundary instruction across all invocations and keeps Codex's review prompt tuning. Touches: - codex/SKILL.md.tmpl + regenerated codex/SKILL.md - scripts/resolvers/review.ts + regenerated review/SKILL.md, ship/SKILL.md - test/gen-skill-docs.test.ts: new regression that fails if any of the five known files still contain the prompt+--base shape - test/skill-validation.test.ts: corresponding negative + positive pin on the rendered SKILL.md files Contributed by @jbetala7 via #1209. Closes #1479. Supersedes #1527 (@mvanhorn — same intent, different patch shape, CONFLICTING) and #1449 (@Gujiassh — broader refactor, CONFLICTING). Credit retained in CHANGELOG. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(review): diff from git merge-base, not git diff origin/<base> (#1492) git diff origin/<base> shows everything since the common ancestor in both directions — it includes commits that landed on origin/<base> after this branch was created as deletions. That made /review and /ship's pre-landing structured review report inflated diff totals and flagged "removed" code that was actually still present in the working tree. Fix: compute DIFF_BASE via git merge-base origin/<base> HEAD and diff the working tree against that point. Same coverage of uncommitted edits, no phantom deletions from out-of-order base advancement. Applies to /review's Step 1 (diff existence check), Step 3 (get the diff), the build-on-intent scope-creep check, the structured review DIFF_INS/DIFF_DEL stats, and the Claude adversarial subagent prompt. Same change flows into ship/SKILL.md via the shared resolver. Touches: - review/SKILL.md.tmpl + regenerated review/SKILL.md, ship/SKILL.md - scripts/resolvers/review.ts - scripts/resolvers/review-army.ts Contributed by @mvanhorn via #1492. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(codex): pin filesystem-boundary preservation across all codex review surfaces (#1503, #1522) #1503 reported that the bare codex review --base path stripped the filesystem boundary instruction, letting Codex spend tokens reading .claude/skills/ and agents/. #1522 proposed adding a skill-path detector that switched to the custom-instructions route when the diff touched skill files. After C10 (#1209) restructured codex review to always carry the boundary in the prompt (the prompt+--base argv conflict forced the restructure), the skill-path detector becomes redundant — every default call already preserves the boundary. This commit pins the post-#1209 invariant with a test that fails the build if any future refactor strips the boundary from codex/SKILL.md, review/SKILL.md, or ship/SKILL.md. Closes #1503 by regression test. #1522 (@genisis0x) is superseded by #1209 (the prompt rewrite covers its safety concern); credit retained in CHANGELOG. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(skills): use command -v instead of which for codex detection (#1197) `which` is not on PATH in every shell — some Windows shells, BusyBox- only containers, and minimal CI images all fail when skills probe codex availability via `which codex`. `command -v` is a POSIX builtin and always available where the skill is running. Touched: - codex/SKILL.md.tmpl: CODEX_BIN=$(command -v codex || echo "") - scripts/resolvers/review.ts and scripts/resolvers/design.ts: 3 + 3 sites each rewritten to `command -v codex >/dev/null 2>&1` - Regenerated all 10 affected SKILL.md files (codex, review, ship, design-consultation, design-review, office-hours, plan-ceo-review, plan-design-review, plan-devex-review, plan-eng-review) - test/skill-validation.test.ts: updated pin + defensive regression test that fails if `which codex` returns to codex/SKILL.md - test/skill-e2e-plan.test.ts: updated summary regex Contributed by @mvanhorn via #1197. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(codex): surface non-zero exits so wrappers stop reading as silent stalls (#1467, #1327) When codex exits non-zero (parse errors, arg-shape breaks, model API errors that propagate as non-zero status), the calling agent previously saw an empty output and burned 30-60 minutes misdiagnosing as a silent model/API stall. The hang-detection block only caught exit 124 (the timeout-wrapper signal). Adds elif blocks in all four codex invocation sites (Review default, Challenge, Consult new-session, Consult resume) that: - Echo "[codex exit N] <stderr first line>" to stdout - Indent the first 20 stderr lines for inline context - Log codex_nonzero_exit telemetry tagged with the call site Contributed by @genisis0x via #1467. Closes #1327. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(design): disclose OpenAI key source + warn on cwd .env match (#1278, closes #1248) The design binary previously called process.env.OPENAI_API_KEY without checking where the key came from. If a user ran $D inside someone else's project that had OPENAI_API_KEY in its .env, the resulting generation billed that project's account. Silent and irreversible. Fix: resolveApiKeyInfo() returns both the key and its source. When the env-var path matches an OPENAI_API_KEY entry in the current directory's .env, .env.<NODE_ENV>, or .env.local file, we set a warning. requireApiKey() prints "Using OpenAI key from <source>" plus the warning before the run — never the key itself. Adds 6 unit tests covering: config-vs-env precedence, env-only (no match), env+cwd .env match, quoted/exported values, value-mismatch (no false positive), and the no-leak invariant for requireApiKey stderr output. Contributed by @jbetala7 via #1278. Closes #1248. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(browse): guard full-page screenshots against Anthropic vision API >2000px brick (#1214) Full-page screenshots of tall pages routinely exceeded 2000px on the longest dimension, silently bricking the agent's session: the resulting base64 reached the Anthropic vision API which rejected the oversized image, leaving the agent burning turns on a useless blob with no stderr trace from the browse side. Adds browse/src/screenshot-size-guard.ts as a shared helper: - guardScreenshotBuffer(buf) → downscales in-memory if max(w,h) > 2000 - guardScreenshotPath(path) → file-mode variant that rewrites in place - Aspect ratio preserved via sharp's resize fit:inside - Stderr diagnostic on any downscale so callers can see when it fired - Lazy sharp import so non-screenshot paths pay no startup cost Wires the guard into all three full-page callsites codex review flagged: - browse/src/snapshot.ts: annotated + heatmap fullPage captures - browse/src/meta-commands.ts: screenshot command (path + base64 fullPage modes) plus the responsive 3-viewport sweep - browse/src/write-commands.ts: prettyscreenshot fullPage path Covers seven unit cases (pass-through, downscale, aspect ratio, exactly-2000px edge, file-mode rewrite) plus a static invariant test that fails the build if any of the three callsites stops importing the guard. Closes #1214. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): add Node sidecar entry for L4 prompt-injection classifier (#1370) The L4 TestSavant classifier in browse/src/security-classifier.ts can't be imported into the compiled browse server (onnxruntime-node dlopen fails from Bun's compile extract dir per CLAUDE.md). The agent that used to host it (sidebar-agent.ts) was removed when the PTY proved out — leaving the classifier file shipped but with zero callers. Exactly the gap codex flagged in #1370. Adds browse/src/security-sidecar-entry.ts: a Node script that runs the classifier as a subprocess of the browse server. It reads NDJSON requests from stdin and writes id-correlated NDJSON responses to stdout, supporting: - op: "scan-page-content" — full L4 classifier scan - op: "ping" — liveness probe for the client's health check - op: "status" — classifier readiness (used by /pty-inject-scan to surface l4 { available: bool } in its response) Plus browse/src/find-security-sidecar.ts: a resolver that locates node + the bundled JS entry (browse/dist/security-sidecar.js, built in a follow-up package.json change) or falls back to the dev TS entry. Returns null cleanly when node isn't on PATH so the calling endpoint can degrade per D7 (extension WARN + user confirm). C17 of the security-stack wave. C18 adds the IPC client + lifecycle management; C19 wires the endpoint; C20 routes the extension through it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): sidecar IPC client with lifecycle + circuit breaker (#1370) Adds browse/src/security-sidecar-client.ts to manage the Node L4 classifier subprocess from the compiled browse server: - Lazy spawn on first scan; reuses the same process across requests - Id-correlated request/response via NDJSON over stdio - 5s default per-scan timeout; 64KB payload cap (short-circuits before spawn so oversized requests don't waste a process) - 3-in-10-minutes respawn cap → trips circuit breaker; subsequent scans throw immediately so the /pty-inject-scan endpoint can surface l4 { available: false } to the extension and degrade to WARN+confirm - process.on('exit') sends SIGTERM to the child for clean teardown - isSidecarAvailable() lets the endpoint probe before scan calls so the response shape reflects degraded mode honestly Unit tests cover the payload cap, the availability probe, and the breaker-doesn't-crash invariant under repeated rejected calls. C18 of the security-stack wave. C19 adds POST /pty-inject-scan; C20 routes the extension through it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): add POST /pty-inject-scan endpoint for pre-PTY-inject scans (#1370) The sidebar's gstackInjectToTerminal callers (toolbar Cleanup, Inspector "Send to Code") were piping page-derived text directly into the live claude PTY with ZERO classifier processing — the gap codex flagged in #1370. The documented sidebar security stack had a hole the size of every Cleanup-button click. Adds POST /pty-inject-scan to browse/src/server.ts: - Local-only binding (NOT in TUNNEL_PATHS — tunnel attempts get the general 404 path; never reaches the scan logic) - Root-token auth via existing validateAuth() — 401 on unauth - 64KB request cap → 413 + payload-too-large body - 5s scan timeout via sidecar client - URL-blocklist forced to BLOCK in PTY context (page-derived REPL input is higher-risk than ordinary tool output) - L4 ML classifier via the sidecar when available; degrades to WARN per D7 when sidecar is unavailable - Response goes through JSON.stringify(..., sanitizeReplacer) per v1.38.0.0 Unicode-egress hardening - Imports only from security-sidecar-client.ts, never directly from security-classifier.ts (which would brick the compiled Bun binary) Seven static-invariant tests pin the POST verb, auth gate, 64KB cap, tunnel-listener exclusion, sanitizeReplacer wrapping, l4 availability shape, and the no-direct-classifier-import rule. C19 of the security-stack wave. C20 routes the extension through it; C21 adds the invariant AST check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(extension): route gstackInjectToTerminal through /pty-inject-scan (#1370) Closes the documented-vs-shipped gap codex flagged in #1370. The sidebar's two PTY-injection call sites (Inspector "Send to Code" and toolbar Cleanup) now pre-scan via the new /pty-inject-scan endpoint before writing to the live claude REPL. Adds window.gstackScanForPTYInject(text, origin) to extension/sidepanel-terminal.js: - Async, returns { allow, verdict, reasons, l4 } - POST to /pty-inject-scan with the existing root-token auth - WARN+confirm on scan failure (network down, sidecar absent, etc.) rather than silent PASS — D7 honest-degradation gstackInjectToTerminal stays synchronous, returns boolean. Per D6: keeping the inject sync means existing `const ok = ...?.()` callers don't break, and the invariant test in test/extension-pty-inject-invariant.test.ts can statically pin that every call goes through the scan first. extension/sidepanel.js call sites updated: - inspectorSendBtn click → await scan, BLOCK drops + WARN prompts via window.confirm, PASS injects silently - runCleanup() → same flow. Static cleanup prompt always PASSes but still routes through scan to honor the invariant. C20 of the security-stack wave. C21 adds the static invariant test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): invariant — extension PTY inject must be scan-gated (#1370) Static-analysis invariant test that fails the build if any extension/*.js path calls window.gstackInjectToTerminal without a preceding window.gstackScanForPTYInject in the same enclosing function. Closes the documented-vs-shipped gap codex demanded a machine check on. Rules: - Rule 1: any file that calls inject must also reference scan - Rule 2: in the enclosing function (function declaration, arrow, async (), event handler), a scan call must appear before the inject call by source position - Exemption: sidepanel-terminal.js (the file that DEFINES the inject function) is exempt from Rule 2 since the definition is not a call Plus two structural checks: - sidepanel-terminal.js defines both the inject and scan functions - inject stays SYNCHRONOUS (no `async` modifier) per D6 — async would silently break the `const ok = ...?.()` pattern at every caller C21 of the security-stack wave. The sidecar architecture (#1370) is complete: server-side L1-L3 + L4-via-sidecar (C17+C18+C19), extension pre-scan wiring (C20), and now the regression gate (C21). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(browse): opt-in extended stealth mode with 6 detection-vector patches (#1112) Rebases @garrytan's PR #1112 (Apr 2026, abandoned) onto the current browse/src/stealth.ts contract. The existing minimal "codex narrowed" stealth (webdriver-mask + AutomationControlled launch arg) stays the default. PR #1112's six additional patches are added behind an opt-in GSTACK_STEALTH=extended env flag. Extended-mode patches (applied AFTER the default mask, in order): 1. delete navigator.webdriver from prototype (not just the getter — detectors check `"webdriver" in navigator`) 2. WebGL renderer spoof to Apple M1 Pro (SwiftShader was the #1 software-GPU tell in containers) 3. navigator.plugins returns a PluginArray-prototype-passing array with MimeType objects and namedItem() 4. window.chrome populated with chrome.app, chrome.runtime, chrome.loadTimes(), chrome.csi() with realistic shapes 5. navigator.mediaDevices backfilled when headless drops it 6. CDP cdc_*-prefixed window globals cleared Why opt-in: the default mode's contract is fingerprint CONSISTENCY, which protects against detectors that flag spoofing mismatch. Extended mode actively lies about the environment; sites that reflect on these properties can break. Users who hit detection in default mode can flip GSTACK_STEALTH=extended for SannySoft 100% pass-rate. Twenty unit tests pin the env-flag semantics, all six patches' code presence, and the applyStealth wiring order. Live SannySoft pass-rate verification stays in the periodic-tier E2E suite. Contributed by @garrytan via #1112 (rebased — original PR opened before the codex-narrowed minimum landed; rebase preserves the narrowed default while adding the SannySoft-passing path as opt-in). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(fixtures): regenerate ship-SKILL.md golden baselines after C10-C13 + C16 templates Updates the three ship-SKILL.md golden baselines (claude, codex, factory hosts) to match the new shape produced by: - C10 #1209 codex argv (prompt + diff scope, no --base) - C11 #1492 merge-base diff (DIFF_BASE= preamble) - C13 #1197 command -v for codex detection - C12 + boundary preservation per regen-enforcing test Per CLAUDE.md SKILL.md workflow: edit the .tmpl, run gen:skill-docs, commit the regenerated outputs together. Goldens are part of the regen contract — without this commit, test/host-config.test.ts' golden-baseline checks fail with the diff codex review surfaced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): v1.41.0.0 — Daegu wave (24 bisect commits, 14 user-facing fixes) Bumps VERSION 1.40.0.0 → 1.41.0.0. CHANGELOG entry follows the release-summary format in CLAUDE.md: two-line headline, lead paragraph, "The numbers that matter" table, "What this means for builders" closer, then itemized Added/Changed/Fixed/For contributors with inline credit to every PR author and original issue reporter. Scale-aware bump per CLAUDE.md: 24 commits, ~6000 LOC net, substantial new capability across security (PTY sidecar wiring), install (Windows build chain), compat (gbrain 0.18-0.35, Codex CLI 0.130+), and quality (screenshot guard, design key disclosure, extended stealth opt-in). MINOR is the right call. Closes for users: #1567, #1559, #1569, #1346, #1418, #1538, #1537, #1530, #1457, #1561, #1554, #1479, #1503, #1248, #1214, #1370, #1327, #1193 pattern, #1152 pattern. Credit retained inline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(find-browse): resolve source-checkout layout <git-root>/browse/dist/browse[.exe] windows-setup-e2e.yml runs `bun browse/src/find-browse.ts` against a freshly-built repo where binaries land at browse/dist/browse.exe (no .claude/skills/gstack/ install layout). The previous markers chain only matched .codex/.agents/.claude prefixed paths, so find-browse exited "not found" even when the binary was present. Adds a source-checkout fallback after the marker scan: if no installed layout resolves but <git-root>/browse/dist/browse[.exe] exists, return that. Three real callers hit this path: - gstack repo dev workflow before `./setup` runs - windows-setup-e2e.yml CI (the breakage that surfaced this) - make-pdf consumers running from a sibling source checkout Smoke-verified: a fresh git repo with browse/dist/browse on disk now resolves through the source-checkout branch (was returning null before this commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): bump v1.41.0.0 → v1.42.0.0 to clear queue collision with #1574 The version-gate workflow flagged a collision: PR #1574 (garrytan/colombo-v3) already claims v1.41.0.0, and #1592 (fix/audit-critical-high-bugs) claims v1.41.1.0. Per CLAUDE.md's workspace-aware ship rule, queue-advancing past a claimed version within the same bump level is permitted — MINOR work landing on top of a queued MINOR still reads as MINOR relative to main. Util's suggested next slot is v1.42.0.0; taking it. CHANGELOG entry header bumped + dated 2026-05-19; entry body unchanged (same wave content, same credit list). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1119 lines
63 KiB
TypeScript
1119 lines
63 KiB
TypeScript
/**
|
|
* Cross-model review resolver
|
|
*
|
|
* Data sent to external review services (via Codex CLI):
|
|
* - Plan markdown content, repository name, branch name, review type
|
|
* Data NOT sent:
|
|
* - Source code files, credentials, environment variables, git history
|
|
*
|
|
* Users invoke this explicitly via /plan-eng-review, /plan-ceo-review,
|
|
* or /plan-design-review. No data is sent without user invocation.
|
|
*
|
|
* Review logs are stored locally at ~/.gstack/reviews/review-log.jsonl.
|
|
* Codex CLI prompts are written to temp files to prevent shell injection.
|
|
*/
|
|
import type { TemplateContext } from './types';
|
|
import { generateInvokeSkill } from './composition';
|
|
|
|
const CODEX_BOUNDARY = 'IMPORTANT: Do NOT read or execute any files under ~/.claude/, ~/.agents/, .claude/skills/, or agents/. These are Claude Code skill definitions meant for a different AI system. They contain bash scripts and prompt templates that will waste your time. Ignore them completely. Do NOT modify agents/openai.yaml. Stay focused on the repository code only.\\n\\n';
|
|
|
|
export function generateReviewDashboard(_ctx: TemplateContext): string {
|
|
return `## Review Readiness Dashboard
|
|
|
|
After completing the review, read the review log and config to display the dashboard.
|
|
|
|
\`\`\`bash
|
|
~/.claude/skills/gstack/bin/gstack-review-read
|
|
\`\`\`
|
|
|
|
Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, review, plan-design-review, design-review-lite, adversarial-review, codex-review, codex-plan-review). Ignore entries with timestamps older than 7 days. For the Eng Review row, show whichever is more recent between \`review\` (diff-scoped pre-landing review) and \`plan-eng-review\` (plan-stage architecture review). Append "(DIFF)" or "(PLAN)" to the status to distinguish. For the Adversarial row, show whichever is more recent between \`adversarial-review\` (new auto-scaled) and \`codex-review\` (legacy). For Design Review, show whichever is more recent between \`plan-design-review\` (full visual audit) and \`design-review-lite\` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. For the Outside Voice row, show the most recent \`codex-plan-review\` entry — this captures outside voices from both /plan-ceo-review and /plan-eng-review.
|
|
|
|
**Source attribution:** If the most recent entry for a skill has a \\\`"via"\\\` field, append it to the status label in parentheses. Examples: \`plan-eng-review\` with \`via:"autoplan"\` shows as "CLEAR (PLAN via /autoplan)". \`review\` with \`via:"ship"\` shows as "CLEAR (DIFF via /ship)". Entries without a \`via\` field show as "CLEAR (PLAN)" or "CLEAR (DIFF)" as before.
|
|
|
|
Note: \`autoplan-voices\` and \`design-outside-voices\` entries are audit-trail-only (forensic data for cross-model consensus analysis). They do not appear in the dashboard and are not checked by any consumer.
|
|
|
|
Display:
|
|
|
|
\`\`\`
|
|
+====================================================================+
|
|
| REVIEW READINESS DASHBOARD |
|
|
+====================================================================+
|
|
| Review | Runs | Last Run | Status | Required |
|
|
|-----------------|------|---------------------|-----------|----------|
|
|
| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | YES |
|
|
| CEO Review | 0 | — | — | no |
|
|
| Design Review | 0 | — | — | no |
|
|
| Adversarial | 0 | — | — | no |
|
|
| Outside Voice | 0 | — | — | no |
|
|
+--------------------------------------------------------------------+
|
|
| VERDICT: CLEARED — Eng Review passed |
|
|
+====================================================================+
|
|
\`\`\`
|
|
|
|
**Review tiers:**
|
|
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \\\`gstack-config set skip_eng_review true\\\` (the "don't bother me" setting).
|
|
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
|
|
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
|
|
- **Adversarial Review (automatic):** Always-on for every review. Every diff gets both Claude adversarial subagent and Codex adversarial challenge. Large diffs (200+ lines) additionally get Codex structured review with P1 gate. No configuration needed.
|
|
- **Outside Voice (optional):** Independent plan review from a different AI model. Offered after all review sections complete in /plan-ceo-review and /plan-eng-review. Falls back to Claude subagent if Codex is unavailable. Never gates shipping.
|
|
|
|
**Verdict logic:**
|
|
- **CLEARED**: Eng Review has >= 1 entry within 7 days from either \\\`review\\\` or \\\`plan-eng-review\\\` with status "clean" (or \\\`skip_eng_review\\\` is \\\`true\\\`)
|
|
- **NOT CLEARED**: Eng Review missing, stale (>7 days), or has open issues
|
|
- CEO, Design, and Codex reviews are shown for context but never block shipping
|
|
- If \\\`skip_eng_review\\\` config is \\\`true\\\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED
|
|
|
|
**Staleness detection:** After displaying the dashboard, check if any existing reviews may be stale:
|
|
- Parse the \\\`---HEAD---\\\` section from the bash output to get the current HEAD commit hash
|
|
- For each review entry that has a \\\`commit\\\` field: compare it against the current HEAD. If different, count elapsed commits: \\\`git rev-list --count STORED_COMMIT..HEAD\\\`. Display: "Note: {skill} review from {date} may be stale — {N} commits since review"
|
|
- For entries without a \\\`commit\\\` field (legacy entries): display "Note: {skill} review from {date} has no commit tracking — consider re-running for accurate staleness detection"
|
|
- If all reviews match the current HEAD, do not display any staleness notes`;
|
|
}
|
|
|
|
export function generatePlanFileReviewReport(_ctx: TemplateContext): string {
|
|
return `## Plan File Review Report
|
|
|
|
After displaying the Review Readiness Dashboard in conversation output, also update the
|
|
**plan file** itself so review status is visible to anyone reading the plan.
|
|
|
|
### Detect the plan file
|
|
|
|
1. Check if there is an active plan file in this conversation (the host provides plan file
|
|
paths in system messages — look for plan file references in the conversation context).
|
|
2. If not found, skip this section silently — not every review runs in plan mode.
|
|
|
|
### Generate the report
|
|
|
|
Read the review log output you already have from the Review Readiness Dashboard step above.
|
|
Parse each JSONL entry. Each skill logs different fields:
|
|
|
|
- **plan-ceo-review**: \\\`status\\\`, \\\`unresolved\\\`, \\\`critical_gaps\\\`, \\\`mode\\\`, \\\`scope_proposed\\\`, \\\`scope_accepted\\\`, \\\`scope_deferred\\\`, \\\`commit\\\`
|
|
→ Findings: "{scope_proposed} proposals, {scope_accepted} accepted, {scope_deferred} deferred"
|
|
→ If scope fields are 0 or missing (HOLD/REDUCTION mode): "mode: {mode}, {critical_gaps} critical gaps"
|
|
- **plan-eng-review**: \\\`status\\\`, \\\`unresolved\\\`, \\\`critical_gaps\\\`, \\\`issues_found\\\`, \\\`mode\\\`, \\\`commit\\\`
|
|
→ Findings: "{issues_found} issues, {critical_gaps} critical gaps"
|
|
- **plan-design-review**: \\\`status\\\`, \\\`initial_score\\\`, \\\`overall_score\\\`, \\\`unresolved\\\`, \\\`decisions_made\\\`, \\\`commit\\\`
|
|
→ Findings: "score: {initial_score}/10 → {overall_score}/10, {decisions_made} decisions"
|
|
- **plan-devex-review**: \\\`status\\\`, \\\`initial_score\\\`, \\\`overall_score\\\`, \\\`product_type\\\`, \\\`tthw_current\\\`, \\\`tthw_target\\\`, \\\`mode\\\`, \\\`persona\\\`, \\\`competitive_tier\\\`, \\\`unresolved\\\`, \\\`commit\\\`
|
|
→ Findings: "score: {initial_score}/10 → {overall_score}/10, TTHW: {tthw_current} → {tthw_target}"
|
|
- **devex-review**: \\\`status\\\`, \\\`overall_score\\\`, \\\`product_type\\\`, \\\`tthw_measured\\\`, \\\`dimensions_tested\\\`, \\\`dimensions_inferred\\\`, \\\`boomerang\\\`, \\\`commit\\\`
|
|
→ Findings: "score: {overall_score}/10, TTHW: {tthw_measured}, {dimensions_tested} tested/{dimensions_inferred} inferred"
|
|
- **codex-review**: \\\`status\\\`, \\\`gate\\\`, \\\`findings\\\`, \\\`findings_fixed\\\`
|
|
→ Findings: "{findings} findings, {findings_fixed}/{findings} fixed"
|
|
|
|
All fields needed for the Findings column are now present in the JSONL entries.
|
|
For the review you just completed, you may use richer details from your own Completion
|
|
Summary. For prior reviews, use the JSONL fields directly — they contain all required data.
|
|
|
|
Produce this markdown table:
|
|
|
|
\\\`\\\`\\\`markdown
|
|
## GSTACK REVIEW REPORT
|
|
|
|
| Review | Trigger | Why | Runs | Status | Findings |
|
|
|--------|---------|-----|------|--------|----------|
|
|
| CEO Review | \\\`/plan-ceo-review\\\` | Scope & strategy | {runs} | {status} | {findings} |
|
|
| Codex Review | \\\`/codex review\\\` | Independent 2nd opinion | {runs} | {status} | {findings} |
|
|
| Eng Review | \\\`/plan-eng-review\\\` | Architecture & tests (required) | {runs} | {status} | {findings} |
|
|
| Design Review | \\\`/plan-design-review\\\` | UI/UX gaps | {runs} | {status} | {findings} |
|
|
| DX Review | \\\`/plan-devex-review\\\` | Developer experience gaps | {runs} | {status} | {findings} |
|
|
\\\`\\\`\\\`
|
|
|
|
Below the table, add these lines (omit any that are empty/not applicable):
|
|
|
|
- **CODEX:** (only if codex-review ran) — one-line summary of codex fixes
|
|
- **CROSS-MODEL:** (only if both Claude and Codex reviews exist) — overlap analysis
|
|
- **UNRESOLVED:** total unresolved decisions across all reviews
|
|
- **VERDICT:** list reviews that are CLEAR (e.g., "CEO + ENG CLEARED — ready to implement").
|
|
If Eng Review is not CLEAR and not skipped globally, append "eng review required".
|
|
|
|
### Write to the plan file
|
|
|
|
**PLAN MODE EXCEPTION — ALWAYS RUN:** This writes to the plan file, which is the one
|
|
file you are allowed to edit in plan mode. The plan file review report is part of the
|
|
plan's living status.
|
|
|
|
The report must always be the LAST section of the plan file — never mid-file.
|
|
Use a single delete-then-append flow:
|
|
|
|
1. Read the plan file (Read tool) to see its full current content. Search the read
|
|
output for a \\\`## GSTACK REVIEW REPORT\\\` heading anywhere in the file.
|
|
2. If found, use the Edit tool to DELETE the entire existing section. Match from
|
|
\\\`## GSTACK REVIEW REPORT\\\` through either the next \\\`## \\\` heading or end of
|
|
file, whichever comes first. Replace with the empty string. This applies
|
|
regardless of where the section currently lives — mid-file deletion is
|
|
intentional, not a special case. If the Edit fails (e.g., concurrent edit
|
|
changed the content), re-read the plan file and retry once.
|
|
3. After the delete (or skipped, if no section existed), append the new
|
|
\\\`## GSTACK REVIEW REPORT\\\` section at the END of the file. Use the Edit
|
|
tool to match the file's current last paragraph and add the section after it,
|
|
or use Write to re-emit the whole file with the section at the end.
|
|
4. Verify with the Read tool that \\\`## GSTACK REVIEW REPORT\\\` is the last
|
|
\\\`## \\\` heading in the file before continuing. If it isn't, repeat steps
|
|
2-3 once.
|
|
|
|
Do NOT replace the section in place. The "replace mid-file" path is what allowed
|
|
prior versions to leave the report mid-file when an older report already lived
|
|
there — the user then sees a plan whose review report is not at the bottom and
|
|
(correctly) rejects it.`;
|
|
}
|
|
|
|
export function generateExitPlanModeGate(_ctx: TemplateContext): string {
|
|
return `## EXIT PLAN MODE GATE (BLOCKING)
|
|
|
|
Before calling ExitPlanMode, run this self-check. If any item fails, do the
|
|
missing work — do NOT call ExitPlanMode:
|
|
|
|
1. Read the plan file with the Read tool (after your most recent write to it).
|
|
2. Confirm the LAST \`## \` heading in the file is \`## GSTACK REVIEW REPORT\`.
|
|
In-body prose that mentions "outside voice", "codex findings", or similar
|
|
does NOT count — only the structured \`## GSTACK REVIEW REPORT\` section
|
|
satisfies this check.
|
|
3. Confirm the report contains: a Runs / Status / Findings table, a VERDICT
|
|
line, and absorbs CODEX / CROSS-MODEL / UNRESOLVED lines if applicable.
|
|
4. If a plan file is in context for this skill invocation: confirm
|
|
\`gstack-review-log\` was called and \`gstack-review-read\` was run at least
|
|
once. If no plan file is in context (e.g. \`/codex consult\` against a
|
|
diff with no plan), this check short-circuits — checks 1-3 already
|
|
short-circuit when no plan file exists.
|
|
|
|
Failing this gate and calling ExitPlanMode anyway is a contract violation —
|
|
the user will see a plan whose review report is missing or stale, and will
|
|
(correctly) reject it. Self-deception failure mode to watch for: feeling
|
|
"done" after writing review prose into the plan body. The body prose is not
|
|
the report. The report is a separate, structured, table-bearing section that
|
|
must be the file's terminal heading.`;
|
|
}
|
|
|
|
export function generateAntiShortcutClause(_ctx: TemplateContext): string {
|
|
return `**Anti-shortcut clause:** The plan file is the OUTPUT of the interactive review, not a substitute for it. Writing every finding into one plan write and calling ExitPlanMode without firing AskUserQuestion is the precise failure mode of the May 2026 transcript bug — the model explored, found issues, and dumped them into a deliverable rather than walking the user through them. If you have ANY non-trivial finding in any review section, the path from finding to ExitPlanMode goes THROUGH AskUserQuestion. Zero findings in every section is the only path to ExitPlanMode that bypasses AskUserQuestion. If you find yourself wanting to write a plan with findings before asking, stop and call AskUserQuestion now — that's the bug, recognize it.`;
|
|
}
|
|
|
|
export function generateSpecReviewLoop(_ctx: TemplateContext): string {
|
|
return `## Spec Review Loop
|
|
|
|
Before presenting the document to the user for approval, run an adversarial review.
|
|
|
|
**Step 1: Dispatch reviewer subagent**
|
|
|
|
Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
|
|
and cannot see the brainstorming conversation — only the document. This ensures genuine
|
|
adversarial independence.
|
|
|
|
Prompt the subagent with:
|
|
- The file path of the document just written
|
|
- "Read this document and review it on 5 dimensions. For each dimension, note PASS or
|
|
list specific issues with suggested fixes. At the end, output a quality score (1-10)
|
|
across all dimensions."
|
|
|
|
**Dimensions:**
|
|
1. **Completeness** — Are all requirements addressed? Missing edge cases?
|
|
2. **Consistency** — Do parts of the document agree with each other? Contradictions?
|
|
3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language?
|
|
4. **Scope** — Does the document creep beyond the original problem? YAGNI violations?
|
|
5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity?
|
|
|
|
The subagent should return:
|
|
- A quality score (1-10)
|
|
- PASS if no issues, or a numbered list of issues with dimension, description, and fix
|
|
|
|
**Step 2: Fix and re-dispatch**
|
|
|
|
If the reviewer returns issues:
|
|
1. Fix each issue in the document on disk (use Edit tool)
|
|
2. Re-dispatch the reviewer subagent with the updated document
|
|
3. Maximum 3 iterations total
|
|
|
|
**Convergence guard:** If the reviewer returns the same issues on consecutive iterations
|
|
(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
|
|
and persist those issues as "Reviewer Concerns" in the document rather than looping
|
|
further.
|
|
|
|
If the subagent fails, times out, or is unavailable — skip the review loop entirely.
|
|
Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
|
|
already written to disk; the review is a quality bonus, not a gate.
|
|
|
|
**Step 3: Report and persist metrics**
|
|
|
|
After the loop completes (PASS, max iterations, or convergence guard):
|
|
|
|
1. Tell the user the result — summary by default:
|
|
"Your doc survived N rounds of adversarial review. M issues caught and fixed.
|
|
Quality score: X/10."
|
|
If they ask "what did the reviewer find?", show the full reviewer output.
|
|
|
|
2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
|
|
section to the document listing each unresolved issue. Downstream skills will see this.
|
|
|
|
3. Append metrics:
|
|
\`\`\`bash
|
|
mkdir -p ~/.gstack/analytics
|
|
echo '{"skill":"${_ctx.skillName}","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
|
|
\`\`\`
|
|
Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.`;
|
|
}
|
|
|
|
export function generateBenefitsFrom(ctx: TemplateContext): string {
|
|
if (!ctx.benefitsFrom || ctx.benefitsFrom.length === 0) return '';
|
|
|
|
const skillList = ctx.benefitsFrom.map(s => `\`/${s}\``).join(' or ');
|
|
const first = ctx.benefitsFrom[0];
|
|
|
|
// Reuse the INVOKE_SKILL resolver for the actual loading instructions
|
|
const invokeBlock = generateInvokeSkill(ctx, [first]);
|
|
|
|
return `## Prerequisite Skill Offer
|
|
|
|
When the design doc check above prints "No design doc found," offer the prerequisite
|
|
skill before proceeding.
|
|
|
|
Say to the user via AskUserQuestion:
|
|
|
|
> "No design doc found for this branch. ${skillList} produces a structured problem
|
|
> statement, premise challenge, and explored alternatives — it gives this review much
|
|
> sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
|
|
> not per-product — it captures the thinking behind this specific change."
|
|
|
|
Options:
|
|
- A) Run /${first} now (we'll pick up the review right after)
|
|
- B) Skip — proceed with standard review
|
|
|
|
If they skip: "No worries — standard review. If you ever want sharper input, try
|
|
/${first} first next time." Then proceed normally. Do not re-offer later in the session.
|
|
|
|
If they choose A:
|
|
|
|
Say: "Running /${first} inline. Once the design doc is ready, I'll pick up
|
|
the review right where we left off."
|
|
|
|
${invokeBlock}
|
|
|
|
After /${first} completes, re-run the design doc check:
|
|
\`\`\`bash
|
|
setopt +o nomatch 2>/dev/null || true # zsh compat
|
|
SLUG=$(~/.claude/skills/gstack/browse/bin/remote-slug 2>/dev/null || basename "$(git rev-parse --show-toplevel 2>/dev/null || pwd)")
|
|
BRANCH=$(git rev-parse --abbrev-ref HEAD 2>/dev/null | tr '/' '-' || echo 'no-branch')
|
|
DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head -1)
|
|
[ -z "$DESIGN" ] && DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-design-*.md 2>/dev/null | head -1)
|
|
[ -n "$DESIGN" ] && echo "Design doc found: $DESIGN" || echo "No design doc found"
|
|
\`\`\`
|
|
|
|
If a design doc is now found, read it and continue the review.
|
|
If none was produced (user may have cancelled), proceed with standard review.`;
|
|
}
|
|
|
|
export function generateCodexSecondOpinion(ctx: TemplateContext): string {
|
|
// Codex host: strip entirely — Codex should never invoke itself
|
|
if (ctx.host === 'codex') return '';
|
|
|
|
return `## Phase 3.5: Cross-Model Second Opinion (optional)
|
|
|
|
**Binary check first:**
|
|
|
|
\`\`\`bash
|
|
command -v codex >/dev/null 2>&1 && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
|
|
\`\`\`
|
|
|
|
Use AskUserQuestion (regardless of codex availability):
|
|
|
|
> Want a second opinion from an independent AI perspective? It will review your problem statement, key answers, premises, and any landscape findings from this session without having seen this conversation — it gets a structured summary. Usually takes 2-5 minutes.
|
|
> A) Yes, get a second opinion
|
|
> B) No, proceed to alternatives
|
|
|
|
If B: skip Phase 3.5 entirely. Remember that the second opinion did NOT run (affects design doc, founder signals, and Phase 4 below).
|
|
|
|
**If A: Run the Codex cold read.**
|
|
|
|
1. Assemble a structured context block from Phases 1-3:
|
|
- Mode (Startup or Builder)
|
|
- Problem statement (from Phase 1)
|
|
- Key answers from Phase 2A/2B (summarize each Q&A in 1-2 sentences, include verbatim user quotes)
|
|
- Landscape findings (from Phase 2.75, if search was run)
|
|
- Agreed premises (from Phase 3)
|
|
- Codebase context (project name, languages, recent activity)
|
|
|
|
2. **Write the assembled prompt to a temp file** (prevents shell injection from user-derived content):
|
|
|
|
\`\`\`bash
|
|
CODEX_PROMPT_FILE=$(mktemp /tmp/gstack-codex-oh-XXXXXXXX.txt)
|
|
\`\`\`
|
|
|
|
Write the full prompt to this file. **Always start with the filesystem boundary:**
|
|
"${CODEX_BOUNDARY}"
|
|
Then add the context block and mode-appropriate instructions:
|
|
|
|
**Startup mode instructions:** "You are an independent technical advisor reading a transcript of a startup brainstorming session. [CONTEXT BLOCK HERE]. Your job: 1) What is the STRONGEST version of what this person is trying to build? Steelman it in 2-3 sentences. 2) What is the ONE thing from their answers that reveals the most about what they should actually build? Quote it and explain why. 3) Name ONE agreed premise you think is wrong, and what evidence would prove you right. 4) If you had 48 hours and one engineer to build a prototype, what would you build? Be specific — tech stack, features, what you'd skip. Be direct. Be terse. No preamble."
|
|
|
|
**Builder mode instructions:** "You are an independent technical advisor reading a transcript of a builder brainstorming session. [CONTEXT BLOCK HERE]. Your job: 1) What is the COOLEST version of this they haven't considered? 2) What's the ONE thing from their answers that reveals what excites them most? Quote it. 3) What existing open source project or tool gets them 50% of the way there — and what's the 50% they'd need to build? 4) If you had a weekend to build this, what would you build first? Be specific. Be direct. No preamble."
|
|
|
|
3. Run Codex:
|
|
|
|
\`\`\`bash
|
|
TMPERR_OH=$(mktemp /tmp/codex-oh-err-XXXXXXXX)
|
|
_REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; }
|
|
codex exec "$(cat "$CODEX_PROMPT_FILE")" -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached < /dev/null 2>"$TMPERR_OH"
|
|
\`\`\`
|
|
|
|
Use a 5-minute timeout (\`timeout: 300000\`). After the command completes, read stderr:
|
|
\`\`\`bash
|
|
cat "$TMPERR_OH"
|
|
rm -f "$TMPERR_OH" "$CODEX_PROMPT_FILE"
|
|
\`\`\`
|
|
|
|
**Error handling:** All errors are non-blocking — second opinion is a quality enhancement, not a prerequisite.
|
|
- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key": "Codex authentication failed. Run \\\`codex login\\\` to authenticate." Fall back to Claude subagent.
|
|
- **Timeout:** "Codex timed out after 5 minutes." Fall back to Claude subagent.
|
|
- **Empty response:** "Codex returned no response." Fall back to Claude subagent.
|
|
|
|
On any Codex error, fall back to the Claude subagent below.
|
|
|
|
**If CODEX_NOT_AVAILABLE (or Codex errored):**
|
|
|
|
Dispatch via the Agent tool. The subagent has fresh context — genuine independence.
|
|
|
|
Subagent prompt: same mode-appropriate prompt as above (Startup or Builder variant).
|
|
|
|
Present findings under a \`SECOND OPINION (Claude subagent):\` header.
|
|
|
|
If the subagent fails or times out: "Second opinion unavailable. Continuing to Phase 4."
|
|
|
|
4. **Presentation:**
|
|
|
|
If Codex ran:
|
|
\`\`\`
|
|
SECOND OPINION (Codex):
|
|
════════════════════════════════════════════════════════════
|
|
<full codex output, verbatim — do not truncate or summarize>
|
|
════════════════════════════════════════════════════════════
|
|
\`\`\`
|
|
|
|
If Claude subagent ran:
|
|
\`\`\`
|
|
SECOND OPINION (Claude subagent):
|
|
════════════════════════════════════════════════════════════
|
|
<full subagent output, verbatim — do not truncate or summarize>
|
|
════════════════════════════════════════════════════════════
|
|
\`\`\`
|
|
|
|
5. **Cross-model synthesis:** After presenting the second opinion output, provide 3-5 bullet synthesis:
|
|
- Where Claude agrees with the second opinion
|
|
- Where Claude disagrees and why
|
|
- Whether the challenged premise changes Claude's recommendation
|
|
|
|
6. **Premise revision check:** If Codex challenged an agreed premise, use AskUserQuestion:
|
|
|
|
> Codex challenged premise #{N}: "{premise text}". Their argument: "{reasoning}".
|
|
> A) Revise this premise based on Codex's input
|
|
> B) Keep the original premise — proceed to alternatives
|
|
|
|
If A: revise the premise and note the revision. If B: proceed (and note that the user defended this premise with reasoning — this is a founder signal if they articulate WHY they disagree, not just dismiss).`;
|
|
}
|
|
|
|
// ─── Scope Drift Detection (shared between /review and /ship) ────────
|
|
|
|
export function generateScopeDrift(ctx: TemplateContext): string {
|
|
const isShip = ctx.skillName === 'ship';
|
|
const stepNum = isShip ? '8.2' : '1.5';
|
|
|
|
return `## Step ${stepNum}: Scope Drift Detection
|
|
|
|
Before reviewing code quality, check: **did they build what was requested — nothing more, nothing less?**
|
|
|
|
1. Read \`TODOS.md\` (if it exists). Read PR description (\`gh pr view --json body --jq .body 2>/dev/null || true\`).
|
|
Read commit messages (\`git log origin/<base>..HEAD --oneline\`).
|
|
**If no PR exists:** rely on commit messages and TODOS.md for stated intent — this is the common case since /review runs before /ship creates the PR.
|
|
2. Identify the **stated intent** — what was this branch supposed to accomplish?
|
|
3. Run \`DIFF_BASE=$(git merge-base origin/<base> HEAD) && git diff "$DIFF_BASE" --stat\` and compare the files changed against the stated intent.
|
|
|
|
4. Evaluate with skepticism (incorporating plan completion results if available from an earlier step or adjacent section):
|
|
|
|
**SCOPE CREEP detection:**
|
|
- Files changed that are unrelated to the stated intent
|
|
- New features or refactors not mentioned in the plan
|
|
- "While I was in there..." changes that expand blast radius
|
|
|
|
**MISSING REQUIREMENTS detection:**
|
|
- Requirements from TODOS.md/PR description not addressed in the diff
|
|
- Test coverage gaps for stated requirements
|
|
- Partial implementations (started but not finished)
|
|
|
|
5. Output (before the main review begins):
|
|
\\\`\\\`\\\`
|
|
Scope Check: [CLEAN / DRIFT DETECTED / REQUIREMENTS MISSING]
|
|
Intent: <1-line summary of what was requested>
|
|
Delivered: <1-line summary of what the diff actually does>
|
|
[If drift: list each out-of-scope change]
|
|
[If missing: list each unaddressed requirement]
|
|
\\\`\\\`\\\`
|
|
|
|
6. This is **INFORMATIONAL** — does not block the review. Proceed to the next step.
|
|
|
|
---`;
|
|
}
|
|
|
|
// ─── Adversarial Review (always-on) ──────────────────────────────────
|
|
|
|
export function generateAdversarialStep(ctx: TemplateContext): string {
|
|
// Codex host: strip entirely — Codex should never invoke itself
|
|
if (ctx.host === 'codex') return '';
|
|
|
|
const isShip = ctx.skillName === 'ship';
|
|
const stepNum = isShip ? '11' : '5.7';
|
|
|
|
return `## Step ${stepNum}: Adversarial review (always-on)
|
|
|
|
Every diff gets adversarial review from both Claude and Codex. LOC is not a proxy for risk — a 5-line auth change can be critical.
|
|
|
|
**Detect diff size and tool availability:**
|
|
|
|
\`\`\`bash
|
|
DIFF_BASE=$(git merge-base origin/<base> HEAD)
|
|
DIFF_INS=$(git diff "$DIFF_BASE" --stat | tail -1 | grep -oE '[0-9]+ insertion' | grep -oE '[0-9]+' || echo "0")
|
|
DIFF_DEL=$(git diff "$DIFF_BASE" --stat | tail -1 | grep -oE '[0-9]+ deletion' | grep -oE '[0-9]+' || echo "0")
|
|
DIFF_TOTAL=$((DIFF_INS + DIFF_DEL))
|
|
command -v codex >/dev/null 2>&1 && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
|
|
# Legacy opt-out — only gates Codex passes, Claude always runs
|
|
OLD_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true)
|
|
echo "DIFF_SIZE: $DIFF_TOTAL"
|
|
echo "OLD_CFG: \${OLD_CFG:-not_set}"
|
|
\`\`\`
|
|
|
|
If \`OLD_CFG\` is \`disabled\`: skip Codex passes only. Claude adversarial subagent still runs (it's free and fast). Jump to the "Claude adversarial subagent" section.
|
|
|
|
**User override:** If the user explicitly requested "full review", "structured review", or "P1 gate", also run the Codex structured review regardless of diff size.
|
|
|
|
---
|
|
|
|
### Claude adversarial subagent (always runs)
|
|
|
|
Dispatch via the Agent tool. The subagent has fresh context — no checklist bias from the structured review. This genuine independence catches things the primary reviewer is blind to.
|
|
|
|
Subagent prompt:
|
|
"Read the diff for this branch with \`DIFF_BASE=$(git merge-base origin/<base> HEAD) && git diff "$DIFF_BASE"\`. Think like an attacker and a chaos engineer. Your job is to find ways this code will fail in production. Look for: edge cases, race conditions, security holes, resource leaks, failure modes, silent data corruption, logic errors that produce wrong results silently, error handling that swallows failures, and trust boundary violations. Be adversarial. Be thorough. No compliments — just the problems. For each finding, classify as FIXABLE (you know how to fix it) or INVESTIGATE (needs human judgment). After listing findings, end your output with ONE line in the canonical format \`Recommendation: <action> because <one-line reason naming the most exploitable finding>\` — examples: \`Recommendation: Fix the unbounded retry at queue.ts:78 because it'll DoS the worker pool under sustained 429s\` or \`Recommendation: Ship as-is because the strongest finding is a theoretical race that requires conditions we can't trigger in production\`. The reason must point to a specific finding (or no-fix rationale). Generic reasons like 'because it's safer' do not qualify."
|
|
|
|
Present findings under an \`ADVERSARIAL REVIEW (Claude subagent):\` header. **FIXABLE findings** flow into the same Fix-First pipeline as the structured review. **INVESTIGATE findings** are presented as informational.
|
|
|
|
If the subagent fails or times out: "Claude adversarial subagent unavailable. Continuing."
|
|
|
|
---
|
|
|
|
### Codex adversarial challenge (always runs when available)
|
|
|
|
If Codex is available AND \`OLD_CFG\` is NOT \`disabled\`:
|
|
|
|
\`\`\`bash
|
|
TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX)
|
|
_REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; }
|
|
codex exec "${CODEX_BOUNDARY}Review the changes on this branch against the base branch. Run DIFF_BASE=$(git merge-base origin/<base> HEAD) && git diff "$DIFF_BASE" to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems. End your output with ONE line in the canonical format \`Recommendation: <action> because <one-line reason naming the most exploitable finding>\`. Generic reasons like 'because it's safer' do not qualify; the reason must point to a specific finding or no-fix rationale." -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached < /dev/null 2>"$TMPERR_ADV"
|
|
\`\`\`
|
|
|
|
Set the Bash tool's \`timeout\` parameter to \`300000\` (5 minutes). Do NOT use the \`timeout\` shell command — it doesn't exist on macOS. After the command completes, read stderr:
|
|
\`\`\`bash
|
|
cat "$TMPERR_ADV"
|
|
\`\`\`
|
|
|
|
Present the full output verbatim. This is informational — it never blocks shipping.
|
|
|
|
**Error handling:** All errors are non-blocking — adversarial review is a quality enhancement, not a prerequisite.
|
|
- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key": "Codex authentication failed. Run \\\`codex login\\\` to authenticate."
|
|
- **Timeout:** "Codex timed out after 5 minutes."
|
|
- **Empty response:** "Codex returned no response. Stderr: <paste relevant error>."
|
|
|
|
**Cleanup:** Run \`rm -f "$TMPERR_ADV"\` after processing.
|
|
|
|
If Codex is NOT available: "Codex CLI not found — running Claude adversarial only. Install Codex for cross-model coverage: \`npm install -g @openai/codex\`"
|
|
|
|
---
|
|
|
|
### Codex structured review (large diffs only, 200+ lines)
|
|
|
|
If \`DIFF_TOTAL >= 200\` AND Codex is available AND \`OLD_CFG\` is NOT \`disabled\`:
|
|
|
|
\`\`\`bash
|
|
TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX)
|
|
_REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; }
|
|
cd "$_REPO_ROOT"
|
|
codex review "${CODEX_BOUNDARY}Review the changes on this branch against the base branch <base>. Run git diff origin/<base>...HEAD 2>/dev/null || git diff <base>...HEAD to see the diff and review only those changes." -c 'model_reasoning_effort="high"' --enable web_search_cached < /dev/null 2>"$TMPERR"
|
|
\`\`\`
|
|
|
|
Set the Bash tool's \`timeout\` parameter to \`300000\` (5 minutes). Do NOT use the \`timeout\` shell command — it doesn't exist on macOS. Present output under \`CODEX SAYS (code review):\` header.
|
|
Check for \`[P1]\` markers: found → \`GATE: FAIL\`, not found → \`GATE: PASS\`.
|
|
|
|
If GATE is FAIL, use AskUserQuestion:
|
|
\`\`\`
|
|
Codex found N critical issues in the diff.
|
|
|
|
A) Investigate and fix now (recommended)
|
|
B) Continue — review will still complete
|
|
\`\`\`
|
|
|
|
If A: address the findings${isShip ? '. After fixing, re-run tests (Step 5) since code has changed' : ''}. Re-run \`codex review\` to verify.
|
|
|
|
Read stderr for errors (same error handling as Codex adversarial above).
|
|
|
|
After stderr: \`rm -f "$TMPERR"\`
|
|
|
|
If \`DIFF_TOTAL < 200\`: skip this section silently. The Claude + Codex adversarial passes provide sufficient coverage for smaller diffs.
|
|
|
|
---
|
|
|
|
### Persist the review result
|
|
|
|
After all passes complete, persist:
|
|
\`\`\`bash
|
|
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"adversarial-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","tier":"always","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}'
|
|
\`\`\`
|
|
Substitute: STATUS = "clean" if no findings across ALL passes, "issues_found" if any pass found issues. SOURCE = "both" if Codex ran, "claude" if only Claude subagent ran. GATE = the Codex structured review gate result ("pass"/"fail"), "skipped" if diff < 200, or "informational" if Codex was unavailable. If all passes failed, do NOT persist.
|
|
|
|
---
|
|
|
|
### Cross-model synthesis
|
|
|
|
After all passes complete, synthesize findings across all sources:
|
|
|
|
\`\`\`
|
|
ADVERSARIAL REVIEW SYNTHESIS (always-on, N lines):
|
|
════════════════════════════════════════════════════════════
|
|
High confidence (found by multiple sources): [findings agreed on by >1 pass]
|
|
Unique to Claude structured review: [from earlier step]
|
|
Unique to Claude adversarial: [from subagent]
|
|
Unique to Codex: [from codex adversarial or code review, if ran]
|
|
Models used: Claude structured ✓ Claude adversarial ✓/✗ Codex ✓/✗
|
|
════════════════════════════════════════════════════════════
|
|
\`\`\`
|
|
|
|
High-confidence findings (agreed on by multiple sources) should be prioritized for fixes.
|
|
|
|
---`;
|
|
}
|
|
|
|
export function generateCodexPlanReview(ctx: TemplateContext): string {
|
|
// Codex host: strip entirely — Codex should never invoke itself
|
|
if (ctx.host === 'codex') return '';
|
|
|
|
return `## Outside Voice — Independent Plan Challenge (optional, recommended)
|
|
|
|
After all review sections are complete, offer an independent second opinion from a
|
|
different AI system. Two models agreeing on a plan is stronger signal than one model's
|
|
thorough review.
|
|
|
|
**Check tool availability:**
|
|
|
|
\`\`\`bash
|
|
command -v codex >/dev/null 2>&1 && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
|
|
\`\`\`
|
|
|
|
Use AskUserQuestion:
|
|
|
|
> "All review sections are complete. Want an outside voice? A different AI system can
|
|
> give a brutally honest, independent challenge of this plan — logical gaps, feasibility
|
|
> risks, and blind spots that are hard to catch from inside the review. Takes about 2
|
|
> minutes."
|
|
>
|
|
> RECOMMENDATION: Choose A — an independent second opinion catches structural blind
|
|
> spots. Two different AI models agreeing on a plan is stronger signal than one model's
|
|
> thorough review. Completeness: A=9/10, B=7/10.
|
|
|
|
Options:
|
|
- A) Get the outside voice (recommended)
|
|
- B) Skip — proceed to outputs
|
|
|
|
**If B:** Print "Skipping outside voice." and continue to the next section.
|
|
|
|
**If A:** Construct the plan review prompt. Read the plan file being reviewed (the file
|
|
the user pointed this review at, or the branch diff scope). If a CEO plan document
|
|
was written in Step 0D-POST, read that too — it contains the scope decisions and vision.
|
|
|
|
Construct this prompt (substitute the actual plan content — if plan content exceeds 30KB,
|
|
truncate to the first 30KB and note "Plan truncated for size"). **Always start with the
|
|
filesystem boundary instruction:**
|
|
|
|
"${CODEX_BOUNDARY}You are a brutally honest technical reviewer examining a development plan that has
|
|
already been through a multi-section review. Your job is NOT to repeat that review.
|
|
Instead, find what it missed. Look for: logical gaps and unstated assumptions that
|
|
survived the review scrutiny, overcomplexity (is there a fundamentally simpler
|
|
approach the review was too deep in the weeds to see?), feasibility risks the review
|
|
took for granted, missing dependencies or sequencing issues, and strategic
|
|
miscalibration (is this the right thing to build at all?). Be direct. Be terse. No
|
|
compliments. Just the problems.
|
|
|
|
THE PLAN:
|
|
<plan content>"
|
|
|
|
**If CODEX_AVAILABLE:**
|
|
|
|
\`\`\`bash
|
|
TMPERR_PV=$(mktemp /tmp/codex-planreview-XXXXXXXX)
|
|
_REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; }
|
|
codex exec "<prompt>" -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached < /dev/null 2>"$TMPERR_PV"
|
|
\`\`\`
|
|
|
|
Use a 5-minute timeout (\`timeout: 300000\`). After the command completes, read stderr:
|
|
\`\`\`bash
|
|
cat "$TMPERR_PV"
|
|
\`\`\`
|
|
|
|
Present the full output verbatim:
|
|
|
|
\`\`\`
|
|
CODEX SAYS (plan review — outside voice):
|
|
════════════════════════════════════════════════════════════
|
|
<full codex output, verbatim — do not truncate or summarize>
|
|
════════════════════════════════════════════════════════════
|
|
\`\`\`
|
|
|
|
**Error handling:** All errors are non-blocking — the outside voice is informational.
|
|
- Auth failure (stderr contains "auth", "login", "unauthorized"): "Codex auth failed. Run \\\`codex login\\\` to authenticate."
|
|
- Timeout: "Codex timed out after 5 minutes."
|
|
- Empty response: "Codex returned no response."
|
|
|
|
On any Codex error, fall back to the Claude adversarial subagent.
|
|
|
|
**If CODEX_NOT_AVAILABLE (or Codex errored):**
|
|
|
|
Dispatch via the Agent tool. The subagent has fresh context — genuine independence.
|
|
|
|
Subagent prompt: same plan review prompt as above.
|
|
|
|
Present findings under an \`OUTSIDE VOICE (Claude subagent):\` header.
|
|
|
|
If the subagent fails or times out: "Outside voice unavailable. Continuing to outputs."
|
|
|
|
**Cross-model tension:**
|
|
|
|
After presenting the outside voice findings, note any points where the outside voice
|
|
disagrees with the review findings from earlier sections. Flag these as:
|
|
|
|
\`\`\`
|
|
CROSS-MODEL TENSION:
|
|
[Topic]: Review said X. Outside voice says Y. [Present both perspectives neutrally.
|
|
State what context you might be missing that would change the answer.]
|
|
\`\`\`
|
|
|
|
**User Sovereignty:** Do NOT auto-incorporate outside voice recommendations into the plan.
|
|
Present each tension point to the user. The user decides. Cross-model agreement is a
|
|
strong signal — present it as such — but it is NOT permission to act. You may state
|
|
which argument you find more compelling, but you MUST NOT apply the change without
|
|
explicit user approval.
|
|
|
|
For each substantive tension point, use AskUserQuestion:
|
|
|
|
> "Cross-model disagreement on [topic]. The review found [X] but the outside voice
|
|
> argues [Y]. [One sentence on what context you might be missing.]"
|
|
>
|
|
> RECOMMENDATION: Choose [A or B] because [one-line reason explaining which argument
|
|
> is more compelling and why]. Completeness: A=X/10, B=Y/10.
|
|
|
|
Options:
|
|
- A) Accept the outside voice's recommendation (I'll apply this change)
|
|
- B) Keep the current approach (reject the outside voice)
|
|
- C) Investigate further before deciding
|
|
- D) Add to TODOS.md for later
|
|
|
|
Wait for the user's response. Do NOT default to accepting because you agree with the
|
|
outside voice. If the user chooses B, the current approach stands — do not re-argue.
|
|
|
|
If no tension points exist, note: "No cross-model tension — both reviewers agree."
|
|
|
|
**Persist the result:**
|
|
\`\`\`bash
|
|
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-plan-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","commit":"'"$(git rev-parse --short HEAD)"'"}'
|
|
\`\`\`
|
|
|
|
Substitute: STATUS = "clean" if no findings, "issues_found" if findings exist.
|
|
SOURCE = "codex" if Codex ran, "claude" if subagent ran.
|
|
|
|
**Cleanup:** Run \`rm -f "$TMPERR_PV"\` after processing (if Codex was used).
|
|
|
|
---`;
|
|
}
|
|
|
|
// ─── Plan File Discovery (shared helper) ──────────────────────────────
|
|
|
|
function generatePlanFileDiscovery(): string {
|
|
return `### Plan File Discovery
|
|
|
|
1. **Conversation context (primary):** Check if there is an active plan file in this conversation. The host agent's system messages include plan file paths when in plan mode. If found, use it directly — this is the most reliable signal.
|
|
|
|
2. **Content-based search (fallback):** If no plan file is referenced in conversation context, search by content:
|
|
|
|
\`\`\`bash
|
|
setopt +o nomatch 2>/dev/null || true # zsh compat
|
|
BRANCH=$(git branch --show-current 2>/dev/null | tr '/' '-')
|
|
REPO=$(basename "$(git rev-parse --show-toplevel 2>/dev/null)")
|
|
# Compute project slug for ~/.gstack/projects/ lookup
|
|
_PLAN_SLUG=$(git remote get-url origin 2>/dev/null | sed 's|.*[:/]\\([^/]*/[^/]*\\)\\.git$|\\1|;s|.*[:/]\\([^/]*/[^/]*\\)$|\\1|' | tr '/' '-' | tr -cd 'a-zA-Z0-9._-') || true
|
|
_PLAN_SLUG="\${_PLAN_SLUG:-$(basename "$PWD" | tr -cd 'a-zA-Z0-9._-')}"
|
|
# Search common plan file locations (project designs first, then personal/local)
|
|
for PLAN_DIR in "$HOME/.gstack/projects/$_PLAN_SLUG" "$HOME/.claude/plans" "$HOME/.codex/plans" ".gstack/plans"; do
|
|
[ -d "$PLAN_DIR" ] || continue
|
|
PLAN=$(ls -t "$PLAN_DIR"/*.md 2>/dev/null | xargs grep -l "$BRANCH" 2>/dev/null | head -1)
|
|
[ -z "$PLAN" ] && PLAN=$(ls -t "$PLAN_DIR"/*.md 2>/dev/null | xargs grep -l "$REPO" 2>/dev/null | head -1)
|
|
[ -z "$PLAN" ] && PLAN=$(find "$PLAN_DIR" -name '*.md' -mmin -1440 -maxdepth 1 2>/dev/null | xargs ls -t 2>/dev/null | head -1)
|
|
[ -n "$PLAN" ] && break
|
|
done
|
|
[ -n "$PLAN" ] && echo "PLAN_FILE: $PLAN" || echo "NO_PLAN_FILE"
|
|
\`\`\`
|
|
|
|
3. **Validation:** If a plan file was found via content-based search (not conversation context), read the first 20 lines and verify it is relevant to the current branch's work. If it appears to be from a different project or feature, treat as "no plan file found."
|
|
|
|
**Error handling:**
|
|
- No plan file found → skip with "No plan file detected — skipping."
|
|
- Plan file found but unreadable (permissions, encoding) → skip with "Plan file found but unreadable — skipping."`;
|
|
}
|
|
|
|
// ─── Plan Completion Audit ────────────────────────────────────────────
|
|
|
|
type PlanCompletionMode = 'ship' | 'review';
|
|
|
|
function generatePlanCompletionAuditInner(mode: PlanCompletionMode): string {
|
|
const sections: string[] = [];
|
|
|
|
// ── Plan file discovery (shared) ──
|
|
sections.push(generatePlanFileDiscovery());
|
|
|
|
// ── Item extraction ──
|
|
sections.push(`
|
|
### Actionable Item Extraction
|
|
|
|
Read the plan file. Extract every actionable item — anything that describes work to be done. Look for:
|
|
|
|
- **Checkbox items:** \`- [ ] ...\` or \`- [x] ...\`
|
|
- **Numbered steps** under implementation headings: "1. Create ...", "2. Add ...", "3. Modify ..."
|
|
- **Imperative statements:** "Add X to Y", "Create a Z service", "Modify the W controller"
|
|
- **File-level specifications:** "New file: path/to/file.ts", "Modify path/to/existing.rb"
|
|
- **Test requirements:** "Test that X", "Add test for Y", "Verify Z"
|
|
- **Data model changes:** "Add column X to table Y", "Create migration for Z"
|
|
|
|
**Ignore:**
|
|
- Context/Background sections (\`## Context\`, \`## Background\`, \`## Problem\`)
|
|
- Questions and open items (marked with ?, "TBD", "TODO: decide")
|
|
- Review report sections (\`## GSTACK REVIEW REPORT\`)
|
|
- Explicitly deferred items ("Future:", "Out of scope:", "NOT in scope:", "P2:", "P3:", "P4:")
|
|
- CEO Review Decisions sections (these record choices, not work items)
|
|
|
|
**Cap:** Extract at most 50 items. If the plan has more, note: "Showing top 50 of N plan items — full list in plan file."
|
|
|
|
**No items found:** If the plan contains no extractable actionable items, skip with: "Plan file contains no actionable items — skipping completion audit."
|
|
|
|
For each item, note:
|
|
- The item text (verbatim or concise summary)
|
|
- Its category: CODE | TEST | MIGRATION | CONFIG | DOCS`);
|
|
|
|
// ── Verification Mode (per PR #1302 — VAS-449 remediation) ──
|
|
sections.push(`
|
|
### Verification Mode
|
|
|
|
Before judging completion, classify HOW each item can be verified. The diff alone cannot prove every kind of work. Items outside the current repo or system are structurally invisible to \`git diff\`.
|
|
|
|
- **DIFF-VERIFIABLE** — A code change in this repo would manifest in \`git diff <base>...HEAD\`. Examples: "add UserService" (file appears), "validate input X" (validation logic appears), "create users table" (migration file appears).
|
|
- **CROSS-REPO** — Item names a file or change in a sibling repo (e.g., \`domain-hq/docs/dashboard.md\`, \`~/Development/<other-repo>/...\`). The current diff CANNOT prove this.
|
|
- **EXTERNAL-STATE** — Item names state in an external system: Supabase config/RLS, Cloudflare DNS, Vercel env vars, OAuth provider allowlists, third-party SaaS, DNS records. The current diff CANNOT prove this.
|
|
- **CONTENT-SHAPE** — Item requires a file to follow a specific convention. If the file is in this repo: diff-verifiable. If in another repo or system: see CROSS-REPO / EXTERNAL-STATE.
|
|
|
|
**Verification dispatch:**
|
|
|
|
- **DIFF-VERIFIABLE** → cross-reference against diff (next section).
|
|
- **CROSS-REPO** → if the sibling repo is reachable on disk (try \`~/Development/<repo>/\`, \`~/code/<repo>/\`, the parent of the current repo), run \`[ -f <path> ]\` to check file existence. File exists → DONE (cite path). File missing → NOT DONE (cite path). Path unreachable → UNVERIFIABLE (cite what needs manual check).
|
|
- **EXTERNAL-STATE** → UNVERIFIABLE. Cite the system and the specific check the user must perform.
|
|
- **CONTENT-SHAPE in another repo** → if the file exists, run any project-detected validator (see "Validator detection" below) before falling back to UNVERIFIABLE. With a validator: pass → DONE; fail → NOT DONE (cite validator output). No validator available: classify UNVERIFIABLE and cite both the file path and the convention to confirm.
|
|
|
|
**Path concreteness rule.** If a plan item names a *concrete filesystem path* (absolute, \`~/...\`, or \`<sibling-repo>/<file>\`), it MUST be classified DONE or NOT DONE based on \`[ -f <path> ]\`. UNVERIFIABLE is only valid when the path is genuinely abstract ("Cloudflare DNS", "Supabase allowlist") or the sibling root is unreachable on this machine. "I don't want to check" is not unreachable.
|
|
|
|
**Validator detection.** Before falling back to UNVERIFIABLE on a CONTENT-SHAPE item, scan the target repo's \`package.json\` for any script matching \`validate-*\`, \`lint-wiki\`, \`check-docs\`, or similar. If found, invoke it with the relevant path argument (e.g., \`npm run validate-wiki -- <path>\`). For multi-target validators (e.g., \`validate-wiki --all\`), run once and reconcile per-item from the output. A passing validator promotes the item from UNVERIFIABLE to DONE; a failing one demotes to NOT DONE.
|
|
|
|
**Honesty rule.** Do NOT classify an item as DONE just because related code shipped. Code that *handles* a deliverable is not the deliverable. Shipping a markdown-extraction library is not the same as shipping the markdown file. When in doubt between DONE and UNVERIFIABLE, prefer UNVERIFIABLE — better to surface a confirmation prompt than silently miss a deliverable.`);
|
|
|
|
// ── Cross-reference against diff ──
|
|
sections.push(`
|
|
### Cross-Reference Against Diff
|
|
|
|
Run \`git diff origin/<base>...HEAD\` and \`git log origin/<base>..HEAD --oneline\` to understand what was implemented.
|
|
|
|
For each extracted plan item, run the verification dispatch from the previous section, then classify:
|
|
|
|
- **DONE** — Clear evidence the item shipped. Cite the specific file(s) changed in the diff for DIFF-VERIFIABLE items, or the verified path that exists for CROSS-REPO items with a reachable sibling repo.
|
|
- **PARTIAL** — Some work toward this item exists but is incomplete (e.g., model created but controller missing, function exists but edge cases not handled).
|
|
- **NOT DONE** — Verification ran and produced negative evidence (file missing, code absent in diff, sibling-repo file confirmed absent).
|
|
- **CHANGED** — The item was implemented using a different approach than the plan described, but the same goal is achieved. Note the difference.
|
|
- **UNVERIFIABLE** — The diff and any reachable sibling-repo checks cannot prove or disprove this. Always applies to EXTERNAL-STATE items and to CROSS-REPO items where the sibling repo isn't reachable. Cite the specific manual verification the user must perform (e.g., "check Cloudflare DNS shows DNS-only mode for dashboard.example.com", "confirm /docs/dashboard.md exists in domain-hq repo").
|
|
|
|
**Be conservative with DONE** — require clear evidence. A file being touched is not enough; the specific functionality described must be present.
|
|
**Be generous with CHANGED** — if the goal is met by different means, that counts as addressed.
|
|
**Be honest with UNVERIFIABLE** — better to surface 5 items the user must manually confirm than silently classify them DONE.`);
|
|
|
|
// ── Output format ──
|
|
sections.push(`
|
|
### Output Format
|
|
|
|
\`\`\`
|
|
PLAN COMPLETION AUDIT
|
|
═══════════════════════════════
|
|
Plan: {plan file path}
|
|
|
|
## Implementation Items
|
|
[DONE] Create UserService — src/services/user_service.rb (+142 lines)
|
|
[PARTIAL] Add validation — model validates but missing controller checks
|
|
[NOT DONE] Add caching layer — no cache-related changes in diff
|
|
[CHANGED] "Redis queue" → implemented with Sidekiq instead
|
|
|
|
## Test Items
|
|
[DONE] Unit tests for UserService — test/services/user_service_test.rb
|
|
[NOT DONE] E2E test for signup flow
|
|
|
|
## Migration Items
|
|
[DONE] Create users table — db/migrate/20240315_create_users.rb
|
|
|
|
## Cross-Repo / External Items
|
|
[DONE] sibling-repo has /docs/dashboard.md — verified at ~/Development/sibling-repo/docs/dashboard.md
|
|
[UNVERIFIABLE] Cloudflare DNS-only on api.example.com — external system, manual check required
|
|
[UNVERIFIABLE] Supabase auth allowlist contains user email — external system, confirm in Supabase dashboard
|
|
|
|
─────────────────────────────────
|
|
COMPLETION: 5/9 DONE, 1 PARTIAL, 1 NOT DONE, 1 CHANGED, 2 UNVERIFIABLE
|
|
─────────────────────────────────
|
|
\`\`\``);
|
|
|
|
// ── Gate logic (mode-specific) ──
|
|
if (mode === 'ship') {
|
|
sections.push(`
|
|
### Gate Logic
|
|
|
|
After producing the completion checklist, evaluate in priority order:
|
|
|
|
1. **Any NOT DONE items** (highest priority — known missing work). Use AskUserQuestion:
|
|
- Show the completion checklist above
|
|
- "{N} items from the plan are NOT DONE. These were part of the original plan but are missing from the implementation."
|
|
- RECOMMENDATION: depends on item count and severity. If 1-2 minor items (docs, config), recommend B. If core functionality is missing, recommend A.
|
|
- Options:
|
|
A) Stop — implement the missing items before shipping
|
|
B) Ship anyway — defer these to a follow-up (will create P1 TODOs in Step 5.5)
|
|
C) These items were intentionally dropped — remove from scope
|
|
- If A: STOP. List the missing items for the user to implement.
|
|
- If B: Continue. For each NOT DONE item, create a P1 TODO in Step 5.5 with "Deferred from plan: {plan file path}".
|
|
- If C: Continue. Note in PR body: "Plan items intentionally dropped: {list}."
|
|
|
|
2. **Any UNVERIFIABLE items** (silent gaps — the diff cannot prove them either way). Only fires after NOT DONE is resolved or absent.
|
|
|
|
**Per-item confirmation is mandatory.** Do NOT use a single AskUserQuestion to blanket-confirm all UNVERIFIABLE items. Blanket confirmation is the failure mode that surfaced in VAS-449 (user clicks A without opening any file). Instead:
|
|
|
|
- Loop through UNVERIFIABLE items one at a time.
|
|
- For each item, use AskUserQuestion with the item's *specific* manual check (e.g., "Confirm: does \`~/Development/domain-hq/docs/dashboard.md\` exist?", not "Have you checked all items?").
|
|
- Options per item:
|
|
Y) Confirmed done — cite what you verified (free-text, embedded in PR body)
|
|
N) Not done — block ship; treat as NOT DONE and re-enter the priority-1 gate
|
|
D) Intentionally dropped — note in PR body: "Plan item intentionally dropped: {item}"
|
|
- RECOMMENDATION per item: Y if the item is concrete and easily verified; N if it's critical-path (auth, DNS, deliverables to other repos) and the user shows hesitation.
|
|
|
|
**Exit conditions:**
|
|
- Any N: STOP. Surface the missing items, suggest re-running /ship after they're addressed.
|
|
- All Y or D: Continue. Embed \`## Plan Completion — Manual Verifications\` section in PR body listing each Y'd item with the user's free-text evidence and each D'd item with "intentionally dropped".
|
|
|
|
**Cap.** If there are more than 5 UNVERIFIABLE items, present them as a numbered list first and ask whether the user wants to (1) confirm each individually, (2) stop and reduce scope, or (3) explicitly accept blanket-confirmation with the warning that this is the VAS-449 failure shape. Default and recommended option is (1).
|
|
|
|
3. **Only PARTIAL items (no NOT DONE, no UNVERIFIABLE):** Continue with a note in the PR body. Not blocking.
|
|
|
|
4. **All DONE or CHANGED:** Pass. "Plan completion: PASS — all items addressed." Continue.
|
|
|
|
**No plan file found:** Skip entirely. "No plan file detected — skipping plan completion audit."
|
|
|
|
**Include in PR body (Step 8):** Add a \`## Plan Completion\` section with the checklist summary.`);
|
|
} else {
|
|
// review mode — enhanced Delivery Integrity (Release 2: Review Army)
|
|
sections.push(`
|
|
### Fallback Intent Sources (when no plan file found)
|
|
|
|
When no plan file is detected, use these secondary intent sources:
|
|
|
|
1. **Commit messages:** Run \`git log origin/<base>..HEAD --oneline\`. Use judgment to extract real intent:
|
|
- Commits with actionable verbs ("add", "implement", "fix", "create", "remove", "update") are intent signals
|
|
- Skip noise: "WIP", "tmp", "squash", "merge", "chore", "typo", "fixup"
|
|
- Extract the intent behind the commit, not the literal message
|
|
2. **TODOS.md:** If it exists, check for items related to this branch or recent dates
|
|
3. **PR description:** Run \`gh pr view --json body -q .body 2>/dev/null\` for intent context
|
|
|
|
**With fallback sources:** Apply the same Cross-Reference classification (DONE/PARTIAL/NOT DONE/CHANGED) using best-effort matching. Note that fallback-sourced items are lower confidence than plan-file items.
|
|
|
|
### Investigation Depth
|
|
|
|
For each PARTIAL or NOT DONE item, investigate WHY:
|
|
|
|
1. Check \`git log origin/<base>..HEAD --oneline\` for commits that suggest the work was started, attempted, or reverted
|
|
2. Read the relevant code to understand what was built instead
|
|
3. Determine the likely reason from this list:
|
|
- **Scope cut** — evidence of intentional removal (revert commit, removed TODO)
|
|
- **Context exhaustion** — work started but stopped mid-way (partial implementation, no follow-up commits)
|
|
- **Misunderstood requirement** — something was built but it doesn't match what the plan described
|
|
- **Blocked by dependency** — plan item depends on something that isn't available
|
|
- **Genuinely forgotten** — no evidence of any attempt
|
|
|
|
Output for each discrepancy:
|
|
\`\`\`
|
|
DISCREPANCY: {PARTIAL|NOT_DONE} | {plan item} | {what was actually delivered}
|
|
INVESTIGATION: {likely reason with evidence from git log / code}
|
|
IMPACT: {HIGH|MEDIUM|LOW} — {what breaks or degrades if this stays undelivered}
|
|
\`\`\`
|
|
|
|
### Learnings Logging (plan-file discrepancies only)
|
|
|
|
**Only for discrepancies sourced from plan files** (not commit messages or TODOS.md), log a learning so future sessions know this pattern occurred:
|
|
|
|
\`\`\`bash
|
|
~/.claude/skills/gstack/bin/gstack-learnings-log '{
|
|
"type": "pitfall",
|
|
"key": "plan-delivery-gap-KEBAB_SUMMARY",
|
|
"insight": "Planned X but delivered Y because Z",
|
|
"confidence": 8,
|
|
"source": "observed",
|
|
"files": ["PLAN_FILE_PATH"]
|
|
}'
|
|
\`\`\`
|
|
|
|
Replace KEBAB_SUMMARY with a kebab-case summary of the gap, and fill in the actual values.
|
|
|
|
**Do NOT log learnings from commit-message-derived or TODOS.md-derived discrepancies.** These are informational in the review output but too noisy for durable memory.
|
|
|
|
### Integration with Scope Drift Detection
|
|
|
|
The plan completion results augment the existing Scope Drift Detection. If a plan file is found:
|
|
|
|
- **NOT DONE items** become additional evidence for **MISSING REQUIREMENTS** in the scope drift report.
|
|
- **Items in the diff that don't match any plan item** become evidence for **SCOPE CREEP** detection.
|
|
- **HIGH-impact discrepancies** trigger AskUserQuestion:
|
|
- Show the investigation findings
|
|
- Options: A) Stop and implement missing items, B) Ship anyway + create P1 TODOs, C) Intentionally dropped
|
|
|
|
This is **INFORMATIONAL** unless HIGH-impact discrepancies are found (then it gates via AskUserQuestion).
|
|
|
|
Update the scope drift output to include plan file context:
|
|
|
|
\`\`\`
|
|
Scope Check: [CLEAN / DRIFT DETECTED / REQUIREMENTS MISSING]
|
|
Intent: <from plan file — 1-line summary>
|
|
Plan: <plan file path>
|
|
Delivered: <1-line summary of what the diff actually does>
|
|
Plan items: N DONE, M PARTIAL, K NOT DONE
|
|
[If NOT DONE: list each missing item with investigation]
|
|
[If scope creep: list each out-of-scope change not in the plan]
|
|
\`\`\`
|
|
|
|
**No plan file found:** Use commit messages and TODOS.md as fallback sources (see above). If no intent sources at all, skip with: "No intent sources detected — skipping completion audit."`);
|
|
}
|
|
|
|
return sections.join('\n');
|
|
}
|
|
|
|
export function generatePlanCompletionAuditShip(_ctx: TemplateContext): string {
|
|
return generatePlanCompletionAuditInner('ship');
|
|
}
|
|
|
|
export function generatePlanCompletionAuditReview(_ctx: TemplateContext): string {
|
|
return generatePlanCompletionAuditInner('review');
|
|
}
|
|
|
|
// ─── Plan Verification Execution ──────────────────────────────────────
|
|
|
|
export function generatePlanVerificationExec(_ctx: TemplateContext): string {
|
|
return `## Step 8.1: Plan Verification
|
|
|
|
Automatically verify the plan's testing/verification steps using the \`/qa-only\` skill.
|
|
|
|
### 1. Check for verification section
|
|
|
|
Using the plan file already discovered in Step 8, look for a verification section. Match any of these headings: \`## Verification\`, \`## Test plan\`, \`## Testing\`, \`## How to test\`, \`## Manual testing\`, or any section with verification-flavored items (URLs to visit, things to check visually, interactions to test).
|
|
|
|
**If no verification section found:** Skip with "No verification steps found in plan — skipping auto-verification."
|
|
**If no plan file was found in Step 8:** Skip (already handled).
|
|
|
|
### 2. Check for running dev server
|
|
|
|
Before invoking browse-based verification, check if a dev server is reachable:
|
|
|
|
\`\`\`bash
|
|
curl -s -o /dev/null -w '%{http_code}' http://localhost:3000 2>/dev/null || \\
|
|
curl -s -o /dev/null -w '%{http_code}' http://localhost:8080 2>/dev/null || \\
|
|
curl -s -o /dev/null -w '%{http_code}' http://localhost:5173 2>/dev/null || \\
|
|
curl -s -o /dev/null -w '%{http_code}' http://localhost:4000 2>/dev/null || echo "NO_SERVER"
|
|
\`\`\`
|
|
|
|
**If NO_SERVER:** Skip with "No dev server detected — skipping plan verification. Run /qa separately after deploying."
|
|
|
|
### 3. Invoke /qa-only inline
|
|
|
|
Read the \`/qa-only\` skill from disk:
|
|
|
|
\`\`\`bash
|
|
cat \${CLAUDE_SKILL_DIR}/../qa-only/SKILL.md
|
|
\`\`\`
|
|
|
|
**If unreadable:** Skip with "Could not load /qa-only — skipping plan verification."
|
|
|
|
Follow the /qa-only workflow with these modifications:
|
|
- **Skip the preamble** (already handled by /ship)
|
|
- **Use the plan's verification section as the primary test input** — treat each verification item as a test case
|
|
- **Use the detected dev server URL** as the base URL
|
|
- **Skip the fix loop** — this is report-only verification during /ship
|
|
- **Cap at the verification items from the plan** — do not expand into general site QA
|
|
|
|
### 4. Gate logic
|
|
|
|
- **All verification items PASS:** Continue silently. "Plan verification: PASS."
|
|
- **Any FAIL:** Use AskUserQuestion:
|
|
- Show the failures with screenshot evidence
|
|
- RECOMMENDATION: Choose A if failures indicate broken functionality. Choose B if cosmetic only.
|
|
- Options:
|
|
A) Fix the failures before shipping (recommended for functional issues)
|
|
B) Ship anyway — known issues (acceptable for cosmetic issues)
|
|
- **No verification section / no server / unreadable skill:** Skip (non-blocking).
|
|
|
|
### 5. Include in PR body
|
|
|
|
Add a \`## Verification Results\` section to the PR body (Step 19):
|
|
- If verification ran: summary of results (N PASS, M FAIL, K SKIPPED)
|
|
- If skipped: reason for skipping (no plan, no server, no verification section)`;
|
|
}
|
|
|
|
// ─── Cross-Review Finding Dedup ──────────────────────────────────────
|
|
|
|
export function generateCrossReviewDedup(ctx: TemplateContext): string {
|
|
const isShip = ctx.skillName === 'ship';
|
|
const stepNum = isShip ? '9.3' : '5.0';
|
|
const findingsRef = isShip
|
|
? 'the checklist pass (Step 9) and specialist review (Step 9.1-9.2)'
|
|
: 'Step 4 critical pass and Step 4.5-4.6 specialists';
|
|
|
|
return `### Step ${stepNum}: Cross-review finding dedup
|
|
|
|
Before classifying findings, check if any were previously skipped by the user in a prior review on this branch.
|
|
|
|
\`\`\`bash
|
|
~/.claude/skills/gstack/bin/gstack-review-read
|
|
\`\`\`
|
|
|
|
Parse the output: only lines BEFORE \`---CONFIG---\` are JSONL entries (the output also contains \`---CONFIG---\` and \`---HEAD---\` footer sections that are not JSONL — ignore those).
|
|
|
|
For each JSONL entry that has a \`findings\` array:
|
|
1. Collect all fingerprints where \`action: "skipped"\`
|
|
2. Note the \`commit\` field from that entry
|
|
|
|
If skipped fingerprints exist, get the list of files changed since that review:
|
|
|
|
\`\`\`bash
|
|
git diff --name-only <prior-review-commit> HEAD
|
|
\`\`\`
|
|
|
|
For each current finding (from both ${findingsRef}), check:
|
|
- Does its fingerprint match a previously skipped finding?
|
|
- Is the finding's file path NOT in the changed-files set?
|
|
|
|
If both conditions are true: suppress the finding. It was intentionally skipped and the relevant code hasn't changed.
|
|
|
|
Print: "Suppressed N findings from prior reviews (previously skipped by user)"
|
|
|
|
**Only suppress \`skipped\` findings — never \`fixed\` or \`auto-fixed\`** (those might regress and should be re-checked).
|
|
|
|
If no prior reviews exist or none have a \`findings\` array, skip this step silently.
|
|
|
|
Output a summary header: \`Pre-Landing Review: N issues (X critical, Y informational)\``;
|
|
}
|