chore: merge main into feature branch

Resolved conflicts in VERSION (keep 0.9.0), CHANGELOG.md (prepend 0.9.0 entry above 0.8.5), and gen-skill-docs.ts (keep dynamic scan over static list). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 11:45:20 +02:00 · 2026-03-19 17:10:08 -07:00
parent 48c906dfe8 bd834aeadb
commit c9d2335222
57 changed files with 1398 additions and 285 deletions
@@ -17,6 +17,7 @@ This document covers the command reference and internals of gstack's headless br
 | Tabs | `tabs`, `tab`, `newtab`, `closetab` | Multi-page workflows |
 | Cookies | `cookie-import`, `cookie-import-browser` | Import cookies from file or real browser |
 | Multi-step | `chain` (JSON from stdin) | Batch commands in one call |
+| Handoff | `handoff [reason]`, `resume` | Switch to visible Chrome for user takeover |

 All selector arguments accept CSS selectors, `@e` refs after `snapshot`, or `@c` refs after `snapshot -C`. 50+ commands total plus cookie import.

@@ -123,6 +124,18 @@ The server hooks into Playwright's `page.on('console')`, `page.on('response')`,

 The `console`, `network`, and `dialog` commands read from the in-memory buffers, not disk.

+### User handoff
+
+When the headless browser can't proceed (CAPTCHA, MFA, complex auth), `handoff` opens a visible Chrome window at the exact same page with all cookies, localStorage, and tabs preserved. The user solves the problem manually, then `resume` returns control to the agent with a fresh snapshot.
+
+```bash
+$B handoff "Stuck on CAPTCHA at login page"   # opens visible Chrome
+# User solves CAPTCHA...
+$B resume                                       # returns to headless with fresh snapshot
+```
+
+The browser auto-suggests `handoff` after 3 consecutive failures. State is fully preserved across the switch — no re-login needed.
+
 ### Dialog handling

 Dialogs (alert, confirm, prompt) are auto-accepted by default to prevent browser lockup. The `dialog-accept` and `dialog-dismiss` commands control this behavior. For prompts, `dialog-accept <text>` provides the response text. All dialogs are logged to the dialog buffer with type, message, and action taken.
@@ -9,6 +9,59 @@
 - **Codex-adapted output.** Frontmatter is stripped to just name + description (Codex doesn't need allowed-tools or hooks). Paths are rewritten from `~/.claude/` to `~/.codex/`. The `/codex` skill itself is excluded from Codex output -- it's a Claude wrapper around `codex exec`, which would be self-referential.
 - **CI checks both hosts.** The freshness check now validates Claude and Codex output independently. Stale Codex docs break the build just like stale Claude docs.

+## [0.8.5] - 2026-03-19
+
+### Fixed
+
+- **`/retro` now counts full calendar days.** Running a retro late at night no longer silently misses commits from earlier in the day. Git treats bare dates like `--since="2026-03-11"` as "11pm on March 11" if you run it at 11pm — now we pass `--since="2026-03-11T00:00:00"` so it always starts from midnight. Compare mode windows get the same fix.
+- **Review log no longer breaks on branch names with `/`.** Branch names like `garrytan/design-system` caused review log writes to fail because Claude Code runs multi-line bash blocks as separate shell invocations, losing variables between commands. New `gstack-review-log` and `gstack-review-read` atomic helpers encapsulate the entire operation in a single command.
+- **All skill templates are now platform-agnostic.** Removed Rails-specific patterns (`bin/test-lane`, `RAILS_ENV`, `.includes()`, `rescue StandardError`, etc.) from `/ship`, `/review`, `/plan-ceo-review`, and `/plan-eng-review`. The review checklist now shows examples for Rails, Node, Python, and Django side-by-side.
+- **`/ship` reads CLAUDE.md to discover test commands** instead of hardcoding `bin/test-lane` and `npm run test`. If no test commands are found, it asks the user and persists the answer to CLAUDE.md.
+
+### Added
+
+- **Platform-agnostic design principle** codified in CLAUDE.md — skills must read project config, never hardcode framework commands.
+- **`## Testing` section** in CLAUDE.md for `/ship` test command discovery.
+
+## [0.8.4] - 2026-03-19
+
+### Added
+
+- **`/ship` now automatically syncs your docs.** After creating the PR, `/ship` runs `/document-release` as Step 8.5 — README, ARCHITECTURE, CONTRIBUTING, and CLAUDE.md all stay current without an extra command. No more stale docs after shipping.
+- **Six new skills in the docs.** README, docs/skills.md, and BROWSER.md now cover `/codex` (multi-AI second opinion), `/careful` (destructive command warnings), `/freeze` (directory-scoped edit lock), `/guard` (full safety mode), `/unfreeze`, and `/gstack-upgrade`. The sprint skill table keeps its 15 specialists; a new "Power tools" section covers the rest.
+- **Browse handoff documented everywhere.** BROWSER.md command table, docs/skills.md deep-dive, and README "What's new" all explain `$B handoff` and `$B resume` for CAPTCHA/MFA/auth walls.
+- **Proactive suggestions know about all skills.** Root SKILL.md.tmpl now suggests `/codex`, `/careful`, `/freeze`, `/guard`, `/unfreeze`, and `/gstack-upgrade` at the right workflow stages.
+
+## [0.8.3] - 2026-03-19
+
+### Added
+
+- **Plan reviews now guide you to the next step.** After running `/plan-ceo-review`, `/plan-eng-review`, or `/plan-design-review`, you get a recommendation for what to run next — eng review is always suggested as the required shipping gate, design review is suggested when UI changes are detected, and CEO review is softly mentioned for big product changes. No more remembering the workflow yourself.
+- **Reviews know when they're stale.** Each review now records the commit it was run at. The dashboard compares that against your current HEAD and tells you exactly how many commits have elapsed — "eng review may be stale — 13 commits since review" instead of guessing.
+- **`skip_eng_review` respected everywhere.** If you've opted out of eng review globally, the chaining recommendations won't nag you about it.
+- **Design review lite now tracks commits too.** The lightweight design check that runs inside `/review` and `/ship` gets the same staleness tracking as full reviews.
+
+### Fixed
+
+- **Browse no longer navigates to dangerous URLs.** `goto`, `diff`, and `newtab` now block `file://`, `javascript:`, `data:` schemes and cloud metadata endpoints (`169.254.169.254`, `metadata.google.internal`). Localhost and private IPs are still allowed for local QA testing. (Closes #17)
+- **Setup script tells you what's missing.** Running `./setup` without `bun` installed now shows a clear error with install instructions instead of a cryptic "command not found." (Closes #147)
+- **`/debug` renamed to `/investigate`.** Claude Code has a built-in `/debug` command that shadowed the gstack skill. The systematic root-cause debugging workflow now lives at `/investigate`. (Closes #190)
+- **Shell injection surface removed.** All skill templates now use `source <(gstack-slug)` instead of `eval $(gstack-slug)`. Same behavior, no `eval`. (Closes #133)
+- **25 new security tests.** URL validation (16 tests) and path traversal validation (14 tests) now have dedicated unit test suites covering scheme blocking, metadata IP blocking, directory escapes, and prefix collision edge cases.
+
+## [0.8.2] - 2026-03-19
+
+### Added
+
+- **Hand off to a real Chrome when the headless browser gets stuck.** Hit a CAPTCHA, auth wall, or MFA prompt? Run `$B handoff "reason"` and a visible Chrome opens at the exact same page with all your cookies and tabs intact. Solve the problem, tell Claude you're done, and `$B resume` picks up right where you left off with a fresh snapshot.
+- **Auto-handoff hint after 3 consecutive failures.** If the browse tool fails 3 times in a row, it suggests using `handoff` — so you don't waste time watching the AI retry a CAPTCHA.
+- **15 new tests for the handoff feature.** Unit tests for state save/restore, failure tracking, edge cases, plus integration tests for the full headless-to-headed flow with cookie and tab preservation.
+
+### Changed
+
+- `recreateContext()` refactored to use shared `saveState()`/`restoreState()` helpers — same behavior, less code, ready for future state persistence features.
+- `browser.close()` now has a 5-second timeout to prevent hangs when closing headed browsers on macOS.
+
 ## [0.8.1] - 2026-03-19

 ### Fixed
@@ -64,7 +117,6 @@ When both `/review` (Claude) and `/codex review` have run, you get a cross-model
 ### Fixed

 - `/debug` and `/office-hours` were completely invisible to natural language — no trigger phrases at all. Now both have full reactive + proactive triggers.
->>>>>>> origin/main

 ## [0.7.0] - 2026-03-18 — YC Office Hours

@@ -102,7 +154,7 @@ When something is broken and you don't know why, `/debug` is your systematic deb
 ### Added

 - **Every PR touching frontend code now gets a design review automatically.** `/review` and `/ship` apply a 20-item design checklist against changed CSS, HTML, JSX, and view files. Catches AI slop patterns (purple gradients, 3-column icon grids, generic hero copy), typography issues (body text < 16px, blacklisted fonts), accessibility gaps (`outline: none`), and `!important` abuse. Mechanical CSS fixes are auto-applied; design judgment calls ask you first.
- **`gstack-diff-scope` categorizes what changed in your branch.** Run `eval $(gstack-diff-scope main)` and get `SCOPE_FRONTEND=true/false`, `SCOPE_BACKEND`, `SCOPE_PROMPTS`, `SCOPE_TESTS`, `SCOPE_DOCS`, `SCOPE_CONFIG`. Design review uses it to skip silently on backend-only PRs. Ship pre-flight uses it to recommend design review when frontend files are touched.
+- **`gstack-diff-scope` categorizes what changed in your branch.** Run `source <(gstack-diff-scope main)` and get `SCOPE_FRONTEND=true/false`, `SCOPE_BACKEND`, `SCOPE_PROMPTS`, `SCOPE_TESTS`, `SCOPE_DOCS`, `SCOPE_CONFIG`. Design review uses it to skip silently on backend-only PRs. Ship pre-flight uses it to recommend design review when frontend files are touched.
 - **Design review shows up in the Review Readiness Dashboard.** The dashboard now distinguishes between "LITE" (code-level, runs automatically in /review and /ship) and "FULL" (visual audit via /plan-design-review with browse binary). Both show up as Design Review entries.
 - **E2E eval for design review detection.** Planted CSS/HTML fixtures with 7 known anti-patterns (Papyrus font, 14px body text, `outline: none`, `!important`, purple gradient, generic hero copy, 3-column feature grid). The eval verifies `/review` catches at least 4 of 7.

@@ -218,7 +270,7 @@ Read the philosophy: https://garryslist.org/posts/boil-the-ocean
 ## 0.5.1 — 2026-03-17
 - **Know where you stand before you ship.** Every `/plan-ceo-review`, `/plan-eng-review`, and `/plan-design-review` now logs its result to a review tracker. At the end of each review, you see a **Review Readiness Dashboard** showing which reviews are done, when they ran, and whether they're clean — with a clear CLEARED TO SHIP or NOT READY verdict.
 - **`/ship` checks your reviews before creating the PR.** Pre-flight now reads the dashboard and asks if you want to continue when reviews are missing. Informational only — it won't block you, but you'll know what you skipped.
- **One less thing to copy-paste.** The SLUG computation (that opaque sed pipeline for computing `owner-repo` from git remote) is now a shared `bin/gstack-slug` helper. All 14 inline copies across templates replaced with `eval $(gstack-slug)`. If the format ever changes, fix it once.
+- **One less thing to copy-paste.** The SLUG computation (that opaque sed pipeline for computing `owner-repo` from git remote) is now a shared `bin/gstack-slug` helper. All 14 inline copies across templates replaced with `source <(gstack-slug)`. If the format ever changes, fix it once.
 - **Screenshots are now visible during QA and browse sessions.** When gstack takes screenshots, they now show up as clickable image elements in your output — no more invisible `/tmp/browse-screenshot.png` paths you can't see. Works in `/qa`, `/qa-only`, `/plan-design-review`, `/qa-design-review`, `/browse`, and `/gstack`.

 ### For contributors
@@ -30,6 +30,17 @@ on `git diff` against the base branch. Each test declares its file dependencies
 llm-judge, gen-skill-docs) trigger all tests. Use `EVALS_ALL=1` or the `:all` script
 variants to force all tests. Run `eval:select` to preview which tests would run.

+## Testing
+
+```bash
+bun test             # run before every commit — free, <2s
+bun run test:evals   # run before shipping — paid, diff-based (~$4/run max)
+```
+
+`bun test` runs skill validation, gen-skill-docs quality checks, and browse
+integration tests. `bun run test:evals` runs LLM-judge quality evals and E2E
+tests via `claude -p`. Both must pass before creating a PR.
+
 ## Project structure

 ```
@@ -59,7 +70,7 @@ gstack/
 ├── plan-ceo-review/ # /plan-ceo-review skill
 ├── plan-eng-review/ # /plan-eng-review skill
 ├── office-hours/    # /office-hours skill (YC Office Hours — startup diagnostic + builder brainstorm)
-├── debug/           # /debug skill (systematic root-cause debugging)
+├── investigate/     # /investigate skill (systematic root-cause debugging)
 ├── retro/           # Retrospective skill
 ├── document-release/ # /document-release skill (post-ship doc updates)
 ├── setup            # One-time setup: build binary + symlink skills
@@ -79,6 +90,18 @@ SKILL.md files are **generated** from `.tmpl` templates. To update docs:
 To add a new browse command: add it to `browse/src/commands.ts` and rebuild.
 To add a snapshot flag: add it to `SNAPSHOT_FLAGS` in `browse/src/snapshot.ts` and rebuild.

+## Platform-agnostic design
+
+Skills must NEVER hardcode framework-specific commands, file patterns, or directory
+structures. Instead:
+
+1. **Read CLAUDE.md** for project-specific config (test commands, eval commands, etc.)
+2. **If missing, AskUserQuestion** — let the user tell you or let gstack search the repo
+3. **Persist the answer to CLAUDE.md** so we never have to ask again
+
+This applies to test commands, eval commands, deploy commands, and any other
+project-specific behavior. The project owns its config; gstack reads it.
+
 ## Writing SKILL templates

 SKILL.md.tmpl files are **prompt templates read by Claude**, not bash scripts.
@@ -16,7 +16,7 @@ In the last 60 days I have written **over 600,000 lines of production code** —

 Same person. Different era. The difference is the tooling.

-**gstack is how I do it.** It is my open source software factory. It turns Claude Code into a virtual engineering team you actually manage — a CEO who rethinks the product, an eng manager who locks the architecture, a designer who catches AI slop, a paranoid reviewer who finds production bugs, a QA lead who opens a real browser and clicks through your app, and a release engineer who ships the PR. Fifteen specialists, all as slash commands, all Markdown, **all free, MIT license, available right now.**
+**gstack is how I do it.** It is my open source software factory. It turns Claude Code into a virtual engineering team you actually manage — a CEO who rethinks the product, an eng manager who locks the architecture, a designer who catches AI slop, a paranoid reviewer who finds production bugs, a QA lead who opens a real browser and clicks through your app, and a release engineer who ships the PR. Fifteen specialists and six power tools, all as slash commands, all Markdown, **all free, MIT license, available right now.**

 I am learning how to get to the edge of what agentic systems can do as of March 2026, and this is my live experiment. I am sharing it because I want the whole world on this journey with me.

@@ -48,11 +48,11 @@ Expect first useful run in under 5 minutes on any repo with tests already set up

 Open Claude Code and paste this. Claude does the rest.

-> Install gstack: run **`git clone https://github.com/garrytan/gstack.git ~/.claude/skills/gstack && cd ~/.claude/skills/gstack && ./setup`** then add a "gstack" section to CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, and lists the available skills: /office-hours, /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /review, /ship, /browse, /qa, /qa-only, /design-review, /setup-browser-cookies, /retro, /debug, /document-release. Then ask the user if they also want to add gstack to the current project so teammates get it.
+> Install gstack: run **`git clone https://github.com/garrytan/gstack.git ~/.claude/skills/gstack && cd ~/.claude/skills/gstack && ./setup`** then add a "gstack" section to CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, and lists the available skills: /office-hours, /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /review, /ship, /browse, /qa, /qa-only, /design-review, /setup-browser-cookies, /retro, /investigate, /document-release, /codex, /careful, /freeze, /guard, /unfreeze, /gstack-upgrade. Then ask the user if they also want to add gstack to the current project so teammates get it.

 ### Step 2: Add to your repo so teammates get it (optional)

-> Add gstack to this project: run **`cp -Rf ~/.claude/skills/gstack .claude/skills/gstack && rm -rf .claude/skills/gstack/.git && cd .claude/skills/gstack && ./setup`** then add a "gstack" section to this project's CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, lists the available skills: /office-hours, /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /review, /ship, /browse, /qa, /qa-only, /design-review, /setup-browser-cookies, /retro, /debug, /document-release, and tells Claude that if gstack skills aren't working, run `cd .claude/skills/gstack && ./setup` to build the binary and register skills.
+> Add gstack to this project: run **`cp -Rf ~/.claude/skills/gstack .claude/skills/gstack && rm -rf .claude/skills/gstack/.git && cd .claude/skills/gstack && ./setup`** then add a "gstack" section to this project's CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, lists the available skills: /office-hours, /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /review, /ship, /browse, /qa, /qa-only, /design-review, /setup-browser-cookies, /retro, /investigate, /document-release, /codex, /careful, /freeze, /guard, /unfreeze, /gstack-upgrade, and tells Claude that if gstack skills aren't working, run `cd .claude/skills/gstack && ./setup` to build the binary and register skills.

 Real files get committed to your repo (not a submodule), so `git clone` just works. Everything lives inside `.claude/`. Nothing touches your PATH or runs in the background.

@@ -135,7 +135,7 @@ One sprint, one person, one feature — that takes about 30 minutes with gstack.
 | `/plan-design-review` | **Senior Designer** | Rates each design dimension 0-10, explains what a 10 looks like, then edits the plan to get there. AI Slop detection. Interactive — one AskUserQuestion per design choice. |
 | `/design-consultation` | **Design Partner** | Build a complete design system from scratch. Knows the landscape, proposes creative risks, generates realistic product mockups. Design at the heart of all other phases. |
 | `/review` | **Staff Engineer** | Find the bugs that pass CI but blow up in production. Auto-fixes the obvious ones. Flags completeness gaps. |
-| `/debug` | **Debugger** | Systematic root-cause debugging. Iron Law: no fixes without investigation. Traces data flow, tests hypotheses, stops after 3 failed fixes. |
+| `/investigate` | **Debugger** | Systematic root-cause debugging. Iron Law: no fixes without investigation. Traces data flow, tests hypotheses, stops after 3 failed fixes. |
 | `/design-review` | **Designer Who Codes** | Same audit as /plan-design-review, then fixes what it finds. Atomic commits, before/after screenshots. |
 | `/qa` | **QA Lead** | Test your app, find bugs, fix them with atomic commits, re-verify. Auto-generates regression tests for every fix. |
 | `/qa-only` | **QA Reporter** | Same methodology as /qa but report only. Use when you want a pure bug report without code changes. |
@@ -145,6 +145,17 @@ One sprint, one person, one feature — that takes about 30 minutes with gstack.
 | `/browse` | **QA Engineer** | Give the agent eyes. Real Chromium browser, real clicks, real screenshots. ~100ms per command. |
 | `/setup-browser-cookies` | **Session Manager** | Import cookies from your real browser (Chrome, Arc, Brave, Edge) into the headless session. Test authenticated pages. |

+### Power tools
+
+| Skill | What it does |
+|-------|-------------|
+| `/codex` | **Second Opinion** — independent code review from OpenAI Codex CLI. Three modes: review (pass/fail gate), adversarial challenge, and open consultation. Cross-model analysis when both `/review` and `/codex` have run. |
+| `/careful` | **Safety Guardrails** — warns before destructive commands (rm -rf, DROP TABLE, force-push). Say "be careful" to activate. Override any warning. |
+| `/freeze` | **Edit Lock** — restrict file edits to one directory. Prevents accidental changes outside scope while debugging. |
+| `/guard` | **Full Safety** — `/careful` + `/freeze` in one command. Maximum safety for prod work. |
+| `/unfreeze` | **Unlock** — remove the `/freeze` boundary. |
+| `/gstack-upgrade` | **Self-Updater** — upgrade gstack to latest. Detects global vs vendored install, syncs both, shows what changed. |
+
 **[Deep dives with examples and philosophy for every skill →](docs/skills.md)**

 ## What's new and why it matters
@@ -159,7 +170,15 @@ One sprint, one person, one feature — that takes about 30 minutes with gstack.

 **Test everything.** `/ship` bootstraps test frameworks from scratch if your project doesn't have one. Every `/ship` run produces a coverage audit. Every `/qa` bug fix generates a regression test. 100% test coverage is the goal — tests make vibe coding safe instead of yolo coding.

-**`/document-release` is the engineer you never had.** It reads every doc file in your project, cross-references the diff, and updates everything that drifted. README, ARCHITECTURE, CONTRIBUTING, CLAUDE.md, TODOS — all kept current automatically.
+**`/document-release` is the engineer you never had.** It reads every doc file in your project, cross-references the diff, and updates everything that drifted. README, ARCHITECTURE, CONTRIBUTING, CLAUDE.md, TODOS — all kept current automatically. And now `/ship` auto-invokes it — docs stay current without an extra command.
+
+**Browser handoff when the AI gets stuck.** Hit a CAPTCHA, auth wall, or MFA prompt? `$B handoff` opens a visible Chrome at the exact same page with all your cookies and tabs intact. Solve the problem, tell Claude you're done, `$B resume` picks up right where it left off. The agent even suggests it automatically after 3 consecutive failures.
+
+**Multi-AI second opinion.** `/codex` gets an independent review from OpenAI's Codex CLI — a completely different AI looking at the same diff. Three modes: code review with a pass/fail gate, adversarial challenge that actively tries to break your code, and open consultation with session continuity. When both `/review` (Claude) and `/codex` (OpenAI) have reviewed the same branch, you get a cross-model analysis showing which findings overlap and which are unique to each.
+
+**Safety guardrails on demand.** Say "be careful" and `/careful` warns before any destructive command — rm -rf, DROP TABLE, force-push, git reset --hard. `/freeze` locks edits to one directory while debugging so Claude can't accidentally "fix" unrelated code. `/guard` activates both. `/investigate` auto-freezes to the module being investigated.
+
+**Proactive skill suggestions.** gstack notices what stage you're in — brainstorming, reviewing, debugging, testing — and suggests the right skill. Don't like it? Say "stop suggesting" and it remembers across sessions.

 ## 10-15 parallel sprints

@@ -181,7 +200,7 @@ Same tools, different outcome — because gstack gives you structured roles and

 The models are getting better fast. The people who figure out how to work with them now — really work with them, not just dabble — are going to have a massive advantage. This is that window. Let's go.

-Fifteen specialists. All slash commands. All Markdown. All free. **[github.com/garrytan/gstack](https://github.com/garrytan/gstack)** — MIT License
+Fifteen specialists and six power tools. All slash commands. All Markdown. All free. **[github.com/garrytan/gstack](https://github.com/garrytan/gstack)** — MIT License

 > **We're hiring.** Want to ship 10K+ LOC/day and help harden gstack?
 > Come work at YC — [ycombinator.com/software](https://ycombinator.com/software)
@@ -212,7 +231,8 @@ Fifteen specialists. All slash commands. All Markdown. All free. **[github.com/g
 Use /browse from gstack for all web browsing. Never use mcp__claude-in-chrome__* tools.
 Available skills: /office-hours, /plan-ceo-review, /plan-eng-review, /plan-design-review,
 /design-consultation, /review, /ship, /browse, /qa, /qa-only, /design-review,
-/setup-browser-cookies, /retro, /debug, /document-release.
+/setup-browser-cookies, /retro, /investigate, /document-release, /codex, /careful,
+/freeze, /guard, /unfreeze, /gstack-upgrade.
 ```

 ## License
@@ -15,13 +15,19 @@ description: |
  - Reviewing a plan (architecture) → suggest /plan-eng-review
  - Reviewing a plan (design) → suggest /plan-design-review
  - Creating a design system → suggest /design-consultation
-  - Debugging errors → suggest /debug
+  - Debugging errors → suggest /investigate
  - Testing the app → suggest /qa
  - Code review before merge → suggest /review
  - Visual design audit → suggest /design-review
  - Ready to deploy / create PR → suggest /ship
  - Post-ship doc updates → suggest /document-release
  - Weekly retrospective → suggest /retro
+  - Wanting a second opinion or adversarial code review → suggest /codex
+  - Working with production or live systems → suggest /careful
+  - Want to scope edits to one module/directory → suggest /freeze
+  - Maximum safety mode (destructive warnings + edit restrictions) → suggest /guard
+  - Removing edit restrictions → suggest /unfreeze
+  - Upgrading gstack to latest version → suggest /gstack-upgrade

  If the user pushes back on skill suggestions ("stop suggesting things",
  "I don't need suggestions", "too aggressive"):
@@ -529,7 +535,9 @@ Refs are invalidated on navigation — run `snapshot` again after `goto`.
 ### Server
 | Command | Description |
 |---------|-------------|
+| `handoff [message]` | Open visible Chrome at current page for user takeover |
 | `restart` | Restart server |
+| `resume` | Re-snapshot after user takeover, return control to AI |
 | `status` | Health check |
 | `stop` | Shutdown server |

@@ -15,13 +15,19 @@ description: |
  - Reviewing a plan (architecture) → suggest /plan-eng-review
  - Reviewing a plan (design) → suggest /plan-design-review
  - Creating a design system → suggest /design-consultation
-  - Debugging errors → suggest /debug
+  - Debugging errors → suggest /investigate
  - Testing the app → suggest /qa
  - Code review before merge → suggest /review
  - Visual design audit → suggest /design-review
  - Ready to deploy / create PR → suggest /ship
  - Post-ship doc updates → suggest /document-release
  - Weekly retrospective → suggest /retro
+  - Wanting a second opinion or adversarial code review → suggest /codex
+  - Working with production or live systems → suggest /careful
+  - Want to scope edits to one module/directory → suggest /freeze
+  - Maximum safety mode (destructive warnings + edit restrictions) → suggest /guard
+  - Removing edit restrictions → suggest /unfreeze
+  - Upgrading gstack to latest version → suggest /gstack-upgrade

  If the user pushes back on skill suggestions ("stop suggesting things",
  "I don't need suggestions", "too aggressive"):
@@ -52,7 +52,9 @@

 **Why:** Enables "resume where I left off" for QA sessions and repeatable auth states.

-**Effort:** M
+**Context:** The `saveState()`/`restoreState()` helpers from the handoff feature (browser-manager.ts) already capture cookies + localStorage + sessionStorage + URLs. Adding file I/O on top is ~20 lines.
+
+**Effort:** S
 **Priority:** P3
 **Depends on:** Sessions

@@ -440,17 +442,9 @@ Shipped as v0.5.0 on main. Includes `/plan-design-review` (report-only design au

 ## Document-Release

-### Auto-invoke /document-release from /ship
+### Auto-invoke /document-release from /ship — SHIPPED

-**What:** Add Step 8.5 to /ship that reads document-release/SKILL.md and executes the doc update workflow after creating the PR.
-
-**Why:** Zero-friction doc updates — user runs /ship and docs are automatically current. No extra command to remember.
-
-**Context:** /ship currently ends at Step 8 (PR URL output). Step 8.5 would continue into the document-release workflow. Same pattern as /ship calling /review's checklist in Step 3.5.
-
-**Effort:** S
-**Priority:** P1
-**Depends on:** /document-release shipped
+Shipped in v0.8.3. Step 8.5 added to `/ship` — after creating the PR, `/ship` automatically reads `document-release/SKILL.md` and executes the doc update workflow. Zero-friction doc updates.

 ### `{{DOC_VOICE}}` shared resolver

@@ -518,25 +512,25 @@ Shipped as `/careful`, `/freeze`, `/guard`, and `/unfreeze` in v0.6.5. Includes

 Shipped in v0.6.5. TemplateContext in gen-skill-docs.ts bakes skill name into preamble telemetry line. Analytics CLI (`bun run analytics`) for querying. /retro integration shows skills-used-this-week.

-### /debug scoped debugging enhancements (gated on telemetry)
+### /investigate scoped debugging enhancements (gated on telemetry)

-**What:** Six enhancements to /debug auto-freeze, contingent on telemetry showing the freeze hook actually fires in real debugging sessions.
+**What:** Six enhancements to /investigate auto-freeze, contingent on telemetry showing the freeze hook actually fires in real debugging sessions.

-**Why:** /debug v0.7.1 auto-freezes edits to the module being debugged. If telemetry shows the hook fires often, these enhancements make the experience smarter. If it never fires, the problem wasn't real and these aren't worth building.
+**Why:** /investigate v0.7.1 auto-freezes edits to the module being debugged. If telemetry shows the hook fires often, these enhancements make the experience smarter. If it never fires, the problem wasn't real and these aren't worth building.

-**Context:** All items are prose additions to `debug/SKILL.md.tmpl`. No new scripts.
+**Context:** All items are prose additions to `investigate/SKILL.md.tmpl`. No new scripts.

 **Items:**
 1. Stack trace auto-detection for freeze directory (parse deepest app frame)
 2. Freeze boundary widening (ask to widen instead of hard-block when hitting boundary)
 3. Post-fix auto-unfreeze + full test suite run
 4. Debug instrumentation cleanup (tag with DEBUG-TEMP, remove before commit)
-5. Debug session persistence (~/.gstack/debug-sessions/ — save investigation for reuse)
+5. Debug session persistence (~/.gstack/investigate-sessions/ — save investigation for reuse)
 6. Investigation timeline in debug report (hypothesis log with timing)

 **Effort:** M (all 6 combined)
 **Priority:** P3
-**Depends on:** Telemetry data showing freeze hook fires in real /debug sessions
+**Depends on:** Telemetry data showing freeze hook fires in real /investigate sessions

 ## Completed

@@ -1,6 +1,6 @@
 #!/usr/bin/env bash
 # gstack-diff-scope — categorize what changed in the diff against a base branch
-# Usage: eval $(gstack-diff-scope main)  → sets SCOPE_FRONTEND=true SCOPE_BACKEND=false ...
+# Usage: source <(gstack-diff-scope main)  → sets SCOPE_FRONTEND=true SCOPE_BACKEND=false ...
 # Or:    gstack-diff-scope main           → prints SCOPE_*=... lines
 set -euo pipefail

@@ -0,0 +1,9 @@
+#!/usr/bin/env bash
+# gstack-review-log — atomically log a review result
+# Usage: gstack-review-log '{"skill":"...","timestamp":"...","status":"..."}'
+set -euo pipefail
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+eval $("$SCRIPT_DIR/gstack-slug" 2>/dev/null)
+GSTACK_HOME="${GSTACK_HOME:-$HOME/.gstack}"
+mkdir -p "$GSTACK_HOME/projects/$SLUG"
+echo "$1" >> "$GSTACK_HOME/projects/$SLUG/$BRANCH-reviews.jsonl"
@@ -0,0 +1,12 @@
+#!/usr/bin/env bash
+# gstack-review-read — read review log and config for dashboard
+# Usage: gstack-review-read
+set -euo pipefail
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+eval $("$SCRIPT_DIR/gstack-slug" 2>/dev/null)
+GSTACK_HOME="${GSTACK_HOME:-$HOME/.gstack}"
+cat "$GSTACK_HOME/projects/$SLUG/$BRANCH-reviews.jsonl" 2>/dev/null || echo "NO_REVIEWS"
+echo "---CONFIG---"
+"$SCRIPT_DIR/gstack-config" get skip_eng_review 2>/dev/null || echo "false"
+echo "---HEAD---"
+git rev-parse --short HEAD 2>/dev/null || echo "unknown"
@@ -1,6 +1,6 @@
 #!/usr/bin/env bash
 # gstack-slug — output project slug and sanitized branch name
-# Usage: eval $(gstack-slug)  → sets SLUG and BRANCH variables
+# Usage: source <(gstack-slug)  → sets SLUG and BRANCH variables
 # Or:    gstack-slug           → prints SLUG=... and BRANCH=... lines
 set -euo pipefail
 SLUG=$(git remote get-url origin 2>/dev/null | sed 's|.*[:/]\([^/]*/[^/]*\)\.git$|\1|;s|.*[:/]\([^/]*/[^/]*\)$|\1|' | tr '/' '-')
@@ -259,6 +259,32 @@ $B diff https://staging.app.com https://prod.app.com
 ### 11. Show screenshots to the user
 After `$B screenshot`, `$B snapshot -a -o`, or `$B responsive`, always use the Read tool on the output PNG(s) so the user can see them. Without this, screenshots are invisible.

+## User Handoff
+
+When you hit something you can't handle in headless mode (CAPTCHA, complex auth, multi-factor
+login), hand off to the user:
+
+```bash
+# 1. Open a visible Chrome at the current page
+$B handoff "Stuck on CAPTCHA at login page"
+
+# 2. Tell the user what happened (via AskUserQuestion)
+#    "I've opened Chrome at the login page. Please solve the CAPTCHA
+#     and let me know when you're done."
+
+# 3. When user says "done", re-snapshot and continue
+$B resume
+```
+
+**When to use handoff:**
+- CAPTCHAs or bot detection
+- Multi-factor authentication (SMS, authenticator app)
+- OAuth flows that require user interaction
+- Complex interactions the AI can't handle after 3 attempts
+
+The browser preserves all state (cookies, localStorage, tabs) across the handoff.
+After `resume`, you get a fresh snapshot of wherever the user left off.
+
 ## Snapshot Flags

 The snapshot is your primary tool for understanding and interacting with pages.
@@ -381,6 +407,8 @@ Refs are invalidated on navigation — run `snapshot` again after `goto`.
 ### Server
 | Command | Description |
 |---------|-------------|
+| `handoff [message]` | Open visible Chrome at current page for user takeover |
 | `restart` | Restart server |
+| `resume` | Re-snapshot after user takeover, return control to AI |
 | `status` | Health check |
 | `stop` | Shutdown server |
@@ -106,6 +106,32 @@ $B diff https://staging.app.com https://prod.app.com
 ### 11. Show screenshots to the user
 After `$B screenshot`, `$B snapshot -a -o`, or `$B responsive`, always use the Read tool on the output PNG(s) so the user can see them. Without this, screenshots are invisible.

+## User Handoff
+
+When you hit something you can't handle in headless mode (CAPTCHA, complex auth, multi-factor
+login), hand off to the user:
+
+```bash
+# 1. Open a visible Chrome at the current page
+$B handoff "Stuck on CAPTCHA at login page"
+
+# 2. Tell the user what happened (via AskUserQuestion)
+#    "I've opened Chrome at the login page. Please solve the CAPTCHA
+#     and let me know when you're done."
+
+# 3. When user says "done", re-snapshot and continue
+$B resume
+```
+
+**When to use handoff:**
+- CAPTCHAs or bot detection
+- Multi-factor authentication (SMS, authenticator app)
+- OAuth flows that require user interaction
+- Complex interactions the AI can't handle after 3 attempts
+
+The browser preserves all state (cookies, localStorage, tabs) across the handoff.
+After `resume`, you get a fresh snapshot of wherever the user left off.
+
 ## Snapshot Flags

 {{SNAPSHOT_FLAGS}}
@@ -15,8 +15,9 @@
 *   restores state. Falls back to clean slate on any failure.
 */

-import { chromium, type Browser, type BrowserContext, type BrowserContextOptions, type Page, type Locator } from 'playwright';
+import { chromium, type Browser, type BrowserContext, type BrowserContextOptions, type Page, type Locator, type Cookie } from 'playwright';
 import { addConsoleEntry, addNetworkEntry, addDialogEntry, networkBuffer, type DialogEntry } from './buffers';
+import { validateNavigationUrl } from './url-validation';

 export interface RefEntry {
  locator: Locator;
@@ -24,6 +25,15 @@ export interface RefEntry {
  name: string;
 }

+export interface BrowserState {
+  cookies: Cookie[];
+  pages: Array<{
+    url: string;
+    isActive: boolean;
+    storage: { localStorage: Record<string, string>; sessionStorage: Record<string, string> } | null;
+  }>;
+}
+
 export class BrowserManager {
  private browser: Browser | null = null;
  private context: BrowserContext | null = null;
@@ -47,6 +57,10 @@ export class BrowserManager {
  private dialogAutoAccept: boolean = true;
  private dialogPromptText: string | null = null;

+  // ─── Handoff State ─────────────────────────────────────────
+  private isHeaded: boolean = false;
+  private consecutiveFailures: number = 0;
+
  async launch() {
    this.browser = await chromium.launch({ headless: true });

@@ -77,7 +91,11 @@ export class BrowserManager {
    if (this.browser) {
      // Remove disconnect handler to avoid exit during intentional close
      this.browser.removeAllListeners('disconnected');
-      await this.browser.close();
+      // Timeout: headed browser.close() can hang on macOS
+      await Promise.race([
+        this.browser.close(),
+        new Promise(resolve => setTimeout(resolve, 5000)),
+      ]).catch(() => {});
      this.browser = null;
    }
  }
@@ -102,6 +120,11 @@ export class BrowserManager {
  async newTab(url?: string): Promise<number> {
    if (!this.context) throw new Error('Browser not launched');

+    // Validate URL before allocating page to avoid zombie tabs on rejection
+    if (url) {
+      validateNavigationUrl(url);
+    }
+
    const page = await this.context.newPage();
    const id = this.nextTabId++;
    this.pages.set(id, page);
@@ -269,6 +292,92 @@ export class BrowserManager {
    return this.customUserAgent;
  }

+  // ─── State Save/Restore (shared by recreateContext + handoff) ─
+  /**
+   * Capture browser state: cookies, localStorage, sessionStorage, URLs, active tab.
+   * Skips pages that fail storage reads (e.g., already closed).
+   */
+  async saveState(): Promise<BrowserState> {
+    if (!this.context) throw new Error('Browser not launched');
+
+    const cookies = await this.context.cookies();
+    const pages: BrowserState['pages'] = [];
+
+    for (const [id, page] of this.pages) {
+      const url = page.url();
+      let storage = null;
+      try {
+        storage = await page.evaluate(() => ({
+          localStorage: { ...localStorage },
+          sessionStorage: { ...sessionStorage },
+        }));
+      } catch {}
+      pages.push({
+        url: url === 'about:blank' ? '' : url,
+        isActive: id === this.activeTabId,
+        storage,
+      });
+    }
+
+    return { cookies, pages };
+  }
+
+  /**
+   * Restore browser state into the current context: cookies, pages, storage.
+   * Navigates to saved URLs, restores storage, wires page events.
+   * Failures on individual pages are swallowed — partial restore is better than none.
+   */
+  async restoreState(state: BrowserState): Promise<void> {
+    if (!this.context) throw new Error('Browser not launched');
+
+    // Restore cookies
+    if (state.cookies.length > 0) {
+      await this.context.addCookies(state.cookies);
+    }
+
+    // Re-create pages
+    let activeId: number | null = null;
+    for (const saved of state.pages) {
+      const page = await this.context.newPage();
+      const id = this.nextTabId++;
+      this.pages.set(id, page);
+      this.wirePageEvents(page);
+
+      if (saved.url) {
+        await page.goto(saved.url, { waitUntil: 'domcontentloaded', timeout: 15000 }).catch(() => {});
+      }
+
+      if (saved.storage) {
+        try {
+          await page.evaluate((s: { localStorage: Record<string, string>; sessionStorage: Record<string, string> }) => {
+            if (s.localStorage) {
+              for (const [k, v] of Object.entries(s.localStorage)) {
+                localStorage.setItem(k, v);
+              }
+            }
+            if (s.sessionStorage) {
+              for (const [k, v] of Object.entries(s.sessionStorage)) {
+                sessionStorage.setItem(k, v);
+              }
+            }
+          }, saved.storage);
+        } catch {}
+      }
+
+      if (saved.isActive) activeId = id;
+    }
+
+    // If no pages were saved, create a blank one
+    if (this.pages.size === 0) {
+      await this.newTab();
+    } else {
+      this.activeTabId = activeId ?? [...this.pages.keys()][0];
+    }
+
+    // Clear refs — pages are new, locators are stale
+    this.clearRefs();
+  }
+
  /**
   * Recreate the browser context to apply user agent changes.
   * Saves and restores cookies, localStorage, sessionStorage, and open pages.
@@ -280,25 +389,8 @@ export class BrowserManager {
    }

    try {
-      // 1. Save state from current context
-      const savedCookies = await this.context.cookies();
-      const savedPages: Array<{ url: string; isActive: boolean; storage: { localStorage: Record<string, string>; sessionStorage: Record<string, string> } | null }> = [];
-
-      for (const [id, page] of this.pages) {
-        const url = page.url();
-        let storage = null;
-        try {
-          storage = await page.evaluate(() => ({
-            localStorage: { ...localStorage },
-            sessionStorage: { ...sessionStorage },
-          }));
-        } catch {}
-        savedPages.push({
-          url: url === 'about:blank' ? '' : url,
-          isActive: id === this.activeTabId,
-          storage,
-        });
-      }
+      // 1. Save state
+      const state = await this.saveState();

      // 2. Close old pages and context
      for (const page of this.pages.values()) {
@@ -320,53 +412,8 @@ export class BrowserManager {
        await this.context.setExtraHTTPHeaders(this.extraHeaders);
      }

-      // 4. Restore cookies
-      if (savedCookies.length > 0) {
-        await this.context.addCookies(savedCookies);
-      }
-
-      // 5. Re-create pages
-      let activeId: number | null = null;
-      for (const saved of savedPages) {
-        const page = await this.context.newPage();
-        const id = this.nextTabId++;
-        this.pages.set(id, page);
-        this.wirePageEvents(page);
-
-        if (saved.url) {
-          await page.goto(saved.url, { waitUntil: 'domcontentloaded', timeout: 15000 }).catch(() => {});
-        }
-
-        // 6. Restore storage
-        if (saved.storage) {
-          try {
-            await page.evaluate((s: { localStorage: Record<string, string>; sessionStorage: Record<string, string> }) => {
-              if (s.localStorage) {
-                for (const [k, v] of Object.entries(s.localStorage)) {
-                  localStorage.setItem(k, v);
-                }
-              }
-              if (s.sessionStorage) {
-                for (const [k, v] of Object.entries(s.sessionStorage)) {
-                  sessionStorage.setItem(k, v);
-                }
-              }
-            }, saved.storage);
-          } catch {}
-        }
-
-        if (saved.isActive) activeId = id;
-      }
-
-      // If no pages were saved, create a blank one
-      if (this.pages.size === 0) {
-        await this.newTab();
-      } else {
-        this.activeTabId = activeId ?? [...this.pages.keys()][0];
-      }
-
-      // Clear refs — pages are new, locators are stale
-      this.clearRefs();
+      // 4. Restore state
+      await this.restoreState(state);

      return null; // success
    } catch (err: unknown) {
@@ -391,6 +438,118 @@ export class BrowserManager {
    }
  }

+  // ─── Handoff: Headless → Headed ─────────────────────────────
+  /**
+   * Hand off browser control to the user by relaunching in headed mode.
+   *
+   * Flow (launch-first-close-second for safe rollback):
+   *   1. Save state from current headless browser
+   *   2. Launch NEW headed browser
+   *   3. Restore state into new browser
+   *   4. Close OLD headless browser
+   *   If step 2 fails → return error, headless browser untouched
+   */
+  async handoff(message: string): Promise<string> {
+    if (this.isHeaded) {
+      return `HANDOFF: Already in headed mode at ${this.getCurrentUrl()}`;
+    }
+    if (!this.browser || !this.context) {
+      throw new Error('Browser not launched');
+    }
+
+    // 1. Save state from current browser
+    const state = await this.saveState();
+    const currentUrl = this.getCurrentUrl();
+
+    // 2. Launch new headed browser (try-catch — if this fails, headless stays running)
+    let newBrowser: Browser;
+    try {
+      newBrowser = await chromium.launch({ headless: false, timeout: 15000 });
+    } catch (err: unknown) {
+      const msg = err instanceof Error ? err.message : String(err);
+      return `ERROR: Cannot open headed browser — ${msg}. Headless browser still running.`;
+    }
+
+    // 3. Create context and restore state into new headed browser
+    try {
+      const contextOptions: BrowserContextOptions = {
+        viewport: { width: 1280, height: 720 },
+      };
+      if (this.customUserAgent) {
+        contextOptions.userAgent = this.customUserAgent;
+      }
+      const newContext = await newBrowser.newContext(contextOptions);
+
+      if (Object.keys(this.extraHeaders).length > 0) {
+        await newContext.setExtraHTTPHeaders(this.extraHeaders);
+      }
+
+      // Swap to new browser/context before restoreState (it uses this.context)
+      const oldBrowser = this.browser;
+      const oldContext = this.context;
+
+      this.browser = newBrowser;
+      this.context = newContext;
+      this.pages.clear();
+
+      // Register crash handler on new browser
+      this.browser.on('disconnected', () => {
+        console.error('[browse] FATAL: Chromium process crashed or was killed. Server exiting.');
+        console.error('[browse] Console/network logs flushed to .gstack/browse-*.log');
+        process.exit(1);
+      });
+
+      await this.restoreState(state);
+      this.isHeaded = true;
+
+      // 4. Close old headless browser (fire-and-forget — close() can hang
+      // when another Playwright instance is active, so we don't await it)
+      oldBrowser.removeAllListeners('disconnected');
+      oldBrowser.close().catch(() => {});
+
+      return [
+        `HANDOFF: Browser opened at ${currentUrl}`,
+        `MESSAGE: ${message}`,
+        `STATUS: Waiting for user. Run 'resume' when done.`,
+      ].join('\n');
+    } catch (err: unknown) {
+      // Restore failed — close the new browser, keep old one
+      await newBrowser.close().catch(() => {});
+      const msg = err instanceof Error ? err.message : String(err);
+      return `ERROR: Handoff failed during state restore — ${msg}. Headless browser still running.`;
+    }
+  }
+
+  /**
+   * Resume AI control after user handoff.
+   * Clears stale refs and resets failure counter.
+   * The meta-command handler calls handleSnapshot() after this.
+   */
+  resume(): void {
+    this.clearRefs();
+    this.resetFailures();
+  }
+
+  getIsHeaded(): boolean {
+    return this.isHeaded;
+  }
+
+  // ─── Auto-handoff Hint (consecutive failure tracking) ───────
+  incrementFailures(): void {
+    this.consecutiveFailures++;
+  }
+
+  resetFailures(): void {
+    this.consecutiveFailures = 0;
+  }
+
+  getFailureHint(): string | null {
+    if (this.consecutiveFailures >= 3 && !this.isHeaded) {
+      return `HINT: ${this.consecutiveFailures} consecutive failures. Consider using 'handoff' to let the user help.`;
+    }
+    return null;
+  }
+
  // ─── Console/Network/Dialog/Ref Wiring ────────────────────
  private wirePageEvents(page: Page) {
    // Clear ref map on navigation — refs point to stale elements after page change
@@ -30,6 +30,7 @@ export const META_COMMANDS = new Set([
  'screenshot', 'pdf', 'responsive',
  'chain', 'diff',
  'url', 'snapshot',
+  'handoff', 'resume',
 ]);

 export const ALL_COMMANDS = new Set([...READ_COMMANDS, ...WRITE_COMMANDS, ...META_COMMANDS]);
@@ -94,6 +95,9 @@ export const COMMAND_DESCRIPTIONS: Record<string, { category: string; descriptio
  // Meta
  'snapshot':{ category: 'Snapshot', description: 'Accessibility tree with @e refs for element selection. Flags: -i interactive only, -c compact, -d N depth limit, -s sel scope, -D diff vs previous, -a annotated screenshot, -o path output, -C cursor-interactive @c refs', usage: 'snapshot [flags]' },
  'chain':   { category: 'Meta', description: 'Run commands from JSON stdin. Format: [["cmd","arg1",...],...]' },
+  // Handoff
+  'handoff': { category: 'Server', description: 'Open visible Chrome at current page for user takeover', usage: 'handoff [message]' },
+  'resume':  { category: 'Server', description: 'Re-snapshot after user takeover, return control to AI', usage: 'resume' },
 };

 // Load-time validation: descriptions must cover exactly the command sets
@@ -6,6 +6,7 @@ import type { BrowserManager } from './browser-manager';
 import { handleSnapshot } from './snapshot';
 import { getCleanText } from './read-commands';
 import { READ_COMMANDS, WRITE_COMMANDS, META_COMMANDS } from './commands';
+import { validateNavigationUrl } from './url-validation';
 import * as Diff from 'diff';
 import * as fs from 'fs';
 import * as path from 'path';
@@ -13,7 +14,7 @@ import * as path from 'path';
 // Security: Path validation to prevent path traversal attacks
 const SAFE_DIRECTORIES = ['/tmp', process.cwd()];

-function validateOutputPath(filePath: string): void {
+export function validateOutputPath(filePath: string): void {
  const resolved = path.resolve(filePath);
  const isSafe = SAFE_DIRECTORIES.some(dir => resolved === dir || resolved.startsWith(dir + '/'));
  if (!isSafe) {
@@ -221,9 +222,11 @@ export async function handleMetaCommand(
      if (!url1 || !url2) throw new Error('Usage: browse diff <url1> <url2>');

      const page = bm.getPage();
+      validateNavigationUrl(url1);
      await page.goto(url1, { waitUntil: 'domcontentloaded', timeout: 15000 });
      const text1 = await getCleanText(page);

+      validateNavigationUrl(url2);
      await page.goto(url2, { waitUntil: 'domcontentloaded', timeout: 15000 });
      const text2 = await getCleanText(page);

@@ -246,6 +249,19 @@ export async function handleMetaCommand(
      return await handleSnapshot(args, bm);
    }

+    // ─── Handoff ────────────────────────────────────
+    case 'handoff': {
+      const message = args.join(' ') || 'User takeover requested';
+      return await bm.handoff(message);
+    }
+
+    case 'resume': {
+      bm.resume();
+      // Re-snapshot to capture current page state after human interaction
+      const snapshot = await handleSnapshot(['-i'], bm);
+      return `RESUMED\n${snapshot}`;
+    }
+
    default:
      throw new Error(`Unknown meta command: ${command}`);
  }
@@ -38,7 +38,7 @@ function wrapForEvaluate(code: string): string {
 // Security: Path validation to prevent path traversal attacks
 const SAFE_DIRECTORIES = ['/tmp', process.cwd()];

-function validateReadPath(filePath: string): void {
+export function validateReadPath(filePath: string): void {
  if (path.isAbsolute(filePath)) {
    const resolved = path.resolve(filePath);
    const isSafe = SAFE_DIRECTORIES.some(dir => resolved === dir || resolved.startsWith(dir + '/'));
@@ -249,12 +249,17 @@ async function handleCommand(body: any): Promise<Response> {
      });
    }

+    browserManager.resetFailures();
    return new Response(result, {
      status: 200,
      headers: { 'Content-Type': 'text/plain' },
    });
  } catch (err: any) {
-    return new Response(JSON.stringify({ error: wrapError(err) }), {
+    browserManager.incrementFailures();
+    let errorMsg = wrapError(err);
+    const hint = browserManager.getFailureHint();
+    if (hint) errorMsg += '\n' + hint;
+    return new Response(JSON.stringify({ error: errorMsg }), {
      status: 500,
      headers: { 'Content-Type': 'application/json' },
    });
@@ -0,0 +1,67 @@
+/**
+ * URL validation for navigation commands — blocks dangerous schemes and cloud metadata endpoints.
+ * Localhost and private IPs are allowed (primary use case: QA testing local dev servers).
+ */
+
+const BLOCKED_METADATA_HOSTS = new Set([
+  '169.254.169.254',  // AWS/GCP/Azure instance metadata
+  'fd00::',           // IPv6 unique local (metadata in some cloud setups)
+  'metadata.google.internal', // GCP metadata
+]);
+
+/**
+ * Normalize hostname for blocklist comparison:
+ * - Strip trailing dot (DNS fully-qualified notation)
+ * - Strip IPv6 brackets (URL.hostname includes [] for IPv6)
+ * - Resolve hex (0xA9FEA9FE) and decimal (2852039166) IP representations
+ */
+function normalizeHostname(hostname: string): string {
+  // Strip IPv6 brackets
+  let h = hostname.startsWith('[') && hostname.endsWith(']')
+    ? hostname.slice(1, -1)
+    : hostname;
+  // Strip trailing dot
+  if (h.endsWith('.')) h = h.slice(0, -1);
+  return h;
+}
+
+/**
+ * Check if a hostname resolves to the link-local metadata IP 169.254.169.254.
+ * Catches hex (0xA9FEA9FE), decimal (2852039166), and octal (0251.0376.0251.0376) forms.
+ */
+function isMetadataIp(hostname: string): boolean {
+  // Try to parse as a numeric IP via URL constructor — it normalizes all forms
+  try {
+    const probe = new URL(`http://${hostname}`);
+    const normalized = probe.hostname;
+    if (BLOCKED_METADATA_HOSTS.has(normalized)) return true;
+    // Also check after stripping trailing dot
+    if (normalized.endsWith('.') && BLOCKED_METADATA_HOSTS.has(normalized.slice(0, -1))) return true;
+  } catch {
+    // Not a valid hostname — can't be a metadata IP
+  }
+  return false;
+}
+
+export function validateNavigationUrl(url: string): void {
+  let parsed: URL;
+  try {
+    parsed = new URL(url);
+  } catch {
+    throw new Error(`Invalid URL: ${url}`);
+  }
+
+  if (parsed.protocol !== 'http:' && parsed.protocol !== 'https:') {
+    throw new Error(
+      `Blocked: scheme "${parsed.protocol}" is not allowed. Only http: and https: URLs are permitted.`
+    );
+  }
+
+  const hostname = normalizeHostname(parsed.hostname.toLowerCase());
+
+  if (BLOCKED_METADATA_HOSTS.has(hostname) || isMetadataIp(hostname)) {
+    throw new Error(
+      `Blocked: ${parsed.hostname} is a cloud metadata endpoint. Access is denied for security.`
+    );
+  }
+}
@@ -7,6 +7,7 @@

 import type { BrowserManager } from './browser-manager';
 import { findInstalledBrowsers, importCookies } from './cookie-import-browser';
+import { validateNavigationUrl } from './url-validation';
 import * as fs from 'fs';
 import * as path from 'path';

@@ -21,6 +22,7 @@ export async function handleWriteCommand(
    case 'goto': {
      const url = args[0];
      if (!url) throw new Error('Usage: browse goto <url>');
+      validateNavigationUrl(url);
      const response = await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 15000 });
      const status = response?.status() || 'unknown';
      return `Navigated to ${url} (${status})`;
@@ -0,0 +1,235 @@
+/**
+ * Tests for handoff/resume commands — headless-to-headed browser switching.
+ *
+ * Unit tests cover saveState/restoreState, failure tracking, and edge cases.
+ * Integration tests cover the full handoff flow with real Playwright browsers.
+ */
+
+import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
+import { startTestServer } from './test-server';
+import { BrowserManager, type BrowserState } from '../src/browser-manager';
+import { handleWriteCommand } from '../src/write-commands';
+import { handleMetaCommand } from '../src/meta-commands';
+
+let testServer: ReturnType<typeof startTestServer>;
+let bm: BrowserManager;
+let baseUrl: string;
+
+beforeAll(async () => {
+  testServer = startTestServer(0);
+  baseUrl = testServer.url;
+
+  bm = new BrowserManager();
+  await bm.launch();
+});
+
+afterAll(() => {
+  try { testServer.server.stop(); } catch {}
+  setTimeout(() => process.exit(0), 500);
+});
+
+// ─── Unit Tests: Failure Tracking (no browser needed) ────────────
+
+describe('failure tracking', () => {
+  test('getFailureHint returns null when below threshold', () => {
+    const tracker = new BrowserManager();
+    tracker.incrementFailures();
+    tracker.incrementFailures();
+    expect(tracker.getFailureHint()).toBeNull();
+  });
+
+  test('getFailureHint returns hint after 3 consecutive failures', () => {
+    const tracker = new BrowserManager();
+    tracker.incrementFailures();
+    tracker.incrementFailures();
+    tracker.incrementFailures();
+    const hint = tracker.getFailureHint();
+    expect(hint).not.toBeNull();
+    expect(hint).toContain('handoff');
+    expect(hint).toContain('3');
+  });
+
+  test('hint suppressed when already headed', () => {
+    const tracker = new BrowserManager();
+    (tracker as any).isHeaded = true;
+    tracker.incrementFailures();
+    tracker.incrementFailures();
+    tracker.incrementFailures();
+    expect(tracker.getFailureHint()).toBeNull();
+  });
+
+  test('resetFailures clears the counter', () => {
+    const tracker = new BrowserManager();
+    tracker.incrementFailures();
+    tracker.incrementFailures();
+    tracker.incrementFailures();
+    expect(tracker.getFailureHint()).not.toBeNull();
+    tracker.resetFailures();
+    expect(tracker.getFailureHint()).toBeNull();
+  });
+
+  test('getIsHeaded returns false by default', () => {
+    const tracker = new BrowserManager();
+    expect(tracker.getIsHeaded()).toBe(false);
+  });
+});
+
+// ─── Unit Tests: State Save/Restore (shared browser) ─────────────
+
+describe('saveState', () => {
+  test('captures cookies and page URLs', async () => {
+    await handleWriteCommand('goto', [baseUrl + '/basic.html'], bm);
+    await handleWriteCommand('cookie', ['testcookie=testvalue'], bm);
+
+    const state = await bm.saveState();
+
+    expect(state.cookies.length).toBeGreaterThan(0);
+    expect(state.cookies.some(c => c.name === 'testcookie')).toBe(true);
+    expect(state.pages.length).toBeGreaterThanOrEqual(1);
+    expect(state.pages.some(p => p.url.includes('/basic.html'))).toBe(true);
+  }, 15000);
+
+  test('captures localStorage and sessionStorage', async () => {
+    await handleWriteCommand('goto', [baseUrl + '/basic.html'], bm);
+    const page = bm.getPage();
+    await page.evaluate(() => {
+      localStorage.setItem('lsKey', 'lsValue');
+      sessionStorage.setItem('ssKey', 'ssValue');
+    });
+
+    const state = await bm.saveState();
+    const activePage = state.pages.find(p => p.isActive);
+
+    expect(activePage).toBeDefined();
+    expect(activePage!.storage).not.toBeNull();
+    expect(activePage!.storage!.localStorage).toHaveProperty('lsKey', 'lsValue');
+    expect(activePage!.storage!.sessionStorage).toHaveProperty('ssKey', 'ssValue');
+  }, 15000);
+
+  test('captures multiple tabs', async () => {
+    while (bm.getTabCount() > 1) {
+      await bm.closeTab();
+    }
+    await handleWriteCommand('goto', [baseUrl + '/basic.html'], bm);
+    await handleMetaCommand('newtab', [baseUrl + '/form.html'], bm, () => {});
+
+    const state = await bm.saveState();
+    expect(state.pages.length).toBe(2);
+    const activePage = state.pages.find(p => p.isActive);
+    expect(activePage).toBeDefined();
+    expect(activePage!.url).toContain('/form.html');
+
+    await bm.closeTab();
+  }, 15000);
+});
+
+describe('restoreState', () => {
+  test('state survives recreateContext round-trip', async () => {
+    await handleWriteCommand('goto', [baseUrl + '/basic.html'], bm);
+    await handleWriteCommand('cookie', ['restored=yes'], bm);
+
+    const stateBefore = await bm.saveState();
+    expect(stateBefore.cookies.some(c => c.name === 'restored')).toBe(true);
+
+    await bm.recreateContext();
+
+    const stateAfter = await bm.saveState();
+    expect(stateAfter.cookies.some(c => c.name === 'restored')).toBe(true);
+    expect(stateAfter.pages.length).toBeGreaterThanOrEqual(1);
+  }, 30000);
+});
+
+// ─── Unit Tests: Handoff Edge Cases ──────────────────────────────
+
+describe('handoff edge cases', () => {
+  test('handoff when already headed returns no-op', async () => {
+    (bm as any).isHeaded = true;
+    const result = await bm.handoff('test');
+    expect(result).toContain('Already in headed mode');
+    (bm as any).isHeaded = false;
+  }, 10000);
+
+  test('resume clears refs and resets failures', () => {
+    bm.incrementFailures();
+    bm.incrementFailures();
+    bm.incrementFailures();
+    bm.resume();
+    expect(bm.getFailureHint()).toBeNull();
+    expect(bm.getRefCount()).toBe(0);
+  });
+
+  test('resume without prior handoff works via meta command', async () => {
+    await handleWriteCommand('goto', [baseUrl + '/basic.html'], bm);
+    const result = await handleMetaCommand('resume', [], bm, () => {});
+    expect(result).toContain('RESUMED');
+  }, 15000);
+});
+
+// ─── Integration Tests: Full Handoff Flow ────────────────────────
+// Each handoff test creates its own BrowserManager since handoff swaps the browser.
+// These tests run sequentially (one browser at a time) to avoid resource issues.
+
+describe('handoff integration', () => {
+  test('full handoff: cookies preserved, headed mode active, commands work', async () => {
+    const hbm = new BrowserManager();
+    await hbm.launch();
+
+    try {
+      // Set up state
+      await handleWriteCommand('goto', [baseUrl + '/basic.html'], hbm);
+      await handleWriteCommand('cookie', ['handoff_test=preserved'], hbm);
+
+      // Handoff
+      const result = await hbm.handoff('Testing handoff');
+      expect(result).toContain('HANDOFF:');
+      expect(result).toContain('Testing handoff');
+      expect(result).toContain('resume');
+      expect(hbm.getIsHeaded()).toBe(true);
+
+      // Verify cookies survived
+      const { handleReadCommand } = await import('../src/read-commands');
+      const cookiesResult = await handleReadCommand('cookies', [], hbm);
+      expect(cookiesResult).toContain('handoff_test');
+
+      // Verify commands still work
+      const text = await handleReadCommand('text', [], hbm);
+      expect(text.length).toBeGreaterThan(0);
+
+      // Resume
+      const resumeResult = await handleMetaCommand('resume', [], hbm, () => {});
+      expect(resumeResult).toContain('RESUMED');
+    } finally {
+      await hbm.close();
+    }
+  }, 45000);
+
+  test('multi-tab handoff preserves all tabs', async () => {
+    const hbm = new BrowserManager();
+    await hbm.launch();
+
+    try {
+      await handleWriteCommand('goto', [baseUrl + '/basic.html'], hbm);
+      await handleMetaCommand('newtab', [baseUrl + '/form.html'], hbm, () => {});
+      expect(hbm.getTabCount()).toBe(2);
+
+      await hbm.handoff('multi-tab test');
+      expect(hbm.getTabCount()).toBe(2);
+      expect(hbm.getIsHeaded()).toBe(true);
+    } finally {
+      await hbm.close();
+    }
+  }, 45000);
+
+  test('handoff meta command joins args as message', async () => {
+    const hbm = new BrowserManager();
+    await hbm.launch();
+
+    try {
+      await handleWriteCommand('goto', [baseUrl + '/basic.html'], hbm);
+      const result = await handleMetaCommand('handoff', ['CAPTCHA', 'stuck'], hbm, () => {});
+      expect(result).toContain('CAPTCHA stuck');
+    } finally {
+      await hbm.close();
+    }
+  }, 45000);
+});
@@ -0,0 +1,63 @@
+import { describe, it, expect } from 'bun:test';
+import { validateOutputPath } from '../src/meta-commands';
+import { validateReadPath } from '../src/read-commands';
+
+describe('validateOutputPath', () => {
+  it('allows paths within /tmp', () => {
+    expect(() => validateOutputPath('/tmp/screenshot.png')).not.toThrow();
+  });
+
+  it('allows paths in subdirectories of /tmp', () => {
+    expect(() => validateOutputPath('/tmp/browse/output.png')).not.toThrow();
+  });
+
+  it('allows paths within cwd', () => {
+    expect(() => validateOutputPath(`${process.cwd()}/output.png`)).not.toThrow();
+  });
+
+  it('blocks paths outside safe directories', () => {
+    expect(() => validateOutputPath('/etc/cron.d/backdoor.png')).toThrow(/Path must be within/);
+  });
+
+  it('blocks /tmpevil prefix collision', () => {
+    expect(() => validateOutputPath('/tmpevil/file.png')).toThrow(/Path must be within/);
+  });
+
+  it('blocks home directory paths', () => {
+    expect(() => validateOutputPath('/Users/someone/file.png')).toThrow(/Path must be within/);
+  });
+
+  it('blocks path traversal via ..', () => {
+    expect(() => validateOutputPath('/tmp/../etc/passwd')).toThrow(/Path must be within/);
+  });
+});
+
+describe('validateReadPath', () => {
+  it('allows absolute paths within /tmp', () => {
+    expect(() => validateReadPath('/tmp/script.js')).not.toThrow();
+  });
+
+  it('allows absolute paths within cwd', () => {
+    expect(() => validateReadPath(`${process.cwd()}/test.js`)).not.toThrow();
+  });
+
+  it('allows relative paths without traversal', () => {
+    expect(() => validateReadPath('src/index.js')).not.toThrow();
+  });
+
+  it('blocks absolute paths outside safe directories', () => {
+    expect(() => validateReadPath('/etc/passwd')).toThrow(/Absolute path must be within/);
+  });
+
+  it('blocks /tmpevil prefix collision', () => {
+    expect(() => validateReadPath('/tmpevil/file.js')).toThrow(/Absolute path must be within/);
+  });
+
+  it('blocks path traversal sequences', () => {
+    expect(() => validateReadPath('../../../etc/passwd')).toThrow(/Path traversal/);
+  });
+
+  it('blocks nested path traversal', () => {
+    expect(() => validateReadPath('src/../../etc/passwd')).toThrow(/Path traversal/);
+  });
+});
@@ -0,0 +1,68 @@
+import { describe, it, expect } from 'bun:test';
+import { validateNavigationUrl } from '../src/url-validation';
+
+describe('validateNavigationUrl', () => {
+  it('allows http URLs', () => {
+    expect(() => validateNavigationUrl('http://example.com')).not.toThrow();
+  });
+
+  it('allows https URLs', () => {
+    expect(() => validateNavigationUrl('https://example.com/path?q=1')).not.toThrow();
+  });
+
+  it('allows localhost', () => {
+    expect(() => validateNavigationUrl('http://localhost:3000')).not.toThrow();
+  });
+
+  it('allows 127.0.0.1', () => {
+    expect(() => validateNavigationUrl('http://127.0.0.1:8080')).not.toThrow();
+  });
+
+  it('allows private IPs', () => {
+    expect(() => validateNavigationUrl('http://192.168.1.1')).not.toThrow();
+  });
+
+  it('blocks file:// scheme', () => {
+    expect(() => validateNavigationUrl('file:///etc/passwd')).toThrow(/scheme.*not allowed/i);
+  });
+
+  it('blocks javascript: scheme', () => {
+    expect(() => validateNavigationUrl('javascript:alert(1)')).toThrow(/scheme.*not allowed/i);
+  });
+
+  it('blocks data: scheme', () => {
+    expect(() => validateNavigationUrl('data:text/html,<h1>hi</h1>')).toThrow(/scheme.*not allowed/i);
+  });
+
+  it('blocks AWS/GCP metadata endpoint', () => {
+    expect(() => validateNavigationUrl('http://169.254.169.254/latest/meta-data/')).toThrow(/cloud metadata/i);
+  });
+
+  it('blocks GCP metadata hostname', () => {
+    expect(() => validateNavigationUrl('http://metadata.google.internal/computeMetadata/v1/')).toThrow(/cloud metadata/i);
+  });
+
+  it('blocks metadata hostname with trailing dot', () => {
+    expect(() => validateNavigationUrl('http://metadata.google.internal./computeMetadata/v1/')).toThrow(/cloud metadata/i);
+  });
+
+  it('blocks metadata IP in hex form', () => {
+    expect(() => validateNavigationUrl('http://0xA9FEA9FE/')).toThrow(/cloud metadata/i);
+  });
+
+  it('blocks metadata IP in decimal form', () => {
+    expect(() => validateNavigationUrl('http://2852039166/')).toThrow(/cloud metadata/i);
+  });
+
+  it('blocks metadata IP in octal form', () => {
+    expect(() => validateNavigationUrl('http://0251.0376.0251.0376/')).toThrow(/cloud metadata/i);
+  });
+
+  it('blocks IPv6 metadata with brackets', () => {
+    expect(() => validateNavigationUrl('http://[fd00::]/')).toThrow(/cloud metadata/i);
+  });
+
+  it('throws on malformed URLs', () => {
+    expect(() => validateNavigationUrl('not-a-url')).toThrow(/Invalid URL/i);
+  });
+});
@@ -279,10 +279,7 @@ CROSS-MODEL ANALYSIS:

 7. Persist the review result:
 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-BRANCH_SLUG=$(git rev-parse --abbrev-ref HEAD 2>/dev/null | tr '/' '-')
-mkdir -p ~/.gstack/projects/"$SLUG"
-echo '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE","findings":N}' >> ~/.gstack/projects/"$SLUG"/"$BRANCH_SLUG"-reviews.jsonl
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE","findings":N}'
 ```

 Substitute: TIMESTAMP (ISO 8601), STATUS ("clean" if PASS, "issues_found" if FAIL),
@@ -126,10 +126,7 @@ CROSS-MODEL ANALYSIS:

 7. Persist the review result:
 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-BRANCH_SLUG=$(git rev-parse --abbrev-ref HEAD 2>/dev/null | tr '/' '-')
-mkdir -p ~/.gstack/projects/"$SLUG"
-echo '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE","findings":N}' >> ~/.gstack/projects/"$SLUG"/"$BRANCH_SLUG"-reviews.jsonl
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE","findings":N}'
 ```

 Substitute: TIMESTAMP (ISO 8601), STATUS ("clean" if PASS, "issues_found" if FAIL),
@@ -188,7 +188,7 @@ ls src/ app/ pages/ components/ 2>/dev/null | head -30
 Look for office-hours output:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
 ls ~/.gstack/projects/$SLUG/*office-hours* 2>/dev/null | head -5
 ls .context/*office-hours* .context/attachments/*office-hours* 2>/dev/null | head -5
 ```
@@ -52,7 +52,7 @@ ls src/ app/ pages/ components/ 2>/dev/null | head -30
 Look for office-hours output:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
 ls ~/.gstack/projects/$SLUG/*office-hours* 2>/dev/null | head -5
 ls .context/*office-hours* .context/attachments/*office-hours* 2>/dev/null | head -5
 ```
@@ -635,8 +635,7 @@ Compare screenshots and observations across pages for:

 **Project-scoped:**
 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG
 ```
 Write to: `~/.gstack/projects/{slug}/{user}-{branch}-design-audit-{datetime}.md`

@@ -854,8 +853,7 @@ Write the report to both local and project-scoped locations:

 **Project-scoped:**
 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG
 ```
 Write to `~/.gstack/projects/{slug}/{user}-{branch}-design-audit-{datetime}.md`

@@ -220,8 +220,7 @@ Write the report to both local and project-scoped locations:

 **Project-scoped:**
 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG
 ```
 Write to `~/.gstack/projects/{slug}/{user}-{branch}-design-audit-{datetime}.md`

@@ -10,7 +10,7 @@ Detailed guides for every gstack skill — philosophy, workflow, and examples.
 | [`/plan-design-review`](#plan-design-review) | **Senior Designer** | Interactive plan-mode design review. Rates each dimension 0-10, explains what a 10 looks like, fixes the plan. Works in plan mode. |
 | [`/design-consultation`](#design-consultation) | **Design Partner** | Build a complete design system from scratch. Knows the landscape, proposes creative risks, generates realistic product mockups. Design at the heart of all other phases. |
 | [`/review`](#review) | **Staff Engineer** | Find the bugs that pass CI but blow up in production. Auto-fixes the obvious ones. Flags completeness gaps. |
-| [`/debug`](#debug) | **Debugger** | Systematic root-cause debugging. Iron Law: no fixes without investigation. Traces data flow, tests hypotheses, stops after 3 failed fixes. |
+| [`/investigate`](#investigate) | **Debugger** | Systematic root-cause debugging. Iron Law: no fixes without investigation. Traces data flow, tests hypotheses, stops after 3 failed fixes. |
 | [`/design-review`](#design-review) | **Designer Who Codes** | Live-site visual audit + fix loop. 80-item audit, then fixes what it finds. Atomic commits, before/after screenshots. |
 | [`/qa`](#qa) | **QA Lead** | Test your app, find bugs, fix them with atomic commits, re-verify. Auto-generates regression tests for every fix. |
 | [`/qa-only`](#qa) | **QA Reporter** | Same methodology as /qa but report only. Use when you want a pure bug report without code changes. |
@@ -19,6 +19,16 @@ Detailed guides for every gstack skill — philosophy, workflow, and examples.
 | [`/retro`](#retro) | **Eng Manager** | Team-aware weekly retro. Per-person breakdowns, shipping streaks, test health trends, growth opportunities. |
 | [`/browse`](#browse) | **QA Engineer** | Give the agent eyes. Real Chromium browser, real clicks, real screenshots. ~100ms per command. |
 | [`/setup-browser-cookies`](#setup-browser-cookies) | **Session Manager** | Import cookies from your real browser (Chrome, Arc, Brave, Edge) into the headless session. Test authenticated pages. |
+| | | |
+| **Multi-AI** | | |
+| [`/codex`](#codex) | **Second Opinion** | Independent review from OpenAI Codex CLI. Three modes: code review (pass/fail gate), adversarial challenge, and open consultation with session continuity. Cross-model analysis when both `/review` and `/codex` have run. |
+| | | |
+| **Safety & Utility** | | |
+| [`/careful`](#safety--guardrails) | **Safety Guardrails** | Warns before destructive commands (rm -rf, DROP TABLE, force-push, git reset --hard). Override any warning. Common build cleanups whitelisted. |
+| [`/freeze`](#safety--guardrails) | **Edit Lock** | Restrict all file edits to a single directory. Blocks Edit and Write outside the boundary. Accident prevention for debugging. |
+| [`/guard`](#safety--guardrails) | **Full Safety** | Combines /careful + /freeze in one command. Maximum safety for prod work. |
+| [`/unfreeze`](#safety--guardrails) | **Unlock** | Remove the /freeze boundary, allowing edits everywhere again. |
+| [`/gstack-upgrade`](#gstack-upgrade) | **Self-Updater** | Upgrade gstack to the latest version. Detects global vs vendored install, syncs both, shows what changed. |

 ---

@@ -440,9 +450,9 @@ I want the model imagining the production incident before it happens.

 ---

-## `/debug`
+## `/investigate`

-When something is broken and you don't know why, `/debug` is your systematic debugger. It follows the Iron Law: **no fixes without root cause investigation first.**
+When something is broken and you don't know why, `/investigate` is your systematic debugger. It follows the Iron Law: **no fixes without root cause investigation first.**

 Instead of guessing and patching, it traces data flow, matches against known bug patterns, and tests hypotheses one at a time. If three fix attempts fail, it stops and questions the architecture instead of thrashing. This prevents the "let me try one more thing" spiral that wastes hours.

@@ -616,6 +626,29 @@ Claude: [18 tool calls, ~60 seconds]

 18 tool calls, about a minute. Full QA pass. No browser opened.

+### Browser handoff
+
+When the headless browser gets stuck — CAPTCHA, MFA, complex auth — hand off to the user:
+
+```
+Claude: I'm stuck on a CAPTCHA at the login page. Opening a visible
+        Chrome so you can solve it.
+
+        > browse handoff "Stuck on CAPTCHA at login page"
+
+        Chrome opened at https://app.example.com/login with all your
+        cookies and tabs intact. Solve the CAPTCHA and tell me when
+        you're done.
+
+You:    done
+
+Claude: > browse resume
+
+        Got a fresh snapshot. Logged in successfully. Continuing QA.
+```
+
+The browser preserves all state (cookies, localStorage, tabs) across the handoff. After `resume`, the agent gets a fresh snapshot of wherever you left off. If the browse tool fails 3 times in a row, it automatically suggests using `handoff`.
+
 **Security note:** `/browse` runs a persistent Chromium session. Cookies, localStorage, and session state carry over between commands. Do not use it against sensitive production environments unless you intend to — it is a real browser with real state. The session auto-shuts down after 30 minutes of idle time.

 For the full command reference, see [BROWSER.md](../BROWSER.md).
@@ -653,6 +686,116 @@ Claude: Imported 12 cookies for github.com from Comet.

 ---

+## `/codex`
+
+This is my **second opinion mode**.
+
+When `/review` catches bugs from Claude's perspective, `/codex` brings a completely different AI — OpenAI's Codex CLI — to review the same diff. Different training, different blind spots, different strengths. The overlap tells you what's definitely real. The unique findings from each are where you find the bugs neither would catch alone.
+
+### Three modes
+
+**Review** — run `codex review` against the current diff. Codex reads every changed file, classifies findings by severity (P1 critical, P2 high, P3 medium), and returns a PASS/FAIL verdict. Any P1 finding = FAIL. The review is fully independent — Codex doesn't see Claude's review.
+
+**Challenge** — adversarial mode. Codex actively tries to break your code. It looks for edge cases, race conditions, security holes, and assumptions that would fail under load. Uses maximum reasoning effort (`xhigh`). Think of it as a penetration test for your logic.
+
+**Consult** — open conversation with session continuity. Ask Codex anything about the codebase. Follow-up questions reuse the same session, so context carries over. Great for "am I thinking about this correctly?" moments.
+
+### Cross-model analysis
+
+When both `/review` (Claude) and `/codex` (OpenAI) have reviewed the same branch, you get a cross-model comparison: which findings overlap (high confidence), which are unique to Codex (different perspective), and which are unique to Claude. This is the "two doctors, same patient" approach to code review.
+
+```
+You:   /codex review
+
+Claude: Running independent Codex review...
+
+        CODEX REVIEW: PASS (3 findings)
+        [P2] Race condition in payment handler — concurrent charges
+             can double-debit without advisory lock
+        [P3] Missing null check on user.email before downcase
+        [P3] Token comparison not using constant-time compare
+
+        Cross-model analysis (vs /review):
+        OVERLAP: Race condition in payment handler (both caught it)
+        UNIQUE TO CODEX: Token comparison timing attack
+        UNIQUE TO CLAUDE: N+1 query in listing photos
+```
+
+---
+
+## Safety & Guardrails
+
+Four skills that add safety rails to any Claude Code session. They work via Claude Code's PreToolUse hooks — transparent, session-scoped, no configuration files.
+
+### `/careful`
+
+Say "be careful" or run `/careful` when you're working near production, running destructive commands, or just want a safety net. Every Bash command gets checked against known-dangerous patterns:
+
+- `rm -rf` / `rm -r` — recursive delete
+- `DROP TABLE` / `DROP DATABASE` / `TRUNCATE` — data loss
+- `git push --force` / `git push -f` — history rewrite
+- `git reset --hard` — discard commits
+- `git checkout .` / `git restore .` — discard uncommitted work
+- `kubectl delete` — production resource deletion
+- `docker rm -f` / `docker system prune` — container/image loss
+
+Common build artifact cleanups (`rm -rf node_modules`, `dist`, `.next`, `__pycache__`, `build`, `coverage`) are whitelisted — no false alarms on routine operations.
+
+You can override any warning. The guardrails are accident prevention, not access control.
+
+### `/freeze`
+
+Restrict all file edits to a single directory. When you're debugging a billing bug, you don't want Claude accidentally "fixing" unrelated code in `src/auth/`. `/freeze src/billing` blocks all Edit and Write operations outside that path.
+
+`/investigate` activates this automatically — it detects the module being debugged and freezes edits to that directory.
+
+```
+You:   /freeze src/billing
+
+Claude: Edits restricted to src/billing/. Run /unfreeze to remove.
+
+        [Later, Claude tries to edit src/auth/middleware.ts]
+
+Claude: BLOCKED — Edit outside freeze boundary (src/billing/).
+        Skipping this change.
+```
+
+Note: this blocks Edit and Write tools only. Bash commands like `sed` can still modify files outside the boundary — it's accident prevention, not a security sandbox.
+
+### `/guard`
+
+Full safety mode — combines `/careful` + `/freeze` in one command. Destructive command warnings plus directory-scoped edits. Use when touching prod or debugging live systems.
+
+### `/unfreeze`
+
+Remove the `/freeze` boundary, allowing edits everywhere again. The hooks stay registered for the session — they just allow everything. Run `/freeze` again to set a new boundary.
+
+---
+
+## `/gstack-upgrade`
+
+Keep gstack current with one command. It detects your install type (global at `~/.claude/skills/gstack` vs vendored in your project at `.claude/skills/gstack`), runs the upgrade, syncs both copies if you have dual installs, and shows you what changed.
+
+```
+You:   /gstack-upgrade
+
+Claude: Current version: 0.7.4
+        Latest version: 0.8.2
+
+        What's new:
+        - Browse handoff for CAPTCHAs and auth walls
+        - /codex multi-AI second opinion
+        - /qa always uses browser now
+        - Safety skills: /careful, /freeze, /guard
+        - Proactive skill suggestions
+
+        Upgraded to 0.8.2. Both global and project installs synced.
+```
+
+Set `auto_upgrade: true` in `~/.gstack/config.yaml` to skip the prompt entirely — gstack upgrades silently at the start of each session when a new version is available.
+
+---
+
 ## Greptile integration

 [Greptile](https://greptile.com) is a YC company that reviews your PRs automatically. It catches real bugs — race conditions, security issues, things that pass CI and blow up in production. It has genuinely saved my ass more than once. I love these guys.
@@ -1,5 +1,5 @@
 ---
-name: debug
+name: investigate
 version: 1.0.0
 description: |
  Systematic debugging with root cause investigation. Four phases: investigate,
@@ -49,7 +49,7 @@ echo "PROACTIVE: $_PROACTIVE"
 _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no")
 echo "LAKE_INTRO: $_LAKE_SEEN"
 mkdir -p ~/.gstack/analytics
-echo '{"skill":"debug","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}'  >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
+echo '{"skill":"investigate","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}'  >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
 ```

 If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
@@ -1,5 +1,5 @@
 ---
-name: debug
+name: investigate
 version: 1.0.0
 description: |
  Systematic debugging with root cause investigation. Four phases: investigate,
@@ -172,7 +172,7 @@ You are a **YC office hours partner**. Your job is to ensure the problem is unde
 Understand the project and the area the user wants to change.

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
 ```

 1. Read `CLAUDE.md`, `TODOS.md` (if they exist).
@@ -445,10 +445,9 @@ Count the signals. You'll use this count in Phase 6 to determine which tier of c
 Write the design document to the project directory.

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG
 USER=$(whoami)
 DATETIME=$(date +%Y%m%d-%H%M%S)
-mkdir -p ~/.gstack/projects/$SLUG
 ```

 **Design lineage:** Before writing, check for existing design docs on this branch:
@@ -36,7 +36,7 @@ You are a **YC office hours partner**. Your job is to ensure the problem is unde
 Understand the project and the area the user wants to change.

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
 ```

 1. Read `CLAUDE.md`, `TODOS.md` (if they exist).
@@ -309,10 +309,9 @@ Count the signals. You'll use this count in Phase 6 to determine which tier of c
 Write the design document to the project directory.

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG
 USER=$(whoami)
 DATETIME=$(date +%Y%m%d-%H%M%S)
-mkdir -p ~/.gstack/projects/$SLUG
 ```

 **Design lineage:** Before writing, check for existing design docs on this branch:
@@ -190,7 +190,7 @@ Do NOT make any code changes. Do NOT start implementation. Your only job right n

 ## Prime Directives
 1. Zero silent failures. Every failure mode must be visible — to the system, to the team, to the user. If a failure can happen silently, that is a critical defect in the plan.
-2. Every error has a name. Don't say "handle errors." Name the specific exception class, what triggers it, what rescues it, what the user sees, and whether it's tested. rescue StandardError is a code smell — call it out.
+2. Every error has a name. Don't say "handle errors." Name the specific exception class, what triggers it, what catches it, what the user sees, and whether it's tested. Catch-all error handling (e.g., catch Exception, rescue StandardError, except Exception) is a code smell — call it out.
 3. Data flows have shadow paths. Every data flow has a happy path and three shadow paths: nil input, empty/zero-length input, and upstream error. Trace all four for every new flow.
 4. Interactions have edge cases. Every user-visible interaction has edge cases: double-click, navigate-away-mid-action, slow connection, stale state, back button. Map them.
 5. Observability is scope, not afterthought. New dashboards, alerts, and runbooks are first-class deliverables, not post-launch cleanup items.
@@ -248,8 +248,8 @@ Run the following commands:
 git log --oneline -30                          # Recent history
 git diff <base> --stat                           # What's already changed
 git stash list                                 # Any stashed work
-grep -r "TODO\|FIXME\|HACK\|XXX" --include="*.rb" --include="*.js" -l
-find . -name "*.rb" -newer Gemfile.lock | head -20  # Recently touched files
+grep -r "TODO\|FIXME\|HACK\|XXX" -l --exclude-dir=node_modules --exclude-dir=vendor --exclude-dir=.git . | head -30
+git log --since=30.days --name-only --format="" | sort | uniq -c | sort -rn | head -20  # Recently touched files
 ```
 Then read CLAUDE.md, TODOS.md, and any existing architecture docs.

@@ -362,8 +362,7 @@ Rules:
 After the opt-in/cherry-pick ceremony, write the plan to disk so the vision and decisions survive beyond this conversation. Only run this step for EXPANSION and SELECTIVE EXPANSION modes.

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG/ceo-plans
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG/ceo-plans
 ```

 Before writing, check for existing CEO plans in the ceo-plans/ directory. If any are >30 days old or their branch has been merged/deleted, offer to archive them:
@@ -478,24 +477,24 @@ For every new method, service, or codepath that can fail, fill in this table:
 ```
  METHOD/CODEPATH          | WHAT CAN GO WRONG           | EXCEPTION CLASS
  -------------------------|-----------------------------|-----------------
-  ExampleService#call      | API timeout                 | Faraday::TimeoutError
+  ExampleService#call      | API timeout                 | TimeoutError
                           | API returns 429             | RateLimitError
-                           | API returns malformed JSON  | JSON::ParserError
-                           | DB connection pool exhausted| ActiveRecord::ConnectionTimeoutError
-                           | Record not found            | ActiveRecord::RecordNotFound
+                           | API returns malformed JSON  | JSONParseError
+                           | DB connection pool exhausted| ConnectionPoolExhausted
+                           | Record not found            | RecordNotFound
  -------------------------|-----------------------------|-----------------

  EXCEPTION CLASS              | RESCUED?  | RESCUE ACTION          | USER SEES
  -----------------------------|-----------|------------------------|------------------
-  Faraday::TimeoutError        | Y         | Retry 2x, then raise   | "Service temporarily unavailable"
+  TimeoutError                 | Y         | Retry 2x, then raise   | "Service temporarily unavailable"
  RateLimitError               | Y         | Backoff + retry         | Nothing (transparent)
-  JSON::ParserError            | N ← GAP   | —                      | 500 error ← BAD
-  ConnectionTimeoutError       | N ← GAP   | —                      | 500 error ← BAD
-  ActiveRecord::RecordNotFound | Y         | Return nil, log warning | "Not found" message
+  JSONParseError               | N ← GAP   | —                      | 500 error ← BAD
+  ConnectionPoolExhausted      | N ← GAP   | —                      | 500 error ← BAD
+  RecordNotFound               | Y         | Return nil, log warning | "Not found" message
 ```
 Rules for this section:
-* `rescue StandardError` is ALWAYS a smell. Name the specific exceptions.
-* `rescue => e` with only `Rails.logger.error(e.message)` is insufficient. Log the full context: what was being attempted, with what arguments, for what user/request.
+* Catch-all error handling (`rescue StandardError`, `catch (Exception e)`, `except Exception`) is ALWAYS a smell. Name the specific exceptions.
+* Catching an error with only a generic log message is insufficient. Log the full context: what was being attempted, with what arguments, for what user/request.
 * Every rescued error must either: retry with backoff, degrade gracefully with a user-visible message, or re-raise with added context. "Swallow and continue" is almost never acceptable.
 * For each GAP (unrescued error that should be rescued): specify the rescue action and what the user should see.
 * For LLM/AI service calls specifically: what happens when the response is malformed? When it's empty? When it hallucinates invalid JSON? When the model returns a refusal? Each of these is a distinct failure mode.
@@ -792,9 +791,7 @@ If any AskUserQuestion goes unanswered, note it here. Never silently default.
 After producing the Completion Summary above, persist the review result:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
-echo '{"skill":"plan-ceo-review","timestamp":"TIMESTAMP","status":"STATUS","unresolved":N,"critical_gaps":N,"mode":"MODE"}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"plan-ceo-review","timestamp":"TIMESTAMP","status":"STATUS","unresolved":N,"critical_gaps":N,"mode":"MODE","commit":"COMMIT"}'
 ```

 Before running this command, substitute the placeholder values from the Completion Summary you just produced:
@@ -803,16 +800,14 @@ Before running this command, substitute the placeholder values from the Completi
 - **unresolved**: number from "Unresolved decisions" in the summary
 - **critical_gaps**: number from "Failure modes: ___ CRITICAL GAPS" in the summary
 - **MODE**: the mode the user selected (SCOPE_EXPANSION / SELECTIVE_EXPANSION / HOLD_SCOPE / SCOPE_REDUCTION)
+- **COMMIT**: output of `git rev-parse --short HEAD`

 ## Review Readiness Dashboard

 After completing the review, read the review log and config to display the dashboard.

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-cat ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl 2>/dev/null || echo "NO_REVIEWS"
-echo "---CONFIG---"
-~/.claude/skills/gstack/bin/gstack-config get skip_eng_review 2>/dev/null || echo "false"
+~/.claude/skills/gstack/bin/gstack-review-read
 ```

 Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite, codex-review). Ignore entries with timestamps older than 7 days. For Design Review, show whichever is more recent between `plan-design-review` (full visual audit) and `design-review-lite` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display:
@@ -844,6 +839,27 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
 - CEO, Design, and Codex reviews are shown for context but never block shipping
 - If \`skip_eng_review\` config is \`true\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED

+**Staleness detection:** After displaying the dashboard, check if any existing reviews may be stale:
+- Parse the \`---HEAD---\` section from the bash output to get the current HEAD commit hash
+- For each review entry that has a \`commit\` field: compare it against the current HEAD. If different, count elapsed commits: \`git rev-list --count STORED_COMMIT..HEAD\`. Display: "Note: {skill} review from {date} may be stale — {N} commits since review"
+- For entries without a \`commit\` field (legacy entries): display "Note: {skill} review from {date} has no commit tracking — consider re-running for accurate staleness detection"
+- If all reviews match the current HEAD, do not display any staleness notes
+
+## Next Steps — Review Chaining
+
+After displaying the Review Readiness Dashboard, recommend the next review(s) based on what this CEO review discovered. Read the dashboard output to see which reviews have already been run and whether they are stale.
+
+**Recommend /plan-eng-review if eng review is not skipped globally** — check the dashboard output for `skip_eng_review`. If it is `true`, eng review is opted out — do not recommend it. Otherwise, eng review is the required shipping gate. If this CEO review expanded scope, changed architectural direction, or accepted scope expansions, emphasize that a fresh eng review is needed. If an eng review already exists in the dashboard but the commit hash shows it predates this CEO review, note that it may be stale and should be re-run.
+
+**Recommend /plan-design-review if UI scope was detected** — specifically if Section 11 (Design & UX Review) was NOT skipped, or if accepted scope expansions included UI-facing features. If an existing design review is stale (commit hash drift), note that. In SCOPE REDUCTION mode, skip this recommendation — design review is unlikely relevant for scope cuts.
+
+**If both are needed, recommend eng review first** (required gate), then design review.
+
+Use AskUserQuestion to present the next step. Include only applicable options:
+- **A)** Run /plan-eng-review next (required gate)
+- **B)** Run /plan-design-review next (only if UI scope detected)
+- **C)** Skip — I'll handle reviews manually
+
 ## docs/designs Promotion (EXPANSION and SELECTIVE EXPANSION only)

 At the end of the review, if the vision produced a compelling feature direction, offer to promote the CEO plan to the project repo. AskUserQuestion:
@@ -37,7 +37,7 @@ Do NOT make any code changes. Do NOT start implementation. Your only job right n

 ## Prime Directives
 1. Zero silent failures. Every failure mode must be visible — to the system, to the team, to the user. If a failure can happen silently, that is a critical defect in the plan.
-2. Every error has a name. Don't say "handle errors." Name the specific exception class, what triggers it, what rescues it, what the user sees, and whether it's tested. rescue StandardError is a code smell — call it out.
+2. Every error has a name. Don't say "handle errors." Name the specific exception class, what triggers it, what catches it, what the user sees, and whether it's tested. Catch-all error handling (e.g., catch Exception, rescue StandardError, except Exception) is a code smell — call it out.
 3. Data flows have shadow paths. Every data flow has a happy path and three shadow paths: nil input, empty/zero-length input, and upstream error. Trace all four for every new flow.
 4. Interactions have edge cases. Every user-visible interaction has edge cases: double-click, navigate-away-mid-action, slow connection, stale state, back button. Map them.
 5. Observability is scope, not afterthought. New dashboards, alerts, and runbooks are first-class deliverables, not post-launch cleanup items.
@@ -95,8 +95,8 @@ Run the following commands:
 git log --oneline -30                          # Recent history
 git diff <base> --stat                           # What's already changed
 git stash list                                 # Any stashed work
-grep -r "TODO\|FIXME\|HACK\|XXX" --include="*.rb" --include="*.js" -l
-find . -name "*.rb" -newer Gemfile.lock | head -20  # Recently touched files
+grep -r "TODO\|FIXME\|HACK\|XXX" -l --exclude-dir=node_modules --exclude-dir=vendor --exclude-dir=.git . | head -30
+git log --since=30.days --name-only --format="" | sort | uniq -c | sort -rn | head -20  # Recently touched files
 ```
 Then read CLAUDE.md, TODOS.md, and any existing architecture docs.

@@ -209,8 +209,7 @@ Rules:
 After the opt-in/cherry-pick ceremony, write the plan to disk so the vision and decisions survive beyond this conversation. Only run this step for EXPANSION and SELECTIVE EXPANSION modes.

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG/ceo-plans
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG/ceo-plans
 ```

 Before writing, check for existing CEO plans in the ceo-plans/ directory. If any are >30 days old or their branch has been merged/deleted, offer to archive them:
@@ -325,24 +324,24 @@ For every new method, service, or codepath that can fail, fill in this table:
 ```
  METHOD/CODEPATH          | WHAT CAN GO WRONG           | EXCEPTION CLASS
  -------------------------|-----------------------------|-----------------
-  ExampleService#call      | API timeout                 | Faraday::TimeoutError
+  ExampleService#call      | API timeout                 | TimeoutError
                           | API returns 429             | RateLimitError
-                           | API returns malformed JSON  | JSON::ParserError
-                           | DB connection pool exhausted| ActiveRecord::ConnectionTimeoutError
-                           | Record not found            | ActiveRecord::RecordNotFound
+                           | API returns malformed JSON  | JSONParseError
+                           | DB connection pool exhausted| ConnectionPoolExhausted
+                           | Record not found            | RecordNotFound
  -------------------------|-----------------------------|-----------------

  EXCEPTION CLASS              | RESCUED?  | RESCUE ACTION          | USER SEES
  -----------------------------|-----------|------------------------|------------------
-  Faraday::TimeoutError        | Y         | Retry 2x, then raise   | "Service temporarily unavailable"
+  TimeoutError                 | Y         | Retry 2x, then raise   | "Service temporarily unavailable"
  RateLimitError               | Y         | Backoff + retry         | Nothing (transparent)
-  JSON::ParserError            | N ← GAP   | —                      | 500 error ← BAD
-  ConnectionTimeoutError       | N ← GAP   | —                      | 500 error ← BAD
-  ActiveRecord::RecordNotFound | Y         | Return nil, log warning | "Not found" message
+  JSONParseError               | N ← GAP   | —                      | 500 error ← BAD
+  ConnectionPoolExhausted      | N ← GAP   | —                      | 500 error ← BAD
+  RecordNotFound               | Y         | Return nil, log warning | "Not found" message
 ```
 Rules for this section:
-* `rescue StandardError` is ALWAYS a smell. Name the specific exceptions.
-* `rescue => e` with only `Rails.logger.error(e.message)` is insufficient. Log the full context: what was being attempted, with what arguments, for what user/request.
+* Catch-all error handling (`rescue StandardError`, `catch (Exception e)`, `except Exception`) is ALWAYS a smell. Name the specific exceptions.
+* Catching an error with only a generic log message is insufficient. Log the full context: what was being attempted, with what arguments, for what user/request.
 * Every rescued error must either: retry with backoff, degrade gracefully with a user-visible message, or re-raise with added context. "Swallow and continue" is almost never acceptable.
 * For each GAP (unrescued error that should be rescued): specify the rescue action and what the user should see.
 * For LLM/AI service calls specifically: what happens when the response is malformed? When it's empty? When it hallucinates invalid JSON? When the model returns a refusal? Each of these is a distinct failure mode.
@@ -639,9 +638,7 @@ If any AskUserQuestion goes unanswered, note it here. Never silently default.
 After producing the Completion Summary above, persist the review result:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
-echo '{"skill":"plan-ceo-review","timestamp":"TIMESTAMP","status":"STATUS","unresolved":N,"critical_gaps":N,"mode":"MODE"}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"plan-ceo-review","timestamp":"TIMESTAMP","status":"STATUS","unresolved":N,"critical_gaps":N,"mode":"MODE","commit":"COMMIT"}'
 ```

 Before running this command, substitute the placeholder values from the Completion Summary you just produced:
@@ -650,9 +647,25 @@ Before running this command, substitute the placeholder values from the Completi
 - **unresolved**: number from "Unresolved decisions" in the summary
 - **critical_gaps**: number from "Failure modes: ___ CRITICAL GAPS" in the summary
 - **MODE**: the mode the user selected (SCOPE_EXPANSION / SELECTIVE_EXPANSION / HOLD_SCOPE / SCOPE_REDUCTION)
+- **COMMIT**: output of `git rev-parse --short HEAD`

 {{REVIEW_DASHBOARD}}

+## Next Steps — Review Chaining
+
+After displaying the Review Readiness Dashboard, recommend the next review(s) based on what this CEO review discovered. Read the dashboard output to see which reviews have already been run and whether they are stale.
+
+**Recommend /plan-eng-review if eng review is not skipped globally** — check the dashboard output for `skip_eng_review`. If it is `true`, eng review is opted out — do not recommend it. Otherwise, eng review is the required shipping gate. If this CEO review expanded scope, changed architectural direction, or accepted scope expansions, emphasize that a fresh eng review is needed. If an eng review already exists in the dashboard but the commit hash shows it predates this CEO review, note that it may be stale and should be re-run.
+
+**Recommend /plan-design-review if UI scope was detected** — specifically if Section 11 (Design & UX Review) was NOT skipped, or if accepted scope expansions included UI-facing features. If an existing design review is stale (commit hash drift), note that. In SCOPE REDUCTION mode, skip this recommendation — design review is unlikely relevant for scope cuts.
+
+**If both are needed, recommend eng review first** (required gate), then design review.
+
+Use AskUserQuestion to present the next step. Include only applicable options:
+- **A)** Run /plan-eng-review next (required gate)
+- **B)** Run /plan-design-review next (only if UI scope detected)
+- **C)** Skip — I'll handle reviews manually
+
 ## docs/designs Promotion (EXPANSION and SELECTIVE EXPANSION only)

 At the end of the review, if the vision produced a compelling feature direction, offer to promote the CEO plan to the project repo. AskUserQuestion:
@@ -422,9 +422,7 @@ If any AskUserQuestion goes unanswered, note it here. Never silently default to
 After producing the Completion Summary above, persist the review result:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
-echo '{"skill":"plan-design-review","timestamp":"TIMESTAMP","status":"STATUS","overall_score":N,"unresolved":N,"decisions_made":N}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"plan-design-review","timestamp":"TIMESTAMP","status":"STATUS","overall_score":N,"unresolved":N,"decisions_made":N,"commit":"COMMIT"}'
 ```

 Substitute values from the Completion Summary:
@@ -433,16 +431,14 @@ Substitute values from the Completion Summary:
 - **overall_score**: final overall design score (0-10)
 - **unresolved**: number of unresolved design decisions
 - **decisions_made**: number of design decisions added to the plan
+- **COMMIT**: output of `git rev-parse --short HEAD`

 ## Review Readiness Dashboard

 After completing the review, read the review log and config to display the dashboard.

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-cat ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl 2>/dev/null || echo "NO_REVIEWS"
-echo "---CONFIG---"
-~/.claude/skills/gstack/bin/gstack-config get skip_eng_review 2>/dev/null || echo "false"
+~/.claude/skills/gstack/bin/gstack-review-read
 ```

 Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite, codex-review). Ignore entries with timestamps older than 7 days. For Design Review, show whichever is more recent between `plan-design-review` (full visual audit) and `design-review-lite` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display:
@@ -474,6 +470,27 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
 - CEO, Design, and Codex reviews are shown for context but never block shipping
 - If \`skip_eng_review\` config is \`true\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED

+**Staleness detection:** After displaying the dashboard, check if any existing reviews may be stale:
+- Parse the \`---HEAD---\` section from the bash output to get the current HEAD commit hash
+- For each review entry that has a \`commit\` field: compare it against the current HEAD. If different, count elapsed commits: \`git rev-list --count STORED_COMMIT..HEAD\`. Display: "Note: {skill} review from {date} may be stale — {N} commits since review"
+- For entries without a \`commit\` field (legacy entries): display "Note: {skill} review from {date} has no commit tracking — consider re-running for accurate staleness detection"
+- If all reviews match the current HEAD, do not display any staleness notes
+
+## Next Steps — Review Chaining
+
+After displaying the Review Readiness Dashboard, recommend the next review(s) based on what this design review discovered. Read the dashboard output to see which reviews have already been run and whether they are stale.
+
+**Recommend /plan-eng-review if eng review is not skipped globally** — check the dashboard output for `skip_eng_review`. If it is `true`, eng review is opted out — do not recommend it. Otherwise, eng review is the required shipping gate. If this design review added significant interaction specifications, new user flows, or changed the information architecture, emphasize that eng review needs to validate the architectural implications. If an eng review already exists but the commit hash shows it predates this design review, note that it may be stale and should be re-run.
+
+**Consider recommending /plan-ceo-review** — but only if this design review revealed fundamental product direction gaps. Specifically: if the overall design score started below 4/10, if the information architecture had major structural problems, or if the review surfaced questions about whether the right problem is being solved. AND no CEO review exists in the dashboard. This is a selective recommendation — most design reviews should NOT trigger a CEO review.
+
+**If both are needed, recommend eng review first** (required gate).
+
+Use AskUserQuestion to present the next step. Include only applicable options:
+- **A)** Run /plan-eng-review next (required gate)
+- **B)** Run /plan-ceo-review (only if fundamental product gaps found)
+- **C)** Skip — I'll handle reviews manually
+
 ## Formatting Rules
 * NUMBER issues (1, 2, 3...) and LETTERS for options (A, B, C...).
 * Label with NUMBER + LETTER (e.g., "3A", "3B").
@@ -269,9 +269,7 @@ If any AskUserQuestion goes unanswered, note it here. Never silently default to
 After producing the Completion Summary above, persist the review result:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
-echo '{"skill":"plan-design-review","timestamp":"TIMESTAMP","status":"STATUS","overall_score":N,"unresolved":N,"decisions_made":N}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"plan-design-review","timestamp":"TIMESTAMP","status":"STATUS","overall_score":N,"unresolved":N,"decisions_made":N,"commit":"COMMIT"}'
 ```

 Substitute values from the Completion Summary:
@@ -280,9 +278,25 @@ Substitute values from the Completion Summary:
 - **overall_score**: final overall design score (0-10)
 - **unresolved**: number of unresolved design decisions
 - **decisions_made**: number of design decisions added to the plan
+- **COMMIT**: output of `git rev-parse --short HEAD`

 {{REVIEW_DASHBOARD}}

+## Next Steps — Review Chaining
+
+After displaying the Review Readiness Dashboard, recommend the next review(s) based on what this design review discovered. Read the dashboard output to see which reviews have already been run and whether they are stale.
+
+**Recommend /plan-eng-review if eng review is not skipped globally** — check the dashboard output for `skip_eng_review`. If it is `true`, eng review is opted out — do not recommend it. Otherwise, eng review is the required shipping gate. If this design review added significant interaction specifications, new user flows, or changed the information architecture, emphasize that eng review needs to validate the architectural implications. If an eng review already exists but the commit hash shows it predates this design review, note that it may be stale and should be re-run.
+
+**Consider recommending /plan-ceo-review** — but only if this design review revealed fundamental product direction gaps. Specifically: if the overall design score started below 4/10, if the information architecture had major structural problems, or if the review surfaced questions about whether the right problem is being solved. AND no CEO review exists in the dashboard. This is a selective recommendation — most design reviews should NOT trigger a CEO review.
+
+**If both are needed, recommend eng review first** (required gate).
+
+Use AskUserQuestion to present the next step. Include only applicable options:
+- **A)** Run /plan-eng-review next (required gate)
+- **B)** Run /plan-ceo-review (only if fundamental product gaps found)
+- **C)** Skip — I'll handle reviews manually
+
 ## Formatting Rules
 * NUMBER issues (1, 2, 3...) and LETTERS for options (A, B, C...).
 * Label with NUMBER + LETTER (e.g., "3A", "3B").
@@ -273,7 +273,7 @@ Evaluate:
 **STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved.

 ### 3. Test review
-Make a diagram of all new UX, new data flow, new codepaths, and new branching if statements or outcomes. For each, note what is new about the features discussed in this branch and plan. Then, for each new item in the diagram, make sure there is a JS or Rails test.
+Make a diagram of all new UX, new data flow, new codepaths, and new branching if statements or outcomes. For each, note what is new about the features discussed in this branch and plan. Then, for each new item in the diagram, make sure there is a corresponding test.

 For LLM/prompt changes: check the "Prompt/LLM changes" file patterns listed in CLAUDE.md. If this plan touches ANY of those patterns, state which eval suites must be run, which cases should be added, and what baselines to compare against. Then use AskUserQuestion to confirm the eval scope with the user.

@@ -284,10 +284,9 @@ For LLM/prompt changes: check the "Prompt/LLM changes" file patterns listed in C
 After producing the test diagram, write a test plan artifact to the project directory so `/qa` and `/qa-only` can consume it as primary test input (replacing the lossy git-diff heuristic):

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG
 USER=$(whoami)
 DATETIME=$(date +%Y%m%d-%H%M%S)
-mkdir -p ~/.gstack/projects/$SLUG
 ```

 Write to `~/.gstack/projects/{slug}/{user}-{branch}-test-plan-{datetime}.md`:
@@ -393,9 +392,7 @@ Check the git log for this branch. If there are prior commits suggesting a previ
 After producing the Completion Summary above, persist the review result:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
-echo '{"skill":"plan-eng-review","timestamp":"TIMESTAMP","status":"STATUS","unresolved":N,"critical_gaps":N,"mode":"MODE"}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"plan-eng-review","timestamp":"TIMESTAMP","status":"STATUS","unresolved":N,"critical_gaps":N,"mode":"MODE","commit":"COMMIT"}'
 ```

 Substitute values from the Completion Summary:
@@ -404,16 +401,14 @@ Substitute values from the Completion Summary:
 - **unresolved**: number from "Unresolved decisions" count
 - **critical_gaps**: number from "Failure modes: ___ critical gaps flagged"
 - **MODE**: FULL_REVIEW / SCOPE_REDUCED
+- **COMMIT**: output of `git rev-parse --short HEAD`

 ## Review Readiness Dashboard

 After completing the review, read the review log and config to display the dashboard.

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-cat ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl 2>/dev/null || echo "NO_REVIEWS"
-echo "---CONFIG---"
-~/.claude/skills/gstack/bin/gstack-config get skip_eng_review 2>/dev/null || echo "false"
+~/.claude/skills/gstack/bin/gstack-review-read
 ```

 Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite, codex-review). Ignore entries with timestamps older than 7 days. For Design Review, show whichever is more recent between `plan-design-review` (full visual audit) and `design-review-lite` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display:
@@ -445,5 +440,28 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
 - CEO, Design, and Codex reviews are shown for context but never block shipping
 - If \`skip_eng_review\` config is \`true\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED

+**Staleness detection:** After displaying the dashboard, check if any existing reviews may be stale:
+- Parse the \`---HEAD---\` section from the bash output to get the current HEAD commit hash
+- For each review entry that has a \`commit\` field: compare it against the current HEAD. If different, count elapsed commits: \`git rev-list --count STORED_COMMIT..HEAD\`. Display: "Note: {skill} review from {date} may be stale — {N} commits since review"
+- For entries without a \`commit\` field (legacy entries): display "Note: {skill} review from {date} has no commit tracking — consider re-running for accurate staleness detection"
+- If all reviews match the current HEAD, do not display any staleness notes
+
+## Next Steps — Review Chaining
+
+After displaying the Review Readiness Dashboard, check if additional reviews would be valuable. Read the dashboard output to see which reviews have already been run and whether they are stale.
+
+**Suggest /plan-design-review if UI changes exist and no design review has been run** — detect from the test diagram, architecture review, or any section that touched frontend components, CSS, views, or user-facing interaction flows. If an existing design review's commit hash shows it predates significant changes found in this eng review, note that it may be stale.
+
+**Mention /plan-ceo-review if this is a significant product change and no CEO review exists** — this is a soft suggestion, not a push. CEO review is optional. Only mention it if the plan introduces new user-facing features, changes product direction, or expands scope substantially.
+
+**Note staleness** of existing CEO or design reviews if this eng review found assumptions that contradict them, or if the commit hash shows significant drift.
+
+**If no additional reviews are needed** (or `skip_eng_review` is `true` in the dashboard config, meaning this eng review was optional): state "All relevant reviews complete. Run /ship when ready."
+
+Use AskUserQuestion with only the applicable options:
+- **A)** Run /plan-design-review (only if UI scope detected and no design review exists)
+- **B)** Run /plan-ceo-review (only if significant product change and no CEO review exists)
+- **C)** Ready to implement — run /ship when done
+
 ## Unresolved decisions
 If the user does not respond to an AskUserQuestion or interrupts to move on, note which decisions were left unresolved. At the end of the review, list these as "Unresolved decisions that may bite you later" — never silently default to an option.
@@ -137,7 +137,7 @@ Evaluate:
 **STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved.

 ### 3. Test review
-Make a diagram of all new UX, new data flow, new codepaths, and new branching if statements or outcomes. For each, note what is new about the features discussed in this branch and plan. Then, for each new item in the diagram, make sure there is a JS or Rails test.
+Make a diagram of all new UX, new data flow, new codepaths, and new branching if statements or outcomes. For each, note what is new about the features discussed in this branch and plan. Then, for each new item in the diagram, make sure there is a corresponding test.

 For LLM/prompt changes: check the "Prompt/LLM changes" file patterns listed in CLAUDE.md. If this plan touches ANY of those patterns, state which eval suites must be run, which cases should be added, and what baselines to compare against. Then use AskUserQuestion to confirm the eval scope with the user.

@@ -148,10 +148,9 @@ For LLM/prompt changes: check the "Prompt/LLM changes" file patterns listed in C
 After producing the test diagram, write a test plan artifact to the project directory so `/qa` and `/qa-only` can consume it as primary test input (replacing the lossy git-diff heuristic):

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG
 USER=$(whoami)
 DATETIME=$(date +%Y%m%d-%H%M%S)
-mkdir -p ~/.gstack/projects/$SLUG
 ```

 Write to `~/.gstack/projects/{slug}/{user}-{branch}-test-plan-{datetime}.md`:
@@ -257,9 +256,7 @@ Check the git log for this branch. If there are prior commits suggesting a previ
 After producing the Completion Summary above, persist the review result:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
-echo '{"skill":"plan-eng-review","timestamp":"TIMESTAMP","status":"STATUS","unresolved":N,"critical_gaps":N,"mode":"MODE"}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"plan-eng-review","timestamp":"TIMESTAMP","status":"STATUS","unresolved":N,"critical_gaps":N,"mode":"MODE","commit":"COMMIT"}'
 ```

 Substitute values from the Completion Summary:
@@ -268,8 +265,26 @@ Substitute values from the Completion Summary:
 - **unresolved**: number from "Unresolved decisions" count
 - **critical_gaps**: number from "Failure modes: ___ critical gaps flagged"
 - **MODE**: FULL_REVIEW / SCOPE_REDUCED
+- **COMMIT**: output of `git rev-parse --short HEAD`

 {{REVIEW_DASHBOARD}}

+## Next Steps — Review Chaining
+
+After displaying the Review Readiness Dashboard, check if additional reviews would be valuable. Read the dashboard output to see which reviews have already been run and whether they are stale.
+
+**Suggest /plan-design-review if UI changes exist and no design review has been run** — detect from the test diagram, architecture review, or any section that touched frontend components, CSS, views, or user-facing interaction flows. If an existing design review's commit hash shows it predates significant changes found in this eng review, note that it may be stale.
+
+**Mention /plan-ceo-review if this is a significant product change and no CEO review exists** — this is a soft suggestion, not a push. CEO review is optional. Only mention it if the plan introduces new user-facing features, changes product direction, or expands scope substantially.
+
+**Note staleness** of existing CEO or design reviews if this eng review found assumptions that contradict them, or if the commit hash shows significant drift.
+
+**If no additional reviews are needed** (or `skip_eng_review` is `true` in the dashboard config, meaning this eng review was optional): state "All relevant reviews complete. Run /ship when ready."
+
+Use AskUserQuestion with only the applicable options:
+- **A)** Run /plan-design-review (only if UI scope detected and no design review exists)
+- **B)** Run /plan-ceo-review (only if significant product change and no CEO review exists)
+- **C)** Ready to implement — run /ship when done
+
 ## Unresolved decisions
 If the user does not respond to an AskUserQuestion or interrupts to move on, note which decisions were left unresolved. At the end of the review, list these as "Unresolved decisions that may bite you later" — never silently default to an option.
@@ -206,7 +206,7 @@ Before falling back to git diff heuristics, check for richer test plan sources:

 1. **Project-scoped test plans:** Check `~/.gstack/projects/` for recent `*-test-plan-*.md` files for this repo
   ```bash
-   eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
+   source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
   ls -t ~/.gstack/projects/$SLUG/*-test-plan-*.md 2>/dev/null | head -1
   ```
 2. **Conversation context:** Check if a prior `/plan-eng-review` or `/plan-ceo-review` produced test plan output in this conversation
@@ -502,8 +502,7 @@ Write the report to both local and project-scoped locations:

 **Project-scoped:** Write test outcome artifact for cross-session context:
 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG
 ```
 Write to `~/.gstack/projects/{slug}/{user}-{branch}-test-outcome-{datetime}.md`

@@ -53,7 +53,7 @@ Before falling back to git diff heuristics, check for richer test plan sources:

 1. **Project-scoped test plans:** Check `~/.gstack/projects/` for recent `*-test-plan-*.md` files for this repo
   ```bash
-   eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
+   source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
   ls -t ~/.gstack/projects/$SLUG/*-test-plan-*.md 2>/dev/null | head -1
   ```
 2. **Conversation context:** Check if a prior `/plan-eng-review` or `/plan-ceo-review` produced test plan output in this conversation
@@ -73,8 +73,7 @@ Write the report to both local and project-scoped locations:

 **Project-scoped:** Write test outcome artifact for cross-session context:
 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG
 ```
 Write to `~/.gstack/projects/{slug}/{user}-{branch}-test-outcome-{datetime}.md`

@@ -410,7 +410,7 @@ Before falling back to git diff heuristics, check for richer test plan sources:

 1. **Project-scoped test plans:** Check `~/.gstack/projects/` for recent `*-test-plan-*.md` files for this repo
   ```bash
-   eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
+   source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
   ls -t ~/.gstack/projects/$SLUG/*-test-plan-*.md 2>/dev/null | head -1
   ```
 2. **Conversation context:** Check if a prior `/plan-eng-review` or `/plan-ceo-review` produced test plan output in this conversation
@@ -874,8 +874,7 @@ Write the report to both local and project-scoped locations:

 **Project-scoped:** Write test outcome artifact for cross-session context:
 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG
 ```
 Write to `~/.gstack/projects/{slug}/{user}-{branch}-test-outcome-{datetime}.md`

@@ -89,7 +89,7 @@ Before falling back to git diff heuristics, check for richer test plan sources:

 1. **Project-scoped test plans:** Check `~/.gstack/projects/` for recent `*-test-plan-*.md` files for this repo
   ```bash
-   eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
+   source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
   ls -t ~/.gstack/projects/$SLUG/*-test-plan-*.md 2>/dev/null | head -1
   ```
 2. **Conversation context:** Check if a prior `/plan-eng-review` or `/plan-ceo-review` produced test plan output in this conversation
@@ -277,8 +277,7 @@ Write the report to both local and project-scoped locations:

 **Project-scoped:** Write test outcome artifact for cross-session context:
 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG
 ```
 Write to `~/.gstack/projects/{slug}/{user}-{branch}-test-outcome-{datetime}.md`

@@ -182,7 +182,7 @@ When the user types `/retro`, run this skill.

 Parse the argument to determine the time window. Default to 7 days if no argument given. All times should be reported in the user's **local timezone** (use the system default — do NOT set `TZ`).

-**Midnight-aligned windows:** For day (`d`) and week (`w`) units, compute an absolute start date at local midnight, not a relative string. For example, if today is 2026-03-18 and the window is 7 days: the start date is 2026-03-11. Use `--since="2026-03-11"` for git log queries — git interprets a bare date as midnight in the local timezone, so this captures full calendar days regardless of what time the retro runs. For week units, multiply by 7 to get days (e.g., `2w` = 14 days back). For hour (`h`) units, use `--since="N hours ago"` since midnight alignment does not apply to sub-day windows.
+**Midnight-aligned windows:** For day (`d`) and week (`w`) units, compute an absolute start date at local midnight, not a relative string. For example, if today is 2026-03-18 and the window is 7 days: the start date is 2026-03-11. Use `--since="2026-03-11T00:00:00"` for git log queries — the explicit `T00:00:00` suffix ensures git starts from midnight. Without it, git uses the current wall-clock time (e.g., `--since="2026-03-11"` at 11pm means 11pm, not midnight). For week units, multiply by 7 to get days (e.g., `2w` = 14 days back). For hour (`h`) units, use `--since="N hours ago"` since midnight alignment does not apply to sub-day windows.

 **Argument validation:** If the argument doesn't match a number followed by `d`, `h`, or `w`, the word `compare`, or `compare` followed by a number and `d`/`h`/`w`, show this usage and stop:
 ```
@@ -628,8 +628,8 @@ Small, practical, realistic. Each must be something that takes <5 minutes to ado

 When the user runs `/retro compare` (or `/retro compare 14d`):

-1. Compute metrics for the current window (default 7d) using the midnight-aligned start date (same logic as the main retro — e.g., if today is 2026-03-18 and window is 7d, use `--since="2026-03-11"`)
-2. Compute metrics for the immediately prior same-length window using both `--since` and `--until` with midnight-aligned dates to avoid overlap (e.g., for a 7d window starting 2026-03-11: prior window is `--since="2026-03-04" --until="2026-03-11"`)
+1. Compute metrics for the current window (default 7d) using the midnight-aligned start date (same logic as the main retro — e.g., if today is 2026-03-18 and window is 7d, use `--since="2026-03-11T00:00:00"`)
+2. Compute metrics for the immediately prior same-length window using both `--since` and `--until` with midnight-aligned dates to avoid overlap (e.g., for a 7d window starting 2026-03-11: prior window is `--since="2026-03-04T00:00:00" --until="2026-03-11T00:00:00"`)
 3. Show a side-by-side comparison table with deltas and arrows
 4. Write a brief narrative highlighting the biggest improvements and regressions
 5. Save only the current-window snapshot to `.context/retros/` (same as a normal retro run); do **not** persist the prior-window metrics.
@@ -46,7 +46,7 @@ When the user types `/retro`, run this skill.

 Parse the argument to determine the time window. Default to 7 days if no argument given. All times should be reported in the user's **local timezone** (use the system default — do NOT set `TZ`).

-**Midnight-aligned windows:** For day (`d`) and week (`w`) units, compute an absolute start date at local midnight, not a relative string. For example, if today is 2026-03-18 and the window is 7 days: the start date is 2026-03-11. Use `--since="2026-03-11"` for git log queries — git interprets a bare date as midnight in the local timezone, so this captures full calendar days regardless of what time the retro runs. For week units, multiply by 7 to get days (e.g., `2w` = 14 days back). For hour (`h`) units, use `--since="N hours ago"` since midnight alignment does not apply to sub-day windows.
+**Midnight-aligned windows:** For day (`d`) and week (`w`) units, compute an absolute start date at local midnight, not a relative string. For example, if today is 2026-03-18 and the window is 7 days: the start date is 2026-03-11. Use `--since="2026-03-11T00:00:00"` for git log queries — the explicit `T00:00:00` suffix ensures git starts from midnight. Without it, git uses the current wall-clock time (e.g., `--since="2026-03-11"` at 11pm means 11pm, not midnight). For week units, multiply by 7 to get days (e.g., `2w` = 14 days back). For hour (`h`) units, use `--since="N hours ago"` since midnight alignment does not apply to sub-day windows.

 **Argument validation:** If the argument doesn't match a number followed by `d`, `h`, or `w`, the word `compare`, or `compare` followed by a number and `d`/`h`/`w`, show this usage and stop:
 ```
@@ -492,8 +492,8 @@ Small, practical, realistic. Each must be something that takes <5 minutes to ado

 When the user runs `/retro compare` (or `/retro compare 14d`):

-1. Compute metrics for the current window (default 7d) using the midnight-aligned start date (same logic as the main retro — e.g., if today is 2026-03-18 and window is 7d, use `--since="2026-03-11"`)
-2. Compute metrics for the immediately prior same-length window using both `--since` and `--until` with midnight-aligned dates to avoid overlap (e.g., for a 7d window starting 2026-03-11: prior window is `--since="2026-03-04" --until="2026-03-11"`)
+1. Compute metrics for the current window (default 7d) using the midnight-aligned start date (same logic as the main retro — e.g., if today is 2026-03-18 and window is 7d, use `--since="2026-03-11T00:00:00"`)
+2. Compute metrics for the immediately prior same-length window using both `--since` and `--until` with midnight-aligned dates to avoid overlap (e.g., for a 7d window starting 2026-03-11: prior window is `--since="2026-03-04T00:00:00" --until="2026-03-11T00:00:00"`)
 3. Show a side-by-side comparison table with deltas and arrows
 4. Write a brief narrative highlighting the biggest improvements and regressions
 5. Save only the current-window snapshot to `.context/retros/` (same as a normal retro run); do **not** persist the prior-window metrics.
@@ -271,7 +271,7 @@ Follow the output format specified in the checklist. Respect the suppressions
 Check if the diff touches frontend files using `gstack-diff-scope`:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
+source <(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
 ```

 **If `SCOPE_FRONTEND=false`:** Skip design review silently. No output.
@@ -294,12 +294,10 @@ eval $(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
 6. **Log the result** for the Review Readiness Dashboard:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
-echo '{"skill":"design-review-lite","timestamp":"TIMESTAMP","status":"STATUS","findings":N,"auto_fixed":M}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"design-review-lite","timestamp":"TIMESTAMP","status":"STATUS","findings":N,"auto_fixed":M,"commit":"COMMIT"}'
 ```

-Substitute: TIMESTAMP = ISO 8601 datetime, STATUS = "clean" if 0 findings or "issues_found", N = total findings, M = auto-fixed count.
+Substitute: TIMESTAMP = ISO 8601 datetime, STATUS = "clean" if 0 findings or "issues_found", N = total findings, M = auto-fixed count, COMMIT = output of `git rev-parse --short HEAD`.

 Include any design findings alongside the findings from Step 4. They follow the same Fix-First flow in Step 5 — AUTO-FIX for mechanical CSS fixes, ASK for everything else.

@@ -453,10 +451,7 @@ Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` h

 **Only if a code review ran (user chose A or C):** Persist the Codex review result to the review log:
 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-BRANCH_SLUG=$(git rev-parse --abbrev-ref HEAD 2>/dev/null | tr '/' '-')
-mkdir -p ~/.gstack/projects/"$SLUG"
-echo '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}' >> ~/.gstack/projects/"$SLUG"/"$BRANCH_SLUG"-reviews.jsonl
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}'
 ```

 Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").
@@ -267,10 +267,7 @@ Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` h

 **Only if a code review ran (user chose A or C):** Persist the Codex review result to the review log:
 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-BRANCH_SLUG=$(git rev-parse --abbrev-ref HEAD 2>/dev/null | tr '/' '-')
-mkdir -p ~/.gstack/projects/"$SLUG"
-echo '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}' >> ~/.gstack/projects/"$SLUG"/"$BRANCH_SLUG"-reviews.jsonl
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}'
 ```

 Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").
@@ -35,16 +35,16 @@ Be terse. For each issue: one line describing the problem, one line with the fix
 ### Pass 1 — CRITICAL

 #### SQL & Data Safety
- String interpolation in SQL (even if values are `.to_i`/`.to_f` — use `sanitize_sql_array` or Arel)
+- String interpolation in SQL (even if values are `.to_i`/`.to_f` — use parameterized queries (Rails: sanitize_sql_array/Arel; Node: prepared statements; Python: parameterized queries))
 - TOCTOU races: check-then-set patterns that should be atomic `WHERE` + `update_all`
- `update_column`/`update_columns` bypassing validations on fields that have or should have constraints
- N+1 queries: `.includes()` missing for associations used in loops/views (especially avatar, attachments)
+- Bypassing model validations for direct DB writes (Rails: update_column; Django: QuerySet.update(); Prisma: raw queries)
+- N+1 queries: Missing eager loading (Rails: .includes(); SQLAlchemy: joinedload(); Prisma: include) for associations used in loops/views

 #### Race Conditions & Concurrency
- Read-check-write without uniqueness constraint or `rescue RecordNotUnique; retry` (e.g., `where(hash:).first` then `save!` without handling concurrent insert)
- `find_or_create_by` on columns without unique DB index — concurrent calls can create duplicates
+- Read-check-write without uniqueness constraint or catch duplicate key error and retry (e.g., `where(hash:).first` then `save!` without handling concurrent insert)
+- find-or-create without unique DB index — concurrent calls can create duplicates
 - Status transitions that don't use atomic `WHERE old_status = ? UPDATE SET new_status` — concurrent updates can skip or double-apply transitions
- `html_safe` on user-controlled data (XSS) — check any `.html_safe`, `raw()`, or string interpolation into `html_safe` output
+- Unsafe HTML rendering (Rails: .html_safe/raw(); React: dangerouslySetInnerHTML; Vue: v-html; Django: |safe/mark_safe) on user-controlled data (XSS)

 #### LLM Output Trust Boundary
 - LLM-generated values (emails, URLs, names) written to DB or passed to mailers without format validation. Add lightweight guards (`EMAIL_REGEXP`, `URI.parse`, `.strip`) before persisting.
@@ -141,7 +141,7 @@ the agent auto-fixes a finding or asks the user.
 ```
 AUTO-FIX (agent fixes without asking):     ASK (needs human judgment):
 ├─ Dead code / unused variables            ├─ Security (auth, XSS, injection)
-├─ N+1 queries (missing .includes())      ├─ Race conditions
+├─ N+1 queries (missing eager loading)      ├─ Race conditions
 ├─ Stale comments contradicting code       ├─ Design decisions
 ├─ Magic numbers → named constants         ├─ Large fixes (>20 lines)
 ├─ Missing LLM output validation           ├─ Enum completeness
@@ -9,7 +9,7 @@ This checklist applies to **source code in the diff** — not rendered output. R
 **Trigger:** Only run this checklist if the diff touches frontend files. Use `gstack-diff-scope` to detect:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
+source <(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
 ```

 If `SCOPE_FRONTEND=false`, skip the entire design review silently.
@@ -626,7 +626,7 @@ function generateDesignReviewLite(_ctx: TemplateContext): string {
 Check if the diff touches frontend files using \`gstack-diff-scope\`:

 \`\`\`bash
-eval $(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
+source <(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
 \`\`\`

 **If \`SCOPE_FRONTEND=false\`:** Skip design review silently. No output.
@@ -649,12 +649,10 @@ eval $(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
 6. **Log the result** for the Review Readiness Dashboard:

 \`\`\`bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
-echo '{"skill":"design-review-lite","timestamp":"TIMESTAMP","status":"STATUS","findings":N,"auto_fixed":M}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"design-review-lite","timestamp":"TIMESTAMP","status":"STATUS","findings":N,"auto_fixed":M,"commit":"COMMIT"}'
 \`\`\`

-Substitute: TIMESTAMP = ISO 8601 datetime, STATUS = "clean" if 0 findings or "issues_found", N = total findings, M = auto-fixed count.`;
+Substitute: TIMESTAMP = ISO 8601 datetime, STATUS = "clean" if 0 findings or "issues_found", N = total findings, M = auto-fixed count, COMMIT = output of \`git rev-parse --short HEAD\`.`;
 }

 // NOTE: design-checklist.md is a subset of this methodology for code-level detection.
@@ -909,8 +907,7 @@ Compare screenshots and observations across pages for:

 **Project-scoped:**
 \`\`\`bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
+source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG
 \`\`\`
 Write to: \`~/.gstack/projects/{slug}/{user}-{branch}-design-audit-{datetime}.md\`

@@ -999,10 +996,7 @@ function generateReviewDashboard(_ctx: TemplateContext): string {
 After completing the review, read the review log and config to display the dashboard.

 \`\`\`bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-cat ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl 2>/dev/null || echo "NO_REVIEWS"
-echo "---CONFIG---"
-~/.claude/skills/gstack/bin/gstack-config get skip_eng_review 2>/dev/null || echo "false"
+~/.claude/skills/gstack/bin/gstack-review-read
 \`\`\`

 Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite, codex-review). Ignore entries with timestamps older than 7 days. For Design Review, show whichever is more recent between \`plan-design-review\` (full visual audit) and \`design-review-lite\` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display:
@@ -1032,7 +1026,13 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
 - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \\\`skip_eng_review\\\` is \\\`true\\\`)
 - **NOT CLEARED**: Eng Review missing, stale (>7 days), or has open issues
 - CEO, Design, and Codex reviews are shown for context but never block shipping
- If \\\`skip_eng_review\\\` config is \\\`true\\\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED`;
+- If \\\`skip_eng_review\\\` config is \\\`true\\\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED
+
+**Staleness detection:** After displaying the dashboard, check if any existing reviews may be stale:
+- Parse the \\\`---HEAD---\\\` section from the bash output to get the current HEAD commit hash
+- For each review entry that has a \\\`commit\\\` field: compare it against the current HEAD. If different, count elapsed commits: \\\`git rev-list --count STORED_COMMIT..HEAD\\\`. Display: "Note: {skill} review from {date} may be stale — {N} commits since review"
+- For entries without a \\\`commit\\\` field (legacy entries): display "Note: {skill} review from {date} has no commit tracking — consider re-running for accurate staleness detection"
+- If all reviews match the current HEAD, do not display any staleness notes`;
 }

 function generateTestBootstrap(_ctx: TemplateContext): string {
@@ -2,6 +2,12 @@
 # gstack setup — build browser binary + register skills with Claude Code / Codex
 set -e

+if ! command -v bun >/dev/null 2>&1; then
+  echo "Error: bun is required but not installed." >&2
+  echo "Install it: curl -fsSL https://bun.sh/install | bash" >&2
+  exit 1
+fi
+
 GSTACK_DIR="$(cd "$(dirname "$0")" && pwd)"
 SKILLS_DIR="$(dirname "$GSTACK_DIR")"
 BROWSE_BIN="$GSTACK_DIR/browse/dist/browse"
@@ -213,10 +213,7 @@ You are running the `/ship` workflow. This is a **non-interactive, fully automat
 After completing the review, read the review log and config to display the dashboard.

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-cat ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl 2>/dev/null || echo "NO_REVIEWS"
-echo "---CONFIG---"
-~/.claude/skills/gstack/bin/gstack-config get skip_eng_review 2>/dev/null || echo "false"
+~/.claude/skills/gstack/bin/gstack-review-read
 ```

 Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite, codex-review). Ignore entries with timestamps older than 7 days. For Design Review, show whichever is more recent between `plan-design-review` (full visual audit) and `design-review-lite` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display:
@@ -248,11 +245,17 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
 - CEO, Design, and Codex reviews are shown for context but never block shipping
 - If \`skip_eng_review\` config is \`true\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED

+**Staleness detection:** After displaying the dashboard, check if any existing reviews may be stale:
+- Parse the \`---HEAD---\` section from the bash output to get the current HEAD commit hash
+- For each review entry that has a \`commit\` field: compare it against the current HEAD. If different, count elapsed commits: \`git rev-list --count STORED_COMMIT..HEAD\`. Display: "Note: {skill} review from {date} may be stale — {N} commits since review"
+- For entries without a \`commit\` field (legacy entries): display "Note: {skill} review from {date} has no commit tracking — consider re-running for accurate staleness detection"
+- If all reviews match the current HEAD, do not display any staleness notes
+
 If the Eng Review is NOT "CLEAR":

 1. **Check for a prior override on this branch:**
   ```bash
-   eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
+   source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
   grep '"skill":"ship-review-override"' ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl 2>/dev/null || echo "NO_OVERRIDE"
   ```
   If an override exists, display the dashboard and note "Review gate previously accepted — continuing." Do NOT ask again.
@@ -262,11 +265,11 @@ If the Eng Review is NOT "CLEAR":
   - RECOMMENDATION: Choose C if the change is obviously trivial (< 20 lines, typo fix, config-only); Choose B for larger changes
   - Options: A) Ship anyway  B) Abort — run /plan-eng-review first  C) Change is too small to need eng review
   - If CEO Review is missing, mention as informational ("CEO Review not run — recommended for product changes") but do NOT block
-   - For Design Review: run `eval $(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)`. If `SCOPE_FRONTEND=true` and no design review (plan-design-review or design-review-lite) exists in the dashboard, mention: "Design Review not run — this PR changes frontend code. The lite design check will run automatically in Step 3.5, but consider running /design-review for a full visual audit post-implementation." Still never block.
+   - For Design Review: run `source <(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)`. If `SCOPE_FRONTEND=true` and no design review (plan-design-review or design-review-lite) exists in the dashboard, mention: "Design Review not run — this PR changes frontend code. The lite design check will run automatically in Step 3.5, but consider running /design-review for a full visual audit post-implementation." Still never block.

 3. **If the user chooses A or C,** persist the decision so future `/ship` runs on this branch skip the gate:
   ```bash
-   eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
+   source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
   echo '{"skill":"ship-review-override","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","decision":"USER_CHOICE"}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl
   ```
   Substitute USER_CHOICE with "ship_anyway" or "not_relevant".
@@ -683,7 +686,7 @@ Review the diff for structural issues that tests don't catch.
 Check if the diff touches frontend files using `gstack-diff-scope`:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
+source <(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
 ```

 **If `SCOPE_FRONTEND=false`:** Skip design review silently. No output.
@@ -706,12 +709,10 @@ eval $(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
 6. **Log the result** for the Review Readiness Dashboard:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
-echo '{"skill":"design-review-lite","timestamp":"TIMESTAMP","status":"STATUS","findings":N,"auto_fixed":M}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"design-review-lite","timestamp":"TIMESTAMP","status":"STATUS","findings":N,"auto_fixed":M,"commit":"COMMIT"}'
 ```

-Substitute: TIMESTAMP = ISO 8601 datetime, STATUS = "clean" if 0 findings or "issues_found", N = total findings, M = auto-fixed count.
+Substitute: TIMESTAMP = ISO 8601 datetime, STATUS = "clean" if 0 findings or "issues_found", N = total findings, M = auto-fixed count, COMMIT = output of `git rev-parse --short HEAD`.

   Include any design findings alongside the code review findings. They follow the same Fix-First flow below.

@@ -803,10 +804,7 @@ Present the full output verbatim under a `CODEX SAYS:` header. Check for `[P1]`
 to determine pass/fail gate. Persist the result:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-BRANCH_SLUG=$(git rev-parse --abbrev-ref HEAD 2>/dev/null | tr '/' '-')
-mkdir -p ~/.gstack/projects/$SLUG
-echo '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}' >> ~/.gstack/projects/$SLUG/$BRANCH_SLUG-reviews.jsonl
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}'
 ```

 If GATE is FAIL, use AskUserQuestion: "Codex found critical issues. Ship anyway?"
@@ -1028,7 +1026,28 @@ EOF
 )"
 ```

-**Output the PR URL** — this should be the final output the user sees.
+**Output the PR URL** — then proceed to Step 8.5.
+
+---
+
+## Step 8.5: Auto-invoke /document-release
+
+After the PR is created, automatically sync project documentation. Read the
+`document-release/SKILL.md` skill file (adjacent to this skill's directory) and
+execute its full workflow:
+
+1. Read the `/document-release` skill: `cat ${CLAUDE_SKILL_DIR}/../document-release/SKILL.md`
+2. Follow its instructions — it reads all .md files in the project, cross-references
+   the diff, and updates anything that drifted (README, ARCHITECTURE, CONTRIBUTING,
+   CLAUDE.md, TODOS, etc.)
+3. If any docs were updated, commit the changes and push to the same branch:
+   ```bash
+   git add -A && git commit -m "docs: sync documentation with shipped changes" && git push
+   ```
+4. If no docs needed updating, say "Documentation is current — no updates needed."
+
+This step is automatic. Do not ask the user for confirmation. The goal is zero-friction
+doc updates — the user runs `/ship` and documentation stays current without a separate command.

 ---

@@ -1045,4 +1064,4 @@ EOF
 - **Use Greptile reply templates from greptile-triage.md.** Every reply includes evidence (inline diff, code references, re-rank suggestion). Never post vague replies.
 - **Never push without fresh verification evidence.** If code changed after Step 3 tests, re-run before pushing.
 - **Step 3.4 generates coverage tests.** They must pass before committing. Never commit failing tests.
- **The goal is: user says `/ship`, next thing they see is the review + PR URL.**
+- **The goal is: user says `/ship`, next thing they see is the review + PR URL + auto-synced docs.**
@@ -61,7 +61,7 @@ If the Eng Review is NOT "CLEAR":

 1. **Check for a prior override on this branch:**
   ```bash
-   eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
+   source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
   grep '"skill":"ship-review-override"' ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl 2>/dev/null || echo "NO_OVERRIDE"
   ```
   If an override exists, display the dashboard and note "Review gate previously accepted — continuing." Do NOT ask again.
@@ -71,11 +71,11 @@ If the Eng Review is NOT "CLEAR":
   - RECOMMENDATION: Choose C if the change is obviously trivial (< 20 lines, typo fix, config-only); Choose B for larger changes
   - Options: A) Ship anyway  B) Abort — run /plan-eng-review first  C) Change is too small to need eng review
   - If CEO Review is missing, mention as informational ("CEO Review not run — recommended for product changes") but do NOT block
-   - For Design Review: run `eval $(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)`. If `SCOPE_FRONTEND=true` and no design review (plan-design-review or design-review-lite) exists in the dashboard, mention: "Design Review not run — this PR changes frontend code. The lite design check will run automatically in Step 3.5, but consider running /design-review for a full visual audit post-implementation." Still never block.
+   - For Design Review: run `source <(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)`. If `SCOPE_FRONTEND=true` and no design review (plan-design-review or design-review-lite) exists in the dashboard, mention: "Design Review not run — this PR changes frontend code. The lite design check will run automatically in Step 3.5, but consider running /design-review for a full visual audit post-implementation." Still never block.

 3. **If the user chooses A or C,** persist the decision so future `/ship` runs on this branch skip the gate:
   ```bash
-   eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
+   source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
   echo '{"skill":"ship-review-override","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","decision":"USER_CHOICE"}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl
   ```
   Substitute USER_CHOICE with "ship_anyway" or "not_relevant".
@@ -428,10 +428,7 @@ Present the full output verbatim under a `CODEX SAYS:` header. Check for `[P1]`
 to determine pass/fail gate. Persist the result:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-BRANCH_SLUG=$(git rev-parse --abbrev-ref HEAD 2>/dev/null | tr '/' '-')
-mkdir -p ~/.gstack/projects/$SLUG
-echo '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}' >> ~/.gstack/projects/$SLUG/$BRANCH_SLUG-reviews.jsonl
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}'
 ```

 If GATE is FAIL, use AskUserQuestion: "Codex found critical issues. Ship anyway?"
@@ -653,7 +650,28 @@ EOF
 )"
 ```

-**Output the PR URL** — this should be the final output the user sees.
+**Output the PR URL** — then proceed to Step 8.5.
+
+---
+
+## Step 8.5: Auto-invoke /document-release
+
+After the PR is created, automatically sync project documentation. Read the
+`document-release/SKILL.md` skill file (adjacent to this skill's directory) and
+execute its full workflow:
+
+1. Read the `/document-release` skill: `cat ${CLAUDE_SKILL_DIR}/../document-release/SKILL.md`
+2. Follow its instructions — it reads all .md files in the project, cross-references
+   the diff, and updates anything that drifted (README, ARCHITECTURE, CONTRIBUTING,
+   CLAUDE.md, TODOS, etc.)
+3. If any docs were updated, commit the changes and push to the same branch:
+   ```bash
+   git add -A && git commit -m "docs: sync documentation with shipped changes" && git push
+   ```
+4. If no docs needed updating, say "Documentation is current — no updates needed."
+
+This step is automatic. Do not ask the user for confirmation. The goal is zero-friction
+doc updates — the user runs `/ship` and documentation stays current without a separate command.

 ---

@@ -670,4 +688,4 @@ EOF
 - **Use Greptile reply templates from greptile-triage.md.** Every reply includes evidence (inline diff, code references, re-rank suggestion). Never post vague replies.
 - **Never push without fresh verification evidence.** If code changed after Step 3 tests, re-run before pushing.
 - **Step 3.4 generates coverage tests.** They must pass before committing. Never commit failing tests.
- **The goal is: user says `/ship`, next thing they see is the review + PR URL.**
+- **The goal is: user says `/ship`, next thing they see is the review + PR URL + auto-synced docs.**
@@ -347,7 +347,7 @@ describe('REVIEW_DASHBOARD resolver', () => {
  for (const skill of REVIEW_SKILLS) {
    test(`review dashboard appears in ${skill} generated file`, () => {
      const content = fs.readFileSync(path.join(ROOT, skill, 'SKILL.md'), 'utf-8');
-      expect(content).toContain('reviews.jsonl');
+      expect(content).toContain('gstack-review');
      expect(content).toContain('REVIEW READINESS DASHBOARD');
    });
  }
@@ -367,6 +367,53 @@ describe('REVIEW_DASHBOARD resolver', () => {
    expect(content).toContain('Design Review');
    expect(content).toContain('skip_eng_review');
  });
+
+  test('dashboard bash block includes git HEAD for staleness detection', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('git rev-parse --short HEAD');
+    expect(content).toContain('---HEAD---');
+  });
+
+  test('dashboard includes staleness detection prose', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('Staleness detection');
+    expect(content).toContain('commit');
+  });
+
+  for (const skill of REVIEW_SKILLS) {
+    test(`${skill} contains review chaining section`, () => {
+      const content = fs.readFileSync(path.join(ROOT, skill, 'SKILL.md'), 'utf-8');
+      expect(content).toContain('Review Chaining');
+    });
+
+    test(`${skill} Review Log includes commit field`, () => {
+      const content = fs.readFileSync(path.join(ROOT, skill, 'SKILL.md'), 'utf-8');
+      expect(content).toContain('"commit"');
+    });
+  }
+
+  test('plan-ceo-review chaining mentions eng and design reviews', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('/plan-eng-review');
+    expect(content).toContain('/plan-design-review');
+  });
+
+  test('plan-eng-review chaining mentions design and ceo reviews', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('/plan-design-review');
+    expect(content).toContain('/plan-ceo-review');
+  });
+
+  test('plan-design-review chaining mentions eng and ceo reviews', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'plan-design-review', 'SKILL.md'), 'utf-8');
+    expect(content).toContain('/plan-eng-review');
+    expect(content).toContain('/plan-ceo-review');
+  });
+
+  test('ship does NOT contain review chaining', () => {
+    const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
+    expect(content).not.toContain('Review Chaining');
+  });
 });

 // ─── Codex Generation Tests ─────────────────────────────────
@@ -50,7 +50,7 @@ function installSkills(tmpDir: string) {
    '', // root gstack SKILL.md
    'qa', 'qa-only', 'ship', 'review', 'plan-ceo-review', 'plan-eng-review',
    'plan-design-review', 'design-review', 'design-consultation', 'retro',
-    'document-release', 'debug', 'office-hours', 'browse', 'setup-browser-cookies',
+    'document-release', 'investigate', 'office-hours', 'browse', 'setup-browser-cookies',
    'gstack-upgrade', 'humanizer',
  ];

@@ -277,7 +277,7 @@ export default app;
      run('git', ['checkout', '-b', 'feature/waitlist-api']);

      const testName = 'journey-debug';
-      const expectedSkill = 'debug';
+      const expectedSkill = 'investigate';
      const result = await runSkillTest({
        prompt: "The GET /api/waitlist endpoint was working fine yesterday but now it's returning 500 errors. The tests are passing locally but the endpoint fails when I hit it with curl. Can you figure out what's going on?",
        workingDirectory: tmpDir,
@@ -218,7 +218,7 @@ describe('Update check preamble', () => {
    'ship/SKILL.md', 'review/SKILL.md',
    'plan-ceo-review/SKILL.md', 'plan-eng-review/SKILL.md',
    'retro/SKILL.md',
-    'office-hours/SKILL.md', 'debug/SKILL.md',
+    'office-hours/SKILL.md', 'investigate/SKILL.md',
    'plan-design-review/SKILL.md',
    'design-review/SKILL.md',
    'design-consultation/SKILL.md',
@@ -530,7 +530,7 @@ describe('v0.4.1 preamble features', () => {
    'ship/SKILL.md', 'review/SKILL.md',
    'plan-ceo-review/SKILL.md', 'plan-eng-review/SKILL.md',
    'retro/SKILL.md',
-    'office-hours/SKILL.md', 'debug/SKILL.md',
+    'office-hours/SKILL.md', 'investigate/SKILL.md',
    'plan-design-review/SKILL.md',
    'design-review/SKILL.md',
    'design-consultation/SKILL.md',
@@ -646,8 +646,8 @@ describe('office-hours skill structure', () => {
  });
 });

-describe('debug skill structure', () => {
-  const content = fs.readFileSync(path.join(ROOT, 'debug', 'SKILL.md'), 'utf-8');
+describe('investigate skill structure', () => {
+  const content = fs.readFileSync(path.join(ROOT, 'investigate', 'SKILL.md'), 'utf-8');
  for (const section of ['Iron Law', 'Root Cause', 'Pattern Analysis', 'Hypothesis',
                          'DEBUG REPORT', '3-strike', 'BLOCKED']) {
    test(`contains ${section}`, () => expect(content).toContain(section));
@@ -1167,7 +1167,7 @@ describe('Codex skill', () => {
  test('codex/SKILL.md contains review log persistence', () => {
    const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
    expect(content).toContain('codex-review');
-    expect(content).toContain('reviews.jsonl');
+    expect(content).toContain('gstack-review-log');
  });

  test('codex/SKILL.md uses which for binary discovery, not hardcoded path', () => {
@@ -1221,7 +1221,7 @@ describe('Skill trigger phrases', () => {
  // Excluded: root gstack (browser tool), gstack-upgrade (gstack-specific),
  // humanizer (text tool)
  const SKILLS_REQUIRING_TRIGGERS = [
-    'qa', 'qa-only', 'ship', 'review', 'debug', 'office-hours',
+    'qa', 'qa-only', 'ship', 'review', 'investigate', 'office-hours',
    'plan-ceo-review', 'plan-eng-review', 'plan-design-review',
    'design-review', 'design-consultation', 'retro', 'document-release',
    'codex', 'browse', 'setup-browser-cookies',
@@ -1241,7 +1241,7 @@ describe('Skill trigger phrases', () => {

  // Skills with proactive triggers should have "Proactively suggest" in description
  const SKILLS_REQUIRING_PROACTIVE = [
-    'qa', 'qa-only', 'ship', 'review', 'debug', 'office-hours',
+    'qa', 'qa-only', 'ship', 'review', 'investigate', 'office-hours',
    'plan-ceo-review', 'plan-eng-review', 'plan-design-review',
    'design-review', 'design-consultation', 'retro', 'document-release',
  ];