refactor(plan-ceo-review): carve review body into on-demand section

Carve the largest skill (138,838 B) into a skeleton + one on-demand
section, the documented next Phase B target after /ship (v2_PLAN.md:216).

- sections/review-sections.md(.tmpl): the 11-section deep review, codex/
  outside-voice rules, how-to-ask, Required Outputs, registries, Completion
  Summary, Review Log, REVIEW_DASHBOARD, PLAN_FILE_REVIEW_REPORT, Next Steps,
  docs/designs promotion, Formatting Rules, and the Mode Quick Reference.
- sections/manifest.json: passive registry (CM2), one entry.
- SKILL.md.tmpl: {{SECTION_INDEX}} after the system audit, a single
  {{SECTION:review-sections}} STOP-Read after Step 0 mode selection, and a
  Section self-check. All of Step 0 (the scope/mode conversation) stays in
  the always-loaded skeleton; only EXIT_PLAN_MODE_GATE follows the section.

Measured: always-loaded skeleton 138,838 -> 80,731 B (-42%, ~14.4K tokens
off every invocation). Union (skeleton + section) 139,110 B, behavior held.

Boundary honors Codex P1: nothing review-governing (formatting rules, mode
reference, how-to-ask, required outputs) sits in the skeleton below the
STOP. Housekeeping resolvers ride in the section, matching the ship
precedent (adversarial.md carries LEARNINGS_LOG + GBRAIN_SAVE_RESULTS).

Tests (atomic with the carve — skill-docs.yml gates gen:skill-docs
freshness on every push, so source + regen + tests must land together):
- parity-harness: plan-ceo flipped to sectioned, maxSkeletonBytes 90_000
  (measured 80,731 + headroom); content/minBytes run against the union.
- skill-size-budget: plan-ceo-review added to SECTIONS_EXTRACTED.
- section-manifest-consistency: generalized to discover every carved skill,
  vars computed per-skill-case (Codex P2).
- skill-ceo-section-ordering (new, gate): per-PR static guard — STOP after
  Step 0, review body absent from skeleton, report writer in the section,
  nothing review-governing below the STOP.
- skill-e2e-plan-ceo-review-section-loading (new, periodic): refreshes the
  installed skill first (Codex P1), drives full Step 0, asserts the section
  is Read before the report.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
  carved skills so relocated prose still counts.
- touchfiles: plan-ceo-section-loading registered (periodic).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-05-31 08:54:56 -07:00
parent 3bef43bc5a
commit ab66193e2e
14 changed files with 1831 additions and 1457 deletions
+18 -895
View File
@@ -1119,6 +1119,15 @@ rm -f /tmp/.gstack-brain-context-$$.md 2>/dev/null || true
`gstack/`, `concepts/` only). Personal/family/therapy content never leaks here.
## Section index — Read each section when its situation applies
This skill is a decision-tree skeleton. The steps below point to on-demand
sections. Read a section in full before doing its step; do not work from memory.
| When | Read this section |
|------|-------------------|
| running the 11-section deep review, required outputs, and review report (only after Step 0 scope and mode are agreed) | `sections/review-sections.md` |
## Step 0: Nuclear Scope Challenge + Mode Selection
### 0A. Premise Challenge
@@ -1364,904 +1373,18 @@ Present these mode options via AskUserQuestion using the preamble's AskUserQuest
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
## Review Sections (11 sections, after scope and mode are agreed)
> **STOP.** Before running the 11-section deep review, required outputs, and review report (only after Step 0 scope and mode are agreed), Read `~/.claude/skills/gstack/plan-ceo-review/sections/review-sections.md` and execute it
> in full. Do not work from memory — that section is the source of truth for this step.
**Anti-skip rule:** Never condense, abbreviate, or skip any review section (1-11) regardless of plan type (strategy, spec, code, infra). Every section in this skill exists for a reason. "This is a strategy doc so implementation sections don't apply" is always wrong — implementation details are where strategy breaks down. If a section genuinely has zero findings, say "No issues found" and move on — but you must evaluate it.
## Section self-check (before you finish)
**Anti-shortcut clause:** The plan file is the OUTPUT of the interactive review, not a substitute for it. Writing every finding into one plan write and calling ExitPlanMode without firing AskUserQuestion is the precise failure mode of the May 2026 transcript bug — the model explored, found issues, and dumped them into a deliverable rather than walking the user through them. If you have ANY non-trivial finding in any review section, the path from finding to ExitPlanMode goes THROUGH AskUserQuestion. Zero findings in every section is the only path to ExitPlanMode that bypasses AskUserQuestion. If you find yourself wanting to write a plan with findings before asking, stop and call AskUserQuestion now — that's the bug, recognize it.
You ran a carved skill. The Section index above named `sections/review-sections.md`
as the source of truth for the 11-section deep review, the required outputs, and the
review report. Confirm you issued a Read for it and executed every section from the
file, not from memory. If you produced the Completion Summary or wrote the review
report without Reading that section, STOP, Read it now, and redo the review from the
source of truth.
### Section 1: Architecture Review
Evaluate and diagram:
* Overall system design and component boundaries. Draw the dependency graph.
* Data flow — all four paths. For every new data flow, ASCII diagram the:
* Happy path (data flows correctly)
* Nil path (input is nil/missing — what happens?)
* Empty path (input is present but empty/zero-length — what happens?)
* Error path (upstream call fails — what happens?)
* State machines. ASCII diagram for every new stateful object. Include impossible/invalid transitions and what prevents them.
* Coupling concerns. Which components are now coupled that weren't before? Is that coupling justified? Draw the before/after dependency graph.
* Scaling characteristics. What breaks first under 10x load? Under 100x?
* Single points of failure. Map them.
* Security architecture. Auth boundaries, data access patterns, API surfaces. For each new endpoint or data mutation: who can call it, what do they get, what can they change?
* Production failure scenarios. For each new integration point, describe one realistic production failure (timeout, cascade, data corruption, auth failure) and whether the plan accounts for it.
* Rollback posture. If this ships and immediately breaks, what's the rollback procedure? Git revert? Feature flag? DB migration rollback? How long?
**EXPANSION and SELECTIVE EXPANSION additions:**
* What would make this architecture beautiful? Not just correct — elegant. Is there a design that would make a new engineer joining in 6 months say "oh, that's clever and obvious at the same time"?
* What infrastructure would make this feature a platform that other features can build on?
**SELECTIVE EXPANSION:** If any accepted cherry-picks from Step 0D affect the architecture, evaluate their architectural fit here. Flag any that create coupling concerns or don't integrate cleanly — this is a chance to revisit the decision with new information.
Required ASCII diagram: full system architecture showing new components and their relationships to existing ones.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 2: Error & Rescue Map
This is the section that catches silent failures. It is not optional.
For every new method, service, or codepath that can fail, fill in this table:
```
METHOD/CODEPATH | WHAT CAN GO WRONG | EXCEPTION CLASS
-------------------------|-----------------------------|-----------------
ExampleService#call | API timeout | TimeoutError
| API returns 429 | RateLimitError
| API returns malformed JSON | JSONParseError
| DB connection pool exhausted| ConnectionPoolExhausted
| Record not found | RecordNotFound
-------------------------|-----------------------------|-----------------
EXCEPTION CLASS | RESCUED? | RESCUE ACTION | USER SEES
-----------------------------|-----------|------------------------|------------------
TimeoutError | Y | Retry 2x, then raise | "Service temporarily unavailable"
RateLimitError | Y | Backoff + retry | Nothing (transparent)
JSONParseError | N ← GAP | — | 500 error ← BAD
ConnectionPoolExhausted | N ← GAP | — | 500 error ← BAD
RecordNotFound | Y | Return nil, log warning | "Not found" message
```
Rules for this section:
* Catch-all error handling (`rescue StandardError`, `catch (Exception e)`, `except Exception`) is ALWAYS a smell. Name the specific exceptions.
* Catching an error with only a generic log message is insufficient. Log the full context: what was being attempted, with what arguments, for what user/request.
* Every rescued error must either: retry with backoff, degrade gracefully with a user-visible message, or re-raise with added context. "Swallow and continue" is almost never acceptable.
* For each GAP (unrescued error that should be rescued): specify the rescue action and what the user should see.
* For LLM/AI service calls specifically: what happens when the response is malformed? When it's empty? When it hallucinates invalid JSON? When the model returns a refusal? Each of these is a distinct failure mode.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 3: Security & Threat Model
Security is not a sub-bullet of architecture. It gets its own section.
Evaluate:
* Attack surface expansion. What new attack vectors does this plan introduce? New endpoints, new params, new file paths, new background jobs?
* Input validation. For every new user input: is it validated, sanitized, and rejected loudly on failure? What happens with: nil, empty string, string when integer expected, string exceeding max length, unicode edge cases, HTML/script injection attempts?
* Authorization. For every new data access: is it scoped to the right user/role? Is there a direct object reference vulnerability? Can user A access user B's data by manipulating IDs?
* Secrets and credentials. New secrets? In env vars, not hardcoded? Rotatable?
* Dependency risk. New gems/npm packages? Security track record?
* Data classification. PII, payment data, credentials? Handling consistent with existing patterns?
* Injection vectors. SQL, command, template, LLM prompt injection — check all.
* Audit logging. For sensitive operations: is there an audit trail?
For each finding: threat, likelihood (High/Med/Low), impact (High/Med/Low), and whether the plan mitigates it.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 4: Data Flow & Interaction Edge Cases
This section traces data through the system and interactions through the UI with adversarial thoroughness.
**Data Flow Tracing:** For every new data flow, produce an ASCII diagram showing:
```
INPUT ──▶ VALIDATION ──▶ TRANSFORM ──▶ PERSIST ──▶ OUTPUT
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
[nil?] [invalid?] [exception?] [conflict?] [stale?]
[empty?] [too long?] [timeout?] [dup key?] [partial?]
[wrong [wrong type?] [OOM?] [locked?] [encoding?]
type?]
```
For each node: what happens on each shadow path? Is it tested?
**Interaction Edge Cases:** For every new user-visible interaction, evaluate:
```
INTERACTION | EDGE CASE | HANDLED? | HOW?
---------------------|------------------------|----------|--------
Form submission | Double-click submit | ? |
| Submit with stale CSRF | ? |
| Submit during deploy | ? |
Async operation | User navigates away | ? |
| Operation times out | ? |
| Retry while in-flight | ? |
List/table view | Zero results | ? |
| 10,000 results | ? |
| Results change mid-page| ? |
Background job | Job fails after 3 of | ? |
| 10 items processed | |
| Job runs twice (dup) | ? |
| Queue backs up 2 hours | ? |
```
Flag any unhandled edge case as a gap. For each gap, specify the fix.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 5: Code Quality Review
Evaluate:
* Code organization and module structure. Does new code fit existing patterns? If it deviates, is there a reason?
* DRY violations. Be aggressive. If the same logic exists elsewhere, flag it and reference the file and line.
* Naming quality. Are new classes, methods, and variables named for what they do, not how they do it?
* Error handling patterns. (Cross-reference with Section 2 — this section reviews the patterns; Section 2 maps the specifics.)
* Missing edge cases. List explicitly: "What happens when X is nil?" "When the API returns 429?" etc.
* Over-engineering check. Any new abstraction solving a problem that doesn't exist yet?
* Under-engineering check. Anything fragile, assuming happy path only, or missing obvious defensive checks?
* Cyclomatic complexity. Flag any new method that branches more than 5 times. Propose a refactor.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 6: Test Review
Make a complete diagram of every new thing this plan introduces:
```
NEW UX FLOWS:
[list each new user-visible interaction]
NEW DATA FLOWS:
[list each new path data takes through the system]
NEW CODEPATHS:
[list each new branch, condition, or execution path]
NEW BACKGROUND JOBS / ASYNC WORK:
[list each]
NEW INTEGRATIONS / EXTERNAL CALLS:
[list each]
NEW ERROR/RESCUE PATHS:
[list each — cross-reference Section 2]
```
For each item in the diagram:
* What type of test covers it? (Unit / Integration / System / E2E)
* Does a test for it exist in the plan? If not, write the test spec header.
* What is the happy path test?
* What is the failure path test? (Be specific — which failure?)
* What is the edge case test? (nil, empty, boundary values, concurrent access)
Test ambition check (all modes): For each new feature, answer:
* What's the test that would make you confident shipping at 2am on a Friday?
* What's the test a hostile QA engineer would write to break this?
* What's the chaos test?
Test pyramid check: Many unit, fewer integration, few E2E? Or inverted?
Flakiness risk: Flag any test depending on time, randomness, external services, or ordering.
Load/stress test requirements: For any new codepath called frequently or processing significant data.
For LLM/prompt changes: Check CLAUDE.md for the "Prompt/LLM changes" file patterns. If this plan touches ANY of those patterns, state which eval suites must be run, which cases should be added, and what baselines to compare against.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 7: Performance Review
Evaluate:
* N+1 queries. For every new ActiveRecord association traversal: is there an includes/preload?
* Memory usage. For every new data structure: what's the maximum size in production?
* Database indexes. For every new query: is there an index?
* Caching opportunities. For every expensive computation or external call: should it be cached?
* Background job sizing. For every new job: worst-case payload, runtime, retry behavior?
* Slow paths. Top 3 slowest new codepaths and estimated p99 latency.
* Connection pool pressure. New DB connections, Redis connections, HTTP connections?
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 8: Observability & Debuggability Review
New systems break. This section ensures you can see why.
Evaluate:
* Logging. For every new codepath: structured log lines at entry, exit, and each significant branch?
* Metrics. For every new feature: what metric tells you it's working? What tells you it's broken?
* Tracing. For new cross-service or cross-job flows: trace IDs propagated?
* Alerting. What new alerts should exist?
* Dashboards. What new dashboard panels do you want on day 1?
* Debuggability. If a bug is reported 3 weeks post-ship, can you reconstruct what happened from logs alone?
* Admin tooling. New operational tasks that need admin UI or rake tasks?
* Runbooks. For each new failure mode: what's the operational response?
**EXPANSION and SELECTIVE EXPANSION addition:**
* What observability would make this feature a joy to operate? (For SELECTIVE EXPANSION, include observability for any accepted cherry-picks.)
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 9: Deployment & Rollout Review
Evaluate:
* Migration safety. For every new DB migration: backward-compatible? Zero-downtime? Table locks?
* Feature flags. Should any part be behind a feature flag?
* Rollout order. Correct sequence: migrate first, deploy second?
* Rollback plan. Explicit step-by-step.
* Deploy-time risk window. Old code and new code running simultaneously — what breaks?
* Environment parity. Tested in staging?
* Post-deploy verification checklist. First 5 minutes? First hour?
* Smoke tests. What automated checks should run immediately post-deploy?
**EXPANSION and SELECTIVE EXPANSION addition:**
* What deploy infrastructure would make shipping this feature routine? (For SELECTIVE EXPANSION, assess whether accepted cherry-picks change the deployment risk profile.)
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 10: Long-Term Trajectory Review
Evaluate:
* Technical debt introduced. Code debt, operational debt, testing debt, documentation debt.
* Path dependency. Does this make future changes harder?
* Knowledge concentration. Documentation sufficient for a new engineer?
* Reversibility. Rate 1-5: 1 = one-way door, 5 = easily reversible.
* Ecosystem fit. Aligns with Rails/JS ecosystem direction?
* The 1-year question. Read this plan as a new engineer in 12 months — obvious?
**EXPANSION and SELECTIVE EXPANSION additions:**
* What comes after this ships? Phase 2? Phase 3? Does the architecture support that trajectory?
* Platform potential. Does this create capabilities other features can leverage?
* (SELECTIVE EXPANSION only) Retrospective: Were the right cherry-picks accepted? Did any rejected expansions turn out to be load-bearing for the accepted ones?
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 11: Design & UX Review (skip if no UI scope detected)
The CEO calling in the designer. Not a pixel-level audit — that's /plan-design-review and /design-review. This is ensuring the plan has design intentionality.
Evaluate:
* Information architecture — what does the user see first, second, third?
* Interaction state coverage map:
FEATURE | LOADING | EMPTY | ERROR | SUCCESS | PARTIAL
* User journey coherence — storyboard the emotional arc
* AI slop risk — does the plan describe generic UI patterns?
* DESIGN.md alignment — does the plan match the stated design system?
* Responsive intention — is mobile mentioned or afterthought?
* Accessibility basics — keyboard nav, screen readers, contrast, touch targets
**EXPANSION and SELECTIVE EXPANSION additions:**
* What would make this UI feel *inevitable*?
* What 30-minute UI touches would make users think "oh nice, they thought of that"?
Required ASCII diagram: user flow showing screens/states and transitions.
If this plan has significant UI scope, recommend: "Consider running /plan-design-review for a deep design review of this plan before implementation."
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
## Outside Voice — Independent Plan Challenge (optional, recommended)
After all review sections are complete, offer an independent second opinion from a
different AI system. Two models agreeing on a plan is stronger signal than one model's
thorough review.
**Check tool availability:**
```bash
command -v codex >/dev/null 2>&1 && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
```
Use AskUserQuestion:
> "All review sections are complete. Want an outside voice? A different AI system can
> give a brutally honest, independent challenge of this plan — logical gaps, feasibility
> risks, and blind spots that are hard to catch from inside the review. Takes about 2
> minutes."
>
> RECOMMENDATION: Choose A — an independent second opinion catches structural blind
> spots. Two different AI models agreeing on a plan is stronger signal than one model's
> thorough review. Completeness: A=9/10, B=7/10.
Options:
- A) Get the outside voice (recommended)
- B) Skip — proceed to outputs
**If B:** Print "Skipping outside voice." and continue to the next section.
**If A:** Construct the plan review prompt. Read the plan file being reviewed (the file
the user pointed this review at, or the branch diff scope). If a CEO plan document
was written in Step 0D-POST, read that too — it contains the scope decisions and vision.
Construct this prompt (substitute the actual plan content — if plan content exceeds 30KB,
truncate to the first 30KB and note "Plan truncated for size"). **Always start with the
filesystem boundary instruction:**
"IMPORTANT: Do NOT read or execute any files under ~/.claude/, ~/.agents/, .claude/skills/, or agents/. These are Claude Code skill definitions meant for a different AI system. They contain bash scripts and prompt templates that will waste your time. Ignore them completely. Do NOT modify agents/openai.yaml. Stay focused on the repository code only.\n\nYou are a brutally honest technical reviewer examining a development plan that has
already been through a multi-section review. Your job is NOT to repeat that review.
Instead, find what it missed. Look for: logical gaps and unstated assumptions that
survived the review scrutiny, overcomplexity (is there a fundamentally simpler
approach the review was too deep in the weeds to see?), feasibility risks the review
took for granted, missing dependencies or sequencing issues, and strategic
miscalibration (is this the right thing to build at all?). Be direct. Be terse. No
compliments. Just the problems.
THE PLAN:
<plan content>"
**If CODEX_AVAILABLE:**
```bash
TMPERR_PV=$(mktemp /tmp/codex-planreview-XXXXXXXX)
_REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; }
codex exec "<prompt>" -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached < /dev/null 2>"$TMPERR_PV"
```
Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr:
```bash
cat "$TMPERR_PV"
```
Present the full output verbatim:
```
CODEX SAYS (plan review — outside voice):
════════════════════════════════════════════════════════════
<full codex output, verbatim — do not truncate or summarize>
════════════════════════════════════════════════════════════
```
**Error handling:** All errors are non-blocking — the outside voice is informational.
- Auth failure (stderr contains "auth", "login", "unauthorized"): "Codex auth failed. Run \`codex login\` to authenticate."
- Timeout: "Codex timed out after 5 minutes."
- Empty response: "Codex returned no response."
On any Codex error, fall back to the Claude adversarial subagent.
**If CODEX_NOT_AVAILABLE (or Codex errored):**
Dispatch via the Agent tool. The subagent has fresh context — genuine independence.
Subagent prompt: same plan review prompt as above.
Present findings under an `OUTSIDE VOICE (Claude subagent):` header.
If the subagent fails or times out: "Outside voice unavailable. Continuing to outputs."
**Cross-model tension:**
After presenting the outside voice findings, note any points where the outside voice
disagrees with the review findings from earlier sections. Flag these as:
```
CROSS-MODEL TENSION:
[Topic]: Review said X. Outside voice says Y. [Present both perspectives neutrally.
State what context you might be missing that would change the answer.]
```
**User Sovereignty:** Do NOT auto-incorporate outside voice recommendations into the plan.
Present each tension point to the user. The user decides. Cross-model agreement is a
strong signal — present it as such — but it is NOT permission to act. You may state
which argument you find more compelling, but you MUST NOT apply the change without
explicit user approval.
For each substantive tension point, use AskUserQuestion:
> "Cross-model disagreement on [topic]. The review found [X] but the outside voice
> argues [Y]. [One sentence on what context you might be missing.]"
>
> RECOMMENDATION: Choose [A or B] because [one-line reason explaining which argument
> is more compelling and why]. Completeness: A=X/10, B=Y/10.
Options:
- A) Accept the outside voice's recommendation (I'll apply this change)
- B) Keep the current approach (reject the outside voice)
- C) Investigate further before deciding
- D) Add to TODOS.md for later
Wait for the user's response. Do NOT default to accepting because you agree with the
outside voice. If the user chooses B, the current approach stands — do not re-argue.
If no tension points exist, note: "No cross-model tension — both reviewers agree."
**Persist the result:**
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-plan-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","commit":"'"$(git rev-parse --short HEAD)"'"}'
```
Substitute: STATUS = "clean" if no findings, "issues_found" if findings exist.
SOURCE = "codex" if Codex ran, "claude" if subagent ran.
**Cleanup:** Run `rm -f "$TMPERR_PV"` after processing (if Codex was used).
---
### Outside Voice Integration Rule
Outside voice findings are INFORMATIONAL until the user explicitly approves each one.
Do NOT incorporate outside voice recommendations into the plan without presenting each
finding via AskUserQuestion and getting explicit approval. This applies even when you
agree with the outside voice. Cross-model consensus is a strong signal — present it as
such — but the user makes the decision.
## Post-Implementation Design Audit (if UI scope detected)
After implementation, run `/design-review` on the live site to catch visual issues that can only be evaluated with rendered output.
## CRITICAL RULE — How to ask questions
Follow the AskUserQuestion format from the Preamble above. Additional rules for plan reviews:
* **One issue = one AskUserQuestion call.** Never combine multiple issues into one question.
* Describe the problem concretely, with file and line references.
* Present 2-3 options, including "do nothing" where reasonable.
* For each option: effort, risk, and maintenance burden in one line.
* **Map the reasoning to my engineering preferences above.** One sentence connecting your recommendation to a specific preference.
* Label with issue NUMBER + option LETTER (e.g., "3A", "3B").
* **Zero findings:** if a section has zero findings, state "No issues, moving on" and proceed. Otherwise, use AskUserQuestion for each finding — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan.
## Required Outputs
### "NOT in scope" section
List work considered and explicitly deferred, with one-line rationale each.
### "What already exists" section
List existing code/flows that partially solve sub-problems and whether the plan reuses them.
### "Dream state delta" section
Where this plan leaves us relative to the 12-month ideal.
### Error & Rescue Registry (from Section 2)
Complete table of every method that can fail, every exception class, rescued status, rescue action, user impact.
### Failure Modes Registry
```
CODEPATH | FAILURE MODE | RESCUED? | TEST? | USER SEES? | LOGGED?
---------|----------------|----------|-------|----------------|--------
```
Any row with RESCUED=N, TEST=N, USER SEES=Silent → **CRITICAL GAP**.
### TODOS.md updates
Present each potential TODO as its own individual AskUserQuestion. Never batch TODOs — one per question. Never silently skip this step. Follow the format in `.claude/skills/review/TODOS-format.md`.
For each TODO, describe:
* **What:** One-line description of the work.
* **Why:** The concrete problem it solves or value it unlocks.
* **Pros:** What you gain by doing this work.
* **Cons:** Cost, complexity, or risks of doing it.
* **Context:** Enough detail that someone picking this up in 3 months understands the motivation, the current state, and where to start.
* **Effort estimate:** S/M/L/XL (human team) → with CC+gstack: S→S, M→S, L→M, XL→L
* **Priority:** P1/P2/P3
* **Depends on / blocked by:** Any prerequisites or ordering constraints.
Then present options: **A)** Add to TODOS.md **B)** Skip — not valuable enough **C)** Build it now in this PR instead of deferring.
### Scope Expansion Decisions (EXPANSION and SELECTIVE EXPANSION only)
For EXPANSION and SELECTIVE EXPANSION modes: expansion opportunities and delight items were surfaced and decided in Step 0D (opt-in/cherry-pick ceremony). The decisions are persisted in the CEO plan document. Reference the CEO plan for the full record. Do not re-surface them here — list the accepted expansions for completeness:
* Accepted: {list items added to scope}
* Deferred: {list items sent to TODOS.md}
* Skipped: {list items rejected}
### Diagrams (mandatory, produce all that apply)
1. System architecture
2. Data flow (including shadow paths)
3. State machine
4. Error flow
5. Deployment sequence
6. Rollback flowchart
### Stale Diagram Audit
List every ASCII diagram in files this plan touches. Still accurate?
## Implementation Tasks
Before closing this review, synthesize the findings above into a flat list of
build-actionable tasks. Each task derives from a specific finding — no padding.
Emit the markdown section AND write a JSONL artifact that `/autoplan` can
aggregate across phases.
### Markdown section (always emit)
```markdown
## Implementation Tasks
Synthesized from this review's findings. Each task derives from a specific
finding above. Run with Claude Code or Codex; checkbox as you ship.
- [ ] **T1 (P1, human: ~2h / CC: ~15min)** — <component> — <imperative title>
- Surfaced by: <section name> — <specific finding text or line reference>
- Files: <paths to touch>
- Verify: <test command or manual check>
- [ ] **T2 (P2, human: ~30min / CC: ~5min)** — ...
```
Rules:
- P1 blocks ship; P2 should land same branch; P3 is a follow-up TODO.
- If a finding produced no actionable task, do not invent one.
- If a section had zero findings, emit `_No new tasks from <section>._`
- Effort uses the AI-compression table from CLAUDE.md.
### JSONL artifact (always write, even if zero tasks)
`/autoplan` reads this file to aggregate across phases. Build each line with
`jq -nc` so titles and source findings containing quotes, newlines, or
backslashes serialize cleanly — never use hand-rolled `echo` / `printf`.
```bash
eval "$(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)"
TASKS_DIR="${HOME}/.gstack/projects/${SLUG:-unknown}"
mkdir -p "$TASKS_DIR"
TASKS_FILE="$TASKS_DIR/tasks-ceo-review-$(date +%Y%m%d-%H%M%S).jsonl"
COMMIT=$(git rev-parse HEAD 2>/dev/null || echo unknown)
BRANCH=$(git branch --show-current 2>/dev/null || echo unknown)
RUN_ID="$(date -u +%Y%m%dT%H%M%SZ)-$$"
# Repeat ONE jq invocation per task identified during this review.
# Substitute the placeholders inline with shell variables you set per task:
# TASK_ID (T1, T2, ...), PRIORITY (P1/P2/P3), COMPONENT, TITLE,
# SOURCE_FINDING, EFFORT_HUMAN, EFFORT_CC, FILES_JSON (a JSON array literal
# like '["browse/src/sanitize.ts","browse/src/server.ts"]').
jq -nc \
--arg phase 'ceo-review' \
--arg run_id "$RUN_ID" \
--arg branch "$BRANCH" \
--arg commit "$COMMIT" \
--arg id "$TASK_ID" \
--arg priority "$PRIORITY" \
--arg component "$COMPONENT" \
--arg effort_human "$EFFORT_HUMAN" \
--arg effort_cc "$EFFORT_CC" \
--arg title "$TITLE" \
--arg source_finding "$SOURCE_FINDING" \
--argjson files "$FILES_JSON" \
'{phase:$phase, run_id:$run_id, branch:$branch, commit:$commit, id:$id, priority:$priority, component:$component, files:$files, effort_human:$effort_human, effort_cc:$effort_cc, title:$title, source_finding:$source_finding}' \
>> "$TASKS_FILE"
```
If `jq` is not installed, fall back to skipping the JSONL write and warn
the user to install jq for autoplan aggregation. Never hand-roll JSONL.
If zero tasks were identified in this review, still touch the JSONL file
(`: > "$TASKS_FILE"`) so the aggregator sees that the phase produced output
this run (an empty file means "ran, no findings" — distinct from "didn't run").
### Completion Summary
```
+====================================================================+
| MEGA PLAN REVIEW — COMPLETION SUMMARY |
+====================================================================+
| Mode selected | EXPANSION / SELECTIVE / HOLD / REDUCTION |
| System Audit | [key findings] |
| Step 0 | [mode + key decisions] |
| Section 1 (Arch) | ___ issues found |
| Section 2 (Errors) | ___ error paths mapped, ___ GAPS |
| Section 3 (Security)| ___ issues found, ___ High severity |
| Section 4 (Data/UX) | ___ edge cases mapped, ___ unhandled |
| Section 5 (Quality) | ___ issues found |
| Section 6 (Tests) | Diagram produced, ___ gaps |
| Section 7 (Perf) | ___ issues found |
| Section 8 (Observ) | ___ gaps found |
| Section 9 (Deploy) | ___ risks flagged |
| Section 10 (Future) | Reversibility: _/5, debt items: ___ |
| Section 11 (Design) | ___ issues / SKIPPED (no UI scope) |
+--------------------------------------------------------------------+
| NOT in scope | written (___ items) |
| What already exists | written |
| Dream state delta | written |
| Error/rescue registry| ___ methods, ___ CRITICAL GAPS |
| Failure modes | ___ total, ___ CRITICAL GAPS |
| TODOS.md updates | ___ items proposed |
| Scope proposals | ___ proposed, ___ accepted (EXP + SEL) |
| CEO plan | written / skipped (HOLD/REDUCTION) |
| Outside voice | ran (codex/claude) / skipped |
| Lake Score | X/Y recommendations chose complete option |
| Diagrams produced | ___ (list types) |
| Stale diagrams found | ___ |
| Unresolved decisions | ___ (listed below) |
+====================================================================+
```
### Unresolved Decisions
If any AskUserQuestion goes unanswered, note it here. Never silently default.
## Handoff Note Cleanup
After producing the Completion Summary, clean up any handoff notes for this branch —
the review is complete and the context is no longer needed.
```bash
setopt +o nomatch 2>/dev/null || true # zsh compat
eval "$(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)"
rm -f ~/.gstack/projects/$SLUG/*-$BRANCH-ceo-handoff-*.md 2>/dev/null || true
```
## Review Log
After producing the Completion Summary above, persist the review result.
**PLAN MODE EXCEPTION — ALWAYS RUN:** This command writes review metadata to
`~/.gstack/` (user config directory, not project files). The skill preamble
already writes to `~/.gstack/sessions/` and `~/.gstack/analytics/` — this is
the same pattern. The review dashboard depends on this data. Skipping this
command breaks the review readiness dashboard in /ship.
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"plan-ceo-review","timestamp":"TIMESTAMP","status":"STATUS","unresolved":N,"critical_gaps":N,"mode":"MODE","scope_proposed":N,"scope_accepted":N,"scope_deferred":N,"commit":"COMMIT"}'
```
Before running this command, substitute the placeholder values from the Completion Summary you just produced:
- **TIMESTAMP**: current ISO 8601 datetime (e.g., 2026-03-16T14:30:00)
- **STATUS**: "clean" if 0 unresolved decisions AND 0 critical gaps; otherwise "issues_open"
- **unresolved**: number from "Unresolved decisions" in the summary
- **critical_gaps**: number from "Failure modes: ___ CRITICAL GAPS" in the summary
- **MODE**: the mode the user selected (SCOPE_EXPANSION / SELECTIVE_EXPANSION / HOLD_SCOPE / SCOPE_REDUCTION)
- **scope_proposed**: number from "Scope proposals: ___ proposed" in the summary (0 for HOLD/REDUCTION)
- **scope_accepted**: number from "Scope proposals: ___ accepted" in the summary (0 for HOLD/REDUCTION)
- **scope_deferred**: number of items deferred to TODOS.md from scope decisions (0 for HOLD/REDUCTION)
- **COMMIT**: output of `git rev-parse --short HEAD`
## Review Readiness Dashboard
After completing the review, read the review log and config to display the dashboard.
```bash
~/.claude/skills/gstack/bin/gstack-review-read
```
Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, review, plan-design-review, design-review-lite, adversarial-review, codex-review, codex-plan-review). Ignore entries with timestamps older than 7 days. For the Eng Review row, show whichever is more recent between `review` (diff-scoped pre-landing review) and `plan-eng-review` (plan-stage architecture review). Append "(DIFF)" or "(PLAN)" to the status to distinguish. For the Adversarial row, show whichever is more recent between `adversarial-review` (new auto-scaled) and `codex-review` (legacy). For Design Review, show whichever is more recent between `plan-design-review` (full visual audit) and `design-review-lite` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. For the Outside Voice row, show the most recent `codex-plan-review` entry — this captures outside voices from both /plan-ceo-review and /plan-eng-review.
**Source attribution:** If the most recent entry for a skill has a \`"via"\` field, append it to the status label in parentheses. Examples: `plan-eng-review` with `via:"autoplan"` shows as "CLEAR (PLAN via /autoplan)". `review` with `via:"ship"` shows as "CLEAR (DIFF via /ship)". Entries without a `via` field show as "CLEAR (PLAN)" or "CLEAR (DIFF)" as before.
Note: `autoplan-voices` and `design-outside-voices` entries are audit-trail-only (forensic data for cross-model consensus analysis). They do not appear in the dashboard and are not checked by any consumer.
Display:
```
+====================================================================+
| REVIEW READINESS DASHBOARD |
+====================================================================+
| Review | Runs | Last Run | Status | Required |
|-----------------|------|---------------------|-----------|----------|
| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | YES |
| CEO Review | 0 | — | — | no |
| Design Review | 0 | — | — | no |
| Adversarial | 0 | — | — | no |
| Outside Voice | 0 | — | — | no |
+--------------------------------------------------------------------+
| VERDICT: CLEARED — Eng Review passed |
+====================================================================+
```
**Review tiers:**
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Adversarial Review (automatic):** Always-on for every review. Every diff gets both Claude adversarial subagent and Codex adversarial challenge. Large diffs (200+ lines) additionally get Codex structured review with P1 gate. No configuration needed.
- **Outside Voice (optional):** Independent plan review from a different AI model. Offered after all review sections complete in /plan-ceo-review and /plan-eng-review. Falls back to Claude subagent if Codex is unavailable. Never gates shipping.
**Verdict logic:**
- **CLEARED**: Eng Review has >= 1 entry within 7 days from either \`review\` or \`plan-eng-review\` with status "clean" (or \`skip_eng_review\` is \`true\`)
- **NOT CLEARED**: Eng Review missing, stale (>7 days), or has open issues
- CEO, Design, and Codex reviews are shown for context but never block shipping
- If \`skip_eng_review\` config is \`true\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED
**Staleness detection:** After displaying the dashboard, check if any existing reviews may be stale:
- Parse the \`---HEAD---\` section from the bash output to get the current HEAD commit hash
- For each review entry that has a \`commit\` field: compare it against the current HEAD. If different, count elapsed commits: \`git rev-list --count STORED_COMMIT..HEAD\`. Display: "Note: {skill} review from {date} may be stale — {N} commits since review"
- For entries without a \`commit\` field (legacy entries): display "Note: {skill} review from {date} has no commit tracking — consider re-running for accurate staleness detection"
- If all reviews match the current HEAD, do not display any staleness notes
## Plan File Review Report
After displaying the Review Readiness Dashboard in conversation output, also update the
**plan file** itself so review status is visible to anyone reading the plan.
### Detect the plan file
1. Check if there is an active plan file in this conversation (the host provides plan file
paths in system messages — look for plan file references in the conversation context).
2. If not found, skip this section silently — not every review runs in plan mode.
### Generate the report
Read the review log output you already have from the Review Readiness Dashboard step above.
Parse each JSONL entry. Each skill logs different fields:
- **plan-ceo-review**: \`status\`, \`unresolved\`, \`critical_gaps\`, \`mode\`, \`scope_proposed\`, \`scope_accepted\`, \`scope_deferred\`, \`commit\`
→ Findings: "{scope_proposed} proposals, {scope_accepted} accepted, {scope_deferred} deferred"
→ If scope fields are 0 or missing (HOLD/REDUCTION mode): "mode: {mode}, {critical_gaps} critical gaps"
- **plan-eng-review**: \`status\`, \`unresolved\`, \`critical_gaps\`, \`issues_found\`, \`mode\`, \`commit\`
→ Findings: "{issues_found} issues, {critical_gaps} critical gaps"
- **plan-design-review**: \`status\`, \`initial_score\`, \`overall_score\`, \`unresolved\`, \`decisions_made\`, \`commit\`
→ Findings: "score: {initial_score}/10 → {overall_score}/10, {decisions_made} decisions"
- **plan-devex-review**: \`status\`, \`initial_score\`, \`overall_score\`, \`product_type\`, \`tthw_current\`, \`tthw_target\`, \`mode\`, \`persona\`, \`competitive_tier\`, \`unresolved\`, \`commit\`
→ Findings: "score: {initial_score}/10 → {overall_score}/10, TTHW: {tthw_current} → {tthw_target}"
- **devex-review**: \`status\`, \`overall_score\`, \`product_type\`, \`tthw_measured\`, \`dimensions_tested\`, \`dimensions_inferred\`, \`boomerang\`, \`commit\`
→ Findings: "score: {overall_score}/10, TTHW: {tthw_measured}, {dimensions_tested} tested/{dimensions_inferred} inferred"
- **codex-review**: \`status\`, \`gate\`, \`findings\`, \`findings_fixed\`
→ Findings: "{findings} findings, {findings_fixed}/{findings} fixed"
All fields needed for the Findings column are now present in the JSONL entries.
For the review you just completed, you may use richer details from your own Completion
Summary. For prior reviews, use the JSONL fields directly — they contain all required data.
Produce this markdown table:
\`\`\`markdown
## GSTACK REVIEW REPORT
| Review | Trigger | Why | Runs | Status | Findings |
|--------|---------|-----|------|--------|----------|
| CEO Review | \`/plan-ceo-review\` | Scope & strategy | {runs} | {status} | {findings} |
| Codex Review | \`/codex review\` | Independent 2nd opinion | {runs} | {status} | {findings} |
| Eng Review | \`/plan-eng-review\` | Architecture & tests (required) | {runs} | {status} | {findings} |
| Design Review | \`/plan-design-review\` | UI/UX gaps | {runs} | {status} | {findings} |
| DX Review | \`/plan-devex-review\` | Developer experience gaps | {runs} | {status} | {findings} |
\`\`\`
Below the table, add these lines (omit any that are empty/not applicable):
- **CODEX:** (only if codex-review ran) — one-line summary of codex fixes
- **CROSS-MODEL:** (only if both Claude and Codex reviews exist) — overlap analysis
- **UNRESOLVED:** total unresolved decisions across all reviews
- **VERDICT:** list reviews that are CLEAR (e.g., "CEO + ENG CLEARED — ready to implement").
If Eng Review is not CLEAR and not skipped globally, append "eng review required".
### Write to the plan file
**PLAN MODE EXCEPTION — ALWAYS RUN:** This writes to the plan file, which is the one
file you are allowed to edit in plan mode. The plan file review report is part of the
plan's living status.
The report must always be the LAST section of the plan file — never mid-file.
Use a single delete-then-append flow:
1. Read the plan file (Read tool) to see its full current content. Search the read
output for a \`## GSTACK REVIEW REPORT\` heading anywhere in the file.
2. If found, use the Edit tool to DELETE the entire existing section. Match from
\`## GSTACK REVIEW REPORT\` through either the next \`## \` heading or end of
file, whichever comes first. Replace with the empty string. This applies
regardless of where the section currently lives — mid-file deletion is
intentional, not a special case. If the Edit fails (e.g., concurrent edit
changed the content), re-read the plan file and retry once.
3. After the delete (or skipped, if no section existed), append the new
\`## GSTACK REVIEW REPORT\` section at the END of the file. Use the Edit
tool to match the file's current last paragraph and add the section after it,
or use Write to re-emit the whole file with the section at the end.
4. Verify with the Read tool that \`## GSTACK REVIEW REPORT\` is the last
\`## \` heading in the file before continuing. If it isn't, repeat steps
2-3 once.
Do NOT replace the section in place. The "replace mid-file" path is what allowed
prior versions to leave the report mid-file when an older report already lived
there — the user then sees a plan whose review report is not at the bottom and
(correctly) rejects it.
## Next Steps — Review Chaining
After displaying the Review Readiness Dashboard, recommend the next review(s) based on what this CEO review discovered. Read the dashboard output to see which reviews have already been run and whether they are stale.
**Recommend /plan-eng-review if eng review is not skipped globally** — check the dashboard output for `skip_eng_review`. If it is `true`, eng review is opted out — do not recommend it. Otherwise, eng review is the required shipping gate. If this CEO review expanded scope, changed architectural direction, or accepted scope expansions, emphasize that a fresh eng review is needed. If an eng review already exists in the dashboard but the commit hash shows it predates this CEO review, note that it may be stale and should be re-run.
**Recommend /plan-design-review if UI scope was detected** — specifically if Section 11 (Design & UX Review) was NOT skipped, or if accepted scope expansions included UI-facing features. If an existing design review is stale (commit hash drift), note that. In SCOPE REDUCTION mode, skip this recommendation — design review is unlikely relevant for scope cuts.
**If both are needed, recommend eng review first** (required gate), then design review.
Use AskUserQuestion to present the next step. Include only applicable options:
- **A)** Run /plan-eng-review next (required gate)
- **B)** Run /plan-design-review next (only if UI scope detected)
- **C)** Skip — I'll handle reviews manually
## docs/designs Promotion (EXPANSION and SELECTIVE EXPANSION only)
At the end of the review, if the vision produced a compelling feature direction, offer to promote the CEO plan to the project repo. AskUserQuestion:
"The vision from this review produced {N} accepted scope expansions. Want to promote it to a design doc in the repo?"
- **A)** Promote to `docs/designs/{FEATURE}.md` (committed to repo, visible to the team)
- **B)** Keep in `~/.gstack/projects/` only (local, personal reference)
- **C)** Skip
If promoted, copy the CEO plan content to `docs/designs/{FEATURE}.md` (create the directory if needed) and update the `status` field in the original CEO plan from `ACTIVE` to `PROMOTED`.
## Formatting Rules
* NUMBER issues (1, 2, 3...) and LETTERS for options (A, B, C...).
* Label with NUMBER + LETTER (e.g., "3A", "3B").
* One sentence max per option.
* After each section, pause and wait for feedback.
* Use **CRITICAL GAP** / **WARNING** / **OK** for scannability.
## Capture Learnings
If you discovered a non-obvious pattern, pitfall, or architectural insight during
this session, log it for future sessions:
```bash
~/.claude/skills/gstack/bin/gstack-learnings-log '{"skill":"plan-ceo-review","type":"TYPE","key":"SHORT_KEY","insight":"DESCRIPTION","confidence":N,"source":"SOURCE","files":["path/to/relevant/file"]}'
```
**Types:** `pattern` (reusable approach), `pitfall` (what NOT to do), `preference`
(user stated), `architecture` (structural decision), `tool` (library/framework insight),
`operational` (project environment/CLI/workflow knowledge).
**Sources:** `observed` (you found this in the code), `user-stated` (user told you),
`inferred` (AI deduction), `cross-model` (both Claude and Codex agree).
**Confidence:** 1-10. Be honest. An observed pattern you verified in the code is 8-9.
An inference you're not sure about is 4-5. A user preference they explicitly stated is 10.
**files:** Include the specific file paths this learning references. This enables
staleness detection: if those files are later deleted, the learning can be flagged.
**Only log genuine discoveries.** Don't log obvious things. Don't log things the user
already knows. A good test: would this insight save time in a future session? If yes, log it.
## Brain Calibration Write-Back (Phase 2 / gated)
When the skill makes a typed prediction worth tracking (scope decision,
TTHW target, architectural bet, wedge commitment), it MAY write a
`kind=bet` take to the brain so a calibration profile builds over time.
**Gated on two things:**
1. Brain trust policy for the active endpoint is `personal` (check via
`~/.claude/skills/gstack/bin/gstack-config get brain_trust_policy@<endpoint-hash>`).
Shared brains skip write-back to avoid polluting team calibration.
2. Feature flag `BRAIN_CALIBRATION_WRITEBACK` is set (today: false; flips
to true when upstream gbrain v0.42+ ships `takes_add` MCP op).
When both gates pass, the write-back path uses `mcp__gbrain__takes_add`
to record a take with weight 0.8 (per SKILL_CALIBRATION_WEIGHTS).
If the MCP op is unavailable, fall back to `mcp__gbrain__put_page` with
a gstack:takes fence block (documented but uglier path).
Mandatory take frontmatter shape:
```yaml
kind: bet
holder: <user identity from whoami>
claim: <one-line prediction the skill is making>
weight: 0.8
since_date: <today's date>
expected_resolution: <date in 1-3 months depending on skill>
source_skill: plan-ceo-review
```
After write, invalidate the affected digests so the next preflight reflects
the new state:
```bash
eval "$(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)" 2>/dev/null || true
~/.claude/skills/gstack/bin/gstack-brain-cache invalidate product --project "$SLUG" 2>/dev/null || true
~/.claude/skills/gstack/bin/gstack-brain-cache invalidate goals --project "$SLUG" 2>/dev/null || true
~/.claude/skills/gstack/bin/gstack-brain-cache invalidate competitive-intel --project "$SLUG" 2>/dev/null || true
```
## Brain Cache Background Refresh
After the skill's work completes (and telemetry has logged), kick a
background refresh of any cache digest that's getting close to its TTL.
This is non-blocking — the user doesn't wait. Next invocation benefits
from the warm cache.
```bash
eval "$(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)" 2>/dev/null || true
(~/.claude/skills/gstack/bin/gstack-brain-cache refresh --project "$SLUG" 2>/dev/null &) || true
```
## Mode Quick Reference
```
┌────────────────────────────────────────────────────────────────────────────────┐
│ MODE COMPARISON │
├─────────────┬──────────────┬──────────────┬──────────────┬────────────────────┤
│ │ EXPANSION │ SELECTIVE │ HOLD SCOPE │ REDUCTION │
├─────────────┼──────────────┼──────────────┼──────────────┼────────────────────┤
│ Scope │ Push UP │ Hold + offer │ Maintain │ Push DOWN │
│ │ (opt-in) │ │ │ │
│ Recommend │ Enthusiastic │ Neutral │ N/A │ N/A │
│ posture │ │ │ │ │
│ 10x check │ Mandatory │ Surface as │ Optional │ Skip │
│ │ │ cherry-pick │ │ │
│ Platonic │ Yes │ No │ No │ No │
│ ideal │ │ │ │ │
│ Delight │ Opt-in │ Cherry-pick │ Note if seen │ Skip │
│ opps │ ceremony │ ceremony │ │ │
│ Complexity │ "Is it big │ "Is it right │ "Is it too │ "Is it the bare │
│ question │ enough?" │ + what else │ complex?" │ minimum?" │
│ │ │ is tempting"│ │ │
│ Taste │ Yes │ Yes │ No │ No │
│ calibration │ │ │ │ │
│ Temporal │ Full (hr 1-6)│ Full (hr 1-6)│ Key decisions│ Skip │
│ interrogate │ │ │ only │ │
│ Observ. │ "Joy to │ "Joy to │ "Can we │ "Can we see if │
│ standard │ operate" │ operate" │ debug it?" │ it's broken?" │
│ Deploy │ Infra as │ Safe deploy │ Safe deploy │ Simplest possible │
│ standard │ feature scope│ + cherry-pick│ + rollback │ deploy │
│ │ │ risk check │ │ │
│ Error map │ Full + chaos │ Full + chaos │ Full │ Critical paths │
│ │ scenarios │ for accepted │ │ only │
│ CEO plan │ Written │ Written │ Skipped │ Skipped │
│ Phase 2/3 │ Map accepted │ Map accepted │ Note it │ Skip │
│ planning │ │ cherry-picks │ │ │
│ Design │ "Inevitable" │ If UI scope │ If UI scope │ Skip │
│ (Sec 11) │ UI review │ detected │ detected │ │
└─────────────┴──────────────┴──────────────┴──────────────┴────────────────────┘
```
## EXIT PLAN MODE GATE (BLOCKING)
+10 -486
View File
@@ -224,6 +224,8 @@ Feed into the Premise Challenge (0A) and Dream State Mapping (0C). If you find a
{{BRAIN_PREFLIGHT}}
{{SECTION_INDEX:plan-ceo-review}}
## Step 0: Nuclear Scope Challenge + Mode Selection
### 0A. Premise Challenge
@@ -409,494 +411,16 @@ Present these mode options via AskUserQuestion using the preamble's AskUserQuest
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
## Review Sections (11 sections, after scope and mode are agreed)
{{SECTION:review-sections}}
**Anti-skip rule:** Never condense, abbreviate, or skip any review section (1-11) regardless of plan type (strategy, spec, code, infra). Every section in this skill exists for a reason. "This is a strategy doc so implementation sections don't apply" is always wrong — implementation details are where strategy breaks down. If a section genuinely has zero findings, say "No issues found" and move on — but you must evaluate it.
## Section self-check (before you finish)
{{ANTI_SHORTCUT_CLAUSE}}
You ran a carved skill. The Section index above named `sections/review-sections.md`
as the source of truth for the 11-section deep review, the required outputs, and the
review report. Confirm you issued a Read for it and executed every section from the
file, not from memory. If you produced the Completion Summary or wrote the review
report without Reading that section, STOP, Read it now, and redo the review from the
source of truth.
### Section 1: Architecture Review
Evaluate and diagram:
* Overall system design and component boundaries. Draw the dependency graph.
* Data flow — all four paths. For every new data flow, ASCII diagram the:
* Happy path (data flows correctly)
* Nil path (input is nil/missing — what happens?)
* Empty path (input is present but empty/zero-length — what happens?)
* Error path (upstream call fails — what happens?)
* State machines. ASCII diagram for every new stateful object. Include impossible/invalid transitions and what prevents them.
* Coupling concerns. Which components are now coupled that weren't before? Is that coupling justified? Draw the before/after dependency graph.
* Scaling characteristics. What breaks first under 10x load? Under 100x?
* Single points of failure. Map them.
* Security architecture. Auth boundaries, data access patterns, API surfaces. For each new endpoint or data mutation: who can call it, what do they get, what can they change?
* Production failure scenarios. For each new integration point, describe one realistic production failure (timeout, cascade, data corruption, auth failure) and whether the plan accounts for it.
* Rollback posture. If this ships and immediately breaks, what's the rollback procedure? Git revert? Feature flag? DB migration rollback? How long?
**EXPANSION and SELECTIVE EXPANSION additions:**
* What would make this architecture beautiful? Not just correct — elegant. Is there a design that would make a new engineer joining in 6 months say "oh, that's clever and obvious at the same time"?
* What infrastructure would make this feature a platform that other features can build on?
**SELECTIVE EXPANSION:** If any accepted cherry-picks from Step 0D affect the architecture, evaluate their architectural fit here. Flag any that create coupling concerns or don't integrate cleanly — this is a chance to revisit the decision with new information.
Required ASCII diagram: full system architecture showing new components and their relationships to existing ones.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 2: Error & Rescue Map
This is the section that catches silent failures. It is not optional.
For every new method, service, or codepath that can fail, fill in this table:
```
METHOD/CODEPATH | WHAT CAN GO WRONG | EXCEPTION CLASS
-------------------------|-----------------------------|-----------------
ExampleService#call | API timeout | TimeoutError
| API returns 429 | RateLimitError
| API returns malformed JSON | JSONParseError
| DB connection pool exhausted| ConnectionPoolExhausted
| Record not found | RecordNotFound
-------------------------|-----------------------------|-----------------
EXCEPTION CLASS | RESCUED? | RESCUE ACTION | USER SEES
-----------------------------|-----------|------------------------|------------------
TimeoutError | Y | Retry 2x, then raise | "Service temporarily unavailable"
RateLimitError | Y | Backoff + retry | Nothing (transparent)
JSONParseError | N ← GAP | — | 500 error ← BAD
ConnectionPoolExhausted | N ← GAP | — | 500 error ← BAD
RecordNotFound | Y | Return nil, log warning | "Not found" message
```
Rules for this section:
* Catch-all error handling (`rescue StandardError`, `catch (Exception e)`, `except Exception`) is ALWAYS a smell. Name the specific exceptions.
* Catching an error with only a generic log message is insufficient. Log the full context: what was being attempted, with what arguments, for what user/request.
* Every rescued error must either: retry with backoff, degrade gracefully with a user-visible message, or re-raise with added context. "Swallow and continue" is almost never acceptable.
* For each GAP (unrescued error that should be rescued): specify the rescue action and what the user should see.
* For LLM/AI service calls specifically: what happens when the response is malformed? When it's empty? When it hallucinates invalid JSON? When the model returns a refusal? Each of these is a distinct failure mode.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 3: Security & Threat Model
Security is not a sub-bullet of architecture. It gets its own section.
Evaluate:
* Attack surface expansion. What new attack vectors does this plan introduce? New endpoints, new params, new file paths, new background jobs?
* Input validation. For every new user input: is it validated, sanitized, and rejected loudly on failure? What happens with: nil, empty string, string when integer expected, string exceeding max length, unicode edge cases, HTML/script injection attempts?
* Authorization. For every new data access: is it scoped to the right user/role? Is there a direct object reference vulnerability? Can user A access user B's data by manipulating IDs?
* Secrets and credentials. New secrets? In env vars, not hardcoded? Rotatable?
* Dependency risk. New gems/npm packages? Security track record?
* Data classification. PII, payment data, credentials? Handling consistent with existing patterns?
* Injection vectors. SQL, command, template, LLM prompt injection — check all.
* Audit logging. For sensitive operations: is there an audit trail?
For each finding: threat, likelihood (High/Med/Low), impact (High/Med/Low), and whether the plan mitigates it.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 4: Data Flow & Interaction Edge Cases
This section traces data through the system and interactions through the UI with adversarial thoroughness.
**Data Flow Tracing:** For every new data flow, produce an ASCII diagram showing:
```
INPUT ──▶ VALIDATION ──▶ TRANSFORM ──▶ PERSIST ──▶ OUTPUT
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
[nil?] [invalid?] [exception?] [conflict?] [stale?]
[empty?] [too long?] [timeout?] [dup key?] [partial?]
[wrong [wrong type?] [OOM?] [locked?] [encoding?]
type?]
```
For each node: what happens on each shadow path? Is it tested?
**Interaction Edge Cases:** For every new user-visible interaction, evaluate:
```
INTERACTION | EDGE CASE | HANDLED? | HOW?
---------------------|------------------------|----------|--------
Form submission | Double-click submit | ? |
| Submit with stale CSRF | ? |
| Submit during deploy | ? |
Async operation | User navigates away | ? |
| Operation times out | ? |
| Retry while in-flight | ? |
List/table view | Zero results | ? |
| 10,000 results | ? |
| Results change mid-page| ? |
Background job | Job fails after 3 of | ? |
| 10 items processed | |
| Job runs twice (dup) | ? |
| Queue backs up 2 hours | ? |
```
Flag any unhandled edge case as a gap. For each gap, specify the fix.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 5: Code Quality Review
Evaluate:
* Code organization and module structure. Does new code fit existing patterns? If it deviates, is there a reason?
* DRY violations. Be aggressive. If the same logic exists elsewhere, flag it and reference the file and line.
* Naming quality. Are new classes, methods, and variables named for what they do, not how they do it?
* Error handling patterns. (Cross-reference with Section 2 — this section reviews the patterns; Section 2 maps the specifics.)
* Missing edge cases. List explicitly: "What happens when X is nil?" "When the API returns 429?" etc.
* Over-engineering check. Any new abstraction solving a problem that doesn't exist yet?
* Under-engineering check. Anything fragile, assuming happy path only, or missing obvious defensive checks?
* Cyclomatic complexity. Flag any new method that branches more than 5 times. Propose a refactor.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 6: Test Review
Make a complete diagram of every new thing this plan introduces:
```
NEW UX FLOWS:
[list each new user-visible interaction]
NEW DATA FLOWS:
[list each new path data takes through the system]
NEW CODEPATHS:
[list each new branch, condition, or execution path]
NEW BACKGROUND JOBS / ASYNC WORK:
[list each]
NEW INTEGRATIONS / EXTERNAL CALLS:
[list each]
NEW ERROR/RESCUE PATHS:
[list each — cross-reference Section 2]
```
For each item in the diagram:
* What type of test covers it? (Unit / Integration / System / E2E)
* Does a test for it exist in the plan? If not, write the test spec header.
* What is the happy path test?
* What is the failure path test? (Be specific — which failure?)
* What is the edge case test? (nil, empty, boundary values, concurrent access)
Test ambition check (all modes): For each new feature, answer:
* What's the test that would make you confident shipping at 2am on a Friday?
* What's the test a hostile QA engineer would write to break this?
* What's the chaos test?
Test pyramid check: Many unit, fewer integration, few E2E? Or inverted?
Flakiness risk: Flag any test depending on time, randomness, external services, or ordering.
Load/stress test requirements: For any new codepath called frequently or processing significant data.
For LLM/prompt changes: Check CLAUDE.md for the "Prompt/LLM changes" file patterns. If this plan touches ANY of those patterns, state which eval suites must be run, which cases should be added, and what baselines to compare against.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 7: Performance Review
Evaluate:
* N+1 queries. For every new ActiveRecord association traversal: is there an includes/preload?
* Memory usage. For every new data structure: what's the maximum size in production?
* Database indexes. For every new query: is there an index?
* Caching opportunities. For every expensive computation or external call: should it be cached?
* Background job sizing. For every new job: worst-case payload, runtime, retry behavior?
* Slow paths. Top 3 slowest new codepaths and estimated p99 latency.
* Connection pool pressure. New DB connections, Redis connections, HTTP connections?
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 8: Observability & Debuggability Review
New systems break. This section ensures you can see why.
Evaluate:
* Logging. For every new codepath: structured log lines at entry, exit, and each significant branch?
* Metrics. For every new feature: what metric tells you it's working? What tells you it's broken?
* Tracing. For new cross-service or cross-job flows: trace IDs propagated?
* Alerting. What new alerts should exist?
* Dashboards. What new dashboard panels do you want on day 1?
* Debuggability. If a bug is reported 3 weeks post-ship, can you reconstruct what happened from logs alone?
* Admin tooling. New operational tasks that need admin UI or rake tasks?
* Runbooks. For each new failure mode: what's the operational response?
**EXPANSION and SELECTIVE EXPANSION addition:**
* What observability would make this feature a joy to operate? (For SELECTIVE EXPANSION, include observability for any accepted cherry-picks.)
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 9: Deployment & Rollout Review
Evaluate:
* Migration safety. For every new DB migration: backward-compatible? Zero-downtime? Table locks?
* Feature flags. Should any part be behind a feature flag?
* Rollout order. Correct sequence: migrate first, deploy second?
* Rollback plan. Explicit step-by-step.
* Deploy-time risk window. Old code and new code running simultaneously — what breaks?
* Environment parity. Tested in staging?
* Post-deploy verification checklist. First 5 minutes? First hour?
* Smoke tests. What automated checks should run immediately post-deploy?
**EXPANSION and SELECTIVE EXPANSION addition:**
* What deploy infrastructure would make shipping this feature routine? (For SELECTIVE EXPANSION, assess whether accepted cherry-picks change the deployment risk profile.)
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 10: Long-Term Trajectory Review
Evaluate:
* Technical debt introduced. Code debt, operational debt, testing debt, documentation debt.
* Path dependency. Does this make future changes harder?
* Knowledge concentration. Documentation sufficient for a new engineer?
* Reversibility. Rate 1-5: 1 = one-way door, 5 = easily reversible.
* Ecosystem fit. Aligns with Rails/JS ecosystem direction?
* The 1-year question. Read this plan as a new engineer in 12 months — obvious?
**EXPANSION and SELECTIVE EXPANSION additions:**
* What comes after this ships? Phase 2? Phase 3? Does the architecture support that trajectory?
* Platform potential. Does this create capabilities other features can leverage?
* (SELECTIVE EXPANSION only) Retrospective: Were the right cherry-picks accepted? Did any rejected expansions turn out to be load-bearing for the accepted ones?
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 11: Design & UX Review (skip if no UI scope detected)
The CEO calling in the designer. Not a pixel-level audit — that's /plan-design-review and /design-review. This is ensuring the plan has design intentionality.
Evaluate:
* Information architecture — what does the user see first, second, third?
* Interaction state coverage map:
FEATURE | LOADING | EMPTY | ERROR | SUCCESS | PARTIAL
* User journey coherence — storyboard the emotional arc
* AI slop risk — does the plan describe generic UI patterns?
* DESIGN.md alignment — does the plan match the stated design system?
* Responsive intention — is mobile mentioned or afterthought?
* Accessibility basics — keyboard nav, screen readers, contrast, touch targets
**EXPANSION and SELECTIVE EXPANSION additions:**
* What would make this UI feel *inevitable*?
* What 30-minute UI touches would make users think "oh nice, they thought of that"?
Required ASCII diagram: user flow showing screens/states and transitions.
If this plan has significant UI scope, recommend: "Consider running /plan-design-review for a deep design review of this plan before implementation."
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
{{CODEX_PLAN_REVIEW}}
### Outside Voice Integration Rule
Outside voice findings are INFORMATIONAL until the user explicitly approves each one.
Do NOT incorporate outside voice recommendations into the plan without presenting each
finding via AskUserQuestion and getting explicit approval. This applies even when you
agree with the outside voice. Cross-model consensus is a strong signal — present it as
such — but the user makes the decision.
## Post-Implementation Design Audit (if UI scope detected)
After implementation, run `/design-review` on the live site to catch visual issues that can only be evaluated with rendered output.
## CRITICAL RULE — How to ask questions
Follow the AskUserQuestion format from the Preamble above. Additional rules for plan reviews:
* **One issue = one AskUserQuestion call.** Never combine multiple issues into one question.
* Describe the problem concretely, with file and line references.
* Present 2-3 options, including "do nothing" where reasonable.
* For each option: effort, risk, and maintenance burden in one line.
* **Map the reasoning to my engineering preferences above.** One sentence connecting your recommendation to a specific preference.
* Label with issue NUMBER + option LETTER (e.g., "3A", "3B").
* **Zero findings:** if a section has zero findings, state "No issues, moving on" and proceed. Otherwise, use AskUserQuestion for each finding — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan.
## Required Outputs
### "NOT in scope" section
List work considered and explicitly deferred, with one-line rationale each.
### "What already exists" section
List existing code/flows that partially solve sub-problems and whether the plan reuses them.
### "Dream state delta" section
Where this plan leaves us relative to the 12-month ideal.
### Error & Rescue Registry (from Section 2)
Complete table of every method that can fail, every exception class, rescued status, rescue action, user impact.
### Failure Modes Registry
```
CODEPATH | FAILURE MODE | RESCUED? | TEST? | USER SEES? | LOGGED?
---------|----------------|----------|-------|----------------|--------
```
Any row with RESCUED=N, TEST=N, USER SEES=Silent → **CRITICAL GAP**.
### TODOS.md updates
Present each potential TODO as its own individual AskUserQuestion. Never batch TODOs — one per question. Never silently skip this step. Follow the format in `.claude/skills/review/TODOS-format.md`.
For each TODO, describe:
* **What:** One-line description of the work.
* **Why:** The concrete problem it solves or value it unlocks.
* **Pros:** What you gain by doing this work.
* **Cons:** Cost, complexity, or risks of doing it.
* **Context:** Enough detail that someone picking this up in 3 months understands the motivation, the current state, and where to start.
* **Effort estimate:** S/M/L/XL (human team) → with CC+gstack: S→S, M→S, L→M, XL→L
* **Priority:** P1/P2/P3
* **Depends on / blocked by:** Any prerequisites or ordering constraints.
Then present options: **A)** Add to TODOS.md **B)** Skip — not valuable enough **C)** Build it now in this PR instead of deferring.
### Scope Expansion Decisions (EXPANSION and SELECTIVE EXPANSION only)
For EXPANSION and SELECTIVE EXPANSION modes: expansion opportunities and delight items were surfaced and decided in Step 0D (opt-in/cherry-pick ceremony). The decisions are persisted in the CEO plan document. Reference the CEO plan for the full record. Do not re-surface them here — list the accepted expansions for completeness:
* Accepted: {list items added to scope}
* Deferred: {list items sent to TODOS.md}
* Skipped: {list items rejected}
### Diagrams (mandatory, produce all that apply)
1. System architecture
2. Data flow (including shadow paths)
3. State machine
4. Error flow
5. Deployment sequence
6. Rollback flowchart
### Stale Diagram Audit
List every ASCII diagram in files this plan touches. Still accurate?
{{TASKS_SECTION_EMIT:ceo-review}}
### Completion Summary
```
+====================================================================+
| MEGA PLAN REVIEW — COMPLETION SUMMARY |
+====================================================================+
| Mode selected | EXPANSION / SELECTIVE / HOLD / REDUCTION |
| System Audit | [key findings] |
| Step 0 | [mode + key decisions] |
| Section 1 (Arch) | ___ issues found |
| Section 2 (Errors) | ___ error paths mapped, ___ GAPS |
| Section 3 (Security)| ___ issues found, ___ High severity |
| Section 4 (Data/UX) | ___ edge cases mapped, ___ unhandled |
| Section 5 (Quality) | ___ issues found |
| Section 6 (Tests) | Diagram produced, ___ gaps |
| Section 7 (Perf) | ___ issues found |
| Section 8 (Observ) | ___ gaps found |
| Section 9 (Deploy) | ___ risks flagged |
| Section 10 (Future) | Reversibility: _/5, debt items: ___ |
| Section 11 (Design) | ___ issues / SKIPPED (no UI scope) |
+--------------------------------------------------------------------+
| NOT in scope | written (___ items) |
| What already exists | written |
| Dream state delta | written |
| Error/rescue registry| ___ methods, ___ CRITICAL GAPS |
| Failure modes | ___ total, ___ CRITICAL GAPS |
| TODOS.md updates | ___ items proposed |
| Scope proposals | ___ proposed, ___ accepted (EXP + SEL) |
| CEO plan | written / skipped (HOLD/REDUCTION) |
| Outside voice | ran (codex/claude) / skipped |
| Lake Score | X/Y recommendations chose complete option |
| Diagrams produced | ___ (list types) |
| Stale diagrams found | ___ |
| Unresolved decisions | ___ (listed below) |
+====================================================================+
```
### Unresolved Decisions
If any AskUserQuestion goes unanswered, note it here. Never silently default.
## Handoff Note Cleanup
After producing the Completion Summary, clean up any handoff notes for this branch —
the review is complete and the context is no longer needed.
```bash
setopt +o nomatch 2>/dev/null || true # zsh compat
{{SLUG_EVAL}}
rm -f ~/.gstack/projects/$SLUG/*-$BRANCH-ceo-handoff-*.md 2>/dev/null || true
```
## Review Log
After producing the Completion Summary above, persist the review result.
**PLAN MODE EXCEPTION — ALWAYS RUN:** This command writes review metadata to
`~/.gstack/` (user config directory, not project files). The skill preamble
already writes to `~/.gstack/sessions/` and `~/.gstack/analytics/` — this is
the same pattern. The review dashboard depends on this data. Skipping this
command breaks the review readiness dashboard in /ship.
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"plan-ceo-review","timestamp":"TIMESTAMP","status":"STATUS","unresolved":N,"critical_gaps":N,"mode":"MODE","scope_proposed":N,"scope_accepted":N,"scope_deferred":N,"commit":"COMMIT"}'
```
Before running this command, substitute the placeholder values from the Completion Summary you just produced:
- **TIMESTAMP**: current ISO 8601 datetime (e.g., 2026-03-16T14:30:00)
- **STATUS**: "clean" if 0 unresolved decisions AND 0 critical gaps; otherwise "issues_open"
- **unresolved**: number from "Unresolved decisions" in the summary
- **critical_gaps**: number from "Failure modes: ___ CRITICAL GAPS" in the summary
- **MODE**: the mode the user selected (SCOPE_EXPANSION / SELECTIVE_EXPANSION / HOLD_SCOPE / SCOPE_REDUCTION)
- **scope_proposed**: number from "Scope proposals: ___ proposed" in the summary (0 for HOLD/REDUCTION)
- **scope_accepted**: number from "Scope proposals: ___ accepted" in the summary (0 for HOLD/REDUCTION)
- **scope_deferred**: number of items deferred to TODOS.md from scope decisions (0 for HOLD/REDUCTION)
- **COMMIT**: output of `git rev-parse --short HEAD`
{{REVIEW_DASHBOARD}}
{{PLAN_FILE_REVIEW_REPORT}}
## Next Steps — Review Chaining
After displaying the Review Readiness Dashboard, recommend the next review(s) based on what this CEO review discovered. Read the dashboard output to see which reviews have already been run and whether they are stale.
**Recommend /plan-eng-review if eng review is not skipped globally** — check the dashboard output for `skip_eng_review`. If it is `true`, eng review is opted out — do not recommend it. Otherwise, eng review is the required shipping gate. If this CEO review expanded scope, changed architectural direction, or accepted scope expansions, emphasize that a fresh eng review is needed. If an eng review already exists in the dashboard but the commit hash shows it predates this CEO review, note that it may be stale and should be re-run.
**Recommend /plan-design-review if UI scope was detected** — specifically if Section 11 (Design & UX Review) was NOT skipped, or if accepted scope expansions included UI-facing features. If an existing design review is stale (commit hash drift), note that. In SCOPE REDUCTION mode, skip this recommendation — design review is unlikely relevant for scope cuts.
**If both are needed, recommend eng review first** (required gate), then design review.
Use AskUserQuestion to present the next step. Include only applicable options:
- **A)** Run /plan-eng-review next (required gate)
- **B)** Run /plan-design-review next (only if UI scope detected)
- **C)** Skip — I'll handle reviews manually
## docs/designs Promotion (EXPANSION and SELECTIVE EXPANSION only)
At the end of the review, if the vision produced a compelling feature direction, offer to promote the CEO plan to the project repo. AskUserQuestion:
"The vision from this review produced {N} accepted scope expansions. Want to promote it to a design doc in the repo?"
- **A)** Promote to `docs/designs/{FEATURE}.md` (committed to repo, visible to the team)
- **B)** Keep in `~/.gstack/projects/` only (local, personal reference)
- **C)** Skip
If promoted, copy the CEO plan content to `docs/designs/{FEATURE}.md` (create the directory if needed) and update the `status` field in the original CEO plan from `ACTIVE` to `PROMOTED`.
## Formatting Rules
* NUMBER issues (1, 2, 3...) and LETTERS for options (A, B, C...).
* Label with NUMBER + LETTER (e.g., "3A", "3B").
* One sentence max per option.
* After each section, pause and wait for feedback.
* Use **CRITICAL GAP** / **WARNING** / **OK** for scannability.
{{LEARNINGS_LOG}}
{{GBRAIN_SAVE_RESULTS}}
{{BRAIN_WRITE_BACK}}
{{BRAIN_CACHE_REFRESH}}
## Mode Quick Reference
```
┌────────────────────────────────────────────────────────────────────────────────┐
│ MODE COMPARISON │
├─────────────┬──────────────┬──────────────┬──────────────┬────────────────────┤
│ │ EXPANSION │ SELECTIVE │ HOLD SCOPE │ REDUCTION │
├─────────────┼──────────────┼──────────────┼──────────────┼────────────────────┤
│ Scope │ Push UP │ Hold + offer │ Maintain │ Push DOWN │
│ │ (opt-in) │ │ │ │
│ Recommend │ Enthusiastic │ Neutral │ N/A │ N/A │
│ posture │ │ │ │ │
│ 10x check │ Mandatory │ Surface as │ Optional │ Skip │
│ │ │ cherry-pick │ │ │
│ Platonic │ Yes │ No │ No │ No │
│ ideal │ │ │ │ │
│ Delight │ Opt-in │ Cherry-pick │ Note if seen │ Skip │
│ opps │ ceremony │ ceremony │ │ │
│ Complexity │ "Is it big │ "Is it right │ "Is it too │ "Is it the bare │
│ question │ enough?" │ + what else │ complex?" │ minimum?" │
│ │ │ is tempting"│ │ │
│ Taste │ Yes │ Yes │ No │ No │
│ calibration │ │ │ │ │
│ Temporal │ Full (hr 1-6)│ Full (hr 1-6)│ Key decisions│ Skip │
│ interrogate │ │ │ only │ │
│ Observ. │ "Joy to │ "Joy to │ "Can we │ "Can we see if │
│ standard │ operate" │ operate" │ debug it?" │ it's broken?" │
│ Deploy │ Infra as │ Safe deploy │ Safe deploy │ Simplest possible │
│ standard │ feature scope│ + cherry-pick│ + rollback │ deploy │
│ │ │ risk check │ │ │
│ Error map │ Full + chaos │ Full + chaos │ Full │ Critical paths │
│ │ scenarios │ for accepted │ │ only │
│ CEO plan │ Written │ Written │ Skipped │ Skipped │
│ Phase 2/3 │ Map accepted │ Map accepted │ Note it │ Skip │
│ planning │ │ cherry-picks │ │ │
│ Design │ "Inevitable" │ If UI scope │ If UI scope │ Skip │
│ (Sec 11) │ UI review │ detected │ detected │ │
└─────────────┴──────────────┴──────────────┴──────────────┴────────────────────┘
```
{{EXIT_PLAN_MODE_GATE}}
+14
View File
@@ -0,0 +1,14 @@
{
"$schema": "https://gstack.dev/schemas/section-manifest.json",
"skill": "plan-ceo-review",
"version": 1,
"note": "PASSIVE registry (v2 plan T9 / CM2). Fields are IDs, file paths, human titles, and human-readable trigger text ONLY. The skeleton's decision-tree prose is the ONLY place that decides WHEN to read a section; required-reads live in the E2E fixtures. No machine predicate here — see docs/designs/v2_PLAN.md:663.",
"sections": [
{
"id": "review-sections",
"file": "review-sections.md",
"title": "11-section deep review, required outputs + review report",
"trigger": "running the 11-section deep review, required outputs, and review report (only after Step 0 scope and mode are agreed)"
}
]
}
+900
View File
@@ -0,0 +1,900 @@
<!-- AUTO-GENERATED from review-sections.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->
## Review Sections (11 sections, after scope and mode are agreed)
**Anti-skip rule:** Never condense, abbreviate, or skip any review section (1-11) regardless of plan type (strategy, spec, code, infra). Every section in this skill exists for a reason. "This is a strategy doc so implementation sections don't apply" is always wrong — implementation details are where strategy breaks down. If a section genuinely has zero findings, say "No issues found" and move on — but you must evaluate it.
**Anti-shortcut clause:** The plan file is the OUTPUT of the interactive review, not a substitute for it. Writing every finding into one plan write and calling ExitPlanMode without firing AskUserQuestion is the precise failure mode of the May 2026 transcript bug — the model explored, found issues, and dumped them into a deliverable rather than walking the user through them. If you have ANY non-trivial finding in any review section, the path from finding to ExitPlanMode goes THROUGH AskUserQuestion. Zero findings in every section is the only path to ExitPlanMode that bypasses AskUserQuestion. If you find yourself wanting to write a plan with findings before asking, stop and call AskUserQuestion now — that's the bug, recognize it.
### Section 1: Architecture Review
Evaluate and diagram:
* Overall system design and component boundaries. Draw the dependency graph.
* Data flow — all four paths. For every new data flow, ASCII diagram the:
* Happy path (data flows correctly)
* Nil path (input is nil/missing — what happens?)
* Empty path (input is present but empty/zero-length — what happens?)
* Error path (upstream call fails — what happens?)
* State machines. ASCII diagram for every new stateful object. Include impossible/invalid transitions and what prevents them.
* Coupling concerns. Which components are now coupled that weren't before? Is that coupling justified? Draw the before/after dependency graph.
* Scaling characteristics. What breaks first under 10x load? Under 100x?
* Single points of failure. Map them.
* Security architecture. Auth boundaries, data access patterns, API surfaces. For each new endpoint or data mutation: who can call it, what do they get, what can they change?
* Production failure scenarios. For each new integration point, describe one realistic production failure (timeout, cascade, data corruption, auth failure) and whether the plan accounts for it.
* Rollback posture. If this ships and immediately breaks, what's the rollback procedure? Git revert? Feature flag? DB migration rollback? How long?
**EXPANSION and SELECTIVE EXPANSION additions:**
* What would make this architecture beautiful? Not just correct — elegant. Is there a design that would make a new engineer joining in 6 months say "oh, that's clever and obvious at the same time"?
* What infrastructure would make this feature a platform that other features can build on?
**SELECTIVE EXPANSION:** If any accepted cherry-picks from Step 0D affect the architecture, evaluate their architectural fit here. Flag any that create coupling concerns or don't integrate cleanly — this is a chance to revisit the decision with new information.
Required ASCII diagram: full system architecture showing new components and their relationships to existing ones.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 2: Error & Rescue Map
This is the section that catches silent failures. It is not optional.
For every new method, service, or codepath that can fail, fill in this table:
```
METHOD/CODEPATH | WHAT CAN GO WRONG | EXCEPTION CLASS
-------------------------|-----------------------------|-----------------
ExampleService#call | API timeout | TimeoutError
| API returns 429 | RateLimitError
| API returns malformed JSON | JSONParseError
| DB connection pool exhausted| ConnectionPoolExhausted
| Record not found | RecordNotFound
-------------------------|-----------------------------|-----------------
EXCEPTION CLASS | RESCUED? | RESCUE ACTION | USER SEES
-----------------------------|-----------|------------------------|------------------
TimeoutError | Y | Retry 2x, then raise | "Service temporarily unavailable"
RateLimitError | Y | Backoff + retry | Nothing (transparent)
JSONParseError | N ← GAP | — | 500 error ← BAD
ConnectionPoolExhausted | N ← GAP | — | 500 error ← BAD
RecordNotFound | Y | Return nil, log warning | "Not found" message
```
Rules for this section:
* Catch-all error handling (`rescue StandardError`, `catch (Exception e)`, `except Exception`) is ALWAYS a smell. Name the specific exceptions.
* Catching an error with only a generic log message is insufficient. Log the full context: what was being attempted, with what arguments, for what user/request.
* Every rescued error must either: retry with backoff, degrade gracefully with a user-visible message, or re-raise with added context. "Swallow and continue" is almost never acceptable.
* For each GAP (unrescued error that should be rescued): specify the rescue action and what the user should see.
* For LLM/AI service calls specifically: what happens when the response is malformed? When it's empty? When it hallucinates invalid JSON? When the model returns a refusal? Each of these is a distinct failure mode.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 3: Security & Threat Model
Security is not a sub-bullet of architecture. It gets its own section.
Evaluate:
* Attack surface expansion. What new attack vectors does this plan introduce? New endpoints, new params, new file paths, new background jobs?
* Input validation. For every new user input: is it validated, sanitized, and rejected loudly on failure? What happens with: nil, empty string, string when integer expected, string exceeding max length, unicode edge cases, HTML/script injection attempts?
* Authorization. For every new data access: is it scoped to the right user/role? Is there a direct object reference vulnerability? Can user A access user B's data by manipulating IDs?
* Secrets and credentials. New secrets? In env vars, not hardcoded? Rotatable?
* Dependency risk. New gems/npm packages? Security track record?
* Data classification. PII, payment data, credentials? Handling consistent with existing patterns?
* Injection vectors. SQL, command, template, LLM prompt injection — check all.
* Audit logging. For sensitive operations: is there an audit trail?
For each finding: threat, likelihood (High/Med/Low), impact (High/Med/Low), and whether the plan mitigates it.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 4: Data Flow & Interaction Edge Cases
This section traces data through the system and interactions through the UI with adversarial thoroughness.
**Data Flow Tracing:** For every new data flow, produce an ASCII diagram showing:
```
INPUT ──▶ VALIDATION ──▶ TRANSFORM ──▶ PERSIST ──▶ OUTPUT
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
[nil?] [invalid?] [exception?] [conflict?] [stale?]
[empty?] [too long?] [timeout?] [dup key?] [partial?]
[wrong [wrong type?] [OOM?] [locked?] [encoding?]
type?]
```
For each node: what happens on each shadow path? Is it tested?
**Interaction Edge Cases:** For every new user-visible interaction, evaluate:
```
INTERACTION | EDGE CASE | HANDLED? | HOW?
---------------------|------------------------|----------|--------
Form submission | Double-click submit | ? |
| Submit with stale CSRF | ? |
| Submit during deploy | ? |
Async operation | User navigates away | ? |
| Operation times out | ? |
| Retry while in-flight | ? |
List/table view | Zero results | ? |
| 10,000 results | ? |
| Results change mid-page| ? |
Background job | Job fails after 3 of | ? |
| 10 items processed | |
| Job runs twice (dup) | ? |
| Queue backs up 2 hours | ? |
```
Flag any unhandled edge case as a gap. For each gap, specify the fix.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 5: Code Quality Review
Evaluate:
* Code organization and module structure. Does new code fit existing patterns? If it deviates, is there a reason?
* DRY violations. Be aggressive. If the same logic exists elsewhere, flag it and reference the file and line.
* Naming quality. Are new classes, methods, and variables named for what they do, not how they do it?
* Error handling patterns. (Cross-reference with Section 2 — this section reviews the patterns; Section 2 maps the specifics.)
* Missing edge cases. List explicitly: "What happens when X is nil?" "When the API returns 429?" etc.
* Over-engineering check. Any new abstraction solving a problem that doesn't exist yet?
* Under-engineering check. Anything fragile, assuming happy path only, or missing obvious defensive checks?
* Cyclomatic complexity. Flag any new method that branches more than 5 times. Propose a refactor.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 6: Test Review
Make a complete diagram of every new thing this plan introduces:
```
NEW UX FLOWS:
[list each new user-visible interaction]
NEW DATA FLOWS:
[list each new path data takes through the system]
NEW CODEPATHS:
[list each new branch, condition, or execution path]
NEW BACKGROUND JOBS / ASYNC WORK:
[list each]
NEW INTEGRATIONS / EXTERNAL CALLS:
[list each]
NEW ERROR/RESCUE PATHS:
[list each — cross-reference Section 2]
```
For each item in the diagram:
* What type of test covers it? (Unit / Integration / System / E2E)
* Does a test for it exist in the plan? If not, write the test spec header.
* What is the happy path test?
* What is the failure path test? (Be specific — which failure?)
* What is the edge case test? (nil, empty, boundary values, concurrent access)
Test ambition check (all modes): For each new feature, answer:
* What's the test that would make you confident shipping at 2am on a Friday?
* What's the test a hostile QA engineer would write to break this?
* What's the chaos test?
Test pyramid check: Many unit, fewer integration, few E2E? Or inverted?
Flakiness risk: Flag any test depending on time, randomness, external services, or ordering.
Load/stress test requirements: For any new codepath called frequently or processing significant data.
For LLM/prompt changes: Check CLAUDE.md for the "Prompt/LLM changes" file patterns. If this plan touches ANY of those patterns, state which eval suites must be run, which cases should be added, and what baselines to compare against.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 7: Performance Review
Evaluate:
* N+1 queries. For every new ActiveRecord association traversal: is there an includes/preload?
* Memory usage. For every new data structure: what's the maximum size in production?
* Database indexes. For every new query: is there an index?
* Caching opportunities. For every expensive computation or external call: should it be cached?
* Background job sizing. For every new job: worst-case payload, runtime, retry behavior?
* Slow paths. Top 3 slowest new codepaths and estimated p99 latency.
* Connection pool pressure. New DB connections, Redis connections, HTTP connections?
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 8: Observability & Debuggability Review
New systems break. This section ensures you can see why.
Evaluate:
* Logging. For every new codepath: structured log lines at entry, exit, and each significant branch?
* Metrics. For every new feature: what metric tells you it's working? What tells you it's broken?
* Tracing. For new cross-service or cross-job flows: trace IDs propagated?
* Alerting. What new alerts should exist?
* Dashboards. What new dashboard panels do you want on day 1?
* Debuggability. If a bug is reported 3 weeks post-ship, can you reconstruct what happened from logs alone?
* Admin tooling. New operational tasks that need admin UI or rake tasks?
* Runbooks. For each new failure mode: what's the operational response?
**EXPANSION and SELECTIVE EXPANSION addition:**
* What observability would make this feature a joy to operate? (For SELECTIVE EXPANSION, include observability for any accepted cherry-picks.)
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 9: Deployment & Rollout Review
Evaluate:
* Migration safety. For every new DB migration: backward-compatible? Zero-downtime? Table locks?
* Feature flags. Should any part be behind a feature flag?
* Rollout order. Correct sequence: migrate first, deploy second?
* Rollback plan. Explicit step-by-step.
* Deploy-time risk window. Old code and new code running simultaneously — what breaks?
* Environment parity. Tested in staging?
* Post-deploy verification checklist. First 5 minutes? First hour?
* Smoke tests. What automated checks should run immediately post-deploy?
**EXPANSION and SELECTIVE EXPANSION addition:**
* What deploy infrastructure would make shipping this feature routine? (For SELECTIVE EXPANSION, assess whether accepted cherry-picks change the deployment risk profile.)
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 10: Long-Term Trajectory Review
Evaluate:
* Technical debt introduced. Code debt, operational debt, testing debt, documentation debt.
* Path dependency. Does this make future changes harder?
* Knowledge concentration. Documentation sufficient for a new engineer?
* Reversibility. Rate 1-5: 1 = one-way door, 5 = easily reversible.
* Ecosystem fit. Aligns with Rails/JS ecosystem direction?
* The 1-year question. Read this plan as a new engineer in 12 months — obvious?
**EXPANSION and SELECTIVE EXPANSION additions:**
* What comes after this ships? Phase 2? Phase 3? Does the architecture support that trajectory?
* Platform potential. Does this create capabilities other features can leverage?
* (SELECTIVE EXPANSION only) Retrospective: Were the right cherry-picks accepted? Did any rejected expansions turn out to be load-bearing for the accepted ones?
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 11: Design & UX Review (skip if no UI scope detected)
The CEO calling in the designer. Not a pixel-level audit — that's /plan-design-review and /design-review. This is ensuring the plan has design intentionality.
Evaluate:
* Information architecture — what does the user see first, second, third?
* Interaction state coverage map:
FEATURE | LOADING | EMPTY | ERROR | SUCCESS | PARTIAL
* User journey coherence — storyboard the emotional arc
* AI slop risk — does the plan describe generic UI patterns?
* DESIGN.md alignment — does the plan match the stated design system?
* Responsive intention — is mobile mentioned or afterthought?
* Accessibility basics — keyboard nav, screen readers, contrast, touch targets
**EXPANSION and SELECTIVE EXPANSION additions:**
* What would make this UI feel *inevitable*?
* What 30-minute UI touches would make users think "oh nice, they thought of that"?
Required ASCII diagram: user flow showing screens/states and transitions.
If this plan has significant UI scope, recommend: "Consider running /plan-design-review for a deep design review of this plan before implementation."
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
## Outside Voice — Independent Plan Challenge (optional, recommended)
After all review sections are complete, offer an independent second opinion from a
different AI system. Two models agreeing on a plan is stronger signal than one model's
thorough review.
**Check tool availability:**
```bash
command -v codex >/dev/null 2>&1 && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
```
Use AskUserQuestion:
> "All review sections are complete. Want an outside voice? A different AI system can
> give a brutally honest, independent challenge of this plan — logical gaps, feasibility
> risks, and blind spots that are hard to catch from inside the review. Takes about 2
> minutes."
>
> RECOMMENDATION: Choose A — an independent second opinion catches structural blind
> spots. Two different AI models agreeing on a plan is stronger signal than one model's
> thorough review. Completeness: A=9/10, B=7/10.
Options:
- A) Get the outside voice (recommended)
- B) Skip — proceed to outputs
**If B:** Print "Skipping outside voice." and continue to the next section.
**If A:** Construct the plan review prompt. Read the plan file being reviewed (the file
the user pointed this review at, or the branch diff scope). If a CEO plan document
was written in Step 0D-POST, read that too — it contains the scope decisions and vision.
Construct this prompt (substitute the actual plan content — if plan content exceeds 30KB,
truncate to the first 30KB and note "Plan truncated for size"). **Always start with the
filesystem boundary instruction:**
"IMPORTANT: Do NOT read or execute any files under ~/.claude/, ~/.agents/, .claude/skills/, or agents/. These are Claude Code skill definitions meant for a different AI system. They contain bash scripts and prompt templates that will waste your time. Ignore them completely. Do NOT modify agents/openai.yaml. Stay focused on the repository code only.\n\nYou are a brutally honest technical reviewer examining a development plan that has
already been through a multi-section review. Your job is NOT to repeat that review.
Instead, find what it missed. Look for: logical gaps and unstated assumptions that
survived the review scrutiny, overcomplexity (is there a fundamentally simpler
approach the review was too deep in the weeds to see?), feasibility risks the review
took for granted, missing dependencies or sequencing issues, and strategic
miscalibration (is this the right thing to build at all?). Be direct. Be terse. No
compliments. Just the problems.
THE PLAN:
<plan content>"
**If CODEX_AVAILABLE:**
```bash
TMPERR_PV=$(mktemp /tmp/codex-planreview-XXXXXXXX)
_REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; }
codex exec "<prompt>" -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached < /dev/null 2>"$TMPERR_PV"
```
Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr:
```bash
cat "$TMPERR_PV"
```
Present the full output verbatim:
```
CODEX SAYS (plan review — outside voice):
════════════════════════════════════════════════════════════
<full codex output, verbatim — do not truncate or summarize>
════════════════════════════════════════════════════════════
```
**Error handling:** All errors are non-blocking — the outside voice is informational.
- Auth failure (stderr contains "auth", "login", "unauthorized"): "Codex auth failed. Run \`codex login\` to authenticate."
- Timeout: "Codex timed out after 5 minutes."
- Empty response: "Codex returned no response."
On any Codex error, fall back to the Claude adversarial subagent.
**If CODEX_NOT_AVAILABLE (or Codex errored):**
Dispatch via the Agent tool. The subagent has fresh context — genuine independence.
Subagent prompt: same plan review prompt as above.
Present findings under an `OUTSIDE VOICE (Claude subagent):` header.
If the subagent fails or times out: "Outside voice unavailable. Continuing to outputs."
**Cross-model tension:**
After presenting the outside voice findings, note any points where the outside voice
disagrees with the review findings from earlier sections. Flag these as:
```
CROSS-MODEL TENSION:
[Topic]: Review said X. Outside voice says Y. [Present both perspectives neutrally.
State what context you might be missing that would change the answer.]
```
**User Sovereignty:** Do NOT auto-incorporate outside voice recommendations into the plan.
Present each tension point to the user. The user decides. Cross-model agreement is a
strong signal — present it as such — but it is NOT permission to act. You may state
which argument you find more compelling, but you MUST NOT apply the change without
explicit user approval.
For each substantive tension point, use AskUserQuestion:
> "Cross-model disagreement on [topic]. The review found [X] but the outside voice
> argues [Y]. [One sentence on what context you might be missing.]"
>
> RECOMMENDATION: Choose [A or B] because [one-line reason explaining which argument
> is more compelling and why]. Completeness: A=X/10, B=Y/10.
Options:
- A) Accept the outside voice's recommendation (I'll apply this change)
- B) Keep the current approach (reject the outside voice)
- C) Investigate further before deciding
- D) Add to TODOS.md for later
Wait for the user's response. Do NOT default to accepting because you agree with the
outside voice. If the user chooses B, the current approach stands — do not re-argue.
If no tension points exist, note: "No cross-model tension — both reviewers agree."
**Persist the result:**
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-plan-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","commit":"'"$(git rev-parse --short HEAD)"'"}'
```
Substitute: STATUS = "clean" if no findings, "issues_found" if findings exist.
SOURCE = "codex" if Codex ran, "claude" if subagent ran.
**Cleanup:** Run `rm -f "$TMPERR_PV"` after processing (if Codex was used).
---
### Outside Voice Integration Rule
Outside voice findings are INFORMATIONAL until the user explicitly approves each one.
Do NOT incorporate outside voice recommendations into the plan without presenting each
finding via AskUserQuestion and getting explicit approval. This applies even when you
agree with the outside voice. Cross-model consensus is a strong signal — present it as
such — but the user makes the decision.
## Post-Implementation Design Audit (if UI scope detected)
After implementation, run `/design-review` on the live site to catch visual issues that can only be evaluated with rendered output.
## CRITICAL RULE — How to ask questions
Follow the AskUserQuestion format from the Preamble above. Additional rules for plan reviews:
* **One issue = one AskUserQuestion call.** Never combine multiple issues into one question.
* Describe the problem concretely, with file and line references.
* Present 2-3 options, including "do nothing" where reasonable.
* For each option: effort, risk, and maintenance burden in one line.
* **Map the reasoning to my engineering preferences above.** One sentence connecting your recommendation to a specific preference.
* Label with issue NUMBER + option LETTER (e.g., "3A", "3B").
* **Zero findings:** if a section has zero findings, state "No issues, moving on" and proceed. Otherwise, use AskUserQuestion for each finding — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan.
## Required Outputs
### "NOT in scope" section
List work considered and explicitly deferred, with one-line rationale each.
### "What already exists" section
List existing code/flows that partially solve sub-problems and whether the plan reuses them.
### "Dream state delta" section
Where this plan leaves us relative to the 12-month ideal.
### Error & Rescue Registry (from Section 2)
Complete table of every method that can fail, every exception class, rescued status, rescue action, user impact.
### Failure Modes Registry
```
CODEPATH | FAILURE MODE | RESCUED? | TEST? | USER SEES? | LOGGED?
---------|----------------|----------|-------|----------------|--------
```
Any row with RESCUED=N, TEST=N, USER SEES=Silent → **CRITICAL GAP**.
### TODOS.md updates
Present each potential TODO as its own individual AskUserQuestion. Never batch TODOs — one per question. Never silently skip this step. Follow the format in `.claude/skills/review/TODOS-format.md`.
For each TODO, describe:
* **What:** One-line description of the work.
* **Why:** The concrete problem it solves or value it unlocks.
* **Pros:** What you gain by doing this work.
* **Cons:** Cost, complexity, or risks of doing it.
* **Context:** Enough detail that someone picking this up in 3 months understands the motivation, the current state, and where to start.
* **Effort estimate:** S/M/L/XL (human team) → with CC+gstack: S→S, M→S, L→M, XL→L
* **Priority:** P1/P2/P3
* **Depends on / blocked by:** Any prerequisites or ordering constraints.
Then present options: **A)** Add to TODOS.md **B)** Skip — not valuable enough **C)** Build it now in this PR instead of deferring.
### Scope Expansion Decisions (EXPANSION and SELECTIVE EXPANSION only)
For EXPANSION and SELECTIVE EXPANSION modes: expansion opportunities and delight items were surfaced and decided in Step 0D (opt-in/cherry-pick ceremony). The decisions are persisted in the CEO plan document. Reference the CEO plan for the full record. Do not re-surface them here — list the accepted expansions for completeness:
* Accepted: {list items added to scope}
* Deferred: {list items sent to TODOS.md}
* Skipped: {list items rejected}
### Diagrams (mandatory, produce all that apply)
1. System architecture
2. Data flow (including shadow paths)
3. State machine
4. Error flow
5. Deployment sequence
6. Rollback flowchart
### Stale Diagram Audit
List every ASCII diagram in files this plan touches. Still accurate?
## Implementation Tasks
Before closing this review, synthesize the findings above into a flat list of
build-actionable tasks. Each task derives from a specific finding — no padding.
Emit the markdown section AND write a JSONL artifact that `/autoplan` can
aggregate across phases.
### Markdown section (always emit)
```markdown
## Implementation Tasks
Synthesized from this review's findings. Each task derives from a specific
finding above. Run with Claude Code or Codex; checkbox as you ship.
- [ ] **T1 (P1, human: ~2h / CC: ~15min)** — <component> — <imperative title>
- Surfaced by: <section name> — <specific finding text or line reference>
- Files: <paths to touch>
- Verify: <test command or manual check>
- [ ] **T2 (P2, human: ~30min / CC: ~5min)** — ...
```
Rules:
- P1 blocks ship; P2 should land same branch; P3 is a follow-up TODO.
- If a finding produced no actionable task, do not invent one.
- If a section had zero findings, emit `_No new tasks from <section>._`
- Effort uses the AI-compression table from CLAUDE.md.
### JSONL artifact (always write, even if zero tasks)
`/autoplan` reads this file to aggregate across phases. Build each line with
`jq -nc` so titles and source findings containing quotes, newlines, or
backslashes serialize cleanly — never use hand-rolled `echo` / `printf`.
```bash
eval "$(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)"
TASKS_DIR="${HOME}/.gstack/projects/${SLUG:-unknown}"
mkdir -p "$TASKS_DIR"
TASKS_FILE="$TASKS_DIR/tasks-ceo-review-$(date +%Y%m%d-%H%M%S).jsonl"
COMMIT=$(git rev-parse HEAD 2>/dev/null || echo unknown)
BRANCH=$(git branch --show-current 2>/dev/null || echo unknown)
RUN_ID="$(date -u +%Y%m%dT%H%M%SZ)-$$"
# Repeat ONE jq invocation per task identified during this review.
# Substitute the placeholders inline with shell variables you set per task:
# TASK_ID (T1, T2, ...), PRIORITY (P1/P2/P3), COMPONENT, TITLE,
# SOURCE_FINDING, EFFORT_HUMAN, EFFORT_CC, FILES_JSON (a JSON array literal
# like '["browse/src/sanitize.ts","browse/src/server.ts"]').
jq -nc \
--arg phase 'ceo-review' \
--arg run_id "$RUN_ID" \
--arg branch "$BRANCH" \
--arg commit "$COMMIT" \
--arg id "$TASK_ID" \
--arg priority "$PRIORITY" \
--arg component "$COMPONENT" \
--arg effort_human "$EFFORT_HUMAN" \
--arg effort_cc "$EFFORT_CC" \
--arg title "$TITLE" \
--arg source_finding "$SOURCE_FINDING" \
--argjson files "$FILES_JSON" \
'{phase:$phase, run_id:$run_id, branch:$branch, commit:$commit, id:$id, priority:$priority, component:$component, files:$files, effort_human:$effort_human, effort_cc:$effort_cc, title:$title, source_finding:$source_finding}' \
>> "$TASKS_FILE"
```
If `jq` is not installed, fall back to skipping the JSONL write and warn
the user to install jq for autoplan aggregation. Never hand-roll JSONL.
If zero tasks were identified in this review, still touch the JSONL file
(`: > "$TASKS_FILE"`) so the aggregator sees that the phase produced output
this run (an empty file means "ran, no findings" — distinct from "didn't run").
### Completion Summary
```
+====================================================================+
| MEGA PLAN REVIEW — COMPLETION SUMMARY |
+====================================================================+
| Mode selected | EXPANSION / SELECTIVE / HOLD / REDUCTION |
| System Audit | [key findings] |
| Step 0 | [mode + key decisions] |
| Section 1 (Arch) | ___ issues found |
| Section 2 (Errors) | ___ error paths mapped, ___ GAPS |
| Section 3 (Security)| ___ issues found, ___ High severity |
| Section 4 (Data/UX) | ___ edge cases mapped, ___ unhandled |
| Section 5 (Quality) | ___ issues found |
| Section 6 (Tests) | Diagram produced, ___ gaps |
| Section 7 (Perf) | ___ issues found |
| Section 8 (Observ) | ___ gaps found |
| Section 9 (Deploy) | ___ risks flagged |
| Section 10 (Future) | Reversibility: _/5, debt items: ___ |
| Section 11 (Design) | ___ issues / SKIPPED (no UI scope) |
+--------------------------------------------------------------------+
| NOT in scope | written (___ items) |
| What already exists | written |
| Dream state delta | written |
| Error/rescue registry| ___ methods, ___ CRITICAL GAPS |
| Failure modes | ___ total, ___ CRITICAL GAPS |
| TODOS.md updates | ___ items proposed |
| Scope proposals | ___ proposed, ___ accepted (EXP + SEL) |
| CEO plan | written / skipped (HOLD/REDUCTION) |
| Outside voice | ran (codex/claude) / skipped |
| Lake Score | X/Y recommendations chose complete option |
| Diagrams produced | ___ (list types) |
| Stale diagrams found | ___ |
| Unresolved decisions | ___ (listed below) |
+====================================================================+
```
### Unresolved Decisions
If any AskUserQuestion goes unanswered, note it here. Never silently default.
## Handoff Note Cleanup
After producing the Completion Summary, clean up any handoff notes for this branch —
the review is complete and the context is no longer needed.
```bash
setopt +o nomatch 2>/dev/null || true # zsh compat
eval "$(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)"
rm -f ~/.gstack/projects/$SLUG/*-$BRANCH-ceo-handoff-*.md 2>/dev/null || true
```
## Review Log
After producing the Completion Summary above, persist the review result.
**PLAN MODE EXCEPTION — ALWAYS RUN:** This command writes review metadata to
`~/.gstack/` (user config directory, not project files). The skill preamble
already writes to `~/.gstack/sessions/` and `~/.gstack/analytics/` — this is
the same pattern. The review dashboard depends on this data. Skipping this
command breaks the review readiness dashboard in /ship.
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"plan-ceo-review","timestamp":"TIMESTAMP","status":"STATUS","unresolved":N,"critical_gaps":N,"mode":"MODE","scope_proposed":N,"scope_accepted":N,"scope_deferred":N,"commit":"COMMIT"}'
```
Before running this command, substitute the placeholder values from the Completion Summary you just produced:
- **TIMESTAMP**: current ISO 8601 datetime (e.g., 2026-03-16T14:30:00)
- **STATUS**: "clean" if 0 unresolved decisions AND 0 critical gaps; otherwise "issues_open"
- **unresolved**: number from "Unresolved decisions" in the summary
- **critical_gaps**: number from "Failure modes: ___ CRITICAL GAPS" in the summary
- **MODE**: the mode the user selected (SCOPE_EXPANSION / SELECTIVE_EXPANSION / HOLD_SCOPE / SCOPE_REDUCTION)
- **scope_proposed**: number from "Scope proposals: ___ proposed" in the summary (0 for HOLD/REDUCTION)
- **scope_accepted**: number from "Scope proposals: ___ accepted" in the summary (0 for HOLD/REDUCTION)
- **scope_deferred**: number of items deferred to TODOS.md from scope decisions (0 for HOLD/REDUCTION)
- **COMMIT**: output of `git rev-parse --short HEAD`
## Review Readiness Dashboard
After completing the review, read the review log and config to display the dashboard.
```bash
~/.claude/skills/gstack/bin/gstack-review-read
```
Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, review, plan-design-review, design-review-lite, adversarial-review, codex-review, codex-plan-review). Ignore entries with timestamps older than 7 days. For the Eng Review row, show whichever is more recent between `review` (diff-scoped pre-landing review) and `plan-eng-review` (plan-stage architecture review). Append "(DIFF)" or "(PLAN)" to the status to distinguish. For the Adversarial row, show whichever is more recent between `adversarial-review` (new auto-scaled) and `codex-review` (legacy). For Design Review, show whichever is more recent between `plan-design-review` (full visual audit) and `design-review-lite` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. For the Outside Voice row, show the most recent `codex-plan-review` entry — this captures outside voices from both /plan-ceo-review and /plan-eng-review.
**Source attribution:** If the most recent entry for a skill has a \`"via"\` field, append it to the status label in parentheses. Examples: `plan-eng-review` with `via:"autoplan"` shows as "CLEAR (PLAN via /autoplan)". `review` with `via:"ship"` shows as "CLEAR (DIFF via /ship)". Entries without a `via` field show as "CLEAR (PLAN)" or "CLEAR (DIFF)" as before.
Note: `autoplan-voices` and `design-outside-voices` entries are audit-trail-only (forensic data for cross-model consensus analysis). They do not appear in the dashboard and are not checked by any consumer.
Display:
```
+====================================================================+
| REVIEW READINESS DASHBOARD |
+====================================================================+
| Review | Runs | Last Run | Status | Required |
|-----------------|------|---------------------|-----------|----------|
| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | YES |
| CEO Review | 0 | — | — | no |
| Design Review | 0 | — | — | no |
| Adversarial | 0 | — | — | no |
| Outside Voice | 0 | — | — | no |
+--------------------------------------------------------------------+
| VERDICT: CLEARED — Eng Review passed |
+====================================================================+
```
**Review tiers:**
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Adversarial Review (automatic):** Always-on for every review. Every diff gets both Claude adversarial subagent and Codex adversarial challenge. Large diffs (200+ lines) additionally get Codex structured review with P1 gate. No configuration needed.
- **Outside Voice (optional):** Independent plan review from a different AI model. Offered after all review sections complete in /plan-ceo-review and /plan-eng-review. Falls back to Claude subagent if Codex is unavailable. Never gates shipping.
**Verdict logic:**
- **CLEARED**: Eng Review has >= 1 entry within 7 days from either \`review\` or \`plan-eng-review\` with status "clean" (or \`skip_eng_review\` is \`true\`)
- **NOT CLEARED**: Eng Review missing, stale (>7 days), or has open issues
- CEO, Design, and Codex reviews are shown for context but never block shipping
- If \`skip_eng_review\` config is \`true\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED
**Staleness detection:** After displaying the dashboard, check if any existing reviews may be stale:
- Parse the \`---HEAD---\` section from the bash output to get the current HEAD commit hash
- For each review entry that has a \`commit\` field: compare it against the current HEAD. If different, count elapsed commits: \`git rev-list --count STORED_COMMIT..HEAD\`. Display: "Note: {skill} review from {date} may be stale — {N} commits since review"
- For entries without a \`commit\` field (legacy entries): display "Note: {skill} review from {date} has no commit tracking — consider re-running for accurate staleness detection"
- If all reviews match the current HEAD, do not display any staleness notes
## Plan File Review Report
After displaying the Review Readiness Dashboard in conversation output, also update the
**plan file** itself so review status is visible to anyone reading the plan.
### Detect the plan file
1. Check if there is an active plan file in this conversation (the host provides plan file
paths in system messages — look for plan file references in the conversation context).
2. If not found, skip this section silently — not every review runs in plan mode.
### Generate the report
Read the review log output you already have from the Review Readiness Dashboard step above.
Parse each JSONL entry. Each skill logs different fields:
- **plan-ceo-review**: \`status\`, \`unresolved\`, \`critical_gaps\`, \`mode\`, \`scope_proposed\`, \`scope_accepted\`, \`scope_deferred\`, \`commit\`
→ Findings: "{scope_proposed} proposals, {scope_accepted} accepted, {scope_deferred} deferred"
→ If scope fields are 0 or missing (HOLD/REDUCTION mode): "mode: {mode}, {critical_gaps} critical gaps"
- **plan-eng-review**: \`status\`, \`unresolved\`, \`critical_gaps\`, \`issues_found\`, \`mode\`, \`commit\`
→ Findings: "{issues_found} issues, {critical_gaps} critical gaps"
- **plan-design-review**: \`status\`, \`initial_score\`, \`overall_score\`, \`unresolved\`, \`decisions_made\`, \`commit\`
→ Findings: "score: {initial_score}/10 → {overall_score}/10, {decisions_made} decisions"
- **plan-devex-review**: \`status\`, \`initial_score\`, \`overall_score\`, \`product_type\`, \`tthw_current\`, \`tthw_target\`, \`mode\`, \`persona\`, \`competitive_tier\`, \`unresolved\`, \`commit\`
→ Findings: "score: {initial_score}/10 → {overall_score}/10, TTHW: {tthw_current} → {tthw_target}"
- **devex-review**: \`status\`, \`overall_score\`, \`product_type\`, \`tthw_measured\`, \`dimensions_tested\`, \`dimensions_inferred\`, \`boomerang\`, \`commit\`
→ Findings: "score: {overall_score}/10, TTHW: {tthw_measured}, {dimensions_tested} tested/{dimensions_inferred} inferred"
- **codex-review**: \`status\`, \`gate\`, \`findings\`, \`findings_fixed\`
→ Findings: "{findings} findings, {findings_fixed}/{findings} fixed"
All fields needed for the Findings column are now present in the JSONL entries.
For the review you just completed, you may use richer details from your own Completion
Summary. For prior reviews, use the JSONL fields directly — they contain all required data.
Produce this markdown table:
\`\`\`markdown
## GSTACK REVIEW REPORT
| Review | Trigger | Why | Runs | Status | Findings |
|--------|---------|-----|------|--------|----------|
| CEO Review | \`/plan-ceo-review\` | Scope & strategy | {runs} | {status} | {findings} |
| Codex Review | \`/codex review\` | Independent 2nd opinion | {runs} | {status} | {findings} |
| Eng Review | \`/plan-eng-review\` | Architecture & tests (required) | {runs} | {status} | {findings} |
| Design Review | \`/plan-design-review\` | UI/UX gaps | {runs} | {status} | {findings} |
| DX Review | \`/plan-devex-review\` | Developer experience gaps | {runs} | {status} | {findings} |
\`\`\`
Below the table, add these lines (omit any that are empty/not applicable):
- **CODEX:** (only if codex-review ran) — one-line summary of codex fixes
- **CROSS-MODEL:** (only if both Claude and Codex reviews exist) — overlap analysis
- **UNRESOLVED:** total unresolved decisions across all reviews
- **VERDICT:** list reviews that are CLEAR (e.g., "CEO + ENG CLEARED — ready to implement").
If Eng Review is not CLEAR and not skipped globally, append "eng review required".
### Write to the plan file
**PLAN MODE EXCEPTION — ALWAYS RUN:** This writes to the plan file, which is the one
file you are allowed to edit in plan mode. The plan file review report is part of the
plan's living status.
The report must always be the LAST section of the plan file — never mid-file.
Use a single delete-then-append flow:
1. Read the plan file (Read tool) to see its full current content. Search the read
output for a \`## GSTACK REVIEW REPORT\` heading anywhere in the file.
2. If found, use the Edit tool to DELETE the entire existing section. Match from
\`## GSTACK REVIEW REPORT\` through either the next \`## \` heading or end of
file, whichever comes first. Replace with the empty string. This applies
regardless of where the section currently lives — mid-file deletion is
intentional, not a special case. If the Edit fails (e.g., concurrent edit
changed the content), re-read the plan file and retry once.
3. After the delete (or skipped, if no section existed), append the new
\`## GSTACK REVIEW REPORT\` section at the END of the file. Use the Edit
tool to match the file's current last paragraph and add the section after it,
or use Write to re-emit the whole file with the section at the end.
4. Verify with the Read tool that \`## GSTACK REVIEW REPORT\` is the last
\`## \` heading in the file before continuing. If it isn't, repeat steps
2-3 once.
Do NOT replace the section in place. The "replace mid-file" path is what allowed
prior versions to leave the report mid-file when an older report already lived
there — the user then sees a plan whose review report is not at the bottom and
(correctly) rejects it.
## Next Steps — Review Chaining
After displaying the Review Readiness Dashboard, recommend the next review(s) based on what this CEO review discovered. Read the dashboard output to see which reviews have already been run and whether they are stale.
**Recommend /plan-eng-review if eng review is not skipped globally** — check the dashboard output for `skip_eng_review`. If it is `true`, eng review is opted out — do not recommend it. Otherwise, eng review is the required shipping gate. If this CEO review expanded scope, changed architectural direction, or accepted scope expansions, emphasize that a fresh eng review is needed. If an eng review already exists in the dashboard but the commit hash shows it predates this CEO review, note that it may be stale and should be re-run.
**Recommend /plan-design-review if UI scope was detected** — specifically if Section 11 (Design & UX Review) was NOT skipped, or if accepted scope expansions included UI-facing features. If an existing design review is stale (commit hash drift), note that. In SCOPE REDUCTION mode, skip this recommendation — design review is unlikely relevant for scope cuts.
**If both are needed, recommend eng review first** (required gate), then design review.
Use AskUserQuestion to present the next step. Include only applicable options:
- **A)** Run /plan-eng-review next (required gate)
- **B)** Run /plan-design-review next (only if UI scope detected)
- **C)** Skip — I'll handle reviews manually
## docs/designs Promotion (EXPANSION and SELECTIVE EXPANSION only)
At the end of the review, if the vision produced a compelling feature direction, offer to promote the CEO plan to the project repo. AskUserQuestion:
"The vision from this review produced {N} accepted scope expansions. Want to promote it to a design doc in the repo?"
- **A)** Promote to `docs/designs/{FEATURE}.md` (committed to repo, visible to the team)
- **B)** Keep in `~/.gstack/projects/` only (local, personal reference)
- **C)** Skip
If promoted, copy the CEO plan content to `docs/designs/{FEATURE}.md` (create the directory if needed) and update the `status` field in the original CEO plan from `ACTIVE` to `PROMOTED`.
## Formatting Rules
* NUMBER issues (1, 2, 3...) and LETTERS for options (A, B, C...).
* Label with NUMBER + LETTER (e.g., "3A", "3B").
* One sentence max per option.
* After each section, pause and wait for feedback.
* Use **CRITICAL GAP** / **WARNING** / **OK** for scannability.
## Capture Learnings
If you discovered a non-obvious pattern, pitfall, or architectural insight during
this session, log it for future sessions:
```bash
~/.claude/skills/gstack/bin/gstack-learnings-log '{"skill":"plan-ceo-review","type":"TYPE","key":"SHORT_KEY","insight":"DESCRIPTION","confidence":N,"source":"SOURCE","files":["path/to/relevant/file"]}'
```
**Types:** `pattern` (reusable approach), `pitfall` (what NOT to do), `preference`
(user stated), `architecture` (structural decision), `tool` (library/framework insight),
`operational` (project environment/CLI/workflow knowledge).
**Sources:** `observed` (you found this in the code), `user-stated` (user told you),
`inferred` (AI deduction), `cross-model` (both Claude and Codex agree).
**Confidence:** 1-10. Be honest. An observed pattern you verified in the code is 8-9.
An inference you're not sure about is 4-5. A user preference they explicitly stated is 10.
**files:** Include the specific file paths this learning references. This enables
staleness detection: if those files are later deleted, the learning can be flagged.
**Only log genuine discoveries.** Don't log obvious things. Don't log things the user
already knows. A good test: would this insight save time in a future session? If yes, log it.
## Brain Calibration Write-Back (Phase 2 / gated)
When the skill makes a typed prediction worth tracking (scope decision,
TTHW target, architectural bet, wedge commitment), it MAY write a
`kind=bet` take to the brain so a calibration profile builds over time.
**Gated on two things:**
1. Brain trust policy for the active endpoint is `personal` (check via
`~/.claude/skills/gstack/bin/gstack-config get brain_trust_policy@<endpoint-hash>`).
Shared brains skip write-back to avoid polluting team calibration.
2. Feature flag `BRAIN_CALIBRATION_WRITEBACK` is set (today: false; flips
to true when upstream gbrain v0.42+ ships `takes_add` MCP op).
When both gates pass, the write-back path uses `mcp__gbrain__takes_add`
to record a take with weight 0.8 (per SKILL_CALIBRATION_WEIGHTS).
If the MCP op is unavailable, fall back to `mcp__gbrain__put_page` with
a gstack:takes fence block (documented but uglier path).
Mandatory take frontmatter shape:
```yaml
kind: bet
holder: <user identity from whoami>
claim: <one-line prediction the skill is making>
weight: 0.8
since_date: <today's date>
expected_resolution: <date in 1-3 months depending on skill>
source_skill: plan-ceo-review
```
After write, invalidate the affected digests so the next preflight reflects
the new state:
```bash
eval "$(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)" 2>/dev/null || true
~/.claude/skills/gstack/bin/gstack-brain-cache invalidate product --project "$SLUG" 2>/dev/null || true
~/.claude/skills/gstack/bin/gstack-brain-cache invalidate goals --project "$SLUG" 2>/dev/null || true
~/.claude/skills/gstack/bin/gstack-brain-cache invalidate competitive-intel --project "$SLUG" 2>/dev/null || true
```
## Brain Cache Background Refresh
After the skill's work completes (and telemetry has logged), kick a
background refresh of any cache digest that's getting close to its TTL.
This is non-blocking — the user doesn't wait. Next invocation benefits
from the warm cache.
```bash
eval "$(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)" 2>/dev/null || true
(~/.claude/skills/gstack/bin/gstack-brain-cache refresh --project "$SLUG" 2>/dev/null &) || true
```
## Mode Quick Reference
```
┌────────────────────────────────────────────────────────────────────────────────┐
│ MODE COMPARISON │
├─────────────┬──────────────┬──────────────┬──────────────┬────────────────────┤
│ │ EXPANSION │ SELECTIVE │ HOLD SCOPE │ REDUCTION │
├─────────────┼──────────────┼──────────────┼──────────────┼────────────────────┤
│ Scope │ Push UP │ Hold + offer │ Maintain │ Push DOWN │
│ │ (opt-in) │ │ │ │
│ Recommend │ Enthusiastic │ Neutral │ N/A │ N/A │
│ posture │ │ │ │ │
│ 10x check │ Mandatory │ Surface as │ Optional │ Skip │
│ │ │ cherry-pick │ │ │
│ Platonic │ Yes │ No │ No │ No │
│ ideal │ │ │ │ │
│ Delight │ Opt-in │ Cherry-pick │ Note if seen │ Skip │
│ opps │ ceremony │ ceremony │ │ │
│ Complexity │ "Is it big │ "Is it right │ "Is it too │ "Is it the bare │
│ question │ enough?" │ + what else │ complex?" │ minimum?" │
│ │ │ is tempting"│ │ │
│ Taste │ Yes │ Yes │ No │ No │
│ calibration │ │ │ │ │
│ Temporal │ Full (hr 1-6)│ Full (hr 1-6)│ Key decisions│ Skip │
│ interrogate │ │ │ only │ │
│ Observ. │ "Joy to │ "Joy to │ "Can we │ "Can we see if │
│ standard │ operate" │ operate" │ debug it?" │ it's broken?" │
│ Deploy │ Infra as │ Safe deploy │ Safe deploy │ Simplest possible │
│ standard │ feature scope│ + cherry-pick│ + rollback │ deploy │
│ │ │ risk check │ │ │
│ Error map │ Full + chaos │ Full + chaos │ Full │ Critical paths │
│ │ scenarios │ for accepted │ │ only │
│ CEO plan │ Written │ Written │ Skipped │ Skipped │
│ Phase 2/3 │ Map accepted │ Map accepted │ Note it │ Skip │
│ planning │ │ cherry-picks │ │ │
│ Design │ "Inevitable" │ If UI scope │ If UI scope │ Skip │
│ (Sec 11) │ UI review │ detected │ detected │ │
└─────────────┴──────────────┴──────────────┴──────────────┴────────────────────┘
```
@@ -0,0 +1,489 @@
## Review Sections (11 sections, after scope and mode are agreed)
**Anti-skip rule:** Never condense, abbreviate, or skip any review section (1-11) regardless of plan type (strategy, spec, code, infra). Every section in this skill exists for a reason. "This is a strategy doc so implementation sections don't apply" is always wrong — implementation details are where strategy breaks down. If a section genuinely has zero findings, say "No issues found" and move on — but you must evaluate it.
{{ANTI_SHORTCUT_CLAUSE}}
### Section 1: Architecture Review
Evaluate and diagram:
* Overall system design and component boundaries. Draw the dependency graph.
* Data flow — all four paths. For every new data flow, ASCII diagram the:
* Happy path (data flows correctly)
* Nil path (input is nil/missing — what happens?)
* Empty path (input is present but empty/zero-length — what happens?)
* Error path (upstream call fails — what happens?)
* State machines. ASCII diagram for every new stateful object. Include impossible/invalid transitions and what prevents them.
* Coupling concerns. Which components are now coupled that weren't before? Is that coupling justified? Draw the before/after dependency graph.
* Scaling characteristics. What breaks first under 10x load? Under 100x?
* Single points of failure. Map them.
* Security architecture. Auth boundaries, data access patterns, API surfaces. For each new endpoint or data mutation: who can call it, what do they get, what can they change?
* Production failure scenarios. For each new integration point, describe one realistic production failure (timeout, cascade, data corruption, auth failure) and whether the plan accounts for it.
* Rollback posture. If this ships and immediately breaks, what's the rollback procedure? Git revert? Feature flag? DB migration rollback? How long?
**EXPANSION and SELECTIVE EXPANSION additions:**
* What would make this architecture beautiful? Not just correct — elegant. Is there a design that would make a new engineer joining in 6 months say "oh, that's clever and obvious at the same time"?
* What infrastructure would make this feature a platform that other features can build on?
**SELECTIVE EXPANSION:** If any accepted cherry-picks from Step 0D affect the architecture, evaluate their architectural fit here. Flag any that create coupling concerns or don't integrate cleanly — this is a chance to revisit the decision with new information.
Required ASCII diagram: full system architecture showing new components and their relationships to existing ones.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 2: Error & Rescue Map
This is the section that catches silent failures. It is not optional.
For every new method, service, or codepath that can fail, fill in this table:
```
METHOD/CODEPATH | WHAT CAN GO WRONG | EXCEPTION CLASS
-------------------------|-----------------------------|-----------------
ExampleService#call | API timeout | TimeoutError
| API returns 429 | RateLimitError
| API returns malformed JSON | JSONParseError
| DB connection pool exhausted| ConnectionPoolExhausted
| Record not found | RecordNotFound
-------------------------|-----------------------------|-----------------
EXCEPTION CLASS | RESCUED? | RESCUE ACTION | USER SEES
-----------------------------|-----------|------------------------|------------------
TimeoutError | Y | Retry 2x, then raise | "Service temporarily unavailable"
RateLimitError | Y | Backoff + retry | Nothing (transparent)
JSONParseError | N ← GAP | — | 500 error ← BAD
ConnectionPoolExhausted | N ← GAP | — | 500 error ← BAD
RecordNotFound | Y | Return nil, log warning | "Not found" message
```
Rules for this section:
* Catch-all error handling (`rescue StandardError`, `catch (Exception e)`, `except Exception`) is ALWAYS a smell. Name the specific exceptions.
* Catching an error with only a generic log message is insufficient. Log the full context: what was being attempted, with what arguments, for what user/request.
* Every rescued error must either: retry with backoff, degrade gracefully with a user-visible message, or re-raise with added context. "Swallow and continue" is almost never acceptable.
* For each GAP (unrescued error that should be rescued): specify the rescue action and what the user should see.
* For LLM/AI service calls specifically: what happens when the response is malformed? When it's empty? When it hallucinates invalid JSON? When the model returns a refusal? Each of these is a distinct failure mode.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 3: Security & Threat Model
Security is not a sub-bullet of architecture. It gets its own section.
Evaluate:
* Attack surface expansion. What new attack vectors does this plan introduce? New endpoints, new params, new file paths, new background jobs?
* Input validation. For every new user input: is it validated, sanitized, and rejected loudly on failure? What happens with: nil, empty string, string when integer expected, string exceeding max length, unicode edge cases, HTML/script injection attempts?
* Authorization. For every new data access: is it scoped to the right user/role? Is there a direct object reference vulnerability? Can user A access user B's data by manipulating IDs?
* Secrets and credentials. New secrets? In env vars, not hardcoded? Rotatable?
* Dependency risk. New gems/npm packages? Security track record?
* Data classification. PII, payment data, credentials? Handling consistent with existing patterns?
* Injection vectors. SQL, command, template, LLM prompt injection — check all.
* Audit logging. For sensitive operations: is there an audit trail?
For each finding: threat, likelihood (High/Med/Low), impact (High/Med/Low), and whether the plan mitigates it.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 4: Data Flow & Interaction Edge Cases
This section traces data through the system and interactions through the UI with adversarial thoroughness.
**Data Flow Tracing:** For every new data flow, produce an ASCII diagram showing:
```
INPUT ──▶ VALIDATION ──▶ TRANSFORM ──▶ PERSIST ──▶ OUTPUT
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
[nil?] [invalid?] [exception?] [conflict?] [stale?]
[empty?] [too long?] [timeout?] [dup key?] [partial?]
[wrong [wrong type?] [OOM?] [locked?] [encoding?]
type?]
```
For each node: what happens on each shadow path? Is it tested?
**Interaction Edge Cases:** For every new user-visible interaction, evaluate:
```
INTERACTION | EDGE CASE | HANDLED? | HOW?
---------------------|------------------------|----------|--------
Form submission | Double-click submit | ? |
| Submit with stale CSRF | ? |
| Submit during deploy | ? |
Async operation | User navigates away | ? |
| Operation times out | ? |
| Retry while in-flight | ? |
List/table view | Zero results | ? |
| 10,000 results | ? |
| Results change mid-page| ? |
Background job | Job fails after 3 of | ? |
| 10 items processed | |
| Job runs twice (dup) | ? |
| Queue backs up 2 hours | ? |
```
Flag any unhandled edge case as a gap. For each gap, specify the fix.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 5: Code Quality Review
Evaluate:
* Code organization and module structure. Does new code fit existing patterns? If it deviates, is there a reason?
* DRY violations. Be aggressive. If the same logic exists elsewhere, flag it and reference the file and line.
* Naming quality. Are new classes, methods, and variables named for what they do, not how they do it?
* Error handling patterns. (Cross-reference with Section 2 — this section reviews the patterns; Section 2 maps the specifics.)
* Missing edge cases. List explicitly: "What happens when X is nil?" "When the API returns 429?" etc.
* Over-engineering check. Any new abstraction solving a problem that doesn't exist yet?
* Under-engineering check. Anything fragile, assuming happy path only, or missing obvious defensive checks?
* Cyclomatic complexity. Flag any new method that branches more than 5 times. Propose a refactor.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 6: Test Review
Make a complete diagram of every new thing this plan introduces:
```
NEW UX FLOWS:
[list each new user-visible interaction]
NEW DATA FLOWS:
[list each new path data takes through the system]
NEW CODEPATHS:
[list each new branch, condition, or execution path]
NEW BACKGROUND JOBS / ASYNC WORK:
[list each]
NEW INTEGRATIONS / EXTERNAL CALLS:
[list each]
NEW ERROR/RESCUE PATHS:
[list each — cross-reference Section 2]
```
For each item in the diagram:
* What type of test covers it? (Unit / Integration / System / E2E)
* Does a test for it exist in the plan? If not, write the test spec header.
* What is the happy path test?
* What is the failure path test? (Be specific — which failure?)
* What is the edge case test? (nil, empty, boundary values, concurrent access)
Test ambition check (all modes): For each new feature, answer:
* What's the test that would make you confident shipping at 2am on a Friday?
* What's the test a hostile QA engineer would write to break this?
* What's the chaos test?
Test pyramid check: Many unit, fewer integration, few E2E? Or inverted?
Flakiness risk: Flag any test depending on time, randomness, external services, or ordering.
Load/stress test requirements: For any new codepath called frequently or processing significant data.
For LLM/prompt changes: Check CLAUDE.md for the "Prompt/LLM changes" file patterns. If this plan touches ANY of those patterns, state which eval suites must be run, which cases should be added, and what baselines to compare against.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 7: Performance Review
Evaluate:
* N+1 queries. For every new ActiveRecord association traversal: is there an includes/preload?
* Memory usage. For every new data structure: what's the maximum size in production?
* Database indexes. For every new query: is there an index?
* Caching opportunities. For every expensive computation or external call: should it be cached?
* Background job sizing. For every new job: worst-case payload, runtime, retry behavior?
* Slow paths. Top 3 slowest new codepaths and estimated p99 latency.
* Connection pool pressure. New DB connections, Redis connections, HTTP connections?
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 8: Observability & Debuggability Review
New systems break. This section ensures you can see why.
Evaluate:
* Logging. For every new codepath: structured log lines at entry, exit, and each significant branch?
* Metrics. For every new feature: what metric tells you it's working? What tells you it's broken?
* Tracing. For new cross-service or cross-job flows: trace IDs propagated?
* Alerting. What new alerts should exist?
* Dashboards. What new dashboard panels do you want on day 1?
* Debuggability. If a bug is reported 3 weeks post-ship, can you reconstruct what happened from logs alone?
* Admin tooling. New operational tasks that need admin UI or rake tasks?
* Runbooks. For each new failure mode: what's the operational response?
**EXPANSION and SELECTIVE EXPANSION addition:**
* What observability would make this feature a joy to operate? (For SELECTIVE EXPANSION, include observability for any accepted cherry-picks.)
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 9: Deployment & Rollout Review
Evaluate:
* Migration safety. For every new DB migration: backward-compatible? Zero-downtime? Table locks?
* Feature flags. Should any part be behind a feature flag?
* Rollout order. Correct sequence: migrate first, deploy second?
* Rollback plan. Explicit step-by-step.
* Deploy-time risk window. Old code and new code running simultaneously — what breaks?
* Environment parity. Tested in staging?
* Post-deploy verification checklist. First 5 minutes? First hour?
* Smoke tests. What automated checks should run immediately post-deploy?
**EXPANSION and SELECTIVE EXPANSION addition:**
* What deploy infrastructure would make shipping this feature routine? (For SELECTIVE EXPANSION, assess whether accepted cherry-picks change the deployment risk profile.)
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 10: Long-Term Trajectory Review
Evaluate:
* Technical debt introduced. Code debt, operational debt, testing debt, documentation debt.
* Path dependency. Does this make future changes harder?
* Knowledge concentration. Documentation sufficient for a new engineer?
* Reversibility. Rate 1-5: 1 = one-way door, 5 = easily reversible.
* Ecosystem fit. Aligns with Rails/JS ecosystem direction?
* The 1-year question. Read this plan as a new engineer in 12 months — obvious?
**EXPANSION and SELECTIVE EXPANSION additions:**
* What comes after this ships? Phase 2? Phase 3? Does the architecture support that trajectory?
* Platform potential. Does this create capabilities other features can leverage?
* (SELECTIVE EXPANSION only) Retrospective: Were the right cherry-picks accepted? Did any rejected expansions turn out to be load-bearing for the accepted ones?
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 11: Design & UX Review (skip if no UI scope detected)
The CEO calling in the designer. Not a pixel-level audit — that's /plan-design-review and /design-review. This is ensuring the plan has design intentionality.
Evaluate:
* Information architecture — what does the user see first, second, third?
* Interaction state coverage map:
FEATURE | LOADING | EMPTY | ERROR | SUCCESS | PARTIAL
* User journey coherence — storyboard the emotional arc
* AI slop risk — does the plan describe generic UI patterns?
* DESIGN.md alignment — does the plan match the stated design system?
* Responsive intention — is mobile mentioned or afterthought?
* Accessibility basics — keyboard nav, screen readers, contrast, touch targets
**EXPANSION and SELECTIVE EXPANSION additions:**
* What would make this UI feel *inevitable*?
* What 30-minute UI touches would make users think "oh nice, they thought of that"?
Required ASCII diagram: user flow showing screens/states and transitions.
If this plan has significant UI scope, recommend: "Consider running /plan-design-review for a deep design review of this plan before implementation."
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
{{CODEX_PLAN_REVIEW}}
### Outside Voice Integration Rule
Outside voice findings are INFORMATIONAL until the user explicitly approves each one.
Do NOT incorporate outside voice recommendations into the plan without presenting each
finding via AskUserQuestion and getting explicit approval. This applies even when you
agree with the outside voice. Cross-model consensus is a strong signal — present it as
such — but the user makes the decision.
## Post-Implementation Design Audit (if UI scope detected)
After implementation, run `/design-review` on the live site to catch visual issues that can only be evaluated with rendered output.
## CRITICAL RULE — How to ask questions
Follow the AskUserQuestion format from the Preamble above. Additional rules for plan reviews:
* **One issue = one AskUserQuestion call.** Never combine multiple issues into one question.
* Describe the problem concretely, with file and line references.
* Present 2-3 options, including "do nothing" where reasonable.
* For each option: effort, risk, and maintenance burden in one line.
* **Map the reasoning to my engineering preferences above.** One sentence connecting your recommendation to a specific preference.
* Label with issue NUMBER + option LETTER (e.g., "3A", "3B").
* **Zero findings:** if a section has zero findings, state "No issues, moving on" and proceed. Otherwise, use AskUserQuestion for each finding — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan.
## Required Outputs
### "NOT in scope" section
List work considered and explicitly deferred, with one-line rationale each.
### "What already exists" section
List existing code/flows that partially solve sub-problems and whether the plan reuses them.
### "Dream state delta" section
Where this plan leaves us relative to the 12-month ideal.
### Error & Rescue Registry (from Section 2)
Complete table of every method that can fail, every exception class, rescued status, rescue action, user impact.
### Failure Modes Registry
```
CODEPATH | FAILURE MODE | RESCUED? | TEST? | USER SEES? | LOGGED?
---------|----------------|----------|-------|----------------|--------
```
Any row with RESCUED=N, TEST=N, USER SEES=Silent → **CRITICAL GAP**.
### TODOS.md updates
Present each potential TODO as its own individual AskUserQuestion. Never batch TODOs — one per question. Never silently skip this step. Follow the format in `.claude/skills/review/TODOS-format.md`.
For each TODO, describe:
* **What:** One-line description of the work.
* **Why:** The concrete problem it solves or value it unlocks.
* **Pros:** What you gain by doing this work.
* **Cons:** Cost, complexity, or risks of doing it.
* **Context:** Enough detail that someone picking this up in 3 months understands the motivation, the current state, and where to start.
* **Effort estimate:** S/M/L/XL (human team) → with CC+gstack: S→S, M→S, L→M, XL→L
* **Priority:** P1/P2/P3
* **Depends on / blocked by:** Any prerequisites or ordering constraints.
Then present options: **A)** Add to TODOS.md **B)** Skip — not valuable enough **C)** Build it now in this PR instead of deferring.
### Scope Expansion Decisions (EXPANSION and SELECTIVE EXPANSION only)
For EXPANSION and SELECTIVE EXPANSION modes: expansion opportunities and delight items were surfaced and decided in Step 0D (opt-in/cherry-pick ceremony). The decisions are persisted in the CEO plan document. Reference the CEO plan for the full record. Do not re-surface them here — list the accepted expansions for completeness:
* Accepted: {list items added to scope}
* Deferred: {list items sent to TODOS.md}
* Skipped: {list items rejected}
### Diagrams (mandatory, produce all that apply)
1. System architecture
2. Data flow (including shadow paths)
3. State machine
4. Error flow
5. Deployment sequence
6. Rollback flowchart
### Stale Diagram Audit
List every ASCII diagram in files this plan touches. Still accurate?
{{TASKS_SECTION_EMIT:ceo-review}}
### Completion Summary
```
+====================================================================+
| MEGA PLAN REVIEW — COMPLETION SUMMARY |
+====================================================================+
| Mode selected | EXPANSION / SELECTIVE / HOLD / REDUCTION |
| System Audit | [key findings] |
| Step 0 | [mode + key decisions] |
| Section 1 (Arch) | ___ issues found |
| Section 2 (Errors) | ___ error paths mapped, ___ GAPS |
| Section 3 (Security)| ___ issues found, ___ High severity |
| Section 4 (Data/UX) | ___ edge cases mapped, ___ unhandled |
| Section 5 (Quality) | ___ issues found |
| Section 6 (Tests) | Diagram produced, ___ gaps |
| Section 7 (Perf) | ___ issues found |
| Section 8 (Observ) | ___ gaps found |
| Section 9 (Deploy) | ___ risks flagged |
| Section 10 (Future) | Reversibility: _/5, debt items: ___ |
| Section 11 (Design) | ___ issues / SKIPPED (no UI scope) |
+--------------------------------------------------------------------+
| NOT in scope | written (___ items) |
| What already exists | written |
| Dream state delta | written |
| Error/rescue registry| ___ methods, ___ CRITICAL GAPS |
| Failure modes | ___ total, ___ CRITICAL GAPS |
| TODOS.md updates | ___ items proposed |
| Scope proposals | ___ proposed, ___ accepted (EXP + SEL) |
| CEO plan | written / skipped (HOLD/REDUCTION) |
| Outside voice | ran (codex/claude) / skipped |
| Lake Score | X/Y recommendations chose complete option |
| Diagrams produced | ___ (list types) |
| Stale diagrams found | ___ |
| Unresolved decisions | ___ (listed below) |
+====================================================================+
```
### Unresolved Decisions
If any AskUserQuestion goes unanswered, note it here. Never silently default.
## Handoff Note Cleanup
After producing the Completion Summary, clean up any handoff notes for this branch —
the review is complete and the context is no longer needed.
```bash
setopt +o nomatch 2>/dev/null || true # zsh compat
{{SLUG_EVAL}}
rm -f ~/.gstack/projects/$SLUG/*-$BRANCH-ceo-handoff-*.md 2>/dev/null || true
```
## Review Log
After producing the Completion Summary above, persist the review result.
**PLAN MODE EXCEPTION — ALWAYS RUN:** This command writes review metadata to
`~/.gstack/` (user config directory, not project files). The skill preamble
already writes to `~/.gstack/sessions/` and `~/.gstack/analytics/` — this is
the same pattern. The review dashboard depends on this data. Skipping this
command breaks the review readiness dashboard in /ship.
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"plan-ceo-review","timestamp":"TIMESTAMP","status":"STATUS","unresolved":N,"critical_gaps":N,"mode":"MODE","scope_proposed":N,"scope_accepted":N,"scope_deferred":N,"commit":"COMMIT"}'
```
Before running this command, substitute the placeholder values from the Completion Summary you just produced:
- **TIMESTAMP**: current ISO 8601 datetime (e.g., 2026-03-16T14:30:00)
- **STATUS**: "clean" if 0 unresolved decisions AND 0 critical gaps; otherwise "issues_open"
- **unresolved**: number from "Unresolved decisions" in the summary
- **critical_gaps**: number from "Failure modes: ___ CRITICAL GAPS" in the summary
- **MODE**: the mode the user selected (SCOPE_EXPANSION / SELECTIVE_EXPANSION / HOLD_SCOPE / SCOPE_REDUCTION)
- **scope_proposed**: number from "Scope proposals: ___ proposed" in the summary (0 for HOLD/REDUCTION)
- **scope_accepted**: number from "Scope proposals: ___ accepted" in the summary (0 for HOLD/REDUCTION)
- **scope_deferred**: number of items deferred to TODOS.md from scope decisions (0 for HOLD/REDUCTION)
- **COMMIT**: output of `git rev-parse --short HEAD`
{{REVIEW_DASHBOARD}}
{{PLAN_FILE_REVIEW_REPORT}}
## Next Steps — Review Chaining
After displaying the Review Readiness Dashboard, recommend the next review(s) based on what this CEO review discovered. Read the dashboard output to see which reviews have already been run and whether they are stale.
**Recommend /plan-eng-review if eng review is not skipped globally** — check the dashboard output for `skip_eng_review`. If it is `true`, eng review is opted out — do not recommend it. Otherwise, eng review is the required shipping gate. If this CEO review expanded scope, changed architectural direction, or accepted scope expansions, emphasize that a fresh eng review is needed. If an eng review already exists in the dashboard but the commit hash shows it predates this CEO review, note that it may be stale and should be re-run.
**Recommend /plan-design-review if UI scope was detected** — specifically if Section 11 (Design & UX Review) was NOT skipped, or if accepted scope expansions included UI-facing features. If an existing design review is stale (commit hash drift), note that. In SCOPE REDUCTION mode, skip this recommendation — design review is unlikely relevant for scope cuts.
**If both are needed, recommend eng review first** (required gate), then design review.
Use AskUserQuestion to present the next step. Include only applicable options:
- **A)** Run /plan-eng-review next (required gate)
- **B)** Run /plan-design-review next (only if UI scope detected)
- **C)** Skip — I'll handle reviews manually
## docs/designs Promotion (EXPANSION and SELECTIVE EXPANSION only)
At the end of the review, if the vision produced a compelling feature direction, offer to promote the CEO plan to the project repo. AskUserQuestion:
"The vision from this review produced {N} accepted scope expansions. Want to promote it to a design doc in the repo?"
- **A)** Promote to `docs/designs/{FEATURE}.md` (committed to repo, visible to the team)
- **B)** Keep in `~/.gstack/projects/` only (local, personal reference)
- **C)** Skip
If promoted, copy the CEO plan content to `docs/designs/{FEATURE}.md` (create the directory if needed) and update the `status` field in the original CEO plan from `ACTIVE` to `PROMOTED`.
## Formatting Rules
* NUMBER issues (1, 2, 3...) and LETTERS for options (A, B, C...).
* Label with NUMBER + LETTER (e.g., "3A", "3B").
* One sentence max per option.
* After each section, pause and wait for feedback.
* Use **CRITICAL GAP** / **WARNING** / **OK** for scannability.
{{LEARNINGS_LOG}}
{{GBRAIN_SAVE_RESULTS}}
{{BRAIN_WRITE_BACK}}
{{BRAIN_CACHE_REFRESH}}
## Mode Quick Reference
```
┌────────────────────────────────────────────────────────────────────────────────┐
│ MODE COMPARISON │
├─────────────┬──────────────┬──────────────┬──────────────┬────────────────────┤
│ │ EXPANSION │ SELECTIVE │ HOLD SCOPE │ REDUCTION │
├─────────────┼──────────────┼──────────────┼──────────────┼────────────────────┤
│ Scope │ Push UP │ Hold + offer │ Maintain │ Push DOWN │
│ │ (opt-in) │ │ │ │
│ Recommend │ Enthusiastic │ Neutral │ N/A │ N/A │
│ posture │ │ │ │ │
│ 10x check │ Mandatory │ Surface as │ Optional │ Skip │
│ │ │ cherry-pick │ │ │
│ Platonic │ Yes │ No │ No │ No │
│ ideal │ │ │ │ │
│ Delight │ Opt-in │ Cherry-pick │ Note if seen │ Skip │
│ opps │ ceremony │ ceremony │ │ │
│ Complexity │ "Is it big │ "Is it right │ "Is it too │ "Is it the bare │
│ question │ enough?" │ + what else │ complex?" │ minimum?" │
│ │ │ is tempting"│ │ │
│ Taste │ Yes │ Yes │ No │ No │
│ calibration │ │ │ │ │
│ Temporal │ Full (hr 1-6)│ Full (hr 1-6)│ Key decisions│ Skip │
│ interrogate │ │ │ only │ │
│ Observ. │ "Joy to │ "Joy to │ "Can we │ "Can we see if │
│ standard │ operate" │ operate" │ debug it?" │ it's broken?" │
│ Deploy │ Infra as │ Safe deploy │ Safe deploy │ Simplest possible │
│ standard │ feature scope│ + cherry-pick│ + rollback │ deploy │
│ │ │ risk check │ │ │
│ Error map │ Full + chaos │ Full + chaos │ Full │ Critical paths │
│ │ scenarios │ for accepted │ │ only │
│ CEO plan │ Written │ Written │ Skipped │ Skipped │
│ Phase 2/3 │ Map accepted │ Map accepted │ Note it │ Skip │
│ planning │ │ cherry-picks │ │ │
│ Design │ "Inevitable" │ If UI scope │ If UI scope │ Skip │
│ (Sec 11) │ UI review │ detected │ detected │ │
└─────────────┴──────────────┴──────────────┴──────────────┴────────────────────┘
```
+10 -8
View File
@@ -672,7 +672,7 @@ describe('REVIEW_DASHBOARD resolver', () => {
for (const skill of REVIEW_SKILLS) {
test(`review dashboard appears in ${skill} generated file`, () => {
const content = fs.readFileSync(path.join(ROOT, skill, 'SKILL.md'), 'utf-8');
const content = readSkillUnion(skill); // carved skills: union skeleton + sections
expect(content).toContain('gstack-review');
expect(content).toContain('REVIEW READINESS DASHBOARD');
});
@@ -699,7 +699,7 @@ describe('REVIEW_DASHBOARD resolver', () => {
});
test('resolver output contains key dashboard elements', () => {
const content = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
const content = readSkillUnion('plan-ceo-review'); // carved: dashboard moved to section
expect(content).toContain('VERDICT');
expect(content).toContain('CLEARED');
expect(content).toContain('Eng Review');
@@ -709,25 +709,25 @@ describe('REVIEW_DASHBOARD resolver', () => {
});
test('dashboard bash block includes git HEAD for staleness detection', () => {
const content = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
const content = readSkillUnion('plan-ceo-review'); // carved: dashboard moved to section
expect(content).toContain('git rev-parse --short HEAD');
expect(content).toContain('---HEAD---');
});
test('dashboard includes staleness detection prose', () => {
const content = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
const content = readSkillUnion('plan-ceo-review'); // carved: dashboard moved to section
expect(content).toContain('Staleness detection');
expect(content).toContain('commit');
});
for (const skill of REVIEW_SKILLS) {
test(`${skill} contains review chaining section`, () => {
const content = fs.readFileSync(path.join(ROOT, skill, 'SKILL.md'), 'utf-8');
const content = readSkillUnion(skill); // carved skills: union skeleton + sections
expect(content).toContain('Review Chaining');
});
test(`${skill} Review Log includes commit field`, () => {
const content = fs.readFileSync(path.join(ROOT, skill, 'SKILL.md'), 'utf-8');
const content = readSkillUnion(skill); // carved skills: union skeleton + sections
expect(content).toContain('"commit"');
});
}
@@ -969,7 +969,7 @@ describe('PLAN_FILE_REVIEW_REPORT resolver', () => {
}
test('resolver output contains key report elements', () => {
const content = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
const content = readSkillUnion('plan-ceo-review'); // carved: report writer moved to section
expect(content).toContain('Trigger');
expect(content).toContain('Findings');
expect(content).toContain('VERDICT');
@@ -3112,7 +3112,9 @@ describe('GSTACK REVIEW REPORT delete-then-append flow', () => {
for (const skill of PLAN_REVIEW_SKILLS) {
test(`${skill}/SKILL.md prescribes delete-then-append, not in-place replace`, () => {
const content = fs.readFileSync(path.join(ROOT, skill, 'SKILL.md'), 'utf-8');
// Carved skills (v2 plan Phase B) relocate the review-report prose into
// sections/*.md; readSkillUnion follows the content wherever the carve put it.
const content = readSkillUnion(skill);
// The new (correct) instruction must be present.
expect(content).toContain('delete-then-append flow');
+7
View File
@@ -226,7 +226,14 @@ export const PARITY_INVARIANTS: ParityInvariant[] = [
minBytes: 120_000,
},
{
// Carved (v2 plan T9): skeleton SKILL.md + sections/review-sections.md.
// Content + size floors run against the union (relocated prose still counts);
// maxSkeletonBytes asserts the always-loaded skeleton shrank from the ~138KB
// monolith to ~81KB (measured 80,731 B, -42%). Headroom to 90KB so a small
// skeleton edit doesn't trip CI, but a 10KB regression does.
skill: 'plan-ceo-review',
sectioned: true,
maxSkeletonBytes: 90_000,
mustContain: [
'SCOPE EXPANSION',
'SELECTIVE EXPANSION',
+2
View File
@@ -122,6 +122,7 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
'budget-regression-pty': ['test/helpers/eval-store.ts', 'test/skill-budget-regression.test.ts'],
'ship-idempotency-pty': ['ship/**', 'bin/gstack-next-version', 'bin/gstack-version-bump', 'scripts/resolvers/sections.ts', 'lib/worktree.ts', 'test/helpers/claude-pty-runner.ts'],
'ship-section-loading': ['ship/**', 'scripts/resolvers/sections.ts', 'scripts/gen-skill-docs.ts', 'test/helpers/required-reads.ts', 'test/helpers/transcript-section-logger.ts', 'test/helpers/claude-pty-runner.ts'],
'plan-ceo-section-loading': ['plan-ceo-review/**', 'scripts/resolvers/sections.ts', 'scripts/gen-skill-docs.ts', 'test/helpers/required-reads.ts', 'test/helpers/transcript-section-logger.ts', 'test/helpers/claude-pty-runner.ts'],
'autoplan-chain-pty': ['autoplan/**', 'plan-ceo-review/**', 'plan-design-review/**', 'plan-eng-review/**', 'plan-devex-review/**', 'test/fixtures/plans/ui-heavy-feature.md', 'test/helpers/claude-pty-runner.ts'],
'e2e-harness-audit': ['plan-ceo-review/**', 'plan-eng-review/**', 'plan-design-review/**', 'plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/agent-sdk-runner.ts', 'test/helpers/claude-pty-runner.ts'],
@@ -510,6 +511,7 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
'budget-regression-pty': 'gate', // free, library-only assertion
'ship-idempotency-pty': 'periodic', // ~$3/run, real /ship in plan mode
'ship-section-loading': 'periodic', // ~$3/run, real /ship; asserts section reads
'plan-ceo-section-loading': 'periodic', // ~$3-5/run, real /plan-ceo-review; asserts section read
'autoplan-chain-pty': 'periodic', // ~$8/run, all 3 phases sequential
// Per-finding count + review-report-at-bottom — periodic because each
+80 -49
View File
@@ -8,6 +8,14 @@
*
* Also pins the PASSIVE-manifest contract (CM2 / v2_PLAN.md:663): manifest entries
* carry only id/file/title/trigger — no machine predicate (applies_when/required_for).
*
* Generalized for every carved skill (v2 plan Phase B). Carved skills are
* discovered dynamically (any top-level dir with sections/manifest.json), so a new
* carve is covered the moment its manifest lands — no edit here. Per Codex
* outside-voice P2, each skill's manifest + dir listing is read INSIDE its own
* describe case (not at module top), so a carve-in-progress (manifest added before
* the .md is generated) fails only that skill's generated-.md assertion instead of
* crashing the whole module, and the suite never silently stays ship-only.
*/
import { describe, test, expect } from 'bun:test';
@@ -15,63 +23,86 @@ import * as fs from 'fs';
import * as path from 'path';
const ROOT = path.resolve(import.meta.dir, '..');
const SHIP_SECTIONS = path.join(ROOT, 'ship', 'sections');
const manifest = JSON.parse(fs.readFileSync(path.join(SHIP_SECTIONS, 'manifest.json'), 'utf-8'));
const sectionTmpls = fs.readdirSync(SHIP_SECTIONS).filter(f => f.endsWith('.md.tmpl'));
const sectionMds = fs.readdirSync(SHIP_SECTIONS).filter(f => f.endsWith('.md') && !f.endsWith('.md.tmpl'));
/** Every top-level skill dir that owns a sections/manifest.json. */
function discoverCarvedSkills(): string[] {
return fs
.readdirSync(ROOT, { withFileTypes: true })
.filter(d => d.isDirectory())
.map(d => d.name)
.filter(name => fs.existsSync(path.join(ROOT, name, 'sections', 'manifest.json')))
.sort();
}
const CARVED_SKILLS = discoverCarvedSkills();
describe('section manifest ↔ filesystem consistency', () => {
test('manifest parses with skill + sections array', () => {
expect(manifest.skill).toBe('ship');
expect(Array.isArray(manifest.sections)).toBe(true);
expect(manifest.sections.length).toBeGreaterThan(0);
test('the known carved skills are discovered', () => {
// Tripwire: if a carve regresses (manifest deleted) this catches it.
expect(CARVED_SKILLS).toContain('ship');
expect(CARVED_SKILLS).toContain('plan-ceo-review');
});
test('every manifest entry has a .md.tmpl source AND a generated .md', () => {
for (const s of manifest.sections) {
expect(fs.existsSync(path.join(SHIP_SECTIONS, `${s.file}.tmpl`))).toBe(true);
expect(fs.existsSync(path.join(SHIP_SECTIONS, s.file))).toBe(true);
}
});
for (const skill of CARVED_SKILLS) {
describe(skill, () => {
// Codex P2: computed per-skill-case, not at module load.
const sectionsDir = path.join(ROOT, skill, 'sections');
const manifest = JSON.parse(fs.readFileSync(path.join(sectionsDir, 'manifest.json'), 'utf-8'));
const sectionTmpls = fs.readdirSync(sectionsDir).filter(f => f.endsWith('.md.tmpl'));
const sectionMds = fs.readdirSync(sectionsDir).filter(f => f.endsWith('.md') && !f.endsWith('.md.tmpl'));
test('manifest is PASSIVE — no applies_when / required_for predicate (CM2)', () => {
for (const s of manifest.sections) {
expect(s).not.toHaveProperty('applies_when');
expect(s).not.toHaveProperty('required_for');
// The allowed passive shape:
expect(typeof s.id).toBe('string');
expect(typeof s.file).toBe('string');
expect(typeof s.title).toBe('string');
expect(typeof s.trigger).toBe('string');
}
});
test('manifest parses with skill + sections array', () => {
expect(manifest.skill).toBe(skill);
expect(Array.isArray(manifest.sections)).toBe(true);
expect(manifest.sections.length).toBeGreaterThan(0);
});
test('no generated orphan: every sections/X.md has a sections/X.md.tmpl → FAIL', () => {
const orphans = sectionMds.filter(md => !sectionTmpls.includes(`${md}.tmpl`));
expect(orphans).toEqual([]);
});
test('every manifest entry has a .md.tmpl source AND a generated .md', () => {
for (const s of manifest.sections) {
expect(fs.existsSync(path.join(sectionsDir, `${s.file}.tmpl`))).toBe(true);
expect(fs.existsSync(path.join(sectionsDir, s.file))).toBe(true);
}
});
test('no hand-edited generated file: every sections/X.md has the AUTO-GENERATED header → FAIL', () => {
for (const md of sectionMds) {
const head = fs.readFileSync(path.join(SHIP_SECTIONS, md), 'utf-8').slice(0, 120);
expect(head).toContain('AUTO-GENERATED');
}
});
test('manifest is PASSIVE — no applies_when / required_for predicate (CM2)', () => {
for (const s of manifest.sections) {
expect(s).not.toHaveProperty('applies_when');
expect(s).not.toHaveProperty('required_for');
// The allowed passive shape:
expect(typeof s.id).toBe('string');
expect(typeof s.file).toBe('string');
expect(typeof s.title).toBe('string');
expect(typeof s.trigger).toBe('string');
}
});
test('manifest orphan check (WARN in v2.0): every .md.tmpl is listed', () => {
const listed = new Set(manifest.sections.map((s: { file: string }) => `${s.file}.tmpl`));
const unlisted = sectionTmpls.filter(t => !listed.has(t));
if (unlisted.length > 0) {
// v2_PLAN.md: WARN now, FAIL in v2.1. Surface, don't fail the build yet.
// eslint-disable-next-line no-console
console.warn(`[section-manifest] manifest orphan(s) (not in manifest.json): ${unlisted.join(', ')}`);
}
expect(unlisted.length).toBeLessThanOrEqual(unlisted.length); // always passes; WARN only
});
test('no generated orphan: every sections/X.md has a sections/X.md.tmpl → FAIL', () => {
const orphans = sectionMds.filter(md => !sectionTmpls.includes(`${md}.tmpl`));
expect(orphans).toEqual([]);
});
test('section ids are unique', () => {
const ids = manifest.sections.map((s: { id: string }) => s.id);
expect(new Set(ids).size).toBe(ids.length);
});
test('no hand-edited generated file: every sections/X.md has the AUTO-GENERATED header → FAIL', () => {
for (const md of sectionMds) {
const head = fs.readFileSync(path.join(sectionsDir, md), 'utf-8').slice(0, 120);
expect(head).toContain('AUTO-GENERATED');
}
});
test('manifest orphan check (WARN in v2.0): every .md.tmpl is listed', () => {
const listed = new Set(manifest.sections.map((s: { file: string }) => `${s.file}.tmpl`));
const unlisted = sectionTmpls.filter(t => !listed.has(t));
if (unlisted.length > 0) {
// v2_PLAN.md: WARN now, FAIL in v2.1. Surface, don't fail the build yet.
// eslint-disable-next-line no-console
console.warn(`[section-manifest] ${skill} manifest orphan(s) (not in manifest.json): ${unlisted.join(', ')}`);
}
expect(unlisted.length).toBeLessThanOrEqual(unlisted.length); // always passes; WARN only
});
test('section ids are unique', () => {
const ids = manifest.sections.map((s: { id: string }) => s.id);
expect(new Set(ids).size).toBe(ids.length);
});
});
}
});
+82
View File
@@ -0,0 +1,82 @@
/**
* plan-ceo-review carve — static ordering guard (GATE tier, free, deterministic).
*
* This is the per-PR mechanical backstop for the v2-plan Phase B carve of
* plan-ceo-review (Codex outside-voice P2). The periodic real-PTY E2E
* (skill-e2e-plan-ceo-review-section-loading.test.ts) is the behavioral proof,
* but it runs weekly and costs money. This file runs on every `bun test` and
* fails CI the moment the carve's structural invariants break:
*
* 1. The skeleton points at the section with a STOP-Read directive, and that
* directive sits AFTER Step 0 (scope + mode) — so the conversational Step 0
* stays in the always-loaded skeleton, never stranded in the on-demand file.
* 2. The heavy review body (Sections 1-11) is NOT in the skeleton — it moved to
* the section. A regression that inlines it back would re-bloat the skeleton.
* 3. The review report writer ("GSTACK REVIEW REPORT") lives in the section, and
* the blocking EXIT PLAN MODE GATE that verifies it lives in the skeleton
* AFTER the STOP — so the gate fires once the section work returns.
* 4. Nothing review-governing sits in the skeleton below the STOP (Codex P1):
* no "Section N", no "## Mode Quick Reference", no "## Formatting Rules".
*/
import { describe, test, expect } from 'bun:test';
import * as fs from 'fs';
import * as path from 'path';
const ROOT = path.resolve(import.meta.dir, '..');
const SKELETON = path.join(ROOT, 'plan-ceo-review', 'SKILL.md');
const SECTION = path.join(ROOT, 'plan-ceo-review', 'sections', 'review-sections.md');
describe('plan-ceo-review carve — static ordering', () => {
const skeleton = fs.readFileSync(SKELETON, 'utf-8');
const section = fs.readFileSync(SECTION, 'utf-8');
// Index into the skeleton, -1 if absent.
const at = (needle: string): number => skeleton.indexOf(needle);
const STEP0 = '## Step 0: Nuclear Scope Challenge + Mode Selection';
const STOP = 'sections/review-sections.md'; // appears in the index row + STOP directive
const GATE = 'GSTACK REVIEW REPORT';
test('skeleton emits a STOP-Read directive pointing at the section', () => {
expect(skeleton).toContain('> **STOP.**');
expect(skeleton).toContain('plan-ceo-review/sections/review-sections.md');
expect(skeleton).toContain('## Section index — Read each section when its situation applies');
});
test('Step 0 (scope + mode) stays in the skeleton, BEFORE the STOP', () => {
const step0 = at(STEP0);
const stop = skeleton.indexOf('> **STOP.**');
expect(step0).toBeGreaterThan(-1);
expect(stop).toBeGreaterThan(step0); // STOP fires only after Step 0
});
test('the heavy review body (Sections 1-11) is NOT in the skeleton', () => {
expect(skeleton).not.toContain('### Section 1: Architecture Review');
expect(skeleton).not.toContain('### Section 11:');
// ...it lives in the section instead.
expect(section).toContain('### Section 1: Architecture Review');
expect(section).toContain('### Section 11:');
});
test('nothing review-governing sits in the skeleton below the STOP (Codex P1)', () => {
// Mode Quick Reference + Formatting Rules govern review-time behavior and must
// travel with the section, not be stranded below the STOP in the skeleton.
expect(skeleton).not.toContain('## Mode Quick Reference');
expect(skeleton).not.toContain('## Formatting Rules');
expect(section).toContain('## Mode Quick Reference');
});
test('review report writer lives in the section; the EXIT PLAN MODE GATE stays in the skeleton AFTER the STOP', () => {
// The report itself is produced inside the section work...
expect(section).toContain(GATE);
// ...and the blocking gate that verifies it is the last thing the skeleton runs.
const stop = skeleton.indexOf('> **STOP.**');
const gate = skeleton.lastIndexOf(GATE);
expect(gate).toBeGreaterThan(stop);
});
test('the section is generated, not hand-edited', () => {
expect(section.slice(0, 120)).toContain('AUTO-GENERATED');
});
});
@@ -0,0 +1,191 @@
/**
* /plan-ceo-review section-loading E2E (periodic, paid, real-PTY) — v2 plan
* Phase B carve backstop. The per-PR guard is the free static test
* skill-ceo-section-ordering.test.ts; THIS is the behavioral proof that a real
* agent actually Reads the carved section instead of working from memory.
*
* After the carve, plan-ceo-review is a skeleton whose single STOP-Read directive
* (fired after Step 0 mode selection) points at sections/review-sections.md. This
* test runs the REAL /plan-ceo-review skill in plan mode against a fixture branch
* that has a plan worth reviewing, drives Step 0 to HOLD SCOPE (the simplest mode
* that still requires all 11 review sections), and asserts the agent Read
* review-sections.md before producing the review report.
*
* Codex outside-voice P1 fixes vs the naive port of the ship test:
* - REFRESH THE INSTALL FIRST. The skill loads from the installed copy at
* ~/.claude/skills/gstack/plan-ceo-review (a real copy on dev machines, fresh
* on CI). A test that didn't refresh would assert against the pre-carve
* monolith and trivially "pass" with zero section reads. beforeAll copies the
* freshly-generated skeleton + sections into the install; afterAll restores the
* prior state so a local run doesn't leave the active skill mutated.
* - HANDLE THE FULL STEP 0. plan-ceo's Step 0 can fire a system audit, WebSearch,
* and several AskUserQuestion calls before mode selection — the answer loop
* replies to every permission dialog / numbered list, not just two.
*
* Plan-mode framing keeps the agent from editing/committing. Cost: ~$3-5/run.
* Periodic tier.
*/
import { describe, test, expect } from 'bun:test';
import { spawnSync } from 'child_process';
import * as fs from 'fs';
import * as path from 'path';
import * as os from 'os';
import {
launchClaudePty,
isPermissionDialogVisible,
isNumberedOptionListVisible,
} from './helpers/claude-pty-runner';
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
const describeE2E = shouldRun ? describe : describe.skip;
const REPO_ROOT = path.resolve(import.meta.dir, '..');
const INSTALL_DIR = path.join(os.homedir(), '.claude', 'skills', 'gstack', 'plan-ceo-review');
// Sections every plan-ceo-review run must consult after Step 0.
const REQUIRED_SECTIONS = ['review-sections.md'];
/** Copy the freshly-generated skeleton + sections into the installed skill so the
* PTY agent loads the carve under test. Returns a restore() that puts the install
* back exactly as it was (content of SKILL.md + presence/content of sections/). */
function refreshInstall(): () => void {
const repoSkill = path.join(REPO_ROOT, 'plan-ceo-review', 'SKILL.md');
const repoSections = path.join(REPO_ROOT, 'plan-ceo-review', 'sections');
const installSkill = path.join(INSTALL_DIR, 'SKILL.md');
const installSections = path.join(INSTALL_DIR, 'sections');
// Snapshot prior state for restore.
const priorSkill = fs.existsSync(installSkill) ? fs.readFileSync(installSkill) : null;
const hadSections = fs.existsSync(installSections);
const priorSections: Record<string, Buffer> = {};
if (hadSections) {
for (const f of fs.readdirSync(installSections)) {
priorSections[f] = fs.readFileSync(path.join(installSections, f));
}
}
// Apply: skeleton + every generated section file (.md) + manifest.
fs.mkdirSync(INSTALL_DIR, { recursive: true });
fs.copyFileSync(repoSkill, installSkill);
fs.mkdirSync(installSections, { recursive: true });
for (const f of fs.readdirSync(repoSections)) {
if (f.endsWith('.md.tmpl')) continue; // install carries generated files, not templates
fs.copyFileSync(path.join(repoSections, f), path.join(installSections, f));
}
return function restore(): void {
try {
if (priorSkill) fs.writeFileSync(installSkill, priorSkill);
if (hadSections) {
// Restore the prior section files; drop any we added.
for (const f of fs.readdirSync(installSections)) {
if (!(f in priorSections)) fs.rmSync(path.join(installSections, f), { force: true });
}
for (const [f, buf] of Object.entries(priorSections)) {
fs.writeFileSync(path.join(installSections, f), buf);
}
} else {
fs.rmSync(installSections, { recursive: true, force: true });
}
} catch { /* best-effort restore */ }
};
}
/** Fixture: a feature branch with a real change + a plan file worth reviewing. */
function buildPlanFixture(): { workTree: string; root: string } {
const root = fs.mkdtempSync(path.join(os.tmpdir(), 'gstack-ceo-secload-'));
const workTree = path.join(root, 'workspace');
const bareRemote = path.join(root, 'origin.git');
fs.mkdirSync(workTree, { recursive: true });
const sh = (cmd: string, cwd: string): void => {
const r = spawnSync('bash', ['-c', cmd], { cwd, stdio: 'pipe', timeout: 15_000 });
if (r.status !== 0) throw new Error(`fixture setup failed at "${cmd}":\n${r.stderr?.toString()}`);
};
sh(`git init --bare "${bareRemote}"`, root);
sh('git init -b main', workTree);
sh('git config user.email "t@t.com" && git config user.name "T" && git config commit.gpgsign false', workTree);
fs.writeFileSync(path.join(workTree, 'app.js'), '// base\n');
sh('git add -A && git commit -m "chore: initial"', workTree);
sh(`git remote add origin "${bareRemote}" && git push -u origin main`, workTree);
// Feature branch with a real change + a plan describing it (something to review).
sh('git checkout -b feat/cache-layer', workTree);
fs.writeFileSync(
path.join(workTree, 'PLAN.md'),
[
'# Plan: add an in-memory cache layer',
'',
'## Context',
'Reads hit the DB on every request. Add a process-local LRU cache in front of',
'the read path to cut DB load.',
'',
'## Approach',
'- Wrap the read repository in a cache that stores the last 1000 keys.',
'- Invalidate on write.',
'',
'## Out of scope',
'Distributed cache, cross-process coherence.',
'',
].join('\n'),
);
fs.writeFileSync(path.join(workTree, 'app.js'), '// base\nexport function read(k) { return db.get(k); }\n');
sh('git add -A && git commit -m "feat: cache layer plan + stub"', workTree);
sh('git push -u origin feat/cache-layer', workTree);
return { workTree, root };
}
describeE2E('/plan-ceo-review section-loading E2E (periodic, real-PTY, installed skill)', () => {
test(
'a real review Reads the carved section before producing the report',
async () => {
const restore = refreshInstall();
const { workTree, root } = buildPlanFixture();
const session = await launchClaudePty({
permissionMode: 'plan',
cwd: workTree,
timeoutMs: 900_000,
env: { NO_COLOR: '1' },
});
const readSections = new Set<string>();
let reportReady = false;
try {
await Bun.sleep(8000);
const since = session.mark();
// HOLD SCOPE = simplest mode that still walks all 11 review sections.
session.send('/plan-ceo-review review PLAN.md, hold scope\r');
const start = Date.now();
let lastPermSig = '';
while (Date.now() - start < 780_000) {
await Bun.sleep(3000);
if (session.exited()) break;
const visible = session.visibleSince(since);
const tail = visible.slice(-1500);
// Answer EVERY permission dialog / numbered option list (system audit,
// WebSearch, and the several Step 0 questions) by taking option 1.
if (isNumberedOptionListVisible(tail) && isPermissionDialogVisible(tail)) {
const sig = visible.slice(-500);
if (sig !== lastPermSig) { lastPermSig = sig; session.send('1\r'); await Bun.sleep(1500); continue; }
}
for (const m of visible.matchAll(/sections\/([A-Za-z0-9._-]+\.md)/g)) readSections.add(m[1]);
if (/GSTACK REVIEW REPORT|COMPLETION SUMMARY|ready to execute/i.test(visible)) {
reportReady = true;
break;
}
}
} finally {
await session.close();
try { fs.rmSync(root, { recursive: true, force: true }); } catch { /* ignore */ }
restore();
}
const missing = REQUIRED_SECTIONS.filter(s => !readSections.has(s));
expect({ reportReady, read: [...readSections], missing }).toEqual({
reportReady: true,
read: expect.any(Array),
missing: [],
});
},
1_020_000,
);
});
+8 -5
View File
@@ -146,11 +146,14 @@ describe('SKILL.md size budget regression (gate, free)', () => {
* skill, so this is a comfortable ceiling that still catches accidental
* mass deletion (e.g., a refactor that strips the body of a skill).
*
* v2.0.0.0 will introduce the sections/ pattern for 5 heavyweights
* v2.0.0.0 introduces the sections/ pattern for 5 heavyweights
* (ship, plan-ceo-review, office-hours, plan-eng-review,
* plan-design-review). Those skills will legitimately shrink to ~15 KB
* skeletons. When that lands, add them to SECTIONS_EXTRACTED so the floor
* relaxes for them.
* plan-design-review). Carved so far: ship (skeleton ~83 KB) and
* plan-ceo-review (skeleton ~81 KB, down from the 138 KB monolith). Those
* skeletons legitimately fall below the 80% body-strip floor, so each carved
* skill is added to SECTIONS_EXTRACTED; its union is guarded instead by the
* sectioned invariant in parity-harness.ts (minBytes on skeleton+sections).
* Add the remaining three here as they carve.
*/
test('no skill shrinks past 80% of v1.47.0.0 baseline (catches accidental body strip)', () => {
const baseline: ParityBaseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8'));
@@ -160,7 +163,7 @@ describe('SKILL.md size budget regression (gate, free)', () => {
// because prose moved into sections/*.md. The union size is guarded instead
// by the sectioned ship invariant in parity-harness.ts (minBytes on the
// skeleton+sections union), so exempt the skeleton from the body-strip floor.
const SECTIONS_EXTRACTED = new Set<string>(['ship']);
const SECTIONS_EXTRACTED = new Set<string>(['ship', 'plan-ceo-review']);
const undershoots: Array<{
skill: string; beforeBytes: number; afterBytes: number; ratio: number;
+16 -12
View File
@@ -7,14 +7,13 @@ import * as path from 'path';
const ROOT = path.resolve(import.meta.dir, '..');
// Carved-skill aware (v2 plan T9): ship is a skeleton SKILL.md + sections/*.md.
// Read the union so validations of content that moved into a section still hold.
// `_SHIP_MD` is a distinct path expression so a mechanical read-replace can't
// recurse into this helper.
const _SHIP_MD = path.join(ROOT, 'ship', 'SKILL.md');
function readShipUnion(): string {
let t = fs.readFileSync(_SHIP_MD, 'utf-8');
const secDir = path.join(ROOT, 'ship', 'sections');
// Carved-skill aware (v2 plan T9 / Phase B): a carved skill is a skeleton SKILL.md
// plus sections/*.md. Read the union so validations of content that moved into a
// section still hold. For an uncarved skill (no sections dir) this is just the
// skeleton, so readSkillUnion is safe to use everywhere.
function readSkillUnion(skill: string): string {
let t = fs.readFileSync(path.join(ROOT, skill, 'SKILL.md'), 'utf-8');
const secDir = path.join(ROOT, skill, 'sections');
if (fs.existsSync(secDir)) {
for (const f of fs.readdirSync(secDir).sort()) {
if (f.endsWith('.md')) t += '\n' + fs.readFileSync(path.join(secDir, f), 'utf-8');
@@ -22,6 +21,9 @@ function readShipUnion(): string {
}
return t;
}
function readShipUnion(): string {
return readSkillUnion('ship');
}
describe('SKILL.md command validation', () => {
test('all $B commands in SKILL.md are valid browse commands', () => {
@@ -548,8 +550,8 @@ describe('TODOS-format.md reference consistency', () => {
test('skills that write TODOs reference TODOS-format.md', () => {
const shipContent = readShipUnion();
const ceoPlanContent = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
const engPlanContent = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
const ceoPlanContent = readSkillUnion('plan-ceo-review'); // carved: TODOS-format ref moved to section
const engPlanContent = readSkillUnion('plan-eng-review');
expect(shipContent).toContain('TODOS-format.md');
expect(ceoPlanContent).toContain('TODOS-format.md');
@@ -912,8 +914,10 @@ describe('CEO review mode validation', () => {
});
test('has docs/designs promotion section', () => {
expect(content).toContain('docs/designs');
expect(content).toContain('PROMOTED');
// Carved (v2 plan Phase B): the promotion block moved into the review section.
const union = readSkillUnion('plan-ceo-review');
expect(union).toContain('docs/designs');
expect(union).toContain('PROMOTED');
});
test('mode quick reference has four columns', () => {
+4 -2
View File
@@ -109,8 +109,10 @@ describe('selectTests', () => {
// E2E test also depends on plan-ceo-review/** (5-option scope decision
// regression for the "drop to fit 4 options" failure mode).
expect(result.selected).toContain('plan-ceo-split-overflow');
expect(result.selected.length).toBe(22);
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 22);
// v2 plan Phase B carve: the section-loading E2E depends on plan-ceo-review/**.
expect(result.selected).toContain('plan-ceo-section-loading');
expect(result.selected.length).toBe(23);
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 23);
});
test('global touchfile triggers ALL tests', () => {