Files
Garry Tan 45cc95d5f4 v1.57.5.0 feat: cross-session decision memory + gbrain dream-stage call graph (#1910)
* feat(gbrain-sync): add cycleCompleted() cycle-state probe

Reads `gbrain doctor` cycle_freshness to classify whether a source has
completed a full cycle (completed/never/unknown). A fail naming this source
-> never; a fail naming only other sources -> completed; an absent or
unparseable check -> unknown, so an unrelated doctor failure never masks a
real state. Gates the automatic call-graph build on --full.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(gbrain-sync): --dream call-graph stage with lock-free gate + honest outcome guard

Adds a source-scoped `gbrain dream --source <id>` stage that builds this
worktree's call graph (code-callers/code-callees). Runs lock-free after the
sync lock releases so it never blocks sibling worktrees; a .dream-in-progress
marker dedupes concurrent dreams. --full auto-runs it only when the cycle was
never built; explicit --dream always forces; --no-dream opts out.

The stage parses the cycle's own output and reports the truth, not a flat
"built": a WARN when the schema pack can't extract code symbols, when the
embed phase failed for a missing key, or when 0 edges resolved; OK with the
resolved-edge count otherwise. gbrain exits 0 even when it skips on a held
cycle lock (e.g. autopilot), so that case reports SKIP, not success.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: ignore gbrain .sources/ local staging dir

gbrain writes per-source staging and capability-check artifacts under
.sources/ in the repo root. It's machine-local runtime state, not source.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(gbrain): honest call-graph guidance in /sync-gbrain + pin works on gbrain>=0.41.38

sync-gbrain frames the --dream offer honestly: building a call graph requires a
code-aware schema pack, and the dream stage reports a WARN when it can't. The
verdict's Call graph row mirrors the dream stage's real outcome instead of
assuming a completed cycle means edges exist. The ## GBrain Search Guidance
block written into CLAUDE.md drops the old code-callers --source caveat:
gbrain >=0.41.38.0 honors the .gbrain-source pin for code-callers/code-callees.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(jsonl-store): shared audited JSONL plumbing (injection-reject + atomic append + tolerant read)

Single source of truth extracted for D2A: gstack-learnings-* and the upcoming
gstack-decision-* bins share one injection-pattern list, one atomic single-line
appender, and one tolerant reader. No more drift between stores.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(learnings-log): use shared hasInjection from lib/jsonl-store (D2A)

Replace the inline injection-pattern copy with the shared list. One audited
write-path rejection across learnings + the upcoming decision store. Behavior
unchanged (35/35 learnings tests green); learnings-search keeps its inline copy
because a structural test pins its bash/bun shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(decision): event-sourced decision-memory model (lib/gstack-decision)

decide/supersede/redact events on lib/jsonl-store; active set is computed (no
mutable status), dangling refs tolerated. Free-text is injection-checked and
redact-scanned on write (HIGH secret -> reject). Scope filter (repo/branch/issue)
for relevant resurfacing. File-only + reliable; gbrain not required.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(decision): bounded active snapshot + compaction (redact expunges, supersede archives)

writeSnapshot/readSnapshot/rebuildSnapshot give an O(active) bounded read for the
session-start hot path (D1A). compact() rewrites the log to active, archives
superseded decisions for history, and EXPUNGES redacted ones (dropped, never
archived) so an accidentally-captured secret leaves the store for good.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(decision): gstack-decision-log + gstack-decision-search bins (non-interactive)

Two bins mirroring gstack-learnings-* (D3A). log writes decide/--supersede/--redact/
--compact events + refreshes the bounded snapshot + enqueues for cross-machine sync;
search reads the O(active) snapshot, scope-filtered to current branch, newest-first,
--all to include superseded, --json for machines. Empty store returns silently
(no snapshot write on an empty read).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(memory): surface active decisions at session start + capture nudge (Context Recovery)

Context Recovery now shows recent scope-relevant active decisions (bounded read of
decisions.active.json via gstack-decision-search) and instructs the agent to treat
them as settled calls and to log durable decisions/reversals. Closes the Phase-1
capture->curate->resurface loop, reliable + file-only. Regen across all hosts folded
in (squash-with-regen); parity 10/10, freshness green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: refresh ship golden baselines for the memory-loop preamble change

Context Recovery now emits the cross-session-decisions block, so ship's preamble
(all hosts) changed. Golden baselines are hand-maintained copies (gen does not
write them); refresh them from the fresh gen so golden-file regression passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(memory): document the cross-session decision-memory loop in CLAUDE.md

Adds a '## Cross-session decision memory' section: how to resurface
(gstack-decision-search) and capture (gstack-decision-log) durable decisions,
the supersede/redact/compact verbs, and a crisp durable-vs-trivial definition
so the store stays signal. Reliable file-only path; gbrain not required.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(memory): emit durable decisions from ship/ceo/eng/spec at structured points

Wires the four skills that finalize real decisions to capture them in the
cross-session decision store, from their STRUCTURED outputs (never free-text
scraping):
- ship: the version bump (level + why) at write time
- plan-ceo-review: accepted scope + verdict (branch-scoped)
- plan-eng-review: the architecture verdict + key call (branch-scoped)
- spec: the filed issue's core approach (issue-scoped)

All emits are non-interactive, schema-correct (content in decision/rationale,
source=skill, confidence 1-10), and best-effort (|| true) so a decision-log
failure never blocks the workflow. Includes regen across hosts + refreshed ship
golden baselines.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(memory): optional gbrain --semantic recall for decision search

Adds gstack-decision-search --semantic (with --query): appends a 'Related from
memory' block from gbrain semantic search, scoped to the curated-memory source.
Pure enhancement, reliability-first: a new lib/gstack-decision-semantic.ts is the
ONLY decision module that touches gbrain and is imported lazily only on --semantic,
so the reliable file path never loads gbrain code. Every path degrades to the
reliable file results when gbrain is off, unconfigured, empty, or errors (never
throws, 10s timeout).

Built against the verified gbrain 0.42.x surface (text output [score] slug --
snippet, NOT JSON; curated-memory source resolved by worktree path, not a
gstack-brain-<user> id). Deterministic-contract tests only: parser units,
degrade-to-null when gbrain absent, and a fake-gbrain shim proving scope+search
end-to-end. find-contradictions deferred (no verifiable CLI surface yet + curated
memory not indexed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(gbrain-sync): self-heal stale autopilot lock (dead-pid)

detectAutopilot treated a lock FILE as proof of life, so a crashed gbrain daemon
left a stale lock that wedged every sync forever (observed: a dead pid refused
--full indefinitely). Now read the holder pid (bare or JSON body) and check
liveness via signal-0: ESRCH=dead → ignore the stale signal and keep checking;
EPERM=alive (other user) → active. A stale lock never masks a live autopilot
process. Pure decision function — does not delete the file; the caller may clean it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(review): drop stray trailing code fence in TODOS-format

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(test): align section-loading E2E testNames with their TOUCHFILES keys

Pre-existing on main (v1.56.x): the two section-loading E2E tests used
human-label testNames ('/ship section-loading') that don't match their slug
keys ('ship-section-loading') in E2E_TOUCHFILES/E2E_TIERS. Every other E2E test
uses the slug as its testName, and the TOUCHFILES completeness gate requires
testName to be a registered key — so the gate was red. Align both testNames to
their slug keys (also fixes tier lookup for these two periodic tests).

Verified failing on a clean origin/main checkout before the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix: pre-landing review fixes (datamark, DRY, compact, coverage)

Addresses the pre-landing review findings (all INFORMATIONAL, no criticals):
- security: datamark resurfaced decision text at the render boundary
  (lib/gstack-decision.ts datamark() — neutralizes code fences, --- banners,
  <|role|>/</system> markers, control chars, newlines). Applied in
  gstack-decision-search human output so stored text can't masquerade as
  instructions in Context Recovery (codex hardening #3 / AC #7). --json stays raw.
- DRY: extract resolveSlug/gitBranch/flagValue to lib/bin-context.ts; both
  decision bins use it instead of duplicating the helpers.
- compact(): batch the archive append (one write, not N) and shrink the
  mid-compact crash window; simplify the opaque branch/issue ternary.
- coverage: learnings-log injection rejection (D2A wiring), search --recent/
  --scope + NaN-safe --recent, datamark-applied, unparseable lock body,
  compact-empty, corrupt-snapshot degrade.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(security): close adversarial-review findings in decision memory

Adversarial review (Claude subagent) found a CRITICAL the specialist pass missed:
- F1 (CRITICAL): 'Human:'/'Assistant:' turn-prefixes bypassed BOTH the write-time
  denylist AND datamark(), landing verbatim in agent context inside the trusted
  ACTIVE DECISIONS fence. Add 'human:' (+ 'disregard previous', 'from now on') to
  the shared denylist, and have datamark() neutralize Human:/Assistant:/System:/User:
  turn-prefixes (ZWSP) at the render boundary.
- F2: datamark() only stripped ASCII C0; extend to Unicode line terminators
  (U+0085/2028/2029) and U+007F so 'strip newlines' actually holds.
- F3: validateDecide blocked only HIGH secrets; MEDIUM-tier PII (e.g. SSN) persisted
  silently and synced cross-machine. The store is non-interactive (no confirm path),
  so fail closed on MEDIUM too.
- F4: compact() was a lock-free read-modify-rewrite that could clobber a concurrent
  append (lost decision). Add an O_EXCL compact lock + a pre-rename size recheck that
  aborts untouched (skipped=true) if an append landed; caller re-runs.
- F7: filterByScope unknown/garbage scope fell through to 'return true' (leaked into
  every context); fail conservative (false).

F5 (pid reuse) and F6 (pgrep over-match) are intentionally left as-is: both fail SAFE
(over-refuse sync); making them precise would introduce a fail-DANGEROUS path
(allowing sync during a real autopilot). True disambiguation needs gbrain to stamp the
lock with a start-time, which gstack doesn't own. F8 (compact moves history to archive)
is by design.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(security): close cross-model (Codex) adversarial findings

Codex adversarial review found a HIGH the Claude pass missed plus 3 mediums:
- C1 (HIGH): gstack-decision-search --all returned every decide and IGNORED redact
  events, so a redacted secret still resurfaced via --all until compact ran. --all
  now excludes redacted (redact = expunge from every read path), still showing
  superseded history.
- C-med: semantic (external gbrain) slug/snippet were printed raw — datamark them too
  so a gbrain hit can't spoof role markers / fences into agent context.
- C4: semanticRecall fell back to an UNSCOPED gbrain search when no curated-memory
  source resolved, pulling code/doc corpora mislabeled as 'related decisions'. Now
  returns null (degrade) when there's no worktree-backed memory source.
- C5: validateDecide scanned only decision/rationale/alternatives; branch and issue
  are stored + surfaced (raw via --json), so include them in the injection+secret scan.

C2 (snapshot staleness) / C3 (compact TOCTOU residual): accepted for a single-user
store — atomic appends never lose the event, rebuilds self-heal, and the compact
size-recheck leaves only a sub-ms window; full append-locking would break the
lock-free append design.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.57.5.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 06:20:58 -07:00

493 lines
35 KiB
Cheetah

## Review Sections (11 sections, after scope and mode are agreed)
**Anti-skip rule:** Never condense, abbreviate, or skip any review section (1-11) regardless of plan type (strategy, spec, code, infra). Every section in this skill exists for a reason. "This is a strategy doc so implementation sections don't apply" is always wrong — implementation details are where strategy breaks down. If a section genuinely has zero findings, say "No issues found" and move on — but you must evaluate it.
{{ANTI_SHORTCUT_CLAUSE}}
### Section 1: Architecture Review
Evaluate and diagram:
* Overall system design and component boundaries. Draw the dependency graph.
* Data flow — all four paths. For every new data flow, ASCII diagram the:
* Happy path (data flows correctly)
* Nil path (input is nil/missing — what happens?)
* Empty path (input is present but empty/zero-length — what happens?)
* Error path (upstream call fails — what happens?)
* State machines. ASCII diagram for every new stateful object. Include impossible/invalid transitions and what prevents them.
* Coupling concerns. Which components are now coupled that weren't before? Is that coupling justified? Draw the before/after dependency graph.
* Scaling characteristics. What breaks first under 10x load? Under 100x?
* Single points of failure. Map them.
* Security architecture. Auth boundaries, data access patterns, API surfaces. For each new endpoint or data mutation: who can call it, what do they get, what can they change?
* Production failure scenarios. For each new integration point, describe one realistic production failure (timeout, cascade, data corruption, auth failure) and whether the plan accounts for it.
* Rollback posture. If this ships and immediately breaks, what's the rollback procedure? Git revert? Feature flag? DB migration rollback? How long?
**EXPANSION and SELECTIVE EXPANSION additions:**
* What would make this architecture beautiful? Not just correct — elegant. Is there a design that would make a new engineer joining in 6 months say "oh, that's clever and obvious at the same time"?
* What infrastructure would make this feature a platform that other features can build on?
**SELECTIVE EXPANSION:** If any accepted cherry-picks from Step 0D affect the architecture, evaluate their architectural fit here. Flag any that create coupling concerns or don't integrate cleanly — this is a chance to revisit the decision with new information.
Required ASCII diagram: full system architecture showing new components and their relationships to existing ones.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 2: Error & Rescue Map
This is the section that catches silent failures. It is not optional.
For every new method, service, or codepath that can fail, fill in this table:
```
METHOD/CODEPATH | WHAT CAN GO WRONG | EXCEPTION CLASS
-------------------------|-----------------------------|-----------------
ExampleService#call | API timeout | TimeoutError
| API returns 429 | RateLimitError
| API returns malformed JSON | JSONParseError
| DB connection pool exhausted| ConnectionPoolExhausted
| Record not found | RecordNotFound
-------------------------|-----------------------------|-----------------
EXCEPTION CLASS | RESCUED? | RESCUE ACTION | USER SEES
-----------------------------|-----------|------------------------|------------------
TimeoutError | Y | Retry 2x, then raise | "Service temporarily unavailable"
RateLimitError | Y | Backoff + retry | Nothing (transparent)
JSONParseError | N ← GAP | — | 500 error ← BAD
ConnectionPoolExhausted | N ← GAP | — | 500 error ← BAD
RecordNotFound | Y | Return nil, log warning | "Not found" message
```
Rules for this section:
* Catch-all error handling (`rescue StandardError`, `catch (Exception e)`, `except Exception`) is ALWAYS a smell. Name the specific exceptions.
* Catching an error with only a generic log message is insufficient. Log the full context: what was being attempted, with what arguments, for what user/request.
* Every rescued error must either: retry with backoff, degrade gracefully with a user-visible message, or re-raise with added context. "Swallow and continue" is almost never acceptable.
* For each GAP (unrescued error that should be rescued): specify the rescue action and what the user should see.
* For LLM/AI service calls specifically: what happens when the response is malformed? When it's empty? When it hallucinates invalid JSON? When the model returns a refusal? Each of these is a distinct failure mode.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 3: Security & Threat Model
Security is not a sub-bullet of architecture. It gets its own section.
Evaluate:
* Attack surface expansion. What new attack vectors does this plan introduce? New endpoints, new params, new file paths, new background jobs?
* Input validation. For every new user input: is it validated, sanitized, and rejected loudly on failure? What happens with: nil, empty string, string when integer expected, string exceeding max length, unicode edge cases, HTML/script injection attempts?
* Authorization. For every new data access: is it scoped to the right user/role? Is there a direct object reference vulnerability? Can user A access user B's data by manipulating IDs?
* Secrets and credentials. New secrets? In env vars, not hardcoded? Rotatable?
* Dependency risk. New gems/npm packages? Security track record?
* Data classification. PII, payment data, credentials? Handling consistent with existing patterns?
* Injection vectors. SQL, command, template, LLM prompt injection — check all.
* Audit logging. For sensitive operations: is there an audit trail?
For each finding: threat, likelihood (High/Med/Low), impact (High/Med/Low), and whether the plan mitigates it.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 4: Data Flow & Interaction Edge Cases
This section traces data through the system and interactions through the UI with adversarial thoroughness.
**Data Flow Tracing:** For every new data flow, produce an ASCII diagram showing:
```
INPUT ──▶ VALIDATION ──▶ TRANSFORM ──▶ PERSIST ──▶ OUTPUT
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
[nil?] [invalid?] [exception?] [conflict?] [stale?]
[empty?] [too long?] [timeout?] [dup key?] [partial?]
[wrong [wrong type?] [OOM?] [locked?] [encoding?]
type?]
```
For each node: what happens on each shadow path? Is it tested?
**Interaction Edge Cases:** For every new user-visible interaction, evaluate:
```
INTERACTION | EDGE CASE | HANDLED? | HOW?
---------------------|------------------------|----------|--------
Form submission | Double-click submit | ? |
| Submit with stale CSRF | ? |
| Submit during deploy | ? |
Async operation | User navigates away | ? |
| Operation times out | ? |
| Retry while in-flight | ? |
List/table view | Zero results | ? |
| 10,000 results | ? |
| Results change mid-page| ? |
Background job | Job fails after 3 of | ? |
| 10 items processed | |
| Job runs twice (dup) | ? |
| Queue backs up 2 hours | ? |
```
Flag any unhandled edge case as a gap. For each gap, specify the fix.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 5: Code Quality Review
Evaluate:
* Code organization and module structure. Does new code fit existing patterns? If it deviates, is there a reason?
* DRY violations. Be aggressive. If the same logic exists elsewhere, flag it and reference the file and line.
* Naming quality. Are new classes, methods, and variables named for what they do, not how they do it?
* Error handling patterns. (Cross-reference with Section 2 — this section reviews the patterns; Section 2 maps the specifics.)
* Missing edge cases. List explicitly: "What happens when X is nil?" "When the API returns 429?" etc.
* Over-engineering check. Any new abstraction solving a problem that doesn't exist yet?
* Under-engineering check. Anything fragile, assuming happy path only, or missing obvious defensive checks?
* Cyclomatic complexity. Flag any new method that branches more than 5 times. Propose a refactor.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 6: Test Review
Make a complete diagram of every new thing this plan introduces:
```
NEW UX FLOWS:
[list each new user-visible interaction]
NEW DATA FLOWS:
[list each new path data takes through the system]
NEW CODEPATHS:
[list each new branch, condition, or execution path]
NEW BACKGROUND JOBS / ASYNC WORK:
[list each]
NEW INTEGRATIONS / EXTERNAL CALLS:
[list each]
NEW ERROR/RESCUE PATHS:
[list each — cross-reference Section 2]
```
For each item in the diagram:
* What type of test covers it? (Unit / Integration / System / E2E)
* Does a test for it exist in the plan? If not, write the test spec header.
* What is the happy path test?
* What is the failure path test? (Be specific — which failure?)
* What is the edge case test? (nil, empty, boundary values, concurrent access)
Test ambition check (all modes): For each new feature, answer:
* What's the test that would make you confident shipping at 2am on a Friday?
* What's the test a hostile QA engineer would write to break this?
* What's the chaos test?
Test pyramid check: Many unit, fewer integration, few E2E? Or inverted?
Flakiness risk: Flag any test depending on time, randomness, external services, or ordering.
Load/stress test requirements: For any new codepath called frequently or processing significant data.
For LLM/prompt changes: Check CLAUDE.md for the "Prompt/LLM changes" file patterns. If this plan touches ANY of those patterns, state which eval suites must be run, which cases should be added, and what baselines to compare against.
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 7: Performance Review
Evaluate:
* N+1 queries. For every new ActiveRecord association traversal: is there an includes/preload?
* Memory usage. For every new data structure: what's the maximum size in production?
* Database indexes. For every new query: is there an index?
* Caching opportunities. For every expensive computation or external call: should it be cached?
* Background job sizing. For every new job: worst-case payload, runtime, retry behavior?
* Slow paths. Top 3 slowest new codepaths and estimated p99 latency.
* Connection pool pressure. New DB connections, Redis connections, HTTP connections?
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 8: Observability & Debuggability Review
New systems break. This section ensures you can see why.
Evaluate:
* Logging. For every new codepath: structured log lines at entry, exit, and each significant branch?
* Metrics. For every new feature: what metric tells you it's working? What tells you it's broken?
* Tracing. For new cross-service or cross-job flows: trace IDs propagated?
* Alerting. What new alerts should exist?
* Dashboards. What new dashboard panels do you want on day 1?
* Debuggability. If a bug is reported 3 weeks post-ship, can you reconstruct what happened from logs alone?
* Admin tooling. New operational tasks that need admin UI or rake tasks?
* Runbooks. For each new failure mode: what's the operational response?
**EXPANSION and SELECTIVE EXPANSION addition:**
* What observability would make this feature a joy to operate? (For SELECTIVE EXPANSION, include observability for any accepted cherry-picks.)
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 9: Deployment & Rollout Review
Evaluate:
* Migration safety. For every new DB migration: backward-compatible? Zero-downtime? Table locks?
* Feature flags. Should any part be behind a feature flag?
* Rollout order. Correct sequence: migrate first, deploy second?
* Rollback plan. Explicit step-by-step.
* Deploy-time risk window. Old code and new code running simultaneously — what breaks?
* Environment parity. Tested in staging?
* Post-deploy verification checklist. First 5 minutes? First hour?
* Smoke tests. What automated checks should run immediately post-deploy?
**EXPANSION and SELECTIVE EXPANSION addition:**
* What deploy infrastructure would make shipping this feature routine? (For SELECTIVE EXPANSION, assess whether accepted cherry-picks change the deployment risk profile.)
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 10: Long-Term Trajectory Review
Evaluate:
* Technical debt introduced. Code debt, operational debt, testing debt, documentation debt.
* Path dependency. Does this make future changes harder?
* Knowledge concentration. Documentation sufficient for a new engineer?
* Reversibility. Rate 1-5: 1 = one-way door, 5 = easily reversible.
* Ecosystem fit. Aligns with Rails/JS ecosystem direction?
* The 1-year question. Read this plan as a new engineer in 12 months — obvious?
**EXPANSION and SELECTIVE EXPANSION additions:**
* What comes after this ships? Phase 2? Phase 3? Does the architecture support that trajectory?
* Platform potential. Does this create capabilities other features can leverage?
* (SELECTIVE EXPANSION only) Retrospective: Were the right cherry-picks accepted? Did any rejected expansions turn out to be load-bearing for the accepted ones?
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
### Section 11: Design & UX Review (skip if no UI scope detected)
The CEO calling in the designer. Not a pixel-level audit — that's /plan-design-review and /design-review. This is ensuring the plan has design intentionality.
Evaluate:
* Information architecture — what does the user see first, second, third?
* Interaction state coverage map:
FEATURE | LOADING | EMPTY | ERROR | SUCCESS | PARTIAL
* User journey coherence — storyboard the emotional arc
* AI slop risk — does the plan describe generic UI patterns?
* DESIGN.md alignment — does the plan match the stated design system?
* Responsive intention — is mobile mentioned or afterthought?
* Accessibility basics — keyboard nav, screen readers, contrast, touch targets
**EXPANSION and SELECTIVE EXPANSION additions:**
* What would make this UI feel *inevitable*?
* What 30-minute UI touches would make users think "oh nice, they thought of that"?
Required ASCII diagram: user flow showing screens/states and transitions.
If this plan has significant UI scope, recommend: "Consider running /plan-design-review for a deep design review of this plan before implementation."
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If this section turned up zero findings, state "No issues, moving on" and proceed. If the section has findings, you MUST call AskUserQuestion as a tool_use — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan. Do NOT proceed until the user responds.
**Reminder: Do NOT make any code changes. Review only.**
{{CODEX_PLAN_REVIEW}}
### Outside Voice Integration Rule
Outside voice findings are INFORMATIONAL until the user explicitly approves each one.
Do NOT incorporate outside voice recommendations into the plan without presenting each
finding via AskUserQuestion and getting explicit approval. This applies even when you
agree with the outside voice. Cross-model consensus is a strong signal — present it as
such — but the user makes the decision.
## Post-Implementation Design Audit (if UI scope detected)
After implementation, run `/design-review` on the live site to catch visual issues that can only be evaluated with rendered output.
## CRITICAL RULE — How to ask questions
Follow the AskUserQuestion format from the Preamble above. Additional rules for plan reviews:
* **One issue = one AskUserQuestion call.** Never combine multiple issues into one question.
* Describe the problem concretely, with file and line references.
* Present 2-3 options, including "do nothing" where reasonable.
* For each option: effort, risk, and maintenance burden in one line.
* **Map the reasoning to my engineering preferences above.** One sentence connecting your recommendation to a specific preference.
* Label with issue NUMBER + option LETTER (e.g., "3A", "3B").
* **Zero findings:** if a section has zero findings, state "No issues, moving on" and proceed. Otherwise, use AskUserQuestion for each finding — a finding with an "obvious fix" is still a finding and still needs user approval before any change lands in the plan.
## Required Outputs
### "NOT in scope" section
List work considered and explicitly deferred, with one-line rationale each.
### "What already exists" section
List existing code/flows that partially solve sub-problems and whether the plan reuses them.
### "Dream state delta" section
Where this plan leaves us relative to the 12-month ideal.
### Error & Rescue Registry (from Section 2)
Complete table of every method that can fail, every exception class, rescued status, rescue action, user impact.
### Failure Modes Registry
```
CODEPATH | FAILURE MODE | RESCUED? | TEST? | USER SEES? | LOGGED?
---------|----------------|----------|-------|----------------|--------
```
Any row with RESCUED=N, TEST=N, USER SEES=Silent → **CRITICAL GAP**.
### TODOS.md updates
Present each potential TODO as its own individual AskUserQuestion. Never batch TODOs — one per question. Never silently skip this step. Follow the format in `.claude/skills/review/TODOS-format.md`.
For each TODO, describe:
* **What:** One-line description of the work.
* **Why:** The concrete problem it solves or value it unlocks.
* **Pros:** What you gain by doing this work.
* **Cons:** Cost, complexity, or risks of doing it.
* **Context:** Enough detail that someone picking this up in 3 months understands the motivation, the current state, and where to start.
* **Effort estimate:** S/M/L/XL (human team) → with CC+gstack: S→S, M→S, L→M, XL→L
* **Priority:** P1/P2/P3
* **Depends on / blocked by:** Any prerequisites or ordering constraints.
Then present options: **A)** Add to TODOS.md **B)** Skip — not valuable enough **C)** Build it now in this PR instead of deferring.
### Scope Expansion Decisions (EXPANSION and SELECTIVE EXPANSION only)
For EXPANSION and SELECTIVE EXPANSION modes: expansion opportunities and delight items were surfaced and decided in Step 0D (opt-in/cherry-pick ceremony). The decisions are persisted in the CEO plan document. Reference the CEO plan for the full record. Do not re-surface them here — list the accepted expansions for completeness:
* Accepted: {list items added to scope}
* Deferred: {list items sent to TODOS.md}
* Skipped: {list items rejected}
### Diagrams (mandatory, produce all that apply)
1. System architecture
2. Data flow (including shadow paths)
3. State machine
4. Error flow
5. Deployment sequence
6. Rollback flowchart
### Stale Diagram Audit
List every ASCII diagram in files this plan touches. Still accurate?
{{TASKS_SECTION_EMIT:ceo-review}}
### Completion Summary
```
+====================================================================+
| MEGA PLAN REVIEW — COMPLETION SUMMARY |
+====================================================================+
| Mode selected | EXPANSION / SELECTIVE / HOLD / REDUCTION |
| System Audit | [key findings] |
| Step 0 | [mode + key decisions] |
| Section 1 (Arch) | ___ issues found |
| Section 2 (Errors) | ___ error paths mapped, ___ GAPS |
| Section 3 (Security)| ___ issues found, ___ High severity |
| Section 4 (Data/UX) | ___ edge cases mapped, ___ unhandled |
| Section 5 (Quality) | ___ issues found |
| Section 6 (Tests) | Diagram produced, ___ gaps |
| Section 7 (Perf) | ___ issues found |
| Section 8 (Observ) | ___ gaps found |
| Section 9 (Deploy) | ___ risks flagged |
| Section 10 (Future) | Reversibility: _/5, debt items: ___ |
| Section 11 (Design) | ___ issues / SKIPPED (no UI scope) |
+--------------------------------------------------------------------+
| NOT in scope | written (___ items) |
| What already exists | written |
| Dream state delta | written |
| Error/rescue registry| ___ methods, ___ CRITICAL GAPS |
| Failure modes | ___ total, ___ CRITICAL GAPS |
| TODOS.md updates | ___ items proposed |
| Scope proposals | ___ proposed, ___ accepted (EXP + SEL) |
| CEO plan | written / skipped (HOLD/REDUCTION) |
| Outside voice | ran (codex/claude) / skipped |
| Lake Score | X/Y recommendations chose complete option |
| Diagrams produced | ___ (list types) |
| Stale diagrams found | ___ |
| Unresolved decisions | ___ (listed below) |
+====================================================================+
```
### Unresolved Decisions
If any AskUserQuestion goes unanswered, note it here. Never silently default.
## Handoff Note Cleanup
After producing the Completion Summary, clean up any handoff notes for this branch —
the review is complete and the context is no longer needed.
```bash
setopt +o nomatch 2>/dev/null || true # zsh compat
{{SLUG_EVAL}}
rm -f ~/.gstack/projects/$SLUG/*-$BRANCH-ceo-handoff-*.md 2>/dev/null || true
```
## Review Log
After producing the Completion Summary above, persist the review result.
**PLAN MODE EXCEPTION — ALWAYS RUN:** This command writes review metadata to
`~/.gstack/` (user config directory, not project files). The skill preamble
already writes to `~/.gstack/sessions/` and `~/.gstack/analytics/` — this is
the same pattern. The review dashboard depends on this data. Skipping this
command breaks the review readiness dashboard in /ship.
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"plan-ceo-review","timestamp":"TIMESTAMP","status":"STATUS","unresolved":N,"critical_gaps":N,"mode":"MODE","scope_proposed":N,"scope_accepted":N,"scope_deferred":N,"commit":"COMMIT"}'
~/.claude/skills/gstack/bin/gstack-decision-log '{"decision":"CEO review (MODE): SCOPE_SUMMARY","rationale":"VERDICT","scope":"branch","source":"skill","confidence":8}' 2>/dev/null || true
```
The second command records the accepted scope as a durable cross-session decision so the next session sees what was settled (and why) without re-litigating it. It writes to `~/.gstack/` (same pattern as review-log), is non-interactive, and is best-effort (`|| true` — never blocks the review). Substitute `SCOPE_SUMMARY` (e.g. "accepted 4 of 6 proposals" for expansion, or "held scope" / "cut 3 items" for HOLD/REDUCTION) and `VERDICT` (the one-line verdict from the summary).
Before running this command, substitute the placeholder values from the Completion Summary you just produced:
- **TIMESTAMP**: current ISO 8601 datetime (e.g., 2026-03-16T14:30:00)
- **STATUS**: "clean" if 0 unresolved decisions AND 0 critical gaps; otherwise "issues_open"
- **unresolved**: number from "Unresolved decisions" in the summary
- **critical_gaps**: number from "Failure modes: ___ CRITICAL GAPS" in the summary
- **MODE**: the mode the user selected (SCOPE_EXPANSION / SELECTIVE_EXPANSION / HOLD_SCOPE / SCOPE_REDUCTION)
- **scope_proposed**: number from "Scope proposals: ___ proposed" in the summary (0 for HOLD/REDUCTION)
- **scope_accepted**: number from "Scope proposals: ___ accepted" in the summary (0 for HOLD/REDUCTION)
- **scope_deferred**: number of items deferred to TODOS.md from scope decisions (0 for HOLD/REDUCTION)
- **COMMIT**: output of `git rev-parse --short HEAD`
{{REVIEW_DASHBOARD}}
{{PLAN_FILE_REVIEW_REPORT}}
## Next Steps — Review Chaining
After displaying the Review Readiness Dashboard, recommend the next review(s) based on what this CEO review discovered. Read the dashboard output to see which reviews have already been run and whether they are stale.
**Recommend /plan-eng-review if eng review is not skipped globally** — check the dashboard output for `skip_eng_review`. If it is `true`, eng review is opted out — do not recommend it. Otherwise, eng review is the required shipping gate. If this CEO review expanded scope, changed architectural direction, or accepted scope expansions, emphasize that a fresh eng review is needed. If an eng review already exists in the dashboard but the commit hash shows it predates this CEO review, note that it may be stale and should be re-run.
**Recommend /plan-design-review if UI scope was detected** — specifically if Section 11 (Design & UX Review) was NOT skipped, or if accepted scope expansions included UI-facing features. If an existing design review is stale (commit hash drift), note that. In SCOPE REDUCTION mode, skip this recommendation — design review is unlikely relevant for scope cuts.
**If both are needed, recommend eng review first** (required gate), then design review.
Use AskUserQuestion to present the next step. Include only applicable options:
- **A)** Run /plan-eng-review next (required gate)
- **B)** Run /plan-design-review next (only if UI scope detected)
- **C)** Skip — I'll handle reviews manually
## docs/designs Promotion (EXPANSION and SELECTIVE EXPANSION only)
At the end of the review, if the vision produced a compelling feature direction, offer to promote the CEO plan to the project repo. AskUserQuestion:
"The vision from this review produced {N} accepted scope expansions. Want to promote it to a design doc in the repo?"
- **A)** Promote to `docs/designs/{FEATURE}.md` (committed to repo, visible to the team)
- **B)** Keep in `~/.gstack/projects/` only (local, personal reference)
- **C)** Skip
If promoted, copy the CEO plan content to `docs/designs/{FEATURE}.md` (create the directory if needed) and update the `status` field in the original CEO plan from `ACTIVE` to `PROMOTED`.
## Formatting Rules
* NUMBER issues (1, 2, 3...) and LETTERS for options (A, B, C...).
* Label with NUMBER + LETTER (e.g., "3A", "3B").
* One sentence max per option.
* After each section, pause and wait for feedback.
* Use **CRITICAL GAP** / **WARNING** / **OK** for scannability.
{{LEARNINGS_LOG}}
{{GBRAIN_SAVE_RESULTS}}
{{BRAIN_WRITE_BACK}}
{{BRAIN_CACHE_REFRESH}}
## Mode Quick Reference
```
┌────────────────────────────────────────────────────────────────────────────────┐
│ MODE COMPARISON │
├─────────────┬──────────────┬──────────────┬──────────────┬────────────────────┤
│ │ EXPANSION │ SELECTIVE │ HOLD SCOPE │ REDUCTION │
├─────────────┼──────────────┼──────────────┼──────────────┼────────────────────┤
│ Scope │ Push UP │ Hold + offer │ Maintain │ Push DOWN │
│ │ (opt-in) │ │ │ │
│ Recommend │ Enthusiastic │ Neutral │ N/A │ N/A │
│ posture │ │ │ │ │
│ 10x check │ Mandatory │ Surface as │ Optional │ Skip │
│ │ │ cherry-pick │ │ │
│ Platonic │ Yes │ No │ No │ No │
│ ideal │ │ │ │ │
│ Delight │ Opt-in │ Cherry-pick │ Note if seen │ Skip │
│ opps │ ceremony │ ceremony │ │ │
│ Complexity │ "Is it big │ "Is it right │ "Is it too │ "Is it the bare │
│ question │ enough?" │ + what else │ complex?" │ minimum?" │
│ │ │ is tempting"│ │ │
│ Taste │ Yes │ Yes │ No │ No │
│ calibration │ │ │ │ │
│ Temporal │ Full (hr 1-6)│ Full (hr 1-6)│ Key decisions│ Skip │
│ interrogate │ │ │ only │ │
│ Observ. │ "Joy to │ "Joy to │ "Can we │ "Can we see if │
│ standard │ operate" │ operate" │ debug it?" │ it's broken?" │
│ Deploy │ Infra as │ Safe deploy │ Safe deploy │ Simplest possible │
│ standard │ feature scope│ + cherry-pick│ + rollback │ deploy │
│ │ │ risk check │ │ │
│ Error map │ Full + chaos │ Full + chaos │ Full │ Critical paths │
│ │ scenarios │ for accepted │ │ only │
│ CEO plan │ Written │ Written │ Skipped │ Skipped │
│ Phase 2/3 │ Map accepted │ Map accepted │ Note it │ Skip │
│ planning │ │ cherry-picks │ │ │
│ Design │ "Inevitable" │ If UI scope │ If UI scope │ Skip │
│ (Sec 11) │ UI review │ detected │ detected │ │
└─────────────┴──────────────┴──────────────┴──────────────┴────────────────────┘
```