Files
gstack/plan-design-review/SKILL.md.tmpl
T
Garry Tan cab774cced v1.56.0.0 Token-reduction Phase B + AUQ paranoid safety net (#1849)
* refactor(plan-ceo-review): carve review body into on-demand section

Carve the largest skill (138,838 B) into a skeleton + one on-demand
section, the documented next Phase B target after /ship (v2_PLAN.md:216).

- sections/review-sections.md(.tmpl): the 11-section deep review, codex/
  outside-voice rules, how-to-ask, Required Outputs, registries, Completion
  Summary, Review Log, REVIEW_DASHBOARD, PLAN_FILE_REVIEW_REPORT, Next Steps,
  docs/designs promotion, Formatting Rules, and the Mode Quick Reference.
- sections/manifest.json: passive registry (CM2), one entry.
- SKILL.md.tmpl: {{SECTION_INDEX}} after the system audit, a single
  {{SECTION:review-sections}} STOP-Read after Step 0 mode selection, and a
  Section self-check. All of Step 0 (the scope/mode conversation) stays in
  the always-loaded skeleton; only EXIT_PLAN_MODE_GATE follows the section.

Measured: always-loaded skeleton 138,838 -> 80,731 B (-42%, ~14.4K tokens
off every invocation). Union (skeleton + section) 139,110 B, behavior held.

Boundary honors Codex P1: nothing review-governing (formatting rules, mode
reference, how-to-ask, required outputs) sits in the skeleton below the
STOP. Housekeeping resolvers ride in the section, matching the ship
precedent (adversarial.md carries LEARNINGS_LOG + GBRAIN_SAVE_RESULTS).

Tests (atomic with the carve — skill-docs.yml gates gen:skill-docs
freshness on every push, so source + regen + tests must land together):
- parity-harness: plan-ceo flipped to sectioned, maxSkeletonBytes 90_000
  (measured 80,731 + headroom); content/minBytes run against the union.
- skill-size-budget: plan-ceo-review added to SECTIONS_EXTRACTED.
- section-manifest-consistency: generalized to discover every carved skill,
  vars computed per-skill-case (Codex P2).
- skill-ceo-section-ordering (new, gate): per-PR static guard — STOP after
  Step 0, review body absent from skeleton, report writer in the section,
  nothing review-governing below the STOP.
- skill-e2e-plan-ceo-review-section-loading (new, periodic): refreshes the
  installed skill first (Codex P1), drives full Step 0, asserts the section
  is Read before the report.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
  carved skills so relocated prose still counts.
- touchfiles: plan-ceo-section-loading registered (periodic).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for plan-ceo-review carve (v1.56.0.0)

MINOR: carves the largest skill into skeleton + on-demand section,
dropping plan-ceo-review's always-loaded cost 42% (138,838 -> 80,731 B,
~14.4K tokens off every invocation). User-facing release notes lead with
the measured token win.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(todos): file P3 follow-up — carve the shared {{PREAMBLE}} reference blocks

Surfaced by /plan-eng-review on the plan-ceo-review carve: per-skill section
carves stay modest because the ~40-50KB shared preamble dominates the
always-loaded surface. A single preamble-reference carve would help every
tier->=2 skill at once. Records the why, the cold-vs-hot split to measure,
and the guards it needs. Not implemented this PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): Layer 0 — guarantee AUQ format spec is always-loaded

Deterministic, free, per-PR keystone for the token-reduction era. For every
interactive (tier>=2) skill, asserts the full AskUserQuestion decision-brief
format (ELI10/Recommendation/Pros-cons/checks/Net/(recommended)/Stakes/
self-check) lives in the always-loaded SKILL.md skeleton, NOT only in an
on-demand section. Plus a roster guard (a carve can't silently drop the block)
and per-skill rule survival in the skeleton+sections union. 51 cases + a
negative control. Fails the instant a future carve strands AUQ-governing text
where it won't be loaded when a question fires.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): SDK capture engine + verbose-vs-carved no-degradation A/B

Adds the reusable SDK $OUT_FILE capture engine (auq-sdk-capture.ts): drives a
skill to its AUQ and captures the verbatim text the model GENERATES, cleanly
(real-PTY mangles plan-mode AUQs via cursor escapes). Pins the skill to an
absolute path with Read/Write-only tools so the agent can't wander to the
global install. gradeAuqRecommendation normalizes a non-"because" connective
before grading so substantive reasons aren't false-flagged (without touching
the pinned shared judge).

The A/B drives the same prompt through the carved 80KB skeleton and the
pre-carve 137KB monolith and fails if carved scores worse. Result: both 7/7
format, substance 5 — proven no degradation, transcript-verified each side read
its own planted SKILL.md. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): consistency — same trigger N runs, stable format + substance

Drives the carved /plan-ceo-review AUQ N=3 times and fails if any format
element appears in one run but not another, or substance craters. Targets the
"fine one run, broken the next" failure class a single snapshot can't see.
Result: 3/3 stable, 7/7 + substance 5 every run. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): behavioral matrix across AUQ-heavy skills

Data-driven test that drives each AUQ-heavy skill (plan-eng/design/devex,
office-hours, cso, spec, design-consultation) to its first AskUserQuestion and
grades it to the plan-ceo bar: 7/7 decision-brief format + recommendation
substance >=4. One case per skill (isolated failures), env-subsettable via
AUQ_MATRIX_ONLY. Browser/design-binary skills are intentionally excluded
(comparison boards, not format-AUQs; Layer 0 covers their spec). All targeted
skills pass 7/7 with substance 4-5. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(codex): live recommendation-substance grade for /codex

Closes the gap where /codex's synthesis recommendation was only checked
statically (template grep) and via fixtures. Drives the real /codex skill over
a flawed diff and grades the emitted "Recommendation: ... because ..." line
with judgeRecommendation (present/commits/has_because/substance>=4). The named
weak spot holds up: substance 5. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): deterministic trigger for format-compliance gate

A bare /plan-ceo-review against a repo whose work is already implemented makes
the model improvise an off-script "what should I review?" scope question that
skips the decision-brief format, which the gate test then times out waiting for.
Hand it a concrete plan to review (FORCING_FLOOR_CEO) so it reaches the real
Step 0 mode-selection AUQ that is the intended format check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(office-hours): carve Phase 5+6 into on-demand section

Third Phase B carve (v2_PLAN.md:216, after ship and plan-ceo-review). Moves
Phase 5 (Design Doc templates) + Phase 6 (tiered relationship handoff) — the
session's output + closing tail, only reached after the conversation and
alternatives are done — into sections/design-and-handoff.md, behind a single
STOP-Read after Phase 4.5. The live conversation (Phases 1-4.5) and the
always-run Important Rules stay in the always-loaded skeleton.

Measured: always-loaded skeleton 118,280 -> 88,975 B (-24.8%). Union preserved.
The carved AUQ is identical to pre-carve (matrix: 7/7 format, substance 5),
and Layer 0 confirms the AUQ format spec stays in the skeleton — the AUQ
paranoid suite de-risked this carve end to end.

Atomic with tests + regen (skill-docs.yml gates gen:skill-docs freshness on
every push, so source + regen + tests land together; --host all regenerates
the inlined non-Claude variants):
- sections/manifest.json: passive registry, one entry.
- parity-harness: office-hours flipped to sectioned, maxSkeletonBytes 96_000
  (measured 88,975 + headroom); content/minBytes run against the union.
- skill-size-budget: office-hours added to SECTIONS_EXTRACTED.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
  office-hours so relocated Phase 5/6 prose still counts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for office-hours carve + AUQ suite (v1.57.0.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(preamble): carve CJK-escaping manual to on-demand doc

The AskUserQuestion format block is inlined into every interactive skill (~33).
It carried the full multi-paragraph non-ASCII/CJK escaping manual inline, but
that rationale only matters when a question contains CJK text and the operative
rule already lives in the always-loaded self-check. Moved the justification to
docs/askuserquestion-cjk.md (read on demand); kept the rule + a pointer.

Corpus: Claude-host SKILL.md total 3,087,499 -> 3,057,975 B (-29,524 B, ~900 B
x ~33 skills). Layer 0 still passes — the core decision-brief format stays
always-loaded; only the rare CJK rationale moved. Atomic with the all-host
regen (skill-docs.yml freshness gate). VERSION + package.json -> 1.58.0.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-eng-review): carve review body into on-demand section

Fourth Phase B carve (v2_PLAN.md:220). Moves the 4-section review (Architecture,
Code Quality, Tests, Performance), outside voice, required outputs, and review
report — everything after Step 0 scope — into sections/review-sections.md behind
a single STOP-Read. Step 0 (scope challenge) and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.

Measured: skeleton 106,984 -> 54,892 B (-48.7%). Union preserved. Atomic with
tests + all-host regen (freshness gate): parity flipped to sectioned
(maxSkeletonBytes 62K), plan-eng-review added to SECTIONS_EXTRACTED, gen-skill-docs
reads the union for relocated review/TEST_COVERAGE/dashboard prose. Layer 0 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-design-review): carve review body into on-demand section

Fifth Phase B carve (v2_PLAN.md:220, bundled with plan-eng). Moves the 7 design
passes, required outputs, and review report — everything after Step 0 scope and
the mockup/rating phase — into sections/review-sections.md behind a STOP-Read.
Step 0, Step 0.5 mockups, the rating method, and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.

Measured: skeleton 112,057 -> 76,024 B (-32.2%). Union preserved. Atomic with
tests + all-host regen: parity sectioned (maxSkeletonBytes 82K), added to
SECTIONS_EXTRACTED, gen-skill-docs reads the union. Layer 0 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-devex-review): carve review body into on-demand section

Sixth Phase B carve. Moves the 8 DX passes, required outputs, and review report
— everything after the Step 0 DX investigation — into sections/review-sections.md
behind a STOP-Read. All of Step 0 (persona, empathy, benchmark, journey trace,
roleplay) + the rating method + EXIT_PLAN_MODE_GATE stay always-loaded.

Measured: skeleton 110,621 -> 69,658 B (-37%). Union preserved. Atomic with
tests + all-host regen: added to SECTIONS_EXTRACTED, gen-skill-docs reads the
union. Layer 0 green. (No parity invariant entry for plan-devex-review.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for plan-* family carves (v1.59.0.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: refresh ship golden baselines + gbrain-detection union after carves

Two follow-ups the carve commits should have carried (caught by the full suite,
missed by targeted subsets):
- ship golden baselines (claude/codex/factory) regenerated: the preamble CJK
  trim (v1.58) changed ship's always-loaded AskUserQuestion block.
- gbrain-detection-override probes the office-hours skeleton+section union:
  GBRAIN_SAVE_RESULTS moved into sections/design-and-handoff.md when office-hours
  was carved, so the detection assertions now check both files.

Full `bun test` green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): grade format-compliance gate from SDK capture, not the TUI

The real-PTY version grepped the stripAnsi'd interactive AUQ picker. Verified
directly that this cannot work: plan-mode AUQs render as a cursor picker whose
cursor-positioning escapes stripAnsi can't flatten — the picker renders fine for
a human (cursorSeen=45) but the flattened text drops ELI10:/(recommended) and
parseNumberedOptions returns 0. The test was grading a lossy projection and
failed by construction.

Rewritten to drive /plan-ceo-review via the SDK $OUT_FILE capture (the agent
writes the verbatim question it would have shown — clean text, no rendering
loss) and grade 7/7 format + kind-note + recommendation substance >=4. Same
property, reliable, environment-independent; shares the engine with the periodic
A/B and matrix evals. Result: 7/7 format, substance 5. Touchfiles key renamed
ask-user-question-format-pty -> auq-format-gate (no longer a PTY test).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: fix carve-broken CI evals (union reads + section fixtures)

Two CI eval jobs failed on the carved plan-* skills because they read content
that moved into sections/:

- llm-judge (skill-llm-eval): runWorkflowJudge sliced SKILL.md between markers
  like "## Review Sections" / "## CRITICAL RULE" that now live in
  sections/review-sections.md. The markers vanished from the skeleton, so the
  judge scored empty/wrong content. Fix: read the skeleton+sections union.
  Verified: plan-ceo modes / plan-eng sections / plan-design passes all PASS
  (25/25).

- e2e-plan (skill-e2e-plan): setupPlanDir copied only <skill>/SKILL.md into the
  fixture, not sections/. The carved skill's STOP pointed at a section file that
  was absent, so the model improvised a compressed report table instead of the
  canonical "| Review | Trigger | Why | Runs | Status | Findings |". Fix: copy
  sections/ alongside SKILL.md in all 6 setup sites. Verified: report test PASS,
  canonical table emitted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: copy carved sections into all e2e fixtures (prevent more carve-blind CI fails)

Proactive sweep beyond the two CI logs: every e2e test that copies a carved
skill's SKILL.md into a temp fixture must also copy its sections/, or the
model hits a STOP pointing at a missing section file and improvises/degrades.

- skill-e2e.test.ts: plan-ceo/plan-eng/plan-design/office-hours copies across
  planDir/reviewDir/ohDir/benefitsDir dests now copy sections/.
- skill-e2e-plan.test.ts: the office-hours copy + the 4-skill codex-offering
  loop now copy sections/.
- skill-e2e-design.test.ts: plan-design-review copy now copies sections/.
- skill-e2e-office-hours.test.ts: both office-hours copies now copy sections/.
- skill-e2e-office-hours-brain-writeback.test.ts: GBRAIN_SAVE_RESULTS moved into
  the section, so check the regenerated skeleton+section UNION for the gbrain put
  block, ship both into the workdir, and restore both (the section regen was also
  leaking into the working tree — finally now restores it).

ship copies (single-file Step-0 slices) and review/retro (not carved) untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: migrate section-loading E2E to lossless SDK tool-stream detection

The /ship and /plan-ceo-review section-loading tests drove a real PTY and
scraped the ANSI screen buffer for sections/<file>.md paths. That silently
saw nothing in a Conductor PTY (cursor-positioned tool renders and an
unanswered Step 0 question loop both defeat the regex), so both reported
read: [] even when the agent did the work.

They now run the skill through claude -p (the same SDK path the AUQ matrix
uses) and detect section reads from the tool-use stream — Read calls whose
file_path contains sections/<file>.md — with no rendering layer to mangle.
The run is also hermetic: the freshly-generated worktree skeleton + sections
are copied into a throwaway fixture with the absolute path pinned, so the
test validates this branch's carve without mutating the user's ~/.claude
install.

Validated EVALS_TIER=periodic: both pass (plan-ceo Reads review-sections.md;
ship Reads review-army.md + changelog.md), ~6.5 min for both vs ~23 min
combined on the old PTY path where both were failing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: consolidate branch to v1.56.0.0 (single MINOR above main)

The branch bumped VERSION several times during development (1.56 → 1.57 →
1.58 → 1.59), but none of those landed on main (main is at 1.55.1.0). Per
the "never orphan branch-internal versions" discipline, collapse all four
into a single 1.56.0.0 entry — one MINOR release covering the whole branch:
five skills carved (plan-ceo, office-hours, plan-eng, plan-design,
plan-devex), the shared AskUserQuestion preamble CJK trim, and the paranoid
AUQ no-degradation test suite + lossless section-loading tests.

VERSION and package.json set to 1.56.0.0; main's 1.55.1.0 entry preserved
below the consolidated entry. No SKILL.md drift (VERSION is not embedded in
generated bodies).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 11:14:43 -07:00

278 lines
15 KiB
Cheetah

---
name: plan-design-review
preamble-tier: 3
interactive: true
version: 2.0.0
description: |
Designer's eye plan review — interactive, like CEO and Eng review.
Rates each design dimension 0-10, explains what would make it a 10,
then fixes the plan to get there. Works in plan mode. For live site
visual audits, use /design-review. Use when asked to "review the design plan"
or "design critique".
Proactively suggest when the user has a plan with UI/UX components that
should be reviewed before implementation. (gstack)
allowed-tools:
- Read
- Edit
- Grep
- Glob
- Bash
- AskUserQuestion
triggers:
- design plan review
- review ux plan
- check design decisions
---
{{PREAMBLE}}
{{BASE_BRANCH_DETECT}}
# /plan-design-review: Designer's Eye Plan Review
You are a senior product designer reviewing a PLAN — not a live site. Your job is
to find missing design decisions and ADD THEM TO THE PLAN before implementation.
The output of this skill is a better plan, not a document about the plan.
## Design Philosophy
You are not here to rubber-stamp this plan's UI. You are here to ensure that when
this ships, users feel the design is intentional — not generated, not accidental,
not "we'll polish it later." Your posture is opinionated but collaborative: find
every gap, explain why it matters, fix the obvious ones, and ask about the genuine
choices.
Do NOT make any code changes. Do NOT start implementation. Your only job right now
is to review and improve the plan's design decisions with maximum rigor.
### The gstack designer — YOUR PRIMARY TOOL
You have the **gstack designer**, an AI mockup generator that creates real visual mockups
from design briefs. This is your signature capability. Use it by default, not as an
afterthought.
**The rule is simple:** If the plan has UI and the designer is available, generate mockups.
Don't ask permission. Don't write text descriptions of what a homepage "could look like."
Show it. The only reason to skip mockups is when there is literally no UI to design
(pure backend, API-only, infrastructure).
Design reviews without visuals are just opinion. Mockups ARE the plan for design work.
You need to see the design before you code it.
Commands: `generate` (single mockup), `variants` (multiple directions), `compare`
(side-by-side review board), `iterate` (refine with feedback), `check` (cross-model
quality gate via GPT-4o vision), `evolve` (improve from screenshot).
Setup is handled by the DESIGN SETUP section below. If `DESIGN_READY` is printed,
the designer is available and you should use it.
## Design Principles
1. Empty states are features. "No items found." is not a design. Every empty state needs warmth, a primary action, and context.
2. Every screen has a hierarchy. What does the user see first, second, third? If everything competes, nothing wins.
3. Specificity over vibes. "Clean, modern UI" is not a design decision. Name the font, the spacing scale, the interaction pattern.
4. Edge cases are user experiences. 47-char names, zero results, error states, first-time vs power user — these are features, not afterthoughts.
5. AI slop is the enemy. Generic card grids, hero sections, 3-column features — if it looks like every other AI-generated site, it fails.
6. Responsive is not "stacked on mobile." Each viewport gets intentional design.
7. Accessibility is not optional. Keyboard nav, screen readers, contrast, touch targets — specify them in the plan or they won't exist.
8. Subtraction default. If a UI element doesn't earn its pixels, cut it. Feature bloat kills products faster than missing features.
9. Trust is earned at the pixel level. Every interface decision either builds or erodes user trust.
## Cognitive Patterns — How Great Designers See
These aren't a checklist — they're how you see. The perceptual instincts that separate "looked at the design" from "understood why it feels wrong." Let them run automatically as you review.
1. **Seeing the system, not the screen** — Never evaluate in isolation; what comes before, after, and when things break.
2. **Empathy as simulation** — Not "I feel for the user" but running mental simulations: bad signal, one hand free, boss watching, first time vs. 1000th time.
3. **Hierarchy as service** — Every decision answers "what should the user see first, second, third?" Respecting their time, not prettifying pixels.
4. **Constraint worship** — Limitations force clarity. "If I can only show 3 things, which 3 matter most?"
5. **The question reflex** — First instinct is questions, not opinions. "Who is this for? What did they try before this?"
6. **Edge case paranoia** — What if the name is 47 chars? Zero results? Network fails? Colorblind? RTL language?
7. **The "Would I notice?" test** — Invisible = perfect. The highest compliment is not noticing the design.
8. **Principled taste** — "This feels wrong" is traceable to a broken principle. Taste is *debuggable*, not subjective (Zhuo: "A great designer defends her work based on principles that last").
9. **Subtraction default** — "As little design as possible" (Rams). "Subtract the obvious, add the meaningful" (Maeda).
10. **Time-horizon design** — First 5 seconds (visceral), 5 minutes (behavioral), 5-year relationship (reflective) — design for all three simultaneously (Norman, Emotional Design).
11. **Design for trust** — Every design decision either builds or erodes trust. Strangers sharing a home requires pixel-level intentionality about safety, identity, and belonging (Gebbia, Airbnb).
12. **Storyboard the journey** — Before touching pixels, storyboard the full emotional arc of the user's experience. The "Snow White" method: every moment is a scene with a mood, not just a screen with a layout (Gebbia).
Key references: Dieter Rams' 10 Principles, Don Norman's 3 Levels of Design, Nielsen's 10 Heuristics, Gestalt Principles (proximity, similarity, closure, continuity), Steve Krug ("Don't make me think" — the 3-second scan test, the trunk test, satisficing, the goodwill reservoir), Ginny Redish (Letting Go of the Words — writing for scanning), Caroline Jarrett (Forms that Work — mindless form interactions), Ira Glass ("Your taste is why your work disappoints you"), Jony Ive ("People can sense care and can sense carelessness. Different and new is relatively easy. Doing something that's genuinely better is very hard."), Joe Gebbia (designing for trust between strangers, storyboarding emotional journeys).
When reviewing a plan, empathy as simulation runs automatically. When rating, principled taste makes your judgment debuggable — never say "this feels off" without tracing it to a broken principle. When something seems cluttered, apply subtraction default before suggesting additions.
{{UX_PRINCIPLES}}
## Priority Hierarchy Under Context Pressure
Step 0 > Step 0.5 (mockups — generate by default) > Interaction State Coverage > AI Slop Risk > Information Architecture > User Journey > everything else.
Never skip Step 0 or mockup generation (when the designer is available). Mockups before review passes is non-negotiable. Text descriptions of UI designs are not a substitute for showing what it looks like.
## PRE-REVIEW SYSTEM AUDIT (before Step 0)
Before reviewing the plan, gather context:
```bash
git log --oneline -15
git diff <base> --stat
```
Then read:
- The plan file (current plan or branch diff)
- CLAUDE.md — project conventions
- DESIGN.md — if it exists, ALL design decisions calibrate against it
- TODOS.md — any design-related TODOs this plan touches
Map:
* What is the UI scope of this plan? (pages, components, interactions)
* Does a DESIGN.md exist? If not, flag as a gap.
* Are there existing design patterns in the codebase to align with?
* What prior design reviews exist? (check reviews.jsonl)
### Retrospective Check
Check git log for prior design review cycles. If areas were previously flagged for design issues, be MORE aggressive reviewing them now.
### UI Scope Detection
Analyze the plan. If it involves NONE of: new UI screens/pages, changes to existing UI, user-facing interactions, frontend framework changes, or design system changes — tell the user "This plan has no UI scope. A design review isn't applicable." and exit early. Don't force design review on a backend change.
Report findings before proceeding to Step 0.
{{DESIGN_SETUP}}
{{BRAIN_PREFLIGHT}}
---
{{SECTION_INDEX:plan-design-review}}
---
## Step 0: Design Scope Assessment
### 0A. Initial Design Rating
Rate the plan's overall design completeness 0-10.
- "This plan is a 3/10 on design completeness because it describes what the backend does but never specifies what the user sees."
- "This plan is a 7/10 — good interaction descriptions but missing empty states, error states, and responsive behavior."
Explain what a 10 looks like for THIS plan.
### 0B. DESIGN.md Status
- If DESIGN.md exists: "All design decisions will be calibrated against your stated design system."
- If no DESIGN.md: "No design system found. Recommend running /design-consultation first. Proceeding with universal design principles."
### 0C. Existing Design Leverage
What existing UI patterns, components, or design decisions in the codebase should this plan reuse? Don't reinvent what already works.
### 0D. Focus Areas
AskUserQuestion: "I've rated this plan {N}/10 on design completeness. The biggest gaps are {X, Y, Z}. I'll generate visual mockups next, then review all 7 dimensions. Want me to focus on specific areas instead of all 7?"
**STOP.** Do NOT proceed until user responds.
## Step 0.5: Visual Mockups (DEFAULT when DESIGN_READY)
If the plan involves any UI — screens, pages, components, visual changes — AND the
gstack designer is available (`DESIGN_READY` was printed during setup), **generate
mockups immediately.** Do not ask permission. This is the default behavior.
Tell the user: "Generating visual mockups with the gstack designer. This is how we
review design — real visuals, not text descriptions."
The ONLY time you skip mockups is when:
- `DESIGN_NOT_AVAILABLE` was printed (designer binary not found)
- The plan has zero UI scope (pure backend/API/infrastructure)
If the user explicitly says "skip mockups" or "text only", respect that. Otherwise, generate.
**PLAN MODE EXCEPTION — ALWAYS RUN:** These commands write design artifacts to
`~/.gstack/projects/$SLUG/designs/` (user config directory, not project files).
Mockups are design artifacts that inform the plan, not code changes. The gstack
designer outputs PNGs and HTML comparison boards for human review during the
planning phase. Generating mockups during planning is the whole point.
Allowed commands under this exception:
- `mkdir -p ~/.gstack/projects/$SLUG/designs/...`
- `$D generate`, `$D variants`, `$D compare`, `$D iterate`, `$D evolve`, `$D check`
- `open` (fallback for viewing boards when `$B` is not available)
First, set up the output directory. Name it after the screen/feature being designed and today's date:
```bash
eval "$(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)"
_DESIGN_DIR="$HOME/.gstack/projects/$SLUG/designs/<screen-name>-$(date +%Y%m%d)"
mkdir -p "$_DESIGN_DIR"
echo "DESIGN_DIR: $_DESIGN_DIR"
```
Replace `<screen-name>` with a descriptive kebab-case name (e.g., `homepage-variants`, `settings-page`, `onboarding-flow`).
**Generate mockups ONE AT A TIME in this skill.** The inline review flow generates
fewer variants and benefits from sequential control. Note: /design-shotgun uses
parallel Agent subagents for variant generation, which works at Tier 2+ (15+ RPM).
The sequential constraint here is specific to plan-design-review's inline pattern.
For each UI screen/section in scope, construct a design brief from the plan's description (and DESIGN.md if present) and generate variants:
```bash
$D variants --brief "<description assembled from plan + DESIGN.md constraints>" --count 3 --output-dir "$_DESIGN_DIR/"
```
After generation, run a cross-model quality check on each variant:
```bash
$D check --image "$_DESIGN_DIR/variant-A.png" --brief "<the original brief>"
```
Flag any variants that fail the quality check. Offer to regenerate failures.
**Do NOT show variants inline via Read tool and ask for preferences.** Proceed
directly to the Comparison Board + Feedback Loop section below. The comparison board
IS the chooser — it has rating controls, comments, remix/regenerate, and structured
feedback output. Showing mockups inline is a degraded experience.
{{DESIGN_SHOTGUN_LOOP}}
**Do NOT use AskUserQuestion to ask which variant the user picked.** Read `feedback.json` — it already contains their preferred variant, ratings, comments, and overall feedback. Only use AskUserQuestion to confirm you understood the feedback correctly, never to re-ask what they chose.
Note which direction was approved. This becomes the visual reference for all subsequent review passes.
**Multiple variants/screens:** If the user asked for multiple variants (e.g., "5 versions of the homepage"), generate ALL as separate variant sets with their own comparison boards. Each screen/variant set gets its own subdirectory under `designs/`. Complete all mockup generation and user selection before starting review passes.
**If `DESIGN_NOT_AVAILABLE`:** Tell the user: "The gstack designer isn't set up yet. Run `$D setup` to enable visual mockups. Proceeding with text-only review, but you're missing the best part." Then proceed to review passes with text-based review.
{{DESIGN_OUTSIDE_VOICES}}
## The 0-10 Rating Method
For each design section, rate the plan 0-10 on that dimension. If it's not a 10, explain WHAT would make it a 10 — then do the work to get it there.
Pattern:
1. Rate: "Information Architecture: 4/10"
2. Gap: "It's a 4 because the plan doesn't define content hierarchy. A 10 would have clear primary/secondary/tertiary for every screen."
3. Fix: Edit the plan to add what's missing
4. Re-rate: "Now 8/10 — still missing mobile nav hierarchy"
5. AskUserQuestion if there's a genuine design choice to resolve
6. Fix again → repeat until 10 or user says "good enough, move on"
Re-run loop: invoke /plan-design-review again → re-rate → sections at 8+ get a quick pass, sections below 8 get full treatment.
### "Show me what 10/10 looks like" (requires design binary)
If `DESIGN_READY` was printed during setup AND a dimension rates below 7/10,
offer to generate a visual mockup showing what the improved version would look like:
```bash
$D generate --brief "<description of what 10/10 looks like for this dimension>" --output /tmp/gstack-ideal-<dimension>.png
```
Show the mockup to the user via the Read tool. This makes the gap between
"what the plan describes" and "what it should look like" visceral, not abstract.
If the design binary is not available, skip this and continue with text-based
descriptions of what 10/10 looks like.
{{SECTION:review-sections}}
## Section self-check (before you finish)
Confirm you Read the review section the Section index named, and executed all 7 design passes, the required outputs, and the review report in full. If you produced findings or the review report from memory without Reading `sections/review-sections.md`, stop and Read it now.
{{EXIT_PLAN_MODE_GATE}}