gstack/docs/designs/DESIGN_TOOLS_V1.md

# Design: gstack Visual Design Generation (`design` binary)

Generated by /office-hours on 2026-03-26
Branch: garrytan/agent-design-tools
Repo: gstack
Status: DRAFT
Mode: Intrapreneurship

## Context

gstack's design skills (/office-hours, /design-consultation, /plan-design-review, /design-review) all produce **text descriptions** of design — DESIGN.md files with hex codes, plan docs with pixel specs in prose, ASCII art wireframes. The creator is a designer who hand-designed HelloSign in OmniGraffle and finds this embarrassing.

The unit of value is wrong. Users don't need richer design language — they need an executable visual artifact that changes the conversation from "do you like this spec?" to "is this the screen?"

## Problem Statement

Design skills describe design in text instead of showing it. The Argus UX overhaul plan is the example: 487 lines of detailed emotional arc specs, typography choices, animation timing — zero visual artifacts. An AI coding agent that "designs" should produce something you can look at and react to viscerally.

## Demand Evidence

The creator/primary user finds the current output embarrassing. Every design skill session ends with prose where a mockup should be. GPT Image API now generates pixel-perfect UI mockups with accurate text rendering — the capability gap that justified text-only output no longer exists.

## Narrowest Wedge

A compiled TypeScript binary (`design/dist/design`) that wraps the OpenAI Images/Responses API, callable from skill templates via `$D` (mirroring the existing `$B` browse binary pattern). Priority integration order: /office-hours → /plan-design-review → /design-consultation → /design-review.

## Agreed Premises

1. GPT Image API (via OpenAI Responses API) is the right engine. Google Stitch SDK is backup.
2. **Visual mockups are default-on for design skills** with an easy skip path — not opt-in. (Revised per Codex challenge.)
3. The integration is a shared utility (not per-skill reimplementation) — a `design` binary that any skill can call.
4. Priority: /office-hours first, then /plan-design-review, /design-consultation, /design-review.

## Cross-Model Perspective (Codex)

Codex independently validated the core thesis: "The failure is not output quality within markdown; it is that the current unit of value is wrong." Key contributions:
- Challenged premise #2 (opt-in → default-on) — accepted
- Proposed vision-based quality gate: use GPT-4o vision to verify generated mockups for unreadable text, missing sections, broken layout, auto-retry once
- Scoped 48-hour prototype: shared `visual_mockup.ts` utility, /office-hours + /plan-design-review only, hero mockup + 2 variants

## Recommended Approach: `design` Binary (Approach B)

### Architecture

**Shares the browse binary's compilation and distribution pattern** (bun build --compile, setup script, $VARIABLE resolution in skill templates) but is architecturally simpler — no persistent daemon server, no Chromium, no health checks, no token auth. The design binary is a stateless CLI that makes OpenAI API calls and writes PNGs to disk. Session state (for multi-turn iteration) is a JSON file.

**New dependency:** `openai` npm package (add to `devDependencies`, NOT runtime deps). Design binary compiled separately from browse so openai doesn't bloat the browse binary.

```
design/
├── src/
│   ├── cli.ts            # Entry point, command dispatch
│   ├── commands.ts        # Command registry (source of truth for docs + validation)
│   ├── generate.ts        # Generate mockups from structured brief
│   ├── iterate.ts         # Multi-turn iteration on existing mockups
│   ├── variants.ts        # Generate N design variants from brief
│   ├── check.ts           # Vision-based quality gate (GPT-4o)
│   ├── brief.ts           # Structured brief type + assembly helpers
│   └── session.ts         # Session state (response IDs for multi-turn)
├── dist/
│   ├── design             # Compiled binary
│   └── .version           # Git hash
└── test/
    └── design.test.ts     # Integration tests
```

### Commands

```bash
# Generate a hero mockup from a structured brief
$D generate --brief "Dashboard for a coding assessment tool. Dark theme, cream accents. Shows: builder name, score badge, narrative letter, score cards. Target: technical users." --output /tmp/mockup-hero.png

# Generate 3 design variants
$D variants --brief "..." --count 3 --output-dir /tmp/mockups/

# Iterate on an existing mockup with feedback
$D iterate --session /tmp/design-session.json --feedback "Make the score cards larger, move the narrative above the scores" --output /tmp/mockup-v2.png

# Vision-based quality check (returns PASS/FAIL + issues)
$D check --image /tmp/mockup-hero.png --brief "Dashboard with builder name, score badge, narrative"

# One-shot with quality gate + auto-retry
$D generate --brief "..." --output /tmp/mockup.png --check --retry 1

# Pass a structured brief via JSON file
$D generate --brief-file /tmp/brief.json --output /tmp/mockup.png

# Generate comparison board HTML for user review
$D compare --images /tmp/mockups/variant-*.png --output /tmp/design-board.html

# Guided API key setup + smoke test
$D setup
```

**Brief input modes:**
- `--brief "plain text"` — free-form text prompt (simple mode)
- `--brief-file path.json` — structured JSON matching the `DesignBrief` interface (rich mode)
- Skills construct a JSON brief file, write it to /tmp, and pass `--brief-file`

**All commands are registered in `commands.ts`** including `--check` and `--retry` as flags on `generate`.

### Design Exploration Workflow (from eng review)

The workflow is sequential, not parallel. PNGs are for visual exploration (human-facing), HTML wireframes are for implementation (agent-facing):

```
1. $D variants --brief "..." --count 3 --output-dir /tmp/mockups/
   → Generates 2-5 PNG mockup variations

2. $D compare --images /tmp/mockups/*.png --output /tmp/design-board.html
   → Generates HTML comparison board (spec below)

3. $B goto file:///tmp/design-board.html
   → User reviews all variants in headed Chrome

4. User picks favorite, rates, comments, clicks [Submit]
   Agent polls: $B eval document.getElementById('status').textContent
   Agent reads: $B eval document.getElementById('feedback-result').textContent
   → No clipboard, no pasting. Agent reads feedback directly from the page.

5. Claude generates HTML wireframe via DESIGN_SKETCH matching approved direction
   → Agent implements from the inspectable HTML, not the opaque PNG
```

### Comparison Board Design Spec (from /plan-design-review)

**Classifier: APP UI** (task-focused, utility page). No product branding.

**Layout: Single column, full-width mockups.** Each variant gets the full viewport
width for maximum image fidelity. Users scroll vertically through variants.

```
┌─────────────────────────────────────────────────────────────┐
│  HEADER BAR                                                 │
│  "Design Exploration" . project name . "3 variants"         │
│  Mode indicator: [Wide exploration] | [Matching DESIGN.md]  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │              VARIANT A (full width)                    │  │
│  │         [ mockup PNG, max-width: 1200px ]              │  │
│  ├───────────────────────────────────────────────────────┤  │
│  │ (●) Pick   ★★★★☆   [What do you like/dislike?____]   │  │
│  │            [More like this]                            │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │              VARIANT B (full width)                    │  │
│  │         [ mockup PNG, max-width: 1200px ]              │  │
│  ├───────────────────────────────────────────────────────┤  │
│  │ ( ) Pick   ★★★☆☆   [What do you like/dislike?____]   │  │
│  │            [More like this]                            │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                             │
│  ... (scroll for more variants)                             │
│                                                             │
│  ─── separator ─────────────────────────────────────────    │
│  Overall direction (optional, collapsed by default)         │
│  [textarea, 3 lines, expand on focus]                       │
│                                                             │
│  ─── REGENERATE BAR (#f7f7f7 bg) ───────────────────────    │
│  "Want to explore more?"                                    │
│  [Totally different]  [Match my design]  [Custom: ______]   │
│                                          [Regenerate ->]    │
│  ─────────────────────────────────────────────────────────  │
│                                        [ ✓ Submit ]         │
└─────────────────────────────────────────────────────────────┘
```

**Visual spec:**
- Background: #fff. No shadows, no card borders. Variant separation: 1px #e5e5e5 line.
- Typography: system font stack. Header: 16px semibold. Labels: 14px semibold. Feedback placeholder: 13px regular #999.
- Star rating: 5 clickable stars, filled=#000, unfilled=#ddd. Not colored, not animated.
- Radio button "Pick": explicit favorite selection. One per variant, mutually exclusive.
- "More like this" button: per-variant, triggers regeneration with that variant's style as seed.
- Submit button: #000 background, white text, right-aligned. Single CTA.
- Regenerate bar: #f7f7f7 background, visually distinct from feedback area.
- Max-width: 1200px centered for mockup images. Margins: 24px sides.

**Interaction states:**
- Loading (page opens before images ready): skeleton pulse with "Generating variant A..." per card. Stars/textarea/pick disabled.
- Partial failure (2 of 3 succeed): show good ones, error card for failed with per-variant [Retry].
- Post-submit: "Feedback submitted! Return to your coding agent." Page stays open.
- Regeneration: smooth transition, fade out old variants, skeleton pulses, fade in new. Scroll resets to top. Previous feedback cleared.

**Feedback JSON structure** (written to hidden #feedback-result element):
```json
{
  "preferred": "A",
  "ratings": { "A": 4, "B": 3, "C": 2 },
  "comments": {
    "A": "Love the spacing, header feels right",
    "B": "Too busy, but good color palette",
    "C": "Wrong mood entirely"
  },
  "overall": "Go with A, make the CTA bigger",
  "regenerated": false
}
```

**Accessibility:** Star ratings keyboard navigable (arrow keys). Textareas labeled ("Feedback for Variant A"). Submit/Regenerate keyboard accessible with visible focus ring. All text #333+ on white.

**Responsive:** >1200px: comfortable margins. 768-1200px: tighter margins. <768px: full-width, no horizontal scroll.

**Screenshot consent (first-time only for $D evolve):** "This will send a screenshot of your live site to OpenAI for design evolution. [Proceed] [Don't ask again]" Stored in ~/.gstack/config.yaml as design_screenshot_consent.

Why sequential: Codex adversarial review identified that raster PNGs are opaque to agents (no DOM, no states, no diffable structure). HTML wireframes preserve a bridge back to code. The PNG is for the human to say "yes, that's right." The HTML is for the agent to say "I know how to build this."

### Key Design Decisions

**1. Stateless CLI, not daemon**
Browse needs a persistent Chromium instance. Design is just API calls — no reason for a server. Session state for multi-turn iteration is a JSON file written to `/tmp/design-session-{id}.json` containing `previous_response_id`.
- **Session ID:** generated from `${PID}-${timestamp}`, passed via `--session` flag
- **Discovery:** the `generate` command creates the session file and prints its path; `iterate` reads it via `--session`
- **Cleanup:** session files in /tmp are ephemeral (OS cleans up); no explicit cleanup needed

**2. Structured brief input**
The brief is the interface between skill prose and image generation. Skills construct it from design context:
```typescript
interface DesignBrief {
  goal: string;           // "Dashboard for coding assessment tool"
  audience: string;       // "Technical users, YC partners"
  style: string;          // "Dark theme, cream accents, minimal"
  elements: string[];     // ["builder name", "score badge", "narrative letter"]
  constraints?: string;   // "Max width 1024px, mobile-first"
  reference?: string;     // Path to existing screenshot or DESIGN.md excerpt
  screenType: string;     // "desktop-dashboard" | "mobile-app" | "landing-page" | etc.
}
```

**3. Default-on in design skills**
Skills generate mockups by default. The template includes skip language:
```
Generating visual mockup of the proposed design... (say "skip" if you don't need visuals)
```

**4. Vision quality gate**
After generating, optionally pass the image through GPT-4o vision to check:
- Text readability (are labels/headings legible?)
- Layout completeness (are all requested elements present?)
- Visual coherence (does it look like a real UI, not a collage?)
Auto-retry once on failure. If still fails, present anyway with a warning.

**5. Output location: explorations in /tmp, approved finals in `docs/designs/`**
- Exploration variants go to `/tmp/gstack-mockups-{session}/` (ephemeral, not committed)
- Only the **user-approved final** mockup gets saved to `docs/designs/` (checked in)
- Default output directory configurable via CLAUDE.md `design_output_dir` setting
- Filename pattern: `{skill}-{description}-{timestamp}.png`
- Create `docs/designs/` if it doesn't exist (mkdir -p)
- Design doc references the committed image path
- Always show to user via the Read tool (which renders images inline in Claude Code)
- This avoids repo bloat: only approved designs are committed, not every exploration variant
- Fallback: if not in a git repo, save to `/tmp/gstack-mockup-{timestamp}.png`

**6. Trust boundary acknowledgment**
Default-on generation sends design brief text to OpenAI. This is a new external data flow vs. the existing HTML wireframe path which is entirely local. The brief contains only abstract design descriptions (goal, style, elements), never source code or user data. Screenshots from $B are NOT sent to OpenAI (the reference field in DesignBrief is a local file path used by the agent, not uploaded to the API). Document this in CLAUDE.md.

**7. Rate limit mitigation**
Variant generation uses staggered parallel: start each API call 1 second apart via `Promise.allSettled()` with delays. This avoids the 5-7 RPM rate limit on image generation while still being faster than fully serial. If any call 429s, retry with exponential backoff (2s, 4s, 8s).

### Template Integration

**Add to existing resolver:** `scripts/resolvers/design.ts` (NOT a new file)
- Add `generateDesignSetup()` for `{{DESIGN_SETUP}}` placeholder (mirrors `generateBrowseSetup()`)
- Add `generateDesignMockup()` for `{{DESIGN_MOCKUP}}` placeholder (full exploration workflow)
- Keeps all design resolvers in one file (consistent with existing codebase convention)

**New HostPaths entry:** `types.ts`
```typescript
// claude host:
designDir: '~/.claude/skills/gstack/design/dist'
// codex host:
designDir: '$GSTACK_DESIGN'
```
Note: Codex runtime setup (`setup` script) must also export `GSTACK_DESIGN` env var, similar to how `GSTACK_BROWSE` is set.

**`$D` resolution bash block** (generated by `{{DESIGN_SETUP}}`):
```bash
_ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
D=""
[ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/design/dist/design" ] && D="$_ROOT/.claude/skills/gstack/design/dist/design"
[ -z "$D" ] && D=~/.claude/skills/gstack/design/dist/design
if [ -x "$D" ]; then
  echo "DESIGN_READY: $D"
else
  echo "DESIGN_NOT_AVAILABLE"
fi
```
If `DESIGN_NOT_AVAILABLE`: skills fall back to HTML wireframe generation (existing `DESIGN_SKETCH` pattern). Design mockup is a progressive enhancement, not a hard requirement.

**New functions in existing resolver:** `scripts/resolvers/design.ts`
- Add `generateDesignSetup()` for `{{DESIGN_SETUP}}` — mirrors `generateBrowseSetup()` pattern
- Add `generateDesignMockup()` for `{{DESIGN_MOCKUP}}` — the full generate+check+present workflow
- Keeps all design resolvers in one file (consistent with existing codebase convention)

### Skill Integration (Priority Order)

**1. /office-hours** — Replace the Visual Sketch section
- After approach selection (Phase 4), generate hero mockup + 2 variants
- Present all three via Read tool, ask user to pick
- Iterate if requested
- Save chosen mockup alongside design doc

**2. /plan-design-review** — "What better looks like"
- When rating a design dimension <7/10, generate a mockup showing what 10/10 would look like
- Side-by-side: current (screenshot via $B) vs. proposed (mockup via $D)

**3. /design-consultation** — Design system preview
- Generate visual preview of proposed design system (typography, colors, components)
- Replace the /tmp HTML preview page with a proper mockup

**4. /design-review** — Design intent comparison
- Generate "design intent" mockup from the plan/DESIGN.md specs
- Compare against live site screenshot for visual delta

### Files to Create

| File | Purpose |
|------|---------|
| `design/src/cli.ts` | Entry point, command dispatch |
| `design/src/commands.ts` | Command registry |
| `design/src/generate.ts` | GPT Image generation via Responses API |
| `design/src/iterate.ts` | Multi-turn iteration with session state |
| `design/src/variants.ts` | Generate N design variants |
| `design/src/check.ts` | Vision-based quality gate |
| `design/src/brief.ts` | Structured brief types + helpers |
| `design/src/session.ts` | Session state management |
| `design/src/compare.ts` | HTML comparison board generator |
| `design/test/design.test.ts` | Integration tests (mock OpenAI API) |
| (none — add to existing `scripts/resolvers/design.ts`) | `{{DESIGN_SETUP}}` + `{{DESIGN_MOCKUP}}` resolvers |

### Files to Modify

| File | Change |
|------|--------|
| `scripts/resolvers/types.ts` | Add `designDir` to `HostPaths` |
| `scripts/resolvers/index.ts` | Register DESIGN_SETUP + DESIGN_MOCKUP resolvers |
| `package.json` | Add `design` build command |
| `setup` | Build design binary alongside browse |
| `scripts/resolvers/preamble.ts` | Add `GSTACK_DESIGN` env var export for Codex host |
| `test/gen-skill-docs.test.ts` | Update DESIGN_SKETCH test suite for new resolvers |
| `setup` | Add design binary build + Codex/Kiro asset linking |
| `office-hours/SKILL.md.tmpl` | Replace Visual Sketch section with `{{DESIGN_MOCKUP}}` |
| `plan-design-review/SKILL.md.tmpl` | Add `{{DESIGN_SETUP}}` + mockup generation for low-scoring dimensions |

### Existing Code to Reuse

| Code | Location | Used For |
|------|----------|----------|
| Browse CLI pattern | `browse/src/cli.ts` | Command dispatch architecture |
| `commands.ts` registry | `browse/src/commands.ts` | Single source of truth pattern |
| `generateBrowseSetup()` | `scripts/resolvers/browse.ts` | Template for `generateDesignSetup()` |
| `DESIGN_SKETCH` resolver | `scripts/resolvers/design.ts` | Template for `DESIGN_MOCKUP` resolver |
| HostPaths system | `scripts/resolvers/types.ts` | Multi-host path resolution |
| Build pipeline | `package.json` build script | `bun build --compile` pattern |

### API Details

**Generate:** OpenAI Responses API with `image_generation` tool
```typescript
const response = await openai.responses.create({
  model: "gpt-4o",
  input: briefToPrompt(brief),
  tools: [{ type: "image_generation", size: "1536x1024", quality: "high" }],
});
// Extract image from response output items
const imageItem = response.output.find(item => item.type === "image_generation_call");
const base64Data = imageItem.result; // base64-encoded PNG
fs.writeFileSync(outputPath, Buffer.from(base64Data, "base64"));
```

**Iterate:** Same API with `previous_response_id`
```typescript
const response = await openai.responses.create({
  model: "gpt-4o",
  input: feedback,
  previous_response_id: session.lastResponseId,
  tools: [{ type: "image_generation" }],
});
```
**NOTE:** Multi-turn image iteration via `previous_response_id` is an assumption that needs prototype validation. The Responses API supports conversation threading, but whether it retains visual context of generated images for edit-style iteration is not confirmed in docs. **Fallback:** if multi-turn doesn't work, `iterate` falls back to re-generating with the original brief + accumulated feedback in a single prompt.

**Check:** GPT-4o vision
```typescript
const check = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{
    role: "user",
    content: [
      { type: "image_url", image_url: { url: `data:image/png;base64,${imageData}` } },
      { type: "text", text: `Check this UI mockup. Brief: ${brief}. Is text readable? Are all elements present? Does it look like a real UI? Return PASS or FAIL with issues.` }
    ]
  }]
});
```

**Cost:** ~$0.10-$0.40 per design session (1 hero + 2 variants + 1 quality check + 1 iteration). Negligible next to the LLM costs already in each skill invocation.

### Auth (validated via smoke test)

**Codex OAuth tokens DO NOT work for image generation.** Tested 2026-03-26: both the Images API and Responses API reject `~/.codex/auth.json` access_token with "Missing scopes: api.model.images.request". Codex CLI also has no native imagegen capability.

**Auth resolution order:**
1. Read `~/.gstack/openai.json` → `{ "api_key": "sk-..." }` (file permissions 0600)
2. Fall back to `OPENAI_API_KEY` environment variable
3. If neither exists → guided setup flow:
   - Tell user: "Design mockups need an OpenAI API key with image generation permissions. Get one at platform.openai.com/api-keys"
   - Prompt user to paste the key
   - Write to `~/.gstack/openai.json` with 0600 permissions
   - Run a smoke test (generate a 1024x1024 test image) to verify the key works
   - If smoke test passes, proceed. If it fails, show the error and fall back to DESIGN_SKETCH.
4. If auth exists but API call fails → fall back to DESIGN_SKETCH (existing HTML wireframe approach). Design mockups are a progressive enhancement, never a hard requirement.

**New command:** `$D setup` — guided API key setup + smoke test. Can be run anytime to update the key.

## Assumptions to Validate in Prototype

1. **Image quality:** "Pixel-perfect UI mockups" is aspirational. GPT Image generation may not reliably produce accurate text rendering, alignment, and spacing at true UI fidelity. The vision quality gate helps, but success criterion "good enough to implement from" needs prototype validation before full skill integration.
2. **Multi-turn iteration:** Whether `previous_response_id` retains visual context is unproven (see API Details section).
3. **Cost model:** Estimated $0.10-$0.40/session needs real-world validation.

**Prototype validation plan:** Build Commit 1 (core generate + check), run 10 design briefs across different screen types, evaluate output quality before proceeding to skill integration.

## CEO Expansion Scope (accepted via /plan-ceo-review SCOPE EXPANSION)

### 1. Design Memory + Exploration Width Control
- Auto-extract visual language from approved mockups into DESIGN.md
- If DESIGN.md exists, constrain future mockups to established design language
- If no DESIGN.md (bootstrap), explore WIDE across diverse directions
- Progressive constraint: more established design = narrower exploration band
- Comparison board gets REGENERATE section with exploration controls:
  - "Something totally different" (wide exploration)
  - "More like option ___" (narrow around a favorite)
  - "Match my existing design" (constrain to DESIGN.md)
  - Free text input for specific direction changes
  - Regenerate refreshes the page, agent polls for new submission

### 2. Mockup Diffing
- `$D diff --before old.png --after new.png` generates visual diff
- Side-by-side with changed regions highlighted
- Uses GPT-4o vision to identify differences
- Used in: /design-review, iteration feedback, PR review

### 3. Screenshot-to-Mockup Evolution
- `$D evolve --screenshot current.png --brief "make it calmer"`
- Takes live site screenshot, generates mockup showing how it SHOULD look
- Starts from reality, not blank canvas
- Bridge between /design-review critique and visual fix proposal

### 4. Design Intent Verification
- During /design-review, overlay approved mockup (docs/designs/) onto live screenshot
- Highlight divergence: "You designed X, you built Y, here's the gap"
- Closes the full loop: design -> implement -> verify visually
- Combines $B screenshot + $D diff + vision analysis

### 5. Responsive Variants
- `$D variants --brief "..." --viewports desktop,tablet,mobile`
- Auto-generates mockups at multiple viewport sizes
- Comparison board shows responsive grid for simultaneous approval
- Makes responsive design a first-class concern from mockup stage

### 6. Design-to-Code Prompt
- After comparison board approval, auto-generate structured implementation prompt
- Extracts colors, typography, layout from approved PNG via vision analysis
- Combines with DESIGN.md and HTML wireframe as structured spec
- Bridges "approved design" to "agent starts coding" with zero interpretation gap

### Future Engines (NOT in this plan's scope)
- Magic Patterns integration (extract patterns from existing designs)
- Variant API (when they ship it, multi-variation React code + preview)
- Figma MCP (bidirectional design file access)
- Google Stitch SDK (free TypeScript alternative)

## Open Questions

1. When Variant ships an API, what's the integration path? (Separate engine in the design binary, or a standalone Variant binary?)
2. How should Magic Patterns integrate? (Another engine in $D, or a separate tool?)
3. At what point does the design binary need a plugin/engine architecture to support multiple generation backends?

## Success Criteria

- Running `/office-hours` on a UI idea produces actual PNG mockups alongside the design doc
- Running `/plan-design-review` shows "what better looks like" as a mockup, not prose
- Mockups are good enough that a developer could implement from them
- The quality gate catches obviously broken mockups and retries
- Cost per design session stays under $0.50

## Distribution Plan

The design binary is compiled and distributed alongside the browse binary:
- `bun build --compile design/src/cli.ts --outfile design/dist/design`
- Built during `./setup` and `bun run build`
- Symlinked via existing `~/.claude/skills/gstack/` install path

## Next Steps (Implementation Order)

### Commit 0: Prototype validation (MUST PASS before building infrastructure)
- Single-file prototype script (~50 lines) that sends 3 different design briefs to GPT Image API
- Validates: text rendering quality, layout accuracy, visual coherence
- If output is "embarrassingly bad AI art" for UI mockups, STOP. Re-evaluate approach.
- This is the cheapest way to validate the core assumption before building 8 files of infrastructure.

### Commit 1: Design binary core (generate + check + compare)
- `design/src/` with cli.ts, commands.ts, generate.ts, check.ts, brief.ts, session.ts, compare.ts
- Auth module (read ~/.gstack/openai.json, fallback to env var, guided setup flow)
- `compare` command generates HTML comparison board with per-variant feedback textareas
- `package.json` build command (separate `bun build --compile` from browse)
- `setup` script integration (including Codex + Kiro asset linking)
- Unit tests with mock OpenAI API server

### Commit 2: Variants + iterate
- `design/src/variants.ts`, `design/src/iterate.ts`
- Staggered parallel generation (1s delay between starts, exponential backoff on 429)
- Session state management for multi-turn
- Tests for iteration flow + rate limit handling

### Commit 3: Template integration
- Add `generateDesignSetup()` + `generateDesignMockup()` to existing `scripts/resolvers/design.ts`
- Add `designDir` to `HostPaths` in `scripts/resolvers/types.ts`
- Register DESIGN_SETUP + DESIGN_MOCKUP in `scripts/resolvers/index.ts`
- Add GSTACK_DESIGN env var export to `scripts/resolvers/preamble.ts` (Codex host)
- Update `test/gen-skill-docs.test.ts` (DESIGN_SKETCH test suite)
- Regenerate SKILL.md files

### Commit 4: /office-hours integration
- Replace Visual Sketch section with `{{DESIGN_MOCKUP}}`
- Sequential workflow: generate variants → $D compare → user feedback → DESIGN_SKETCH HTML wireframe
- Save approved mockup to docs/designs/ (only the approved one, not explorations)

### Commit 5: /plan-design-review integration
- Add `{{DESIGN_SETUP}}` and mockup generation for low-scoring dimensions
- "What 10/10 looks like" mockup comparison

### Commit 6: Design Memory + Exploration Width Control (CEO expansion)
- After mockup approval, extract visual language via GPT-4o vision
- Write/update DESIGN.md with extracted colors, typography, spacing, layout patterns
- If DESIGN.md exists, feed it as constraint context to all future mockup prompts
- Add REGENERATE section to comparison board HTML (chiclets + free text + refresh loop)
- Progressive constraint logic in brief construction

### Commit 7: Mockup Diffing + Design Intent Verification (CEO expansion)
- `$D diff` command: takes two PNGs, uses GPT-4o vision to identify differences, generates overlay
- `$D verify` command: screenshots live site via $B, diffs against approved mockup from docs/designs/
- Integration into /design-review template: auto-verify when approved mockup exists

### Commit 8: Screenshot-to-Mockup Evolution (CEO expansion)
- `$D evolve` command: takes screenshot + brief, generates "how it should look" mockup
- Sends screenshot as reference image to GPT Image API
- Integration into /design-review: "Here's what the fix should look like" visual proposals

### Commit 9: Responsive Variants + Design-to-Code Prompt (CEO expansion)
- `--viewports` flag on `$D variants` for multi-size generation
- Comparison board responsive grid layout
- Auto-generate structured implementation prompt after approval
- Vision analysis of approved PNG to extract colors, typography, layout for the prompt

## The Assignment

Tell Variant to build an API. As their investor: "I'm building a workflow where AI agents generate visual designs programmatically. GPT Image API works today — but I'd rather use Variant because the multi-variation approach is better for design exploration. Ship an API endpoint: prompt in, React code + preview image out. I'll be your first integration partner."

## Verification

1. `bun run build` compiles `design/dist/design` binary
2. `$D generate --brief "Landing page for a developer tool" --output /tmp/test.png` produces a real PNG
3. `$D check --image /tmp/test.png --brief "Landing page"` returns PASS/FAIL
4. `$D variants --brief "..." --count 3 --output-dir /tmp/variants/` produces 3 PNGs
5. Running `/office-hours` on a UI idea produces mockups inline
6. `bun test` passes (skill validation, gen-skill-docs)
7. `bun run test:evals` passes (E2E tests)

## What I noticed about how you think

- You said "that isn't design" about text descriptions and ASCII art. That's a designer's instinct — you know the difference between describing a thing and showing a thing. Most people building AI tools don't notice this gap because they were never designers.
- You prioritized /office-hours first — the upstream leverage point. If the brainstorm produces real mockups, every downstream skill (/plan-design-review, /design-review) has a visual artifact to reference instead of re-interpreting prose.
- You funded Variant and immediately thought "they should have an API." That's investor-as-user thinking — you're not just evaluating the company, you're designing how their product fits into your workflow.
- When Codex challenged the opt-in premise, you accepted it immediately. No ego defense. That's the fastest path to the right answer.

## Spec Review Results

Doc survived 1 round of adversarial review. 11 issues caught and fixed.
Quality score: 7/10 → estimated 8.5/10 after fixes.

Issues fixed:
1. OpenAI SDK dependency declared
2. Image data extraction path specified (response.output item shape)
3. --check and --retry flags formally registered in command registry
4. Brief input modes specified (plain text vs JSON file)
5. Resolver file contradiction fixed (add to existing design.ts)
6. HostPaths Codex env var setup noted
7. "Mirrors browse" reframed to "shares compilation/distribution pattern"
8. Session state specified (ID generation, discovery, cleanup)
9. "Pixel-perfect" flagged as assumption needing prototype validation
10. Multi-turn iteration flagged as unproven with fallback plan
11. $D discovery bash block fully specified with fallback to DESIGN_SKETCH

## Eng Review Completion Summary

- Step 0: Scope Challenge — scope accepted as-is (full binary, user overrode reduction recommendation)
- Architecture Review: 5 issues found (openai dep separation, graceful degrade, output dir config, auth model, trust boundary)
- Code Quality Review: 1 issue found (8 files vs 5, kept 8)
- Test Review: diagram produced, 42 gaps identified, test plan written
- Performance Review: 1 issue found (parallel variants with staggered start)
- NOT in scope: Google Stitch SDK integration, Figma MCP, Variant API (deferred)
- What already exists: browse CLI pattern, DESIGN_SKETCH resolver, HostPaths system, gen-skill-docs pipeline
- Outside voice: 4 passes (Claude structured 12 issues, Codex structured 8 issues, Claude adversarial 1 fatal flaw, Codex adversarial 1 fatal flaw). Key insight: sequential PNG→HTML workflow resolved the "opaque raster" fatal flaw.
- Failure modes: 0 critical gaps (all identified failure modes have error handling + tests planned)
- Lake Score: 7/7 recommendations chose complete option

## GSTACK REVIEW REPORT

| Review | Trigger | Why | Runs | Status | Findings |
|--------|---------|-----|------|--------|----------|
| Office Hours | `/office-hours` | Design brainstorm | 1 | DONE | 4 premises, 1 revised (Codex: opt-in->default-on) |
| CEO Review | `/plan-ceo-review` | Scope & strategy | 1 | CLEAR | EXPANSION: 6 proposed, 6 accepted, 0 deferred |
| Eng Review | `/plan-eng-review` | Architecture & tests (required) | 1 | CLEAR | 7 issues, 0 critical gaps, 4 outside voices |
| Design Review | `/plan-design-review` | UI/UX gaps | 1 | CLEAR | score: 2/10 -> 8/10, 5 decisions made |
| Outside Voice | structured + adversarial | Independent challenge | 4 | DONE | Sequential PNG->HTML workflow, trust boundary noted |

**CEO EXPANSIONS:** Design Memory + Exploration Width, Mockup Diffing, Screenshot Evolution, Design Intent Verification, Responsive Variants, Design-to-Code Prompt.
**DESIGN DECISIONS:** Single-column full-width layout, per-card "More like this", explicit radio Pick, smooth fade regeneration, skeleton loading states.
**UNRESOLVED:** 0
**VERDICT:** CEO + ENG + DESIGN CLEARED. Ready to implement. Start with Commit 0 (prototype validation).