Files
gstack/docs/designs/DESIGN_SHOTGUN.md
T
Garry Tan 78bc1d1968 feat: design binary — real UI mockup generation for gstack skills (v0.13.0.0) (#551)
* docs: design tools v1 plan — visual mockup generation for gstack skills

Full design doc covering the `design` binary that wraps OpenAI's GPT Image API
to generate real UI mockups from gstack's design skills. Includes comparison
board UX spec, auth model, 6 CEO expansions (design memory, mockup diffing,
screenshot evolution, design intent verification, responsive variants,
design-to-code prompt), and 9-commit implementation plan.

Reviewed: /office-hours + /plan-eng-review (CLEARED) + /plan-ceo-review
(EXPANSION, 6/6 accepted) + /plan-design-review (2/10 → 8/10).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: design tools prototype validation — GPT Image API works

Prototype script sends 3 design briefs to OpenAI Responses API with
image_generation tool. Results: dashboard (47s, 2.1MB), landing page
(42s, 1.3MB), settings page (37s, 1.3MB) all produce real, implementable
UI mockups with accurate text rendering and clean layouts.

Key finding: Codex OAuth tokens lack image generation scopes. Direct
API key (sk-proj-*) required, stored in ~/.gstack/openai.json.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: design binary core — generate, check, compare commands

Stateless CLI (design/dist/design) wrapping OpenAI Responses API for
UI mockup generation. Three working commands:

- generate: brief -> PNG mockup via gpt-4o + image_generation tool
- check: vision-based quality gate via GPT-4o (text readability, layout
  completeness, visual coherence)
- compare: generates self-contained HTML comparison board with star
  ratings, radio Pick, per-variant feedback, regenerate controls,
  and Submit button that writes structured JSON for agent polling

Auth reads from ~/.gstack/openai.json (0600), falls back to
OPENAI_API_KEY env var. Compiled separately from browse binary
(openai added to devDependencies, not runtime deps).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: design binary variants + iterate commands

variants: generates N style variations with staggered parallel (1.5s
between launches, exponential backoff on 429). 7 built-in style
variations (bold, calm, warm, corporate, dark, playful + default).
Tested: 3/3 variants in 41.6s.

iterate: multi-turn design iteration using previous_response_id for
conversational threading. Falls back to re-generation with accumulated
feedback if threading doesn't retain visual context.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: DESIGN_SETUP + DESIGN_MOCKUP template resolvers

Add generateDesignSetup() and generateDesignMockup() to the existing
design.ts resolver file. Add designDir to HostPaths (claude + codex).
Register DESIGN_SETUP and DESIGN_MOCKUP in the resolver index.

DESIGN_SETUP: $D binary discovery (mirrors $B browse setup pattern).
Falls back to DESIGN_SKETCH if binary not available.

DESIGN_MOCKUP: full visual exploration workflow template — construct
brief from DESIGN.md context, generate 3 variants, open comparison
board in Chrome, poll for user feedback, save approved mockup to
docs/designs/, generate HTML wireframe for implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: sync package.json version with VERSION file (0.12.2.0)

Pre-existing mismatch: VERSION was 0.12.2.0 but package.json was
0.12.0.0. Also adds design binary to build script and dev:design
convenience command.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /office-hours visual design exploration integration

Add {{DESIGN_MOCKUP}} to office-hours template before the existing
{{DESIGN_SKETCH}}. When the design binary is available, /office-hours
generates 3 visual mockup variants, opens a comparison board in Chrome,
and polls for user feedback. Falls back to HTML wireframes if the
design binary isn't built.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /plan-design-review visual mockup integration

Add {{DESIGN_SETUP}} to pre-review audit and "show me what 10/10
looks like" mockup generation to the 0-10 rating method. When a
design dimension rates below 7/10, the review can generate a mockup
showing the improved version. Falls back to text descriptions if
the design binary isn't available.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: design memory — extract visual language from mockups into DESIGN.md

New `$D extract` command: sends approved mockup to GPT-4o vision,
extracts color palette, typography, spacing, and layout patterns,
writes/updates DESIGN.md with an "Extracted Design Language" section.

Progressive constraint: if DESIGN.md exists, future mockup briefs
include it as style context. If no DESIGN.md, explorations run wide.
readDesignConstraints() reads existing DESIGN.md for brief construction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: mockup diffing + design intent verification

New commands:
- $D diff --before old.png --after new.png: visual diff using GPT-4o
  vision. Returns differences by area with severity (high/medium/low)
  and a matchScore (0-100).
- $D verify --mockup approved.png --screenshot live.png: compares live
  site screenshot against approved design mockup. Pass if matchScore
  >= 70 and no high-severity differences.

Used by /design-review to close the design loop: design -> implement ->
verify visually.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: screenshot-to-mockup evolution ($D evolve)

New command: $D evolve --screenshot current.png --brief "make it calmer"

Two-step process: first analyzes the screenshot via GPT-4o vision to
produce a detailed description, then generates a new mockup that keeps
the existing layout structure but applies the requested changes. Starts
from reality, not blank canvas.

Bridges the gap between /design-review critique ("the spacing is off")
and a visual proposal of the fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: responsive variants + design-to-code prompt

Responsive variants: $D variants --viewports desktop,tablet,mobile
generates mockups at 1536x1024, 1024x1024, and 1024x1536 (portrait)
with viewport-appropriate layout instructions.

Design-to-code prompt: $D prompt --image approved.png extracts colors,
typography, layout, and components via GPT-4o vision, producing a
structured implementation prompt. Reads DESIGN.md for additional
constraint context.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.13.0.0)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: gstack designer as first-class tool in /plan-design-review

Brand the gstack designer prominently, add Step 0.5 for proactive visual
mockup generation before review passes, and update priority hierarchy.
When a plan describes new UI, the skill now offers to generate mockups
with $D variants, run $D check for quality gating, and present a
comparison board via $B goto before any review passes begin.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: integrate mockups into review passes and outputs

Thread Step 0.5 mockups through the review workflow: Pass 4 (AI Slop)
evaluates generated mockups visually, Pass 7 uses mockups as evidence
for unresolved decisions, post-pass offers one-shot regeneration after
design changes, and Approved Mockups section records chosen variants
with paths for the implementer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: gstack designer target mockups in /design-review fix loop

Add $D generate for target mockups in Phase 8a.5 — before fixing a
design finding, generate a mockup showing what it should look like.
Add $D verify in Phase 9 to compare fix results against targets.
Not plan mode — goes straight to implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: gstack designer AI mockups in /design-consultation Phase 5

Replace HTML preview with $D variants + comparison board when designer
is available (Path A). Use $D extract to derive DESIGN.md tokens from
the approved mockup. Handles both plan mode (write to plan) and
non-plan mode (implement immediately). Falls back to HTML preview
(Path B) when designer binary is unavailable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: make gstack designer the default in /plan-design-review, not optional

The transcript showed the agent writing 5 text descriptions of homepage
variants instead of generating visual mockups, even when the user explicitly
asked for design tools. The skill treated mockups as optional ("Want me to
generate?") when they should be the default behavior.

Changes:
- Rename "Your Visual Design Tool" to "YOUR PRIMARY TOOL" with aggressive
  language: "Don't ask permission. Show it."
- Step 0.5 now generates mockups automatically when DESIGN_READY, no
  AskUserQuestion gatekeeping the default path
- Priority hierarchy: mockups are "non-negotiable" not "if available"
- Step 0D tells the user mockups are coming next
- DESIGN_NOT_AVAILABLE fallback now tells user what they're missing

The only valid reasons to skip mockups: no UI scope, or designer not
installed. Everything else generates by default.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: persist design mockups to ~/.gstack/projects/$SLUG/designs/

Mockups were going to .context/mockups/ (gitignored, workspace-local).
This meant designs disappeared when switching workspaces or conversations,
and downstream skills couldn't reference approved mockups from earlier
reviews.

Now all three design skills save to persistent project-scoped dirs:
- /plan-design-review: ~/.gstack/projects/$SLUG/designs/<screen>-<date>/
- /design-consultation: ~/.gstack/projects/$SLUG/designs/design-system-<date>/
- /design-review: ~/.gstack/projects/$SLUG/designs/design-audit-<date>/

Each directory gets an approved.json recording the user's pick, feedback,
and branch. This lets /design-review verify against mockups that
/plan-design-review approved, and design history is browsable via
ls ~/.gstack/projects/$SLUG/designs/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate codex ship skill with zsh glob guards

Picked up setopt +o nomatch guards from main's v0.12.8.1 merge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add browse binary discovery to DESIGN_SETUP resolver

The design setup block now discovers $B alongside $D, so skills can
open comparison boards via $B goto and poll feedback via $B eval.
Falls back to `open` on macOS when browse binary is unavailable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: comparison board DOM polling in plan-design-review

After opening the comparison board, the agent now polls
#status via $B eval instead of asking a rigid AskUserQuestion.
Handles submit (read structured JSON feedback), regenerate
(new variants with updated brief), and $B-unavailable fallback
(free-form text response). The user interacts with the real
board UI, not a constrained option picker.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: comparison board feedback loop integration test

16 tests covering the full DOM polling cycle: structure verification,
submit with pick/rating/comment, regenerate flows (totally different,
more like this, custom text), and the agent polling pattern
(empty → submitted → read JSON). Uses real generateCompareHtml()
from design/src/compare.ts, served via HTTP. Runs in <1s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add $D serve command for HTTP-based comparison board feedback

The comparison board feedback loop was fundamentally broken: browse blocks
file:// URLs (url-validation.ts:71), so $B goto file://board.html always
fails. The fallback open + $B eval polls a different browser instance.

$D serve fixes this by serving the board over HTTP on localhost. The server
is stateful: stays alive across regeneration rounds, exposes /api/progress
for the board to poll, and accepts /api/reload from the agent to swap in
new board HTML. Stdout carries feedback JSON only; stderr carries telemetry.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: dual-mode feedback + post-submit lifecycle in comparison board

When __GSTACK_SERVER_URL is set (injected by $D serve), the board POSTs
feedback to the server instead of only writing to hidden DOM elements.
After submit: disables all inputs, shows "Return to your coding agent."
After regenerate: shows spinner, polls /api/progress, auto-refreshes on
ready. On POST failure: shows copyable JSON fallback. On progress timeout
(5 min): shows error with /design-shotgun prompt. DOM fallback preserved
for headed browser mode and tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: HTTP serve command endpoints and regeneration lifecycle

11 tests covering: HTML serving with injected server URL, /api/progress
state reporting, submit → done lifecycle, regenerate → regenerating state,
remix with remixSpec, malformed JSON rejection, /api/reload HTML swapping,
missing file validation, and full regenerate → reload → submit round-trip.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add DESIGN_SHOTGUN_LOOP resolver + fix design artifact paths

Adds generateDesignShotgunLoop() resolver for the shared comparison board
feedback loop (serve via HTTP, handle regenerate/remix, AskUserQuestion
fallback, feedback confirmation). Registered as {{DESIGN_SHOTGUN_LOOP}}.

Fixes generateDesignMockup() to use ~/.gstack/projects/$SLUG/designs/
instead of /tmp/ and docs/designs/. Replaces broken $B goto file:// +
$B eval polling with $D compare --serve (HTTP-based, stdout feedback).

Adds CRITICAL PATH RULE guardrail to DESIGN_SETUP: design artifacts must
go to ~/.gstack/projects/$SLUG/designs/, never .context/ or /tmp/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add /design-shotgun standalone design exploration skill

New skill for visual brainstorming: generate AI design variants, open a
comparison board in the user's browser, collect structured feedback, and
iterate. Features: session detection (revisit prior explorations), 5-dimension
context gathering (who, job to be done, what exists, user flow, edge cases),
taste memory (prior approved designs bias new generations), inline variant
preview, configurable variant count, screenshot-to-variants via $D evolve.

Uses {{DESIGN_SHOTGUN_LOOP}} resolver for the feedback loop. Saves all
artifacts to ~/.gstack/projects/$SLUG/designs/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate SKILL.md files for design-shotgun + resolver changes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add remix UI to comparison board

Per-variant element selectors (Layout, Colors, Typography, Spacing) with
radio buttons in a grid. Remix button collects selections into a remixSpec
object and sends via the same HTTP POST feedback mechanism. Enabled only
when at least one element is selected. Board shows regenerating spinner
while agent generates the hybrid variant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add $D gallery command for design history timeline

Generates a self-contained HTML page showing all prior design explorations
for a project: every variant (approved or not), feedback notes, organized
by date (newest first). Images embedded as base64. Handles corrupted
approved.json gracefully (skips, still shows the session). Empty state
shows "No history yet" with /design-shotgun prompt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: gallery generation — sessions, dates, corruption, empty state

7 tests: empty dir, nonexistent dir, single session with approved variant,
multiple sessions sorted newest-first, corrupted approved.json handled
gracefully, session without approved.json, self-contained HTML (no
external dependencies).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: replace broken file:// polling with {{DESIGN_SHOTGUN_LOOP}}

plan-design-review and design-consultation templates previously used
$B goto file:// + $B eval polling for the comparison board feedback loop.
This was broken (browse blocks file:// URLs). Both templates now use
{{DESIGN_SHOTGUN_LOOP}} which serves via HTTP, handles regeneration in
the same browser tab, and falls back to AskUserQuestion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add design-shotgun touchfile entries and tier classifications

design-shotgun-path (gate): verify artifacts go to ~/.gstack/, not .context/
design-shotgun-session (gate): verify repeat-run detection + AskUserQuestion
design-shotgun-full (periodic): full round-trip with real design binary

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate SKILL.md files for template refactor

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: comparison board UI improvements — option headers, pick confirmation, grid view

Three changes to the design comparison board:

1. Pick confirmation: selecting "Pick" on Option A shows "We'll move
   forward with Option A" in green, plus a status line above the submit
   button repeating the choice.

2. Clear option headers: each variant now has "Option A" in bold with a
   subtitle above the image, instead of just the raw image.

3. View toggle: top-right Large/Grid buttons switch between single-column
   (default) and 3-across grid view.

Also restructured the bottom section into a 2-column grid: submit/overall
feedback on the left, regenerate controls on the right.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use 127.0.0.1 instead of localhost for serve URL

Avoids DNS resolution issues on some systems where localhost may resolve
to IPv6 ::1 while Bun listens on IPv4 only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: write ALL feedback to disk so agent can poll in background mode

The agent backgrounds $D serve (Claude Code can't block on a subprocess
and do other work simultaneously). With stdout-only feedback delivery,
the agent never sees regenerate/remix feedback.

Fix: write feedback-pending.json (regenerate/remix) and feedback.json
(submit) to disk next to the board HTML. Agent polls the filesystem
instead of reading stdout. Both channels (stdout + disk) are always
active so foreground mode still works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: DESIGN_SHOTGUN_LOOP uses file polling instead of stdout reading

Update the template resolver to instruct the agent to background $D serve
and poll for feedback-pending.json / feedback.json on a 5-second loop.
This matches the real-world pattern where Claude Code / Conductor agents
can't block on subprocess stdout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate SKILL.md files for file-polling feedback loop

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: null-safe DOM selectors for post-submit and regenerating states

The user's layout restructure renamed .regenerate-bar → .regen-column,
.submit-bar → .submit-column, and .overall-section → .bottom-section.
The JS still referenced the old class names, causing querySelector to
return null and showPostSubmitState() / showRegeneratingState() to
silently crash. This meant Submit and Regenerate buttons appeared to
work (DOM elements updated, HTTP POST succeeded) but the visual
feedback (disabled inputs, spinner, success message) never appeared.

Fix: use fallback selectors that check both old and new class names,
with null guards so a missing element doesn't crash the function.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: end-to-end feedback roundtrip — browser click to file on disk

The test that proves "changes on the website propagate to Claude Code."
Opens the comparison board in a real headless browser with __GSTACK_SERVER_URL
injected, simulates user clicks (Submit, Regenerate, More Like This), and
verifies that feedback.json / feedback-pending.json land on disk with the
correct structured data.

6 tests covering: submit → feedback.json, post-submit UI lockdown,
regenerate → feedback-pending.json, more-like-this → feedback-pending.json,
regenerate spinner display, and full regen → reload → submit round-trip.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: comprehensive design doc for Design Shotgun feedback loop

Documents the full browser-to-agent feedback architecture: state machine,
file-based polling, port discovery, post-submit lifecycle, and every known
edge case (zombie forms, dead servers, stale spinners, file:// bug,
double-click races, port coordination, sequential generate rule).

Includes ASCII diagrams of the data flow and state transitions, complete
step-by-step walkthrough of happy path and regeneration path, test coverage
map with gaps, and short/medium/long-term improvement ideas.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: plan-design-review agent guardrails for feedback loop

Four fixes to prevent agents from reinventing the feedback loop badly:

1. Sequential generate rule: explicit instruction that $D generate calls
   must run one at a time (API rate-limits concurrent image generation).
2. No-AskUserQuestion-for-feedback rule: agent reads feedback.json instead
   of re-asking what the user picked.
3. Remove file:// references: $B goto file:// was always rejected by
   url-validation.ts. The --serve flag handles everything.
4. Remove $B eval polling reference: no longer needed with HTTP POST.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: design-shotgun Step 3 progressive reveal, silent failure detection, timing estimate

Three production UX bugs fixed:
1. Dead air — now shows timing estimate before generation starts
2. Silent variant drop — replaced $D variants batch with individual $D generate
   calls, each verified for existence and non-zero size with retry
3. No progressive reveal — each variant shown inline via Read tool immediately
   after generation (~60s increments instead of all at ~180s)

Also: /tmp/ then cp as default output pattern (sandbox workaround),
screenshot taken once for evolve path (not per-variant).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: parallel design-shotgun with concept-first confirmation

Step 3 rewritten to concept-first + parallel Agent architecture:
- 3a: generate text concepts (free, instant)
- 3b: AskUserQuestion to confirm/modify before spending API credits
- 3c: launch N Agent subagents in parallel (~60s total regardless of count)
- 3d: show all results, dynamic image list for comparison board

Adds Agent to allowed-tools. Softens plan-design-review sequential
warning to note design-shotgun uses parallel at Tier 2+.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.13.0.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: untrack .agents/skills/ — generated at setup, already gitignored

These files were committed despite .agents/ being in .gitignore.
They regenerate from ./setup --host codex on any machine.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate design-shotgun SKILL.md for v0.12.12.0 preamble changes

Merge from main brought updated preamble resolver (conditional telemetry,
local JSONL logging) but design-shotgun/SKILL.md wasn't regenerated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:32:59 -06:00

22 KiB

Design: Design Shotgun — Browser-to-Agent Feedback Loop

Generated on 2026-03-27 Branch: garrytan/agent-design-tools Status: LIVING DOCUMENT — update as bugs are found and fixed

What This Feature Does

Design Shotgun generates multiple AI design mockups, opens them side-by-side in the user's real browser as a comparison board, and collects structured feedback (pick a favorite, rate alternatives, leave notes, request regeneration). The feedback flows back to the coding agent, which acts on it: either proceeding with the approved variant or generating new variants and reloading the board.

The user never leaves their browser tab. The agent never asks redundant questions. The board is the feedback mechanism.

The Core Problem: Two Worlds That Must Talk

  ┌─────────────────────┐          ┌──────────────────────┐
  │   USER'S BROWSER    │          │   CODING AGENT       │
  │   (real Chrome)     │          │   (Claude Code /     │
  │                     │          │    Conductor)         │
  │  Comparison board   │          │                      │
  │  with buttons:      │   ???    │  Needs to know:      │
  │  - Submit           │ ──────── │  - What was picked   │
  │  - Regenerate       │          │  - Star ratings      │
  │  - More like this   │          │  - Comments          │
  │  - Remix            │          │  - Regen requested?  │
  └─────────────────────┘          └──────────────────────┘

The "???" is the hard part. The user clicks a button in Chrome. The agent running in a terminal needs to know about it. These are two completely separate processes with no shared memory, no shared event bus, no WebSocket connection.

Architecture: How the Linkage Works

  USER'S BROWSER                    $D serve (Bun HTTP)              AGENT
  ═══════════════                   ═══════════════════              ═════
       │                                   │                           │
       │  GET /                            │                           │
       │ ◄─────── serves board HTML ──────►│                           │
       │    (with __GSTACK_SERVER_URL      │                           │
       │     injected into <head>)         │                           │
       │                                   │                           │
       │  [user rates, picks, comments]    │                           │
       │                                   │                           │
       │  POST /api/feedback               │                           │
       │ ─────── {preferred:"A",...} ─────►│                           │
       │                                   │                           │
       │  ◄── {received:true} ────────────│                           │
       │                                   │── writes feedback.json ──►│
       │  [inputs disabled,                │   (or feedback-pending    │
       │   "Return to agent" shown]        │    .json for regen)       │
       │                                   │                           │
       │                                   │                  [agent polls
       │                                   │                   every 5s,
       │                                   │                   reads file]

The Three Files

File Written when Means Agent action
feedback.json User clicks Submit Final selection, done Read it, proceed
feedback-pending.json User clicks Regenerate/More Like This Wants new options Read it, delete it, generate new variants, reload board
feedback.json (round 2+) User clicks Submit after regeneration Final selection after iteration Read it, proceed

The State Machine

  $D serve starts
       │
       ▼
  ┌──────────┐
  │ SERVING  │◄──────────────────────────────────────┐
  │          │                                        │
  │ Board is │  POST /api/feedback                    │
  │ live,    │  {regenerated: true}                   │
  │ waiting  │──────────────────►┌──────────────┐     │
  │          │                   │ REGENERATING │     │
  │          │                   │              │     │
  └────┬─────┘                   │ Agent has    │     │
       │                         │ 10 min to    │     │
       │  POST /api/feedback     │ POST new     │     │
       │  {regenerated: false}   │ board HTML   │     │
       │                         └──────┬───────┘     │
       ▼                                │             │
  ┌──────────┐                POST /api/reload        │
  │  DONE    │                {html: "/new/board"}    │
  │          │                          │             │
  │ exit 0   │                          ▼             │
  └──────────┘                   ┌──────────────┐     │
                                 │  RELOADING   │─────┘
                                 │              │
                                 │ Board auto-  │
                                 │ refreshes    │
                                 │ (same tab)   │
                                 └──────────────┘

Port Discovery

The agent backgrounds $D serve and reads stderr for the port:

SERVE_STARTED: port=54321 html=/path/to/board.html
SERVE_BROWSER_OPENED: url=http://127.0.0.1:54321

The agent parses port=XXXXX from stderr. This port is needed later to POST /api/reload when the user requests regeneration. If the agent loses the port number, it cannot reload the board.

Why 127.0.0.1, Not localhost

localhost can resolve to IPv6 ::1 on some systems while Bun.serve() listens on IPv4 only. More importantly, localhost sends all dev cookies for every domain the developer has been working on. On a machine with many active sessions, this blows past Bun's default header size limit (HTTP 431 error). 127.0.0.1 avoids both issues.

Every Edge Case and Pitfall

1. The Zombie Form Problem

What: User submits feedback, the POST succeeds, the server exits. But the HTML page is still open in Chrome. It looks interactive. The user might edit their feedback and click Submit again. Nothing happens because the server is gone.

Fix: After successful POST, the board JS:

  • Disables ALL inputs (buttons, radios, textareas, star ratings)
  • Hides the Regenerate bar entirely
  • Replaces the Submit button with: "Feedback received! Return to your coding agent."
  • Shows: "Want to make more changes? Run /design-shotgun again."
  • The page becomes a read-only record of what was submitted

Implemented in: compare.ts:showPostSubmitState() (line 484)

2. The Dead Server Problem

What: The server times out (10 min default) or crashes while the user still has the board open. User clicks Submit. The fetch() fails silently.

Fix: The postFeedback() function has a .catch() handler. On network failure:

  • Shows red error banner: "Connection lost"
  • Displays the collected feedback JSON in a copyable <pre> block
  • User can copy-paste it directly into their coding agent

Implemented in: compare.ts:showPostFailure() (line 546)

3. The Stale Regeneration Spinner

What: User clicks Regenerate. Board shows spinner and polls /api/progress every 2 seconds. Agent crashes or takes too long to generate new variants. The spinner spins forever.

Fix: Progress polling has a hard 5-minute timeout (150 polls x 2s interval). After 5 minutes:

  • Spinner replaced with: "Something went wrong."
  • Shows: "Run /design-shotgun again in your coding agent."
  • Polling stops. Page becomes informational.

Implemented in: compare.ts:startProgressPolling() (line 511)

4. The file:// URL Problem (THE ORIGINAL BUG)

What: The skill template originally used $B goto file:///path/to/board.html. But browse/src/url-validation.ts:71 blocks file:// URLs for security. The fallback open file://... opens the user's macOS browser, but $B eval polls Playwright's headless browser (different process, never loaded the page). Agent polls empty DOM forever.

Fix: $D serve serves over HTTP. Never use file:// for the board. The --serve flag on $D compare combines board generation and HTTP serving in one command.

Evidence: See .context/attachments/image-v2.png — a real user hit this exact bug. The agent correctly diagnosed: (1) $B goto rejects file:// URLs, (2) no polling loop even with the browse daemon.

5. The Double-Click Race

What: User clicks Submit twice rapidly. Two POST requests arrive at the server. First one sets state to "done" and schedules exit(0) in 100ms. Second one arrives during that 100ms window.

Current state: NOT fully guarded. The handleFeedback() function doesn't check if state is already "done" before processing. The second POST would succeed and write a second feedback.json (harmless, same data). The exit still fires after 100ms.

Risk: Low. The board disables all inputs on the first successful POST response, so a second click would need to arrive within ~1ms. And both writes would contain the same feedback data.

Potential fix: Add if (state === 'done') return Response.json({error: 'already submitted'}, {status: 409}) at the top of handleFeedback().

6. The Port Coordination Problem

What: Agent backgrounds $D serve and parses port=54321 from stderr. Agent needs this port later to POST /api/reload during regeneration. If the agent loses context (conversation compresses, context window fills up), it may not remember the port.

Current state: The port is printed to stderr once. The agent must remember it. There is no port file written to disk.

Potential fix: Write a serve.pid or serve.port file next to the board HTML on startup. Agent can read it anytime:

cat "$_DESIGN_DIR/serve.port"  # → 54321

7. The Feedback File Cleanup Problem

What: feedback-pending.json from a regeneration round is left on disk. If the agent crashes before reading it, the next $D serve session finds a stale file.

Current state: The polling loop in the resolver template says to delete feedback-pending.json after reading it. But this depends on the agent following instructions perfectly. Stale files could confuse a new session.

Potential fix: $D serve could check for and delete stale feedback files on startup. Or: name files with timestamps (feedback-pending-1711555200.json).

8. Sequential Generate Rule

What: The underlying OpenAI GPT Image API rate-limits concurrent image generation requests. When 3 $D generate calls run in parallel, 1 succeeds and 2 get aborted.

Fix: The skill template must explicitly say: "Generate mockups ONE AT A TIME. Do not parallelize $D generate calls." This is a prompt-level instruction, not a code-level lock. The design binary does not enforce sequential execution.

Risk: Agents are trained to parallelize independent work. Without an explicit instruction, they will try to run 3 generates simultaneously. This wastes API calls and money.

9. The AskUserQuestion Redundancy

What: After the user submits feedback via the board (with preferred variant, ratings, comments all in the JSON), the agent asks them again: "Which variant do you prefer?" This is annoying. The whole point of the board is to avoid this.

Fix: The skill template must say: "Do NOT use AskUserQuestion to ask the user's preference. Read feedback.json, it contains their selection. Only AskUserQuestion to confirm you understood correctly, not to re-ask."

10. The CORS Problem

What: If the board HTML references external resources (fonts, images from CDN), the browser sends requests with Origin: http://127.0.0.1:PORT. Most CDNs allow this, but some might block it.

Current state: The server does not set CORS headers. The board HTML is self-contained (images base64-encoded, styles inline), so this hasn't been an issue in practice.

Risk: Low for current design. Would matter if the board loaded external resources.

11. The Large Payload Problem

What: No size limit on POST bodies to /api/feedback. If the board somehow sends a multi-MB payload, req.json() will parse it all into memory.

Current state: In practice, feedback JSON is ~500 bytes to ~2KB. The risk is theoretical, not practical. The board JS constructs a fixed-shape JSON object.

12. The fs.writeFileSync Error

What: feedback.json write in serve.ts:138 uses fs.writeFileSync() with no try/catch. If the disk is full or the directory is read-only, this throws and crashes the server. The user sees a spinner forever (server is dead, but board doesn't know).

Risk: Low in practice (the board HTML was just written to the same directory, proving it's writable). But a try/catch with a 500 response would be cleaner.

The Complete Flow (Step by Step)

Happy Path: User Picks on First Try

1. Agent runs: $D compare --images "A.png,B.png,C.png" --output board.html --serve &
2. $D serve starts Bun.serve() on random port (e.g. 54321)
3. $D serve opens http://127.0.0.1:54321 in user's browser
4. $D serve prints to stderr: SERVE_STARTED: port=54321 html=/path/board.html
5. $D serve writes board HTML with injected __GSTACK_SERVER_URL
6. User sees comparison board with 3 variants side by side
7. User picks Option B, rates A: 3/5, B: 5/5, C: 2/5
8. User writes "B has better spacing, go with that" in overall feedback
9. User clicks Submit
10. Board JS POSTs to http://127.0.0.1:54321/api/feedback
    Body: {"preferred":"B","ratings":{"A":3,"B":5,"C":2},"overall":"B has better spacing","regenerated":false}
11. Server writes feedback.json to disk (next to board.html)
12. Server prints feedback JSON to stdout
13. Server responds {received:true, action:"submitted"}
14. Board disables all inputs, shows "Return to your coding agent"
15. Server exits with code 0 after 100ms
16. Agent's polling loop finds feedback.json
17. Agent reads it, summarizes to user, proceeds

Regeneration Path: User Wants Different Options

1-6.  Same as above
7.  User clicks "Totally different" chiclet
8.  User clicks Regenerate
9.  Board JS POSTs to /api/feedback
    Body: {"regenerated":true,"regenerateAction":"different","preferred":"","ratings":{},...}
10. Server writes feedback-pending.json to disk
11. Server state → "regenerating"
12. Server responds {received:true, action:"regenerate"}
13. Board shows spinner: "Generating new designs..."
14. Board starts polling GET /api/progress every 2s

    Meanwhile, in the agent:
15. Agent's polling loop finds feedback-pending.json
16. Agent reads it, deletes it
17. Agent runs: $D variants --brief "totally different direction" --count 3
    (ONE AT A TIME, not parallel)
18. Agent runs: $D compare --images "new-A.png,new-B.png,new-C.png" --output board-v2.html
19. Agent POSTs: curl -X POST http://127.0.0.1:54321/api/reload -d '{"html":"/path/board-v2.html"}'
20. Server swaps htmlContent to new board
21. Server state → "serving" (from reloading)
22. Board's next /api/progress poll returns {"status":"serving"}
23. Board auto-refreshes: window.location.reload()
24. User sees new board with 3 fresh variants
25. User picks one, clicks Submit → happy path from step 10

"More Like This" Path

Same as regeneration, except:
- regenerateAction is "more_like_B" (references the variant)
- Agent uses $D iterate --image B.png --brief "more like this, keep the spacing"
  instead of $D variants

Fallback Path: $D serve Fails

1. Agent tries $D compare --serve, it fails (binary missing, port error, etc.)
2. Agent falls back to: open file:///path/board.html
3. Agent uses AskUserQuestion: "I've opened the design board. Which variant
   do you prefer? Any feedback?"
4. User responds in text
5. Agent proceeds with text feedback (no structured JSON)

Files That Implement This

File Role
design/src/serve.ts HTTP server, state machine, file writing, browser launch
design/src/compare.ts Board HTML generation, JS for ratings/picks/regen, POST logic, post-submit lifecycle
design/src/cli.ts CLI entry point, wires serve and compare --serve commands
design/src/commands.ts Command registry, defines serve and compare with their args
scripts/resolvers/design.ts generateDesignShotgunLoop() — template resolver that outputs the polling loop and reload instructions
design-shotgun/SKILL.md.tmpl Skill template that orchestrates the full flow: context gathering, variant generation, {{DESIGN_SHOTGUN_LOOP}}, feedback confirmation
design/test/serve.test.ts Unit tests for HTTP endpoints and state transitions
design/test/feedback-roundtrip.test.ts E2E test: browser click → JS fetch → HTTP POST → file on disk
browse/test/compare-board.test.ts DOM-level tests for the comparison board UI

What Could Still Go Wrong

Known Risks (ordered by likelihood)

  1. Agent doesn't follow sequential generate rule — most LLMs want to parallelize. Without enforcement in the binary, this is a prompt-level instruction that can be ignored.

  2. Agent loses port number — context compression drops the stderr output. Agent can't reload the board. Mitigation: write port to a file.

  3. Stale feedback files — leftover feedback-pending.json from a crashed session confuses the next run. Mitigation: clean on startup.

  4. fs.writeFileSync crash — no try/catch on the feedback file write. Silent server death if disk is full. User sees infinite spinner.

  5. Progress polling driftsetInterval(fn, 2000) over 5 minutes. In practice, JavaScript timers are accurate enough. But if the browser tab is backgrounded, Chrome may throttle intervals to once per minute.

Things That Work Well

  1. Dual-channel feedback — stdout for foreground mode, files for background mode. Both always active. Agent can use whichever works.

  2. Self-contained HTML — board has all CSS, JS, and base64-encoded images inline. No external dependencies. Works offline.

  3. Same-tab regeneration — user stays in one tab. Board auto-refreshes via /api/progress polling + window.location.reload(). No tab explosion.

  4. Graceful degradation — POST failure shows copyable JSON. Progress timeout shows clear error message. No silent failures.

  5. Post-submit lifecycle — board becomes read-only after submit. No zombie forms. Clear "what to do next" message.

Test Coverage

What's Tested

Flow Test File
Submit → feedback.json on disk browser click → file feedback-roundtrip.test.ts
Post-submit UI lockdown inputs disabled, success shown feedback-roundtrip.test.ts
Regenerate → feedback-pending.json chiclet + regen click → file feedback-roundtrip.test.ts
"More like this" → specific action more_like_B in JSON feedback-roundtrip.test.ts
Spinner after regenerate DOM shows loading text feedback-roundtrip.test.ts
Full regen → reload → submit 2-round trip feedback-roundtrip.test.ts
Server starts on random port port 0 binding serve.test.ts
HTML injection of server URL __GSTACK_SERVER_URL check serve.test.ts
Invalid JSON rejection 400 response serve.test.ts
HTML file validation exit 1 if missing serve.test.ts
Timeout behavior exit 1 after timeout serve.test.ts
Board DOM structure radios, stars, chiclets compare-board.test.ts

What's NOT Tested

Gap Risk Priority
Double-click submit race Low — inputs disable on first response P3
Progress polling timeout (150 iterations) Medium — 5 min is long to wait in a test P2
Server crash during regeneration Medium — user sees infinite spinner P2
Network timeout during POST Low — localhost is fast P3
Backgrounded Chrome tab throttling intervals Medium — could extend 5-min timeout to 30+ min P2
Large feedback payload Low — board constructs fixed-shape JSON P3
Concurrent sessions (two boards, one server) Low — each $D serve gets its own port P3
Stale feedback file from prior session Medium — could confuse new polling loop P2

Potential Improvements

Short-term (this branch)

  1. Write port to fileserve.ts writes serve.port to disk on startup. Agent reads it anytime. 5 lines.
  2. Clean stale files on startupserve.ts deletes feedback*.json before starting. 3 lines.
  3. Guard double-click — check state === 'done' at top of handleFeedback(). 2 lines.
  4. try/catch file write — wrap fs.writeFileSync in try/catch, return 500 on failure. 5 lines.

Medium-term (follow-up)

  1. WebSocket instead of polling — replace setInterval + GET /api/progress with a WebSocket connection. Board gets instant notification when new HTML is ready. Eliminates polling drift and backgrounded-tab throttling. ~50 lines in serve.ts + ~20 lines in compare.ts.

  2. Port file for agent — write {"port": 54321, "pid": 12345, "html": "/path/board.html"} to $_DESIGN_DIR/serve.json. Agent reads this instead of parsing stderr. Makes the system more robust to context loss.

  3. Feedback schema validation — validate the POST body against a JSON schema before writing. Catch malformed feedback early instead of confusing the agent downstream.

Long-term (design direction)

  1. Persistent design server — instead of launching $D serve per session, run a long-lived design daemon (like the browse daemon). Multiple boards share one server. Eliminates cold start. But adds daemon lifecycle management complexity.

  2. Real-time collaboration — two agents (or one agent + one human) working on the same board simultaneously. Server broadcasts state changes via WebSocket. Requires conflict resolution on feedback.