Files
gstack/ARCHITECTURE.md
T
Garry Tan 78bc1d1968 feat: design binary — real UI mockup generation for gstack skills (v0.13.0.0) (#551)
* docs: design tools v1 plan — visual mockup generation for gstack skills

Full design doc covering the `design` binary that wraps OpenAI's GPT Image API
to generate real UI mockups from gstack's design skills. Includes comparison
board UX spec, auth model, 6 CEO expansions (design memory, mockup diffing,
screenshot evolution, design intent verification, responsive variants,
design-to-code prompt), and 9-commit implementation plan.

Reviewed: /office-hours + /plan-eng-review (CLEARED) + /plan-ceo-review
(EXPANSION, 6/6 accepted) + /plan-design-review (2/10 → 8/10).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: design tools prototype validation — GPT Image API works

Prototype script sends 3 design briefs to OpenAI Responses API with
image_generation tool. Results: dashboard (47s, 2.1MB), landing page
(42s, 1.3MB), settings page (37s, 1.3MB) all produce real, implementable
UI mockups with accurate text rendering and clean layouts.

Key finding: Codex OAuth tokens lack image generation scopes. Direct
API key (sk-proj-*) required, stored in ~/.gstack/openai.json.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: design binary core — generate, check, compare commands

Stateless CLI (design/dist/design) wrapping OpenAI Responses API for
UI mockup generation. Three working commands:

- generate: brief -> PNG mockup via gpt-4o + image_generation tool
- check: vision-based quality gate via GPT-4o (text readability, layout
  completeness, visual coherence)
- compare: generates self-contained HTML comparison board with star
  ratings, radio Pick, per-variant feedback, regenerate controls,
  and Submit button that writes structured JSON for agent polling

Auth reads from ~/.gstack/openai.json (0600), falls back to
OPENAI_API_KEY env var. Compiled separately from browse binary
(openai added to devDependencies, not runtime deps).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: design binary variants + iterate commands

variants: generates N style variations with staggered parallel (1.5s
between launches, exponential backoff on 429). 7 built-in style
variations (bold, calm, warm, corporate, dark, playful + default).
Tested: 3/3 variants in 41.6s.

iterate: multi-turn design iteration using previous_response_id for
conversational threading. Falls back to re-generation with accumulated
feedback if threading doesn't retain visual context.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: DESIGN_SETUP + DESIGN_MOCKUP template resolvers

Add generateDesignSetup() and generateDesignMockup() to the existing
design.ts resolver file. Add designDir to HostPaths (claude + codex).
Register DESIGN_SETUP and DESIGN_MOCKUP in the resolver index.

DESIGN_SETUP: $D binary discovery (mirrors $B browse setup pattern).
Falls back to DESIGN_SKETCH if binary not available.

DESIGN_MOCKUP: full visual exploration workflow template — construct
brief from DESIGN.md context, generate 3 variants, open comparison
board in Chrome, poll for user feedback, save approved mockup to
docs/designs/, generate HTML wireframe for implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: sync package.json version with VERSION file (0.12.2.0)

Pre-existing mismatch: VERSION was 0.12.2.0 but package.json was
0.12.0.0. Also adds design binary to build script and dev:design
convenience command.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /office-hours visual design exploration integration

Add {{DESIGN_MOCKUP}} to office-hours template before the existing
{{DESIGN_SKETCH}}. When the design binary is available, /office-hours
generates 3 visual mockup variants, opens a comparison board in Chrome,
and polls for user feedback. Falls back to HTML wireframes if the
design binary isn't built.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /plan-design-review visual mockup integration

Add {{DESIGN_SETUP}} to pre-review audit and "show me what 10/10
looks like" mockup generation to the 0-10 rating method. When a
design dimension rates below 7/10, the review can generate a mockup
showing the improved version. Falls back to text descriptions if
the design binary isn't available.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: design memory — extract visual language from mockups into DESIGN.md

New `$D extract` command: sends approved mockup to GPT-4o vision,
extracts color palette, typography, spacing, and layout patterns,
writes/updates DESIGN.md with an "Extracted Design Language" section.

Progressive constraint: if DESIGN.md exists, future mockup briefs
include it as style context. If no DESIGN.md, explorations run wide.
readDesignConstraints() reads existing DESIGN.md for brief construction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: mockup diffing + design intent verification

New commands:
- $D diff --before old.png --after new.png: visual diff using GPT-4o
  vision. Returns differences by area with severity (high/medium/low)
  and a matchScore (0-100).
- $D verify --mockup approved.png --screenshot live.png: compares live
  site screenshot against approved design mockup. Pass if matchScore
  >= 70 and no high-severity differences.

Used by /design-review to close the design loop: design -> implement ->
verify visually.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: screenshot-to-mockup evolution ($D evolve)

New command: $D evolve --screenshot current.png --brief "make it calmer"

Two-step process: first analyzes the screenshot via GPT-4o vision to
produce a detailed description, then generates a new mockup that keeps
the existing layout structure but applies the requested changes. Starts
from reality, not blank canvas.

Bridges the gap between /design-review critique ("the spacing is off")
and a visual proposal of the fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: responsive variants + design-to-code prompt

Responsive variants: $D variants --viewports desktop,tablet,mobile
generates mockups at 1536x1024, 1024x1024, and 1024x1536 (portrait)
with viewport-appropriate layout instructions.

Design-to-code prompt: $D prompt --image approved.png extracts colors,
typography, layout, and components via GPT-4o vision, producing a
structured implementation prompt. Reads DESIGN.md for additional
constraint context.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.13.0.0)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: gstack designer as first-class tool in /plan-design-review

Brand the gstack designer prominently, add Step 0.5 for proactive visual
mockup generation before review passes, and update priority hierarchy.
When a plan describes new UI, the skill now offers to generate mockups
with $D variants, run $D check for quality gating, and present a
comparison board via $B goto before any review passes begin.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: integrate mockups into review passes and outputs

Thread Step 0.5 mockups through the review workflow: Pass 4 (AI Slop)
evaluates generated mockups visually, Pass 7 uses mockups as evidence
for unresolved decisions, post-pass offers one-shot regeneration after
design changes, and Approved Mockups section records chosen variants
with paths for the implementer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: gstack designer target mockups in /design-review fix loop

Add $D generate for target mockups in Phase 8a.5 — before fixing a
design finding, generate a mockup showing what it should look like.
Add $D verify in Phase 9 to compare fix results against targets.
Not plan mode — goes straight to implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: gstack designer AI mockups in /design-consultation Phase 5

Replace HTML preview with $D variants + comparison board when designer
is available (Path A). Use $D extract to derive DESIGN.md tokens from
the approved mockup. Handles both plan mode (write to plan) and
non-plan mode (implement immediately). Falls back to HTML preview
(Path B) when designer binary is unavailable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: make gstack designer the default in /plan-design-review, not optional

The transcript showed the agent writing 5 text descriptions of homepage
variants instead of generating visual mockups, even when the user explicitly
asked for design tools. The skill treated mockups as optional ("Want me to
generate?") when they should be the default behavior.

Changes:
- Rename "Your Visual Design Tool" to "YOUR PRIMARY TOOL" with aggressive
  language: "Don't ask permission. Show it."
- Step 0.5 now generates mockups automatically when DESIGN_READY, no
  AskUserQuestion gatekeeping the default path
- Priority hierarchy: mockups are "non-negotiable" not "if available"
- Step 0D tells the user mockups are coming next
- DESIGN_NOT_AVAILABLE fallback now tells user what they're missing

The only valid reasons to skip mockups: no UI scope, or designer not
installed. Everything else generates by default.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: persist design mockups to ~/.gstack/projects/$SLUG/designs/

Mockups were going to .context/mockups/ (gitignored, workspace-local).
This meant designs disappeared when switching workspaces or conversations,
and downstream skills couldn't reference approved mockups from earlier
reviews.

Now all three design skills save to persistent project-scoped dirs:
- /plan-design-review: ~/.gstack/projects/$SLUG/designs/<screen>-<date>/
- /design-consultation: ~/.gstack/projects/$SLUG/designs/design-system-<date>/
- /design-review: ~/.gstack/projects/$SLUG/designs/design-audit-<date>/

Each directory gets an approved.json recording the user's pick, feedback,
and branch. This lets /design-review verify against mockups that
/plan-design-review approved, and design history is browsable via
ls ~/.gstack/projects/$SLUG/designs/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate codex ship skill with zsh glob guards

Picked up setopt +o nomatch guards from main's v0.12.8.1 merge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add browse binary discovery to DESIGN_SETUP resolver

The design setup block now discovers $B alongside $D, so skills can
open comparison boards via $B goto and poll feedback via $B eval.
Falls back to `open` on macOS when browse binary is unavailable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: comparison board DOM polling in plan-design-review

After opening the comparison board, the agent now polls
#status via $B eval instead of asking a rigid AskUserQuestion.
Handles submit (read structured JSON feedback), regenerate
(new variants with updated brief), and $B-unavailable fallback
(free-form text response). The user interacts with the real
board UI, not a constrained option picker.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: comparison board feedback loop integration test

16 tests covering the full DOM polling cycle: structure verification,
submit with pick/rating/comment, regenerate flows (totally different,
more like this, custom text), and the agent polling pattern
(empty → submitted → read JSON). Uses real generateCompareHtml()
from design/src/compare.ts, served via HTTP. Runs in <1s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add $D serve command for HTTP-based comparison board feedback

The comparison board feedback loop was fundamentally broken: browse blocks
file:// URLs (url-validation.ts:71), so $B goto file://board.html always
fails. The fallback open + $B eval polls a different browser instance.

$D serve fixes this by serving the board over HTTP on localhost. The server
is stateful: stays alive across regeneration rounds, exposes /api/progress
for the board to poll, and accepts /api/reload from the agent to swap in
new board HTML. Stdout carries feedback JSON only; stderr carries telemetry.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: dual-mode feedback + post-submit lifecycle in comparison board

When __GSTACK_SERVER_URL is set (injected by $D serve), the board POSTs
feedback to the server instead of only writing to hidden DOM elements.
After submit: disables all inputs, shows "Return to your coding agent."
After regenerate: shows spinner, polls /api/progress, auto-refreshes on
ready. On POST failure: shows copyable JSON fallback. On progress timeout
(5 min): shows error with /design-shotgun prompt. DOM fallback preserved
for headed browser mode and tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: HTTP serve command endpoints and regeneration lifecycle

11 tests covering: HTML serving with injected server URL, /api/progress
state reporting, submit → done lifecycle, regenerate → regenerating state,
remix with remixSpec, malformed JSON rejection, /api/reload HTML swapping,
missing file validation, and full regenerate → reload → submit round-trip.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add DESIGN_SHOTGUN_LOOP resolver + fix design artifact paths

Adds generateDesignShotgunLoop() resolver for the shared comparison board
feedback loop (serve via HTTP, handle regenerate/remix, AskUserQuestion
fallback, feedback confirmation). Registered as {{DESIGN_SHOTGUN_LOOP}}.

Fixes generateDesignMockup() to use ~/.gstack/projects/$SLUG/designs/
instead of /tmp/ and docs/designs/. Replaces broken $B goto file:// +
$B eval polling with $D compare --serve (HTTP-based, stdout feedback).

Adds CRITICAL PATH RULE guardrail to DESIGN_SETUP: design artifacts must
go to ~/.gstack/projects/$SLUG/designs/, never .context/ or /tmp/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add /design-shotgun standalone design exploration skill

New skill for visual brainstorming: generate AI design variants, open a
comparison board in the user's browser, collect structured feedback, and
iterate. Features: session detection (revisit prior explorations), 5-dimension
context gathering (who, job to be done, what exists, user flow, edge cases),
taste memory (prior approved designs bias new generations), inline variant
preview, configurable variant count, screenshot-to-variants via $D evolve.

Uses {{DESIGN_SHOTGUN_LOOP}} resolver for the feedback loop. Saves all
artifacts to ~/.gstack/projects/$SLUG/designs/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate SKILL.md files for design-shotgun + resolver changes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add remix UI to comparison board

Per-variant element selectors (Layout, Colors, Typography, Spacing) with
radio buttons in a grid. Remix button collects selections into a remixSpec
object and sends via the same HTTP POST feedback mechanism. Enabled only
when at least one element is selected. Board shows regenerating spinner
while agent generates the hybrid variant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add $D gallery command for design history timeline

Generates a self-contained HTML page showing all prior design explorations
for a project: every variant (approved or not), feedback notes, organized
by date (newest first). Images embedded as base64. Handles corrupted
approved.json gracefully (skips, still shows the session). Empty state
shows "No history yet" with /design-shotgun prompt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: gallery generation — sessions, dates, corruption, empty state

7 tests: empty dir, nonexistent dir, single session with approved variant,
multiple sessions sorted newest-first, corrupted approved.json handled
gracefully, session without approved.json, self-contained HTML (no
external dependencies).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: replace broken file:// polling with {{DESIGN_SHOTGUN_LOOP}}

plan-design-review and design-consultation templates previously used
$B goto file:// + $B eval polling for the comparison board feedback loop.
This was broken (browse blocks file:// URLs). Both templates now use
{{DESIGN_SHOTGUN_LOOP}} which serves via HTTP, handles regeneration in
the same browser tab, and falls back to AskUserQuestion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add design-shotgun touchfile entries and tier classifications

design-shotgun-path (gate): verify artifacts go to ~/.gstack/, not .context/
design-shotgun-session (gate): verify repeat-run detection + AskUserQuestion
design-shotgun-full (periodic): full round-trip with real design binary

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate SKILL.md files for template refactor

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: comparison board UI improvements — option headers, pick confirmation, grid view

Three changes to the design comparison board:

1. Pick confirmation: selecting "Pick" on Option A shows "We'll move
   forward with Option A" in green, plus a status line above the submit
   button repeating the choice.

2. Clear option headers: each variant now has "Option A" in bold with a
   subtitle above the image, instead of just the raw image.

3. View toggle: top-right Large/Grid buttons switch between single-column
   (default) and 3-across grid view.

Also restructured the bottom section into a 2-column grid: submit/overall
feedback on the left, regenerate controls on the right.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use 127.0.0.1 instead of localhost for serve URL

Avoids DNS resolution issues on some systems where localhost may resolve
to IPv6 ::1 while Bun listens on IPv4 only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: write ALL feedback to disk so agent can poll in background mode

The agent backgrounds $D serve (Claude Code can't block on a subprocess
and do other work simultaneously). With stdout-only feedback delivery,
the agent never sees regenerate/remix feedback.

Fix: write feedback-pending.json (regenerate/remix) and feedback.json
(submit) to disk next to the board HTML. Agent polls the filesystem
instead of reading stdout. Both channels (stdout + disk) are always
active so foreground mode still works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: DESIGN_SHOTGUN_LOOP uses file polling instead of stdout reading

Update the template resolver to instruct the agent to background $D serve
and poll for feedback-pending.json / feedback.json on a 5-second loop.
This matches the real-world pattern where Claude Code / Conductor agents
can't block on subprocess stdout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate SKILL.md files for file-polling feedback loop

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: null-safe DOM selectors for post-submit and regenerating states

The user's layout restructure renamed .regenerate-bar → .regen-column,
.submit-bar → .submit-column, and .overall-section → .bottom-section.
The JS still referenced the old class names, causing querySelector to
return null and showPostSubmitState() / showRegeneratingState() to
silently crash. This meant Submit and Regenerate buttons appeared to
work (DOM elements updated, HTTP POST succeeded) but the visual
feedback (disabled inputs, spinner, success message) never appeared.

Fix: use fallback selectors that check both old and new class names,
with null guards so a missing element doesn't crash the function.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: end-to-end feedback roundtrip — browser click to file on disk

The test that proves "changes on the website propagate to Claude Code."
Opens the comparison board in a real headless browser with __GSTACK_SERVER_URL
injected, simulates user clicks (Submit, Regenerate, More Like This), and
verifies that feedback.json / feedback-pending.json land on disk with the
correct structured data.

6 tests covering: submit → feedback.json, post-submit UI lockdown,
regenerate → feedback-pending.json, more-like-this → feedback-pending.json,
regenerate spinner display, and full regen → reload → submit round-trip.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: comprehensive design doc for Design Shotgun feedback loop

Documents the full browser-to-agent feedback architecture: state machine,
file-based polling, port discovery, post-submit lifecycle, and every known
edge case (zombie forms, dead servers, stale spinners, file:// bug,
double-click races, port coordination, sequential generate rule).

Includes ASCII diagrams of the data flow and state transitions, complete
step-by-step walkthrough of happy path and regeneration path, test coverage
map with gaps, and short/medium/long-term improvement ideas.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: plan-design-review agent guardrails for feedback loop

Four fixes to prevent agents from reinventing the feedback loop badly:

1. Sequential generate rule: explicit instruction that $D generate calls
   must run one at a time (API rate-limits concurrent image generation).
2. No-AskUserQuestion-for-feedback rule: agent reads feedback.json instead
   of re-asking what the user picked.
3. Remove file:// references: $B goto file:// was always rejected by
   url-validation.ts. The --serve flag handles everything.
4. Remove $B eval polling reference: no longer needed with HTTP POST.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: design-shotgun Step 3 progressive reveal, silent failure detection, timing estimate

Three production UX bugs fixed:
1. Dead air — now shows timing estimate before generation starts
2. Silent variant drop — replaced $D variants batch with individual $D generate
   calls, each verified for existence and non-zero size with retry
3. No progressive reveal — each variant shown inline via Read tool immediately
   after generation (~60s increments instead of all at ~180s)

Also: /tmp/ then cp as default output pattern (sandbox workaround),
screenshot taken once for evolve path (not per-variant).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: parallel design-shotgun with concept-first confirmation

Step 3 rewritten to concept-first + parallel Agent architecture:
- 3a: generate text concepts (free, instant)
- 3b: AskUserQuestion to confirm/modify before spending API credits
- 3c: launch N Agent subagents in parallel (~60s total regardless of count)
- 3d: show all results, dynamic image list for comparison board

Adds Agent to allowed-tools. Softens plan-design-review sequential
warning to note design-shotgun uses parallel at Tier 2+.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.13.0.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: untrack .agents/skills/ — generated at setup, already gitignored

These files were committed despite .agents/ being in .gitignore.
They regenerate from ./setup --host codex on any machine.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate design-shotgun SKILL.md for v0.12.12.0 preamble changes

Merge from main brought updated preamble resolver (conditional telemetry,
local JSONL logging) but design-shotgun/SKILL.md wasn't regenerated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:32:59 -06:00

363 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Architecture
This document explains **why** gstack is built the way it is. For setup and commands, see CLAUDE.md. For contributing, see CONTRIBUTING.md.
## The core idea
gstack gives Claude Code a persistent browser and a set of opinionated workflow skills. The browser is the hard part — everything else is Markdown.
The key insight: an AI agent interacting with a browser needs **sub-second latency** and **persistent state**. If every command cold-starts a browser, you're waiting 3-5 seconds per tool call. If the browser dies between commands, you lose cookies, tabs, and login sessions. So gstack runs a long-lived Chromium daemon that the CLI talks to over localhost HTTP.
```
Claude Code gstack
───────── ──────
┌──────────────────────┐
Tool call: $B snapshot -i │ CLI (compiled binary)│
─────────────────────────→ │ • reads state file │
│ • POST /command │
│ to localhost:PORT │
└──────────┬───────────┘
│ HTTP
┌──────────▼───────────┐
│ Server (Bun.serve) │
│ • dispatches command │
│ • talks to Chromium │
│ • returns plain text │
└──────────┬───────────┘
│ CDP
┌──────────▼───────────┐
│ Chromium (headless) │
│ • persistent tabs │
│ • cookies carry over │
│ • 30min idle timeout │
└───────────────────────┘
```
First call starts everything (~3s). Every call after: ~100-200ms.
## Why Bun
Node.js would work. Bun is better here for three reasons:
1. **Compiled binaries.** `bun build --compile` produces a single ~58MB executable. No `node_modules` at runtime, no `npx`, no PATH configuration. The binary just runs. This matters because gstack installs into `~/.claude/skills/` where users don't expect to manage a Node.js project.
2. **Native SQLite.** Cookie decryption reads Chromium's SQLite cookie database directly. Bun has `new Database()` built in — no `better-sqlite3`, no native addon compilation, no gyp. One less thing that breaks on different machines.
3. **Native TypeScript.** The server runs as `bun run server.ts` during development. No compilation step, no `ts-node`, no source maps to debug. The compiled binary is for deployment; source files are for development.
4. **Built-in HTTP server.** `Bun.serve()` is fast, simple, and doesn't need Express or Fastify. The server handles ~10 routes total. A framework would be overhead.
The bottleneck is always Chromium, not the CLI or server. Bun's startup speed (~1ms for the compiled binary vs ~100ms for Node) is nice but not the reason we chose it. The compiled binary and native SQLite are.
## The daemon model
### Why not start a browser per command?
Playwright can launch Chromium in ~2-3 seconds. For a single screenshot, that's fine. For a QA session with 20+ commands, it's 40+ seconds of browser startup overhead. Worse: you lose all state between commands. Cookies, localStorage, login sessions, open tabs — all gone.
The daemon model means:
- **Persistent state.** Log in once, stay logged in. Open a tab, it stays open. localStorage persists across commands.
- **Sub-second commands.** After the first call, every command is just an HTTP POST. ~100-200ms round-trip including Chromium's work.
- **Automatic lifecycle.** The server auto-starts on first use, auto-shuts down after 30 minutes idle. No process management needed.
### State file
The server writes `.gstack/browse.json` (atomic write via tmp + rename, mode 0o600):
```json
{ "pid": 12345, "port": 34567, "token": "uuid-v4", "startedAt": "...", "binaryVersion": "abc123" }
```
The CLI reads this file to find the server. If the file is missing or the server fails an HTTP health check, the CLI spawns a new server. On Windows, PID-based process detection is unreliable in Bun binaries, so the health check (GET /health) is the primary liveness signal on all platforms.
### Port selection
Random port between 10000-60000 (retry up to 5 on collision). This means 10 Conductor workspaces can each run their own browse daemon with zero configuration and zero port conflicts. The old approach (scanning 9400-9409) broke constantly in multi-workspace setups.
### Version auto-restart
The build writes `git rev-parse HEAD` to `browse/dist/.version`. On each CLI invocation, if the binary's version doesn't match the running server's `binaryVersion`, the CLI kills the old server and starts a new one. This prevents the "stale binary" class of bugs entirely — rebuild the binary, next command picks it up automatically.
## Security model
### Localhost only
The HTTP server binds to `localhost`, not `0.0.0.0`. It's not reachable from the network.
### Bearer token auth
Every server session generates a random UUID token, written to the state file with mode 0o600 (owner-only read). Every HTTP request must include `Authorization: Bearer <token>`. If the token doesn't match, the server returns 401.
This prevents other processes on the same machine from talking to your browse server. The cookie picker UI (`/cookie-picker`) and health check (`/health`) are exempt — they're localhost-only and don't execute commands.
### Cookie security
Cookies are the most sensitive data gstack handles. The design:
1. **Keychain access requires user approval.** First cookie import per browser triggers a macOS Keychain dialog. The user must click "Allow" or "Always Allow." gstack never silently accesses credentials.
2. **Decryption happens in-process.** Cookie values are decrypted in memory (PBKDF2 + AES-128-CBC), loaded into the Playwright context, and never written to disk in plaintext. The cookie picker UI never displays cookie values — only domain names and counts.
3. **Database is read-only.** gstack copies the Chromium cookie DB to a temp file (to avoid SQLite lock conflicts with the running browser) and opens it read-only. It never modifies your real browser's cookie database.
4. **Key caching is per-session.** The Keychain password + derived AES key are cached in memory for the server's lifetime. When the server shuts down (idle timeout or explicit stop), the cache is gone.
5. **No cookie values in logs.** Console, network, and dialog logs never contain cookie values. The `cookies` command outputs cookie metadata (domain, name, expiry) but values are truncated.
### Shell injection prevention
The browser registry (Comet, Chrome, Arc, Brave, Edge) is hardcoded. Database paths are constructed from known constants, never from user input. Keychain access uses `Bun.spawn()` with explicit argument arrays, not shell string interpolation.
## The ref system
Refs (`@e1`, `@e2`, `@c1`) are how the agent addresses page elements without writing CSS selectors or XPath.
### How it works
```
1. Agent runs: $B snapshot -i
2. Server calls Playwright's page.accessibility.snapshot()
3. Parser walks the ARIA tree, assigns sequential refs: @e1, @e2, @e3...
4. For each ref, builds a Playwright Locator: getByRole(role, { name }).nth(index)
5. Stores Map<string, RefEntry> on the BrowserManager instance (role + name + Locator)
6. Returns the annotated tree as plain text
Later:
7. Agent runs: $B click @e3
8. Server resolves @e3 → Locator → locator.click()
```
### Why Locators, not DOM mutation
The obvious approach is to inject `data-ref="@e1"` attributes into the DOM. This breaks on:
- **CSP (Content Security Policy).** Many production sites block DOM modification from scripts.
- **React/Vue/Svelte hydration.** Framework reconciliation can strip injected attributes.
- **Shadow DOM.** Can't reach inside shadow roots from the outside.
Playwright Locators are external to the DOM. They use the accessibility tree (which Chromium maintains internally) and `getByRole()` queries. No DOM mutation, no CSP issues, no framework conflicts.
### Ref lifecycle
Refs are cleared on navigation (the `framenavigated` event on the main frame). This is correct — after navigation, all locators are stale. The agent must run `snapshot` again to get fresh refs. This is by design: stale refs should fail loudly, not click the wrong element.
### Ref staleness detection
SPAs can mutate the DOM without triggering `framenavigated` (e.g. React router transitions, tab switches, modal opens). This makes refs stale even though the page URL didn't change. To catch this, `resolveRef()` performs an async `count()` check before using any ref:
```
resolveRef(@e3) → entry = refMap.get("e3")
→ count = await entry.locator.count()
→ if count === 0: throw "Ref @e3 is stale — element no longer exists. Run 'snapshot' to get fresh refs."
→ if count > 0: return { locator }
```
This fails fast (~5ms overhead) instead of letting Playwright's 30-second action timeout expire on a missing element. The `RefEntry` stores `role` and `name` metadata alongside the Locator so the error message can tell the agent what the element was.
### Cursor-interactive refs (@c)
The `-C` flag finds elements that are clickable but not in the ARIA tree — things styled with `cursor: pointer`, elements with `onclick` attributes, or custom `tabindex`. These get `@c1`, `@c2` refs in a separate namespace. This catches custom components that frameworks render as `<div>` but are actually buttons.
## Logging architecture
Three ring buffers (50,000 entries each, O(1) push):
```
Browser events → CircularBuffer (in-memory) → Async flush to .gstack/*.log
```
Console messages, network requests, and dialog events each have their own buffer. Flushing happens every 1 second — the server appends only new entries since the last flush. This means:
- HTTP request handling is never blocked by disk I/O
- Logs survive server crashes (up to 1 second of data loss)
- Memory is bounded (50K entries × 3 buffers)
- Disk files are append-only, readable by external tools
The `console`, `network`, and `dialog` commands read from the in-memory buffers, not disk. Disk files are for post-mortem debugging.
## SKILL.md template system
### The problem
SKILL.md files tell Claude how to use the browse commands. If the docs list a flag that doesn't exist, or miss a command that was added, the agent hits errors. Hand-maintained docs always drift from code.
### The solution
```
SKILL.md.tmpl (human-written prose + placeholders)
gen-skill-docs.ts (reads source code metadata)
SKILL.md (committed, auto-generated sections)
```
Templates contain the workflows, tips, and examples that require human judgment. Placeholders are filled from source code at build time:
| Placeholder | Source | What it generates |
|-------------|--------|-------------------|
| `{{COMMAND_REFERENCE}}` | `commands.ts` | Categorized command table |
| `{{SNAPSHOT_FLAGS}}` | `snapshot.ts` | Flag reference with examples |
| `{{PREAMBLE}}` | `gen-skill-docs.ts` | Startup block: update check, session tracking, contributor mode, AskUserQuestion format |
| `{{BROWSE_SETUP}}` | `gen-skill-docs.ts` | Binary discovery + setup instructions |
| `{{BASE_BRANCH_DETECT}}` | `gen-skill-docs.ts` | Dynamic base branch detection for PR-targeting skills (ship, review, qa, plan-ceo-review) |
| `{{QA_METHODOLOGY}}` | `gen-skill-docs.ts` | Shared QA methodology block for /qa and /qa-only |
| `{{DESIGN_METHODOLOGY}}` | `gen-skill-docs.ts` | Shared design audit methodology for /plan-design-review and /design-review |
| `{{REVIEW_DASHBOARD}}` | `gen-skill-docs.ts` | Review Readiness Dashboard for /ship pre-flight |
| `{{TEST_BOOTSTRAP}}` | `gen-skill-docs.ts` | Test framework detection, bootstrap, CI/CD setup for /qa, /ship, /design-review |
| `{{CODEX_PLAN_REVIEW}}` | `gen-skill-docs.ts` | Optional cross-model plan review (Codex or Claude subagent fallback) for /plan-ceo-review and /plan-eng-review |
| `{{DESIGN_SETUP}}` | `resolvers/design.ts` | Discovery pattern for `$D` design binary, mirrors `{{BROWSE_SETUP}}` |
| `{{DESIGN_SHOTGUN_LOOP}}` | `resolvers/design.ts` | Shared comparison board feedback loop for /design-shotgun, /plan-design-review, /design-consultation |
This is structurally sound — if a command exists in code, it appears in docs. If it doesn't exist, it can't appear.
### The preamble
Every skill starts with a `{{PREAMBLE}}` block that runs before the skill's own logic. It handles five things in a single bash command:
1. **Update check** — calls `gstack-update-check`, reports if an upgrade is available.
2. **Session tracking** — touches `~/.gstack/sessions/$PPID` and counts active sessions (files modified in the last 2 hours). When 3+ sessions are running, all skills enter "ELI16 mode" — every question re-grounds the user on context because they're juggling windows.
3. **Contributor mode** — reads `gstack_contributor` from config. When true, the agent files casual field reports to `~/.gstack/contributor-logs/` when gstack itself misbehaves.
4. **AskUserQuestion format** — universal format: context, question, `RECOMMENDATION: Choose X because ___`, lettered options. Consistent across all skills.
5. **Search Before Building** — before building infrastructure or unfamiliar patterns, search first. Three layers of knowledge: tried-and-true (Layer 1), new-and-popular (Layer 2), first-principles (Layer 3). When first-principles reasoning reveals conventional wisdom is wrong, the agent names the "eureka moment" and logs it. See `ETHOS.md` for the full builder philosophy.
### Why committed, not generated at runtime?
Three reasons:
1. **Claude reads SKILL.md at skill load time.** There's no build step when a user invokes `/browse`. The file must already exist and be correct.
2. **CI can validate freshness.** `gen:skill-docs --dry-run` + `git diff --exit-code` catches stale docs before merge.
3. **Git blame works.** You can see when a command was added and in which commit.
### Template test tiers
| Tier | What | Cost | Speed |
|------|------|------|-------|
| 1 — Static validation | Parse every `$B` command in SKILL.md, validate against registry | Free | <2s |
| 2 — E2E via `claude -p` | Spawn real Claude session, run each skill, check for errors | ~$3.85 | ~20min |
| 3 — LLM-as-judge | Sonnet scores docs on clarity/completeness/actionability | ~$0.15 | ~30s |
Tier 1 runs on every `bun test`. Tiers 2+3 are gated behind `EVALS=1`. The idea is: catch 95% of issues for free, use LLMs only for judgment calls.
## Command dispatch
Commands are categorized by side effects:
- **READ** (text, html, links, console, cookies, ...): No mutations. Safe to retry. Returns page state.
- **WRITE** (goto, click, fill, press, ...): Mutates page state. Not idempotent.
- **META** (snapshot, screenshot, tabs, chain, ...): Server-level operations that don't fit neatly into read/write.
This isn't just organizational. The server uses it for dispatch:
```typescript
if (READ_COMMANDS.has(cmd)) handleReadCommand(cmd, args, bm)
if (WRITE_COMMANDS.has(cmd)) handleWriteCommand(cmd, args, bm)
if (META_COMMANDS.has(cmd)) handleMetaCommand(cmd, args, bm, shutdown)
```
The `help` command returns all three sets so agents can self-discover available commands.
## Error philosophy
Errors are for AI agents, not humans. Every error message must be actionable:
- "Element not found" → "Element not found or not interactable. Run `snapshot -i` to see available elements."
- "Selector matched multiple elements" → "Selector matched multiple elements. Use @refs from `snapshot` instead."
- Timeout → "Navigation timed out after 30s. The page may be slow or the URL may be wrong."
Playwright's native errors are rewritten through `wrapError()` to strip internal stack traces and add guidance. The agent should be able to read the error and know what to do next without human intervention.
### Crash recovery
The server doesn't try to self-heal. If Chromium crashes (`browser.on('disconnected')`), the server exits immediately. The CLI detects the dead server on the next command and auto-restarts. This is simpler and more reliable than trying to reconnect to a half-dead browser process.
## E2E test infrastructure
### Session runner (`test/helpers/session-runner.ts`)
E2E tests spawn `claude -p` as a completely independent subprocess — not via the Agent SDK, which can't nest inside Claude Code sessions. The runner:
1. Writes the prompt to a temp file (avoids shell escaping issues)
2. Spawns `sh -c 'cat prompt | claude -p --output-format stream-json --verbose'`
3. Streams NDJSON from stdout for real-time progress
4. Races against a configurable timeout
5. Parses the full NDJSON transcript into structured results
The `parseNDJSON()` function is pure — no I/O, no side effects — making it independently testable.
### Observability data flow
```
skill-e2e-*.test.ts
│ generates runId, passes testName + runId to each call
┌─────┼──────────────────────────────┐
│ │ │
│ runSkillTest() evalCollector
│ (session-runner.ts) (eval-store.ts)
│ │ │
│ per tool call: per addTest():
│ ┌──┼──────────┐ savePartial()
│ │ │ │ │
│ ▼ ▼ ▼ ▼
│ [HB] [PL] [NJ] _partial-e2e.json
│ │ │ │ (atomic overwrite)
│ │ │ │
│ ▼ ▼ ▼
│ e2e- prog- {name}
│ live ress .ndjson
│ .json .log
│ on failure:
│ {name}-failure.json
│ ALL files in ~/.gstack-dev/
│ Run dir: e2e-runs/{runId}/
│ eval-watch.ts
│ │
│ ┌─────┴─────┐
│ read HB read partial
│ └─────┬─────┘
│ ▼
│ render dashboard
│ (stale >10min? warn)
```
**Split ownership:** session-runner owns the heartbeat (current test state), eval-store owns partial results (completed test state). The watcher reads both. Neither component knows about the other — they share data only through the filesystem.
**Non-fatal everything:** All observability I/O is wrapped in try/catch. A write failure never causes a test to fail. The tests themselves are the source of truth; observability is best-effort.
**Machine-readable diagnostics:** Each test result includes `exit_reason` (success, timeout, error_max_turns, error_api, exit_code_N), `timeout_at_turn`, and `last_tool_call`. This enables `jq` queries like:
```bash
jq '.tests[] | select(.exit_reason == "timeout") | .last_tool_call' ~/.gstack-dev/evals/_partial-e2e.json
```
### Eval persistence (`test/helpers/eval-store.ts`)
The `EvalCollector` accumulates test results and writes them in two ways:
1. **Incremental:** `savePartial()` writes `_partial-e2e.json` after each test (atomic: write `.tmp`, `fs.renameSync`). Survives kills.
2. **Final:** `finalize()` writes a timestamped eval file (e.g. `e2e-20260314-143022.json`). The partial file is never cleaned up — it persists alongside the final file for observability.
`eval:compare` diffs two eval runs. `eval:summary` aggregates stats across all runs in `~/.gstack-dev/evals/`.
### Test tiers
| Tier | What | Cost | Speed |
|------|------|------|-------|
| 1 — Static validation | Parse `$B` commands, validate against registry, observability unit tests | Free | <5s |
| 2 — E2E via `claude -p` | Spawn real Claude session, run each skill, scan for errors | ~$3.85 | ~20min |
| 3 — LLM-as-judge | Sonnet scores docs on clarity/completeness/actionability | ~$0.15 | ~30s |
Tier 1 runs on every `bun test`. Tiers 2+3 are gated behind `EVALS=1`. The idea: catch 95% of issues for free, use LLMs only for judgment calls and integration testing.
## What's intentionally not here
- **No WebSocket streaming.** HTTP request/response is simpler, debuggable with curl, and fast enough. Streaming would add complexity for marginal benefit.
- **No MCP protocol.** MCP adds JSON schema overhead per request and requires a persistent connection. Plain HTTP + plain text output is lighter on tokens and easier to debug.
- **No multi-user support.** One server per workspace, one user. The token auth is defense-in-depth, not multi-tenancy.
- **No Windows/Linux cookie decryption.** macOS Keychain is the only supported credential store. Linux (GNOME Keyring/kwallet) and Windows (DPAPI) are architecturally possible but not implemented.
- **No iframe auto-discovery.** `$B frame` supports cross-frame interaction (CSS selector, @ref, `--name`, `--url` matching), but the ref system does not auto-crawl iframes during `snapshot`. You must explicitly enter a frame context first.