feat: SKILL.md template system, 3-tier testing, DX tools (v0.3.3) (#41)

* refactor: extract command registry to commands.ts, add SNAPSHOT_FLAGS metadata - NEW: browse/src/commands.ts — command sets + COMMAND_DESCRIPTIONS + load-time validation (zero side effects) - server.ts imports from commands.ts instead of declaring sets inline - snapshot.ts: SNAPSHOT_FLAGS array drives parseSnapshotArgs (metadata-driven, no duplication) - All 186 existing tests pass * feat: SKILL.md template system with auto-generated command references - SKILL.md.tmpl + browse/SKILL.md.tmpl with {{COMMAND_REFERENCE}} and {{SNAPSHOT_FLAGS}} placeholders - scripts/gen-skill-docs.ts generates SKILL.md from templates (supports --dry-run) - Build pipeline runs gen:skill-docs before binary compilation - Generated files have AUTO-GENERATED header, committed to git * test: Tier 1 static validation — 34 tests for SKILL.md command correctness - test/helpers/skill-parser.ts: extracts $B commands from code blocks, validates against registry - test/skill-parser.test.ts: 13 parser/validator unit tests - test/skill-validation.test.ts: 13 tests validating all SKILL.md files + registry consistency - test/gen-skill-docs.test.ts: 8 generator tests (categories, sorting, freshness) * feat: DX tools (skill:check, dev:skill) + Tier 2 E2E test scaffolding - scripts/skill-check.ts: health summary for all SKILL.md files (commands, templates, freshness) - scripts/dev-skill.ts: watch mode for template development - test/helpers/session-runner.ts: Agent SDK wrapper for E2E skill tests - test/skill-e2e.test.ts: 2 E2E tests + 3 stubs (auto-skip inside Claude Code sessions) - E2E tests must run from plain terminal: SKILL_E2E=1 bun test test/skill-e2e.test.ts * ci: SKILL.md freshness check on push/PR + TODO updates - .github/workflows/skill-docs.yml: fails if generated SKILL.md files are stale - TODO.md: add E2E cost tracking and model pinning to future ideas * fix: restore rich descriptions lost in auto-generation - Snapshot flags: add back value hints (-d <N>, -s <sel>, -o <path>) - Snapshot flags: restore parenthetical context (@e refs, @c refs, etc.) - Commands: is → includes valid states enum - Commands: console → notes --errors filter behavior - Commands: press → lists common keys (Enter, Tab, Escape) - Commands: cookie-import-browser → describes picker UI - Commands: dialog-accept → specifies alert/confirm/prompt - Tips: restore → arrow (was downgraded to ->) * test: quality evals for generated SKILL.md descriptions Catches the exact regressions we shipped and caught in review: - Snapshot flags must include value hints (-d <N>, -s <sel>, -o <path>) - is command must list all valid states (visible/hidden/enabled/...) - press command must list example keys (Enter, Tab, Escape) - console command must describe --errors behavior - Snapshot -i must mention @e refs, -C must mention @c refs - All descriptions must be >= 8 chars (no empty stubs) - Tips section must use → not -> * feat: LLM-as-judge evals for SKILL.md documentation quality 4 eval tests using Anthropic API (claude-haiku, ~$0.01-0.03/run): - Command reference table: clarity/completeness/actionability >= 4/5 - Snapshot flags section: same thresholds - browse/SKILL.md overall quality - Regression: generated version must score >= hand-maintained baseline Requires ANTHROPIC_API_KEY. Auto-skips without it. Run: bun run test:eval (or ANTHROPIC_API_KEY=sk-... bun test test/skill-llm-eval.test.ts) * chore: bump version to 0.3.3, update changelog Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add ARCHITECTURE.md, update CLAUDE.md and CONTRIBUTING.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: conductor.json lifecycle hooks + .env propagation across worktrees bin/dev-setup now copies .env from main worktree so API keys carry over to Conductor workspaces automatically. conductor.json wires up setup and archive hooks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: complete CHANGELOG for v0.3.3 (architecture, conductor, .env) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-28 12:39:58 +02:00 · 2026-03-13 21:08:12 -07:00
parent ea0c0dad5e
commit 5205070299
29 changed files with 2479 additions and 135 deletions
@@ -0,0 +1,240 @@
+# Architecture
+
+This document explains **why** gstack is built the way it is. For setup and commands, see CLAUDE.md. For contributing, see CONTRIBUTING.md.
+
+## The core idea
+
+gstack gives Claude Code a persistent browser and a set of opinionated workflow skills. The browser is the hard part — everything else is Markdown.
+
+The key insight: an AI agent interacting with a browser needs **sub-second latency** and **persistent state**. If every command cold-starts a browser, you're waiting 3-5 seconds per tool call. If the browser dies between commands, you lose cookies, tabs, and login sessions. So gstack runs a long-lived Chromium daemon that the CLI talks to over localhost HTTP.
+
+```
+Claude Code                     gstack
+─────────                      ──────
+                               ┌──────────────────────┐
+  Tool call: $B snapshot -i    │  CLI (compiled binary)│
+  ─────────────────────────→   │  • reads state file   │
+                               │  • POST /command      │
+                               │    to localhost:PORT   │
+                               └──────────┬───────────┘
+                                          │ HTTP
+                               ┌──────────▼───────────┐
+                               │  Server (Bun.serve)   │
+                               │  • dispatches command  │
+                               │  • talks to Chromium   │
+                               │  • returns plain text  │
+                               └──────────┬───────────┘
+                                          │ CDP
+                               ┌──────────▼───────────┐
+                               │  Chromium (headless)   │
+                               │  • persistent tabs     │
+                               │  • cookies carry over  │
+                               │  • 30min idle timeout  │
+                               └───────────────────────┘
+```
+
+First call starts everything (~3s). Every call after: ~100-200ms.
+
+## Why Bun
+
+Node.js would work. Bun is better here for three reasons:
+
+1. **Compiled binaries.** `bun build --compile` produces a single ~58MB executable. No `node_modules` at runtime, no `npx`, no PATH configuration. The binary just runs. This matters because gstack installs into `~/.claude/skills/` where users don't expect to manage a Node.js project.
+
+2. **Native SQLite.** Cookie decryption reads Chromium's SQLite cookie database directly. Bun has `new Database()` built in — no `better-sqlite3`, no native addon compilation, no gyp. One less thing that breaks on different machines.
+
+3. **Native TypeScript.** The server runs as `bun run server.ts` during development. No compilation step, no `ts-node`, no source maps to debug. The compiled binary is for deployment; source files are for development.
+
+4. **Built-in HTTP server.** `Bun.serve()` is fast, simple, and doesn't need Express or Fastify. The server handles ~10 routes total. A framework would be overhead.
+
+The bottleneck is always Chromium, not the CLI or server. Bun's startup speed (~1ms for the compiled binary vs ~100ms for Node) is nice but not the reason we chose it. The compiled binary and native SQLite are.
+
+## The daemon model
+
+### Why not start a browser per command?
+
+Playwright can launch Chromium in ~2-3 seconds. For a single screenshot, that's fine. For a QA session with 20+ commands, it's 40+ seconds of browser startup overhead. Worse: you lose all state between commands. Cookies, localStorage, login sessions, open tabs — all gone.
+
+The daemon model means:
+
+- **Persistent state.** Log in once, stay logged in. Open a tab, it stays open. localStorage persists across commands.
+- **Sub-second commands.** After the first call, every command is just an HTTP POST. ~100-200ms round-trip including Chromium's work.
+- **Automatic lifecycle.** The server auto-starts on first use, auto-shuts down after 30 minutes idle. No process management needed.
+
+### State file
+
+The server writes `.gstack/browse.json` (atomic write via tmp + rename, mode 0o600):
+
+```json
+{ "pid": 12345, "port": 34567, "token": "uuid-v4", "startedAt": "...", "binaryVersion": "abc123" }
+```
+
+The CLI reads this file to find the server. If the file is missing, stale, or the PID is dead, the CLI spawns a new server.
+
+### Port selection
+
+Random port between 10000-60000 (retry up to 5 on collision). This means 10 Conductor workspaces can each run their own browse daemon with zero configuration and zero port conflicts. The old approach (scanning 9400-9409) broke constantly in multi-workspace setups.
+
+### Version auto-restart
+
+The build writes `git rev-parse HEAD` to `browse/dist/.version`. On each CLI invocation, if the binary's version doesn't match the running server's `binaryVersion`, the CLI kills the old server and starts a new one. This prevents the "stale binary" class of bugs entirely — rebuild the binary, next command picks it up automatically.
+
+## Security model
+
+### Localhost only
+
+The HTTP server binds to `localhost`, not `0.0.0.0`. It's not reachable from the network.
+
+### Bearer token auth
+
+Every server session generates a random UUID token, written to the state file with mode 0o600 (owner-only read). Every HTTP request must include `Authorization: Bearer <token>`. If the token doesn't match, the server returns 401.
+
+This prevents other processes on the same machine from talking to your browse server. The cookie picker UI (`/cookie-picker`) and health check (`/health`) are exempt — they're localhost-only and don't execute commands.
+
+### Cookie security
+
+Cookies are the most sensitive data gstack handles. The design:
+
+1. **Keychain access requires user approval.** First cookie import per browser triggers a macOS Keychain dialog. The user must click "Allow" or "Always Allow." gstack never silently accesses credentials.
+
+2. **Decryption happens in-process.** Cookie values are decrypted in memory (PBKDF2 + AES-128-CBC), loaded into the Playwright context, and never written to disk in plaintext. The cookie picker UI never displays cookie values — only domain names and counts.
+
+3. **Database is read-only.** gstack copies the Chromium cookie DB to a temp file (to avoid SQLite lock conflicts with the running browser) and opens it read-only. It never modifies your real browser's cookie database.
+
+4. **Key caching is per-session.** The Keychain password + derived AES key are cached in memory for the server's lifetime. When the server shuts down (idle timeout or explicit stop), the cache is gone.
+
+5. **No cookie values in logs.** Console, network, and dialog logs never contain cookie values. The `cookies` command outputs cookie metadata (domain, name, expiry) but values are truncated.
+
+### Shell injection prevention
+
+The browser registry (Comet, Chrome, Arc, Brave, Edge) is hardcoded. Database paths are constructed from known constants, never from user input. Keychain access uses `Bun.spawn()` with explicit argument arrays, not shell string interpolation.
+
+## The ref system
+
+Refs (`@e1`, `@e2`, `@c1`) are how the agent addresses page elements without writing CSS selectors or XPath.
+
+### How it works
+
+```
+1. Agent runs: $B snapshot -i
+2. Server calls Playwright's page.accessibility.snapshot()
+3. Parser walks the ARIA tree, assigns sequential refs: @e1, @e2, @e3...
+4. For each ref, builds a Playwright Locator: getByRole(role, { name }).nth(index)
+5. Stores Map<string, Locator> on the BrowserManager instance
+6. Returns the annotated tree as plain text
+
+Later:
+7. Agent runs: $B click @e3
+8. Server resolves @e3 → Locator → locator.click()
+```
+
+### Why Locators, not DOM mutation
+
+The obvious approach is to inject `data-ref="@e1"` attributes into the DOM. This breaks on:
+
+- **CSP (Content Security Policy).** Many production sites block DOM modification from scripts.
+- **React/Vue/Svelte hydration.** Framework reconciliation can strip injected attributes.
+- **Shadow DOM.** Can't reach inside shadow roots from the outside.
+
+Playwright Locators are external to the DOM. They use the accessibility tree (which Chromium maintains internally) and `getByRole()` queries. No DOM mutation, no CSP issues, no framework conflicts.
+
+### Ref lifecycle
+
+Refs are cleared on navigation (the `framenavigated` event on the main frame). This is correct — after navigation, all locators are stale. The agent must run `snapshot` again to get fresh refs. This is by design: stale refs should fail loudly, not click the wrong element.
+
+### Cursor-interactive refs (@c)
+
+The `-C` flag finds elements that are clickable but not in the ARIA tree — things styled with `cursor: pointer`, elements with `onclick` attributes, or custom `tabindex`. These get `@c1`, `@c2` refs in a separate namespace. This catches custom components that frameworks render as `<div>` but are actually buttons.
+
+## Logging architecture
+
+Three ring buffers (50,000 entries each, O(1) push):
+
+```
+Browser events → CircularBuffer (in-memory) → Async flush to .gstack/*.log
+```
+
+Console messages, network requests, and dialog events each have their own buffer. Flushing happens every 1 second — the server appends only new entries since the last flush. This means:
+
+- HTTP request handling is never blocked by disk I/O
+- Logs survive server crashes (up to 1 second of data loss)
+- Memory is bounded (50K entries × 3 buffers)
+- Disk files are append-only, readable by external tools
+
+The `console`, `network`, and `dialog` commands read from the in-memory buffers, not disk. Disk files are for post-mortem debugging.
+
+## SKILL.md template system
+
+### The problem
+
+SKILL.md files tell Claude how to use the browse commands. If the docs list a flag that doesn't exist, or miss a command that was added, the agent hits errors. Hand-maintained docs always drift from code.
+
+### The solution
+
+```
+SKILL.md.tmpl          (human-written prose + placeholders)
+       ↓
+gen-skill-docs.ts      (reads source code metadata)
+       ↓
+SKILL.md               (committed, auto-generated sections)
+```
+
+Templates contain the workflows, tips, and examples that require human judgment. The `{{COMMAND_REFERENCE}}` and `{{SNAPSHOT_FLAGS}}` placeholders are filled from `commands.ts` and `snapshot.ts` at build time. This is structurally sound — if a command exists in code, it appears in docs. If it doesn't exist, it can't appear.
+
+### Why committed, not generated at runtime?
+
+Three reasons:
+
+1. **Claude reads SKILL.md at skill load time.** There's no build step when a user invokes `/browse`. The file must already exist and be correct.
+2. **CI can validate freshness.** `gen:skill-docs --dry-run` + `git diff --exit-code` catches stale docs before merge.
+3. **Git blame works.** You can see when a command was added and in which commit.
+
+### Test tiers
+
+| Tier | What | Cost | Speed |
+|------|------|------|-------|
+| 1 — Static validation | Parse every `$B` command in SKILL.md, validate against registry | Free | <2s |
+| 2 — E2E via Agent SDK | Spawn real Claude session, run `/qa`, check for errors | ~$0.50 | ~60s |
+| 3 — LLM-as-judge | Haiku scores docs on clarity/completeness/actionability | ~$0.03 | ~10s |
+
+Tier 1 runs on every `bun test`. Tier 2 and 3 are gated behind env vars. The idea is: catch 95% of issues for free, use LLMs only for the judgment calls.
+
+## Command dispatch
+
+Commands are categorized by side effects:
+
+- **READ** (text, html, links, console, cookies, ...): No mutations. Safe to retry. Returns page state.
+- **WRITE** (goto, click, fill, press, ...): Mutates page state. Not idempotent.
+- **META** (snapshot, screenshot, tabs, chain, ...): Server-level operations that don't fit neatly into read/write.
+
+This isn't just organizational. The server uses it for dispatch:
+
+```typescript
+if (READ_COMMANDS.has(cmd))  → handleReadCommand(cmd, args, bm)
+if (WRITE_COMMANDS.has(cmd)) → handleWriteCommand(cmd, args, bm)
+if (META_COMMANDS.has(cmd))  → handleMetaCommand(cmd, args, bm, shutdown)
+```
+
+The `help` command returns all three sets so agents can self-discover available commands.
+
+## Error philosophy
+
+Errors are for AI agents, not humans. Every error message must be actionable:
+
+- "Element not found" → "Element not found or not interactable. Run `snapshot -i` to see available elements."
+- "Selector matched multiple elements" → "Selector matched multiple elements. Use @refs from `snapshot` instead."
+- Timeout → "Navigation timed out after 30s. The page may be slow or the URL may be wrong."
+
+Playwright's native errors are rewritten through `wrapError()` to strip internal stack traces and add guidance. The agent should be able to read the error and know what to do next without human intervention.
+
+### Crash recovery
+
+The server doesn't try to self-heal. If Chromium crashes (`browser.on('disconnected')`), the server exits immediately. The CLI detects the dead server on the next command and auto-restarts. This is simpler and more reliable than trying to reconnect to a half-dead browser process.
+
+## What's intentionally not here
+
+- **No WebSocket streaming.** HTTP request/response is simpler, debuggable with curl, and fast enough. Streaming would add complexity for marginal benefit.
+- **No MCP protocol.** MCP adds JSON schema overhead per request and requires a persistent connection. Plain HTTP + plain text output is lighter on tokens and easier to debug.
+- **No multi-user support.** One server per workspace, one user. The token auth is defense-in-depth, not multi-tenancy.
+- **No Windows/Linux cookie decryption.** macOS Keychain is the only supported credential store. Linux (GNOME Keyring/kwallet) and Windows (DPAPI) are architecturally possible but not implemented.
+- **No iframe support.** Playwright can handle iframes but the ref system doesn't cross frame boundaries yet. This is the most-requested missing feature.