mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-05 05:05:08 +02:00
docs: add ARCHITECTURE.md, update CLAUDE.md and CONTRIBUTING.md
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
+240
@@ -0,0 +1,240 @@
|
||||
# Architecture
|
||||
|
||||
This document explains **why** gstack is built the way it is. For setup and commands, see CLAUDE.md. For contributing, see CONTRIBUTING.md.
|
||||
|
||||
## The core idea
|
||||
|
||||
gstack gives Claude Code a persistent browser and a set of opinionated workflow skills. The browser is the hard part — everything else is Markdown.
|
||||
|
||||
The key insight: an AI agent interacting with a browser needs **sub-second latency** and **persistent state**. If every command cold-starts a browser, you're waiting 3-5 seconds per tool call. If the browser dies between commands, you lose cookies, tabs, and login sessions. So gstack runs a long-lived Chromium daemon that the CLI talks to over localhost HTTP.
|
||||
|
||||
```
|
||||
Claude Code gstack
|
||||
───────── ──────
|
||||
┌──────────────────────┐
|
||||
Tool call: $B snapshot -i │ CLI (compiled binary)│
|
||||
─────────────────────────→ │ • reads state file │
|
||||
│ • POST /command │
|
||||
│ to localhost:PORT │
|
||||
└──────────┬───────────┘
|
||||
│ HTTP
|
||||
┌──────────▼───────────┐
|
||||
│ Server (Bun.serve) │
|
||||
│ • dispatches command │
|
||||
│ • talks to Chromium │
|
||||
│ • returns plain text │
|
||||
└──────────┬───────────┘
|
||||
│ CDP
|
||||
┌──────────▼───────────┐
|
||||
│ Chromium (headless) │
|
||||
│ • persistent tabs │
|
||||
│ • cookies carry over │
|
||||
│ • 30min idle timeout │
|
||||
└───────────────────────┘
|
||||
```
|
||||
|
||||
First call starts everything (~3s). Every call after: ~100-200ms.
|
||||
|
||||
## Why Bun
|
||||
|
||||
Node.js would work. Bun is better here for three reasons:
|
||||
|
||||
1. **Compiled binaries.** `bun build --compile` produces a single ~58MB executable. No `node_modules` at runtime, no `npx`, no PATH configuration. The binary just runs. This matters because gstack installs into `~/.claude/skills/` where users don't expect to manage a Node.js project.
|
||||
|
||||
2. **Native SQLite.** Cookie decryption reads Chromium's SQLite cookie database directly. Bun has `new Database()` built in — no `better-sqlite3`, no native addon compilation, no gyp. One less thing that breaks on different machines.
|
||||
|
||||
3. **Native TypeScript.** The server runs as `bun run server.ts` during development. No compilation step, no `ts-node`, no source maps to debug. The compiled binary is for deployment; source files are for development.
|
||||
|
||||
4. **Built-in HTTP server.** `Bun.serve()` is fast, simple, and doesn't need Express or Fastify. The server handles ~10 routes total. A framework would be overhead.
|
||||
|
||||
The bottleneck is always Chromium, not the CLI or server. Bun's startup speed (~1ms for the compiled binary vs ~100ms for Node) is nice but not the reason we chose it. The compiled binary and native SQLite are.
|
||||
|
||||
## The daemon model
|
||||
|
||||
### Why not start a browser per command?
|
||||
|
||||
Playwright can launch Chromium in ~2-3 seconds. For a single screenshot, that's fine. For a QA session with 20+ commands, it's 40+ seconds of browser startup overhead. Worse: you lose all state between commands. Cookies, localStorage, login sessions, open tabs — all gone.
|
||||
|
||||
The daemon model means:
|
||||
|
||||
- **Persistent state.** Log in once, stay logged in. Open a tab, it stays open. localStorage persists across commands.
|
||||
- **Sub-second commands.** After the first call, every command is just an HTTP POST. ~100-200ms round-trip including Chromium's work.
|
||||
- **Automatic lifecycle.** The server auto-starts on first use, auto-shuts down after 30 minutes idle. No process management needed.
|
||||
|
||||
### State file
|
||||
|
||||
The server writes `.gstack/browse.json` (atomic write via tmp + rename, mode 0o600):
|
||||
|
||||
```json
|
||||
{ "pid": 12345, "port": 34567, "token": "uuid-v4", "startedAt": "...", "binaryVersion": "abc123" }
|
||||
```
|
||||
|
||||
The CLI reads this file to find the server. If the file is missing, stale, or the PID is dead, the CLI spawns a new server.
|
||||
|
||||
### Port selection
|
||||
|
||||
Random port between 10000-60000 (retry up to 5 on collision). This means 10 Conductor workspaces can each run their own browse daemon with zero configuration and zero port conflicts. The old approach (scanning 9400-9409) broke constantly in multi-workspace setups.
|
||||
|
||||
### Version auto-restart
|
||||
|
||||
The build writes `git rev-parse HEAD` to `browse/dist/.version`. On each CLI invocation, if the binary's version doesn't match the running server's `binaryVersion`, the CLI kills the old server and starts a new one. This prevents the "stale binary" class of bugs entirely — rebuild the binary, next command picks it up automatically.
|
||||
|
||||
## Security model
|
||||
|
||||
### Localhost only
|
||||
|
||||
The HTTP server binds to `localhost`, not `0.0.0.0`. It's not reachable from the network.
|
||||
|
||||
### Bearer token auth
|
||||
|
||||
Every server session generates a random UUID token, written to the state file with mode 0o600 (owner-only read). Every HTTP request must include `Authorization: Bearer <token>`. If the token doesn't match, the server returns 401.
|
||||
|
||||
This prevents other processes on the same machine from talking to your browse server. The cookie picker UI (`/cookie-picker`) and health check (`/health`) are exempt — they're localhost-only and don't execute commands.
|
||||
|
||||
### Cookie security
|
||||
|
||||
Cookies are the most sensitive data gstack handles. The design:
|
||||
|
||||
1. **Keychain access requires user approval.** First cookie import per browser triggers a macOS Keychain dialog. The user must click "Allow" or "Always Allow." gstack never silently accesses credentials.
|
||||
|
||||
2. **Decryption happens in-process.** Cookie values are decrypted in memory (PBKDF2 + AES-128-CBC), loaded into the Playwright context, and never written to disk in plaintext. The cookie picker UI never displays cookie values — only domain names and counts.
|
||||
|
||||
3. **Database is read-only.** gstack copies the Chromium cookie DB to a temp file (to avoid SQLite lock conflicts with the running browser) and opens it read-only. It never modifies your real browser's cookie database.
|
||||
|
||||
4. **Key caching is per-session.** The Keychain password + derived AES key are cached in memory for the server's lifetime. When the server shuts down (idle timeout or explicit stop), the cache is gone.
|
||||
|
||||
5. **No cookie values in logs.** Console, network, and dialog logs never contain cookie values. The `cookies` command outputs cookie metadata (domain, name, expiry) but values are truncated.
|
||||
|
||||
### Shell injection prevention
|
||||
|
||||
The browser registry (Comet, Chrome, Arc, Brave, Edge) is hardcoded. Database paths are constructed from known constants, never from user input. Keychain access uses `Bun.spawn()` with explicit argument arrays, not shell string interpolation.
|
||||
|
||||
## The ref system
|
||||
|
||||
Refs (`@e1`, `@e2`, `@c1`) are how the agent addresses page elements without writing CSS selectors or XPath.
|
||||
|
||||
### How it works
|
||||
|
||||
```
|
||||
1. Agent runs: $B snapshot -i
|
||||
2. Server calls Playwright's page.accessibility.snapshot()
|
||||
3. Parser walks the ARIA tree, assigns sequential refs: @e1, @e2, @e3...
|
||||
4. For each ref, builds a Playwright Locator: getByRole(role, { name }).nth(index)
|
||||
5. Stores Map<string, Locator> on the BrowserManager instance
|
||||
6. Returns the annotated tree as plain text
|
||||
|
||||
Later:
|
||||
7. Agent runs: $B click @e3
|
||||
8. Server resolves @e3 → Locator → locator.click()
|
||||
```
|
||||
|
||||
### Why Locators, not DOM mutation
|
||||
|
||||
The obvious approach is to inject `data-ref="@e1"` attributes into the DOM. This breaks on:
|
||||
|
||||
- **CSP (Content Security Policy).** Many production sites block DOM modification from scripts.
|
||||
- **React/Vue/Svelte hydration.** Framework reconciliation can strip injected attributes.
|
||||
- **Shadow DOM.** Can't reach inside shadow roots from the outside.
|
||||
|
||||
Playwright Locators are external to the DOM. They use the accessibility tree (which Chromium maintains internally) and `getByRole()` queries. No DOM mutation, no CSP issues, no framework conflicts.
|
||||
|
||||
### Ref lifecycle
|
||||
|
||||
Refs are cleared on navigation (the `framenavigated` event on the main frame). This is correct — after navigation, all locators are stale. The agent must run `snapshot` again to get fresh refs. This is by design: stale refs should fail loudly, not click the wrong element.
|
||||
|
||||
### Cursor-interactive refs (@c)
|
||||
|
||||
The `-C` flag finds elements that are clickable but not in the ARIA tree — things styled with `cursor: pointer`, elements with `onclick` attributes, or custom `tabindex`. These get `@c1`, `@c2` refs in a separate namespace. This catches custom components that frameworks render as `<div>` but are actually buttons.
|
||||
|
||||
## Logging architecture
|
||||
|
||||
Three ring buffers (50,000 entries each, O(1) push):
|
||||
|
||||
```
|
||||
Browser events → CircularBuffer (in-memory) → Async flush to .gstack/*.log
|
||||
```
|
||||
|
||||
Console messages, network requests, and dialog events each have their own buffer. Flushing happens every 1 second — the server appends only new entries since the last flush. This means:
|
||||
|
||||
- HTTP request handling is never blocked by disk I/O
|
||||
- Logs survive server crashes (up to 1 second of data loss)
|
||||
- Memory is bounded (50K entries × 3 buffers)
|
||||
- Disk files are append-only, readable by external tools
|
||||
|
||||
The `console`, `network`, and `dialog` commands read from the in-memory buffers, not disk. Disk files are for post-mortem debugging.
|
||||
|
||||
## SKILL.md template system
|
||||
|
||||
### The problem
|
||||
|
||||
SKILL.md files tell Claude how to use the browse commands. If the docs list a flag that doesn't exist, or miss a command that was added, the agent hits errors. Hand-maintained docs always drift from code.
|
||||
|
||||
### The solution
|
||||
|
||||
```
|
||||
SKILL.md.tmpl (human-written prose + placeholders)
|
||||
↓
|
||||
gen-skill-docs.ts (reads source code metadata)
|
||||
↓
|
||||
SKILL.md (committed, auto-generated sections)
|
||||
```
|
||||
|
||||
Templates contain the workflows, tips, and examples that require human judgment. The `{{COMMAND_REFERENCE}}` and `{{SNAPSHOT_FLAGS}}` placeholders are filled from `commands.ts` and `snapshot.ts` at build time. This is structurally sound — if a command exists in code, it appears in docs. If it doesn't exist, it can't appear.
|
||||
|
||||
### Why committed, not generated at runtime?
|
||||
|
||||
Three reasons:
|
||||
|
||||
1. **Claude reads SKILL.md at skill load time.** There's no build step when a user invokes `/browse`. The file must already exist and be correct.
|
||||
2. **CI can validate freshness.** `gen:skill-docs --dry-run` + `git diff --exit-code` catches stale docs before merge.
|
||||
3. **Git blame works.** You can see when a command was added and in which commit.
|
||||
|
||||
### Test tiers
|
||||
|
||||
| Tier | What | Cost | Speed |
|
||||
|------|------|------|-------|
|
||||
| 1 — Static validation | Parse every `$B` command in SKILL.md, validate against registry | Free | <2s |
|
||||
| 2 — E2E via Agent SDK | Spawn real Claude session, run `/qa`, check for errors | ~$0.50 | ~60s |
|
||||
| 3 — LLM-as-judge | Haiku scores docs on clarity/completeness/actionability | ~$0.03 | ~10s |
|
||||
|
||||
Tier 1 runs on every `bun test`. Tier 2 and 3 are gated behind env vars. The idea is: catch 95% of issues for free, use LLMs only for the judgment calls.
|
||||
|
||||
## Command dispatch
|
||||
|
||||
Commands are categorized by side effects:
|
||||
|
||||
- **READ** (text, html, links, console, cookies, ...): No mutations. Safe to retry. Returns page state.
|
||||
- **WRITE** (goto, click, fill, press, ...): Mutates page state. Not idempotent.
|
||||
- **META** (snapshot, screenshot, tabs, chain, ...): Server-level operations that don't fit neatly into read/write.
|
||||
|
||||
This isn't just organizational. The server uses it for dispatch:
|
||||
|
||||
```typescript
|
||||
if (READ_COMMANDS.has(cmd)) → handleReadCommand(cmd, args, bm)
|
||||
if (WRITE_COMMANDS.has(cmd)) → handleWriteCommand(cmd, args, bm)
|
||||
if (META_COMMANDS.has(cmd)) → handleMetaCommand(cmd, args, bm, shutdown)
|
||||
```
|
||||
|
||||
The `help` command returns all three sets so agents can self-discover available commands.
|
||||
|
||||
## Error philosophy
|
||||
|
||||
Errors are for AI agents, not humans. Every error message must be actionable:
|
||||
|
||||
- "Element not found" → "Element not found or not interactable. Run `snapshot -i` to see available elements."
|
||||
- "Selector matched multiple elements" → "Selector matched multiple elements. Use @refs from `snapshot` instead."
|
||||
- Timeout → "Navigation timed out after 30s. The page may be slow or the URL may be wrong."
|
||||
|
||||
Playwright's native errors are rewritten through `wrapError()` to strip internal stack traces and add guidance. The agent should be able to read the error and know what to do next without human intervention.
|
||||
|
||||
### Crash recovery
|
||||
|
||||
The server doesn't try to self-heal. If Chromium crashes (`browser.on('disconnected')`), the server exits immediately. The CLI detects the dead server on the next command and auto-restarts. This is simpler and more reliable than trying to reconnect to a half-dead browser process.
|
||||
|
||||
## What's intentionally not here
|
||||
|
||||
- **No WebSocket streaming.** HTTP request/response is simpler, debuggable with curl, and fast enough. Streaming would add complexity for marginal benefit.
|
||||
- **No MCP protocol.** MCP adds JSON schema overhead per request and requires a persistent connection. Plain HTTP + plain text output is lighter on tokens and easier to debug.
|
||||
- **No multi-user support.** One server per workspace, one user. The token auth is defense-in-depth, not multi-tenancy.
|
||||
- **No Windows/Linux cookie decryption.** macOS Keychain is the only supported credential store. Linux (GNOME Keyring/kwallet) and Windows (DPAPI) are architecturally possible but not implemented.
|
||||
- **No iframe support.** Playwright can handle iframes but the ref system doesn't cross frame boundaries yet. This is the most-requested missing feature.
|
||||
@@ -4,9 +4,14 @@
|
||||
|
||||
```bash
|
||||
bun install # install dependencies
|
||||
bun test # run integration tests (browse + snapshot)
|
||||
bun test # run tests (browse + snapshot + skill validation)
|
||||
bun run test:eval # run LLM-as-judge evals (needs ANTHROPIC_API_KEY)
|
||||
bun run test:e2e # run E2E skill tests (needs SKILL_E2E=1, ~$0.50/run)
|
||||
bun run dev <cmd> # run CLI in dev mode, e.g. bun run dev goto https://example.com
|
||||
bun run build # compile binary to browse/dist/browse
|
||||
bun run build # gen docs + compile binaries
|
||||
bun run gen:skill-docs # regenerate SKILL.md files from templates
|
||||
bun run skill:check # health dashboard for all skills
|
||||
bun run dev:skill # watch mode: auto-regen + validate on change
|
||||
```
|
||||
|
||||
## Project structure
|
||||
@@ -15,18 +20,42 @@ bun run build # compile binary to browse/dist/browse
|
||||
gstack/
|
||||
├── browse/ # Headless browser CLI (Playwright)
|
||||
│ ├── src/ # CLI + server + commands
|
||||
│ │ ├── commands.ts # Command registry (single source of truth)
|
||||
│ │ └── snapshot.ts # SNAPSHOT_FLAGS metadata array
|
||||
│ ├── test/ # Integration tests + fixtures
|
||||
│ └── dist/ # Compiled binary
|
||||
├── scripts/ # Build + DX tooling
|
||||
│ ├── gen-skill-docs.ts # Template → SKILL.md generator
|
||||
│ ├── skill-check.ts # Health dashboard
|
||||
│ └── dev-skill.ts # Watch mode
|
||||
├── test/ # Skill validation + eval tests
|
||||
│ ├── helpers/ # skill-parser.ts, session-runner.ts
|
||||
│ ├── skill-validation.test.ts # Tier 1: static command validation
|
||||
│ ├── gen-skill-docs.test.ts # Tier 1: generator + quality evals
|
||||
│ ├── skill-e2e.test.ts # Tier 2: Agent SDK E2E
|
||||
│ └── skill-llm-eval.test.ts # Tier 3: LLM-as-judge
|
||||
├── ship/ # Ship workflow skill
|
||||
├── review/ # PR review skill
|
||||
├── plan-ceo-review/ # /plan-ceo-review skill
|
||||
├── plan-eng-review/ # /plan-eng-review skill
|
||||
├── retro/ # Retrospective skill
|
||||
├── setup # One-time setup: build binary + symlink skills
|
||||
├── SKILL.md # Browse skill (Claude discovers this)
|
||||
├── SKILL.md # Generated from SKILL.md.tmpl (don't edit directly)
|
||||
├── SKILL.md.tmpl # Template: edit this, run gen:skill-docs
|
||||
└── package.json # Build scripts for browse
|
||||
```
|
||||
|
||||
## SKILL.md workflow
|
||||
|
||||
SKILL.md files are **generated** from `.tmpl` templates. To update docs:
|
||||
|
||||
1. Edit the `.tmpl` file (e.g. `SKILL.md.tmpl` or `browse/SKILL.md.tmpl`)
|
||||
2. Run `bun run gen:skill-docs` (or `bun run build` which does it automatically)
|
||||
3. Commit both the `.tmpl` and generated `.md` files
|
||||
|
||||
To add a new browse command: add it to `browse/src/commands.ts` and rebuild.
|
||||
To add a snapshot flag: add it to `SNAPSHOT_FLAGS` in `browse/src/snapshot.ts` and rebuild.
|
||||
|
||||
## Browser interaction
|
||||
|
||||
When you need to interact with a browser (QA, dogfooding, cookie setup), use the
|
||||
|
||||
+28
-3
@@ -63,16 +63,41 @@ bin/dev-teardown
|
||||
## Running tests
|
||||
|
||||
```bash
|
||||
bun test # all tests (browse integration + snapshot)
|
||||
bun test # Tier 1: browse integration + skill validation (free, <5s)
|
||||
bun run test:eval # Tier 3: LLM-as-judge quality evals (needs ANTHROPIC_API_KEY, ~$0.03)
|
||||
bun run test:e2e # Tier 2: E2E skill tests via Agent SDK (needs SKILL_E2E=1, ~$0.50)
|
||||
bun run test:all # Tier 1 + Tier 2
|
||||
bun run dev <cmd> # run CLI in dev mode, e.g. bun run dev goto https://example.com
|
||||
bun run build # compile binary to browse/dist/browse
|
||||
bun run build # gen docs + compile binaries
|
||||
```
|
||||
|
||||
**Tier 1** (static validation) runs automatically — it parses every `$B` command in SKILL.md files and validates them against the command registry. **Tier 2** (E2E) spawns real Claude sessions and costs money. **Tier 3** (LLM-as-judge) uses Haiku to score generated docs on clarity/completeness/actionability.
|
||||
|
||||
Tests run against the browse binary directly — they don't require dev mode.
|
||||
|
||||
## Editing SKILL.md files
|
||||
|
||||
SKILL.md files are **generated** from `.tmpl` templates. Don't edit the `.md` directly — your changes will be overwritten on the next build.
|
||||
|
||||
```bash
|
||||
# 1. Edit the template
|
||||
vim SKILL.md.tmpl # or browse/SKILL.md.tmpl
|
||||
|
||||
# 2. Regenerate
|
||||
bun run gen:skill-docs
|
||||
|
||||
# 3. Check health
|
||||
bun run skill:check
|
||||
|
||||
# Or use watch mode — auto-regenerates on save
|
||||
bun run dev:skill
|
||||
```
|
||||
|
||||
To add a browse command, add it to `browse/src/commands.ts`. To add a snapshot flag, add it to `SNAPSHOT_FLAGS` in `browse/src/snapshot.ts`. Then rebuild.
|
||||
|
||||
## Things to know
|
||||
|
||||
- **SKILL.md changes are instant.** They're just Markdown. Edit, save, invoke.
|
||||
- **SKILL.md files are generated.** Edit the `.tmpl` template, not the `.md`. Run `bun run gen:skill-docs` to regenerate.
|
||||
- **Browse source changes need a rebuild.** If you touch `browse/src/*.ts`, run `bun run build`.
|
||||
- **Dev mode shadows your global install.** Project-local skills take priority over `~/.claude/skills/gstack`. `bin/dev-teardown` restores the global one.
|
||||
- **Conductor workspaces are independent.** Each workspace is its own clone. Run `bin/dev-setup` in the one you're working in.
|
||||
|
||||
Reference in New Issue
Block a user