gstack/BROWSER.md

# Browser — technical details

This document covers the command reference and internals of gstack's headless browser.

## Command reference

| Category | Commands | What for |
|----------|----------|----------|
| Navigate | `goto`, `back`, `forward`, `reload`, `url` | Get to a page |
| Read | `text`, `html`, `links`, `forms`, `accessibility` | Extract content |
| Snapshot | `snapshot [-i] [-c] [-d N] [-s sel] [-D] [-a] [-o] [-C]` | Get refs, diff, annotate |
| Interact | `click`, `fill`, `select`, `hover`, `type`, `press`, `scroll`, `wait`, `viewport`, `upload` | Use the page |
| Inspect | `js`, `eval`, `css`, `attrs`, `is`, `console`, `network`, `dialog`, `cookies`, `storage`, `perf` | Debug and verify |
| Visual | `screenshot [--viewport] [--clip x,y,w,h] [sel\|@ref] [path]`, `pdf`, `responsive` | See what Claude sees |
| Compare | `diff <url1> <url2>` | Spot differences between environments |
| Dialogs | `dialog-accept [text]`, `dialog-dismiss` | Control alert/confirm/prompt handling |
| Tabs | `tabs`, `tab`, `newtab`, `closetab` | Multi-page workflows |
| Cookies | `cookie-import`, `cookie-import-browser` | Import cookies from file or real browser |
| Multi-step | `chain` (JSON from stdin) | Batch commands in one call |
| Handoff | `handoff [reason]`, `resume` | Switch to visible Chrome for user takeover |
| Real browser | `connect`, `disconnect`, `focus` | Control real Chrome, visible window |

All selector arguments accept CSS selectors, `@e` refs after `snapshot`, or `@c` refs after `snapshot -C`. 50+ commands total plus cookie import.

## How it works

gstack's browser is a compiled CLI binary that talks to a persistent local Chromium daemon over HTTP. The CLI is a thin client — it reads a state file, sends a command, and prints the response to stdout. The server does the real work via [Playwright](https://playwright.dev/).

```
┌─────────────────────────────────────────────────────────────────┐
│  Claude Code                                                    │
│                                                                 │
│  "browse goto https://staging.myapp.com"                        │
│       │                                                         │
│       ▼                                                         │
│  ┌──────────┐    HTTP POST     ┌──────────────┐                 │
│  │ browse   │ ──────────────── │ Bun HTTP     │                 │
│  │ CLI      │  localhost:rand  │ server       │                 │
│  │          │  Bearer token    │              │                 │
│  │ compiled │ ◄──────────────  │  Playwright  │──── Chromium    │
│  │ binary   │  plain text      │  API calls   │    (headless)   │
│  └──────────┘                  └──────────────┘                 │
│   ~1ms startup                  persistent daemon               │
│                                 auto-starts on first call       │
│                                 auto-stops after 30 min idle    │
└─────────────────────────────────────────────────────────────────┘
```

### Lifecycle

1. **First call**: CLI checks `.gstack/browse.json` (in the project root) for a running server. None found — it spawns `bun run browse/src/server.ts` in the background. The server launches headless Chromium via Playwright, picks a random port (10000-60000), generates a bearer token, writes the state file, and starts accepting HTTP requests. This takes ~3 seconds.

2. **Subsequent calls**: CLI reads the state file, sends an HTTP POST with the bearer token, prints the response. ~100-200ms round trip.

3. **Idle shutdown**: After 30 minutes with no commands, the server shuts down and cleans up the state file. Next call restarts it automatically.

4. **Crash recovery**: If Chromium crashes, the server exits immediately (no self-healing — don't hide failure). The CLI detects the dead server on the next call and starts a fresh one.

### Key components

```
browse/
├── src/
│   ├── cli.ts              # Thin client — reads state file, sends HTTP, prints response
│   ├── server.ts           # Bun.serve HTTP server — routes commands to Playwright
│   ├── browser-manager.ts  # Chromium lifecycle — launch, tabs, ref map, crash handling
│   ├── snapshot.ts         # Accessibility tree → @ref assignment → Locator map + diff/annotate/-C
│   ├── read-commands.ts    # Non-mutating commands (text, html, links, js, css, is, dialog, etc.)
│   ├── write-commands.ts   # Mutating commands (click, fill, select, upload, dialog-accept, etc.)
│   ├── meta-commands.ts    # Server management, chain, diff, snapshot routing
│   ├── cookie-import-browser.ts  # Decrypt + import cookies from real Chromium browsers
│   ├── cookie-picker-routes.ts   # HTTP routes for interactive cookie picker UI
│   ├── cookie-picker-ui.ts       # Self-contained HTML/CSS/JS for cookie picker
│   ├── activity.ts         # Activity streaming (SSE) for Chrome extension
│   └── buffers.ts          # CircularBuffer<T> + console/network/dialog capture
├── test/                   # Integration tests + HTML fixtures
└── dist/
    └── browse              # Compiled binary (~58MB, Bun --compile)
```

### The snapshot system

The browser's key innovation is ref-based element selection, built on Playwright's accessibility tree API:

1. `page.locator(scope).ariaSnapshot()` returns a YAML-like accessibility tree
2. The snapshot parser assigns refs (`@e1`, `@e2`, ...) to each element
3. For each ref, it builds a Playwright `Locator` (using `getByRole` + nth-child)
4. The ref-to-Locator map is stored on `BrowserManager`
5. Later commands like `click @e3` look up the Locator and call `locator.click()`

No DOM mutation. No injected scripts. Just Playwright's native accessibility API.

**Ref staleness detection:** SPAs can mutate the DOM without navigation (React router, tab switches, modals). When this happens, refs collected from a previous `snapshot` may point to elements that no longer exist. To handle this, `resolveRef()` runs an async `count()` check before using any ref — if the element count is 0, it throws immediately with a message telling the agent to re-run `snapshot`. This fails fast (~5ms) instead of waiting for Playwright's 30-second action timeout.

**Extended snapshot features:**
- `--diff` (`-D`): Stores each snapshot as a baseline. On the next `-D` call, returns a unified diff showing what changed. Use this to verify that an action (click, fill, etc.) actually worked.
- `--annotate` (`-a`): Injects temporary overlay divs at each ref's bounding box, takes a screenshot with ref labels visible, then removes the overlays. Use `-o <path>` to control the output path.
- `--cursor-interactive` (`-C`): Scans for non-ARIA interactive elements (divs with `cursor:pointer`, `onclick`, `tabindex>=0`) using `page.evaluate`. Assigns `@c1`, `@c2`... refs with deterministic `nth-child` CSS selectors. These are elements the ARIA tree misses but users can still click.

### Screenshot modes

The `screenshot` command supports four modes:

| Mode | Syntax | Playwright API |
|------|--------|----------------|
| Full page (default) | `screenshot [path]` | `page.screenshot({ fullPage: true })` |
| Viewport only | `screenshot --viewport [path]` | `page.screenshot({ fullPage: false })` |
| Element crop | `screenshot "#sel" [path]` or `screenshot @e3 [path]` | `locator.screenshot()` |
| Region clip | `screenshot --clip x,y,w,h [path]` | `page.screenshot({ clip })` |

Element crop accepts CSS selectors (`.class`, `#id`, `[attr]`) or `@e`/`@c` refs from `snapshot`. Auto-detection: `@e`/`@c` prefix = ref, `.`/`#`/`[` prefix = CSS selector, `--` prefix = flag, everything else = output path.

Mutual exclusion: `--clip` + selector and `--viewport` + `--clip` both throw errors. Unknown flags (e.g. `--bogus`) also throw.

### Authentication

Each server session generates a random UUID as a bearer token. The token is written to the state file (`.gstack/browse.json`) with chmod 600. Every HTTP request must include `Authorization: Bearer <token>`. This prevents other processes on the machine from controlling the browser.

### Console, network, and dialog capture

The server hooks into Playwright's `page.on('console')`, `page.on('response')`, and `page.on('dialog')` events. All entries are kept in O(1) circular buffers (50,000 capacity each) and flushed to disk asynchronously via `Bun.write()`:

- Console: `.gstack/browse-console.log`
- Network: `.gstack/browse-network.log`
- Dialog: `.gstack/browse-dialog.log`

The `console`, `network`, and `dialog` commands read from the in-memory buffers, not disk.

### Real browser mode (`connect`)

Instead of headless Chromium, `connect` launches your real Chrome as a headed window controlled by Playwright. You see everything Claude does in real time.

```bash
$B connect              # launch real Chrome, headed
$B goto https://app.com # navigates in the visible window
$B snapshot -i          # refs from the real page
$B click @e3            # clicks in the real window
$B focus                # bring Chrome window to foreground (macOS)
$B status               # shows Mode: cdp
$B disconnect           # back to headless mode
```

The window has a subtle green shimmer line at the top edge and a floating "gstack" pill in the bottom-right corner so you always know which Chrome window is being controlled.

**How it works:** Playwright's `channel: 'chrome'` launches your system Chrome binary via a native pipe protocol — not CDP WebSocket. All existing browse commands work unchanged because they go through Playwright's abstraction layer.

**When to use it:**
- QA testing where you want to watch Claude click through your app
- Design review where you need to see exactly what Claude sees
- Debugging where headless behavior differs from real Chrome
- Demos where you're sharing your screen

**Commands:**

| Command | What it does |
|---------|-------------|
| `connect` | Launch real Chrome, restart server in headed mode |
| `disconnect` | Close real Chrome, restart in headless mode |
| `focus` | Bring Chrome to foreground (macOS). `focus @e3` also scrolls element into view |
| `status` | Shows `Mode: cdp` when connected, `Mode: launched` when headless |

**CDP-aware skills:** When in real-browser mode, `/qa` and `/design-review` automatically skip cookie import prompts and headless workarounds.

### Chrome extension (Side Panel)

A Chrome extension that shows a live activity feed of browse commands in a Side Panel, plus @ref overlays on the page.

#### Automatic install (recommended)

When you run `$B connect`, the extension **auto-loads** into the Playwright-controlled Chrome window. No manual steps needed — the Side Panel is immediately available.

```bash
$B connect              # launches Chrome with extension pre-loaded
# Click the gstack icon in toolbar → Open Side Panel
```

The port is auto-configured. You're done.

#### Manual install (for your regular Chrome)

If you want the extension in your everyday Chrome (not the Playwright-controlled one), run:

```bash
bin/gstack-extension    # opens chrome://extensions, copies path to clipboard
```

Or do it manually:

1. **Go to `chrome://extensions`** in Chrome's address bar
2. **Toggle "Developer mode" ON** (top-right corner)
3. **Click "Load unpacked"** — a file picker opens
4. **Navigate to the extension folder:** Press **Cmd+Shift+G** in the file picker to open "Go to folder", then paste one of these paths:
   - Global install: `~/.claude/skills/gstack/extension`
   - Dev/source: `<gstack-repo>/extension`

   Press Enter, then click **Select**.

   (Tip: macOS hides folders starting with `.` — press **Cmd+Shift+.** in the file picker to reveal them if you prefer to navigate manually.)

5. **Pin it:** Click the puzzle piece icon (Extensions) in the toolbar → pin "gstack browse"
6. **Set the port:** Click the gstack icon → enter the port from `$B status` or `.gstack/browse.json`
7. **Open Side Panel:** Click the gstack icon → "Open Side Panel"

#### What you get

| Feature | What it does |
|---------|-------------|
| **Toolbar badge** | Green dot when the browse server is reachable, gray when not |
| **Side Panel** | Live scrolling feed of every browse command — shows command name, args, duration, status (success/error) |
| **Refs tab** | After `$B snapshot`, shows the current @ref list (role + name) |
| **@ref overlays** | Floating panel on the page showing current refs |
| **Connection pill** | Small "gstack" pill in the bottom-right corner of every page when connected |

#### Troubleshooting

- **Badge stays gray:** Check that the port is correct. The browse server may have restarted on a different port — re-run `$B status` and update the port in the popup.
- **Side Panel is empty:** The feed only shows activity after the extension connects. Run a browse command (`$B snapshot`) to see it appear.
- **Extension disappeared after Chrome update:** Sideloaded extensions persist across updates. If it's gone, reload it from Step 3.

### Sidebar agent

The Chrome side panel includes a chat interface. Type a message and a child Claude instance executes it in the browser. The sidebar agent has access to `Bash`, `Read`, `Glob`, and `Grep` tools (same as Claude Code, minus `Edit` and `Write` ... read-only by design).

**How it works:**

1. You type a message in the side panel chat
2. The extension POSTs to the local browse server (`/sidebar-command`)
3. The server queues the message and the sidebar-agent process spawns `claude -p` with your message + the current page context
4. Claude executes browse commands via Bash (`$B snapshot`, `$B click @e3`, etc.)
5. Progress streams back to the side panel in real time

**What you can do:**
- "Take a snapshot and describe what you see"
- "Click the Login button, fill in test@example.com / password123, and submit"
- "Go through every row in this table and extract the names and emails"
- "Navigate to Settings > Account and screenshot it"

**Timeout:** Each task gets up to 5 minutes. Multi-page workflows (navigating a directory, filling forms across pages) work within this window. If a task times out, the side panel shows an error and you can retry or break it into smaller steps.

**Session isolation:** Each sidebar session runs in its own git worktree. The sidebar agent won't interfere with your main Claude Code session.

**Authentication:** The sidebar agent uses the same browser session as headed mode. Two options:
1. Log in manually in the headed browser ... your session persists for the sidebar agent
2. Import cookies from your real Chrome via `/setup-browser-cookies`

**Random delays:** If you need the agent to pause between actions (e.g., to avoid rate limits), use `sleep` in bash or `$B wait <milliseconds>`.

### User handoff

When the headless browser can't proceed (CAPTCHA, MFA, complex auth), `handoff` opens a visible Chrome window at the exact same page with all cookies, localStorage, and tabs preserved. The user solves the problem manually, then `resume` returns control to the agent with a fresh snapshot.

```bash
$B handoff "Stuck on CAPTCHA at login page"   # opens visible Chrome
# User solves CAPTCHA...
$B resume                                       # returns to headless with fresh snapshot
```

The browser auto-suggests `handoff` after 3 consecutive failures. State is fully preserved across the switch — no re-login needed.

### Dialog handling

Dialogs (alert, confirm, prompt) are auto-accepted by default to prevent browser lockup. The `dialog-accept` and `dialog-dismiss` commands control this behavior. For prompts, `dialog-accept <text>` provides the response text. All dialogs are logged to the dialog buffer with type, message, and action taken.

### JavaScript execution (`js` and `eval`)

`js` runs a single expression, `eval` runs a JS file. Both support `await` — expressions containing `await` are automatically wrapped in an async context:

```bash
$B js "await fetch('/api/data').then(r => r.json())"  # works
$B js "document.title"                                  # also works (no wrapping needed)
$B eval my-script.js                                    # file with await works too
```

For `eval` files, single-line files return the expression value directly. Multi-line files need explicit `return` when using `await`. Comments containing "await" don't trigger wrapping.

### Multi-workspace support

Each workspace gets its own isolated browser instance with its own Chromium process, tabs, cookies, and logs. State is stored in `.gstack/` inside the project root (detected via `git rev-parse --show-toplevel`).

| Workspace | State file | Port |
|-----------|------------|------|
| `/code/project-a` | `/code/project-a/.gstack/browse.json` | random (10000-60000) |
| `/code/project-b` | `/code/project-b/.gstack/browse.json` | random (10000-60000) |

No port collisions. No shared state. Each project is fully isolated.

### Environment variables

| Variable | Default | Description |
|----------|---------|-------------|
| `BROWSE_PORT` | 0 (random 10000-60000) | Fixed port for the HTTP server (debug override) |
| `BROWSE_IDLE_TIMEOUT` | 1800000 (30 min) | Idle shutdown timeout in ms |
| `BROWSE_STATE_FILE` | `.gstack/browse.json` | Path to state file (CLI passes to server) |
| `BROWSE_SERVER_SCRIPT` | auto-detected | Path to server.ts |
| `BROWSE_CDP_URL` | (none) | Set to `channel:chrome` for real browser mode |
| `BROWSE_CDP_PORT` | 0 | CDP port (used internally) |

### Performance

| Tool | First call | Subsequent calls | Context overhead per call |
|------|-----------|-----------------|--------------------------|
| Chrome MCP | ~5s | ~2-5s | ~2000 tokens (schema + protocol) |
| Playwright MCP | ~3s | ~1-3s | ~1500 tokens (schema + protocol) |
| **gstack browse** | **~3s** | **~100-200ms** | **0 tokens** (plain text stdout) |

The context overhead difference compounds fast. In a 20-command browser session, MCP tools burn 30,000-40,000 tokens on protocol framing alone. gstack burns zero.

### Why CLI over MCP?

MCP (Model Context Protocol) works well for remote services, but for local browser automation it adds pure overhead:

- **Context bloat**: every MCP call includes full JSON schemas and protocol framing. A simple "get the page text" costs 10x more context tokens than it should.
- **Connection fragility**: persistent WebSocket/stdio connections drop and fail to reconnect.
- **Unnecessary abstraction**: Claude Code already has a Bash tool. A CLI that prints to stdout is the simplest possible interface.

gstack skips all of this. Compiled binary. Plain text in, plain text out. No protocol. No schema. No connection management.

## Acknowledgments

The browser automation layer is built on [Playwright](https://playwright.dev/) by Microsoft. Playwright's accessibility tree API, locator system, and headless Chromium management are what make ref-based interaction possible. The snapshot system — assigning `@ref` labels to accessibility tree nodes and mapping them back to Playwright Locators — is built entirely on top of Playwright's primitives. Thank you to the Playwright team for building such a solid foundation.

## Development

### Prerequisites

- [Bun](https://bun.sh/) v1.0+
- Playwright's Chromium (installed automatically by `bun install`)

### Quick start

```bash
bun install              # install dependencies + Playwright Chromium
bun test                 # run integration tests (~3s)
bun run dev <cmd>        # run CLI from source (no compile)
bun run build            # compile to browse/dist/browse
```

### Dev mode vs compiled binary

During development, use `bun run dev` instead of the compiled binary. It runs `browse/src/cli.ts` directly with Bun, so you get instant feedback without a compile step:

```bash
bun run dev goto https://example.com
bun run dev text
bun run dev snapshot -i
bun run dev click @e3
```

The compiled binary (`bun run build`) is only needed for distribution. It produces a single ~58MB executable at `browse/dist/browse` using Bun's `--compile` flag.

### Running tests

```bash
bun test                         # run all tests
bun test browse/test/commands              # run command integration tests only
bun test browse/test/snapshot              # run snapshot tests only
bun test browse/test/cookie-import-browser # run cookie import unit tests only
```

Tests spin up a local HTTP server (`browse/test/test-server.ts`) serving HTML fixtures from `browse/test/fixtures/`, then exercise the CLI commands against those pages. 203 tests across 3 files, ~15 seconds total.

### Source map

| File | Role |
|------|------|
| `browse/src/cli.ts` | Entry point. Reads `.gstack/browse.json`, sends HTTP to the server, prints response. |
| `browse/src/server.ts` | Bun HTTP server. Routes commands to the right handler. Manages idle timeout. |
| `browse/src/browser-manager.ts` | Chromium lifecycle — launch, tab management, ref map, crash detection. |
| `browse/src/snapshot.ts` | Parses accessibility tree, assigns `@e`/`@c` refs, builds Locator map. Handles `--diff`, `--annotate`, `-C`. |
| `browse/src/read-commands.ts` | Non-mutating commands: `text`, `html`, `links`, `js`, `css`, `is`, `dialog`, `forms`, etc. Exports `getCleanText()`. |
| `browse/src/write-commands.ts` | Mutating commands: `goto`, `click`, `fill`, `upload`, `dialog-accept`, `useragent` (with context recreation), etc. |
| `browse/src/meta-commands.ts` | Server management, chain routing, diff (DRY via `getCleanText`), snapshot delegation. |
| `browse/src/cookie-import-browser.ts` | Decrypt Chromium cookies from macOS and Linux browser profiles using platform-specific safe-storage key lookup. Auto-detects installed browsers. |
| `browse/src/cookie-picker-routes.ts` | HTTP routes for `/cookie-picker/*` — browser list, domain search, import, remove. |
| `browse/src/cookie-picker-ui.ts` | Self-contained HTML generator for the interactive cookie picker (dark theme, no frameworks). |
| `browse/src/activity.ts` | Activity streaming — `ActivityEntry` type, `CircularBuffer`, privacy filtering, SSE subscriber management. |
| `browse/src/buffers.ts` | `CircularBuffer<T>` (O(1) ring buffer) + console/network/dialog capture with async disk flush. |

### Deploying to the active skill

The active skill lives at `~/.claude/skills/gstack/`. After making changes:

1. Push your branch
2. Pull in the skill directory: `cd ~/.claude/skills/gstack && git pull`
3. Rebuild: `cd ~/.claude/skills/gstack && bun run build`

Or copy the binary directly: `cp browse/dist/browse ~/.claude/skills/gstack/browse/dist/browse`

### Adding a new command

1. Add the handler in `read-commands.ts` (non-mutating) or `write-commands.ts` (mutating)
2. Register the route in `server.ts`
3. Add a test case in `browse/test/commands.test.ts` with an HTML fixture if needed
4. Run `bun test` to verify
5. Run `bun run build` to compile