Merge branch 'main' into garrytan/team-supabase-store

Brings in 48 commits from main (v0.15.7–v0.15.16): deterministic slugs,
TabSession refactor, pair-agent tunnel fix, content security layers,
community security wave, team-friendly install, interactive snapshots.

Conflict resolution:
- .gitignore: merged both sides (kept .factory/ + added .kiro/.opencode/
  .slate/.cursor/.openclaw/ from main)
- open-gstack-browser/SKILL.md: accepted main (renamed from .factory/)
- setup-team-sync/SKILL.md: regenerated via gen:skill-docs
- test/fixtures/golden/*: updated golden baselines for ship SKILL.md
- codex-ship-SKILL.md: accepted main (renamed from .factory/)
- package.json version: synced to VERSION (0.15.16.0)
- bin/gstack-uninstall: check settings file exists before claiming
  SessionStart hook removal (fixes false positive on clean systems)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-07 20:47:07 -10:00
258 changed files with 55174 additions and 2692 deletions
+182
View File
@@ -0,0 +1,182 @@
# Adding a New Host to gstack
gstack uses a declarative host config system. Each supported AI coding agent
(Claude, Codex, Factory, Kiro, OpenCode, Slate, Cursor, OpenClaw) is defined
as a typed TypeScript config object. Adding a new host means creating one file
and re-exporting it. Zero code changes to the generator, setup, or tooling.
## How it works
```
hosts/
├── claude.ts # Primary host
├── codex.ts # OpenAI Codex CLI
├── factory.ts # Factory Droid
├── kiro.ts # Amazon Kiro
├── opencode.ts # OpenCode
├── slate.ts # Slate (Random Labs)
├── cursor.ts # Cursor
├── openclaw.ts # OpenClaw (hybrid: config + adapter)
└── index.ts # Registry: imports all, derives Host type
```
Each config file exports a `HostConfig` object that tells the generator:
- Where to put generated skills (paths)
- How to transform frontmatter (allowlist/denylist fields)
- What Claude-specific references to rewrite (paths, tool names)
- What binary to detect for auto-install
- What resolver sections to suppress
- What assets to symlink at install time
The generator, setup script, platform-detect, uninstall, health checks, worktree
copy, and tests all read from these configs. None of them have per-host code.
## Step-by-step: add a new host
### 1. Create the config file
Copy an existing config as a starting point. `hosts/opencode.ts` is a good
minimal example. `hosts/factory.ts` shows tool rewrites and conditional fields.
`hosts/openclaw.ts` shows the adapter pattern for hosts with different tool models.
Create `hosts/myhost.ts`:
```typescript
import type { HostConfig } from '../scripts/host-config';
const myhost: HostConfig = {
name: 'myhost',
displayName: 'MyHost',
cliCommand: 'myhost', // binary name for `command -v` detection
cliAliases: [], // alternative binary names
globalRoot: '.myhost/skills/gstack',
localSkillRoot: '.myhost/skills/gstack',
hostSubdir: '.myhost',
usesEnvVars: true, // false only for Claude (uses literal ~ paths)
frontmatter: {
mode: 'allowlist', // 'allowlist' keeps only listed fields
keepFields: ['name', 'description'],
descriptionLimit: null, // set to 1024 for hosts with limits
},
generation: {
generateMetadata: false, // true only for Codex (openai.yaml)
skipSkills: ['codex'], // codex skill is Claude-only
},
pathRewrites: [
{ from: '~/.claude/skills/gstack', to: '~/.myhost/skills/gstack' },
{ from: '.claude/skills/gstack', to: '.myhost/skills/gstack' },
{ from: '.claude/skills', to: '.myhost/skills' },
],
runtimeRoot: {
globalSymlinks: ['bin', 'browse/dist', 'browse/bin', 'gstack-upgrade', 'ETHOS.md'],
globalFiles: { 'review': ['checklist.md', 'TODOS-format.md'] },
},
install: {
prefixable: false,
linkingStrategy: 'symlink-generated',
},
learningsMode: 'basic',
};
export default myhost;
```
### 2. Register in the index
Edit `hosts/index.ts`:
```typescript
import myhost from './myhost';
// Add to ALL_HOST_CONFIGS array:
export const ALL_HOST_CONFIGS: HostConfig[] = [
claude, codex, factory, kiro, opencode, slate, cursor, openclaw, myhost
];
// Add to re-exports:
export { claude, codex, factory, kiro, opencode, slate, cursor, openclaw, myhost };
```
### 3. Add to .gitignore
Add `.myhost/` to `.gitignore` (generated skill docs are gitignored).
### 4. Generate and verify
```bash
# Generate skill docs for the new host
bun run gen:skill-docs --host myhost
# Verify output exists and has no .claude/skills leakage
ls .myhost/skills/gstack-*/SKILL.md
grep -r ".claude/skills" .myhost/skills/ | head -5
# (should be empty)
# Generate for all hosts (includes the new one)
bun run gen:skill-docs --host all
# Health dashboard shows the new host
bun run skill:check
```
### 5. Run tests
```bash
bun test test/gen-skill-docs.test.ts
bun test test/host-config.test.ts
```
The parameterized smoke tests automatically pick up the new host. Zero test
code to write. They verify: output exists, no path leakage, valid frontmatter,
freshness check passes, codex skill excluded.
### 6. Update README.md
Add install instructions for the new host in the appropriate section.
## Config field reference
See `scripts/host-config.ts` for the full `HostConfig` interface with JSDoc
comments on every field.
Key fields:
| Field | Purpose |
|-------|---------|
| `frontmatter.mode` | `allowlist` (keep only listed) or `denylist` (strip listed) |
| `frontmatter.descriptionLimit` | Max chars, `null` for no limit |
| `frontmatter.descriptionLimitBehavior` | `error` (fail build), `truncate`, `warn` |
| `frontmatter.conditionalFields` | Add fields based on template values (e.g., sensitive → disable-model-invocation) |
| `frontmatter.renameFields` | Rename template fields (e.g., voice-triggers → triggers) |
| `pathRewrites` | Literal replaceAll on content. Order matters. |
| `toolRewrites` | Rewrite Claude tool names (e.g., "use the Bash tool" → "run this command") |
| `suppressedResolvers` | Resolver functions that return empty for this host |
| `coAuthorTrailer` | Git co-author string for commits |
| `boundaryInstruction` | Anti-prompt-injection warning for cross-model invocations |
| `adapter` | Path to adapter module for complex transformations |
## Adapter pattern (for hosts with different tool models)
If string-replace tool rewrites aren't enough (the host has fundamentally
different tool semantics), use the adapter pattern. See `hosts/openclaw.ts`
and `scripts/host-adapters/openclaw-adapter.ts`.
The adapter runs as a post-processing step after all generic rewrites. It
exports `transform(content: string, config: HostConfig): string`.
## Validation
The `validateHostConfig()` function in `scripts/host-config.ts` checks:
- Name: lowercase alphanumeric with hyphens
- CLI command: alphanumeric with hyphens/underscores
- Paths: safe characters only (alphanumeric, `.`, `/`, `$`, `{}`, `~`, `-`, `_`)
- No duplicate names, hostSubdirs, or globalRoots across configs
Run `bun run scripts/host-config-export.ts validate` to check all configs.
+145
View File
@@ -0,0 +1,145 @@
# gstack x OpenClaw Integration
gstack integrates with OpenClaw as a methodology source, not a ported codebase.
OpenClaw's ACP runtime spawns Claude Code sessions natively. gstack provides the
planning discipline and methodology that makes those sessions better.
This is a lightweight protocol encoded as prompt text. No daemon. No JSON-RPC.
No compatibility matrices. The prompt is the bridge.
## Architecture
```
OpenClaw gstack repo
───────────────────── ──────────────
Orchestrator: messaging, Source of truth for
calendar, memory, EA methodology + planning
│ │
├── Native skills (conversational) ├── Generates native skills
│ office-hours, ceo-review, │ via gen-skill-docs pipeline
│ investigate, retro │
│ ├── Generates gstack-lite
├── sessions_spawn(runtime: "acp") │ (planning discipline)
│ │ │
│ └── Claude Code ├── Generates gstack-full
│ └── gstack installed at │ (complete pipeline)
│ ~/.claude/skills/gstack │
│ └── docs/OPENCLAW.md (this file)
└── Dispatch routing (AGENTS.md)
```
## Dispatch Routing
OpenClaw decides at spawn time which tier of gstack support to use:
| Tier | When | Prompt prefix |
|------|------|---------------|
| **Simple** | One-file edits, typos, config changes | No gstack context injected |
| **Medium** | Multi-file features, refactors | gstack-lite CLAUDE.md appended |
| **Heavy** | Specific gstack skill needed | "Load gstack. Run /X" |
| **Full** | Complete features, objectives, projects | gstack-full pipeline appended |
| **Plan** | "Help me plan a Claude Code project" | gstack-plan pipeline appended |
### Decision heuristic
- Can it be done in <10 lines of code? -> **Simple**
- Does it touch multiple files but the approach is obvious? -> **Medium**
- Does the user name a specific skill (/cso, /review, /qa)? -> **Heavy**
- Is it a feature, project, or objective (not a task)? -> **Full**
- Does the user want to PLAN something for Claude Code without implementing yet? -> **Plan**
### Dispatch routing guide (for AGENTS.md)
The complete ready-to-paste section lives in `openclaw/agents-gstack-section.md`.
Copy it into your OpenClaw AGENTS.md.
Key behavioral rules (these go ABOVE the dispatch tiers):
1. **Always spawn, never redirect.** When the user asks to use ANY gstack skill,
ALWAYS spawn a Claude Code session. Never tell the user to open Claude Code.
2. **Resolve the repo.** If the user names a repo, set the working directory. If
unknown, ask which repo.
3. **Autoplan runs end-to-end.** Spawn, let it run the full pipeline, report back
in chat. User should never have to leave Telegram.
### CLAUDE.md collision handling
When spawning Claude Code in a repo that already has a CLAUDE.md, APPEND
gstack-lite/full as a new section. Do not replace the repo's existing instructions.
## What gstack generates for OpenClaw
All artifacts live in the `openclaw/` directory and are generated by
`bun run gen:skill-docs --host openclaw`:
### gstack-lite (Medium tier)
`openclaw/gstack-lite-CLAUDE.md` — ~15 lines of planning discipline:
1. Read every file before modifying
2. Write a 5-line plan: what, why, which files, test case, risk
3. Resolve ambiguity using decision principles
4. Self-review before reporting done
5. Completion report: what shipped, decisions made, anything uncertain
A/B tested: 2x time, meaningfully better output.
### gstack-full (Full tier)
`openclaw/gstack-full-CLAUDE.md` — chains existing gstack skills:
1. Read CLAUDE.md and understand the project
2. Run /autoplan (CEO + eng + design review)
3. Implement the approved plan
4. Run /ship to create a PR
5. Report back with PR URL and decisions
### gstack-plan (Plan tier)
`openclaw/gstack-plan-CLAUDE.md` — full review gauntlet, no implementation:
1. Run /office-hours to produce a design doc
2. Run /autoplan (CEO + eng + design + DX reviews + codex adversarial)
3. Save the reviewed plan to `plans/<project-slug>-plan-<date>.md`
4. Report back: plan path, summary, key decisions, recommended next step
The orchestrator persists the plan link to its own memory store (brain repo,
knowledge base, or whatever is configured in AGENTS.md). When the user is
ready to build, spawn a FULL session that references the saved plan.
### Native methodology skills
Published to ClawHub. Install with `clawhub install`:
- `gstack-openclaw-office-hours` — Product interrogation (6 forcing questions)
- `gstack-openclaw-ceo-review` — Strategic challenge (10-section review, 4 modes)
- `gstack-openclaw-investigate` — Operational debugging (4-phase methodology)
- `gstack-openclaw-retro` — Operational retrospective (weekly review)
Source lives in `openclaw/skills/` in the gstack repo. These are hand-crafted
adaptations of the gstack methodology for OpenClaw's conversational context.
No gstack infrastructure (no browse, no telemetry, no preamble).
## Spawned session detection
When Claude Code runs inside a session spawned by OpenClaw, the `OPENCLAW_SESSION`
environment variable should be set. gstack detects this and adjusts:
- Skips interactive prompts (auto-chooses recommended options)
- Skips upgrade checks and telemetry prompts
- Focuses on task completion and prose reporting
Set the env var in sessions_spawn: `env: { OPENCLAW_SESSION: "1" }`
## Installation
For OpenClaw users: tell your OpenClaw agent "install gstack for openclaw."
The agent should:
1. Install gstack-lite CLAUDE.md into its coding session templates
2. Install the 4 native methodology skills
3. Add dispatch routing to AGENTS.md
4. Verify with a test spawn
For gstack developers: `./setup --host openclaw` outputs this documentation.
The actual artifacts are generated by `bun run gen:skill-docs --host openclaw`.
## What we don't do
- No dispatch daemon (ACP handles session spawning)
- No Clawvisor relay (no security layer needed)
- No bidirectional learnings bridge (brain repo is the knowledge store)
- No JSON schemas or protocol versioning
- No SOUL.md from gstack (OpenClaw has its own)
- No full skill porting (coding skills stay native to Claude Code)
+178
View File
@@ -0,0 +1,178 @@
# Remote Browser Access — How to Pair With a GStack Browser
A GStack Browser server can be shared with any AI agent that can make HTTP requests.
The agent gets scoped access to a real Chromium browser: navigate pages, read content,
click elements, fill forms, take screenshots. Each agent gets its own tab.
This document is the reference for remote agents. The quick-start instructions are
generated by `$B pair-agent` with the actual credentials baked in.
## Architecture
```
Your Machine Remote Agent
───────────── ────────────
GStack Browser Server Any AI agent
├── Chromium (Playwright) (OpenClaw, Hermes, Codex, etc.)
├── HTTP API on localhost:PORT │
├── ngrok tunnel (optional) │
│ https://xxx.ngrok.dev ─────────────┘
└── Token Registry
├── Root token (local only)
├── Setup keys (5 min, one-time)
└── Session tokens (24h, scoped)
```
## Connection Flow
1. **User runs** `$B pair-agent` (or `/pair-agent` in Claude Code)
2. **Server creates** a one-time setup key (expires in 5 minutes)
3. **User copies** the instruction block into the other agent's chat
4. **Remote agent runs** `POST /connect` with the setup key
5. **Server returns** a scoped session token (24h default)
6. **Remote agent creates** its own tab via `POST /command` with `newtab`
7. **Remote agent browses** using `POST /command` with its session token + tabId
## API Reference
### Authentication
All endpoints except `/connect` and `/health` require a Bearer token:
```
Authorization: Bearer gsk_sess_...
```
### Endpoints
#### POST /connect
Exchange a setup key for a session token. No auth required. Rate-limited to 3/minute.
```json
Request: {"setup_key": "gsk_setup_..."}
Response: {"token": "gsk_sess_...", "expires": "ISO8601", "scopes": ["read","write"], "agent": "agent-name"}
```
#### POST /command
Send a browser command. Requires Bearer auth.
```json
Request: {"command": "goto", "args": ["https://example.com"], "tabId": 1}
Response: (plain text result of the command)
```
#### GET /health
Server status. No auth required. Returns status, tabs, mode, uptime.
### Commands
#### Navigation
| Command | Args | Description |
|---------|------|-------------|
| `goto` | `["URL"]` | Navigate to a URL |
| `back` | `[]` | Go back |
| `forward` | `[]` | Go forward |
| `reload` | `[]` | Reload page |
#### Reading Content
| Command | Args | Description |
|---------|------|-------------|
| `snapshot` | `["-i"]` | Interactive snapshot with @ref labels (most useful) |
| `text` | `[]` | Full page text |
| `html` | `["selector?"]` | HTML of element or full page |
| `links` | `[]` | All links on page |
| `screenshot` | `["/tmp/s.png"]` | Take a screenshot |
| `url` | `[]` | Current URL |
#### Interaction
| Command | Args | Description |
|---------|------|-------------|
| `click` | `["@e3"]` | Click an element (use @ref from snapshot) |
| `fill` | `["@e5", "text"]` | Fill a form field |
| `select` | `["@e7", "option"]` | Select dropdown value |
| `type` | `["text"]` | Type text (keyboard) |
| `press` | `["Enter"]` | Press a key |
| `scroll` | `["down"]` | Scroll the page |
#### Tabs
| Command | Args | Description |
|---------|------|-------------|
| `newtab` | `["URL?"]` | Create a new tab (required before writing) |
| `tabs` | `[]` | List all tabs |
| `closetab` | `["id?"]` | Close a tab |
## The Snapshot → @ref Pattern
This is the most powerful browsing pattern. Instead of writing CSS selectors:
1. Run `snapshot -i` to get an interactive snapshot with labeled elements
2. The snapshot returns text like:
```
[Page Title]
@e1 [link] "Home"
@e2 [button] "Sign In"
@e3 [input] "Search..."
```
3. Use the `@e` refs directly in commands: `click @e2`, `fill @e3 "search query"`
This is how the snapshot system works, and it's much more reliable than guessing
CSS selectors. Always `snapshot -i` first, then use the refs.
## Scopes
| Scope | What it allows |
|-------|---------------|
| `read` | snapshot, text, html, links, screenshot, url, tabs, console, etc. |
| `write` | goto, click, fill, scroll, newtab, closetab, etc. |
| `admin` | eval, js, cookies, storage, cookie-import, useragent, etc. |
| `meta` | tab, diff, frame, responsive, watch |
Default tokens get `read` + `write`. Admin requires `--admin` flag when pairing.
## Tab Isolation
Each agent owns the tabs it creates. Rules:
- **Read:** Any agent can read any tab (snapshot, text, screenshot)
- **Write:** Only the tab owner can write (click, fill, goto, etc.)
- **Unowned tabs:** Pre-existing tabs are root-only for writes
- **First step:** Always `newtab` before trying to interact
## Error Codes
| Code | Meaning | What to do |
|------|---------|------------|
| 401 | Token invalid, expired, or revoked | Ask user to run /pair-agent again |
| 403 | Command not in scope, or tab not yours | Use newtab, or ask for --admin |
| 429 | Rate limit exceeded (>10 req/s) | Wait for Retry-After header |
## Security Model
- Setup keys expire in 5 minutes and can only be used once
- Session tokens expire in 24 hours (configurable)
- The root token never appears in instruction blocks or connection strings
- Admin scope (JS execution, cookie access) is denied by default
- Tokens can be revoked instantly: `$B tunnel revoke agent-name`
- All agent activity is logged with attribution (clientId)
## Same-Machine Shortcut
If both agents are on the same machine, skip the copy-paste:
```bash
$B pair-agent --local openclaw # writes to ~/.openclaw/skills/gstack/browse-remote.json
$B pair-agent --local codex # writes to ~/.codex/skills/gstack/browse-remote.json
$B pair-agent --local cursor # writes to ~/.cursor/skills/gstack/browse-remote.json
```
No tunnel needed. Uses localhost directly.
## ngrok Tunnel Setup
For remote agents on different machines:
1. Sign up at [ngrok.com](https://ngrok.com) (free tier works)
2. Copy your auth token from the dashboard
3. Save it: `echo 'NGROK_AUTHTOKEN=your_token' > ~/.gstack/ngrok.env`
4. Optionally claim a stable domain: `echo 'NGROK_DOMAIN=your-name.ngrok-free.dev' >> ~/.gstack/ngrok.env`
5. Start with tunnel: `BROWSE_TUNNEL=1 $B restart`
6. Run `$B pair-agent` — it will use the tunnel URL automatically
+376
View File
@@ -0,0 +1,376 @@
# GStack Browser V0 — The AI-Native Development Browser
**Date:** 2026-03-30
**Author:** Garry Tan + Claude Code
**Status:** Phase 1a shipped, Phase 1b in progress
**Branch:** garrytan/gstack-as-browser
## The Thesis
Every other AI browser (Atlas, Dia, Comet, Chrome Auto Browse) starts with a
consumer browser and bolts AI onto it. GStack Browser inverts this. It starts
with Claude Code as the runtime and gives it a browser viewport.
The agent is the primary citizen. The browser is the canvas. Skills are
first-class capabilities. You don't "use a browser with AI help." You use
an AI that can see and interact with the web.
This is the IDE for the post-IDE era. Code lives in the terminal. The product
lives in the browser. The AI works across both simultaneously. What Cursor did
for text editors, GStack Browser does for the browser.
## What It Is Today (Phase 1a, shipped)
A double-clickable macOS .app that wraps Playwright's Chromium with the gstack
sidebar extension baked in. You open it and Claude Code can see your screen,
navigate pages, fill forms, take screenshots, inspect CSS, clean up overlays,
and run any gstack skill. All without touching a terminal.
```
GStack Browser.app (389MB, 189MB DMG)
├── Compiled browse binary (58MB) — CLI + HTTP server
├── Chrome extension (172KB) — sidebar, activity feed, inspector
├── Playwright's Chromium (330MB) — the actual browser
└── Launcher script — binds project dir, sets env vars
```
Launch → Chromium opens with sidebar → extension auto-connects to browse server
→ agent ready in ~5 seconds.
## What It Will Be
### Phase 1b: Developer UX (next)
**Command Palette (Cmd+K):** The signature interaction. Opens a fuzzy-filtered
skill picker. Type "/qa" to start QA testing, "/investigate" to debug, "/ship"
to create a PR. Skills are fetched from the browse server, not hardcoded. The
palette is the entry point to everything.
**Quick Screenshot (Cmd+Shift+S):** Capture the current viewport and pipe it into
the sidebar chat with "What do you see?" context. The AI analyzes the screenshot
and gives you actionable feedback. Visual bug reports in one keystroke.
**Status Bar:** A persistent 30px bar at the bottom of every page. Shows agent
status (idle/thinking), workspace name, current branch, and auto-detected dev
servers. Click a dev server pill to navigate. Always-visible context about what
the AI is doing.
**Auto-Detect Dev Servers:** On launch, scans common ports (3000, 3001, 4200,
5173, 5174, 8000, 8080). If exactly one server is found, auto-navigates to it.
Dev server pills in the status bar for one-click switching.
### Phase 2: BoomLooper Integration
The sidebar connects to BoomLooper's Phoenix/Elixir APIs instead of a local
`claude -p` subprocess. BoomLooper provides:
- **Multi-agent orchestration.** Spawn 5 agents in parallel, each with its own
browser tab. One runs QA, one does design review, one watches for regressions.
- **Docker infrastructure.** Each agent gets an isolated container. The browser
inside the container tests the dev server. No port conflicts, no state leakage.
- **Session persistence.** Agent conversations survive browser restarts. Pick up
where you left off.
- **Team visibility.** Your teammates can watch what your agents are doing in
real-time. Like pair programming, but the pair is 5 AI agents and you're the
conductor.
### Phase 3: Browse as BoomLooper Tool
The browse binary becomes an MCP tool in BoomLooper. Agents in Docker containers
use browse commands to test dev servers, take screenshots, fill forms, and verify
deployments. Cross-platform compilation (linux-arm64/x64) required.
### Phase 4: Chromium Fork (trigger-gated)
When the extension side panel hits hard API limits, GStack Browser ships to
external users, build infra exists, and the business justifies maintenance:
fork Chromium. Brave's `chromium_src` override pattern, CC-powered 6-week
rebases (2-4 hours with CC vs 1-2 weeks human). ~20-30 files modified.
### Phase 5: Native Shell
SwiftUI/AppKit app shell with native sidebar, isolated Chromium service. Full
platform integration. May be superseded by Phase 4 if the Chromium fork includes
a native sidebar.
## Vision: What an AI Browser Can Do
### 1. See What You See
The browser is the AI's eyes. Not through screenshots (though it can do that),
but through DOM access, CSS inspection, network monitoring, and accessibility
tree parsing. The AI understands the page structure, not just the pixels.
**Today:** `snapshot` command returns an accessibility-tree representation of any
page. The AI can "see" every button, link, form field, and text element. Element
references (`@e1`, `@e2`) let the AI click, fill, and interact.
**Next:** Real-time page observation. The AI notices when a page changes, when an
error appears in the console, when a network request fails. Proactive debugging
without being asked.
**Future:** Visual understanding. The AI compares before/after screenshots to catch
visual regressions. Pixel-level design review. "This button moved 3px left and the
font changed from 14px to 13px."
### 2. Act on What It Sees
Not just reading pages, but interacting with them like a human user would.
**Today:** Click, fill, select, hover, type, scroll, upload files, handle dialogs,
navigate, manage tabs. All via simple commands through the browse server.
**Next:** Multi-step user flows. "Log in, go to settings, change the timezone,
verify the confirmation message." The AI chains commands with verification at each
step.
**Future:** Autonomous QA agent. "Test every link on this page. Fill every form.
Try to break it." The AI runs exhaustive interaction testing without a script.
Finds bugs a human tester would miss because it tries combinations humans don't
think of.
### 3. Write Code While Browsing
This is the key differentiator. The AI can see the bug in the browser AND fix it
in the code simultaneously.
**Today:** The sidebar chat connects to Claude Code. You say "this button is
misaligned" and the AI reads the CSS, identifies the issue, and proposes a fix.
The `/design-review` skill takes screenshots, identifies visual issues, and
commits fixes with before/after evidence.
**Next:** Live reload loop. The AI edits CSS/HTML, the browser auto-reloads, the
AI verifies the fix visually. No human in the loop for simple visual fixes.
"Fix every spacing issue on this page" becomes a 30-second task.
**Future:** Full-stack debugging. The AI sees a 500 error in the browser, reads
the server logs, traces to the failing line, writes the fix, and verifies in the
browser. One command: "This page is broken. Fix it."
### 4. Understand the Whole Stack
The browser isn't just a viewport. It's a window into the application's health.
**Today:**
- Console log capture — every `console.log`, `console.error`, and warning
- Network request monitoring — every XHR, fetch, websocket, and static asset
- Performance metrics — Core Web Vitals, resource timing, paint events
- Cookie and storage inspection — read and write localStorage, sessionStorage
- CSS inspection — computed styles, box model, rule cascade
**Next:**
- Network request replay — "replay this failing request with different params"
- Performance regression detection — "this page is 200ms slower than yesterday"
- Dependency auditing — "this page loads 47 third-party scripts"
- Accessibility auditing — "this form has no labels, these colors fail contrast"
**Future:**
- Full application telemetry — CPU, memory, GPU usage in real-time
- Cross-browser testing — same test suite across Chrome, Firefox, Safari
- Real user monitoring correlation — "this bug affects 12% of production users"
### 5. The Workspace Model
The browser IS the workspace. Not a tab in a workspace. The workspace itself.
**Today:** Each browser session is bound to a project directory. The sidebar shows
the current branch. The status bar shows detected dev servers.
**Next:** Multi-project support. Switch between projects without closing the
browser. Each project gets its own set of tabs, its own agent, its own context.
Like VSCode workspaces, but for the browser.
**Future:** Team workspaces. Multiple developers share a browser workspace. See
each other's agents working. Collaborative debugging where one person navigates
and the other watches the AI fix things in real-time.
### 6. Skills as Browser Capabilities
Every gstack skill becomes a browser capability.
| Skill | Browser Capability |
|-------|-------------------|
| `/qa` | Test every page, find bugs, fix them, verify fixes |
| `/design-review` | Screenshot → analyze → fix CSS → screenshot again |
| `/investigate` | See the error in browser → trace to code → fix → verify |
| `/benchmark` | Measure page performance → detect regressions → alert |
| `/canary` | Monitor deployed site → screenshot periodically → alert on changes |
| `/ship` | Run tests → review diff → create PR → verify deployment in browser |
| `/cso` | Audit page for XSS, open redirects, clickjacking in real browser |
| `/office-hours` | Browse competitor sites → synthesize observations → design doc |
The command palette (Cmd+K) is the hub. You don't need to know the skills exist.
You type what you want, the fuzzy filter finds the right skill, and the AI runs it
with the browser as context.
### 7. The Design Loop
AI-powered design is a loop, not a handoff.
```
Generate mockup (GPT Image API)
→ Review in browser (side-by-side with live site)
→ Iterate with feedback ("make the header taller")
→ Approve direction
→ Generate production HTML/CSS
→ Preview in browser
→ Fine-tune with /design-review
→ Ship
```
The browser closes the gap between "what it looks like in Figma" and "what it
looks like in production." Because the AI can see both simultaneously.
### 8. The Security Loop
CSO review in a real browser, not just static analysis.
- Inject XSS payloads into every input field, check if they execute
- Test CSRF by replaying requests from a different origin
- Check for open redirects by navigating to crafted URLs
- Verify CSP headers are actually enforced (not just present)
- Test auth flows by manipulating cookies and tokens in real-time
- Check for clickjacking by loading the site in an iframe
Static analysis catches patterns. Browser testing catches reality.
### 9. The Monitoring Loop
Post-deploy canary monitoring, in a real browser.
```
Deploy → Browser loads production URL
→ Screenshot baseline
→ Every 5 minutes: screenshot, compare, check console
→ Alert on: visual regression, new console errors, performance drop
→ Auto-rollback if critical error detected
```
Synthetic monitoring with AI judgment. Not just "did the page return 200" but
"does the page look right and work correctly."
## Architecture
```
+-------------------------------------------------------+
| GStack Browser |
| |
| +------------------+ +---------------------------+ |
| | Chromium | | Extension Side Panel | |
| | (Playwright) | | ├── Chat (Claude Code) | |
| | | | ├── Activity Feed | |
| | ┌────────────┐ | | ├── Element Refs | |
| | │ Status Bar │ | | ├── CSS Inspector | |
| | └────────────┘ | | ├── Command Palette | |
| +--------┬──────────+ | └── Settings | |
| │ +-------------┬--------------+ |
+-----------┼────────────────────────────┼─────────────────+
│ │
v v
+---------┴-----------+ +-----------┴-----------+
| Browse Server | | Sidebar Agent |
| (HTTP + SSE) | | (claude -p wrapper) |
| :34567 | | Runs gstack skills |
| | | Per-tab isolation |
| Commands: | | |
| goto, click, fill | | Future: BoomLooper |
| snapshot, screenshot| | GenServer agents |
| css, inspect, eval | | |
+---------┬-----------+ +-----------┬-----------+
│ │
v v
+---------┴-----------+ +-----------┴-----------+
| User's App | | Claude Code |
| localhost:3000 | | (reads/writes code) |
| (or any URL) | | |
+---------------------+ +-----------------------+
```
## Competitive Landscape
| Browser | Approach | Differentiator | Weakness |
|---------|----------|---------------|----------|
| **Atlas** | Chromium fork + AI layer | Agentic browser, "OWL" isolated Chromium | Consumer-focused, no code integration |
| **Dia** | AI-native browser | Clean UI, built for AI interaction | No dev tools, no code editing |
| **Comet** | AI browser | Multi-agent browsing | Early, unclear dev workflow |
| **Chrome Auto Browse** | Extension | Google's own, deep Chrome integration | Extension-only, no code editing |
| **Cursor** | VSCode fork + AI | Best-in-class code editing | No browser viewport |
| **GStack Browser** | CC runtime + browser viewport | See bug in browser, fix in code, verify | Currently macOS-only, no consumer features |
GStack Browser doesn't compete with consumer browsers. It competes with the
workflow of switching between browser and editor. The goal is to make that switch
invisible.
## Design System
From DESIGN.md:
- **Primary accent:** Amber-500 (#F59E0B) — agent active, focus states, pulse
- **Background:** Zinc-950 (#09090B) through Zinc-800 (#27272A) — dark, dense
- **Typography:** JetBrains Mono (code/status), DM Sans (UI/labels)
- **Border radius:** 8px (md), 12px (lg), full (pills)
- **Motion:** Pulse animation on agent active, 200ms transitions
- **Layout:** Sidebar (right), status bar (bottom), palette (centered overlay)
## Implementation Status
| Component | Status | Notes |
|-----------|--------|-------|
| .app bundle | **SHIPPED** | 389MB, launches in ~5s |
| DMG packaging | **SHIPPED** | 189MB compressed |
| `GSTACK_CHROMIUM_PATH` | **SHIPPED** | Custom Chromium binary support |
| `BROWSE_EXTENSIONS_DIR` | **SHIPPED** | Extension path override |
| Auth via `/health` | **SHIPPED** | Replaces .auth.json file approach, auto-refreshes on server restart |
| Build script | **SHIPPED** | `scripts/build-app.sh` |
| Model routing | **SHIPPED** | Sonnet for actions, Opus for analysis (`pickSidebarModel`) |
| Debug logging | **SHIPPED** | 40+ silent catches → prefixed console logging across 4 files |
| No idle timeout (headed) | **SHIPPED** | Browser stays alive as long as window is open |
| Cookie import button | **SHIPPED** | One-click in sidebar footer, opens `/cookie-picker` |
| Sidebar arrow hint | **SHIPPED** | Points to sidebar, hides only when sidebar actually opens |
| Architecture doc | **SHIPPED** | `docs/designs/SIDEBAR_MESSAGE_FLOW.md` |
| Command palette | Planned | Phase 1b |
| Quick screenshot | Planned | Phase 1b |
| Status bar | Planned | Phase 1b |
| Dev server detection | Planned | Phase 1b |
| BoomLooper integration | Future | Phase 2 |
| Cross-platform | Future | Phase 3 |
| Chromium fork | Trigger-gated | Phase 4 |
| Native shell | Deferred | Phase 5 |
## The 12-Month Vision
```
TODAY (Phase 1) 6 MONTHS (Phase 2-3) 12 MONTHS (Phase 4-5)
───────────── ────────────────── ────────────────────
macOS .app wrapper BoomLooper multi-agent Chromium fork OR
Extension sidebar Docker containers Native SwiftUI shell
Local claude -p agent Team workspaces Cross-platform
Single project Linux/x64 browse Auto-update
Manual skill invocation Autonomous QA loops Skill marketplace
Performance monitoring Plugin API
Real-time collaboration Enterprise features
```
The 12-month ideal: you open GStack Browser, it detects your project, starts
your dev server, runs your test suite, and reports what's broken. You say "fix
it" and the AI fixes every bug, verifies each fix visually, and creates a PR.
You review the PR in the same browser, approve it, and the AI deploys it and
monitors the canary. All in one window.
That's the browser as AI workspace. Not a browser with AI bolted on. An AI
with a browser bolted on.
## Review History
This plan went through 4 reviews:
1. **CEO Review** (`/plan-ceo-review`, SELECTIVE EXPANSION) — 9 scope proposals,
3 accepted (Cmd+K, Cmd+Shift+S, status bar), 5 deferred, 1 skipped
2. **Design Review** (`/plan-design-review`) — scored 5/10 → 8/10, 9 design
decisions added, 2 approved mockups generated
3. **Eng Review** (`/plan-eng-review`) — 4 issues found, 0 critical gaps,
test plan produced
4. **Codex Review** (outside voice) — 9 findings, 3 critical gaps caught
(server bundling, auth file location, project binding). All resolved.
The Codex review caught 3 real architecture gaps that survived 3 prior reviews.
Cross-model review works.
+330
View File
@@ -0,0 +1,330 @@
# Design: GStack Self-Learning Infrastructure
Generated by /office-hours + /plan-ceo-review + /plan-eng-review on 2026-03-28
Updated: 2026-04-01 (post-Session Intelligence, reviewed by Codex)
Branch: garrytan/ce-features
Repo: gstack
Status: ACTIVE
Mode: Open Source / Community
## Problem Statement
GStack runs 30+ skills across sessions but learns nothing between them. A /review
session catches an N+1 query pattern, and the next /review on the same codebase
starts from scratch. A /ship run discovers the test command, and every future /ship
re-discovers it. A /investigate finds a tricky race condition, and no future session
knows about it.
Every AI coding tool has this problem. Cursor has per-user memory. Claude Code has
CLAUDE.md. Windsurf has persistent context. But none of them compound. None of them
structure what they learn. None of them share knowledge across skills.
## What We're Building
Per-project institutional knowledge that compounds across sessions and skills.
Structured, typed, confidence-scored learnings that every gstack skill can read and
write. The goal: after 20 sessions on the same codebase, gstack knows every
architectural decision, every past bug pattern, and every time it was wrong.
## North Star
/autoship (Release 5). A full engineering team in one command. Describe a feature,
approve the plan, everything else is automatic. /autoship can't work without
learnings (R1), review quality (R2), session persistence (R3), and adaptive ceremony
(R4). Releases 1-4 are the infrastructure that makes /autoship actually work.
## Audience
YC founders building with AI. The people who run gstack on real codebases 20+ times
a week and notice when it asks the same question twice.
## Differentiation
| Tool | Memory model | Scope | Structure |
|------|-------------|-------|-----------|
| Cursor | Per-user chat memory | Per-session | Unstructured |
| CLAUDE.md | Static file | Per-project | Manual |
| Windsurf | Persistent context | Per-session | Unstructured |
| **GStack** | **Per-project JSONL** | **Cross-session, cross-skill** | **Typed, scored, decaying** |
---
## State Systems
gstack has four distinct persistence layers. They share storage patterns
(JSONL in `~/.gstack/projects/$SLUG/`) but serve different purposes:
| System | File | What it stores | Written by | Read by |
|--------|------|---------------|------------|---------|
| **Learnings** | `learnings.jsonl` | Institutional knowledge (pitfalls, patterns, preferences) | All skills | All skills (preamble) |
| **Timeline** | `timeline.jsonl` | Event history (skill start/complete, branch, outcome) | Preamble (automatic) | /retro, preamble context recovery |
| **Checkpoints** | `checkpoints/*.md` | Working state snapshots (decisions, remaining work, files) | /checkpoint, /ship, /investigate | Preamble context recovery, /checkpoint resume |
| **Health** | `health-history.jsonl` | Code quality scores over time (per-tool, composite) | /health | /retro, /ship (gate), /health (trends) |
These are not overlapping. Learnings = what you know. Timeline = what happened.
Checkpoints = where you are. Health = how good the code is. Each answers a
different question.
---
## Release Roadmap
### Release 1: "GStack Learns" (v0.13-0.14) — SHIPPED
**Headline:** Every session makes the next one smarter.
What shipped:
- Learnings persistence at `~/.gstack/projects/{slug}/learnings.jsonl`
- `/learn` skill for manual review, search, prune, export
- Confidence calibration on all review findings (1-10 scores with display rules)
- Confidence decay for observed/inferred learnings (1pt/30d)
- Cross-project learnings discovery (opt-in, AskUserQuestion consent)
- "Learning applied" callouts when reviews match past learnings
- Integration into /review, /ship, /plan-*, /office-hours, /investigate, /retro
Schema:
```json
{
"ts": "2026-03-28T12:00:00Z",
"skill": "review",
"type": "pitfall",
"key": "n-plus-one-activerecord",
"insight": "Always check includes() for has_many in list endpoints",
"confidence": 8,
"source": "observed",
"branch": "feature-x",
"commit": "abc1234",
"files": ["app/models/user.rb"]
}
```
Types: `pattern` | `pitfall` | `preference` | `architecture` | `tool`
Sources: `observed` | `user-stated` | `inferred` | `cross-model`
Architecture: append-only JSONL. Duplicates resolved at read time ("latest winner"
per key+type). No write-time mutation, no race conditions.
### Release 2: "Review Army" (v0.14.3-0.14.4) — SHIPPED
**Headline:** 10 specialist reviewers on every PR.
What shipped:
- 7 parallel specialist subagents: always-on (testing, maintainability) +
conditional (security, performance, data-migration, API contract, design) +
red team (large diffs / critical findings)
- JSON-structured findings with confidence scores + fingerprint dedup across agents
- PR quality score (0-10) logged per review + /retro trending
- Learning-informed specialist prompts, past pitfalls injected per domain
- Multi-specialist consensus highlighting, confirmed findings get boosted
- Enhanced Delivery Integrity via PLAN_COMPLETION_AUDIT
- Checklist refactored: CRITICAL categories stay in main pass, specialist
categories extracted to focused checklists in review/specialists/
### Release 2.5: "Review Army Expansions" — NOT YET SHIPPED
**Headline:** Ship after R2 proves stable. Check in on how the core loop is performing.
Pre-check: review R2 quality metrics (PR quality scores, specialist hit rates,
false positive rates, E2E test stability). If core loop has issues, fix those first.
What ships:
- E1: Adaptive specialist gating, auto-skip specialists with 0-finding track record.
Store per-project hit rates via gstack-learnings-log. User can force with --security etc.
- E3: Test stub generation, each specialist outputs TEST_STUB alongside findings.
Framework detected from project (Jest/Vitest/RSpec/pytest/Go test).
Flows into Fix-First: AUTO-FIX applies fix + creates test file.
- E5: Cross-review finding dedup, read gstack-review-read for prior review entries.
Suppress findings matching a prior user-skipped finding.
- E7: Specialist performance tracking, log per-specialist metrics via gstack-review-log.
Timeline integration: specialist runs appear in timeline.jsonl for /retro trending.
### Release 3: "Session Intelligence" (v0.15.0) — SHIPPED
**Headline:** Your AI sessions remember what happened.
What shipped:
- Session timeline: every skill auto-logs start/complete events to
`~/.gstack/projects/$SLUG/timeline.jsonl`. Local-only, never sent anywhere,
always on regardless of telemetry setting.
- Context recovery: after compaction or session start, preamble lists recent CEO
plans, checkpoints, and reviews. Agent reads the most recent to recover context.
- Cross-session injection: preamble prints LAST_SESSION and LATEST_CHECKPOINT for
the current branch. You see where you left off before typing anything.
- Predictive skill suggestion: if your last 3 sessions follow a pattern
(review, ship, review), gstack suggests what you probably want next.
- "Welcome back" synthesized context message on session start.
- `/checkpoint` skill: save/resume/list working state snapshots. Cross-branch
listing for Conductor workspace handoff between agents.
- `/health` skill: code quality scorekeeper wrapping project tools (tsc, biome,
knip, shellcheck, tests). Composite 0-10 score, trend tracking, improvement
suggestions when scores drop.
- Timeline binaries: `bin/gstack-timeline-log` and `bin/gstack-timeline-read`.
- Routing rules: /checkpoint and /health added to preamble skill routing.
Design doc: `docs/designs/SESSION_INTELLIGENCE.md`
### Release 4: "Adaptive Ceremony" — NOT YET SHIPPED
**Headline:** GStack respects your time without compromising your safety.
Ceremony and trust are separate concerns. Ceremony = the set of review/test/QA
steps a PR goes through. Trust = a policy engine that determines which ceremony
level applies. They interact but don't merge.
What ships:
**Ceremony levels:**
- FULL: all specialists, adversarial, Codex structured review, coverage audit, plan
completion. For large diffs, new features, migrations, auth changes.
- STANDARD: adversarial + Codex, coverage audit, plan completion. For medium diffs,
typical feature work.
- FAST: adversarial only. For small, well-tested changes on trusted projects.
**Trust policy engine:**
- Scope-aware trust. Trust is earned per change class, not globally. Clean history on
docs-only PRs does not buy trust on migration PRs.
- Change class detection: docs, tests, config, frontend, backend, migrations, auth,
infra. Each class has its own trust threshold.
- Trust signals: consecutive clean reviews (per class), /health score stability,
regression frequency, test coverage trends.
- Trust never fast-tracks: migrations, auth/permission changes, new API endpoints,
infrastructure changes. These always get FULL ceremony regardless of trust level.
- Gradual degradation, not binary reset. A single regression doesn't reset all trust.
It degrades trust for that change class by one level.
**Scope assessment:**
- TINY/SMALL/MEDIUM/LARGE classification in /review, /ship, /autoplan based on
diff size, files touched, and change class.
- Ceremony level = f(scope, trust, change class).
**TODO lifecycle:**
- /triage for interactive approval of incoming TODOs
- /resolve for batch resolution via parallel agents
### Release 5: "/autoship — One Command, Full Feature" — NOT YET SHIPPED
**Headline:** Describe a feature. Approve the plan. Everything else is automatic.
/autoship is a resumable state machine, not a linear pipeline. Review and QA can
send work back to build/fix. Compaction can interrupt any phase. The system must
recover gracefully.
```
┌──────────┐
│ START │
└────┬─────┘
┌────▼─────┐
│ /office- │
│ hours │
└────┬─────┘
┌────▼─────┐
│/autoplan │ ◄── single approval gate
└────┬─────┘
┌──────────▼──────────┐
│ BUILD │ ◄── /checkpoint auto-save
└──────────┬──────────┘
┌──────────▼──────────┐
│ /health │ ◄── quality gate
│ (score >= 7.0) │
└──────────┬──────────┘
│ fail → back to BUILD
┌──────────▼──────────┐
│ /review │
└──────────┬──────────┘
│ ASK items → back to BUILD
┌──────────▼──────────┐
│ /qa │
└──────────┬──────────┘
│ bugs found → back to BUILD
┌──────────▼──────────┐
│ /ship │
└──────────┬──────────┘
┌──────────▼──────────┐
│ /checkpoint archive │ ◄── preserve, don't destroy
└─────────────────────┘
```
What ships:
- /autoship autonomous pipeline with the state machine above.
Each phase writes to timeline.jsonl. Checkpoints auto-save before each phase.
Compaction recovery: context recovery reads checkpoint + timeline, resumes at
the last completed phase.
- Checkpoint archival on completion (not deletion). Recovery state is preserved
for debugging failed autoship runs.
- /ideate brainstorming skill (parallel divergent agents + adversarial filtering)
- Research agents in /plan-eng-review (codebase analyst, history analyst,
best practices researcher, learnings researcher)
Depends on: R1 (learnings for research agents), R2 (review army for quality),
R3 (session intelligence for persistence), R4 (adaptive ceremony for speed).
### Release 6: "Execution Studio" — NOT YET SHIPPED
**Headline:** Parallel execution infrastructure.
What ships:
- Swarm orchestration: multi-worktree parallel builds. Builds on Conductor
workspace handoff from /checkpoint (R3). An orchestrator skill dispatches
independent workstreams to parallel agents, each with its own worktree.
- Codex build delegation: auto-detect when to delegate implementation to Codex
CLI based on task type (boilerplate, test generation, mechanical refactors).
- PR feedback resolution: parallel comment resolver across review platforms.
- /onboard: auto-generated contributor guide from codebase analysis.
- /triage-prs: batch PR triage for maintainers.
### Release 7: "Design & Media" — NOT YET SHIPPED
**Headline:** Visual design integration.
What ships:
- Figma design sync (pixel-matching iteration loop)
- Feature video recording (auto-generated PR demos)
- Cross-platform portability (Copilot, Kiro, Windsurf output)
---
## Risk Register
### Proxy signals as permission to skip scrutiny
(Identified by Codex review, 2026-04-01)
/health scores, clean review history, and timeline patterns are useful signals.
They are not proof of safety. If those signals feed ceremony reduction AND /autoship,
the failure mode is rare, silent, high-severity mistakes. Mitigations:
- Certain change classes never fast-track (migrations, auth, infra, new endpoints).
- Trust degrades gradually, not binary reset.
- /autoship always runs FULL ceremony on its first run per project. Trust is earned.
### Stale context recovery
(Identified by Codex review, 2026-04-01)
Context recovery can inject wrong-branch state, obsolete plans, or invalid
checkpoints. Mitigations:
- Checkpoints include branch name in YAML frontmatter. Context recovery filters
by current branch.
- Timeline grep filters by branch before showing LAST_SESSION.
- Stale artifact detection: if checkpoint is >7 days old, note it as potentially
stale rather than presenting as current.
### Validation metrics needed
(Identified by Codex review, 2026-04-01)
Before shipping R4 (Adaptive Ceremony), measure:
- Predictive suggestion accuracy (did the user run the suggested skill?)
- Trust policy false-skip rate (did fast-tracked PRs have post-merge issues?)
- Context recovery accuracy (did recovered context match actual state?)
- /health score correlation with actual code quality (do high scores predict
fewer production bugs?)
These metrics should be collected during R3 usage and reviewed before R4 ships.
---
## Acknowledged Inspiration
The self-learning roadmap was inspired by ideas from the [Compound Engineering](https://github.com/nicobailon/compound-engineering) project by Nico Bailon. Their exploration of learnings persistence, parallel review agents, and autonomous pipelines catalyzed the design of GStack's approach. We adapted every concept to fit GStack's template system, voice, and architecture rather than porting directly.
+135
View File
@@ -0,0 +1,135 @@
# Session Intelligence Layer
## The Problem
Claude Code's context window is ephemeral. Every session starts fresh. When
auto-compaction fires at ~167K tokens, it preserves a generic summary but
destroys file reads, reasoning chains, and intermediate decisions.
gstack already produces valuable artifacts that survive on disk: CEO plans,
eng reviews, design reviews, QA reports, learnings. These files contain
decisions, constraints, and context that shaped the current work. But Claude
doesn't know they exist. After compaction, the plans and reviews that
informed every decision silently vanish from context.
The ecosystem is working on this. claude-mem (9K+ stars) captures tool usage
and injects context into future sessions. Claude HUD shows real-time agent
status. Anthropic's own `claude-progress.txt` pattern uses a progress file
that agents read at the start of each session.
Nobody is solving the specific problem of making **skill-produced artifacts**
survive compaction. Because nobody else has gstack's artifact architecture.
## The Insight
gstack already writes structured artifacts to `~/.gstack/projects/$SLUG/`:
- CEO plans: `ceo-plans/`
- Design reviews: `design-reviews/`
- Eng reviews: `eng-reviews/`
- Learnings: `learnings.jsonl`
- Skill usage: `../analytics/skill-usage.jsonl`
The missing piece is not storage. It's awareness. The preamble needs to tell
the agent: "These files exist. They contain decisions you've already made.
After compaction, re-read them."
## The Architecture
```
┌─────────────────────────────────────┐
│ Claude Context Window │
│ (ephemeral, ~167K token limit) │
│ │
│ Compaction fires ──► summary only │
└──────────────┬──────────────────────┘
reads on start / after compaction
┌──────────────▼──────────────────────┐
│ ~/.gstack/projects/$SLUG/ │
│ (persistent, survives everything) │
│ │
│ ceo-plans/ ← /plan-ceo-review
│ eng-reviews/ ← /plan-eng-review
│ design-reviews/ ← /plan-design-review
│ checkpoints/ ← /checkpoint (new)
│ timeline.jsonl ← every skill (new)
│ learnings.jsonl ← /learn
└─────────────────────────────────────┘
rolled up weekly
┌──────────────▼──────────────────────┐
│ /retro │
│ Timeline: 3 /review, 2 /ship, ... │
│ Health trends: compile 8/10 (↑2) │
│ Learnings applied: 4 this week │
└─────────────────────────────────────┘
```
## The Features
### Layer 1: Context Recovery (preamble, all skills)
~10 lines of prose in the preamble. After compaction or context degradation,
the agent checks `~/.gstack/projects/$SLUG/` for recent plans, reviews, and
checkpoints. Lists the directory, reads the most recent file.
Cost: near-zero. Benefit: every skill's plans/reviews survive compaction.
### Layer 2: Session Timeline (preamble, all skills)
Every skill appends a one-line JSONL entry to `timeline.jsonl`: timestamp,
skill name, branch, key outcome. `/retro` renders it.
Makes the project's AI-assisted work history visible. "This week: 3 /review,
2 /ship, 1 /investigate across branches feature-auth and fix-billing."
### Layer 3: Cross-Session Injection (preamble, all skills)
When a new session starts on a branch with recent artifacts, the preamble
prints a one-liner: "Last session: implemented JWT auth, 3/5 tasks done.
Plan: ~/.gstack/projects/$SLUG/checkpoints/latest.md"
The agent knows where you left off before reading any files.
### Layer 4: /checkpoint (opt-in skill)
Manual snapshot of working state: what's being done, files being edited,
decisions made, what's remaining. Useful before stepping away, before
complex operations, for workspace handoffs, or coming back after days.
### Layer 5: /health (opt-in skill)
Code quality dashboard: type-check, lint, test suite, dead code scan.
Composite 0-10 score. Tracks over time. `/retro` shows trends. `/ship`
gates on configurable threshold.
## The Compounding Effect
Each feature is independently useful. Together, they create something
that compounds:
Session 1: /plan-ceo-review produces a plan. Saved to disk.
Session 2: Agent reads the plan after preamble. Doesn't re-ask decisions.
Session 3: /checkpoint saves progress. Timeline shows 2 /review, 1 /ship.
Session 4: Compaction fires mid-refactor. Agent re-reads the checkpoint.
Recovers key decisions, types, remaining work. Continues.
Session 5: /retro rolls up the week. Health trend: 6/10 → 8/10.
Timeline shows 12 skill invocations across 3 branches.
The project's AI history is no longer ephemeral. It persists, compounds,
and makes every future session smarter. That's the session intelligence
layer.
## What This Is Not
- Not a replacement for Claude's built-in compaction (that handles session
state; we handle gstack artifacts)
- Not a full memory system like claude-mem (that handles cross-session
memory via SQLite; we handle structured skill artifacts)
- Not a database or service (just markdown files on disk)
## Research Sources
- [Anthropic: Effective harnesses for long-running agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents)
- [Anthropic: Effective context engineering](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)
- [claude-mem](https://github.com/thedotmack/claude-mem)
- [Claude HUD](https://github.com/jarrodwatts/claude-hud)
- [CodeScene: Agentic AI coding best practices](https://codescene.com/blog/agentic-ai-coding-best-practice-patterns-for-speed-with-quality)
- [Post-compaction recovery via git-persisted state (Beads)](https://dev.to/jeremy_longshore/building-post-compaction-recovery-for-ai-agent-workflows-with-beads-207l)
+190
View File
@@ -0,0 +1,190 @@
# Sidebar Message Flow
How the GStack Browser sidebar actually works. Read this before touching
sidepanel.js, background.js, content.js, server.ts sidebar endpoints,
or sidebar-agent.ts.
## Components
```
┌─────────────────┐ ┌──────────────┐ ┌─────────────┐ ┌────────────────┐
│ sidepanel.js │────▶│ background.js│────▶│ server.ts │────▶│sidebar-agent.ts│
│ (Chrome panel) │ │ (svc worker) │ │ (Bun HTTP) │ │ (Bun process) │
└─────────────────┘ └──────────────┘ └─────────────┘ └────────────────┘
▲ │ │
│ polls /sidebar-chat │ polls queue file │
└───────────────────────────────────────────┘ │
◀──────────────────────┘
POST /sidebar-agent/event
```
## Startup Timeline
```
T+0ms CLI runs `$B connect`
├── Server starts on port 34567
├── Writes state to .gstack/browse.json (pid, port, token)
├── Launches headed Chromium with extension
└── Clears sidebar-agent-queue.jsonl
T+500ms sidebar-agent.ts spawned by CLI
├── Reads auth token from .gstack/browse.json
├── Creates queue file if missing
├── Sets lastLine = current line count
└── Starts polling every 200ms
T+1-3s Extension loads in Chromium
├── background.js: health poll every 1s (fast startup)
│ └── GET /health → gets auth token
├── content.js: injects on welcome page
│ └── Does NOT fire gstack-extension-ready (waits for sidebar)
└── Side panel: may auto-open via chrome.sidePanel.open()
T+2-10s Side panel connects
├── tryConnect() → asks background for port/token
├── Fallback: direct GET /health for token
├── updateConnection(url, token)
│ ├── Starts chat polling (1s interval)
│ ├── Starts tab polling (2s interval)
│ ├── Connects SSE activity stream
│ └── Sends { type: 'sidebarOpened' } to background
└── background relays to content script → hides welcome arrow
T+10s+ Ready for messages
```
## Message Flow: User Types → Claude Responds
```
1. User types "go to hn" in sidebar, hits Enter
2. sidepanel.js sendMessage()
├── Renders user bubble immediately (optimistic)
├── Renders thinking dots immediately
├── Switches to fast poll (300ms)
└── chrome.runtime.sendMessage({ type: 'sidebar-command', message, tabId })
3. background.js
├── Gets active Chrome tab URL
└── POST /sidebar-command { message, activeTabUrl }
with Authorization: Bearer ${authToken}
4. server.ts /sidebar-command handler
├── validateAuth(req)
├── syncActiveTabByUrl(extensionUrl) — syncs Playwright tab to Chrome tab
├── pickSidebarModel(message) — 'sonnet' for actions, 'opus' for analysis
├── Adds user message to chat buffer
├── Builds system prompt + args
└── Appends JSON to ~/.gstack/sidebar-agent-queue.jsonl
5. sidebar-agent.ts poll() (within 200ms)
├── Reads new line from queue file
├── Parses JSON entry
├── Checks processingTabs — skips if tab already has agent running
└── askClaude(entry) — fire and forget
6. sidebar-agent.ts askClaude()
├── spawn('claude', ['-p', prompt, '--model', model, ...])
├── Streams stdout line-by-line (stream-json format)
├── For each event: POST /sidebar-agent/event { type, tool, text, tabId }
└── On close: POST /sidebar-agent/event { type: 'agent_done' }
7. server.ts processAgentEvent()
├── Adds entry to chat buffer (in-memory + disk)
├── On agent_done: sets tab status to 'idle'
└── On agent_done: processes next queued message for that tab
8. sidepanel.js pollChat() (every 300ms during fast poll)
├── GET /sidebar-chat?after=${chatLineCount}&tabId=${tabId}
├── Renders new entries (text, tool_use, agent_done)
└── On agent idle: removes thinking dots, stops fast poll
```
## Arrow Hint Hide Flow (4-step signal chain)
The welcome page shows a right-pointing arrow until the sidebar opens.
```
1. sidepanel.js updateConnection()
└── chrome.runtime.sendMessage({ type: 'sidebarOpened' })
2. background.js
└── chrome.tabs.sendMessage(activeTabId, { type: 'sidebarOpened' })
3. content.js onMessage handler
└── document.dispatchEvent(new CustomEvent('gstack-extension-ready'))
4. welcome.html script
└── addEventListener('gstack-extension-ready', () => arrow.classList.add('hidden'))
```
The arrow does NOT hide when the extension loads. Only when the sidebar connects.
## Auth Token Flow
```
Server starts → AUTH_TOKEN = crypto.randomUUID()
├── GET /health (no auth) → returns { token: AUTH_TOKEN }
├── background.js checkHealth() → authToken = data.token
│ └── Refreshes on EVERY health poll (fixes stale token on restart)
├── sidepanel.js tryConnect() → serverToken from background or /health
│ └── Used for chat polling: Authorization: Bearer ${serverToken}
└── sidebar-agent.ts refreshToken() → reads from .gstack/browse.json
└── Used for event relay: Authorization: Bearer ${authToken}
```
If the server restarts, all three components get fresh tokens within 10s
(background health poll interval).
## Model Routing
`pickSidebarModel(message)` in server.ts classifies messages:
| Pattern | Model | Why |
|---------|-------|-----|
| "click @e24", "go to hn", "screenshot" | sonnet | Deterministic tool calls, no thinking needed |
| "what does this page say?", "summarize" | opus | Needs comprehension |
| "find bugs", "check for broken links" | opus | Analysis task |
| "navigate to X and fill the form" | sonnet | Action-oriented, no analysis words |
Analysis words (`what`, `why`, `how`, `summarize`, `describe`, `analyze`, `read X and Y`)
always override action verbs and force opus.
## Known Failure Modes
| Failure | Symptom | Root Cause | Fix |
|---------|---------|------------|-----|
| Stale auth token | "Unauthorized" in input | Server restarted, background had old token | background.js refreshes token on every health poll |
| Tab ID mismatch | Message sent, no response visible | Server assigned tabId 1, sidebar polling tabId 0 | switchChatTab preserves optimistic UI during switch |
| Sidebar agent not running | Messages queue forever | Agent process failed to spawn or crashed | Check `ps aux | grep sidebar-agent` |
| Agent stale token | Agent runs but no events appear in sidebar | sidebar-agent has old token from .gstack/browse.json | Agent re-reads token before each event POST |
| Queue file missing | spawnClaude fails | Race between server start and agent start | Both sides create file if missing |
| Optimistic UI blown away | User bubble + dots vanish | switchChatTab replaced DOM with welcome screen | Preserved DOM when lastOptimisticMsg is set |
## Per-Tab Concurrency
Each browser tab can run its own agent simultaneously:
- Server: `tabAgents: Map<number, TabAgentState>` with per-tab queue (max 5)
- sidebar-agent: `processingTabs: Set<number>` prevents duplicate spawns
- Two messages on same tab: queued sequentially, processed in order
- Two messages on different tabs: run concurrently
## File Locations
| Component | File | Runs in |
|-----------|------|---------|
| Sidebar UI | `extension/sidepanel.js` | Chrome side panel |
| Service worker | `extension/background.js` | Chrome background |
| Content script | `extension/content.js` | Page context |
| Welcome page | `browse/src/welcome.html` | Page context |
| HTTP server | `browse/src/server.ts` | Bun (compiled binary) |
| Agent process | `browse/src/sidebar-agent.ts` | Bun (non-compiled, can spawn) |
| CLI entry | `browse/src/cli.ts` | Bun (compiled binary) |
| Queue file | `~/.gstack/sidebar-agent-queue.jsonl` | Filesystem |
| State file | `.gstack/browse.json` | Filesystem |
| Chat log | `~/.gstack/sessions/<id>/chat.jsonl` | Filesystem |
+290
View File
@@ -0,0 +1,290 @@
# Slate Host Integration — Research & Design Doc
**Date:** 2026-04-02
**Branch:** garrytan/slate-agent-support
**Status:** Research complete, blocked on host config refactor
**Supersedes:** None
## What is Slate
Slate is a proprietary coding agent CLI from Random Labs.
Install: `npm i -g @randomlabs/slate` or `brew install anthropic/tap/slate`.
License: Proprietary. 85MB compiled Bun binary (arm64/x64, darwin/linux/windows).
npm package: `@randomlabs/slate@1.0.25` (thin 8.8KB launcher + platform-specific optional deps).
Multi-model: dynamically selects Claude Sonnet/Opus/Haiku, plus other models.
Built for "swarm orchestration" with extended multi-hour sessions.
## Slate is an OpenCode fork
**Confirmed via binary strings analysis** of the 85MB Mach-O arm64 binary:
- Internal name: `name: "opencode"` (literal string in binary)
- All `OPENCODE_*` env vars present alongside `SLATE_*` equivalents
- Shares OpenCode's tool/skill architecture, LSP integration, terminal management
- Own branding, API endpoints (`api.randomlabs.ai`, `agent-worker-prod.randomlabs.workers.dev`), and config paths
This matters for integration: OpenCode conventions mostly apply, but Slate adds
its own paths and env vars on top.
## Skill Discovery (confirmed from binary)
Slate scans ALL four directory families for skills. Error messages in binary confirm:
```
"failed .slate directory scan for skills"
"failed .claude directory scan for skills"
"failed .agents directory scan for skills"
"failed .opencode directory scan for skills"
```
**Discovery paths (priority order from Slate docs):**
1. `.slate/skills/<name>/SKILL.md` — project-level, highest priority
2. `~/.slate/skills/<name>/SKILL.md` — global
3. `.opencode/skills/`, `.agents/skills/` — compatibility fallback
4. `.claude/skills/` — Claude Code compatibility fallback (lowest)
5. Custom paths via `slate.json`
**Glob patterns:** `**/SKILL.md` and `{skill,skills}/**/SKILL.md`
**Commands:** Same directory structure but under `commands/` subdirs:
`/.slate/commands/`, `/.claude/commands/`, `/.agents/commands/`, `/.opencode/commands/`
**Skill frontmatter:** YAML with `name` and `description` fields (per Slate docs).
No documented length limits on either field.
## Project Instructions
Slate reads both `CLAUDE.md` and `AGENTS.md` for project instructions.
Both literal strings confirmed in binary. No changes needed to existing
gstack projects... CLAUDE.md works as-is.
## Configuration
**Config file:** `slate.json` / `slate.jsonc` (NOT opencode.json)
**Config options (from Slate docs):**
- `privacy` (boolean) — disables telemetry/logging
- Permissions: `allow`, `ask`, `deny` per tool (`read`, `edit`, `bash`, `grep`, `webfetch`, `websearch`, `*`)
- Model slots: `models.main`, `models.subagent`, `models.search`, `models.reasoning`
- MCP servers: local or remote with custom commands and headers
- Custom commands: `/commands` with templates
The setup script should NOT create `slate.json`. Users configure their own permissions.
## CLI Flags (Headless Mode)
```
--stream-json / --output-format stream-json — JSONL output, "compatible with Anthropic Claude Code SDK"
--dangerously-skip-permissions — bypass all permission checks (CI/automation)
--input-format stream-json — programmatic input
-q — non-interactive mode
-w <dir> — workspace directory
--output-format text — plain text output (default)
```
**Stream-JSON format:** Slate docs claim "compatible with Anthropic Claude Code SDK."
Not yet empirically verified. Given OpenCode heritage, likely matches Claude Code's
NDJSON event schema (type: "assistant", type: "tool_result", type: "result").
**Need to verify:** Run `slate -q "hello" --stream-json` with valid credits and
capture actual JSONL events before building the session runner parser.
## Environment Variables (from binary strings)
### Slate-specific
```
SLATE_API_KEY — API key
SLATE_AGENT — agent selection
SLATE_AUTO_SHARE — auto-share setting
SLATE_CLIENT — client identifier
SLATE_CONFIG — config override
SLATE_CONFIG_CONTENT — inline config
SLATE_CONFIG_DIR — config directory
SLATE_DANGEROUSLY_SKIP_PERMISSIONS — bypass permissions
SLATE_DIR — data directory override
SLATE_DISABLE_AUTOUPDATE — disable auto-update
SLATE_DISABLE_CLAUDE_CODE — disable Claude Code integration entirely
SLATE_DISABLE_CLAUDE_CODE_PROMPT — disable Claude Code prompt loading
SLATE_DISABLE_CLAUDE_CODE_SKILLS — disable .claude/skills/ loading
SLATE_DISABLE_DEFAULT_PLUGINS — disable default plugins
SLATE_DISABLE_FILETIME_CHECK — disable file time checks
SLATE_DISABLE_LSP_DOWNLOAD — disable LSP auto-download
SLATE_DISABLE_MODELS_FETCH — disable models config fetch
SLATE_DISABLE_PROJECT_CONFIG — disable project-level config
SLATE_DISABLE_PRUNE — disable session pruning
SLATE_DISABLE_TERMINAL_TITLE — disable terminal title updates
SLATE_ENABLE_EXA — enable Exa search
SLATE_ENABLE_EXPERIMENTAL_MODELS — enable experimental models
SLATE_EXPERIMENTAL — enable experimental features
SLATE_EXPERIMENTAL_BASH_DEFAULT_TIMEOUT_MS — bash timeout override
SLATE_EXPERIMENTAL_DISABLE_COPY_ON_SELECT — disable copy on select
SLATE_EXPERIMENTAL_DISABLE_FILEWATCHER — disable file watcher
SLATE_EXPERIMENTAL_EXA — Exa search (alt flag)
SLATE_EXPERIMENTAL_FILEWATCHER — enable file watcher
SLATE_EXPERIMENTAL_ICON_DISCOVERY — icon discovery
SLATE_EXPERIMENTAL_LSP_TOOL — LSP tool
SLATE_EXPERIMENTAL_LSP_TY — LSP type checking
SLATE_EXPERIMENTAL_MARKDOWN — markdown mode
SLATE_EXPERIMENTAL_OUTPUT_TOKEN_MAX — output token limit
SLATE_EXPERIMENTAL_OXFMT — oxfmt integration
SLATE_EXPERIMENTAL_PLAN_MODE — plan mode
SLATE_FAKE_VCS — fake VCS for testing
SLATE_GIT_BASH_PATH — git bash path (Windows)
SLATE_MODELS_URL — models config URL
SLATE_PERMISSION — permission override
SLATE_SERVER_PASSWORD — server auth
SLATE_SERVER_USERNAME — server auth
SLATE_TELEMETRY_DISABLED — disable telemetry
SLATE_TEST_HOME — test home directory
SLATE_TOKEN_DIR — token storage directory
```
### OpenCode legacy (still functional)
```
OPENCODE_DISABLE_LSP_DOWNLOAD
OPENCODE_EXPERIMENTAL_DISABLE_FILEWATCHER
OPENCODE_EXPERIMENTAL_FILEWATCHER
OPENCODE_EXPERIMENTAL_ICON_DISCOVERY
OPENCODE_EXPERIMENTAL_LSP_TY
OPENCODE_EXPERIMENTAL_OXFMT
OPENCODE_FAKE_VCS
OPENCODE_GIT_BASH_PATH
OPENCODE_LIBC
OPENCODE_TERMINAL
```
### Critical env vars for gstack integration
**`SLATE_DISABLE_CLAUDE_CODE_SKILLS`** — When set, `.claude/skills/` loading is disabled.
This makes publishing to `.slate/skills/` load-bearing, not just an optimization.
Without native `.slate/` publishing, gstack skills vanish when this flag is set.
**`SLATE_TEST_HOME`** — Useful for E2E tests. Can redirect Slate's home directory
to an isolated temp directory, similar to how Codex tests use a temp HOME.
**`SLATE_DANGEROUSLY_SKIP_PERMISSIONS`** — Required for headless E2E tests.
## Model References (from binary)
```
anthropic/claude-sonnet-4.6
anthropic/claude-opus-4
anthropic/claude-haiku-4
anthropic/slate — Slate's own model routing
openai/gpt-5.3-codex
google/nano-banana
randomlabs/fast-default-alpha
```
## API Endpoints (from binary)
```
https://api.randomlabs.ai — main API
https://api.randomlabs.ai/exaproxy — Exa search proxy
https://agent-worker-prod.randomlabs.workers.dev — production worker
https://agent-worker-dev.randomlabs.workers.dev — dev worker
https://dashboard.randomlabs.ai — dashboard
https://docs.randomlabs.ai — documentation
https://randomlabs.ai/config.json — remote config
```
Brew tap: `anthropic/tap/slate` (notable: under Anthropic's tap, not Random Labs)
## npm Package Structure
```
@randomlabs/slate (8.8 kB, thin launcher)
├── bin/slate — Node.js launcher (finds platform binary in node_modules)
├── bin/slate1 — Bun launcher (same logic, import.meta.filename)
├── postinstall.mjs — Verifies platform binary exists, symlinks if needed
└── package.json — Declares optionalDependencies for all platforms
Platform packages (85MB each):
├── @randomlabs/slate-darwin-arm64
├── @randomlabs/slate-darwin-x64
├── @randomlabs/slate-linux-arm64
├── @randomlabs/slate-linux-x64
├── @randomlabs/slate-linux-x64-musl
├── @randomlabs/slate-linux-arm64-musl
├── @randomlabs/slate-linux-x64-baseline
├── @randomlabs/slate-linux-x64-baseline-musl
├── @randomlabs/slate-darwin-x64-baseline
├── @randomlabs/slate-windows-x64
└── @randomlabs/slate-windows-x64-baseline
```
Binary override: `SLATE_BIN_PATH` env var skips all discovery, runs the specified binary directly.
## What Already Works Today
gstack skills already work in Slate via the `.claude/skills/` fallback path.
No changes needed for basic functionality. Users who install gstack for Claude Code
and also use Slate will find their skills available in both agents.
## What First-Class Support Adds
1. **Reliability**`.slate/skills/` is Slate's highest-priority path. Immune to
`SLATE_DISABLE_CLAUDE_CODE_SKILLS`.
2. **Optimized frontmatter** — Strip Claude-specific fields (allowed-tools, hooks, version)
that Slate doesn't use. Keep only `name` and `description`.
3. **Setup script** — Auto-detect `slate` binary, install skills to `~/.slate/skills/`.
4. **E2E tests** — Verify skills work when invoked by Slate directly.
## Blocked On: Host Config Refactor
Codex's outside voice review identified that adding Slate as a 4th host (after Claude,
Codex, Factory) is "host explosion for a path alias." The current architecture has:
- Hard-coded host names in `type Host = 'claude' | 'codex' | 'factory'`
- Per-host branches in `transformFrontmatter()` with near-duplicate logic
- Per-host config in `EXTERNAL_HOST_CONFIG` with similar patterns
- Per-host functions in the setup script (`create_codex_runtime_root`, `link_codex_skill_dirs`)
- Host names duplicated in `bin/gstack-platform-detect`, `bin/gstack-uninstall`, `bin/dev-setup`
Adding Slate means copying all of these patterns again. A refactor to make hosts
data-driven (config objects instead of if/else branches) would make Slate integration
trivial AND make future hosts (any new OpenCode fork, any new agent) zero-effort.
### Missing from the plan (identified by Codex)
- `lib/worktree.ts` only copies `.agents/`, not `.slate/` — E2E tests in worktrees won't
have Slate skills
- `bin/gstack-uninstall` doesn't know about `.slate/`
- `bin/dev-setup` doesn't wire `.slate/` for contributor dev mode
- `bin/gstack-platform-detect` doesn't detect Slate
- E2E tests should set `SLATE_DISABLE_CLAUDE_CODE_SKILLS=1` to prove `.slate/` path
actually works (not just falling back to `.claude/`)
## Session Runner Design (for later)
When the JSONL format is verified, the session runner should:
- Spawn: `slate -q "<prompt>" --stream-json --dangerously-skip-permissions -w <dir>`
- Parse: Claude Code SDK-compatible NDJSON (assumed, needs verification)
- Skills: Install to `.slate/skills/` in test fixture (not `.claude/skills/`)
- Auth: Use `SLATE_API_KEY` or existing `~/.slate/` credentials
- Isolation: Use `SLATE_TEST_HOME` for home directory isolation
- Timeout: 300s default (same as Codex)
```typescript
export interface SlateResult {
output: string;
toolCalls: string[];
tokens: number;
exitCode: number;
durationMs: number;
sessionId: string | null;
rawLines: string[];
stderr: string;
}
```
## Docs References
- Slate docs: https://docs.randomlabs.ai
- Quickstart: https://docs.randomlabs.ai/en/getting-started/quickstart
- Skills: https://docs.randomlabs.ai/en/using-slate/skills
- Configuration: https://docs.randomlabs.ai/en/using-slate/configuration
- Hotkeys: https://docs.randomlabs.ai/en/using-slate/hotkey_reference
+283
View File
@@ -12,14 +12,21 @@ Detailed guides for every gstack skill — philosophy, workflow, and examples.
| [`/review`](#review) | **Staff Engineer** | Find the bugs that pass CI but blow up in production. Auto-fixes the obvious ones. Flags completeness gaps. |
| [`/investigate`](#investigate) | **Debugger** | Systematic root-cause debugging. Iron Law: no fixes without investigation. Traces data flow, tests hypotheses, stops after 3 failed fixes. |
| [`/design-review`](#design-review) | **Designer Who Codes** | Live-site visual audit + fix loop. 80-item audit, then fixes what it finds. Atomic commits, before/after screenshots. |
| [`/design-shotgun`](#design-shotgun) | **Design Explorer** | Generate multiple AI design variants, open a comparison board in your browser, and iterate until you approve a direction. Taste memory biases toward your preferences. |
| [`/design-html`](#design-html) | **Design Engineer** | Generates production-quality Pretext-native HTML. Works with approved mockups, CEO plans, design reviews, or from scratch. Text reflows on resize, heights adjust to content. Smart API routing per design type. Framework detection for React/Svelte/Vue. |
| [`/qa`](#qa) | **QA Lead** | Test your app, find bugs, fix them with atomic commits, re-verify. Auto-generates regression tests for every fix. |
| [`/qa-only`](#qa) | **QA Reporter** | Same methodology as /qa but report only. Use when you want a pure bug report without code changes. |
| [`/ship`](#ship) | **Release Engineer** | Sync main, run tests, audit coverage, push, open PR. Bootstraps test frameworks if you don't have one. One command. |
| [`/land-and-deploy`](#land-and-deploy) | **Release Engineer** | Merge the PR, wait for CI and deploy, verify production health. One command from "approved" to "verified in production." |
| [`/canary`](#canary) | **SRE** | Post-deploy monitoring loop. Watches for console errors, performance regressions, and page failures using the browse daemon. |
| [`/benchmark`](#benchmark) | **Performance Engineer** | Baseline page load times, Core Web Vitals, and resource sizes. Compare before/after on every PR. Track trends over time. |
| [`/cso`](#cso) | **Chief Security Officer** | OWASP Top 10 + STRIDE threat modeling security audit. Scans for injection, auth, crypto, and access control issues. |
| [`/document-release`](#document-release) | **Technical Writer** | Update all project docs to match what you just shipped. Catches stale READMEs automatically. |
| [`/retro`](#retro) | **Eng Manager** | Team-aware weekly retro. Per-person breakdowns, shipping streaks, test health trends, growth opportunities. |
| [`/browse`](#browse) | **QA Engineer** | Give the agent eyes. Real Chromium browser, real clicks, real screenshots. ~100ms per command. |
| [`/setup-browser-cookies`](#setup-browser-cookies) | **Session Manager** | Import cookies from your real browser (Chrome, Arc, Brave, Edge) into the headless session. Test authenticated pages. |
| [`/autoplan`](#autoplan) | **Review Pipeline** | One command, fully reviewed plan. Runs CEO → design → eng review automatically with encoded decision principles. Surfaces only taste decisions for your approval. |
| [`/learn`](#learn) | **Memory** | Manage what gstack learned across sessions. Review, search, prune, and export project-specific patterns and preferences. |
| | | |
| **Multi-AI** | | |
| [`/codex`](#codex) | **Second Opinion** | Independent review from OpenAI Codex CLI. Three modes: code review (pass/fail gate), adversarial challenge, and open consultation with session continuity. Cross-model analysis when both `/review` and `/codex` have run. |
@@ -29,6 +36,8 @@ Detailed guides for every gstack skill — philosophy, workflow, and examples.
| [`/freeze`](#safety--guardrails) | **Edit Lock** | Restrict all file edits to a single directory. Blocks Edit and Write outside the boundary. Accident prevention for debugging. |
| [`/guard`](#safety--guardrails) | **Full Safety** | Combines /careful + /freeze in one command. Maximum safety for prod work. |
| [`/unfreeze`](#safety--guardrails) | **Unlock** | Remove the /freeze boundary, allowing edits everywhere again. |
| [`/open-gstack-browser`](#open-gstack-browser) | **GStack Browser** | Launch GStack Browser with sidebar, anti-bot stealth, auto model routing, cookie import, and Claude Code integration. Watch every action live. |
| [`/setup-deploy`](#setup-deploy) | **Deploy Configurator** | One-time setup for `/land-and-deploy`. Detects your platform, production URL, and deploy commands. |
| [`/gstack-upgrade`](#gstack-upgrade) | **Self-Updater** | Upgrade gstack to the latest version. Detects global vs vendored install, syncs both, shows what changed. |
---
@@ -399,6 +408,110 @@ Nine commits, each touching one concern. The AI Slop score went from D to A beca
---
## `/design-shotgun`
This is my **design exploration mode**.
You know the feeling. You have a feature, a page, a landing screen... and you're not sure what it should look like. You could describe it to Claude and get one answer. But one answer means one perspective, and design is a taste game. You need to see options.
`/design-shotgun` generates 3 visual design variants using the GPT Image API, opens a comparison board in your browser, and waits for your feedback. You pick a direction, request changes, or ask for entirely new variants. The board supports remix, regenerate, and approval actions.
### The loop
1. You describe what you want (or point at an existing page)
2. The skill reads your `DESIGN.md` for brand constraints (if it exists)
3. It generates 3 distinct design variants as PNGs
4. A comparison board opens in your browser with all 3 side-by-side
5. You click "Approve" on the one you like, or give feedback for another round
6. The approved variant saves to `~/.gstack/projects/$SLUG/designs/` with an `approved.json`
That `approved.json` is one way to feed `/design-html`. The design pipeline chains: shotgun picks the direction, design-html renders it as working code. But `/design-html` also works with CEO plans, design reviews, or just a description.
### Taste memory
The skill remembers your preferences across sessions. If you consistently prefer minimal designs over busy ones, it biases future generations. This isn't a setting you configure... it emerges from your approvals.
### Example
```
You: /design-shotgun — hero section for a developer tools landing page
Claude: [Generates 3 variants]
Variant A: Bold typography, dark background, code snippet hero
Variant B: Split layout, product screenshot left, copy right
Variant C: Minimal, centered headline, gradient accent
[Opens comparison board at localhost:PORT]
You: [Clicks "Approve" on Variant A in the browser]
Claude: Approved Variant A. Saved to ~/.gstack/projects/myapp/designs/
Next: run /design-html to generate production HTML from this mockup.
```
---
## `/design-html`
This is my **design-to-code mode**.
Every AI code generation tool produces static CSS. Hardcoded heights. Text that overflows on resize. Breakpoints that snap instead of flowing. The output looks right at exactly one viewport size and breaks at every other.
`/design-html` fixes this. It generates HTML using [Pretext](https://github.com/chenglou/pretext) by Cheng Lou (ex-React core, Midjourney frontend). Pretext is a 15KB library that computes text layout without DOM measurement. Text reflows. Heights adjust to content. Cards size themselves. Chat bubbles shrinkwrap. All sub-millisecond, all dynamic.
It works with multiple input sources: an approved mockup from `/design-shotgun`, a CEO plan from `/plan-ceo-review`, design review context from `/plan-design-review`, a PNG you provide, or just a description of what you want. It detects what context exists and asks how you want to proceed.
### Smart API routing
Not every page needs the full Pretext engine. The skill reads the design and picks the right tools:
- **Simple layouts** (landing, marketing): `prepare()` + `layout()` for resize-aware heights
- **Card grids** (dashboard, listing): `prepare()` + `layout()` for self-sizing cards
- **Chat UIs**: `walkLineRanges()` for tight-fit bubbles with zero wasted pixels
- **Editorial layouts**: `layoutNextLine()` for text flowing around obstacles
- **Complex editorial**: Full engine with `layoutWithLines()` for manual line rendering
### The refinement loop
1. Reads the approved mockup from `approved.json`
2. Uses GPT-4o vision to extract implementation spec (colors, typography, layout)
3. Generates self-contained HTML with Pretext inlined (15KB, zero network dependency)
4. Spins up a live-reload server so you see changes instantly
5. Screenshots at 3 viewports (mobile, tablet, desktop) to verify layout
6. AskUserQuestion: what needs to change?
7. Surgical edits via the Edit tool (not full regeneration)
8. Repeat until you say "done"
### Framework detection
If your project uses React, Svelte, or Vue (detected from `package.json`), the skill offers to generate a framework component instead of vanilla HTML. Framework output uses `npm install @chenglou/pretext` instead of inline vendoring.
### Example
```
You: /design-html
Claude: Found approved mockup: variant-A.png (hero section)
Design type: simple layout → using prepare() + layout()
[Generates HTML with Pretext, opens live-reload server]
[Screenshots at 375px, 768px, 1440px — all verified]
The HTML is live at localhost:3456. Resize your browser.
Text reflows dynamically. What needs to change?
You: Make the headline bigger and add more spacing below the CTA
Claude: [Edits font-size and margin, browser auto-reloads]
Updated. Two changes applied.
You: Done
Claude: Saved to ~/.gstack/projects/myapp/designs/hero-20260330/finalized.html
```
---
## `/review`
This is my **paranoid staff engineer mode**.
@@ -525,6 +638,82 @@ A lot of branches die when the interesting work is done and only the boring rele
---
## `/land-and-deploy`
This is my **deploy pipeline mode**.
`/ship` creates the PR. `/land-and-deploy` finishes the job: merge, deploy, verify.
It merges the PR, waits for CI, waits for the deploy to finish, then runs canary checks against production. One command from "approved" to "verified in production." If the deploy breaks, it tells you what failed and whether to rollback.
First run on a new project triggers a dry-run walk-through so you can verify the pipeline before it does anything irreversible. After that, it trusts the config and runs straight through.
### Setup
Run `/setup-deploy` first. It detects your platform (Fly.io, Render, Vercel, Netlify, Heroku, GitHub Actions, or custom), discovers your production URL and health check endpoints, and writes the config to CLAUDE.md. One-time, 60 seconds.
### Example
```
You: /land-and-deploy
Claude: Merging PR #42...
CI: 3/3 checks passed
Deploy: Fly.io — deploying v2.1.0...
Health check: https://myapp.fly.dev/health → 200 OK
Canary: 5 pages checked, 0 console errors, p95 < 800ms
Production verified. v2.1.0 is live.
```
---
## `/canary`
This is my **post-deploy monitoring mode**.
After deploy, `/canary` watches the live site for trouble. It loops through your key pages using the browse daemon, checking for console errors, performance regressions, page failures, and visual anomalies. Takes periodic screenshots and compares against pre-deploy baselines.
Use it right after `/land-and-deploy`, or schedule it to run periodically after a risky deploy.
```
You: /canary https://myapp.com
Claude: Monitoring 8 pages every 2 minutes...
Cycle 1: ✓ All pages healthy. p95: 340ms. 0 console errors.
Cycle 2: ✓ All pages healthy. p95: 380ms. 0 console errors.
Cycle 3: ⚠ /dashboard — new console error: "TypeError: Cannot read
property 'map' of undefined" at dashboard.js:142
Screenshot saved.
Alert: 1 new console error after 3 monitoring cycles.
```
---
## `/benchmark`
This is my **performance engineer mode**.
`/benchmark` establishes performance baselines for your pages: load time, Core Web Vitals (LCP, CLS, INP), resource counts, and total transfer size. Run it before and after a PR to catch regressions.
It uses the browse daemon for real Chromium measurements, not synthetic estimates. Multiple runs averaged. Results persist so you can track trends across PRs.
```
You: /benchmark https://myapp.com
Claude: Benchmarking 5 pages (3 runs each)...
/ load: 1.2s LCP: 0.9s CLS: 0.01 resources: 24 (890KB)
/dashboard load: 2.1s LCP: 1.8s CLS: 0.03 resources: 31 (1.4MB)
/settings load: 0.8s LCP: 0.6s CLS: 0.00 resources: 18 (420KB)
Baseline saved. Run again after changes to compare.
```
---
## `/cso`
This is my **Chief Security Officer**.
@@ -711,6 +900,100 @@ Claude: Imported 12 cookies for github.com from Comet.
---
## `/autoplan`
This is my **review autopilot mode**.
Running `/plan-ceo-review`, then `/plan-design-review`, then `/plan-eng-review` individually means answering 15-30 intermediate questions. Each question is valuable, but sometimes you want the gauntlet to run without stopping for every decision.
`/autoplan` reads all three review skills from disk and runs them sequentially: CEO → Design → Eng. It makes decisions automatically using six encoded principles (prefer completeness, match existing patterns, choose reversible options, prefer the option the user chose for similar past decisions, defer ambiguous items, and escalate security). Taste decisions (close approaches, borderline scope expansions, cross-model disagreements) get saved and presented at a final approval gate.
One command, fully reviewed plan out.
```
You: /autoplan
Claude: Running CEO review... [4 scope decisions auto-resolved]
Running design review... [3 design dimensions auto-scored]
Running eng review... [2 architecture decisions auto-resolved]
TASTE DECISIONS (need your input):
1. Scope: Codex suggested adding search — borderline expansion. Add?
2. Design: Two approaches scored within 1 point. Which feels right?
[Shows both options with context]
You: 1) Yes, add search. 2) Option A.
Claude: Plan complete. 9 decisions auto-resolved, 2 taste decisions approved.
```
---
## `/learn`
This is my **institutional memory mode**.
gstack learns from every session. Patterns, pitfalls, preferences, architectural decisions... they accumulate in `~/.gstack/projects/$SLUG/learnings.jsonl`. Each learning has a confidence score, source attribution, and the files it references.
`/learn` lets you see what gstack has absorbed, search for specific patterns, prune stale entries (when referenced files no longer exist), and export learnings for team sharing. The real magic is in other skills... they automatically search learnings before making recommendations, and display "Prior learning applied" when a past insight is relevant.
```
You: /learn
Claude: 23 learnings for this project (14 high confidence, 6 medium, 3 low)
Top patterns:
- [9/10] API responses always wrapped in { data, error } envelope
- [8/10] Tests use factory helpers in test/support/factories.ts
- [8/10] All DB queries go through repository pattern, never direct
3 potentially stale (referenced files deleted):
- "auth middleware uses JWT" — auth/middleware.ts was deleted
[Prune these? Y/N]
```
---
## `/open-gstack-browser`
This is my **co-presence mode**.
`/browse` runs headless by default. You don't see what the agent sees. `/open-gstack-browser` changes that. It launches GStack Browser (rebranded Chromium with anti-bot stealth) controlled by Playwright, with the sidebar extension auto-loaded. You watch every action in real time.
The sidebar chat is a Claude instance that controls the browser. It auto-routes to the right model: Sonnet for navigation and actions (click, goto, fill, screenshot), Opus for reading and analysis (summarize, find bugs, describe). One-click cookie import from the sidebar footer. The browser stays alive as long as the window is open... no idle timeout in headed mode. The menu bar says "GStack Browser" instead of "Chrome for Testing."
```
You: /open-gstack-browser
Claude: Launched GStack Browser with sidebar extension.
Anti-bot stealth active. All $B commands run in headed mode.
Type in the sidebar to direct the browser agent.
Sidebar model routing: sonnet for actions, opus for analysis.
```
---
## `/setup-deploy`
One-time deploy configuration. Run this before your first `/land-and-deploy`.
It auto-detects your deploy platform (Fly.io, Render, Vercel, Netlify, Heroku, GitHub Actions, or custom), discovers your production URL, health check endpoints, and deploy status commands. Writes everything to CLAUDE.md so all future deploys are automatic.
```
You: /setup-deploy
Claude: Detected: Fly.io (fly.toml found)
Production URL: https://myapp.fly.dev
Health check: /health → expects 200
Deploy command: fly deploy
Status command: fly status
Written to CLAUDE.md. Run /land-and-deploy when ready.
```
---
## `/codex`
This is my **second opinion mode**.