Files
gstack/TODOS.md
T
Garry Tan 920a13a17f v1.44.0.0 feat: long-lived sidebar — keepalive, restart, re-attach, scrollback replay (#1678)
* fix(browse): identity-based terminal-agent kill replaces pkill regex

Commit 0 of the v1.44 long-lived-sidebar PR — foundation for the watchdog
and removes a latent cross-session footgun.

`pkill -f terminal-agent\.ts` (cli.ts spawn site + server.ts shutdown) matched
by argv regex and would kill ANY process whose argv contained the string —
sibling gstack sessions on the same host, an editor with the file open, a
second `$B connect` run. Identity-based PID kill via a new helper module
removes that whole class of bug.

  * New `browse/src/terminal-agent-control.ts`: `readAgentRecord`,
    `writeAgentRecord`, `clearAgentRecord`, `killAgentByRecord`. Validates
    PID liveness via `isProcessAlive` before signaling (PID-reuse defense).
  * `terminal-agent.ts` writes `<stateDir>/terminal-agent-pid` (JSON
    `{pid, gen, startedAt}`) at boot; clears on SIGTERM/SIGINT.
  * New per-boot `CURRENT_GEN` (16-byte random); `/internal/*` callers can
    include `X-Browse-Gen` to defend against split-brain in the upcoming
    watchdog. Absent header is accepted (backward compat); mismatch returns
    409. New `checkInternalAuth` helper centralizes bearer + gen checks.
  * New `/internal/healthz` route — agent liveness probe used by the
    upcoming watchdog (returns pid/gen/sessions, no claude-binary lookup).
  * `cli.ts` and `server.ts` both call `killAgentByRecord` instead of pkill.
  * `ServerConfig.ownsTerminalAgent` JSDoc updated; the gated teardown now
    runs 4 side effects (was 3) — adds the new agent-record unlink.

Test changes:

  * New `browse/test/terminal-agent-pid-identity.test.ts` — static-grep
    tripwire that fails CI if any source file re-introduces `pkill ...
    terminal-agent` or `spawnSync('pkill', ...)`; round-trips
    write/read/clear; verifies killAgentByRecord no-ops on dead PIDs.
  * `browse/test/server-embedder-terminal-port.test.ts` rewritten to
    intercept `process.kill` (not `child_process.spawnSync`); writes a
    sentinel agent-record with a guaranteed-dead PID; asserts probe-only
    (signal 0) calls, no termination signals; verifies all 3 discovery
    files including the new terminal-agent-pid.

Closes TODOS.md P3 ("Identity-based terminal-agent kill").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tests): repair 7 pre-existing failures (env pollution + stale markers)

All 7 failures existed on main before this branch — verified via `git stash`
round-trip. Bundling them into the long-lived-sidebar PR because we kept
tripping over them while running `bun test` to verify Commit 0.

  * Global afterEach restores `process.env.PATH` (new bunfig.toml +
    test-setup.ts). browser-skill-commands.test.ts sets
    `PATH = '/test/bin:/usr/bin'` to exercise a scrubbed-env fixture and
    used the broken `process.env = origEnv` reassignment pattern that
    swaps the proxy reference; the underlying env stayed mutated and
    leaked downstream. Fixed three call sites in that file and added a
    narrow PATH-only global guardrail so a future polluter can't bring
    the bug back. Killed: pair-agent-tunnel-eval (bun ENOENT),
    security.test.ts > resolveBashBinary (Bun.which('bash') null),
    server-no-import-side-effects (bun ENOENT).
  * server-auth.test.ts: two `sliceBetween` markers referenced strings
    deleted when sidebar-agent.ts was ripped — `'Sidebar agent started'`
    → `'Terminal agent started'`, `'Sidebar endpoints'` → `'Batch endpoint'`.
    Also fixed the pair-agent BROWSE_PARENT_PID assertion (the literal
    `serverEnv.BROWSE_PARENT_PID` never existed in source; the actual
    contract is the object-literal `BROWSE_PARENT_PID: '0'` inside the
    `const serverEnv` declaration).
  * test/upgrade-migration-v1.test.ts: also overrides HOME in the spawn
    env. The migration shells out to `${HOME}/.claude/skills/gstack/bin/gstack-config`
    and a developer's real config with `explain_level` set causes the
    script to take the "user already decided" branch and skip writing
    the pending-prompt flag the test asserts on.
  * test/setup-codesign.test.ts: replaced fragile `bun run build`
    string-match (which hit a comment 700 lines later) with the actual
    invocation `bun_cmd run build` used in the setup script.

Net: full suite is now green; CI no longer trips on bash/bun-ENOENT
from PATH pollution or on test markers that drifted with the codebase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(terminal-agent): extract internalHandler<T> helper for /internal/* routes

Replaces the copy-pasted bearer-auth + X-Browse-Gen + req.json().then().catch()
boilerplate on /internal/grant and /internal/revoke with a single
internalHandler<T>(req, fn) wrapper. Future /internal/* routes added by the
v1.44 long-lived-sidebar work (/internal/lease-refresh, /internal/restart)
land as one-liners using the same helper. Pure refactor; no behavior change.

/internal/healthz stays on the bare checkInternalAuth gate because it's a
GET with no JSON body to parse — the helper's body-parse path would 400 it.

  * browse/src/terminal-agent.ts — new internalHandler<T>; /internal/grant
    + /internal/revoke routed through it.
  * browse/test/terminal-agent-internal-handler.test.ts — static-grep
    tripwire that fails CI if the helper goes away or either of the two
    refactored routes regresses to the old inline pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(terminal-agent): 25s WS keepalive ping/pong + client keepalive frames

PTY connections were dying silently after NAT idle timeouts (30-60s on most
home routers, even shorter on some carrier-grade NAT) and Chrome MV3 panel
suspension. Neither side noticed until the user's next keystroke produced
no output. Both sides now drive a 25s keepalive cycle.

Server side (browse/src/terminal-agent.ts):
  * New ws.open handler constructs the PtySession eagerly and starts a
    setInterval that sends `{type:"ping",ts:Date.now()}` every 25s.
    Interval handle stored on session.pingInterval so close() can clear it.
  * PtySession.pingInterval field added; cleared in ws.close before
    disposeSession runs. Prevents timer leak across reconnects.
  * Message handler accepts `{type:"ping"|"pong"|"keepalive"}` silently —
    keepalive frames are a liveness signal at the TCP layer, no state to
    update. Existing resize/tabSwitch/tabState handling unchanged.
  * GSTACK_PTY_KEEPALIVE_INTERVAL_MS env knob (default 25000) lets the
    upcoming e2e tests compress idle assertions without 30s waits.

Client side (extension/sidepanel-terminal.js):
  * Belt-and-suspenders: client also runs a 25s setInterval that sends
    `{type:"keepalive"}`. Defends against Chrome pausing our timers if
    the server-side ping ever gets dropped (rare but possible in MV3).
  * Ping reply: on `{type:"ping",ts}` from the server, immediately send
    `{type:"pong",ts}`. Lets the agent observe round-trip latency for
    free and confirms the channel is bidirectional.
  * Interval cleared in three teardown paths: ws.close handler,
    teardown(), forceRestart(). Three paths exist because the sidebar
    can exit the LIVE state through any of them; all three must clean up
    or we leak timers across reconnects.

Test (browse/test/terminal-agent-keepalive.test.ts):
  * Static-grep tripwires for the 7-point protocol contract: agent has
    a configurable interval, open() starts the ping, close() clears it,
    message handler accepts keepalive vocabulary, client sends keepalive
    + replies pong, and all three client teardown paths clear the timer.
  * Wire-level tests (actually observe a ping after 25s) belong in the
    e2e tier — adding them here would either flake on slow CI or require
    a real Bun.serve listener per test which we don't want to pay for
    in the free tier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(sidebar): patient tryAutoConnect — poll forever with ascending status, abort only on 401

The 15s give-up message ("Browse server not ready. Reload sidebar to retry.")
fired on every cold start where the daemon took >15s to bind — common on
Conductor workspaces, CI runners, and any system under load. The user
already opened the sidebar; telling them to give up is the wrong default.

Now polls every 2s indefinitely with ascending status messages:
  *   0 - 15s : silent (handles the happy path on a warm laptop)
  *  15 - 60s : "Waiting for browse server..."
  *  60s - 5m : "Still waiting — browse server may be slow to start."
  *      > 5m : "Browse server still not responding after 5 min. Try `$B status`."

Loop aborts on three signals only:
  * state transitions out of IDLE (connect succeeded or user navigated)
  * autoConnectAborted sticky flag set on unrecoverable error
  * the panel itself unloading (browser handles this; pagehide cleanup
    arrives with T8 of the larger plan)

401 from /pty-session sets the sticky flag with a clear "Auth invalid —
reload the sidebar or restart your gstack session." message. Without the
flag, the loop would re-call connect() every 2s and spam the same error;
with it, the user sees the message once and the loop holds. forceRestart()
clears the flag so clicking Restart is the explicit "try again" escape hatch.

Bumped poll interval 200ms → 2000ms — the legacy tight loop burned CPU
for no reason. 2s is plenty fast for a "did the daemon come up yet" check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): terminal-agent watchdog with PID liveness + crash-loop guard

terminal-agent could die independently of the server — SIGKILL from the OS
OOM killer, an uncaught exception under PTY churn, an external `pkill` from
a sibling debugging session. Pre-v1.44 the sidebar would observe the broken
connection and stay broken until the user reloaded the sidebar. Now a 60s
ticker checks the recorded agent PID and respawns via the shared
spawnTerminalAgent helper when dead.

Identity-based liveness (T4 from the eng review):
  * Uses readAgentRecord + isProcessAlive (signal 0 probe), not a name match.
  * Slow-but-alive agents intentionally fall through — respawning around a
    living agent would create split-brain (two agents writing the port
    file, tokens diverging between them, mystery upgrade 401s).
  * Pairs with the v1.44 generation counter in /internal/* loopback calls:
    if a stale agent does come back to life mid-cycle, its X-Browse-Gen
    no longer matches and the parent's calls 409 cleanly.

Crash-loop guard:
  * 3 respawn attempts inside a rolling 60s window → stop trying. A daemon
    up for a week with one crash a day shouldn't trip the guard.
  * On trip: one-line error to console (`respawn guard tripped`) and the
    watchdog goes dormant. Manual restart via the sidebar Restart button
    is the explicit signal to re-arm (added in Commit 2 of the larger PR).

Shared spawn path (refactor):
  * New spawnTerminalAgent(opts) in terminal-agent-control.ts handles:
    prior-PID cleanup → spawn → record stash. Both the CLI cold-start path
    in cli.ts and the new server.ts watchdog route through it. Removes the
    copy-paste between them; future env wiring lands in one place.

Gated on cfg.ownsTerminalAgent — embedders that pre-launch their own PTY
server (gbrowser phoenix overlay) still own the full lifecycle.

GSTACK_AGENT_WATCHDOG_TICK_MS env knob compresses the 60s tick for e2e
tests without 60s waits per assertion.

Tests:
  * browse/test/terminal-agent-watchdog.test.ts — 7 static-grep tripwires
    for the load-bearing invariants (ownsTerminalAgent gate, PID-based
    liveness, crash-loop guard with window pruning, shutdown cleanup,
    CLI cold-start uses the same helper, env knob exists).
  * Live process-kill tests belong in the e2e tier; cheaper invariants
    here catch refactor regressions in ~1ms each.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cli): opt-in outer supervisor — respawn browse server on crash

Pre-v1.44 `$B connect` was fire-and-forget: spawn server detached, CLI
exits, server runs unsupervised. If the server crashed (OOM, uncaught
exception, signal kill from a runaway debugger), the user had to notice,
re-run `$B connect`, and resume work. The v1.44 terminal-agent watchdog
recovers from one layer of failure; this commit closes the outer loop.

Opt-in via `--supervise` flag or `BROWSE_SUPERVISE=1` env. Default
behavior is unchanged — every existing caller (Claude Code's Bash tool,
scripts, CI) still gets a prompt return. When the flag is set:

  * CLI stays attached, polls server PID every 30s via readState() +
    isProcessAlive (same identity primitive as the terminal-agent watchdog).
  * On unexpected exit: respawn via the same headed-mode startServer path
    used initially, then re-spawn the terminal-agent so the PTY recovers
    too (otherwise sidebar Restart is the only path back).
  * Crash-loop guard: 5 respawns in a rolling 5-min window → exit 1 with
    a clear error. Window pruning means a long-lived daemon with sporadic
    crashes does NOT trip the guard (otherwise we punish the user for the
    supervisor doing its job).
  * Backoff: 1s, 2s, 4s, 8s, 30s capped. Env-overridable via
    GSTACK_SUPERVISOR_BACKOFF for tests.
  * SIGINT / SIGTERM: clean teardown — signals the supervised server
    before exiting itself. Without this, Ctrl-C leaves an orphaned server.

Out of scope (deferred follow-up): routing the Chromium-disconnect
exit-code-1 path back through this supervisor. The terminal-agent
watchdog already covers the highest-frequency restart case; Chromium
crash recovery joins the queue as its own commit.

Test (browse/test/cli-supervisor.test.ts):
  * 6 static-grep tripwires: opt-in default, signal wiring, crash-loop
    guard with window pruning, backoff schedule env knob, tick interval
    env knob, terminal-agent re-spawn after server respawn.
  * Live respawn tests belong in the e2e tier (real spawn cycles take
    3-8s each; spamming these in the free tier would balloon CI time).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): pty-session-lease registry — stable sessionId + lease lifecycle

Foundation for Commit 2 of the long-lived-sidebar PR. Separates two
concerns that pre-v1.44 were conflated under one token:

  * sessionId — stable, non-secret identifier for a single PTY session.
    Safe to log, safe in URLs, safe in DevTools. Identifies "this terminal,"
    not "you're allowed to use this terminal."
  * lease — server-side bookkeeping that maps sessionId → expiresAt.
    Re-attach within the lease window resumes the same PTY; expiry tears
    it down.

The companion attach-token primitive (short-lived 30s bearer) reuses the
existing browse/src/pty-session-cookie.ts module unchanged — the lease
adds a name-space alongside, it doesn't replace anything.

Codex outside-voice (T1 of the eng review) flagged the original D4
"token IS sessionId" design as conflating identity with auth. The fix
is this lease registry: re-attach URLs carry the stable sessionId
(loggable), the short-lived attachToken stays out of logs.

API:
  * mintLease() → { sessionId, expiresAt }
  * validateLease(sessionId) → { ok: true, expiresAt } | { ok: false }
  * refreshLease(sessionId) — validate-first, never resurrects expired
    leases. Security-critical: the 30-min TTL is what bounds blast
    radius for a leaked attachToken whose lease should have GC'd.
  * revokeLease(sessionId) — explicit dispose path.
  * leaseCount() — observability helper.
  * __resetLeases() — test-only.

TTL env knob (GSTACK_PTY_LEASE_TTL_MS) lets v1.44 e2e tests compress
the detach window to 1s instead of waiting 30 minutes per assertion.

Server.ts wiring + /pty-session shape change + /pty-restart + /pty-dispose
+ /pty-session/reattach all land in subsequent commits in this branch.

Test (browse/test/pty-session-lease.test.ts):
  * 8 cases pinning mint uniqueness, validate-first refresh contract,
    revoke idempotency, null/undefined tolerance, and the negative case
    that refresh never resurrects a revoked lease (same code path as
    expired-and-pruned).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(terminal-agent): sessionId-aware grant + scoped restart + eager spawn

Wires the pty-session-lease primitive (3aada48b) into terminal-agent so
the Commit 2 work in server.ts (next commit) can route /pty-restart and
re-attach by session identity rather than by single-use token.

Changes:

  * validTokens: Set<string> → Map<string, string|null>. Each grant carries
    its bound sessionId (or null for legacy single-grant callers). On WS
    upgrade, the agent surfaces the bound sessionId via ws.data so open()
    can register the session in the new reverse index.
  * sessionsById: Map<sessionId, PtySession> — populated in open(),
    cleared in close(). Required so /internal/restart can find and dispose
    one specific session by id rather than enumerating all live sessions.
  * /internal/restart: scoped to one sessionId. Codex T2 of the eng review
    caught the gap — pre-spec the route would have disposed every PTY on
    the agent, breaking pair-agent and any future multi-sidebar setup.
    The body now requires `{sessionId}`; missing or unknown id returns
    `{killed: 0}` and leaves siblings alone.
  * maybeSpawnPty(ws, session): hoisted from the inline binary-frame spawn
    block so both the legacy "spawn on first keystroke" trigger AND the
    new `{type:"start"}` text-frame trigger land in the same code path.
    Idempotent on session.spawned.
  * `{type:"start"}` text frame: explicit spawn trigger. forceRestart
    (extension side, lands in Commit 2C) sends this immediately on every
    fresh WS so claude boots without requiring a keystroke. Pre-v1.44 the
    lazy-binary-spawn pattern made the restart feel stuck.
  * close(ws): drops the sessionsById entry alongside the existing
    sessions WeakMap + validTokens cleanup. Commit 3 will revisit this to
    keep the session alive for a 60s detach window before disposing.

Test (browse/test/terminal-agent-session-routing.test.ts):
  * 8 static-grep tripwires pinning the load-bearing properties: validTokens
    is a Map (not Set), sessionsById exists, /internal/restart is scoped
    (negative-assert against enumerate-all patterns), WS upgrade plumbs
    sessionId, maybeSpawnPty is the single spawn entry, close() drops the
    index. Live spawn cycles belong in the e2e tier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(server): /pty-session 4-tuple + /pty-restart + /pty-dispose + lease-refresh

Wires the lease + attachToken model end-to-end on the server side. The
client side (extension) lands in the next commit; agent side already
shipped in 449144cd.

Routes:
  * POST /pty-session — mints sessionId (stable, loggable) + lease
    (server-side bookkeeping) + attachToken (short-lived bearer for the
    WS upgrade). Returns the 4-tuple in one round trip. Legacy
    ptySessionToken / expiresAt aliases kept for one minor release so
    extensions on the v1.43 wire shape keep working.
  * POST /pty-session/reattach — validates a sessionId's lease and mints
    a FRESH attachToken bound to the same sessionId. Used by Commit 3's
    re-attach loop; 410 Gone when the lease has expired so the client
    knows to fall back to a brand-new /pty-session.
  * POST /pty-restart — one transaction: dispose the caller's existing
    PtySession on the agent (via /internal/restart, scoped to one
    sessionId — codex T2), revoke the old lease, mint a fresh
    sessionId + lease + attachToken, return the 4-tuple. Zero race
    window between kill and mint (codex T2 + D8 of the eng review).
  * POST /pty-dispose — explicit teardown. sendBeacon-compatible: accepts
    auth token in the body so the extension's pagehide handler (Commit 2C)
    can fire it without setting custom headers (sendBeacon doesn't
    support those). Without this route, every clean browser quit leaves
    a zombie PTY alive for the 60s detach window — codex T3 caught it.
  * POST /internal/lease-refresh — loopback from terminal-agent on its
    25s keepalive cycle (lazy: only when lease is within 5 min of
    expiry). Refreshes the lease AND resets the daemon idle timer. T6
    of the eng review: PTY activity (not arbitrary SSE consumers) is
    what keeps the daemon alive when the sidebar is in use.

Helpers:
  * grantPtyToken now accepts optional sessionId and passes it through
    to the agent's /internal/grant body. The agent binds token → sessionId
    in its validTokens Map so /ws upgrades carry the sessionId for
    /internal/restart and Commit 3 re-attach lookups.
  * restartPtySession() — new loopback helper that POSTs the agent's
    scoped /internal/restart with a sessionId body. Used by /pty-restart
    and /pty-dispose.

Auth contract on /pty-dispose deliberately accepts the auth token in
EITHER the Authorization header OR the request body. The body path is
required for sendBeacon (which can't set custom headers); the header
path stays available for non-beacon callers and tests.

Test (browse/test/server-pty-lease-routes.test.ts):
  * 7 static-grep tripwires pinning the 4-tuple shape, validate-first
    re-attach with 410 fallback, one-transaction restart semantics,
    sendBeacon-compatible dispose auth, and the T6 PTY-only idle reset.
  * Live route exercises (full mint + grant + WS upgrade cycle) belong
    in the e2e tier — they require a real terminal-agent loopback and
    take seconds per assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(sidebar): forceRestart via /pty-restart + pagehide /pty-dispose

Closes the Commit 2 loop: server-side lease + restart routes shipped in
25ef24e9; this commit wires the extension client to use them. End-to-end
result — clicking Restart now actually kills the server's PTY before
opening a new WS (zero race window), and closing the sidebar / quitting
the browser disposes the PTY immediately instead of letting it linger
for the upcoming 60s detach window.

sidepanel-terminal.js:
  * mintSession callers read the v1.44 4-tuple (sessionId + attachToken)
    from /pty-session, with a backward-compat fallback to ptySessionToken
    so a partially-updated extension still works against a fresh server
    for one minor release.
  * Eager spawn via {type:"start"} text frame replaces the legacy
    `TextEncoder().encode("\n")` newline hack. Pre-v1.44, the lazy-binary-
    spawn pattern made forceRestart look stuck until the user typed —
    now claude boots before the prompt renders.
  * forceRestart() rewritten as an async one-transaction handler:
      1. close current WS with code 4001 (intentional-restart)
      2. POST /pty-restart with priorSessionId so the server can scope
         the dispose, then mint fresh sessionId + lease + attachToken
         in the same response
      3. Open new WS with the returned attachToken, send {type:"start"}
         immediately for eager spawn
      4. On 401: sticky-abort the auto-connect loop (no spam)
      5. On 503 / network failure: fall back to patient autoconnect
  * currentSessionId tracked and exposed on window.gstackPtySession so
    sidepanel.js's pagehide handler can sendBeacon the dispose.

sidepanel.js:
  * New pagehide handler fires navigator.sendBeacon('/pty-dispose',
    {sessionId, authToken}) on tab close, panel close, browser quit,
    or extension reload. sendBeacon-compatible: auth token rides in
    the body since sendBeacon can't set custom headers (server route
    accepts body-auth per 25ef24e9).
  * try/catch around the entire body so a sendBeacon failure can't
    interfere with the browser's unload sequence — the 60s detach
    window from Commit 3 catches anything we miss.

There's bounded duplication between connect() and forceRestart() (~70
lines of WS attach/handler wiring). Extracting a shared helper is a
clean follow-up but out of scope for the v1.44 ship — both paths are
exercised by the same e2e test.

Test (browse/test/sidepanel-restart-dispose.test.ts):
  * 9 static-grep tripwires pinning the 4-tuple parse, eager spawn,
    close-code 4001 contract, /pty-restart wire shape, sticky-abort
    401 path, sessionId window plumbing, sendBeacon body contract,
    and the best-effort try/catch around pagehide.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(terminal-agent): scrollback ring buffer + detach state machine + re-attach

The agent side of Commit 3 — the "magic" feature. A network blip (wifi
hiccup, MV3 panel suspend, brief Chromium pause) now silently reconnects
the sidebar to the SAME claude session with scrollback intact. No more
"Session ended" message + manual Restart click + losing your tool-call
output. Server-side /pty-session/reattach (25ef24e9) and the extension
re-attach loop (next commit) close the loop end-to-end.

Ring buffer (T10):
  * Per-session frames: Buffer[] capped at 1 MB (env-overridable via
    GSTACK_PTY_RING_BUFFER_BYTES). Each PTY write is one frame, so
    eviction is at frame boundaries and never cuts a UTF-8 sequence or
    ANSI CSI in half.
  * appendToRingBuffer eviction loop keeps at least one frame even at
    extreme caps — a single oversized frame can't empty the buffer.
  * Alt-screen tracking via canonical xterm CSI ?1049h / CSI ?1049l
    sequences. lastIndexOf comparison so trailing state wins when both
    appear in one render frame (quick tool-call open+close).

Replay payload (T5 — codex outside-voice):
  * buildReplayPayload prefixes DECSTR soft reset (\x1b[!p) and
    conditionally re-enters alt-screen if claude was in a tool call at
    detach. The client writes RIS (\x1bc) FIRST to clear pre-blip xterm
    content; the server's prelude resets character attributes; the ring
    buffer replays cleanly on top.
  * Order is enforced by the {type:"reattach-begin"} text frame the
    agent sends right before the binary replay — client waits for it,
    writes RIS, then treats the next binary frame as the replay payload.

Detach state machine (T9):
  * PtySession.liveWs decouples the PTY callback from the original ws
    closure. On re-attach, swapping session.liveWs is enough — the
    on-data callback writes to the new ws automatically.
  * close(ws, code, _reason): codes 4001 (intentional restart), 4404
    (no-claude), and 1000 (clean exit) trigger immediate dispose.
    Anything else (1006 abnormal, 1001 going-away from network blip /
    panel suspend) starts a 60s detach timer instead. claude keeps
    running, output keeps accumulating in the ring buffer.
  * Detach timer is unref'd so the bun process can still exit cleanly
    on natural shutdown.
  * Sessions without a sessionId (legacy single-shot grants) can't
    re-attach by definition — those fall through to immediate dispose.

Re-attach lookup (T9):
  * WS open() checks sessionsById[sessionId] FIRST. If a detached
    session is sitting there, cancel its detach timer, swap liveWs,
    rebind the WS-keyed map, restart keepalive, send reattach-begin
    + replay payload. The PTY process is unchanged.
  * /internal/restart now cancels any pending detach timer before
    disposal — otherwise the timer would later try to dispose an
    already-disposed session.

Env knobs for e2e:
  * GSTACK_PTY_RING_BUFFER_BYTES — compress to 256 for eviction tests.
  * GSTACK_PTY_DETACH_WINDOW_MS — compress to 1000 for "did the timer
    fire?" tests without waiting a minute per assertion.

Tests:
  * browse/test/terminal-agent-detach-reattach.test.ts — 10 static-grep
    tripwires for the load-bearing properties: interface shape, env
    knobs, eviction floor, alt-screen tracking, replay prelude
    composition, re-attach lookup, close-code routing, detach timer
    unref, /internal/restart timer cancellation, on-data through
    session.liveWs.
  * browse/test/terminal-agent-session-routing.test.ts test 7 widened
    to match the new close(ws, code, _reason) signature.
  * browse/test/terminal-agent-keepalive.test.ts test 3 widened
    similarly. Both stay regressions for the prior contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(sidebar): silent re-attach with scrollback replay (Commit 3 client side)

Closes the v1.44 long-lived-sidebar loop end-to-end. When the WS dies for
a transient reason (wifi blip, MV3 panel suspend, brief Chromium pause),
the sidebar now silently re-attaches to the SAME claude session inside the
server's 60s detach window. Scrollback replays cleanly; the user keeps
typing without noticing anything happened.

State machine:
  * New STATE.RECONNECTING covers the in-flight re-attach window.
    setState transitions out of this state reset reattachInFlight so a
    concurrent user action (Restart click, panel navigate) short-circuits
    cleanly.
  * Backoff schedule REATTACH_BACKOFF_MS = [1000, 2000, 4000, 8000] then
    8s steady until REATTACH_WINDOW_MS (60s) elapses. Past that point
    the server has disposed our session and /pty-session/reattach
    returns 410 Gone.

startReattachLoop(prevSessionId):
  * Posts /pty-session/reattach with sessionId.
  * On 200 with a valid 4-tuple, opens the post-reattach WS directly.
  * On 410 (lease expired) — short-circuits to ENDED. No retry; the user
    clicks Restart for a fresh session.
  * On 401 — sticky-aborts the auto-connect loop. Same defense as 25ef24e9
    so we don't spam "Auth invalid" every 2s.
  * On network failure or other non-OK status — schedules the next
    backoff tick.

openReattachWebSocket(terminalPort, attachToken, sessionId):
  * Mostly a clone of connect()'s attach wiring. Reuses the live xterm
    element — RIS clears the buffer cleanly when the agent's
    {type:"reattach-begin"} arrives, so the visual flash is minimal.
  * Handshake: on `{type:"reattach-begin"}` text frame → write `\x1bc`
    (RIS) to xterm + set nextBinaryIsReplay = true. The next binary
    frame IS the server-built replay payload (DECSTR soft-reset prefix
    + optional alt-screen re-enter + ring buffer contents).
  * If THIS reattach WS also dies uncleanly, recurses into another
    re-attach loop with the same sessionId — the server's detach window
    may still be open. State guard prevents runaway recursion.

connect() + forceRestart() close handlers (existing):
  * Both updated to call startReattachLoop on transient close codes
    (anything other than 1000 / 4001 / 4404). Was just setState(ENDED).
  * Clean codes still bypass — re-attaching to a force-restart's
    pre-restart session would be the bug we're avoiding.

Test (browse/test/sidepanel-reattach.test.ts):
  * 8 static-grep tripwires for the load-bearing properties: state
    constant, backoff schedule, /pty-session/reattach wiring, 410
    short-circuit (no retry past lease window), 401 sticky-abort,
    reattach-begin → RIS handshake, all three close handlers route
    through the loop, clean-code bypass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.44.0.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(terminal-agent): runtime tests for ring buffer + replay + alt-screen tracking

Companion to browse/test/terminal-agent-detach-reattach.test.ts (static-grep
tripwires) — calls appendToRingBuffer + buildReplayPayload directly to prove
behavioral correctness without spinning up a real Bun.serve listener.

  * 11 runtime cases: append + byte counting, oversize eviction with
    one-frame floor (the eviction loop guard that prevents an oversized
    single frame from emptying the buffer), alt-screen tracking via
    canonical xterm CSI ?1049h / CSI ?1049l, trailing-state-wins for
    enter+exit pairs inside a single render frame, soft-reset prefix
    ordering, optional alt-screen re-enter, payload length math.
  * Exports appendToRingBuffer, buildReplayPayload, and the PtySession
    interface from terminal-agent.ts (purely for testability — they
    were module-private; the change is annotation-only).
  * Lease registry sanity check: mint two sessions, verify distinct
    sessionIds, both valid simultaneously. Catches future refactors
    that accidentally couple lease + ring buffer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tests): explain_level unset returns the documented default, not empty

Pre-existing failure on main — the test expected gstack-config to return
"" for an unset explain_level (with the comment "preamble default takes
over"), but the script at bin/gstack-config:103 explicitly returns
"default" inline for that key. Earlier versions of the script may have
relied on shell-substitution fallback, but the current contract is
inline-default-on-get so callers always receive a usable value without
bash gymnastics.

Updated the test to match the actual contract. Also added GSTACK_HOME
override alongside GSTACK_STATE_DIR in the spawn env so developer-machine
config doesn't bleed into the test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 01:43:51 -07:00

108 KiB
Raw Blame History

TODOS

browse server: terminal-agent teardown follow-ups (filed v1.41 via /plan-eng-review)

DONE (v1.44.0.0): Identity-based terminal-agent kill (replace pkill regex with PID)

Resolved: Bundled into the v1.44.0.0 long-lived-sidebar PR as Commit 0. browse/src/terminal-agent-control.ts is the new home for readAgentRecord, writeAgentRecord, clearAgentRecord, and killAgentByRecord. The agent writes <stateDir>/terminal-agent-pid (JSON {pid, gen, startedAt}) at boot and clears it on SIGTERM/SIGINT. cli.ts and server.ts both route through killAgentByRecord instead of pkill -f terminal-agent\.ts. The new browse/test/terminal-agent-pid-identity.test.ts is the static-grep tripwire that fails CI if pkill ... terminal-agent or spawnSync('pkill', ...) reappears in any source file.


P3: shutdown() reads module-level config, not cfg.config (composition gap)

What: browse/src/server.ts:shutdown() reads path.dirname(config.stateFile) where config is the module-level value resolved at import time, not the cfg.config passed into buildFetchHandler. Same gap applies to cleanSingletonLocks(resolveChromiumProfile()) at server.ts:1298 — should read cfg.chromiumProfile.

Why: Embedders today happen to share state-dir resolution with the CLI (both go through resolveConfig() against the same env), so this doesn't bite. But if an embedder ever passes a divergent cfg.config (e.g., a test harness pointing at a temp dir), shutdown will operate on the wrong paths. The ownsTerminalAgent flag exposes the problem without fixing it.

Pros: Closes the embedder-composition story properly. Pairs with cfg.chromiumProfile to give a single coherent "this factory teardown respects cfg" contract.

Cons: Pre-existing — not a regression. Two call sites today (1285 for terminal files, 1298 for chromium locks). Threading cfg.config and cfg.chromiumProfile into the right closures is straightforward but broader than the v1.41 fix.

Context: Flagged by both Codex and Claude subagent in the /plan-eng-review dual voices. Documented as out-of-scope in the v1.41 plan; same shape as the chromiumProfile PR-body note to the gbrowser team.

Depends on: None.


P3: Ownership-object refactor if a 4th caller-owned teardown gate appears

What: Today ServerConfig has three caller-owned teardown gates: xvfb? (presence ⇒ don't close), proxyBridge? (same), and now ownsTerminalAgent (explicit boolean). If a 4th gate appears, collapse to cfg.callerOwns?: Set<'terminalAgent' | 'xvfb' | 'proxyBridge' | ...> or similar.

Why: Three independent flags is below the refactor threshold — each field has clear, distinct semantics and the JSDoc voice is consistent. A fourth tips the cost balance: the per-field surface gets noisy, and "what does this factory own?" becomes a question you have to ask of three or four scattered fields instead of one explicit set.

Pros: Single source of truth for "what gstack tears down". Trivial extension surface for future caller-owned resources. Easier to assert in tests ("the set should contain X, not Y").

Cons: Premature today. The polarity-inversion note in the ownsTerminalAgent JSDoc only hurts a little — it's one anomaly, not a pattern. Refactoring now to an ownership object would touch every embedder.

Context: Recommended by Claude subagent during /plan-ceo-review dual voice (autoplan). Trigger: a 4th caller-owned teardown gate in this same ServerConfig shape.

Depends on: A 4th gate to motivate the refactor.


/sync-gbrain memory stage perf follow-up

P2: Investigate gbrain import perf on large staging dirs

What: Cold-run time on a 5131-file staging dir is >10 min in gbrain import alone (after gstack's prepare phase, which is now <10s after dropping per-file gitleaks). On 501 files it took 10s. The scaling is worse than linear and the bottleneck is inside gbrain, not the gstack orchestrator.

Why: With memory-ingest's prepare phase now fast, the remaining cold-run cost is entirely on the gbrain side. Users with large corpora (5K+ files) currently pay ~15-30 min on first ingest. Likely culprits in ~/git/gbrain/src/core/import-file.ts:

  • N+1 SQL queries: engine.getPage(slug) for each file's content_hash check (line 242 + 478) — should be batched into a single query
  • Per-page auto-link reconciliation that fires even for unchanged content
  • FTS / vector index updates without batching transactions

Pros: Lives in gbrain (cleaner separation). Fix in gbrain benefits other gbrain callers too (gbrain sync, MCP put_page workflows). Likely 10-50x speedup from batched queries alone.

Cons: Cross-repo change, requires gbrain test coverage for the new batched path. Not on the gstack critical path; gstack's architecture is already correct.

Context: Verified on real corpus 2026-05-10. gstack-side prepare with --scan-secrets off runs in <10s. The full gbrain import on the same staged dir consumes 100% CPU for >10 min. Both observations from bin/gstack-memory-ingest.ts:ingestPass reaching the runGbrainImport call quickly, then the child process taking the bulk of the wall time.

Depends on: None — gstack's batch-ingest architecture (D1-D8 in docs/designs/SYNC_GBRAIN_BATCH_INGEST.md) is already shipped and correct.


P3: Cache "no changes since last import" at the prepare-batch level

What: Even with the prepare phase fast (<10s for 5135 files), walking and mtime-stat'ing every file on a true no-op run adds a few seconds and creates spurious staging dirs. Cache the most-recent-source-mtime per-source in the state file; if no source dir has a newer mtime, skip the walk + stage + import entirely.

Why: Most /sync-gbrain invocations have nothing new to ingest. The fastest path is "do nothing, fast." gbrain doctor should still report state, but the actual ingest pipeline can short-circuit when last_full_walk is recent and no source-tree mtime has moved.

Pros: Trivial implementation (~20 lines in ingestPass). Makes the incremental fast-path actually live up to "<30s" in the original plan.

Cons: Adds a cache invalidation surface. If a user edits a file but its parent dir's mtime doesn't update (rare on macOS APFS), changes get missed. Mitigation: only short-circuit when last_full_walk is recent (e.g. <1 min ago).

Context: Filed during 2026-05-10 perf testing after --scan-secrets was made opt-in. Lower priority than the gbrain-side perf issue above.


Browser-skills follow-on (Phases 2-4)

P1: Browser-skills Phase 2 — /scrape and /skillify skill templates

What: Phase 2a of the browser-skills design (docs/designs/BROWSER_SKILLS_V1.md). Two new gstack skills: /scrape <intent> (read-only) is the single entry point for pulling page data — first call prototypes via $B primitives, subsequent calls on a matching intent route to a codified browser-skill in ~200ms. /skillify codifies the most recent successful prototype into a permanent browser-skill on disk: synthesizes script.ts + script.test.ts + fixture from the agent's own context (final-attempt $B calls only), runs the test in a temp dir, asks before committing, atomic rename to ~/.gstack/browser-skills/<name>/. The mutating-flow sibling /automate is split out as its own P0 (below) — same skillify pattern, different trust profile.

Why: Phase 1 shipped the runtime — humans can hand-write deterministic browser scripts that gstack runs. Phase 2a unlocks the productivity gain: an agent that gets a flow right once via 20+ $B commands says /skillify and the script becomes a 200ms call forever after. Same skillify pattern Garry's articles describe, applied to the read-only browser activity (scraping) most amenable to deterministic compression. Mutating actions ship next as /automate because the failure mode (unintended writes) needs stronger gates.

Pros: The 100x productivity gain lives here. Closes the loop: agents prototype, codify, then reach for the codified skill in future sessions instead of re-exploring. Replaces the original "self-authoring $B commands" P1 — same user-visible goal, no in-daemon isolation problem (skill scripts run as standalone Bun processes, never imported into the daemon). Synthesis question (Codex finding #6) is resolved by re-prompting from the agent's own conversation context (option b in the design doc), bounded to final-attempt $B calls per /plan-eng-review D2.

Cons: Bun runtime distribution (Codex finding #7). Phase 1 sidesteps this because the bundled reference skill ships inside the gstack install. User-authored skills land on machines without Bun unless we ship a runtime alongside, compile to a self-contained binary, or use Node + the existing cli.ts pattern. Deferred to Phase 4 — /skillify documents the assumption that gstack is installed (which means Bun is on PATH).

Context: The Phase 1 architecture (3-tier lookup, scoped tokens, sibling SDK, frontmatter contract) is locked and exercised by the bundled hackernews-frontpage reference skill. Phase 2a plugs /scrape and /skillify into that runtime via two skill templates plus one new helper (browse/src/browser-skill-write.ts for atomic temp-dir-then-rename per /plan-eng-review D3) — no new storage primitives.

Effort: M (human: ~1 week / CC: ~1 day) Priority: P1 (this branch — garrytan/browserharness shipping as v1.19.0.0) Depends on: Phase 1 shipped (this branch).


P2: Browser-skills Phase 3 — resolver injection at session start

What: Mirror the domain-skill resolver at browse/src/server.ts:722-743. When a sidebar-agent session starts on a host with matching browser-skills, inject a list block telling the agent which skills exist for that host and how to invoke them ($B skill run <name> --arg ...). UNTRUSTED-wrapped via the existing L1-L6 security stack. Add gstack-config browser_skillify_prompts knob (default off) controlling end-of-task nudges in /qa, /design-review, etc. when activity feed shows ≥N commands on a single host AND no skill exists yet for that host+intent.

Why: Without the resolver, browser-skills only work when the user explicitly types $B skill run <name>. With the resolver, agents auto-discover existing skills for the current host and reach for them instead of re-exploring. Same compounding pattern as domain-skills.

Pros: Closes the discoverability gap. Agents that wouldn't know a skill exists now see it in their system prompt automatically. End-of-task nudges (opt-in via knob) catch the moments where skillify is most valuable.

Cons: The resolver block lives in the system prompt and competes with other resolver blocks for prompt budget. Need to gate carefully so it doesn't fire on every host with a skill — only when the skill is plausibly relevant to the current task. v1.8.0.0 domain-skills handles this by only firing for the active tab's hostname; same pattern here.

Effort: S (human: ~3 days / CC: ~4 hours) Priority: P2 Depends on: Phase 2.


P2: Browser-skills Phase 4 — eval infrastructure + fixture staleness + OS sandbox

What: Three loosely-coupled extensions: (a) LLM-judge eval ("did the agent reach for the skill instead of re-exploring?"), classified periodic per test/helpers/touchfiles.ts. (b) Fixture-staleness detection — periodic comparison of bundled fixtures against live pages, flagging mismatches before they break tests silently. (c) OS-level FS sandbox for untrusted spawns: sandbox-exec profile on macOS, namespaces / seccomp on Linux. Drops in cleanly behind the existing trusted/untrusted contract (Phase 1 just stripped env; Phase 4 adds real FS isolation).

Why: Phase 1's trust model has the daemon-side capability boundary right (scoped tokens) but the process-side env scrub is hygiene, not a sandbox (Codex finding #1). For genuinely untrusted skills (Phase 2 agent-authored), real FS isolation matters. Eval + fixture staleness keep the skill quality bar honest as flows drift.

Pros: Closes the last credible attack surface from Codex finding #1 (FS read of ~/.ssh/id_rsa etc.). Eval data tells us whether the resolver injection is actually working. Fixture staleness catches HTML drift before users.

Cons: Three different concerns, three different design passes. Tempting to bundle. Resist: each can ship independently. OS sandbox is the hardest piece (macOS sandbox-exec is Apple-private but stable; Linux requires namespaces + bind mounts).

Effort: L (human: ~2-3 weeks / CC: ~3-5 days) Priority: P2 Depends on: Phase 2 (need agent-authored skills to motivate sandbox); Phase 3 (eval needs resolver injection).


P2: Migrate /learn to SQLite

What: The current ~/.gstack/projects/<slug>/learnings.jsonl storage works (append-only, tolerant parser, idle compactor) but Codex outside-voice (T5) flagged JSONL as "the wrong primitive" for multi-writer canonical state: lost-update on rewrite, partial-line corruption on crash, no transactions. v1.8.0.0 hardened JSONL with flock + O_APPEND but the right long-term primitive is SQLite (which Bun has built in via bun:sqlite).

Why: Domain skills now live in the same learnings.jsonl (per CEO D1 unification). As volume grows, the JSONL hardening compactor + tolerant parser approach becomes the long pole. SQLite gives atomic transactions, indexes (huge for hostname lookup), and crash-safety without a custom compactor.

Pros: Atomic writes. Real schema. Fast indexed lookups by hostname/key/type. Crash-safe.

Cons: Migration touches every consumer of learnings.jsonl/learn scripts (gstack-learnings-log, gstack-learnings-search), domain-skills.ts read/write, gbrain-sync (which currently treats it as a flat file). Old learnings.jsonl files in the wild need a one-shot migration script.

Context: The JSONL hardening in v1.8.0.0 was the right call for that release scope (preserve unification, not boil-the-ocean). But the failure modes are bounded, not eliminated. SQLite is the boil-the-ocean fix.

Effort: M (human: ~1 week / CC: ~1 day) Priority: P2 Depends on: v1.8.0.0 in production for ~1 month to measure JSONL pain (compactor frequency, partial-line drops, write contention).


P2: Remove plan-mode handshake from /plan-devex-review SKILL.md.tmpl

What: /plan-devex-review has a "Plan Mode Handshake" section at the top that contradicts the preamble's "Skill Invocation During Plan Mode" contract (which says AskUserQuestion satisfies plan mode's end-of-turn requirement). The handshake forces an extra exit-plan-mode step that no other interactive review skill needs. /plan-ceo-review, /plan-eng-review, /plan-design-review all run fine in plan mode without it.

Why: Found during the v1.8.0.0 DevEx review. The inconsistency cost a turn and confused the flow. Either remove the handshake from plan-devex-review (clean fix, recommended) OR add it to every interactive skill for consistency.

Pros: Fixes a real DX bug for anyone running /plan-devex-review in plan mode. Five-minute change.

Cons: Need to think about WHY it was added in the first place — there may be context this TODO is missing.

Context: The handshake section in plan-devex-review/SKILL.md.tmpl says it's needed because plan mode's "this supersedes any other instructions" warning could otherwise bypass the skill's per-finding STOP gates. But the same warning exists for the other review skills, and they all work fine because AskUserQuestion satisfies the end-of-turn contract.

Effort: S (human: ~15 min / CC: ~5 min) Priority: P2 Depends on: Nothing.


P2: Bump gbrain install-pin in lockstep with gstack memory-feature releases (#1305 part 2)

What: bin/gstack-gbrain-install pins gbrain to commit 08b3698 (v0.18.2). When gstack ships features that depend on newer gbrain ops or schema (e.g. v1.26.0 manifests + code-def/code-refs/reindex-code), the pin doesn't move with it. Fresh /setup-gbrain installs an old gbrain that fails gbrain doctor schema_version checks (24 vs latest 32+) until the user manually upgrades.

Why: Filed in #1305 alongside the put_page CLI bug. Out of scope for the v1.26.5.0 fix wave (separate release-coordination concern: which gbrain version we install vs. how we call it). The install-pin should either (a) auto-bump whenever gstack releases features that need newer gbrain, or (b) detect a stale pin during preamble and either auto-upgrade gbrain or print a one-line FIX hint.

Pros: Closes the "fresh-install paper-cut" path. New users land on a healthy schema. Reduces support noise on /setup-gbrain flows. Makes the gstack/gbrain release contract visible.

Cons: Adds release-cadence coupling between gstack and gbrain. Needs a policy: pin = "minimum version that still works" vs "latest known good." If gbrain ships a breaking change to put shape and gstack doesn't update the pin, fresh installs break in a new way.

Context: Issue #1305 part 1 (the put_page CLI verb bug) was handled in v1.26.5.0. Part 2 (this TODO) is the install-pin staleness. Pin lives in bin/gstack-gbrain-install near the top as a constant. Easiest minimal fix: ship the pin as a tracked release artifact (e.g. write it from package.json at build time) and add a doctor-style preamble check.

Effort: S (human: ~2 days / CC: ~3 hours) Priority: P2 Depends on: Nothing.


P3: Source-id host-collision risk in deriveCodeSourceId (cross-host duplicate org/repo)

What: v1.26.5.0's deriveCodeSourceId drops the host segment to fit gbrain's 32-char source-id budget. This means github.com/acme/foo and gitlab.com/acme/foo collapse to the same gstack-code-acme-foo. ensureSourceRegisteredSync() in bin/gstack-gbrain-sync.ts:323 will silently re-register the source when local_path differs, evicting one side.

Why: Vanishingly rare in practice — same <org>/<repo> shape across both github.com and gitlab.com on the same machine almost never happens. But the failure mode is silent (one repo evicts the other in the brain), and the user has no signal anything is wrong.

Pros: Closes the silent-eviction edge. Two viable approaches: short host marker (gh- / gl- / bb-) eats 3 chars but keeps cross-host uniqueness; OR include a 3-char hash of the host alongside the org-repo.

Cons: Source IDs change shape again — anyone with existing registrations on v1.26.5.0 gets a one-time re-register. Net break-even because the current scheme also changed from v1.26.4.0.

Context: Filed in #1320 / #1322 / #1323 / #1331 (the underlying source-id validation bugs), addressed in v1.26.5.0 by dropping host segment + hash-truncating. Cross-host collision was a known accepted tradeoff in PR #1330's design ("vanishingly rare in practice"). Codex outside-voice plan review surfaced it as a long-tail concern; this TODO captures it for a future bump.

Effort: XS (human: ~4 hours / CC: ~30 min) Priority: P3 Depends on: Nothing.


P3: GBrain skillpack publishing for domain skills

What: Domain skills are agent-authored notes per hostname. Right now they're per-machine or per-agent-repo. The natural compounding extension: publish curated skill packs to GBrain (gstack-brain-sync) so others can subscribe. "Louise's LinkedIn skills" or "Garry's GitHub skills" become packs anyone can pull.

Why: v1.8.0.0 gets us per-machine compounding. Cross-user compounding is the network effect — every user contributes, every user benefits.

Pros: Massive compounding potential. Hard part is trust/moderation (existing problem GBrain-sync has thought through).

Cons: Publishing infra, signature/redaction model, moderation when packs go bad. Real plan needed.

Context: GBrain-sync infra (v1.7.0.0) already does private cross-machine sync for the user's own data. Skillpack publishing is the public/shared layer on top of that.

Effort: M (human: ~1 week / CC: ~1 day) Priority: P3 Depends on: GBrain-sync stable in production. Some user demand signal first.


P3: Replay/record demonstrated flows to domain-skills

What: Watch a human drive a site once (record DOM events + screenshots + nav), generalize to a domain-skill. "Teach by showing." Different research dream than v1.8.0.0's per-site notes.

Why: The highest-quality skill content is one a human demonstrated, not one the agent figured out from scratch. Pairs with skillpack publishing — recorded flows are the most valuable packs.

Pros: Skill quality jumps. Some sites are too complex for an agent to figure out alone (multi-step OAuth, captcha-gated forms).

Cons: Record fidelity vs. selector stability over time. DOM changes break recordings. Real research needed.

Context: Browser-use has experimented with this. Playwright has a recorder. Codeception/Cypress recorders exist. None of them do the "generalize the recording into a markdown note" step.

Effort: L (human: ~2-3 weeks / CC: ~2-3 days) Priority: P3 Depends on: Probably its own /office-hours session before committing eng time.


P3: $B commands review batch-mode UX

What: Originally an alternative for the inline-on-first-use approval gate (DevEx D6 alternative C). Instead of approving each agent-authored command at first invocation, batch them: agent scaffolds many, human reviews $B commands review at a convenient time, approves/rejects in one pass.

Why: If self-authoring commands ever ships (the P1 above), the inline approval at first-use can interrupt the agent mid-task. Batch review is friendlier for the human.

Pros: Reduces interrupt frequency. Lets humans review with full context.

Cons: Defers approval — agent can't use the new command until the human comes back. If the agent needs the command immediately, this is worse than inline.

Context: Tied to the P1 above. Won't ship before that does.

Effort: S (human: ~half day / CC: ~30 min) Priority: P3 Depends on: P1 self-authoring $B commands.


P3: Heuristic command-gap watcher

What: Sidebar-agent watches the activity feed; when an agent repeats a similar action 3+ times (e.g., calls $B js with structurally similar arguments), suggest scaffolding a command. From DevEx D4 alternative C.

Why: Closes the discoverability loop on self-authoring commands. Agent is most likely to write a command when it just hit the same friction multiple times.

Pros: Surgical. Fires only when a command would have demonstrably helped. Uses real telemetry, not heuristics.

Cons: False positives (legitimate repeated actions) feel intrusive. Hard to design without telemetry first.

Context: Telemetry from v1.8.0.0 (cdp_method_called, cdp_method_denied counters) gives us the data to design this well. Don't design until we have ~1 month of production data.

Effort: M (human: ~1 week / CC: ~1 day) Priority: P3 Depends on: v1.8.0.0 telemetry in production. P1 self-authoring commands.


Sidebar Terminal (cc-pty-import follow-ups)

v1.1: PTY session survives sidebar reload

What: Today the Terminal tab's PTY dies with the WebSocket — sidebar reload, side-panel close, even a quick navigate-away in another tab close the session. v1.1 should key the PTY on a tab/session id so a reload reattaches to the existing claude process and you keep /resume history.

Why: Mid-task resilience. When you've been pair-programming with claude for 20 minutes and an accidental Cmd-R blows it away, the cost is real.

Pros: Better UX, fewer interrupted sessions. Cons: Session-tracking state, ghost-process risk, lifecycle bugs (when DOES the PTY actually go away?). v1 chose the simple "PTY dies with WS" model deliberately.

Context: /plan-eng-review Issue 1C decision (cc-pty-import branch, 2026-04-25). v1 ships with phoenix's lifecycle. Depends on: cc-pty-import landed.

Priority: P2 (nice-to-have). Effort: M. Likely needs a per-tab session map keyed by chrome.tabs.id plus a TTL so abandoned PTYs eventually exit.


v1.1+: Audit /health token distribution

What: Codex's outside-voice review on cc-pty-import flagged that /health already surfaces AUTH_TOKEN to any localhost caller in headed mode (server.ts:1657). That's a pre-existing soft leak — anything running on localhost gets the root token by hitting /health.

Why: cc-pty-import sidesteps it by NOT putting the PTY token there (uses an HttpOnly cookie path instead). But the underlying leak is still shippable surface. A second extension or a localhost web app could currently scrape AUTH_TOKEN and hit any browse-server endpoint.

Pros: Closes a real privilege-escalation path on multi-extension machines. Cons: Either we tighten the gate (Origin must be OUR extension id, not just any chrome-extension://) or we move bootstrap discovery off /health entirely. Either has migration cost for tests and the existing extension.

Context: codex finding #2 on cc-pty-import plan-eng review. Not in scope of that PR; deliberately deferred to keep PTY-import small.

Priority: P2. Effort: M.


Testing

P2: Per-finding AskUserQuestion count assertion for /plan-ceo-review

What: PTY E2E test that drives /plan-ceo-review through Step 0 with a stable fixture diff containing N known findings, asserts that exactly N distinct AskUserQuestions fire (one per finding) before plan_ready.

Why: The skill template repeats "One issue = one AskUserQuestion call. Never combine multiple issues into one question." at every review checkpoint. No test enforces it. The current skill-e2e-plan-ceo-plan-mode.test.ts smoke (post-v1.21.1.0) only catches "agent skipped Step 0 entirely." Batching findings into one question slips through silently.

Pros: Locks in the strongest contract the skill mandates. Catches a real failure mode (the original attachment showed 2 findings batched as 0 questions). Cons: Needs a stable fixture diff to keep finding count deterministic (~1 day human / ~30 min CC). Opus may reasonably consolidate two related findings, so the assertion needs a forgiving lower bound (e.g., >= ceil(N * 0.6)) rather than strict equality.

Context: The PTY harness (runPlanSkillObservation) returns at first terminal outcome — for V2 we need a streaming variant that counts AskUserQuestions across the whole session up to plan_ready. Probably a new helper alongside runPlanSkillObservation.

Depends on: Stable fixture diff (test/fixtures/plans/multi-finding.diff or similar) with a small known set of issues that triggers all 4 review sections.

Priority: P2. Effort: S (CC: ~30 min once fixture exists). Captured from v1.21.1.0 plan-eng-review D2.


P3: Honor env vars in gstack-config (so QUESTION_TUNING/EXPLAIN_LEVEL actually isolate tests)

What: gstack-config get <key> reads ~/.gstack/config.yaml. runPlanSkillObservation plumbs env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' } through to the spawned claude process — but the skill preamble bash uses gstack-config get question_tuning, which never looks at env. The env passthrough is theater on current code.

Why: Without env honoring, the v1.21.1.0 plan-ceo-review smoke is still flaky on machines with question_tuning: true set in YAML. AUTO_DECIDE preferences would skip the rendered AskUserQuestion list, masking the regression we want to catch.

Pros: Makes the gate test hermetic across machines. The env wiring is already in place — only gstack-config needs to read env first, fall back to YAML. Cons: Touches the gstack-config binary across all 3 platforms (linux/darwin/windows). Cross-binary refactor.

Context: Captured from v1.21.1.0 adversarial review. Documented honestly in the test docstring as a known limitation.

Priority: P3. Effort: S. Single-file edit to bin/gstack-config (~10 LOC for env-first lookup).


P3: Path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS

What: runPlanSkillObservation's silent-write detector uses substring matching on a few sanctioned paths (.gstack/, CHANGELOG.md, TODOS.md, etc). A write to node_modules/some-pkg/CHANGELOG.md or src/foo/.gstack/leak.ts is currently sanctioned because the substring matches anywhere in the path.

Why: Defensive — no current bug exploits this, but a malicious skill or fixture could write to a path that happens to contain .gstack/ or CHANGELOG.md and slip past silent-write detection.

Pros: Hardens the harness against future skill misbehavior. Aligns substring rules with their intent. Cons: Need to anchor against absolute prefixes (os.homedir() + '/.gstack/', worktree root) which makes the test less portable across machines.

Context: Captured from v1.21.1.0 adversarial review (HIGH/FIXABLE finding, pre-existing). Refactored into a SANCTIONED_WRITE_SUBSTRINGS constant in v1.21.1.0 but the substring-includes logic is unchanged from before.

Priority: P3. Effort: S.


P1: Structural STOP-Ask forcing function across all skills

What: Design and implement a structural forcing function that catches when a skill mandates per-issue AskUserQuestion but the model silently substitutes batch-synthesis. Candidate mechanisms: question-count assertion (skill declares expected question count in frontmatter; post-run audit logs if model fired <N), typed question templates (skill hands the model pre-built AskUserQuestion payloads rather than prose instructions), or a canUseTool-based post-run audit that compares declared-gates-fired vs expected.

Why: The authoritative "Skill Invocation During Plan Mode" rule (hoisted to preamble position 1) tells the model AskUserQuestion satisfies plan mode's end-of-turn requirement. That fixes plan-mode entry, but NOT the broader class of failures: the model silently substitutes batch-synthesis for STOP-Ask loops whenever the skill's interactive contract collides with any other rule surface (auto mode, tool-count anxiety, cognitive load). Without structural enforcement, every skill with STOP-per-issue contracts remains vulnerable.

Pros: Catches a class-of-bug, not an instance. Applies to every skill that declares STOP gates. Builds on canUseTool primitive in test/helpers/agent-sdk-runner.ts.

Cons: Real design work. How does a skill declare expected question count — static value in frontmatter, or dynamic based on number of review sections that surface findings? Is the audit inline (blocking, same-turn) or post-hoc (after skill completion)? Calibration of expected-vs-actual thresholds depends on real V0 question-log data across skills.

Context: Relevant files — scripts/question-registry.ts (typed question catalog), scripts/resolvers/question-tuning.ts (preference classification), bin/gstack-question-log (event log), bin/gstack-question-preference (read/write preferences), test/helpers/agent-sdk-runner.ts (canUseTool harness). Existing question-log already captures fire events; the gap is declaring expected counts and auditing against them.

Effort: L (human: ~1-2 weeks / CC+gstack: ~2-3 hours for design doc + first-pass implementation). Priority: P1 if interactive-skill volume is growing; P2 otherwise. Depends on / blocked by: design doc — likely its own docs/designs/STOP_ASK_ENFORCEMENT_V0.md.

Context skills

/context-save --lane + /context-restore --lane for parallel workstreams

What: Let users save and restore per-workstream (lane) context independently. On save: /context-save --lane A "backend refactor" writes a lane-tagged file. Or /context-save lanes reads the "Parallelization Strategy" section of the most recent plan file and auto-generates one saved context per lane. On restore: /context-restore --lane A loads just that lane's context. Useful when a plan has 3 independent workstreams and the user wants to pick one up in each of 3 Conductor windows.

Why: Plans produced by /plan-eng-review already emit a lane table (Lane A: touches models/ and controllers/ sequentially; Lane B: touches api/ independently; etc.). Right now there's no way to transfer that structure into resumable saved state. Users manually re-describe the scope in each window. Lane-tagged save/restore would be the bridge between "here's the plan" and "three people (or three AIs) are now working in parallel on it."

Pros: Turns /plan-eng-review's parallelization output into actionable resume state. Reduces context-loss across Conductor workspace handoffs for multi-workstream plans.

Cons: Net-new functionality (not a port from the old /checkpoint skill). The "spawn new Conductor windows" part needs research into whether Conductor has a spawn CLI. Also requires lane-tagging discipline in the save step (manual or extracted).

Context: Source of the lane data model is plan-eng-review/SKILL.md.tmpl:240-249 (the "Parallelization Strategy" output with Lane A/B/C dependency tables and conflict flags). Deferred from the v0.18.5.0 rename PR so the rename could land as a tight, low-risk fix. Saved files currently live at ~/.gstack/projects/$SLUG/checkpoints/YYYYMMDD-HHMMSS-<title>.md with YAML frontmatter (branch, timestamp, etc.). The lane feature would add a lane: field to frontmatter and a --lane filter to both skills.

Effort: M (human: ~1-2 days / CC: ~45-60 min) Priority: P3 (nice-to-have, not blocking anyone yet) Depends on: /context-save + /context-restore rename stable in production (v1.0.1.0+). Research: does Conductor expose a spawn-workspace CLI?

P0: Browser-skills Phase 2 follow-up — /automate skill

What: The mutating-flow sibling of /scrape (Phase 2b). /automate <intent> codifies form fills, click sequences, and multi-step interactions into permanent browser-skills. Reuses Phase 2a's skillify machinery (/skillify is shared) and the D3 atomic-write helper. Adds: per-mutating-step UNTRUSTED-wrapped summary + AskUserQuestion confirmation gate when running non-codified (codified skills run unattended after the initial human approval). Defaults to trusted: false per Phase 1 — env-scrubbed spawn, scoped-token capability, no admin scope.

Why: Read-only scraping is the safer wedge to validate the skillify pattern (failure mode: wrong data = benign). Mutating actions are the other half of the 100x productivity gain — agents that codify "log into example.com → click Settings → toggle X" save real time on every future session. Splitting from Phase 2a means we ship the productivity loop first, validate the architecture, then add the higher-trust surface with confidence.

Pros: Unlocks deterministic automation authoring without self-authoring safety concerns — Phase 1's scoped-token model applies equally to mutating skills. The codified script enumerates exactly which $B click/$B fill/$B type calls run; nothing else is possible at runtime. Reuses 100% of /skillify, the D3 helper, and the storage tier. Per-step confirmation gate surfaces the actions to the user before they run for the first time.

Cons: Mutating intents have higher blast radius (the wrong selector clicks "Delete Account" instead of "Delete Comment"). Phase 4 OS-level FS sandbox is a stronger answer; until then, the user trust burden is real. Confirmation-gate UX needs care — too many prompts and users hit "yes" reflexively. Mitigation: only gate first-run; after /skillify codifies, the skill runs unattended.

Context: Original Phase 2 plan in docs/designs/BROWSER_SKILLS_V1.md bundled /scrape + /automate. Split during the v1.19.0.0 plan review (/plan-eng-review on garrytan/browserharness) — the user's source doc framed both as primary, but in practice scraping is where users start because the failure mode is benign. Ship /scrape + /skillify first (this branch), validate the skillify pattern works, then /automate lands on top of the same machinery.

Effort: M (human: ~3-5 days / CC: ~1 day) Priority: P0 (next branch after v1.19.0.0) Depends on: Phase 2a (/scrape + /skillify) shipped at v1.19.0.0. The D3 atomic-write helper (browse/src/browser-skill-write.ts) and the bundled SDK pattern are reused as-is.


P0: PACING_UPDATES_V0 — Louise's fatigue root cause (V1.1)

What: Implement the pacing overhaul extracted from PLAN_TUNING_V1. Full design in docs/designs/PACING_UPDATES_V0.md. Requires: session-state model, phase field in question-log schema, registry extension for dynamic findings, pacing as skill-template control flow (not preamble prose), bin/gstack-flip-decision command, migration-prompt budget rule, first-run preamble audit, ranking threshold calibration from real V0 data, one-way-door uncapped rule, concrete verification values.

Why: Louise de Sadeleer's "yes yes yes" during /autoplan was pacing + agency, not (only) jargon density. V1 addresses jargon (ELI10 writing). V1.1 addresses the interruption-volume half. Without this, V1 only gets halfway to the HOLY SHIT outcome.

Pros: End-to-end answer to Louise's feedback. Ships real calibration data from V1 usage. Completes the V0 → V2 pacing arc started in PLAN_TUNING_V0.

Cons: Substantial scope (10 items in docs/designs/PACING_UPDATES_V0.md). Needs its own CEO + Codex + DX + Eng review cycle. Calibration depends on real V0 question-log distribution.

Context: PLAN_TUNING_V1 attempted to bundle pacing. Three eng-review passes + two Codex passes surfaced 10 structural gaps unfixable via plan-text editing. Extracted to V1.1 as a dedicated plan.

Depends on / blocked by: V1 shipping (provides Louise's baseline transcript for calibration).

Plan Tune (v2 deferrals from v0.19.0.0 rollback)

All six items are gated on v1 dogfood results and the acceptance criteria in docs/designs/PLAN_TUNING_V0.md. They were explicitly deferred after Codex's outside-voice review drove a scope rollback from the CEO EXPANSION plan. v1 ships the observational substrate only; v2 adds behavior adaptation.

E1 — Substrate wiring (5 skills consume profile)

What: Add {{PROFILE_ADAPTATION:<skill>}} placeholder to ship, review, office-hours, plan-ceo-review, plan-eng-review SKILL.md.tmpl files. Implement scripts/resolvers/profile-consumer.ts with a per-skill adaptation registry (scripts/profile-adaptations/{skill}.ts). Each consumer reads ~/.gstack/developer-profile.json on preamble and adapts skill-specific defaults (verbosity, mode selection, severity thresholds, pushback intensity).

Why: v1 observational profile writes a file nobody reads. The substrate claim only becomes real when skills actually consume it. Without this, /plan-tune is a fancy config page.

Pros: gstack feels personal. Every skill adapts to the user's steering style instead of defaulting to middle-of-the-road.

Cons: Risk of psychographic drift if profile is noisy. Requires calibrated profile (v1 acceptance criteria: 90+ days stable across 3+ skills).

Context: See docs/designs/PLAN_TUNING_V0.md §Deferred to v2. v1 ships the signal map + inferred computation; it's displayed in /plan-tune but no skill reads it yet.

Effort: L (human: ~1 week / CC: ~4h) Priority: P0 Depends on: 2+ weeks of v1 dogfood, profile diversity check passing.

E3 — /plan-tune narrative + /plan-tune vibe

What: Event-anchored narrative ("You accepted 7 scope expansions, overrode test_failure_triage 4 times, called every PR 'boil the lake'") + one-word vibe archetype (Cathedral Builder, Ship-It Pragmatist, Deep Craft, etc). scripts/archetypes.ts is ALREADY SHIPPED in v1 (8 archetypes + Polymath fallback). v2 work is the narrative generator + /plan-tune skill wiring.

Why: Makes profile tangible and shareable. Screenshot-able.

Pros: Killer delight feature. Social surface for gstack. Concrete, specific output anchored in real events (not generic AI slop).

Cons: Requires stable inferred profile — without calibration it produces generic paragraphs. Gen-tests need to validate no-slop.

Context: Archetypes already defined. Just need the /plan-tune narrative subcommand + slop-check test.

Effort: S+ (human: ~1 day / CC: ~1h) Priority: P0 Depends on: Calibrated profile (>= 20 events, 3+ skills, 7+ days span).

E4 — Blind-spot coach

What: Preamble injection that surfaces the OPPOSITE of the user's profile once per session per tier >= 2 skill. Boil-the-ocean user gets challenged on scope ("what's the 80% version?"); small-scope user gets challenged on ambition. scripts/resolvers/blind-spot-coach.ts. Marker file for session dedup. Opt-out via gstack-config set blind_spot_coach false.

Why: Makes gstack a coach (challenges you) instead of a mirror (reflects you). The killer differentiation vs. a settings menu.

Pros: The feature that makes gstack feel like Garry. Surfaces assumptions the user hasn't challenged.

Cons: Logically conflicts with E1 (which adapts TO profile) and E6 (which flags mismatch). Requires interaction-budget design: global session budget + escalation rules + explicit exclusion from mismatch detection. Risk of feeling like a nag if fires wrong.

Context: v2 must redesign to resolve the E1/E4/E6 composition issue Codex caught. Dogfood required to calibrate frequency.

Effort: M (human: ~3 days / CC: ~2h design + ~1h impl) Priority: P0 Depends on: E1 shipped + interaction-budget design spec.

E5 — LANDED celebration HTML page

What: When a PR authored by the user is newly merged to the base branch, open an animated HTML celebration page in the browser. Confetti + typewriter headline + stats counter. Shows: what we built (PR stats + CHANGELOG entry), road traveled (scope decisions from CEO plan), road not traveled (deferred items), where we're going (next TODOs), who you are as a builder (vibe + narrative + profile delta for this ship). Self-contained HTML (CSS animations only, no JS deps).

CRITICAL REVISION from v0 plan: Passive detection must NOT live in the preamble (Codex #9). When promoted, moves to explicit /plan-tune show-landed OR post-ship hook — not passive detection in the hot path.

Why: Biggest personality moment in gstack. The "one-word thing that makes you remember why you built this."

Pros: Screenshot-worthy. Shareable. The kind of dopamine hit that turns power users into evangelists.

Cons: Product theater if the substrate isn't solid. Needs /design-shotgun → /design-html for the visual direction. Requires E2 unified profile for narrative/vibe data.

Context: /land-and-deploy trust/adoption is low, so passive detection is the right trigger shape. Dedup marker per PR in ~/.gstack/.landed-celebrated-*. E2E tests for squash/merge-commit/rebase/co-author/fresh-clone/dedup variants.

Effort: M+ (human: ~1 week / CC: ~3h total) Priority: P0 Depends on: E3 narrative/vibe shipped. /design-shotgun run on real PR data to pick a visual direction, then /design-html to finalize.

E6 — Auto-adjustment based on declared ↔ inferred mismatch

What: Currently /plan-tune shows the gap between declared and inferred (v1 observational). v2 auto-suggests declaration updates when the gap exceeds a threshold ("Your profile says hands-off but you've overridden 40% of recommendations — you're actually taste-driven. Update declared autonomy from 0.8 to 0.5?"). Requires explicit user confirmation before any mutation (Codex trust-boundary #15 already baked into v1).

Why: Profile drifts silently without correction. Self-correcting profile stays honest.

Pros: Profile becomes more accurate over time. User sees the gap and decides.

Cons: Requires stable inferred profile (diversity check). False positives nag the user.

Context: v1 has --check-mismatch that flags > 0.3 gaps but doesn't suggest fixes. v2 adds the suggestion UX + per-dimension threshold tuning from real data.

Effort: S (human: ~1 day / CC: ~45min) Priority: P0 Depends on: Calibrated profile + real mismatch data from v1 dogfood.

E7 — Psychographic auto-decide

What: When inferred profile is calibrated AND a question is two-way AND the user's dimensions strongly favor one option, auto-choose without asking (visible annotation: "Auto-decided via profile. Change with /plan-tune."). v1 only auto-decides via EXPLICIT per-question preferences; v2 adds profile-driven auto-decide.

Why: The whole point of the psychographic. Silent, correct defaults based on who the user IS, not just what they've said.

Pros: Friction-free skill invocation for calibrated power users. Over time, gstack feels like it's reading your mind.

Cons: Highest-risk deferral. Wrong auto-decides are costly. Requires very high confidence in the signal map AND calibration gate.

Context: v1 diversity gate is sample_size >= 20 AND skills_covered >= 3 AND question_ids_covered >= 8 AND days_span >= 7. v2 must prove this gate actually catches noisy profiles before shipping.

Effort: M (human: ~3 days / CC: ~2h) Priority: P0 Depends on: E1 (skills consuming profile) + real observed data showing calibration gate is trustworthy.

Browse

Scope sidebar-agent kill to session PID, not pkill -f sidebar-agent\.ts

What: shutdown() in browse/src/server.ts:1193 uses pkill -f sidebar-agent\.ts to kill the sidebar-agent daemon, which matches every sidebar-agent on the machine, not just the one this server spawned. Replace with PID tracking: store the sidebar-agent PID when cli.ts spawns it (via state file or env), then process.kill(pid, 'SIGTERM') in shutdown().

Why: A user running two Conductor worktrees (or any multi-session setup), each with its own $B connect, closes one browser window ... and the other worktree's sidebar-agent gets killed too. The blast radius was there before, but the v0.18.1.0 disconnect-cleanup fix makes it more reachable: every user-close now runs the full shutdown() path, whereas before user-close bypassed it.

Context: Surfaced by /ship's adversarial review on v0.18.1.0. Pre-existing code, not introduced by the fix. Fix requires propagating the sidebar-agent PID from cli.ts spawn site (~line 885) into the server's state file so shutdown() can target just this session's agent. Related: browse/src/cli.ts spawns with Bun.spawn(...).unref() and already captures agentProc.pid.

Effort: S (human: ~2h / CC: ~15min) Priority: P2 Depends on: None

Sidebar Security

ML Prompt Injection Classifier — v1 SHIPPED (branch garrytan/prompt-injection-guard)

Status: IN PROGRESS on branch garrytan/prompt-injection-guard. Classifier swap: TestSavantAI replaces DeBERTa (better on developer content — HN/Reddit/Wikipedia/tech blogs all score SAFE 0.98+, attacks score INJECTION 0.99+). Pre-impl gate 3 (benign corpus dry-run) forced this pivot — see ~/.gstack/projects/garrytan-gstack/ceo-plans/2026-04-19-prompt-injection-guard.md.

What shipped in v1:

  • browse/src/security.ts — canary injection + check, verdict combiner (ensemble rule), attack log with rotation, cross-process session state, status reporting
  • browse/src/security-classifier.ts — TestSavantAI ONNX classifier + Haiku transcript classifier (reasoning-blind), both with graceful degradation
  • Canary flows end-to-end: server.ts injects, sidebar-agent.ts checks every outbound channel (text, tool args, URLs, file writes) and kills session on leak
  • Pre-spawn ML scan of user message with ensemble rule (BLOCK requires both classifiers)
  • /health endpoint exposes security status for shield icon
  • 25 unit tests + 12 regression tests all passing

Branch 2 architecture (decided from pre-impl gate 1): The ML classifier ONLY runs in sidebar-agent.ts (non-compiled bun script). The compiled browse binary cannot link onnxruntime-node. Architectural controls (XML framing + allowlist) defend the compiled-side ingress.

ML Prompt Injection Classifier — v2 Follow-ups

~Cut Haiku false-positive rate from 44% toward 15% (P0) — SHIPPED in v1.5.2.0

Measured result (500-case BrowseSafe-Bench smoke): detection 67.3% → 56.2%, FP 44.1% → 22.9%. Gate passes (detection ≥ 55%, FP ≤ 25%). Knobs that landed: label-first ensemble voting (verdict label trumps numeric confidence for transcript layer), hallucination guard (verdict=block at conf < 0.40 → warn-vote), new THRESHOLDS.SOLO_CONTENT_BLOCK = 0.92 for label-less content classifiers, label-first extension to toolOutput path, tighter Haiku prompt + 8 few-shot exemplars, pinned Haiku model, claude -p spawn from os.tmpdir() so CLAUDE.md can't poison the classifier, timeout bumped 15s → 45s. CI gate: browse/test/security-bench-ensemble.test.ts replays fixture, fail-closed on missing fixture + security-layer diff. The original plan's stop-loss revert order didn't move the FP needle (FPs came from single-layer-BLOCK paths, not ensemble); the real levers turned out to be architectural (label-first) plus a new decoupled threshold.

See CHANGELOG.md [1.5.2.0] for the full shipped summary.

Original spec (pre-ship, retained for archive)

What: v1 ships the Haiku transcript classifier on every tool output (Read/Grep/Bash/Glob/WebFetch). BrowseSafe-Bench smoke measured detection 67.3% + FP 44.1% — a 4.4x detection lift from L4-only, but FP tripled because Haiku is more aggressive than L4 on edge cases (phishing-style benign content, borderline social engineering). The review banner makes FPs recoverable but 44% is too high for a delightful default.

Why: User clicks review banner roughly every-other tool output = real UX friction. Tuning these four knobs together should cut FP to ~15-20% while keeping detection in the 60-70% range:

  1. Switch ensemble counting to Haiku's verdict field, not confidence. Right now combineVerdict treats Haiku warn-at-0.6 as a BLOCK vote. Haiku reserves verdict: "block" for clear-cut cases and uses "warn" liberally. Count only verdict === "block" as a BLOCK vote; warn becomes a soft signal that participates in 2-of-N ensemble but doesn't single-handedly BLOCK.
  2. Tighten Haiku's classifier prompt. Current prompt is generic. Rewrite to: "Return block only if the text contains explicit instruction-override, role-reset, exfil request, or malicious code execution. Return warn for social engineering that doesn't try to hijack the agent. Return safe otherwise." More specific instructions → fewer false flags.
  3. Add 6-8 few-shot exemplars to Haiku's prompt. Pairs of (injection text → block) and (benign-looking-but-safe → safe). LLM few-shot consistently outperforms zero-shot on classification.
  4. Bump Haiku's WARN threshold from 0.6 to 0.75. Borderline fires drop out of the ensemble pool.

Ship all four together, re-run BrowseSafe-Bench smoke, record before/after. Target: 60-70% detection / 15-25% FP.

Effort: S (human: ~1 day / CC: ~30-45 min + ~45min bench) Priority: P0 (direct UX impact post-ship; ship v1 as-is with review banner, file this as the immediate follow-up) Depends on: v1.4.0.0 prompt-injection-guard branch merged

Cache review decisions per (domain, payload-hash-prefix) (P1)

What: If Haiku fires on a page twice in the same session (e.g., user does Bash then Grep on the same suspicious file), the second fire shouldn't re-prompt. Cache the user's decision keyed by a per-session (domain, payloadHash-prefix) pair. Small LRU, ~100 entries, session-scoped (not persistent across sidebar restarts — we want fresh decisions on new sessions).

Why: Reduces review-banner fatigue when the same bit of sketchy content gets scanned multiple times via different tools. At 44% FP on v1, this matters most.

Effort: S (human: ~0.5 day / CC: ~20 min) Priority: P1

Fine-tune a small classifier on BrowseSafe-Bench + Qualifire + xxz224 (P2 research)

What: TestSavantAI was trained on direct-injection text, wrong distribution for browser-agent attacks (measured 15% recall). Take BERT-base, fine-tune on BrowseSafe-Bench (3,680 cases) + Qualifire prompt-injection-benchmark (5k) + xxz224 (3.7k) combined, ship in ~/.gstack/models/ as replacement L4 classifier.

Why: Expected 15% → 70%+ recall on the actual threat distribution without needing Haiku. Would also cut latency (no CLI subprocess) and drop Haiku cost.

Effort: XL (human: ~3-5 days + ~$50 GPU / CC: ~4-6 hours setup + ~$50 GPU) Priority: P2 research — validate the lift on a held-out test set before committing to replace TestSavant

DeBERTa-v3 ensemble as default (P2)

What: Flip GSTACK_SECURITY_ENSEMBLE=deberta from opt-in to default. Adds a 3rd ML vote; 2-of-3 agreement rule should reduce FPs while catching attacks that only DeBERTa sees.

Why: More votes = better calibration. Currently opt-in because 721MB is a big first-run download; flipping to default requires lazy-download UX.

Cons: 721MB first-run download for every user. Costs user bandwidth + disk.

Effort: M (human: ~2 days / CC: ~1 hour + UX) Priority: P2 (after #1 tuning to see how much room is left)

User-feedback flywheel — decisions become training data (P3)

What: Every Allow/Block click is labeled data. Log (suspected_text hash, layer scores, user decision, ts) to ~/.gstack/security/feedback.jsonl. Aggregate via community-pulse when telemetry: community. Periodically retrain the classifier on aggregate feedback.

Why: The system gets better the more it's used. Closes the loop between user reality and defense quality.

Cons: Feedback loop can be poisoned if attacker controls enough devices. Need guardrails (stratified sampling, reviewer validation, k-anon minimums on training batch).

Effort: L (human: ~1 week for local logging + aggregation pipe, another week for retrain cron / CC: ~2-4 hours per sub-part) Priority: P3 — only worth building after v2 tuning proves the architecture is the right shape

Shield icon + canary leak banner UI (P0) — SHIPPED

Banner landed in commits a9f702a7 (HTML+CSS, variant A mockup) + ffb064af (JS wiring + security_event routing + a11y + Escape-to-dismiss). Shield icon landed in 59e0635e with 3 states (protected/degraded/inactive), custom SVG + mono SEC label per design review Pass 7, hover tooltip with per-layer detail.

Known v1 limitation logged as follow-up: shield only updates at connect — see "Shield icon continuous polling" above.

Shield icon continuous polling (P2) — SHIPPED

Commit 06002a82: /sidebar-chat response now includes security: getSecurityStatus(), and sidepanel.js calls updateSecurityShield(data.security) on every poll tick. Shield flips to 'protected' as soon as classifier warmup completes (typically ~30s after initial connect on first run), no reload needed.

Attack telemetry via gstack-telemetry-log (P1) — SHIPPED

Landed in commits 28ce883c (binary) + f68fa4a9 (security.ts wiring). The telemetry binary now accepts --event-type attack_attempt --url-domain --payload-hash --confidence --layer --verdict. logAttempt() spawns the binary fire-and-forget. Existing tier gating carries the events.

Downstream follow-up still open: update the community-pulse Supabase edge function to accept the new event type and store in a typed security_attempts table. Dashboard read path is a separate TODO ("Cross-user aggregate attack dashboard" below).

Full BrowseSafe-Bench at gate tier (P2)

What: Promote browse/test/security-bench.test.ts from smoke-200 (gate) to full-3680 (gate) once smoke/full detection rate correlation is measured (~2 weeks post-ship).

Why: BrowseSafe-Bench is Perplexity's 3,680-case browser-agent injection benchmark. Smoke-200 is a sample; full coverage catches the long tail. Run time ~5min hermetic.

Effort: S (CC: ~45min) Priority: P2 Depends on: v1 shipped + ~2 weeks real data

Cross-user aggregate attack dashboard (P2) — CLI SHIPPED, web UI remains

CLI dashboard shipped in commits a5588ec0 (schema migration) + 2d107978 (community-pulse edge function security aggregation) + 756875a7 (bin/gstack- security-dashboard). Users can now run gstack-security-dashboard to see attacks last 7 days, top attacked domains, detection-layer distribution, and verdict counts — all aggregated from the Supabase community-pulse pipe.

Web UI at gstack.gg/dashboard/security is still open — that's a separate webapp project outside this repo's scope.

TestSavantAI ensemble → DeBERTa-v3 ensemble (P2) — SHIPPED (opt-in)

Commits b4e49d08 + 8e9ec52d + 4e051603 + 7a815fa7: DeBERTa-v3-base-injection-onnx is now wired as an opt-in L4c ensemble classifier. Enable via GSTACK_SECURITY_ENSEMBLE=deberta — sidebar-agent warmup downloads the 721MB model to ~/.gstack/models/deberta-v3-injection/ on first run. combineVerdict becomes a 2-of-3 agreement rule (testsavant + deberta + transcript) when enabled. Default behavior unchanged (2-of-2 testsavant + transcript).

TestSavantAI + DeBERTa-v3 ensemble — SHIPPED opt-in (see entry above)

Read/Glob/Grep tool-output injection coverage (P2) — SHIPPED

Commits f2e80dd7 + 0098d574: sidebar-agent.ts now scans tool outputs from Read, Glob, Grep, WebFetch, and Bash via SCANNED_TOOLS set. Content >= 32 chars runs through the ML ensemble; BLOCK verdict kills the session and emits security_event. The content-security.ts envelope path was already wrapping browse-command output; this extension closes the non-browse path Codex flagged.

During /ship for v1.4.0.0 this path got additional hardening (commit 407c36b4 + 88b12c2b + c51ebdf4): transcript classifier now receives the tool output text (was empty before), and combineVerdict accepts a toolOutput: true opt that blocks on a single ML classifier at BLOCK threshold (user-input default unchanged for SO-FP mitigation).

Adversarial + integration + smoke-bench test suites (P1) — SHIPPED

Four test files shipped this round:

  • browse/test/security-adversarial.test.ts (94a83c50) — 23 canary-channel
    • verdict-combiner attack-shape tests
  • browse/test/security-integration.test.ts (07745e04) — 10 layer-coexistence
    • defense-in-depth regression guards
  • browse/test/security-live-playwright.test.ts (b9677519) — 7 live-Chromium fixture tests (5 deterministic + 2 ML, skipped if model cache absent)
  • browse/test/security-bench.test.ts (afc6661f) — BrowseSafe-Bench 200-case smoke harness with hermetic dataset cache + v1 baseline metrics

Bun-native 5ms inference (P3 research) — SKELETON SHIPPED, forward pass open

Research skeleton landed this round (browse/src/security-bunnative.ts, docs/designs/BUN_NATIVE_INFERENCE.md, browse/test/security-bunnative.test.ts):

  • Pure-TS WordPiece tokenizer — reads HF tokenizer.json directly, matches transformers.js output on fixture strings (correctness-tested in CI)
  • Stable classify() API that current callers can wire against today
  • Benchmark harness with p50/p95/p99 reporting — anchors v1 WASM baseline for future regressions

Design doc captures the roadmap:

  • Approach A: pure-TS + Float32Array SIMD — ruled out (can't beat WASM)
  • Approach B: Bun FFI + Apple Accelerate cblas_sgemm — target ~3-6ms p50, macOS-only, ~1000 LOC
  • Approach C: Bun WebGPU — unexplored, worth a spike

Remaining work (XL, multi-week):

  • FFI proof-of-concept for cblas_sgemm
  • Single transformer layer implementation + correctness check vs onnxruntime
  • Full forward pass + weight loader + correctness regression fixtures
  • Production swap in security-bunnative.ts classify() body

Builder Ethos

First-time Search Before Building intro

What: Add a generateSearchIntro() function (like generateLakeIntro()) that introduces the Search Before Building principle on first use, with a link to the blog essay.

Why: Boil the Lake has an intro flow that links to the essay and marks .completeness-intro-seen. Search Before Building should have the same pattern for discoverability.

Context: Blocked on a blog post to link to. When the essay exists, add the intro flow with a .search-intro-seen marker file. Pattern: generateLakeIntro() at gen-skill-docs.ts:176.

Effort: S Priority: P2 Depends on: Blog post about Search Before Building

Chrome DevTools MCP Integration

Real Chrome session access

What: Integrate Chrome DevTools MCP to connect to the user's real Chrome session with real cookies, real state, no Playwright middleman.

Why: Right now, headed mode launches a fresh Chromium profile. Users must log in manually or import cookies. Chrome DevTools MCP connects to the user's actual Chrome ... instant access to every authenticated site. This is the future of browser automation for AI agents.

Context: Google shipped Chrome DevTools MCP in Chrome 146+ (June 2025). It provides screenshots, console messages, performance traces, Lighthouse audits, and full page interaction through the user's real browser. gstack should use it for real-session access while keeping Playwright for headless CI/testing workflows.

Potential new skills:

  • /debug-browser: JS error tracing with source-mapped stack traces
  • /perf-debug: performance traces, Core Web Vitals, network waterfall

May replace /setup-browser-cookies for most use cases since the user's real cookies are already there.

Effort: L (human: ~2 weeks / CC: ~2 hours) Priority: P0 Depends on: Chrome 146+, DevTools MCP server installed

Browse

Bundle server.ts into compiled binary

What: Eliminate resolveServerScript() fallback chain entirely — bundle server.ts into the compiled browse binary.

Why: The current fallback chain (check adjacent to cli.ts, check global install) is fragile and caused bugs in v0.3.2. A single compiled binary is simpler and more reliable.

Context: Bun's --compile flag can bundle multiple entry points. The server is currently resolved at runtime via file path lookup. Bundling it removes the resolution step entirely.

Effort: M Priority: P2 Depends on: None

Sessions (isolated browser instances)

What: Isolated browser instances with separate cookies/storage/history, addressable by name.

Why: Enables parallel testing of different user roles, A/B test verification, and clean auth state management.

Context: Requires Playwright browser context isolation. Each session gets its own context with independent cookies/localStorage. Prerequisite for video recording (clean context lifecycle) and auth vault.

Effort: L Priority: P3

Video recording

What: Record browser interactions as video (start/stop controls).

Why: Video evidence in QA reports and PR bodies. Currently deferred because recreateContext() destroys page state.

Context: Needs sessions for clean context lifecycle. Playwright supports video recording per context. Also needs WebM → GIF conversion for PR embedding.

Effort: M Priority: P3 Depends on: Sessions

v20 encryption format support

What: AES-256-GCM support for future Chromium cookie DB versions (currently v10).

Why: Future Chromium versions may change encryption format. Proactive support prevents breakage.

Effort: S Priority: P3

State persistence — SHIPPED

What: Save/load cookies + localStorage to JSON files for reproducible test sessions.

$B state save/load ships in v0.12.1.0. V1 saves cookies + URLs only (not localStorage, which breaks on load-before-navigate). Files at .gstack/browse-states/{name}.json with 0o600 permissions. Load replaces session (closes all pages first). Name sanitized to [a-zA-Z0-9_-].

Remaining: V2 localStorage support (needs pre-navigation injection strategy). Completed: v0.12.1.0 (2026-03-26)

Auth vault

What: Encrypted credential storage, referenced by name. LLM never sees passwords.

Why: Security — currently auth credentials flow through the LLM context. Vault keeps secrets out of the AI's view.

Effort: L Priority: P3 Depends on: Sessions, state persistence

Iframe support — SHIPPED

What: frame <sel> and frame main commands for cross-frame interaction.

$B frame ships in v0.12.1.0. Supports CSS selector, @ref, --name, and --url pattern matching. Execution target abstraction (getActiveFrameOrPage()) across all read/write/snapshot commands. Frame context cleared on navigation, tab switch, resume. Detached frame auto-recovery. Page-only operations (goto, screenshot, viewport) throw clear error when in frame context.

Completed: v0.12.1.0 (2026-03-26)

Semantic locators

What: find role/label/text/placeholder/testid with attached actions.

Why: More resilient element selection than CSS selectors or ref numbers.

Effort: M Priority: P4

Device emulation presets

What: set device "iPhone 16 Pro" for mobile/tablet testing.

Why: Responsive layout testing without manual viewport resizing.

Effort: S Priority: P4

Network mocking/routing

What: Intercept, block, and mock network requests.

Why: Test error states, loading states, and offline behavior.

Effort: M Priority: P4

Download handling

What: Click-to-download with path control.

Why: Test file download flows end-to-end.

Effort: S Priority: P4

Content safety

What: --max-output truncation, --allowed-domains filtering.

Why: Prevent context window overflow and restrict navigation to safe domains.

Effort: S Priority: P4

Streaming (WebSocket live preview)

What: WebSocket-based live preview for pair browsing sessions.

Why: Enables real-time collaboration — human watches AI browse.

Effort: L Priority: P4

Headed mode with Chrome extension — SHIPPED

$B connect launches Playwright's bundled Chromium in headed mode with the gstack Chrome extension auto-loaded. $B handoff now produces the same result (extension + side panel). Sidebar chat gated behind --chat flag.

$B watch — SHIPPED

Claude observes user browsing in passive read-only mode with periodic snapshots. $B watch stop exits with summary. Mutation commands blocked during watch.

Sidebar scout / file drop relay — SHIPPED

Sidebar agent writes structured messages to .context/sidebar-inbox/. Workspace agent reads via $B inbox. Message format: {type, timestamp, page, userMessage, sidebarSessionId}.

Multi-agent tab isolation

What: Two Claude sessions connect to the same browser, each operating on different tabs. No cross-contamination.

Why: Enables parallel /qa + /design-review on different tabs in the same browser.

Context: Requires tab ownership model for concurrent headed connections. Playwright may not cleanly support two persistent contexts. Needs investigation.

Effort: L (human: ~2 weeks / CC: ~2 hours) Priority: P3 Depends on: Headed mode (shipped)

Sidebar agent needs Write tool + better error visibility — SHIPPED

What: Two issues with the sidebar agent (sidebar-agent.ts): (1) --allowedTools is hardcoded to Bash,Read,Glob,Grep, missing Write. Claude can't create files (like CSVs) when asked. (2) When Claude errors or returns empty, the sidebar UI shows nothing, just a green dot. No error message, no "I tried but failed", nothing.

Completed: v0.15.4.0 (2026-04-04). Write tool added to allowedTools. 40+ empty catch blocks replaced with [gstack sidebar], [gstack bg], [browse], [sidebar-agent] prefixed console logging across all 4 files (sidepanel.js, background.js, server.ts, sidebar-agent.ts). Error placeholder text now shows in red. Auth token stale-refresh bug fixed.

Sidebar direct API calls (eliminate claude -p startup tax)

What: Each sidebar message spawns a fresh claude -p process (~2-3s cold start overhead). For "click @e24" that's absurd. Direct Anthropic API calls would be sub-second.

Why: The claude -p startup cost is: process spawn (~100ms) + CLI init (~500ms-1s) + API connection (~200ms) + first token. Model routing (Sonnet for actions) helps but doesn't fix the CLI overhead.

Context: server.ts:spawnClaude() builds args and writes to queue file. sidebar-agent.ts:askClaude() spawns claude -p. Replace with direct fetch('https://api.anthropic.com/...') with tool use. Requires ANTHROPIC_API_KEY accessible to the browse server.

Effort: M (human: ~1 week / CC: ~30min) Priority: P2 Depends on: None

Chrome Web Store publishing

What: Publish the gstack browse Chrome extension to Chrome Web Store for easier install.

Why: Currently sideloaded via chrome://extensions. Web Store makes install one-click.

Effort: S Priority: P4 Depends on: Chrome extension proving value via sideloading

What: GNOME Keyring / kwallet / DPAPI support for non-macOS cookie import.

Linux cookie import shipped in v0.11.11.0 (Wave 3). Supports Chrome, Chromium, Brave, Edge on Linux with GNOME Keyring (libsecret) and "peanuts" fallback. Windows DPAPI support remains deferred.

Remaining: Windows cookie decryption (DPAPI). Needs complete rewrite — PR #64 was 1346 lines and stale.

Effort: L (Windows only) Priority: P4 Completed (Linux): v0.11.11.0 (2026-03-23)

Ship

/ship Step 12 test harness should exec the actual template bash, not a reimplementation

What: test/ship-version-sync.test.ts currently reimplements the bash from ship/SKILL.md.tmpl Step 12 inside template literals. When the template changes, both sides must be updated — exactly the drift-risk pattern the Step 12 fix is meant to prevent, applied to our own testing strategy. Replace with a helper that extracts the fenced bash blocks from the template at test time and runs them verbatim (similar to the skill-parser.ts pattern).

Why: Surfaced by the Claude adversarial subagent during the v1.0.1.0 ship. Today the tests would stay green while the template regresses, because the error-message strings already differ between test and template. It's a silent-drift bug waiting to happen.

Context: The fixed test file is at test/ship-version-sync.test.ts (branched off garrytan/ship-version-sync). Existing precedent for extracting-from-skill-md is at test/helpers/skill-parser.ts. Pattern: read the template, slice from ## Step 12 to the next ---, grep fenced bash, feed to /bin/bash with substituted fixtures.

Effort: S (human: ~2h / CC: ~30min) Priority: P2 Depends on: None.

/ship Step 12 BASE_VERSION silent fallback to 0.0.0.0 when git show fails

What: BASE_VERSION=$(git show origin/<base>:VERSION 2>/dev/null || echo "0.0.0.0") silently defaults to 0.0.0.0 in any failure mode — detached HEAD, no origin, offline, base branch renamed. In such states, a real drift could be misclassified or silently repaired with the wrong value. Distinguish "origin/ unreachable" from "origin/:VERSION absent" and fail loudly on the former.

Why: Flagged as CRITICAL (confidence 8/10) by the Claude adversarial subagent during the v1.0.1.0 ship. Low practical risk because /ship Step 3 already fetches origin before Step 12 runs — any reachability failure would abort Step 3 long before this code runs. Still, defense in depth: if someone invokes Step 12 bash outside the full /ship pipeline (e.g., via a standalone helper), the fallback masks a real problem.

Context: Fix: wrap with git rev-parse --verify origin/<base> probe; if that fails, error out rather than defaulting. Touches ship/SKILL.md.tmpl Step 12 idempotency block (around line 409). Tests need a case where git show fails.

Effort: S (human: ~1h / CC: ~15min) Priority: P3 Depends on: None.

GitLab support for /land-and-deploy

What: Add GitLab MR merge + CI polling support to /land-and-deploy skill. Currently uses gh pr view, gh pr checks, gh pr merge, and gh run list/view in 15+ places — each needs a GitLab conditional path using glab ci status, glab mr merge, etc.

Why: Without this, GitLab users can /ship (create MR) but can't /land-and-deploy (merge + verify). Completes the GitLab story end-to-end.

Context: /retro, /ship, and /document-release now support GitLab via the multi-platform BASE_BRANCH_DETECT resolver. /land-and-deploy has deeper GitHub-specific semantics (merge queues, required checks via gh pr checks, deploy workflow polling) that have different shapes on GitLab. The glab CLI (v1.90.0) supports glab mr merge, glab ci status, glab ci view but with different output formats and no merge queue concept.

Effort: L Priority: P2 Depends on: None (BASE_BRANCH_DETECT multi-platform resolver is already done)

Multi-commit CHANGELOG completeness eval

What: Add a periodic E2E eval that creates a branch with 5+ commits spanning 3+ themes (features, cleanup, infra), runs /ship's Step 5 CHANGELOG generation, and verifies the CHANGELOG mentions all themes.

Why: The bug fixed in v0.11.22 (garrytan/ship-full-commit-coverage) showed that /ship's CHANGELOG generation biased toward recent commits on long branches. The prompt fix adds a cross-check, but no test exercises the multi-commit failure mode. The existing ship-local-workflow E2E only uses a single-commit branch.

Context: Would be a periodic tier test (~$4/run, non-deterministic since it tests LLM instruction-following). Setup: create bare remote, clone, add 5+ commits across different themes on a feature branch, run Step 5 via claude -p, verify CHANGELOG output covers all themes. Pattern: ship-local-workflow in test/skill-e2e-workflow.test.ts.

Effort: M Priority: P3 Depends on: None

Ship log — persistent record of /ship runs

What: Append structured JSON entry to .gstack/ship-log.json at end of every /ship run (version, date, branch, PR URL, review findings, Greptile stats, todos completed, test results).

Why: /retro has no structured data about shipping velocity. Ship log enables: PRs-per-week trending, review finding rates, Greptile signal over time, test suite growth.

Context: /retro already reads greptile-history.md — same pattern. Eval persistence (eval-store.ts) shows the JSON append pattern exists in the codebase. ~15 lines in ship template.

Effort: S Priority: P2 Depends on: None

Visual verification with screenshots in PR body

What: /ship Step 7.5: screenshot key pages after push, embed in PR body.

Why: Visual evidence in PRs. Reviewers see what changed without deploying locally.

Context: Part of Phase 3.6. Needs S3 upload for image hosting.

Effort: M Priority: P2 Depends on: /setup-gstack-upload

Review

Inline PR annotations

What: /ship and /review post inline review comments at specific file:line locations using gh api to create pull request review comments.

Why: Line-level annotations are more actionable than top-level comments. The PR thread becomes a line-by-line conversation between Greptile, Claude, and human reviewers.

Context: GitHub supports inline review comments via gh api repos/$REPO/pulls/$PR/reviews. Pairs naturally with Phase 3.6 visual annotations.

Effort: S Priority: P2 Depends on: None

Greptile training feedback export

What: Aggregate greptile-history.md into machine-readable JSON summary of false positive patterns, exportable to the Greptile team for model improvement.

Why: Closes the feedback loop — Greptile can use FP data to stop making the same mistakes on your codebase.

Context: Was a P3 Future Idea. Upgraded to P2 now that greptile-history.md data infrastructure exists. The signal data is already being collected; this just makes it exportable. ~40 lines.

Effort: S Priority: P2 Depends on: Enough FP data accumulated (10+ entries)

Visual review with annotated screenshots

What: /review Step 4.5: browse PR's preview deploy, annotated screenshots of changed pages, compare against production, check responsive layouts, verify accessibility tree.

Why: Visual diff catches layout regressions that code review misses.

Context: Part of Phase 3.6. Needs S3 upload for image hosting.

Effort: M Priority: P2 Depends on: /setup-gstack-upload

QA

QA trend tracking

What: Compare baseline.json over time, detect regressions across QA runs.

Why: Spot quality trends — is the app getting better or worse?

Context: QA already writes structured reports. This adds cross-run comparison.

Effort: S Priority: P2

CI/CD QA integration

What: /qa as GitHub Action step, fail PR if health score drops.

Why: Automated quality gate in CI. Catch regressions before merge.

Effort: M Priority: P2

Smart default QA tier

What: After a few runs, check index.md for user's usual tier pick, skip the AskUserQuestion.

Why: Reduces friction for repeat users.

Effort: S Priority: P2

Accessibility audit mode

What: --a11y flag for focused accessibility testing.

Why: Dedicated accessibility testing beyond the general QA checklist.

Effort: S Priority: P3

CI/CD generation for non-GitHub providers

What: Extend CI/CD bootstrap to generate GitLab CI (.gitlab-ci.yml), CircleCI (.circleci/config.yml), and Bitrise pipelines.

Why: Not all projects use GitHub Actions. Universal CI/CD bootstrap would make test bootstrap work for everyone.

Context: v1 ships with GitHub Actions only. Detection logic already checks for .gitlab-ci.yml, .circleci/, bitrise.yml and skips with an informational note. Each provider needs ~20 lines of template text in generateTestBootstrap().

Effort: M Priority: P3 Depends on: Test bootstrap (shipped)

Auto-upgrade weak tests (★) to strong tests (★★★)

What: When Step 7 coverage audit identifies existing ★-rated tests (smoke/trivial assertions), generate improved versions testing edge cases and error paths.

Why: Many codebases have tests that technically exist but don't catch real bugs — expect(component).toBeDefined() isn't testing behavior. Upgrading these closes the gap between "has tests" and "has good tests."

Context: Requires the quality scoring rubric from the test coverage audit. Modifying existing test files is riskier than creating new ones — needs careful diffing to ensure the upgraded test still passes. Consider creating a companion test file rather than modifying the original.

Effort: M Priority: P3 Depends on: Test quality scoring (shipped)

Retro

Deployment health tracking (retro + browse)

What: Screenshot production state, check perf metrics (page load times), count console errors across key pages, track trends over retro window.

Why: Retro should include production health alongside code metrics.

Context: Requires browse integration. Screenshots + metrics fed into retro output.

Effort: L Priority: P3 Depends on: Browse sessions

Infrastructure

/setup-gstack-upload skill (S3 bucket)

What: Configure S3 bucket for image hosting. One-time setup for visual PR annotations.

Why: Prerequisite for visual PR annotations in /ship and /review.

Effort: M Priority: P2

gstack-upload helper

What: browse/bin/gstack-upload — upload file to S3, return public URL.

Why: Shared utility for all skills that need to embed images in PRs.

Effort: S Priority: P2 Depends on: /setup-gstack-upload

WebM to GIF conversion

What: ffmpeg-based WebM → GIF conversion for video evidence in PRs.

Why: GitHub PR bodies render GIFs but not WebM. Needed for video recording evidence.

Effort: S Priority: P3 Depends on: Video recording

Extend worktree isolation to Claude E2E tests

What: Add useWorktree?: boolean option to runSkillTest() so any Claude E2E test can opt into worktree mode for full repo context instead of tmpdir fixtures.

Why: Some Claude E2E tests (CSO audit, review-sql-injection) create minimal fake repos but would produce more realistic results with full repo context. The infrastructure exists (describeWithWorktree() in e2e-helpers.ts) — this extends it to the session-runner level.

Context: WorktreeManager shipped in v0.11.12.0. Currently only Gemini/Codex tests use worktrees. Claude tests use planted-bug fixture repos which are correct for their purpose, but new tests that want real repo context can use describeWithWorktree() today. This TODO is about making it even easier via a flag on runSkillTest().

Effort: M (human: ~2 days / CC: ~20 min) Priority: P3 Depends on: Worktree isolation (shipped v0.11.12.0)

E2E model pinning — SHIPPED

What: Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses.

Shipped: Default model changed to Sonnet for structure tests (~30), Opus retained for quality tests (~10). --retry 2 added. EVALS_MODEL env var for override. test:e2e:fast tier added. Rate-limit telemetry (first_response_ms, max_inter_turn_ms) and wall_clock_ms tracking added to eval-store.

Eval web dashboard

What: bun run eval:dashboard serves local HTML with charts: cost trending, detection rate, pass/fail history.

Why: Visual charts better for spotting trends than CLI tools.

Context: Reads ~/.gstack-dev/evals/*.json. ~200 lines HTML + chart.js via Bun HTTP server.

Effort: M Priority: P3 Depends on: Eval persistence (shipped in v0.3.6)

CI/CD QA quality gate

What: Run /qa as a GitHub Action step, fail PR if health score drops below threshold.

Why: Automated quality gate catches regressions before merge. Currently QA is manual — CI integration makes it part of the standard workflow.

Context: Requires headless browse binary available in CI. The /qa skill already produces baseline.json with health scores — CI step would compare against the main branch baseline and fail if score drops. Would need ANTHROPIC_API_KEY in CI secrets since /qa uses Claude.

Effort: M Priority: P2 Depends on: None

Cross-platform URL open helper

What: gstack-open-url helper script — detect platform, use open (macOS) or xdg-open (Linux).

Why: The first-time Completeness Principle intro uses macOS open to launch the essay. If gstack ever supports Linux, this silently fails.

Effort: S (human: ~30 min / CC: ~2 min) Priority: P4 Depends on: Nothing

CDP-based DOM mutation detection for ref staleness

What: Use Chrome DevTools Protocol DOM.documentUpdated / MutationObserver events to proactively invalidate stale refs when the DOM changes, without requiring an explicit snapshot call.

Why: Current ref staleness detection (async count() check) only catches stale refs at action time. CDP mutation detection would proactively warn when refs become stale, preventing the 5-second timeout entirely for SPA re-renders.

Context: Parts 1+2 of ref staleness fix (RefEntry metadata + eager validation via count()) are shipped. This is Part 3 — the most ambitious piece. Requires CDP session alongside Playwright, MutationObserver bridge, and careful performance tuning to avoid overhead on every DOM change.

Effort: L Priority: P3 Depends on: Ref staleness Parts 1+2 (shipped)

Office Hours / Design

Design docs → Supabase team store sync

What: Add design docs (*-design-*.md) to the Supabase sync pipeline alongside test plans, retro snapshots, and QA reports.

Why: Cross-team design discovery at scale. Local ~/.gstack/projects/$SLUG/ keyword-grep discovery works for same-machine users now, but Supabase sync makes it work across the whole team. Duplicate ideas surface, everyone sees what's been explored.

Context: /office-hours writes design docs to ~/.gstack/projects/$SLUG/. The team store already syncs test plans, retro snapshots, QA reports. Design docs follow the same pattern — just add a sync adapter.

Effort: S Priority: P2 Depends on: garrytan/team-supabase-store branch landing on main

/yc-prep skill

What: Skill that helps founders prepare their YC application after /office-hours identifies strong signal. Pulls from the design doc, structures answers to YC app questions, runs a mock interview.

Why: Closes the loop. /office-hours identifies the founder, /yc-prep helps them apply well. The design doc already contains most of the raw material for a YC application.

Effort: M (human: ~2 weeks / CC: ~2 hours) Priority: P2 Depends on: office-hours founder discovery engine shipping first

Design Review

/plan-design-review + /qa-design-review + /design-consultation — SHIPPED

Shipped as v0.5.0 on main. Includes /plan-design-review (report-only design audit), /qa-design-review (audit + fix loop), and /design-consultation (interactive DESIGN.md creation). {{DESIGN_METHODOLOGY}} resolver provides shared 80-item design audit checklist.

Design outside voices in /plan-eng-review

What: Extend the parallel dual-voice pattern (Codex + Claude subagent) to /plan-eng-review's architecture review section.

Why: The design beachhead (v0.11.3.0) proves cross-model consensus works for subjective reviews. Architecture reviews have similar subjectivity in tradeoff decisions.

Context: Depends on learnings from the design beachhead. If the litmus scorecard format proves useful, adapt it for architecture dimensions (coupling, scaling, reversibility).

Effort: S Priority: P3 Depends on: Design outside voices shipped (v0.11.3.0)

Outside voices in /qa visual regression detection

What: Add Codex design voice to /qa for detecting visual regressions during bug-fix verification.

Why: When fixing bugs, the fix can introduce visual regressions that code-level checks miss. Codex could flag "the fix broke the responsive layout" during re-test.

Context: Depends on /qa having design awareness. Currently /qa focuses on functional testing.

Effort: M Priority: P3 Depends on: Design outside voices shipped (v0.11.3.0)

Document-Release

Auto-invoke /document-release from /ship — SHIPPED

Shipped in v0.8.3. Step 8.5 added to /ship — after creating the PR, /ship automatically reads document-release/SKILL.md and executes the doc update workflow. Zero-friction doc updates.

{{DOC_VOICE}} shared resolver

What: Create a placeholder resolver in gen-skill-docs.ts encoding the gstack voice guide (friendly, user-forward, lead with benefits). Inject into /ship Step 5, /document-release Step 5, and reference from CLAUDE.md.

Why: DRY — voice rules currently live inline in 3 places (CLAUDE.md CHANGELOG style section, /ship Step 5, /document-release Step 5). When the voice evolves, all three drift.

Context: Same pattern as {{QA_METHODOLOGY}} — shared block injected into multiple templates to prevent drift. ~20 lines in gen-skill-docs.ts.

Effort: S Priority: P2 Depends on: None

Ship Confidence Dashboard

Smart review relevance detection — PARTIALLY SHIPPED

What: Auto-detect which of the 4 reviews are relevant based on branch changes (skip Design Review if no CSS/view changes, skip Code Review if plan-only).

bin/gstack-diff-scope shipped — categorizes diff into SCOPE_FRONTEND, SCOPE_BACKEND, SCOPE_PROMPTS, SCOPE_TESTS, SCOPE_DOCS, SCOPE_CONFIG. Used by design-review-lite to skip when no frontend files changed. Dashboard integration for conditional row display is a follow-up.

Remaining: Dashboard conditional row display (hide "Design Review: NOT YET RUN" when SCOPE_FRONTEND=false). Extend to Eng Review (skip for docs-only) and CEO Review (skip for config-only).

Effort: S Priority: P3 Depends on: gstack-diff-scope (shipped)

Codex

Codex→Claude reverse buddy check skill

What: A Codex-native skill (.agents/skills/gstack-claude/SKILL.md) that runs claude -p to get an independent second opinion from Claude — the reverse of what /codex does today from Claude Code.

Why: Codex users deserve the same cross-model challenge that Claude users get via /codex. Currently the flow is one-way (Claude→Codex). Codex users have no way to get a Claude second opinion.

Context: The /codex skill template (codex/SKILL.md.tmpl) shows the pattern — it wraps codex exec with JSONL parsing, timeout handling, and structured output. The reverse skill would wrap claude -p with similar infrastructure. Would be generated into .agents/skills/gstack-claude/ by gen-skill-docs --host codex.

Effort: M (human: ~2 weeks / CC: ~30 min) Priority: P1 Depends on: None

Completeness

Completeness metrics dashboard

What: Track how often Claude chooses the complete option vs shortcut across gstack sessions. Aggregate into a dashboard showing completeness trend over time.

Why: Without measurement, we can't know if the Completeness Principle is working. Could surface patterns (e.g., certain skills still bias toward shortcuts).

Context: Would require logging choices (e.g., append to a JSONL file when AskUserQuestion resolves), parsing them, and displaying trends. Similar pattern to eval persistence.

Effort: M (human) / S (CC) Priority: P3 Depends on: Boil the Lake shipped (v0.6.1)

Safety & Observability

On-demand hook skills (/careful, /freeze, /guard) — SHIPPED

What: Three new skills that use Claude Code's session-scoped PreToolUse hooks to add safety guardrails on demand.

Shipped as /careful, /freeze, /guard, and /unfreeze in v0.6.5. Includes hook fire-rate telemetry (pattern name only, no command content) and inline skill activation telemetry.

Skill usage telemetry — SHIPPED

What: Track which skills get invoked, how often, from which repo.

Shipped in v0.6.5. TemplateContext in gen-skill-docs.ts bakes skill name into preamble telemetry line. Analytics CLI (bun run analytics) for querying. /retro integration shows skills-used-this-week.

/investigate scoped debugging enhancements (gated on telemetry)

What: Six enhancements to /investigate auto-freeze, contingent on telemetry showing the freeze hook actually fires in real debugging sessions.

Why: /investigate v0.7.1 auto-freezes edits to the module being debugged. If telemetry shows the hook fires often, these enhancements make the experience smarter. If it never fires, the problem wasn't real and these aren't worth building.

Context: All items are prose additions to investigate/SKILL.md.tmpl. No new scripts.

Items:

  1. Stack trace auto-detection for freeze directory (parse deepest app frame)
  2. Freeze boundary widening (ask to widen instead of hard-block when hitting boundary)
  3. Post-fix auto-unfreeze + full test suite run
  4. Debug instrumentation cleanup (tag with DEBUG-TEMP, remove before commit)
  5. Debug session persistence (~/.gstack/investigate-sessions/ — save investigation for reuse)
  6. Investigation timeline in debug report (hypothesis log with timing)

Effort: M (all 6 combined) Priority: P3 Depends on: Telemetry data showing freeze hook fires in real /investigate sessions

Context Intelligence

Context recovery preamble

What: Add ~10 lines of prose to the preamble telling the agent to re-read gstack artifacts (CEO plans, design reviews, eng reviews, checkpoints) after compaction or context degradation.

Why: gstack skills produce valuable artifacts stored at ~/.gstack/projects/$SLUG/. When Claude's auto-compaction fires, it preserves a generic summary but doesn't know these artifacts exist. The plans and reviews that shaped the current work silently vanish from context, even though they're still on disk. This is the thing nobody else in the Claude Code ecosystem is solving, because nobody else has gstack's artifact architecture.

Context: Inspired by Anthropic's claude-progress.txt pattern for long-running agents. Also informed by claude-mem's "progressive disclosure" approach. See docs/designs/SESSION_INTELLIGENCE.md for the broader vision. CEO plan: ~/.gstack/projects/garrytan-gstack/ceo-plans/2026-03-31-session-intelligence-layer.md.

Effort: S (human: ~30 min / CC: ~5 min) Priority: P1 Depends on: None Key files: scripts/resolvers/preamble.ts

Session timeline

What: Append one-line JSONL entry to ~/.gstack/projects/$SLUG/timeline.jsonl after every skill run (timestamp, skill, branch, outcome). /retro renders the timeline.

Why: Makes AI-assisted work history visible. /retro can show "this week: 3 /review, 2 /ship, 1 /investigate." Provides the observability layer for the session intelligence architecture.

Effort: S (human: ~1h / CC: ~5 min) Priority: P1 Depends on: None Key files: scripts/resolvers/preamble.ts, retro/SKILL.md.tmpl

Cross-session context injection

What: When a new gstack session starts on a branch with recent checkpoints or plans, the preamble prints a one-line summary: "Last session: implemented JWT auth, 3/5 tasks done." Agent knows where you left off before reading any files.

Why: Claude starts every session fresh. This one-liner orients the agent immediately. Similar to claude-mem's SessionStart hook pattern but simpler and integrated.

Effort: S (human: ~2h / CC: ~10 min) Priority: P2 Depends on: Context recovery preamble

/checkpoint skill

What: Manual skill to snapshot current working state: what's being done and why, files being edited, decisions made (and rationale), what's done vs. remaining, critical types/signatures. Saved to ~/.gstack/projects/$SLUG/checkpoints/<timestamp>.md.

Why: Useful before stepping away from a long session, before known-complex operations that might trigger compaction, for handing off context to a different agent/workspace, or coming back to a project after days away.

Effort: M (human: ~1 week / CC: ~30 min) Priority: P2 Depends on: Context recovery preamble Key files: New checkpoint/SKILL.md.tmpl, scripts/gen-skill-docs.ts

Session Intelligence Layer design doc

What: Write docs/designs/SESSION_INTELLIGENCE.md describing the architectural vision: gstack as the persistent brain that survives Claude's ephemeral context. Every skill writes to ~/.gstack/projects/$SLUG/, preamble re-reads, /retro rolls up.

Why: Connects context recovery, health, checkpoint, and timeline features into a coherent architecture. Nobody else in the ecosystem is building this.

Effort: S (human: ~2h / CC: ~15 min) Priority: P1 Depends on: None

Health

/health — Project Health Dashboard

What: Skill that runs type-check, lint, test suite, and dead code scan, then reports a composite 0-10 health score with breakdown by category. Tracks over time in ~/.gstack/health/<project-slug>/ for trend detection. Optionally integrates CodeScene MCP for deeper complexity/cohesion/coupling analysis.

Why: No quick way to get "state of the codebase" before starting work. CodeScene peer-reviewed research shows AI-generated code increases static analysis warnings by 30%, code complexity by 41%, and change failure rates by 30%. Users need guardrails. Like /qa but for code quality rather than browser behavior.

Context: Reads CLAUDE.md for project-specific commands (platform-agnostic principle). Runs checks in parallel. /retro can pull from health history for trend sparklines.

Effort: M (human: ~1 week / CC: ~30 min) Priority: P1 Depends on: None Key files: New health/SKILL.md.tmpl, scripts/gen-skill-docs.ts

/health as /ship gate

What: If health score exists and drops below a configurable threshold, /ship warns before creating the PR: "Health dropped from 8/10 to 5/10 this branch — 3 new lint warnings, 1 test failure. Ship anyway?"

Why: Quality gate that prevents shipping degraded code. Configurable threshold so it's not blocking for teams that don't use /health.

Effort: S (human: ~1h / CC: ~5 min) Priority: P2 Depends on: /health skill

Swarm

Swarm primitive — reusable multi-agent dispatch

What: Extract Review Army's dispatch pattern into a reusable resolver (scripts/resolvers/swarm.ts). Wire into /ship for parallel pre-ship checks (type-check + lint + test in parallel sub-agents). Make available to /qa, /investigate, /health.

Why: Review Army proved parallel sub-agents work brilliantly (5 agents = 835K tokens of working memory vs. 167K for one). The pattern is locked inside review-army.ts. Other skills need it too. Claude Code Agent Teams (official, Feb 2026) validates the team-lead-delegates-to-specialists pattern. Gartner: multi-agent inquiries surged 1,445% in one year.

Context: Start with the specific /ship use case. Extract shared parts only after 2+ consumers reveal what config parameters are actually needed. Avoid premature abstraction. Can leverage existing WorktreeManager for isolation.

Effort: L (human: ~2 weeks / CC: ~2 hours) Priority: P2 Depends on: None Key files: scripts/resolvers/review-army.ts, new scripts/resolvers/swarm.ts, ship/SKILL.md.tmpl, lib/worktree.ts

Refactoring

/refactor-prep — Pre-Refactor Token Hygiene

What: Skill that detects project language/framework, runs appropriate dead code detection (knip/ts-prune for TS/JS, vulture/autoflake for Python, staticcheck/deadcode for Go, cargo udeps for Rust), strips dead imports/exports/props/console.logs, and commits cleanup separately.

Why: Dirty codebases accelerate context compaction. Dead imports, unused exports, and orphaned code eat tokens that contribute nothing but everything to triggering compaction mid-refactor. Cleaning first buys back 20%+ of context budget. Reports lines removed and estimated token savings.

Effort: M (human: ~1 week / CC: ~30 min) Priority: P2 Depends on: None Key files: New refactor-prep/SKILL.md.tmpl, scripts/gen-skill-docs.ts

Factory Droid

Browse MCP server for Factory Droid

What: Expose gstack's browse binary and key workflows as an MCP server that Factory Droid connects to natively. Factory users would run /mcp, add the gstack server, and get browse, QA, and review capabilities as Factory tools.

Why: Factory already supports 40+ MCP servers in its registry. Getting gstack's browse binary listed there is a distribution play. Nobody else has a real compiled browser binary as an MCP tool. This is the thing that makes gstack uniquely valuable on Factory Droid.

Context: Option A (--host factory compatibility shim) ships first in v0.13.4.0. Option B is the follow-up that provides deeper integration. The browse binary is already a stateless CLI, so wrapping it as an MCP server is straightforward (stdin/stdout JSON-RPC). Each browse command becomes an MCP tool.

Effort: L (human: ~1 week / CC: ~5 hours) Priority: P1 Depends on: --host factory (Option A, shipping in v0.13.4.0)

.agent/skills/ dual output for cross-agent compatibility

What: Factory also reads from <repo>/.agent/skills/ as a cross-agent compatibility path. Could output there in addition to .factory/skills/ for broader reach across other agents that use the .agent convention.

Why: Multiple AI agents beyond Factory may adopt the .agent/skills/ convention. Outputting there too would give free compatibility.

Effort: S Priority: P3 Depends on: --host factory

Custom Droid definitions alongside skills

What: Factory has "custom droids" (subagents with tool restrictions, model selection, autonomy levels). Could ship gstack-qa.md droid configs alongside skills that restrict tools to read-only + execute for safety.

Why: Deeper Factory integration. Droid configs give Factory users tighter control over what gstack skills can do.

Effort: M Priority: P3 Depends on: --host factory

GStack Browser

Anti-bot stealth: Playwright CDP patches (rebrowser-style)

What: Write a postinstall script that patches Playwright's CDP layer to suppress Runtime.enable and use addBinding for context ID discovery, same approach as rebrowser-patches. Eliminates the navigator.webdriver, cdc_ markers, and other CDP artifacts that sites like Google use to detect automation.

Why: Our current stealth narrows to navigator.webdriver masking + ChromeDriver cdc_ runtime cleanup + Permissions API patch (v1.28.0.0 narrowed it from also faking plugins/languages, since modern fingerprinters punish inconsistent fakes more than they punish admitted defaults). That's enough for most sites but Google still triggers captchas, because the real detection is at the CDP protocol level. rebrowser-patches proved the approach works but their patches target Playwright 1.52.0 and don't apply to our 1.58.2. We need our own patcher using string matching instead of line-number diffs. 6 files, ~200 lines of patches total.

Context: Full analysis of rebrowser-patches source: patches 6 files in playwright-core/lib/server/ (crConnection.js, crDevTools.js, crPage.js, crServiceWorker.js, frames.js, page.js). Key technique: suppress Runtime.enable (the main CDP detection vector), use Runtime.addBinding + CustomEvent trick to discover execution context IDs without it. Our extension communicates via Chrome extension APIs, not CDP Runtime, so it should be unaffected. Write E2E tests that verify: (1) extension still loads and connects, (2) Google.com loads without captcha, (3) sidebar chat still works.

Effort: L (human: ~2 weeks / CC: ~3 hours) Priority: P1 Depends on: None

Chromium fork (long-term alternative to CDP patches)

What: Maintain a Chromium fork where anti-bot stealth, GStack Browser branding, and native sidebar support live in the source code, not as runtime monkey-patches.

Why: The CDP patches are brittle. They break on every Playwright upgrade and target compiled JS with fragile string matching. A proper fork means: (1) stealth is permanent, not patched, (2) branding is native (no plist hacking at launch), (3) native sidebar replaces the extension (Phase 4 of V0 roadmap), (4) custom protocols (gstack://) for internal pages. Companies like Brave, Arc, and Vivaldi maintain Chromium forks with small teams. With CC, the rebase-on-upstream maintenance could be largely automated.

Context: Trigger criteria from V0 design doc: fork when extension side panel becomes the bottleneck, when anti-bot patches need to live deeper than CDP, or when native UI integration (sidebar, status bar) can't be done via extension. The Chromium build takes ~4 hours on a 32-core machine and produces ~50GB of build artifacts. CI would need dedicated build infra. See docs/designs/GSTACK_BROWSER_V0.md Phase 5 for full analysis.

Effort: XL (human: ~1 quarter / CC: ~2-3 weeks of focused work) Priority: P2 Depends on: CDP patches proving the value of anti-bot stealth first

Completed

Slim preamble + real-PTY plan-mode E2E harness (v1.13.1.0)

  • Compressed 18 preamble resolvers; total SKILL.md corpus dropped from 3.08 MB to 2.30 MB across 47 outputs (-25.5%, ~196K tokens saved).
  • Built test/helpers/claude-pty-runner.ts — real-PTY harness using Bun.spawn({terminal:}) (Bun 1.3.10+ has built-in PTY, no node-pty needed).
  • Rewrote 5 plan-mode E2E tests (plan-ceo, plan-eng, plan-design, plan-devex, plan-mode-no-op); all 5 pass for the first time ever (790s sequential).
  • Same tests were 0/5 on origin/main, on v1.0.0.0, and on this branch with the SDK harness — the SDK couldn't observe Claude's plan-mode confirmation UI.
  • Side fixes folded in: scripts/skill-check.ts sidecar-symlink helper, test/skill-validation.test.ts exemption for browse/test/fixtures/security-bench-haiku-responses.json (resolves the size-warning noise from main's warn-only conversion).

Completed: v1.13.1.0 (2026-04-25)


Pre-existing test failures surfaced during v1.12.0.0 ship — RESOLVED

  • test/brain-sync.test.ts GSTACK_HOME isolation fixed on main in v1.13.0.0.
  • test/model-overlay-opus-4-7.test.ts updated on main to match the new overlay content (the v1.10.1.0 removal of "Fan out explicitly" was correct — measured 60pp fanout vs baseline).

Completed: v1.13.0.0 (2026-04-25, on main)


security-bench-haiku-responses.json size gate — RESOLVED

  • Main converted the 2 MB tracked-file gate to warn-only in v1.13.0.0.
  • v1.13.1.0 added a knownLargeFixtures exemption to suppress the warning for this specific intentional fixture.

Completed: v1.13.1.0 (2026-04-25)


Bearer-token secret-scan regression fixed + E2E coverage added for privacy gate + gh auto-create (v1.12.0.0)

  • Fixed the bearer-token-json regression in bin/gstack-brain-sync — the value charset [A-Za-z0-9_./+=-]{16,} didn't permit spaces, so auth headers with the standard Bearer <token> form (literal space after the scheme name) slipped past the scanner. Added an optional (Bearer |Basic |Token )? prefix to the pattern. Validated against 5 positive cases (including the regression fixture) + 3 negative cases (short tokens, non-secret keys, random JSON). The 7-pattern secret scanner now passes all fixtures including bearer-json.
  • Added test/gstack-brain-init-gh-mock.test.ts — 8 tests exercising the gh CLI auto-create path that previously had zero coverage. Stubs gh on PATH to record every call, asserts gh repo create --private --description "..." --source <GSTACK_HOME> fires with the computed gstack-brain-<user> default name. Covers: happy path, fall-through-to-gh repo view when create hits already-exists, user-provided-URL-bypasses-gh, gh-not-on-path prompts for URL, gh-not-authed prompts for URL, idempotent --remote re-runs, conflicting-remote rejection.
  • Added test/skill-e2e-brain-privacy-gate.test.ts — periodic-tier E2E (~$0.30-$0.50/run). Stages a fake gbrain on PATH + gbrain_sync_mode_prompted=false in config, runs a real skill via runAgentSdkTest, intercepts tool-use via canUseTool, and asserts the preamble fires the 3-option privacy AskUserQuestion with canonical prose ("publish session memory" / "artifact" / "decline"). Second test asserts the gate is silent when prompted=true (idempotency-within-session).
  • Registered brain-privacy-gate in test/helpers/touchfiles.ts (periodic tier) with dependency tracking on scripts/resolvers/preamble/generate-brain-sync-block.ts, bin/gstack-brain-sync, bin/gstack-brain-init, bin/gstack-config, and the Agent SDK runner. Diff-based selection will re-run the E2E whenever any of those change.

Completed: v1.12.0.0 (2026-04-24)


Overlay efficacy harness + Opus 4.7 fanout nudge removal (v1.10.1.0)

  • Built test/skill-e2e-overlay-harness.test.ts, a parametric periodic-tier eval that drives @anthropic-ai/claude-agent-sdk and measures first-turn fanout rate (overlay-ON vs overlay-OFF) across registered fixtures
  • Measured the original "Fan out explicitly" overlay nudge: baseline Opus 4.7 = 70% first-turn fanout on toy prompt, with our nudge = 10%, with Anthropic's own canonical <use_parallel_tool_calls> text = 0%
  • Removed the counterproductive nudge from model-overlays/opus-4-7.md
  • Shipped 36-test free-tier unit suite for the SDK runner + strict fixture validator
  • Registered overlay-harness-opus-4-7-fanout-{toy,realistic} in E2E_TOUCHFILES and E2E_TIERS
  • Total investigation cost: ~$7 across 3 eval runs Completed: v1.10.1.0

CI eval pipeline (v0.9.9.0)

  • GitHub Actions eval upload on Ubicloud runners ($0.006/run)
  • Within-file test concurrency (test() → testConcurrentIfSelected())
  • Eval artifact upload + PR comment with pass/fail + cost
  • Baseline comparison via artifact download from main
  • EVALS_CONCURRENCY=40 for ~6min wall clock (was ~18min) Completed: v0.9.9.0

Deploy pipeline (v0.9.8.0)

  • /land-and-deploy — merge PR, wait for CI/deploy, canary verification
  • /canary — post-deploy monitoring loop with anomaly detection
  • /benchmark — performance regression detection with Core Web Vitals
  • /setup-deploy — one-time deploy platform configuration
  • /review Performance & Bundle Impact pass
  • E2E model pinning (Sonnet default, Opus for quality tests)
  • E2E timing telemetry (first_response_ms, max_inter_turn_ms, wall_clock_ms)
  • test:e2e:fast tier, --retry 2 on all E2E scripts Completed: v0.9.8.0

Phase 1: Foundations (v0.2.0)

  • Rename to gstack
  • Restructure to monorepo layout
  • Setup script for skill symlinks
  • Snapshot command with ref-based element selection
  • Snapshot tests Completed: v0.2.0

Phase 2: Enhanced Browser (v0.2.0)

  • Annotated screenshots, snapshot diffing, dialog handling, file upload
  • Cursor-interactive elements, element state checks
  • CircularBuffer, async buffer flush, health check
  • Playwright error wrapping, useragent fix
  • 148 integration tests Completed: v0.2.0

Phase 3: QA Testing Agent (v0.3.0)

  • /qa SKILL.md with 6-phase workflow, 3 modes (full/quick/regression)
  • Issue taxonomy, severity classification, exploration checklist
  • Report template, health score rubric, framework detection
  • wait/console/cookie-import commands, find-browse binary Completed: v0.3.0
  • cookie-import-browser command (Chromium cookie DB decryption)
  • Cookie picker web UI, /setup-browser-cookies skill
  • 18 unit tests, browser registry (Comet, Chrome, Arc, Brave, Edge) Completed: v0.3.1

E2E test cost tracking

  • Track cumulative API spend, warn if over threshold Completed: v0.3.6

Auto-upgrade mode + smart update check

  • Config CLI (bin/gstack-config), auto-upgrade via ~/.gstack/config.yaml, 12h cache TTL, exponential snooze backoff (24h→48h→1wk), "never ask again" option, vendored copy sync on upgrade Completed: v0.3.8