mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-01 19:25:10 +02:00
ed1e4be2f6
* build: vendor xterm@5 for the Terminal sidebar tab
Adds xterm@5 + xterm-addon-fit as devDependencies and a `vendor:xterm`
build step that copies the assets into `extension/lib/` at build time.
The vendored files are .gitignored so the npm version stays the source
of truth. xterm@5 is eval-free, so no MV3 CSP changes needed.
No runtime callers yet — this just stages the assets.
* feat(server): add pty-session-cookie module for the Terminal tab
Mirrors `sse-session-cookie.ts` exactly. Mints short-lived 30-min HttpOnly
cookies for authenticating the Terminal-tab WebSocket upgrade against
the terminal-agent. Same TTL, same opportunistic-pruning shape, same
"scoped tokens never valid as root" invariant. Two registries instead of
one because the cookie names are different (`gstack_sse` vs `gstack_pty`)
and the token spaces must not overlap.
No callers yet — wired up in the next commit.
* feat(server): add terminal-agent.ts (PTY for the Terminal sidebar tab)
Translates phoenix gbrowser's Go PTY (cmd/gbd/terminal.go) into a Bun
non-compiled process. Lives separately from `sidebar-agent.ts` so a
WS-framing or PTY-cleanup bug can't take down the chat path (codex
outside-voice review caught the coupling risk).
Architecture:
- Bun.serve on 127.0.0.1:0 (never tunneled).
- POST /internal/grant accepts cookie tokens from the parent server over
loopback, authenticated with a per-boot internal token.
- GET /ws upgrades require BOTH (a) Origin: chrome-extension://<id> and
(b) the gstack_pty cookie minted by /pty-session. Either gate alone is
insufficient (CSWSH defense + auth defense).
- Lazy spawn: claude PTY is not started until the WS receives its first
data frame. Idle sidebar opens cost nothing.
- Bun PTY API: `terminal: { rows, cols, data(t, chunk) }` — verified at
impl time on Bun 1.3.10. proc.terminal.write() for input,
proc.terminal.resize() for resize, proc.kill() + 3s SIGKILL fallback
on close.
- process.on('uncaughtException'|'unhandledRejection') handlers so a
framing bug logs but doesn't kill the listener loop.
Test-only `BROWSE_TERMINAL_BINARY` env override lets the integration
tests spawn /bin/bash instead of requiring claude on every CI runner.
Not yet spawned by anything — wired in the next commit.
* feat(server): wire /pty-session route + spawn terminal-agent
Server-side glue connecting the Terminal sidebar tab to the new
terminal-agent process.
server.ts:
- New POST /pty-session route. Validates AUTH_TOKEN, mints a gstack_pty
HttpOnly cookie via pty-session-cookie.ts, posts the cookie value to
the agent's loopback /internal/grant. Returns the terminalPort + Set-Cookie
to the extension.
- /health response gains `terminalPort` (just the port number — never a
shell token). Tokens flow via the cookie path, never /health, because
/health already surfaces AUTH_TOKEN to localhost callers in headed mode
(that's a separate v1.1+ TODO).
- /pty-session and /terminal/* are deliberately NOT added to TUNNEL_PATHS,
so the dual-listener tunnel surface 404s by default-deny.
- Shutdown path now also pkills terminal-agent and unlinks its state files
(terminal-port + terminal-internal-token) so a reconnect doesn't try to
hit a dead port.
cli.ts:
- After spawning sidebar-agent.ts, also spawn terminal-agent.ts. Same
pattern: pkill old instances, Bun.spawn(['bun', 'run', script]) with
BROWSE_STATE_FILE + BROWSE_SERVER_PORT env. Non-fatal if the spawn
fails — chat still works without the terminal agent.
* feat(extension): Terminal as default sidebar tab
Adds a primary tab bar (Terminal | Chat) above the existing tab-content
panes. Terminal is the default-active tab; clicking Chat returns to the
existing claude -p one-shot flow which is preserved verbatim.
manifest.json: adds ws://127.0.0.1:*/ to host_permissions so MV3 doesn't
block the WebSocket upgrade.
sidepanel.html: new primary-tabs nav, new #tab-terminal pane with a
"Press any key to start Claude Code" bootstrap card, claude-not-found
install card, xterm mount point, and "session ended" restart UI. Loads
xterm.js + xterm-addon-fit + sidepanel-terminal.js. tab-chat is no
longer the .active default.
sidepanel.js: new activePrimaryPaneId() helper that reads which primary
tab is selected. Debug-close paths now route back to whichever primary
pane is active (was hardcoded to tab-chat). Primary-tab click handler
toggles .active classes and aria-selected. window.gstackServerPort and
window.gstackAuthToken exposed so sidepanel-terminal.js can build the
/pty-session POST and the WS URL.
sidepanel-terminal.js (new): xterm.js lifecycle. Lazy-spawn — first
keystroke fires POST /pty-session, then opens
ws://127.0.0.1:<terminalPort>/ws. Origin + cookie are set automatically
by the browser. Resize observer sends {type:"resize"} text frames.
ResizeObserver, tab-switch hooks, restart button, install-card retry.
On WS close shows "Session ended, click to restart" — no auto-reconnect
(codex outside-voice flagged that as session-burning).
sidepanel.css: primary-tabs bar + Terminal pane styling (full-height
xterm container, install card, ended state).
* test: terminal-agent + cookie module + sidebar default-tab regression
Three new test files:
terminal-agent.test.ts (16 tests): pty-session-cookie mint/validate/
revoke, Set-Cookie shape (HttpOnly + SameSite=Strict + Path=/, NO Secure
since 127.0.0.1 over HTTP), source-level guards that /pty-session and
/terminal/* are NOT in TUNNEL_PATHS, /health does NOT surface ptyToken
or gstack_pty, terminal-agent binds 127.0.0.1, /ws upgrade enforces
chrome-extension:// Origin AND gstack_pty cookie, lazy-spawn invariant
(spawnClaude is called from message handler, not upgrade), uncaughtException/
unhandledRejection handlers exist, SIGINT-then-SIGKILL cleanup.
terminal-agent-integration.test.ts (7 tests): spawns the agent as a real
subprocess in a tmp state dir. Verifies /internal/grant accepts/rejects
the loopback token, /ws gates (no Origin → 403, bad Origin → 403, no
cookie → 401), real WebSocket round-trip with /bin/bash via the
BROWSE_TERMINAL_BINARY override (write 'echo hello-pty-world\n', read it
back), and resize message acceptance.
sidebar-tabs.test.ts (13 tests): structural regression suite locking the
load-bearing invariants of the default-tab change — Terminal is .active,
Chat is not, xterm assets are loaded, debug-close path no longer hardcodes
tab-chat (uses activePrimaryPaneId), primary-tab click handler exists,
chat surface is not accidentally deleted, terminal JS does NOT auto-
reconnect on close, manifest declares ws:// + http:// localhost host
permissions, no unsafe-eval.
Plan called for Playwright + extension regression; the codebase doesn't
ship Playwright extension launcher infra, so we follow the existing
extension-test pattern (source-level structural assertions). Same
load-bearing intent — locks the invariants before they regress.
* docs: Terminal flow + threat model + v1.1 follow-ups
SIDEBAR_MESSAGE_FLOW.md: new "Terminal flow" section. Documents the WS
upgrade path (/pty-session cookie mint → /ws Origin + cookie gate →
lazy claude spawn), the dual-token model (AUTH_TOKEN for /pty-session,
gstack_pty cookie for /ws, INTERNAL_TOKEN for server↔agent loopback),
and the threat-model boundary — the Terminal tab bypasses the entire
prompt-injection security stack on purpose; user keystrokes are the
trust source. That trust assumption is load-bearing on three transport
guarantees: local-only listener, Origin gate, cookie auth. Drop any
one of those three and the tab becomes unsafe.
CLAUDE.md: extends the "Sidebar architecture" note to include
terminal-agent.ts in the read-this-first list. Adds a "Terminal tab is
its own process" note so a future contributor doesn't bolt PTY logic
onto sidebar-agent.ts.
TODOS.md: three new follow-ups under a new "Sidebar Terminal" section:
- v1.1: PTY session survives sidebar reload (Issue 1C deferred).
- v1.1+: audit /health AUTH_TOKEN distribution (codex finding #2 —
a pre-existing soft leak that cc-pty-import sidesteps but doesn't
fix).
- v1.1+: apply terminal-agent's process.on exception handlers to
sidebar-agent.ts (codex finding #4 — chat path has no fatal
handlers).
* feat(extension): Terminal-only sidebar — auth fix, UX polish, chat rip
The chat queue path is gone. The Chrome side panel is now just an
interactive claude PTY in xterm.js. Activity / Refs / Inspector still
exist behind the `debug` toggle in the footer.
Three threads of change, all from dogfood iteration on top of
cc-pty-import:
1. fix(server): cross-port WS auth via Sec-WebSocket-Protocol
- Browsers can't set Authorization on a WebSocket upgrade. We had
been minting an HttpOnly gstack_pty cookie via /pty-session, but
SameSite=Strict cookies don't survive the cross-port jump from
server.ts:34567 to the agent's random port from a chrome-extension
origin. The WS opened then immediately closed → "Session ended."
- /pty-session now also returns ptySessionToken in the JSON body.
- Extension calls `new WebSocket(url, [`gstack-pty.<token>`])`.
Browser sends Sec-WebSocket-Protocol on the upgrade.
- Agent reads the protocol header, validates against validTokens,
and MUST echo the protocol back (Chromium closes the connection
immediately if a server doesn't pick one of the offered protocols).
- Cookie path is kept as a fallback for non-browser callers (curl,
integration tests).
- New integration test exercises the full protocol-auth round-trip
via raw fetch+Upgrade so a future regression of this exact class
fails in CI.
2. fix(extension): UX polish on the Terminal pane
- Eager auto-connect when the sidebar opens — no "Press any key to
start" friction every reload.
- Always-visible ↻ Restart button in the terminal toolbar (not
gated on the ENDED state) so the user can force a fresh claude
mid-session.
- MutationObserver on #tab-terminal's class attribute drives a
fitAddon.fit() + term.refresh() when the pane becomes visible
again — xterm doesn't auto-redraw after display:none → display:flex.
3. feat(extension): rip the chat tab + sidebar-agent.ts
- Sidebar is Terminal-only. No more Terminal | Chat primary nav.
- sidebar-agent.ts deleted. /sidebar-command, /sidebar-chat,
/sidebar-agent/event, /sidebar-tabs* and friends all deleted.
- The pickSidebarModel router (sonnet vs opus) is gone — the live
PTY uses whatever model the user's `claude` CLI is configured with.
- Quick-actions (🧹 Cleanup / 📸 Screenshot / 🍪 Cookies) survive
in the Terminal toolbar. Cleanup now injects its prompt into the
live PTY via window.gstackInjectToTerminal — no more
/sidebar-command POST. The Inspector "Send to Code" action uses
the same injection path.
- clear-chat button removed from the footer.
- sidepanel.js shed ~900 lines of chat polling, optimistic UI,
stop-agent, etc.
Net diff: -3.4k lines across 16 files. CLAUDE.md, TODOS.md, and
docs/designs/SIDEBAR_MESSAGE_FLOW.md rewritten to match. The sidebar
regression test (browse/test/sidebar-tabs.test.ts) is rewritten as 27
structural assertions locking the new layout — Terminal sole pane,
no chat input, quick-actions in toolbar, eager-connect, MutationObserver
repaint, restart helper.
* feat: live tab awareness for the Terminal pane
claude in the PTY now has continuous tab-aware context. Three pieces:
1. Live state files. background.js listens to chrome.tabs.onActivated /
onCreated / onRemoved / onUpdated (throttled to URL/title/status==
complete so loading spinners don't spam) and pushes a snapshot. The
sidepanel relays it as a custom event; sidepanel-terminal.js sends
{type:"tabState"} text frames over the live PTY WebSocket.
terminal-agent.ts writes:
<stateDir>/tabs.json all open tabs (id, url, title, active,
pinned, audible, windowId)
<stateDir>/active-tab.json current active tab (skips chrome:// and
chrome-extension:// internal pages)
Atomic write via tmp + rename so claude never reads a half-written
document. A fresh snapshot is pushed on WS open so the files exist by
the time claude finishes booting.
2. New $B tab-each <command> [args...] meta-command. Fans out a single
command across every open tab, returns
{command, args, total, results: [{tabId, url, title, status, output}]}.
Skips chrome:// pages; restores the originally active tab in a finally
block (so a mid-batch error doesn't leave the user looking at a
different tab); uses bringToFront: false so the OS window doesn't
jump on every fanout. Scope-checks the inner command BEFORE the loop.
3. --append-system-prompt hint at spawn time. Claude is told about both
the state files and the $B tab-each command up front, so it doesn't
have to discover the surface by trial. Passed via the --append-system-
prompt CLI flag, NOT as a leading PTY write — the hint stays out of
the visible transcript.
Tests:
- browse/test/tab-each.test.ts (new) — registration + source-level
invariants (scope check before loop, finally-restore, bringToFront:false,
chrome:// skip) + behavior tests with a mock BrowserManager that verify
iteration order, JSON shape, error handling, and active-tab restore.
- browse/test/terminal-agent.test.ts — three new assertions for
tabState handler shape, atomic-write pattern, and the
--append-system-prompt wiring at spawn.
Verified live: opened 5 tabs, ran $B tab-each url against the live
server, got per-tab JSON results back, original active tab restored
without OS focus stealing.
* chore: drop sidebar-agent test refs after chat rip
Five test files / describe blocks targeted the deleted chat path:
- browse/test/security-e2e-fullstack.test.ts (full-stack chat-pipeline E2E
with mock claude — whole file gone)
- browse/test/security-review-fullstack.test.ts (review-flow E2E with real
classifier — whole file gone)
- browse/test/security-review-sidepanel-e2e.test.ts (Playwright E2E for
the security event banner that was ripped from sidepanel.html)
- browse/test/security-audit-r2.test.ts (5 describe blocks: agent queue
permissions, isValidQueueEntry stateFile traversal, loadSession session-ID
validation, switchChatTab DocumentFragment, pollChat reentrancy guard,
/sidebar-tabs URL sanitization, sidebar-agent SIGTERM→SIGKILL escalation,
AGENT_SRC top-level read converted to graceful fallback)
- browse/test/security-adversarial-fixes.test.ts (canary stream-chunk split
detection on detectCanaryLeak; one tool-output test on sidebar-agent)
- test/skill-validation.test.ts (sidebar agent #584 describe block)
These all assumed sidebar-agent.ts existed and tested chat-queue plumbing,
chat-tab DOM round-trip, chat-polling reentrancy, or per-message classifier
canary detection. With the live PTY there is no chat queue, no chat tab,
no LLM stream to canary-scan, and no per-message subprocess. The Terminal
pane's invariants are covered by the new browse/test/sidebar-tabs.test.ts
(27 structural assertions), browse/test/terminal-agent.test.ts, and
browse/test/terminal-agent-integration.test.ts.
bun test → exit 0, 0 failures.
* chore: bump version and changelog (v1.14.0.0)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(extension): xterm fills the full Terminal panel height
The Terminal pane only rendered into the top portion of the panel — most
of the panel below the prompt was an empty black gap. Three layered
issues, all about xterm.js measuring dimensions during a layout state
that wasn't ready yet:
1. order-of-operations in connect(): ensureXterm() ran BEFORE
setState(LIVE), so term.open() measured els.mount while it was still
display:none. xterm caches a 0-size viewport synchronously inside
open() and never auto-recovers when the container goes visible.
Flipped: setState(LIVE) → ensureXterm.
2. first fit() ran synchronously before the browser had applied the
.active class transition. Wrapped in requestAnimationFrame so layout
has settled before fit() reads clientHeight.
3. CSS flex-overflow trap: .terminal-mount has flex:1 inside the
flex-column #tab-terminal, but .tab-content's `overflow-y: auto` and
the lack of `min-height: 0` on .terminal-mount meant the item
couldn't shrink below content size. flex:1 then refused to expand
into available space and xterm rendered into whatever its initial
2x2 measurement happened to be.
Fixes:
- extension/sidepanel-terminal.js: reorder + RAF fit
- extension/sidepanel.css: .terminal-mount gets `flex: 1 1 0` +
`min-height: 0` + `position: relative`. #tab-terminal overrides
.tab-content's `overflow-y: auto` to `overflow: hidden` (xterm has
its own viewport scroll; the parent shouldn't compete) and explicitly
re-declares `display: flex; flex-direction: column` for #tab-terminal.active.
bun test browse/test/sidebar-tabs.test.ts → 27/27 pass.
Manually verified: side panel opens → Terminal fills full panel height,
xterm scrollback works, debug-tab toggle still repaints correctly.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1433 lines
80 KiB
Markdown
1433 lines
80 KiB
Markdown
# TODOS
|
|
|
|
## Sidebar Terminal (cc-pty-import follow-ups)
|
|
|
|
### v1.1: PTY session survives sidebar reload
|
|
|
|
**What:** Today the Terminal tab's PTY dies with the WebSocket — sidebar
|
|
reload, side-panel close, even a quick navigate-away in another tab close
|
|
the session. v1.1 should key the PTY on a tab/session id so a reload
|
|
reattaches to the existing claude process and you keep `/resume` history.
|
|
|
|
**Why:** Mid-task resilience. When you've been pair-programming with claude
|
|
for 20 minutes and an accidental Cmd-R blows it away, the cost is real.
|
|
|
|
**Pros:** Better UX, fewer interrupted sessions. **Cons:** Session-tracking
|
|
state, ghost-process risk, lifecycle bugs (when DOES the PTY actually go
|
|
away?). v1 chose the simple "PTY dies with WS" model deliberately.
|
|
|
|
**Context:** /plan-eng-review Issue 1C decision (cc-pty-import branch,
|
|
2026-04-25). v1 ships with phoenix's lifecycle. **Depends on:**
|
|
cc-pty-import landed.
|
|
|
|
**Priority:** P2 (nice-to-have).
|
|
**Effort:** M. Likely needs a per-tab session map keyed by chrome.tabs.id
|
|
plus a TTL so abandoned PTYs eventually exit.
|
|
|
|
---
|
|
|
|
### v1.1+: Audit `/health` token distribution
|
|
|
|
**What:** Codex's outside-voice review on cc-pty-import flagged that
|
|
`/health` already surfaces `AUTH_TOKEN` to any localhost caller in headed
|
|
mode (`server.ts:1657`). That's a pre-existing soft leak — anything
|
|
running on localhost gets the root token by hitting `/health`.
|
|
|
|
**Why:** cc-pty-import sidesteps it by NOT putting the PTY token there
|
|
(uses an HttpOnly cookie path instead). But the underlying leak is still
|
|
shippable surface. A second extension or a localhost web app could
|
|
currently scrape `AUTH_TOKEN` and hit any browse-server endpoint.
|
|
|
|
**Pros:** Closes a real privilege-escalation path on multi-extension
|
|
machines. **Cons:** Either we tighten the gate (Origin must be OUR
|
|
extension id, not just any chrome-extension://) or we move bootstrap
|
|
discovery off `/health` entirely. Either has migration cost for tests
|
|
and the existing extension.
|
|
|
|
**Context:** codex finding #2 on cc-pty-import plan-eng review. Not in
|
|
scope of that PR; deliberately deferred to keep PTY-import small.
|
|
|
|
**Priority:** P2.
|
|
**Effort:** M.
|
|
|
|
---
|
|
|
|
## Testing
|
|
|
|
### Pre-existing test failures surfaced during v1.12.0.0 ship
|
|
|
|
**What:** Two remaining test failures on bare main that have been shipping as-is for multiple versions. (The bearer-json secret-scan regression flagged here originally was a real leak path and has been fixed in this PR — see Completed section below.)
|
|
|
|
1. `gstack-config gbrain keys > GSTACK_HOME overrides real config dir` (`test/brain-sync.test.ts:104`) — the GSTACK_HOME env override leaks into the real `~/.gstack/config.yaml`. Test asserts real config does NOT contain `gbrain_sync_mode: full` but it does. Either the test environment isn't isolated correctly or `bin/gstack-config` is writing to both locations.
|
|
2. `Opus 4.7 overlay — pacing directive > keeps Fan out / Effort-match / Literal interpretation nudges` (`test/model-overlay-opus-4-7.test.ts:87`) — v1.10.1.0 (#1166) removed the "Fan out explicitly" nudge from the overlay but the assertion was never updated. Either the nudge should come back (intentional removal reverted) or the test should be updated to match the new expected content.
|
|
|
|
**Why:** Both have been green-washing through recent `/ship` runs via "pre-existing test failures skipped: <name>." #1 signals a real config isolation bug; #2 is a stale assertion since the overlay intentionally removed that nudge.
|
|
|
|
**Priority:** P0 (both)
|
|
|
|
**Effort:** S each. #1 likely a test harness fix in `test/brain-sync.test.ts`'s setup hook. #2 is a one-line test update OR a revert of #1166.
|
|
|
|
---
|
|
|
|
### `security-bench-haiku-responses.json` is 27MB, violates the 2MB tracked-file gate
|
|
|
|
**What:** `browse/test/fixtures/security-bench-haiku-responses.json` landed on main at v1.6.4.0 (PR #1135) at 27MB. The `no compiled binaries in git > git tracks no files larger than 2MB` gate in `test/skill-validation.test.ts:1623` fails on main and on every feature branch that merges main afterward.
|
|
|
|
**Why:** The fixture is a legitimate CI replay corpus (real Haiku responses from the 500-case BrowseSafe-Bench) used to verify the ensemble classifier deterministically. But 13x over the 2MB limit means it will keep failing the validation test for every future ship.
|
|
|
|
**Pros:** Removes a pre-existing failure that wastes a triage slot in every /ship run.
|
|
|
|
**Cons:** Moving to git-lfs adds a dependency. Splitting into chunks risks breaking the bench test. External hosting adds a CI fetch step.
|
|
|
|
**Context:** Noticed during workspace-aware-ship /ship on 2026-04-23 when the post-merge test suite flagged this single failure. Introduced on main in PR #1135 (`v1.6.4.0: cut Haiku classifier FP from 44% to 23%`), commit d75402bb. Two reasonable paths: (a) split into multiple ≤2MB chunks and load them in the bench test, (b) move to git-lfs.
|
|
|
|
**Effort:** M (human: ~2-3h / CC: ~20 min)
|
|
**Priority:** P1 (not blocking ship, but every future /ship triages the same failure)
|
|
**Depends on:** nothing
|
|
|
|
---
|
|
|
|
## P1: Structural STOP-Ask forcing function across all skills
|
|
|
|
**What:** Design and implement a structural forcing function that catches when a skill mandates per-issue AskUserQuestion but the model silently substitutes batch-synthesis. Candidate mechanisms: question-count assertion (skill declares expected question count in frontmatter; post-run audit logs if model fired <N), typed question templates (skill hands the model pre-built AskUserQuestion payloads rather than prose instructions), or a canUseTool-based post-run audit that compares declared-gates-fired vs expected.
|
|
|
|
**Why:** The authoritative "Skill Invocation During Plan Mode" rule (hoisted to preamble position 1) tells the model AskUserQuestion satisfies plan mode's end-of-turn requirement. That fixes plan-mode entry, but NOT the broader class of failures: the model silently substitutes batch-synthesis for STOP-Ask loops whenever the skill's interactive contract collides with any other rule surface (auto mode, tool-count anxiety, cognitive load). Without structural enforcement, every skill with STOP-per-issue contracts remains vulnerable.
|
|
|
|
**Pros:** Catches a class-of-bug, not an instance. Applies to every skill that declares STOP gates. Builds on `canUseTool` primitive in `test/helpers/agent-sdk-runner.ts`.
|
|
|
|
**Cons:** Real design work. How does a skill declare expected question count — static value in frontmatter, or dynamic based on number of review sections that surface findings? Is the audit inline (blocking, same-turn) or post-hoc (after skill completion)? Calibration of expected-vs-actual thresholds depends on real V0 question-log data across skills.
|
|
|
|
**Context:** Relevant files — `scripts/question-registry.ts` (typed question catalog), `scripts/resolvers/question-tuning.ts` (preference classification), `bin/gstack-question-log` (event log), `bin/gstack-question-preference` (read/write preferences), `test/helpers/agent-sdk-runner.ts` (canUseTool harness). Existing question-log already captures fire events; the gap is declaring expected counts and auditing against them.
|
|
|
|
**Effort:** L (human: ~1-2 weeks / CC+gstack: ~2-3 hours for design doc + first-pass implementation).
|
|
**Priority:** P1 if interactive-skill volume is growing; P2 otherwise.
|
|
**Depends on / blocked by:** design doc — likely its own `docs/designs/STOP_ASK_ENFORCEMENT_V0.md`.
|
|
|
|
## Context skills
|
|
|
|
### `/context-save --lane` + `/context-restore --lane` for parallel workstreams
|
|
|
|
**What:** Let users save and restore per-workstream (lane) context independently. On save: `/context-save --lane A "backend refactor"` writes a lane-tagged file. Or `/context-save lanes` reads the "Parallelization Strategy" section of the most recent plan file and auto-generates one saved context per lane. On restore: `/context-restore --lane A` loads just that lane's context. Useful when a plan has 3 independent workstreams and the user wants to pick one up in each of 3 Conductor windows.
|
|
|
|
**Why:** Plans produced by `/plan-eng-review` already emit a lane table (Lane A: touches `models/` and `controllers/` sequentially; Lane B: touches `api/` independently; etc.). Right now there's no way to transfer that structure into resumable saved state. Users manually re-describe the scope in each window. Lane-tagged save/restore would be the bridge between "here's the plan" and "three people (or three AIs) are now working in parallel on it."
|
|
|
|
**Pros:** Turns `/plan-eng-review`'s parallelization output into actionable resume state. Reduces context-loss across Conductor workspace handoffs for multi-workstream plans.
|
|
|
|
**Cons:** Net-new functionality (not a port from the old `/checkpoint` skill). The "spawn new Conductor windows" part needs research into whether Conductor has a spawn CLI. Also requires lane-tagging discipline in the save step (manual or extracted).
|
|
|
|
**Context:** Source of the lane data model is `plan-eng-review/SKILL.md.tmpl:240-249` (the "Parallelization Strategy" output with Lane A/B/C dependency tables and conflict flags). Deferred from the v0.18.5.0 rename PR so the rename could land as a tight, low-risk fix. Saved files currently live at `~/.gstack/projects/$SLUG/checkpoints/YYYYMMDD-HHMMSS-<title>.md` with YAML frontmatter (branch, timestamp, etc.). The lane feature would add a `lane:` field to frontmatter and a `--lane` filter to both skills.
|
|
|
|
**Effort:** M (human: ~1-2 days / CC: ~45-60 min)
|
|
**Priority:** P3 (nice-to-have, not blocking anyone yet)
|
|
**Depends on:** `/context-save` + `/context-restore` rename stable in production (v1.0.1.0+). Research: does Conductor expose a spawn-workspace CLI?
|
|
|
|
## P0: PACING_UPDATES_V0 — Louise's fatigue root cause (V1.1)
|
|
|
|
**What:** Implement the pacing overhaul extracted from PLAN_TUNING_V1. Full design in `docs/designs/PACING_UPDATES_V0.md`. Requires: session-state model, `phase` field in question-log schema, registry extension for dynamic findings, pacing as skill-template control flow (not preamble prose), `bin/gstack-flip-decision` command, migration-prompt budget rule, first-run preamble audit, ranking threshold calibration from real V0 data, one-way-door uncapped rule, concrete verification values.
|
|
|
|
**Why:** Louise de Sadeleer's "yes yes yes" during `/autoplan` was pacing + agency, not (only) jargon density. V1 addresses jargon (ELI10 writing). V1.1 addresses the interruption-volume half. Without this, V1 only gets halfway to the HOLY SHIT outcome.
|
|
|
|
**Pros:** End-to-end answer to Louise's feedback. Ships real calibration data from V1 usage. Completes the V0 → V2 pacing arc started in PLAN_TUNING_V0.
|
|
|
|
**Cons:** Substantial scope (10 items in `docs/designs/PACING_UPDATES_V0.md`). Needs its own CEO + Codex + DX + Eng review cycle. Calibration depends on real V0 question-log distribution.
|
|
|
|
**Context:** PLAN_TUNING_V1 attempted to bundle pacing. Three eng-review passes + two Codex passes surfaced 10 structural gaps unfixable via plan-text editing. Extracted to V1.1 as a dedicated plan.
|
|
|
|
**Depends on / blocked by:** V1 shipping (provides Louise's baseline transcript for calibration).
|
|
|
|
## Plan Tune (v2 deferrals from v0.19.0.0 rollback)
|
|
|
|
All six items are gated on v1 dogfood results and the acceptance criteria in
|
|
`docs/designs/PLAN_TUNING_V0.md`. They were explicitly deferred after Codex's
|
|
outside-voice review drove a scope rollback from the CEO EXPANSION plan. v1
|
|
ships the observational substrate only; v2 adds behavior adaptation.
|
|
|
|
### E1 — Substrate wiring (5 skills consume profile)
|
|
|
|
**What:** Add `{{PROFILE_ADAPTATION:<skill>}}` placeholder to ship, review,
|
|
office-hours, plan-ceo-review, plan-eng-review SKILL.md.tmpl files. Implement
|
|
`scripts/resolvers/profile-consumer.ts` with a per-skill adaptation registry
|
|
(`scripts/profile-adaptations/{skill}.ts`). Each consumer reads
|
|
`~/.gstack/developer-profile.json` on preamble and adapts skill-specific
|
|
defaults (verbosity, mode selection, severity thresholds, pushback intensity).
|
|
|
|
**Why:** v1 observational profile writes a file nobody reads. The substrate
|
|
claim only becomes real when skills actually consume it. Without this, /plan-tune
|
|
is a fancy config page.
|
|
|
|
**Pros:** gstack feels personal. Every skill adapts to the user's steering
|
|
style instead of defaulting to middle-of-the-road.
|
|
|
|
**Cons:** Risk of psychographic drift if profile is noisy. Requires calibrated
|
|
profile (v1 acceptance criteria: 90+ days stable across 3+ skills).
|
|
|
|
**Context:** See `docs/designs/PLAN_TUNING_V0.md` §Deferred to v2. v1 ships the
|
|
signal map + inferred computation; it's displayed in /plan-tune but no skill
|
|
reads it yet.
|
|
|
|
**Effort:** L (human: ~1 week / CC: ~4h)
|
|
**Priority:** P0
|
|
**Depends on:** 2+ weeks of v1 dogfood, profile diversity check passing.
|
|
|
|
### E3 — `/plan-tune narrative` + `/plan-tune vibe`
|
|
|
|
**What:** Event-anchored narrative ("You accepted 7 scope expansions, overrode
|
|
test_failure_triage 4 times, called every PR 'boil the lake'") + one-word vibe
|
|
archetype (Cathedral Builder, Ship-It Pragmatist, Deep Craft, etc).
|
|
scripts/archetypes.ts is ALREADY SHIPPED in v1 (8 archetypes + Polymath
|
|
fallback). v2 work is the narrative generator + /plan-tune skill wiring.
|
|
|
|
**Why:** Makes profile tangible and shareable. Screenshot-able.
|
|
|
|
**Pros:** Killer delight feature. Social surface for gstack. Concrete, specific
|
|
output anchored in real events (not generic AI slop).
|
|
|
|
**Cons:** Requires stable inferred profile — without calibration it produces
|
|
generic paragraphs. Gen-tests need to validate no-slop.
|
|
|
|
**Context:** Archetypes already defined. Just need the /plan-tune narrative
|
|
subcommand + slop-check test.
|
|
|
|
**Effort:** S+ (human: ~1 day / CC: ~1h)
|
|
**Priority:** P0
|
|
**Depends on:** Calibrated profile (>= 20 events, 3+ skills, 7+ days span).
|
|
|
|
### E4 — Blind-spot coach
|
|
|
|
**What:** Preamble injection that surfaces the OPPOSITE of the user's profile
|
|
once per session per tier >= 2 skill. Boil-the-ocean user gets challenged on
|
|
scope ("what's the 80% version?"); small-scope user gets challenged on ambition.
|
|
`scripts/resolvers/blind-spot-coach.ts`. Marker file for session dedup. Opt-out
|
|
via `gstack-config set blind_spot_coach false`.
|
|
|
|
**Why:** Makes gstack a coach (challenges you) instead of a mirror (reflects
|
|
you). The killer differentiation vs. a settings menu.
|
|
|
|
**Pros:** The feature that makes gstack feel like Garry. Surfaces assumptions
|
|
the user hasn't challenged.
|
|
|
|
**Cons:** Logically conflicts with E1 (which adapts TO profile) and E6 (which
|
|
flags mismatch). Requires interaction-budget design: global session budget +
|
|
escalation rules + explicit exclusion from mismatch detection. Risk of feeling
|
|
like a nag if fires wrong.
|
|
|
|
**Context:** v2 must redesign to resolve the E1/E4/E6 composition issue Codex
|
|
caught. Dogfood required to calibrate frequency.
|
|
|
|
**Effort:** M (human: ~3 days / CC: ~2h design + ~1h impl)
|
|
**Priority:** P0
|
|
**Depends on:** E1 shipped + interaction-budget design spec.
|
|
|
|
### E5 — LANDED celebration HTML page
|
|
|
|
**What:** When a PR authored by the user is newly merged to the base branch,
|
|
open an animated HTML celebration page in the browser. Confetti + typewriter
|
|
headline + stats counter. Shows: what we built (PR stats + CHANGELOG entry),
|
|
road traveled (scope decisions from CEO plan), road not traveled (deferred
|
|
items), where we're going (next TODOs), who you are as a builder (vibe +
|
|
narrative + profile delta for this ship). Self-contained HTML (CSS animations
|
|
only, no JS deps).
|
|
|
|
**CRITICAL REVISION from v0 plan:** Passive detection must NOT live in the
|
|
preamble (Codex #9). When promoted, moves to explicit `/plan-tune show-landed`
|
|
OR post-ship hook — not passive detection in the hot path.
|
|
|
|
**Why:** Biggest personality moment in gstack. The "one-word thing that makes
|
|
you remember why you built this."
|
|
|
|
**Pros:** Screenshot-worthy. Shareable. The kind of dopamine hit that turns
|
|
power users into evangelists.
|
|
|
|
**Cons:** Product theater if the substrate isn't solid. Needs /design-shotgun
|
|
→ /design-html for the visual direction. Requires E2 unified profile for
|
|
narrative/vibe data.
|
|
|
|
**Context:** /land-and-deploy trust/adoption is low, so passive detection is
|
|
the right trigger shape. Dedup marker per PR in `~/.gstack/.landed-celebrated-*`.
|
|
E2E tests for squash/merge-commit/rebase/co-author/fresh-clone/dedup variants.
|
|
|
|
**Effort:** M+ (human: ~1 week / CC: ~3h total)
|
|
**Priority:** P0
|
|
**Depends on:** E3 narrative/vibe shipped. /design-shotgun run on real PR data
|
|
to pick a visual direction, then /design-html to finalize.
|
|
|
|
### E6 — Auto-adjustment based on declared ↔ inferred mismatch
|
|
|
|
**What:** Currently `/plan-tune` shows the gap between declared and inferred
|
|
(v1 observational). v2 auto-suggests declaration updates when the gap exceeds
|
|
a threshold ("Your profile says hands-off but you've overridden 40% of
|
|
recommendations — you're actually taste-driven. Update declared autonomy from
|
|
0.8 to 0.5?"). Requires explicit user confirmation before any mutation (Codex
|
|
trust-boundary #15 already baked into v1).
|
|
|
|
**Why:** Profile drifts silently without correction. Self-correcting profile
|
|
stays honest.
|
|
|
|
**Pros:** Profile becomes more accurate over time. User sees the gap and
|
|
decides.
|
|
|
|
**Cons:** Requires stable inferred profile (diversity check). False positives
|
|
nag the user.
|
|
|
|
**Context:** v1 has `--check-mismatch` that flags > 0.3 gaps but doesn't
|
|
suggest fixes. v2 adds the suggestion UX + per-dimension threshold tuning from
|
|
real data.
|
|
|
|
**Effort:** S (human: ~1 day / CC: ~45min)
|
|
**Priority:** P0
|
|
**Depends on:** Calibrated profile + real mismatch data from v1 dogfood.
|
|
|
|
### E7 — Psychographic auto-decide
|
|
|
|
**What:** When inferred profile is calibrated AND a question is two-way AND
|
|
the user's dimensions strongly favor one option, auto-choose without asking
|
|
(visible annotation: "Auto-decided via profile. Change with /plan-tune."). v1
|
|
only auto-decides via EXPLICIT per-question preferences; v2 adds profile-driven
|
|
auto-decide.
|
|
|
|
**Why:** The whole point of the psychographic. Silent, correct defaults based
|
|
on who the user IS, not just what they've said.
|
|
|
|
**Pros:** Friction-free skill invocation for calibrated power users. Over time,
|
|
gstack feels like it's reading your mind.
|
|
|
|
**Cons:** Highest-risk deferral. Wrong auto-decides are costly. Requires very
|
|
high confidence in the signal map AND calibration gate.
|
|
|
|
**Context:** v1 diversity gate is `sample_size >= 20 AND skills_covered >= 3
|
|
AND question_ids_covered >= 8 AND days_span >= 7`. v2 must prove this gate
|
|
actually catches noisy profiles before shipping.
|
|
|
|
**Effort:** M (human: ~3 days / CC: ~2h)
|
|
**Priority:** P0
|
|
**Depends on:** E1 (skills consuming profile) + real observed data showing
|
|
calibration gate is trustworthy.
|
|
|
|
## Browse
|
|
|
|
### Scope sidebar-agent kill to session PID, not `pkill -f sidebar-agent\.ts`
|
|
|
|
**What:** `shutdown()` in `browse/src/server.ts:1193` uses `pkill -f sidebar-agent\.ts` to kill the sidebar-agent daemon, which matches every sidebar-agent on the machine, not just the one this server spawned. Replace with PID tracking: store the sidebar-agent PID when `cli.ts` spawns it (via state file or env), then `process.kill(pid, 'SIGTERM')` in `shutdown()`.
|
|
|
|
**Why:** A user running two Conductor worktrees (or any multi-session setup), each with its own `$B connect`, closes one browser window ... and the other worktree's sidebar-agent gets killed too. The blast radius was there before, but the v0.18.1.0 disconnect-cleanup fix makes it more reachable: every user-close now runs the full `shutdown()` path, whereas before user-close bypassed it.
|
|
|
|
**Context:** Surfaced by /ship's adversarial review on v0.18.1.0. Pre-existing code, not introduced by the fix. Fix requires propagating the sidebar-agent PID from `cli.ts` spawn site (~line 885) into the server's state file so `shutdown()` can target just this session's agent. Related: `browse/src/cli.ts` spawns with `Bun.spawn(...).unref()` and already captures `agentProc.pid`.
|
|
|
|
**Effort:** S (human: ~2h / CC: ~15min)
|
|
**Priority:** P2
|
|
**Depends on:** None
|
|
|
|
## Sidebar Security
|
|
|
|
### ML Prompt Injection Classifier — v1 SHIPPED (branch garrytan/prompt-injection-guard)
|
|
|
|
**Status:** IN PROGRESS on branch `garrytan/prompt-injection-guard`. Classifier swap:
|
|
**TestSavantAI** replaces DeBERTa (better on developer content — HN/Reddit/Wikipedia/tech blogs all
|
|
score SAFE 0.98+, attacks score INJECTION 0.99+). Pre-impl gate 3 (benign corpus dry-run)
|
|
forced this pivot — see `~/.gstack/projects/garrytan-gstack/ceo-plans/2026-04-19-prompt-injection-guard.md`.
|
|
|
|
**What shipped in v1:**
|
|
- `browse/src/security.ts` — canary injection + check, verdict combiner (ensemble rule),
|
|
attack log with rotation, cross-process session state, status reporting
|
|
- `browse/src/security-classifier.ts` — TestSavantAI ONNX classifier + Haiku transcript
|
|
classifier (reasoning-blind), both with graceful degradation
|
|
- Canary flows end-to-end: server.ts injects, sidebar-agent.ts checks every outbound
|
|
channel (text, tool args, URLs, file writes) and kills session on leak
|
|
- Pre-spawn ML scan of user message with ensemble rule (BLOCK requires both classifiers)
|
|
- `/health` endpoint exposes security status for shield icon
|
|
- 25 unit tests + 12 regression tests all passing
|
|
|
|
**Branch 2 architecture (decided from pre-impl gate 1):**
|
|
The ML classifier ONLY runs in `sidebar-agent.ts` (non-compiled bun script). The compiled
|
|
browse binary cannot link onnxruntime-node. Architectural controls (XML framing + allowlist)
|
|
defend the compiled-side ingress.
|
|
|
|
### ML Prompt Injection Classifier — v2 Follow-ups
|
|
|
|
#### ~~Cut Haiku false-positive rate from 44% toward ~15% (P0)~~ — SHIPPED in v1.5.2.0
|
|
|
|
Measured result (500-case BrowseSafe-Bench smoke): detection 67.3% → **56.2%**, FP 44.1% → **22.9%**. Gate passes (detection ≥ 55%, FP ≤ 25%). Knobs that landed: label-first ensemble voting (verdict label trumps numeric confidence for transcript layer), hallucination guard (`verdict=block` at conf < 0.40 → warn-vote), new `THRESHOLDS.SOLO_CONTENT_BLOCK = 0.92` for label-less content classifiers, label-first extension to toolOutput path, tighter Haiku prompt + 8 few-shot exemplars, pinned Haiku model, `claude -p` spawn from `os.tmpdir()` so CLAUDE.md can't poison the classifier, timeout bumped 15s → 45s. CI gate: `browse/test/security-bench-ensemble.test.ts` replays fixture, fail-closed on missing fixture + security-layer diff. The original plan's stop-loss revert order didn't move the FP needle (FPs came from single-layer-BLOCK paths, not ensemble); the real levers turned out to be architectural (label-first) plus a new decoupled threshold.
|
|
|
|
See CHANGELOG.md [1.5.2.0] for the full shipped summary.
|
|
|
|
#### Original spec (pre-ship, retained for archive)
|
|
|
|
**What:** v1 ships the Haiku transcript classifier on every tool output (Read/Grep/Bash/Glob/WebFetch). BrowseSafe-Bench smoke measured detection 67.3% + FP 44.1% — a 4.4x detection lift from L4-only, but FP tripled because Haiku is more aggressive than L4 on edge cases (phishing-style benign content, borderline social engineering). The review banner makes FPs recoverable but 44% is too high for a delightful default.
|
|
|
|
**Why:** User clicks review banner roughly every-other tool output = real UX friction. Tuning these four knobs together should cut FP to ~15-20% while keeping detection in the 60-70% range:
|
|
|
|
1. **Switch ensemble counting to Haiku's `verdict` field, not `confidence`.** Right now `combineVerdict` treats Haiku warn-at-0.6 as a BLOCK vote. Haiku reserves `verdict: "block"` for clear-cut cases and uses `"warn"` liberally. Count only `verdict === "block"` as a BLOCK vote; `warn` becomes a soft signal that participates in 2-of-N ensemble but doesn't single-handedly BLOCK.
|
|
2. **Tighten Haiku's classifier prompt.** Current prompt is generic. Rewrite to: "Return `block` only if the text contains explicit instruction-override, role-reset, exfil request, or malicious code execution. Return `warn` for social engineering that doesn't try to hijack the agent. Return `safe` otherwise." More specific instructions → fewer false flags.
|
|
3. **Add 6-8 few-shot exemplars to Haiku's prompt.** Pairs of (injection text → block) and (benign-looking-but-safe → safe). LLM few-shot consistently outperforms zero-shot on classification.
|
|
4. **Bump Haiku's WARN threshold from 0.6 to 0.75.** Borderline fires drop out of the ensemble pool.
|
|
|
|
Ship all four together, re-run BrowseSafe-Bench smoke, record before/after. Target: 60-70% detection / 15-25% FP.
|
|
|
|
**Effort:** S (human: ~1 day / CC: ~30-45 min + ~45min bench)
|
|
**Priority:** P0 (direct UX impact post-ship; ship v1 as-is with review banner, file this as the immediate follow-up)
|
|
**Depends on:** v1.4.0.0 prompt-injection-guard branch merged
|
|
|
|
#### Cache review decisions per (domain, payload-hash-prefix) (P1)
|
|
|
|
**What:** If Haiku fires on a page twice in the same session (e.g., user does Bash then Grep on the same suspicious file), the second fire shouldn't re-prompt. Cache the user's decision keyed by a per-session (domain, payloadHash-prefix) pair. Small LRU, ~100 entries, session-scoped (not persistent across sidebar restarts — we want fresh decisions on new sessions).
|
|
|
|
**Why:** Reduces review-banner fatigue when the same bit of sketchy content gets scanned multiple times via different tools. At 44% FP on v1, this matters most.
|
|
|
|
**Effort:** S (human: ~0.5 day / CC: ~20 min)
|
|
**Priority:** P1
|
|
|
|
#### Fine-tune a small classifier on BrowseSafe-Bench + Qualifire + xxz224 (P2 research)
|
|
|
|
**What:** TestSavantAI was trained on direct-injection text, wrong distribution for browser-agent attacks (measured 15% recall). Take BERT-base, fine-tune on BrowseSafe-Bench (3,680 cases) + Qualifire prompt-injection-benchmark (5k) + xxz224 (3.7k) combined, ship in ~/.gstack/models/ as replacement L4 classifier.
|
|
|
|
**Why:** Expected 15% → 70%+ recall on the actual threat distribution without needing Haiku. Would also cut latency (no CLI subprocess) and drop Haiku cost.
|
|
|
|
**Effort:** XL (human: ~3-5 days + ~$50 GPU / CC: ~4-6 hours setup + ~$50 GPU)
|
|
**Priority:** P2 research — validate the lift on a held-out test set before committing to replace TestSavant
|
|
|
|
#### DeBERTa-v3 ensemble as default (P2)
|
|
|
|
**What:** Flip `GSTACK_SECURITY_ENSEMBLE=deberta` from opt-in to default. Adds a 3rd ML vote; 2-of-3 agreement rule should reduce FPs while catching attacks that only DeBERTa sees.
|
|
|
|
**Why:** More votes = better calibration. Currently opt-in because 721MB is a big first-run download; flipping to default requires lazy-download UX.
|
|
|
|
**Cons:** 721MB first-run download for every user. Costs user bandwidth + disk.
|
|
|
|
**Effort:** M (human: ~2 days / CC: ~1 hour + UX)
|
|
**Priority:** P2 (after #1 tuning to see how much room is left)
|
|
|
|
#### User-feedback flywheel — decisions become training data (P3)
|
|
|
|
**What:** Every Allow/Block click is labeled data. Log (suspected_text hash, layer scores, user decision, ts) to ~/.gstack/security/feedback.jsonl. Aggregate via community-pulse when `telemetry: community`. Periodically retrain the classifier on aggregate feedback.
|
|
|
|
**Why:** The system gets better the more it's used. Closes the loop between user reality and defense quality.
|
|
|
|
**Cons:** Feedback loop can be poisoned if attacker controls enough devices. Need guardrails (stratified sampling, reviewer validation, k-anon minimums on training batch).
|
|
|
|
**Effort:** L (human: ~1 week for local logging + aggregation pipe, another week for retrain cron / CC: ~2-4 hours per sub-part)
|
|
**Priority:** P3 — only worth building after v2 tuning proves the architecture is the right shape
|
|
|
|
#### ~~Shield icon + canary leak banner UI (P0)~~ — SHIPPED
|
|
|
|
Banner landed in commits a9f702a7 (HTML+CSS, variant A mockup) + ffb064af
|
|
(JS wiring + security_event routing + a11y + Escape-to-dismiss). Shield
|
|
icon landed in 59e0635e with 3 states (protected/degraded/inactive),
|
|
custom SVG + mono SEC label per design review Pass 7, hover tooltip with
|
|
per-layer detail.
|
|
|
|
Known v1 limitation logged as follow-up: shield only updates at connect —
|
|
see "Shield icon continuous polling" above.
|
|
|
|
#### ~~Shield icon continuous polling (P2)~~ — SHIPPED
|
|
|
|
Commit 06002a82: `/sidebar-chat` response now includes `security:
|
|
getSecurityStatus()`, and sidepanel.js calls `updateSecurityShield(data.security)`
|
|
on every poll tick. Shield flips to 'protected' as soon as classifier warmup
|
|
completes (typically ~30s after initial connect on first run), no reload needed.
|
|
|
|
#### ~~Attack telemetry via gstack-telemetry-log (P1)~~ — SHIPPED
|
|
|
|
Landed in commits 28ce883c (binary) + f68fa4a9 (security.ts wiring). The
|
|
telemetry binary now accepts `--event-type attack_attempt --url-domain
|
|
--payload-hash --confidence --layer --verdict`. `logAttempt()` spawns the
|
|
binary fire-and-forget. Existing tier gating carries the events.
|
|
|
|
Downstream follow-up still open: update the `community-pulse` Supabase edge
|
|
function to accept the new event type and store in a typed `security_attempts`
|
|
table. Dashboard read path is a separate TODO ("Cross-user aggregate attack
|
|
dashboard" below).
|
|
|
|
#### Full BrowseSafe-Bench at gate tier (P2)
|
|
|
|
**What:** Promote `browse/test/security-bench.test.ts` from smoke-200 (gate) to full-3680
|
|
(gate) once smoke/full detection rate correlation is measured (~2 weeks post-ship).
|
|
|
|
**Why:** BrowseSafe-Bench is Perplexity's 3,680-case browser-agent injection benchmark.
|
|
Smoke-200 is a sample; full coverage catches the long tail. Run time ~5min hermetic.
|
|
|
|
**Effort:** S (CC: ~45min)
|
|
**Priority:** P2
|
|
**Depends on:** v1 shipped + ~2 weeks real data
|
|
|
|
#### ~~Cross-user aggregate attack dashboard (P2)~~ — CLI SHIPPED, web UI remains
|
|
|
|
CLI dashboard shipped in commits a5588ec0 (schema migration) + 2d107978
|
|
(community-pulse edge function security aggregation) + 756875a7 (bin/gstack-
|
|
security-dashboard). Users can now run `gstack-security-dashboard` to see
|
|
attacks last 7 days, top attacked domains, detection-layer distribution,
|
|
and verdict counts — all aggregated from the Supabase community-pulse pipe.
|
|
|
|
Web UI at gstack.gg/dashboard/security is still open — that's a separate
|
|
webapp project outside this repo's scope.
|
|
|
|
#### TestSavantAI ensemble → DeBERTa-v3 ensemble (P2) — SHIPPED (opt-in)
|
|
|
|
Commits b4e49d08 + 8e9ec52d + 4e051603 + 7a815fa7: DeBERTa-v3-base-injection-onnx
|
|
is now wired as an opt-in L4c ensemble classifier. Enable via
|
|
`GSTACK_SECURITY_ENSEMBLE=deberta` — sidebar-agent warmup downloads the 721MB
|
|
model to ~/.gstack/models/deberta-v3-injection/ on first run. combineVerdict
|
|
becomes a 2-of-3 agreement rule (testsavant + deberta + transcript) when
|
|
enabled. Default behavior unchanged (2-of-2 testsavant + transcript).
|
|
|
|
#### ~~TestSavantAI + DeBERTa-v3 ensemble~~ — SHIPPED opt-in (see entry above)
|
|
|
|
#### ~~Read/Glob/Grep tool-output injection coverage (P2)~~ — SHIPPED
|
|
|
|
Commits f2e80dd7 + 0098d574: sidebar-agent.ts now scans tool outputs from
|
|
Read, Glob, Grep, WebFetch, and Bash via `SCANNED_TOOLS` set. Content >= 32
|
|
chars runs through the ML ensemble; BLOCK verdict kills the session and
|
|
emits security_event. The content-security.ts envelope path was already
|
|
wrapping browse-command output; this extension closes the non-browse path
|
|
Codex flagged.
|
|
|
|
During /ship for v1.4.0.0 this path got additional hardening (commit
|
|
407c36b4 + 88b12c2b + c51ebdf4): transcript classifier now receives the
|
|
tool output text (was empty before), and combineVerdict accepts a
|
|
`toolOutput: true` opt that blocks on a single ML classifier at BLOCK
|
|
threshold (user-input default unchanged for SO-FP mitigation).
|
|
|
|
#### ~~Adversarial + integration + smoke-bench test suites (P1)~~ — SHIPPED
|
|
|
|
Four test files shipped this round:
|
|
* `browse/test/security-adversarial.test.ts` (94a83c50) — 23 canary-channel
|
|
+ verdict-combiner attack-shape tests
|
|
* `browse/test/security-integration.test.ts` (07745e04) — 10 layer-coexistence
|
|
+ defense-in-depth regression guards
|
|
* `browse/test/security-live-playwright.test.ts` (b9677519) — 7 live-Chromium
|
|
fixture tests (5 deterministic + 2 ML, skipped if model cache absent)
|
|
* `browse/test/security-bench.test.ts` (afc6661f) — BrowseSafe-Bench 200-case
|
|
smoke harness with hermetic dataset cache + v1 baseline metrics
|
|
|
|
#### Bun-native 5ms inference (P3 research) — SKELETON SHIPPED, forward pass open
|
|
|
|
Research skeleton landed this round (browse/src/security-bunnative.ts,
|
|
docs/designs/BUN_NATIVE_INFERENCE.md, browse/test/security-bunnative.test.ts):
|
|
|
|
* Pure-TS WordPiece tokenizer — reads HF tokenizer.json directly, matches
|
|
transformers.js output on fixture strings (correctness-tested in CI)
|
|
* Stable `classify()` API that current callers can wire against today
|
|
* Benchmark harness with p50/p95/p99 reporting — anchors v1 WASM baseline
|
|
for future regressions
|
|
|
|
Design doc captures the roadmap:
|
|
* Approach A: pure-TS + Float32Array SIMD — ruled out (can't beat WASM)
|
|
* Approach B: Bun FFI + Apple Accelerate cblas_sgemm — target ~3-6ms p50,
|
|
macOS-only, ~1000 LOC
|
|
* Approach C: Bun WebGPU — unexplored, worth a spike
|
|
|
|
Remaining work (XL, multi-week):
|
|
* FFI proof-of-concept for cblas_sgemm
|
|
* Single transformer layer implementation + correctness check vs onnxruntime
|
|
* Full forward pass + weight loader + correctness regression fixtures
|
|
* Production swap in security-bunnative.ts `classify()` body
|
|
|
|
## Builder Ethos
|
|
|
|
### First-time Search Before Building intro
|
|
|
|
**What:** Add a `generateSearchIntro()` function (like `generateLakeIntro()`) that introduces the Search Before Building principle on first use, with a link to the blog essay.
|
|
|
|
**Why:** Boil the Lake has an intro flow that links to the essay and marks `.completeness-intro-seen`. Search Before Building should have the same pattern for discoverability.
|
|
|
|
**Context:** Blocked on a blog post to link to. When the essay exists, add the intro flow with a `.search-intro-seen` marker file. Pattern: `generateLakeIntro()` at gen-skill-docs.ts:176.
|
|
|
|
**Effort:** S
|
|
**Priority:** P2
|
|
**Depends on:** Blog post about Search Before Building
|
|
|
|
## Chrome DevTools MCP Integration
|
|
|
|
### Real Chrome session access
|
|
|
|
**What:** Integrate Chrome DevTools MCP to connect to the user's real Chrome session with real cookies, real state, no Playwright middleman.
|
|
|
|
**Why:** Right now, headed mode launches a fresh Chromium profile. Users must log in manually or import cookies. Chrome DevTools MCP connects to the user's actual Chrome ... instant access to every authenticated site. This is the future of browser automation for AI agents.
|
|
|
|
**Context:** Google shipped Chrome DevTools MCP in Chrome 146+ (June 2025). It provides screenshots, console messages, performance traces, Lighthouse audits, and full page interaction through the user's real browser. gstack should use it for real-session access while keeping Playwright for headless CI/testing workflows.
|
|
|
|
Potential new skills:
|
|
- `/debug-browser`: JS error tracing with source-mapped stack traces
|
|
- `/perf-debug`: performance traces, Core Web Vitals, network waterfall
|
|
|
|
May replace `/setup-browser-cookies` for most use cases since the user's real cookies are already there.
|
|
|
|
**Effort:** L (human: ~2 weeks / CC: ~2 hours)
|
|
**Priority:** P0
|
|
**Depends on:** Chrome 146+, DevTools MCP server installed
|
|
|
|
## Browse
|
|
|
|
### Bundle server.ts into compiled binary
|
|
|
|
**What:** Eliminate `resolveServerScript()` fallback chain entirely — bundle server.ts into the compiled browse binary.
|
|
|
|
**Why:** The current fallback chain (check adjacent to cli.ts, check global install) is fragile and caused bugs in v0.3.2. A single compiled binary is simpler and more reliable.
|
|
|
|
**Context:** Bun's `--compile` flag can bundle multiple entry points. The server is currently resolved at runtime via file path lookup. Bundling it removes the resolution step entirely.
|
|
|
|
**Effort:** M
|
|
**Priority:** P2
|
|
**Depends on:** None
|
|
|
|
### Sessions (isolated browser instances)
|
|
|
|
**What:** Isolated browser instances with separate cookies/storage/history, addressable by name.
|
|
|
|
**Why:** Enables parallel testing of different user roles, A/B test verification, and clean auth state management.
|
|
|
|
**Context:** Requires Playwright browser context isolation. Each session gets its own context with independent cookies/localStorage. Prerequisite for video recording (clean context lifecycle) and auth vault.
|
|
|
|
**Effort:** L
|
|
**Priority:** P3
|
|
|
|
### Video recording
|
|
|
|
**What:** Record browser interactions as video (start/stop controls).
|
|
|
|
**Why:** Video evidence in QA reports and PR bodies. Currently deferred because `recreateContext()` destroys page state.
|
|
|
|
**Context:** Needs sessions for clean context lifecycle. Playwright supports video recording per context. Also needs WebM → GIF conversion for PR embedding.
|
|
|
|
**Effort:** M
|
|
**Priority:** P3
|
|
**Depends on:** Sessions
|
|
|
|
### v20 encryption format support
|
|
|
|
**What:** AES-256-GCM support for future Chromium cookie DB versions (currently v10).
|
|
|
|
**Why:** Future Chromium versions may change encryption format. Proactive support prevents breakage.
|
|
|
|
**Effort:** S
|
|
**Priority:** P3
|
|
|
|
### State persistence — SHIPPED
|
|
|
|
~~**What:** Save/load cookies + localStorage to JSON files for reproducible test sessions.~~
|
|
|
|
`$B state save/load` ships in v0.12.1.0. V1 saves cookies + URLs only (not localStorage, which breaks on load-before-navigate). Files at `.gstack/browse-states/{name}.json` with 0o600 permissions. Load replaces session (closes all pages first). Name sanitized to `[a-zA-Z0-9_-]`.
|
|
|
|
**Remaining:** V2 localStorage support (needs pre-navigation injection strategy).
|
|
**Completed:** v0.12.1.0 (2026-03-26)
|
|
|
|
### Auth vault
|
|
|
|
**What:** Encrypted credential storage, referenced by name. LLM never sees passwords.
|
|
|
|
**Why:** Security — currently auth credentials flow through the LLM context. Vault keeps secrets out of the AI's view.
|
|
|
|
**Effort:** L
|
|
**Priority:** P3
|
|
**Depends on:** Sessions, state persistence
|
|
|
|
### Iframe support — SHIPPED
|
|
|
|
~~**What:** `frame <sel>` and `frame main` commands for cross-frame interaction.~~
|
|
|
|
`$B frame` ships in v0.12.1.0. Supports CSS selector, @ref, `--name`, and `--url` pattern matching. Execution target abstraction (`getActiveFrameOrPage()`) across all read/write/snapshot commands. Frame context cleared on navigation, tab switch, resume. Detached frame auto-recovery. Page-only operations (goto, screenshot, viewport) throw clear error when in frame context.
|
|
|
|
**Completed:** v0.12.1.0 (2026-03-26)
|
|
|
|
### Semantic locators
|
|
|
|
**What:** `find role/label/text/placeholder/testid` with attached actions.
|
|
|
|
**Why:** More resilient element selection than CSS selectors or ref numbers.
|
|
|
|
**Effort:** M
|
|
**Priority:** P4
|
|
|
|
### Device emulation presets
|
|
|
|
**What:** `set device "iPhone 16 Pro"` for mobile/tablet testing.
|
|
|
|
**Why:** Responsive layout testing without manual viewport resizing.
|
|
|
|
**Effort:** S
|
|
**Priority:** P4
|
|
|
|
### Network mocking/routing
|
|
|
|
**What:** Intercept, block, and mock network requests.
|
|
|
|
**Why:** Test error states, loading states, and offline behavior.
|
|
|
|
**Effort:** M
|
|
**Priority:** P4
|
|
|
|
### Download handling
|
|
|
|
**What:** Click-to-download with path control.
|
|
|
|
**Why:** Test file download flows end-to-end.
|
|
|
|
**Effort:** S
|
|
**Priority:** P4
|
|
|
|
### Content safety
|
|
|
|
**What:** `--max-output` truncation, `--allowed-domains` filtering.
|
|
|
|
**Why:** Prevent context window overflow and restrict navigation to safe domains.
|
|
|
|
**Effort:** S
|
|
**Priority:** P4
|
|
|
|
### Streaming (WebSocket live preview)
|
|
|
|
**What:** WebSocket-based live preview for pair browsing sessions.
|
|
|
|
**Why:** Enables real-time collaboration — human watches AI browse.
|
|
|
|
**Effort:** L
|
|
**Priority:** P4
|
|
|
|
### Headed mode with Chrome extension — SHIPPED
|
|
|
|
`$B connect` launches Playwright's bundled Chromium in headed mode with the gstack Chrome extension auto-loaded. `$B handoff` now produces the same result (extension + side panel). Sidebar chat gated behind `--chat` flag.
|
|
|
|
### `$B watch` — SHIPPED
|
|
|
|
Claude observes user browsing in passive read-only mode with periodic snapshots. `$B watch stop` exits with summary. Mutation commands blocked during watch.
|
|
|
|
### Sidebar scout / file drop relay — SHIPPED
|
|
|
|
Sidebar agent writes structured messages to `.context/sidebar-inbox/`. Workspace agent reads via `$B inbox`. Message format: `{type, timestamp, page, userMessage, sidebarSessionId}`.
|
|
|
|
### Multi-agent tab isolation
|
|
|
|
**What:** Two Claude sessions connect to the same browser, each operating on different tabs. No cross-contamination.
|
|
|
|
**Why:** Enables parallel /qa + /design-review on different tabs in the same browser.
|
|
|
|
**Context:** Requires tab ownership model for concurrent headed connections. Playwright may not cleanly support two persistent contexts. Needs investigation.
|
|
|
|
**Effort:** L (human: ~2 weeks / CC: ~2 hours)
|
|
**Priority:** P3
|
|
**Depends on:** Headed mode (shipped)
|
|
|
|
### Sidebar agent needs Write tool + better error visibility — SHIPPED
|
|
|
|
**What:** Two issues with the sidebar agent (`sidebar-agent.ts`): (1) `--allowedTools` is hardcoded to `Bash,Read,Glob,Grep`, missing `Write`. Claude can't create files (like CSVs) when asked. (2) When Claude errors or returns empty, the sidebar UI shows nothing, just a green dot. No error message, no "I tried but failed", nothing.
|
|
|
|
**Completed:** v0.15.4.0 (2026-04-04). Write tool added to allowedTools. 40+ empty catch blocks replaced with `[gstack sidebar]`, `[gstack bg]`, `[browse]`, `[sidebar-agent]` prefixed console logging across all 4 files (sidepanel.js, background.js, server.ts, sidebar-agent.ts). Error placeholder text now shows in red. Auth token stale-refresh bug fixed.
|
|
|
|
### Sidebar direct API calls (eliminate claude -p startup tax)
|
|
|
|
**What:** Each sidebar message spawns a fresh `claude -p` process (~2-3s cold start overhead). For "click @e24" that's absurd. Direct Anthropic API calls would be sub-second.
|
|
|
|
**Why:** The `claude -p` startup cost is: process spawn (~100ms) + CLI init (~500ms-1s) + API connection (~200ms) + first token. Model routing (Sonnet for actions) helps but doesn't fix the CLI overhead.
|
|
|
|
**Context:** `server.ts:spawnClaude()` builds args and writes to queue file. `sidebar-agent.ts:askClaude()` spawns `claude -p`. Replace with direct `fetch('https://api.anthropic.com/...')` with tool use. Requires `ANTHROPIC_API_KEY` accessible to the browse server.
|
|
|
|
**Effort:** M (human: ~1 week / CC: ~30min)
|
|
**Priority:** P2
|
|
**Depends on:** None
|
|
|
|
### Chrome Web Store publishing
|
|
|
|
**What:** Publish the gstack browse Chrome extension to Chrome Web Store for easier install.
|
|
|
|
**Why:** Currently sideloaded via chrome://extensions. Web Store makes install one-click.
|
|
|
|
**Effort:** S
|
|
**Priority:** P4
|
|
**Depends on:** Chrome extension proving value via sideloading
|
|
|
|
### Linux cookie decryption — PARTIALLY SHIPPED
|
|
|
|
~~**What:** GNOME Keyring / kwallet / DPAPI support for non-macOS cookie import.~~
|
|
|
|
Linux cookie import shipped in v0.11.11.0 (Wave 3). Supports Chrome, Chromium, Brave, Edge on Linux with GNOME Keyring (libsecret) and "peanuts" fallback. Windows DPAPI support remains deferred.
|
|
|
|
**Remaining:** Windows cookie decryption (DPAPI). Needs complete rewrite — PR #64 was 1346 lines and stale.
|
|
|
|
**Effort:** L (Windows only)
|
|
**Priority:** P4
|
|
**Completed (Linux):** v0.11.11.0 (2026-03-23)
|
|
|
|
## Ship
|
|
|
|
### /ship Step 12 test harness should exec the actual template bash, not a reimplementation
|
|
|
|
**What:** `test/ship-version-sync.test.ts` currently reimplements the bash from `ship/SKILL.md.tmpl` Step 12 inside template literals. When the template changes, both sides must be updated — exactly the drift-risk pattern the Step 12 fix is meant to prevent, applied to our own testing strategy. Replace with a helper that extracts the fenced bash blocks from the template at test time and runs them verbatim (similar to the `skill-parser.ts` pattern).
|
|
|
|
**Why:** Surfaced by the Claude adversarial subagent during the v1.0.1.0 ship. Today the tests would stay green while the template regresses, because the error-message strings already differ between test and template. It's a silent-drift bug waiting to happen.
|
|
|
|
**Context:** The fixed test file is at `test/ship-version-sync.test.ts` (branched off garrytan/ship-version-sync). Existing precedent for extracting-from-skill-md is at `test/helpers/skill-parser.ts`. Pattern: read the template, slice from `## Step 12` to the next `---`, grep fenced bash, feed to `/bin/bash` with substituted fixtures.
|
|
|
|
**Effort:** S (human: ~2h / CC: ~30min)
|
|
**Priority:** P2
|
|
**Depends on:** None.
|
|
|
|
### /ship Step 12 BASE_VERSION silent fallback to 0.0.0.0 when git show fails
|
|
|
|
**What:** `BASE_VERSION=$(git show origin/<base>:VERSION 2>/dev/null || echo "0.0.0.0")` silently defaults to `0.0.0.0` in any failure mode — detached HEAD, no origin, offline, base branch renamed. In such states, a real drift could be misclassified or silently repaired with the wrong value. Distinguish "origin/<base> unreachable" from "origin/<base>:VERSION absent" and fail loudly on the former.
|
|
|
|
**Why:** Flagged as CRITICAL (confidence 8/10) by the Claude adversarial subagent during the v1.0.1.0 ship. Low practical risk because `/ship` Step 3 already fetches origin before Step 12 runs — any reachability failure would abort Step 3 long before this code runs. Still, defense in depth: if someone invokes Step 12 bash outside the full /ship pipeline (e.g., via a standalone helper), the fallback masks a real problem.
|
|
|
|
**Context:** Fix: wrap with `git rev-parse --verify origin/<base>` probe; if that fails, error out rather than defaulting. Touches `ship/SKILL.md.tmpl` Step 12 idempotency block (around line 409). Tests need a case where `git show` fails.
|
|
|
|
**Effort:** S (human: ~1h / CC: ~15min)
|
|
**Priority:** P3
|
|
**Depends on:** None.
|
|
|
|
### GitLab support for /land-and-deploy
|
|
|
|
**What:** Add GitLab MR merge + CI polling support to `/land-and-deploy` skill. Currently uses `gh pr view`, `gh pr checks`, `gh pr merge`, and `gh run list/view` in 15+ places — each needs a GitLab conditional path using `glab ci status`, `glab mr merge`, etc.
|
|
|
|
**Why:** Without this, GitLab users can `/ship` (create MR) but can't `/land-and-deploy` (merge + verify). Completes the GitLab story end-to-end.
|
|
|
|
**Context:** `/retro`, `/ship`, and `/document-release` now support GitLab via the multi-platform `BASE_BRANCH_DETECT` resolver. `/land-and-deploy` has deeper GitHub-specific semantics (merge queues, required checks via `gh pr checks`, deploy workflow polling) that have different shapes on GitLab. The `glab` CLI (v1.90.0) supports `glab mr merge`, `glab ci status`, `glab ci view` but with different output formats and no merge queue concept.
|
|
|
|
**Effort:** L
|
|
**Priority:** P2
|
|
**Depends on:** None (BASE_BRANCH_DETECT multi-platform resolver is already done)
|
|
|
|
### Multi-commit CHANGELOG completeness eval
|
|
|
|
**What:** Add a periodic E2E eval that creates a branch with 5+ commits spanning 3+ themes (features, cleanup, infra), runs /ship's Step 5 CHANGELOG generation, and verifies the CHANGELOG mentions all themes.
|
|
|
|
**Why:** The bug fixed in v0.11.22 (garrytan/ship-full-commit-coverage) showed that /ship's CHANGELOG generation biased toward recent commits on long branches. The prompt fix adds a cross-check, but no test exercises the multi-commit failure mode. The existing `ship-local-workflow` E2E only uses a single-commit branch.
|
|
|
|
**Context:** Would be a `periodic` tier test (~$4/run, non-deterministic since it tests LLM instruction-following). Setup: create bare remote, clone, add 5+ commits across different themes on a feature branch, run Step 5 via `claude -p`, verify CHANGELOG output covers all themes. Pattern: `ship-local-workflow` in `test/skill-e2e-workflow.test.ts`.
|
|
|
|
**Effort:** M
|
|
**Priority:** P3
|
|
**Depends on:** None
|
|
|
|
### Ship log — persistent record of /ship runs
|
|
|
|
**What:** Append structured JSON entry to `.gstack/ship-log.json` at end of every /ship run (version, date, branch, PR URL, review findings, Greptile stats, todos completed, test results).
|
|
|
|
**Why:** /retro has no structured data about shipping velocity. Ship log enables: PRs-per-week trending, review finding rates, Greptile signal over time, test suite growth.
|
|
|
|
**Context:** /retro already reads greptile-history.md — same pattern. Eval persistence (eval-store.ts) shows the JSON append pattern exists in the codebase. ~15 lines in ship template.
|
|
|
|
**Effort:** S
|
|
**Priority:** P2
|
|
**Depends on:** None
|
|
|
|
|
|
### Visual verification with screenshots in PR body
|
|
|
|
**What:** /ship Step 7.5: screenshot key pages after push, embed in PR body.
|
|
|
|
**Why:** Visual evidence in PRs. Reviewers see what changed without deploying locally.
|
|
|
|
**Context:** Part of Phase 3.6. Needs S3 upload for image hosting.
|
|
|
|
**Effort:** M
|
|
**Priority:** P2
|
|
**Depends on:** /setup-gstack-upload
|
|
|
|
## Review
|
|
|
|
### Inline PR annotations
|
|
|
|
**What:** /ship and /review post inline review comments at specific file:line locations using `gh api` to create pull request review comments.
|
|
|
|
**Why:** Line-level annotations are more actionable than top-level comments. The PR thread becomes a line-by-line conversation between Greptile, Claude, and human reviewers.
|
|
|
|
**Context:** GitHub supports inline review comments via `gh api repos/$REPO/pulls/$PR/reviews`. Pairs naturally with Phase 3.6 visual annotations.
|
|
|
|
**Effort:** S
|
|
**Priority:** P2
|
|
**Depends on:** None
|
|
|
|
### Greptile training feedback export
|
|
|
|
**What:** Aggregate greptile-history.md into machine-readable JSON summary of false positive patterns, exportable to the Greptile team for model improvement.
|
|
|
|
**Why:** Closes the feedback loop — Greptile can use FP data to stop making the same mistakes on your codebase.
|
|
|
|
**Context:** Was a P3 Future Idea. Upgraded to P2 now that greptile-history.md data infrastructure exists. The signal data is already being collected; this just makes it exportable. ~40 lines.
|
|
|
|
**Effort:** S
|
|
**Priority:** P2
|
|
**Depends on:** Enough FP data accumulated (10+ entries)
|
|
|
|
### Visual review with annotated screenshots
|
|
|
|
**What:** /review Step 4.5: browse PR's preview deploy, annotated screenshots of changed pages, compare against production, check responsive layouts, verify accessibility tree.
|
|
|
|
**Why:** Visual diff catches layout regressions that code review misses.
|
|
|
|
**Context:** Part of Phase 3.6. Needs S3 upload for image hosting.
|
|
|
|
**Effort:** M
|
|
**Priority:** P2
|
|
**Depends on:** /setup-gstack-upload
|
|
|
|
## QA
|
|
|
|
### QA trend tracking
|
|
|
|
**What:** Compare baseline.json over time, detect regressions across QA runs.
|
|
|
|
**Why:** Spot quality trends — is the app getting better or worse?
|
|
|
|
**Context:** QA already writes structured reports. This adds cross-run comparison.
|
|
|
|
**Effort:** S
|
|
**Priority:** P2
|
|
|
|
### CI/CD QA integration
|
|
|
|
**What:** `/qa` as GitHub Action step, fail PR if health score drops.
|
|
|
|
**Why:** Automated quality gate in CI. Catch regressions before merge.
|
|
|
|
**Effort:** M
|
|
**Priority:** P2
|
|
|
|
### Smart default QA tier
|
|
|
|
**What:** After a few runs, check index.md for user's usual tier pick, skip the AskUserQuestion.
|
|
|
|
**Why:** Reduces friction for repeat users.
|
|
|
|
**Effort:** S
|
|
**Priority:** P2
|
|
|
|
### Accessibility audit mode
|
|
|
|
**What:** `--a11y` flag for focused accessibility testing.
|
|
|
|
**Why:** Dedicated accessibility testing beyond the general QA checklist.
|
|
|
|
**Effort:** S
|
|
**Priority:** P3
|
|
|
|
### CI/CD generation for non-GitHub providers
|
|
|
|
**What:** Extend CI/CD bootstrap to generate GitLab CI (`.gitlab-ci.yml`), CircleCI (`.circleci/config.yml`), and Bitrise pipelines.
|
|
|
|
**Why:** Not all projects use GitHub Actions. Universal CI/CD bootstrap would make test bootstrap work for everyone.
|
|
|
|
**Context:** v1 ships with GitHub Actions only. Detection logic already checks for `.gitlab-ci.yml`, `.circleci/`, `bitrise.yml` and skips with an informational note. Each provider needs ~20 lines of template text in `generateTestBootstrap()`.
|
|
|
|
**Effort:** M
|
|
**Priority:** P3
|
|
**Depends on:** Test bootstrap (shipped)
|
|
|
|
### Auto-upgrade weak tests (★) to strong tests (★★★)
|
|
|
|
**What:** When Step 7 coverage audit identifies existing ★-rated tests (smoke/trivial assertions), generate improved versions testing edge cases and error paths.
|
|
|
|
**Why:** Many codebases have tests that technically exist but don't catch real bugs — `expect(component).toBeDefined()` isn't testing behavior. Upgrading these closes the gap between "has tests" and "has good tests."
|
|
|
|
**Context:** Requires the quality scoring rubric from the test coverage audit. Modifying existing test files is riskier than creating new ones — needs careful diffing to ensure the upgraded test still passes. Consider creating a companion test file rather than modifying the original.
|
|
|
|
**Effort:** M
|
|
**Priority:** P3
|
|
**Depends on:** Test quality scoring (shipped)
|
|
|
|
## Retro
|
|
|
|
### Deployment health tracking (retro + browse)
|
|
|
|
**What:** Screenshot production state, check perf metrics (page load times), count console errors across key pages, track trends over retro window.
|
|
|
|
**Why:** Retro should include production health alongside code metrics.
|
|
|
|
**Context:** Requires browse integration. Screenshots + metrics fed into retro output.
|
|
|
|
**Effort:** L
|
|
**Priority:** P3
|
|
**Depends on:** Browse sessions
|
|
|
|
## Infrastructure
|
|
|
|
### /setup-gstack-upload skill (S3 bucket)
|
|
|
|
**What:** Configure S3 bucket for image hosting. One-time setup for visual PR annotations.
|
|
|
|
**Why:** Prerequisite for visual PR annotations in /ship and /review.
|
|
|
|
**Effort:** M
|
|
**Priority:** P2
|
|
|
|
### gstack-upload helper
|
|
|
|
**What:** `browse/bin/gstack-upload` — upload file to S3, return public URL.
|
|
|
|
**Why:** Shared utility for all skills that need to embed images in PRs.
|
|
|
|
**Effort:** S
|
|
**Priority:** P2
|
|
**Depends on:** /setup-gstack-upload
|
|
|
|
### WebM to GIF conversion
|
|
|
|
**What:** ffmpeg-based WebM → GIF conversion for video evidence in PRs.
|
|
|
|
**Why:** GitHub PR bodies render GIFs but not WebM. Needed for video recording evidence.
|
|
|
|
**Effort:** S
|
|
**Priority:** P3
|
|
**Depends on:** Video recording
|
|
|
|
|
|
|
|
### Extend worktree isolation to Claude E2E tests
|
|
|
|
**What:** Add `useWorktree?: boolean` option to `runSkillTest()` so any Claude E2E test can opt into worktree mode for full repo context instead of tmpdir fixtures.
|
|
|
|
**Why:** Some Claude E2E tests (CSO audit, review-sql-injection) create minimal fake repos but would produce more realistic results with full repo context. The infrastructure exists (`describeWithWorktree()` in e2e-helpers.ts) — this extends it to the session-runner level.
|
|
|
|
**Context:** WorktreeManager shipped in v0.11.12.0. Currently only Gemini/Codex tests use worktrees. Claude tests use planted-bug fixture repos which are correct for their purpose, but new tests that want real repo context can use `describeWithWorktree()` today. This TODO is about making it even easier via a flag on `runSkillTest()`.
|
|
|
|
**Effort:** M (human: ~2 days / CC: ~20 min)
|
|
**Priority:** P3
|
|
**Depends on:** Worktree isolation (shipped v0.11.12.0)
|
|
|
|
### E2E model pinning — SHIPPED
|
|
|
|
~~**What:** Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses.~~
|
|
|
|
Shipped: Default model changed to Sonnet for structure tests (~30), Opus retained for quality tests (~10). `--retry 2` added. `EVALS_MODEL` env var for override. `test:e2e:fast` tier added. Rate-limit telemetry (first_response_ms, max_inter_turn_ms) and wall_clock_ms tracking added to eval-store.
|
|
|
|
### Eval web dashboard
|
|
|
|
**What:** `bun run eval:dashboard` serves local HTML with charts: cost trending, detection rate, pass/fail history.
|
|
|
|
**Why:** Visual charts better for spotting trends than CLI tools.
|
|
|
|
**Context:** Reads `~/.gstack-dev/evals/*.json`. ~200 lines HTML + chart.js via Bun HTTP server.
|
|
|
|
**Effort:** M
|
|
**Priority:** P3
|
|
**Depends on:** Eval persistence (shipped in v0.3.6)
|
|
|
|
### CI/CD QA quality gate
|
|
|
|
**What:** Run `/qa` as a GitHub Action step, fail PR if health score drops below threshold.
|
|
|
|
**Why:** Automated quality gate catches regressions before merge. Currently QA is manual — CI integration makes it part of the standard workflow.
|
|
|
|
**Context:** Requires headless browse binary available in CI. The `/qa` skill already produces `baseline.json` with health scores — CI step would compare against the main branch baseline and fail if score drops. Would need `ANTHROPIC_API_KEY` in CI secrets since `/qa` uses Claude.
|
|
|
|
**Effort:** M
|
|
**Priority:** P2
|
|
**Depends on:** None
|
|
|
|
### Cross-platform URL open helper
|
|
|
|
**What:** `gstack-open-url` helper script — detect platform, use `open` (macOS) or `xdg-open` (Linux).
|
|
|
|
**Why:** The first-time Completeness Principle intro uses macOS `open` to launch the essay. If gstack ever supports Linux, this silently fails.
|
|
|
|
**Effort:** S (human: ~30 min / CC: ~2 min)
|
|
**Priority:** P4
|
|
**Depends on:** Nothing
|
|
|
|
### CDP-based DOM mutation detection for ref staleness
|
|
|
|
**What:** Use Chrome DevTools Protocol `DOM.documentUpdated` / MutationObserver events to proactively invalidate stale refs when the DOM changes, without requiring an explicit `snapshot` call.
|
|
|
|
**Why:** Current ref staleness detection (async count() check) only catches stale refs at action time. CDP mutation detection would proactively warn when refs become stale, preventing the 5-second timeout entirely for SPA re-renders.
|
|
|
|
**Context:** Parts 1+2 of ref staleness fix (RefEntry metadata + eager validation via count()) are shipped. This is Part 3 — the most ambitious piece. Requires CDP session alongside Playwright, MutationObserver bridge, and careful performance tuning to avoid overhead on every DOM change.
|
|
|
|
**Effort:** L
|
|
**Priority:** P3
|
|
**Depends on:** Ref staleness Parts 1+2 (shipped)
|
|
|
|
## Office Hours / Design
|
|
|
|
### Design docs → Supabase team store sync
|
|
|
|
**What:** Add design docs (`*-design-*.md`) to the Supabase sync pipeline alongside test plans, retro snapshots, and QA reports.
|
|
|
|
**Why:** Cross-team design discovery at scale. Local `~/.gstack/projects/$SLUG/` keyword-grep discovery works for same-machine users now, but Supabase sync makes it work across the whole team. Duplicate ideas surface, everyone sees what's been explored.
|
|
|
|
**Context:** /office-hours writes design docs to `~/.gstack/projects/$SLUG/`. The team store already syncs test plans, retro snapshots, QA reports. Design docs follow the same pattern — just add a sync adapter.
|
|
|
|
**Effort:** S
|
|
**Priority:** P2
|
|
**Depends on:** `garrytan/team-supabase-store` branch landing on main
|
|
|
|
### /yc-prep skill
|
|
|
|
**What:** Skill that helps founders prepare their YC application after /office-hours identifies strong signal. Pulls from the design doc, structures answers to YC app questions, runs a mock interview.
|
|
|
|
**Why:** Closes the loop. /office-hours identifies the founder, /yc-prep helps them apply well. The design doc already contains most of the raw material for a YC application.
|
|
|
|
**Effort:** M (human: ~2 weeks / CC: ~2 hours)
|
|
**Priority:** P2
|
|
**Depends on:** office-hours founder discovery engine shipping first
|
|
|
|
## Design Review
|
|
|
|
### /plan-design-review + /qa-design-review + /design-consultation — SHIPPED
|
|
|
|
Shipped as v0.5.0 on main. Includes `/plan-design-review` (report-only design audit), `/qa-design-review` (audit + fix loop), and `/design-consultation` (interactive DESIGN.md creation). `{{DESIGN_METHODOLOGY}}` resolver provides shared 80-item design audit checklist.
|
|
|
|
### Design outside voices in /plan-eng-review
|
|
|
|
**What:** Extend the parallel dual-voice pattern (Codex + Claude subagent) to /plan-eng-review's architecture review section.
|
|
|
|
**Why:** The design beachhead (v0.11.3.0) proves cross-model consensus works for subjective reviews. Architecture reviews have similar subjectivity in tradeoff decisions.
|
|
|
|
**Context:** Depends on learnings from the design beachhead. If the litmus scorecard format proves useful, adapt it for architecture dimensions (coupling, scaling, reversibility).
|
|
|
|
**Effort:** S
|
|
**Priority:** P3
|
|
**Depends on:** Design outside voices shipped (v0.11.3.0)
|
|
|
|
### Outside voices in /qa visual regression detection
|
|
|
|
**What:** Add Codex design voice to /qa for detecting visual regressions during bug-fix verification.
|
|
|
|
**Why:** When fixing bugs, the fix can introduce visual regressions that code-level checks miss. Codex could flag "the fix broke the responsive layout" during re-test.
|
|
|
|
**Context:** Depends on /qa having design awareness. Currently /qa focuses on functional testing.
|
|
|
|
**Effort:** M
|
|
**Priority:** P3
|
|
**Depends on:** Design outside voices shipped (v0.11.3.0)
|
|
|
|
## Document-Release
|
|
|
|
### Auto-invoke /document-release from /ship — SHIPPED
|
|
|
|
Shipped in v0.8.3. Step 8.5 added to `/ship` — after creating the PR, `/ship` automatically reads `document-release/SKILL.md` and executes the doc update workflow. Zero-friction doc updates.
|
|
|
|
### `{{DOC_VOICE}}` shared resolver
|
|
|
|
**What:** Create a placeholder resolver in gen-skill-docs.ts encoding the gstack voice guide (friendly, user-forward, lead with benefits). Inject into /ship Step 5, /document-release Step 5, and reference from CLAUDE.md.
|
|
|
|
**Why:** DRY — voice rules currently live inline in 3 places (CLAUDE.md CHANGELOG style section, /ship Step 5, /document-release Step 5). When the voice evolves, all three drift.
|
|
|
|
**Context:** Same pattern as `{{QA_METHODOLOGY}}` — shared block injected into multiple templates to prevent drift. ~20 lines in gen-skill-docs.ts.
|
|
|
|
**Effort:** S
|
|
**Priority:** P2
|
|
**Depends on:** None
|
|
|
|
## Ship Confidence Dashboard
|
|
|
|
### Smart review relevance detection — PARTIALLY SHIPPED
|
|
|
|
~~**What:** Auto-detect which of the 4 reviews are relevant based on branch changes (skip Design Review if no CSS/view changes, skip Code Review if plan-only).~~
|
|
|
|
`bin/gstack-diff-scope` shipped — categorizes diff into SCOPE_FRONTEND, SCOPE_BACKEND, SCOPE_PROMPTS, SCOPE_TESTS, SCOPE_DOCS, SCOPE_CONFIG. Used by design-review-lite to skip when no frontend files changed. Dashboard integration for conditional row display is a follow-up.
|
|
|
|
**Remaining:** Dashboard conditional row display (hide "Design Review: NOT YET RUN" when SCOPE_FRONTEND=false). Extend to Eng Review (skip for docs-only) and CEO Review (skip for config-only).
|
|
|
|
**Effort:** S
|
|
**Priority:** P3
|
|
**Depends on:** gstack-diff-scope (shipped)
|
|
|
|
|
|
## Codex
|
|
|
|
### Codex→Claude reverse buddy check skill
|
|
|
|
**What:** A Codex-native skill (`.agents/skills/gstack-claude/SKILL.md`) that runs `claude -p` to get an independent second opinion from Claude — the reverse of what `/codex` does today from Claude Code.
|
|
|
|
**Why:** Codex users deserve the same cross-model challenge that Claude users get via `/codex`. Currently the flow is one-way (Claude→Codex). Codex users have no way to get a Claude second opinion.
|
|
|
|
**Context:** The `/codex` skill template (`codex/SKILL.md.tmpl`) shows the pattern — it wraps `codex exec` with JSONL parsing, timeout handling, and structured output. The reverse skill would wrap `claude -p` with similar infrastructure. Would be generated into `.agents/skills/gstack-claude/` by `gen-skill-docs --host codex`.
|
|
|
|
**Effort:** M (human: ~2 weeks / CC: ~30 min)
|
|
**Priority:** P1
|
|
**Depends on:** None
|
|
|
|
## Completeness
|
|
|
|
### Completeness metrics dashboard
|
|
|
|
**What:** Track how often Claude chooses the complete option vs shortcut across gstack sessions. Aggregate into a dashboard showing completeness trend over time.
|
|
|
|
**Why:** Without measurement, we can't know if the Completeness Principle is working. Could surface patterns (e.g., certain skills still bias toward shortcuts).
|
|
|
|
**Context:** Would require logging choices (e.g., append to a JSONL file when AskUserQuestion resolves), parsing them, and displaying trends. Similar pattern to eval persistence.
|
|
|
|
**Effort:** M (human) / S (CC)
|
|
**Priority:** P3
|
|
**Depends on:** Boil the Lake shipped (v0.6.1)
|
|
|
|
## Safety & Observability
|
|
|
|
### On-demand hook skills (/careful, /freeze, /guard) — SHIPPED
|
|
|
|
~~**What:** Three new skills that use Claude Code's session-scoped PreToolUse hooks to add safety guardrails on demand.~~
|
|
|
|
Shipped as `/careful`, `/freeze`, `/guard`, and `/unfreeze` in v0.6.5. Includes hook fire-rate telemetry (pattern name only, no command content) and inline skill activation telemetry.
|
|
|
|
### Skill usage telemetry — SHIPPED
|
|
|
|
~~**What:** Track which skills get invoked, how often, from which repo.~~
|
|
|
|
Shipped in v0.6.5. TemplateContext in gen-skill-docs.ts bakes skill name into preamble telemetry line. Analytics CLI (`bun run analytics`) for querying. /retro integration shows skills-used-this-week.
|
|
|
|
### /investigate scoped debugging enhancements (gated on telemetry)
|
|
|
|
**What:** Six enhancements to /investigate auto-freeze, contingent on telemetry showing the freeze hook actually fires in real debugging sessions.
|
|
|
|
**Why:** /investigate v0.7.1 auto-freezes edits to the module being debugged. If telemetry shows the hook fires often, these enhancements make the experience smarter. If it never fires, the problem wasn't real and these aren't worth building.
|
|
|
|
**Context:** All items are prose additions to `investigate/SKILL.md.tmpl`. No new scripts.
|
|
|
|
**Items:**
|
|
1. Stack trace auto-detection for freeze directory (parse deepest app frame)
|
|
2. Freeze boundary widening (ask to widen instead of hard-block when hitting boundary)
|
|
3. Post-fix auto-unfreeze + full test suite run
|
|
4. Debug instrumentation cleanup (tag with DEBUG-TEMP, remove before commit)
|
|
5. Debug session persistence (~/.gstack/investigate-sessions/ — save investigation for reuse)
|
|
6. Investigation timeline in debug report (hypothesis log with timing)
|
|
|
|
**Effort:** M (all 6 combined)
|
|
**Priority:** P3
|
|
**Depends on:** Telemetry data showing freeze hook fires in real /investigate sessions
|
|
|
|
## Context Intelligence
|
|
|
|
### Context recovery preamble
|
|
|
|
**What:** Add ~10 lines of prose to the preamble telling the agent to re-read gstack artifacts (CEO plans, design reviews, eng reviews, checkpoints) after compaction or context degradation.
|
|
|
|
**Why:** gstack skills produce valuable artifacts stored at `~/.gstack/projects/$SLUG/`. When Claude's auto-compaction fires, it preserves a generic summary but doesn't know these artifacts exist. The plans and reviews that shaped the current work silently vanish from context, even though they're still on disk. This is the thing nobody else in the Claude Code ecosystem is solving, because nobody else has gstack's artifact architecture.
|
|
|
|
**Context:** Inspired by Anthropic's `claude-progress.txt` pattern for long-running agents. Also informed by claude-mem's "progressive disclosure" approach. See `docs/designs/SESSION_INTELLIGENCE.md` for the broader vision. CEO plan: `~/.gstack/projects/garrytan-gstack/ceo-plans/2026-03-31-session-intelligence-layer.md`.
|
|
|
|
**Effort:** S (human: ~30 min / CC: ~5 min)
|
|
**Priority:** P1
|
|
**Depends on:** None
|
|
**Key files:** `scripts/resolvers/preamble.ts`
|
|
|
|
### Session timeline
|
|
|
|
**What:** Append one-line JSONL entry to `~/.gstack/projects/$SLUG/timeline.jsonl` after every skill run (timestamp, skill, branch, outcome). `/retro` renders the timeline.
|
|
|
|
**Why:** Makes AI-assisted work history visible. `/retro` can show "this week: 3 /review, 2 /ship, 1 /investigate." Provides the observability layer for the session intelligence architecture.
|
|
|
|
**Effort:** S (human: ~1h / CC: ~5 min)
|
|
**Priority:** P1
|
|
**Depends on:** None
|
|
**Key files:** `scripts/resolvers/preamble.ts`, `retro/SKILL.md.tmpl`
|
|
|
|
### Cross-session context injection
|
|
|
|
**What:** When a new gstack session starts on a branch with recent checkpoints or plans, the preamble prints a one-line summary: "Last session: implemented JWT auth, 3/5 tasks done." Agent knows where you left off before reading any files.
|
|
|
|
**Why:** Claude starts every session fresh. This one-liner orients the agent immediately. Similar to claude-mem's SessionStart hook pattern but simpler and integrated.
|
|
|
|
**Effort:** S (human: ~2h / CC: ~10 min)
|
|
**Priority:** P2
|
|
**Depends on:** Context recovery preamble
|
|
|
|
### /checkpoint skill
|
|
|
|
**What:** Manual skill to snapshot current working state: what's being done and why, files being edited, decisions made (and rationale), what's done vs. remaining, critical types/signatures. Saved to `~/.gstack/projects/$SLUG/checkpoints/<timestamp>.md`.
|
|
|
|
**Why:** Useful before stepping away from a long session, before known-complex operations that might trigger compaction, for handing off context to a different agent/workspace, or coming back to a project after days away.
|
|
|
|
**Effort:** M (human: ~1 week / CC: ~30 min)
|
|
**Priority:** P2
|
|
**Depends on:** Context recovery preamble
|
|
**Key files:** New `checkpoint/SKILL.md.tmpl`, `scripts/gen-skill-docs.ts`
|
|
|
|
### Session Intelligence Layer design doc
|
|
|
|
**What:** Write `docs/designs/SESSION_INTELLIGENCE.md` describing the architectural vision: gstack as the persistent brain that survives Claude's ephemeral context. Every skill writes to `~/.gstack/projects/$SLUG/`, preamble re-reads, `/retro` rolls up.
|
|
|
|
**Why:** Connects context recovery, health, checkpoint, and timeline features into a coherent architecture. Nobody else in the ecosystem is building this.
|
|
|
|
**Effort:** S (human: ~2h / CC: ~15 min)
|
|
**Priority:** P1
|
|
**Depends on:** None
|
|
|
|
## Health
|
|
|
|
### /health — Project Health Dashboard
|
|
|
|
**What:** Skill that runs type-check, lint, test suite, and dead code scan, then reports a composite 0-10 health score with breakdown by category. Tracks over time in `~/.gstack/health/<project-slug>/` for trend detection. Optionally integrates CodeScene MCP for deeper complexity/cohesion/coupling analysis.
|
|
|
|
**Why:** No quick way to get "state of the codebase" before starting work. CodeScene peer-reviewed research shows AI-generated code increases static analysis warnings by 30%, code complexity by 41%, and change failure rates by 30%. Users need guardrails. Like `/qa` but for code quality rather than browser behavior.
|
|
|
|
**Context:** Reads CLAUDE.md for project-specific commands (platform-agnostic principle). Runs checks in parallel. `/retro` can pull from health history for trend sparklines.
|
|
|
|
**Effort:** M (human: ~1 week / CC: ~30 min)
|
|
**Priority:** P1
|
|
**Depends on:** None
|
|
**Key files:** New `health/SKILL.md.tmpl`, `scripts/gen-skill-docs.ts`
|
|
|
|
### /health as /ship gate
|
|
|
|
**What:** If health score exists and drops below a configurable threshold, `/ship` warns before creating the PR: "Health dropped from 8/10 to 5/10 this branch — 3 new lint warnings, 1 test failure. Ship anyway?"
|
|
|
|
**Why:** Quality gate that prevents shipping degraded code. Configurable threshold so it's not blocking for teams that don't use `/health`.
|
|
|
|
**Effort:** S (human: ~1h / CC: ~5 min)
|
|
**Priority:** P2
|
|
**Depends on:** /health skill
|
|
|
|
## Swarm
|
|
|
|
### Swarm primitive — reusable multi-agent dispatch
|
|
|
|
**What:** Extract Review Army's dispatch pattern into a reusable resolver (`scripts/resolvers/swarm.ts`). Wire into `/ship` for parallel pre-ship checks (type-check + lint + test in parallel sub-agents). Make available to `/qa`, `/investigate`, `/health`.
|
|
|
|
**Why:** Review Army proved parallel sub-agents work brilliantly (5 agents = 835K tokens of working memory vs. 167K for one). The pattern is locked inside `review-army.ts`. Other skills need it too. Claude Code Agent Teams (official, Feb 2026) validates the team-lead-delegates-to-specialists pattern. Gartner: multi-agent inquiries surged 1,445% in one year.
|
|
|
|
**Context:** Start with the specific `/ship` use case. Extract shared parts only after 2+ consumers reveal what config parameters are actually needed. Avoid premature abstraction. Can leverage existing WorktreeManager for isolation.
|
|
|
|
**Effort:** L (human: ~2 weeks / CC: ~2 hours)
|
|
**Priority:** P2
|
|
**Depends on:** None
|
|
**Key files:** `scripts/resolvers/review-army.ts`, new `scripts/resolvers/swarm.ts`, `ship/SKILL.md.tmpl`, `lib/worktree.ts`
|
|
|
|
## Refactoring
|
|
|
|
### /refactor-prep — Pre-Refactor Token Hygiene
|
|
|
|
**What:** Skill that detects project language/framework, runs appropriate dead code detection (knip/ts-prune for TS/JS, vulture/autoflake for Python, staticcheck/deadcode for Go, cargo udeps for Rust), strips dead imports/exports/props/console.logs, and commits cleanup separately.
|
|
|
|
**Why:** Dirty codebases accelerate context compaction. Dead imports, unused exports, and orphaned code eat tokens that contribute nothing but everything to triggering compaction mid-refactor. Cleaning first buys back 20%+ of context budget. Reports lines removed and estimated token savings.
|
|
|
|
**Effort:** M (human: ~1 week / CC: ~30 min)
|
|
**Priority:** P2
|
|
**Depends on:** None
|
|
**Key files:** New `refactor-prep/SKILL.md.tmpl`, `scripts/gen-skill-docs.ts`
|
|
|
|
## Factory Droid
|
|
|
|
### Browse MCP server for Factory Droid
|
|
|
|
**What:** Expose gstack's browse binary and key workflows as an MCP server that Factory Droid connects to natively. Factory users would run /mcp, add the gstack server, and get browse, QA, and review capabilities as Factory tools.
|
|
|
|
**Why:** Factory already supports 40+ MCP servers in its registry. Getting gstack's browse binary listed there is a distribution play. Nobody else has a real compiled browser binary as an MCP tool. This is the thing that makes gstack uniquely valuable on Factory Droid.
|
|
|
|
**Context:** Option A (--host factory compatibility shim) ships first in v0.13.4.0. Option B is the follow-up that provides deeper integration. The browse binary is already a stateless CLI, so wrapping it as an MCP server is straightforward (stdin/stdout JSON-RPC). Each browse command becomes an MCP tool.
|
|
|
|
**Effort:** L (human: ~1 week / CC: ~5 hours)
|
|
**Priority:** P1
|
|
**Depends on:** --host factory (Option A, shipping in v0.13.4.0)
|
|
|
|
### .agent/skills/ dual output for cross-agent compatibility
|
|
|
|
**What:** Factory also reads from `<repo>/.agent/skills/` as a cross-agent compatibility path. Could output there in addition to `.factory/skills/` for broader reach across other agents that use the `.agent` convention.
|
|
|
|
**Why:** Multiple AI agents beyond Factory may adopt the `.agent/skills/` convention. Outputting there too would give free compatibility.
|
|
|
|
**Effort:** S
|
|
**Priority:** P3
|
|
**Depends on:** --host factory
|
|
|
|
### Custom Droid definitions alongside skills
|
|
|
|
**What:** Factory has "custom droids" (subagents with tool restrictions, model selection, autonomy levels). Could ship `gstack-qa.md` droid configs alongside skills that restrict tools to read-only + execute for safety.
|
|
|
|
**Why:** Deeper Factory integration. Droid configs give Factory users tighter control over what gstack skills can do.
|
|
|
|
**Effort:** M
|
|
**Priority:** P3
|
|
**Depends on:** --host factory
|
|
|
|
## GStack Browser
|
|
|
|
### Anti-bot stealth: Playwright CDP patches (rebrowser-style)
|
|
|
|
**What:** Write a postinstall script that patches Playwright's CDP layer to suppress `Runtime.enable` and use `addBinding` for context ID discovery, same approach as rebrowser-patches. Eliminates the `navigator.webdriver`, `cdc_` markers, and other CDP artifacts that sites like Google use to detect automation.
|
|
|
|
**Why:** Our current stealth patches (UA override, navigator.webdriver=false, fake plugins) work on most sites but Google still triggers captchas. The real detection is at the CDP protocol level. rebrowser-patches proved the approach works but their patches target Playwright 1.52.0 and don't apply to our 1.58.2. We need our own patcher using string matching instead of line-number diffs. 6 files, ~200 lines of patches total.
|
|
|
|
**Context:** Full analysis of rebrowser-patches source: patches 6 files in `playwright-core/lib/server/` (crConnection.js, crDevTools.js, crPage.js, crServiceWorker.js, frames.js, page.js). Key technique: suppress `Runtime.enable` (the main CDP detection vector), use `Runtime.addBinding` + `CustomEvent` trick to discover execution context IDs without it. Our extension communicates via Chrome extension APIs, not CDP Runtime, so it should be unaffected. Write E2E tests that verify: (1) extension still loads and connects, (2) Google.com loads without captcha, (3) sidebar chat still works.
|
|
|
|
**Effort:** L (human: ~2 weeks / CC: ~3 hours)
|
|
**Priority:** P1
|
|
**Depends on:** None
|
|
|
|
### Chromium fork (long-term alternative to CDP patches)
|
|
|
|
**What:** Maintain a Chromium fork where anti-bot stealth, GStack Browser branding, and native sidebar support live in the source code, not as runtime monkey-patches.
|
|
|
|
**Why:** The CDP patches are brittle. They break on every Playwright upgrade and target compiled JS with fragile string matching. A proper fork means: (1) stealth is permanent, not patched, (2) branding is native (no plist hacking at launch), (3) native sidebar replaces the extension (Phase 4 of V0 roadmap), (4) custom protocols (gstack://) for internal pages. Companies like Brave, Arc, and Vivaldi maintain Chromium forks with small teams. With CC, the rebase-on-upstream maintenance could be largely automated.
|
|
|
|
**Context:** Trigger criteria from V0 design doc: fork when extension side panel becomes the bottleneck, when anti-bot patches need to live deeper than CDP, or when native UI integration (sidebar, status bar) can't be done via extension. The Chromium build takes ~4 hours on a 32-core machine and produces ~50GB of build artifacts. CI would need dedicated build infra. See `docs/designs/GSTACK_BROWSER_V0.md` Phase 5 for full analysis.
|
|
|
|
**Effort:** XL (human: ~1 quarter / CC: ~2-3 weeks of focused work)
|
|
**Priority:** P2
|
|
**Depends on:** CDP patches proving the value of anti-bot stealth first
|
|
|
|
## Completed
|
|
|
|
### Bearer-token secret-scan regression fixed + E2E coverage added for privacy gate + gh auto-create (v1.12.0.0)
|
|
|
|
- **Fixed the `bearer-token-json` regression in `bin/gstack-brain-sync`** — the value charset `[A-Za-z0-9_./+=-]{16,}` didn't permit spaces, so auth headers with the standard `Bearer <token>` form (literal space after the scheme name) slipped past the scanner. Added an optional `(Bearer |Basic |Token )?` prefix to the pattern. Validated against 5 positive cases (including the regression fixture) + 3 negative cases (short tokens, non-secret keys, random JSON). The 7-pattern secret scanner now passes all fixtures including bearer-json.
|
|
- **Added `test/gstack-brain-init-gh-mock.test.ts`** — 8 tests exercising the `gh` CLI auto-create path that previously had zero coverage. Stubs `gh` on PATH to record every call, asserts `gh repo create --private --description "..." --source <GSTACK_HOME>` fires with the computed `gstack-brain-<user>` default name. Covers: happy path, fall-through-to-`gh repo view` when create hits already-exists, user-provided-URL-bypasses-gh, gh-not-on-path prompts for URL, gh-not-authed prompts for URL, idempotent `--remote` re-runs, conflicting-remote rejection.
|
|
- **Added `test/skill-e2e-brain-privacy-gate.test.ts`** — periodic-tier E2E (~$0.30-$0.50/run). Stages a fake `gbrain` on PATH + `gbrain_sync_mode_prompted=false` in config, runs a real skill via `runAgentSdkTest`, intercepts tool-use via `canUseTool`, and asserts the preamble fires the 3-option privacy AskUserQuestion with canonical prose ("publish session memory" / "artifact" / "decline"). Second test asserts the gate is silent when `prompted=true` (idempotency-within-session).
|
|
- **Registered `brain-privacy-gate` in `test/helpers/touchfiles.ts`** (periodic tier) with dependency tracking on `scripts/resolvers/preamble/generate-brain-sync-block.ts`, `bin/gstack-brain-sync`, `bin/gstack-brain-init`, `bin/gstack-config`, and the Agent SDK runner. Diff-based selection will re-run the E2E whenever any of those change.
|
|
|
|
**Completed:** v1.12.0.0 (2026-04-24)
|
|
|
|
---
|
|
|
|
### Overlay efficacy harness + Opus 4.7 fanout nudge removal (v1.10.1.0)
|
|
- Built `test/skill-e2e-overlay-harness.test.ts`, a parametric periodic-tier eval that drives `@anthropic-ai/claude-agent-sdk` and measures first-turn fanout rate (overlay-ON vs overlay-OFF) across registered fixtures
|
|
- Measured the original "Fan out explicitly" overlay nudge: baseline Opus 4.7 = 70% first-turn fanout on toy prompt, with our nudge = 10%, with Anthropic's own canonical `<use_parallel_tool_calls>` text = 0%
|
|
- Removed the counterproductive nudge from `model-overlays/opus-4-7.md`
|
|
- Shipped 36-test free-tier unit suite for the SDK runner + strict fixture validator
|
|
- Registered `overlay-harness-opus-4-7-fanout-{toy,realistic}` in E2E_TOUCHFILES and E2E_TIERS
|
|
- Total investigation cost: ~$7 across 3 eval runs
|
|
**Completed:** v1.10.1.0
|
|
|
|
### CI eval pipeline (v0.9.9.0)
|
|
- GitHub Actions eval upload on Ubicloud runners ($0.006/run)
|
|
- Within-file test concurrency (test() → testConcurrentIfSelected())
|
|
- Eval artifact upload + PR comment with pass/fail + cost
|
|
- Baseline comparison via artifact download from main
|
|
- EVALS_CONCURRENCY=40 for ~6min wall clock (was ~18min)
|
|
**Completed:** v0.9.9.0
|
|
|
|
### Deploy pipeline (v0.9.8.0)
|
|
- /land-and-deploy — merge PR, wait for CI/deploy, canary verification
|
|
- /canary — post-deploy monitoring loop with anomaly detection
|
|
- /benchmark — performance regression detection with Core Web Vitals
|
|
- /setup-deploy — one-time deploy platform configuration
|
|
- /review Performance & Bundle Impact pass
|
|
- E2E model pinning (Sonnet default, Opus for quality tests)
|
|
- E2E timing telemetry (first_response_ms, max_inter_turn_ms, wall_clock_ms)
|
|
- test:e2e:fast tier, --retry 2 on all E2E scripts
|
|
**Completed:** v0.9.8.0
|
|
|
|
### Phase 1: Foundations (v0.2.0)
|
|
- Rename to gstack
|
|
- Restructure to monorepo layout
|
|
- Setup script for skill symlinks
|
|
- Snapshot command with ref-based element selection
|
|
- Snapshot tests
|
|
**Completed:** v0.2.0
|
|
|
|
### Phase 2: Enhanced Browser (v0.2.0)
|
|
- Annotated screenshots, snapshot diffing, dialog handling, file upload
|
|
- Cursor-interactive elements, element state checks
|
|
- CircularBuffer, async buffer flush, health check
|
|
- Playwright error wrapping, useragent fix
|
|
- 148 integration tests
|
|
**Completed:** v0.2.0
|
|
|
|
### Phase 3: QA Testing Agent (v0.3.0)
|
|
- /qa SKILL.md with 6-phase workflow, 3 modes (full/quick/regression)
|
|
- Issue taxonomy, severity classification, exploration checklist
|
|
- Report template, health score rubric, framework detection
|
|
- wait/console/cookie-import commands, find-browse binary
|
|
**Completed:** v0.3.0
|
|
|
|
### Phase 3.5: Browser Cookie Import (v0.3.x)
|
|
- cookie-import-browser command (Chromium cookie DB decryption)
|
|
- Cookie picker web UI, /setup-browser-cookies skill
|
|
- 18 unit tests, browser registry (Comet, Chrome, Arc, Brave, Edge)
|
|
**Completed:** v0.3.1
|
|
|
|
### E2E test cost tracking
|
|
- Track cumulative API spend, warn if over threshold
|
|
**Completed:** v0.3.6
|
|
|
|
### Auto-upgrade mode + smart update check
|
|
- Config CLI (`bin/gstack-config`), auto-upgrade via `~/.gstack/config.yaml`, 12h cache TTL, exponential snooze backoff (24h→48h→1wk), "never ask again" option, vendored copy sync on upgrade
|
|
**Completed:** v0.3.8
|