* refactor: renumber /ship steps to clean integers (1-20)
Replaces fractional step numbers (1.5, 2.5, 3.25, 3.4, 3.45, 3.47, 3.48,
3.5, 3.55, 3.56, 3.57, 3.75, 3.8, 5.5, 6.5, 8.5, 8.75) with clean
integers 1 through 20, plus allowed resolver sub-steps 8.1, 8.2,
9.1, 9.2, 9.3. Fractional numbering signaled "optional appendix" and
contributed to /ship's habit of skipping late-stage steps.
Affects:
- ship/SKILL.md.tmpl (all headings + ~30 cross-references)
- scripts/resolvers/review.ts (ship-side 3.47/3.48/3.57/3.8 conditionals)
- scripts/resolvers/review-army.ts (ship-side 3.55/3.56 conditionals)
- scripts/resolvers/testing.ts (ship-side 2.5/3.4 references, 5 sites)
- scripts/resolvers/utility.ts (CHANGELOG heading gets Step 13 prefix)
- test/gen-skill-docs.test.ts (5 step-number assertions updated)
- test/skill-validation.test.ts (3 step-number assertions updated)
/review step numbering (1.5, 2.5, 4.5, 5.5-5.8) intentionally unchanged —
only the ship-side of each isShip conditional was updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: subagent isolation for /ship's 4 context-heaviest sub-workflows
Fights context rot. By late /ship, the parent context is bloated with
500-1,750 lines of intermediate tool output from tests, coverage audits,
reviews, adversarial checks, and PR body construction. The model is
at its least intelligent when it reaches doc-sync — which is why
/document-release was being skipped ~80% of the time.
Applies subagent dispatch (proven pattern from Review Army at Step 9.1
and Adversarial at Step 11) to four sub-workflows where the parent
only needs the conclusion, not the intermediate output:
- Step 7 (Test Coverage Audit) — subagent returns coverage_pct, gaps,
diagram, tests_added
- Step 8 (Plan Completion Audit) — subagent returns total_items, done,
changed, deferred, summary
- Step 10 (Greptile Triage) — subagent fetches + classifies, parent
handles user interaction and commits fixes (AskUserQuestion + Edit
can't run in subagents)
- Step 18 (Documentation Sync) — subagent invokes full /document-release
skill in fresh context; parent embeds documentation_section in PR body
Sequencing fix for Step 18: runs AFTER Step 17 (Push) and BEFORE Step 19
(Create PR). The PR is created once from final HEAD with the
## Documentation section baked into the initial body — no create-then-
re-edit dance, no race conditions with document-release's own PR body
editor.
Adds "You are NOT done" guardrail after Step 17 (Push) to break the
natural stopping point that currently causes doc-release skips.
Each subagent falls back to inline execution if it fails or returns
invalid JSON. /ship never blocks on subagent failure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: regression guard for /ship step numbering
Three regression guards in skill-validation.test.ts to prevent future
drift back to fractional step numbering:
1. ship/SKILL.md.tmpl contains no fractional step numbers except the
allowed resolver sub-steps (8.1, 8.2, 9.1, 9.2, 9.3). A contributor
adding "Step 3.75" next month will fail this test with a clear error.
2. ship/SKILL.md main headings use clean integer step numbers. If a
renumber accidentally leaves a decimal heading, this catches it.
3. review/SKILL.md step numbers unchanged — regression guard for the
resolver conditionals in review.ts/review-army.ts. If a future edit
accidentally touches the review-side of an isShip ternary, /review's
fractional numbering (1.5, 4.5, 5.7) would vanish. This test catches
that cross-contamination.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: sync ship step references after renumber
CLAUDE.md: "At /ship time (Step 5)" → "(Step 13)" — CHANGELOG is now
explicitly Step 13 after the renumber (was implicit between old
Step 4 and Step 5.5).
TODOS.md: "Step 3.4 coverage audit" → "Step 7" — references the open
TODO for auto-upgrading ★-rated tests, which hooks into the coverage
audit step.
Both are historical references to ship's step numbering that became
stale when clean integer renumbering landed in 566d42c2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: update golden ship skill baselines after renumber + subagent refactor
The golden fixtures at test/fixtures/golden/{claude,codex,factory}-ship-SKILL.md
regression-test that generated ship/SKILL.md output matches a committed baseline.
After renumbering steps to clean integers and converting 4 sub-workflows to
subagent dispatches, the generated output changed substantially — refresh the
baselines to reflect the new expected output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.18.1.0)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: gitignore Claude Code harness runtime artifacts
.claude/scheduled_tasks.lock appears when ScheduleWakeup fires. It's a
runtime lock file owned by the Claude Code harness, not project source.
Add .claude/*.lock too so future harness artifacts in that directory
don't need their own gitignore entries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
45 KiB
TODOS
Browse
Scope sidebar-agent kill to session PID, not pkill -f sidebar-agent\.ts
What: shutdown() in browse/src/server.ts:1193 uses pkill -f sidebar-agent\.ts to kill the sidebar-agent daemon, which matches every sidebar-agent on the machine, not just the one this server spawned. Replace with PID tracking: store the sidebar-agent PID when cli.ts spawns it (via state file or env), then process.kill(pid, 'SIGTERM') in shutdown().
Why: A user running two Conductor worktrees (or any multi-session setup), each with its own $B connect, closes one browser window ... and the other worktree's sidebar-agent gets killed too. The blast radius was there before, but the v0.18.1.0 disconnect-cleanup fix makes it more reachable: every user-close now runs the full shutdown() path, whereas before user-close bypassed it.
Context: Surfaced by /ship's adversarial review on v0.18.1.0. Pre-existing code, not introduced by the fix. Fix requires propagating the sidebar-agent PID from cli.ts spawn site (~line 885) into the server's state file so shutdown() can target just this session's agent. Related: browse/src/cli.ts spawns with Bun.spawn(...).unref() and already captures agentProc.pid.
Effort: S (human: ~2h / CC: ~15min) Priority: P2 Depends on: None
Sidebar Security
ML Prompt Injection Classifier
What: Add DeBERTa-v3-base-prompt-injection-v2 via @huggingface/transformers v4 (WASM backend) as an ML defense layer for the Chrome sidebar. Reusable browse/src/security.ts module with checkInjection() API. Includes canary tokens, attack logging, shield icon, special telemetry (AskUserQuestion on detection even when telemetry off), and BrowseSafe-bench red team test harness (3,680 adversarial cases from Perplexity).
Why: PR 1 fixes the architecture (command allowlist, XML framing, Opus default). But attackers can still trick Claude into navigating to phishing sites or exfiltrating visible page data via allowed browse commands. The ML classifier catches prompt injection patterns that architectural controls can't see. 94.8% accuracy, 99.6% recall, ~50-100ms inference via WASM. Defense-in-depth.
Context: Full design doc with industry research, open source tool landscape, Codex review findings, and ambitious Bun-native vision (5ms inference via FFI + Apple Accelerate): docs/designs/ML_PROMPT_INJECTION_KILLER.md. CEO plan with scope decisions: ~/.gstack/projects/garrytan-gstack/ceo-plans/2026-03-28-sidebar-prompt-injection-defense.md.
Effort: L (human: ~2 weeks / CC: ~3-4 hours) Priority: P0 Depends on: Sidebar security fix PR (command allowlist + XML framing + arg fix) landing first
Builder Ethos
First-time Search Before Building intro
What: Add a generateSearchIntro() function (like generateLakeIntro()) that introduces the Search Before Building principle on first use, with a link to the blog essay.
Why: Boil the Lake has an intro flow that links to the essay and marks .completeness-intro-seen. Search Before Building should have the same pattern for discoverability.
Context: Blocked on a blog post to link to. When the essay exists, add the intro flow with a .search-intro-seen marker file. Pattern: generateLakeIntro() at gen-skill-docs.ts:176.
Effort: S Priority: P2 Depends on: Blog post about Search Before Building
Chrome DevTools MCP Integration
Real Chrome session access
What: Integrate Chrome DevTools MCP to connect to the user's real Chrome session with real cookies, real state, no Playwright middleman.
Why: Right now, headed mode launches a fresh Chromium profile. Users must log in manually or import cookies. Chrome DevTools MCP connects to the user's actual Chrome ... instant access to every authenticated site. This is the future of browser automation for AI agents.
Context: Google shipped Chrome DevTools MCP in Chrome 146+ (June 2025). It provides screenshots, console messages, performance traces, Lighthouse audits, and full page interaction through the user's real browser. gstack should use it for real-session access while keeping Playwright for headless CI/testing workflows.
Potential new skills:
/debug-browser: JS error tracing with source-mapped stack traces/perf-debug: performance traces, Core Web Vitals, network waterfall
May replace /setup-browser-cookies for most use cases since the user's real cookies are already there.
Effort: L (human: ~2 weeks / CC: ~2 hours) Priority: P0 Depends on: Chrome 146+, DevTools MCP server installed
Browse
Bundle server.ts into compiled binary
What: Eliminate resolveServerScript() fallback chain entirely — bundle server.ts into the compiled browse binary.
Why: The current fallback chain (check adjacent to cli.ts, check global install) is fragile and caused bugs in v0.3.2. A single compiled binary is simpler and more reliable.
Context: Bun's --compile flag can bundle multiple entry points. The server is currently resolved at runtime via file path lookup. Bundling it removes the resolution step entirely.
Effort: M Priority: P2 Depends on: None
Sessions (isolated browser instances)
What: Isolated browser instances with separate cookies/storage/history, addressable by name.
Why: Enables parallel testing of different user roles, A/B test verification, and clean auth state management.
Context: Requires Playwright browser context isolation. Each session gets its own context with independent cookies/localStorage. Prerequisite for video recording (clean context lifecycle) and auth vault.
Effort: L Priority: P3
Video recording
What: Record browser interactions as video (start/stop controls).
Why: Video evidence in QA reports and PR bodies. Currently deferred because recreateContext() destroys page state.
Context: Needs sessions for clean context lifecycle. Playwright supports video recording per context. Also needs WebM → GIF conversion for PR embedding.
Effort: M Priority: P3 Depends on: Sessions
v20 encryption format support
What: AES-256-GCM support for future Chromium cookie DB versions (currently v10).
Why: Future Chromium versions may change encryption format. Proactive support prevents breakage.
Effort: S Priority: P3
State persistence — SHIPPED
What: Save/load cookies + localStorage to JSON files for reproducible test sessions.
$B state save/load ships in v0.12.1.0. V1 saves cookies + URLs only (not localStorage, which breaks on load-before-navigate). Files at .gstack/browse-states/{name}.json with 0o600 permissions. Load replaces session (closes all pages first). Name sanitized to [a-zA-Z0-9_-].
Remaining: V2 localStorage support (needs pre-navigation injection strategy). Completed: v0.12.1.0 (2026-03-26)
Auth vault
What: Encrypted credential storage, referenced by name. LLM never sees passwords.
Why: Security — currently auth credentials flow through the LLM context. Vault keeps secrets out of the AI's view.
Effort: L Priority: P3 Depends on: Sessions, state persistence
Iframe support — SHIPPED
What: frame <sel> and frame main commands for cross-frame interaction.
$B frame ships in v0.12.1.0. Supports CSS selector, @ref, --name, and --url pattern matching. Execution target abstraction (getActiveFrameOrPage()) across all read/write/snapshot commands. Frame context cleared on navigation, tab switch, resume. Detached frame auto-recovery. Page-only operations (goto, screenshot, viewport) throw clear error when in frame context.
Completed: v0.12.1.0 (2026-03-26)
Semantic locators
What: find role/label/text/placeholder/testid with attached actions.
Why: More resilient element selection than CSS selectors or ref numbers.
Effort: M Priority: P4
Device emulation presets
What: set device "iPhone 16 Pro" for mobile/tablet testing.
Why: Responsive layout testing without manual viewport resizing.
Effort: S Priority: P4
Network mocking/routing
What: Intercept, block, and mock network requests.
Why: Test error states, loading states, and offline behavior.
Effort: M Priority: P4
Download handling
What: Click-to-download with path control.
Why: Test file download flows end-to-end.
Effort: S Priority: P4
Content safety
What: --max-output truncation, --allowed-domains filtering.
Why: Prevent context window overflow and restrict navigation to safe domains.
Effort: S Priority: P4
Streaming (WebSocket live preview)
What: WebSocket-based live preview for pair browsing sessions.
Why: Enables real-time collaboration — human watches AI browse.
Effort: L Priority: P4
Headed mode with Chrome extension — SHIPPED
$B connect launches Playwright's bundled Chromium in headed mode with the gstack Chrome extension auto-loaded. $B handoff now produces the same result (extension + side panel). Sidebar chat gated behind --chat flag.
$B watch — SHIPPED
Claude observes user browsing in passive read-only mode with periodic snapshots. $B watch stop exits with summary. Mutation commands blocked during watch.
Sidebar scout / file drop relay — SHIPPED
Sidebar agent writes structured messages to .context/sidebar-inbox/. Workspace agent reads via $B inbox. Message format: {type, timestamp, page, userMessage, sidebarSessionId}.
Multi-agent tab isolation
What: Two Claude sessions connect to the same browser, each operating on different tabs. No cross-contamination.
Why: Enables parallel /qa + /design-review on different tabs in the same browser.
Context: Requires tab ownership model for concurrent headed connections. Playwright may not cleanly support two persistent contexts. Needs investigation.
Effort: L (human: ~2 weeks / CC: ~2 hours) Priority: P3 Depends on: Headed mode (shipped)
Sidebar agent needs Write tool + better error visibility — SHIPPED
What: Two issues with the sidebar agent (sidebar-agent.ts): (1) --allowedTools is hardcoded to Bash,Read,Glob,Grep, missing Write. Claude can't create files (like CSVs) when asked. (2) When Claude errors or returns empty, the sidebar UI shows nothing, just a green dot. No error message, no "I tried but failed", nothing.
Completed: v0.15.4.0 (2026-04-04). Write tool added to allowedTools. 40+ empty catch blocks replaced with [gstack sidebar], [gstack bg], [browse], [sidebar-agent] prefixed console logging across all 4 files (sidepanel.js, background.js, server.ts, sidebar-agent.ts). Error placeholder text now shows in red. Auth token stale-refresh bug fixed.
Sidebar direct API calls (eliminate claude -p startup tax)
What: Each sidebar message spawns a fresh claude -p process (~2-3s cold start overhead). For "click @e24" that's absurd. Direct Anthropic API calls would be sub-second.
Why: The claude -p startup cost is: process spawn (~100ms) + CLI init (~500ms-1s) + API connection (~200ms) + first token. Model routing (Sonnet for actions) helps but doesn't fix the CLI overhead.
Context: server.ts:spawnClaude() builds args and writes to queue file. sidebar-agent.ts:askClaude() spawns claude -p. Replace with direct fetch('https://api.anthropic.com/...') with tool use. Requires ANTHROPIC_API_KEY accessible to the browse server.
Effort: M (human: ~1 week / CC: ~30min) Priority: P2 Depends on: None
Chrome Web Store publishing
What: Publish the gstack browse Chrome extension to Chrome Web Store for easier install.
Why: Currently sideloaded via chrome://extensions. Web Store makes install one-click.
Effort: S Priority: P4 Depends on: Chrome extension proving value via sideloading
Linux cookie decryption — PARTIALLY SHIPPED
What: GNOME Keyring / kwallet / DPAPI support for non-macOS cookie import.
Linux cookie import shipped in v0.11.11.0 (Wave 3). Supports Chrome, Chromium, Brave, Edge on Linux with GNOME Keyring (libsecret) and "peanuts" fallback. Windows DPAPI support remains deferred.
Remaining: Windows cookie decryption (DPAPI). Needs complete rewrite — PR #64 was 1346 lines and stale.
Effort: L (Windows only) Priority: P4 Completed (Linux): v0.11.11.0 (2026-03-23)
Ship
GitLab support for /land-and-deploy
What: Add GitLab MR merge + CI polling support to /land-and-deploy skill. Currently uses gh pr view, gh pr checks, gh pr merge, and gh run list/view in 15+ places — each needs a GitLab conditional path using glab ci status, glab mr merge, etc.
Why: Without this, GitLab users can /ship (create MR) but can't /land-and-deploy (merge + verify). Completes the GitLab story end-to-end.
Context: /retro, /ship, and /document-release now support GitLab via the multi-platform BASE_BRANCH_DETECT resolver. /land-and-deploy has deeper GitHub-specific semantics (merge queues, required checks via gh pr checks, deploy workflow polling) that have different shapes on GitLab. The glab CLI (v1.90.0) supports glab mr merge, glab ci status, glab ci view but with different output formats and no merge queue concept.
Effort: L Priority: P2 Depends on: None (BASE_BRANCH_DETECT multi-platform resolver is already done)
Multi-commit CHANGELOG completeness eval
What: Add a periodic E2E eval that creates a branch with 5+ commits spanning 3+ themes (features, cleanup, infra), runs /ship's Step 5 CHANGELOG generation, and verifies the CHANGELOG mentions all themes.
Why: The bug fixed in v0.11.22 (garrytan/ship-full-commit-coverage) showed that /ship's CHANGELOG generation biased toward recent commits on long branches. The prompt fix adds a cross-check, but no test exercises the multi-commit failure mode. The existing ship-local-workflow E2E only uses a single-commit branch.
Context: Would be a periodic tier test (~$4/run, non-deterministic since it tests LLM instruction-following). Setup: create bare remote, clone, add 5+ commits across different themes on a feature branch, run Step 5 via claude -p, verify CHANGELOG output covers all themes. Pattern: ship-local-workflow in test/skill-e2e-workflow.test.ts.
Effort: M Priority: P3 Depends on: None
Ship log — persistent record of /ship runs
What: Append structured JSON entry to .gstack/ship-log.json at end of every /ship run (version, date, branch, PR URL, review findings, Greptile stats, todos completed, test results).
Why: /retro has no structured data about shipping velocity. Ship log enables: PRs-per-week trending, review finding rates, Greptile signal over time, test suite growth.
Context: /retro already reads greptile-history.md — same pattern. Eval persistence (eval-store.ts) shows the JSON append pattern exists in the codebase. ~15 lines in ship template.
Effort: S Priority: P2 Depends on: None
Visual verification with screenshots in PR body
What: /ship Step 7.5: screenshot key pages after push, embed in PR body.
Why: Visual evidence in PRs. Reviewers see what changed without deploying locally.
Context: Part of Phase 3.6. Needs S3 upload for image hosting.
Effort: M Priority: P2 Depends on: /setup-gstack-upload
Review
Inline PR annotations
What: /ship and /review post inline review comments at specific file:line locations using gh api to create pull request review comments.
Why: Line-level annotations are more actionable than top-level comments. The PR thread becomes a line-by-line conversation between Greptile, Claude, and human reviewers.
Context: GitHub supports inline review comments via gh api repos/$REPO/pulls/$PR/reviews. Pairs naturally with Phase 3.6 visual annotations.
Effort: S Priority: P2 Depends on: None
Greptile training feedback export
What: Aggregate greptile-history.md into machine-readable JSON summary of false positive patterns, exportable to the Greptile team for model improvement.
Why: Closes the feedback loop — Greptile can use FP data to stop making the same mistakes on your codebase.
Context: Was a P3 Future Idea. Upgraded to P2 now that greptile-history.md data infrastructure exists. The signal data is already being collected; this just makes it exportable. ~40 lines.
Effort: S Priority: P2 Depends on: Enough FP data accumulated (10+ entries)
Visual review with annotated screenshots
What: /review Step 4.5: browse PR's preview deploy, annotated screenshots of changed pages, compare against production, check responsive layouts, verify accessibility tree.
Why: Visual diff catches layout regressions that code review misses.
Context: Part of Phase 3.6. Needs S3 upload for image hosting.
Effort: M Priority: P2 Depends on: /setup-gstack-upload
QA
QA trend tracking
What: Compare baseline.json over time, detect regressions across QA runs.
Why: Spot quality trends — is the app getting better or worse?
Context: QA already writes structured reports. This adds cross-run comparison.
Effort: S Priority: P2
CI/CD QA integration
What: /qa as GitHub Action step, fail PR if health score drops.
Why: Automated quality gate in CI. Catch regressions before merge.
Effort: M Priority: P2
Smart default QA tier
What: After a few runs, check index.md for user's usual tier pick, skip the AskUserQuestion.
Why: Reduces friction for repeat users.
Effort: S Priority: P2
Accessibility audit mode
What: --a11y flag for focused accessibility testing.
Why: Dedicated accessibility testing beyond the general QA checklist.
Effort: S Priority: P3
CI/CD generation for non-GitHub providers
What: Extend CI/CD bootstrap to generate GitLab CI (.gitlab-ci.yml), CircleCI (.circleci/config.yml), and Bitrise pipelines.
Why: Not all projects use GitHub Actions. Universal CI/CD bootstrap would make test bootstrap work for everyone.
Context: v1 ships with GitHub Actions only. Detection logic already checks for .gitlab-ci.yml, .circleci/, bitrise.yml and skips with an informational note. Each provider needs ~20 lines of template text in generateTestBootstrap().
Effort: M Priority: P3 Depends on: Test bootstrap (shipped)
Auto-upgrade weak tests (★) to strong tests (★★★)
What: When Step 7 coverage audit identifies existing ★-rated tests (smoke/trivial assertions), generate improved versions testing edge cases and error paths.
Why: Many codebases have tests that technically exist but don't catch real bugs — expect(component).toBeDefined() isn't testing behavior. Upgrading these closes the gap between "has tests" and "has good tests."
Context: Requires the quality scoring rubric from the test coverage audit. Modifying existing test files is riskier than creating new ones — needs careful diffing to ensure the upgraded test still passes. Consider creating a companion test file rather than modifying the original.
Effort: M Priority: P3 Depends on: Test quality scoring (shipped)
Retro
Deployment health tracking (retro + browse)
What: Screenshot production state, check perf metrics (page load times), count console errors across key pages, track trends over retro window.
Why: Retro should include production health alongside code metrics.
Context: Requires browse integration. Screenshots + metrics fed into retro output.
Effort: L Priority: P3 Depends on: Browse sessions
Infrastructure
/setup-gstack-upload skill (S3 bucket)
What: Configure S3 bucket for image hosting. One-time setup for visual PR annotations.
Why: Prerequisite for visual PR annotations in /ship and /review.
Effort: M Priority: P2
gstack-upload helper
What: browse/bin/gstack-upload — upload file to S3, return public URL.
Why: Shared utility for all skills that need to embed images in PRs.
Effort: S Priority: P2 Depends on: /setup-gstack-upload
WebM to GIF conversion
What: ffmpeg-based WebM → GIF conversion for video evidence in PRs.
Why: GitHub PR bodies render GIFs but not WebM. Needed for video recording evidence.
Effort: S Priority: P3 Depends on: Video recording
Extend worktree isolation to Claude E2E tests
What: Add useWorktree?: boolean option to runSkillTest() so any Claude E2E test can opt into worktree mode for full repo context instead of tmpdir fixtures.
Why: Some Claude E2E tests (CSO audit, review-sql-injection) create minimal fake repos but would produce more realistic results with full repo context. The infrastructure exists (describeWithWorktree() in e2e-helpers.ts) — this extends it to the session-runner level.
Context: WorktreeManager shipped in v0.11.12.0. Currently only Gemini/Codex tests use worktrees. Claude tests use planted-bug fixture repos which are correct for their purpose, but new tests that want real repo context can use describeWithWorktree() today. This TODO is about making it even easier via a flag on runSkillTest().
Effort: M (human: ~2 days / CC: ~20 min) Priority: P3 Depends on: Worktree isolation (shipped v0.11.12.0)
E2E model pinning — SHIPPED
What: Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses.
Shipped: Default model changed to Sonnet for structure tests (~30), Opus retained for quality tests (~10). --retry 2 added. EVALS_MODEL env var for override. test:e2e:fast tier added. Rate-limit telemetry (first_response_ms, max_inter_turn_ms) and wall_clock_ms tracking added to eval-store.
Eval web dashboard
What: bun run eval:dashboard serves local HTML with charts: cost trending, detection rate, pass/fail history.
Why: Visual charts better for spotting trends than CLI tools.
Context: Reads ~/.gstack-dev/evals/*.json. ~200 lines HTML + chart.js via Bun HTTP server.
Effort: M Priority: P3 Depends on: Eval persistence (shipped in v0.3.6)
CI/CD QA quality gate
What: Run /qa as a GitHub Action step, fail PR if health score drops below threshold.
Why: Automated quality gate catches regressions before merge. Currently QA is manual — CI integration makes it part of the standard workflow.
Context: Requires headless browse binary available in CI. The /qa skill already produces baseline.json with health scores — CI step would compare against the main branch baseline and fail if score drops. Would need ANTHROPIC_API_KEY in CI secrets since /qa uses Claude.
Effort: M Priority: P2 Depends on: None
Cross-platform URL open helper
What: gstack-open-url helper script — detect platform, use open (macOS) or xdg-open (Linux).
Why: The first-time Completeness Principle intro uses macOS open to launch the essay. If gstack ever supports Linux, this silently fails.
Effort: S (human: ~30 min / CC: ~2 min) Priority: P4 Depends on: Nothing
CDP-based DOM mutation detection for ref staleness
What: Use Chrome DevTools Protocol DOM.documentUpdated / MutationObserver events to proactively invalidate stale refs when the DOM changes, without requiring an explicit snapshot call.
Why: Current ref staleness detection (async count() check) only catches stale refs at action time. CDP mutation detection would proactively warn when refs become stale, preventing the 5-second timeout entirely for SPA re-renders.
Context: Parts 1+2 of ref staleness fix (RefEntry metadata + eager validation via count()) are shipped. This is Part 3 — the most ambitious piece. Requires CDP session alongside Playwright, MutationObserver bridge, and careful performance tuning to avoid overhead on every DOM change.
Effort: L Priority: P3 Depends on: Ref staleness Parts 1+2 (shipped)
Office Hours / Design
Design docs → Supabase team store sync
What: Add design docs (*-design-*.md) to the Supabase sync pipeline alongside test plans, retro snapshots, and QA reports.
Why: Cross-team design discovery at scale. Local ~/.gstack/projects/$SLUG/ keyword-grep discovery works for same-machine users now, but Supabase sync makes it work across the whole team. Duplicate ideas surface, everyone sees what's been explored.
Context: /office-hours writes design docs to ~/.gstack/projects/$SLUG/. The team store already syncs test plans, retro snapshots, QA reports. Design docs follow the same pattern — just add a sync adapter.
Effort: S
Priority: P2
Depends on: garrytan/team-supabase-store branch landing on main
/yc-prep skill
What: Skill that helps founders prepare their YC application after /office-hours identifies strong signal. Pulls from the design doc, structures answers to YC app questions, runs a mock interview.
Why: Closes the loop. /office-hours identifies the founder, /yc-prep helps them apply well. The design doc already contains most of the raw material for a YC application.
Effort: M (human: ~2 weeks / CC: ~2 hours) Priority: P2 Depends on: office-hours founder discovery engine shipping first
Design Review
/plan-design-review + /qa-design-review + /design-consultation — SHIPPED
Shipped as v0.5.0 on main. Includes /plan-design-review (report-only design audit), /qa-design-review (audit + fix loop), and /design-consultation (interactive DESIGN.md creation). {{DESIGN_METHODOLOGY}} resolver provides shared 80-item design audit checklist.
Design outside voices in /plan-eng-review
What: Extend the parallel dual-voice pattern (Codex + Claude subagent) to /plan-eng-review's architecture review section.
Why: The design beachhead (v0.11.3.0) proves cross-model consensus works for subjective reviews. Architecture reviews have similar subjectivity in tradeoff decisions.
Context: Depends on learnings from the design beachhead. If the litmus scorecard format proves useful, adapt it for architecture dimensions (coupling, scaling, reversibility).
Effort: S Priority: P3 Depends on: Design outside voices shipped (v0.11.3.0)
Outside voices in /qa visual regression detection
What: Add Codex design voice to /qa for detecting visual regressions during bug-fix verification.
Why: When fixing bugs, the fix can introduce visual regressions that code-level checks miss. Codex could flag "the fix broke the responsive layout" during re-test.
Context: Depends on /qa having design awareness. Currently /qa focuses on functional testing.
Effort: M Priority: P3 Depends on: Design outside voices shipped (v0.11.3.0)
Document-Release
Auto-invoke /document-release from /ship — SHIPPED
Shipped in v0.8.3. Step 8.5 added to /ship — after creating the PR, /ship automatically reads document-release/SKILL.md and executes the doc update workflow. Zero-friction doc updates.
{{DOC_VOICE}} shared resolver
What: Create a placeholder resolver in gen-skill-docs.ts encoding the gstack voice guide (friendly, user-forward, lead with benefits). Inject into /ship Step 5, /document-release Step 5, and reference from CLAUDE.md.
Why: DRY — voice rules currently live inline in 3 places (CLAUDE.md CHANGELOG style section, /ship Step 5, /document-release Step 5). When the voice evolves, all three drift.
Context: Same pattern as {{QA_METHODOLOGY}} — shared block injected into multiple templates to prevent drift. ~20 lines in gen-skill-docs.ts.
Effort: S Priority: P2 Depends on: None
Ship Confidence Dashboard
Smart review relevance detection — PARTIALLY SHIPPED
What: Auto-detect which of the 4 reviews are relevant based on branch changes (skip Design Review if no CSS/view changes, skip Code Review if plan-only).
bin/gstack-diff-scope shipped — categorizes diff into SCOPE_FRONTEND, SCOPE_BACKEND, SCOPE_PROMPTS, SCOPE_TESTS, SCOPE_DOCS, SCOPE_CONFIG. Used by design-review-lite to skip when no frontend files changed. Dashboard integration for conditional row display is a follow-up.
Remaining: Dashboard conditional row display (hide "Design Review: NOT YET RUN" when SCOPE_FRONTEND=false). Extend to Eng Review (skip for docs-only) and CEO Review (skip for config-only).
Effort: S Priority: P3 Depends on: gstack-diff-scope (shipped)
Codex
Codex→Claude reverse buddy check skill
What: A Codex-native skill (.agents/skills/gstack-claude/SKILL.md) that runs claude -p to get an independent second opinion from Claude — the reverse of what /codex does today from Claude Code.
Why: Codex users deserve the same cross-model challenge that Claude users get via /codex. Currently the flow is one-way (Claude→Codex). Codex users have no way to get a Claude second opinion.
Context: The /codex skill template (codex/SKILL.md.tmpl) shows the pattern — it wraps codex exec with JSONL parsing, timeout handling, and structured output. The reverse skill would wrap claude -p with similar infrastructure. Would be generated into .agents/skills/gstack-claude/ by gen-skill-docs --host codex.
Effort: M (human: ~2 weeks / CC: ~30 min) Priority: P1 Depends on: None
Completeness
Completeness metrics dashboard
What: Track how often Claude chooses the complete option vs shortcut across gstack sessions. Aggregate into a dashboard showing completeness trend over time.
Why: Without measurement, we can't know if the Completeness Principle is working. Could surface patterns (e.g., certain skills still bias toward shortcuts).
Context: Would require logging choices (e.g., append to a JSONL file when AskUserQuestion resolves), parsing them, and displaying trends. Similar pattern to eval persistence.
Effort: M (human) / S (CC) Priority: P3 Depends on: Boil the Lake shipped (v0.6.1)
Safety & Observability
On-demand hook skills (/careful, /freeze, /guard) — SHIPPED
What: Three new skills that use Claude Code's session-scoped PreToolUse hooks to add safety guardrails on demand.
Shipped as /careful, /freeze, /guard, and /unfreeze in v0.6.5. Includes hook fire-rate telemetry (pattern name only, no command content) and inline skill activation telemetry.
Skill usage telemetry — SHIPPED
What: Track which skills get invoked, how often, from which repo.
Shipped in v0.6.5. TemplateContext in gen-skill-docs.ts bakes skill name into preamble telemetry line. Analytics CLI (bun run analytics) for querying. /retro integration shows skills-used-this-week.
/investigate scoped debugging enhancements (gated on telemetry)
What: Six enhancements to /investigate auto-freeze, contingent on telemetry showing the freeze hook actually fires in real debugging sessions.
Why: /investigate v0.7.1 auto-freezes edits to the module being debugged. If telemetry shows the hook fires often, these enhancements make the experience smarter. If it never fires, the problem wasn't real and these aren't worth building.
Context: All items are prose additions to investigate/SKILL.md.tmpl. No new scripts.
Items:
- Stack trace auto-detection for freeze directory (parse deepest app frame)
- Freeze boundary widening (ask to widen instead of hard-block when hitting boundary)
- Post-fix auto-unfreeze + full test suite run
- Debug instrumentation cleanup (tag with DEBUG-TEMP, remove before commit)
- Debug session persistence (~/.gstack/investigate-sessions/ — save investigation for reuse)
- Investigation timeline in debug report (hypothesis log with timing)
Effort: M (all 6 combined) Priority: P3 Depends on: Telemetry data showing freeze hook fires in real /investigate sessions
Context Intelligence
Context recovery preamble
What: Add ~10 lines of prose to the preamble telling the agent to re-read gstack artifacts (CEO plans, design reviews, eng reviews, checkpoints) after compaction or context degradation.
Why: gstack skills produce valuable artifacts stored at ~/.gstack/projects/$SLUG/. When Claude's auto-compaction fires, it preserves a generic summary but doesn't know these artifacts exist. The plans and reviews that shaped the current work silently vanish from context, even though they're still on disk. This is the thing nobody else in the Claude Code ecosystem is solving, because nobody else has gstack's artifact architecture.
Context: Inspired by Anthropic's claude-progress.txt pattern for long-running agents. Also informed by claude-mem's "progressive disclosure" approach. See docs/designs/SESSION_INTELLIGENCE.md for the broader vision. CEO plan: ~/.gstack/projects/garrytan-gstack/ceo-plans/2026-03-31-session-intelligence-layer.md.
Effort: S (human: ~30 min / CC: ~5 min)
Priority: P1
Depends on: None
Key files: scripts/resolvers/preamble.ts
Session timeline
What: Append one-line JSONL entry to ~/.gstack/projects/$SLUG/timeline.jsonl after every skill run (timestamp, skill, branch, outcome). /retro renders the timeline.
Why: Makes AI-assisted work history visible. /retro can show "this week: 3 /review, 2 /ship, 1 /investigate." Provides the observability layer for the session intelligence architecture.
Effort: S (human: ~1h / CC: ~5 min)
Priority: P1
Depends on: None
Key files: scripts/resolvers/preamble.ts, retro/SKILL.md.tmpl
Cross-session context injection
What: When a new gstack session starts on a branch with recent checkpoints or plans, the preamble prints a one-line summary: "Last session: implemented JWT auth, 3/5 tasks done." Agent knows where you left off before reading any files.
Why: Claude starts every session fresh. This one-liner orients the agent immediately. Similar to claude-mem's SessionStart hook pattern but simpler and integrated.
Effort: S (human: ~2h / CC: ~10 min) Priority: P2 Depends on: Context recovery preamble
/checkpoint skill
What: Manual skill to snapshot current working state: what's being done and why, files being edited, decisions made (and rationale), what's done vs. remaining, critical types/signatures. Saved to ~/.gstack/projects/$SLUG/checkpoints/<timestamp>.md.
Why: Useful before stepping away from a long session, before known-complex operations that might trigger compaction, for handing off context to a different agent/workspace, or coming back to a project after days away.
Effort: M (human: ~1 week / CC: ~30 min)
Priority: P2
Depends on: Context recovery preamble
Key files: New checkpoint/SKILL.md.tmpl, scripts/gen-skill-docs.ts
Session Intelligence Layer design doc
What: Write docs/designs/SESSION_INTELLIGENCE.md describing the architectural vision: gstack as the persistent brain that survives Claude's ephemeral context. Every skill writes to ~/.gstack/projects/$SLUG/, preamble re-reads, /retro rolls up.
Why: Connects context recovery, health, checkpoint, and timeline features into a coherent architecture. Nobody else in the ecosystem is building this.
Effort: S (human: ~2h / CC: ~15 min) Priority: P1 Depends on: None
Health
/health — Project Health Dashboard
What: Skill that runs type-check, lint, test suite, and dead code scan, then reports a composite 0-10 health score with breakdown by category. Tracks over time in ~/.gstack/health/<project-slug>/ for trend detection. Optionally integrates CodeScene MCP for deeper complexity/cohesion/coupling analysis.
Why: No quick way to get "state of the codebase" before starting work. CodeScene peer-reviewed research shows AI-generated code increases static analysis warnings by 30%, code complexity by 41%, and change failure rates by 30%. Users need guardrails. Like /qa but for code quality rather than browser behavior.
Context: Reads CLAUDE.md for project-specific commands (platform-agnostic principle). Runs checks in parallel. /retro can pull from health history for trend sparklines.
Effort: M (human: ~1 week / CC: ~30 min)
Priority: P1
Depends on: None
Key files: New health/SKILL.md.tmpl, scripts/gen-skill-docs.ts
/health as /ship gate
What: If health score exists and drops below a configurable threshold, /ship warns before creating the PR: "Health dropped from 8/10 to 5/10 this branch — 3 new lint warnings, 1 test failure. Ship anyway?"
Why: Quality gate that prevents shipping degraded code. Configurable threshold so it's not blocking for teams that don't use /health.
Effort: S (human: ~1h / CC: ~5 min) Priority: P2 Depends on: /health skill
Swarm
Swarm primitive — reusable multi-agent dispatch
What: Extract Review Army's dispatch pattern into a reusable resolver (scripts/resolvers/swarm.ts). Wire into /ship for parallel pre-ship checks (type-check + lint + test in parallel sub-agents). Make available to /qa, /investigate, /health.
Why: Review Army proved parallel sub-agents work brilliantly (5 agents = 835K tokens of working memory vs. 167K for one). The pattern is locked inside review-army.ts. Other skills need it too. Claude Code Agent Teams (official, Feb 2026) validates the team-lead-delegates-to-specialists pattern. Gartner: multi-agent inquiries surged 1,445% in one year.
Context: Start with the specific /ship use case. Extract shared parts only after 2+ consumers reveal what config parameters are actually needed. Avoid premature abstraction. Can leverage existing WorktreeManager for isolation.
Effort: L (human: ~2 weeks / CC: ~2 hours)
Priority: P2
Depends on: None
Key files: scripts/resolvers/review-army.ts, new scripts/resolvers/swarm.ts, ship/SKILL.md.tmpl, lib/worktree.ts
Refactoring
/refactor-prep — Pre-Refactor Token Hygiene
What: Skill that detects project language/framework, runs appropriate dead code detection (knip/ts-prune for TS/JS, vulture/autoflake for Python, staticcheck/deadcode for Go, cargo udeps for Rust), strips dead imports/exports/props/console.logs, and commits cleanup separately.
Why: Dirty codebases accelerate context compaction. Dead imports, unused exports, and orphaned code eat tokens that contribute nothing but everything to triggering compaction mid-refactor. Cleaning first buys back 20%+ of context budget. Reports lines removed and estimated token savings.
Effort: M (human: ~1 week / CC: ~30 min)
Priority: P2
Depends on: None
Key files: New refactor-prep/SKILL.md.tmpl, scripts/gen-skill-docs.ts
Factory Droid
Browse MCP server for Factory Droid
What: Expose gstack's browse binary and key workflows as an MCP server that Factory Droid connects to natively. Factory users would run /mcp, add the gstack server, and get browse, QA, and review capabilities as Factory tools.
Why: Factory already supports 40+ MCP servers in its registry. Getting gstack's browse binary listed there is a distribution play. Nobody else has a real compiled browser binary as an MCP tool. This is the thing that makes gstack uniquely valuable on Factory Droid.
Context: Option A (--host factory compatibility shim) ships first in v0.13.4.0. Option B is the follow-up that provides deeper integration. The browse binary is already a stateless CLI, so wrapping it as an MCP server is straightforward (stdin/stdout JSON-RPC). Each browse command becomes an MCP tool.
Effort: L (human: ~1 week / CC: ~5 hours) Priority: P1 Depends on: --host factory (Option A, shipping in v0.13.4.0)
.agent/skills/ dual output for cross-agent compatibility
What: Factory also reads from <repo>/.agent/skills/ as a cross-agent compatibility path. Could output there in addition to .factory/skills/ for broader reach across other agents that use the .agent convention.
Why: Multiple AI agents beyond Factory may adopt the .agent/skills/ convention. Outputting there too would give free compatibility.
Effort: S Priority: P3 Depends on: --host factory
Custom Droid definitions alongside skills
What: Factory has "custom droids" (subagents with tool restrictions, model selection, autonomy levels). Could ship gstack-qa.md droid configs alongside skills that restrict tools to read-only + execute for safety.
Why: Deeper Factory integration. Droid configs give Factory users tighter control over what gstack skills can do.
Effort: M Priority: P3 Depends on: --host factory
GStack Browser
Anti-bot stealth: Playwright CDP patches (rebrowser-style)
What: Write a postinstall script that patches Playwright's CDP layer to suppress Runtime.enable and use addBinding for context ID discovery, same approach as rebrowser-patches. Eliminates the navigator.webdriver, cdc_ markers, and other CDP artifacts that sites like Google use to detect automation.
Why: Our current stealth patches (UA override, navigator.webdriver=false, fake plugins) work on most sites but Google still triggers captchas. The real detection is at the CDP protocol level. rebrowser-patches proved the approach works but their patches target Playwright 1.52.0 and don't apply to our 1.58.2. We need our own patcher using string matching instead of line-number diffs. 6 files, ~200 lines of patches total.
Context: Full analysis of rebrowser-patches source: patches 6 files in playwright-core/lib/server/ (crConnection.js, crDevTools.js, crPage.js, crServiceWorker.js, frames.js, page.js). Key technique: suppress Runtime.enable (the main CDP detection vector), use Runtime.addBinding + CustomEvent trick to discover execution context IDs without it. Our extension communicates via Chrome extension APIs, not CDP Runtime, so it should be unaffected. Write E2E tests that verify: (1) extension still loads and connects, (2) Google.com loads without captcha, (3) sidebar chat still works.
Effort: L (human: ~2 weeks / CC: ~3 hours) Priority: P1 Depends on: None
Chromium fork (long-term alternative to CDP patches)
What: Maintain a Chromium fork where anti-bot stealth, GStack Browser branding, and native sidebar support live in the source code, not as runtime monkey-patches.
Why: The CDP patches are brittle. They break on every Playwright upgrade and target compiled JS with fragile string matching. A proper fork means: (1) stealth is permanent, not patched, (2) branding is native (no plist hacking at launch), (3) native sidebar replaces the extension (Phase 4 of V0 roadmap), (4) custom protocols (gstack://) for internal pages. Companies like Brave, Arc, and Vivaldi maintain Chromium forks with small teams. With CC, the rebase-on-upstream maintenance could be largely automated.
Context: Trigger criteria from V0 design doc: fork when extension side panel becomes the bottleneck, when anti-bot patches need to live deeper than CDP, or when native UI integration (sidebar, status bar) can't be done via extension. The Chromium build takes ~4 hours on a 32-core machine and produces ~50GB of build artifacts. CI would need dedicated build infra. See docs/designs/GSTACK_BROWSER_V0.md Phase 5 for full analysis.
Effort: XL (human: ~1 quarter / CC: ~2-3 weeks of focused work) Priority: P2 Depends on: CDP patches proving the value of anti-bot stealth first
Completed
CI eval pipeline (v0.9.9.0)
- GitHub Actions eval upload on Ubicloud runners ($0.006/run)
- Within-file test concurrency (test() → testConcurrentIfSelected())
- Eval artifact upload + PR comment with pass/fail + cost
- Baseline comparison via artifact download from main
- EVALS_CONCURRENCY=40 for ~6min wall clock (was ~18min) Completed: v0.9.9.0
Deploy pipeline (v0.9.8.0)
- /land-and-deploy — merge PR, wait for CI/deploy, canary verification
- /canary — post-deploy monitoring loop with anomaly detection
- /benchmark — performance regression detection with Core Web Vitals
- /setup-deploy — one-time deploy platform configuration
- /review Performance & Bundle Impact pass
- E2E model pinning (Sonnet default, Opus for quality tests)
- E2E timing telemetry (first_response_ms, max_inter_turn_ms, wall_clock_ms)
- test:e2e:fast tier, --retry 2 on all E2E scripts Completed: v0.9.8.0
Phase 1: Foundations (v0.2.0)
- Rename to gstack
- Restructure to monorepo layout
- Setup script for skill symlinks
- Snapshot command with ref-based element selection
- Snapshot tests Completed: v0.2.0
Phase 2: Enhanced Browser (v0.2.0)
- Annotated screenshots, snapshot diffing, dialog handling, file upload
- Cursor-interactive elements, element state checks
- CircularBuffer, async buffer flush, health check
- Playwright error wrapping, useragent fix
- 148 integration tests Completed: v0.2.0
Phase 3: QA Testing Agent (v0.3.0)
- /qa SKILL.md with 6-phase workflow, 3 modes (full/quick/regression)
- Issue taxonomy, severity classification, exploration checklist
- Report template, health score rubric, framework detection
- wait/console/cookie-import commands, find-browse binary Completed: v0.3.0
Phase 3.5: Browser Cookie Import (v0.3.x)
- cookie-import-browser command (Chromium cookie DB decryption)
- Cookie picker web UI, /setup-browser-cookies skill
- 18 unit tests, browser registry (Comet, Chrome, Arc, Brave, Edge) Completed: v0.3.1
E2E test cost tracking
- Track cumulative API spend, warn if over threshold Completed: v0.3.6
Auto-upgrade mode + smart update check
- Config CLI (
bin/gstack-config), auto-upgrade via~/.gstack/config.yaml, 12h cache TTL, exponential snooze backoff (24h→48h→1wk), "never ask again" option, vendored copy sync on upgrade Completed: v0.3.8