Files
gstack/TODOS.md
Garry Tan 94c1530efc feat: /debug sub-agent escalation from /qa + recommendations in /review and /ship (v0.6.5.0) (#192)
* feat: add browse access to /debug for visual verification

Debug skill can now use the browse binary to visually reproduce bugs,
take screenshots as evidence, and verify fixes. This makes /debug
effective for web app bugs when spawned as a sub-agent from /qa.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add /debug sub-agent escalation to /qa (Phase 8g)

When QA fix attempts fail twice on the same bug (reverted due to
regressions), /qa now spawns a /debug sub-agent with a structured
bug brief including symptoms, repro steps, failed fix details, and
file paths. Results are reported in Phase 10's debug escalation summary.

Sequential execution: one debug investigation at a time, working tree
cleaned between investigations. Graceful degradation on all failure
modes (BLOCKED, agent failure → deferred in report).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add /debug recommendation to /review (Step 5.7)

When /review finds what appears to be a pre-existing bug in the base
branch (not introduced by the PR's diff), it now classifies it as
INFORMATIONAL and recommends running /debug for systematic root-cause
investigation. No Agent spawning — /review's scope stays on the diff.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add reverted QA commit detection to /ship

During pre-landing review, /ship now checks for reverted fix(qa):
commits in the branch history and recommends /debug for systematic
investigation. Informational only — does not block shipping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add debug escalation tests (validation + LLM judge + E2E)

Skill validation: 11 new assertions covering Phase 8g trigger, structured
handoff fields, agent result handlers, debug escalation summary, Step 5.7
recommendation, ship reverted QA detection, and debug browse setup.

LLM judge: evaluates Phase 8g template quality — structured brief format,
result handling, working tree cleanup, sequential processing.

E2E: prompt-level deterministic test (verifies escalation prompt has all
required fields) + full flow stub (fixture TODO for planted regression).

Touchfile entries for diff-based test selection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: add worktree parallel debug agents to TODOS.md (P2)

When /qa hits multiple stubborn bugs, parallel debug agents in
isolated git worktrees could investigate simultaneously. Deferred
from the sequential debug escalation PR as a follow-up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.6.5.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add E2E evals for /review pre-existing bug + /ship reverted QA detection

Two new E2E tests:
- review-pre-existing-bug: plants SQL injection in base branch, verifies
  Step 5.7 classifies as INFORMATIONAL and recommends /debug
- ship-reverted-qa-commits: creates branch with reverted fix(qa): commits,
  verifies /ship detects them and recommends /debug

Also fixes qa-debug-prompt-logic to use correct workingDirectory, and
ensures test repo init uses -b main for portability.

All 4 debug-related evals pass: $0.34 total, 94s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 17:59:32 -05:00

580 lines
23 KiB
Markdown

# TODOS
## Browse
### Bundle server.ts into compiled binary
**What:** Eliminate `resolveServerScript()` fallback chain entirely — bundle server.ts into the compiled browse binary.
**Why:** The current fallback chain (check adjacent to cli.ts, check global install) is fragile and caused bugs in v0.3.2. A single compiled binary is simpler and more reliable.
**Context:** Bun's `--compile` flag can bundle multiple entry points. The server is currently resolved at runtime via file path lookup. Bundling it removes the resolution step entirely.
**Effort:** M
**Priority:** P2
**Depends on:** None
### Sessions (isolated browser instances)
**What:** Isolated browser instances with separate cookies/storage/history, addressable by name.
**Why:** Enables parallel testing of different user roles, A/B test verification, and clean auth state management.
**Context:** Requires Playwright browser context isolation. Each session gets its own context with independent cookies/localStorage. Prerequisite for video recording (clean context lifecycle) and auth vault.
**Effort:** L
**Priority:** P3
### Video recording
**What:** Record browser interactions as video (start/stop controls).
**Why:** Video evidence in QA reports and PR bodies. Currently deferred because `recreateContext()` destroys page state.
**Context:** Needs sessions for clean context lifecycle. Playwright supports video recording per context. Also needs WebM → GIF conversion for PR embedding.
**Effort:** M
**Priority:** P3
**Depends on:** Sessions
### v20 encryption format support
**What:** AES-256-GCM support for future Chromium cookie DB versions (currently v10).
**Why:** Future Chromium versions may change encryption format. Proactive support prevents breakage.
**Effort:** S
**Priority:** P3
### State persistence
**What:** Save/load cookies + localStorage to JSON files for reproducible test sessions.
**Why:** Enables "resume where I left off" for QA sessions and repeatable auth states.
**Effort:** M
**Priority:** P3
**Depends on:** Sessions
### Auth vault
**What:** Encrypted credential storage, referenced by name. LLM never sees passwords.
**Why:** Security — currently auth credentials flow through the LLM context. Vault keeps secrets out of the AI's view.
**Effort:** L
**Priority:** P3
**Depends on:** Sessions, state persistence
### Iframe support
**What:** `frame <sel>` and `frame main` commands for cross-frame interaction.
**Why:** Many web apps use iframes (embeds, payment forms, ads). Currently invisible to browse.
**Effort:** M
**Priority:** P4
### Semantic locators
**What:** `find role/label/text/placeholder/testid` with attached actions.
**Why:** More resilient element selection than CSS selectors or ref numbers.
**Effort:** M
**Priority:** P4
### Device emulation presets
**What:** `set device "iPhone 16 Pro"` for mobile/tablet testing.
**Why:** Responsive layout testing without manual viewport resizing.
**Effort:** S
**Priority:** P4
### Network mocking/routing
**What:** Intercept, block, and mock network requests.
**Why:** Test error states, loading states, and offline behavior.
**Effort:** M
**Priority:** P4
### Download handling
**What:** Click-to-download with path control.
**Why:** Test file download flows end-to-end.
**Effort:** S
**Priority:** P4
### Content safety
**What:** `--max-output` truncation, `--allowed-domains` filtering.
**Why:** Prevent context window overflow and restrict navigation to safe domains.
**Effort:** S
**Priority:** P4
### Streaming (WebSocket live preview)
**What:** WebSocket-based live preview for pair browsing sessions.
**Why:** Enables real-time collaboration — human watches AI browse.
**Effort:** L
**Priority:** P4
### CDP mode
**What:** Connect to already-running Chrome/Electron apps via Chrome DevTools Protocol.
**Why:** Test production apps, Electron apps, and existing browser sessions without launching new instances.
**Effort:** M
**Priority:** P4
### Linux/Windows cookie decryption
**What:** GNOME Keyring / kwallet / DPAPI support for non-macOS cookie import.
**Why:** Cross-platform cookie import. Currently macOS-only (Keychain).
**Effort:** L
**Priority:** P4
## Ship
### Ship log — persistent record of /ship runs
**What:** Append structured JSON entry to `.gstack/ship-log.json` at end of every /ship run (version, date, branch, PR URL, review findings, Greptile stats, todos completed, test results).
**Why:** /retro has no structured data about shipping velocity. Ship log enables: PRs-per-week trending, review finding rates, Greptile signal over time, test suite growth.
**Context:** /retro already reads greptile-history.md — same pattern. Eval persistence (eval-store.ts) shows the JSON append pattern exists in the codebase. ~15 lines in ship template.
**Effort:** S
**Priority:** P2
**Depends on:** None
### Post-deploy verification (ship + browse)
**What:** After push, browse staging/preview URL, screenshot key pages, check console for JS errors, compare staging vs prod via snapshot diff. Include verification screenshots in PR body. STOP if critical errors found.
**Why:** Catch deployment-time regressions (JS errors, broken layouts) before merge.
**Context:** Requires S3 upload infrastructure for PR screenshots. Pairs with visual PR annotations.
**Effort:** L
**Priority:** P2
**Depends on:** /setup-gstack-upload, visual PR annotations
### Visual verification with screenshots in PR body
**What:** /ship Step 7.5: screenshot key pages after push, embed in PR body.
**Why:** Visual evidence in PRs. Reviewers see what changed without deploying locally.
**Context:** Part of Phase 3.6. Needs S3 upload for image hosting.
**Effort:** M
**Priority:** P2
**Depends on:** /setup-gstack-upload
## Review
### Inline PR annotations
**What:** /ship and /review post inline review comments at specific file:line locations using `gh api` to create pull request review comments.
**Why:** Line-level annotations are more actionable than top-level comments. The PR thread becomes a line-by-line conversation between Greptile, Claude, and human reviewers.
**Context:** GitHub supports inline review comments via `gh api repos/$REPO/pulls/$PR/reviews`. Pairs naturally with Phase 3.6 visual annotations.
**Effort:** S
**Priority:** P2
**Depends on:** None
### Greptile training feedback export
**What:** Aggregate greptile-history.md into machine-readable JSON summary of false positive patterns, exportable to the Greptile team for model improvement.
**Why:** Closes the feedback loop — Greptile can use FP data to stop making the same mistakes on your codebase.
**Context:** Was a P3 Future Idea. Upgraded to P2 now that greptile-history.md data infrastructure exists. The signal data is already being collected; this just makes it exportable. ~40 lines.
**Effort:** S
**Priority:** P2
**Depends on:** Enough FP data accumulated (10+ entries)
### Visual review with annotated screenshots
**What:** /review Step 4.5: browse PR's preview deploy, annotated screenshots of changed pages, compare against production, check responsive layouts, verify accessibility tree.
**Why:** Visual diff catches layout regressions that code review misses.
**Context:** Part of Phase 3.6. Needs S3 upload for image hosting.
**Effort:** M
**Priority:** P2
**Depends on:** /setup-gstack-upload
## QA
### QA trend tracking
**What:** Compare baseline.json over time, detect regressions across QA runs.
**Why:** Spot quality trends — is the app getting better or worse?
**Context:** QA already writes structured reports. This adds cross-run comparison.
**Effort:** S
**Priority:** P2
### CI/CD QA integration
**What:** `/qa` as GitHub Action step, fail PR if health score drops.
**Why:** Automated quality gate in CI. Catch regressions before merge.
**Effort:** M
**Priority:** P2
### Smart default QA tier
**What:** After a few runs, check index.md for user's usual tier pick, skip the AskUserQuestion.
**Why:** Reduces friction for repeat users.
**Effort:** S
**Priority:** P2
### Accessibility audit mode
**What:** `--a11y` flag for focused accessibility testing.
**Why:** Dedicated accessibility testing beyond the general QA checklist.
**Effort:** S
**Priority:** P3
### CI/CD generation for non-GitHub providers
**What:** Extend CI/CD bootstrap to generate GitLab CI (`.gitlab-ci.yml`), CircleCI (`.circleci/config.yml`), and Bitrise pipelines.
**Why:** Not all projects use GitHub Actions. Universal CI/CD bootstrap would make test bootstrap work for everyone.
**Context:** v1 ships with GitHub Actions only. Detection logic already checks for `.gitlab-ci.yml`, `.circleci/`, `bitrise.yml` and skips with an informational note. Each provider needs ~20 lines of template text in `generateTestBootstrap()`.
**Effort:** M
**Priority:** P3
**Depends on:** Test bootstrap (shipped)
### Auto-upgrade weak tests (★) to strong tests (★★★)
**What:** When Step 3.4 coverage audit identifies existing ★-rated tests (smoke/trivial assertions), generate improved versions testing edge cases and error paths.
**Why:** Many codebases have tests that technically exist but don't catch real bugs — `expect(component).toBeDefined()` isn't testing behavior. Upgrading these closes the gap between "has tests" and "has good tests."
**Context:** Requires the quality scoring rubric from the test coverage audit. Modifying existing test files is riskier than creating new ones — needs careful diffing to ensure the upgraded test still passes. Consider creating a companion test file rather than modifying the original.
**Effort:** M
**Priority:** P3
**Depends on:** Test quality scoring (shipped)
## Retro
### Deployment health tracking (retro + browse)
**What:** Screenshot production state, check perf metrics (page load times), count console errors across key pages, track trends over retro window.
**Why:** Retro should include production health alongside code metrics.
**Context:** Requires browse integration. Screenshots + metrics fed into retro output.
**Effort:** L
**Priority:** P3
**Depends on:** Browse sessions
## Infrastructure
### /setup-gstack-upload skill (S3 bucket)
**What:** Configure S3 bucket for image hosting. One-time setup for visual PR annotations.
**Why:** Prerequisite for visual PR annotations in /ship and /review.
**Effort:** M
**Priority:** P2
### gstack-upload helper
**What:** `browse/bin/gstack-upload` — upload file to S3, return public URL.
**Why:** Shared utility for all skills that need to embed images in PRs.
**Effort:** S
**Priority:** P2
**Depends on:** /setup-gstack-upload
### WebM to GIF conversion
**What:** ffmpeg-based WebM → GIF conversion for video evidence in PRs.
**Why:** GitHub PR bodies render GIFs but not WebM. Needed for video recording evidence.
**Effort:** S
**Priority:** P3
**Depends on:** Video recording
### Deploy-verify skill
**What:** Lightweight post-deploy smoke test: hit key URLs, verify 200s, screenshot critical pages, console error check, compare against baseline snapshots. Pass/fail with evidence.
**Why:** Fast post-deploy confidence check, separate from full QA.
**Effort:** M
**Priority:** P2
### GitHub Actions eval upload
**What:** Run eval suite in CI, upload result JSON as artifact, post summary comment on PR.
**Why:** CI integration catches quality regressions before merge and provides persistent eval records per PR.
**Context:** Requires `ANTHROPIC_API_KEY` in CI secrets. Cost is ~$4/run. Eval persistence system (v0.3.6) writes JSON to `~/.gstack-dev/evals/` — CI would upload as GitHub Actions artifacts and use `eval:compare` to post delta comment.
**Effort:** M
**Priority:** P2
**Depends on:** Eval persistence (shipped in v0.3.6)
### E2E model pinning
**What:** Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses.
**Why:** Reduce E2E test cost and flakiness.
**Effort:** XS
**Priority:** P2
### Eval web dashboard
**What:** `bun run eval:dashboard` serves local HTML with charts: cost trending, detection rate, pass/fail history.
**Why:** Visual charts better for spotting trends than CLI tools.
**Context:** Reads `~/.gstack-dev/evals/*.json`. ~200 lines HTML + chart.js via Bun HTTP server.
**Effort:** M
**Priority:** P3
**Depends on:** Eval persistence (shipped in v0.3.6)
### CI/CD QA quality gate
**What:** Run `/qa` as a GitHub Action step, fail PR if health score drops below threshold.
**Why:** Automated quality gate catches regressions before merge. Currently QA is manual — CI integration makes it part of the standard workflow.
**Context:** Requires headless browse binary available in CI. The `/qa` skill already produces `baseline.json` with health scores — CI step would compare against the main branch baseline and fail if score drops. Would need `ANTHROPIC_API_KEY` in CI secrets since `/qa` uses Claude.
**Effort:** M
**Priority:** P2
**Depends on:** None
### Cross-platform URL open helper
**What:** `gstack-open-url` helper script — detect platform, use `open` (macOS) or `xdg-open` (Linux).
**Why:** The first-time Completeness Principle intro uses macOS `open` to launch the essay. If gstack ever supports Linux, this silently fails.
**Effort:** S (human: ~30 min / CC: ~2 min)
**Priority:** P4
**Depends on:** Nothing
### CDP-based DOM mutation detection for ref staleness
**What:** Use Chrome DevTools Protocol `DOM.documentUpdated` / MutationObserver events to proactively invalidate stale refs when the DOM changes, without requiring an explicit `snapshot` call.
**Why:** Current ref staleness detection (async count() check) only catches stale refs at action time. CDP mutation detection would proactively warn when refs become stale, preventing the 5-second timeout entirely for SPA re-renders.
**Context:** Parts 1+2 of ref staleness fix (RefEntry metadata + eager validation via count()) are shipped. This is Part 3 — the most ambitious piece. Requires CDP session alongside Playwright, MutationObserver bridge, and careful performance tuning to avoid overhead on every DOM change.
**Effort:** L
**Priority:** P3
**Depends on:** Ref staleness Parts 1+2 (shipped)
## Office Hours / Design
### Design docs → Supabase team store sync
**What:** Add design docs (`*-design-*.md`) to the Supabase sync pipeline alongside test plans, retro snapshots, and QA reports.
**Why:** Cross-team design discovery at scale. Local `~/.gstack/projects/$SLUG/` keyword-grep discovery works for same-machine users now, but Supabase sync makes it work across the whole team. Duplicate ideas surface, everyone sees what's been explored.
**Context:** /office-hours writes design docs to `~/.gstack/projects/$SLUG/`. The team store already syncs test plans, retro snapshots, QA reports. Design docs follow the same pattern — just add a sync adapter.
**Effort:** S
**Priority:** P2
**Depends on:** `garrytan/team-supabase-store` branch landing on main
## Debug
### Worktree-based parallel debug sub-agents
**What:** When /qa hits multiple stubborn bugs, spawn parallel debug agents each in their own git worktree (`isolation: "worktree"` on the Agent call). They investigate simultaneously while /qa continues other work. Results cherry-picked back.
**Why:** Sequential debug investigations add ~2-5 min per bug. With 3-4 hard bugs, that's 10-20 min of sequential waiting. Parallel worktree agents could investigate all simultaneously, with zero working-tree conflicts.
**Context:** v1 ships sequential (debug-subagent-escalation PR). The Agent tool supports `isolation: "worktree"` which creates a temporary git worktree — each agent gets its own copy of the repo. Especially powerful in plan mode where /qa might find multiple hard bugs across different subsystems.
**Effort:** M (human: ~1 week / CC: ~30min)
**Priority:** P2
**Depends on:** Sequential debug escalation (debug-subagent-escalation PR)
## Design Review
### /plan-design-review + /qa-design-review + /design-consultation — SHIPPED
Shipped as v0.5.0 on main. Includes `/plan-design-review` (report-only design audit), `/qa-design-review` (audit + fix loop), and `/design-consultation` (interactive DESIGN.md creation). `{{DESIGN_METHODOLOGY}}` resolver provides shared 80-item design audit checklist.
## Document-Release
### Auto-invoke /document-release from /ship
**What:** Add Step 8.5 to /ship that reads document-release/SKILL.md and executes the doc update workflow after creating the PR.
**Why:** Zero-friction doc updates — user runs /ship and docs are automatically current. No extra command to remember.
**Context:** /ship currently ends at Step 8 (PR URL output). Step 8.5 would continue into the document-release workflow. Same pattern as /ship calling /review's checklist in Step 3.5.
**Effort:** S
**Priority:** P1
**Depends on:** /document-release shipped
### `{{DOC_VOICE}}` shared resolver
**What:** Create a placeholder resolver in gen-skill-docs.ts encoding the gstack voice guide (friendly, user-forward, lead with benefits). Inject into /ship Step 5, /document-release Step 5, and reference from CLAUDE.md.
**Why:** DRY — voice rules currently live inline in 3 places (CLAUDE.md CHANGELOG style section, /ship Step 5, /document-release Step 5). When the voice evolves, all three drift.
**Context:** Same pattern as `{{QA_METHODOLOGY}}` — shared block injected into multiple templates to prevent drift. ~20 lines in gen-skill-docs.ts.
**Effort:** S
**Priority:** P2
**Depends on:** None
## Ship Confidence Dashboard
### Smart review relevance detection — PARTIALLY SHIPPED
~~**What:** Auto-detect which of the 4 reviews are relevant based on branch changes (skip Design Review if no CSS/view changes, skip Code Review if plan-only).~~
`bin/gstack-diff-scope` shipped — categorizes diff into SCOPE_FRONTEND, SCOPE_BACKEND, SCOPE_PROMPTS, SCOPE_TESTS, SCOPE_DOCS, SCOPE_CONFIG. Used by design-review-lite to skip when no frontend files changed. Dashboard integration for conditional row display is a follow-up.
**Remaining:** Dashboard conditional row display (hide "Design Review: NOT YET RUN" when SCOPE_FRONTEND=false). Extend to Eng Review (skip for docs-only) and CEO Review (skip for config-only).
**Effort:** S
**Priority:** P3
**Depends on:** gstack-diff-scope (shipped)
### /merge skill — review-gated PR merge
**What:** Create a `/merge` skill that merges an approved PR, but first checks the Review Readiness Dashboard and runs `/review` (Fix-First) if code review hasn't been done. Separates "ship" (create PR) from "merge" (land it).
**Why:** Currently `/review` runs inside `/ship` Step 3.5 but isn't tracked as a gate. A `/merge` skill ensures code review always happens before landing, and enables workflows where someone else reviews the PR first.
**Context:** `/ship` creates the PR. `/merge` would: check dashboard → run `/review` if needed → `gh pr merge`. This is where code review tracking belongs — at merge time, not at plan time.
**Effort:** M
**Priority:** P2
**Depends on:** Ship Confidence Dashboard (shipped)
## Completeness
### Completeness metrics dashboard
**What:** Track how often Claude chooses the complete option vs shortcut across gstack sessions. Aggregate into a dashboard showing completeness trend over time.
**Why:** Without measurement, we can't know if the Completeness Principle is working. Could surface patterns (e.g., certain skills still bias toward shortcuts).
**Context:** Would require logging choices (e.g., append to a JSONL file when AskUserQuestion resolves), parsing them, and displaying trends. Similar pattern to eval persistence.
**Effort:** M (human) / S (CC)
**Priority:** P3
**Depends on:** Boil the Lake shipped (v0.6.1)
## Safety & Observability
### On-demand hook skills (/careful, /freeze, /guard)
**What:** Three new skills that use Claude Code's session-scoped PreToolUse hooks to add safety guardrails on demand.
**Why:** Anthropic's internal skill best practices recommend on-demand hooks for safety. Claude Code already handles destructive command permissions, but these add an explicit opt-in layer for high-risk sessions (touching prod, debugging live systems).
**Skills:**
- `/careful` — PreToolUse hook on Bash tool. Warns (not blocks) before destructive commands: `rm -rf`, `DROP TABLE`, `git push --force`, `git reset --hard`, `kubectl delete`, `docker system prune`. Uses `permissionDecision: "ask"` so user can override.
- `/freeze` — PreToolUse hook on Edit/Write tools. Restricts file edits to a user-specified directory. Great for debugging without accidentally "fixing" unrelated code.
- `/guard` — meta-skill composing `/careful` + `/freeze` into one command.
**Implementation notes:** Use `${CLAUDE_SKILL_DIR}` (not `${SKILL_DIR}`) for script paths in hook commands. Pure bash JSON parsing (no jq dependency). Freeze dir storage: `${CLAUDE_PLUGIN_DATA}/freeze-dir.txt` with `~/.gstack/freeze-dir.txt` fallback. Ensure trailing `/` on freeze dir paths to prevent `/src` matching `/src-old`.
**Effort:** M (human) / S (CC)
**Priority:** P3
**Depends on:** None
### Skill usage telemetry
**What:** Track which skills get invoked, how often, from which repo.
**Why:** Enables finding undertriggering skills and measuring adoption. Anthropic uses a PreToolUse hook for this; simpler approach is appending JSONL from the preamble.
**Context:** Add to `generatePreamble()` in `scripts/gen-skill-docs.ts`. Append to `~/.gstack/analytics/skill-usage.jsonl` with skill name, timestamp, and repo name. `mkdir -p` ensures the directory exists.
**Effort:** S (human) / S (CC)
**Priority:** P3
**Depends on:** None
## Completed
### Phase 1: Foundations (v0.2.0)
- Rename to gstack
- Restructure to monorepo layout
- Setup script for skill symlinks
- Snapshot command with ref-based element selection
- Snapshot tests
**Completed:** v0.2.0
### Phase 2: Enhanced Browser (v0.2.0)
- Annotated screenshots, snapshot diffing, dialog handling, file upload
- Cursor-interactive elements, element state checks
- CircularBuffer, async buffer flush, health check
- Playwright error wrapping, useragent fix
- 148 integration tests
**Completed:** v0.2.0
### Phase 3: QA Testing Agent (v0.3.0)
- /qa SKILL.md with 6-phase workflow, 3 modes (full/quick/regression)
- Issue taxonomy, severity classification, exploration checklist
- Report template, health score rubric, framework detection
- wait/console/cookie-import commands, find-browse binary
**Completed:** v0.3.0
### Phase 3.5: Browser Cookie Import (v0.3.x)
- cookie-import-browser command (Chromium cookie DB decryption)
- Cookie picker web UI, /setup-browser-cookies skill
- 18 unit tests, browser registry (Comet, Chrome, Arc, Brave, Edge)
**Completed:** v0.3.1
### E2E test cost tracking
- Track cumulative API spend, warn if over threshold
**Completed:** v0.3.6
### Auto-upgrade mode + smart update check
- Config CLI (`bin/gstack-config`), auto-upgrade via `~/.gstack/config.yaml`, 12h cache TTL, exponential snooze backoff (24h→48h→1wk), "never ask again" option, vendored copy sync on upgrade
**Completed:** v0.3.8