Merge remote-tracking branch 'origin/main' into garrytan/chrome-extension-ctrl

# Conflicts:
#	browse/src/browser-manager.ts
#	browse/src/cli.ts
This commit is contained in:
Garry Tan
2026-03-26 00:08:38 -06:00
156 changed files with 9667 additions and 1161 deletions
+37 -11
View File
@@ -179,14 +179,17 @@ Sidebar agent writes structured messages to `.context/sidebar-inbox/`. Workspace
**Priority:** P4
**Depends on:** Chrome extension proving value via sideloading
### Linux/Windows cookie decryption
### Linux cookie decryption — PARTIALLY SHIPPED
**What:** GNOME Keyring / kwallet / DPAPI support for non-macOS cookie import.
~~**What:** GNOME Keyring / kwallet / DPAPI support for non-macOS cookie import.~~
**Why:** Cross-platform cookie import. Currently macOS-only (Keychain).
Linux cookie import shipped in v0.11.11.0 (Wave 3). Supports Chrome, Chromium, Brave, Edge on Linux with GNOME Keyring (libsecret) and "peanuts" fallback. Windows DPAPI support remains deferred.
**Effort:** L
**Remaining:** Windows cookie decryption (DPAPI). Needs complete rewrite — PR #64 was 1346 lines and stale.
**Effort:** L (Windows only)
**Priority:** P4
**Completed (Linux):** v0.11.11.0 (2026-03-23)
## Ship
@@ -363,17 +366,18 @@ Sidebar agent writes structured messages to `.context/sidebar-inbox/`. Workspace
**Depends on:** Video recording
### GitHub Actions eval upload
**What:** Run eval suite in CI, upload result JSON as artifact, post summary comment on PR.
### Extend worktree isolation to Claude E2E tests
**Why:** CI integration catches quality regressions before merge and provides persistent eval records per PR.
**What:** Add `useWorktree?: boolean` option to `runSkillTest()` so any Claude E2E test can opt into worktree mode for full repo context instead of tmpdir fixtures.
**Context:** Requires `ANTHROPIC_API_KEY` in CI secrets. Cost is ~$4/run. Eval persistence system (v0.3.6) writes JSON to `~/.gstack-dev/evals/` — CI would upload as GitHub Actions artifacts and use `eval:compare` to post delta comment.
**Why:** Some Claude E2E tests (CSO audit, review-sql-injection) create minimal fake repos but would produce more realistic results with full repo context. The infrastructure exists (`describeWithWorktree()` in e2e-helpers.ts) — this extends it to the session-runner level.
**Effort:** M
**Priority:** P2
**Depends on:** Eval persistence (shipped in v0.3.6)
**Context:** WorktreeManager shipped in v0.11.12.0. Currently only Gemini/Codex tests use worktrees. Claude tests use planted-bug fixture repos which are correct for their purpose, but new tests that want real repo context can use `describeWithWorktree()` today. This TODO is about making it even easier via a flag on `runSkillTest()`.
**Effort:** M (human: ~2 days / CC: ~20 min)
**Priority:** P3
**Depends on:** Worktree isolation (shipped v0.11.12.0)
### E2E model pinning — SHIPPED
@@ -514,6 +518,20 @@ Shipped in v0.8.3. Step 8.5 added to `/ship` — after creating the PR, `/ship`
**Depends on:** gstack-diff-scope (shipped)
## Codex
### Codex→Claude reverse buddy check skill
**What:** A Codex-native skill (`.agents/skills/gstack-claude/SKILL.md`) that runs `claude -p` to get an independent second opinion from Claude — the reverse of what `/codex` does today from Claude Code.
**Why:** Codex users deserve the same cross-model challenge that Claude users get via `/codex`. Currently the flow is one-way (Claude→Codex). Codex users have no way to get a Claude second opinion.
**Context:** The `/codex` skill template (`codex/SKILL.md.tmpl`) shows the pattern — it wraps `codex exec` with JSONL parsing, timeout handling, and structured output. The reverse skill would wrap `claude -p` with similar infrastructure. Would be generated into `.agents/skills/gstack-claude/` by `gen-skill-docs --host codex`.
**Effort:** M (human: ~2 weeks / CC: ~30 min)
**Priority:** P1
**Depends on:** None
## Completeness
### Completeness metrics dashboard
@@ -564,6 +582,14 @@ Shipped in v0.6.5. TemplateContext in gen-skill-docs.ts bakes skill name into pr
## Completed
### CI eval pipeline (v0.9.9.0)
- GitHub Actions eval upload on Ubicloud runners ($0.006/run)
- Within-file test concurrency (test() → testConcurrentIfSelected())
- Eval artifact upload + PR comment with pass/fail + cost
- Baseline comparison via artifact download from main
- EVALS_CONCURRENCY=40 for ~6min wall clock (was ~18min)
**Completed:** v0.9.9.0
### Deploy pipeline (v0.9.8.0)
- /land-and-deploy — merge PR, wait for CI/deploy, canary verification
- /canary — post-deploy monitoring loop with anomaly detection