mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-02 03:35:09 +02:00
ed802d0c7f
Add eval:list, eval:compare, eval:summary CLI scripts for exploring eval history from ~/.gstack-dev/evals/. eval:compare reuses the shared comparison functions from eval-store.ts. - eval:list: sorted table with branch/tier/cost filters - eval:compare: thin wrapper around compareEvalResults + formatComparison - eval:summary: aggregate stats, flaky test detection, branch rankings - Remove unused @anthropic-ai/claude-agent-sdk from devDependencies - Update CLAUDE.md: streaming docs, eval CLI commands, remove Agent SDK refs - Add GH Actions eval upload (P2) and web dashboard (P3) to TODOS.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
49 lines
3.1 KiB
Markdown
49 lines
3.1 KiB
Markdown
# TODOS
|
|
|
|
## Auto-upgrade mode (zero-prompt)
|
|
|
|
**What:** Add a `GSTACK_AUTO_UPGRADE=1` env var or `~/.gstack/config` option that skips the AskUserQuestion prompt and upgrades automatically when a new version is detected.
|
|
|
|
**Why:** Power users and CI environments may want zero-friction upgrades without being asked every time.
|
|
|
|
**Context:** The current upgrade system (v0.3.4) always prompts via AskUserQuestion. This TODO adds an opt-in bypass. Implementation is ~10 lines in the preamble instructions: check for the env var/config before calling AskUserQuestion, and if set, go straight to the upgrade flow. Depends on the full upgrade system being stable first — wait for user feedback on the prompt-based flow before adding this.
|
|
|
|
**Effort:** S (small)
|
|
**Priority:** P3 (nice-to-have, revisit after adoption data)
|
|
|
|
## Convert remaining skills to .tmpl files
|
|
|
|
**What:** Convert ship/, review/, plan-ceo-review/, plan-eng-review/, retro/ SKILL.md files to .tmpl templates using the `{{UPDATE_CHECK}}` placeholder.
|
|
|
|
**Why:** These 5 skills still have the update check preamble copy-pasted. When the preamble changes (like the `|| true` fix in v0.3.5), all 5 need manual updates. The `{{UPDATE_CHECK}}` resolver already exists in `scripts/gen-skill-docs.ts` — these skills just need to be converted.
|
|
|
|
**Context:** The browse-using skills (SKILL.md, browse/, qa/, setup-browser-cookies/) were converted to .tmpl in v0.3.5. The remaining 5 skills only use `{{UPDATE_CHECK}}` (no `{{BROWSE_SETUP}}`), so the conversion is mechanical: replace the preamble with `{{UPDATE_CHECK}}`, add the path to `findTemplates()` in `scripts/gen-skill-docs.ts`, and commit both .tmpl + generated .md.
|
|
|
|
**Depends on:** v0.3.5 shipping first (the `{{UPDATE_CHECK}}` resolver).
|
|
**Effort:** S (small, ~20 min)
|
|
**Priority:** P2 (prevents drift on next preamble change)
|
|
|
|
## GitHub Actions eval upload
|
|
|
|
**What:** Run eval suite in CI, upload result JSON as artifact, post summary comment on PR.
|
|
|
|
**Why:** Currently evals only run locally. CI integration would catch quality regressions before merge and provide a persistent record of eval results per PR.
|
|
|
|
**Context:** Requires `ANTHROPIC_API_KEY` in CI secrets. Cost is ~$4/run. The eval persistence system (v0.3.6) writes JSON to `~/.gstack-dev/evals/` — CI would upload these as GitHub Actions artifacts and use `eval:compare` to post a delta comment on the PR.
|
|
|
|
**Depends on:** Eval persistence shipping (v0.3.6).
|
|
**Effort:** M (medium)
|
|
**Priority:** P2
|
|
|
|
## Eval web dashboard
|
|
|
|
**What:** `bun run eval:dashboard` serves local HTML with charts: cost trending, detection rate over time, pass/fail history.
|
|
|
|
**Why:** The CLI tools (`eval:list`, `eval:compare`, `eval:summary`) are good for quick checks but visual charts are better for spotting trends over many runs.
|
|
|
|
**Context:** Reads the same `~/.gstack-dev/evals/*.json` files. ~200 lines HTML + chart.js code served via a simple Bun HTTP server. No external dependencies beyond what's already installed.
|
|
|
|
**Depends on:** Eval persistence + eval:list shipping (v0.3.6).
|
|
**Effort:** M (medium)
|
|
**Priority:** P3 (nice-to-have, revisit after eval system sees regular use)
|