feat: eval CLI tools + docs cleanup

Add eval:list, eval:compare, eval:summary CLI scripts for exploring eval history from ~/.gstack-dev/evals/. eval:compare reuses the shared comparison functions from eval-store.ts. - eval:list: sorted table with branch/tier/cost filters - eval:compare: thin wrapper around compareEvalResults + formatComparison - eval:summary: aggregate stats, flaky test detection, branch rankings - Remove unused @anthropic-ai/claude-agent-sdk from devDependencies - Update CLAUDE.md: streaming docs, eval CLI commands, remove Agent SDK refs - Add GH Actions eval upload (P2) and web dashboard (P3) to TODOS.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-29 21:15:37 +02:00 · 2026-03-14 03:49:57 -05:00
parent 84f52f3bad
commit ed802d0c7f
6 changed files with 373 additions and 11 deletions
@@ -22,3 +22,27 @@
 **Depends on:** v0.3.5 shipping first (the `{{UPDATE_CHECK}}` resolver).
 **Effort:** S (small, ~20 min)
 **Priority:** P2 (prevents drift on next preamble change)
+
+## GitHub Actions eval upload
+
+**What:** Run eval suite in CI, upload result JSON as artifact, post summary comment on PR.
+
+**Why:** Currently evals only run locally. CI integration would catch quality regressions before merge and provide a persistent record of eval results per PR.
+
+**Context:** Requires `ANTHROPIC_API_KEY` in CI secrets. Cost is ~$4/run. The eval persistence system (v0.3.6) writes JSON to `~/.gstack-dev/evals/` — CI would upload these as GitHub Actions artifacts and use `eval:compare` to post a delta comment on the PR.
+
+**Depends on:** Eval persistence shipping (v0.3.6).
+**Effort:** M (medium)
+**Priority:** P2
+
+## Eval web dashboard
+
+**What:** `bun run eval:dashboard` serves local HTML with charts: cost trending, detection rate over time, pass/fail history.
+
+**Why:** The CLI tools (`eval:list`, `eval:compare`, `eval:summary`) are good for quick checks but visual charts are better for spotting trends over many runs.
+
+**Context:** Reads the same `~/.gstack-dev/evals/*.json` files. ~200 lines HTML + chart.js code served via a simple Bun HTTP server. No external dependencies beyond what's already installed.
+
+**Depends on:** Eval persistence + eval:list shipping (v0.3.6).
+**Effort:** M (medium)
+**Priority:** P3 (nice-to-have, revisit after eval system sees regular use)