From e28033353dd64cc7958f8c95cac83114559b03f0 Mon Sep 17 00:00:00 2001
From: Garry Tan <garrytan@gmail.com>
Date: Sun, 15 Mar 2026 16:55:34 -0500
Subject: [PATCH] chore: bump v0.3.10, update CHANGELOG and docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md    | 18 ++++++++++++++++++
 CLAUDE.md       |  1 +
 CONTRIBUTING.md |  5 ++++-
 VERSION         |  2 +-
 4 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 4c571e6e..b040306b 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,23 @@
 # Changelog
 
+## 0.3.10 — 2026-03-15
+
+### Added
+- **Per-model cost tracking** — eval results now include `costs[]` with exact per-model token usage (input, output, cache read, cache creation) and API-reported cost. Extracted from `resultLine.modelUsage` in `claude -p` NDJSON stream. `computeCosts()` prefers exact `cost_usd` over MODEL_PRICING estimates (~4x more accurate with prompt caching).
+- **LLM judge caching** — SHA-based caching for LLM-as-judge eval calls via `eval-cache.ts`. Cache keyed by `model:prompt`, so unchanged SKILL.md content skips API calls entirely. ~$0.18/run savings. Set `EVAL_CACHE=0` to force re-run.
+- **Dynamic model selection** — `EVAL_JUDGE_TIER` env var controls which Claude model runs judge evals (haiku/sonnet/opus, default: sonnet). `EVAL_TIER` pins the E2E test model via `--model` flag to `claude -p`.
+- **`bun run eval:trend`** — per-test pass rate tracking over last N runs. Classifies tests as stable-pass, stable-fail, flaky, improving, or degrading. Sparkline table with `--limit`, `--tier`, `--test` filters. Answers "is /retro getting more reliable?" instantly.
+- **CostEntry extended** — `cache_read_input_tokens`, `cache_creation_input_tokens`, `cost_usd` optional fields for accurate cache-aware cost reporting.
+- 22 new tests: 10 cache/tier integration (llm-judge.test.ts), 12 trend classification (lib-eval-trend.test.ts).
+
+### Changed
+- `callJudge()` and `judge()` now return `{ result, meta }` with `JudgeMeta` (model, tokens, cached flag). `outcomeJudge()` retains simple return type for E2E callers.
+- `EvalCollector.finalize()` aggregates per-test `costs[]` into result-level cost breakdown.
+- `cli-eval.ts` main block guarded with `import.meta.main` to prevent execution on import.
+- `eval:summary` now hints to run `eval:trend` when flaky tests are detected.
+- All 8 LLM eval test sites updated from hard-coded `cost_usd: 0.02` to real API-reported costs.
+- Regression test refactored from direct `Anthropic()` client to `callJudge()` (benefits from cache + tier).
+
 ## 0.3.9 — 2026-03-15
 
 ### Added
diff --git a/CLAUDE.md b/CLAUDE.md
index c6909357..681566b3 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -15,6 +15,7 @@ bun run dev:skill    # watch mode: auto-regen + validate on change
 bun run eval:list    # list all eval runs from ~/.gstack-dev/evals/
 bun run eval:compare # compare two eval runs (auto-picks most recent)
 bun run eval:summary # aggregate stats across all eval runs
+bun run eval:trend   # per-test pass rate trends (flaky detection)
 ```
 
 `test:evals` requires `ANTHROPIC_API_KEY`. E2E tests stream progress in real-time
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 34e502ea..0116be43 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -134,6 +134,8 @@ When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`:
 bun run eval:list            # list all eval runs
 bun run eval:compare         # compare two runs (auto-picks most recent)
 bun run eval:summary         # aggregate stats across all runs
+bun run eval:trend           # per-test pass rate over last N runs (flaky detection)
+bun run eval:cache stats     # check LLM judge cache hit rate
 ```
 
 Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis.
@@ -152,7 +154,8 @@ Each dimension is scored 1-5. Threshold: every dimension must score **≥ 4**. T
 # Needs ANTHROPIC_API_KEY in .env — included in bun run test:evals
 ```
 
-- Uses `claude-sonnet-4-6` for scoring stability
+- Model defaults to `claude-sonnet-4-6`; override with `EVAL_JUDGE_TIER=haiku|opus`
+- Results are SHA-cached — unchanged SKILL.md content skips API calls ($0 on repeat runs). Set `EVAL_CACHE=0` to force re-run.
 - Tests live in `test/skill-llm-eval.test.ts`
 - Calls the Anthropic API directly (not `claude -p`), so it works from anywhere including inside Claude Code
 
diff --git a/VERSION b/VERSION
index 940ac09a..5503126d 100644
--- a/VERSION
+++ b/VERSION
@@ -1 +1 @@
-0.3.9
+0.3.10