From e28033353dd64cc7958f8c95cac83114559b03f0 Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Sun, 15 Mar 2026 16:55:34 -0500 Subject: [PATCH] chore: bump v0.3.10, update CHANGELOG and docs Co-Authored-By: Claude Opus 4.6 (1M context) --- CHANGELOG.md | 18 ++++++++++++++++++ CLAUDE.md | 1 + CONTRIBUTING.md | 5 ++++- VERSION | 2 +- 4 files changed, 24 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 4c571e6e..b040306b 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,23 @@ # Changelog +## 0.3.10 — 2026-03-15 + +### Added +- **Per-model cost tracking** — eval results now include `costs[]` with exact per-model token usage (input, output, cache read, cache creation) and API-reported cost. Extracted from `resultLine.modelUsage` in `claude -p` NDJSON stream. `computeCosts()` prefers exact `cost_usd` over MODEL_PRICING estimates (~4x more accurate with prompt caching). +- **LLM judge caching** — SHA-based caching for LLM-as-judge eval calls via `eval-cache.ts`. Cache keyed by `model:prompt`, so unchanged SKILL.md content skips API calls entirely. ~$0.18/run savings. Set `EVAL_CACHE=0` to force re-run. +- **Dynamic model selection** — `EVAL_JUDGE_TIER` env var controls which Claude model runs judge evals (haiku/sonnet/opus, default: sonnet). `EVAL_TIER` pins the E2E test model via `--model` flag to `claude -p`. +- **`bun run eval:trend`** — per-test pass rate tracking over last N runs. Classifies tests as stable-pass, stable-fail, flaky, improving, or degrading. Sparkline table with `--limit`, `--tier`, `--test` filters. Answers "is /retro getting more reliable?" instantly. +- **CostEntry extended** — `cache_read_input_tokens`, `cache_creation_input_tokens`, `cost_usd` optional fields for accurate cache-aware cost reporting. +- 22 new tests: 10 cache/tier integration (llm-judge.test.ts), 12 trend classification (lib-eval-trend.test.ts). + +### Changed +- `callJudge()` and `judge()` now return `{ result, meta }` with `JudgeMeta` (model, tokens, cached flag). `outcomeJudge()` retains simple return type for E2E callers. +- `EvalCollector.finalize()` aggregates per-test `costs[]` into result-level cost breakdown. +- `cli-eval.ts` main block guarded with `import.meta.main` to prevent execution on import. +- `eval:summary` now hints to run `eval:trend` when flaky tests are detected. +- All 8 LLM eval test sites updated from hard-coded `cost_usd: 0.02` to real API-reported costs. +- Regression test refactored from direct `Anthropic()` client to `callJudge()` (benefits from cache + tier). + ## 0.3.9 — 2026-03-15 ### Added diff --git a/CLAUDE.md b/CLAUDE.md index c6909357..681566b3 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -15,6 +15,7 @@ bun run dev:skill # watch mode: auto-regen + validate on change bun run eval:list # list all eval runs from ~/.gstack-dev/evals/ bun run eval:compare # compare two eval runs (auto-picks most recent) bun run eval:summary # aggregate stats across all eval runs +bun run eval:trend # per-test pass rate trends (flaky detection) ``` `test:evals` requires `ANTHROPIC_API_KEY`. E2E tests stream progress in real-time diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 34e502ea..0116be43 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -134,6 +134,8 @@ When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`: bun run eval:list # list all eval runs bun run eval:compare # compare two runs (auto-picks most recent) bun run eval:summary # aggregate stats across all runs +bun run eval:trend # per-test pass rate over last N runs (flaky detection) +bun run eval:cache stats # check LLM judge cache hit rate ``` Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis. @@ -152,7 +154,8 @@ Each dimension is scored 1-5. Threshold: every dimension must score **≥ 4**. T # Needs ANTHROPIC_API_KEY in .env — included in bun run test:evals ``` -- Uses `claude-sonnet-4-6` for scoring stability +- Model defaults to `claude-sonnet-4-6`; override with `EVAL_JUDGE_TIER=haiku|opus` +- Results are SHA-cached — unchanged SKILL.md content skips API calls ($0 on repeat runs). Set `EVAL_CACHE=0` to force re-run. - Tests live in `test/skill-llm-eval.test.ts` - Calls the Anthropic API directly (not `claude -p`), so it works from anywhere including inside Claude Code diff --git a/VERSION b/VERSION index 940ac09a..5503126d 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.3.9 +0.3.10