mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-05 05:05:08 +02:00
chore: bump v0.3.10, update CHANGELOG and docs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,5 +1,23 @@
|
||||
# Changelog
|
||||
|
||||
## 0.3.10 — 2026-03-15
|
||||
|
||||
### Added
|
||||
- **Per-model cost tracking** — eval results now include `costs[]` with exact per-model token usage (input, output, cache read, cache creation) and API-reported cost. Extracted from `resultLine.modelUsage` in `claude -p` NDJSON stream. `computeCosts()` prefers exact `cost_usd` over MODEL_PRICING estimates (~4x more accurate with prompt caching).
|
||||
- **LLM judge caching** — SHA-based caching for LLM-as-judge eval calls via `eval-cache.ts`. Cache keyed by `model:prompt`, so unchanged SKILL.md content skips API calls entirely. ~$0.18/run savings. Set `EVAL_CACHE=0` to force re-run.
|
||||
- **Dynamic model selection** — `EVAL_JUDGE_TIER` env var controls which Claude model runs judge evals (haiku/sonnet/opus, default: sonnet). `EVAL_TIER` pins the E2E test model via `--model` flag to `claude -p`.
|
||||
- **`bun run eval:trend`** — per-test pass rate tracking over last N runs. Classifies tests as stable-pass, stable-fail, flaky, improving, or degrading. Sparkline table with `--limit`, `--tier`, `--test` filters. Answers "is /retro getting more reliable?" instantly.
|
||||
- **CostEntry extended** — `cache_read_input_tokens`, `cache_creation_input_tokens`, `cost_usd` optional fields for accurate cache-aware cost reporting.
|
||||
- 22 new tests: 10 cache/tier integration (llm-judge.test.ts), 12 trend classification (lib-eval-trend.test.ts).
|
||||
|
||||
### Changed
|
||||
- `callJudge()` and `judge()` now return `{ result, meta }` with `JudgeMeta` (model, tokens, cached flag). `outcomeJudge()` retains simple return type for E2E callers.
|
||||
- `EvalCollector.finalize()` aggregates per-test `costs[]` into result-level cost breakdown.
|
||||
- `cli-eval.ts` main block guarded with `import.meta.main` to prevent execution on import.
|
||||
- `eval:summary` now hints to run `eval:trend` when flaky tests are detected.
|
||||
- All 8 LLM eval test sites updated from hard-coded `cost_usd: 0.02` to real API-reported costs.
|
||||
- Regression test refactored from direct `Anthropic()` client to `callJudge()` (benefits from cache + tier).
|
||||
|
||||
## 0.3.9 — 2026-03-15
|
||||
|
||||
### Added
|
||||
|
||||
@@ -15,6 +15,7 @@ bun run dev:skill # watch mode: auto-regen + validate on change
|
||||
bun run eval:list # list all eval runs from ~/.gstack-dev/evals/
|
||||
bun run eval:compare # compare two eval runs (auto-picks most recent)
|
||||
bun run eval:summary # aggregate stats across all eval runs
|
||||
bun run eval:trend # per-test pass rate trends (flaky detection)
|
||||
```
|
||||
|
||||
`test:evals` requires `ANTHROPIC_API_KEY`. E2E tests stream progress in real-time
|
||||
|
||||
+4
-1
@@ -134,6 +134,8 @@ When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`:
|
||||
bun run eval:list # list all eval runs
|
||||
bun run eval:compare # compare two runs (auto-picks most recent)
|
||||
bun run eval:summary # aggregate stats across all runs
|
||||
bun run eval:trend # per-test pass rate over last N runs (flaky detection)
|
||||
bun run eval:cache stats # check LLM judge cache hit rate
|
||||
```
|
||||
|
||||
Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis.
|
||||
@@ -152,7 +154,8 @@ Each dimension is scored 1-5. Threshold: every dimension must score **≥ 4**. T
|
||||
# Needs ANTHROPIC_API_KEY in .env — included in bun run test:evals
|
||||
```
|
||||
|
||||
- Uses `claude-sonnet-4-6` for scoring stability
|
||||
- Model defaults to `claude-sonnet-4-6`; override with `EVAL_JUDGE_TIER=haiku|opus`
|
||||
- Results are SHA-cached — unchanged SKILL.md content skips API calls ($0 on repeat runs). Set `EVAL_CACHE=0` to force re-run.
|
||||
- Tests live in `test/skill-llm-eval.test.ts`
|
||||
- Calls the Anthropic API directly (not `claude -p`), so it works from anywhere including inside Claude Code
|
||||
|
||||
|
||||
Reference in New Issue
Block a user