Commit Graph

2 Commits

Author SHA1 Message Date
Garry Tan 02925cfc7a feat: wire costs[] from modelUsage into eval results
Extract per-model token usage from resultLine.modelUsage (including
cache tokens and exact API cost), flow CostEntry[] through EvalCollector,
aggregate in finalize(). Extend CostEntry with cache_read_input_tokens,
cache_creation_input_tokens, cost_usd. computeCosts() prefers exact
cost_usd over MODEL_PRICING when available (~4x more accurate with
prompt caching).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 16:47:27 -05:00
Garry Tan 9bc6c9416f feat: add eval format validation, tier selection, cost tracking
- lib/eval-format.ts: StandardEvalResult interfaces, validateEvalResult(),
  normalizeFromLegacy/normalizeToLegacy round-trip converters
- lib/eval-tier.ts: EvalTier type, resolveTier/resolveJudgeTier from env,
  tierToModel mapping, TIER_ALIASES (haiku→fast, sonnet→standard, opus→full)
- lib/eval-cost.ts: MODEL_PRICING (last verified 2025-05-01), computeCosts(),
  formatCostDashboard(), aggregateCosts(), fallback for unknown models
- 42 tests across 3 test files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 09:39:18 -05:00