diff --git a/TODOS.md b/TODOS.md index 9327d343..1c2380b0 100644 --- a/TODOS.md +++ b/TODOS.md @@ -336,14 +336,11 @@ **Priority:** P2 **Depends on:** Eval persistence (shipped in v0.3.6) -### E2E model pinning +### E2E model pinning — SHIPPED -**What:** Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses. +~~**What:** Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses.~~ -**Why:** Reduce E2E test cost and flakiness. - -**Effort:** XS -**Priority:** P2 +Shipped: Default model changed to Sonnet for structure tests (~30), Opus retained for quality tests (~10). `--retry 2` added. `EVALS_MODEL` env var for override. `test:e2e:fast` tier added. Rate-limit telemetry (first_response_ms, max_inter_turn_ms) and wall_clock_ms tracking added to eval-store. ### Eval web dashboard