From 2bb6df73f44e8d6d49573e345ea7a972c68dc7bc Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Sat, 21 Mar 2026 13:12:30 -0700 Subject: [PATCH] docs: mark E2E model pinning TODO as shipped Co-Authored-By: Claude Opus 4.6 (1M context) --- TODOS.md | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/TODOS.md b/TODOS.md index 9327d343..1c2380b0 100644 --- a/TODOS.md +++ b/TODOS.md @@ -336,14 +336,11 @@ **Priority:** P2 **Depends on:** Eval persistence (shipped in v0.3.6) -### E2E model pinning +### E2E model pinning — SHIPPED -**What:** Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses. +~~**What:** Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses.~~ -**Why:** Reduce E2E test cost and flakiness. - -**Effort:** XS -**Priority:** P2 +Shipped: Default model changed to Sonnet for structure tests (~30), Opus retained for quality tests (~10). `--retry 2` added. `EVALS_MODEL` env var for override. `test:e2e:fast` tier added. Rate-limit telemetry (first_response_ms, max_inter_turn_ms) and wall_clock_ms tracking added to eval-store. ### Eval web dashboard