docs: mark E2E model pinning TODO as shipped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-03-21 13:12:30 -07:00
parent fa61e2fe3c
commit 2bb6df73f4
+3 -6
View File
@@ -336,14 +336,11 @@
**Priority:** P2
**Depends on:** Eval persistence (shipped in v0.3.6)
### E2E model pinning
### E2E model pinning — SHIPPED
**What:** Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses.
~~**What:** Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses.~~
**Why:** Reduce E2E test cost and flakiness.
**Effort:** XS
**Priority:** P2
Shipped: Default model changed to Sonnet for structure tests (~30), Opus retained for quality tests (~10). `--retry 2` added. `EVALS_MODEL` env var for override. `test:e2e:fast` tier added. Rate-limit telemetry (first_response_ms, max_inter_turn_ms) and wall_clock_ms tracking added to eval-store.
### Eval web dashboard