feat: unified gstack eval CLI with list, compare, push, cache, cost

- lib/cli-eval.ts: routes to list/compare/summary/push/cost/cache/watch subcommands. Ports logic from 4 separate scripts into unified entry. Adds ANSI color for TTY (respects NO_COLOR), --limit flag for list. - bin/gstack-eval: bash wrapper matching bin/gstack-sync pattern - package.json: eval:* scripts now point to lib/cli-eval.ts - supabase/migrations/004_eval_costs.sql: per-model cost tracking + RLS - docs/eval-result-format.md: public format spec for any language - test/lib-eval-cli.test.ts: integration tests (spawn CLI subprocess) including 3 push failure modes (file-not-found, invalid schema, sync unavailable) 215 tests passing across 13 files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 05:56:41 +02:00 · 2026-03-15 09:39:36 -05:00
parent 1f5b7882e6
commit 4ad73f7362
6 changed files with 834 additions and 4 deletions
@@ -0,0 +1,136 @@
+# Standard Eval Result Format
+
+This document defines the JSON format that any language can produce and push into gstack's eval infrastructure via `gstack eval push <file>`.
+
+## Required Fields
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `schema_version` | `number` | Format version (currently `1`) |
+| `version` | `string` | Version of the tool/system being evaluated |
+| `git_branch` | `string` | Git branch name |
+| `git_sha` | `string` | Git commit SHA (short or full) |
+| `timestamp` | `string` | ISO 8601 timestamp |
+| `tier` | `string` | Eval tier: `"e2e"`, `"llm-judge"`, or custom |
+| `total` | `number` | Total number of test cases |
+| `passed` | `number` | Number of passing test cases |
+| `failed` | `number` | Number of failing test cases |
+| `total_cost_usd` | `number` | Total estimated cost in USD |
+| `duration_seconds` | `number` | Total wall-clock duration in seconds |
+| `all_results` | `array` | Array of test result objects (see below) |
+
+## Optional Fields
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `hostname` | `string` | Machine hostname |
+| `label` | `string` | Human-readable label for this run |
+| `prompt_sha` | `string` | SHA of the prompt(s) used |
+| `by_category` | `object` | `{ category: { passed, failed } }` breakdown |
+| `costs` | `array` | Per-model cost entries (see below) |
+| `comparison` | `array` | A/B comparison entries |
+| `failures` | `array` | Structured failure details |
+| `_partial` | `boolean` | `true` for incremental saves, absent in final |
+
+## Test Result Entry (`all_results[]`)
+
+Each entry in `all_results` must have:
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `name` | `string` | Yes | Unique test name |
+| `passed` | `boolean` | Yes | Whether this test passed |
+| `suite` | `string` | No | Suite/group name |
+| `tier` | `string` | No | Test tier |
+| `duration_ms` | `number` | No | Duration in milliseconds |
+| `cost_usd` | `number` | No | Cost for this test |
+| `output` | `object` | No | Open-ended output data |
+| `turns_used` | `number` | No | LLM conversation turns |
+| `exit_reason` | `string` | No | `"success"`, `"timeout"`, `"error_max_turns"`, etc. |
+| `detection_rate` | `number` | No | Bugs detected (for QA evals) |
+| `judge_scores` | `object` | No | `{ dimension: score }` from LLM judge |
+| `judge_reasoning` | `string` | No | LLM judge's reasoning |
+| `error` | `string` | No | Error message if test failed |
+
+## Cost Entry (`costs[]`)
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `model` | `string` | Model ID (e.g., `"claude-sonnet-4-6"`) |
+| `calls` | `number` | Number of API calls |
+| `input_tokens` | `number` | Total input tokens |
+| `output_tokens` | `number` | Total output tokens |
+
+## Example
+
+```json
+{
+  "schema_version": 1,
+  "version": "0.3.3",
+  "git_branch": "main",
+  "git_sha": "abc1234",
+  "timestamp": "2025-05-01T12:00:00Z",
+  "hostname": "ci-runner-01",
+  "tier": "e2e",
+  "total": 2,
+  "passed": 1,
+  "failed": 1,
+  "total_cost_usd": 1.50,
+  "duration_seconds": 120,
+  "all_results": [
+    {
+      "name": "login-flow",
+      "suite": "auth",
+      "passed": true,
+      "duration_ms": 60000,
+      "cost_usd": 0.75,
+      "turns_used": 5
+    },
+    {
+      "name": "checkout-flow",
+      "suite": "commerce",
+      "passed": false,
+      "duration_ms": 60000,
+      "cost_usd": 0.75,
+      "error": "Timed out waiting for payment confirmation"
+    }
+  ],
+  "costs": [
+    {
+      "model": "claude-sonnet-4-6",
+      "calls": 10,
+      "input_tokens": 500000,
+      "output_tokens": 250000
+    }
+  ]
+}
+```
+
+## Legacy Format
+
+gstack's internal eval system uses a slightly different format (from `test/helpers/eval-store.ts`). The `normalizeFromLegacy()` and `normalizeToLegacy()` functions in `lib/eval-format.ts` handle conversion:
+
+| Legacy field | Standard field |
+|-------------|---------------|
+| `branch` | `git_branch` |
+| `total_tests` | `total` |
+| `total_duration_ms` | `duration_seconds` (÷ 1000) |
+| `tests` | `all_results` |
+
+## Validation
+
+Use `gstack eval push <file>` to validate and push a result file. Validation checks:
+- All required fields present with correct types
+- `all_results` is an array of objects
+- Each entry has `name` (string) and `passed` (boolean)
+
+## Pushing Results
+
+```bash
+# Validate + save locally + push to team Supabase (if configured)
+gstack eval push my-eval-results.json
+
+# From any language — just write JSON and push:
+python run_evals.py --output results.json
+gstack eval push results.json
+```