Files
gstack/docs/eval-result-format.md
Garry Tan 4ad73f7362 feat: unified gstack eval CLI with list, compare, push, cache, cost
- lib/cli-eval.ts: routes to list/compare/summary/push/cost/cache/watch
  subcommands. Ports logic from 4 separate scripts into unified entry.
  Adds ANSI color for TTY (respects NO_COLOR), --limit flag for list.
- bin/gstack-eval: bash wrapper matching bin/gstack-sync pattern
- package.json: eval:* scripts now point to lib/cli-eval.ts
- supabase/migrations/004_eval_costs.sql: per-model cost tracking + RLS
- docs/eval-result-format.md: public format spec for any language
- test/lib-eval-cli.test.ts: integration tests (spawn CLI subprocess)
  including 3 push failure modes (file-not-found, invalid schema,
  sync unavailable)

215 tests passing across 13 files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 09:39:36 -05:00

4.4 KiB

Standard Eval Result Format

This document defines the JSON format that any language can produce and push into gstack's eval infrastructure via gstack eval push <file>.

Required Fields

Field Type Description
schema_version number Format version (currently 1)
version string Version of the tool/system being evaluated
git_branch string Git branch name
git_sha string Git commit SHA (short or full)
timestamp string ISO 8601 timestamp
tier string Eval tier: "e2e", "llm-judge", or custom
total number Total number of test cases
passed number Number of passing test cases
failed number Number of failing test cases
total_cost_usd number Total estimated cost in USD
duration_seconds number Total wall-clock duration in seconds
all_results array Array of test result objects (see below)

Optional Fields

Field Type Description
hostname string Machine hostname
label string Human-readable label for this run
prompt_sha string SHA of the prompt(s) used
by_category object { category: { passed, failed } } breakdown
costs array Per-model cost entries (see below)
comparison array A/B comparison entries
failures array Structured failure details
_partial boolean true for incremental saves, absent in final

Test Result Entry (all_results[])

Each entry in all_results must have:

Field Type Required Description
name string Yes Unique test name
passed boolean Yes Whether this test passed
suite string No Suite/group name
tier string No Test tier
duration_ms number No Duration in milliseconds
cost_usd number No Cost for this test
output object No Open-ended output data
turns_used number No LLM conversation turns
exit_reason string No "success", "timeout", "error_max_turns", etc.
detection_rate number No Bugs detected (for QA evals)
judge_scores object No { dimension: score } from LLM judge
judge_reasoning string No LLM judge's reasoning
error string No Error message if test failed

Cost Entry (costs[])

Field Type Description
model string Model ID (e.g., "claude-sonnet-4-6")
calls number Number of API calls
input_tokens number Total input tokens
output_tokens number Total output tokens

Example

{
  "schema_version": 1,
  "version": "0.3.3",
  "git_branch": "main",
  "git_sha": "abc1234",
  "timestamp": "2025-05-01T12:00:00Z",
  "hostname": "ci-runner-01",
  "tier": "e2e",
  "total": 2,
  "passed": 1,
  "failed": 1,
  "total_cost_usd": 1.50,
  "duration_seconds": 120,
  "all_results": [
    {
      "name": "login-flow",
      "suite": "auth",
      "passed": true,
      "duration_ms": 60000,
      "cost_usd": 0.75,
      "turns_used": 5
    },
    {
      "name": "checkout-flow",
      "suite": "commerce",
      "passed": false,
      "duration_ms": 60000,
      "cost_usd": 0.75,
      "error": "Timed out waiting for payment confirmation"
    }
  ],
  "costs": [
    {
      "model": "claude-sonnet-4-6",
      "calls": 10,
      "input_tokens": 500000,
      "output_tokens": 250000
    }
  ]
}

Legacy Format

gstack's internal eval system uses a slightly different format (from test/helpers/eval-store.ts). The normalizeFromLegacy() and normalizeToLegacy() functions in lib/eval-format.ts handle conversion:

Legacy field Standard field
branch git_branch
total_tests total
total_duration_ms duration_seconds (÷ 1000)
tests all_results

Validation

Use gstack eval push <file> to validate and push a result file. Validation checks:

  • All required fields present with correct types
  • all_results is an array of objects
  • Each entry has name (string) and passed (boolean)

Pushing Results

# Validate + save locally + push to team Supabase (if configured)
gstack eval push my-eval-results.json

# From any language — just write JSON and push:
python run_evals.py --output results.json
gstack eval push results.json