mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-02 11:45:20 +02:00
4ad73f7362
- lib/cli-eval.ts: routes to list/compare/summary/push/cost/cache/watch subcommands. Ports logic from 4 separate scripts into unified entry. Adds ANSI color for TTY (respects NO_COLOR), --limit flag for list. - bin/gstack-eval: bash wrapper matching bin/gstack-sync pattern - package.json: eval:* scripts now point to lib/cli-eval.ts - supabase/migrations/004_eval_costs.sql: per-model cost tracking + RLS - docs/eval-result-format.md: public format spec for any language - test/lib-eval-cli.test.ts: integration tests (spawn CLI subprocess) including 3 push failure modes (file-not-found, invalid schema, sync unavailable) 215 tests passing across 13 files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4.4 KiB
4.4 KiB
Standard Eval Result Format
This document defines the JSON format that any language can produce and push into gstack's eval infrastructure via gstack eval push <file>.
Required Fields
| Field | Type | Description |
|---|---|---|
schema_version |
number |
Format version (currently 1) |
version |
string |
Version of the tool/system being evaluated |
git_branch |
string |
Git branch name |
git_sha |
string |
Git commit SHA (short or full) |
timestamp |
string |
ISO 8601 timestamp |
tier |
string |
Eval tier: "e2e", "llm-judge", or custom |
total |
number |
Total number of test cases |
passed |
number |
Number of passing test cases |
failed |
number |
Number of failing test cases |
total_cost_usd |
number |
Total estimated cost in USD |
duration_seconds |
number |
Total wall-clock duration in seconds |
all_results |
array |
Array of test result objects (see below) |
Optional Fields
| Field | Type | Description |
|---|---|---|
hostname |
string |
Machine hostname |
label |
string |
Human-readable label for this run |
prompt_sha |
string |
SHA of the prompt(s) used |
by_category |
object |
{ category: { passed, failed } } breakdown |
costs |
array |
Per-model cost entries (see below) |
comparison |
array |
A/B comparison entries |
failures |
array |
Structured failure details |
_partial |
boolean |
true for incremental saves, absent in final |
Test Result Entry (all_results[])
Each entry in all_results must have:
| Field | Type | Required | Description |
|---|---|---|---|
name |
string |
Yes | Unique test name |
passed |
boolean |
Yes | Whether this test passed |
suite |
string |
No | Suite/group name |
tier |
string |
No | Test tier |
duration_ms |
number |
No | Duration in milliseconds |
cost_usd |
number |
No | Cost for this test |
output |
object |
No | Open-ended output data |
turns_used |
number |
No | LLM conversation turns |
exit_reason |
string |
No | "success", "timeout", "error_max_turns", etc. |
detection_rate |
number |
No | Bugs detected (for QA evals) |
judge_scores |
object |
No | { dimension: score } from LLM judge |
judge_reasoning |
string |
No | LLM judge's reasoning |
error |
string |
No | Error message if test failed |
Cost Entry (costs[])
| Field | Type | Description |
|---|---|---|
model |
string |
Model ID (e.g., "claude-sonnet-4-6") |
calls |
number |
Number of API calls |
input_tokens |
number |
Total input tokens |
output_tokens |
number |
Total output tokens |
Example
{
"schema_version": 1,
"version": "0.3.3",
"git_branch": "main",
"git_sha": "abc1234",
"timestamp": "2025-05-01T12:00:00Z",
"hostname": "ci-runner-01",
"tier": "e2e",
"total": 2,
"passed": 1,
"failed": 1,
"total_cost_usd": 1.50,
"duration_seconds": 120,
"all_results": [
{
"name": "login-flow",
"suite": "auth",
"passed": true,
"duration_ms": 60000,
"cost_usd": 0.75,
"turns_used": 5
},
{
"name": "checkout-flow",
"suite": "commerce",
"passed": false,
"duration_ms": 60000,
"cost_usd": 0.75,
"error": "Timed out waiting for payment confirmation"
}
],
"costs": [
{
"model": "claude-sonnet-4-6",
"calls": 10,
"input_tokens": 500000,
"output_tokens": 250000
}
]
}
Legacy Format
gstack's internal eval system uses a slightly different format (from test/helpers/eval-store.ts). The normalizeFromLegacy() and normalizeToLegacy() functions in lib/eval-format.ts handle conversion:
| Legacy field | Standard field |
|---|---|
branch |
git_branch |
total_tests |
total |
total_duration_ms |
duration_seconds (÷ 1000) |
tests |
all_results |
Validation
Use gstack eval push <file> to validate and push a result file. Validation checks:
- All required fields present with correct types
all_resultsis an array of objects- Each entry has
name(string) andpassed(boolean)
Pushing Results
# Validate + save locally + push to team Supabase (if configured)
gstack eval push my-eval-results.json
# From any language — just write JSON and push:
python run_evals.py --output results.json
gstack eval push results.json