gstack

CalvinBackup/gstack

Fork 0

mirror of https://github.com/garrytan/gstack.git synced 2026-05-05 13:15:24 +02:00

Commit Graph

Author	SHA1	Message	Date
Garry Tan	02925cfc7a	feat: wire costs[] from modelUsage into eval results Extract per-model token usage from resultLine.modelUsage (including cache tokens and exact API cost), flow CostEntry[] through EvalCollector, aggregate in finalize(). Extend CostEntry with cache_read_input_tokens, cache_creation_input_tokens, cost_usd. computeCosts() prefers exact cost_usd over MODEL_PRICING when available (~4x more accurate with prompt caching). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 16:47:27 -05:00
Garry Tan	82e204179b	feat: hook eval-store sync, use shared utils, add 30 lib tests - eval-store.ts: import shared getGitInfo/getVersion, add pushEvalRun() hook in finalize() (non-blocking, non-fatal) - session-runner.ts: import shared atomicWriteSync/sanitizeForFilename - eval-store.test.ts: fix pre-existing bug in double-finalize test (was counting _partial file) - 30 new tests for lib/util, lib/sync-config, lib/sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 02:02:54 -05:00
Garry Tan	84f52f3bad	feat: eval persistence with auto-compare against previous run EvalCollector accumulates test results during eval runs, writes JSON to ~/.gstack-dev/evals/{version}-{branch}-{tier}-{timestamp}.json, prints a summary table, and automatically compares against the previous run. - EvalCollector class with addTest() / finalize() / summary table - findPreviousRun() prefers same branch, falls back to any branch - compareEvalResults() matches tests by name, detects improved/regressed - extractToolSummary() counts tool types from transcript events - formatComparison() renders delta table with per-test + aggregate diffs - Wire into skill-e2e.test.ts (recordE2E helper) and skill-llm-eval.test.ts - 19 unit tests for collector + comparison functions - schema_version: 1 for forward compatibility Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 03:49:47 -05:00

Author

SHA1

Message

Date

Garry Tan

02925cfc7a

feat: wire costs[] from modelUsage into eval results

Extract per-model token usage from resultLine.modelUsage (including
cache tokens and exact API cost), flow CostEntry[] through EvalCollector,
aggregate in finalize(). Extend CostEntry with cache_read_input_tokens,
cache_creation_input_tokens, cost_usd. computeCosts() prefers exact
cost_usd over MODEL_PRICING when available (~4x more accurate with
prompt caching).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-15 16:47:27 -05:00

Garry Tan

82e204179b

feat: hook eval-store sync, use shared utils, add 30 lib tests

- eval-store.ts: import shared getGitInfo/getVersion, add pushEvalRun()
  hook in finalize() (non-blocking, non-fatal)
- session-runner.ts: import shared atomicWriteSync/sanitizeForFilename
- eval-store.test.ts: fix pre-existing bug in double-finalize test
  (was counting _partial file)
- 30 new tests for lib/util, lib/sync-config, lib/sync

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-15 02:02:54 -05:00

Garry Tan

84f52f3bad

feat: eval persistence with auto-compare against previous run

EvalCollector accumulates test results during eval runs, writes JSON to
~/.gstack-dev/evals/{version}-{branch}-{tier}-{timestamp}.json, prints
a summary table, and automatically compares against the previous run.

- EvalCollector class with addTest() / finalize() / summary table
- findPreviousRun() prefers same branch, falls back to any branch
- compareEvalResults() matches tests by name, detects improved/regressed
- extractToolSummary() counts tool types from transcript events
- formatComparison() renders delta table with per-test + aggregate diffs
- Wire into skill-e2e.test.ts (recordE2E helper) and skill-llm-eval.test.ts
- 19 unit tests for collector + comparison functions
- schema_version: 1 for forward compatibility

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-14 03:49:47 -05:00

3 Commits