Commit Graph

3 Commits

Author SHA1 Message Date
Garry Tan 5aae3ce117 fix: never clean up observability artifacts — partial file persists after finalize
Removing the _partial-e2e.json deletion from finalize(). These are small files
on a local disk and their persistence is the whole point of observability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 12:37:38 -05:00
Garry Tan f9cfabeda8 feat: add E2E observability — heartbeat, progress.log, NDJSON persistence, savePartial()
session-runner: atomic heartbeat file (e2e-live.json), per-run log directory
(~/.gstack-dev/e2e-runs/{runId}/), progress.log + per-test NDJSON persistence,
failure transcripts to persistent run dir instead of tmpdir.

eval-store: 3 new diagnostic fields (exit_reason, timeout_at_turn, last_tool_call),
savePartial() writes _partial-e2e.json after each addTest() for crash resilience,
finalize() cleans up partial file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 11:04:16 -05:00
Garry Tan 84f52f3bad feat: eval persistence with auto-compare against previous run
EvalCollector accumulates test results during eval runs, writes JSON to
~/.gstack-dev/evals/{version}-{branch}-{tier}-{timestamp}.json, prints
a summary table, and automatically compares against the previous run.

- EvalCollector class with addTest() / finalize() / summary table
- findPreviousRun() prefers same branch, falls back to any branch
- compareEvalResults() matches tests by name, detects improved/regressed
- extractToolSummary() counts tool types from transcript events
- formatComparison() renders delta table with per-test + aggregate diffs
- Wire into skill-e2e.test.ts (recordE2E helper) and skill-llm-eval.test.ts
- 19 unit tests for collector + comparison functions
- schema_version: 1 for forward compatibility

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 03:49:47 -05:00