Commit Graph

81 Commits

Author SHA1 Message Date
Garry Tan cce407b218 Merge remote-tracking branch 'origin/garrytan/team-supabase-store' into garrytan/dev-mode 2026-03-16 00:22:05 -05:00
Garry Tan 6e14689f0e docs: add team sync TODOs — streaming parser, effectiveness scoring, weekly digest
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 00:15:40 -05:00
Garry Tan 3a57a3f59e feat: add /setup-team-sync skill, auto-push transcript hooks in skills
- setup-team-sync/SKILL.md.tmpl: idempotent guided setup (create config,
  OAuth, verify connectivity, configure settings, summary)
- ship/retro/qa SKILL.md.tmpl: add push-transcript hook after existing
  push-ship/push-retro/push-qa hooks (silent, non-fatal)
- scripts/gen-skill-docs.ts: add setup-team-sync to template list
- Regenerated all SKILL.md files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 00:15:36 -05:00
Garry Tan a104471272 feat: add push-transcript CLI, show sessions, interactive setup, 36 tests
- cli-sync.ts: push-transcript command, show sessions with formatSessionTable(),
  upgrade cmdSetup() to interactively create .gstack-sync.json if missing
- bin/gstack-sync: add push-transcript case and help text
- test/lib-llm-summarize.test.ts: 10 tests with mocked fetch (429 retry,
  5xx backoff, malformed response, no API key, cache)
- test/lib-transcript-sync.test.ts: 22 tests for parsing, grouping,
  session file extraction, marker management, slug resolution
- test/lib-sync-show.test.ts: 4 tests for formatSessionTable

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 00:15:26 -05:00
Garry Tan 0e29d7d1a3 feat: add enriched transcript sync — Haiku summaries, session file enrichment
Add session intelligence pipeline for team transcript sync:
- lib/transcript-sync.ts: parse history.jsonl, enrich with Claude session
  file data (tools_used, full turn count), sync marker management,
  10-concurrent push with 5-concurrent Haiku summarization
- lib/llm-summarize.ts: raw fetch() to Anthropic Messages API (no SDK dep),
  retry-after on 429, exponential backoff on 5xx, SHA-based eval-cache
- lib/sync.ts: pushTranscript() and pullTranscripts() following existing patterns
- 006_transcript_sync.sql: unique index on (team_id, session_id) for
  idempotent upsert, RLS changed from admin-only to team-wide read

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 00:15:19 -05:00
Garry Tan 2d42e15b5c chore: bump version and changelog (v0.3.11)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 20:46:10 -05:00
Garry Tan b07e842f13 Merge remote-tracking branch 'origin/garrytan/team-supabase-store' into garrytan/dev-mode 2026-03-15 20:41:33 -05:00
Garry Tan 5e641bdf76 feat: add Enum & Value Completeness to /review critical checklist
New CRITICAL review category that traces new enum values, status strings,
and type constants through every consumer outside the diff. Catches the
class of bugs where a new value is added but not handled in all switch/case
chains, allowlists, or frontend-backend contracts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 20:41:17 -05:00
Garry Tan 87cb769c35 feat: sync heartbeats, eval:trend --team, setup guide, 10 new tests
- 005_sync_heartbeats.sql migration for connectivity testing
- eval:trend --team flag pulls team eval data (graceful fallback)
- docs/TEAM_SYNC_SETUP.md step-by-step setup guide
- Design doc status updated to Phase 2 complete
- 10 new tests for sync show formatting functions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 19:43:03 -05:00
Garry Tan 06f2da2019 feat: wire team sync push into ship, retro, qa, and greptile skills
Add non-fatal sync steps to all 4 skill templates:
- /ship Step 8.5: write ship log JSON + push after PR creation
- /retro Step 13: push snapshot after JSON save
- /qa Phase 6.7: write qa-sync.json + push after health score
- greptile-triage: push each triage entry after history file writes

All calls use || true for zero disruption. Silent when sync not
configured.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 19:42:54 -05:00
Garry Tan dc3fcc8611 feat: DRY push functions, add push-greptile + sync test/show commands
Extract pushWithSync() helper to eliminate boilerplate across 6 push
functions. Add pushHeartbeat() for connectivity testing. Add push-greptile
to CLI. New commands: gstack-sync test (validates full push/pull flow
via sync_heartbeats table), gstack-sync show (terminal team data
dashboard with summary/evals/ships/retros views). Guard main block
with import.meta.main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 19:42:45 -05:00
Garry Tan c11cb708a5 Merge remote-tracking branch 'origin/garrytan/team-supabase-store' into garrytan/dev-mode 2026-03-15 17:29:37 -05:00
Garry Tan e97108ae10 feat: contributor mode, session awareness, universal RECOMMENDATION format
- Rename {{UPDATE_CHECK}} → {{PREAMBLE}} across all 10 skill templates
- Add session tracking (touch ~/.gstack/sessions/$PPID, count active sessions)
- ELI16 mode when 3+ concurrent sessions detected (re-ground user on context)
- Contributor mode: auto-file field reports to ~/.gstack/contributor-logs/
- Universal AskUserQuestion format: context → question → RECOMMENDATION → options
- Update plan-ceo-review and plan-eng-review to reference preamble baseline
- Add vendored symlink awareness section to CLAUDE.md
- Rewrite CONTRIBUTING.md with contributor workflow and cross-project testing
- Add tests for contributor mode and session awareness in generated output
- Add E2E eval for contributor mode report filing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 17:29:34 -05:00
Garry Tan 704fe34e98 docs: clean up sync example, add team sync section to README
Remove _comment hacks from JSON example file. Add short team sync
section to README explaining what it is, that it's optional, and
how to set it up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 17:06:51 -05:00
Garry Tan 14320469b0 docs: CHANGELOG covers full branch scope including team sync
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 17:05:45 -05:00
Garry Tan eb7ef2153b docs: add setup comments to .gstack-sync.json.example
Explain what team sync gives you, that it's optional, and how to
set it up. Points to TEAM_COORDINATION_STORE.md for full guide.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 17:04:49 -05:00
Garry Tan e28033353d chore: bump v0.3.10, update CHANGELOG and docs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 16:55:34 -05:00
Garry Tan 33c9552870 chore: update gitignore
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 16:47:46 -05:00
Garry Tan daea165333 feat: add eval:trend CLI for per-test pass rate tracking
computeTrends() classifies tests as stable-pass/stable-fail/flaky/
improving/degrading based on pass rate, flip count, and recent streak.
gstack eval trend shows sparkline table with --limit, --tier, --test
filters. Guard CLI main block with import.meta.main to prevent
execution on import.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 16:47:41 -05:00
Garry Tan 59752fc510 feat: wire eval-cache + eval-tier into LLM judge, pin E2E model
callJudge/judge now return {result, meta} with SHA-based caching
(~$0.18/run savings when SKILL.md unchanged) and dynamic model
selection via EVAL_JUDGE_TIER env var. E2E tests pass --model from
EVAL_TIER to claude -p. outcomeJudge retains simple return type.
All 8 LLM eval test sites updated with real costs and costs[].

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 16:47:35 -05:00
Garry Tan 02925cfc7a feat: wire costs[] from modelUsage into eval results
Extract per-model token usage from resultLine.modelUsage (including
cache tokens and exact API cost), flow CostEntry[] through EvalCollector,
aggregate in finalize(). Extend CostEntry with cache_read_input_tokens,
cache_creation_input_tokens, cost_usd. computeCosts() prefers exact
cost_usd over MODEL_PRICING when available (~4x more accurate with
prompt caching).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 16:47:27 -05:00
Garry Tan 4ad73f7362 feat: unified gstack eval CLI with list, compare, push, cache, cost
- lib/cli-eval.ts: routes to list/compare/summary/push/cost/cache/watch
  subcommands. Ports logic from 4 separate scripts into unified entry.
  Adds ANSI color for TTY (respects NO_COLOR), --limit flag for list.
- bin/gstack-eval: bash wrapper matching bin/gstack-sync pattern
- package.json: eval:* scripts now point to lib/cli-eval.ts
- supabase/migrations/004_eval_costs.sql: per-model cost tracking + RLS
- docs/eval-result-format.md: public format spec for any language
- test/lib-eval-cli.test.ts: integration tests (spawn CLI subprocess)
  including 3 push failure modes (file-not-found, invalid schema,
  sync unavailable)

215 tests passing across 13 files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 09:39:36 -05:00
Garry Tan 1f5b7882e6 feat: add SHA-based eval caching with EVAL_CACHE=0 bypass
Cache at ~/.gstack/eval-cache/{suite}/{sha}.json. Compute cache keys
from source file contents + test input via Bun.CryptoHasher SHA256.
Supports read/write/stats/clear/verify operations. EVAL_CACHE=0
skips reads for force-rerun. 16 tests including corrupt JSON handling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 09:39:26 -05:00
Garry Tan 9bc6c9416f feat: add eval format validation, tier selection, cost tracking
- lib/eval-format.ts: StandardEvalResult interfaces, validateEvalResult(),
  normalizeFromLegacy/normalizeToLegacy round-trip converters
- lib/eval-tier.ts: EvalTier type, resolveTier/resolveJudgeTier from env,
  tierToModel mapping, TIER_ALIASES (haiku→fast, sonnet→standard, opus→full)
- lib/eval-cost.ts: MODEL_PRICING (last verified 2025-05-01), computeCosts(),
  formatCostDashboard(), aggregateCosts(), fallback for unknown models
- 42 tests across 3 test files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 09:39:18 -05:00
Garry Tan 7f7035f55a feat: add listEvalFiles, loadEvalResults, formatTimestamp to lib/util.ts
DRY up eval I/O duplicated across scripts/eval-list.ts,
eval-compare.ts, and eval-summary.ts. Adds EVAL_DIR constant,
formatTimestamp(), listEvalFiles(), loadEvalResults() with
--limit support. 13 new tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 09:39:09 -05:00
Garry Tan 82e204179b feat: hook eval-store sync, use shared utils, add 30 lib tests
- eval-store.ts: import shared getGitInfo/getVersion, add pushEvalRun()
  hook in finalize() (non-blocking, non-fatal)
- session-runner.ts: import shared atomicWriteSync/sanitizeForFilename
- eval-store.test.ts: fix pre-existing bug in double-finalize test
  (was counting _partial file)
- 30 new tests for lib/util, lib/sync-config, lib/sync

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 02:02:54 -05:00
Garry Tan f7ae465415 feat: add Supabase migration SQL for team data store
- 001_teams.sql: teams + team_members + RLS
- 002_eval_runs.sql: eval results with universal format, indexes, upsert key
- 003_data_tables.sql: retro, QA, ship, greptile, transcripts + RLS

All tables use RLS: team members read/insert, admins delete.
Transcript table has tighter policy (admin-only read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 02:02:47 -05:00
Garry Tan 3713c3b9b9 feat: add team sync infrastructure (config, auth, push/pull, CLI)
- lib/sync-config.ts: reads .gstack-sync.json + ~/.gstack/auth.json
- lib/auth.ts: device auth flow (browser OAuth, local HTTP callback)
- lib/sync.ts: Supabase push/pull via raw fetch(), offline queue, cache
- lib/cli-sync.ts: CLI handler for gstack-sync commands
- bin/gstack-sync: bash wrapper (setup, status, push-*, pull, drain)
- .gstack-sync.json.example: template for team setup

Zero new dependencies — uses raw fetch() against PostgREST API.
All sync is non-fatal with 5s timeout and offline queue fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 02:02:40 -05:00
Garry Tan caed287496 feat: extract shared utilities into lib/util.ts
DRY up atomicWriteSync, readJSON, getGitInfo, getVersion, getRemoteSlug,
and sanitizeForFilename from eval-store.ts, session-runner.ts, and
eval-watch.ts into a shared module.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 02:02:32 -05:00
Garry Tan 5c1ea088d8 docs: scrub proprietary refs, close eval format gaps, integrate gstack-config
- Replace project-specific references with generic language
- Add missing fields to eval result format: prompt_sha, by_category,
  timestamp, response_preview
- Enrich failure format with details array, scores dict, expectation_type
- Add EVAL_JUDGE_CACHE, EVAL_VERBOSE, multiprocess worker support,
  dedup on push, run scopes, model aliases, judge profiles
- Restructure credential storage to 4 layers with gstack-config (v0.3.9)
  for user preferences (sync_enabled, sync_transcripts)
- Update integration points, observability, and reuse map

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 01:47:30 -05:00
Garry Tan 89311653df Merge remote-tracking branch 'origin/main' into garrytan/team-supabase-store 2026-03-15 01:32:11 -05:00
Garry Tan f87bc21865 docs: add team coordination store design doc
Design doc for Supabase-backed team data store and universal eval
infrastructure. Covers architecture, credential storage, eval formats,
YAML test case spec, Supabase schema, phased rollout, and security model.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 01:32:06 -05:00
Garry Tan bb46ca6b21 feat: smart update check with auto-upgrade, snooze backoff, config CLI (v0.3.9) (#62)
* feat: add bin/gstack-config CLI for reading/writing ~/.gstack/config.yaml

Simple get/set/list interface for persistent gstack configuration.
Used by update-check and upgrade skill for auto_upgrade and update_check settings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: smart update check with 12h cache, snooze backoff, config disable

- Reduce cache TTL from 24h to 12h for faster update detection
- Add exponential snooze backoff: 24h → 48h → 1 week (resets on new version)
- Add update_check: false config option to disable checks entirely
- Clear snooze file on just-upgraded
- 14 new tests covering snooze levels, expiry, corruption, and config paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: upgrade skill with auto-upgrade, 4-option prompt, vendored sync

- Auto-upgrade mode via config or GSTACK_AUTO_UPGRADE=1 env var
- 4-option AskUserQuestion: upgrade once, always, not now, never
- Step 4.5: sync local vendored copy after upgrading primary install
- Snooze write with escalating backoff on "Not now"
- Update preamble text in gen-skill-docs for new upgrade flow
- Regenerate all SKILL.md files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: simplify upgrade instructions, move auto-upgrade to completed

README now points to /gstack-upgrade instead of long paste commands.
Auto-upgrade TODO moved to Completed section (v0.3.8).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version and changelog (v0.3.9)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 23:28:02 -07:00
Garry Tan 41141007c1 feat: TODOS-aware skills, 2-tier Greptile replies, gitignore fix (#61)
* fix: log non-ENOENT errors in ensureStateDir() instead of silently swallowing

Replace bare catch {} with ENOENT-only silence. Non-ENOENT errors (EACCES,
ENOSPC) are now logged to .gstack/browse-server.log. Includes test for
permission-denied scenario with chmod 444.

* feat: merge TODO.md + TODOS.md into unified backlog with shared format reference

Merge TODO.md (roadmap) and TODOS.md (near-term) into one file organized by
skill/component with P0-P4 priority ordering and Completed section. Add shared
review/TODOS-format.md for canonical format. Add static validation tests.

* feat: add 2-tier Greptile reply system with escalation detection

Add reply templates (Tier 1 friendly, Tier 2 firm), explicit escalation
detection algorithm, and severity re-ranking guidance to greptile-triage.md.

* feat: cross-skill TODOS awareness + Greptile template refs in all skills

/ship Step 5.5: auto-detect completed TODOs, offer reorganization.
/review Step 5.5: cross-reference PR against open TODOs.
/plan-ceo-review, /plan-eng-review: TODOS context in planning.
/retro: Backlog Health metric. /qa: bug TODO context in diff-aware mode.
All Greptile-aware skills now reference reply templates and escalation detection.

* chore: bump version and changelog (v0.3.8)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update CONTRIBUTING.md for v0.3.8 changes

Clarify test tier cost table (Tier 3 standalone vs combined), add TODOS.md
to "Things to know", mention Greptile triage in ship workflow description.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 20:15:11 -07:00
Garry Tan 2aa745cb0e feat: screenshot element/region clipping (v0.3.7) (#56)
* feat: screenshot element/region clipping (--clip, --viewport, CSS/@ref)

Add element crop (CSS selector or @ref), region clip (--clip x,y,w,h),
and viewport-only (--viewport) modes to the screenshot command. Uses
Playwright's native locator.screenshot() and page.screenshot({ clip }).
Full page remains the default. Includes 10 new tests covering all modes
and error paths.

* chore: bump version and changelog (v0.3.7)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add screenshot modes to BROWSER.md command reference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 12:47:42 -07:00
Garry Tan 0ac7ef4e81 fix: harden planted-bug eval prompt for reliable form testing
Phase 3 was too vague ("click every nav link") causing the agent to
wander instead of systematically testing form fields. Now explicitly
directs: fill every input, clear it, try invalid values, submit and
check console. Added Phase 4 finalize step to ensure report is updated
with all findings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 13:28:18 -05:00
Garry Tan 7d26666164 Merge pull request #55 from garrytan/v0.3.6-qa-upgrades
feat: E2E observability + eval infrastructure + all skills templated
2026-03-14 11:24:24 -07:00
Garry Tan baf8acd55c fix: update check ignores stale UP_TO_DATE cache after version change
The UP_TO_DATE cache path exited immediately without checking if the
cached version still matched the local VERSION. After upgrading (e.g.
0.3.3 → 0.3.4), the cache still said "UP_TO_DATE 0.3.3" and the
script never re-checked against remote — so updates were invisible
until the 24h cache expired.

Now both UP_TO_DATE and UPGRADE_AVAILABLE verify cached version vs
local before trusting the cache.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 13:23:25 -05:00
Garry Tan 4e31acbd47 fix: auto-clear stale heartbeat when process is dead
Add PID to heartbeat file. eval-watch checks process.kill(pid, 0) and
auto-deletes the heartbeat when the PID is no longer alive — no manual
cleanup needed after crashed/killed E2E runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 12:55:40 -05:00
Garry Tan 43fbe165a4 docs: update README, CONTRIBUTING, ARCHITECTURE for v0.3.6
Update test tier costs and commands (Agent SDK → claude -p, SKILL_E2E → EVALS),
add E2E observability section to CONTRIBUTING and ARCHITECTURE, add testing
quick-start to README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 12:47:00 -05:00
Garry Tan 4ace0c2f6f chore: bump version and changelog (v0.3.6)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 12:44:41 -05:00
Garry Tan 9f5aa32e67 fix: fail fast on API connectivity — pre-check before E2E suite
Spawn a quick claude -p ping before running 13 tests. If the Anthropic API
is unreachable (ConnectionRefused), throw immediately instead of burning
through the entire suite with silent false passes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 12:37:44 -05:00
Garry Tan 5aae3ce117 fix: never clean up observability artifacts — partial file persists after finalize
Removing the _partial-e2e.json deletion from finalize(). These are small files
on a local disk and their persistence is the whole point of observability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 12:37:38 -05:00
Garry Tan 336dbaa50d fix: detect is_error from claude -p result line (ConnectionRefused was PASS)
claude -p can return subtype="success" with is_error=true when the API is
unreachable. Previously we only checked subtype, so API failures silently
passed. Now check is_error first and report as 'error_api'.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 12:35:43 -05:00
Garry Tan 029a7c2a37 feat: eval-watch dashboard + observability unit tests (15 tests, 11 codepaths)
eval-watch: live terminal dashboard reads heartbeat + partial file every 1s,
shows completed/running tests, stale detection (>10min), --tail flag for
progress.log tail. Pure renderDashboard() function for testability.

observability.test.ts: unit tests for sanitizeTestName, heartbeat schema,
progress.log format, NDJSON file naming, savePartial() with _partial flag,
finalize() cleanup, diagnostic fields, watcher rendering, stale detection,
and non-fatal I/O guarantees.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 11:04:40 -05:00
Garry Tan 510a8d8dda feat: wire runId + testName + diagnostics through all E2E tests
Generate per-session runId, pass testName + runId to every runSkillTest() call,
wire exit_reason/timeout_at_turn/last_tool_call through recordE2E(). Add
eval:watch script entry to package.json.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 11:04:28 -05:00
Garry Tan f9cfabeda8 feat: add E2E observability — heartbeat, progress.log, NDJSON persistence, savePartial()
session-runner: atomic heartbeat file (e2e-live.json), per-run log directory
(~/.gstack-dev/e2e-runs/{runId}/), progress.log + per-test NDJSON persistence,
failure transcripts to persistent run dir instead of tmpdir.

eval-store: 3 new diagnostic fields (exit_reason, timeout_at_turn, last_tool_call),
savePartial() writes _partial-e2e.json after each addTest() for crash resilience,
finalize() cleans up partial file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 11:04:16 -05:00
Garry Tan eb9a9193c9 fix: plan-ceo-review timeout — init git repo, skip codebase exploration, bump to 420s
The CEO review SKILL.md has a "System Audit" step that runs git commands.
In an empty tmpdir without a git repo, the agent wastes turns exploring.
Fix: init minimal git repo, tell agent to skip codebase exploration,
bump test timeouts to 420s for all review/retro tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 08:39:26 -05:00
Garry Tan 7d5036db1a fix: increase timeouts for plan-review and retro E2E tests
plan-ceo-review takes ~300s (thorough 10-section review), retro takes
~220s (many git commands for history analysis). Bumped runSkillTest
timeout to 300s and test timeout to 360s. Also accept error_max_turns
for these verbose skills.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 07:54:48 -05:00
Garry Tan f1ee3d924e feat: template-ify all skills + E2E tests for plan-ceo-review, plan-eng-review, retro
- Convert gstack-upgrade to SKILL.md.tmpl template system
- All 10 skills now use templates (consistent auto-generated headers)
- Add comprehensive template validation tests (22 tests):
  every skill has .tmpl, generated SKILL.md has header, valid frontmatter,
  --dry-run reports FRESH, no unresolved placeholders
- Add E2E tests for /plan-ceo-review, /plan-eng-review, /retro
- Mark /ship, /setup-browser-cookies, /gstack-upgrade as test.todo (destructive/interactive)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 07:28:02 -05:00