gstack

mirror of https://github.com/garrytan/gstack.git synced 2026-06-22 01:30:03 +02:00

Author	SHA1	Message	Date
Garry Tan	a05a670189	feat: support await in $B js and eval commands Auto-wrap await expressions in async IIFE context so $B js "await fetch(...)" works without SyntaxError. - hasAwait() strips comments before detection - js: expression wrapping (async()=>(expr))() - eval: smart wrapping — single-line=expression, multi-line=block - 6 new unit tests covering async, false-positive, and return semantics	2026-03-16 09:43:41 -05:00
Garry Tan	2357f134ce	merge: integrate origin/main (v0.4.0, v0.4.1) into team-supabase-store Resolves conflicts in CHANGELOG.md (ordering), CONTRIBUTING.md (eval tools list merge), VERSION (take main's 0.4.1), qa/SKILL.md.tmpl (keep full methodology + baseline line), eval-store.test.ts (drop redundant comment). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 07:49:27 -05:00
Garry Tan	83bfc7f88d	feat: add /setup-team-dashboard skill, post-ship leaderboard callout Interactive 8-step setup skill for deploying dashboard + edge functions. Post-ship callout shows team leaderboard after successful sync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 02:44:59 -05:00
Garry Tan	78840c64a8	feat: add shared team dashboard, regression alerts, weekly digest edge functions Dashboard: Supabase edge function serving self-contained HTML with PKCE OAuth, 6 parallel client-side REST queries, SVG charts, dark theme, auto-refresh, who's-online from heartbeats. Public URL. Regression alert: webhook on eval_runs INSERT, 5-min cooldown dedup via alert_cooldowns, Slack notification on >5% pass rate drop. Weekly digest: pg_cron Monday 9am UTC, aggregates 7-day team data, Slack message with evals/ships/sessions/costs. 15 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 02:44:47 -05:00
Garry Tan	46c82ce8ec	feat: add team admin CLI + migration 007 (settings, cooldowns, create_team RPC) New `gstack team` CLI with create, members, set subcommands. Migration adds team_settings (admin-only), alert_cooldowns (edge-fn dedup), and create_team() SECURITY DEFINER RPC for atomic team + first member creation. 9 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 02:44:24 -05:00
Garry Tan	4985c8e7e9	feat: add CLI leaderboard, refactor formatTeamSummary to use dashboard-queries New `gstack eval leaderboard` subcommand pulls team data and renders weekly stats per contributor. Refactored formatTeamSummary to use computeVelocity from dashboard-queries (DRY). 4 new tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 02:44:12 -05:00
Garry Tan	e969c6dadf	feat: add dashboard query functions — pure transforms for team analytics 6 functions: detectRegressions, computeVelocity, computeCostTrend, computeLeaderboard, computeQATrend, computeEvalTrend. All pure, no I/O, with division-by-zero guards. 28 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 02:43:52 -05:00
Garry Tan	3e3843c4a9	feat: contributor mode, session awareness, recommendation format (#90 ) * feat: contributor mode, session awareness, universal RECOMMENDATION format - Rename {{UPDATE_CHECK}} → {{PREAMBLE}} across all 10 skill templates - Add session tracking (touch ~/.gstack/sessions/$PPID, count active sessions) - ELI16 mode when 3+ concurrent sessions detected (re-ground user on context) - Contributor mode: auto-file field reports to ~/.gstack/contributor-logs/ - Universal AskUserQuestion format: context → question → RECOMMENDATION → options - Update plan-ceo-review and plan-eng-review to reference preamble baseline - Add vendored symlink awareness section to CLAUDE.md - Rewrite CONTRIBUTING.md with contributor workflow and cross-project testing - Add tests for contributor mode and session awareness in generated output - Add E2E eval for contributor mode report filing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add Enum & Value Completeness to /review critical checklist New CRITICAL review category that traces new enum values, status strings, and type constants through every consumer outside the diff. Catches the class of bugs where a new value is added but not handled in all switch/case chains, allowlists, or frontend-backend contracts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump v0.4.1, user-facing changelog, update qa-only template and architecture docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add CHANGELOG style guide — user-facing, sell the feature Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: rewrite v0.4.1 changelog to be user-facing and sell the features Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add evals for RECOMMENDATION format, session awareness, and enum completeness Free tests (Tier 1): RECOMMENDATION format + session awareness in all preamble SKILL.md files, enum completeness checklist structure and CRITICAL classification. E2E eval: /review catches missed enum handlers when a new status value is added but not handled in case/switch and notify methods. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add E2E eval for session awareness ELI16 mode Stubs _SESSIONS=4, gives agent a decision point on feature/add-payments branch, verifies the output re-grounds the user with project, branch, context, and RECOMMENDATION — the ELI16 mode behavior for 3+ sessions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: contributor mode eval marked FAIL due to expected browse error The test intentionally runs a nonexistent binary to trigger contributor mode. The session runner's browse error detection catches "no such file or directory...browse" and sets browseErrors, causing recordE2E to mark passed=false. Override passed to check only exitReason since the browse error is the expected scenario. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 01:45:50 -05:00
Garry Tan	6e14689f0e	docs: add team sync TODOs — streaming parser, effectiveness scoring, weekly digest Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 00:15:40 -05:00
Garry Tan	3a57a3f59e	feat: add /setup-team-sync skill, auto-push transcript hooks in skills - setup-team-sync/SKILL.md.tmpl: idempotent guided setup (create config, OAuth, verify connectivity, configure settings, summary) - ship/retro/qa SKILL.md.tmpl: add push-transcript hook after existing push-ship/push-retro/push-qa hooks (silent, non-fatal) - scripts/gen-skill-docs.ts: add setup-team-sync to template list - Regenerated all SKILL.md files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 00:15:36 -05:00
Garry Tan	a104471272	feat: add push-transcript CLI, show sessions, interactive setup, 36 tests - cli-sync.ts: push-transcript command, show sessions with formatSessionTable(), upgrade cmdSetup() to interactively create .gstack-sync.json if missing - bin/gstack-sync: add push-transcript case and help text - test/lib-llm-summarize.test.ts: 10 tests with mocked fetch (429 retry, 5xx backoff, malformed response, no API key, cache) - test/lib-transcript-sync.test.ts: 22 tests for parsing, grouping, session file extraction, marker management, slug resolution - test/lib-sync-show.test.ts: 4 tests for formatSessionTable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 00:15:26 -05:00
Garry Tan	0e29d7d1a3	feat: add enriched transcript sync — Haiku summaries, session file enrichment Add session intelligence pipeline for team transcript sync: - lib/transcript-sync.ts: parse history.jsonl, enrich with Claude session file data (tools_used, full turn count), sync marker management, 10-concurrent push with 5-concurrent Haiku summarization - lib/llm-summarize.ts: raw fetch() to Anthropic Messages API (no SDK dep), retry-after on 429, exponential backoff on 5xx, SHA-based eval-cache - lib/sync.ts: pushTranscript() and pullTranscripts() following existing patterns - 006_transcript_sync.sql: unique index on (team_id, session_id) for idempotent upsert, RLS changed from admin-only to team-wide read Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 00:15:19 -05:00
Garry Tan	f3ee0ee28a	feat: QA restructure, browser ref staleness, eval efficiency metrics (v0.4.0) (#83 ) * feat: browser ref staleness detection via async count() validation resolveRef() now checks element count to detect stale refs after page mutations (e.g. SPA navigation). RefEntry stores role+name metadata for better diagnostics. 3 new snapshot tests for staleness detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: qa-only skill, qa fix loop, plan-to-QA artifact flow Add /qa-only (report-only, Edit tool blocked), restructure /qa with find-fix-verify cycle, add {{QA_METHODOLOGY}} DRY placeholder for shared methodology. /plan-eng-review now writes test-plan artifacts to ~/.gstack/projects/<slug>/ for QA consumption. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: eval efficiency metrics — turns, duration, commentary across all surfaces Add generateCommentary() for natural-language delta interpretation, per-test turns/duration in comparison and summary output, judgePassed unit tests, 3 new E2E tests (qa-only, qa fix loop, plan artifact). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump version and changelog (v0.4.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: update ARCHITECTURE, BROWSER, CONTRIBUTING, README for v0.4.0 - ARCHITECTURE: add ref staleness detection section, update RefEntry type - BROWSER: add ref staleness paragraph to snapshot system docs - CONTRIBUTING: update eval tool descriptions with commentary feature - README: fix missing qa-only in project-local uninstall command Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add user-facing benefit descriptions to v0.4.0 changelog Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 23:55:39 -05:00
Garry Tan	87cb769c35	feat: sync heartbeats, eval:trend --team, setup guide, 10 new tests - 005_sync_heartbeats.sql migration for connectivity testing - eval:trend --team flag pulls team eval data (graceful fallback) - docs/TEAM_SYNC_SETUP.md step-by-step setup guide - Design doc status updated to Phase 2 complete - 10 new tests for sync show formatting functions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 19:43:03 -05:00
Garry Tan	06f2da2019	feat: wire team sync push into ship, retro, qa, and greptile skills Add non-fatal sync steps to all 4 skill templates: - /ship Step 8.5: write ship log JSON + push after PR creation - /retro Step 13: push snapshot after JSON save - /qa Phase 6.7: write qa-sync.json + push after health score - greptile-triage: push each triage entry after history file writes All calls use \|\| true for zero disruption. Silent when sync not configured. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 19:42:54 -05:00
Garry Tan	dc3fcc8611	feat: DRY push functions, add push-greptile + sync test/show commands Extract pushWithSync() helper to eliminate boilerplate across 6 push functions. Add pushHeartbeat() for connectivity testing. Add push-greptile to CLI. New commands: gstack-sync test (validates full push/pull flow via sync_heartbeats table), gstack-sync show (terminal team data dashboard with summary/evals/ships/retros views). Guard main block with import.meta.main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 19:42:45 -05:00
Garry Tan	704fe34e98	docs: clean up sync example, add team sync section to README Remove _comment hacks from JSON example file. Add short team sync section to README explaining what it is, that it's optional, and how to set it up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 17:06:51 -05:00
Garry Tan	14320469b0	docs: CHANGELOG covers full branch scope including team sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 17:05:45 -05:00
Garry Tan	eb7ef2153b	docs: add setup comments to .gstack-sync.json.example Explain what team sync gives you, that it's optional, and how to set it up. Points to TEAM_COORDINATION_STORE.md for full guide. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 17:04:49 -05:00
Garry Tan	e28033353d	chore: bump v0.3.10, update CHANGELOG and docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 16:55:34 -05:00
Garry Tan	33c9552870	chore: update gitignore Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 16:47:46 -05:00
Garry Tan	daea165333	feat: add eval:trend CLI for per-test pass rate tracking computeTrends() classifies tests as stable-pass/stable-fail/flaky/ improving/degrading based on pass rate, flip count, and recent streak. gstack eval trend shows sparkline table with --limit, --tier, --test filters. Guard CLI main block with import.meta.main to prevent execution on import. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 16:47:41 -05:00
Garry Tan	59752fc510	feat: wire eval-cache + eval-tier into LLM judge, pin E2E model callJudge/judge now return {result, meta} with SHA-based caching (~$0.18/run savings when SKILL.md unchanged) and dynamic model selection via EVAL_JUDGE_TIER env var. E2E tests pass --model from EVAL_TIER to claude -p. outcomeJudge retains simple return type. All 8 LLM eval test sites updated with real costs and costs[]. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 16:47:35 -05:00
Garry Tan	02925cfc7a	feat: wire costs[] from modelUsage into eval results Extract per-model token usage from resultLine.modelUsage (including cache tokens and exact API cost), flow CostEntry[] through EvalCollector, aggregate in finalize(). Extend CostEntry with cache_read_input_tokens, cache_creation_input_tokens, cost_usd. computeCosts() prefers exact cost_usd over MODEL_PRICING when available (~4x more accurate with prompt caching). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 16:47:27 -05:00
Garry Tan	4ad73f7362	feat: unified gstack eval CLI with list, compare, push, cache, cost - lib/cli-eval.ts: routes to list/compare/summary/push/cost/cache/watch subcommands. Ports logic from 4 separate scripts into unified entry. Adds ANSI color for TTY (respects NO_COLOR), --limit flag for list. - bin/gstack-eval: bash wrapper matching bin/gstack-sync pattern - package.json: eval:* scripts now point to lib/cli-eval.ts - supabase/migrations/004_eval_costs.sql: per-model cost tracking + RLS - docs/eval-result-format.md: public format spec for any language - test/lib-eval-cli.test.ts: integration tests (spawn CLI subprocess) including 3 push failure modes (file-not-found, invalid schema, sync unavailable) 215 tests passing across 13 files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 09:39:36 -05:00
Garry Tan	1f5b7882e6	feat: add SHA-based eval caching with EVAL_CACHE=0 bypass Cache at ~/.gstack/eval-cache/{suite}/{sha}.json. Compute cache keys from source file contents + test input via Bun.CryptoHasher SHA256. Supports read/write/stats/clear/verify operations. EVAL_CACHE=0 skips reads for force-rerun. 16 tests including corrupt JSON handling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 09:39:26 -05:00
Garry Tan	9bc6c9416f	feat: add eval format validation, tier selection, cost tracking - lib/eval-format.ts: StandardEvalResult interfaces, validateEvalResult(), normalizeFromLegacy/normalizeToLegacy round-trip converters - lib/eval-tier.ts: EvalTier type, resolveTier/resolveJudgeTier from env, tierToModel mapping, TIER_ALIASES (haiku→fast, sonnet→standard, opus→full) - lib/eval-cost.ts: MODEL_PRICING (last verified 2025-05-01), computeCosts(), formatCostDashboard(), aggregateCosts(), fallback for unknown models - 42 tests across 3 test files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 09:39:18 -05:00
Garry Tan	7f7035f55a	feat: add listEvalFiles, loadEvalResults, formatTimestamp to lib/util.ts DRY up eval I/O duplicated across scripts/eval-list.ts, eval-compare.ts, and eval-summary.ts. Adds EVAL_DIR constant, formatTimestamp(), listEvalFiles(), loadEvalResults() with --limit support. 13 new tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 09:39:09 -05:00
Garry Tan	82e204179b	feat: hook eval-store sync, use shared utils, add 30 lib tests - eval-store.ts: import shared getGitInfo/getVersion, add pushEvalRun() hook in finalize() (non-blocking, non-fatal) - session-runner.ts: import shared atomicWriteSync/sanitizeForFilename - eval-store.test.ts: fix pre-existing bug in double-finalize test (was counting _partial file) - 30 new tests for lib/util, lib/sync-config, lib/sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 02:02:54 -05:00
Garry Tan	f7ae465415	feat: add Supabase migration SQL for team data store - 001_teams.sql: teams + team_members + RLS - 002_eval_runs.sql: eval results with universal format, indexes, upsert key - 003_data_tables.sql: retro, QA, ship, greptile, transcripts + RLS All tables use RLS: team members read/insert, admins delete. Transcript table has tighter policy (admin-only read). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 02:02:47 -05:00
Garry Tan	3713c3b9b9	feat: add team sync infrastructure (config, auth, push/pull, CLI) - lib/sync-config.ts: reads .gstack-sync.json + ~/.gstack/auth.json - lib/auth.ts: device auth flow (browser OAuth, local HTTP callback) - lib/sync.ts: Supabase push/pull via raw fetch(), offline queue, cache - lib/cli-sync.ts: CLI handler for gstack-sync commands - bin/gstack-sync: bash wrapper (setup, status, push-*, pull, drain) - .gstack-sync.json.example: template for team setup Zero new dependencies — uses raw fetch() against PostgREST API. All sync is non-fatal with 5s timeout and offline queue fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 02:02:40 -05:00
Garry Tan	caed287496	feat: extract shared utilities into lib/util.ts DRY up atomicWriteSync, readJSON, getGitInfo, getVersion, getRemoteSlug, and sanitizeForFilename from eval-store.ts, session-runner.ts, and eval-watch.ts into a shared module. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 02:02:32 -05:00
Garry Tan	5c1ea088d8	docs: scrub proprietary refs, close eval format gaps, integrate gstack-config - Replace project-specific references with generic language - Add missing fields to eval result format: prompt_sha, by_category, timestamp, response_preview - Enrich failure format with details array, scores dict, expectation_type - Add EVAL_JUDGE_CACHE, EVAL_VERBOSE, multiprocess worker support, dedup on push, run scopes, model aliases, judge profiles - Restructure credential storage to 4 layers with gstack-config (v0.3.9) for user preferences (sync_enabled, sync_transcripts) - Update integration points, observability, and reuse map Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 01:47:30 -05:00
Garry Tan	89311653df	Merge remote-tracking branch 'origin/main' into garrytan/team-supabase-store	2026-03-15 01:32:11 -05:00
Garry Tan	f87bc21865	docs: add team coordination store design doc Design doc for Supabase-backed team data store and universal eval infrastructure. Covers architecture, credential storage, eval formats, YAML test case spec, Supabase schema, phased rollout, and security model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 01:32:06 -05:00
Garry Tan	bb46ca6b21	feat: smart update check with auto-upgrade, snooze backoff, config CLI (v0.3.9) (#62 ) * feat: add bin/gstack-config CLI for reading/writing ~/.gstack/config.yaml Simple get/set/list interface for persistent gstack configuration. Used by update-check and upgrade skill for auto_upgrade and update_check settings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: smart update check with 12h cache, snooze backoff, config disable - Reduce cache TTL from 24h to 12h for faster update detection - Add exponential snooze backoff: 24h → 48h → 1 week (resets on new version) - Add update_check: false config option to disable checks entirely - Clear snooze file on just-upgraded - 14 new tests covering snooze levels, expiry, corruption, and config paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: upgrade skill with auto-upgrade, 4-option prompt, vendored sync - Auto-upgrade mode via config or GSTACK_AUTO_UPGRADE=1 env var - 4-option AskUserQuestion: upgrade once, always, not now, never - Step 4.5: sync local vendored copy after upgrading primary install - Snooze write with escalating backoff on "Not now" - Update preamble text in gen-skill-docs for new upgrade flow - Regenerate all SKILL.md files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: simplify upgrade instructions, move auto-upgrade to completed README now points to /gstack-upgrade instead of long paste commands. Auto-upgrade TODO moved to Completed section (v0.3.8). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump version and changelog (v0.3.9) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 23:28:02 -07:00
Garry Tan	41141007c1	feat: TODOS-aware skills, 2-tier Greptile replies, gitignore fix (#61 ) * fix: log non-ENOENT errors in ensureStateDir() instead of silently swallowing Replace bare catch {} with ENOENT-only silence. Non-ENOENT errors (EACCES, ENOSPC) are now logged to .gstack/browse-server.log. Includes test for permission-denied scenario with chmod 444. * feat: merge TODO.md + TODOS.md into unified backlog with shared format reference Merge TODO.md (roadmap) and TODOS.md (near-term) into one file organized by skill/component with P0-P4 priority ordering and Completed section. Add shared review/TODOS-format.md for canonical format. Add static validation tests. * feat: add 2-tier Greptile reply system with escalation detection Add reply templates (Tier 1 friendly, Tier 2 firm), explicit escalation detection algorithm, and severity re-ranking guidance to greptile-triage.md. * feat: cross-skill TODOS awareness + Greptile template refs in all skills /ship Step 5.5: auto-detect completed TODOs, offer reorganization. /review Step 5.5: cross-reference PR against open TODOs. /plan-ceo-review, /plan-eng-review: TODOS context in planning. /retro: Backlog Health metric. /qa: bug TODO context in diff-aware mode. All Greptile-aware skills now reference reply templates and escalation detection. * chore: bump version and changelog (v0.3.8) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: update CONTRIBUTING.md for v0.3.8 changes Clarify test tier cost table (Tier 3 standalone vs combined), add TODOS.md to "Things to know", mention Greptile triage in ship workflow description. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 20:15:11 -07:00
Garry Tan	2aa745cb0e	feat: screenshot element/region clipping (v0.3.7) (#56 ) * feat: screenshot element/region clipping (--clip, --viewport, CSS/@ref) Add element crop (CSS selector or @ref), region clip (--clip x,y,w,h), and viewport-only (--viewport) modes to the screenshot command. Uses Playwright's native locator.screenshot() and page.screenshot({ clip }). Full page remains the default. Includes 10 new tests covering all modes and error paths. * chore: bump version and changelog (v0.3.7) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add screenshot modes to BROWSER.md command reference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 12:47:42 -07:00
Garry Tan	0ac7ef4e81	fix: harden planted-bug eval prompt for reliable form testing Phase 3 was too vague ("click every nav link") causing the agent to wander instead of systematically testing form fields. Now explicitly directs: fill every input, clear it, try invalid values, submit and check console. Added Phase 4 finalize step to ensure report is updated with all findings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 13:28:18 -05:00
Garry Tan	7d26666164	Merge pull request #55 from garrytan/v0.3.6-qa-upgrades feat: E2E observability + eval infrastructure + all skills templated	2026-03-14 11:24:24 -07:00
Garry Tan	baf8acd55c	fix: update check ignores stale UP_TO_DATE cache after version change The UP_TO_DATE cache path exited immediately without checking if the cached version still matched the local VERSION. After upgrading (e.g. 0.3.3 → 0.3.4), the cache still said "UP_TO_DATE 0.3.3" and the script never re-checked against remote — so updates were invisible until the 24h cache expired. Now both UP_TO_DATE and UPGRADE_AVAILABLE verify cached version vs local before trusting the cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 13:23:25 -05:00
Garry Tan	4e31acbd47	fix: auto-clear stale heartbeat when process is dead Add PID to heartbeat file. eval-watch checks process.kill(pid, 0) and auto-deletes the heartbeat when the PID is no longer alive — no manual cleanup needed after crashed/killed E2E runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 12:55:40 -05:00
Garry Tan	43fbe165a4	docs: update README, CONTRIBUTING, ARCHITECTURE for v0.3.6 Update test tier costs and commands (Agent SDK → claude -p, SKILL_E2E → EVALS), add E2E observability section to CONTRIBUTING and ARCHITECTURE, add testing quick-start to README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 12:47:00 -05:00
Garry Tan	4ace0c2f6f	chore: bump version and changelog (v0.3.6) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 12:44:41 -05:00
Garry Tan	9f5aa32e67	fix: fail fast on API connectivity — pre-check before E2E suite Spawn a quick claude -p ping before running 13 tests. If the Anthropic API is unreachable (ConnectionRefused), throw immediately instead of burning through the entire suite with silent false passes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 12:37:44 -05:00
Garry Tan	5aae3ce117	fix: never clean up observability artifacts — partial file persists after finalize Removing the _partial-e2e.json deletion from finalize(). These are small files on a local disk and their persistence is the whole point of observability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 12:37:38 -05:00
Garry Tan	336dbaa50d	fix: detect is_error from claude -p result line (ConnectionRefused was PASS) claude -p can return subtype="success" with is_error=true when the API is unreachable. Previously we only checked subtype, so API failures silently passed. Now check is_error first and report as 'error_api'. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 12:35:43 -05:00
Garry Tan	029a7c2a37	feat: eval-watch dashboard + observability unit tests (15 tests, 11 codepaths) eval-watch: live terminal dashboard reads heartbeat + partial file every 1s, shows completed/running tests, stale detection (>10min), --tail flag for progress.log tail. Pure renderDashboard() function for testability. observability.test.ts: unit tests for sanitizeTestName, heartbeat schema, progress.log format, NDJSON file naming, savePartial() with _partial flag, finalize() cleanup, diagnostic fields, watcher rendering, stale detection, and non-fatal I/O guarantees. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 11:04:40 -05:00
Garry Tan	510a8d8dda	feat: wire runId + testName + diagnostics through all E2E tests Generate per-session runId, pass testName + runId to every runSkillTest() call, wire exit_reason/timeout_at_turn/last_tool_call through recordE2E(). Add eval:watch script entry to package.json. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 11:04:28 -05:00
Garry Tan	f9cfabeda8	feat: add E2E observability — heartbeat, progress.log, NDJSON persistence, savePartial() session-runner: atomic heartbeat file (e2e-live.json), per-run log directory (~/.gstack-dev/e2e-runs/{runId}/), progress.log + per-test NDJSON persistence, failure transcripts to persistent run dir instead of tmpdir. eval-store: 3 new diagnostic fields (exit_reason, timeout_at_turn, last_tool_call), savePartial() writes _partial-e2e.json after each addTest() for crash resilience, finalize() cleans up partial file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 11:04:16 -05:00

1 2

84 Commits