gstack

mirror of https://github.com/garrytan/gstack.git synced 2026-05-06 05:35:46 +02:00

Author	SHA1	Message	Date
Garry Tan	15e6d9d8f1	Merge branch 'main' into garrytan/team-supabase-store Brings in 55 commits from main (v0.12.x–v0.13.5.0): Factory Droid compat, prompt injection defense, user sovereignty, security audit, design binary, skill namespacing, modular resolvers, Chrome sidebar, and more. Conflict resolution: - .agents/ SKILL.md files: deleted (main moved to .factory/) - 8 .tmpl templates: accepted main (new features: CDP mode, design tools, global retro, parallelization, distribution checks, plan audits) - scripts/gen-skill-docs.ts: accepted main's modular resolver refactor - test/helpers/session-runner.ts: accepted main + layered back CostEntry tracking from team branch - Generated SKILL.md files: regenerated via bun run gen:skill-docs - Updated tests to match main's gstack-slug output (2 lines, no PROJECTS_DIR) and review log mechanism (gstack-review-log, not $BRANCH.jsonl) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 15:12:12 -07:00
Garry Tan	dc5e0538e5	feat: worktree isolation for E2E tests + infrastructure elegance (v0.11.12.0) (#425 ) * refactor: extract gen-skill-docs into modular resolver architecture Break the 3000-line monolith into 10 domain modules under scripts/resolvers/: types, constants, preamble, utility, browse, design, testing, review, codex-helpers, and index. Each module owns one domain of template generation. The preamble module introduces a 4-tier composition system (T1-T4) so skills only pay for the preamble sections they actually need, reducing token usage for lightweight skills by ~40%. Adds a token budget dashboard that prints after every generation run showing per-skill and total token counts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: tiered preamble — skills only pay for what they use Tag all 23 templates with preamble-tier (T1-T4). Lightweight skills like /browse and /benchmark get a minimal preamble (~40% fewer tokens), while review skills get the full stack. Regenerate all SKILL.md files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: migrate eval storage to project-scoped paths Move eval results and E2E run artifacts from ~/.gstack-dev/evals/ to ~/.gstack/projects/$SLUG/evals/ so each project's eval history lives alongside its other gstack data. Falls back to legacy path if slug detection fails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: sync package.json version with VERSION after merge Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add WorktreeManager for isolated test environments Reusable platform module (lib/worktree.ts) that creates git worktrees for test isolation and harvests useful changes as patches. Includes SHA-256 dedup, original SHA tracking for committed change detection, and automatic gitignored artifact copying (.agents/, browse/dist/). 12 unit tests covering lifecycle, harvest, dedup, and error handling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: integrate worktree isolation into E2E test infrastructure Add createTestWorktree(), harvestAndCleanup(), and describeWithWorktree() helpers to e2e-helpers.ts. Add harvest field to EvalTestEntry for eval-store integration. Register lib/worktree.ts as a global touchfile. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: run Gemini and Codex E2E tests in worktrees Switch both test suites from cwd: ROOT to worktree isolation. Gemini (--yolo) no longer pollutes the working tree. Codex (read-only) gets worktree for consistency. Useful changes are harvested as patches for cherry-picking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: skip symlinks in copyDirSync to prevent infinite recursion Adversarial review caught that .claude/skills/gstack may be a symlink back to the repo root, causing copyDirSync to recurse infinitely when copying gitignored artifacts into worktrees. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump version and changelog (v0.11.12.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: relax session-awareness assertion to accept structured options The LLM consistently presents well-formatted A/B choices with pros/cons but doesn't always use the exact string "RECOMMENDATION". Accept case-insensitive "recommend", "option a", "which do you want", or "which approach" as equivalent signals of a structured recommendation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 23:05:22 -07:00
Garry Tan	468c5eb55f	fix: normalize StandardEvalResult to legacy format before local save P1 from Codex review: gstack eval push copied standard-format JSON verbatim to ~/.gstack-dev/evals/, but eval:summary and eval:trend expect legacy fields (branch, total_tests, tests). Now uses normalizeToLegacy() before writing locally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-19 01:28:19 -07:00
Garry Tan	7808ee380b	fix: resolve team_id during auth and preserve across token refresh P1 from Codex review: interactive auth saved team_id: '' making all subsequent sync operations fail. Now resolves team_id from team_members table immediately after OAuth callback. Also fixes token refresh in sync.ts to preserve the existing team_id instead of resetting it to empty, and removes order=created_at.desc from pullTable() default query since sync_heartbeats and team_members tables don't have that column (P2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-19 01:28:15 -07:00
Garry Tan	624e4f234a	feat: add gstack projects ls CLI command New CLI for inspecting project artifacts: gstack projects ls — list all projects with artifact counts and sizes gstack projects show — detailed view of one project with manifest entries gstack projects clean — find and remove E2E test garbage directories Reads .manifest.jsonl when available for richer output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 23:50:30 -07:00
Garry Tan	e943c82e67	feat: add gstack-upload helper for Supabase Storage lib/upload.ts provides uploadScreenshot() that uploads to Supabase Storage and returns a public CDN URL. Falls back gracefully to local path with stderr warning on any failure (no config, expired auth, network error). Exit code 0 always — never breaks calling templates. bin/gstack-upload is a thin bash wrapper for CLI use. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 23:49:38 -07:00
Garry Tan	b332364b43	refactor: add PROJECTS_DIR to gstack-slug and getProjectsDir() to util.ts gstack-slug now outputs PROJECTS_DIR (respecting GSTACK_STATE_DIR env var). lib/util.ts gets getProjectsDir(slug?) as single source of truth for TypeScript consumers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 23:45:27 -07:00
Garry Tan	07aad3562f	refactor: lowercase slug in gstack-slug and getRemoteSlug() Fixes mixed-case slugs like Garry-s-List-garryslist by adding tr '[:upper:]' '[:lower:]' to bash and .toLowerCase() to TypeScript. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 23:45:16 -07:00
Garry Tan	238e89db9a	docs: cross-reference leaderboard duplication, service-role-key warning - Add cross-reference comments between dashboard-queries.ts computeLeaderboard() and dashboard/ui.ts renderLeaderboard() so maintainers know to update both - Add security note in setup-team-dashboard about service-role-key visibility in pg_cron job table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 19:12:49 -07:00
Garry Tan	4093c5e031	fix: DRY getValidToken — cli-team delegates to sync.ts, remove phantom Joined column - Export getValidToken from sync.ts (was private) - cli-team.ts now uses sync.ts version (supports auto-refresh, was missing) - Remove unused isTokenExpired/getAuthTokens imports from cli-team - Remove "Joined" column from formatMembersTable (team_members has no created_at) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 19:12:44 -07:00
Garry Tan	721abce5a5	fix: review-driven hardening — env guards, token expiry, slug validation, dashboard UX From CEO plan review: - Edge functions: early guard on missing env vars instead of non-null assert crash - cli-team: wire isTokenExpired check (was imported but unused) - Migration 007: CHECK constraint on team slug (a-z0-9 hyphens, 2-50 chars) - Dashboard: streak badges on leaderboard, repo slug in who's-online, contextual empty states that teach, 60s refresh (was 30s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 09:59:20 -05:00
Garry Tan	46c82ce8ec	feat: add team admin CLI + migration 007 (settings, cooldowns, create_team RPC) New `gstack team` CLI with create, members, set subcommands. Migration adds team_settings (admin-only), alert_cooldowns (edge-fn dedup), and create_team() SECURITY DEFINER RPC for atomic team + first member creation. 9 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 02:44:24 -05:00
Garry Tan	4985c8e7e9	feat: add CLI leaderboard, refactor formatTeamSummary to use dashboard-queries New `gstack eval leaderboard` subcommand pulls team data and renders weekly stats per contributor. Refactored formatTeamSummary to use computeVelocity from dashboard-queries (DRY). 4 new tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 02:44:12 -05:00
Garry Tan	e969c6dadf	feat: add dashboard query functions — pure transforms for team analytics 6 functions: detectRegressions, computeVelocity, computeCostTrend, computeLeaderboard, computeQATrend, computeEvalTrend. All pure, no I/O, with division-by-zero guards. 28 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 02:43:52 -05:00
Garry Tan	a104471272	feat: add push-transcript CLI, show sessions, interactive setup, 36 tests - cli-sync.ts: push-transcript command, show sessions with formatSessionTable(), upgrade cmdSetup() to interactively create .gstack-sync.json if missing - bin/gstack-sync: add push-transcript case and help text - test/lib-llm-summarize.test.ts: 10 tests with mocked fetch (429 retry, 5xx backoff, malformed response, no API key, cache) - test/lib-transcript-sync.test.ts: 22 tests for parsing, grouping, session file extraction, marker management, slug resolution - test/lib-sync-show.test.ts: 4 tests for formatSessionTable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 00:15:26 -05:00
Garry Tan	0e29d7d1a3	feat: add enriched transcript sync — Haiku summaries, session file enrichment Add session intelligence pipeline for team transcript sync: - lib/transcript-sync.ts: parse history.jsonl, enrich with Claude session file data (tools_used, full turn count), sync marker management, 10-concurrent push with 5-concurrent Haiku summarization - lib/llm-summarize.ts: raw fetch() to Anthropic Messages API (no SDK dep), retry-after on 429, exponential backoff on 5xx, SHA-based eval-cache - lib/sync.ts: pushTranscript() and pullTranscripts() following existing patterns - 006_transcript_sync.sql: unique index on (team_id, session_id) for idempotent upsert, RLS changed from admin-only to team-wide read Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 00:15:19 -05:00
Garry Tan	87cb769c35	feat: sync heartbeats, eval:trend --team, setup guide, 10 new tests - 005_sync_heartbeats.sql migration for connectivity testing - eval:trend --team flag pulls team eval data (graceful fallback) - docs/TEAM_SYNC_SETUP.md step-by-step setup guide - Design doc status updated to Phase 2 complete - 10 new tests for sync show formatting functions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 19:43:03 -05:00
Garry Tan	dc3fcc8611	feat: DRY push functions, add push-greptile + sync test/show commands Extract pushWithSync() helper to eliminate boilerplate across 6 push functions. Add pushHeartbeat() for connectivity testing. Add push-greptile to CLI. New commands: gstack-sync test (validates full push/pull flow via sync_heartbeats table), gstack-sync show (terminal team data dashboard with summary/evals/ships/retros views). Guard main block with import.meta.main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 19:42:45 -05:00
Garry Tan	daea165333	feat: add eval:trend CLI for per-test pass rate tracking computeTrends() classifies tests as stable-pass/stable-fail/flaky/ improving/degrading based on pass rate, flip count, and recent streak. gstack eval trend shows sparkline table with --limit, --tier, --test filters. Guard CLI main block with import.meta.main to prevent execution on import. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 16:47:41 -05:00
Garry Tan	02925cfc7a	feat: wire costs[] from modelUsage into eval results Extract per-model token usage from resultLine.modelUsage (including cache tokens and exact API cost), flow CostEntry[] through EvalCollector, aggregate in finalize(). Extend CostEntry with cache_read_input_tokens, cache_creation_input_tokens, cost_usd. computeCosts() prefers exact cost_usd over MODEL_PRICING when available (~4x more accurate with prompt caching). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 16:47:27 -05:00
Garry Tan	4ad73f7362	feat: unified gstack eval CLI with list, compare, push, cache, cost - lib/cli-eval.ts: routes to list/compare/summary/push/cost/cache/watch subcommands. Ports logic from 4 separate scripts into unified entry. Adds ANSI color for TTY (respects NO_COLOR), --limit flag for list. - bin/gstack-eval: bash wrapper matching bin/gstack-sync pattern - package.json: eval:* scripts now point to lib/cli-eval.ts - supabase/migrations/004_eval_costs.sql: per-model cost tracking + RLS - docs/eval-result-format.md: public format spec for any language - test/lib-eval-cli.test.ts: integration tests (spawn CLI subprocess) including 3 push failure modes (file-not-found, invalid schema, sync unavailable) 215 tests passing across 13 files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 09:39:36 -05:00
Garry Tan	1f5b7882e6	feat: add SHA-based eval caching with EVAL_CACHE=0 bypass Cache at ~/.gstack/eval-cache/{suite}/{sha}.json. Compute cache keys from source file contents + test input via Bun.CryptoHasher SHA256. Supports read/write/stats/clear/verify operations. EVAL_CACHE=0 skips reads for force-rerun. 16 tests including corrupt JSON handling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 09:39:26 -05:00
Garry Tan	9bc6c9416f	feat: add eval format validation, tier selection, cost tracking - lib/eval-format.ts: StandardEvalResult interfaces, validateEvalResult(), normalizeFromLegacy/normalizeToLegacy round-trip converters - lib/eval-tier.ts: EvalTier type, resolveTier/resolveJudgeTier from env, tierToModel mapping, TIER_ALIASES (haiku→fast, sonnet→standard, opus→full) - lib/eval-cost.ts: MODEL_PRICING (last verified 2025-05-01), computeCosts(), formatCostDashboard(), aggregateCosts(), fallback for unknown models - 42 tests across 3 test files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 09:39:18 -05:00
Garry Tan	7f7035f55a	feat: add listEvalFiles, loadEvalResults, formatTimestamp to lib/util.ts DRY up eval I/O duplicated across scripts/eval-list.ts, eval-compare.ts, and eval-summary.ts. Adds EVAL_DIR constant, formatTimestamp(), listEvalFiles(), loadEvalResults() with --limit support. 13 new tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 09:39:09 -05:00
Garry Tan	3713c3b9b9	feat: add team sync infrastructure (config, auth, push/pull, CLI) - lib/sync-config.ts: reads .gstack-sync.json + ~/.gstack/auth.json - lib/auth.ts: device auth flow (browser OAuth, local HTTP callback) - lib/sync.ts: Supabase push/pull via raw fetch(), offline queue, cache - lib/cli-sync.ts: CLI handler for gstack-sync commands - bin/gstack-sync: bash wrapper (setup, status, push-*, pull, drain) - .gstack-sync.json.example: template for team setup Zero new dependencies — uses raw fetch() against PostgREST API. All sync is non-fatal with 5s timeout and offline queue fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 02:02:40 -05:00
Garry Tan	caed287496	feat: extract shared utilities into lib/util.ts DRY up atomicWriteSync, readJSON, getGitInfo, getVersion, getRemoteSlug, and sanitizeForFilename from eval-store.ts, session-runner.ts, and eval-watch.ts into a shared module. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 02:02:32 -05:00

26 Commits