Brings in 55 commits from main (v0.12.x–v0.13.5.0): Factory Droid compat,
prompt injection defense, user sovereignty, security audit, design binary,
skill namespacing, modular resolvers, Chrome sidebar, and more.
Conflict resolution:
- .agents/ SKILL.md files: deleted (main moved to .factory/)
- 8 .tmpl templates: accepted main (new features: CDP mode, design tools,
global retro, parallelization, distribution checks, plan audits)
- scripts/gen-skill-docs.ts: accepted main's modular resolver refactor
- test/helpers/session-runner.ts: accepted main + layered back CostEntry
tracking from team branch
- Generated SKILL.md files: regenerated via bun run gen:skill-docs
- Updated tests to match main's gstack-slug output (2 lines, no PROJECTS_DIR)
and review log mechanism (gstack-review-log, not $BRANCH.jsonl)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: extract gen-skill-docs into modular resolver architecture
Break the 3000-line monolith into 10 domain modules under scripts/resolvers/:
types, constants, preamble, utility, browse, design, testing, review,
codex-helpers, and index. Each module owns one domain of template generation.
The preamble module introduces a 4-tier composition system (T1-T4) so skills
only pay for the preamble sections they actually need, reducing token usage
for lightweight skills by ~40%.
Adds a token budget dashboard that prints after every generation run showing
per-skill and total token counts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: tiered preamble — skills only pay for what they use
Tag all 23 templates with preamble-tier (T1-T4). Lightweight skills
like /browse and /benchmark get a minimal preamble (~40% fewer tokens),
while review skills get the full stack. Regenerate all SKILL.md files.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: migrate eval storage to project-scoped paths
Move eval results and E2E run artifacts from ~/.gstack-dev/evals/ to
~/.gstack/projects/$SLUG/evals/ so each project's eval history lives
alongside its other gstack data. Falls back to legacy path if slug
detection fails.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: sync package.json version with VERSION after merge
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add WorktreeManager for isolated test environments
Reusable platform module (lib/worktree.ts) that creates git worktrees
for test isolation and harvests useful changes as patches. Includes
SHA-256 dedup, original SHA tracking for committed change detection,
and automatic gitignored artifact copying (.agents/, browse/dist/).
12 unit tests covering lifecycle, harvest, dedup, and error handling.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: integrate worktree isolation into E2E test infrastructure
Add createTestWorktree(), harvestAndCleanup(), and describeWithWorktree()
helpers to e2e-helpers.ts. Add harvest field to EvalTestEntry for
eval-store integration. Register lib/worktree.ts as a global touchfile.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: run Gemini and Codex E2E tests in worktrees
Switch both test suites from cwd: ROOT to worktree isolation.
Gemini (--yolo) no longer pollutes the working tree. Codex
(read-only) gets worktree for consistency. Useful changes are
harvested as patches for cherry-picking.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: skip symlinks in copyDirSync to prevent infinite recursion
Adversarial review caught that .claude/skills/gstack may be a symlink
back to the repo root, causing copyDirSync to recurse infinitely
when copying gitignored artifacts into worktrees.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* chore: bump version and changelog (v0.11.12.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: relax session-awareness assertion to accept structured options
The LLM consistently presents well-formatted A/B choices with pros/cons
but doesn't always use the exact string "RECOMMENDATION". Accept
case-insensitive "recommend", "option a", "which do you want", or
"which approach" as equivalent signals of a structured recommendation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P1 from Codex review: gstack eval push copied standard-format JSON
verbatim to ~/.gstack-dev/evals/, but eval:summary and eval:trend
expect legacy fields (branch, total_tests, tests). Now uses
normalizeToLegacy() before writing locally.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P1 from Codex review: interactive auth saved team_id: '' making all
subsequent sync operations fail. Now resolves team_id from team_members
table immediately after OAuth callback.
Also fixes token refresh in sync.ts to preserve the existing team_id
instead of resetting it to empty, and removes order=created_at.desc
from pullTable() default query since sync_heartbeats and team_members
tables don't have that column (P2).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New CLI for inspecting project artifacts:
gstack projects ls — list all projects with artifact counts and sizes
gstack projects show — detailed view of one project with manifest entries
gstack projects clean — find and remove E2E test garbage directories
Reads .manifest.jsonl when available for richer output.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lib/upload.ts provides uploadScreenshot() that uploads to Supabase
Storage and returns a public CDN URL. Falls back gracefully to local
path with stderr warning on any failure (no config, expired auth,
network error). Exit code 0 always — never breaks calling templates.
bin/gstack-upload is a thin bash wrapper for CLI use.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gstack-slug now outputs PROJECTS_DIR (respecting GSTACK_STATE_DIR env var).
lib/util.ts gets getProjectsDir(slug?) as single source of truth for
TypeScript consumers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes mixed-case slugs like Garry-s-List-garryslist by adding
tr '[:upper:]' '[:lower:]' to bash and .toLowerCase() to TypeScript.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add cross-reference comments between dashboard-queries.ts computeLeaderboard()
and dashboard/ui.ts renderLeaderboard() so maintainers know to update both
- Add security note in setup-team-dashboard about service-role-key visibility
in pg_cron job table
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Export getValidToken from sync.ts (was private)
- cli-team.ts now uses sync.ts version (supports auto-refresh, was missing)
- Remove unused isTokenExpired/getAuthTokens imports from cli-team
- Remove "Joined" column from formatMembersTable (team_members has no created_at)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
From CEO plan review:
- Edge functions: early guard on missing env vars instead of non-null assert crash
- cli-team: wire isTokenExpired check (was imported but unused)
- Migration 007: CHECK constraint on team slug (a-z0-9 hyphens, 2-50 chars)
- Dashboard: streak badges on leaderboard, repo slug in who's-online,
contextual empty states that teach, 60s refresh (was 30s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New `gstack team` CLI with create, members, set subcommands.
Migration adds team_settings (admin-only), alert_cooldowns (edge-fn
dedup), and create_team() SECURITY DEFINER RPC for atomic team +
first member creation. 9 tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New `gstack eval leaderboard` subcommand pulls team data and renders
weekly stats per contributor. Refactored formatTeamSummary to use
computeVelocity from dashboard-queries (DRY). 4 new tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 functions: detectRegressions, computeVelocity, computeCostTrend,
computeLeaderboard, computeQATrend, computeEvalTrend. All pure,
no I/O, with division-by-zero guards. 28 tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- cli-sync.ts: push-transcript command, show sessions with formatSessionTable(),
upgrade cmdSetup() to interactively create .gstack-sync.json if missing
- bin/gstack-sync: add push-transcript case and help text
- test/lib-llm-summarize.test.ts: 10 tests with mocked fetch (429 retry,
5xx backoff, malformed response, no API key, cache)
- test/lib-transcript-sync.test.ts: 22 tests for parsing, grouping,
session file extraction, marker management, slug resolution
- test/lib-sync-show.test.ts: 4 tests for formatSessionTable
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add session intelligence pipeline for team transcript sync:
- lib/transcript-sync.ts: parse history.jsonl, enrich with Claude session
file data (tools_used, full turn count), sync marker management,
10-concurrent push with 5-concurrent Haiku summarization
- lib/llm-summarize.ts: raw fetch() to Anthropic Messages API (no SDK dep),
retry-after on 429, exponential backoff on 5xx, SHA-based eval-cache
- lib/sync.ts: pushTranscript() and pullTranscripts() following existing patterns
- 006_transcript_sync.sql: unique index on (team_id, session_id) for
idempotent upsert, RLS changed from admin-only to team-wide read
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- 005_sync_heartbeats.sql migration for connectivity testing
- eval:trend --team flag pulls team eval data (graceful fallback)
- docs/TEAM_SYNC_SETUP.md step-by-step setup guide
- Design doc status updated to Phase 2 complete
- 10 new tests for sync show formatting functions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract pushWithSync() helper to eliminate boilerplate across 6 push
functions. Add pushHeartbeat() for connectivity testing. Add push-greptile
to CLI. New commands: gstack-sync test (validates full push/pull flow
via sync_heartbeats table), gstack-sync show (terminal team data
dashboard with summary/evals/ships/retros views). Guard main block
with import.meta.main.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
computeTrends() classifies tests as stable-pass/stable-fail/flaky/
improving/degrading based on pass rate, flip count, and recent streak.
gstack eval trend shows sparkline table with --limit, --tier, --test
filters. Guard CLI main block with import.meta.main to prevent
execution on import.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract per-model token usage from resultLine.modelUsage (including
cache tokens and exact API cost), flow CostEntry[] through EvalCollector,
aggregate in finalize(). Extend CostEntry with cache_read_input_tokens,
cache_creation_input_tokens, cost_usd. computeCosts() prefers exact
cost_usd over MODEL_PRICING when available (~4x more accurate with
prompt caching).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- lib/cli-eval.ts: routes to list/compare/summary/push/cost/cache/watch
subcommands. Ports logic from 4 separate scripts into unified entry.
Adds ANSI color for TTY (respects NO_COLOR), --limit flag for list.
- bin/gstack-eval: bash wrapper matching bin/gstack-sync pattern
- package.json: eval:* scripts now point to lib/cli-eval.ts
- supabase/migrations/004_eval_costs.sql: per-model cost tracking + RLS
- docs/eval-result-format.md: public format spec for any language
- test/lib-eval-cli.test.ts: integration tests (spawn CLI subprocess)
including 3 push failure modes (file-not-found, invalid schema,
sync unavailable)
215 tests passing across 13 files.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cache at ~/.gstack/eval-cache/{suite}/{sha}.json. Compute cache keys
from source file contents + test input via Bun.CryptoHasher SHA256.
Supports read/write/stats/clear/verify operations. EVAL_CACHE=0
skips reads for force-rerun. 16 tests including corrupt JSON handling.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DRY up eval I/O duplicated across scripts/eval-list.ts,
eval-compare.ts, and eval-summary.ts. Adds EVAL_DIR constant,
formatTimestamp(), listEvalFiles(), loadEvalResults() with
--limit support. 13 new tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- lib/sync-config.ts: reads .gstack-sync.json + ~/.gstack/auth.json
- lib/auth.ts: device auth flow (browser OAuth, local HTTP callback)
- lib/sync.ts: Supabase push/pull via raw fetch(), offline queue, cache
- lib/cli-sync.ts: CLI handler for gstack-sync commands
- bin/gstack-sync: bash wrapper (setup, status, push-*, pull, drain)
- .gstack-sync.json.example: template for team setup
Zero new dependencies — uses raw fetch() against PostgREST API.
All sync is non-fatal with 5s timeout and offline queue fallback.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DRY up atomicWriteSync, readJSON, getGitInfo, getVersion, getRemoteSlug,
and sanitizeForFilename from eval-store.ts, session-runner.ts, and
eval-watch.ts into a shared module.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>