Commit Graph

100 Commits

Author SHA1 Message Date
Garry Tan 2670c96040 merge: integrate origin/main (v0.4.3-v0.5.0) into team-supabase-store
Resolves conflict in scripts/gen-skill-docs.ts (keep both setup-team-sync
and new design/document-release skill templates).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 08:29:30 -07:00
Garry Tan c8c2cbba33 docs: add /design-consultation skill to README (#127)
The skill was fully implemented but completely absent from the README.
Add it to the skill table, write a detailed section with usage example,
and include it in install/uninstall instructions.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 08:10:01 -05:00
Garry Tan 4a77cc2c34 feat: /plan-design-review + /qa-design-review skills (v0.5.0) (#102)
* feat: add {{DESIGN_METHODOLOGY}} resolver and register design review skills

Add generateDesignMethodology() to gen-skill-docs.ts with 10-category, 80-item
design audit checklist. Register plan-design-review and qa-design-review templates
in findTemplates(). Add both skills to skill-check.ts SKILL_FILES. Add command
and snapshot flag validation tests for both skills in skill-validation.test.ts.

* feat: add /plan-design-review and /qa-design-review skills

/plan-design-review: report-only designer audit with letter grades, AI slop
scoring, structured first impression, design system extraction, DESIGN.md
inference and export offer. Never modifies code.

/qa-design-review: same audit, then iterative fix loop with style(design):
commits, CSS-safe WTF heuristic, before/after screenshots, final re-audit.

* chore: bump version and changelog (v0.5.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update README, ARCHITECTURE for design review skills (v0.5.0)

- Update skill count to 11, add /plan-design-review and /qa-design-review
  to skill table, install/uninstall commands, and demo walkthrough
- Add narrative sections: "senior designer mode" and "designer who codes mode"
  with compelling examples showing AI Slop detection and design system inference
- Add {{DESIGN_METHODOLOGY}} to ARCHITECTURE.md placeholder table
- Extend demo to show full plan→eng→review→ship→qa→design-review pipeline

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: regenerate design review SKILL.md files after merge from main

Picks up BASE_BRANCH_DETECT resolver and updated contributor mode from main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add /design-consultation skill — design consultant that creates DESIGN.md

6-phase consultant flow: product context → competitive research (WebSearch) →
complete coherent proposal → drill-downs on demand → font+color preview page →
write DESIGN.md + update CLAUDE.md. Opinionated recommendations grounded in
product context, not menu-driven forms.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add E2E tests for design skill family (7 tests + LLM quality judge)

Tests 1-4: /design-consultation (core flow, research integration, existing
DESIGN.md handling, font+color preview generation).
Tests 5-6: /plan-design-review (audit report, DESIGN.md export).
Test 7: /qa-design-review (audit + fix loop).
LLM judge validates font blacklist compliance, coherence, and AI slop avoidance.
Also adds plan-design-review + qa-design-review to ALL_SKILLS test array.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: mark /design-consultation as shipped in TODOS.md

Renamed from /setup-design-md to reflect the consultant approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 21:55:07 -05:00
Garry Tan a30f7079da feat: Fix-First Review — auto-fix obvious issues, ask about hard ones (v0.4.5) (#116)
* feat: Fix-First Review — auto-fix obvious issues, ask about hard ones

Replace the CRITICAL-only AskUserQuestion flow with Fix-First:
- Every finding gets action (not just critical ones)
- AUTO-FIX items (dead code, N+1, stale comments) applied directly
- ASK items (security, race conditions, design decisions) batched
  into at most one AskUserQuestion
- Fix-First Heuristic in checklist.md (single source of truth)
- Gate Classification → Severity Classification rename

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.5)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: polish CHANGELOG v0.4.5 voice — lead with user benefit

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 21:52:50 -05:00
Garry Tan 318ffdbdf0 fix: js statement wrapping + click auto-routes option to selectOption (v0.4.5) (#117)
* fix: js statement wrapping + click auto-routes option to selectOption

Bug 1: js command wrapped all code as expressions — const, semicolons,
and multi-line code broke with SyntaxError. Added needsBlockWrapper()
and wrapForEvaluate() helpers (shared with eval) to detect statements
and use block wrapper {…} instead of expression wrapper (…).

Bug 2: clicking <option> refs hung forever because Playwright can't
.click() native select UI. Click handler now checks ARIA role + DOM
tagName and auto-routes to selectOption() via parent <select>.

Bug 3: click timeouts on <option> elements gave no guidance. Now
throws helpful error: "Use browse select instead of click."

* chore: bump version and changelog (v0.4.5)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 21:50:43 -05:00
Garry Tan 4e9c30076a fix: rename {{UPDATE_CHECK}} to {{PREAMBLE}} in setup-team-sync template
Aligns with main branch rename. Regenerated stale qa/SKILL.md and
ship/SKILL.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 19:12:53 -07:00
Garry Tan 238e89db9a docs: cross-reference leaderboard duplication, service-role-key warning
- Add cross-reference comments between dashboard-queries.ts computeLeaderboard()
  and dashboard/ui.ts renderLeaderboard() so maintainers know to update both
- Add security note in setup-team-dashboard about service-role-key visibility
  in pg_cron job table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 19:12:49 -07:00
Garry Tan 4093c5e031 fix: DRY getValidToken — cli-team delegates to sync.ts, remove phantom Joined column
- Export getValidToken from sync.ts (was private)
- cli-team.ts now uses sync.ts version (supports auto-refresh, was missing)
- Remove unused isTokenExpired/getAuthTokens imports from cli-team
- Remove "Joined" column from formatMembersTable (team_members has no created_at)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 19:12:44 -07:00
Garry Tan 03e61be488 test: add 24 tests for edge function pure functions
Tests for computePassRate, shouldAlert, formatSlackMessage (regression-alert)
and formatDigestMessage (weekly-digest). Covers null inputs, threshold
boundaries, delta formatting, quiet week fallback. Uses Deno/supabase mock
for bun compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 19:12:40 -07:00
Garry Tan c86faa7968 fix: update check cache — 60min UP_TO_DATE TTL + --force flag (v0.4.4) (#110)
* fix: split update check cache TTL + add --force flag

UP_TO_DATE cache now expires after 60 min (was 720 min / 12 hours).
UPGRADE_AVAILABLE keeps 720 min TTL to keep nagging.

--force flag deletes cache before checking, used by /gstack-upgrade
standalone invocation to always get a fresh result from GitHub.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /gstack-upgrade standalone uses --force for fresh check

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.4)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 14:14:15 -05:00
Garry Tan a68244ab57 feat: /document-release skill — post-ship doc updates (v0.4.3) (#109)
* docs: update project documentation for v0.4.2

- README: skill count 9→10, added /document-release to skills table,
  install/uninstall sections, and dedicated section with example
- CHANGELOG: added /document-release bullet to v0.4.2 entry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add /document-release skill with smart VERSION handling

New skill runs after /ship but before PR merge. Reads every doc file,
cross-references the diff, auto-updates factual changes, asks about
risky edits. CHANGELOG clobber protection: never uses Write tool on
CHANGELOG.md, only Edit with exact old_string matches.

Smart VERSION logic: instead of silently skipping already-bumped
versions, compares CHANGELOG entry scope against full diff and asks
if significant uncovered changes exist.

Also fixes gstack-upgrade/SKILL.md missing from skill-check.ts
SKILL_FILES array (existing inconsistency with gen-skill-docs.ts).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /review Step 5.6 — documentation staleness check

Review skill now cross-references code changes against doc files.
If a doc describes a feature that changed but the doc wasn't updated,
flags it as INFORMATIONAL with a pointer to /document-release.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: /document-release E2E with CHANGELOG clobber guard

E2E test creates a repo with existing CHANGELOG entries, runs
/document-release, and asserts original entries survive. Critical
guardrail against the incident where an agent replaced CHANGELOG
entries during conflict resolution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump to v0.4.3 — /document-release skill

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate SKILL.md files after merge

* chore: regenerate SKILL.md files after merge

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 12:30:22 -05:00
Garry Tan 276d0cc6cb feat: always-on ELI16 + branch detection (v0.4.3) (#108)
* feat: always-on ELI16 + branch detection in preamble

- Add _BRANCH detection to preamble bash block (git branch --show-current)
- Merge ELI16 rules into default AskUserQuestion format (always-on)
- Remove _SESSIONS >= 3 conditional — better questions always
- Add simplification rules: plain English, no jargon, no raw function names
- Update tests for branch detection and simplification regression guard

* chore: bump version and changelog (v0.4.3)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 12:27:47 -05:00
Garry Tan 8ef73a7508 Merge remote-tracking branch 'origin/main' into garrytan/team-supabase-store 2026-03-16 11:29:57 -05:00
Garry Tan 78e519e3b7 feat: await support in browse js/eval + contributor mode v2 (#104)
* feat: support await in $B js and eval commands

Auto-wrap await expressions in async IIFE context so
$B js "await fetch(...)" works without SyntaxError.

- hasAwait() strips comments before detection
- js: expression wrapping (async()=>(expr))()
- eval: smart wrapping — single-line=expression, multi-line=block
- 6 new unit tests covering async, false-positive, and return semantics

* feat: redesign contributor mode — periodic reflection with 0-10 rating

Replace passive "report when things break" with active reflection:
- Rate gstack experience 0-10 at workflow step boundaries
- Historical calibration example (await bug) anchors the reporting bar
- "What would make this a 10" field focuses on actionable improvements
- Removed category lists in favor of judgment-based assessment

* test: add deterministic contributor mode preamble validation

40 new skill-validation tests (4 checks × 10 skills) verify:
- 0-10 rating scale present
- Calibration example present
- "What would make this a 10" field present
- Periodic reflection (not per-command)

Update existing E2E contributor eval for new report format.

* chore: bump version and changelog (v0.4.2)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: improve contributor mode + qa-quick E2E reliability

Contributor mode:
- Add "do not truncate" directive to template — agent was stopping
  after "My rating" without completing Steps/Raw output/What would
  make this a 10 sections
- Restore assertions for Steps to reproduce and Date footer

QA quick:
- Make test server URL prominent: top of prompt, explicit "already
  running" and "do NOT discover ports" instructions
- Bump session timeout 180s→240s and test timeout 240s→300s
- Set B= at top of prompt (was buried in prose)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use flexible assertions for contributor mode E2E

Agent writes thorough reports with creative section names
("Repro Steps" vs "Steps to reproduce"). Match intent not formatting:
- /repro|steps to reproduce/ for reproduction steps
- /date.*2026/ for date footer presence

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add E2E eval failure blame protocol

"Not related to our changes" is an extraordinary claim that requires
extraordinary proof. When evals fail during /ship:

1. Run the same eval on main — prove it fails there too
2. If it passes on main, it IS your change — trace the blame
3. If you can't verify, say "unverified" not "pre-existing"

Added to CLAUDE.md and as a comment in skill-e2e.test.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update CONTRIBUTING.md and BROWSER.md for v0.4.2

CONTRIBUTING.md: update contributor mode description — now describes
periodic 0-10 reflection loop instead of passive friction detection.

BROWSER.md: add js/eval async documentation — await expressions are
auto-wrapped in async context, single-line eval returns values directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: restore v0.4.2 changelog entries lost during cherry-pick conflict

The base branch detection entries from main were dropped when resolving
the CHANGELOG conflict — should have merged both sets, not replaced.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 11:28:58 -05:00
Garry Tan 1e06b6a5c6 fix: dynamic base branch detection across all SKILL templates (v0.3.10) (#81)
* feat: add {{BASE_BRANCH_DETECT}} resolver to gen-skill-docs

DRY placeholder for dynamic base branch detection across PR-targeting
skills. Detects via gh pr view (existing PR base) → gh repo view
(repo default) → fallback to main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: ship skill detects base branch instead of hardcoding main

Replaces ~14 hardcoded 'main' references with dynamic detection via
{{BASE_BRANCH_DETECT}}. Fixes stacked branches and Conductor workspaces
targeting non-main branches. Adds --base <base> to gh pr create.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: review, qa, plan-ceo-review detect base branch dynamically

Same pattern as ship: replaces hardcoded 'main' with {{BASE_BRANCH_DETECT}}.
Also cleans up qa bash-isms (REPORT_DIR variable, port chaining).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: retro detects default branch instead of hardcoding origin/main

Retro queries commit history (not PR targets), so uses simpler detection:
gh repo view defaultBranchRef. Replaces ~11 origin/main refs with
origin/<default>.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add explicit cross-step references in gstack-upgrade template

Bash blocks are self-contained, but cross-block variable references
(INSTALL_DIR from Step 2) were implicit. Adds prose making them explicit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs+test: SKILL authoring guidance + regression tests

Adds "Writing SKILL templates" section to CLAUDE.md explaining that
templates are prompts, not scripts. Adds validation test catching
hardcoded 'main' in git commands, and resolver content test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update ARCHITECTURE + CONTRIBUTING for new placeholders

Add {{BASE_BRANCH_DETECT}} to ARCHITECTURE.md placeholder list.
Cross-reference CLAUDE.md template authoring guidance from CONTRIBUTING.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.3.10)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add missing blank line between resolver functions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add 3 E2E smoke tests for base branch detection

- /review: verifies Step 0 detection + git diff against detected base
- /ship: truncated dry-run (Steps 0-1 only, no push/PR), asserts no
  destructive actions
- /retro: verifies default branch detection for git log queries

Covers the {{BASE_BRANCH_DETECT}} resolver path (review), the ship
template's dual abort check, and retro's inline detection pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 10:59:13 -05:00
Garry Tan 9e67d71f72 docs: add 8 team dashboard TODOs from CEO review, mark weekly digest shipped
New TODOs: regression alert links, projected monthly cost, ship-to-Slack
notifications, dynamic favicon, server-side aggregation, SSE streaming,
GitHub Check Runs, ship_logs index.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 10:00:36 -05:00
Garry Tan 721abce5a5 fix: review-driven hardening — env guards, token expiry, slug validation, dashboard UX
From CEO plan review:
- Edge functions: early guard on missing env vars instead of non-null assert crash
- cli-team: wire isTokenExpired check (was imported but unused)
- Migration 007: CHECK constraint on team slug (a-z0-9 hyphens, 2-50 chars)
- Dashboard: streak badges on leaderboard, repo slug in who's-online,
  contextual empty states that teach, 60s refresh (was 30s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 09:59:20 -05:00
Garry Tan 2357f134ce merge: integrate origin/main (v0.4.0, v0.4.1) into team-supabase-store
Resolves conflicts in CHANGELOG.md (ordering), CONTRIBUTING.md (eval
tools list merge), VERSION (take main's 0.4.1), qa/SKILL.md.tmpl
(keep full methodology + baseline line), eval-store.test.ts (drop
redundant comment).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 07:49:27 -05:00
Garry Tan 83bfc7f88d feat: add /setup-team-dashboard skill, post-ship leaderboard callout
Interactive 8-step setup skill for deploying dashboard + edge functions.
Post-ship callout shows team leaderboard after successful sync.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 02:44:59 -05:00
Garry Tan 78840c64a8 feat: add shared team dashboard, regression alerts, weekly digest edge functions
Dashboard: Supabase edge function serving self-contained HTML with
PKCE OAuth, 6 parallel client-side REST queries, SVG charts, dark
theme, auto-refresh, who's-online from heartbeats. Public URL.

Regression alert: webhook on eval_runs INSERT, 5-min cooldown dedup
via alert_cooldowns, Slack notification on >5% pass rate drop.

Weekly digest: pg_cron Monday 9am UTC, aggregates 7-day team data,
Slack message with evals/ships/sessions/costs. 15 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 02:44:47 -05:00
Garry Tan 46c82ce8ec feat: add team admin CLI + migration 007 (settings, cooldowns, create_team RPC)
New `gstack team` CLI with create, members, set subcommands.
Migration adds team_settings (admin-only), alert_cooldowns (edge-fn
dedup), and create_team() SECURITY DEFINER RPC for atomic team +
first member creation. 9 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 02:44:24 -05:00
Garry Tan 4985c8e7e9 feat: add CLI leaderboard, refactor formatTeamSummary to use dashboard-queries
New `gstack eval leaderboard` subcommand pulls team data and renders
weekly stats per contributor. Refactored formatTeamSummary to use
computeVelocity from dashboard-queries (DRY). 4 new tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 02:44:12 -05:00
Garry Tan e969c6dadf feat: add dashboard query functions — pure transforms for team analytics
6 functions: detectRegressions, computeVelocity, computeCostTrend,
computeLeaderboard, computeQATrend, computeEvalTrend. All pure,
no I/O, with division-by-zero guards. 28 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 02:43:52 -05:00
Garry Tan 3e3843c4a9 feat: contributor mode, session awareness, recommendation format (#90)
* feat: contributor mode, session awareness, universal RECOMMENDATION format

- Rename {{UPDATE_CHECK}} → {{PREAMBLE}} across all 10 skill templates
- Add session tracking (touch ~/.gstack/sessions/$PPID, count active sessions)
- ELI16 mode when 3+ concurrent sessions detected (re-ground user on context)
- Contributor mode: auto-file field reports to ~/.gstack/contributor-logs/
- Universal AskUserQuestion format: context → question → RECOMMENDATION → options
- Update plan-ceo-review and plan-eng-review to reference preamble baseline
- Add vendored symlink awareness section to CLAUDE.md
- Rewrite CONTRIBUTING.md with contributor workflow and cross-project testing
- Add tests for contributor mode and session awareness in generated output
- Add E2E eval for contributor mode report filing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add Enum & Value Completeness to /review critical checklist

New CRITICAL review category that traces new enum values, status strings,
and type constants through every consumer outside the diff. Catches the
class of bugs where a new value is added but not handled in all switch/case
chains, allowlists, or frontend-backend contracts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump v0.4.1, user-facing changelog, update qa-only template and architecture docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add CHANGELOG style guide — user-facing, sell the feature

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: rewrite v0.4.1 changelog to be user-facing and sell the features

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add evals for RECOMMENDATION format, session awareness, and enum completeness

Free tests (Tier 1): RECOMMENDATION format + session awareness in all
preamble SKILL.md files, enum completeness checklist structure and CRITICAL
classification.

E2E eval: /review catches missed enum handlers when a new status value
is added but not handled in case/switch and notify methods.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add E2E eval for session awareness ELI16 mode

Stubs _SESSIONS=4, gives agent a decision point on feature/add-payments
branch, verifies the output re-grounds the user with project, branch,
context, and RECOMMENDATION — the ELI16 mode behavior for 3+ sessions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: contributor mode eval marked FAIL due to expected browse error

The test intentionally runs a nonexistent binary to trigger contributor
mode. The session runner's browse error detection catches "no such file
or directory...browse" and sets browseErrors, causing recordE2E to mark
passed=false. Override passed to check only exitReason since the browse
error is the expected scenario.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 01:45:50 -05:00
Garry Tan 6e14689f0e docs: add team sync TODOs — streaming parser, effectiveness scoring, weekly digest
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 00:15:40 -05:00
Garry Tan 3a57a3f59e feat: add /setup-team-sync skill, auto-push transcript hooks in skills
- setup-team-sync/SKILL.md.tmpl: idempotent guided setup (create config,
  OAuth, verify connectivity, configure settings, summary)
- ship/retro/qa SKILL.md.tmpl: add push-transcript hook after existing
  push-ship/push-retro/push-qa hooks (silent, non-fatal)
- scripts/gen-skill-docs.ts: add setup-team-sync to template list
- Regenerated all SKILL.md files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 00:15:36 -05:00
Garry Tan a104471272 feat: add push-transcript CLI, show sessions, interactive setup, 36 tests
- cli-sync.ts: push-transcript command, show sessions with formatSessionTable(),
  upgrade cmdSetup() to interactively create .gstack-sync.json if missing
- bin/gstack-sync: add push-transcript case and help text
- test/lib-llm-summarize.test.ts: 10 tests with mocked fetch (429 retry,
  5xx backoff, malformed response, no API key, cache)
- test/lib-transcript-sync.test.ts: 22 tests for parsing, grouping,
  session file extraction, marker management, slug resolution
- test/lib-sync-show.test.ts: 4 tests for formatSessionTable

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 00:15:26 -05:00
Garry Tan 0e29d7d1a3 feat: add enriched transcript sync — Haiku summaries, session file enrichment
Add session intelligence pipeline for team transcript sync:
- lib/transcript-sync.ts: parse history.jsonl, enrich with Claude session
  file data (tools_used, full turn count), sync marker management,
  10-concurrent push with 5-concurrent Haiku summarization
- lib/llm-summarize.ts: raw fetch() to Anthropic Messages API (no SDK dep),
  retry-after on 429, exponential backoff on 5xx, SHA-based eval-cache
- lib/sync.ts: pushTranscript() and pullTranscripts() following existing patterns
- 006_transcript_sync.sql: unique index on (team_id, session_id) for
  idempotent upsert, RLS changed from admin-only to team-wide read

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 00:15:19 -05:00
Garry Tan f3ee0ee28a feat: QA restructure, browser ref staleness, eval efficiency metrics (v0.4.0) (#83)
* feat: browser ref staleness detection via async count() validation

resolveRef() now checks element count to detect stale refs after page
mutations (e.g. SPA navigation). RefEntry stores role+name metadata
for better diagnostics. 3 new snapshot tests for staleness detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: qa-only skill, qa fix loop, plan-to-QA artifact flow

Add /qa-only (report-only, Edit tool blocked), restructure /qa with
find-fix-verify cycle, add {{QA_METHODOLOGY}} DRY placeholder for
shared methodology. /plan-eng-review now writes test-plan artifacts
to ~/.gstack/projects/<slug>/ for QA consumption.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: eval efficiency metrics — turns, duration, commentary across all surfaces

Add generateCommentary() for natural-language delta interpretation,
per-test turns/duration in comparison and summary output, judgePassed
unit tests, 3 new E2E tests (qa-only, qa fix loop, plan artifact).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update ARCHITECTURE, BROWSER, CONTRIBUTING, README for v0.4.0

- ARCHITECTURE: add ref staleness detection section, update RefEntry type
- BROWSER: add ref staleness paragraph to snapshot system docs
- CONTRIBUTING: update eval tool descriptions with commentary feature
- README: fix missing qa-only in project-local uninstall command

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add user-facing benefit descriptions to v0.4.0 changelog

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 23:55:39 -05:00
Garry Tan 87cb769c35 feat: sync heartbeats, eval:trend --team, setup guide, 10 new tests
- 005_sync_heartbeats.sql migration for connectivity testing
- eval:trend --team flag pulls team eval data (graceful fallback)
- docs/TEAM_SYNC_SETUP.md step-by-step setup guide
- Design doc status updated to Phase 2 complete
- 10 new tests for sync show formatting functions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 19:43:03 -05:00
Garry Tan 06f2da2019 feat: wire team sync push into ship, retro, qa, and greptile skills
Add non-fatal sync steps to all 4 skill templates:
- /ship Step 8.5: write ship log JSON + push after PR creation
- /retro Step 13: push snapshot after JSON save
- /qa Phase 6.7: write qa-sync.json + push after health score
- greptile-triage: push each triage entry after history file writes

All calls use || true for zero disruption. Silent when sync not
configured.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 19:42:54 -05:00
Garry Tan dc3fcc8611 feat: DRY push functions, add push-greptile + sync test/show commands
Extract pushWithSync() helper to eliminate boilerplate across 6 push
functions. Add pushHeartbeat() for connectivity testing. Add push-greptile
to CLI. New commands: gstack-sync test (validates full push/pull flow
via sync_heartbeats table), gstack-sync show (terminal team data
dashboard with summary/evals/ships/retros views). Guard main block
with import.meta.main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 19:42:45 -05:00
Garry Tan 704fe34e98 docs: clean up sync example, add team sync section to README
Remove _comment hacks from JSON example file. Add short team sync
section to README explaining what it is, that it's optional, and
how to set it up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 17:06:51 -05:00
Garry Tan 14320469b0 docs: CHANGELOG covers full branch scope including team sync
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 17:05:45 -05:00
Garry Tan eb7ef2153b docs: add setup comments to .gstack-sync.json.example
Explain what team sync gives you, that it's optional, and how to
set it up. Points to TEAM_COORDINATION_STORE.md for full guide.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 17:04:49 -05:00
Garry Tan e28033353d chore: bump v0.3.10, update CHANGELOG and docs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 16:55:34 -05:00
Garry Tan 33c9552870 chore: update gitignore
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 16:47:46 -05:00
Garry Tan daea165333 feat: add eval:trend CLI for per-test pass rate tracking
computeTrends() classifies tests as stable-pass/stable-fail/flaky/
improving/degrading based on pass rate, flip count, and recent streak.
gstack eval trend shows sparkline table with --limit, --tier, --test
filters. Guard CLI main block with import.meta.main to prevent
execution on import.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 16:47:41 -05:00
Garry Tan 59752fc510 feat: wire eval-cache + eval-tier into LLM judge, pin E2E model
callJudge/judge now return {result, meta} with SHA-based caching
(~$0.18/run savings when SKILL.md unchanged) and dynamic model
selection via EVAL_JUDGE_TIER env var. E2E tests pass --model from
EVAL_TIER to claude -p. outcomeJudge retains simple return type.
All 8 LLM eval test sites updated with real costs and costs[].

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 16:47:35 -05:00
Garry Tan 02925cfc7a feat: wire costs[] from modelUsage into eval results
Extract per-model token usage from resultLine.modelUsage (including
cache tokens and exact API cost), flow CostEntry[] through EvalCollector,
aggregate in finalize(). Extend CostEntry with cache_read_input_tokens,
cache_creation_input_tokens, cost_usd. computeCosts() prefers exact
cost_usd over MODEL_PRICING when available (~4x more accurate with
prompt caching).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 16:47:27 -05:00
Garry Tan 4ad73f7362 feat: unified gstack eval CLI with list, compare, push, cache, cost
- lib/cli-eval.ts: routes to list/compare/summary/push/cost/cache/watch
  subcommands. Ports logic from 4 separate scripts into unified entry.
  Adds ANSI color for TTY (respects NO_COLOR), --limit flag for list.
- bin/gstack-eval: bash wrapper matching bin/gstack-sync pattern
- package.json: eval:* scripts now point to lib/cli-eval.ts
- supabase/migrations/004_eval_costs.sql: per-model cost tracking + RLS
- docs/eval-result-format.md: public format spec for any language
- test/lib-eval-cli.test.ts: integration tests (spawn CLI subprocess)
  including 3 push failure modes (file-not-found, invalid schema,
  sync unavailable)

215 tests passing across 13 files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 09:39:36 -05:00
Garry Tan 1f5b7882e6 feat: add SHA-based eval caching with EVAL_CACHE=0 bypass
Cache at ~/.gstack/eval-cache/{suite}/{sha}.json. Compute cache keys
from source file contents + test input via Bun.CryptoHasher SHA256.
Supports read/write/stats/clear/verify operations. EVAL_CACHE=0
skips reads for force-rerun. 16 tests including corrupt JSON handling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 09:39:26 -05:00
Garry Tan 9bc6c9416f feat: add eval format validation, tier selection, cost tracking
- lib/eval-format.ts: StandardEvalResult interfaces, validateEvalResult(),
  normalizeFromLegacy/normalizeToLegacy round-trip converters
- lib/eval-tier.ts: EvalTier type, resolveTier/resolveJudgeTier from env,
  tierToModel mapping, TIER_ALIASES (haiku→fast, sonnet→standard, opus→full)
- lib/eval-cost.ts: MODEL_PRICING (last verified 2025-05-01), computeCosts(),
  formatCostDashboard(), aggregateCosts(), fallback for unknown models
- 42 tests across 3 test files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 09:39:18 -05:00
Garry Tan 7f7035f55a feat: add listEvalFiles, loadEvalResults, formatTimestamp to lib/util.ts
DRY up eval I/O duplicated across scripts/eval-list.ts,
eval-compare.ts, and eval-summary.ts. Adds EVAL_DIR constant,
formatTimestamp(), listEvalFiles(), loadEvalResults() with
--limit support. 13 new tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 09:39:09 -05:00
Garry Tan 82e204179b feat: hook eval-store sync, use shared utils, add 30 lib tests
- eval-store.ts: import shared getGitInfo/getVersion, add pushEvalRun()
  hook in finalize() (non-blocking, non-fatal)
- session-runner.ts: import shared atomicWriteSync/sanitizeForFilename
- eval-store.test.ts: fix pre-existing bug in double-finalize test
  (was counting _partial file)
- 30 new tests for lib/util, lib/sync-config, lib/sync

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 02:02:54 -05:00
Garry Tan f7ae465415 feat: add Supabase migration SQL for team data store
- 001_teams.sql: teams + team_members + RLS
- 002_eval_runs.sql: eval results with universal format, indexes, upsert key
- 003_data_tables.sql: retro, QA, ship, greptile, transcripts + RLS

All tables use RLS: team members read/insert, admins delete.
Transcript table has tighter policy (admin-only read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 02:02:47 -05:00
Garry Tan 3713c3b9b9 feat: add team sync infrastructure (config, auth, push/pull, CLI)
- lib/sync-config.ts: reads .gstack-sync.json + ~/.gstack/auth.json
- lib/auth.ts: device auth flow (browser OAuth, local HTTP callback)
- lib/sync.ts: Supabase push/pull via raw fetch(), offline queue, cache
- lib/cli-sync.ts: CLI handler for gstack-sync commands
- bin/gstack-sync: bash wrapper (setup, status, push-*, pull, drain)
- .gstack-sync.json.example: template for team setup

Zero new dependencies — uses raw fetch() against PostgREST API.
All sync is non-fatal with 5s timeout and offline queue fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 02:02:40 -05:00
Garry Tan caed287496 feat: extract shared utilities into lib/util.ts
DRY up atomicWriteSync, readJSON, getGitInfo, getVersion, getRemoteSlug,
and sanitizeForFilename from eval-store.ts, session-runner.ts, and
eval-watch.ts into a shared module.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 02:02:32 -05:00
Garry Tan 5c1ea088d8 docs: scrub proprietary refs, close eval format gaps, integrate gstack-config
- Replace project-specific references with generic language
- Add missing fields to eval result format: prompt_sha, by_category,
  timestamp, response_preview
- Enrich failure format with details array, scores dict, expectation_type
- Add EVAL_JUDGE_CACHE, EVAL_VERBOSE, multiprocess worker support,
  dedup on push, run scopes, model aliases, judge profiles
- Restructure credential storage to 4 layers with gstack-config (v0.3.9)
  for user preferences (sync_enabled, sync_transcripts)
- Update integration points, observability, and reuse map

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 01:47:30 -05:00
Garry Tan 89311653df Merge remote-tracking branch 'origin/main' into garrytan/team-supabase-store 2026-03-15 01:32:11 -05:00