mirror of https://github.com/garrytan/gstack.git synced 2026-05-07 05:56:41 +02:00

Files

T

Garry Tan 87cb769c35 feat: sync heartbeats, eval:trend --team, setup guide, 10 new tests

- 005_sync_heartbeats.sql migration for connectivity testing
- eval:trend --team flag pulls team eval data (graceful fallback)
- docs/TEAM_SYNC_SETUP.md step-by-step setup guide
- Design doc status updated to Phase 2 complete
- 10 new tests for sync show formatting functions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-15 19:43:03 -05:00

62 KiB

Raw Blame History

Team Coordination Store: gstack as Engineering Intelligence Platform

Design doc for the Supabase-backed team data store and universal eval infrastructure. Authored 2026-03-15. Status: Phase 1 complete. Phase 2 complete (skill hooks, sync test/show, team trends). Phase 3-4 not started.

The Problem
The Vision (Platonic Ideal)
10-Year Trajectory
Key Decisions
Architecture
gstack eval: Universal Eval Infrastructure
Supabase Schema
Integration Points
Phased Rollout
Data Flows
Error & Rescue Map
Security & Threat Model
Observability
What Already Exists
What's NOT in Scope
Risks & Mitigations
Verification Plan
Review Decisions Log

The Problem

gstack currently stores all data as local flat files:

Data	Location	Format
Eval results	`~/.gstack-dev/evals/*.json`	JSON (EvalResult schema v1)
Retro snapshots	`.context/retros/*.json`	JSON (metrics + per-author)
Greptile triage	`~/.gstack/greptile-history.md`	Pipe-delimited text
QA reports	`.gstack/qa-reports/`	Markdown + baseline.json
Ship logs	Not yet implemented	Planned JSON
Claude transcripts	`~/.claude/history.jsonl`	JSONL (Claude Code's domain)

This works for solo developers. For teams on vendored gstack, it means:

Zero shared visibility into code quality, shipping velocity, or eval regressions
No cross-contributor comparison — each developer's data is isolated on their machine
No regression detection — an eval suite can regress and nobody notices until production breaks
Duplicated infrastructure — the author runs another project with a sophisticated eval system (60+ runners, S3 storage, caching, cost tracking, baselines) locked inside Ruby/Rails that solves the same problems gstack solves in Bun/TS

The Vision (Platonic Ideal)

Imagine this: a new engineer joins the team. They run gstack sync setup, authenticate in 30 seconds, and immediately see:

The team's shipping velocity — 14 PRs merged this week, trending up
Which areas of the codebase are most active — app/services/ is a hotspot
How the AI is performing — eval detection rate is 92%, up from 85% last month
What the AI struggles with — response email evals consistently score low on brevity
How senior engineers use Claude differently than juniors — more targeted prompts, fewer turns
A weekly digest arriving in Slack every Monday with the team's pulse

They don't need to ask anyone. They don't need to read a wiki. The data is alive, flowing, and organized.

When they run /ship, the last line says "Synced to team ✓". When an eval regresses, a Slack alert fires within minutes. When someone ships a fix that improves detection rate by 10%, it shows up on the leaderboard.

The system is invisible when it works and loud when something breaks. Skills don't know sync exists — they read local files, and the local files happen to contain team data. The infrastructure layer is purely additive. Turn it off with one config change. Delete the config and it's as if it never existed.

This is what "engineering intelligence" means: the team's collective knowledge about code quality, AI effectiveness, and shipping patterns — organized, shared, and actionable.

10-Year Trajectory

YEAR 1 (this plan)
├── Supabase data store — team sync for evals, retros, QA, ships, reviews
├── Universal eval infrastructure — adapter mode, any language pushes results
├── Eval cache, cost tracking, baselines, comparison — ported from existing Rails project
├── Live eval dashboard — browser-based, SSE streaming
├── Team dashboard — velocity, quality trends, cost tracking
├── Edge functions — regression alerts, weekly digests
└── Inline sync in skills — "Synced to team ✓"

YEAR 2
├── Native eval runner — gstack runs evals directly (YAML → LLM → judge)
├── Cross-team benchmarking — opt-in anonymized aggregates across teams
├── AI usage analytics — which prompts/tools are most effective
├── PR-integrated quality gates — eval results as GitHub check runs
├── CI/CD first-class support — GitHub Actions eval workflow
└── Multi-repo support — one team, many repos, unified dashboard

YEAR 3
├── Prompt optimization engine — analyze eval history to suggest prompt improvements
├── Regression prediction — ML on eval trends to predict quality drops before they happen
├── Custom judge profiles — teams define their own quality criteria and scoring rubrics
├── Eval marketplace — share and discover eval suites across the gstack community
└── Voice health dashboard — per-author quality scoring

YEAR 5
├── Engineering intelligence API — other tools consume gstack's data layer
├── Autonomous quality maintenance — gstack detects regressions and proposes fixes
├── Cross-organization insights — "teams like yours typically..." recommendations
├── Real-time collaboration — live pair-eval sessions, shared debugging
└── Training data curation — eval results feed into fine-tuning pipelines

YEAR 10
├── The engineering intelligence layer — as fundamental as git or CI
├── Every AI-assisted engineering team has a shared data substrate
├── Eval-driven development is standard practice, not an afterthought
├── The gap between "how the AI performed" and "what the team shipped" is closed
└── gstack is to AI-native engineering what GitHub is to version control

The key insight: data compounds. Year 1 data makes year 2 features possible. Year 2 data makes year 3 predictions accurate. By year 5, the accumulated eval history is more valuable than any individual eval run. The platform gets smarter the longer a team uses it.

Key Decisions

All decisions were made during the CEO-mode plan review on 2026-03-15.

#	Decision	Resolution	Rationale
1	Hosting model	Self-hosted Supabase per team	Maximum control, data sovereignty
2	Transcript handling	Opt-in, no scrubbing	Trust the team — same model as shared Slack. Supabase encrypts at rest + in transit. RLS enforces team isolation.
3	Read architecture	Cache-based	Skills never touch network. `gstack sync pull` writes to `.gstack/team-cache/`. Skills read local files only. Preserves "sync is invisible" invariant.
4	Eval integration	Adapter mode (not native runner)	Your app runs evals. gstack is infrastructure: storage, comparison, caching, dashboards, sharing.
5	Test case format	YAML for cases, JSON for results	YAML for human-authored inputs (comments, multiline). JSON for machine-generated outputs.
6	Queue overflow	No cap, warning-based	Don't silently drop data. `gstack sync status` warns if >100 items or >24h old.
7	Queue drain	Parallel 10-concurrent	`Promise.allSettled()`. 500 items in ~10s instead of 100s.
8	Cache staleness	Metadata file	`.gstack/team-cache/.meta.json` tracks last_pull + row counts per table.

Architecture

System Diagram

                           TEAM SUPABASE INSTANCE
                           ┌─────────────────────────────────────────────┐
                           │  PostgreSQL + RLS                            │
                           │  ┌──────────┐ ┌──────────┐ ┌─────────────┐ │
                           │  │eval_runs │ │retro_    │ │eval_costs   │ │
                           │  │          │ │snapshots │ │(per-model)  │ │
                           │  └──────────┘ └──────────┘ └─────────────┘ │
                           │  ┌──────────┐ ┌──────────┐ ┌─────────────┐ │
                           │  │qa_reports│ │greptile_ │ │ship_logs    │ │
                           │  │          │ │triage    │ │             │ │
                           │  └──────────┘ └──────────┘ └─────────────┘ │
                           │  ┌──────────┐ ┌──────────┐                 │
                           │  │session_  │ │teams +   │  Auth.users     │
                           │  │transcr.  │ │members   │                 │
                           │  └──────────┘ └──────────┘                 │
                           │                                             │
                           │  Edge Functions (Phase 4):                  │
                           │  • regression-alert (on eval_runs INSERT)   │
                           │  • weekly-digest (cron → email/Slack)       │
                           └──────────┬──────────────────────────────────┘
                                      │ HTTPS (REST API)
                                      │
              ┌───────────────────────┼───────────────────────┐
              │                       │                       │
   Developer A Machine      Developer B Machine        CI Runner
   ┌─────────────────┐     ┌─────────────────┐     ┌──────────────┐
   │ gstack eval push │     │ gstack eval push │     │ ENV:          │
   │ gstack eval cache│     │ gstack eval      │     │ ACCESS_TOKEN │
   │ /retro /ship /qa │     │   compare        │     │              │
   │                  │     │ /retro /ship /qa │     │ gstack eval  │
   │ ~/.gstack/       │     │                  │     │   push       │
   │   auth.json(0600)│     │ ~/.gstack/       │     └──────────────┘
   │   eval-cache/    │     │   auth.json(0600)│
   │   sync-queue.json│     │   eval-cache/    │
   │                  │     │   sync-queue.json│
   │ .gstack/         │     │                  │
   │   team-cache/    │     │ .gstack/         │
   │     .meta.json   │     │   team-cache/    │
   └─────────────────┘     └─────────────────┘

Config & Credential Storage: 4 Layers

Layer 1: Project config — .gstack-sync.json (committed to repo)

{
  "supabase_url": "https://xyzcompany.supabase.co",
  "supabase_anon_key": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "team_slug": "xyzcompany"
}

The anon key is safe to commit. This is Supabase's design — the anon key only grants access through RLS policies, which require a valid user JWT. It's the same key that ships in every Supabase client-side app. Without a valid user token, the anon key gets you nothing.

Project-level only: Supabase URL, anon key, team slug. No user preferences here — those are per-developer (Layer 2).

Layer 2: User settings — ~/.gstack/config.yaml (via gstack-config)

# Existing settings (v0.3.9)
auto_upgrade: true
update_check: true

# New sync settings
sync_enabled: true          # enable/disable team sync (per-user)
sync_transcripts: false     # opt-in transcript sharing (per-user)

Managed via the existing gstack-config CLI (bin/gstack-config):

gstack-config get sync_enabled     # → "true" or ""
gstack-config set sync_enabled true
gstack-config set sync_transcripts false
gstack-config list                 # → all settings

Rationale: sync_enabled and sync_transcripts are user preferences, not project config. One developer might want sync off while the rest of the team has it on. gstack-config already handles this pattern for auto_upgrade and update_check.

Layer 3: User auth — ~/.gstack/auth.json (mode 0o600, never committed)

{
  "https://xyzcompany.supabase.co": {
    "access_token": "eyJ...",
    "refresh_token": "v1.xxx...",
    "expires_at": 1710460800,
    "user_id": "uuid",
    "team_id": "uuid",
    "email": "dev@company.com"
  }
}

Keyed by supabase_url so developers on multiple teams/projects just work. Written with chmod 0o600 — same pattern as browse.json in browse/src/server.ts.

Layer 4: Admin bootstrap — one-time Supabase project setup

# Admin runs once to set up the project:
gstack sync init --supabase-url https://xyzcompany.supabase.co

# Prompts for service role key (or reads SUPABASE_SERVICE_ROLE_KEY env).
# Runs migrations, creates team, generates .gstack-sync.json.
# Service role key is NOT saved anywhere.

CI/automation uses GSTACK_SUPABASE_ACCESS_TOKEN env var.

Auth Flow

gstack sync setup reads URL from .gstack-sync.json → opens browser for OAuth or magic link → polls for completion → writes tokens to ~/.gstack/auth.json (mode 0o600) → sets sync_enabled=true via gstack-config.

On first successful auth, shows a team welcome: "3 members, 47 eval runs this week, last ship 2h ago."

Sync Pattern: Bidirectional, Non-Fatal

Writes: Every local data write gets a push*() call after. Pattern:

5-second timeout
try/catch (never throws, never blocks the calling skill)
Idempotent (upsert on natural keys: timestamp + hostname + repo_slug)
Falls back to local queue (~/.gstack/sync-queue.json) if offline

Reads: gstack sync pull queries Supabase and writes team data to .gstack/team-cache/. Skills read local files only — they never import sync or touch the network. Cache metadata in .gstack/team-cache/.meta.json tracks freshness:

{
  "last_pull": "2026-03-15T10:30:00Z",
  "tables": {
    "retro_snapshots": { "rows": 47, "latest": "2026-03-14" },
    "eval_runs": { "rows": 123, "latest": "2026-03-15T09:00:00Z" }
  }
}

Queue: No cap on size. gstack sync status warns if >100 items or oldest entry >24h. Drain uses 10-concurrent Promise.allSettled() — 500 items drain in ~10s.

For skills (retro, review, qa, ship), sync happens via bin/gstack-sync called at end of skill with || true — same pattern as existing bin/gstack-update-check.

Opt-in Transcript Sync

When sync_transcripts: true in ~/.gstack/config.yaml (set via gstack-config set sync_transcripts true):

gstack-sync push-transcript reads ~/.claude/history.jsonl (new entries since last sync marker)
Stores in session_transcripts table with RLS policy (admin-only read by default)
No scrubbing — trust the team. Opt-in = consent. Same trust model as a shared Slack channel.
Useful for: team code review of AI usage patterns, onboarding, identifying prompt improvements

gstack eval: Universal Eval Infrastructure

gstack eval is the infrastructure layer for LLM evals. It does not run your evals — your app does that in whatever language it's written in. gstack handles everything after results exist: storage, comparison, caching, dashboards, team sharing.

Design: Adapter Mode

YOUR APP (any language)              GSTACK EVAL (infrastructure)
═══════════════════════              ════════════════════════════

Rails rake eval:run ──┐
Python pytest-evals ──┼──▶ JSON result ──▶ gstack eval push ──▶ Supabase
Go test -run Eval ────┘    (standard        ├──▶ gstack eval compare
                            format)          ├──▶ gstack eval list
                                             ├──▶ gstack eval baselines
                                             ├──▶ gstack eval cost
                                             ├──▶ gstack eval watch (live dashboard)
                                             └──▶ gstack dashboard (team-wide)

Your eval runners keep their language, their models, their service objects. gstack provides the plumbing.

What We're Porting from an Existing Rails Project

The author runs another project with a production-grade eval infrastructure in Ruby/Rails. The patterns are general-purpose and worth extracting into gstack as framework-agnostic infrastructure:

60+ eval runners with YAML test cases
Multi-judge LLM evaluation — multiple judge profiles scoring on 8+ quality criteria
3-tier pipeline — progressive refinement across model tiers (cheap → expensive)
SHA-based input caching with atomic writes and version invalidation
S3 result storage with auto-labeling, deduplication, and score aggregation
Cost tracking with per-model dashboards and tier comparison
Baseline generation — markdown reports with cross-tier comparison
Rake tasks for list, compare, cache management, fixture export

Existing Rails Pattern	gstack (Bun/TS)	Port scope
S3 result storage	`lib/sync.ts` (Supabase)	Full port: upload, list, compare, aggregate
Cost tracker	`lib/eval-cost.ts`	Full port: per-model tracking, terminal + HTML dashboard
Eval cache	`lib/eval-cache.ts`	Full port: SHA-based, atomic, CLI-accessible from any language
Baseline generator	`lib/eval-baselines.ts`	Full port: markdown reports from results
Judge tier selection	`lib/eval-tier.ts`	Full port: fast/standard/full model mapping
Rake tasks	`bin/gstack-eval` CLI	Full port: list, compare, cache, baselines, cost
YAML test cases	Standard format spec	Define format, document for any language
Eval runners (60+)	Stay in Rails	NOT ported — adapter mode
LLM-as-judge	`lib/eval-judge.ts`	Extend existing with multi-judge

For existing Rails projects

Integrating an existing Rails eval system requires ~20 lines of change:

# BEFORE (S3):
EvalResultStorage.upload(results, label: auto_label)

# AFTER (gstack):
path = "#{gstack_dir}/result.json"
File.write(path, JSON.pretty_generate(gstack_format(results)))
system("gstack eval push #{path}")

Rails keeps its eval runners, YAML cases, service objects, and models. S3 is replaced by gstack eval push → Supabase.

Standard Eval Result Format (JSON)

Any language produces this. gstack consumes it. Designed as a superset of patterns found across 42+ eval suites covering content generation, tool-calling agents, email generation, scoring/classification, fact-checking, clustering, memory extraction, and A/B comparison testing.

{
  "schema_version": 1,
  "label": "dev_fix-terseness_standard",
  "timestamp": "2026-03-15T10:30:00Z",
  "git_sha": "abc123",
  "git_branch": "dev/fix-terseness",
  "prompt_sha": "a08ff469",
  "hostname": "dev-machine",
  "tier": "standard",
  "total": 18,
  "passed": 17,
  "failed": 1,
  "duration_seconds": 893.4,
  "by_category": {
    "post_generation": { "passed": 16, "total": 17 },
    "tool_usage": { "passed": 1, "total": 1 }
  },
  "all_results": [
    {
      "name": "must_cite_sources",
      "category": "post_generation",
      "passed": true,
      "duration_ms": 45000,
      "failures": [],
      "judge_scores": { "accuracy": 0.85, "voice_fidelity": 0.72 },
      "response_preview": "The proposed legislation would...",
      "output": {},
      "comparison": null
    }
  ],
  "costs": [
    {
      "model": "claude-sonnet-4-6",
      "calls": 25,
      "input_tokens": 45123,
      "output_tokens": 12456
    }
  ]
}

Per-result output field — open object, suite-specific. Different eval types populate different keys. gstack stores as-is (JSONB) for display/comparison:

{
  "output": {
    "response": "Agent text response",
    "tool_calls": [{"name": "search", "input": {"query": "..."}}],
    "body": "Generated email body...",
    "subject": "Email subject line",
    "score": 72,
    "reasoning": "High alignment because...",
    "flags": ["red_flag_1"],
    "items": [{"id": "claim_1", "severity": "yellow", "commentary": "..."}],
    "chunks": ["chunk 1 text", "chunk 2 text"],
    "clusters": [{"theme": "Housing", "articles": ["..."]}],
    "memories": [{"content": "Enjoys cycling", "category": "personal"}],
    "extracted_fields": {"occupation": "engineer", "city": "Oakland"},
    "title": "Generated title",
    "structured_content": "Full article body..."
  }
}

Per-result comparison field — for A/B testing and tier-chaining evals:

{
  "comparison": {
    "type": "ab_test",
    "control_scores": {"accuracy": 0.80, "voice": 0.75},
    "treatment_scores": {"accuracy": 0.85, "voice": 0.78},
    "deltas": [
      {"criterion": "accuracy", "control": 0.80, "treatment": 0.85, "delta": 0.05}
    ],
    "tolerance": 0.05
  }
}

failures array format:

{
  "failures": [
    {
      "type": "threshold",
      "expectation_type": "voice_check",
      "message": "Voice check failed: 2 of 5 criteria below threshold",
      "criterion": "voice_fidelity",
      "expected": 0.7,
      "actual": 0.58,
      "details": [
        {"criterion": "no_hedging", "score": 0.3, "threshold": 0.7},
        {"criterion": "direct_tone", "score": 0.4, "threshold": 0.6}
      ],
      "scores": {
        "no_hedging": 0.3, "no_filler": 0.6, "direct_tone": 0.4,
        "uses_specifics": 0.8, "operator_energy": 0.9
      }
    },
    {
      "type": "deterministic",
      "expectation_type": "body_contains",
      "check": "body_contains",
      "pattern": "Series B",
      "message": "Pattern not found in output"
    }
  ]
}

Fields: type = generic class (threshold | deterministic). expectation_type = domain-specific check name from the YAML case. details = per-criterion breakdown for multi-criteria checks. scores = ALL scores (passing + failing) for context. message = human-readable summary.

YAML Test Case Format

Human-authored, comments supported, multiline strings via | blocks. Designed as a superset of 60+ expectation types across 42+ eval suites.

Three sections: metadata (universal), input (suite-specific, open-ended), and expectations (standardized assertion types).

Minimal example

name: must_cite_sources
description: Post must cite original source material
category: post_generation
expectations:
  - type: body_contains
    patterns: ["Series B", "$50M"]
  - type: quality_check
    criteria:
      accuracy: 0.7
      no_hallucination: 0.8

Full example (all field categories)

# ── Metadata (universal) ──────────────────────────
name: admin_search_knowledge
description: Admin asks a content question, should use search tool
category: tool_usage
tags: [admin, regression, tool_calling]

# ── Prompt source files (for cache invalidation) ──
# SHA of these files becomes part of the cache key.
prompt_source_files:
  - app/services/chat_responder_service.rb
  - config/system_prompts/agent.txt

# ── Input context (suite-specific, open-ended) ────
# gstack treats input as opaque data passed to the runner.
# Different suites use different shapes:

# Agent/chat evals:
user_message: "What articles have we published about housing policy?"
user_state:
  fixture: admin_user
  overrides:
    city: "San Francisco"

# Email generation evals:
# user_context:
#   first_name: "David"
#   membership_status: active
#   memories: ["Works as ML engineer"]
# conversation_thread:
#   - direction: inbound
#     body: "Hi, I heard about your organization..."

# Content scoring/classification:
# content:
#   title: "Policy Analysis"
#   raw_content: "The proposed legislation..."

# Fixture-based generation:
# fixture_name: bundle_housing_policy

# Text processing:
# text: "Full article text..."
# strategies: [recursive, semantic]
# chunk_size: 80

# Media analysis:
# media_type: youtube
# transcript: "Full transcript..."
# metadata: { duration_seconds: 2700 }

# ── Expectations (standardized) ───────────────────
expectations:

  # ── Tool calling ──
  - type: tool_called
    tool: search_knowledge
    required: true
    input_contains:
      query: "housing"
  - type: tool_not_called
    tool: update_user_profile

  # ── Text matching (supports regex: /pattern/i) ──
  - type: response_contains
    patterns: ["housing", "/\\b(policy|legislation)\\b/i"]
  - type: response_excludes
    patterns: ["I don't have access"]
  - type: body_contains
    patterns: ["Dear David"]
  - type: body_excludes
    patterns: ["Best regards", "/here's the kicker/i"]
  - type: body_contains_any
    patterns: ["housing", "homes", "zoning"]

  # ── Length constraints ──
  - type: body_word_count
    min_words: 80
    max_words: 300
  - type: body_min_length
    min_words: 600

  # ── Structural checks ──
  - type: has_title
    min_words: 3
    max_words: 15
  - type: has_tldr
    min_chars: 50
    max_chars: 300
  - type: subject_not_empty
  - type: has_signoff
  - type: ends_with_question
  - type: body_has_headers
    min_count: 3
  - type: body_integrity
    max_shrinkage_pct: 10

  # ── Numeric scoring ──
  - type: score_range
    min: 40
    max: 65

  # ── Classification ──
  - type: channel_is
    channel: housing_policy
  - type: content_type_in
    values: [advocacy, opinion]
  - type: worthy

  # ── Field extraction ──
  - type: has_field
    field: occupation
    min_length: 5
  - type: has_fields
    fields: [topic_summary, sections]
  - type: min_fields_filled
    value: 4

  # ── Memory extraction ──
  - type: has_category
    value: "issue"
  - type: min_memories
    value: 2

  # ── Clustering / grouping ──
  - type: cluster_count_range
    min: 1
    max: 4
  - type: all_attendees_assigned
  - type: no_duplicate_assignments
  - type: themes_not_generic
    forbidden_themes: ["General group"]

  # ── Fact-check ──
  - type: item_count_range
    min: 5
    max: 20
  - type: no_false_positives
    max_actionable: 6
  - type: has_severity
    severity: green
    min: 1

  # ── LLM-as-judge checks ──
  - type: quality_check
    criteria:
      accuracy: 0.7
      completeness: 0.6
      no_hallucination: 0.8
      voice_fidelity: 0.7
  - type: voice_check
    criteria:
      no_filler: 0.5
      no_hedging: 0.6
      direct_tone: 0.6
      uses_specifics: 0.6
  - type: quality_check
    judge_profile: strict       # named profile (defined in judge-profiles.yaml)
    criteria:
      accuracy: 0.8             # profile can override default thresholds

# ── A/B testing (optional) ─────────────────────────
# comparison:
#   type: ab_test
#   control:
#     env: { DISABLE_FEATURE: "1" }
#   treatment:
#     env: {}
#   tolerance: 0.05
#   flaky_criteria:
#     some_criterion: 0.10

# ── Tier chaining (optional) ───────────────────────
# tier_chain:
#   - tier: quick
#     model: sonnet-4-6
#     output_file: quick_result.json
#   - tier: full
#     model: opus-4-6
#     input_from: quick_result.json

Complete expectation type inventory (60+ types)

Category	Type	Key Fields	LLM?
Tool calling	`tool_called`	tool, required, input_contains	No
	`tool_not_called`	tool	No
Text matching	`response_contains`	patterns	No
	`response_excludes`	patterns	No
	`response_contains_any`	patterns	No
	`body_contains`	patterns	No
	`body_excludes`	patterns	No
	`body_contains_any`	patterns	No
	`title_excludes`	patterns	No
	`tldr_excludes`	patterns	No
	`reasoning_contains`	patterns	No
Length	`body_word_count`	min_words, max_words	No
	`body_min_length`	min_words	No
	`word_count_range`	min, max	No
	`commentary_length`	min_chars, max_chars	No
Structure	`has_title`	min_words, max_words	No
	`has_tldr`	min_chars, max_chars	No
	`has_subtitle`	min_chars, max_chars	No
	`has_read_time`	min, max	No
	`has_signoff`	—	No
	`has_links`	min_count	No
	`has_media_embeds`	min_count, max_count, pattern	No
	`body_has_headers`	min_count	No
	`subject_not_empty`	—	No
	`ends_with_question`	—	No
	`body_integrity`	max_shrinkage_pct	No
Scoring	`score_range`	min, max	No
	`expect_score_above`	value	No
	`expect_score_below`	value	No
	`bias_score_range`	min, max	No
	`quality_score_range`	min, max	No
Classification	`channel_is`	channel	No
	`channel_not`	channel	No
	`content_type_in`	values	No
	`worthy` / `not_worthy`	—	No
	`expected_pass`	value, expected_comment_type	No
Field extraction	`has_field`	field, min_length	No
	`has_fields`	fields	No
	`field_is`	field, value	No
	`field_contains`	field, patterns	No
	`field_missing`	field	No
	`min_fields_filled`	value	No
Memory	`has_category`	value	No
	`min_memories`	value	No
	`max_memories`	value	No
Clustering	`cluster_count_range`	min, max	No
	`group_count_range`	min, max	No
	`group_size_range`	min, max	No
	`min_stories` / `max_stories`	count	No
	`all_attendees_assigned`	—	No
	`no_duplicate_assignments`	—	No
	`themes_not_generic`	forbidden_themes	No
	`has_high_score_cluster`	min, score	No
	`all_clusters_have_evidence`	—	No
Chunks	`chunk_count_range`	min, max	No
	`lossless`	—	No
	`word_bound`	max_words	No
Threads	`has_tweets`	min_count, max_count	No
	`char_limits`	—	No
	`link_in_last_tweet`	—	No
Fact-check	`item_count_range`	min, max	No
	`no_false_positives`	max_actionable	No
	`has_severity`	severity, min	No
	`violation_severity_at_least`	violation, severity	No
Media	`selects_expected_images`	expected_filenames, min_selected	No
	`extracts_clean_content`	min_length	No
	`min_concepts`	count	No
Research	`min_sections`	count	No
	`has_commentaries`	min	No
	`title_changed`	—	No
Source audit	`source_audit_ran`	—	No
	`urls_from_sources`	allow_tweets, allow_internal	No
	`outline_sources_cited`	min_ratio	No
LLM judge	`quality_check`	criteria (dict), judge_profile	Yes
	`voice_check`	criteria (dict or string)	Yes
	`question_quality`	criteria	Yes

Eval Cache (language-agnostic CLI)

~/.gstack/eval-cache/
  {suite}/
    {sha-key}.json    ← { _cache_version, _cached_at, _suite, _case_name, data }

Cache key = SHA256(source_files_content + test_input)[0..15]

Any language uses the cache via CLI:

# Read (returns JSON to stdout, exit 0 on hit, exit 1 on miss)
gstack eval cache read my_suite abc123def456

# Write (reads JSON from stdin or argument)
gstack eval cache write my_suite abc123def456 '{"data": ...}'

# Management
gstack eval cache stats            # Per-suite file count, disk usage, date range
gstack eval cache verify           # Check all entries for validity
gstack eval cache clear [suite]    # Clear all or per-suite

Env vars: EVAL_CACHE=0 (disable), EVAL_CACHE_CLEAR=1 (clear before run), EVAL_JUDGE_CACHE=0 (skip cached judge scores — re-run LLM judges even if cached).

Judge responses are cached separately from eval data. This lets you re-run deterministic checks (text matching, length, tool calling) without re-calling expensive LLM judges.

Ported from eval_cache.rb — same atomic write (tmp+rename), same version/validation, same SHA computation.

Multiprocess Worker Support

For large test suites (60+ cases), eval workers run in parallel processes:

~/.gstack/eval-partials/{suite}/worker_{pid}.json

Each worker writes partial results. gstack eval push merges them before upload:

Workers write worker_{pid}.json atomically (tmp+rename)
Push reads all worker_*.json in the partials directory
Deduplicates by test name (keeps longest duration_ms)
Merges into a single result JSON
Pushes merged result to Supabase

Env var: EVAL_WORKERS=4 (number of parallel processes, default 1).

Eval Cost Tracker

Reads the costs array from result JSON. Terminal dashboard:

┌─────────────────────────────────────────────────────────────┐
│  EVAL COST DASHBOARD (standard tier)                        │
├──────────────────┬───────┬──────────┬──────────┬────────────┤
│ Model            │ Calls │ Input    │ Output   │ Est. Cost  │
├──────────────────┼───────┼──────────┼──────────┼────────────┤
│ sonnet-4-6       │   25  │   45,123 │   12,456 │ $0.1234    │
│ opus-4-6         │    5  │   78,900 │   45,123 │ $0.5678    │
├──────────────────┼───────┼──────────┼──────────┼────────────┤
│ TOTAL            │   30  │  124,023 │   57,579 │ $0.6912    │
│ At full tier: ~$0.9234  │  At fast tier: ~$0.3456           │
└─────────────────────────────────────────────────────────────┘

Also generates HTML dashboard and pushes aggregated costs to Supabase eval_costs table.

Auto-Labeling

Label = EVAL_LABEL env || sanitized_git_branch
Append tier suffix: _fast, _full (omit for standard)

Eval Tier & Run Scopes

Two orthogonal tier concepts:

Run scope — how much of the test suite to execute:

EVAL_TIER=quick      # Subset of cases (fast smoke test)
EVAL_TIER=standard   # Full suite (default)
EVAL_TIER=full       # Full suite + expensive multi-judge checks

Judge model tier — which model judges use:

EVAL_JUDGE_TIER=fast|standard|full
Aliases: haiku→fast, sonnet→standard, opus→full

Debug output:

EVAL_VERBOSE=1       # Persistent logging to ~/.gstack/log/evals/
                     # Format: YYYYMMDD-{test-name}-{random}.txt
                     # Includes full untruncated LLM inputs/outputs

CLI Commands

# Result management
gstack eval push <file.json>       # Push result to Supabase + local store
                                   # Dedup: skips insert if git_sha+label+tier already exists
gstack eval list [label]           # List all results (local + Supabase)
gstack eval compare [a] [b]       # Compare two runs — color-coded score deltas
gstack eval baselines [date]       # Generate markdown baseline report
gstack eval cost [file.json]       # Show cost dashboard from result

# Cache (any language, CLI interface)
gstack eval cache read <suite> <key>
gstack eval cache write <suite> <key> [data]
gstack eval cache stats
gstack eval cache clear [suite]
gstack eval cache verify

# Live monitoring
gstack eval watch                  # Browser dashboard (Bun.serve + SSE)

Live Eval Dashboard (browser-based)

gstack eval watch starts a local Bun HTTP server, auto-opens browser:

Progress bar, pass/fail tally, cost accumulating in real-time
Per-test results table updating as each test completes
Estimated time remaining
Live updates via Server-Sent Events (SSE) — simpler than WebSocket, one-directional
Reuses browse server patterns: random port selection, state file, auto-shutdown
Eval runner writes progress to a known file; dashboard reads and streams it

Future: Native Eval Runner Mode

For projects that want gstack to run evals directly (YAML cases → Anthropic API → judge → result) without any app framework. Deferred as a separate initiative after adapter mode proves valuable.

Supabase Schema

-- ═══════════════════════════════════════════════
-- Teams and membership
-- ═══════════════════════════════════════════════

create table teams (
  id uuid primary key default gen_random_uuid(),
  name text not null,
  slug text not null unique,
  created_at timestamptz default now()
);

create table team_members (
  team_id uuid references teams(id) on delete cascade,
  user_id uuid references auth.users(id) on delete cascade,
  role text not null default 'member'
    check (role in ('owner', 'admin', 'member')),
  joined_at timestamptz default now(),
  primary key (team_id, user_id)
);

-- ═══════════════════════════════════════════════
-- Eval results (merges gstack EvalResult + external project format)
-- ═══════════════════════════════════════════════

create table eval_runs (
  id uuid primary key default gen_random_uuid(),
  team_id uuid references teams(id) not null,
  version text not null,
  branch text not null,
  git_sha text not null,
  repo_slug text not null,
  label text not null,                -- auto-label (branch + tier suffix)
  timestamp timestamptz not null,
  hostname text not null,
  user_id uuid references auth.users(id),
  tier text not null
    check (tier in ('e2e', 'llm-judge', 'fast', 'standard', 'full')),
  total_tests int not null,
  passed int not null,
  failed int not null,
  total_cost_usd numeric(10,4) not null,
  total_duration_ms int not null,
  tests jsonb not null,               -- EvalTestEntry[] (transcripts stripped)
  judge_averages jsonb,               -- { criterion: avg_score } (aggregated)
  created_at timestamptz default now()
);

-- Eval cost tracking (per-model, per-run)
create table eval_costs (
  id uuid primary key default gen_random_uuid(),
  team_id uuid references teams(id) not null,
  eval_run_id uuid references eval_runs(id) on delete cascade,
  model text not null,
  calls int not null,
  input_tokens int not null,
  output_tokens int not null,
  estimated_cost_usd numeric(10,6) not null,
  created_at timestamptz default now()
);

-- ═══════════════════════════════════════════════
-- Skill data (retro, review, QA, ship)
-- ═══════════════════════════════════════════════

create table retro_snapshots (
  id uuid primary key default gen_random_uuid(),
  team_id uuid references teams(id) not null,
  repo_slug text not null,
  user_id uuid references auth.users(id),
  date date not null,
  window text not null,               -- '7d', '14d', '30d'
  metrics jsonb not null,             -- commits, LOC, test ratio, sessions, etc.
  authors jsonb not null,             -- per-contributor breakdown
  version_range jsonb,
  streak_days int,
  tweetable text,
  greptile jsonb,
  backlog jsonb,
  created_at timestamptz default now()
);

create table greptile_triage (
  id uuid primary key default gen_random_uuid(),
  team_id uuid references teams(id) not null,
  user_id uuid references auth.users(id),
  date date not null,
  repo text not null,                 -- owner/repo
  triage_type text not null
    check (triage_type in ('fp', 'fix', 'already-fixed')),
  file_pattern text not null,
  category text not null,             -- race-condition, null-check, security, etc.
  created_at timestamptz default now()
);

create table qa_reports (
  id uuid primary key default gen_random_uuid(),
  team_id uuid references teams(id) not null,
  repo_slug text not null,
  user_id uuid references auth.users(id),
  url text not null,
  mode text not null,                 -- full, quick, regression, diff-aware
  health_score numeric(5,2),
  issues jsonb,
  category_scores jsonb,
  report_markdown text,
  created_at timestamptz default now()
);

create table ship_logs (
  id uuid primary key default gen_random_uuid(),
  team_id uuid references teams(id) not null,
  repo_slug text not null,
  user_id uuid references auth.users(id),
  version text not null,
  branch text not null,
  pr_url text,
  review_findings jsonb,
  greptile_stats jsonb,
  todos_completed text[],
  test_results jsonb,
  created_at timestamptz default now()
);

-- ═══════════════════════════════════════════════
-- Session transcripts (opt-in only)
-- ═══════════════════════════════════════════════

create table session_transcripts (
  id uuid primary key default gen_random_uuid(),
  team_id uuid references teams(id) not null,
  user_id uuid references auth.users(id),
  session_id text not null,
  repo_slug text not null,
  messages jsonb not null,            -- [{role, display_text, tool_names, timestamp}]
  total_turns int,
  tools_used jsonb,                   -- {Bash: 8, Read: 3, ...}
  started_at timestamptz,
  ended_at timestamptz,
  created_at timestamptz default now()
);

-- ═══════════════════════════════════════════════
-- Indexes
-- ═══════════════════════════════════════════════

create index idx_eval_runs_team_label on eval_runs(team_id, label, timestamp desc);
create index idx_eval_runs_team_ts on eval_runs(team_id, timestamp desc);
create index idx_eval_costs_run on eval_costs(eval_run_id);
create index idx_retro_team_date on retro_snapshots(team_id, date desc);
create index idx_greptile_team_date on greptile_triage(team_id, date desc);
create index idx_qa_team_created on qa_reports(team_id, created_at desc);
create index idx_ship_team_created on ship_logs(team_id, created_at desc);

-- ═══════════════════════════════════════════════
-- Row Level Security (same pattern all tables)
-- ═══════════════════════════════════════════════

alter table teams enable row level security;
alter table team_members enable row level security;
alter table eval_runs enable row level security;
alter table eval_costs enable row level security;
alter table retro_snapshots enable row level security;
alter table greptile_triage enable row level security;
alter table qa_reports enable row level security;
alter table ship_logs enable row level security;
alter table session_transcripts enable row level security;

-- Team members can read their team's data
create policy "team_read" on eval_runs for select using (
  team_id in (select team_id from team_members where user_id = auth.uid())
);
create policy "team_insert" on eval_runs for insert with check (
  team_id in (select team_id from team_members where user_id = auth.uid())
);
-- Only admins/owners can delete
create policy "team_admin_delete" on eval_runs for delete using (
  team_id in (select team_id from team_members
    where user_id = auth.uid() and role in ('owner', 'admin'))
);
-- (Repeat for all data tables)

Dashboard Queries Unlocked

-- Eval regression detection
select label, timestamp, passed, total_tests,
  passed::float / total_tests as pass_rate
from eval_runs where team_id = $1
order by timestamp desc limit 20;

-- Team velocity (PRs per week per person)
select date_trunc('week', created_at) as week,
  user_id, count(*) as ships
from ship_logs where team_id = $1
group by 1, 2 order by 1 desc;

-- Cost trending
select date_trunc('week', created_at) as week,
  sum(estimated_cost_usd) as total_cost,
  sum(input_tokens + output_tokens) as total_tokens
from eval_costs where team_id = $1
group by 1 order by 1 desc;

-- Greptile signal quality
select category,
  count(*) filter (where triage_type = 'fp') as fps,
  count(*) filter (where triage_type = 'fix') as fixes,
  round(count(*) filter (where triage_type = 'fp')::numeric / count(*) * 100) as fp_pct
from greptile_triage where team_id = $1
group by category order by count(*) desc;

-- QA health trending
select created_at::date, repo_slug, health_score
from qa_reports where team_id = $1
order by created_at desc;

Integration Points (critical existing files)

Integration	File	Change
Eval push	`test/helpers/eval-store.ts:420` (`finalize()`)	After local write, call `pushEvalRun()`
Eval judge	`test/helpers/llm-judge.ts`	Extend with multi-judge judging, tier selection
Retro push	`retro/SKILL.md.tmpl` Step 13	Bash call: `gstack-sync push-retro "$FILE"`
Greptile push	`review/greptile-triage.md`	After append, call `gstack-sync push-greptile`
QA push	`qa/SKILL.md.tmpl` Phase 6	After baseline, call `gstack-sync push-qa`
Ship push	`ship/SKILL.md.tmpl` new Step 9	Write ship log + push
Config reuse	`browse/src/config.ts`	Import `getRemoteSlug()`, `getGitRoot()`
User settings	`bin/gstack-config`	Reuse for sync preferences (`sync_enabled`, `sync_transcripts`)
Atomic write	`eval-store.ts:413-416`	Extract shared `atomicWriteJSON()` utility
Eval watch	`scripts/eval-watch.ts`	Adapt for browser-based SSE dashboard
Comparison	`eval-store.ts:167` `compareEvalResults()`	Extend with color-coded diff + cross-team

New Files

gstack/
├── lib/                             # Shared library
│   ├── sync.ts                      # Supabase client, push/pull, token refresh
│   ├── sync-config.ts               # .gstack-sync.json + ~/.gstack/auth.json
│   ├── auth.ts                      # Device auth flow, token management
│   ├── eval-cache.ts                # SHA-based cache (ported from eval_cache.rb)
│   ├── eval-cost.ts                 # Token accumulator + dashboards
│   ├── eval-tier.ts                 # Model tier selection (fast/standard/full)
│   ├── eval-baselines.ts            # Markdown baseline generator
│   ├── eval-format.ts               # Standard result format validation + helpers
│   └── util.ts                      # atomicWriteJSON(), numberWithCommas()
├── bin/
│   ├── gstack-sync                  # Bash wrapper (setup, init, pull, status, migrate)
│   └── gstack-eval                  # Bun entry (push, cache, list, compare, etc.)
├── eval/
│   ├── watch-server.ts              # Bun.serve() for live eval dashboard
│   └── watch-ui.html               # SSE-powered live dashboard page
├── supabase/
│   └── migrations/
│       ├── 001_teams.sql
│       ├── 002_eval_runs_and_costs.sql
│       ├── 003_skill_data.sql
│       └── 004_rls_policies.sql
├── docs/
│   └── eval-result-format.md        # Standard format spec for any language
├── .gstack-sync.json.example
└── test/lib/
    ├── sync.test.ts
    ├── eval-cache.test.ts
    ├── eval-cost.test.ts
    └── eval-format.test.ts

Phased Rollout

Phase 1: Foundation + eval infrastructure

lib/sync.ts, lib/auth.ts, lib/sync-config.ts, lib/util.ts
bin/gstack-sync (setup, init, pull, status, migrate)
Supabase migrations (teams, team_members, eval_runs, eval_costs)
Standard eval result format spec (docs/eval-result-format.md, lib/eval-format.ts)
bin/gstack-eval (push, list, compare, cost, cache)
lib/eval-cache.ts (port from existing Rails eval cache pattern)
lib/eval-cost.ts (port from existing Rails cost tracker pattern)
lib/eval-tier.ts (fast/standard/full model mapping)
Hook EvalCollector.finalize() → auto-push when sync configured
YAML test case format spec + yaml npm dependency
First-run team welcome in gstack sync setup
Color-coded visual diff in gstack eval compare

Phase 2: Ship logs + Greptile + skill sync + live dashboard

Add ship_logs, greptile_triage tables
Ship log local write + push (new Step 9 in ship template)
Greptile triage push after append
gstack eval watch — live browser dashboard (Bun.serve + SSE)
lib/eval-baselines.ts (markdown baseline generator)
Inline sync indicator in skill output ("Synced to team ✓")

Phase 3: Retro + QA + transcript sync

Add retro_snapshots, qa_reports, session_transcripts tables
Hook retro and QA write paths
Opt-in transcript sync

Phase 4: Team dashboard + edge functions

gstack dashboard — team-wide HTML dashboard, reads from Supabase
Supabase edge function: regression alerts on eval_runs INSERT
Weekly digest edge function (cron → email/Slack)
Team admin commands (create, invite)
gstack eval leaderboard — fun weekly team stats

Data Flows

Push (write) flow — all four paths

  Skill writes local file
         │
         ▼
  loadSyncConfig()
         │
    ┌────┴────┐
    │ config? │
    │         │
   NO        YES
    │         │
    ▼         ▼
  RETURN   refreshTokenIfNeeded()
  (noop)      │
         ┌────┴────┐
         │ token   │
         │ valid?  │
        NO        YES
         │         │
         ▼         ▼
      queue to   supabase.from(table).upsert(data)
      sync-         │
      queue.    ┌───┴───────┬──────────┐
      json      │           │          │
               OK      TIMEOUT     ERROR
                │       (5s)        │
                ▼         │         ▼
             DONE      queue to   log warning
                       sync-      + queue
                       queue.json

  NIL PATH:  .gstack-sync.json missing → noop
  EMPTY PATH: sync_enabled=false → noop
  ERROR PATH: Supabase unreachable → 5s timeout → queue + continue

Pull-to-cache (read) flow

  gstack sync pull
         │
         ▼
  loadSyncConfig()
         │
    ┌────┴────┐
    │ config? │
   NO        YES
    │         │
    ▼         ▼
  skip     supabase.from(table).select(...)
             │
        ┌───┴──────┬──────────┐
        │          │          │
       OK      TIMEOUT     ERROR
        │       (3s)        │
        ▼          │         ▼
     write to    keep       keep
     cache/      stale      stale
        │        cache      cache
        ▼
     update
     .meta.json

Error & Rescue Map

METHOD/CODEPATH              | WHAT CAN GO WRONG              | RESCUED? | ACTION                    | USER SEES
-----------------------------|--------------------------------|----------|---------------------------|------------------
loadSyncConfig()             | .gstack-sync.json missing      | Y        | Return null → noop        | Nothing
                             | JSON malformed                 | Y        | Log warning, return null  | Nothing
                             | auth.json missing              | Y        | Return null → noop        | Nothing
refreshToken()               | Supabase auth down             | Y        | Queue + continue          | Nothing
                             | Token revoked                  | Y        | Clear token, prompt setup | "Run gstack sync setup"
pushEvalRun() (all push*)    | Supabase 503                   | Y        | Queue for retry           | Nothing
                             | Network timeout (5s)           | Y        | Queue for retry           | Nothing
                             | Rate limit (429)               | Y        | Backoff + queue           | Nothing
                             | RLS violation (403)            | Y        | Log, skip                 | Warning in status
                             | Duplicate (409)                | Y        | Ignore (idempotent)       | Nothing
                             | Token expired                  | Y        | Refresh → retry once      | Nothing
pullToCache()                | Supabase timeout (3s)          | Y        | Use stale cache           | Stale data
                             | Empty result set               | Y        | Write empty cache         | Nothing
                             | Cache dir EACCES               | Y        | Log warning               | Warning in status
                             | Cache JSON corrupt             | Y        | Delete + re-pull          | Nothing
queueForRetry()              | Queue file EACCES              | Y        | Log, data lost            | Warning in status
drainQueue()                 | Partial failure                | Y        | Failed items stay queued  | Nothing
pushTranscript()             | history.jsonl EBUSY            | Y        | Skip this cycle           | Nothing
gstack sync setup            | OAuth timeout                  | Y        | Clear error message       | Error
                             | Localhost port in use          | Y        | Try 3 ports               | Error if all fail
                             | Already authenticated          | Y        | "Re-auth or keep?"        | Prompt
gstack sync init             | Tables already exist           | Y        | Idempotent (IF NOT EXISTS)| Nothing
                             | Service key invalid            | Y        | Clear error               | Error

All 16 error paths are rescued. 0 critical gaps.

Security & Threat Model

#	Threat	Likelihood	Impact	Mitigated?	How
1	Anon key exposed in repo	Certain	LOW	YES	By Supabase design — RLS enforces access
2	Auth token stolen from auth.json	Low	HIGH	YES	0o600, per-machine, auto-expire
3	MITM on Supabase HTTPS	Very Low	HIGH	YES	TLS 1.2+, Supabase cert management
4	RLS bypass via malformed JWT	Low	HIGH	YES	Supabase validates JWTs server-side
5	Cross-team data leak via REST API	Low	HIGH	YES	RLS on all tables
6	CI token leaked via logs	Medium	HIGH	PARTIAL	Document short-lived + scoped tokens
7	Transcript contains secrets	Medium	MEDIUM	YES	Opt-in = consent, trust the team
8	sync-queue.json has pending data	Medium	LOW	YES	0o600 on file
9	Service role key in shell history	Low	CRITICAL	YES	Prompt-based, never stored, or env var
10	Supabase JS SDK supply chain	Very Low	HIGH	PARTIAL	Pin version, audit

Observability

Sync log

~/.gstack/sync.log — append-only, one line per operation:

[2026-03-15T10:30:00Z] PUSH eval_runs OK 5 tests, 0.3s
[2026-03-15T10:30:01Z] PUSH retro_snapshots QUEUED timeout after 5s
[2026-03-15T10:35:00Z] DRAIN 47/47 OK 2.1s

Status command

$ gstack sync status
─────────────────
  Project:       .gstack-sync.json (supabase_url: https://xyzcompany.supabase.co)
  User settings: sync_enabled=true, sync_transcripts=false (via gstack-config)
  Connected:     yes
  Authenticated: yes (dev@company.com, team: xyzcompany)
  Last push:     2 min ago (eval_runs)
  Last pull:     1h ago
  Queue:         0 items
  Cache:         retro: 47 rows (2h old), eval: 123 rows (2h old)
  Sync log:      ~/.gstack/sync.log (1.2KB)

Inline sync in skills

After /ship or /retro completes:

Synced to team ✓

Queued (offline)

or nothing (sync not configured).

What Already Exists (reuse map)

Existing code	File	Reuse
`EvalCollector` + `finalize()`	`test/helpers/eval-store.ts:420`	Hook for eval push
`getRemoteSlug()`	`browse/src/config.ts:119`	Repo identification
`getGitRoot()`	`browse/src/config.ts:28`	Project root detection
Atomic write (tmp+rename)	`eval-store.ts:413-416`	Extract to `atomicWriteJSON()`
Bash wrapper pattern	`bin/gstack-update-check`	Template for `bin/gstack-sync` + `bin/gstack-eval`
0o600 state file	`browse/src/server.ts`	Pattern for `auth.json`
`compareEvalResults()`	`eval-store.ts:167`	Extend for cross-team
`formatComparison()`	`eval-store.ts:267`	Extend with color diff
`llm-judge.ts`	`test/helpers/llm-judge.ts`	Extend with multi-judge
eval-watch.ts	`scripts/eval-watch.ts`	Adapt for browser SSE
`gstack-config` get/set/list	`bin/gstack-config`	User settings for sync preferences (v0.3.9)

What's NOT in Scope

Item	Rationale
Native eval runner mode	Adapter-only first. Future TODO after adapter proves out.
Hosted gstack cloud service	Self-hosted Supabase per team.
Cross-team benchmarking	Phase 5+ — needs anonymization + multi-team opt-in.
Porting existing eval runners	Runners stay in their source language. gstack is infrastructure.
Real-time sync (WebSocket)	Push-on-write + cache pull is sufficient.
Transcript scrubbing	Trust the team. Opt-in = consent.

Risks & Mitigations

Risk	Mitigation
Supabase adds a dependency	`@supabase/supabase-js` imported conditionally. If missing or unconfigured, all sync functions return immediately. Zero impact on non-sync users.
Sync failures slow down skills	All push: 5s timeout, non-fatal. All pull: cache-based, skills never block on network.
Large eval transcripts	Strip `transcript` field from EvalTestEntry before push. Full transcripts stay local-only.
Token expiry mid-session	Auto-refresh before each push. If refresh fails, queue to `sync-queue.json` for retry.
Schema drift	Flexible fields use `jsonb`. Only fields needed for indexing/querying are proper columns. `schema_version` for forward compat.
Queue overflow	No cap. Warn via `gstack sync status` if >100 items or oldest entry >24h.
Concurrent queue writes	Atomic read-modify-write via `atomicWriteJSON()` (tmp+rename pattern).
Cache staleness	`.meta.json` tracks last_pull + row counts per table. Skills can display "team data as of 2h ago".

Verification Plan

gstack sync setup → complete auth → verify ~/.gstack/auth.json written with 0o600
gstack eval push result.json → verify row in Supabase dashboard
gstack eval cache stats → verify cache populated after eval run
gstack eval compare main feature-branch → verify color-coded delta output
gstack eval cost result.json → verify cost dashboard renders
gstack sync pull → verify .gstack/team-cache/ populated with .meta.json
Offline test: disconnect network → run evals → reconnect → verify queued syncs drain
/ship → verify ship log in Supabase
/retro → verify team data from cache appears in output
gstack sync status → verify health output (connected, authenticated, queue, cache)

Review Decisions Log

All decisions from the /plan-ceo-review session on 2026-03-15:

#	Question	Options	Chosen	Rationale
0F	Mode selection	Expansion / Hold / Reduction	EXPANSION	Greenfield team infra, cathedral-tier vision
1	Read-side architecture	Cache / Direct / Hybrid	Cache-based	Skills never touch network. "Sync is invisible" invariant.
2	Queue overflow	Cap / Warn / Both	Warn only	Don't silently drop data. Surface via status.
3	Transcript secrets	Scrub / Trust / Metadata-only	Trust the team	Supabase is encrypted. Opt-in = consent.
4	Cache staleness	Meta file / File mtime / None	Meta file	`.meta.json` gives skills + status a single source of truth.
5	Queue drain performance	Parallel / Sequential / Background	Parallel 10x	500 items in ~10s vs 100s.
—	Scope expansion	Full convergence / Eval sync only / Defer	Full convergence	Existing Rails eval infra + gstack team sync = universal platform
—	Integration mode	Native + Adapter / Native only / Adapter only	Adapter only	App runs evals, gstack is infrastructure. Start with C, add B as TODO.
—	Case format	YAML / JSON / Both	YAML cases, JSON results	YAML for human-authored (comments, multiline), JSON for machine output.
T1	Regression alerts	TODOS / Skip / Build Phase 4	Phase 4	Killer feature of team sync.
T2	Weekly digest	TODOS / Skip / Build Phase 4	Phase 4	Passive team visibility.
T3	Eval case format spec	Phase 1 / TODOS / Port directly	Phase 1	Foundational to eval CLI.
D1	Live eval dashboard	Phase 1 / TODOS / Phase 4	Phase 2	Bun.serve + SSE, reuses browse patterns.
D2	Team leaderboard	TODOS / Skip / Phase 4	Phase 4	Fun gamification alongside dashboard.
D3	Inline sync indicator	Phase 2 / TODOS / Skip	Phase 2	XS effort, builds trust in sync.
D4	First-run welcome	Phase 1 / TODOS / Skip	Phase 1	Part of setup flow.
D5	Visual eval diff	Phase 1 / TODOS / Skip	Phase 1	Color-coded compare is essential UX.

62 KiB Raw Blame History