docs: scrub proprietary refs, close eval format gaps, integrate gstack-config

- Replace project-specific references with generic language - Add missing fields to eval result format: prompt_sha, by_category, timestamp, response_preview - Enrich failure format with details array, scores dict, expectation_type - Add EVAL_JUDGE_CACHE, EVAL_VERBOSE, multiprocess worker support, dedup on push, run scopes, model aliases, judge profiles - Restructure credential storage to 4 layers with gstack-config (v0.3.9) for user preferences (sync_enabled, sync_transcripts) - Update integration points, observability, and reuse map Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 05:05:08 +02:00 · 2026-03-15 01:47:30 -05:00
parent 89311653df
commit 5c1ea088d8
1 changed files with 112 additions and 14 deletions
@@ -44,7 +44,7 @@ This works for solo developers. For teams on vendored gstack, it means:
 - **Zero shared visibility** into code quality, shipping velocity, or eval regressions
 - **No cross-contributor comparison** — each developer's data is isolated on their machine
 - **No regression detection** — an eval suite can regress and nobody notices until production breaks
- **Duplicated infrastructure** — Garry has another project with a sophisticated eval system (60+ runners, S3 storage, caching, cost tracking, baselines) locked inside Ruby/Rails that solves the same problems gstack solves in Bun/TS
+- **Duplicated infrastructure** — the author runs another project with a sophisticated eval system (60+ runners, S3 storage, caching, cost tracking, baselines) locked inside Ruby/Rails that solves the same problems gstack solves in Bun/TS

 ---

@@ -178,7 +178,7 @@ All decisions were made during the CEO-mode plan review on 2026-03-15.
   └─────────────────┘     └─────────────────┘
 ```

-### Credential Storage: 3 Layers
+### Config & Credential Storage: 4 Layers

 **Layer 1: Project config — `.gstack-sync.json` (committed to repo)**

@@ -186,15 +186,37 @@ All decisions were made during the CEO-mode plan review on 2026-03-15.
 {
  "supabase_url": "https://xyzcompany.supabase.co",
  "supabase_anon_key": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
-  "team_slug": "xyzcompany",
-  "sync_enabled": true,
-  "sync_transcripts": false
+  "team_slug": "xyzcompany"
 }
 ```

 The anon key is **safe to commit**. This is Supabase's design — the anon key only grants access through RLS policies, which require a valid user JWT. It's the same key that ships in every Supabase client-side app. Without a valid user token, the anon key gets you nothing.

-**Layer 2: User auth — `~/.gstack/auth.json` (mode 0o600, never committed)**
+Project-level only: Supabase URL, anon key, team slug. No user preferences here — those are per-developer (Layer 2).
+
+**Layer 2: User settings — `~/.gstack/config.yaml` (via `gstack-config`)**
+
+```yaml
+# Existing settings (v0.3.9)
+auto_upgrade: true
+update_check: true
+
+# New sync settings
+sync_enabled: true          # enable/disable team sync (per-user)
+sync_transcripts: false     # opt-in transcript sharing (per-user)
+```
+
+Managed via the existing `gstack-config` CLI (`bin/gstack-config`):
+```bash
+gstack-config get sync_enabled     # → "true" or ""
+gstack-config set sync_enabled true
+gstack-config set sync_transcripts false
+gstack-config list                 # → all settings
+```
+
+Rationale: `sync_enabled` and `sync_transcripts` are **user preferences**, not project config. One developer might want sync off while the rest of the team has it on. `gstack-config` already handles this pattern for `auto_upgrade` and `update_check`.
+
+**Layer 3: User auth — `~/.gstack/auth.json` (mode 0o600, never committed)**

 ```json
 {
@@ -211,7 +233,7 @@ The anon key is **safe to commit**. This is Supabase's design — the anon key o

 Keyed by `supabase_url` so developers on multiple teams/projects just work. Written with `chmod 0o600` — same pattern as `browse.json` in `browse/src/server.ts`.

-**Layer 3: Admin bootstrap — one-time Supabase project setup**
+**Layer 4: Admin bootstrap — one-time Supabase project setup**

 ```bash
 # Admin runs once to set up the project:
@@ -226,7 +248,7 @@ CI/automation uses `GSTACK_SUPABASE_ACCESS_TOKEN` env var.

 ### Auth Flow

-`gstack sync setup` reads URL from `.gstack-sync.json` → opens browser for OAuth or magic link → polls for completion → writes tokens to `~/.gstack/auth.json` (mode 0o600).
+`gstack sync setup` reads URL from `.gstack-sync.json` → opens browser for OAuth or magic link → polls for completion → writes tokens to `~/.gstack/auth.json` (mode 0o600) → sets `sync_enabled=true` via `gstack-config`.

 On first successful auth, shows a team welcome: "3 members, 47 eval runs this week, last ship 2h ago."

@@ -256,7 +278,7 @@ For skills (retro, review, qa, ship), sync happens via `bin/gstack-sync` called

 ### Opt-in Transcript Sync

-When `"sync_transcripts": true` in `.gstack-sync.json`:
+When `sync_transcripts: true` in `~/.gstack/config.yaml` (set via `gstack-config set sync_transcripts true`):
 - `gstack-sync push-transcript` reads `~/.claude/history.jsonl` (new entries since last sync marker)
 - Stores in `session_transcripts` table with RLS policy (admin-only read by default)
 - No scrubbing — trust the team. Opt-in = consent. Same trust model as a shared Slack channel.
@@ -288,7 +310,7 @@ Your eval runners keep their language, their models, their service objects. gsta

 ### What We're Porting from an Existing Rails Project

-Garry has another project with a production-grade eval infrastructure in Ruby/Rails. The patterns are general-purpose and worth extracting into gstack as framework-agnostic infrastructure:
+The author runs another project with a production-grade eval infrastructure in Ruby/Rails. The patterns are general-purpose and worth extracting into gstack as framework-agnostic infrastructure:

 - **60+ eval runners** with YAML test cases
 - **Multi-judge LLM evaluation** — multiple judge profiles scoring on 8+ quality criteria
@@ -338,14 +360,20 @@ and A/B comparison testing.
 {
  "schema_version": 1,
  "label": "dev_fix-terseness_standard",
+  "timestamp": "2026-03-15T10:30:00Z",
  "git_sha": "abc123",
  "git_branch": "dev/fix-terseness",
+  "prompt_sha": "a08ff469",
  "hostname": "dev-machine",
  "tier": "standard",
  "total": 18,
  "passed": 17,
  "failed": 1,
  "duration_seconds": 893.4,
+  "by_category": {
+    "post_generation": { "passed": 16, "total": 17 },
+    "tool_usage": { "passed": 1, "total": 1 }
+  },
  "all_results": [
    {
      "name": "must_cite_sources",
@@ -354,6 +382,7 @@ and A/B comparison testing.
      "duration_ms": 45000,
      "failures": [],
      "judge_scores": { "accuracy": 0.85, "voice_fidelity": 0.72 },
+      "response_preview": "The proposed legislation would...",
      "output": {},
      "comparison": null
    }
@@ -385,7 +414,7 @@ populate different keys. gstack stores as-is (JSONB) for display/comparison:
    "items": [{"id": "claim_1", "severity": "yellow", "commentary": "..."}],
    "chunks": ["chunk 1 text", "chunk 2 text"],
    "clusters": [{"theme": "Housing", "articles": ["..."]}],
-    "memories": [{"content": "Lives in SF", "category": "personal"}],
+    "memories": [{"content": "Enjoys cycling", "category": "personal"}],
    "extracted_fields": {"occupation": "engineer", "city": "Oakland"},
    "title": "Generated title",
    "structured_content": "Full article body..."
@@ -416,12 +445,23 @@ populate different keys. gstack stores as-is (JSONB) for display/comparison:
  "failures": [
    {
      "type": "threshold",
+      "expectation_type": "voice_check",
+      "message": "Voice check failed: 2 of 5 criteria below threshold",
      "criterion": "voice_fidelity",
      "expected": 0.7,
-      "actual": 0.58
+      "actual": 0.58,
+      "details": [
+        {"criterion": "no_hedging", "score": 0.3, "threshold": 0.7},
+        {"criterion": "direct_tone", "score": 0.4, "threshold": 0.6}
+      ],
+      "scores": {
+        "no_hedging": 0.3, "no_filler": 0.6, "direct_tone": 0.4,
+        "uses_specifics": 0.8, "operator_energy": 0.9
+      }
    },
    {
      "type": "deterministic",
+      "expectation_type": "body_contains",
      "check": "body_contains",
      "pattern": "Series B",
      "message": "Pattern not found in output"
@@ -430,6 +470,10 @@ populate different keys. gstack stores as-is (JSONB) for display/comparison:
 }
 ```

+Fields: `type` = generic class (`threshold` | `deterministic`). `expectation_type` = domain-specific
+check name from the YAML case. `details` = per-criterion breakdown for multi-criteria checks.
+`scores` = ALL scores (passing + failing) for context. `message` = human-readable summary.
+
 ### YAML Test Case Format

 Human-authored, comments supported, multiline strings via `|` blocks.
@@ -611,6 +655,10 @@ expectations:
      no_hedging: 0.6
      direct_tone: 0.6
      uses_specifics: 0.6
+  - type: quality_check
+    judge_profile: strict       # named profile (defined in judge-profiles.yaml)
+    criteria:
+      accuracy: 0.8             # profile can override default thresholds

 # ── A/B testing (optional) ─────────────────────────
 # comparison:
@@ -739,10 +787,31 @@ gstack eval cache verify           # Check all entries for validity
 gstack eval cache clear [suite]    # Clear all or per-suite
 ```

-Env vars: `EVAL_CACHE=0` (disable), `EVAL_CACHE_CLEAR=1` (clear before run).
+Env vars: `EVAL_CACHE=0` (disable), `EVAL_CACHE_CLEAR=1` (clear before run),
+`EVAL_JUDGE_CACHE=0` (skip cached judge scores — re-run LLM judges even if cached).
+
+Judge responses are cached separately from eval data. This lets you re-run deterministic
+checks (text matching, length, tool calling) without re-calling expensive LLM judges.

 Ported from `eval_cache.rb` — same atomic write (tmp+rename), same version/validation, same SHA computation.

+### Multiprocess Worker Support
+
+For large test suites (60+ cases), eval workers run in parallel processes:
+
+```
+~/.gstack/eval-partials/{suite}/worker_{pid}.json
+```
+
+Each worker writes partial results. `gstack eval push` merges them before upload:
+1. Workers write `worker_{pid}.json` atomically (tmp+rename)
+2. Push reads all `worker_*.json` in the partials directory
+3. Deduplicates by test name (keeps longest `duration_ms`)
+4. Merges into a single result JSON
+5. Pushes merged result to Supabase
+
+Env var: `EVAL_WORKERS=4` (number of parallel processes, default 1).
+
 ### Eval Cost Tracker

 Reads the `costs` array from result JSON. Terminal dashboard:
@@ -770,11 +839,36 @@ Label = EVAL_LABEL env || sanitized_git_branch
 Append tier suffix: _fast, _full (omit for standard)
 ```

+### Eval Tier & Run Scopes
+
+Two orthogonal tier concepts:
+
+**Run scope** — how much of the test suite to execute:
+```
+EVAL_TIER=quick      # Subset of cases (fast smoke test)
+EVAL_TIER=standard   # Full suite (default)
+EVAL_TIER=full       # Full suite + expensive multi-judge checks
+```
+
+**Judge model tier** — which model judges use:
+```
+EVAL_JUDGE_TIER=fast|standard|full
+Aliases: haiku→fast, sonnet→standard, opus→full
+```
+
+**Debug output:**
+```
+EVAL_VERBOSE=1       # Persistent logging to ~/.gstack/log/evals/
+                     # Format: YYYYMMDD-{test-name}-{random}.txt
+                     # Includes full untruncated LLM inputs/outputs
+```
+
 ### CLI Commands

 ```bash
 # Result management
 gstack eval push <file.json>       # Push result to Supabase + local store
+                                   # Dedup: skips insert if git_sha+label+tier already exists
 gstack eval list [label]           # List all results (local + Supabase)
 gstack eval compare [a] [b]       # Compare two runs — color-coded score deltas
 gstack eval baselines [date]       # Generate markdown baseline report
@@ -1041,6 +1135,7 @@ order by created_at desc;
 | QA push | `qa/SKILL.md.tmpl` Phase 6 | After baseline, call `gstack-sync push-qa` |
 | Ship push | `ship/SKILL.md.tmpl` new Step 9 | Write ship log + push |
 | Config reuse | `browse/src/config.ts` | Import `getRemoteSlug()`, `getGitRoot()` |
+| User settings | `bin/gstack-config` | Reuse for sync preferences (`sync_enabled`, `sync_transcripts`) |
 | Atomic write | `eval-store.ts:413-416` | Extract shared `atomicWriteJSON()` utility |
 | Eval watch | `scripts/eval-watch.ts` | Adapt for browser-based SSE dashboard |
 | Comparison | `eval-store.ts:167` `compareEvalResults()` | Extend with color-coded diff + cross-team |
@@ -1265,7 +1360,9 @@ All 16 error paths are rescued. 0 critical gaps.
 ```
 $ gstack sync status
 ─────────────────
-  Connected:     yes (https://xyzcompany.supabase.co)
+  Project:       .gstack-sync.json (supabase_url: https://xyzcompany.supabase.co)
+  User settings: sync_enabled=true, sync_transcripts=false (via gstack-config)
+  Connected:     yes
  Authenticated: yes (dev@company.com, team: xyzcompany)
  Last push:     2 min ago (eval_runs)
  Last pull:     1h ago
@@ -1302,6 +1399,7 @@ or nothing (sync not configured).
 | `formatComparison()` | `eval-store.ts:267` | Extend with color diff |
 | `llm-judge.ts` | `test/helpers/llm-judge.ts` | Extend with multi-judge |
 | eval-watch.ts | `scripts/eval-watch.ts` | Adapt for browser SSE |
+| `gstack-config` get/set/list | `bin/gstack-config` | User settings for sync preferences (v0.3.9) |

 ---