v1.46.0.0 feat: gstack v2 foundation — catalog tokens drop 56%, eval-first floor covers all 51 skills (#1712)

* docs(designs): add v2_PLAN.md — gstack v2 the lightest opinionated skill pack The approved plan from /plan-ceo-review → /plan-eng-review → /codex×2 → /plan-devex-review. Captures the v1.45/v2.0 hybrid release shape, cathedral parity-eval suite, sequential v1.45 execution, sections/*.md.tmpl pipeline, EVALS_BUDGET_HARD_CAP override path, and v2 launch copy specs. This commit just lands the design doc. Implementation follows in the rest of the v1.45.0.0 branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(parity): T0a — capture v1.44.1 baseline + capture helper + diff utility Cathedral parity-eval suite primitive. captureBaseline() walks every top-level SKILL.md and records bytes, lines, estimated tokens, frontmatter description length, and eval coverage. diffBaselines() reports per-skill delta + total corpus delta + catalog tokens delta. Locks the v1.44.1 reference snapshot at test/fixtures/parity-baseline-v1.44.1.json. After Phase A+B+C land, scripts/capture-baseline.ts --tag v1.45.0.0 produces a comparable snapshot; diff supplies the real numbers the v2 CHANGELOG quotes. Never invent baseline numbers; ship them only if they came from a real run. v1.44.1 numbers captured this commit: - 51 skills - 2,847 KB total corpus - ~9,319 catalog tokens (sum of description bytes / 4) - top 3: ship 160 KB, plan-ceo-review 128 KB, office-hours 108 KB Test plan: - bun test test/helpers/capture-parity-baseline.test.ts passes 4/4 - The baseline JSON file is committed so reviewers can audit v1→v2 numbers Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(resolvers): T2 — ResolverEntry + appliesTo gate infrastructure Adds the conditional-resolver-injection plumbing from the v2_PLAN A.1 step. Resolvers can now be either a bare ResolverFn (always fires, current behavior) or a ResolverEntry { resolve, appliesTo? } (gated; appliesTo returning false skips the resolver, substitutes empty string). Why infrastructure-only: the audit during T0a confirmed most resolvers don't need gating. The {{NAME}} placeholder system is already conditional at the template level — a resolver only fires for skills that reference it. The gate is for future use when a placeholder's audience needs a structural guardrail beyond social convention, or when a sub-resolver inside a larger composed resolver (e.g. preamble) needs per-skill skip. scripts/gen-skill-docs.ts:444 now uses unwrapResolver() to handle both shapes. RESOLVERS map signature widens from Record<string, ResolverFn> to Record<string, ResolverValue>. All existing resolvers stay bare functions and work unchanged. Test plan: - bun test test/resolver-entry.test.ts: 6 pass (gate plumbing + registry) - bun test test/gen-skill-docs.test.ts: 389 pass (no regression) - bun run gen:skill-docs --dry-run: all SKILL.md files FRESH (no diff) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(preamble): T3 — jargon dedup + terse-build flag (Phase A.2 + A.3) A.2 jargon dedup: generate-writing-style.ts replaces the inlined 80-term jargon list with a one-line pointer to scripts/jargon-list.json. The list was duplicated into every tier-2+ skill (48 of 51 skills); inlining cost was ~1.5 KB × 48 = ~70 KB across the corpus. Pointer cost is ~30 bytes per skill. Agents Read the JSON once per session on first jargon term encountered; thereafter the terms array is the canonical reference. A.3 terse build flag: --explain-level=terse compresses preamble prose at gen time. When the flag is set, writing-style collapses to a one-line terse directive and completeness-section + confusion-protocol + context-health are dropped entirely. The default build keeps the runtime-conditional behavior intact (sections still render; the model skips them when EXPLAIN_LEVEL: terse appears in the preamble echo). Terse build is opt-in for users who want shipped skills to match their runtime preference and avoid the per-session terse-mode dead prose. TemplateContext gains an optional `explainLevel: 'default' | 'terse'` field. Default builds set it to 'default'; --explain-level=terse sets 'terse'. Resolvers gate their output via `ctx?.explainLevel === 'terse'`. Measured impact (default build, post-T3): - Total corpus: 2,847 KB → 2,812 KB (saved 35 KB) - ship.md: 160 → 159 KB - plan-ceo-review.md: 128 → 127 KB - Top 10 heaviest: all slightly smaller from jargon pointer Larger compression lands in T4 (catalog trim) and T7 (atomic regen across the full Phase A pipeline). The terse build path further compresses to ~711K tokens vs default ~725K (saved ~14K tokens corpus-wide). Test plan: - bun test test/gen-skill-docs.test.ts: 389 pass (no regression) - bun test test/resolver-entry.test.ts: 6 pass - bun test test/helpers/capture-parity-baseline.test.ts: 4 pass - bun run gen:skill-docs --explain-level=terse: ship.md drops completeness + confusion-protocol + context-health sections; writing-style collapses to one-line terse directive 48 SKILL.md files updated (every tier-2+ skill picks up the jargon pointer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(catalog): T4 — catalog trim + proactive-suggestions.json (Phase A.4) Shortens frontmatter `description:` in every Claude SKILL.md to a single lead sentence + (gstack) tag. The routing prose ("Use when asked to...", "Proactively suggest...") and voice triggers move to a "## When to invoke" body section so they remain discoverable inside the skill. A per-run registry at scripts/proactive-suggestions.json aggregates the routing/ voice text for all 52 skills so agents can pull guidance on demand without paying for it in the always-loaded catalog. Build flag --catalog-mode=full restores v1.44 legacy behavior (full multi-line descriptions in frontmatter). Default is trim. splitCatalogDescription() extracts: lead sentence, routing paragraphs, voice-triggers line, (gstack) tag presence. Short descriptions (<120 chars, already trimmed) are skipped via a guard so re-runs are idempotent. Measured impact (vs v1.44.1 baseline): - Catalog tokens (sum of description bytes / 4): 9,319 → 4,045 (-56.6%) - Total SKILL.md corpus bytes: 2,915 KB → 2,880 KB (-1.2%) - Routing prose preserved as in-skill "## When to invoke" sections - 52 skill entries in scripts/proactive-suggestions.json (on-demand registry) The corpus drop is small because catalog trim MOVES text from frontmatter to body, it doesn't delete it. The headline win is the catalog: the always-loaded system prompt surface drops by more than half. Test plan: - bun test test/gen-skill-docs.test.ts: 389 pass, 0 fail - Manual: ship/SKILL.md frontmatter description is now ONE line ending with `(gstack)`; allowed-tools field on next line (YAML well-formed) - Manual: scripts/proactive-suggestions.json contains 52 entries - bun run gen:skill-docs --catalog-mode=full restores legacy behavior 53 files changed (52 SKILL.md across hosts + the new proactive-suggestions.json). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(budget): T5 — hard token budgets + override audit trail (Phase A.6) Two new gate-tier guardrails for the v1.45.0.0 compression baseline: 1. test/skill-size-budget.test.ts (NEW) — per-skill SKILL.md size budget. Compares current state to test/fixtures/parity-baseline-v1.44.1.json. Three checks: per-skill (×1.05 default ratio), total corpus, and catalog token estimate (≤7000 for v1.45). The per-skill ratio is 1.05 not 1.0 because the T4 catalog trim moves text from frontmatter to a body section; small skills see a tiny body growth that's fine when offset by the much larger catalog-token win. 2. test/skill-budget-regression.test.ts EXTENDED — hard dollar cap on per-run eval cost. Per-tier defaults: gate $25, periodic $70. Umbrella EVALS_BUDGET_HARD_CAP=$30. Catches runaway eval costs (infinite retry, model price changes) before they amortize across PRs. Both checks support an override path with audit trail: GSTACK_SIZE_BUDGET_OVERRIDE_REASON="why this is OK" — size EVALS_BUDGET_OVERRIDE_REASON="why this is OK" — cost Overrides log to ~/.gstack/analytics/spend-overrides.jsonl with timestamp + scope + reason + CI provenance (runner, branch, commit) via test/helpers/budget-override.ts. Why the override audit: a hard cap with no escape valve becomes operationally hostile (legit price changes, longer transcripts, new required evals can all blow the cap). An override with no audit becomes "everyone overrides everything and the gate is theater." This module ships the audit half so reviewers can see what was waived and why. Codex 2nd-pass critique #3 absorbed: per-suite caps + override path with auditability + budget baselines checked into repo (parity-baseline-v1.44.1.json already in test/fixtures/). Test plan: - bun test test/skill-size-budget.test.ts: 4 pass (per-skill, corpus, catalog, baseline-exists) - bun test test/skill-budget-regression.test.ts: 4 pass (2 existing ratio checks + 2 new hard-cap checks) - Existing eval runs ($14.11 e2e, $0.02 llm-judge) sit well under the new caps Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(cso): T6 — pin must-preserve security phrases (Phase A.5) cso/SKILL.md is a content-heavy security audit skill (75 KB after T3+T4). Codex 2nd-pass critique #9: "cso exemption too broad ... should still get resolver dedup, catalog trim, sectioning if safe, and targeted evals around must-not-miss checks." T3 (jargon dedup) and T4 (catalog trim) already applied to cso the same way they applied to every other skill — confirmed by inspection: - jargon list NOT inlined (0 inline term lines) - catalog description trimmed to one line (74 bytes vs 774 bytes baseline) - "## When to invoke" body section present T6 work: lock in the security-prose preservation via a gate-tier test that fails CI if future compression strips load-bearing phrases: - OWASP, STRIDE positioning - daily / comprehensive mode discipline - confidence scoring language - active verification ("verif" prefix catches verify/verified/verification) - ## Preamble heading (preamble resolver still fires) Also guards cso against accidental over-stripping: SKILL.md must stay ≥30 KB (currently 75 KB) — a sudden cliff would mean compression went past the targeted-dedup line into structural removal. No structural change to cso. Future Phase B sections/ work for cso requires writing baseline parity tests FIRST per the v2_PLAN.md sequencing. Test plan: - bun test test/cso-preserved.test.ts: 5 pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(parity): T0b — cathedral parity-suite harness + invariant registry Adds the harness that the v2_PLAN.md cathedral parity-eval suite is built on. Compares CURRENT SKILL.md output to v1.44.1 baseline along three axes: STRUCTURE frontmatter shape (catalog trim landed, "## When to invoke" present) CONTENT must-preserve phrases per skill family (cso: OWASP/STRIDE; plan-ceo: SCOPE EXPANSION/HOLD SCOPE/REDUCTION; ship: VERSION/CHANGELOG/PR; etc.) SIZE per-skill byte budget (maxSizeRatio + minBytes guards) PARITY_INVARIANTS registry pins 10 load-bearing skills (cso, ship, plan-*- review, review, qa, investigate, office-hours, autoplan). Each entry declares what must NOT regress; future compression that strips these phrases or shrinks a skill past its minBytes cliff fails CI. Periodic-tier LLM-judge parity (paid, ~$0.20/skill) lands in v2.0.0.0 sections/ phase. Same registry, same harness, judge added on top. Test plan: - bun test test/parity-suite.test.ts: 10/10 invariants pass vs v1.44.1 - Per-skill failures get actionable per-line breakdown so a reviewer can see which phrase / heading / size limit went sideways Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(coverage): T1 — skill coverage matrix + structural-compliance floor Phase 0 deliverable — eval-first foundation. Two new test files plus the registry: 1. test/skill-coverage-matrix.ts — single source of truth mapping each skill to its gate-tier + periodic-tier test files. SKILL_COVERAGE record with 51 entries; every gstack skill on disk has at least one gate-tier entry. 2. test/skill-coverage-matrix.test.ts — CI gate. Asserts every skill on disk has a registry entry AND that gate[] is non-empty. Catches "skill added but eval not registered" the moment a new SKILL.md lands. 3. test/skill-coverage-floor.test.ts — per-skill structural compliance (FREE, file-IO only). For each of 51 skills, verifies: - SKILL.md exists - Frontmatter well-formed (name + description fields) - Catalog-trim contract (inline description ≤ 250 chars, or block form) - Generated header present (edit .tmpl, not .md) - Body ≥ 200 bytes (non-trivial content) - No unresolved {{TEMPLATE}} placeholders leaked The "floor" is the minimum eval that every skill ships with. Skills that need deeper behavioral testing get additional entries in their coverage record (e.g., ship has skill-e2e-ship-idempotency + workflow + floor). Future skills only need to add the floor entry and the matrix gate unblocks them. Codex 2nd-pass critique #1 mitigation: eval-first floor is structural compliance (the testable part) — judgment-skill behavior gets layered periodic-tier evals on top. We don't pretend the floor proves correctness, only that the skill structurally compiles. Test plan: - bun test test/skill-coverage-matrix.test.ts: 4 pass (matrix shape + coverage) - bun test test/skill-coverage-floor.test.ts: 309 pass (6 checks × 51 skills + 3 registry-level) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * build(skills): T7 — atomic regenerate + capture v1.45.0.0 baseline Final regen pass across all hosts after T1-T6 work landed. Captures the v1.45.0.0 parity baseline at test/fixtures/parity-baseline-v1.45.0.0.json for diffing against the v1.44.1 reference. Measured deltas (real numbers from test/helpers/capture-parity-baseline.ts): Total SKILL.md corpus 2,847 KB → 2,813 KB (-1.2%) Catalog tokens (always-loaded) ~9,319 → ~4,045 tokens (-56.6%) Top 10 heaviest skills 0.5-1.0% drop each The catalog token cut is the headline. It's the always-loaded surface, i.e. tokens charged on every session start. Per-skill SKILL.md sizes barely moved because T4 catalog trim MOVES routing prose from frontmatter to a body "## When to invoke" section rather than deleting it — the catalog wins without amputating discoverability. The bigger per-skill compression lands in v2.0.0.0 (Phase B sections/ pattern on the 5 heavyweights). v1.45 is the foundation: eval-first infrastructure + cheap wins. scripts/proactive-suggestions.json regenerated with the latest 52 skills listed (one-time write per gen-skill-docs run; aggregated catalog parts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * v1.45.0.0 — gstack v2 foundation: catalog tokens drop 56%, eval-first floor Bumps VERSION + package.json to 1.45.0.0. CHANGELOG entry covers what shipped between v1.44.1 and this release: the cathedral parity-eval foundation, conditional resolver injection plumbing, jargon dedup, terse build flag, catalog trim with one-line frontmatter descriptions, hard token + dollar budget gates with override audit, cso preservation pins, and the v1.44.1 ↔ v1.45.0.0 parity baselines committed to test/fixtures/. Numbers (measured, not estimated): - Catalog tokens: ~9,319 → ~4,045 (-56.6%) - Total corpus: 2,847 KB → 2,813 KB (-1.2%) - Skills with gate-tier eval coverage: 32/51 → 51/51 (floor achieved) This is the foundation release. v2.0.0.0 will ship the architectural break (sections/*.md.tmpl pattern + mechanical Read enforcement + eval-coverage annotations) as a coordinated marketing-grade launch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(catalog): refresh proactive-suggestions.json timestamp after v1.45 bump The generated_at field updates on every gen-skill-docs run; this is the T7 atomic-regenerate output landed alongside the v1.45.0.0 bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(catalog): deterministic proactive-suggestions.json (no per-run timestamp) Original implementation wrote a generated_at timestamp on every gen-skill-docs run. That made CI dry-run freshness checks flap because the file changed on every regeneration even when the actual content (skill descriptions, routing prose, voice triggers) was unchanged. Two fixes: 1. Drop the generated_at field. The file is purely a content registry now. 2. Only write the file when serialized content actually differs from disk. Reproducible test: bun run gen:skill-docs twice in a row now leaves scripts/proactive-suggestions.json unchanged on the second run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(catalog): preserve routing prose when first sentence exceeds 200 chars splitCatalogDescription truncated the lead BEFORE computing routing extraction, which meant skills whose first sentence was over 200 chars (design-consultation: 207 chars) had their entire routing prose silently dropped — the "## When to invoke" body section came out empty. Root cause: routing was extracted via `collapsed.indexOf(lead)` after lead was suffixed with "...". The "..." never appeared in the original string, so indexOf returned -1 and routingProse fell back to empty. Fix: compute routing from sentenceLead (the untruncated first sentence) BEFORE truncating the displayed lead. The displayed lead still gets "..." when over 200 chars, but the routing extraction uses the real boundary. Also: refresh golden snapshots for claude/codex/factory ship and update two unit tests that asserted v1.44 behavior: - skill-validation.test.ts: trigger-phrase + proactive-routing tests now search whole content, not just frontmatter (T4 moved them to a body "## When to invoke" section) - writing-style-resolver.test.ts: jargon-list assertion now expects the T3 reference pointer, not the inline list Test plan: - bun test test/skill-validation.test.ts test/writing-style-resolver.test.ts test/host-config.test.ts test/skill-size-budget.test.ts test/parity-suite.test.ts test/skill-coverage-matrix.test.ts test/skill-coverage-floor.test.ts test/cso-preserved.test.ts test/resolver-entry.test.ts test/helpers/capture-parity-baseline.test.ts test/gen-skill-docs.test.ts: 1134 pass, 0 fail - Manual verify: design-consultation/SKILL.md "## When to invoke this skill" body section now contains "Use when asked to..." + "Proactively suggest..." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(catalog): deterministic proactive-suggestions.json across machines CI check-freshness failed because scripts/proactive-suggestions.json serialized differently on local vs CI: 1. Root-skill key leaked the directory name. processTemplate's outer loop computed `dir = path.basename(path.dirname(tmplPath))`. For the root SKILL.md.tmpl at ROOT/SKILL.md.tmpl, that returns the repo-checkout directory name — "seville-v3" in a Conductor worktree, "gstack" on GitHub Actions, anything-else for a fork. Fix: detect root via `path.dirname(tmplPath) === ROOT` and hardcode the key to "gstack" for that one case. 2. Aggregate key order was filesystem-iteration order. discoverTemplates doesn't guarantee stable ordering across platforms, so the JSON `skills` object came out shuffled between machines. Fix: sort Object.keys(proactiveAggregate) alphabetically before serializing. After the fix, the generated file is identical on every machine and matches what's committed. CI freshness check (bun run gen:skill-docs && git diff --exit-code) now passes. Test plan: - bun run gen:skill-docs && bun run gen:skill-docs --dry-run: all FRESH - node -e 'verify keys sorted': sorted match: true - grep -c '"seville-v3"' scripts/proactive-suggestions.json: 0 - Focused test suite: 704 pass, 0 fail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(catalog): unit + regression coverage for catalog-trim helpers Four exported functions in scripts/gen-skill-docs.ts handle every skill's frontmatter rewrite at gen time but had zero unit tests. Both real bugs we shipped (and fixed) on this branch lived in these functions: v1.45.0.0 design-consultation: when the first sentence exceeded 200 chars, routing-prose extraction lost the entire tail (anchored on truncated lead with "..." that didn't substring-match the original). v1.45.0.0 CI freshness: root-skill key leaked the checkout directory name ("seville-v3" vs "gstack") and aggregate order was filesystem- iteration order. Both shapes are now regression-tested: - splitCatalogDescription: 7 tests covering simple multi-line, >200-char first sentence (design-consultation regression), voice-trigger extraction, no-(gstack) handling, embedded periods (documents known fallback), no-period fragments, and idempotency. - buildTrimmedDescription: 3 tests. - buildWhenToInvokeSection: 3 tests. - applyCatalogTrim: 4 tests covering the standard rewrite, no-op for already-short descriptions, the YAML-collision newline fix, and the malformed-frontmatter null return. - proactive-suggestions.json determinism: 3 tests asserting sorted keys, root keyed as "gstack" (not the worktree directory), and no timestamp/generated_at field that would flap CI freshness. Test plan: - bun test test/catalog-trim.test.ts: 20 pass, 0 fail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(coverage): fill three remaining v1.46.0.0 test gaps Three untested surfaces from the v1.46.0.0 work. All three would have caught real bugs we shipped (and fixed) on this branch. 1. test/helpers/budget-override.test.ts — 7 tests pin the audit-trail contract for EVALS_BUDGET_OVERRIDE_REASON and GSTACK_SIZE_BUDGET_OVERRIDE_REASON. Without this, the audit logger could silently drop events and overrides become invisible. Tests cover: required fields per JSONL line, CI provenance capture (CI/GITHUB_ACTIONS/branch/commit), local-runner defaults, append-only behavior, missing-directory recovery, and unwritable- path resilience (logs warning instead of throwing). 2. test/terse-build.test.ts — 16 tests pin --explain-level=terse behavior across the 4 gated resolvers and the composed preamble. Default vs terse vs undefined-ctx all asserted. Without this, a refactor that breaks the explainLevel threading silently regresses the opt-in compression path; the runtime EXPLAIN_LEVEL: terse gate still works so users wouldn't notice. Tier-1 invariant pinned (terse-only-affects-tier-2+). 3. test/gen-skill-docs-idempotency.test.ts — 2 tests catch the class of bug behind the v1.45.0.0 timestamp flap. Two consecutive gen-skill-docs runs must produce byte-identical outputs across STABLE_OUTPUTS (proactive-suggestions.json, SKILL.md, ship/SKILL.md, plan-ceo-review/SKILL.md, office-hours/SKILL.md, gstack/llms.txt). --dry-run reports zero stale files after a fresh gen. CI freshness regressions surface as test failures BEFORE a PR is opened. Test plan: - bun test test/helpers/budget-override.test.ts: 7 pass - bun test test/terse-build.test.ts: 16 pass - bun test test/gen-skill-docs-idempotency.test.ts: 2 pass - Full focused suite (15 test files): 1179 pass, 0 fail (+45 new tests vs the pre-fill baseline of 1134) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(coverage): close 5 remaining v1.46.0.0 test gaps (A-E) Five behaviors that v1.46 ships but had no test coverage. All now pinned. A) --host all idempotency (test/gen-skill-docs-idempotency.test.ts) The default test ran Claude host only. Non-Claude hosts (Codex, Factory, Cursor, OpenClaw, GBrain, Slate, OpenCode, Hermes, Kiro) each have their own output paths and could carry their own non-deterministic fields. We hit a "--host all needed for freshness check" mid-/ship. Now: two consecutive `bun run gen:skill-docs --host all` runs must produce byte-identical outputs across a per-host sample (.agents/, .cursor/, .factory/, .gbrain/). Catches per-host adapter regressions before CI. B) --catalog-mode=full opt-out (test/catalog-mode-full.test.ts) The legacy escape hatch had zero tests. 6 new tests across two layers: static (CATALOG_MODE_ARG parsed; conditional gate present; default is "trim"; invalid value throws) + smoke (actual --catalog-mode=full run produces a multi-line `description: |` block + omits "## When to invoke" body section; mutates the working tree then restores in a finally block). C) parity-baseline-v1.44.1.json integrity (test/parity-baseline-integrity.test.ts) The baseline is the source of every v1→v2 number cited in the CHANGELOG v1.46.0.0 entry. Anyone could edit it without test failure until now. 8 new tests pin: existence, tag, capturedFromCommit allowlist, expected v1.44 numbers (51 skills, ~2,915 KB, ~9,319 catalog tokens), CHANGELOG references this file by path, per-skill shape, and a SHA256 byte-stability hash. Any edit fails with a clear "if intentional, update EXPECTED_HASH AND the CHANGELOG numbers" signal. D) Live appliesTo gate end-to-end (test/resolver-entry.test.ts extended) The unwrapResolver unit tests covered the function; the gen-skill-docs.ts substitution loop that USES the gate had no integration coverage. 6 new tests simulate the exact 4-line shape from gen-skill-docs.ts:457-467 against synthetic registries: plain-function fires unconditionally, gated fires when true / empty-string when false, mixed registries compose, parameterized resolvers respect gates, unknown resolvers throw. E) Per-skill min-size floor (test/skill-size-budget.test.ts extended) The existing 200-byte body coverage-floor is a noise floor — a skill that lost 99.75% of content still passes. 1 new test asserts every skill stays ≥80% of its v1.44.1 baseline size (the parity-suite content invariants only covered 10 of 51 skills; the remaining 41 were uncovered). SECTIONS_EXTRACTED hook in place for v2.0.0.0 when the sections/ pattern legitimately shrinks ship/plan-ceo/etc. past the floor. Test plan: - bun test focused 17-file suite: 1202 pass, 0 fail (+23 new tests vs the pre-fill 1179 baseline) - catalog-mode=full mutates working tree then restores cleanly - --host all idempotency runs two full gen passes in <1s on this machine Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-07-19 05:57:37 +02:00 · 2026-05-26 16:50:03 -07:00
parent cf50443b63
commit 22f8c7f4e1
91 changed files with 6044 additions and 3885 deletions
@@ -0,0 +1,267 @@
+{
+  "$schema": "https://gstack.dev/schemas/proactive-suggestions.json",
+  "catalog_mode": "trim",
+  "note": "Routing / voice-trigger prose extracted from SKILL.md frontmatter descriptions during catalog trim. Loaded on demand when routing guidance is needed.",
+  "skills": {
+    "autoplan": {
+      "lead": "Auto-review pipeline — reads the full CEO, design, eng, and DX review skills from disk and runs them sequentially with auto-decisions using 6 decision principles.",
+      "routing": "Surfaces\ntaste decisions (close approaches, borderline scope, codex disagreements) at a final\napproval gate. One command, fully reviewed plan out.\nUse when asked to \"auto review\", \"autoplan\", \"run all reviews\", \"review this plan\nautomatically\", or \"make the decisions for me\".\nProactively suggest when the user has a plan file and wants to run the full review\ngauntlet without answering 15-30 intermediate questions.",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"auto plan\", \"automatic review\"."
+    },
+    "benchmark": {
+      "lead": "Performance regression detection using the browse daemon.",
+      "routing": "Establishes\nbaselines for page load times, Core Web Vitals, and resource sizes.\nCompares before/after on every PR. Tracks performance trends over time.\nUse when: \"performance\", \"benchmark\", \"page speed\", \"lighthouse\", \"web vitals\",\n\"bundle size\", \"load time\".",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"speed test\", \"check performance\"."
+    },
+    "benchmark-models": {
+      "lead": "Cross-model benchmark for gstack skills.",
+      "routing": "Runs the same prompt through Claude,\nGPT (via Codex CLI), and Gemini side-by-side — compares latency, tokens, cost,\nand optionally quality via LLM judge. Answers \"which model is actually best\nfor this skill?\" with data instead of vibes. Separate from /benchmark, which\nmeasures web page performance. Use when: \"benchmark models\", \"compare models\",\n\"which model is best for X\", \"cross-model comparison\", \"model shootout\".",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"compare models\", \"model shootout\", \"which model is best\"."
+    },
+    "browse": {
+      "lead": "Fast headless browser for QA testing and site dogfooding.",
+      "routing": "Navigate any URL, interact with\nelements, verify page state, diff before/after actions, take annotated screenshots, check\nresponsive layouts, test forms and uploads, handle dialogs, and assert element states.\n~100ms per command. Use when you need to test a feature, verify a deployment, dogfood a\nuser flow, or file a bug with evidence. Use when asked to \"open in browser\", \"test the\nsite\", \"take a screenshot\", or \"dogfood this\".",
+      "voice_line": null
+    },
+    "canary": {
+      "lead": "Post-deploy canary monitoring.",
+      "routing": "Watches the live app for console errors,\nperformance regressions, and page failures using the browse daemon. Takes\nperiodic screenshots, compares against pre-deploy baselines, and alerts\non anomalies. Use when: \"monitor deploy\", \"canary\", \"post-deploy check\",\n\"watch production\", \"verify deploy\".",
+      "voice_line": null
+    },
+    "careful": {
+      "lead": "Safety guardrails for destructive commands.",
+      "routing": "Warns before rm -rf, DROP TABLE,\nforce-push, git reset --hard, kubectl delete, and similar destructive operations.\nUser can override each warning. Use when touching prod, debugging live systems,\nor working in a shared environment. Use when asked to \"be careful\", \"safety mode\",\n\"prod mode\", or \"careful mode\".",
+      "voice_line": null
+    },
+    "codex": {
+      "lead": "OpenAI Codex CLI wrapper — three modes.",
+      "routing": "Code review: independent diff review via\ncodex review with pass/fail gate. Challenge: adversarial mode that tries to break\nyour code. Consult: ask codex anything with session continuity for follow-ups.\nThe \"200 IQ autistic developer\" second opinion. Use when asked to \"codex review\",\n\"codex challenge\", \"ask codex\", \"second opinion\", or \"consult codex\".",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"code x\", \"code ex\", \"get another opinion\"."
+    },
+    "context-restore": {
+      "lead": "Restore working context saved earlier by /context-save.",
+      "routing": "Loads the most recent\nsaved state (across all branches by default) so you can pick up where you\nleft off — even across Conductor workspace handoffs.\nUse when asked to \"resume\", \"restore context\", \"where was I\", or\n\"pick up where I left off\". Pair with /context-save.\nFormerly /checkpoint resume — renamed because Claude Code treats /checkpoint\nas a native rewind alias in current environments.",
+      "voice_line": null
+    },
+    "context-save": {
+      "lead": "Save working context.",
+      "routing": "Captures git state, decisions made, and remaining work\nso any future session can pick up without losing a beat.\nUse when asked to \"save progress\", \"save state\", \"context save\", or\n\"save my work\". Pair with /context-restore to resume later.\nFormerly /checkpoint — renamed because Claude Code treats /checkpoint as a\nnative rewind alias in current environments, which was shadowing this skill.",
+      "voice_line": null
+    },
+    "cso": {
+      "lead": "Chief Security Officer mode.",
+      "routing": "Infrastructure-first security audit: secrets archaeology,\ndependency supply chain, CI/CD pipeline security, LLM/AI security, skill supply chain\nscanning, plus OWASP Top 10, STRIDE threat modeling, and active verification.\nTwo modes: daily (zero-noise, 8/10 confidence gate) and comprehensive (monthly deep\nscan, 2/10 bar). Trend tracking across audit runs.\nUse when: \"security audit\", \"threat model\", \"pentest review\", \"OWASP\", \"CSO review\".",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"see-so\", \"see so\", \"security review\", \"security check\", \"vulnerability scan\", \"run security\"."
+    },
+    "design-consultation": {
+      "lead": "Design consultation: understands your product, researches the landscape, proposes a complete design system (aesthetic, typography, color, layout, spacing, motion), and generates font+color preview...",
+      "routing": "Creates DESIGN.md as your project's design source\nof truth. For existing sites, use /plan-design-review to infer the system instead.\nUse when asked to \"design system\", \"brand guidelines\", or \"create DESIGN.md\".\nProactively suggest when starting a new project's UI with no existing\ndesign system or DESIGN.md.",
+      "voice_line": null
+    },
+    "design-html": {
+      "lead": "Design finalization: generates production-quality Pretext-native HTML/CSS.",
+      "routing": "Works with approved mockups from /design-shotgun, CEO plans from /plan-ceo-review,\ndesign review context from /plan-design-review, or from scratch with a user\ndescription. Text actually reflows, heights are computed, layouts are dynamic.\n30KB overhead, zero deps. Smart API routing: picks the right Pretext patterns\nfor each design type. Use when: \"finalize this design\", \"turn this into HTML\",\n\"build me a page\", \"implement this design\", or after any planning skill.\nProactively suggest when user has approved a design or has a plan ready.",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"build the design\", \"code the mockup\", \"make it real\"."
+    },
+    "design-review": {
+      "lead": "Designer's eye QA: finds visual inconsistency, spacing issues, hierarchy problems, AI slop patterns, and slow interactions — then fixes them.",
+      "routing": "Iteratively fixes issues\nin source code, committing each fix atomically and re-verifying with before/after\nscreenshots. For plan-mode design review (before implementation), use /plan-design-review.\nUse when asked to \"audit the design\", \"visual QA\", \"check if it looks good\", or \"design polish\".\nProactively suggest when the user mentions visual inconsistencies or\nwants to polish the look of a live site.",
+      "voice_line": null
+    },
+    "design-shotgun": {
+      "lead": "Design shotgun: generate multiple AI design variants, open a comparison board, collect structured feedback, and iterate.",
+      "routing": "Standalone design exploration you can\nrun anytime. Use when: \"explore designs\", \"show me options\", \"design variants\",\n\"visual brainstorm\", or \"I don't like how this looks\".\nProactively suggest when the user describes a UI feature but hasn't seen\nwhat it could look like.",
+      "voice_line": null
+    },
+    "devex-review": {
+      "lead": "Live developer experience audit.",
+      "routing": "Uses the browse tool to actually TEST the\ndeveloper experience: navigates docs, tries the getting started flow, times\nTTHW, screenshots error messages, evaluates CLI help text. Produces a DX\nscorecard with evidence. Compares against /plan-devex-review scores if they\nexist (the boomerang: plan said 3 minutes, reality says 8). Use when asked to\n\"test the DX\", \"DX audit\", \"developer experience test\", or \"try the\nonboarding\". Proactively suggest after shipping a developer-facing feature.",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"dx audit\", \"test the developer experience\", \"try the onboarding\", \"developer experience test\"."
+    },
+    "document-generate": {
+      "lead": "Generate missing documentation from scratch for a feature, module, or entire project.",
+      "routing": "Uses the Diataxis framework (tutorial / how-to / reference / explanation) to produce\ncomplete, structured documentation. Can be invoked standalone or called by\n/document-release when it finds coverage gaps. Use when asked to \"write docs\",\n\"generate documentation\", \"document this feature\", \"create a tutorial\", or\n\"explain this module\".",
+      "voice_line": null
+    },
+    "document-release": {
+      "lead": "Post-ship documentation update.",
+      "routing": "Reads all project docs, cross-references the\ndiff, builds a Diataxis coverage map (reference/how-to/tutorial/explanation),\nupdates README/ARCHITECTURE/CONTRIBUTING/CLAUDE.md to match what shipped,\ndetects architecture diagram drift, polishes CHANGELOG voice with a sell-test\nrubric, cleans up TODOS, and optionally bumps VERSION. Surfaces documentation\ndebt in the PR body. Use when asked to \"update the docs\", \"sync documentation\",\nor \"post-ship docs\". Proactively suggest after a PR is merged or code is shipped.",
+      "voice_line": null
+    },
+    "freeze": {
+      "lead": "Restrict file edits to a specific directory for the session.",
+      "routing": "Blocks Edit and\nWrite outside the allowed path. Use when debugging to prevent accidentally\n\"fixing\" unrelated code, or when you want to scope changes to one module.\nUse when asked to \"freeze\", \"restrict edits\", \"only edit this folder\",\nor \"lock down edits\".",
+      "voice_line": null
+    },
+    "gstack": {
+      "lead": "Fast headless browser for QA testing and site dogfooding.",
+      "routing": "Navigate pages, interact with\nelements, verify state, diff before/after, take annotated screenshots, test responsive\nlayouts, forms, uploads, dialogs, and capture bug evidence. Use when asked to open or\ntest a site, verify a deployment, dogfood a user flow, or file a bug with screenshots.",
+      "voice_line": null
+    },
+    "gstack-upgrade": {
+      "lead": "Upgrade gstack to the latest version.",
+      "routing": "Detects global vs vendored install,\nruns the upgrade, and shows what's new. Use when asked to \"upgrade gstack\",\n\"update gstack\", or \"get latest version\".",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"upgrade the tools\", \"update the tools\", \"gee stack upgrade\", \"g stack upgrade\"."
+    },
+    "guard": {
+      "lead": "Full safety mode: destructive command warnings + directory-scoped edits.",
+      "routing": "Combines /careful (warns before rm -rf, DROP TABLE, force-push, etc.) with\n/freeze (blocks edits outside a specified directory). Use for maximum safety\nwhen touching prod or debugging live systems. Use when asked to \"guard mode\",\n\"full safety\", \"lock it down\", or \"maximum safety\".",
+      "voice_line": null
+    },
+    "health": {
+      "lead": "Code quality dashboard.",
+      "routing": "Wraps existing project tools (type checker, linter,\ntest runner, dead code detector, shell linter), computes a weighted composite\n0-10 score, and tracks trends over time. Use when: \"health check\",\n\"code quality\", \"how healthy is the codebase\", \"run all checks\",\n\"quality score\".",
+      "voice_line": null
+    },
+    "investigate": {
+      "lead": "Systematic debugging with root cause investigation.",
+      "routing": "Four phases: investigate,\nanalyze, hypothesize, implement. Iron Law: no fixes without root cause.\nUse when asked to \"debug this\", \"fix this bug\", \"why is this broken\",\n\"investigate this error\", or \"root cause analysis\".\nProactively invoke this skill (do NOT debug directly) when the user reports\nerrors, 500 errors, stack traces, unexpected behavior, \"it was working\nyesterday\", or is troubleshooting why something stopped working.",
+      "voice_line": null
+    },
+    "ios-clean": {
+      "lead": "Remove the DebugBridge SPM package and all #if DEBUG wiring from an iOS app.",
+      "routing": "Cleans up StateServer, DebugOverlay, accessor codegen output, and\napp-side hooks installed by /ios-qa. This is a convenience wrapper —\nthe structural Release-build guard (Package.swift conditional + CI\nswift build -c release check) is the safety-critical path.\nUse when asked to \"clean the iOS debug bridge\", \"remove DebugBridge\",\nor \"strip the gstack iOS instrumentation\".",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"clean the iOS debug bridge\", \"remove DebugBridge\", \"strip the gstack iOS instrumentation\"."
+    },
+    "ios-design-review": {
+      "lead": "Visual design audit for iOS apps on real hardware.",
+      "routing": "Connects to a real\niPhone via the same StateServer as /ios-qa, screenshots every screen,\nevaluates against Apple HIG, DESIGN.md, and design best practices. Scores\neach dimension 0-10 with \"what would make it a 10\" framing — mirrors\n/plan-design-review for browser. For plan-stage design review (before\nimplementation), use /plan-design-review. For live web visual audits, use\n/design-review.\nUse when asked to \"review the iOS design\", \"audit the iPhone app's\nvisuals\", or \"design QA the iOS app\".",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"review the iOS design\", \"audit the iPhone app's visuals\", \"design QA the iPhone app\"."
+    },
+    "ios-fix": {
+      "lead": "Autonomous iOS bug fixer.",
+      "routing": "Takes a bug found by /ios-qa, reads the source,\nwrites the fix, rebuilds, redeploys, and verifies the fix on the real\ndevice. Closes the loop: find bug → fix bug → confirm fix — zero human\nintervention. Captures the pre-bug state snapshot as a regression test\nfixture, so the bug can never recur silently.\nUse when /ios-qa reports a bug and you want it fixed automatically, or\nwhen asked to \"fix this iOS bug\", \"patch the iPhone app\", or \"auto-fix\nthe iOS issue\".",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"fix the iOS bug\", \"patch the iPhone app\", \"auto-fix the iOS issue\"."
+    },
+    "ios-qa": {
+      "lead": "Live-device iOS QA for SwiftUI apps.",
+      "routing": "Connects to a real iPhone via USB\nCoreDevice IPv6 tunnel, reads Swift source to understand every screen, then\nruns a vision-driven agent loop: screenshot → analyze → decide → act →\nverify → repeat. All interaction happens via HTTP to an embedded\nStateServer in the app under test. Optionally exposes the device over\nTailscale so remote agents (OpenClaw, Codex, any HTTP-capable agent) can\nrun iOS QA from anywhere without touching the hardware.\nUse when asked to \"ios qa\", \"test my iPhone app\", \"find bugs on the device\",\nor \"qa the iOS app\".",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"iOS quality check\", \"test the iPhone app\", \"run iOS QA\"."
+    },
+    "ios-sync": {
+      "lead": "Regenerate the iOS debug bridge against the latest upstream gstack templates.",
+      "routing": "Updates StateServer.swift, DebugOverlay.swift, Package.swift,\nand the typed @Observable state accessors. Use after you upgrade gstack\nor add new ViewModels/properties that need accessor coverage.\nUse when asked to \"resync the iOS debug bridge\", \"regenerate iOS\naccessors\", or \"update the gstack iOS instrumentation\".",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"resync the iOS debug bridge\", \"regenerate iOS accessors\", \"update the gstack iOS instrumentation\"."
+    },
+    "land-and-deploy": {
+      "lead": "Land and deploy workflow.",
+      "routing": "Merges the PR, waits for CI and deploy,\nverifies production health via canary checks. Takes over after /ship\ncreates the PR. Use when: \"merge\", \"land\", \"deploy\", \"merge and verify\",\n\"land it\", \"ship it to production\".",
+      "voice_line": null
+    },
+    "landing-report": {
+      "lead": "Read-only queue dashboard for workspace-aware ship.",
+      "routing": "Shows which VERSION slots\nare currently claimed by open PRs, which sibling Conductor workspaces have\nWIP work likely to ship soon, and what slot /ship would pick next. No\nmutations — just a snapshot. Use when asked to \"landing report\", \"what's in\nthe queue\", \"show me open PRs\", or \"which version do I claim next\".",
+      "voice_line": null
+    },
+    "learn": {
+      "lead": "Manage project learnings.",
+      "routing": "Review, search, prune, and export what gstack\nhas learned across sessions. Use when asked to \"what have we learned\",\n\"show learnings\", \"prune stale learnings\", or \"export learnings\".\nProactively suggest when the user asks about past patterns or wonders\n\"didn't we fix this before?\"",
+      "voice_line": null
+    },
+    "make-pdf": {
+      "lead": "Turn any markdown file into a publication-quality PDF.",
+      "routing": "Proper 1in margins,\nintelligent page breaks, page numbers, cover pages, running headers, curly\nquotes and em dashes, clickable TOC, diagonal DRAFT watermark. Not a draft\nartifact — a finished artifact. Use when asked to \"make a PDF\", \"export to\nPDF\", \"turn this markdown into a PDF\", or \"generate a document\".",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"make this a pdf\", \"make it a pdf\", \"export to pdf\", \"turn this into a pdf\", \"turn this markdown into a pdf\", \"generate a pdf\", \"make a pdf from\", \"pdf this markdown\"."
+    },
+    "office-hours": {
+      "lead": "YC Office Hours — two modes.",
+      "routing": "Startup mode: six forcing questions that expose\ndemand reality, status quo, desperate specificity, narrowest wedge, observation,\nand future-fit. Builder mode: design thinking brainstorming for side projects,\nhackathons, learning, and open source. Saves a design doc.\nUse when asked to \"brainstorm this\", \"I have an idea\", \"help me think through\nthis\", \"office hours\", or \"is this worth building\".\nProactively invoke this skill (do NOT answer directly) when the user describes\na new product idea, asks whether something is worth building, wants to think\nthrough design decisions for something that doesn't exist yet, or is exploring\na concept before any code is written.\nUse before /plan-ceo-review or /plan-eng-review.",
+      "voice_line": null
+    },
+    "open-gstack-browser": {
+      "lead": "Launch GStack Browser — AI-controlled Chromium with the sidebar extension baked in.",
+      "routing": "Opens a visible browser window where you can watch every action in real time.\nThe sidebar shows a live activity feed and chat. Anti-bot stealth built in.\nUse when asked to \"open gstack browser\", \"launch browser\", \"connect chrome\",\n\"open chrome\", \"real browser\", \"launch chrome\", \"side panel\", or \"control my browser\".",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"show me the browser\"."
+    },
+    "pair-agent": {
+      "lead": "Pair a remote AI agent with your browser.",
+      "routing": "One command generates a setup key and\nprints instructions the other agent can follow to connect. Works with OpenClaw,\nHermes, Codex, Cursor, or any agent that can make HTTP requests. The remote agent\ngets its own tab with scoped access (read+write by default, admin on request).\nUse when asked to \"pair agent\", \"connect agent\", \"share browser\", \"remote browser\",\n\"let another agent use my browser\", or \"give browser access\".",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"pair agent\", \"connect agent\", \"share my browser\", \"remote browser access\"."
+    },
+    "plan-ceo-review": {
+      "lead": "CEO/founder-mode plan review.",
+      "routing": "Rethink the problem, find the 10-star product,\nchallenge premises, expand scope when it creates a better product. Four modes:\nSCOPE EXPANSION (dream big), SELECTIVE EXPANSION (hold scope + cherry-pick\nexpansions), HOLD SCOPE (maximum rigor), SCOPE REDUCTION (strip to essentials).\nUse when asked to \"think bigger\", \"expand scope\", \"strategy review\", \"rethink this\",\nor \"is this ambitious enough\".\nProactively suggest when the user is questioning scope or ambition of a plan,\nor when the plan feels like it could be thinking bigger.",
+      "voice_line": null
+    },
+    "plan-design-review": {
+      "lead": "Designer's eye plan review — interactive, like CEO and Eng review.",
+      "routing": "Rates each design dimension 0-10, explains what would make it a 10,\nthen fixes the plan to get there. Works in plan mode. For live site\nvisual audits, use /design-review. Use when asked to \"review the design plan\"\nor \"design critique\".\nProactively suggest when the user has a plan with UI/UX components that\nshould be reviewed before implementation.",
+      "voice_line": null
+    },
+    "plan-devex-review": {
+      "lead": "Interactive developer experience plan review.",
+      "routing": "Explores developer personas,\nbenchmarks against competitors, designs magical moments, and traces friction\npoints before scoring. Three modes: DX EXPANSION (competitive advantage),\nDX POLISH (bulletproof every touchpoint), DX TRIAGE (critical gaps only).\nUse when asked to \"DX review\", \"developer experience audit\", \"devex review\",\nor \"API design review\".\nProactively suggest when the user has a plan for developer-facing products\n(APIs, CLIs, SDKs, libraries, platforms, docs).",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"dx review\", \"developer experience review\", \"devex review\", \"devex audit\", \"API design review\", \"onboarding review\"."
+    },
+    "plan-eng-review": {
+      "lead": "Eng manager-mode plan review.",
+      "routing": "Lock in the execution plan — architecture,\ndata flow, diagrams, edge cases, test coverage, performance. Walks through\nissues interactively with opinionated recommendations. Use when asked to\n\"review the architecture\", \"engineering review\", or \"lock in the plan\".\nProactively suggest when the user has a plan or design doc and is about to\nstart coding — to catch architecture issues before implementation.",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"tech review\", \"technical review\", \"plan engineering review\"."
+    },
+    "plan-tune": {
+      "lead": "Self-tuning question sensitivity + developer psychographic for gstack (v1: observational).",
+      "routing": "Review which AskUserQuestion prompts fire across gstack skills, set per-question preferences\n(never-ask / always-ask / ask-only-for-one-way), inspect the dual-track\nprofile (what you declared vs what your behavior suggests), and enable/disable\nquestion tuning. Conversational interface — no CLI syntax required.\n\nUse when asked to \"tune questions\", \"stop asking me that\", \"too many questions\",\n\"show my profile\", \"what questions have I been asked\", \"show my vibe\",\n\"developer profile\", or \"turn off question tuning\". \n\nProactively suggest when the user says the same gstack question has come up before,\nor when they explicitly override a recommendation for the Nth time.",
+      "voice_line": null
+    },
+    "qa": {
+      "lead": "Systematically QA test a web application and fix bugs found.",
+      "routing": "Runs QA testing,\nthen iteratively fixes bugs in source code, committing each fix atomically and\nre-verifying. Use when asked to \"qa\", \"QA\", \"test this site\", \"find bugs\",\n\"test and fix\", or \"fix what's broken\".\nProactively suggest when the user says a feature is ready for testing\nor asks \"does this work?\". Three tiers: Quick (critical/high only),\nStandard (+ medium), Exhaustive (+ cosmetic). Produces before/after health scores,\nfix evidence, and a ship-readiness summary. For report-only mode, use /qa-only.",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"quality check\", \"test the app\", \"run QA\"."
+    },
+    "qa-only": {
+      "lead": "Report-only QA testing.",
+      "routing": "Systematically tests a web application and produces a\nstructured report with health score, screenshots, and repro steps — but never\nfixes anything. Use when asked to \"just report bugs\", \"qa report only\", or\n\"test but don't fix\". For the full test-fix-verify loop, use /qa instead.\nProactively suggest when the user wants a bug report without any code changes.",
+      "voice_line": "Voice triggers (speech-to-text aliases): \"bug report\", \"just check for bugs\"."
+    },
+    "retro": {
+      "lead": "Weekly engineering retrospective.",
+      "routing": "Analyzes commit history, work patterns,\nand code quality metrics with persistent history and trend tracking.\nTeam-aware: breaks down per-person contributions with praise and growth areas.\nUse when asked to \"weekly retro\", \"what did we ship\", or \"engineering retrospective\".\nProactively suggest at the end of a work week or sprint.",
+      "voice_line": null
+    },
+    "review": {
+      "lead": "Pre-landing PR review.",
+      "routing": "Analyzes diff against the base branch for SQL safety, LLM trust\nboundary violations, conditional side effects, and other structural issues. Use when\nasked to \"review this PR\", \"code review\", \"pre-landing review\", or \"check my diff\".\nProactively suggest when the user is about to merge or land code changes.",
+      "voice_line": null
+    },
+    "scrape": {
+      "lead": "Pull data from a web page.",
+      "routing": "First call on a new intent prototypes the flow\nvia $B primitives and returns JSON. Subsequent calls on a matching intent\nroute to a codified browser-skill and return in ~200ms. Read-only — for\nmutating flows (form fills, clicks, submissions), use /automate.\nUse when asked to \"scrape\", \"get data from\", \"pull\", \"extract from\", or\n\"what's on\" a page.",
+      "voice_line": null
+    },
+    "setup-browser-cookies": {
+      "lead": "Import cookies from your real Chromium browser into the headless browse session.",
+      "routing": "Opens an interactive picker UI where you select which cookie domains to import.\nUse before QA testing authenticated pages. Use when asked to \"import cookies\",\n\"login to the site\", or \"authenticate the browser\".",
+      "voice_line": null
+    },
+    "setup-deploy": {
+      "lead": "Configure deployment settings for /land-and-deploy.",
+      "routing": "Detects your deploy\nplatform (Fly.io, Render, Vercel, Netlify, Heroku, GitHub Actions, custom),\nproduction URL, health check endpoints, and deploy status commands. Writes\nthe configuration to CLAUDE.md so all future deploys are automatic.\nUse when: \"setup deploy\", \"configure deployment\", \"set up land-and-deploy\",\n\"how do I deploy with gstack\", \"add deploy config\".",
+      "voice_line": null
+    },
+    "setup-gbrain": {
+      "lead": "Set up gbrain for this coding agent: install the CLI, initialize a local PGLite or Supabase brain, register MCP, capture per-remote trust policy.",
+      "routing": "One command from zero to \"gbrain is running, and this agent\ncan call it.\" Use when: \"setup gbrain\", \"connect gbrain\", \"start\ngbrain\", \"install gbrain\", \"configure gbrain for this machine\".",
+      "voice_line": null
+    },
+    "ship": {
+      "lead": "Ship workflow: detect + merge base branch, run tests, review diff, bump VERSION, update CHANGELOG, commit, push, create PR.",
+      "routing": "Use when asked to \"ship\", \"deploy\",\n\"push to main\", \"create a PR\", \"merge and push\", or \"get it deployed\".\nProactively invoke this skill (do NOT push/PR directly) when the user says code\nis ready, asks about deploying, wants to push code up, or asks to create a PR.",
+      "voice_line": null
+    },
+    "skillify": {
+      "lead": "Codify the most recent successful /scrape flow into a permanent browser-skill on disk.",
+      "routing": "Future /scrape calls with the same intent run\nthe codified script in ~200ms instead of re-driving the page. Walks\nback through the conversation, synthesizes script.ts + script.test.ts\n+ fixture, runs the test in a temp dir, and asks before committing.\nUse when asked to \"skillify\", \"codify\", \"save this scrape\", or\n\"make this permanent\".",
+      "voice_line": null
+    },
+    "sync-gbrain": {
+      "lead": "Keep gbrain current with this repo's code and refresh agent search guidance in CLAUDE.md. Wraps the gstack-gbrain-sync orchestrator with state",
+      "routing": "probing, native code-surface registration, capability checks,\nand a verdict block. Re-runnable, idempotent. Use when: \"sync gbrain\",\n\"refresh gbrain\", \"re-index this repo\", \"gbrain search isn't finding\nthings\".",
+      "voice_line": null
+    },
+    "unfreeze": {
+      "lead": "Clear the freeze boundary set by /freeze, allowing edits to all directories again.",
+      "routing": "Use when you want to widen edit scope without ending the session.\nUse when asked to \"unfreeze\", \"unlock edits\", \"remove freeze\", or\n\"allow all edits\".",
+      "voice_line": null
+    }
+  }
+}