From 22f8c7f4e1eda65680d4b87a2548429f44020277 Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Tue, 26 May 2026 16:50:03 -0700 Subject: [PATCH] =?UTF-8?q?v1.46.0.0=20feat:=20gstack=20v2=20foundation=20?= =?UTF-8?q?=E2=80=94=20catalog=20tokens=20drop=2056%,=20eval-first=20floor?= =?UTF-8?q?=20covers=20all=2051=20skills=20(#1712)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * docs(designs): add v2_PLAN.md — gstack v2 the lightest opinionated skill pack The approved plan from /plan-ceo-review → /plan-eng-review → /codex×2 → /plan-devex-review. Captures the v1.45/v2.0 hybrid release shape, cathedral parity-eval suite, sequential v1.45 execution, sections/*.md.tmpl pipeline, EVALS_BUDGET_HARD_CAP override path, and v2 launch copy specs. This commit just lands the design doc. Implementation follows in the rest of the v1.45.0.0 branch. Co-Authored-By: Claude Opus 4.7 (1M context) * test(parity): T0a — capture v1.44.1 baseline + capture helper + diff utility Cathedral parity-eval suite primitive. captureBaseline() walks every top-level SKILL.md and records bytes, lines, estimated tokens, frontmatter description length, and eval coverage. diffBaselines() reports per-skill delta + total corpus delta + catalog tokens delta. Locks the v1.44.1 reference snapshot at test/fixtures/parity-baseline-v1.44.1.json. After Phase A+B+C land, scripts/capture-baseline.ts --tag v1.45.0.0 produces a comparable snapshot; diff supplies the real numbers the v2 CHANGELOG quotes. Never invent baseline numbers; ship them only if they came from a real run. v1.44.1 numbers captured this commit: - 51 skills - 2,847 KB total corpus - ~9,319 catalog tokens (sum of description bytes / 4) - top 3: ship 160 KB, plan-ceo-review 128 KB, office-hours 108 KB Test plan: - bun test test/helpers/capture-parity-baseline.test.ts passes 4/4 - The baseline JSON file is committed so reviewers can audit v1→v2 numbers Co-Authored-By: Claude Opus 4.7 (1M context) * feat(resolvers): T2 — ResolverEntry + appliesTo gate infrastructure Adds the conditional-resolver-injection plumbing from the v2_PLAN A.1 step. Resolvers can now be either a bare ResolverFn (always fires, current behavior) or a ResolverEntry { resolve, appliesTo? } (gated; appliesTo returning false skips the resolver, substitutes empty string). Why infrastructure-only: the audit during T0a confirmed most resolvers don't need gating. The {{NAME}} placeholder system is already conditional at the template level — a resolver only fires for skills that reference it. The gate is for future use when a placeholder's audience needs a structural guardrail beyond social convention, or when a sub-resolver inside a larger composed resolver (e.g. preamble) needs per-skill skip. scripts/gen-skill-docs.ts:444 now uses unwrapResolver() to handle both shapes. RESOLVERS map signature widens from Record to Record. All existing resolvers stay bare functions and work unchanged. Test plan: - bun test test/resolver-entry.test.ts: 6 pass (gate plumbing + registry) - bun test test/gen-skill-docs.test.ts: 389 pass (no regression) - bun run gen:skill-docs --dry-run: all SKILL.md files FRESH (no diff) Co-Authored-By: Claude Opus 4.7 (1M context) * feat(preamble): T3 — jargon dedup + terse-build flag (Phase A.2 + A.3) A.2 jargon dedup: generate-writing-style.ts replaces the inlined 80-term jargon list with a one-line pointer to scripts/jargon-list.json. The list was duplicated into every tier-2+ skill (48 of 51 skills); inlining cost was ~1.5 KB × 48 = ~70 KB across the corpus. Pointer cost is ~30 bytes per skill. Agents Read the JSON once per session on first jargon term encountered; thereafter the terms array is the canonical reference. A.3 terse build flag: --explain-level=terse compresses preamble prose at gen time. When the flag is set, writing-style collapses to a one-line terse directive and completeness-section + confusion-protocol + context-health are dropped entirely. The default build keeps the runtime-conditional behavior intact (sections still render; the model skips them when EXPLAIN_LEVEL: terse appears in the preamble echo). Terse build is opt-in for users who want shipped skills to match their runtime preference and avoid the per-session terse-mode dead prose. TemplateContext gains an optional `explainLevel: 'default' | 'terse'` field. Default builds set it to 'default'; --explain-level=terse sets 'terse'. Resolvers gate their output via `ctx?.explainLevel === 'terse'`. Measured impact (default build, post-T3): - Total corpus: 2,847 KB → 2,812 KB (saved 35 KB) - ship.md: 160 → 159 KB - plan-ceo-review.md: 128 → 127 KB - Top 10 heaviest: all slightly smaller from jargon pointer Larger compression lands in T4 (catalog trim) and T7 (atomic regen across the full Phase A pipeline). The terse build path further compresses to ~711K tokens vs default ~725K (saved ~14K tokens corpus-wide). Test plan: - bun test test/gen-skill-docs.test.ts: 389 pass (no regression) - bun test test/resolver-entry.test.ts: 6 pass - bun test test/helpers/capture-parity-baseline.test.ts: 4 pass - bun run gen:skill-docs --explain-level=terse: ship.md drops completeness + confusion-protocol + context-health sections; writing-style collapses to one-line terse directive 48 SKILL.md files updated (every tier-2+ skill picks up the jargon pointer). Co-Authored-By: Claude Opus 4.7 (1M context) * feat(catalog): T4 — catalog trim + proactive-suggestions.json (Phase A.4) Shortens frontmatter `description:` in every Claude SKILL.md to a single lead sentence + (gstack) tag. The routing prose ("Use when asked to...", "Proactively suggest...") and voice triggers move to a "## When to invoke" body section so they remain discoverable inside the skill. A per-run registry at scripts/proactive-suggestions.json aggregates the routing/ voice text for all 52 skills so agents can pull guidance on demand without paying for it in the always-loaded catalog. Build flag --catalog-mode=full restores v1.44 legacy behavior (full multi-line descriptions in frontmatter). Default is trim. splitCatalogDescription() extracts: lead sentence, routing paragraphs, voice-triggers line, (gstack) tag presence. Short descriptions (<120 chars, already trimmed) are skipped via a guard so re-runs are idempotent. Measured impact (vs v1.44.1 baseline): - Catalog tokens (sum of description bytes / 4): 9,319 → 4,045 (-56.6%) - Total SKILL.md corpus bytes: 2,915 KB → 2,880 KB (-1.2%) - Routing prose preserved as in-skill "## When to invoke" sections - 52 skill entries in scripts/proactive-suggestions.json (on-demand registry) The corpus drop is small because catalog trim MOVES text from frontmatter to body, it doesn't delete it. The headline win is the catalog: the always-loaded system prompt surface drops by more than half. Test plan: - bun test test/gen-skill-docs.test.ts: 389 pass, 0 fail - Manual: ship/SKILL.md frontmatter description is now ONE line ending with `(gstack)`; allowed-tools field on next line (YAML well-formed) - Manual: scripts/proactive-suggestions.json contains 52 entries - bun run gen:skill-docs --catalog-mode=full restores legacy behavior 53 files changed (52 SKILL.md across hosts + the new proactive-suggestions.json). Co-Authored-By: Claude Opus 4.7 (1M context) * test(budget): T5 — hard token budgets + override audit trail (Phase A.6) Two new gate-tier guardrails for the v1.45.0.0 compression baseline: 1. test/skill-size-budget.test.ts (NEW) — per-skill SKILL.md size budget. Compares current state to test/fixtures/parity-baseline-v1.44.1.json. Three checks: per-skill (×1.05 default ratio), total corpus, and catalog token estimate (≤7000 for v1.45). The per-skill ratio is 1.05 not 1.0 because the T4 catalog trim moves text from frontmatter to a body section; small skills see a tiny body growth that's fine when offset by the much larger catalog-token win. 2. test/skill-budget-regression.test.ts EXTENDED — hard dollar cap on per-run eval cost. Per-tier defaults: gate $25, periodic $70. Umbrella EVALS_BUDGET_HARD_CAP=$30. Catches runaway eval costs (infinite retry, model price changes) before they amortize across PRs. Both checks support an override path with audit trail: GSTACK_SIZE_BUDGET_OVERRIDE_REASON="why this is OK" — size EVALS_BUDGET_OVERRIDE_REASON="why this is OK" — cost Overrides log to ~/.gstack/analytics/spend-overrides.jsonl with timestamp + scope + reason + CI provenance (runner, branch, commit) via test/helpers/budget-override.ts. Why the override audit: a hard cap with no escape valve becomes operationally hostile (legit price changes, longer transcripts, new required evals can all blow the cap). An override with no audit becomes "everyone overrides everything and the gate is theater." This module ships the audit half so reviewers can see what was waived and why. Codex 2nd-pass critique #3 absorbed: per-suite caps + override path with auditability + budget baselines checked into repo (parity-baseline-v1.44.1.json already in test/fixtures/). Test plan: - bun test test/skill-size-budget.test.ts: 4 pass (per-skill, corpus, catalog, baseline-exists) - bun test test/skill-budget-regression.test.ts: 4 pass (2 existing ratio checks + 2 new hard-cap checks) - Existing eval runs ($14.11 e2e, $0.02 llm-judge) sit well under the new caps Co-Authored-By: Claude Opus 4.7 (1M context) * test(cso): T6 — pin must-preserve security phrases (Phase A.5) cso/SKILL.md is a content-heavy security audit skill (75 KB after T3+T4). Codex 2nd-pass critique #9: "cso exemption too broad ... should still get resolver dedup, catalog trim, sectioning if safe, and targeted evals around must-not-miss checks." T3 (jargon dedup) and T4 (catalog trim) already applied to cso the same way they applied to every other skill — confirmed by inspection: - jargon list NOT inlined (0 inline term lines) - catalog description trimmed to one line (74 bytes vs 774 bytes baseline) - "## When to invoke" body section present T6 work: lock in the security-prose preservation via a gate-tier test that fails CI if future compression strips load-bearing phrases: - OWASP, STRIDE positioning - daily / comprehensive mode discipline - confidence scoring language - active verification ("verif" prefix catches verify/verified/verification) - ## Preamble heading (preamble resolver still fires) Also guards cso against accidental over-stripping: SKILL.md must stay ≥30 KB (currently 75 KB) — a sudden cliff would mean compression went past the targeted-dedup line into structural removal. No structural change to cso. Future Phase B sections/ work for cso requires writing baseline parity tests FIRST per the v2_PLAN.md sequencing. Test plan: - bun test test/cso-preserved.test.ts: 5 pass Co-Authored-By: Claude Opus 4.7 (1M context) * test(parity): T0b — cathedral parity-suite harness + invariant registry Adds the harness that the v2_PLAN.md cathedral parity-eval suite is built on. Compares CURRENT SKILL.md output to v1.44.1 baseline along three axes: STRUCTURE frontmatter shape (catalog trim landed, "## When to invoke" present) CONTENT must-preserve phrases per skill family (cso: OWASP/STRIDE; plan-ceo: SCOPE EXPANSION/HOLD SCOPE/REDUCTION; ship: VERSION/CHANGELOG/PR; etc.) SIZE per-skill byte budget (maxSizeRatio + minBytes guards) PARITY_INVARIANTS registry pins 10 load-bearing skills (cso, ship, plan-*- review, review, qa, investigate, office-hours, autoplan). Each entry declares what must NOT regress; future compression that strips these phrases or shrinks a skill past its minBytes cliff fails CI. Periodic-tier LLM-judge parity (paid, ~$0.20/skill) lands in v2.0.0.0 sections/ phase. Same registry, same harness, judge added on top. Test plan: - bun test test/parity-suite.test.ts: 10/10 invariants pass vs v1.44.1 - Per-skill failures get actionable per-line breakdown so a reviewer can see which phrase / heading / size limit went sideways Co-Authored-By: Claude Opus 4.7 (1M context) * test(coverage): T1 — skill coverage matrix + structural-compliance floor Phase 0 deliverable — eval-first foundation. Two new test files plus the registry: 1. test/skill-coverage-matrix.ts — single source of truth mapping each skill to its gate-tier + periodic-tier test files. SKILL_COVERAGE record with 51 entries; every gstack skill on disk has at least one gate-tier entry. 2. test/skill-coverage-matrix.test.ts — CI gate. Asserts every skill on disk has a registry entry AND that gate[] is non-empty. Catches "skill added but eval not registered" the moment a new SKILL.md lands. 3. test/skill-coverage-floor.test.ts — per-skill structural compliance (FREE, file-IO only). For each of 51 skills, verifies: - SKILL.md exists - Frontmatter well-formed (name + description fields) - Catalog-trim contract (inline description ≤ 250 chars, or block form) - Generated header present (edit .tmpl, not .md) - Body ≥ 200 bytes (non-trivial content) - No unresolved {{TEMPLATE}} placeholders leaked The "floor" is the minimum eval that every skill ships with. Skills that need deeper behavioral testing get additional entries in their coverage record (e.g., ship has skill-e2e-ship-idempotency + workflow + floor). Future skills only need to add the floor entry and the matrix gate unblocks them. Codex 2nd-pass critique #1 mitigation: eval-first floor is structural compliance (the testable part) — judgment-skill behavior gets layered periodic-tier evals on top. We don't pretend the floor proves correctness, only that the skill structurally compiles. Test plan: - bun test test/skill-coverage-matrix.test.ts: 4 pass (matrix shape + coverage) - bun test test/skill-coverage-floor.test.ts: 309 pass (6 checks × 51 skills + 3 registry-level) Co-Authored-By: Claude Opus 4.7 (1M context) * build(skills): T7 — atomic regenerate + capture v1.45.0.0 baseline Final regen pass across all hosts after T1-T6 work landed. Captures the v1.45.0.0 parity baseline at test/fixtures/parity-baseline-v1.45.0.0.json for diffing against the v1.44.1 reference. Measured deltas (real numbers from test/helpers/capture-parity-baseline.ts): Total SKILL.md corpus 2,847 KB → 2,813 KB (-1.2%) Catalog tokens (always-loaded) ~9,319 → ~4,045 tokens (-56.6%) Top 10 heaviest skills 0.5-1.0% drop each The catalog token cut is the headline. It's the always-loaded surface, i.e. tokens charged on every session start. Per-skill SKILL.md sizes barely moved because T4 catalog trim MOVES routing prose from frontmatter to a body "## When to invoke" section rather than deleting it — the catalog wins without amputating discoverability. The bigger per-skill compression lands in v2.0.0.0 (Phase B sections/ pattern on the 5 heavyweights). v1.45 is the foundation: eval-first infrastructure + cheap wins. scripts/proactive-suggestions.json regenerated with the latest 52 skills listed (one-time write per gen-skill-docs run; aggregated catalog parts). Co-Authored-By: Claude Opus 4.7 (1M context) * v1.45.0.0 — gstack v2 foundation: catalog tokens drop 56%, eval-first floor Bumps VERSION + package.json to 1.45.0.0. CHANGELOG entry covers what shipped between v1.44.1 and this release: the cathedral parity-eval foundation, conditional resolver injection plumbing, jargon dedup, terse build flag, catalog trim with one-line frontmatter descriptions, hard token + dollar budget gates with override audit, cso preservation pins, and the v1.44.1 ↔ v1.45.0.0 parity baselines committed to test/fixtures/. Numbers (measured, not estimated): - Catalog tokens: ~9,319 → ~4,045 (-56.6%) - Total corpus: 2,847 KB → 2,813 KB (-1.2%) - Skills with gate-tier eval coverage: 32/51 → 51/51 (floor achieved) This is the foundation release. v2.0.0.0 will ship the architectural break (sections/*.md.tmpl pattern + mechanical Read enforcement + eval-coverage annotations) as a coordinated marketing-grade launch. Co-Authored-By: Claude Opus 4.7 (1M context) * chore(catalog): refresh proactive-suggestions.json timestamp after v1.45 bump The generated_at field updates on every gen-skill-docs run; this is the T7 atomic-regenerate output landed alongside the v1.45.0.0 bump. Co-Authored-By: Claude Opus 4.7 (1M context) * fix(catalog): deterministic proactive-suggestions.json (no per-run timestamp) Original implementation wrote a generated_at timestamp on every gen-skill-docs run. That made CI dry-run freshness checks flap because the file changed on every regeneration even when the actual content (skill descriptions, routing prose, voice triggers) was unchanged. Two fixes: 1. Drop the generated_at field. The file is purely a content registry now. 2. Only write the file when serialized content actually differs from disk. Reproducible test: bun run gen:skill-docs twice in a row now leaves scripts/proactive-suggestions.json unchanged on the second run. Co-Authored-By: Claude Opus 4.7 (1M context) * fix(catalog): preserve routing prose when first sentence exceeds 200 chars splitCatalogDescription truncated the lead BEFORE computing routing extraction, which meant skills whose first sentence was over 200 chars (design-consultation: 207 chars) had their entire routing prose silently dropped — the "## When to invoke" body section came out empty. Root cause: routing was extracted via `collapsed.indexOf(lead)` after lead was suffixed with "...". The "..." never appeared in the original string, so indexOf returned -1 and routingProse fell back to empty. Fix: compute routing from sentenceLead (the untruncated first sentence) BEFORE truncating the displayed lead. The displayed lead still gets "..." when over 200 chars, but the routing extraction uses the real boundary. Also: refresh golden snapshots for claude/codex/factory ship and update two unit tests that asserted v1.44 behavior: - skill-validation.test.ts: trigger-phrase + proactive-routing tests now search whole content, not just frontmatter (T4 moved them to a body "## When to invoke" section) - writing-style-resolver.test.ts: jargon-list assertion now expects the T3 reference pointer, not the inline list Test plan: - bun test test/skill-validation.test.ts test/writing-style-resolver.test.ts test/host-config.test.ts test/skill-size-budget.test.ts test/parity-suite.test.ts test/skill-coverage-matrix.test.ts test/skill-coverage-floor.test.ts test/cso-preserved.test.ts test/resolver-entry.test.ts test/helpers/capture-parity-baseline.test.ts test/gen-skill-docs.test.ts: 1134 pass, 0 fail - Manual verify: design-consultation/SKILL.md "## When to invoke this skill" body section now contains "Use when asked to..." + "Proactively suggest..." Co-Authored-By: Claude Opus 4.7 (1M context) * fix(catalog): deterministic proactive-suggestions.json across machines CI check-freshness failed because scripts/proactive-suggestions.json serialized differently on local vs CI: 1. Root-skill key leaked the directory name. processTemplate's outer loop computed `dir = path.basename(path.dirname(tmplPath))`. For the root SKILL.md.tmpl at ROOT/SKILL.md.tmpl, that returns the repo-checkout directory name — "seville-v3" in a Conductor worktree, "gstack" on GitHub Actions, anything-else for a fork. Fix: detect root via `path.dirname(tmplPath) === ROOT` and hardcode the key to "gstack" for that one case. 2. Aggregate key order was filesystem-iteration order. discoverTemplates doesn't guarantee stable ordering across platforms, so the JSON `skills` object came out shuffled between machines. Fix: sort Object.keys(proactiveAggregate) alphabetically before serializing. After the fix, the generated file is identical on every machine and matches what's committed. CI freshness check (bun run gen:skill-docs && git diff --exit-code) now passes. Test plan: - bun run gen:skill-docs && bun run gen:skill-docs --dry-run: all FRESH - node -e 'verify keys sorted': sorted match: true - grep -c '"seville-v3"' scripts/proactive-suggestions.json: 0 - Focused test suite: 704 pass, 0 fail Co-Authored-By: Claude Opus 4.7 (1M context) * test(catalog): unit + regression coverage for catalog-trim helpers Four exported functions in scripts/gen-skill-docs.ts handle every skill's frontmatter rewrite at gen time but had zero unit tests. Both real bugs we shipped (and fixed) on this branch lived in these functions: v1.45.0.0 design-consultation: when the first sentence exceeded 200 chars, routing-prose extraction lost the entire tail (anchored on truncated lead with "..." that didn't substring-match the original). v1.45.0.0 CI freshness: root-skill key leaked the checkout directory name ("seville-v3" vs "gstack") and aggregate order was filesystem- iteration order. Both shapes are now regression-tested: - splitCatalogDescription: 7 tests covering simple multi-line, >200-char first sentence (design-consultation regression), voice-trigger extraction, no-(gstack) handling, embedded periods (documents known fallback), no-period fragments, and idempotency. - buildTrimmedDescription: 3 tests. - buildWhenToInvokeSection: 3 tests. - applyCatalogTrim: 4 tests covering the standard rewrite, no-op for already-short descriptions, the YAML-collision newline fix, and the malformed-frontmatter null return. - proactive-suggestions.json determinism: 3 tests asserting sorted keys, root keyed as "gstack" (not the worktree directory), and no timestamp/generated_at field that would flap CI freshness. Test plan: - bun test test/catalog-trim.test.ts: 20 pass, 0 fail Co-Authored-By: Claude Opus 4.7 (1M context) * test(coverage): fill three remaining v1.46.0.0 test gaps Three untested surfaces from the v1.46.0.0 work. All three would have caught real bugs we shipped (and fixed) on this branch. 1. test/helpers/budget-override.test.ts — 7 tests pin the audit-trail contract for EVALS_BUDGET_OVERRIDE_REASON and GSTACK_SIZE_BUDGET_OVERRIDE_REASON. Without this, the audit logger could silently drop events and overrides become invisible. Tests cover: required fields per JSONL line, CI provenance capture (CI/GITHUB_ACTIONS/branch/commit), local-runner defaults, append-only behavior, missing-directory recovery, and unwritable- path resilience (logs warning instead of throwing). 2. test/terse-build.test.ts — 16 tests pin --explain-level=terse behavior across the 4 gated resolvers and the composed preamble. Default vs terse vs undefined-ctx all asserted. Without this, a refactor that breaks the explainLevel threading silently regresses the opt-in compression path; the runtime EXPLAIN_LEVEL: terse gate still works so users wouldn't notice. Tier-1 invariant pinned (terse-only-affects-tier-2+). 3. test/gen-skill-docs-idempotency.test.ts — 2 tests catch the class of bug behind the v1.45.0.0 timestamp flap. Two consecutive gen-skill-docs runs must produce byte-identical outputs across STABLE_OUTPUTS (proactive-suggestions.json, SKILL.md, ship/SKILL.md, plan-ceo-review/SKILL.md, office-hours/SKILL.md, gstack/llms.txt). --dry-run reports zero stale files after a fresh gen. CI freshness regressions surface as test failures BEFORE a PR is opened. Test plan: - bun test test/helpers/budget-override.test.ts: 7 pass - bun test test/terse-build.test.ts: 16 pass - bun test test/gen-skill-docs-idempotency.test.ts: 2 pass - Full focused suite (15 test files): 1179 pass, 0 fail (+45 new tests vs the pre-fill baseline of 1134) Co-Authored-By: Claude Opus 4.7 (1M context) * test(coverage): close 5 remaining v1.46.0.0 test gaps (A-E) Five behaviors that v1.46 ships but had no test coverage. All now pinned. A) --host all idempotency (test/gen-skill-docs-idempotency.test.ts) The default test ran Claude host only. Non-Claude hosts (Codex, Factory, Cursor, OpenClaw, GBrain, Slate, OpenCode, Hermes, Kiro) each have their own output paths and could carry their own non-deterministic fields. We hit a "--host all needed for freshness check" mid-/ship. Now: two consecutive `bun run gen:skill-docs --host all` runs must produce byte-identical outputs across a per-host sample (.agents/, .cursor/, .factory/, .gbrain/). Catches per-host adapter regressions before CI. B) --catalog-mode=full opt-out (test/catalog-mode-full.test.ts) The legacy escape hatch had zero tests. 6 new tests across two layers: static (CATALOG_MODE_ARG parsed; conditional gate present; default is "trim"; invalid value throws) + smoke (actual --catalog-mode=full run produces a multi-line `description: |` block + omits "## When to invoke" body section; mutates the working tree then restores in a finally block). C) parity-baseline-v1.44.1.json integrity (test/parity-baseline-integrity.test.ts) The baseline is the source of every v1→v2 number cited in the CHANGELOG v1.46.0.0 entry. Anyone could edit it without test failure until now. 8 new tests pin: existence, tag, capturedFromCommit allowlist, expected v1.44 numbers (51 skills, ~2,915 KB, ~9,319 catalog tokens), CHANGELOG references this file by path, per-skill shape, and a SHA256 byte-stability hash. Any edit fails with a clear "if intentional, update EXPECTED_HASH AND the CHANGELOG numbers" signal. D) Live appliesTo gate end-to-end (test/resolver-entry.test.ts extended) The unwrapResolver unit tests covered the function; the gen-skill-docs.ts substitution loop that USES the gate had no integration coverage. 6 new tests simulate the exact 4-line shape from gen-skill-docs.ts:457-467 against synthetic registries: plain-function fires unconditionally, gated fires when true / empty-string when false, mixed registries compose, parameterized resolvers respect gates, unknown resolvers throw. E) Per-skill min-size floor (test/skill-size-budget.test.ts extended) The existing 200-byte body coverage-floor is a noise floor — a skill that lost 99.75% of content still passes. 1 new test asserts every skill stays ≥80% of its v1.44.1 baseline size (the parity-suite content invariants only covered 10 of 51 skills; the remaining 41 were uncovered). SECTIONS_EXTRACTED hook in place for v2.0.0.0 when the sections/ pattern legitimately shrinks ship/plan-ceo/etc. past the floor. Test plan: - bun test focused 17-file suite: 1202 pass, 0 fail (+23 new tests vs the pre-fill 1179 baseline) - catalog-mode=full mutates working tree then restores cleanly - --host all idempotency runs two full gen passes in <1s on this machine Co-Authored-By: Claude Opus 4.7 (1M context) --------- Co-authored-by: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 63 ++ SKILL.md | 14 +- VERSION | 2 +- autoplan/SKILL.md | 103 +-- benchmark-models/SKILL.md | 21 +- benchmark/SKILL.md | 19 +- browse/SKILL.md | 18 +- canary/SKILL.md | 95 +-- careful/SKILL.md | 16 +- codex/SKILL.md | 98 +-- context-restore/SKILL.md | 99 +-- context-save/SKILL.md | 98 +-- cso/SKILL.md | 100 +-- design-consultation/SKILL.md | 97 +-- design-html/SKILL.md | 103 +-- design-review/SKILL.md | 98 +-- design-shotgun/SKILL.md | 96 +-- devex-review/SKILL.md | 102 +-- docs/designs/v2_PLAN.md | 755 ++++++++++++++++++ document-generate/SKILL.md | 96 +-- document-release/SKILL.md | 99 +-- freeze/SKILL.md | 16 +- gstack-upgrade/SKILL.md | 15 +- guard/SKILL.md | 15 +- health/SKILL.md | 95 +-- investigate/SKILL.md | 99 +-- ios-clean/SKILL.md | 101 +-- ios-design-review/SKILL.md | 106 +-- ios-fix/SKILL.md | 104 +-- ios-qa/SKILL.md | 106 +-- ios-sync/SKILL.md | 99 +-- land-and-deploy/SKILL.md | 93 +-- landing-report/SKILL.md | 95 +-- learn/SKILL.md | 95 +-- make-pdf/SKILL.md | 19 +- office-hours/SKILL.md | 107 +-- open-gstack-browser/SKILL.md | 97 +-- package.json | 2 +- pair-agent/SKILL.md | 100 +-- plan-ceo-review/SKILL.md | 101 +-- plan-design-review/SKILL.md | 98 +-- plan-devex-review/SKILL.md | 104 +-- plan-eng-review/SKILL.md | 100 +-- plan-tune/SKILL.md | 108 +-- qa-only/SKILL.md | 98 +-- qa/SKILL.md | 104 +-- retro/SKILL.md | 95 +-- review/SKILL.md | 93 +-- scrape/SKILL.md | 97 +-- scripts/capture-baseline.ts | 54 ++ scripts/gen-skill-docs.ts | 285 ++++++- scripts/proactive-suggestions.json | 267 +++++++ scripts/resolvers/index.ts | 17 +- scripts/resolvers/preamble.ts | 6 +- .../preamble/generate-completeness-section.ts | 5 +- .../preamble/generate-confusion-protocol.ts | 5 +- .../preamble/generate-context-health.ts | 5 +- .../preamble/generate-writing-style.ts | 37 +- scripts/resolvers/types.ts | 49 ++ setup-browser-cookies/SKILL.md | 13 +- setup-deploy/SKILL.md | 97 +-- setup-gbrain/SKILL.md | 93 +-- ship/SKILL.md | 94 +-- skillify/SKILL.md | 98 +-- sync-gbrain/SKILL.md | 95 +-- test/catalog-mode-full.test.ts | 118 +++ test/catalog-trim.test.ts | 313 ++++++++ test/cso-preserved.test.ts | 86 ++ test/fixtures/golden/claude-ship-SKILL.md | 94 +-- test/fixtures/golden/codex-ship-SKILL.md | 79 +- test/fixtures/golden/factory-ship-SKILL.md | 79 +- test/fixtures/parity-baseline-v1.44.1.json | 623 +++++++++++++++ test/fixtures/parity-baseline-v1.46.0.0.json | 623 +++++++++++++++ test/gen-skill-docs-idempotency.test.ts | 159 ++++ test/helpers/budget-override.test.ts | 116 +++ test/helpers/budget-override.ts | 50 ++ test/helpers/capture-parity-baseline.test.ts | 90 +++ test/helpers/capture-parity-baseline.ts | 231 ++++++ test/helpers/parity-harness.ts | 230 ++++++ test/parity-baseline-integrity.test.ts | 145 ++++ test/parity-suite.test.ts | 49 ++ test/resolver-entry.test.ts | 186 +++++ test/skill-budget-regression.test.ts | 64 ++ test/skill-coverage-floor.test.ts | 153 ++++ test/skill-coverage-matrix.test.ts | 72 ++ test/skill-coverage-matrix.ts | 181 +++++ test/skill-size-budget.test.ts | 220 +++++ test/skill-validation.test.ts | 16 +- test/terse-build.test.ts | 151 ++++ test/writing-style-resolver.test.ts | 14 +- unfreeze/SKILL.md | 13 +- 91 files changed, 6044 insertions(+), 3885 deletions(-) create mode 100644 docs/designs/v2_PLAN.md create mode 100644 scripts/capture-baseline.ts create mode 100644 scripts/proactive-suggestions.json create mode 100644 test/catalog-mode-full.test.ts create mode 100644 test/catalog-trim.test.ts create mode 100644 test/cso-preserved.test.ts create mode 100644 test/fixtures/parity-baseline-v1.44.1.json create mode 100644 test/fixtures/parity-baseline-v1.46.0.0.json create mode 100644 test/gen-skill-docs-idempotency.test.ts create mode 100644 test/helpers/budget-override.test.ts create mode 100644 test/helpers/budget-override.ts create mode 100644 test/helpers/capture-parity-baseline.test.ts create mode 100644 test/helpers/capture-parity-baseline.ts create mode 100644 test/helpers/parity-harness.ts create mode 100644 test/parity-baseline-integrity.test.ts create mode 100644 test/parity-suite.test.ts create mode 100644 test/resolver-entry.test.ts create mode 100644 test/skill-coverage-floor.test.ts create mode 100644 test/skill-coverage-matrix.test.ts create mode 100644 test/skill-coverage-matrix.ts create mode 100644 test/skill-size-budget.test.ts create mode 100644 test/terse-build.test.ts diff --git a/CHANGELOG.md b/CHANGELOG.md index a126d4c13..bc1d88a69 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,68 @@ # Changelog +## [1.46.0.0] - 2026-05-26 + +## **gstack v2 foundation lands. Catalog tokens drop 56%, eval-first floor covers all 51 skills, hard token + dollar caps gate every PR.** + +The always-loaded skill catalog — what every Claude Code session pays for at startup before any real work begins — went from ~9,319 tokens to ~4,045 tokens. That's a 56.6% cut to the surface gstack has been criticized for (third-party review, May 2026: "10K+ tokens before any real code is written"). Heavyweight skills like `/ship`, `/plan-ceo-review`, `/office-hours` still ship their full content, but their frontmatter descriptions trim to one sentence each; the routing prose lives in a new "## When to invoke" body section, and a per-run `scripts/proactive-suggestions.json` registry holds the voice-trigger + proactive-suggest text so agents can pull guidance on demand instead of always-loaded. + +This is the v2 foundation release. The architectural break — `sections/*.md.tmpl` pattern, mechanical Read enforcement, eval-coverage annotations — lands in v2.0.0.0 as a coordinated launch. v1.46 absorbs every low-risk win, ships the eval-first floor every future skill must pass, and locks in the v1.44.1 reference baseline so reviewers can audit v1→v2 numbers against a real file (`test/fixtures/parity-baseline-v1.44.1.json`). + +### The numbers that matter + +Source: `bun run scripts/capture-baseline.ts --tag v1.46.0.0` vs the locked v1.44.1 baseline at `test/fixtures/parity-baseline-v1.44.1.json`. Reproduce locally with `bun test test/skill-size-budget.test.ts`. + +| Metric | v1.44.1 | v1.46.0.0 | Δ | +|---|---|---|---| +| Catalog tokens (always-loaded system prompt) | ~9,319 | ~4,045 | **−56.6%** | +| Total SKILL.md corpus | 2,847 KB | 2,813 KB | −1.2% | +| ship.md | 160 KB | 159 KB | −0.5% | +| plan-ceo-review.md | 128 KB | 127 KB | −0.7% | +| office-hours.md | 108 KB | 108 KB | −0.8% | +| Skills with gate-tier eval coverage | 32 of 51 | **51 of 51** | floor achieved | +| Cathedral parity invariants pinned | 0 | **10** | structural + content | +| Token & dollar budget regressions caught at CI | (none) | **5 new test files** | per-skill, corpus, catalog, eval-cost gate, eval-cost periodic | + +The corpus barely moved because the catalog trim MOVES routing prose from frontmatter to a body section — it doesn't delete it. The always-loaded surface drops by more than half because catalog text is what Claude Code reads on every session start; body content only loads when the skill is invoked. + +### What this means for you + +If you use any gstack skill, every session starts ~5,000 tokens lighter before you type anything. Heavyweight invocations like `/ship` cost about the same as before, but session startup feels snappier. If you've been on the fence about installing gstack because of the "fat" reputation, this is the release that addresses it directly: the always-loaded surface is now competitive with stripped-down skill packs while every skill keeps its full body content. + +If you contribute skills, the eval-first floor means a new SKILL.md without an entry in `test/skill-coverage-matrix.ts` fails CI. The minimum entry is one line referencing `test/skill-coverage-floor.test.ts` (the free structural-compliance smoke test). Behavioral E2E coverage gets layered on top per skill. + +If you run gstack in CI, the new `EVALS_BUDGET_HARD_CAP=$30` cap (per-suite: gate $25 / periodic $70) stops runaway eval costs from a model price change or infinite-retry bug. Override path exists for legit-need-more cases: `EVALS_BUDGET_OVERRIDE_REASON="why this is OK"` logs to `~/.gstack/analytics/spend-overrides.jsonl` for audit. + +### Itemized changes + +**Added** +- `scripts/capture-baseline.ts` + `test/helpers/capture-parity-baseline.ts` — captures per-skill SKILL.md sizes, token estimates, frontmatter description lengths, and eval coverage flags. Writes JSON snapshots used by the parity and size-budget gates. Locks `test/fixtures/parity-baseline-v1.44.1.json` as the v1→v2 reference. +- `test/helpers/parity-harness.ts` + `test/parity-suite.test.ts` — cathedral parity-eval suite floor. `PARITY_INVARIANTS` registry pins must-preserve phrases per skill family (cso: OWASP/STRIDE; plan-ceo: SCOPE EXPANSION / HOLD SCOPE; ship: VERSION/CHANGELOG/PR) so future compression can't silently strip load-bearing prose. +- `test/skill-coverage-matrix.ts` + `test/skill-coverage-matrix.test.ts` — single source of truth mapping each skill to gate + periodic tests; CI gate asserts every skill has at least one gate-tier entry. 51 skills, 51 entries. +- `test/skill-coverage-floor.test.ts` — per-skill structural-compliance smoke test (file-IO, free). Verifies frontmatter shape, generated header, body non-trivial, no leaked `{{TEMPLATE}}` placeholders, catalog-trim contract on description. 309 assertions across 51 skills. +- `test/skill-size-budget.test.ts` — per-skill SKILL.md byte budget (×1.05 default ratio), total corpus budget, catalog token budget (≤7000 for v1.46). Caught regressions get a per-skill breakdown + override path. +- `test/cso-preserved.test.ts` — pins cso's must-not-strip security guidance phrases (OWASP, STRIDE, daily/comprehensive mode discipline, confidence scoring, active verification). Future compression that hits cso fails CI here. +- `test/helpers/budget-override.ts` — audit-trail logger for `GSTACK_SIZE_BUDGET_OVERRIDE_REASON` and `EVALS_BUDGET_OVERRIDE_REASON`. Append-only JSONL at `~/.gstack/analytics/spend-overrides.jsonl` with timestamp + scope + reason + CI provenance. +- `scripts/proactive-suggestions.json` — per-run registry of routing prose + voice triggers extracted from skill frontmatter during catalog trim. Agents pull on demand instead of paying for it always-loaded. +- `--catalog-mode=full` build flag — restores v1.44 legacy multi-line catalog descriptions. Use when debugging routing regressions or when shipping skills to hosts that depend on the legacy fat catalog. +- `--explain-level=terse` build flag — opt-in compression of `## Writing Style` + `## Completeness Principle` + `## Confusion Protocol` + `## Context Health` preamble sections. Default build keeps the runtime-conditional behavior intact (the model still skips when `EXPLAIN_LEVEL: terse` appears in the preamble echo); terse build makes the compression structural. +- `EVALS_BUDGET_HARD_CAP` environment variable (umbrella $30 default) + per-suite `EVALS_BUDGET_HARD_CAP_GATE=$25`, `EVALS_BUDGET_HARD_CAP_PERIODIC=$70`. Build fails if a single run exceeds; `EVALS_BUDGET_OVERRIDE_REASON` env unblocks + audit-logs. + +**Changed** +- Skill frontmatter `description:` blocks across 51 skills trimmed to a single lead sentence + `(gstack)` tag. Routing prose ("Use when asked to...", "Proactively suggest...") and voice triggers moved to a `## When to invoke` body section in each SKILL.md. Always-loaded catalog cost drops ~56%. +- Jargon list (`scripts/jargon-list.json`, 80 terms) no longer inlined into every tier-2+ skill. `## Writing Style` now references the JSON path; agents Read it once per session on first jargon term encountered. Saves ~70 KB of duplicated text across the corpus. +- `ResolverEntry` union type in `scripts/resolvers/types.ts` + `unwrapResolver` helper. Resolvers can now be either bare functions (current behavior) or `{ resolve, appliesTo? }` gated entries. `scripts/gen-skill-docs.ts:444` checks the gate before invocation. Infrastructure for future per-skill resolver gating; all current resolvers stay bare functions and work unchanged. +- `TemplateContext` gains an optional `explainLevel: 'default' | 'terse'` field threaded from the `--explain-level` build flag. + +**Fixed** +- Catalog descriptions no longer collide with adjacent YAML fields (initial implementation produced `description: ... (gstack)allowed-tools:` with no newline; fixed by appending `\n` to the replacement). + +**For contributors** +- New skills require an entry in `test/skill-coverage-matrix.ts` — at minimum referencing `test/skill-coverage-floor.test.ts` in `gate[]`. The CI gate at `test/skill-coverage-matrix.test.ts` fails fast on missing entries. +- New must-preserve invariants for a skill family go in `PARITY_INVARIANTS` in `test/helpers/parity-harness.ts`. Adding invariants is additive; removing one is a deliberate scope decision. +- The `scripts/jargon-list.json` is the canonical glossary. Add terms there; gen-skill-docs picks them up automatically on next regen. +- `test/fixtures/parity-baseline-v1.44.1.json` is the locked v1→v2 reference. Do not modify; capture new snapshots at later tags via `bun run scripts/capture-baseline.ts --tag `. + ## [1.45.0.0] - 2026-05-25 ## **Design boards now live 24 hours, not 10 minutes. One daemon hosts every board, one tab survives the whole day.** diff --git a/SKILL.md b/SKILL.md index c6441014c..56f74ba07 100644 --- a/SKILL.md +++ b/SKILL.md @@ -2,11 +2,7 @@ name: gstack preamble-tier: 1 version: 1.1.0 -description: | - Fast headless browser for QA testing and site dogfooding. Navigate pages, interact with - elements, verify state, diff before/after, take annotated screenshots, test responsive - layouts, forms, uploads, dialogs, and capture bug evidence. Use when asked to open or - test a site, verify a deployment, dogfood a user flow, or file a bug with screenshots. (gstack) +description: Fast headless browser for QA testing and site dogfooding. (gstack) allowed-tools: - Bash - Read @@ -21,6 +17,14 @@ triggers: + +## When to invoke this skill + +Navigate pages, interact with +elements, verify state, diff before/after, take annotated screenshots, test responsive +layouts, forms, uploads, dialogs, and capture bug evidence. Use when asked to open or +test a site, verify a deployment, dogfood a user flow, or file a bug with screenshots. + ## Preamble (run first) ```bash diff --git a/VERSION b/VERSION index 94cf8fed1..2ac450d0b 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -1.45.0.0 +1.46.0.0 diff --git a/autoplan/SKILL.md b/autoplan/SKILL.md index a39b60bbd..073ba4b86 100644 --- a/autoplan/SKILL.md +++ b/autoplan/SKILL.md @@ -2,16 +2,7 @@ name: autoplan preamble-tier: 3 version: 1.0.0 -description: | - Auto-review pipeline — reads the full CEO, design, eng, and DX review skills from disk - and runs them sequentially with auto-decisions using 6 decision principles. Surfaces - taste decisions (close approaches, borderline scope, codex disagreements) at a final - approval gate. One command, fully reviewed plan out. - Use when asked to "auto review", "autoplan", "run all reviews", "review this plan - automatically", or "make the decisions for me". - Proactively suggest when the user has a plan file and wants to run the full review - gauntlet without answering 15-30 intermediate questions. (gstack) - Voice triggers (speech-to-text aliases): "auto plan", "automatic review". +description: Auto-review pipeline — reads the full CEO, design, eng, and DX review skills from disk and runs them sequentially with auto-decisions using 6 decision principles. (gstack) benefits-from: [office-hours] triggers: - run all reviews @@ -30,6 +21,19 @@ allowed-tools: + +## When to invoke this skill + +Surfaces +taste decisions (close approaches, borderline scope, codex disagreements) at a final +approval gate. One command, fully reviewed plan out. +Use when asked to "auto review", "autoplan", "run all reviews", "review this plan +automatically", or "make the decisions for me". +Proactively suggest when the user has a plan file and wants to run the full review +gauntlet without answering 15-30 intermediate questions. + +Voice triggers (speech-to-text aliases): "auto plan", "automatic review". + ## Preamble (run first) ```bash @@ -556,84 +560,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/benchmark-models/SKILL.md b/benchmark-models/SKILL.md index 47050855b..ef821a3ea 100644 --- a/benchmark-models/SKILL.md +++ b/benchmark-models/SKILL.md @@ -2,14 +2,7 @@ name: benchmark-models preamble-tier: 1 version: 1.0.0 -description: | - Cross-model benchmark for gstack skills. Runs the same prompt through Claude, - GPT (via Codex CLI), and Gemini side-by-side — compares latency, tokens, cost, - and optionally quality via LLM judge. Answers "which model is actually best - for this skill?" with data instead of vibes. Separate from /benchmark, which - measures web page performance. Use when: "benchmark models", "compare models", - "which model is best for X", "cross-model comparison", "model shootout". (gstack) - Voice triggers (speech-to-text aliases): "compare models", "model shootout", "which model is best". +description: Cross-model benchmark for gstack skills. (gstack) triggers: - cross model benchmark - compare claude gpt gemini @@ -23,6 +16,18 @@ allowed-tools: + +## When to invoke this skill + +Runs the same prompt through Claude, +GPT (via Codex CLI), and Gemini side-by-side — compares latency, tokens, cost, +and optionally quality via LLM judge. Answers "which model is actually best +for this skill?" with data instead of vibes. Separate from /benchmark, which +measures web page performance. Use when: "benchmark models", "compare models", +"which model is best for X", "cross-model comparison", "model shootout". + +Voice triggers (speech-to-text aliases): "compare models", "model shootout", "which model is best". + ## Preamble (run first) ```bash diff --git a/benchmark/SKILL.md b/benchmark/SKILL.md index b6dc81373..841fb95b6 100644 --- a/benchmark/SKILL.md +++ b/benchmark/SKILL.md @@ -2,13 +2,7 @@ name: benchmark preamble-tier: 1 version: 1.0.0 -description: | - Performance regression detection using the browse daemon. Establishes - baselines for page load times, Core Web Vitals, and resource sizes. - Compares before/after on every PR. Tracks performance trends over time. - Use when: "performance", "benchmark", "page speed", "lighthouse", "web vitals", - "bundle size", "load time". (gstack) - Voice triggers (speech-to-text aliases): "speed test", "check performance". +description: Performance regression detection using the browse daemon. (gstack) triggers: - performance benchmark - check page speed @@ -23,6 +17,17 @@ allowed-tools: + +## When to invoke this skill + +Establishes +baselines for page load times, Core Web Vitals, and resource sizes. +Compares before/after on every PR. Tracks performance trends over time. +Use when: "performance", "benchmark", "page speed", "lighthouse", "web vitals", +"bundle size", "load time". + +Voice triggers (speech-to-text aliases): "speed test", "check performance". + ## Preamble (run first) ```bash diff --git a/browse/SKILL.md b/browse/SKILL.md index 6a4f5c269..74cf086fd 100644 --- a/browse/SKILL.md +++ b/browse/SKILL.md @@ -2,13 +2,7 @@ name: browse preamble-tier: 1 version: 1.1.0 -description: | - Fast headless browser for QA testing and site dogfooding. Navigate any URL, interact with - elements, verify page state, diff before/after actions, take annotated screenshots, check - responsive layouts, test forms and uploads, handle dialogs, and assert element states. - ~100ms per command. Use when you need to test a feature, verify a deployment, dogfood a - user flow, or file a bug with evidence. Use when asked to "open in browser", "test the - site", "take a screenshot", or "dogfood this". (gstack) +description: Fast headless browser for QA testing and site dogfooding. (gstack) triggers: - browse a page - headless browser @@ -22,6 +16,16 @@ allowed-tools: + +## When to invoke this skill + +Navigate any URL, interact with +elements, verify page state, diff before/after actions, take annotated screenshots, check +responsive layouts, test forms and uploads, handle dialogs, and assert element states. +~100ms per command. Use when you need to test a feature, verify a deployment, dogfood a +user flow, or file a bug with evidence. Use when asked to "open in browser", "test the +site", "take a screenshot", or "dogfood this". + ## Preamble (run first) ```bash diff --git a/canary/SKILL.md b/canary/SKILL.md index 1ba6ecec7..b305bfd45 100644 --- a/canary/SKILL.md +++ b/canary/SKILL.md @@ -2,12 +2,7 @@ name: canary preamble-tier: 2 version: 1.0.0 -description: | - Post-deploy canary monitoring. Watches the live app for console errors, - performance regressions, and page failures using the browse daemon. Takes - periodic screenshots, compares against pre-deploy baselines, and alerts - on anomalies. Use when: "monitor deploy", "canary", "post-deploy check", - "watch production", "verify deploy". (gstack) +description: Post-deploy canary monitoring. (gstack) allowed-tools: - Bash - Read @@ -22,6 +17,15 @@ triggers: + +## When to invoke this skill + +Watches the live app for console errors, +performance regressions, and page failures using the browse daemon. Takes +periodic screenshots, compares against pre-deploy baselines, and alerts +on anomalies. Use when: "monitor deploy", "canary", "post-deploy check", +"watch production", "verify deploy". + ## Preamble (run first) ```bash @@ -548,84 +552,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/careful/SKILL.md b/careful/SKILL.md index 91a5776e3..678d66c16 100644 --- a/careful/SKILL.md +++ b/careful/SKILL.md @@ -1,12 +1,7 @@ --- name: careful version: 0.1.0 -description: | - Safety guardrails for destructive commands. Warns before rm -rf, DROP TABLE, - force-push, git reset --hard, kubectl delete, and similar destructive operations. - User can override each warning. Use when touching prod, debugging live systems, - or working in a shared environment. Use when asked to "be careful", "safety mode", - "prod mode", or "careful mode". (gstack) +description: Safety guardrails for destructive commands. (gstack) triggers: - be careful - warn before destructive @@ -25,6 +20,15 @@ hooks: + +## When to invoke this skill + +Warns before rm -rf, DROP TABLE, +force-push, git reset --hard, kubectl delete, and similar destructive operations. +User can override each warning. Use when touching prod, debugging live systems, +or working in a shared environment. Use when asked to "be careful", "safety mode", +"prod mode", or "careful mode". + # /careful — Destructive Command Guardrails Safety mode is now **active**. Every bash command will be checked for destructive diff --git a/codex/SKILL.md b/codex/SKILL.md index dbc6bbcb6..8051a80b6 100644 --- a/codex/SKILL.md +++ b/codex/SKILL.md @@ -2,13 +2,7 @@ name: codex preamble-tier: 3 version: 1.0.0 -description: | - OpenAI Codex CLI wrapper — three modes. Code review: independent diff review via - codex review with pass/fail gate. Challenge: adversarial mode that tries to break - your code. Consult: ask codex anything with session continuity for follow-ups. - The "200 IQ autistic developer" second opinion. Use when asked to "codex review", - "codex challenge", "ask codex", "second opinion", or "consult codex". (gstack) - Voice triggers (speech-to-text aliases): "code x", "code ex", "get another opinion". +description: OpenAI Codex CLI wrapper — three modes. (gstack) triggers: - codex review - second opinion @@ -24,6 +18,17 @@ allowed-tools: + +## When to invoke this skill + +Code review: independent diff review via +codex review with pass/fail gate. Challenge: adversarial mode that tries to break +your code. Consult: ask codex anything with session continuity for follow-ups. +The "200 IQ autistic developer" second opinion. Use when asked to "codex review", +"codex challenge", "ask codex", "second opinion", or "consult codex". + +Voice triggers (speech-to-text aliases): "code x", "code ex", "get another opinion". + ## Preamble (run first) ```bash @@ -550,84 +555,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/context-restore/SKILL.md b/context-restore/SKILL.md index 92eb1cdd1..d385ecdfe 100644 --- a/context-restore/SKILL.md +++ b/context-restore/SKILL.md @@ -2,14 +2,7 @@ name: context-restore preamble-tier: 2 version: 1.0.0 -description: | - Restore working context saved earlier by /context-save. Loads the most recent - saved state (across all branches by default) so you can pick up where you - left off — even across Conductor workspace handoffs. - Use when asked to "resume", "restore context", "where was I", or - "pick up where I left off". Pair with /context-save. - Formerly /checkpoint resume — renamed because Claude Code treats /checkpoint - as a native rewind alias in current environments. (gstack) +description: Restore working context saved earlier by /context-save. (gstack) allowed-tools: - Bash - Read @@ -26,6 +19,17 @@ triggers: + +## When to invoke this skill + +Loads the most recent +saved state (across all branches by default) so you can pick up where you +left off — even across Conductor workspace handoffs. +Use when asked to "resume", "restore context", "where was I", or +"pick up where I left off". Pair with /context-save. +Formerly /checkpoint resume — renamed because Claude Code treats /checkpoint +as a native rewind alias in current environments. + ## Preamble (run first) ```bash @@ -552,84 +556,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/context-save/SKILL.md b/context-save/SKILL.md index 5a7b0d60e..fae5e29d7 100644 --- a/context-save/SKILL.md +++ b/context-save/SKILL.md @@ -2,14 +2,7 @@ name: context-save preamble-tier: 2 version: 1.0.0 -description: | - Save working context. Captures git state, decisions made, and remaining work - so any future session can pick up without losing a beat. - Use when asked to "save progress", "save state", "context save", or - "save my work". Pair with /context-restore to resume later. - Formerly /checkpoint — renamed because Claude Code treats /checkpoint as a - native rewind alias in current environments, which was shadowing this skill. - (gstack) +description: Save working context. (gstack) allowed-tools: - Bash - Read @@ -26,6 +19,16 @@ triggers: + +## When to invoke this skill + +Captures git state, decisions made, and remaining work +so any future session can pick up without losing a beat. +Use when asked to "save progress", "save state", "context save", or +"save my work". Pair with /context-restore to resume later. +Formerly /checkpoint — renamed because Claude Code treats /checkpoint as a +native rewind alias in current environments, which was shadowing this skill. + ## Preamble (run first) ```bash @@ -552,84 +555,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/cso/SKILL.md b/cso/SKILL.md index 64cb75306..50bb4fee1 100644 --- a/cso/SKILL.md +++ b/cso/SKILL.md @@ -2,14 +2,7 @@ name: cso preamble-tier: 2 version: 2.0.0 -description: | - Chief Security Officer mode. Infrastructure-first security audit: secrets archaeology, - dependency supply chain, CI/CD pipeline security, LLM/AI security, skill supply chain - scanning, plus OWASP Top 10, STRIDE threat modeling, and active verification. - Two modes: daily (zero-noise, 8/10 confidence gate) and comprehensive (monthly deep - scan, 2/10 bar). Trend tracking across audit runs. - Use when: "security audit", "threat model", "pentest review", "OWASP", "CSO review". (gstack) - Voice triggers (speech-to-text aliases): "see-so", "see so", "security review", "security check", "vulnerability scan", "run security". +description: Chief Security Officer mode. (gstack) allowed-tools: - Bash - Read @@ -27,6 +20,18 @@ triggers: + +## When to invoke this skill + +Infrastructure-first security audit: secrets archaeology, +dependency supply chain, CI/CD pipeline security, LLM/AI security, skill supply chain +scanning, plus OWASP Top 10, STRIDE threat modeling, and active verification. +Two modes: daily (zero-noise, 8/10 confidence gate) and comprehensive (monthly deep +scan, 2/10 bar). Trend tracking across audit runs. +Use when: "security audit", "threat model", "pentest review", "OWASP", "CSO review". + +Voice triggers (speech-to-text aliases): "see-so", "see so", "security review", "security check", "vulnerability scan", "run security". + ## Preamble (run first) ```bash @@ -553,84 +558,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/design-consultation/SKILL.md b/design-consultation/SKILL.md index 8cecaa3d6..08bb2f48f 100644 --- a/design-consultation/SKILL.md +++ b/design-consultation/SKILL.md @@ -2,14 +2,7 @@ name: design-consultation preamble-tier: 3 version: 1.0.0 -description: | - Design consultation: understands your product, researches the landscape, proposes a - complete design system (aesthetic, typography, color, layout, spacing, motion), and - generates font+color preview pages. Creates DESIGN.md as your project's design source - of truth. For existing sites, use /plan-design-review to infer the system instead. - Use when asked to "design system", "brand guidelines", or "create DESIGN.md". - Proactively suggest when starting a new project's UI with no existing - design system or DESIGN.md. (gstack) +description: Design consultation: understands your product, researches the landscape, proposes a complete design system (aesthetic, typography, color, layout, spacing, motion), and generates font+color preview... (gstack) allowed-tools: - Bash - Read @@ -50,6 +43,15 @@ gbrain: + +## When to invoke this skill + +Creates DESIGN.md as your project's design source +of truth. For existing sites, use /plan-design-review to infer the system instead. +Use when asked to "design system", "brand guidelines", or "create DESIGN.md". +Proactively suggest when starting a new project's UI with no existing +design system or DESIGN.md. + ## Preamble (run first) ```bash @@ -576,84 +578,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/design-html/SKILL.md b/design-html/SKILL.md index 5c92f7703..b0e8a5f30 100644 --- a/design-html/SKILL.md +++ b/design-html/SKILL.md @@ -2,16 +2,7 @@ name: design-html preamble-tier: 2 version: 1.0.0 -description: | - Design finalization: generates production-quality Pretext-native HTML/CSS. - Works with approved mockups from /design-shotgun, CEO plans from /plan-ceo-review, - design review context from /plan-design-review, or from scratch with a user - description. Text actually reflows, heights are computed, layouts are dynamic. - 30KB overhead, zero deps. Smart API routing: picks the right Pretext patterns - for each design type. Use when: "finalize this design", "turn this into HTML", - "build me a page", "implement this design", or after any planning skill. - Proactively suggest when user has approved a design or has a plan ready. (gstack) - Voice triggers (speech-to-text aliases): "build the design", "code the mockup", "make it real". +description: Design finalization: generates production-quality Pretext-native HTML/CSS. (gstack) triggers: - build the design - code the mockup @@ -29,6 +20,19 @@ allowed-tools: + +## When to invoke this skill + +Works with approved mockups from /design-shotgun, CEO plans from /plan-ceo-review, +design review context from /plan-design-review, or from scratch with a user +description. Text actually reflows, heights are computed, layouts are dynamic. +30KB overhead, zero deps. Smart API routing: picks the right Pretext patterns +for each design type. Use when: "finalize this design", "turn this into HTML", +"build me a page", "implement this design", or after any planning skill. +Proactively suggest when user has approved a design or has a plan ready. + +Voice triggers (speech-to-text aliases): "build the design", "code the mockup", "make it real". + ## Preamble (run first) ```bash @@ -555,84 +559,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/design-review/SKILL.md b/design-review/SKILL.md index b584ada8f..87b04124f 100644 --- a/design-review/SKILL.md +++ b/design-review/SKILL.md @@ -2,14 +2,7 @@ name: design-review preamble-tier: 4 version: 2.0.0 -description: | - Designer's eye QA: finds visual inconsistency, spacing issues, hierarchy problems, - AI slop patterns, and slow interactions — then fixes them. Iteratively fixes issues - in source code, committing each fix atomically and re-verifying with before/after - screenshots. For plan-mode design review (before implementation), use /plan-design-review. - Use when asked to "audit the design", "visual QA", "check if it looks good", or "design polish". - Proactively suggest when the user mentions visual inconsistencies or - wants to polish the look of a live site. (gstack) +description: Designer's eye QA: finds visual inconsistency, spacing issues, hierarchy problems, AI slop patterns, and slow interactions — then fixes them. (gstack) allowed-tools: - Bash - Read @@ -27,6 +20,16 @@ triggers: + +## When to invoke this skill + +Iteratively fixes issues +in source code, committing each fix atomically and re-verifying with before/after +screenshots. For plan-mode design review (before implementation), use /plan-design-review. +Use when asked to "audit the design", "visual QA", "check if it looks good", or "design polish". +Proactively suggest when the user mentions visual inconsistencies or +wants to polish the look of a live site. + ## Preamble (run first) ```bash @@ -553,84 +556,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/design-shotgun/SKILL.md b/design-shotgun/SKILL.md index 36cbcb498..813da91c4 100644 --- a/design-shotgun/SKILL.md +++ b/design-shotgun/SKILL.md @@ -2,13 +2,7 @@ name: design-shotgun preamble-tier: 2 version: 1.0.0 -description: | - Design shotgun: generate multiple AI design variants, open a comparison board, - collect structured feedback, and iterate. Standalone design exploration you can - run anytime. Use when: "explore designs", "show me options", "design variants", - "visual brainstorm", or "I don't like how this looks". - Proactively suggest when the user describes a UI feature but hasn't seen - what it could look like. (gstack) +description: Design shotgun: generate multiple AI design variants, open a comparison board, collect structured feedback, and iterate. (gstack) triggers: - explore design variants - show me design options @@ -44,6 +38,15 @@ gbrain: + +## When to invoke this skill + +Standalone design exploration you can +run anytime. Use when: "explore designs", "show me options", "design variants", +"visual brainstorm", or "I don't like how this looks". +Proactively suggest when the user describes a UI feature but hasn't seen +what it could look like. + ## Preamble (run first) ```bash @@ -570,84 +573,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/devex-review/SKILL.md b/devex-review/SKILL.md index 49d5ae212..de339fbe6 100644 --- a/devex-review/SKILL.md +++ b/devex-review/SKILL.md @@ -2,15 +2,7 @@ name: devex-review preamble-tier: 3 version: 1.0.0 -description: | - Live developer experience audit. Uses the browse tool to actually TEST the - developer experience: navigates docs, tries the getting started flow, times - TTHW, screenshots error messages, evaluates CLI help text. Produces a DX - scorecard with evidence. Compares against /plan-devex-review scores if they - exist (the boomerang: plan said 3 minutes, reality says 8). Use when asked to - "test the DX", "DX audit", "developer experience test", or "try the - onboarding". Proactively suggest after shipping a developer-facing feature. (gstack) - Voice triggers (speech-to-text aliases): "dx audit", "test the developer experience", "try the onboarding", "developer experience test". +description: Live developer experience audit. (gstack) triggers: - live dx audit - test developer experience @@ -27,6 +19,19 @@ allowed-tools: + +## When to invoke this skill + +Uses the browse tool to actually TEST the +developer experience: navigates docs, tries the getting started flow, times +TTHW, screenshots error messages, evaluates CLI help text. Produces a DX +scorecard with evidence. Compares against /plan-devex-review scores if they +exist (the boomerang: plan said 3 minutes, reality says 8). Use when asked to +"test the DX", "DX audit", "developer experience test", or "try the +onboarding". Proactively suggest after shipping a developer-facing feature. + +Voice triggers (speech-to-text aliases): "dx audit", "test the developer experience", "try the onboarding", "developer experience test". + ## Preamble (run first) ```bash @@ -553,84 +558,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/docs/designs/v2_PLAN.md b/docs/designs/v2_PLAN.md new file mode 100644 index 000000000..bd684df0d --- /dev/null +++ b/docs/designs/v2_PLAN.md @@ -0,0 +1,755 @@ +# gstack v2 — the lightest opinionated skill pack + +## Context + +gstack has an externally documented reputation for being "fat." Third-party reviews (dev.to, May 2026) explicitly say gstack "can feel bloated when all roles are turned on... potentially consuming 10K+ tokens before any real code is written, and daily usage burns through tokens fast... making even straightforward tasks feel sluggish and redundant." Anthropic's own canonical Skills guidance prescribes the "progressive disclosure" pattern (`SKILL.md` skeleton + `references/` loaded on demand) — gstack diverges from this. + +The numbers back the criticism: + +- 31 skills, 2.1MB total generated SKILL.md corpus +- 28 of 31 skills exceed the 40KB soft ceiling (~10K tokens each) +- ship.md is 164KB (~41K tokens); ship.md.tmpl is only 48KB — **115KB is resolver-injected**, the highest-leverage compression target +- Catalog in always-loaded system prompt: 50+ skills × multi-paragraph descriptions, voice triggers, proactive-suggest paragraphs + +This plan ships gstack v2 in two coordinated releases: v1.45.0.0 lands the foundation + low-risk wins, then v2.0.0.0 ships the architectural break + marketing-grade repositioning 2-4 weeks later. The split came out of cross-model review: Codex argued v2 looks like posturing without real breakage; the hybrid shape gives the genuinely-breaking sections/ pattern the major bump it earns, while letting the risk-free wins ship immediately. + +## Release shape + +``` +v1.45.0.0 (Foundation Release) v2.0.0.0 (gstack v2 Launch) +───────────────────────────── ───────────────────────────── +~1-2 weeks of CC work 2-4 weeks later, coordinated + +Phase 0: Eval coverage matrix Phase B: sections/ pattern + gate + periodic for all 31 skills on 5 heavyweights + (ship, plan-ceo, office-hours, +Phase A: Build-time compression plan-eng, plan-design) + conditional resolver injection + jargon dedup Phase C: Eval annotations + terse-mode actually compresses + CI orphan check (WARN→FAIL) + +Catalog trim (Codex high-leverage win) Lighter-touch migration + one-line skill descriptions release note + auto-regenerate + drop voice triggers/proactive blocks on /gstack-upgrade + +Hard token budgets defined Marketing-grade CHANGELOG + enforced via budget-regression v1 vs v2 numbers table + README v2 banner +Normal release voice "lightest opinionated skill pack" +``` + +## Premise check (Step 0A findings) + +1. **Is this the right problem?** YES — externally validated. The bloat criticism is quotable and represents real user pain (token cost, sluggish sessions). Doing nothing means losing users to Cursor/Codex for their "lighter touch" reputation. +2. **Doing nothing:** the criticism compounds. Recent releases (v1.38 → v1.44) all added features; no release has gone the other direction. Without an explicit reversal, the reputation calcifies. +3. **Risk of acting:** the lazy-section pattern introduces silent-behavior-loss as a new failure class. Mitigated by the eval-first foundation + mechanical enforcement + canary rollout (see Phase B integrity section). + +## What already exists (reuse-first audit) + +| Asset | Reuse | +|---|---| +| `scripts/gen-skill-docs.ts` lines 439-450 | Already does string substitution and per-host suppression; extend with `appliesTo` resolver gate (~15 LOC) | +| `scripts/resolvers/types.ts` | Add `ResolverEntry` union type | +| `scripts/resolvers/preamble.ts` | Already does tier-gated composition (1-4); add per-resolver gating | +| `scripts/jargon-list.json` | Already a single file; just stop inlining it 37× | +| `test/skill-e2e-budget-regression.test.ts` (existing gate-tier) | Extend with per-skill hard budgets | +| Real-PTY harness from v1.13.2.0 | Reuse for behavioral-contract evals (~$0.50/eval) | +| SDK harness | Reuse for cheap shape evals (~$0/eval where possible) | +| `gstack-upgrade/migrations/` | Pattern exists for state-format migrations; reuse for v2 auto-regenerate | +| `~/.gstack/analytics/skill-usage.jsonl` | Already collected; powers deferred `gstack budget` CLI | + +We are catching up to Anthropic's canonical Skills pattern, not inventing one. + +## Dream state delta + +``` +TODAY v1.45.0.0 v2.0.0.0 +────── ───────── ──────── +2.1MB corpus ~1.3MB corpus (-40%) ~700KB corpus (-67%) +ship.md: 164KB ship.md: ~80KB (-50%) ship.md: ~15KB skeleton + + 5×~5KB sections +28/31 over 40KB ceiling ~10/31 over ceiling ~3/31 over ceiling + (cso, document-release, + design-consultation + kept as monoliths) +Catalog: multi-paragraph Catalog: one-line per skill Catalog: one-line per skill +descriptions, voice triggers (~70% catalog cut) (same) +No eval coverage matrix Every skill: ≥1 gate eval Section-level eval + + ≥1 periodic eval annotations + CI orphan check +"Fat" reputation in third-party "Compressed, eval-protected" "Lightest opinionated skill +reviews internally measured pack" externally measured +``` + +## Phase 0 — Eval coverage matrix (v1.45.0.0) + +**Goal:** every skill in gstack ships with at least one gate-tier eval AND one periodic-tier eval that asserts a must-have behavior. The eval suite becomes the design spec. This is the load-bearing claim of the plan — must come first. + +**Cross-model tension noted:** Codex argued this is a procrastination trap and shape-asserts are shallow. User explicitly chose full tiered coverage anyway (D9 = A), with rationale: "the eval suite IS the design spec; that commitment is the load-bearing claim of the whole plan." We accept the larger upfront investment. + +**Mitigation of Codex's "shape vs quality" critique:** for orchestration/judgment skills (plan-ceo, office-hours, autoplan), the must-have isn't deterministic output — it's structural compliance (does it call AskUserQuestion in the right shape? does it follow the section order? does it persist artifacts?). Eval design must capture structural contracts, not output content. Where structural eval is impossible, that section is explicitly noted as "judgment-dependent, not eval-protected" — Codex's #2 critique is honored by NOT then stripping unprotected judgment prose. + +**Skills currently lacking dedicated E2E coverage** (eval-writing target): + +| Skill | Gate eval (target) | Periodic eval (target) | Est. cost/run | +|---|---|---|---| +| qa-only | report-only flag triggers | full QA flow with fix-loop disabled | $0.30 / $1.50 | +| retro | weekly aggregate runs without error | full retro produces ranked output | $0.20 / $2.00 | +| document-release | reads CHANGELOG, produces Diataxis map | full post-ship doc update | $0.30 / $1.80 | +| document-generate | generates 4 doc types from prompt | E2E generation passes quality bar | $0.30 / $2.00 | +| context-save | persists state to expected path | round-trip restore preserves context | $0.10 / $0.50 | +| context-restore | reads latest save, applies to session | cross-workspace restore works | $0.10 / $0.50 | +| gstack-upgrade | detects install type, runs upgrade | full upgrade + migration round-trip | $0.20 / $1.00 | +| sync-gbrain | refreshes index without error | full sync produces searchable corpus | $0.20 / $1.50 | +| setup-gbrain | path 1-4 detection works | end-to-end setup for each path | $0.20 / $2.00 | +| setup-browser-cookies | picker UI loads without error | cookie import round-trip | $0.20 / $1.00 | +| setup-deploy | detects config, writes expected files | full deploy config setup | $0.20 / $1.00 | +| design-consultation | DESIGN.md template renders | full design system generation | $0.30 / $2.50 | +| design-shotgun | variants generated and saved | full multi-variant exploration | $0.30 / $2.00 | +| open-gstack-browser | launches browser without error | sidebar attaches and shows activity | $0.20 / $0.80 | +| pair-agent | setup key generated, instructions printed | full pair flow with second agent | $0.20 / $1.50 | +| land-and-deploy | merge gates check correctly | full merge → deploy → canary | $0.30 / $3.00 | +| canary | post-deploy loop runs, exits cleanly | full canary cycle with alert simulation | $0.20 / $1.50 | +| benchmark | runs and produces score | full regression detection | $0.20 / $2.00 | +| plan-devex-review | mode routing works | full DX review with scoring | $0.40 / $3.00 | +| devex-review | live DX audit produces scorecard | E2E DX measurement vs plan baseline | $0.40 / $2.50 | + +Estimated added CI cost: **~$5/run gate, ~$30/run periodic.** Combined with existing E2E suite (~$15/gate, ~$30/periodic), total: ~$20/gate (every PR), ~$60/periodic (weekly). Acceptable. + +**Eval matrix lives at:** `test/helpers/skill-coverage-matrix.ts` — a single source of truth mapping each skill to its gate + periodic eval test files. CI check in `test/skill-coverage-matrix.test.ts` fails the build if any skill is missing an entry. + +**Critical files to add:** +- `test/skill-coverage-matrix.ts` — registry mapping skill → eval paths +- `test/skill-e2e-*.test.ts` — 20 new test files (gate-tier subset starts in gate config, periodic-tier subset in periodic config) +- `test/helpers/touchfiles.ts` — register new tests for diff-based selection + +## Phase A — Build-time compression (v1.45.0.0) + +**A.1 Conditional resolver injection** — extend `scripts/gen-skill-docs.ts` and `scripts/resolvers/`: + +```ts +// scripts/resolvers/types.ts +export type ResolverFn = (ctx: TemplateContext, args?: string[]) => string; +export type ResolverEntry = ResolverFn | { + resolve: ResolverFn; + appliesTo?: (ctx: TemplateContext) => boolean; +}; +``` + +```ts +// scripts/resolvers/index.ts — gate the heavy ones +QUESTION_TUNING: { + resolve: generateQuestionTuning, + appliesTo: (ctx) => ['plan-ceo-review','plan-eng-review','office-hours'].includes(ctx.skillName), +}, +REVIEW_ARMY: { + resolve: generateReviewArmy, + appliesTo: (ctx) => ['ship','review'].includes(ctx.skillName), +}, +REVIEW_DASHBOARD: { + resolve: generateReviewDashboard, + appliesTo: (ctx) => ['ship','plan-ceo-review','plan-eng-review','plan-design-review','plan-devex-review','devex-review'].includes(ctx.skillName), +}, +// ... audit all 21 resolvers, gate per actual usage +``` + +```ts +// scripts/gen-skill-docs.ts (~line 444) — check the gate +const entry = RESOLVERS[resolverName]; +const resolver = typeof entry === 'function' ? entry : entry.resolve; +const gate = typeof entry === 'function' ? undefined : entry.appliesTo; +if (gate && !gate(ctx)) return ''; +return args.length > 0 ? resolver(ctx, args) : resolver(ctx); +``` + +**A.2 Jargon-list dedup** — currently `scripts/resolvers/preamble/generate-writing-style.ts` inlines the full 1.8KB jargon glossary into 37 skills. Replace inline with a reference: "For the canonical jargon list, Read `~/.claude/skills/gstack/scripts/jargon-list.json` on first use." Saves ~66KB total corpus. + +**A.3 Terse-mode actually compresses** — read `~/.gstack/config.yaml` once in `gen-skill-docs.ts`, pass `explainLevel` into `TemplateContext`, and have `generate-writing-style.ts` / `generate-completeness.ts` / `generate-confusion-protocol.ts` / `generate-context-health.ts` return `''` when terse. Today the bytes ship regardless of config — the flag only changes runtime model behavior. Add `--explain-level=terse` build flag for benchmarking. + +**A.4 Catalog trim** (moved up per Codex #6) — shorten skill descriptions in the always-loaded system prompt to one line per skill. Voice triggers move from catalog descriptions into in-skill content. Proactive-suggest paragraphs move to a separate `~/.claude/skills/gstack/scripts/proactive-suggestions.json` loaded only when the agent needs routing guidance. Per-skill description format: + +``` +- : (gstack) +``` + +Estimated catalog cut: ~70% (largest single always-loaded reduction). + +**A.5 cso/ targeted compression** (Codex #9) — cso gets resolver dedup + catalog trim. Security guidance prose stays uncompressed monolithically until Phase B audit shows specific sections can safely move to sections/ with eval coverage. Not "exempt" — just sequenced last. + +**A.6 Hard token budgets** (Codex #10) — define and enforce in `test/skill-e2e-budget-regression.test.ts`: + +| Budget | v1.44 actual | v1.45 target | v2.0 target | +|---|---|---|---| +| Max system-prompt catalog tokens | ~25K | ~8K | ~6K | +| Max per-skill SKILL.md size | 164KB (ship) | 100KB | 30KB (heavyweights) | +| Max corpus total | 2.1MB | 1.3MB | 700KB | +| Max first-invocation latency (heavyweight) | ~immediate | ~immediate | <500ms section reads | + +CI fails if any budget exceeded. Tracked over time via existing budget-regression jsonl. + +## Phase B — sections/ pattern for heavyweights (v2.0.0.0) + +Convert 5 heavyweights to Anthropic-canon skeleton + `sections/*.md`: + +``` +ship/ +├── SKILL.md # 12-15KB decision-tree skeleton + section manifest +├── SKILL.md.tmpl # source for the skeleton +├── sections/ +│ ├── manifest.json # NEW: structured section registry (Codex #3 mitigation) +│ ├── version-bump.md +│ ├── changelog.md +│ ├── review-army.md +│ ├── todos-cleanup.md +│ ├── pr-body.md +│ └── ... +``` + +**Silent-behavior-loss mitigations** (Codex #3) — layered defense, not just self-check: + +1. **Section manifest** (`sections/manifest.json`) — structured registry: `{section_file, applies_when, required_for}`. Decision-tree skeleton references entries by ID, not free-form prose. +2. **Imperative skeleton phrasing** — "STOP. Read `sections/version-bump.md` before computing the bump." Not "see ... for details." +3. **Top-of-file section index table** — situation → section file mapping. +4. **End-of-skill self-check** — "Confirm you Read every section your decision tree pointed to. List them." (weakest layer, kept as fallback.) +5. **Eval harness `requiredReads` declaration** — E2E test asserts which sections must appear in transcript Read calls for a given fixture. Mechanical enforcement at the test layer, not just prompt layer. +6. **Transcript inspection in canary cohort** — first week post-ship, log which sections actually get read by real sessions; alert on Read-miss for marked-required sections. + +**Conversion order** (one at a time, validate each before next): +1. `ship/` — most invocations, biggest cost, riskiest. Land alone, observe 1 week. +2. `plan-ceo-review/` — conversational; risk of breaking flow. Land second, observe carefully. +3. `office-hours/` — most conversational. Land third only if 1+2 went clean. +4. `plan-eng-review/` and `plan-design-review/` — bundle, similar shape. + +**Do not convert** unless explicitly approved later: `autoplan` (orchestrator that already chains skills), `design-review` (UI flow already tight), `qa` (single-purpose), `investigate` (single-purpose). + +## Phase C — Eval annotations + CI orphan check (v2.0.0.0) + +Per Codex #4 — warn-before-fail progression, not immediate strict gate. + +```md + + +``` + +Annotations include **coverage semantics** (what behavior is protected) per Codex #5, not just paths. Path-only would be false confidence. + +CI check in `gen-skill-docs.ts` walker: +- v2.0.0.0 ships in WARN mode — orphans logged to PR summary but build passes +- v2.1.0.0 (or 2 release cycles after v2.0): WARN escalates to FAIL +- Waiver: `` + +This avoids "maintenance theater" of mandatory annotations with no semantics, and gives users a transition window. + +## Migration approach (v2.0.0.0, lighter touch per D11) + +- Release note in v2.0.0.0 CHANGELOG explains the sections/ format change and concrete user impact: forks/copy-pasted SKILL.md files need re-fetch; first-invocation of heavyweight skills has ~200-500ms section-read latency added. +- `/gstack-upgrade` auto-regenerates on next invocation. No interactive migration prompts. +- Vendored installs get a single one-line warning at session start on first v2 contact (re-use existing vendored-install warning pattern in skill preamble). +- `gstack-upgrade --explain-v2` flag for users who want the full explanation on demand. + +## Forks / customization compatibility (Codex #11) + +Documented in v2.0.0.0 release note: + +- Anyone who reads/copies/edits a heavyweight SKILL.md file directly: the file is now a skeleton; behavior lives in `sections/*.md`. They need to either treat the skill as a black box (recommended) or fork the full `skill/` directory including `sections/`. +- Anyone with local SKILL.md.tmpl edits in a fork: the templates are smaller; conflicts likely on regenerate. Fork docs updated with migration guidance. +- Anyone with docs/blog posts linking to specific lines of a generated SKILL.md: line numbers will shift; recommend linking to template + section name instead. + +## Rollout strategy (Codex #12) + +v1.45.0.0: +- Land in one PR; existing budget-regression test catches any per-skill size regression; eval matrix CI check catches any skill missing its evals. +- Dogfood: 1 week active use across all of Garry's workspaces before announcing. + +v2.0.0.0: +- **Canary cohort**: ship to dogfood users (Garry + active agents) first via a v2.0.0-rc.1 tag. Real-PTY harness logs section Reads for top 5 workflows (`/ship`, `/qa`, `/review`, `/plan-ceo-review`, `/autoplan`); alert on Read-miss for required sections. +- **Manual verification**: top 5 workflows manually run before tagging v2.0.0.0 final, with before/after transcripts saved as eval baselines. +- **Regression dashboard**: existing `bun run eval:summary` extended with v1 vs v2 per-skill token + behavioral compliance comparison. +- **Rollback**: revert PR + `bun run gen:skill-docs` regenerates old shape. Documented in CONTRIBUTING.md. + +## Review-section findings (Sections 1-11, condensed) + +| Section | Findings | Status | +|---|---|---| +| 1. Architecture | Lazy-section silent-loss risk; mitigated via 6-layer defense above | Findings addressed in plan | +| 2. Errors/Rescues | gen-skill-docs gate-fail loud; missing sections fall back to skeleton; CI orphan check loud | Findings addressed | +| 3. Security | cso targeted dedup not blanket exemption (Codex #9); migration script runs at user-shell trust boundary, same as existing migrations | Findings addressed | +| 4. Data/UX edge cases | v1→v2 muscle-memory break warned in release note; vendored installs get one-line warning; concurrent dev-symlink sessions risk is existing CLAUDE.md caveat | Findings addressed | +| 5. Code quality | ~150 LOC additive across gen-skill-docs/types/index; ~20 new eval test files; sections/ extraction is mechanical | OK | +| 6. Tests | Phase 0 IS the test plan. Coverage matrix CI gate enforces every skill has its evals | Findings addressed | +| 7. Performance | Build time <2× current; runtime adds 200-500ms first-invocation for sectioned heavyweights; catalog trim reduces always-loaded prompt size on every session | Documented | +| 8. Observability | budget-regression test already exists; canary cohort transcript logging in Phase B; migration outcome logged to ~/.gstack/analytics/migrations.jsonl | Findings addressed | +| 9. Deployment | Two-release split + warn-before-fail eval annotations + rollback via revert | Findings addressed | +| 10. Long-term trajectory | Reversibility 3/5; sections/ pattern becomes template for future skills; deferred TODOs extend v2 narrative for v2.1+ | OK | +| 11. Design/UX | README v2 banner + CHANGELOG numbers table land in v2.0.0.0; concrete numbers, gstack voice, no AI slop | OK | + +## NOT in scope + +- **Skill removals.** User said "keep all functions." qa-only, design-shotgun, pair-agent, open-gstack-browser all stay. They get evals + catalog trim like everyone else. +- **Skill renames.** No `qa` → `qa-fix` collapses. Keep CLI surface stable. +- **gstack lite/pro install profiles.** Deferred to TODOS for post-v2. +- **gstack budget CLI.** Deferred to TODOS for post-v2. +- **Per-skill eval coverage badge in README.** Deferred to TODOS. +- **Cross-tool portability test/demo (Codex/Cursor compat).** Deferred to TODOS. +- **Token-cost preview on invocation.** Deferred to TODOS. +- **Skill autoload telemetry.** Deferred to TODOS. +- **gstack diff PR comment.** Deferred to TODOS. + +## TODOS.md updates (deferred items, recommend bulk-add post-merge) + +| TODO | Priority | Effort (human / CC) | Depends on | +|---|---|---|---| +| `gstack lite` install profile (5-skill core) | P2 | 2 days / 3-4 hrs | v2.0.0.0 | +| `gstack pro` opt-in upgrade path | P2 | 1 day / 1 hr | gstack lite | +| `gstack budget` CLI (per-skill token usage telemetry) | P2 | 1 day / 1 hr | v1.45.0.0 | +| Per-skill eval coverage badge in `gstack-skills list` + README | P3 | 1 day / 1 hr | Phase 0 | +| Cross-tool portability test/demo (Codex CLI, Cursor) | P3 | 2 days / 2 hrs | v2.0.0.0 | +| Token-cost preview on skill invocation | P3 | 1 day / 1 hr | gstack budget CLI | +| Skill autoload telemetry (dead-weight detection) | P3 | 2 days / 2 hrs | v1.45.0.0 | +| `gstack diff` PR comment (per-PR budget delta) | P3 | 1 day / 1 hr | budget-regression extended | +| Section-level eval annotations visible to user (confidence signal) | P3 | half day / 30 min | Phase C | + +## Critical files + +| Path | Change | Phase | +|---|---|---| +| `scripts/gen-skill-docs.ts` | Add resolver gate check (~line 444); read explain_level from config; add CI orphan walker | A, C | +| `scripts/resolvers/types.ts` | Add `ResolverEntry` union type | A | +| `scripts/resolvers/index.ts` | Wrap heavy resolvers with `appliesTo` predicates (audit all 21) | A | +| `scripts/resolvers/preamble/generate-writing-style.ts` | Replace inline jargon; return `''` on terse | A | +| `scripts/resolvers/preamble/generate-completeness.ts` | Return `''` on terse | A | +| `scripts/resolvers/preamble/generate-confusion-protocol.ts` | Return `''` on terse | A | +| `scripts/resolvers/preamble/generate-context-health.ts` | Return `''` on terse | A | +| `scripts/skill-catalog.ts` (new or in gen-skill-docs) | One-line catalog generator + voice-triggers JSON splitter | A.4 | +| `scripts/proactive-suggestions.json` (new) | Voice triggers + proactive suggestions, loaded on demand | A.4 | +| `test/skill-coverage-matrix.ts` (new) | Single-source-of-truth eval registry | Phase 0 | +| `test/skill-coverage-matrix.test.ts` (new) | CI gate: every skill has entries | Phase 0 | +| `test/skill-e2e-*.test.ts` (~20 new files) | New evals for skills currently lacking coverage | Phase 0 | +| `test/skill-e2e-budget-regression.test.ts` | Extend with per-skill hard budgets | A.6 | +| `test/helpers/touchfiles.ts` | Register new tests for diff-based selection | Phase 0 | +| `ship/SKILL.md.tmpl` → `ship/sections/manifest.json` + `ship/sections/*.md` | Skeleton extraction | B | +| `plan-ceo-review/SKILL.md.tmpl` → sections/ | Skeleton extraction | B | +| `office-hours/SKILL.md.tmpl` → sections/ | Skeleton extraction | B | +| `plan-eng-review/SKILL.md.tmpl` → sections/ | Skeleton extraction | B | +| `plan-design-review/SKILL.md.tmpl` → sections/ | Skeleton extraction | B | +| `gstack-upgrade/migrations/v2.0.0.0.sh` (new) | Auto-regenerate + vendored-install warning | B | +| `CHANGELOG.md` | v1.45.0.0 entry (normal), v2.0.0.0 entry (marketing-grade w/ numbers table) | A, B | +| `README.md` | v2.0.0.0 banner; "lightest opinionated skill pack" positioning | B | +| `CONTRIBUTING.md` | Document sections/ pattern + rollback procedure | B | + +## Verification + +**v1.45.0.0:** +1. `bun run gen:skill-docs` succeeds with no errors +2. `bun test` passes (skill-validation, gen-skill-docs.test.ts, browse integration, NEW skill-coverage-matrix.test.ts) +3. `bun run test:evals` passes — all new gate evals green; no regression on existing evals +4. `bun run test:evals:periodic` passes — all new periodic evals green +5. Catalog system-prompt size measured: target ≤8K tokens (vs ~25K current). Capture before/after in PR body. +6. Total SKILL.md corpus byte count: target ≤1.3MB (vs 2.1MB). Capture in PR body. +7. Top 3 heaviest skills under 100KB. +8. Manual smoke: invoke `/ship`, `/plan-ceo-review`, `/office-hours` in fresh Claude Code sessions; confirm no missing behavior. Save transcripts as v1.45 baselines. + +**v2.0.0.0:** +1. All v1.45 checks pass +2. Sectioned skills: total corpus ≤700KB; heavyweight skeletons ≤30KB each +3. `test/skill-e2e-ship-section-loading.test.ts` (new): asserts `/ship` Reads expected sections per decision tree +4. Canary cohort: 1 week dogfood at v2.0.0-rc.1 with transcript logging; zero Read-miss for marked-required sections +5. Top 5 workflows manually verified; transcripts compared against v1.45 baselines +6. Migration: `gstack-upgrade` on a v1.45 install successfully regenerates without prompts; vendored-install warning appears once +7. CHANGELOG numbers table matches measured reality +8. WARN-mode orphan check: PR summary shows orphan list; build passes + +## Cross-model agreements baked in + +Items from Codex's review accepted and integrated above: + +- #4 Warn-before-fail eval annotations (Phase C) +- #5 Coverage semantics in annotation comments, not just paths +- #6 Catalog trim moved up to Phase A (was buried after sections/) +- #9 cso gets resolver dedup + catalog trim (not blanket exempt) +- #10 Hard token budgets defined + enforced (Phase A.6) +- #11 Forks/customization compatibility documented (Migration section) +- #12 Rollout strategy with canary cohort + manual top-5-workflows verification (Rollout section) + +Items from Codex's review explicitly rejected by user (D9, D10): +- #1 Eval-first scope: user kept full tiered coverage. Mitigated by structural-eval guidance (not output-content) for orphan/judgment skills. +- #7 v2.0.0.0 vs v1.x: user chose HYBRID. v1.45 absorbs low-risk wins; v2.0.0.0 carries the genuinely-breaking sections/ change. + +Item where user accepted Codex over original pick: +- #8 Migration approach: user moved from hard-cut (D7) to lighter touch (D11) once v1.45 absorbed the low-risk work. + +## Implementation Tasks + +Synthesized from this review's findings. Each task derives from a specific phase/finding above. T1-T8 land in v1.45.0.0; T9-T16 land in v2.0.0.0. + +- [ ] **T1 (P1, human: ~3 days / CC: ~7 hours)** — Phase 0 / coverage matrix — write gate+periodic evals for all 20 skills lacking coverage + - Surfaced by: Phase 0 section + - Files: `test/skill-coverage-matrix.ts`, `test/skill-coverage-matrix.test.ts`, ~20 new `test/skill-e2e-*.test.ts`, `test/helpers/touchfiles.ts` + - Verify: `bun test test/skill-coverage-matrix.test.ts` and `bun run test:evals` both pass with new evals +- [ ] **T2 (P1, human: ~1 day / CC: ~1 hour)** — A.1 conditional resolver injection — add `appliesTo` gate + - Surfaced by: Phase A section, Codex #10 (measurement before architecture) + - Files: `scripts/resolvers/types.ts`, `scripts/gen-skill-docs.ts:444`, `scripts/resolvers/index.ts` + - Verify: `bun run gen:skill-docs` produces smaller SKILL.md files; `bun test` passes +- [ ] **T3 (P1, human: ~half day / CC: ~30 min)** — A.2 + A.3 jargon dedup + terse-mode gen-time compression + - Surfaced by: Phase A section + - Files: `scripts/resolvers/preamble/generate-writing-style.ts`, `generate-completeness.ts`, `generate-confusion-protocol.ts`, `generate-context-health.ts` + - Verify: jargon-list no longer appears inlined in generated SKILL.md; `gstack-config set explain_level terse && bun run gen:skill-docs` produces shorter files +- [ ] **T4 (P1, human: ~1 day / CC: ~2 hours)** — A.4 catalog trim — one-line skill descriptions; voice triggers + proactive paragraphs moved to JSON + - Surfaced by: Codex #6 (highest-leverage), Phase A.4 + - Files: `scripts/skill-catalog.ts` (new), `scripts/proactive-suggestions.json` (new), per-skill SKILL.md.tmpl frontmatter for one-line description field + - Verify: catalog system-prompt size <8K tokens; voice-triggered invocation still works +- [ ] **T5 (P1, human: ~half day / CC: ~30 min)** — A.6 hard token budgets in budget-regression + - Surfaced by: Codex #10 + - Files: `test/skill-e2e-budget-regression.test.ts` + - Verify: budget-regression fails when artificially inflated test SKILL.md exceeds budget +- [ ] **T6 (P1, human: ~1 day / CC: ~1 hour)** — A.5 cso resolver dedup + catalog trim (NOT broader compression) + - Surfaced by: Codex #9 + - Files: `cso/SKILL.md.tmpl` (no structural change, only resolver gate audit) + - Verify: cso SKILL.md size drops 20-30%; cso E2E evals still pass +- [ ] **T7 (P1, human: ~1 day / CC: ~1 hour)** — Regenerate all SKILL.md atomically + measure + - Surfaced by: Phase A + - Files: all `*/SKILL.md` regenerated + - Verify: PR body includes before/after corpus size, top 10 skill sizes, catalog size; budget-regression confirms targets met +- [ ] **T8 (P2, human: ~half day / CC: ~30 min)** — v1.45.0.0 CHANGELOG entry (normal voice; note that Phase 0 + Phase A landed) + - Surfaced by: Release shape section + - Files: `CHANGELOG.md`, `VERSION` + - Verify: CHANGELOG lints clean; reverse-chrono order preserved; entry covers the diff + +- [ ] **T9 (P1, human: ~2 days / CC: ~3 hours)** — Phase B.1 convert ship/ to skeleton + sections/ + - Surfaced by: Phase B section + - Files: `ship/SKILL.md.tmpl` → skeleton; `ship/sections/manifest.json` + `ship/sections/*.md` + - Verify: new `test/skill-e2e-ship-section-loading.test.ts` asserts expected Reads per decision tree; existing ship evals pass; ship.md skeleton <15KB +- [ ] **T10 (P1, human: ~1 day / CC: ~1 hour)** — Canary cohort for ship/ (1 week dogfood at v2.0.0-rc.1) + - Surfaced by: Rollout strategy section, Codex #12 + - Files: `test/helpers/transcript-section-logger.ts` (new) + - Verify: zero Read-miss on marked-required sections in dogfood transcripts +- [ ] **T11 (P1, human: ~2 days / CC: ~3 hours)** — Phase B.2 convert plan-ceo-review/ (after ship/ proven) + - Surfaced by: Phase B section + - Files: `plan-ceo-review/SKILL.md.tmpl` + `plan-ceo-review/sections/` + - Verify: section-loading test green; plan-ceo evals pass +- [ ] **T12 (P2, human: ~3 days / CC: ~4 hours)** — Phase B.3 + B.4 convert office-hours/ + plan-eng-review/ + plan-design-review/ + - Surfaced by: Phase B section + - Files: respective `SKILL.md.tmpl` + `sections/` directories + - Verify: section-loading tests green; respective evals pass +- [ ] **T13 (P1, human: ~1 day / CC: ~1 hour)** — Phase C eval annotations + WARN-mode CI orphan check + - Surfaced by: Phase C section, Codex #4 + #5 + - Files: `scripts/gen-skill-docs.ts` (orphan walker), all `sections/*.md` (annotations with coverage semantics) + - Verify: orphan check reports correctly in PR summary; build still passes in WARN mode +- [ ] **T14 (P1, human: ~half day / CC: ~30 min)** — `gstack-upgrade/migrations/v2.0.0.0.sh` lighter-touch auto-regenerate + - Surfaced by: Migration approach section + - Files: `gstack-upgrade/migrations/v2.0.0.0.sh` + - Verify: upgrade from v1.45 install produces clean v2 state without prompts; vendored install gets one-line warning +- [ ] **T15 (P1, human: ~half day / CC: ~1 hour)** — v2.0.0.0 marketing-grade CHANGELOG with v1 vs v2 numbers table + - Surfaced by: D5, Release shape, Codex #7 (real breakage documented) + - Files: `CHANGELOG.md`, `VERSION`, `README.md` (v2 banner) + - Verify: numbers table matches measured corpus; release note documents concrete breakage (sections/ format change, first-invocation latency, vendored-install deprecation); positioning past-tenses bloat reputation +- [ ] **T16 (P2, human: ~1 day / CC: ~1 hour)** — Bulk-add 9 deferred TODOS to TODOS.md (gstack lite, gstack budget, etc.) + - Surfaced by: TODOS.md updates section + - Files: `TODOS.md` + - Verify: TODOS format matches `.claude/skills/review/TODOS-format.md` + +## Failure Modes Registry + +| Codepath | Failure mode | Rescued? | Test? | User sees | Logged | +|---|---|---|---|---|---| +| gen-skill-docs.ts gate check | resolver `appliesTo` throws | Y — try/catch logs + skips resolver | Y (test/gen-skill-docs.test.ts extended) | "resolver X errored, skipped" in build output | stderr | +| sections/ Read at runtime | section file missing | Y — agent falls back to skeleton-only behavior | Y (test/skill-e2e-ship-section-loading.test.ts) | warning in agent prose | session transcript | +| CI orphan walker | sections/*.md missing eval annotation | WARN mode v2.0; FAIL v2.1+ | Y (test/skill-coverage-matrix.test.ts) | PR summary lists orphans | PR comment | +| Migration script v2.0.0.0.sh | regenerate fails on damaged install | Y — script aborts, prints repair steps | Y (migration test) | clear error + repair steps | ~/.gstack/analytics/migrations.jsonl | +| Catalog one-line generator | skill missing one-line description in frontmatter | Y — gen-skill-docs fails build loudly | Y (gen-skill-docs.test.ts extended) | build error | stderr | +| Canary section-Read logger | logger missing for a heavyweight skill | Y — silently skipped, gap visible in dashboard | Y (transcript-logger test) | none directly; surfaced in canary dashboard | ~/.gstack/analytics/section-reads.jsonl | + +No critical gaps — every failure mode has a rescue, a test, and visibility. + +## Diagrams + +System architecture (build pipeline): +``` + CONFIG (~/.gstack/config.yaml) + | + v + +-----------------+ +--------------------+ + | gen-skill-docs | <--- | resolvers/*.ts | + | (with gate) | | (w/ appliesTo) | + +-----------------+ +--------------------+ + | + v + +--------------------------+ + | SKILL.md.tmpl per skill | + | + sections/manifest.json | (heavyweights only, v2) + | + sections/*.md | (heavyweights only, v2) + +--------------------------+ + | + v + +--------------------+ +--------------------------+ + | generated SKILL.md | <-----> | scripts/jargon-list.json | + | (skeleton for | | (referenced, not inlined)| + | heavyweights v2) | +--------------------------+ + +--------------------+ + | + v + +-------------------+ +----------------------+ + | catalog (system | <--- | proactive-suggestions| + | prompt, one-line | | .json (loaded on | + | per skill) | | demand only) | + +-------------------+ +----------------------+ +``` + +Section-Read flow (v2 runtime): +``` + USER /ship + | + v + +-----------------------+ + | ship/SKILL.md | + | (12-15KB skeleton) | + | reads: | + | - manifest.json | + | - decision tree | + +-----------------------+ + | + v Agent walks decision tree, identifies which sections apply + | + +-----> Read sections/version-bump.md (if bumping) + +-----> Read sections/changelog.md (if writing entry) + +-----> Read sections/review-army.md (if pre-ship review) + +-----> ... only sections that apply + | + v + +-------------------------+ + | end-of-skill self-check | + | "list sections I read" | + +-------------------------+ + | + v Canary cohort: transcript-section-logger compares + | actual Reads vs manifest's required_for declarations + | alerts on miss +``` + +## Stale diagram audit + +ASCII diagrams in CLAUDE.md / ARCHITECTURE.md that this plan affects: + +| Diagram | File | Still accurate post-v2? | +|---|---|---| +| Sidebar message flow | `docs/designs/SIDEBAR_MESSAGE_FLOW.md` | YES (unrelated subsystem) | +| Dual-listener tunnel architecture | `ARCHITECTURE.md` | YES (unrelated) | +| Unicode sanitization at server egress | `ARCHITECTURE.md` | YES (unrelated) | +| (none for skill build pipeline) | — | New diagrams above are NEW, not updates | + +No stale diagrams to fix. + +## Completion summary + +``` ++====================================================================+ +| MEGA PLAN REVIEW — COMPLETION SUMMARY | ++====================================================================+ +| Mode selected | SCOPE EXPANSION | +| System Audit | bloat externally documented; prior design | +| | doc unrelated; budget-regression infra exists| +| Step 0 | EXPANSION + Approach C + eval-first + | +| | hybrid v1.45/v2.0 split + lighter migration | +| Section 1 (Arch) | 1 finding — silent-loss risk, 6-layer mit | +| Section 2 (Errors) | 6 failure modes mapped, 0 CRITICAL GAPS | +| Section 3 (Security)| cso targeted dedup (Codex #9 absorbed) | +| Section 4 (Data/UX) | v1→v2 muscle memory warned, vendored noted | +| Section 5 (Quality) | ~150 LOC additive, mechanical extraction | +| Section 6 (Tests) | Phase 0 IS the test plan | +| Section 7 (Perf) | <2× build time; +200-500ms first-invoke v2 | +| Section 8 (Observ) | budget-regression + canary + migrations.log | +| Section 9 (Deploy) | 2-release split + warn-before-fail + revert | +| Section 10 (Future) | Reversibility 3/5; sections/ becomes template| +| Section 11 (Design) | README banner + numbers table | ++--------------------------------------------------------------------+ +| NOT in scope | written (9 items deferred) | +| What already exists | written (9 reuse points) | +| Dream state delta | written (TODAY / v1.45 / v2.0) | +| Error/rescue registry| 6 modes, 0 CRITICAL GAPS | +| Failure modes | covered in registry | +| TODOS.md updates | 9 items, bulk-add post-merge | +| Scope proposals | 3 surfaced, 1 accepted (launch positioning) | +| CEO plan | this plan IS the CEO plan | +| Outside voice | ran (codex); 3 tensions surfaced | +| Lake Score | 11/11 recommendations chose complete option | +| Diagrams produced | 2 (build pipeline, section-read flow) | +| Stale diagrams found | 0 | +| Unresolved decisions | 0 | ++====================================================================+ +``` + +## Eng-review additions (from /plan-eng-review session) + +### Architectural decisions locked in + +- **D1 (manifest format):** `sections/manifest.json` is the structured per-heavyweight registry (JSON, machine-readable for gen-skill-docs CI checks). SKILL.md skeleton is markdown headers + imperative prose blocks ("STOP. If X, Read `sections/Y.md`"). Matches Anthropic's documented `references/` style. No invented DSL. +- **D2 (drift control):** `sections/*.md.tmpl` is the source of truth; `sections/*.md` is generated. gen-skill-docs walks `/sections/*.tmpl` and writes `/sections/*.md` using the same resolver pipeline as SKILL.md. Cost: ~30 LOC in `scripts/gen-skill-docs.ts`. Eliminates the drift class that `test/ship-version-sync.test.ts` already suffers from (TODOS:1120). +- **D3 (CI cost cap):** `EVALS_BUDGET_HARD_CAP=$30` env var enforced by `test/skill-e2e-budget-regression.test.ts`; build fails if a single run exceeds. Section-loading tests (Phase B) use minimal-bash fixtures (~$0.30 each) because they assert STRUCTURAL behavior (was the right file Read?) not output quality. + +### Adjacent TODOS surfaced (informational, not blocking) + +- **TODOS:161** — planned "resolver injection at session start" for browser-skills (P2). Has architectural overlap with this plan's `appliesTo` predicate. Decision: keep separate for now — browser-skill resolver injection is runtime (session-start hostname matching); our `appliesTo` is build-time (gen-skill-docs.ts). Different lifecycles, different concerns. Revisit only if the browser-skills work needs the same predicate shape. +- **TODOS:1120** — `test/ship-version-sync.test.ts` reimplements ship/SKILL.md.tmpl Step 12 bash. D2 (sections/*.md.tmpl pipeline) is the structural fix. Phase B work obviates this TODO; mark as resolved when ship/ extraction lands. +- **TODOS:1136** — `git show` fallback in ship/SKILL.md.tmpl Step 12 line 409. Phase B touches this; bundle the `git rev-parse --verify` fix into the version-bump section extraction. + +### Test plan artifact + +Test plan written to `~/.gstack/projects/garrytan-gstack/garrytan-garrytan-slim-skill-tokens-eng-review-test-plan-.md`. `/qa` and `/qa-only` consume this as primary test input. Covers: per-phase test coverage targets, fixture design for section-loading tests, CI budget enforcement check, migration round-trip test. + +### Failure modes additions + +Adding to the registry from §Failure Modes (already complete; new rows): + +| Codepath | Failure mode | Rescued? | Test? | User sees | Logged | +|---|---|---|---|---|---| +| sections/*.md.tmpl generator | template references missing resolver | Y — gen-skill-docs fails build loudly | Y (gen-skill-docs.test.ts extended) | build error | stderr | +| Manifest ↔ filesystem consistency | manifest references section file that doesn't exist | Y — CI check fails | Y (new `test/section-manifest-consistency.test.ts`) | build error | PR summary | +| Manifest ↔ filesystem consistency | section file exists but not in manifest (orphan) | WARN v2.0; FAIL v2.1+ | Y (same test) | PR summary | PR comment | +| Budget cap exceeded | single test or aggregate exceeds `EVALS_BUDGET_HARD_CAP` | Y — CI fails | Y (budget-regression extended) | build error w/ cost breakdown | stderr | + +Still 0 critical gaps. All new failure modes have rescue + test + visibility. + +### Execution sequencing (sequential v1.45, integration-branch v2.0) + +v1.45 runs **sequentially** in a single branch, T1 → T8. The parallelization map was reconsidered after codex's second-pass critique flagged that T2 (gen-skill-docs.ts TemplateContext changes) and T4 (catalog frontmatter additions) almost certainly touch each other at compile time — both branches passing alone, failing at integration. Sequential lands cleaner and avoids 3-way merge surprise. AI compression makes the wall-clock cost of sequential acceptable. + +| Step | Modules touched | Depends on | +|---|---|---| +| T1 Phase 0 evals (~20 files) | `test/skill-e2e-*.test.ts`, `test/skill-coverage-matrix.ts`, `test/helpers/touchfiles.ts` | — | +| T2 conditional resolver gate | `scripts/gen-skill-docs.ts`, `scripts/resolvers/types.ts`, `scripts/resolvers/index.ts` | T1 | +| T3 jargon dedup + terse compression | `scripts/resolvers/preamble/*` | T2 | +| T4 catalog trim | `scripts/skill-catalog.ts`, `scripts/proactive-suggestions.json`, all SKILL.md.tmpl frontmatter | T2 | +| T5 hard token budgets + override path | `test/skill-e2e-budget-regression.test.ts` (per-suite caps + `EVALS_BUDGET_OVERRIDE_REASON`) | T1 | +| T6 cso targeted dedup | `cso/SKILL.md.tmpl` | T2, T3 | +| T7 regenerate all SKILL.md atomically | all `*/SKILL.md` | T1-T6 | +| T8 v1.45 CHANGELOG | `CHANGELOG.md`, `VERSION` | T7 | +| **— v1.45.0.0 ship boundary —** | | | +| T9 ship/ sections/ extraction | `ship/SKILL.md.tmpl`, `ship/sections/*`, gen-skill-docs (sections pipeline w/ TemplateContext contract) | T8 + sections-pipeline (T2/D2) | +| T10 ship/ canary cohort | `test/helpers/transcript-section-logger.ts` | T9 | +| T11 plan-ceo-review sections/ | `plan-ceo-review/SKILL.md.tmpl` + sections | T10 (ship/ proven) | +| T12 office-hours + plan-eng + plan-design sections/ | respective directories | T11 | +| T13 Phase C eval annotations + 3-tier orphan check | gen-skill-docs.ts orphan walker, all sections/*.md | T9-T12 | +| T14 migration script | `gstack-upgrade/migrations/v2.0.0.0.sh` | T13 | +| T15 v2.0.0.0 CHANGELOG + README banner | `CHANGELOG.md`, `README.md`, `VERSION` | T14 | +| T16 TODOS bulk-add | `TODOS.md` | — anytime | + +**Execution recommendation:** single-worktree sequential for both v1.45 (T1→T8) and v2.0 (T9→T15). T16 lands whenever. The CC speedup comes from per-step compression (each step is ~1 hour vs human-days), not from parallel branches. + +## Codex consult additions (second pass, post eng-review) + +### Cathedral parity-eval suite (Phase 0 add-on, expanded to "11") + +User said "do it like 11, not just 10. max it out and then some." Maxed-out scope: + +- **ALL 31 skills** get golden-baseline transcripts (not just top 5) +- **Multiple fixtures per skill** (3-5 representative invocation paths each) +- **Quantitative + qualitative scoring:** LLM-as-judge similarity score (1-10) AND transcript-diff highlights (added/removed sections, missing nuance) +- **Token-efficiency ratio measured:** quality-per-token = judge_score / tokens_consumed (forces v2 to be measurably MORE efficient, not just smaller) +- **"Quality budget" alongside "token budget":** both enforced in CI. A v2 skill that compressed to half size but dropped from 9/10 quality to 6/10 fails the gate. +- **Side-by-side PR comment:** every PR that touches a heavyweight skill auto-posts a v1.45-baseline vs current parity comparison in the PR summary +- **Public benchmark page:** `gstack.benchmarks.md` (new), continuously updated. Quotable: "v2 average parity score: 9.2/10, average token reduction: 67%." +- **Continuous monitoring:** parity suite runs weekly on main; alerts if any skill drifts below baseline (Discord webhook or similar) +- **Baseline-capture script:** `test/helpers/capture-parity-baseline.ts` — run once at v1.44 HEAD to lock in golden transcripts before any Phase A work lands + +Effort: human ~3-4 days / CC ~6-8 hours one-time + ~$30/week ongoing for continuous monitoring. Cost is justified — this is the ONLY mechanism that catches "looks green, feels worse" silent regression that section-loading and budget tests both miss. Adds new tasks T0a (baseline capture) and T0b (parity eval harness) BEFORE T1. + +### Absorbed refinements from codex consult (no further user decision needed) + +1. **TemplateContext contract for sections pipeline (codex D2 critique):** explicit spec required in T9. Section generation uses the SAME `TemplateContext` as SKILL.md generation — same `skillName`, same host suppression, same `explainLevel`, same tier gating. Documented in code comments + asserted by `test/template-context-parity.test.ts` (new). +2. **3-tier orphan classification (codex orphan-semantics critique):** the CI check (T13) distinguishes: + - **Generated orphan** (`sections/foo.md` exists, no `sections/foo.md.tmpl`) → FAIL immediately, every release + - **Manifest orphan** (`sections/foo.md.tmpl` exists, not in `manifest.json`) → WARN in v2.0, FAIL in v2.1+ + - **Hand-edited generated file** (`sections/foo.md` diverges from what regen would produce) → FAIL immediately, with "this file is generated, edit `.tmpl` instead" message +3. **Budget cap override path (codex D3 critique):** `EVALS_BUDGET_HARD_CAP=$30` becomes the default; per-suite caps via `EVALS_BUDGET_HARD_CAP_GATE=$25`, `EVALS_BUDGET_HARD_CAP_PERIODIC=$70`; override path `EVALS_BUDGET_OVERRIDE_REASON=""` env required to exceed cap (CI prints the reason in build output for audit trail); daily org-level spend alert via existing analytics (`~/.gstack/analytics/skill-usage.jsonl` aggregator). +4. **Manifest as passive data (codex D1 critique):** `manifest.json` fields are IDs, file paths, and human-readable trigger text ONLY. No `applies_when` predicate. The skill skeleton's decision-tree prose is the ONLY place "when to read X" is decided. Avoids inventing a fourth condition language alongside tier-gating + `appliesTo` + `requiredReads`. +5. **T7 as integration-branch flow (codex parallelization critique, now obviated by sequential):** sequential execution means T7 is just "atomic regenerate within the single v1.45 branch." Integration-branch dance not needed. The critique's intent (no 3-way merge surprise) is honored by collapsing to sequential. + +### New failure modes (additions to registry) + +| Codepath | Failure mode | Rescued? | Test? | User sees | Logged | +|---|---|---|---|---|---| +| Sections pipeline TemplateContext | sections generated with divergent ctx (e.g. wrong skillName) | Y — parity test fails | Y (`test/template-context-parity.test.ts`) | build error | stderr | +| Hand-edited generated section | user edits `sections/foo.md` directly instead of `.tmpl` | Y — CI fails with explicit message | Y (orphan-check 3-tier classification) | "this file is generated, edit `.tmpl` instead" | PR summary | +| Quality budget exceeded | v2 skill compressed but dropped >2 points on LLM-judge parity | Y — CI fails | Y (parity-eval suite) | "v2 X.md dropped from 9.2 to 6.4 vs v1.45 baseline" | PR comment with diff | +| Budget cap override audit | EVALS_BUDGET_OVERRIDE_REASON used | N (intentional escape valve) | Y (audit-log test) | reason printed in CI output, logged to spend-audit jsonl | analytics/spend-overrides.jsonl | +| Parity baseline drift on main | weekly continuous monitor detects regression | Y — Discord alert + ticket | Y (continuous-monitor test) | alert in team channel | analytics/parity-drift.jsonl | + +Still 0 critical gaps. + +## v2 launch copy specs (from /plan-devex-review) + +These drafts become the source of truth for v2.0.0.0 launch tone. T15 implements them verbatim (unless workshopping at ship time produces a measurably better take, in which case update both plan and implementation in lockstep). + +### JUST_UPGRADED notice (Persona A — existing user upgrading) + +Triggered by `gstack-update-check` showing `JUST_UPGRADED v1.x v2.0.0.0`. Replaces the generic v1 "Running gstack v{to} (just updated!)" with persona-A-aware copy that names the perceived speed win AND signals "your muscle memory still works." + +``` +Running gstack v2.0.0.0 (just updated!) — your sessions are now ~67% lighter. +Heavyweight skills load only the sections they need; the catalog dropped to +one line per skill. Everything still works the same way — your /ship, /qa, +/review commands haven't changed. Run `/gstack-upgrade --explain-v2` for the +full migration story, or just keep working. +``` + +Voice rules honored: lead with the win ("67% lighter"); concrete numbers; reassurance that workflows are unchanged ("everything still works the same way"); escape hatch (`--explain-v2`). No em dashes. Aimed at a 5-second read. + +Implementation: update `~/.claude/skills/gstack/gstack-upgrade/SKILL.md.tmpl` Inline upgrade flow with v2-aware message; existing `JUST_UPGRADED ` detection in skill preamble fires it. + +### CHANGELOG numbers table (Persona A's magical moment + Persona B's evaluation evidence) + +Lands in `## [v2.0.0.0]` entry of CHANGELOG.md, immediately under the headline. Compare measured v1.44 actuals (baseline captured by `test/helpers/capture-parity-baseline.ts` BEFORE Phase A starts) vs v2.0.0.0 measured. Numbers must be REAL, not estimated; replace placeholders during T15. + +| Metric | v1.44.1 (baseline) | v2.0.0.0 (measured) | Δ | +|---|---|---|---| +| Total SKILL.md corpus | 2.1 MB | ~700 KB | **−67%** | +| ship.md (heaviest) | 164 KB | ~15 KB skeleton + 5×~5 KB sections | **−76% first-Read** | +| plan-ceo-review.md | 131 KB | ~12 KB skeleton + sections on demand | **−68% first-Read** | +| office-hours.md | 111 KB | ~10 KB skeleton + sections on demand | **−71% first-Read** | +| Catalog tokens (always-loaded system prompt) | ~25K tokens | ~6K tokens | **−76%** | +| Per-invocation tokens (typical /ship session) | ~41K | ~14K skeleton + on-demand sections | **~60% drop** | +| Eval coverage (skills with E2E protection) | ~16 of 31 | **31 of 31 + parity baselines** | quality gate enabled | +| Parity score vs v1.44 baseline (LLM judge, all 31 skills) | — | **≥9.0/10 floor** | (CI-enforced; see parity-eval suite) | + +Below the table, one paragraph in gstack voice: "v1 was the heaviest opinionated skill pack. v2 is the lightest. The compression isn't free — every skill ships with both gate-tier and periodic-tier E2E evals, and a continuous parity-monitor catches silent quality regressions. The numbers above are measured against `test/helpers/parity-baseline-v1.44.1/` and reproduced by `bun run eval:parity`." + +### README v2 banner + +Placement: top of README.md, immediately under the existing Karpathy pull-quote, above "When I heard Karpathy say this..." Stays in place for 60 days post-launch, then collapses to a one-line "v2 released May 2026" entry in the Quick start section. + +```markdown +> **gstack v2.0.0.0 — the lightest opinionated skill pack (May 2026)** +> +> Heavyweight skills now load only the sections they need. Total SKILL.md +> corpus dropped from 2.1 MB to ~700 KB. Every skill ships with E2E eval +> protection and a continuous parity-monitor against v1.44 baselines. +> See the [v2.0.0.0 release notes](CHANGELOG.md) for per-skill numbers and +> the migration story. Existing users: `/gstack-upgrade` auto-regenerates. +``` + +Voice rules honored: lead with the position ("lightest opinionated skill pack"); concrete numbers (2.1 MB → 700 KB); proof of rigor (eval protection + parity monitor); migration path explicit. No em dashes. Aimed at a 10-second read. + +### Implementation notes (for T15) + +- Lock the actual v1.44 baseline numbers into `test/helpers/parity-baseline-v1.44.1/` BEFORE Phase A regeneration starts. The "v1 vs v2" delta only quotes accurately if v1.44 was measured in the same units (token count via `tiktoken`, byte count via `wc -c`, eval coverage via `test/skill-coverage-matrix.ts`). +- If the measured v2 numbers come in LESS impressive than the drafts above (e.g., ship.md ends up at 25 KB instead of 15 KB), update the drafts to reflect reality. Never invent numbers; the marketing-grade ship moment dies the moment readers find a number they can disprove with `wc -c`. +- The JUST_UPGRADED notice fires automatically via existing `gstack-upgrade` detection — no new mechanism required. +- The README banner placement above the existing Karpathy quote is intentional: persona B (new evaluator) sees the v2 win BEFORE the Karpathy framing, anchoring "this is May 2026's most-current gstack." + +## GSTACK REVIEW REPORT + +| Review | Trigger | Why | Runs | Status | Findings | +|---|---|---|---|---|---| +| CEO Review | `/plan-ceo-review` | Scope & strategy | 1 | CLEAR | SCOPE_EXPANSION mode; 3 expansion proposals (1 accepted: v2 launch positioning; 2 deferred: gstack lite, gstack budget); 11/11 sections reviewed; 0 critical gaps | +| Codex Review | `/codex review` | Independent 2nd opinion (outside voice) | 1 | issues_found | 12 challenges surfaced; 7 absorbed into plan (#4, #5, #6, #9, #10, #11, #12); 3 surfaced as user-decision (#1 user kept original pick, #7 hybrid split adopted, #8 user accepted codex) | +| Eng Review | `/plan-eng-review` | Architecture & tests (required) | 1 | CLEAR | 3 architectural decisions locked (D1 JSON manifest, D2 sections/*.md.tmpl pipeline, D3 CI cost cap); 4 new failure modes added (all rescued+tested); test plan artifact written; parallelization map produced (3 lanes parallel in v1.45, sequential in v2.0); 0 critical gaps; 0 unresolved decisions | +| Codex Consult (2nd pass) | `/codex` (consult on eng-review additions) | Independent challenge of D1/D2/D3 + parallelization | 1 | issues_found | 7 additional findings on eng-review additions; 5 absorbed (TemplateContext contract, 3-tier orphan classification, budget cap override path, manifest as passive data not predicates, T7 as integration-flow obviated by sequential); 2 surfaced as user-decision (attention-architecture risk → cathedral parity-eval suite added at "11"; parallelization collapsed to sequential v1.45 per codex critique) | +| Design Review | `/plan-design-review` | UI/UX gaps | 0 | — | not required (no significant UI scope; README/CHANGELOG only) | +| DX Review | `/plan-devex-review` | Developer experience gaps | 1 | CLEAR | DX POLISH mode; product type = Claude Code Skill; 2 personas tracked equally (existing-user upgrader + new-user evaluator); initial 7.9/10 → 9.0/10 after launch-copy specs added to plan (JUST_UPGRADED notice, CHANGELOG numbers table, README v2 banner all drafted as T15 deliverables); all 8 passes evaluated; skill DX checklist passes | + +**CODEX:** First pass (CEO): 12 findings, 7 absorbed, 3 cross-model user-decided, 2 baked into tasks. Second pass (post eng-review): 7 findings on the new D1/D2/D3 additions, 5 absorbed, 2 user-decided. Both passes preserved as audit trail. 19 total codex findings → 12 absorbed without friction, 5 user-decided across both passes, 2 quality-of-life refinements baked into tasks. DX review skipped fresh codex pass (3 prior passes already covered structural blind spots; remaining DX work is copy-craft, where codex adds less value than user taste). + +**CROSS-MODEL:** Strong agreement on (a) phasing (catalog trim early, sections/ later), (b) measurement-first (hard token budgets + override audit trail), (c) forks/rollout-strategy gaps. Tensions resolved across all passes: eval-first scope (user kept), v2 vs v1.x (HYBRID adopted), migration heaviness (lighter touch adopted), parallelization (user accepted codex's sequential critique), attention-architecture risk (user expanded scope to cathedral parity-eval suite covering ALL 31 skills with quality budget alongside token budget), launch copy artifacts (user drafted all three in plan vs deferring to T15 implementation). + +**UNRESOLVED:** 0 decisions outstanding across all 5 reviews. + +**VERDICT:** CEO + ENG + CODEX×2 + DX CLEARED — ready to implement. The hybrid v1.45/v2.0 split de-risks the bloat-reputation fix; the sections/*.md.tmpl pipeline (D2) prevents drift; the CI cost cap with override audit (D3 + codex absorbed refinement) prevents runaway eval spend; the cathedral parity-eval suite (codex 2nd pass) catches silent attention-architecture regressions that section-loading + budget tests alone would miss; sequential v1.45 execution (codex absorbed) trades wall-clock for integration safety; v2 launch copy specs (DX review) make the marketing-grade ship moment land for both persona A (existing upgrader) and persona B (new evaluator). Plan is now executable. diff --git a/document-generate/SKILL.md b/document-generate/SKILL.md index e6cf9965d..fb53315e8 100644 --- a/document-generate/SKILL.md +++ b/document-generate/SKILL.md @@ -2,13 +2,7 @@ name: document-generate preamble-tier: 2 version: 1.0.0 -description: | - Generate missing documentation from scratch for a feature, module, or entire project. - Uses the Diataxis framework (tutorial / how-to / reference / explanation) to produce - complete, structured documentation. Can be invoked standalone or called by - /document-release when it finds coverage gaps. Use when asked to "write docs", - "generate documentation", "document this feature", "create a tutorial", or - "explain this module". (gstack) +description: Generate missing documentation from scratch for a feature, module, or entire project. (gstack) allowed-tools: - Bash - Read @@ -29,6 +23,15 @@ triggers: + +## When to invoke this skill + +Uses the Diataxis framework (tutorial / how-to / reference / explanation) to produce +complete, structured documentation. Can be invoked standalone or called by +/document-release when it finds coverage gaps. Use when asked to "write docs", +"generate documentation", "document this feature", "create a tutorial", or +"explain this module". + ## Preamble (run first) ```bash @@ -555,84 +558,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/document-release/SKILL.md b/document-release/SKILL.md index b49f4e89b..10ce70573 100644 --- a/document-release/SKILL.md +++ b/document-release/SKILL.md @@ -2,14 +2,7 @@ name: document-release preamble-tier: 2 version: 1.0.0 -description: | - Post-ship documentation update. Reads all project docs, cross-references the - diff, builds a Diataxis coverage map (reference/how-to/tutorial/explanation), - updates README/ARCHITECTURE/CONTRIBUTING/CLAUDE.md to match what shipped, - detects architecture diagram drift, polishes CHANGELOG voice with a sell-test - rubric, cleans up TODOS, and optionally bumps VERSION. Surfaces documentation - debt in the PR body. Use when asked to "update the docs", "sync documentation", - or "post-ship docs". Proactively suggest after a PR is merged or code is shipped. (gstack) +description: Post-ship documentation update. (gstack) allowed-tools: - Bash - Read @@ -26,6 +19,17 @@ triggers: + +## When to invoke this skill + +Reads all project docs, cross-references the +diff, builds a Diataxis coverage map (reference/how-to/tutorial/explanation), +updates README/ARCHITECTURE/CONTRIBUTING/CLAUDE.md to match what shipped, +detects architecture diagram drift, polishes CHANGELOG voice with a sell-test +rubric, cleans up TODOS, and optionally bumps VERSION. Surfaces documentation +debt in the PR body. Use when asked to "update the docs", "sync documentation", +or "post-ship docs". Proactively suggest after a PR is merged or code is shipped. + ## Preamble (run first) ```bash @@ -552,84 +556,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/freeze/SKILL.md b/freeze/SKILL.md index 87f8506ca..fc82b1bea 100644 --- a/freeze/SKILL.md +++ b/freeze/SKILL.md @@ -1,12 +1,7 @@ --- name: freeze version: 0.1.0 -description: | - Restrict file edits to a specific directory for the session. Blocks Edit and - Write outside the allowed path. Use when debugging to prevent accidentally - "fixing" unrelated code, or when you want to scope changes to one module. - Use when asked to "freeze", "restrict edits", "only edit this folder", - or "lock down edits". (gstack) +description: Restrict file edits to a specific directory for the session. (gstack) triggers: - freeze edits to directory - lock editing scope @@ -31,6 +26,15 @@ hooks: + +## When to invoke this skill + +Blocks Edit and +Write outside the allowed path. Use when debugging to prevent accidentally +"fixing" unrelated code, or when you want to scope changes to one module. +Use when asked to "freeze", "restrict edits", "only edit this folder", +or "lock down edits". + # /freeze — Restrict Edits to a Directory Lock file edits to a specific directory. Any Edit or Write operation targeting diff --git a/gstack-upgrade/SKILL.md b/gstack-upgrade/SKILL.md index 81bb1228c..9f0f2f7ea 100644 --- a/gstack-upgrade/SKILL.md +++ b/gstack-upgrade/SKILL.md @@ -1,11 +1,7 @@ --- name: gstack-upgrade version: 1.1.0 -description: | - Upgrade gstack to the latest version. Detects global vs vendored install, - runs the upgrade, and shows what's new. Use when asked to "upgrade gstack", - "update gstack", or "get latest version". - Voice triggers (speech-to-text aliases): "upgrade the tools", "update the tools", "gee stack upgrade", "g stack upgrade". +description: Upgrade gstack to the latest version. triggers: - upgrade gstack - update gstack version @@ -19,6 +15,15 @@ allowed-tools: + +## When to invoke this skill + +Detects global vs vendored install, +runs the upgrade, and shows what's new. Use when asked to "upgrade gstack", +"update gstack", or "get latest version". + +Voice triggers (speech-to-text aliases): "upgrade the tools", "update the tools", "gee stack upgrade", "g stack upgrade". + # /gstack-upgrade Upgrade gstack to the latest version and show what's new. diff --git a/guard/SKILL.md b/guard/SKILL.md index 36216ac16..8d4a57448 100644 --- a/guard/SKILL.md +++ b/guard/SKILL.md @@ -1,12 +1,7 @@ --- name: guard version: 0.1.0 -description: | - Full safety mode: destructive command warnings + directory-scoped edits. - Combines /careful (warns before rm -rf, DROP TABLE, force-push, etc.) with - /freeze (blocks edits outside a specified directory). Use for maximum safety - when touching prod or debugging live systems. Use when asked to "guard mode", - "full safety", "lock it down", or "maximum safety". (gstack) +description: Full safety mode: destructive command warnings + directory-scoped edits. (gstack) triggers: - full safety mode - guard against mistakes @@ -36,6 +31,14 @@ hooks: + +## When to invoke this skill + +Combines /careful (warns before rm -rf, DROP TABLE, force-push, etc.) with +/freeze (blocks edits outside a specified directory). Use for maximum safety +when touching prod or debugging live systems. Use when asked to "guard mode", +"full safety", "lock it down", or "maximum safety". + # /guard — Full Safety Mode Activates both destructive command warnings and directory-scoped edit restrictions. diff --git a/health/SKILL.md b/health/SKILL.md index 396c980b2..3343c5551 100644 --- a/health/SKILL.md +++ b/health/SKILL.md @@ -2,12 +2,7 @@ name: health preamble-tier: 2 version: 1.0.0 -description: | - Code quality dashboard. Wraps existing project tools (type checker, linter, - test runner, dead code detector, shell linter), computes a weighted composite - 0-10 score, and tracks trends over time. Use when: "health check", - "code quality", "how healthy is the codebase", "run all checks", - "quality score". (gstack) +description: Code quality dashboard. (gstack) triggers: - code health check - quality dashboard @@ -24,6 +19,15 @@ allowed-tools: + +## When to invoke this skill + +Wraps existing project tools (type checker, linter, +test runner, dead code detector, shell linter), computes a weighted composite +0-10 score, and tracks trends over time. Use when: "health check", +"code quality", "how healthy is the codebase", "run all checks", +"quality score". + ## Preamble (run first) ```bash @@ -550,84 +554,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/investigate/SKILL.md b/investigate/SKILL.md index 40525d63a..75400287e 100644 --- a/investigate/SKILL.md +++ b/investigate/SKILL.md @@ -2,14 +2,7 @@ name: investigate preamble-tier: 2 version: 1.0.0 -description: | - Systematic debugging with root cause investigation. Four phases: investigate, - analyze, hypothesize, implement. Iron Law: no fixes without root cause. - Use when asked to "debug this", "fix this bug", "why is this broken", - "investigate this error", or "root cause analysis". - Proactively invoke this skill (do NOT debug directly) when the user reports - errors, 500 errors, stack traces, unexpected behavior, "it was working - yesterday", or is troubleshooting why something stopped working. (gstack) +description: Systematic debugging with root cause investigation. (gstack) allowed-tools: - Bash - Read @@ -63,6 +56,17 @@ gbrain: + +## When to invoke this skill + +Four phases: investigate, +analyze, hypothesize, implement. Iron Law: no fixes without root cause. +Use when asked to "debug this", "fix this bug", "why is this broken", +"investigate this error", or "root cause analysis". +Proactively invoke this skill (do NOT debug directly) when the user reports +errors, 500 errors, stack traces, unexpected behavior, "it was working +yesterday", or is troubleshooting why something stopped working. + ## Preamble (run first) ```bash @@ -589,84 +593,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/ios-clean/SKILL.md b/ios-clean/SKILL.md index f1a458e1e..77809e26b 100644 --- a/ios-clean/SKILL.md +++ b/ios-clean/SKILL.md @@ -2,15 +2,7 @@ name: ios-clean preamble-tier: 3 version: 1.0.0 -description: | - Remove the DebugBridge SPM package and all #if DEBUG wiring from an iOS - app. Cleans up StateServer, DebugOverlay, accessor codegen output, and - app-side hooks installed by /ios-qa. This is a convenience wrapper — - the structural Release-build guard (Package.swift conditional + CI - swift build -c release check) is the safety-critical path. - Use when asked to "clean the iOS debug bridge", "remove DebugBridge", - or "strip the gstack iOS instrumentation". (gstack) - Voice triggers (speech-to-text aliases): "clean the iOS debug bridge", "remove DebugBridge", "strip the gstack iOS instrumentation". +description: Remove the DebugBridge SPM package and all #if DEBUG wiring from an iOS app. (gstack) allowed-tools: - Bash - Read @@ -26,6 +18,18 @@ triggers: + +## When to invoke this skill + +Cleans up StateServer, DebugOverlay, accessor codegen output, and +app-side hooks installed by /ios-qa. This is a convenience wrapper — +the structural Release-build guard (Package.swift conditional + CI +swift build -c release check) is the safety-critical path. +Use when asked to "clean the iOS debug bridge", "remove DebugBridge", +or "strip the gstack iOS instrumentation". + +Voice triggers (speech-to-text aliases): "clean the iOS debug bridge", "remove DebugBridge", "strip the gstack iOS instrumentation". + ## Preamble (run first) ```bash @@ -552,84 +556,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/ios-design-review/SKILL.md b/ios-design-review/SKILL.md index f03806f54..0560719fe 100644 --- a/ios-design-review/SKILL.md +++ b/ios-design-review/SKILL.md @@ -2,17 +2,7 @@ name: ios-design-review preamble-tier: 3 version: 1.0.0 -description: | - Visual design audit for iOS apps on real hardware. Connects to a real - iPhone via the same StateServer as /ios-qa, screenshots every screen, - evaluates against Apple HIG, DESIGN.md, and design best practices. Scores - each dimension 0-10 with "what would make it a 10" framing — mirrors - /plan-design-review for browser. For plan-stage design review (before - implementation), use /plan-design-review. For live web visual audits, use - /design-review. - Use when asked to "review the iOS design", "audit the iPhone app's - visuals", or "design QA the iOS app". (gstack) - Voice triggers (speech-to-text aliases): "review the iOS design", "audit the iPhone app's visuals", "design QA the iPhone app". +description: Visual design audit for iOS apps on real hardware. (gstack) allowed-tools: - Bash - Read @@ -27,6 +17,21 @@ triggers: + +## When to invoke this skill + +Connects to a real +iPhone via the same StateServer as /ios-qa, screenshots every screen, +evaluates against Apple HIG, DESIGN.md, and design best practices. Scores +each dimension 0-10 with "what would make it a 10" framing — mirrors +/plan-design-review for browser. For plan-stage design review (before +implementation), use /plan-design-review. For live web visual audits, use +/design-review. +Use when asked to "review the iOS design", "audit the iPhone app's +visuals", or "design QA the iOS app". + +Voice triggers (speech-to-text aliases): "review the iOS design", "audit the iPhone app's visuals", "design QA the iPhone app". + ## Preamble (run first) ```bash @@ -553,84 +558,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/ios-fix/SKILL.md b/ios-fix/SKILL.md index cc7a172b1..f0659bfa2 100644 --- a/ios-fix/SKILL.md +++ b/ios-fix/SKILL.md @@ -2,16 +2,7 @@ name: ios-fix preamble-tier: 3 version: 1.0.0 -description: | - Autonomous iOS bug fixer. Takes a bug found by /ios-qa, reads the source, - writes the fix, rebuilds, redeploys, and verifies the fix on the real - device. Closes the loop: find bug → fix bug → confirm fix — zero human - intervention. Captures the pre-bug state snapshot as a regression test - fixture, so the bug can never recur silently. - Use when /ios-qa reports a bug and you want it fixed automatically, or - when asked to "fix this iOS bug", "patch the iPhone app", or "auto-fix - the iOS issue". (gstack) - Voice triggers (speech-to-text aliases): "fix the iOS bug", "patch the iPhone app", "auto-fix the iOS issue". +description: Autonomous iOS bug fixer. (gstack) allowed-tools: - Bash - Read @@ -28,6 +19,20 @@ triggers: + +## When to invoke this skill + +Takes a bug found by /ios-qa, reads the source, +writes the fix, rebuilds, redeploys, and verifies the fix on the real +device. Closes the loop: find bug → fix bug → confirm fix — zero human +intervention. Captures the pre-bug state snapshot as a regression test +fixture, so the bug can never recur silently. +Use when /ios-qa reports a bug and you want it fixed automatically, or +when asked to "fix this iOS bug", "patch the iPhone app", or "auto-fix +the iOS issue". + +Voice triggers (speech-to-text aliases): "fix the iOS bug", "patch the iPhone app", "auto-fix the iOS issue". + ## Preamble (run first) ```bash @@ -554,84 +559,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/ios-qa/SKILL.md b/ios-qa/SKILL.md index 4d03a041b..347dab978 100644 --- a/ios-qa/SKILL.md +++ b/ios-qa/SKILL.md @@ -2,17 +2,7 @@ name: ios-qa preamble-tier: 3 version: 1.0.0 -description: | - Live-device iOS QA for SwiftUI apps. Connects to a real iPhone via USB - CoreDevice IPv6 tunnel, reads Swift source to understand every screen, then - runs a vision-driven agent loop: screenshot → analyze → decide → act → - verify → repeat. All interaction happens via HTTP to an embedded - StateServer in the app under test. Optionally exposes the device over - Tailscale so remote agents (OpenClaw, Codex, any HTTP-capable agent) can - run iOS QA from anywhere without touching the hardware. - Use when asked to "ios qa", "test my iPhone app", "find bugs on the device", - or "qa the iOS app". (gstack) - Voice triggers (speech-to-text aliases): "iOS quality check", "test the iPhone app", "run iOS QA". +description: Live-device iOS QA for SwiftUI apps. (gstack) allowed-tools: - Bash - Read @@ -31,6 +21,21 @@ triggers: + +## When to invoke this skill + +Connects to a real iPhone via USB +CoreDevice IPv6 tunnel, reads Swift source to understand every screen, then +runs a vision-driven agent loop: screenshot → analyze → decide → act → +verify → repeat. All interaction happens via HTTP to an embedded +StateServer in the app under test. Optionally exposes the device over +Tailscale so remote agents (OpenClaw, Codex, any HTTP-capable agent) can +run iOS QA from anywhere without touching the hardware. +Use when asked to "ios qa", "test my iPhone app", "find bugs on the device", +or "qa the iOS app". + +Voice triggers (speech-to-text aliases): "iOS quality check", "test the iPhone app", "run iOS QA". + ## Preamble (run first) ```bash @@ -557,84 +562,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/ios-sync/SKILL.md b/ios-sync/SKILL.md index c3449af7b..ce0c81929 100644 --- a/ios-sync/SKILL.md +++ b/ios-sync/SKILL.md @@ -2,14 +2,7 @@ name: ios-sync preamble-tier: 3 version: 1.0.0 -description: | - Regenerate the iOS debug bridge against the latest upstream gstack - templates. Updates StateServer.swift, DebugOverlay.swift, Package.swift, - and the typed @Observable state accessors. Use after you upgrade gstack - or add new ViewModels/properties that need accessor coverage. - Use when asked to "resync the iOS debug bridge", "regenerate iOS - accessors", or "update the gstack iOS instrumentation". (gstack) - Voice triggers (speech-to-text aliases): "resync the iOS debug bridge", "regenerate iOS accessors", "update the gstack iOS instrumentation". +description: Regenerate the iOS debug bridge against the latest upstream gstack templates. (gstack) allowed-tools: - Bash - Read @@ -26,6 +19,17 @@ triggers: + +## When to invoke this skill + +Updates StateServer.swift, DebugOverlay.swift, Package.swift, +and the typed @Observable state accessors. Use after you upgrade gstack +or add new ViewModels/properties that need accessor coverage. +Use when asked to "resync the iOS debug bridge", "regenerate iOS +accessors", or "update the gstack iOS instrumentation". + +Voice triggers (speech-to-text aliases): "resync the iOS debug bridge", "regenerate iOS accessors", "update the gstack iOS instrumentation". + ## Preamble (run first) ```bash @@ -552,84 +556,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/land-and-deploy/SKILL.md b/land-and-deploy/SKILL.md index ef7497cd4..3c67cbfe0 100644 --- a/land-and-deploy/SKILL.md +++ b/land-and-deploy/SKILL.md @@ -2,11 +2,7 @@ name: land-and-deploy preamble-tier: 4 version: 1.0.0 -description: | - Land and deploy workflow. Merges the PR, waits for CI and deploy, - verifies production health via canary checks. Takes over after /ship - creates the PR. Use when: "merge", "land", "deploy", "merge and verify", - "land it", "ship it to production". (gstack) +description: Land and deploy workflow. (gstack) allowed-tools: - Bash - Read @@ -21,6 +17,14 @@ triggers: + +## When to invoke this skill + +Merges the PR, waits for CI and deploy, +verifies production health via canary checks. Takes over after /ship +creates the PR. Use when: "merge", "land", "deploy", "merge and verify", +"land it", "ship it to production". + ## Preamble (run first) ```bash @@ -547,84 +551,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/landing-report/SKILL.md b/landing-report/SKILL.md index be8aed5e1..921299863 100644 --- a/landing-report/SKILL.md +++ b/landing-report/SKILL.md @@ -1,12 +1,7 @@ --- name: landing-report version: 0.1.0 -description: | - Read-only queue dashboard for workspace-aware ship. Shows which VERSION slots - are currently claimed by open PRs, which sibling Conductor workspaces have - WIP work likely to ship soon, and what slot /ship would pick next. No - mutations — just a snapshot. Use when asked to "landing report", "what's in - the queue", "show me open PRs", or "which version do I claim next". (gstack) +description: Read-only queue dashboard for workspace-aware ship. (gstack) triggers: - landing report - version queue @@ -20,6 +15,15 @@ allowed-tools: + +## When to invoke this skill + +Shows which VERSION slots +are currently claimed by open PRs, which sibling Conductor workspaces have +WIP work likely to ship soon, and what slot /ship would pick next. No +mutations — just a snapshot. Use when asked to "landing report", "what's in +the queue", "show me open PRs", or "which version do I claim next". + # /landing-report — Version Queue Dashboard ## Preamble (run first) @@ -548,84 +552,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/learn/SKILL.md b/learn/SKILL.md index 3599115b8..8fe745816 100644 --- a/learn/SKILL.md +++ b/learn/SKILL.md @@ -2,12 +2,7 @@ name: learn preamble-tier: 2 version: 1.0.0 -description: | - Manage project learnings. Review, search, prune, and export what gstack - has learned across sessions. Use when asked to "what have we learned", - "show learnings", "prune stale learnings", or "export learnings". - Proactively suggest when the user asks about past patterns or wonders - "didn't we fix this before?" +description: Manage project learnings. triggers: - show learnings - what have we learned @@ -24,6 +19,15 @@ allowed-tools: + +## When to invoke this skill + +Review, search, prune, and export what gstack +has learned across sessions. Use when asked to "what have we learned", +"show learnings", "prune stale learnings", or "export learnings". +Proactively suggest when the user asks about past patterns or wonders +"didn't we fix this before?" + ## Preamble (run first) ```bash @@ -550,84 +554,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/make-pdf/SKILL.md b/make-pdf/SKILL.md index 045e31516..f46b0d501 100644 --- a/make-pdf/SKILL.md +++ b/make-pdf/SKILL.md @@ -2,13 +2,7 @@ name: make-pdf preamble-tier: 1 version: 1.0.0 -description: | - Turn any markdown file into a publication-quality PDF. Proper 1in margins, - intelligent page breaks, page numbers, cover pages, running headers, curly - quotes and em dashes, clickable TOC, diagonal DRAFT watermark. Not a draft - artifact — a finished artifact. Use when asked to "make a PDF", "export to - PDF", "turn this markdown into a PDF", or "generate a document". (gstack) - Voice triggers (speech-to-text aliases): "make this a pdf", "make it a pdf", "export to pdf", "turn this into a pdf", "turn this markdown into a pdf", "generate a pdf", "make a pdf from", "pdf this markdown". +description: Turn any markdown file into a publication-quality PDF. (gstack) triggers: - markdown to pdf - generate pdf @@ -22,6 +16,17 @@ allowed-tools: + +## When to invoke this skill + +Proper 1in margins, +intelligent page breaks, page numbers, cover pages, running headers, curly +quotes and em dashes, clickable TOC, diagonal DRAFT watermark. Not a draft +artifact — a finished artifact. Use when asked to "make a PDF", "export to +PDF", "turn this markdown into a PDF", or "generate a document". + +Voice triggers (speech-to-text aliases): "make this a pdf", "make it a pdf", "export to pdf", "turn this into a pdf", "turn this markdown into a pdf", "generate a pdf", "make a pdf from", "pdf this markdown". + ## Preamble (run first) ```bash diff --git a/office-hours/SKILL.md b/office-hours/SKILL.md index b54c7c791..16062aaab 100644 --- a/office-hours/SKILL.md +++ b/office-hours/SKILL.md @@ -2,18 +2,7 @@ name: office-hours preamble-tier: 3 version: 2.0.0 -description: | - YC Office Hours — two modes. Startup mode: six forcing questions that expose - demand reality, status quo, desperate specificity, narrowest wedge, observation, - and future-fit. Builder mode: design thinking brainstorming for side projects, - hackathons, learning, and open source. Saves a design doc. - Use when asked to "brainstorm this", "I have an idea", "help me think through - this", "office hours", or "is this worth building". - Proactively invoke this skill (do NOT answer directly) when the user describes - a new product idea, asks whether something is worth building, wants to think - through design decisions for something that doesn't exist yet, or is exploring - a concept before any code is written. - Use before /plan-ceo-review or /plan-eng-review. (gstack) +description: YC Office Hours — two modes. (gstack) allowed-tools: - Bash - Read @@ -59,6 +48,21 @@ gbrain: + +## When to invoke this skill + +Startup mode: six forcing questions that expose +demand reality, status quo, desperate specificity, narrowest wedge, observation, +and future-fit. Builder mode: design thinking brainstorming for side projects, +hackathons, learning, and open source. Saves a design doc. +Use when asked to "brainstorm this", "I have an idea", "help me think through +this", "office hours", or "is this worth building". +Proactively invoke this skill (do NOT answer directly) when the user describes +a new product idea, asks whether something is worth building, wants to think +through design decisions for something that doesn't exist yet, or is exploring +a concept before any code is written. +Use before /plan-ceo-review or /plan-eng-review. + ## Preamble (run first) ```bash @@ -585,84 +589,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/open-gstack-browser/SKILL.md b/open-gstack-browser/SKILL.md index 8b4b0c493..bb9f808ee 100644 --- a/open-gstack-browser/SKILL.md +++ b/open-gstack-browser/SKILL.md @@ -1,13 +1,7 @@ --- name: open-gstack-browser version: 0.2.0 -description: | - Launch GStack Browser — AI-controlled Chromium with the sidebar extension baked in. - Opens a visible browser window where you can watch every action in real time. - The sidebar shows a live activity feed and chat. Anti-bot stealth built in. - Use when asked to "open gstack browser", "launch browser", "connect chrome", - "open chrome", "real browser", "launch chrome", "side panel", or "control my browser". - Voice triggers (speech-to-text aliases): "show me the browser". +description: Launch GStack Browser — AI-controlled Chromium with the sidebar extension baked in. triggers: - open gstack browser - launch chromium @@ -21,6 +15,16 @@ allowed-tools: + +## When to invoke this skill + +Opens a visible browser window where you can watch every action in real time. +The sidebar shows a live activity feed and chat. Anti-bot stealth built in. +Use when asked to "open gstack browser", "launch browser", "connect chrome", +"open chrome", "real browser", "launch chrome", "side panel", or "control my browser". + +Voice triggers (speech-to-text aliases): "show me the browser". + ## Preamble (run first) ```bash @@ -547,84 +551,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/package.json b/package.json index a8e289a30..84f0cb46b 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "gstack", - "version": "1.45.0.0", + "version": "1.46.0.0", "description": "Garry's Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.", "license": "MIT", "type": "module", diff --git a/pair-agent/SKILL.md b/pair-agent/SKILL.md index dd7a51ecd..0f6a75c54 100644 --- a/pair-agent/SKILL.md +++ b/pair-agent/SKILL.md @@ -1,14 +1,7 @@ --- name: pair-agent version: 0.1.0 -description: | - Pair a remote AI agent with your browser. One command generates a setup key and - prints instructions the other agent can follow to connect. Works with OpenClaw, - Hermes, Codex, Cursor, or any agent that can make HTTP requests. The remote agent - gets its own tab with scoped access (read+write by default, admin on request). - Use when asked to "pair agent", "connect agent", "share browser", "remote browser", - "let another agent use my browser", or "give browser access". (gstack) - Voice triggers (speech-to-text aliases): "pair agent", "connect agent", "share my browser", "remote browser access". +description: Pair a remote AI agent with your browser. (gstack) triggers: - pair with agent - connect remote agent @@ -22,6 +15,18 @@ allowed-tools: + +## When to invoke this skill + +One command generates a setup key and +prints instructions the other agent can follow to connect. Works with OpenClaw, +Hermes, Codex, Cursor, or any agent that can make HTTP requests. The remote agent +gets its own tab with scoped access (read+write by default, admin on request). +Use when asked to "pair agent", "connect agent", "share browser", "remote browser", +"let another agent use my browser", or "give browser access". + +Voice triggers (speech-to-text aliases): "pair agent", "connect agent", "share my browser", "remote browser access". + ## Preamble (run first) ```bash @@ -548,84 +553,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/plan-ceo-review/SKILL.md b/plan-ceo-review/SKILL.md index a0b24ef99..e0d269709 100644 --- a/plan-ceo-review/SKILL.md +++ b/plan-ceo-review/SKILL.md @@ -3,15 +3,7 @@ name: plan-ceo-review preamble-tier: 3 interactive: true version: 1.0.0 -description: | - CEO/founder-mode plan review. Rethink the problem, find the 10-star product, - challenge premises, expand scope when it creates a better product. Four modes: - SCOPE EXPANSION (dream big), SELECTIVE EXPANSION (hold scope + cherry-pick - expansions), HOLD SCOPE (maximum rigor), SCOPE REDUCTION (strip to essentials). - Use when asked to "think bigger", "expand scope", "strategy review", "rethink this", - or "is this ambitious enough". - Proactively suggest when the user is questioning scope or ambition of a plan, - or when the plan feels like it could be thinking bigger. (gstack) +description: CEO/founder-mode plan review. (gstack) benefits-from: [office-hours] allowed-tools: - Read @@ -53,6 +45,18 @@ gbrain: + +## When to invoke this skill + +Rethink the problem, find the 10-star product, +challenge premises, expand scope when it creates a better product. Four modes: +SCOPE EXPANSION (dream big), SELECTIVE EXPANSION (hold scope + cherry-pick +expansions), HOLD SCOPE (maximum rigor), SCOPE REDUCTION (strip to essentials). +Use when asked to "think bigger", "expand scope", "strategy review", "rethink this", +or "is this ambitious enough". +Proactively suggest when the user is questioning scope or ambition of a plan, +or when the plan feels like it could be thinking bigger. + ## Preamble (run first) ```bash @@ -579,84 +583,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/plan-design-review/SKILL.md b/plan-design-review/SKILL.md index 2c95c67e6..32d3e5efc 100644 --- a/plan-design-review/SKILL.md +++ b/plan-design-review/SKILL.md @@ -3,14 +3,7 @@ name: plan-design-review preamble-tier: 3 interactive: true version: 2.0.0 -description: | - Designer's eye plan review — interactive, like CEO and Eng review. - Rates each design dimension 0-10, explains what would make it a 10, - then fixes the plan to get there. Works in plan mode. For live site - visual audits, use /design-review. Use when asked to "review the design plan" - or "design critique". - Proactively suggest when the user has a plan with UI/UX components that - should be reviewed before implementation. (gstack) +description: Designer's eye plan review — interactive, like CEO and Eng review. (gstack) allowed-tools: - Read - Edit @@ -26,6 +19,16 @@ triggers: + +## When to invoke this skill + +Rates each design dimension 0-10, explains what would make it a 10, +then fixes the plan to get there. Works in plan mode. For live site +visual audits, use /design-review. Use when asked to "review the design plan" +or "design critique". +Proactively suggest when the user has a plan with UI/UX components that +should be reviewed before implementation. + ## Preamble (run first) ```bash @@ -552,84 +555,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/plan-devex-review/SKILL.md b/plan-devex-review/SKILL.md index 371d07a75..cacd8f76f 100644 --- a/plan-devex-review/SKILL.md +++ b/plan-devex-review/SKILL.md @@ -3,16 +3,7 @@ name: plan-devex-review preamble-tier: 3 interactive: true version: 2.0.0 -description: | - Interactive developer experience plan review. Explores developer personas, - benchmarks against competitors, designs magical moments, and traces friction - points before scoring. Three modes: DX EXPANSION (competitive advantage), - DX POLISH (bulletproof every touchpoint), DX TRIAGE (critical gaps only). - Use when asked to "DX review", "developer experience audit", "devex review", - or "API design review". - Proactively suggest when the user has a plan for developer-facing products - (APIs, CLIs, SDKs, libraries, platforms, docs). (gstack) - Voice triggers (speech-to-text aliases): "dx review", "developer experience review", "devex review", "devex audit", "API design review", "onboarding review". +description: Interactive developer experience plan review. (gstack) benefits-from: [office-hours] allowed-tools: - Read @@ -30,6 +21,20 @@ triggers: + +## When to invoke this skill + +Explores developer personas, +benchmarks against competitors, designs magical moments, and traces friction +points before scoring. Three modes: DX EXPANSION (competitive advantage), +DX POLISH (bulletproof every touchpoint), DX TRIAGE (critical gaps only). +Use when asked to "DX review", "developer experience audit", "devex review", +or "API design review". +Proactively suggest when the user has a plan for developer-facing products +(APIs, CLIs, SDKs, libraries, platforms, docs). + +Voice triggers (speech-to-text aliases): "dx review", "developer experience review", "devex review", "devex audit", "API design review", "onboarding review". + ## Preamble (run first) ```bash @@ -556,84 +561,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/plan-eng-review/SKILL.md b/plan-eng-review/SKILL.md index a3a064a32..66edf982a 100644 --- a/plan-eng-review/SKILL.md +++ b/plan-eng-review/SKILL.md @@ -3,14 +3,7 @@ name: plan-eng-review preamble-tier: 3 interactive: true version: 1.0.0 -description: | - Eng manager-mode plan review. Lock in the execution plan — architecture, - data flow, diagrams, edge cases, test coverage, performance. Walks through - issues interactively with opinionated recommendations. Use when asked to - "review the architecture", "engineering review", or "lock in the plan". - Proactively suggest when the user has a plan or design doc and is about to - start coding — to catch architecture issues before implementation. (gstack) - Voice triggers (speech-to-text aliases): "tech review", "technical review", "plan engineering review". +description: Eng manager-mode plan review. (gstack) benefits-from: [office-hours] allowed-tools: - Read @@ -28,6 +21,18 @@ triggers: + +## When to invoke this skill + +Lock in the execution plan — architecture, +data flow, diagrams, edge cases, test coverage, performance. Walks through +issues interactively with opinionated recommendations. Use when asked to +"review the architecture", "engineering review", or "lock in the plan". +Proactively suggest when the user has a plan or design doc and is about to +start coding — to catch architecture issues before implementation. + +Voice triggers (speech-to-text aliases): "tech review", "technical review", "plan engineering review". + ## Preamble (run first) ```bash @@ -554,84 +559,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/plan-tune/SKILL.md b/plan-tune/SKILL.md index c575ef4f4..f56fb5b43 100644 --- a/plan-tune/SKILL.md +++ b/plan-tune/SKILL.md @@ -2,19 +2,7 @@ name: plan-tune preamble-tier: 2 version: 1.0.0 -description: | - Self-tuning question sensitivity + developer psychographic for gstack (v1: observational). - Review which AskUserQuestion prompts fire across gstack skills, set per-question preferences - (never-ask / always-ask / ask-only-for-one-way), inspect the dual-track - profile (what you declared vs what your behavior suggests), and enable/disable - question tuning. Conversational interface — no CLI syntax required. - - Use when asked to "tune questions", "stop asking me that", "too many questions", - "show my profile", "what questions have I been asked", "show my vibe", - "developer profile", or "turn off question tuning". (gstack) - - Proactively suggest when the user says the same gstack question has come up before, - or when they explicitly override a recommendation for the Nth time. +description: Self-tuning question sensitivity + developer psychographic for gstack (v1: observational). (gstack) triggers: - tune questions - stop asking me that @@ -35,6 +23,21 @@ allowed-tools: + +## When to invoke this skill + +Review which AskUserQuestion prompts fire across gstack skills, set per-question preferences +(never-ask / always-ask / ask-only-for-one-way), inspect the dual-track +profile (what you declared vs what your behavior suggests), and enable/disable +question tuning. Conversational interface — no CLI syntax required. + +Use when asked to "tune questions", "stop asking me that", "too many questions", +"show my profile", "what questions have I been asked", "show my vibe", +"developer profile", or "turn off question tuning". + +Proactively suggest when the user says the same gstack question has come up before, +or when they explicitly override a recommendation for the Nth time. + ## Preamble (run first) ```bash @@ -561,84 +564,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/qa-only/SKILL.md b/qa-only/SKILL.md index 3e95cb032..b04749f63 100644 --- a/qa-only/SKILL.md +++ b/qa-only/SKILL.md @@ -2,13 +2,7 @@ name: qa-only preamble-tier: 4 version: 1.0.0 -description: | - Report-only QA testing. Systematically tests a web application and produces a - structured report with health score, screenshots, and repro steps — but never - fixes anything. Use when asked to "just report bugs", "qa report only", or - "test but don't fix". For the full test-fix-verify loop, use /qa instead. - Proactively suggest when the user wants a bug report without any code changes. (gstack) - Voice triggers (speech-to-text aliases): "bug report", "just check for bugs". +description: Report-only QA testing. (gstack) allowed-tools: - Bash - Read @@ -23,6 +17,17 @@ triggers: + +## When to invoke this skill + +Systematically tests a web application and produces a +structured report with health score, screenshots, and repro steps — but never +fixes anything. Use when asked to "just report bugs", "qa report only", or +"test but don't fix". For the full test-fix-verify loop, use /qa instead. +Proactively suggest when the user wants a bug report without any code changes. + +Voice triggers (speech-to-text aliases): "bug report", "just check for bugs". + ## Preamble (run first) ```bash @@ -549,84 +554,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/qa/SKILL.md b/qa/SKILL.md index aec716f95..3832f01ac 100644 --- a/qa/SKILL.md +++ b/qa/SKILL.md @@ -2,16 +2,7 @@ name: qa preamble-tier: 4 version: 2.0.0 -description: | - Systematically QA test a web application and fix bugs found. Runs QA testing, - then iteratively fixes bugs in source code, committing each fix atomically and - re-verifying. Use when asked to "qa", "QA", "test this site", "find bugs", - "test and fix", or "fix what's broken". - Proactively suggest when the user says a feature is ready for testing - or asks "does this work?". Three tiers: Quick (critical/high only), - Standard (+ medium), Exhaustive (+ cosmetic). Produces before/after health scores, - fix evidence, and a ship-readiness summary. For report-only mode, use /qa-only. (gstack) - Voice triggers (speech-to-text aliases): "quality check", "test the app", "run QA". +description: Systematically QA test a web application and fix bugs found. (gstack) allowed-tools: - Bash - Read @@ -29,6 +20,20 @@ triggers: + +## When to invoke this skill + +Runs QA testing, +then iteratively fixes bugs in source code, committing each fix atomically and +re-verifying. Use when asked to "qa", "QA", "test this site", "find bugs", +"test and fix", or "fix what's broken". +Proactively suggest when the user says a feature is ready for testing +or asks "does this work?". Three tiers: Quick (critical/high only), +Standard (+ medium), Exhaustive (+ cosmetic). Produces before/after health scores, +fix evidence, and a ship-readiness summary. For report-only mode, use /qa-only. + +Voice triggers (speech-to-text aliases): "quality check", "test the app", "run QA". + ## Preamble (run first) ```bash @@ -555,84 +560,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/retro/SKILL.md b/retro/SKILL.md index f75097669..cffb82cf3 100644 --- a/retro/SKILL.md +++ b/retro/SKILL.md @@ -2,12 +2,7 @@ name: retro preamble-tier: 2 version: 2.0.0 -description: | - Weekly engineering retrospective. Analyzes commit history, work patterns, - and code quality metrics with persistent history and trend tracking. - Team-aware: breaks down per-person contributions with praise and growth areas. - Use when asked to "weekly retro", "what did we ship", or "engineering retrospective". - Proactively suggest at the end of a work week or sprint. (gstack) +description: Weekly engineering retrospective. (gstack) allowed-tools: - Bash - Read @@ -41,6 +36,15 @@ gbrain: + +## When to invoke this skill + +Analyzes commit history, work patterns, +and code quality metrics with persistent history and trend tracking. +Team-aware: breaks down per-person contributions with praise and growth areas. +Use when asked to "weekly retro", "what did we ship", or "engineering retrospective". +Proactively suggest at the end of a work week or sprint. + ## Preamble (run first) ```bash @@ -567,84 +571,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/review/SKILL.md b/review/SKILL.md index ef9e439c5..fe505daee 100644 --- a/review/SKILL.md +++ b/review/SKILL.md @@ -2,11 +2,7 @@ name: review preamble-tier: 4 version: 1.0.0 -description: | - Pre-landing PR review. Analyzes diff against the base branch for SQL safety, LLM trust - boundary violations, conditional side effects, and other structural issues. Use when - asked to "review this PR", "code review", "pre-landing review", or "check my diff". - Proactively suggest when the user is about to merge or land code changes. (gstack) +description: Pre-landing PR review. (gstack) allowed-tools: - Bash - Read @@ -26,6 +22,14 @@ triggers: + +## When to invoke this skill + +Analyzes diff against the base branch for SQL safety, LLM trust +boundary violations, conditional side effects, and other structural issues. Use when +asked to "review this PR", "code review", "pre-landing review", or "check my diff". +Proactively suggest when the user is about to merge or land code changes. + ## Preamble (run first) ```bash @@ -552,84 +556,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/scrape/SKILL.md b/scrape/SKILL.md index 7fb04d3f6..024d634e5 100644 --- a/scrape/SKILL.md +++ b/scrape/SKILL.md @@ -1,13 +1,7 @@ --- name: scrape version: 1.0.0 -description: | - Pull data from a web page. First call on a new intent prototypes the flow - via $B primitives and returns JSON. Subsequent calls on a matching intent - route to a codified browser-skill and return in ~200ms. Read-only — for - mutating flows (form fills, clicks, submissions), use /automate. - Use when asked to "scrape", "get data from", "pull", "extract from", or - "what's on" a page. (gstack) +description: Pull data from a web page. (gstack) allowed-tools: - Bash - Read @@ -22,6 +16,16 @@ triggers: + +## When to invoke this skill + +First call on a new intent prototypes the flow +via $B primitives and returns JSON. Subsequent calls on a matching intent +route to a codified browser-skill and return in ~200ms. Read-only — for +mutating flows (form fills, clicks, submissions), use /automate. +Use when asked to "scrape", "get data from", "pull", "extract from", or +"what's on" a page. + ## Preamble (run first) ```bash @@ -548,84 +552,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/scripts/capture-baseline.ts b/scripts/capture-baseline.ts new file mode 100644 index 000000000..fa6c7ad33 --- /dev/null +++ b/scripts/capture-baseline.ts @@ -0,0 +1,54 @@ +#!/usr/bin/env bun +/** + * CLI for capturing a parity baseline snapshot. + * + * Usage: + * bun run scripts/capture-baseline.ts # default path + * bun run scripts/capture-baseline.ts --tag v1.44.1 # tag the snapshot + * bun run scripts/capture-baseline.ts --out path/to/baseline.json + * + * The default output path is test/fixtures/parity-baseline-.json, + * or test/fixtures/parity-baseline-current.json when no tag is given. + */ + +import * as fs from 'fs'; +import * as path from 'path'; +import { captureBaseline } from '../test/helpers/capture-parity-baseline'; + +const ROOT = path.resolve(import.meta.dir, '..'); + +function arg(name: string): string | undefined { + const i = process.argv.indexOf(name); + if (i === -1) return undefined; + return process.argv[i + 1]; +} + +const tag = arg('--tag'); +const outOverride = arg('--out'); +const defaultOut = path.join( + ROOT, + 'test', + 'fixtures', + `parity-baseline-${tag ?? 'current'}.json`, +); +const outPath = outOverride ? path.resolve(outOverride) : defaultOut; + +const baseline = captureBaseline({ repoRoot: ROOT, tag }); + +fs.mkdirSync(path.dirname(outPath), { recursive: true }); +fs.writeFileSync(outPath, JSON.stringify(baseline, null, 2) + '\n'); + +const totalKB = Math.round(baseline.totalCorpusBytes / 1024); +const top3 = baseline.topHeaviest.slice(0, 3); +console.log(`Parity baseline captured: ${outPath}`); +console.log(` tag: ${baseline.tag}`); +console.log(` commit: ${baseline.capturedFromCommit}`); +console.log(` branch: ${baseline.capturedFromBranch}`); +console.log(` skills: ${baseline.totalSkills}`); +console.log(` total corpus: ${totalKB} KB`); +console.log(` catalog tokens: ~${baseline.estTotalCatalogTokens}`); +console.log(` top 3 heaviest:`); +for (const s of top3) { + const kb = Math.round(s.skillMdBytes / 1024); + console.log(` ${s.skill.padEnd(28)} ${kb} KB (${s.skillMdLines} lines, ~${s.estTokens} tokens)`); +} diff --git a/scripts/gen-skill-docs.ts b/scripts/gen-skill-docs.ts index b89aea8b9..30853f677 100644 --- a/scripts/gen-skill-docs.ts +++ b/scripts/gen-skill-docs.ts @@ -16,7 +16,7 @@ import { writeLlmsTxt } from './gen-llms-txt'; import * as fs from 'fs'; import * as path from 'path'; import type { Host, TemplateContext } from './resolvers/types'; -import { HOST_PATHS } from './resolvers/types'; +import { HOST_PATHS, unwrapResolver } from './resolvers/types'; import { RESOLVERS } from './resolvers/index'; import { externalSkillName, extractHookSafetyProse as _extractHookSafetyProse, extractNameAndDescription as _extractNameAndDescription, condenseOpenAIShortDescription as _condenseOpenAIShortDescription, generateOpenAIYaml as _generateOpenAIYaml } from './resolvers/codex-helpers'; import { generatePlanCompletionAuditShip, generatePlanCompletionAuditReview, generatePlanVerificationExec } from './resolvers/review'; @@ -59,6 +59,41 @@ const MODEL_ARG_VAL: Model = (() => { return resolved; })(); +// ─── Catalog Mode (v1.45.0.0 T4) ──────────────────────────── +// 'trim' (default): shorten frontmatter description to lead sentence, +// move routing/voice prose into a "## When to invoke" body section, and +// emit scripts/proactive-suggestions.json (single file across all skills). +// 'full': legacy v1.44 behavior — full description stays in frontmatter. +const CATALOG_MODE_ARG = process.argv.find(a => a.startsWith('--catalog-mode')); +const CATALOG_MODE: 'trim' | 'full' = (() => { + if (!CATALOG_MODE_ARG) return 'trim'; + const val = CATALOG_MODE_ARG.includes('=') + ? CATALOG_MODE_ARG.split('=')[1] + : process.argv[process.argv.indexOf(CATALOG_MODE_ARG) + 1]; + if (val !== 'trim' && val !== 'full') { + throw new Error(`Unknown catalog mode: ${val}. Use 'trim' (default) or 'full'.`); + } + return val; +})(); + +// ─── Explain-level Overlay ────────────────────────────────── +// --explain-level=terse compresses preamble prose (writing-style, completeness, +// confusion-protocol, context-health) to a single pointer line at gen time. +// Default keeps the runtime-conditional behavior (sections render unconditionally, +// the model skips them when EXPLAIN_LEVEL: terse appears in the preamble echo). +// Opt-in via the build flag so most users get the runtime-flexible default. +const EXPLAIN_LEVEL_ARG = process.argv.find(a => a.startsWith('--explain-level')); +const EXPLAIN_LEVEL: 'default' | 'terse' = (() => { + if (!EXPLAIN_LEVEL_ARG) return 'default'; + const val = EXPLAIN_LEVEL_ARG.includes('=') + ? EXPLAIN_LEVEL_ARG.split('=')[1] + : process.argv[process.argv.indexOf(EXPLAIN_LEVEL_ARG) + 1]; + if (val !== 'default' && val !== 'terse') { + throw new Error(`Unknown explain level: ${val}. Use 'default' or 'terse'.`); + } + return val; +})(); + // HostPaths, HOST_PATHS, and TemplateContext imported from ./resolvers/types (line 7-8) // Design constants (AI_SLOP_BLACKLIST, OPENAI_HARD_REJECTIONS, OPENAI_LITMUS_CHECKS) // live in ./resolvers/constants and are consumed by resolvers directly. @@ -172,6 +207,169 @@ function processVoiceTriggers(content: string): string { // Export for testing export { extractVoiceTriggers, processVoiceTriggers }; +// ─── Catalog Trim (v1.45.0.0 T4) ───────────────────────────── +// +// Frontmatter `description:` blocks today pack: a one-line outcome, "Use when +// asked to..." voice triggers, "Proactively..." routing guidance, and a +// "(gstack)" tag. This pile is the always-loaded catalog surface — every +// session pays for the full text. The catalog trim splits the description +// into a one-line catalog entry (lead sentence + "(gstack)") that stays in +// the frontmatter, and a "## When to invoke" body section that holds the +// routing/voice triggers prose for in-skill discovery. A registry written +// to scripts/proactive-suggestions.json (one entry per skill) makes routing +// available to agents that need it without paying the always-loaded cost. +// +// Opt-out: `--catalog-mode=full` keeps v1.44 behavior (no trim, full +// description in frontmatter). Use when debugging routing regressions or +// when shipping skills to hosts that depend on the legacy fat catalog. + +export interface CatalogParts { + lead: string; // First sentence — kept in catalog + routingProse: string; // "Use when asked to...", "Proactively..." paragraphs + voiceLine: string | null; // "Voice triggers (speech-to-text aliases): ..." line if present + hasGstackTag: boolean; +} + +export function splitCatalogDescription(description: string): CatalogParts { + // Voice triggers line (folded in by processVoiceTriggers earlier) + const voiceMatch = description.match(/Voice triggers \(speech-to-text aliases\):[^\n]+/); + const voiceLine = voiceMatch ? voiceMatch[0] : null; + let working = voiceLine ? description.replace(voiceLine, '').trim() : description.trim(); + + const hasGstackTag = /\(gstack\)/.test(working); + if (hasGstackTag) working = working.replace(/\(gstack\)/, '').trim(); + + // Lead = first sentence (up to first period followed by space or end of string). + // We tolerate sentences with embedded periods (URLs, "v1.45.0.0") by requiring + // the period to be followed by whitespace OR end-of-text. + // First normalize to single-line for sentence detection, then back out. + const collapsed = working.replace(/\s+/g, ' ').trim(); + const sentenceMatch = collapsed.match(/^([^.!?]*[.!?])(?:\s|$)/); + // sentenceLead is the FULL first sentence (no truncation). We compute routing + // from this position, then optionally truncate the displayed lead afterwards. + // Truncating first then computing routing was the v1.45.0.0 bug — when the + // first sentence exceeded 200 chars, the routing extraction would lose the + // entire tail of the description (design-consultation's "Use when..." + // routing prose silently dropped). + const sentenceLead = sentenceMatch ? sentenceMatch[1].trim() : collapsed.split(/\s/).slice(0, 20).join(' '); + + // Routing prose: everything AFTER the first sentence boundary in the collapsed view. + const leadInCollapsed = collapsed.indexOf(sentenceLead); + const routingCollapsed = leadInCollapsed >= 0 + ? collapsed.slice(leadInCollapsed + sentenceLead.length).trim() + : ''; + + // Now produce the displayed lead — truncated if too long. The original + // sentenceLead is preserved for routing extraction below. + let lead = sentenceLead; + if (lead.length > 200) { + const trunc = lead.slice(0, 197); + const lastSpace = trunc.lastIndexOf(' '); + lead = (lastSpace > 60 ? trunc.slice(0, lastSpace) : trunc) + '...'; + } + // Restore line breaks for routing prose by mapping back to original layout. + // Use original whitespace structure where possible; fall back to collapsed. + // Anchor recovery on sentenceLead (the untruncated first sentence) — not + // `lead` (which may have a "..." suffix and won't substring-match `working`). + let routingProse = routingCollapsed; + const collapsedLeadIdx = working.replace(/\s+/g, ' ').indexOf(sentenceLead); + if (collapsedLeadIdx >= 0) { + let consumed = 0; + let cut = 0; + for (let i = 0; i < working.length && consumed < collapsedLeadIdx + sentenceLead.length; i++) { + if (/\s/.test(working[i])) { + if (i === 0 || /\s/.test(working[i - 1])) continue; + consumed += 1; + } else { + consumed += 1; + } + cut = i + 1; + } + const tail = working.slice(cut).trim(); + if (tail.length > 0) routingProse = tail; + } + + return { lead, routingProse, voiceLine, hasGstackTag }; +} + +/** Build the catalog-trimmed `description:` block. */ +export function buildTrimmedDescription(parts: CatalogParts): string { + const lead = parts.lead.trim(); + const suffix = parts.hasGstackTag ? ' (gstack)' : ''; + return `${lead}${suffix}`; +} + +/** Build the body section that holds the routing/voice prose. */ +export function buildWhenToInvokeSection(parts: CatalogParts): string { + const lines: string[] = ['## When to invoke this skill', '']; + if (parts.routingProse) { + lines.push(parts.routingProse); + lines.push(''); + } + if (parts.voiceLine) { + lines.push(parts.voiceLine); + lines.push(''); + } + return lines.join('\n'); +} + +/** + * Apply catalog trim to a SKILL.md body: + * - shorten frontmatter `description:` to lead + (gstack) + * - insert "## When to invoke" body section AFTER the generated header + * (so it lands near the top of body content, where routing guidance + * belongs) + * + * Returns the rewritten content plus the parts (used for proactive-suggestions + * JSON aggregation at the end of the run). + */ +export function applyCatalogTrim(content: string, skillName: string): { content: string; parts: CatalogParts } | null { + // Locate description block in frontmatter + if (!content.startsWith('---\n')) return null; + const fmEnd = content.indexOf('\n---', 4); + if (fmEnd === -1) return null; + const frontmatter = content.slice(4, fmEnd); + + // Match `description: |` block + indented body lines + const descMatch = frontmatter.match(/^description:\s*\|?\s*\n((?:\s{2,}.*(?:\n|$))+)/m) + || frontmatter.match(/^description:\s+(.+)$/m); + if (!descMatch) return null; + + // Extract full description text + let descText: string; + if (descMatch[0].startsWith('description: |') || /^description:\s*\|/.test(descMatch[0])) { + descText = descMatch[1].split('\n').map(l => l.replace(/^\s{2}/, '')).join('\n').trim(); + } else { + descText = descMatch[1].trim(); + } + + // Skip skills with very short descriptions (already trimmed or no routing prose). + // Below ~120 chars, splitting adds no value. + if (descText.length < 120) return null; + + const parts = splitCatalogDescription(descText); + // If lead + (gstack) is already most of the text, no trim needed. + const trimmedLen = buildTrimmedDescription(parts).length; + if (trimmedLen >= descText.length - 20) return null; + + // Replace description in frontmatter — keep trailing newline so the next + // YAML field doesn't collide on the same line as the description value. + const newDesc = buildTrimmedDescription(parts); + const newFrontmatter = frontmatter.replace(descMatch[0], `description: ${newDesc}\n`); + let newContent = '---\n' + newFrontmatter + content.slice(fmEnd); + + // Insert body section after frontmatter (after the closing ---\n and any + // existing GENERATED header). We insert before the first non-comment line. + const bodyStart = newContent.indexOf('\n---\n') + 5; + const whenToInvoke = '\n' + buildWhenToInvokeSection(parts).trim() + '\n'; + // Skip past the generated header if present (it lives after frontmatter close) + const headerMatch = newContent.slice(bodyStart).match(/^(\s*\n)+/); + const insertAt = bodyStart + (headerMatch ? headerMatch[0].length : 0); + newContent = newContent.slice(0, insertAt) + whenToInvoke + '\n' + newContent.slice(insertAt); + + return { content: newContent, parts }; +} + const OPENAI_SHORT_DESCRIPTION_LIMIT = 120; function condenseOpenAIShortDescription(description: string): string { @@ -401,7 +599,7 @@ function processExternalHost( return { content: result, outputPath, outputDir, symlinkLoop }; } -function processTemplate(tmplPath: string, host: Host = 'claude'): { outputPath: string; content: string; symlinkLoop?: boolean } { +function processTemplate(tmplPath: string, host: Host = 'claude'): { outputPath: string; content: string; symlinkLoop?: boolean; catalogParts?: CatalogParts | null } { const tmplContent = fs.readFileSync(tmplPath, 'utf-8'); const relTmplPath = path.relative(ROOT, tmplPath); let outputPath = tmplPath.replace(/\.tmpl$/, ''); @@ -430,7 +628,7 @@ function processTemplate(tmplPath: string, host: Host = 'claude'): { outputPath: const interactiveMatch = tmplContent.match(/^interactive:\s*(true|false)\s*$/m); const interactive = interactiveMatch ? interactiveMatch[1] === 'true' : undefined; - const ctx: TemplateContext = { skillName, tmplPath, benefitsFrom, host, paths: HOST_PATHS[host], preambleTier, model: MODEL_ARG_VAL, interactive }; + const ctx: TemplateContext = { skillName, tmplPath, benefitsFrom, host, paths: HOST_PATHS[host], preambleTier, model: MODEL_ARG_VAL, interactive, explainLevel: EXPLAIN_LEVEL }; // Replace placeholders (supports parameterized: {{NAME:arg1:arg2}}) // Config-driven: suppressedResolvers return empty string for this host @@ -441,9 +639,11 @@ function processTemplate(tmplPath: string, host: Host = 'claude'): { outputPath: const resolverName = parts[0]; const args = parts.slice(1); if (suppressed.has(resolverName)) return ''; - const resolver = RESOLVERS[resolverName]; - if (!resolver) throw new Error(`Unknown placeholder {{${resolverName}}} in ${relTmplPath}`); - return args.length > 0 ? resolver(ctx, args) : resolver(ctx); + const entry = RESOLVERS[resolverName]; + if (!entry) throw new Error(`Unknown placeholder {{${resolverName}}} in ${relTmplPath}`); + const { resolve, appliesTo } = unwrapResolver(entry); + if (appliesTo && !appliesTo(ctx)) return ''; + return args.length > 0 ? resolve(ctx, args) : resolve(ctx); }); // Check for any remaining unresolved placeholders @@ -483,7 +683,17 @@ function processTemplate(tmplPath: string, host: Host = 'claude'): { outputPath: content = header + content; } - return { outputPath, content, symlinkLoop }; + // Catalog trim (Claude only — external hosts have their own frontmatter shapes) + let catalogParts: CatalogParts | null = null; + if (host === 'claude' && CATALOG_MODE === 'trim') { + const trimmed = applyCatalogTrim(content, skillName); + if (trimmed) { + content = trimmed.content; + catalogParts = trimmed.parts; + } + } + + return { outputPath, content, symlinkLoop, catalogParts }; } // ─── Main ─────────────────────────────────────────────────── @@ -503,6 +713,14 @@ for (const currentHost of hostsToRun) { let hasChanges = false; const tokenBudget: Array<{ skill: string; lines: number; tokens: number }> = []; + // T4 catalog trim: collect routing/voice parts across all Claude skills, + // then write scripts/proactive-suggestions.json once per gen-skill-docs run. + const proactiveAggregate: Record = {}; + const currentHostConfig = getHostConfig(currentHost); for (const tmplPath of findTemplates()) { const dir = path.basename(path.dirname(tmplPath)); @@ -516,7 +734,24 @@ for (const currentHost of hostsToRun) { if (currentHostConfig.generation.skipSkills.includes(dir)) continue; } - const { outputPath, content, symlinkLoop } = processTemplate(tmplPath, currentHost); + const { outputPath, content, symlinkLoop, catalogParts } = processTemplate(tmplPath, currentHost); + if (catalogParts) { + // Root-skill detection: when the template lives at ROOT/SKILL.md.tmpl, + // path.basename(path.dirname(tmplPath)) returns the repo's directory + // name (e.g. "seville-v3" in a Conductor worktree, "gstack" on CI). + // That's non-deterministic across machines and breaks CI freshness + // checks. Use the frontmatter `name` field as the registry key — the + // root SKILL.md.tmpl declares `name: gstack` explicitly. For all other + // skills, `dir` matches the directory name which matches the + // frontmatter name by convention. + const isRoot = path.dirname(tmplPath) === ROOT; + const key = isRoot ? 'gstack' : dir; + proactiveAggregate[key] = { + lead: catalogParts.lead, + routing: catalogParts.routingProse, + voice_line: catalogParts.voiceLine, + }; + } const relOutput = path.relative(ROOT, outputPath); if (symlinkLoop) { @@ -620,6 +855,40 @@ The orchestrator will persist the plan link to its own memory/knowledge store. failures.push({ host: currentHost, error: new Error('Stale files detected') }); } + // T4 catalog trim: write aggregated proactive-suggestions.json (Claude only). + // The JSON registry lets agents pull voice triggers / routing prose for any + // skill on demand instead of paying for it always-loaded in the catalog. + // + // No timestamp field — keeps the file content-deterministic across runs so + // CI dry-run freshness checks don't flap on regen. If a per-run timestamp + // is ever needed for debugging, write it to a separate `.gen-stamp` file. + if (currentHost === 'claude' && CATALOG_MODE === 'trim' && Object.keys(proactiveAggregate).length > 0 && !DRY_RUN) { + const proactivePath = path.join(ROOT, 'scripts', 'proactive-suggestions.json'); + // Sort keys alphabetically so the serialized JSON is identical across + // machines regardless of filesystem-iteration order. Without this, CI + // freshness checks fail when the local dev machine and CI runner + // discover templates in different orders. + const sortedSkills: typeof proactiveAggregate = {}; + for (const key of Object.keys(proactiveAggregate).sort()) { + sortedSkills[key] = proactiveAggregate[key]; + } + const payload = { + $schema: 'https://gstack.dev/schemas/proactive-suggestions.json', + catalog_mode: 'trim', + note: 'Routing / voice-trigger prose extracted from SKILL.md frontmatter descriptions during catalog trim. Loaded on demand when routing guidance is needed.', + skills: sortedSkills, + }; + const serialized = JSON.stringify(payload, null, 2) + '\n'; + // Only write if content actually changed — prevents needless touches that + // would flap CI freshness checks. Read existing file, compare, skip write + // when identical. + let existing = ''; + try { existing = fs.readFileSync(proactivePath, 'utf-8'); } catch { /* first run */ } + if (existing !== serialized) { + fs.writeFileSync(proactivePath, serialized); + } + } + // Print token budget summary if (!DRY_RUN && tokenBudget.length > 0) { tokenBudget.sort((a, b) => b.lines - a.lines); diff --git a/scripts/proactive-suggestions.json b/scripts/proactive-suggestions.json new file mode 100644 index 000000000..f865e396b --- /dev/null +++ b/scripts/proactive-suggestions.json @@ -0,0 +1,267 @@ +{ + "$schema": "https://gstack.dev/schemas/proactive-suggestions.json", + "catalog_mode": "trim", + "note": "Routing / voice-trigger prose extracted from SKILL.md frontmatter descriptions during catalog trim. Loaded on demand when routing guidance is needed.", + "skills": { + "autoplan": { + "lead": "Auto-review pipeline — reads the full CEO, design, eng, and DX review skills from disk and runs them sequentially with auto-decisions using 6 decision principles.", + "routing": "Surfaces\ntaste decisions (close approaches, borderline scope, codex disagreements) at a final\napproval gate. One command, fully reviewed plan out.\nUse when asked to \"auto review\", \"autoplan\", \"run all reviews\", \"review this plan\nautomatically\", or \"make the decisions for me\".\nProactively suggest when the user has a plan file and wants to run the full review\ngauntlet without answering 15-30 intermediate questions.", + "voice_line": "Voice triggers (speech-to-text aliases): \"auto plan\", \"automatic review\"." + }, + "benchmark": { + "lead": "Performance regression detection using the browse daemon.", + "routing": "Establishes\nbaselines for page load times, Core Web Vitals, and resource sizes.\nCompares before/after on every PR. Tracks performance trends over time.\nUse when: \"performance\", \"benchmark\", \"page speed\", \"lighthouse\", \"web vitals\",\n\"bundle size\", \"load time\".", + "voice_line": "Voice triggers (speech-to-text aliases): \"speed test\", \"check performance\"." + }, + "benchmark-models": { + "lead": "Cross-model benchmark for gstack skills.", + "routing": "Runs the same prompt through Claude,\nGPT (via Codex CLI), and Gemini side-by-side — compares latency, tokens, cost,\nand optionally quality via LLM judge. Answers \"which model is actually best\nfor this skill?\" with data instead of vibes. Separate from /benchmark, which\nmeasures web page performance. Use when: \"benchmark models\", \"compare models\",\n\"which model is best for X\", \"cross-model comparison\", \"model shootout\".", + "voice_line": "Voice triggers (speech-to-text aliases): \"compare models\", \"model shootout\", \"which model is best\"." + }, + "browse": { + "lead": "Fast headless browser for QA testing and site dogfooding.", + "routing": "Navigate any URL, interact with\nelements, verify page state, diff before/after actions, take annotated screenshots, check\nresponsive layouts, test forms and uploads, handle dialogs, and assert element states.\n~100ms per command. Use when you need to test a feature, verify a deployment, dogfood a\nuser flow, or file a bug with evidence. Use when asked to \"open in browser\", \"test the\nsite\", \"take a screenshot\", or \"dogfood this\".", + "voice_line": null + }, + "canary": { + "lead": "Post-deploy canary monitoring.", + "routing": "Watches the live app for console errors,\nperformance regressions, and page failures using the browse daemon. Takes\nperiodic screenshots, compares against pre-deploy baselines, and alerts\non anomalies. Use when: \"monitor deploy\", \"canary\", \"post-deploy check\",\n\"watch production\", \"verify deploy\".", + "voice_line": null + }, + "careful": { + "lead": "Safety guardrails for destructive commands.", + "routing": "Warns before rm -rf, DROP TABLE,\nforce-push, git reset --hard, kubectl delete, and similar destructive operations.\nUser can override each warning. Use when touching prod, debugging live systems,\nor working in a shared environment. Use when asked to \"be careful\", \"safety mode\",\n\"prod mode\", or \"careful mode\".", + "voice_line": null + }, + "codex": { + "lead": "OpenAI Codex CLI wrapper — three modes.", + "routing": "Code review: independent diff review via\ncodex review with pass/fail gate. Challenge: adversarial mode that tries to break\nyour code. Consult: ask codex anything with session continuity for follow-ups.\nThe \"200 IQ autistic developer\" second opinion. Use when asked to \"codex review\",\n\"codex challenge\", \"ask codex\", \"second opinion\", or \"consult codex\".", + "voice_line": "Voice triggers (speech-to-text aliases): \"code x\", \"code ex\", \"get another opinion\"." + }, + "context-restore": { + "lead": "Restore working context saved earlier by /context-save.", + "routing": "Loads the most recent\nsaved state (across all branches by default) so you can pick up where you\nleft off — even across Conductor workspace handoffs.\nUse when asked to \"resume\", \"restore context\", \"where was I\", or\n\"pick up where I left off\". Pair with /context-save.\nFormerly /checkpoint resume — renamed because Claude Code treats /checkpoint\nas a native rewind alias in current environments.", + "voice_line": null + }, + "context-save": { + "lead": "Save working context.", + "routing": "Captures git state, decisions made, and remaining work\nso any future session can pick up without losing a beat.\nUse when asked to \"save progress\", \"save state\", \"context save\", or\n\"save my work\". Pair with /context-restore to resume later.\nFormerly /checkpoint — renamed because Claude Code treats /checkpoint as a\nnative rewind alias in current environments, which was shadowing this skill.", + "voice_line": null + }, + "cso": { + "lead": "Chief Security Officer mode.", + "routing": "Infrastructure-first security audit: secrets archaeology,\ndependency supply chain, CI/CD pipeline security, LLM/AI security, skill supply chain\nscanning, plus OWASP Top 10, STRIDE threat modeling, and active verification.\nTwo modes: daily (zero-noise, 8/10 confidence gate) and comprehensive (monthly deep\nscan, 2/10 bar). Trend tracking across audit runs.\nUse when: \"security audit\", \"threat model\", \"pentest review\", \"OWASP\", \"CSO review\".", + "voice_line": "Voice triggers (speech-to-text aliases): \"see-so\", \"see so\", \"security review\", \"security check\", \"vulnerability scan\", \"run security\"." + }, + "design-consultation": { + "lead": "Design consultation: understands your product, researches the landscape, proposes a complete design system (aesthetic, typography, color, layout, spacing, motion), and generates font+color preview...", + "routing": "Creates DESIGN.md as your project's design source\nof truth. For existing sites, use /plan-design-review to infer the system instead.\nUse when asked to \"design system\", \"brand guidelines\", or \"create DESIGN.md\".\nProactively suggest when starting a new project's UI with no existing\ndesign system or DESIGN.md.", + "voice_line": null + }, + "design-html": { + "lead": "Design finalization: generates production-quality Pretext-native HTML/CSS.", + "routing": "Works with approved mockups from /design-shotgun, CEO plans from /plan-ceo-review,\ndesign review context from /plan-design-review, or from scratch with a user\ndescription. Text actually reflows, heights are computed, layouts are dynamic.\n30KB overhead, zero deps. Smart API routing: picks the right Pretext patterns\nfor each design type. Use when: \"finalize this design\", \"turn this into HTML\",\n\"build me a page\", \"implement this design\", or after any planning skill.\nProactively suggest when user has approved a design or has a plan ready.", + "voice_line": "Voice triggers (speech-to-text aliases): \"build the design\", \"code the mockup\", \"make it real\"." + }, + "design-review": { + "lead": "Designer's eye QA: finds visual inconsistency, spacing issues, hierarchy problems, AI slop patterns, and slow interactions — then fixes them.", + "routing": "Iteratively fixes issues\nin source code, committing each fix atomically and re-verifying with before/after\nscreenshots. For plan-mode design review (before implementation), use /plan-design-review.\nUse when asked to \"audit the design\", \"visual QA\", \"check if it looks good\", or \"design polish\".\nProactively suggest when the user mentions visual inconsistencies or\nwants to polish the look of a live site.", + "voice_line": null + }, + "design-shotgun": { + "lead": "Design shotgun: generate multiple AI design variants, open a comparison board, collect structured feedback, and iterate.", + "routing": "Standalone design exploration you can\nrun anytime. Use when: \"explore designs\", \"show me options\", \"design variants\",\n\"visual brainstorm\", or \"I don't like how this looks\".\nProactively suggest when the user describes a UI feature but hasn't seen\nwhat it could look like.", + "voice_line": null + }, + "devex-review": { + "lead": "Live developer experience audit.", + "routing": "Uses the browse tool to actually TEST the\ndeveloper experience: navigates docs, tries the getting started flow, times\nTTHW, screenshots error messages, evaluates CLI help text. Produces a DX\nscorecard with evidence. Compares against /plan-devex-review scores if they\nexist (the boomerang: plan said 3 minutes, reality says 8). Use when asked to\n\"test the DX\", \"DX audit\", \"developer experience test\", or \"try the\nonboarding\". Proactively suggest after shipping a developer-facing feature.", + "voice_line": "Voice triggers (speech-to-text aliases): \"dx audit\", \"test the developer experience\", \"try the onboarding\", \"developer experience test\"." + }, + "document-generate": { + "lead": "Generate missing documentation from scratch for a feature, module, or entire project.", + "routing": "Uses the Diataxis framework (tutorial / how-to / reference / explanation) to produce\ncomplete, structured documentation. Can be invoked standalone or called by\n/document-release when it finds coverage gaps. Use when asked to \"write docs\",\n\"generate documentation\", \"document this feature\", \"create a tutorial\", or\n\"explain this module\".", + "voice_line": null + }, + "document-release": { + "lead": "Post-ship documentation update.", + "routing": "Reads all project docs, cross-references the\ndiff, builds a Diataxis coverage map (reference/how-to/tutorial/explanation),\nupdates README/ARCHITECTURE/CONTRIBUTING/CLAUDE.md to match what shipped,\ndetects architecture diagram drift, polishes CHANGELOG voice with a sell-test\nrubric, cleans up TODOS, and optionally bumps VERSION. Surfaces documentation\ndebt in the PR body. Use when asked to \"update the docs\", \"sync documentation\",\nor \"post-ship docs\". Proactively suggest after a PR is merged or code is shipped.", + "voice_line": null + }, + "freeze": { + "lead": "Restrict file edits to a specific directory for the session.", + "routing": "Blocks Edit and\nWrite outside the allowed path. Use when debugging to prevent accidentally\n\"fixing\" unrelated code, or when you want to scope changes to one module.\nUse when asked to \"freeze\", \"restrict edits\", \"only edit this folder\",\nor \"lock down edits\".", + "voice_line": null + }, + "gstack": { + "lead": "Fast headless browser for QA testing and site dogfooding.", + "routing": "Navigate pages, interact with\nelements, verify state, diff before/after, take annotated screenshots, test responsive\nlayouts, forms, uploads, dialogs, and capture bug evidence. Use when asked to open or\ntest a site, verify a deployment, dogfood a user flow, or file a bug with screenshots.", + "voice_line": null + }, + "gstack-upgrade": { + "lead": "Upgrade gstack to the latest version.", + "routing": "Detects global vs vendored install,\nruns the upgrade, and shows what's new. Use when asked to \"upgrade gstack\",\n\"update gstack\", or \"get latest version\".", + "voice_line": "Voice triggers (speech-to-text aliases): \"upgrade the tools\", \"update the tools\", \"gee stack upgrade\", \"g stack upgrade\"." + }, + "guard": { + "lead": "Full safety mode: destructive command warnings + directory-scoped edits.", + "routing": "Combines /careful (warns before rm -rf, DROP TABLE, force-push, etc.) with\n/freeze (blocks edits outside a specified directory). Use for maximum safety\nwhen touching prod or debugging live systems. Use when asked to \"guard mode\",\n\"full safety\", \"lock it down\", or \"maximum safety\".", + "voice_line": null + }, + "health": { + "lead": "Code quality dashboard.", + "routing": "Wraps existing project tools (type checker, linter,\ntest runner, dead code detector, shell linter), computes a weighted composite\n0-10 score, and tracks trends over time. Use when: \"health check\",\n\"code quality\", \"how healthy is the codebase\", \"run all checks\",\n\"quality score\".", + "voice_line": null + }, + "investigate": { + "lead": "Systematic debugging with root cause investigation.", + "routing": "Four phases: investigate,\nanalyze, hypothesize, implement. Iron Law: no fixes without root cause.\nUse when asked to \"debug this\", \"fix this bug\", \"why is this broken\",\n\"investigate this error\", or \"root cause analysis\".\nProactively invoke this skill (do NOT debug directly) when the user reports\nerrors, 500 errors, stack traces, unexpected behavior, \"it was working\nyesterday\", or is troubleshooting why something stopped working.", + "voice_line": null + }, + "ios-clean": { + "lead": "Remove the DebugBridge SPM package and all #if DEBUG wiring from an iOS app.", + "routing": "Cleans up StateServer, DebugOverlay, accessor codegen output, and\napp-side hooks installed by /ios-qa. This is a convenience wrapper —\nthe structural Release-build guard (Package.swift conditional + CI\nswift build -c release check) is the safety-critical path.\nUse when asked to \"clean the iOS debug bridge\", \"remove DebugBridge\",\nor \"strip the gstack iOS instrumentation\".", + "voice_line": "Voice triggers (speech-to-text aliases): \"clean the iOS debug bridge\", \"remove DebugBridge\", \"strip the gstack iOS instrumentation\"." + }, + "ios-design-review": { + "lead": "Visual design audit for iOS apps on real hardware.", + "routing": "Connects to a real\niPhone via the same StateServer as /ios-qa, screenshots every screen,\nevaluates against Apple HIG, DESIGN.md, and design best practices. Scores\neach dimension 0-10 with \"what would make it a 10\" framing — mirrors\n/plan-design-review for browser. For plan-stage design review (before\nimplementation), use /plan-design-review. For live web visual audits, use\n/design-review.\nUse when asked to \"review the iOS design\", \"audit the iPhone app's\nvisuals\", or \"design QA the iOS app\".", + "voice_line": "Voice triggers (speech-to-text aliases): \"review the iOS design\", \"audit the iPhone app's visuals\", \"design QA the iPhone app\"." + }, + "ios-fix": { + "lead": "Autonomous iOS bug fixer.", + "routing": "Takes a bug found by /ios-qa, reads the source,\nwrites the fix, rebuilds, redeploys, and verifies the fix on the real\ndevice. Closes the loop: find bug → fix bug → confirm fix — zero human\nintervention. Captures the pre-bug state snapshot as a regression test\nfixture, so the bug can never recur silently.\nUse when /ios-qa reports a bug and you want it fixed automatically, or\nwhen asked to \"fix this iOS bug\", \"patch the iPhone app\", or \"auto-fix\nthe iOS issue\".", + "voice_line": "Voice triggers (speech-to-text aliases): \"fix the iOS bug\", \"patch the iPhone app\", \"auto-fix the iOS issue\"." + }, + "ios-qa": { + "lead": "Live-device iOS QA for SwiftUI apps.", + "routing": "Connects to a real iPhone via USB\nCoreDevice IPv6 tunnel, reads Swift source to understand every screen, then\nruns a vision-driven agent loop: screenshot → analyze → decide → act →\nverify → repeat. All interaction happens via HTTP to an embedded\nStateServer in the app under test. Optionally exposes the device over\nTailscale so remote agents (OpenClaw, Codex, any HTTP-capable agent) can\nrun iOS QA from anywhere without touching the hardware.\nUse when asked to \"ios qa\", \"test my iPhone app\", \"find bugs on the device\",\nor \"qa the iOS app\".", + "voice_line": "Voice triggers (speech-to-text aliases): \"iOS quality check\", \"test the iPhone app\", \"run iOS QA\"." + }, + "ios-sync": { + "lead": "Regenerate the iOS debug bridge against the latest upstream gstack templates.", + "routing": "Updates StateServer.swift, DebugOverlay.swift, Package.swift,\nand the typed @Observable state accessors. Use after you upgrade gstack\nor add new ViewModels/properties that need accessor coverage.\nUse when asked to \"resync the iOS debug bridge\", \"regenerate iOS\naccessors\", or \"update the gstack iOS instrumentation\".", + "voice_line": "Voice triggers (speech-to-text aliases): \"resync the iOS debug bridge\", \"regenerate iOS accessors\", \"update the gstack iOS instrumentation\"." + }, + "land-and-deploy": { + "lead": "Land and deploy workflow.", + "routing": "Merges the PR, waits for CI and deploy,\nverifies production health via canary checks. Takes over after /ship\ncreates the PR. Use when: \"merge\", \"land\", \"deploy\", \"merge and verify\",\n\"land it\", \"ship it to production\".", + "voice_line": null + }, + "landing-report": { + "lead": "Read-only queue dashboard for workspace-aware ship.", + "routing": "Shows which VERSION slots\nare currently claimed by open PRs, which sibling Conductor workspaces have\nWIP work likely to ship soon, and what slot /ship would pick next. No\nmutations — just a snapshot. Use when asked to \"landing report\", \"what's in\nthe queue\", \"show me open PRs\", or \"which version do I claim next\".", + "voice_line": null + }, + "learn": { + "lead": "Manage project learnings.", + "routing": "Review, search, prune, and export what gstack\nhas learned across sessions. Use when asked to \"what have we learned\",\n\"show learnings\", \"prune stale learnings\", or \"export learnings\".\nProactively suggest when the user asks about past patterns or wonders\n\"didn't we fix this before?\"", + "voice_line": null + }, + "make-pdf": { + "lead": "Turn any markdown file into a publication-quality PDF.", + "routing": "Proper 1in margins,\nintelligent page breaks, page numbers, cover pages, running headers, curly\nquotes and em dashes, clickable TOC, diagonal DRAFT watermark. Not a draft\nartifact — a finished artifact. Use when asked to \"make a PDF\", \"export to\nPDF\", \"turn this markdown into a PDF\", or \"generate a document\".", + "voice_line": "Voice triggers (speech-to-text aliases): \"make this a pdf\", \"make it a pdf\", \"export to pdf\", \"turn this into a pdf\", \"turn this markdown into a pdf\", \"generate a pdf\", \"make a pdf from\", \"pdf this markdown\"." + }, + "office-hours": { + "lead": "YC Office Hours — two modes.", + "routing": "Startup mode: six forcing questions that expose\ndemand reality, status quo, desperate specificity, narrowest wedge, observation,\nand future-fit. Builder mode: design thinking brainstorming for side projects,\nhackathons, learning, and open source. Saves a design doc.\nUse when asked to \"brainstorm this\", \"I have an idea\", \"help me think through\nthis\", \"office hours\", or \"is this worth building\".\nProactively invoke this skill (do NOT answer directly) when the user describes\na new product idea, asks whether something is worth building, wants to think\nthrough design decisions for something that doesn't exist yet, or is exploring\na concept before any code is written.\nUse before /plan-ceo-review or /plan-eng-review.", + "voice_line": null + }, + "open-gstack-browser": { + "lead": "Launch GStack Browser — AI-controlled Chromium with the sidebar extension baked in.", + "routing": "Opens a visible browser window where you can watch every action in real time.\nThe sidebar shows a live activity feed and chat. Anti-bot stealth built in.\nUse when asked to \"open gstack browser\", \"launch browser\", \"connect chrome\",\n\"open chrome\", \"real browser\", \"launch chrome\", \"side panel\", or \"control my browser\".", + "voice_line": "Voice triggers (speech-to-text aliases): \"show me the browser\"." + }, + "pair-agent": { + "lead": "Pair a remote AI agent with your browser.", + "routing": "One command generates a setup key and\nprints instructions the other agent can follow to connect. Works with OpenClaw,\nHermes, Codex, Cursor, or any agent that can make HTTP requests. The remote agent\ngets its own tab with scoped access (read+write by default, admin on request).\nUse when asked to \"pair agent\", \"connect agent\", \"share browser\", \"remote browser\",\n\"let another agent use my browser\", or \"give browser access\".", + "voice_line": "Voice triggers (speech-to-text aliases): \"pair agent\", \"connect agent\", \"share my browser\", \"remote browser access\"." + }, + "plan-ceo-review": { + "lead": "CEO/founder-mode plan review.", + "routing": "Rethink the problem, find the 10-star product,\nchallenge premises, expand scope when it creates a better product. Four modes:\nSCOPE EXPANSION (dream big), SELECTIVE EXPANSION (hold scope + cherry-pick\nexpansions), HOLD SCOPE (maximum rigor), SCOPE REDUCTION (strip to essentials).\nUse when asked to \"think bigger\", \"expand scope\", \"strategy review\", \"rethink this\",\nor \"is this ambitious enough\".\nProactively suggest when the user is questioning scope or ambition of a plan,\nor when the plan feels like it could be thinking bigger.", + "voice_line": null + }, + "plan-design-review": { + "lead": "Designer's eye plan review — interactive, like CEO and Eng review.", + "routing": "Rates each design dimension 0-10, explains what would make it a 10,\nthen fixes the plan to get there. Works in plan mode. For live site\nvisual audits, use /design-review. Use when asked to \"review the design plan\"\nor \"design critique\".\nProactively suggest when the user has a plan with UI/UX components that\nshould be reviewed before implementation.", + "voice_line": null + }, + "plan-devex-review": { + "lead": "Interactive developer experience plan review.", + "routing": "Explores developer personas,\nbenchmarks against competitors, designs magical moments, and traces friction\npoints before scoring. Three modes: DX EXPANSION (competitive advantage),\nDX POLISH (bulletproof every touchpoint), DX TRIAGE (critical gaps only).\nUse when asked to \"DX review\", \"developer experience audit\", \"devex review\",\nor \"API design review\".\nProactively suggest when the user has a plan for developer-facing products\n(APIs, CLIs, SDKs, libraries, platforms, docs).", + "voice_line": "Voice triggers (speech-to-text aliases): \"dx review\", \"developer experience review\", \"devex review\", \"devex audit\", \"API design review\", \"onboarding review\"." + }, + "plan-eng-review": { + "lead": "Eng manager-mode plan review.", + "routing": "Lock in the execution plan — architecture,\ndata flow, diagrams, edge cases, test coverage, performance. Walks through\nissues interactively with opinionated recommendations. Use when asked to\n\"review the architecture\", \"engineering review\", or \"lock in the plan\".\nProactively suggest when the user has a plan or design doc and is about to\nstart coding — to catch architecture issues before implementation.", + "voice_line": "Voice triggers (speech-to-text aliases): \"tech review\", \"technical review\", \"plan engineering review\"." + }, + "plan-tune": { + "lead": "Self-tuning question sensitivity + developer psychographic for gstack (v1: observational).", + "routing": "Review which AskUserQuestion prompts fire across gstack skills, set per-question preferences\n(never-ask / always-ask / ask-only-for-one-way), inspect the dual-track\nprofile (what you declared vs what your behavior suggests), and enable/disable\nquestion tuning. Conversational interface — no CLI syntax required.\n\nUse when asked to \"tune questions\", \"stop asking me that\", \"too many questions\",\n\"show my profile\", \"what questions have I been asked\", \"show my vibe\",\n\"developer profile\", or \"turn off question tuning\". \n\nProactively suggest when the user says the same gstack question has come up before,\nor when they explicitly override a recommendation for the Nth time.", + "voice_line": null + }, + "qa": { + "lead": "Systematically QA test a web application and fix bugs found.", + "routing": "Runs QA testing,\nthen iteratively fixes bugs in source code, committing each fix atomically and\nre-verifying. Use when asked to \"qa\", \"QA\", \"test this site\", \"find bugs\",\n\"test and fix\", or \"fix what's broken\".\nProactively suggest when the user says a feature is ready for testing\nor asks \"does this work?\". Three tiers: Quick (critical/high only),\nStandard (+ medium), Exhaustive (+ cosmetic). Produces before/after health scores,\nfix evidence, and a ship-readiness summary. For report-only mode, use /qa-only.", + "voice_line": "Voice triggers (speech-to-text aliases): \"quality check\", \"test the app\", \"run QA\"." + }, + "qa-only": { + "lead": "Report-only QA testing.", + "routing": "Systematically tests a web application and produces a\nstructured report with health score, screenshots, and repro steps — but never\nfixes anything. Use when asked to \"just report bugs\", \"qa report only\", or\n\"test but don't fix\". For the full test-fix-verify loop, use /qa instead.\nProactively suggest when the user wants a bug report without any code changes.", + "voice_line": "Voice triggers (speech-to-text aliases): \"bug report\", \"just check for bugs\"." + }, + "retro": { + "lead": "Weekly engineering retrospective.", + "routing": "Analyzes commit history, work patterns,\nand code quality metrics with persistent history and trend tracking.\nTeam-aware: breaks down per-person contributions with praise and growth areas.\nUse when asked to \"weekly retro\", \"what did we ship\", or \"engineering retrospective\".\nProactively suggest at the end of a work week or sprint.", + "voice_line": null + }, + "review": { + "lead": "Pre-landing PR review.", + "routing": "Analyzes diff against the base branch for SQL safety, LLM trust\nboundary violations, conditional side effects, and other structural issues. Use when\nasked to \"review this PR\", \"code review\", \"pre-landing review\", or \"check my diff\".\nProactively suggest when the user is about to merge or land code changes.", + "voice_line": null + }, + "scrape": { + "lead": "Pull data from a web page.", + "routing": "First call on a new intent prototypes the flow\nvia $B primitives and returns JSON. Subsequent calls on a matching intent\nroute to a codified browser-skill and return in ~200ms. Read-only — for\nmutating flows (form fills, clicks, submissions), use /automate.\nUse when asked to \"scrape\", \"get data from\", \"pull\", \"extract from\", or\n\"what's on\" a page.", + "voice_line": null + }, + "setup-browser-cookies": { + "lead": "Import cookies from your real Chromium browser into the headless browse session.", + "routing": "Opens an interactive picker UI where you select which cookie domains to import.\nUse before QA testing authenticated pages. Use when asked to \"import cookies\",\n\"login to the site\", or \"authenticate the browser\".", + "voice_line": null + }, + "setup-deploy": { + "lead": "Configure deployment settings for /land-and-deploy.", + "routing": "Detects your deploy\nplatform (Fly.io, Render, Vercel, Netlify, Heroku, GitHub Actions, custom),\nproduction URL, health check endpoints, and deploy status commands. Writes\nthe configuration to CLAUDE.md so all future deploys are automatic.\nUse when: \"setup deploy\", \"configure deployment\", \"set up land-and-deploy\",\n\"how do I deploy with gstack\", \"add deploy config\".", + "voice_line": null + }, + "setup-gbrain": { + "lead": "Set up gbrain for this coding agent: install the CLI, initialize a local PGLite or Supabase brain, register MCP, capture per-remote trust policy.", + "routing": "One command from zero to \"gbrain is running, and this agent\ncan call it.\" Use when: \"setup gbrain\", \"connect gbrain\", \"start\ngbrain\", \"install gbrain\", \"configure gbrain for this machine\".", + "voice_line": null + }, + "ship": { + "lead": "Ship workflow: detect + merge base branch, run tests, review diff, bump VERSION, update CHANGELOG, commit, push, create PR.", + "routing": "Use when asked to \"ship\", \"deploy\",\n\"push to main\", \"create a PR\", \"merge and push\", or \"get it deployed\".\nProactively invoke this skill (do NOT push/PR directly) when the user says code\nis ready, asks about deploying, wants to push code up, or asks to create a PR.", + "voice_line": null + }, + "skillify": { + "lead": "Codify the most recent successful /scrape flow into a permanent browser-skill on disk.", + "routing": "Future /scrape calls with the same intent run\nthe codified script in ~200ms instead of re-driving the page. Walks\nback through the conversation, synthesizes script.ts + script.test.ts\n+ fixture, runs the test in a temp dir, and asks before committing.\nUse when asked to \"skillify\", \"codify\", \"save this scrape\", or\n\"make this permanent\".", + "voice_line": null + }, + "sync-gbrain": { + "lead": "Keep gbrain current with this repo's code and refresh agent search guidance in CLAUDE.md. Wraps the gstack-gbrain-sync orchestrator with state", + "routing": "probing, native code-surface registration, capability checks,\nand a verdict block. Re-runnable, idempotent. Use when: \"sync gbrain\",\n\"refresh gbrain\", \"re-index this repo\", \"gbrain search isn't finding\nthings\".", + "voice_line": null + }, + "unfreeze": { + "lead": "Clear the freeze boundary set by /freeze, allowing edits to all directories again.", + "routing": "Use when you want to widen edit scope without ending the session.\nUse when asked to \"unfreeze\", \"unlock edits\", \"remove freeze\", or\n\"allow all edits\".", + "voice_line": null + } + } +} diff --git a/scripts/resolvers/index.ts b/scripts/resolvers/index.ts index c5cbd0445..6502960f9 100644 --- a/scripts/resolvers/index.ts +++ b/scripts/resolvers/index.ts @@ -1,9 +1,20 @@ /** - * RESOLVERS record — maps {{PLACEHOLDER}} names to generator functions. + * RESOLVERS record — maps {{PLACEHOLDER}} names to generator functions + * or gated entries. + * * Each resolver takes a TemplateContext and returns the replacement string. + * Resolvers may be either a bare function (always fires) or a gated entry + * ({ resolve, appliesTo }) where appliesTo can return false to skip the + * resolver for a given skill. See ./types.ts: ResolverEntry. + * + * Most resolvers don't need a gate — the {{NAME}} placeholder system is + * already conditional at the template level (the resolver only fires for + * skills that reference it). Use a gate when you want a structural + * guardrail that says "this placeholder is meaningful only in skills X, Y, Z" + * even if someone later adds {{NAME}} to skill W. */ -import type { TemplateContext, ResolverFn } from './types'; +import type { TemplateContext, ResolverFn, ResolverValue } from './types'; // Domain modules import { generatePreamble } from './preamble'; @@ -24,7 +35,7 @@ import { generateQuestionPreferenceCheck, generateQuestionLog, generateInlineTun import { generateMakePdfSetup } from './make-pdf'; import { generateTasksSectionEmit, generateTasksSectionAggregate } from './tasks-section'; -export const RESOLVERS: Record = { +export const RESOLVERS: Record = { SLUG_EVAL: generateSlugEval, SLUG_SETUP: generateSlugSetup, COMMAND_REFERENCE: generateCommandReference, diff --git a/scripts/resolvers/preamble.ts b/scripts/resolvers/preamble.ts index 97698bfcf..cc3045174 100644 --- a/scripts/resolvers/preamble.ts +++ b/scripts/resolvers/preamble.ts @@ -109,10 +109,10 @@ export function generatePreamble(ctx: TemplateContext): string { ...(tier >= 2 ? [ generateContextRecovery(ctx), generateWritingStyle(ctx), - generateCompletenessSection(), - generateConfusionProtocol(), + generateCompletenessSection(ctx), + generateConfusionProtocol(ctx), generateContinuousCheckpoint(), - generateContextHealth(), + generateContextHealth(ctx), generateQuestionTuning(ctx), ] : []), ...(tier >= 3 ? [generateRepoModeSection(), generateSearchBeforeBuildingSection(ctx)] : []), diff --git a/scripts/resolvers/preamble/generate-completeness-section.ts b/scripts/resolvers/preamble/generate-completeness-section.ts index e64dc0cd1..ba9a874bd 100644 --- a/scripts/resolvers/preamble/generate-completeness-section.ts +++ b/scripts/resolvers/preamble/generate-completeness-section.ts @@ -1,6 +1,7 @@ +import type { TemplateContext } from '../types'; - -export function generateCompletenessSection(): string { +export function generateCompletenessSection(ctx?: TemplateContext): string { + if (ctx?.explainLevel === 'terse') return ''; return `## Completeness Principle — Boil the Lake AI makes completeness cheap. Recommend complete lakes (tests, edge cases, error paths); flag oceans (rewrites, multi-quarter migrations). diff --git a/scripts/resolvers/preamble/generate-confusion-protocol.ts b/scripts/resolvers/preamble/generate-confusion-protocol.ts index eb3e820fd..fcbecb75b 100644 --- a/scripts/resolvers/preamble/generate-confusion-protocol.ts +++ b/scripts/resolvers/preamble/generate-confusion-protocol.ts @@ -1,4 +1,7 @@ -export function generateConfusionProtocol(): string { +import type { TemplateContext } from '../types'; + +export function generateConfusionProtocol(ctx?: TemplateContext): string { + if (ctx?.explainLevel === 'terse') return ''; return `## Confusion Protocol For high-stakes ambiguity (architecture, data model, destructive scope, missing context), STOP. Name it in one sentence, present 2-3 options with tradeoffs, and ask. Do not use for routine coding or obvious changes.`; diff --git a/scripts/resolvers/preamble/generate-context-health.ts b/scripts/resolvers/preamble/generate-context-health.ts index 019d200d3..88b058406 100644 --- a/scripts/resolvers/preamble/generate-context-health.ts +++ b/scripts/resolvers/preamble/generate-context-health.ts @@ -1,6 +1,7 @@ +import type { TemplateContext } from '../types'; - -export function generateContextHealth(): string { +export function generateContextHealth(ctx?: TemplateContext): string { + if (ctx?.explainLevel === 'terse') return ''; return `## Context Health (soft directive) During long-running skill sessions, periodically write a brief \`[PROGRESS]\` summary: done, next, surprises. diff --git a/scripts/resolvers/preamble/generate-writing-style.ts b/scripts/resolvers/preamble/generate-writing-style.ts index 02701943c..812571ce2 100644 --- a/scripts/resolvers/preamble/generate-writing-style.ts +++ b/scripts/resolvers/preamble/generate-writing-style.ts @@ -1,25 +1,24 @@ -import * as fs from 'fs'; -import * as path from 'path'; import type { TemplateContext } from '../types'; -function loadJargonList(): string[] { - const jargonPath = path.join(__dirname, '..', '..', 'jargon-list.json'); - try { - const raw = fs.readFileSync(jargonPath, 'utf-8'); - const data = JSON.parse(raw); - if (Array.isArray(data?.terms)) return data.terms.filter((t: unknown): t is string => typeof t === 'string'); - } catch { - // Missing or malformed: fall back to empty list. Writing Style block still fires, - // but with no terms to gloss — graceful degradation. +/** + * Writing Style preamble section. + * + * v1.45.0.0 changes (T3): + * - Jargon list is referenced by path, not inlined. The 80-term list was + * duplicated into every tier-2+ skill (~1.5-2 KB × 48 skills = ~80 KB + * across the corpus). The pointer asks the agent to Read the JSON on + * first jargon term encountered — one extra Read per session, but the + * per-corpus payload is ~30 bytes. + * - When `ctx.explainLevel === 'terse'`, the entire section is replaced + * with a one-line pointer. Saves ~1.5 KB per tier-2+ skill in the + * opt-in terse build. + */ +export function generateWritingStyle(ctx: TemplateContext): string { + if (ctx.explainLevel === 'terse') { + return `## Writing Style\n\nTerse mode (build-time): skip jargon glossing, outcome-framing layer, and decision-impact closers. Lead with the answer.\n`; } - return []; -} -export function generateWritingStyle(_ctx: TemplateContext): string { - const terms = loadJargonList(); - const jargonBlock = terms.length > 0 - ? `Jargon list, gloss on first use if the term appears:\n${terms.map(t => `- ${t}`).join('\n')}` - : `Jargon list unavailable. Skip jargon glossing until \`scripts/jargon-list.json\` is restored.`; + const jargonPath = `${ctx.paths.skillRoot}/scripts/jargon-list.json`; return `## Writing Style (skip entirely if \`EXPLAIN_LEVEL: terse\` appears in the preamble echo OR the user's current message explicitly requests terse / no-explanations output) @@ -32,6 +31,6 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -${jargonBlock} +Curated jargon list lives at \`${jargonPath}\` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the \`terms\` array as the canonical list. The list is repo-owned and may grow between releases. `; } diff --git a/scripts/resolvers/types.ts b/scripts/resolvers/types.ts index c8a44425b..40afed160 100644 --- a/scripts/resolvers/types.ts +++ b/scripts/resolvers/types.ts @@ -62,7 +62,56 @@ export interface TemplateContext { preambleTier?: number; // 1-4, controls which preamble sections are included model?: Model; // model family for behavioral overlay. Omitted/undefined → no overlay. interactive?: boolean; // true → emit plan-mode handshake in preamble. Generator-only, not written to SKILL.md. + /** + * Build-time compression mode. Defaults to 'default'. + * + * - 'default': full preamble prose ships as today (writing style, completeness, + * confusion protocol, context health are all present). + * - 'terse': writing-style + completeness + confusion-protocol + context-health + * sections are compressed to a one-line pointer at gen time. Saves ~3-5 KB + * per tier-2+ skill. Opt-in via `--explain-level=terse` build flag for + * users who want shipped skills to match their runtime preference and + * avoid the per-session terse-mode prose. + * + * Default builds keep the runtime-conditional behavior intact (Writing Style + * section says "skip entirely if EXPLAIN_LEVEL: terse appears in preamble echo"). + * Terse builds make the compression structural — bytes never ship in the first place. + */ + explainLevel?: 'default' | 'terse'; } /** Resolver function signature. args is populated for parameterized placeholders like {{INVOKE_SKILL:name}}. */ export type ResolverFn = (ctx: TemplateContext, args?: string[]) => string; + +/** + * Optional gated resolver. When the gate returns false, the resolver is + * skipped (substituted with empty string) — same effect as the placeholder + * not being referenced. Use when a resolver's output is only meaningful for + * a known subset of skills, so future template authors get a structural + * guardrail instead of relying on social knowledge. + * + * Most resolvers don't need this — the {{NAME}} placeholder system is + * already conditional at the template level. Use only when a resolver + * lives inside another resolver (e.g. via preamble composition) AND must + * be conditionalized, or when a top-level resolver has a small, well-defined + * audience. + */ +export interface ResolverEntry { + resolve: ResolverFn; + appliesTo?: (ctx: TemplateContext) => boolean; +} + +/** Anything the RESOLVERS map accepts — either a bare function or a gated entry. */ +export type ResolverValue = ResolverFn | ResolverEntry; + +/** + * Type-narrowing helper for the gen-skill-docs lookup. + * Returns (resolverFn, gate) so callers can do gate?.(ctx) before invoking. + */ +export function unwrapResolver(entry: ResolverValue): { + resolve: ResolverFn; + appliesTo?: (ctx: TemplateContext) => boolean; +} { + if (typeof entry === 'function') return { resolve: entry }; + return { resolve: entry.resolve, appliesTo: entry.appliesTo }; +} diff --git a/setup-browser-cookies/SKILL.md b/setup-browser-cookies/SKILL.md index 8b80fd58b..0daff31df 100644 --- a/setup-browser-cookies/SKILL.md +++ b/setup-browser-cookies/SKILL.md @@ -2,11 +2,7 @@ name: setup-browser-cookies preamble-tier: 1 version: 1.0.0 -description: | - Import cookies from your real Chromium browser into the headless browse session. - Opens an interactive picker UI where you select which cookie domains to import. - Use before QA testing authenticated pages. Use when asked to "import cookies", - "login to the site", or "authenticate the browser". (gstack) +description: Import cookies from your real Chromium browser into the headless browse session. (gstack) triggers: - import browser cookies - login to test site @@ -19,6 +15,13 @@ allowed-tools: + +## When to invoke this skill + +Opens an interactive picker UI where you select which cookie domains to import. +Use before QA testing authenticated pages. Use when asked to "import cookies", +"login to the site", or "authenticate the browser". + ## Preamble (run first) ```bash diff --git a/setup-deploy/SKILL.md b/setup-deploy/SKILL.md index 0e09cc491..ee4807f34 100644 --- a/setup-deploy/SKILL.md +++ b/setup-deploy/SKILL.md @@ -2,13 +2,7 @@ name: setup-deploy preamble-tier: 2 version: 1.0.0 -description: | - Configure deployment settings for /land-and-deploy. Detects your deploy - platform (Fly.io, Render, Vercel, Netlify, Heroku, GitHub Actions, custom), - production URL, health check endpoints, and deploy status commands. Writes - the configuration to CLAUDE.md so all future deploys are automatic. - Use when: "setup deploy", "configure deployment", "set up land-and-deploy", - "how do I deploy with gstack", "add deploy config". +description: Configure deployment settings for /land-and-deploy. triggers: - configure deploy - setup deployment @@ -25,6 +19,16 @@ allowed-tools: + +## When to invoke this skill + +Detects your deploy +platform (Fly.io, Render, Vercel, Netlify, Heroku, GitHub Actions, custom), +production URL, health check endpoints, and deploy status commands. Writes +the configuration to CLAUDE.md so all future deploys are automatic. +Use when: "setup deploy", "configure deployment", "set up land-and-deploy", +"how do I deploy with gstack", "add deploy config". + ## Preamble (run first) ```bash @@ -551,84 +555,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/setup-gbrain/SKILL.md b/setup-gbrain/SKILL.md index 6a3536d3b..5376b58bb 100644 --- a/setup-gbrain/SKILL.md +++ b/setup-gbrain/SKILL.md @@ -2,12 +2,7 @@ name: setup-gbrain preamble-tier: 2 version: 1.0.0 -description: | - Set up gbrain for this coding agent: install the CLI, initialize a - local PGLite or Supabase brain, register MCP, capture per-remote trust - policy. One command from zero to "gbrain is running, and this agent - can call it." Use when: "setup gbrain", "connect gbrain", "start - gbrain", "install gbrain", "configure gbrain for this machine". (gstack) +description: Set up gbrain for this coding agent: install the CLI, initialize a local PGLite or Supabase brain, register MCP, capture per-remote trust policy. (gstack) triggers: - setup gbrain - install gbrain @@ -26,6 +21,13 @@ allowed-tools: + +## When to invoke this skill + +One command from zero to "gbrain is running, and this agent +can call it." Use when: "setup gbrain", "connect gbrain", "start +gbrain", "install gbrain", "configure gbrain for this machine". + ## Preamble (run first) ```bash @@ -552,84 +554,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/ship/SKILL.md b/ship/SKILL.md index 38da52874..090798914 100644 --- a/ship/SKILL.md +++ b/ship/SKILL.md @@ -2,12 +2,7 @@ name: ship preamble-tier: 4 version: 1.0.0 -description: | - Ship workflow: detect + merge base branch, run tests, review diff, bump VERSION, - update CHANGELOG, commit, push, create PR. Use when asked to "ship", "deploy", - "push to main", "create a PR", "merge and push", or "get it deployed". - Proactively invoke this skill (do NOT push/PR directly) when the user says code - is ready, asks about deploying, wants to push code up, or asks to create a PR. (gstack) +description: Ship workflow: detect + merge base branch, run tests, review diff, bump VERSION, update CHANGELOG, commit, push, create PR. (gstack) allowed-tools: - Bash - Read @@ -27,6 +22,14 @@ triggers: + +## When to invoke this skill + +Use when asked to "ship", "deploy", +"push to main", "create a PR", "merge and push", or "get it deployed". +Proactively invoke this skill (do NOT push/PR directly) when the user says code +is ready, asks about deploying, wants to push code up, or asks to create a PR. + ## Preamble (run first) ```bash @@ -553,84 +556,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/skillify/SKILL.md b/skillify/SKILL.md index afef0e3a1..cfefe936b 100644 --- a/skillify/SKILL.md +++ b/skillify/SKILL.md @@ -1,14 +1,7 @@ --- name: skillify version: 1.0.0 -description: | - Codify the most recent successful /scrape flow into a permanent - browser-skill on disk. Future /scrape calls with the same intent run - the codified script in ~200ms instead of re-driving the page. Walks - back through the conversation, synthesizes script.ts + script.test.ts - + fixture, runs the test in a temp dir, and asks before committing. - Use when asked to "skillify", "codify", "save this scrape", or - "make this permanent". (gstack) +description: Codify the most recent successful /scrape flow into a permanent browser-skill on disk. (gstack) allowed-tools: - Bash - Read @@ -23,6 +16,16 @@ triggers: + +## When to invoke this skill + +Future /scrape calls with the same intent run +the codified script in ~200ms instead of re-driving the page. Walks +back through the conversation, synthesizes script.ts + script.test.ts ++ fixture, runs the test in a temp dir, and asks before committing. +Use when asked to "skillify", "codify", "save this scrape", or +"make this permanent". + ## Preamble (run first) ```bash @@ -549,84 +552,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/sync-gbrain/SKILL.md b/sync-gbrain/SKILL.md index 7b7610ea6..f6b64edb3 100644 --- a/sync-gbrain/SKILL.md +++ b/sync-gbrain/SKILL.md @@ -2,13 +2,7 @@ name: sync-gbrain preamble-tier: 2 version: 1.0.0 -description: | - Keep gbrain current with this repo's code and refresh agent search - guidance in CLAUDE.md. Wraps the gstack-gbrain-sync orchestrator with - state probing, native code-surface registration, capability checks, - and a verdict block. Re-runnable, idempotent. Use when: "sync gbrain", - "refresh gbrain", "re-index this repo", "gbrain search isn't finding - things". (gstack) +description: Keep gbrain current with this repo's code and refresh agent search guidance in CLAUDE.md. Wraps the gstack-gbrain-sync orchestrator with state (gstack) triggers: - sync gbrain - refresh gbrain @@ -26,6 +20,14 @@ allowed-tools: + +## When to invoke this skill + +probing, native code-surface registration, capability checks, +and a verdict block. Re-runnable, idempotent. Use when: "sync gbrain", +"refresh gbrain", "re-index this repo", "gbrain search isn't finding +things". + ## Preamble (run first) ```bash @@ -552,84 +554,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/test/catalog-mode-full.test.ts b/test/catalog-mode-full.test.ts new file mode 100644 index 000000000..009db33ee --- /dev/null +++ b/test/catalog-mode-full.test.ts @@ -0,0 +1,118 @@ +/** + * Gap B (v1.46.0.0): --catalog-mode=full opt-out behavior. + * + * The catalog trim is the default. The opt-out (`--catalog-mode=full`) + * preserves v1.44 multi-line frontmatter descriptions for users / hosts + * that depend on the legacy fat catalog. Without this test, someone could + * break the conditional `if (host === 'claude' && CATALOG_MODE === 'trim')` + * and silently turn the opt-out path into a no-op — users with the flag + * still get trim'd output, the v1.44 behavior is gone. + * + * Two layers: + * 1. Static: the CATALOG_MODE flag is wired into gen-skill-docs.ts and + * the conditional gate is in the pipeline. + * 2. Smoke: running with --catalog-mode=full produces a frontmatter + * `description: |` block (multi-line) instead of the trim'd one-line + * `description: ...(gstack)` form. + * + * The smoke test mutates the working tree mid-run. It restores the default + * trim'd state in a finally block so a crash mid-test still leaves a clean + * working tree. + */ + +import { describe, test, expect } from 'bun:test'; +import { spawnSync } from 'child_process'; +import * as fs from 'fs'; +import * as path from 'path'; + +const REPO_ROOT = path.resolve(import.meta.dir, '..'); +const GEN_SKILL_DOCS = path.join(REPO_ROOT, 'scripts', 'gen-skill-docs.ts'); +const SHIP_SKILL = path.join(REPO_ROOT, 'ship', 'SKILL.md'); + +describe('--catalog-mode=full opt-out wiring (static)', () => { + test('CATALOG_MODE_ARG parsing is wired into gen-skill-docs.ts', () => { + const src = fs.readFileSync(GEN_SKILL_DOCS, 'utf-8'); + expect(src).toContain('CATALOG_MODE_ARG'); + expect(src).toContain("a.startsWith('--catalog-mode')"); + }); + + test('CATALOG_MODE accepts only "trim" or "full" — anything else throws', () => { + const src = fs.readFileSync(GEN_SKILL_DOCS, 'utf-8'); + expect(src).toMatch(/val !== 'trim' && val !== 'full'/); + expect(src).toContain('Unknown catalog mode'); + }); + + test('catalog trim only fires when CATALOG_MODE === "trim"', () => { + const src = fs.readFileSync(GEN_SKILL_DOCS, 'utf-8'); + // The applyCatalogTrim call is gated by both host and CATALOG_MODE checks. + expect(src).toMatch(/CATALOG_MODE === 'trim'/); + expect(src).toContain('applyCatalogTrim(content, skillName)'); + }); + + test('default CATALOG_MODE is "trim" (opt-out, not opt-in)', () => { + const src = fs.readFileSync(GEN_SKILL_DOCS, 'utf-8'); + // The const initializer falls back to 'trim' when --catalog-mode is unset. + expect(src).toMatch(/if \(!CATALOG_MODE_ARG\) return 'trim'/); + }); +}); + +describe('--catalog-mode=full opt-out behavior (smoke)', () => { + test('--catalog-mode=full produces multi-line description in frontmatter', () => { + // Save the trim'd state so we can restore it. + const trimmedShip = fs.readFileSync(SHIP_SKILL, 'utf-8'); + expect(trimmedShip).toMatch(/^description: Ship workflow:[^\n]*\(gstack\)\n/m); + + try { + // Run with --catalog-mode=full. Mutates working tree. + const result = spawnSync('bun', ['run', 'gen:skill-docs', '--catalog-mode=full'], { + cwd: REPO_ROOT, + stdio: ['ignore', 'pipe', 'pipe'], + timeout: 60_000, + }); + expect(result.status).toBe(0); + + // After --catalog-mode=full, frontmatter description is the legacy + // multi-line block, not the trim'd one-line form. + const fullShip = fs.readFileSync(SHIP_SKILL, 'utf-8'); + expect(fullShip).toMatch(/^description: \|\s*$/m); // YAML block scalar + // Legacy multi-line content includes "Use when asked to..." in the + // frontmatter (in trim mode this lives in the body section). + const fmEnd = fullShip.indexOf('\n---', 4); + const fm = fullShip.slice(0, fmEnd); + expect(fm).toMatch(/Use when asked to/i); + + // "When to invoke" body section should NOT be present in full mode + // (because the routing prose stayed in frontmatter). + const body = fullShip.slice(fmEnd); + expect(body).not.toContain('## When to invoke this skill'); + } finally { + // Restore default trim state regardless of test outcome. + const restore = spawnSync('bun', ['run', 'gen:skill-docs'], { + cwd: REPO_ROOT, + stdio: ['ignore', 'pipe', 'pipe'], + timeout: 60_000, + }); + if (restore.status !== 0) { + // eslint-disable-next-line no-console + console.error( + 'CRITICAL: failed to restore default trim state. Run `bun run gen:skill-docs` to clean up.', + ); + } + // Sanity-check the restored state matches what we saw at the start. + const restoredShip = fs.readFileSync(SHIP_SKILL, 'utf-8'); + expect(restoredShip).toMatch(/^description: Ship workflow:[^\n]*\(gstack\)\n/m); + } + }, 180_000); + + test('--catalog-mode=invalid throws a clear error', () => { + const result = spawnSync('bun', ['run', 'gen:skill-docs', '--catalog-mode=invalid'], { + cwd: REPO_ROOT, + stdio: ['ignore', 'pipe', 'pipe'], + timeout: 30_000, + }); + expect(result.status).not.toBe(0); + const stderr = result.stderr?.toString() ?? ''; + expect(stderr).toMatch(/Unknown catalog mode/); + expect(stderr).toMatch(/invalid/); + }); +}); diff --git a/test/catalog-trim.test.ts b/test/catalog-trim.test.ts new file mode 100644 index 000000000..e58678603 --- /dev/null +++ b/test/catalog-trim.test.ts @@ -0,0 +1,313 @@ +/** + * Unit tests for catalog-trim helpers (gen-skill-docs.ts T4 functions). + * + * splitCatalogDescription, buildTrimmedDescription, buildWhenToInvokeSection, + * applyCatalogTrim — these handle every skill's frontmatter rewrite at gen + * time. Two bugs already shipped here: + * + * v1.45.0.0 design-consultation: when the first sentence exceeded 200 chars, + * the routing-prose extraction lost the entire tail. design-consultation's + * "Use when asked to..." silently disappeared from the body section. + * + * v1.45.0.0 CI freshness: the root-skill key leaked the checkout directory + * name ("seville-v3" vs "gstack") and aggregate order was filesystem- + * iteration order. Two machines produced two different JSON files. + * + * Both are regression-tested here. Future bugs in these functions surface as + * unit-test failures before they hit CI or production. + */ + +import { describe, test, expect } from 'bun:test'; +import { + splitCatalogDescription, + buildTrimmedDescription, + buildWhenToInvokeSection, + applyCatalogTrim, +} from '../scripts/gen-skill-docs'; + +describe('splitCatalogDescription', () => { + test('extracts lead sentence + routing prose from simple multi-line description', () => { + const desc = + 'Pre-landing PR review. Analyzes diff against the base branch for SQL safety, LLM trust\n' + + 'boundary violations, conditional side effects, and other structural issues. Use when\n' + + 'asked to "review this PR", "code review", "pre-landing review", or "check my diff".\n' + + 'Proactively suggest when the user is about to merge or land code changes. (gstack)'; + + const parts = splitCatalogDescription(desc); + + expect(parts.lead).toBe('Pre-landing PR review.'); + expect(parts.hasGstackTag).toBe(true); + expect(parts.voiceLine).toBeNull(); + expect(parts.routingProse).toContain('Use when'); + expect(parts.routingProse).toContain('Proactively suggest'); + expect(parts.routingProse).toContain('Analyzes diff'); + // (gstack) tag stripped from routingProse + expect(parts.routingProse).not.toContain('(gstack)'); + }); + + test('REGRESSION (design-consultation v1.45.0.0): >200 char first sentence keeps routing', () => { + // This is the exact shape that broke. First sentence (with embedded periods) + // is 207 chars. Original bug: routing extraction ran AFTER lead truncation, + // so collapsed.indexOf(lead) returned -1 (lead ended in "...") and the + // entire "Use when..." + "Proactively..." tail dropped to empty string. + const desc = + 'Design consultation: understands your product, researches the landscape, ' + + 'proposes a complete design system (aesthetic, typography, color, layout, ' + + 'spacing, motion), and generates font+color preview pages. ' + + 'Creates DESIGN.md as your project\'s design source of truth. ' + + 'For existing sites, use /plan-design-review to infer the system instead. ' + + 'Use when asked to "design system", "brand guidelines", or "create DESIGN.md". ' + + 'Proactively suggest when starting a new project\'s UI with no existing ' + + 'design system or DESIGN.md. (gstack)'; + + const parts = splitCatalogDescription(desc); + + // Lead may be truncated with "..." since it exceeds 200 chars + expect(parts.lead.length).toBeLessThanOrEqual(205); + // Critical: routing MUST contain the "Use when..." and "Proactively..." prose + expect(parts.routingProse).toContain('Use when asked to'); + expect(parts.routingProse).toContain('design system'); + expect(parts.routingProse).toContain('Proactively suggest'); + expect(parts.routingProse).toContain('Creates DESIGN.md'); + }); + + test('extracts voice-triggers line when present', () => { + const desc = + 'Quick fix. Use when asked to fix the bug. ' + + 'Voice triggers (speech-to-text aliases): "fix it", "patch this", "make it work". ' + + '(gstack)'; + + const parts = splitCatalogDescription(desc); + + expect(parts.lead).toBe('Quick fix.'); + expect(parts.voiceLine).toContain('Voice triggers'); + expect(parts.voiceLine).toContain('"fix it"'); + expect(parts.routingProse).toContain('Use when asked to fix'); + // Voice line should NOT leak into routing + expect(parts.routingProse).not.toContain('speech-to-text'); + }); + + test('handles description without (gstack) tag', () => { + const desc = 'Single sentence description. With routing prose afterward.'; + const parts = splitCatalogDescription(desc); + expect(parts.lead).toBe('Single sentence description.'); + expect(parts.hasGstackTag).toBe(false); + expect(parts.routingProse).toBe('With routing prose afterward.'); + }); + + test('embedded-period descriptions: known limitation falls back to first-20-words', () => { + // KNOWN LIMITATION: the sentence regex `^([^.!?]*[.!?])(?:\\s|$)` stops + // at the FIRST `.`-then-non-whitespace because [^.!?]* is greedy and + // can't backtrack past a non-period char. For "DESIGN.md and v1.45.0.0 + // in the lead. Use when..." the regex fails entirely and the lead falls + // back to the first 20 words (~the whole short input). + // + // The real-world impact is small: descriptions like "DESIGN.md" or "v1.45" + // appearing in the middle of the FIRST sentence are rare. When they do + // occur, the lead simply becomes the full description (no body section + // generated) — same as a description without a period. The trim CI gate + // still keeps the per-skill size budget honest. + // + // If this gap matters later, replace the regex with a sentence tokenizer + // (compromise.js / Intl.Segmenter) — until then we accept the fallback. + const desc = + 'Skill that mentions DESIGN.md and v1.45.0.0 in the lead. ' + + 'Use when asked to do something.'; + const parts = splitCatalogDescription(desc); + // Actual behavior: lead absorbs the whole input via the word-count fallback. + expect(parts.lead.length).toBeGreaterThan(0); + // routingProse may be empty when the fallback consumes everything. + // The test exists to detect REGRESSIONS (lead becoming oddly short like + // "Skill that mentions DESIGN.") not to assert ideal behavior. + expect(parts.lead).toContain('Skill that mentions'); + }); + + test('description without a period uses first ~20 words as lead', () => { + const desc = 'A long fragment with no sentence terminator drifting on and on across many words for an unusual frontmatter shape'; + const parts = splitCatalogDescription(desc); + expect(parts.lead.length).toBeGreaterThan(0); + expect(parts.lead.split(/\s+/).length).toBeLessThanOrEqual(21); + }); + + test('idempotent: calling on already-trimmed output returns the same parts', () => { + const desc = 'Already trimmed. (gstack)'; + const parts1 = splitCatalogDescription(desc); + const parts2 = splitCatalogDescription(buildTrimmedDescription(parts1)); + // Re-split of a one-line trimmed result keeps lead identical, routing empty. + expect(parts2.lead).toBe(parts1.lead); + expect(parts2.hasGstackTag).toBe(true); + expect(parts2.routingProse).toBe(''); + }); +}); + +describe('buildTrimmedDescription', () => { + test('appends (gstack) when hasGstackTag is true', () => { + const out = buildTrimmedDescription({ + lead: 'Some lead.', + routingProse: 'routing', + voiceLine: null, + hasGstackTag: true, + }); + expect(out).toBe('Some lead. (gstack)'); + }); + + test('omits (gstack) when hasGstackTag is false', () => { + const out = buildTrimmedDescription({ + lead: 'No tag.', + routingProse: '', + voiceLine: null, + hasGstackTag: false, + }); + expect(out).toBe('No tag.'); + }); + + test('trims whitespace from lead', () => { + const out = buildTrimmedDescription({ + lead: ' Lead with whitespace. ', + routingProse: '', + voiceLine: null, + hasGstackTag: true, + }); + expect(out).toBe('Lead with whitespace. (gstack)'); + }); +}); + +describe('buildWhenToInvokeSection', () => { + test('produces markdown H2 with routing prose and voice line', () => { + const out = buildWhenToInvokeSection({ + lead: 'Lead.', + routingProse: 'Use when asked to ship.', + voiceLine: 'Voice triggers (speech-to-text aliases): "ship it".', + hasGstackTag: true, + }); + expect(out).toContain('## When to invoke this skill'); + expect(out).toContain('Use when asked to ship.'); + expect(out).toContain('Voice triggers'); + }); + + test('omits routing block when routingProse is empty', () => { + const out = buildWhenToInvokeSection({ + lead: 'Lead.', + routingProse: '', + voiceLine: null, + hasGstackTag: true, + }); + expect(out).toContain('## When to invoke this skill'); + expect(out).not.toContain('Use when'); + }); + + test('emits even when only voice line is present', () => { + const out = buildWhenToInvokeSection({ + lead: 'Lead.', + routingProse: '', + voiceLine: 'Voice triggers: x.', + hasGstackTag: true, + }); + expect(out).toContain('Voice triggers: x.'); + }); +}); + +describe('applyCatalogTrim', () => { + const minimalSkill = `--- +name: example +description: | + Example skill: this is the first sentence of the description, intended to be + the lead displayed in the catalog. Use when asked to do an example task. + Proactively suggest when the user mentions examples. (gstack) +preamble-tier: 2 +--- + + + +# Example body +Original body content here. +`; + + test('rewrites multi-line description into one-line + body section', () => { + const result = applyCatalogTrim(minimalSkill, 'example'); + expect(result).not.toBeNull(); + const { content, parts } = result!; + // Frontmatter description is now ONE line ending with (gstack) + expect(content).toMatch(/^description: Example skill:[^\n]*\(gstack\)\n/m); + // Body has the When to invoke section + expect(content).toContain('## When to invoke this skill'); + expect(content).toContain('Use when asked to do an example task.'); + expect(content).toContain('Proactively suggest when'); + // Original body still present + expect(content).toContain('# Example body'); + expect(content).toContain('Original body content here.'); + // parts is populated for the aggregator + expect(parts.lead).toContain('Example skill'); + expect(parts.hasGstackTag).toBe(true); + }); + + test('returns null for already-short descriptions (no-op)', () => { + const shortSkill = minimalSkill.replace( + /description: \|[\s\S]*?(?=preamble-tier:)/, + 'description: Already short. (gstack)\n', + ); + const result = applyCatalogTrim(shortSkill, 'example'); + expect(result).toBeNull(); + }); + + test('keeps the newline between description and next YAML field (no field collision)', () => { + // Bug shape from v1.45.0.0 first attempt: produced + // `description: ... (gstack)preamble-tier:` with no newline. + const result = applyCatalogTrim(minimalSkill, 'example'); + expect(result).not.toBeNull(); + expect(result!.content).not.toMatch(/\(gstack\)preamble-tier/); + expect(result!.content).not.toMatch(/\(gstack\)allowed-tools/); + expect(result!.content).toMatch(/\(gstack\)\n[a-z-]+:/); + }); + + test('returns null on content without proper frontmatter', () => { + expect(applyCatalogTrim('no frontmatter here', 'whatever')).toBeNull(); + expect(applyCatalogTrim('---\nincomplete frontmatter', 'whatever')).toBeNull(); + }); +}); + +describe('proactive-suggestions.json determinism (regression for v1.45.0.0 CI freshness fail)', () => { + test('committed JSON keys are alphabetically sorted', () => { + // Reads the actual committed file at scripts/proactive-suggestions.json + // and verifies sort order. Catches regressions to non-sorted output. + const fs = require('fs'); + const path = require('path'); + const json = JSON.parse( + fs.readFileSync(path.join(__dirname, '..', 'scripts', 'proactive-suggestions.json'), 'utf-8'), + ); + const keys = Object.keys(json.skills); + const sorted = [...keys].sort(); + expect(keys).toEqual(sorted); + }); + + test('root skill is keyed as "gstack" (not the checkout directory name)', () => { + // Catches the bug where the root SKILL.md.tmpl's catalog parts get + // registered under the directory basename ("seville-v3" in a Conductor + // worktree, "gstack" on CI). + const fs = require('fs'); + const path = require('path'); + const json = JSON.parse( + fs.readFileSync(path.join(__dirname, '..', 'scripts', 'proactive-suggestions.json'), 'utf-8'), + ); + expect(json.skills).toHaveProperty('gstack'); + // The directory the test runs in must NOT appear as a key. + const repoDir = path.basename(path.resolve(__dirname, '..')); + if (repoDir !== 'gstack') { + expect(json.skills).not.toHaveProperty(repoDir); + } + }); + + test('schema + catalog_mode + note fields are stable', () => { + const fs = require('fs'); + const path = require('path'); + const json = JSON.parse( + fs.readFileSync(path.join(__dirname, '..', 'scripts', 'proactive-suggestions.json'), 'utf-8'), + ); + expect(json).toHaveProperty('$schema'); + expect(json.catalog_mode).toBe('trim'); + expect(typeof json.note).toBe('string'); + // No timestamp field — those cause flapping CI freshness checks. + expect(json).not.toHaveProperty('generated_at'); + expect(json).not.toHaveProperty('timestamp'); + }); +}); diff --git a/test/cso-preserved.test.ts b/test/cso-preserved.test.ts new file mode 100644 index 000000000..83fe6bbc8 --- /dev/null +++ b/test/cso-preserved.test.ts @@ -0,0 +1,86 @@ +/** + * cso security-guidance preservation test (v1.45.0.0 T6). + * + * The cso skill carries load-bearing security prose: OWASP Top 10 mappings, + * STRIDE threat-model phrasing, "do not auto-fix without user approval" + * gates. Codex 2nd-pass critique #9: "cso exemption too broad ... should + * still get resolver dedup, catalog trim, sectioning if safe, and targeted + * evals around must-not-miss checks." + * + * This test pins the must-not-miss checks. cso gets the same resolver gate + * (T2), jargon dedup (T3), and catalog trim (T4) as every other skill — but + * its security-guidance body content stays intact. Future compression work + * that would strip this content fails CI here. + */ + +import { describe, test, expect } from 'bun:test'; +import * as fs from 'fs'; +import * as path from 'path'; + +const REPO_ROOT = path.resolve(import.meta.dir, '..'); +const CSO_SKILL = path.join(REPO_ROOT, 'cso', 'SKILL.md'); + +const MUST_PRESERVE_PHRASES = [ + // OWASP / STRIDE positioning + 'OWASP', + 'STRIDE', + // Mode discipline + 'daily', + 'comprehensive', + // Severity language + 'confidence', + // Active verification requirement (codex critique: "active verification") + 'verif', // covers "verify", "verification", "verified" +]; + +const MUST_PRESERVE_HEADINGS = [ + '## Preamble', // from PREAMBLE resolver +]; + +describe('cso skill preserves load-bearing security guidance', () => { + test('cso/SKILL.md exists and is non-trivial', () => { + expect(fs.existsSync(CSO_SKILL)).toBe(true); + const content = fs.readFileSync(CSO_SKILL, 'utf-8'); + // cso is a content-heavy security skill; under 30 KB suggests stripping went too far. + expect(content.length).toBeGreaterThan(30_000); + }); + + test('cso preserves required security phrases (case-insensitive)', () => { + const content = fs.readFileSync(CSO_SKILL, 'utf-8').toLowerCase(); + const missing: string[] = []; + for (const phrase of MUST_PRESERVE_PHRASES) { + if (!content.includes(phrase.toLowerCase())) missing.push(phrase); + } + if (missing.length > 0) { + throw new Error( + `cso/SKILL.md is missing required security phrases: ${missing.join(', ')}. ` + + `These are load-bearing for the skill's audit posture. If you intentionally ` + + `removed them, update this test with the new phrasing.`, + ); + } + }); + + test('cso preserves required headings', () => { + const content = fs.readFileSync(CSO_SKILL, 'utf-8'); + for (const heading of MUST_PRESERVE_HEADINGS) { + expect(content).toContain(heading); + } + }); + + test('cso catalog trim landed (frontmatter description ≤ 200 chars)', () => { + const content = fs.readFileSync(CSO_SKILL, 'utf-8'); + const fmMatch = content.match(/^---\n([\s\S]*?)\n---/); + expect(fmMatch).not.toBeNull(); + const fm = fmMatch![1]; + const descMatch = fm.match(/^description:\s+(.+)$/m); + expect(descMatch).not.toBeNull(); + const desc = descMatch![1].trim(); + expect(desc.length).toBeLessThanOrEqual(200); + expect(desc).toContain('(gstack)'); + }); + + test('cso routing prose moved to "## When to invoke" body section', () => { + const content = fs.readFileSync(CSO_SKILL, 'utf-8'); + expect(content).toContain('## When to invoke this skill'); + }); +}); diff --git a/test/fixtures/golden/claude-ship-SKILL.md b/test/fixtures/golden/claude-ship-SKILL.md index 38da52874..090798914 100644 --- a/test/fixtures/golden/claude-ship-SKILL.md +++ b/test/fixtures/golden/claude-ship-SKILL.md @@ -2,12 +2,7 @@ name: ship preamble-tier: 4 version: 1.0.0 -description: | - Ship workflow: detect + merge base branch, run tests, review diff, bump VERSION, - update CHANGELOG, commit, push, create PR. Use when asked to "ship", "deploy", - "push to main", "create a PR", "merge and push", or "get it deployed". - Proactively invoke this skill (do NOT push/PR directly) when the user says code - is ready, asks about deploying, wants to push code up, or asks to create a PR. (gstack) +description: Ship workflow: detect + merge base branch, run tests, review diff, bump VERSION, update CHANGELOG, commit, push, create PR. (gstack) allowed-tools: - Bash - Read @@ -27,6 +22,14 @@ triggers: + +## When to invoke this skill + +Use when asked to "ship", "deploy", +"push to main", "create a PR", "merge and push", or "get it deployed". +Proactively invoke this skill (do NOT push/PR directly) when the user says code +is ready, asks about deploying, wants to push code up, or asks to create a PR. + ## Preamble (run first) ```bash @@ -553,84 +556,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/test/fixtures/golden/codex-ship-SKILL.md b/test/fixtures/golden/codex-ship-SKILL.md index d0159842f..5610f747d 100644 --- a/test/fixtures/golden/codex-ship-SKILL.md +++ b/test/fixtures/golden/codex-ship-SKILL.md @@ -542,84 +542,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `$GSTACK_ROOT/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/test/fixtures/golden/factory-ship-SKILL.md b/test/fixtures/golden/factory-ship-SKILL.md index 9ccdfeeeb..a7426c9ca 100644 --- a/test/fixtures/golden/factory-ship-SKILL.md +++ b/test/fixtures/golden/factory-ship-SKILL.md @@ -544,84 +544,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i - User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section. - Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses. -Jargon list, gloss on first use if the term appears: -- idempotent -- idempotency -- race condition -- deadlock -- cyclomatic complexity -- N+1 -- N+1 query -- backpressure -- memoization -- eventual consistency -- CAP theorem -- CORS -- CSRF -- XSS -- SQL injection -- prompt injection -- DDoS -- rate limit -- throttle -- circuit breaker -- load balancer -- reverse proxy -- SSR -- CSR -- hydration -- tree-shaking -- bundle splitting -- code splitting -- hot reload -- tombstone -- soft delete -- cascade delete -- foreign key -- composite index -- covering index -- OLTP -- OLAP -- sharding -- replication lag -- quorum -- two-phase commit -- saga -- outbox pattern -- inbox pattern -- optimistic locking -- pessimistic locking -- thundering herd -- cache stampede -- bloom filter -- consistent hashing -- virtual DOM -- reconciliation -- closure -- hoisting -- tail call -- GIL -- zero-copy -- mmap -- cold start -- warm start -- green-blue deploy -- canary deploy -- feature flag -- kill switch -- dead letter queue -- fan-out -- fan-in -- debounce -- throttle (UI) -- hydration mismatch -- memory leak -- GC pause -- heap fragmentation -- stack overflow -- null pointer -- dangling pointer -- buffer overflow +Curated jargon list lives at `$GSTACK_ROOT/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases. ## Completeness Principle — Boil the Lake diff --git a/test/fixtures/parity-baseline-v1.44.1.json b/test/fixtures/parity-baseline-v1.44.1.json new file mode 100644 index 000000000..2e6d0c6f9 --- /dev/null +++ b/test/fixtures/parity-baseline-v1.44.1.json @@ -0,0 +1,623 @@ +{ + "tag": "v1.44.1", + "capturedAt": "2026-05-26T03:29:32.568Z", + "capturedFromCommit": "74bc8054", + "capturedFromBranch": "garrytan/slim-skill-tokens", + "totalSkills": 51, + "totalCorpusBytes": 2915151, + "estTotalCatalogTokens": 9319, + "topHeaviest": [ + { + "skill": "ship", + "skillMdBytes": 163553, + "skillMdLines": 3094, + "estTokens": 40888, + "tmplBytes": 48869, + "descriptionLen": 557, + "hasGateEval": true, + "hasPeriodicEval": true + }, + { + "skill": "plan-ceo-review", + "skillMdBytes": 130891, + "skillMdLines": 2224, + "estTokens": 32723, + "tmplBytes": 63393, + "descriptionLen": 1326, + "hasGateEval": true, + "hasPeriodicEval": true + }, + { + "skill": "office-hours", + "skillMdBytes": 111088, + "skillMdLines": 2090, + "estTokens": 27772, + "tmplBytes": 55466, + "descriptionLen": 1579, + "hasGateEval": true, + "hasPeriodicEval": false + }, + { + "skill": "plan-design-review", + "skillMdBytes": 105592, + "skillMdLines": 1944, + "estTokens": 26398, + "tmplBytes": 28624, + "descriptionLen": 568, + "hasGateEval": true, + "hasPeriodicEval": true + }, + { + "skill": "plan-devex-review", + "skillMdBytes": 104571, + "skillMdLines": 2145, + "estTokens": 26143, + "tmplBytes": 35680, + "descriptionLen": 886, + "hasGateEval": true, + "hasPeriodicEval": true + }, + { + "skill": "plan-eng-review", + "skillMdBytes": 101409, + "skillMdLines": 1788, + "estTokens": 25352, + "tmplBytes": 26234, + "descriptionLen": 743, + "hasGateEval": true, + "hasPeriodicEval": true + }, + { + "skill": "design-review", + "skillMdBytes": 94055, + "skillMdLines": 1960, + "estTokens": 23514, + "tmplBytes": 11674, + "descriptionLen": 709, + "hasGateEval": true, + "hasPeriodicEval": false + }, + { + "skill": "review", + "skillMdBytes": 92443, + "skillMdLines": 1789, + "estTokens": 23111, + "tmplBytes": 14099, + "descriptionLen": 512, + "hasGateEval": true, + "hasPeriodicEval": false + }, + { + "skill": "land-and-deploy", + "skillMdBytes": 90281, + "skillMdLines": 1883, + "estTokens": 22570, + "tmplBytes": 48624, + "descriptionLen": 378, + "hasGateEval": true, + "hasPeriodicEval": false + }, + { + "skill": "autoplan", + "skillMdBytes": 89274, + "skillMdLines": 1811, + "estTokens": 22319, + "tmplBytes": 45271, + "descriptionLen": 857, + "hasGateEval": true, + "hasPeriodicEval": true + } + ], + "skills": { + "autoplan": { + "skill": "autoplan", + "skillMdBytes": 89274, + "skillMdLines": 1811, + "estTokens": 22319, + "tmplBytes": 45271, + "descriptionLen": 857, + "hasGateEval": true, + "hasPeriodicEval": true + }, + "benchmark": { + "skill": "benchmark", + "skillMdBytes": 32537, + "skillMdLines": 728, + "estTokens": 8134, + "tmplBytes": 9378, + "descriptionLen": 549, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "benchmark-models": { + "skill": "benchmark-models", + "skillMdBytes": 28606, + "skillMdLines": 603, + "estTokens": 7152, + "tmplBytes": 6631, + "descriptionLen": 740, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "browse": { + "skill": "browse", + "skillMdBytes": 47290, + "skillMdLines": 911, + "estTokens": 11823, + "tmplBytes": 10805, + "descriptionLen": 612, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "canary": { + "skill": "canary", + "skillMdBytes": 45502, + "skillMdLines": 1017, + "estTokens": 11376, + "tmplBytes": 8033, + "descriptionLen": 477, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "careful": { + "skill": "careful", + "skillMdBytes": 2531, + "skillMdLines": 64, + "estTokens": 633, + "tmplBytes": 2435, + "descriptionLen": 625, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "codex": { + "skill": "codex", + "skillMdBytes": 78018, + "skillMdLines": 1545, + "estTokens": 19505, + "tmplBytes": 34143, + "descriptionLen": 626, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "context-restore": { + "skill": "context-restore", + "skillMdBytes": 39894, + "skillMdLines": 875, + "estTokens": 9974, + "tmplBytes": 5255, + "descriptionLen": 636, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "context-save": { + "skill": "context-save", + "skillMdBytes": 44091, + "skillMdLines": 994, + "estTokens": 11023, + "tmplBytes": 9293, + "descriptionLen": 562, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "cso": { + "skill": "cso", + "skillMdBytes": 75797, + "skillMdLines": 1477, + "estTokens": 18949, + "tmplBytes": 35158, + "descriptionLen": 774, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "design-consultation": { + "skill": "design-consultation", + "skillMdBytes": 76963, + "skillMdLines": 1578, + "estTokens": 19241, + "tmplBytes": 25899, + "descriptionLen": 1201, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "design-html": { + "skill": "design-html", + "skillMdBytes": 64951, + "skillMdLines": 1476, + "estTokens": 16238, + "tmplBytes": 22567, + "descriptionLen": 870, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "design-review": { + "skill": "design-review", + "skillMdBytes": 94055, + "skillMdLines": 1960, + "estTokens": 23514, + "tmplBytes": 11674, + "descriptionLen": 709, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "design-shotgun": { + "skill": "design-shotgun", + "skillMdBytes": 60571, + "skillMdLines": 1327, + "estTokens": 15143, + "tmplBytes": 13331, + "descriptionLen": 1057, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "devex-review": { + "skill": "devex-review", + "skillMdBytes": 62815, + "skillMdLines": 1259, + "estTokens": 15704, + "tmplBytes": 7984, + "descriptionLen": 827, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "document-generate": { + "skill": "document-generate", + "skillMdBytes": 51386, + "skillMdLines": 1204, + "estTokens": 12847, + "tmplBytes": 15093, + "descriptionLen": 671, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "document-release": { + "skill": "document-release", + "skillMdBytes": 56652, + "skillMdLines": 1262, + "estTokens": 14163, + "tmplBytes": 20362, + "descriptionLen": 707, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "freeze": { + "skill": "freeze", + "skillMdBytes": 3134, + "skillMdLines": 88, + "estTokens": 784, + "tmplBytes": 3038, + "descriptionLen": 761, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "gstack-upgrade": { + "skill": "gstack-upgrade", + "skillMdBytes": 10794, + "skillMdLines": 280, + "estTokens": 2699, + "tmplBytes": 10667, + "descriptionLen": 439, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "guard": { + "skill": "guard", + "skillMdBytes": 3277, + "skillMdLines": 88, + "estTokens": 819, + "tmplBytes": 3181, + "descriptionLen": 968, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "health": { + "skill": "health", + "skillMdBytes": 46313, + "skillMdLines": 1041, + "estTokens": 11578, + "tmplBytes": 11617, + "descriptionLen": 463, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "investigate": { + "skill": "investigate", + "skillMdBytes": 48810, + "skillMdLines": 1039, + "estTokens": 12203, + "tmplBytes": 11561, + "descriptionLen": 1811, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "ios-clean": { + "skill": "ios-clean", + "skillMdBytes": 39447, + "skillMdLines": 840, + "estTokens": 9862, + "tmplBytes": 3851, + "descriptionLen": 761, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "ios-design-review": { + "skill": "ios-design-review", + "skillMdBytes": 40037, + "skillMdLines": 841, + "estTokens": 10009, + "tmplBytes": 4417, + "descriptionLen": 836, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "ios-fix": { + "skill": "ios-fix", + "skillMdBytes": 39164, + "skillMdLines": 837, + "estTokens": 9791, + "tmplBytes": 3574, + "descriptionLen": 767, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "ios-qa": { + "skill": "ios-qa", + "skillMdBytes": 45677, + "skillMdLines": 957, + "estTokens": 11419, + "tmplBytes": 10090, + "descriptionLen": 875, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "ios-sync": { + "skill": "ios-sync", + "skillMdBytes": 39137, + "skillMdLines": 831, + "estTokens": 9784, + "tmplBytes": 3544, + "descriptionLen": 727, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "land-and-deploy": { + "skill": "land-and-deploy", + "skillMdBytes": 90281, + "skillMdLines": 1883, + "estTokens": 22570, + "tmplBytes": 48624, + "descriptionLen": 378, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "landing-report": { + "skill": "landing-report", + "skillMdBytes": 42382, + "skillMdLines": 901, + "estTokens": 10596, + "tmplBytes": 6806, + "descriptionLen": 512, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "learn": { + "skill": "learn", + "skillMdBytes": 40119, + "skillMdLines": 918, + "estTokens": 10030, + "tmplBytes": 5594, + "descriptionLen": 460, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "make-pdf": { + "skill": "make-pdf", + "skillMdBytes": 28721, + "skillMdLines": 644, + "estTokens": 7180, + "tmplBytes": 5106, + "descriptionLen": 698, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "office-hours": { + "skill": "office-hours", + "skillMdBytes": 111088, + "skillMdLines": 2090, + "estTokens": 27772, + "tmplBytes": 55466, + "descriptionLen": 1579, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "open-gstack-browser": { + "skill": "open-gstack-browser", + "skillMdBytes": 44529, + "skillMdLines": 981, + "estTokens": 11132, + "tmplBytes": 7702, + "descriptionLen": 586, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "pair-agent": { + "skill": "pair-agent", + "skillMdBytes": 45339, + "skillMdLines": 1036, + "estTokens": 11335, + "tmplBytes": 8548, + "descriptionLen": 709, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "plan-ceo-review": { + "skill": "plan-ceo-review", + "skillMdBytes": 130891, + "skillMdLines": 2224, + "estTokens": 32723, + "tmplBytes": 63393, + "descriptionLen": 1326, + "hasGateEval": true, + "hasPeriodicEval": true + }, + "plan-design-review": { + "skill": "plan-design-review", + "skillMdBytes": 105592, + "skillMdLines": 1944, + "estTokens": 26398, + "tmplBytes": 28624, + "descriptionLen": 568, + "hasGateEval": true, + "hasPeriodicEval": true + }, + "plan-devex-review": { + "skill": "plan-devex-review", + "skillMdBytes": 104571, + "skillMdLines": 2145, + "estTokens": 26143, + "tmplBytes": 35680, + "descriptionLen": 886, + "hasGateEval": true, + "hasPeriodicEval": true + }, + "plan-eng-review": { + "skill": "plan-eng-review", + "skillMdBytes": 101409, + "skillMdLines": 1788, + "estTokens": 25352, + "tmplBytes": 26234, + "descriptionLen": 743, + "hasGateEval": true, + "hasPeriodicEval": true + }, + "plan-tune": { + "skill": "plan-tune", + "skillMdBytes": 50123, + "skillMdLines": 1105, + "estTokens": 12531, + "tmplBytes": 15586, + "descriptionLen": 997, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "qa": { + "skill": "qa", + "skillMdBytes": 72267, + "skillMdLines": 1648, + "estTokens": 18067, + "tmplBytes": 12701, + "descriptionLen": 814, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "qa-only": { + "skill": "qa-only", + "skillMdBytes": 54819, + "skillMdLines": 1220, + "estTokens": 13705, + "tmplBytes": 3851, + "descriptionLen": 605, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "retro": { + "skill": "retro", + "skillMdBytes": 81286, + "skillMdLines": 1777, + "estTokens": 20322, + "tmplBytes": 42427, + "descriptionLen": 979, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "review": { + "skill": "review", + "skillMdBytes": 92443, + "skillMdLines": 1789, + "estTokens": 23111, + "tmplBytes": 14099, + "descriptionLen": 512, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "scrape": { + "skill": "scrape", + "skillMdBytes": 42040, + "skillMdLines": 914, + "estTokens": 10510, + "tmplBytes": 5220, + "descriptionLen": 519, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "setup-browser-cookies": { + "skill": "setup-browser-cookies", + "skillMdBytes": 25886, + "skillMdLines": 577, + "estTokens": 6472, + "tmplBytes": 2724, + "descriptionLen": 433, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "setup-deploy": { + "skill": "setup-deploy", + "skillMdBytes": 42326, + "skillMdLines": 946, + "estTokens": 10582, + "tmplBytes": 7780, + "descriptionLen": 564, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "setup-gbrain": { + "skill": "setup-gbrain", + "skillMdBytes": 76791, + "skillMdLines": 1733, + "estTokens": 19198, + "tmplBytes": 42245, + "descriptionLen": 512, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "ship": { + "skill": "ship", + "skillMdBytes": 163553, + "skillMdLines": 3094, + "estTokens": 40888, + "tmplBytes": 48869, + "descriptionLen": 557, + "hasGateEval": true, + "hasPeriodicEval": true + }, + "skillify": { + "skill": "skillify", + "skillMdBytes": 51935, + "skillMdLines": 1196, + "estTokens": 12984, + "tmplBytes": 15107, + "descriptionLen": 571, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "sync-gbrain": { + "skill": "sync-gbrain", + "skillMdBytes": 48555, + "skillMdLines": 1057, + "estTokens": 12139, + "tmplBytes": 13996, + "descriptionLen": 510, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "unfreeze": { + "skill": "unfreeze", + "skillMdBytes": 1482, + "skillMdLines": 46, + "estTokens": 371, + "tmplBytes": 1386, + "descriptionLen": 350, + "hasGateEval": false, + "hasPeriodicEval": false + } + } +} diff --git a/test/fixtures/parity-baseline-v1.46.0.0.json b/test/fixtures/parity-baseline-v1.46.0.0.json new file mode 100644 index 000000000..e4daf8245 --- /dev/null +++ b/test/fixtures/parity-baseline-v1.46.0.0.json @@ -0,0 +1,623 @@ +{ + "tag": "v1.46.0.0", + "capturedAt": "2026-05-26T04:17:57.247Z", + "capturedFromCommit": "2aff29e9", + "capturedFromBranch": "garrytan/slim-skill-tokens", + "totalSkills": 51, + "totalCorpusBytes": 2882468, + "estTotalCatalogTokens": 4045, + "topHeaviest": [ + { + "skill": "ship", + "skillMdBytes": 162702, + "skillMdLines": 3020, + "estTokens": 40676, + "tmplBytes": 48869, + "descriptionLen": 291, + "hasGateEval": true, + "hasPeriodicEval": true + }, + { + "skill": "plan-ceo-review", + "skillMdBytes": 130034, + "skillMdLines": 2151, + "estTokens": 32509, + "tmplBytes": 63393, + "descriptionLen": 794, + "hasGateEval": true, + "hasPeriodicEval": true + }, + { + "skill": "office-hours", + "skillMdBytes": 110388, + "skillMdLines": 2020, + "estTokens": 27597, + "tmplBytes": 55466, + "descriptionLen": 860, + "hasGateEval": true, + "hasPeriodicEval": false + }, + { + "skill": "plan-design-review", + "skillMdBytes": 105401, + "skillMdLines": 1882, + "estTokens": 26350, + "tmplBytes": 28624, + "descriptionLen": 218, + "hasGateEval": true, + "hasPeriodicEval": true + }, + { + "skill": "plan-devex-review", + "skillMdBytes": 103713, + "skillMdLines": 2073, + "estTokens": 25928, + "tmplBytes": 35680, + "descriptionLen": 250, + "hasGateEval": true, + "hasPeriodicEval": true + }, + { + "skill": "plan-eng-review", + "skillMdBytes": 100555, + "skillMdLines": 1716, + "estTokens": 25139, + "tmplBytes": 26234, + "descriptionLen": 231, + "hasGateEval": true, + "hasPeriodicEval": true + }, + { + "skill": "design-review", + "skillMdBytes": 93200, + "skillMdLines": 1886, + "estTokens": 23300, + "tmplBytes": 11674, + "descriptionLen": 304, + "hasGateEval": true, + "hasPeriodicEval": false + }, + { + "skill": "review", + "skillMdBytes": 91594, + "skillMdLines": 1716, + "estTokens": 22899, + "tmplBytes": 14099, + "descriptionLen": 205, + "hasGateEval": true, + "hasPeriodicEval": false + }, + { + "skill": "land-and-deploy", + "skillMdBytes": 89432, + "skillMdLines": 1810, + "estTokens": 22358, + "tmplBytes": 48624, + "descriptionLen": 160, + "hasGateEval": true, + "hasPeriodicEval": false + }, + { + "skill": "autoplan", + "skillMdBytes": 88416, + "skillMdLines": 1738, + "estTokens": 22104, + "tmplBytes": 45271, + "descriptionLen": 366, + "hasGateEval": true, + "hasPeriodicEval": true + } + ], + "skills": { + "autoplan": { + "skill": "autoplan", + "skillMdBytes": 88416, + "skillMdLines": 1738, + "estTokens": 22104, + "tmplBytes": 45271, + "descriptionLen": 366, + "hasGateEval": true, + "hasPeriodicEval": true + }, + "benchmark": { + "skill": "benchmark", + "skillMdBytes": 32556, + "skillMdLines": 733, + "estTokens": 8139, + "tmplBytes": 9378, + "descriptionLen": 213, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "benchmark-models": { + "skill": "benchmark-models", + "skillMdBytes": 28623, + "skillMdLines": 608, + "estTokens": 7156, + "tmplBytes": 6631, + "descriptionLen": 217, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "browse": { + "skill": "browse", + "skillMdBytes": 47308, + "skillMdLines": 915, + "estTokens": 11827, + "tmplBytes": 10805, + "descriptionLen": 181, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "canary": { + "skill": "canary", + "skillMdBytes": 44651, + "skillMdLines": 944, + "estTokens": 11163, + "tmplBytes": 8033, + "descriptionLen": 180, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "careful": { + "skill": "careful", + "skillMdBytes": 2551, + "skillMdLines": 68, + "estTokens": 638, + "tmplBytes": 2435, + "descriptionLen": 315, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "codex": { + "skill": "codex", + "skillMdBytes": 77166, + "skillMdLines": 1473, + "estTokens": 19292, + "tmplBytes": 34143, + "descriptionLen": 187, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "context-restore": { + "skill": "context-restore", + "skillMdBytes": 39039, + "skillMdLines": 802, + "estTokens": 9760, + "tmplBytes": 5255, + "descriptionLen": 238, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "context-save": { + "skill": "context-save", + "skillMdBytes": 43236, + "skillMdLines": 920, + "estTokens": 10809, + "tmplBytes": 9293, + "descriptionLen": 168, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "cso": { + "skill": "cso", + "skillMdBytes": 74943, + "skillMdLines": 1405, + "estTokens": 18736, + "tmplBytes": 35158, + "descriptionLen": 196, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "design-consultation": { + "skill": "design-consultation", + "skillMdBytes": 76768, + "skillMdLines": 1515, + "estTokens": 19192, + "tmplBytes": 25899, + "descriptionLen": 888, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "design-html": { + "skill": "design-html", + "skillMdBytes": 64093, + "skillMdLines": 1403, + "estTokens": 16023, + "tmplBytes": 22567, + "descriptionLen": 233, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "design-review": { + "skill": "design-review", + "skillMdBytes": 93200, + "skillMdLines": 1886, + "estTokens": 23300, + "tmplBytes": 11674, + "descriptionLen": 304, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "design-shotgun": { + "skill": "design-shotgun", + "skillMdBytes": 60382, + "skillMdLines": 1265, + "estTokens": 15096, + "tmplBytes": 13331, + "descriptionLen": 786, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "devex-review": { + "skill": "devex-review", + "skillMdBytes": 61959, + "skillMdLines": 1187, + "estTokens": 15490, + "tmplBytes": 7984, + "descriptionLen": 201, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "document-generate": { + "skill": "document-generate", + "skillMdBytes": 50533, + "skillMdLines": 1130, + "estTokens": 12633, + "tmplBytes": 15093, + "descriptionLen": 334, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "document-release": { + "skill": "document-release", + "skillMdBytes": 55797, + "skillMdLines": 1189, + "estTokens": 13949, + "tmplBytes": 20362, + "descriptionLen": 192, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "freeze": { + "skill": "freeze", + "skillMdBytes": 3154, + "skillMdLines": 92, + "estTokens": 789, + "tmplBytes": 3038, + "descriptionLen": 503, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "gstack-upgrade": { + "skill": "gstack-upgrade", + "skillMdBytes": 10817, + "skillMdLines": 285, + "estTokens": 2704, + "tmplBytes": 10667, + "descriptionLen": 163, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "guard": { + "skill": "guard", + "skillMdBytes": 3297, + "skillMdLines": 91, + "estTokens": 824, + "tmplBytes": 3181, + "descriptionLen": 686, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "health": { + "skill": "health", + "skillMdBytes": 45462, + "skillMdLines": 968, + "estTokens": 11366, + "tmplBytes": 11617, + "descriptionLen": 184, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "investigate": { + "skill": "investigate", + "skillMdBytes": 47955, + "skillMdLines": 966, + "estTokens": 11989, + "tmplBytes": 11561, + "descriptionLen": 1379, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "ios-clean": { + "skill": "ios-clean", + "skillMdBytes": 38591, + "skillMdLines": 767, + "estTokens": 9648, + "tmplBytes": 3851, + "descriptionLen": 252, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "ios-design-review": { + "skill": "ios-design-review", + "skillMdBytes": 39177, + "skillMdLines": 769, + "estTokens": 9794, + "tmplBytes": 4417, + "descriptionLen": 209, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "ios-fix": { + "skill": "ios-fix", + "skillMdBytes": 38306, + "skillMdLines": 765, + "estTokens": 9577, + "tmplBytes": 3574, + "descriptionLen": 187, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "ios-qa": { + "skill": "ios-qa", + "skillMdBytes": 44817, + "skillMdLines": 885, + "estTokens": 11204, + "tmplBytes": 10090, + "descriptionLen": 223, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "ios-sync": { + "skill": "ios-sync", + "skillMdBytes": 38283, + "skillMdLines": 758, + "estTokens": 9571, + "tmplBytes": 3544, + "descriptionLen": 269, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "land-and-deploy": { + "skill": "land-and-deploy", + "skillMdBytes": 89432, + "skillMdLines": 1810, + "estTokens": 22358, + "tmplBytes": 48624, + "descriptionLen": 160, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "landing-report": { + "skill": "landing-report", + "skillMdBytes": 41531, + "skillMdLines": 828, + "estTokens": 10383, + "tmplBytes": 6806, + "descriptionLen": 195, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "learn": { + "skill": "learn", + "skillMdBytes": 39268, + "skillMdLines": 845, + "estTokens": 9817, + "tmplBytes": 5594, + "descriptionLen": 178, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "make-pdf": { + "skill": "make-pdf", + "skillMdBytes": 28740, + "skillMdLines": 649, + "estTokens": 7185, + "tmplBytes": 5106, + "descriptionLen": 177, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "office-hours": { + "skill": "office-hours", + "skillMdBytes": 110388, + "skillMdLines": 2020, + "estTokens": 27597, + "tmplBytes": 55466, + "descriptionLen": 860, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "open-gstack-browser": { + "skill": "open-gstack-browser", + "skillMdBytes": 43677, + "skillMdLines": 908, + "estTokens": 10919, + "tmplBytes": 7702, + "descriptionLen": 204, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "pair-agent": { + "skill": "pair-agent", + "skillMdBytes": 44485, + "skillMdLines": 964, + "estTokens": 11121, + "tmplBytes": 8548, + "descriptionLen": 167, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "plan-ceo-review": { + "skill": "plan-ceo-review", + "skillMdBytes": 130034, + "skillMdLines": 2151, + "estTokens": 32509, + "tmplBytes": 63393, + "descriptionLen": 794, + "hasGateEval": true, + "hasPeriodicEval": true + }, + "plan-design-review": { + "skill": "plan-design-review", + "skillMdBytes": 105401, + "skillMdLines": 1882, + "estTokens": 26350, + "tmplBytes": 28624, + "descriptionLen": 218, + "hasGateEval": true, + "hasPeriodicEval": true + }, + "plan-devex-review": { + "skill": "plan-devex-review", + "skillMdBytes": 103713, + "skillMdLines": 2073, + "estTokens": 25928, + "tmplBytes": 35680, + "descriptionLen": 250, + "hasGateEval": true, + "hasPeriodicEval": true + }, + "plan-eng-review": { + "skill": "plan-eng-review", + "skillMdBytes": 100555, + "skillMdLines": 1716, + "estTokens": 25139, + "tmplBytes": 26234, + "descriptionLen": 231, + "hasGateEval": true, + "hasPeriodicEval": true + }, + "plan-tune": { + "skill": "plan-tune", + "skillMdBytes": 49263, + "skillMdLines": 1031, + "estTokens": 12316, + "tmplBytes": 15586, + "descriptionLen": 325, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "qa": { + "skill": "qa", + "skillMdBytes": 71409, + "skillMdLines": 1576, + "estTokens": 17852, + "tmplBytes": 12701, + "descriptionLen": 218, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "qa-only": { + "skill": "qa-only", + "skillMdBytes": 53967, + "skillMdLines": 1148, + "estTokens": 13492, + "tmplBytes": 3851, + "descriptionLen": 165, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "retro": { + "skill": "retro", + "skillMdBytes": 80435, + "skillMdLines": 1704, + "estTokens": 20109, + "tmplBytes": 42427, + "descriptionLen": 648, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "review": { + "skill": "review", + "skillMdBytes": 91594, + "skillMdLines": 1716, + "estTokens": 22899, + "tmplBytes": 14099, + "descriptionLen": 205, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "scrape": { + "skill": "scrape", + "skillMdBytes": 41187, + "skillMdLines": 841, + "estTokens": 10297, + "tmplBytes": 5220, + "descriptionLen": 167, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "setup-browser-cookies": { + "skill": "setup-browser-cookies", + "skillMdBytes": 25908, + "skillMdLines": 580, + "estTokens": 6477, + "tmplBytes": 2724, + "descriptionLen": 222, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "setup-deploy": { + "skill": "setup-deploy", + "skillMdBytes": 41473, + "skillMdLines": 873, + "estTokens": 10368, + "tmplBytes": 7780, + "descriptionLen": 197, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "setup-gbrain": { + "skill": "setup-gbrain", + "skillMdBytes": 75940, + "skillMdLines": 1658, + "estTokens": 18985, + "tmplBytes": 42245, + "descriptionLen": 323, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "ship": { + "skill": "ship", + "skillMdBytes": 162702, + "skillMdLines": 3020, + "estTokens": 40676, + "tmplBytes": 48869, + "descriptionLen": 291, + "hasGateEval": true, + "hasPeriodicEval": true + }, + "skillify": { + "skill": "skillify", + "skillMdBytes": 51080, + "skillMdLines": 1122, + "estTokens": 12770, + "tmplBytes": 15107, + "descriptionLen": 233, + "hasGateEval": true, + "hasPeriodicEval": false + }, + "sync-gbrain": { + "skill": "sync-gbrain", + "skillMdBytes": 47702, + "skillMdLines": 982, + "estTokens": 11926, + "tmplBytes": 13996, + "descriptionLen": 299, + "hasGateEval": false, + "hasPeriodicEval": false + }, + "unfreeze": { + "skill": "unfreeze", + "skillMdBytes": 1504, + "skillMdLines": 49, + "estTokens": 376, + "tmplBytes": 1386, + "descriptionLen": 199, + "hasGateEval": false, + "hasPeriodicEval": false + } + } +} diff --git a/test/gen-skill-docs-idempotency.test.ts b/test/gen-skill-docs-idempotency.test.ts new file mode 100644 index 000000000..c1dd9b199 --- /dev/null +++ b/test/gen-skill-docs-idempotency.test.ts @@ -0,0 +1,159 @@ +/** + * Idempotency test for gen-skill-docs (regression for v1.45.0.0 timestamp flap). + * + * Running `bun run gen:skill-docs` twice in a row must produce a no-op on + * the second run: every output file is byte-identical to itself. Without + * this gate, CI freshness checks flap whenever someone introduces a + * timestamp, a random seed, or any other non-deterministic field into a + * generated artifact. + * + * v1.45.0.0 shipped with a `generated_at` ISO timestamp in + * scripts/proactive-suggestions.json that updated every run. CI freshness + * checks failed because the committed file's timestamp never matched the + * latest gen. Fixed in 43e18af4 — this test pins the contract going forward. + * + * The test pays a small cost (~2 gen-skill-docs invocations, ~3s total) but + * catches a class of bugs that's invisible until CI fails. + */ + +import { describe, test, expect } from 'bun:test'; +import { spawnSync } from 'child_process'; +import * as fs from 'fs'; +import * as path from 'path'; + +const REPO_ROOT = path.resolve(import.meta.dir, '..'); + +/** Files that gen-skill-docs writes and that must be byte-stable across runs. */ +const STABLE_OUTPUTS = [ + 'scripts/proactive-suggestions.json', + 'SKILL.md', + 'ship/SKILL.md', + 'plan-ceo-review/SKILL.md', + 'office-hours/SKILL.md', + 'gstack/llms.txt', +]; + +/** + * Sampled outputs from EVERY non-Claude host. The full host-all run touches + * .agents/, .cursor/, .factory/, .gbrain/, .hermes/, .kiro/, .openclaw/, + * .opencode/, .slate/ — picking one canonical file per host catches per-host + * non-determinism without paying the cost of snapshotting hundreds of files. + */ +const STABLE_HOST_ALL_OUTPUTS = [ + 'scripts/proactive-suggestions.json', + 'SKILL.md', + 'ship/SKILL.md', + '.agents/skills/gstack-ship/SKILL.md', + '.cursor/skills/gstack-ship/SKILL.md', + '.factory/skills/gstack-ship/SKILL.md', + '.gbrain/skills/gstack-ship/SKILL.md', +]; + +function runGen(extraArgs: string[] = []): { exitCode: number; stderr: string } { + const result = spawnSync('bun', ['run', 'gen:skill-docs', ...extraArgs], { + cwd: REPO_ROOT, + stdio: ['ignore', 'pipe', 'pipe'], + timeout: 120_000, + }); + return { + exitCode: result.status ?? -1, + stderr: result.stderr?.toString() ?? '', + }; +} + +function snapshot(files: string[] = STABLE_OUTPUTS): Map { + const m = new Map(); + for (const rel of files) { + const full = path.join(REPO_ROOT, rel); + if (fs.existsSync(full)) { + m.set(rel, fs.readFileSync(full, 'utf-8')); + } + } + return m; +} + +describe('gen-skill-docs idempotency', () => { + test('two consecutive runs produce byte-identical outputs (no flapping fields)', () => { + const firstRun = runGen(); + expect(firstRun.exitCode).toBe(0); + + const after1 = snapshot(); + expect(after1.size).toBeGreaterThan(0); + + const secondRun = runGen(); + expect(secondRun.exitCode).toBe(0); + + const after2 = snapshot(); + + // Compare each stable output byte-for-byte. + const flapping: string[] = []; + for (const [file, before] of after1.entries()) { + const now = after2.get(file); + if (now !== before) flapping.push(file); + } + + if (flapping.length > 0) { + throw new Error( + `${flapping.length} file(s) changed between two consecutive gen-skill-docs runs (flapping):\n` + + flapping.map(f => ` - ${f}`).join('\n') + + `\nLikely cause: a non-deterministic field (timestamp, random ID, ` + + `filesystem-iteration order) leaked into the generated output. CI freshness ` + + `checks (git diff --exit-code) will fail unpredictably until this is fixed.`, + ); + } + }, 180_000); // ~2 min budget for two gen runs + + test('--dry-run after a fresh gen reports zero stale files', () => { + // Pre-condition: working tree gen must be fresh (idempotency test above ran first). + // If a contributor introduces a non-deterministic field, this dry-run reports STALE. + const result = spawnSync('bun', ['run', 'gen:skill-docs', '--dry-run'], { + cwd: REPO_ROOT, + stdio: ['ignore', 'pipe', 'pipe'], + timeout: 60_000, + }); + expect(result.status).toBe(0); + const stdout = result.stdout?.toString() ?? ''; + // STALE: prefix means a file would change. Count them. + const staleLines = stdout.split('\n').filter(l => l.startsWith('STALE:')); + if (staleLines.length > 0) { + throw new Error( + `--dry-run reports ${staleLines.length} stale file(s) after a fresh gen:\n` + + staleLines.map(l => ` ${l}`).join('\n') + + `\nRun \`bun run gen:skill-docs\` and commit the result.`, + ); + } + }, 90_000); + + test('--host all idempotency: every host output is byte-stable across two runs', () => { + // Gap A: the default test above runs Claude host only. Non-Claude hosts + // (Codex, Factory, Cursor, OpenClaw, GBrain, Slate, OpenCode, Hermes, + // Kiro) have their own output paths and could carry their own + // non-deterministic fields. We hit a "--host all needed for freshness + // check" mid-/ship; this test pins the contract across every host. + const firstRun = runGen(['--host', 'all']); + expect(firstRun.exitCode).toBe(0); + + const after1 = snapshot(STABLE_HOST_ALL_OUTPUTS); + expect(after1.size).toBeGreaterThan(0); + + const secondRun = runGen(['--host', 'all']); + expect(secondRun.exitCode).toBe(0); + + const after2 = snapshot(STABLE_HOST_ALL_OUTPUTS); + + const flapping: string[] = []; + for (const [file, before] of after1.entries()) { + const now = after2.get(file); + if (now !== before) flapping.push(file); + } + + if (flapping.length > 0) { + throw new Error( + `${flapping.length} file(s) changed between two consecutive --host all gen runs:\n` + + flapping.map(f => ` - ${f}`).join('\n') + + `\nLikely cause: a non-deterministic field leaked into a non-Claude host adapter ` + + `(scripts/host-adapters/*.ts). CI freshness checks for that host will flap.`, + ); + } + }, 300_000); // ~5 min budget for two host-all runs +}); diff --git a/test/helpers/budget-override.test.ts b/test/helpers/budget-override.test.ts new file mode 100644 index 000000000..e420c3892 --- /dev/null +++ b/test/helpers/budget-override.test.ts @@ -0,0 +1,116 @@ +/** + * Unit tests for budget-override audit logger. + * + * The audit trail is the only check on `EVALS_BUDGET_OVERRIDE_REASON` and + * `GSTACK_SIZE_BUDGET_OVERRIDE_REASON` — if the logger silently drops events, + * overrides become invisible and the budget gates are theater. These tests + * pin the contract: every override produces exactly one JSONL line with + * timestamp + scope + reason + CI provenance. + */ + +import { describe, test, expect, beforeEach } from 'bun:test'; +import * as fs from 'fs'; +import * as path from 'path'; +import * as os from 'os'; +import { logBudgetOverride } from './budget-override'; + +const TMP_HOME = fs.mkdtempSync(path.join(os.tmpdir(), 'budget-override-test-')); +process.env.GSTACK_HOME = TMP_HOME; +const AUDIT_PATH = path.join(TMP_HOME, 'analytics', 'spend-overrides.jsonl'); + +describe('logBudgetOverride', () => { + beforeEach(() => { + // Start each test with a clean audit file + try { fs.unlinkSync(AUDIT_PATH); } catch { /* doesn't exist */ } + }); + + test('writes one JSONL line per call with required fields', () => { + logBudgetOverride({ + scope: 'evals-cost-cap-e2e', + reason: 'model price went up, will rebase the cap next sprint', + details: { tier: 'e2e', cap: 25, observed_cost_usd: 31.4 }, + }); + + expect(fs.existsSync(AUDIT_PATH)).toBe(true); + const lines = fs.readFileSync(AUDIT_PATH, 'utf-8').split('\n').filter(Boolean); + expect(lines.length).toBe(1); + const entry = JSON.parse(lines[0]!); + expect(entry.scope).toBe('evals-cost-cap-e2e'); + expect(entry.reason).toBe('model price went up, will rebase the cap next sprint'); + expect(entry.details).toEqual({ tier: 'e2e', cap: 25, observed_cost_usd: 31.4 }); + expect(typeof entry.timestamp).toBe('string'); + expect(entry.timestamp).toMatch(/^\d{4}-\d{2}-\d{2}T/); + }); + + test('captures CI provenance when CI env is set', () => { + process.env.CI = 'true'; + process.env.GITHUB_ACTIONS = 'true'; + process.env.GITHUB_REF_NAME = 'feature/x'; + process.env.GITHUB_SHA = 'deadbeefcafe1234'; + + logBudgetOverride({ scope: 'skill-size-budget', reason: 'big diff bake-in' }); + + const entry = JSON.parse(fs.readFileSync(AUDIT_PATH, 'utf-8').trim()); + expect(entry.ci).toBe(true); + expect(entry.runner).toBe('github-actions'); + expect(entry.branch).toBe('feature/x'); + expect(entry.commit).toBe('deadbeef'); + + delete process.env.CI; + delete process.env.GITHUB_ACTIONS; + delete process.env.GITHUB_REF_NAME; + delete process.env.GITHUB_SHA; + }); + + test('defaults provenance to local when CI is unset', () => { + delete process.env.CI; + delete process.env.GITHUB_ACTIONS; + delete process.env.GITHUB_REF_NAME; + delete process.env.GITHUB_SHA; + delete process.env.CI_RUNNER; + delete process.env.CI_COMMIT_REF_NAME; + delete process.env.CI_COMMIT_SHORT_SHA; + + logBudgetOverride({ scope: 'skill-size-budget-corpus', reason: 'local dev test' }); + + const entry = JSON.parse(fs.readFileSync(AUDIT_PATH, 'utf-8').trim()); + expect(entry.ci).toBe(false); + expect(entry.runner).toBe('local'); + expect(entry.branch).toBe('unknown'); + expect(entry.commit).toBe('unknown'); + }); + + test('append-only: multiple calls produce multiple lines', () => { + logBudgetOverride({ scope: 's1', reason: 'r1' }); + logBudgetOverride({ scope: 's2', reason: 'r2' }); + logBudgetOverride({ scope: 's3', reason: 'r3' }); + + const lines = fs.readFileSync(AUDIT_PATH, 'utf-8').split('\n').filter(Boolean); + expect(lines.length).toBe(3); + const scopes = lines.map(l => JSON.parse(l).scope); + expect(scopes).toEqual(['s1', 's2', 's3']); + }); + + test('omits details key when entry.details is absent (uses empty object)', () => { + logBudgetOverride({ scope: 'plain', reason: 'no details' }); + const entry = JSON.parse(fs.readFileSync(AUDIT_PATH, 'utf-8').trim()); + expect(entry.details).toEqual({}); + }); + + test('never throws even when audit directory is missing — creates it', () => { + // Remove the analytics dir to force mkdir + try { fs.rmSync(path.join(TMP_HOME, 'analytics'), { recursive: true, force: true }); } catch { /* */ } + expect(() => logBudgetOverride({ scope: 'recreate', reason: 'test' })).not.toThrow(); + expect(fs.existsSync(AUDIT_PATH)).toBe(true); + }); + + test('survives an unwritable audit path (logs warning, does not throw)', () => { + // Point GSTACK_HOME at a path inside a file (illegal directory location) + const originalHome = process.env.GSTACK_HOME; + const bogusFile = path.join(TMP_HOME, 'not-a-dir.txt'); + fs.writeFileSync(bogusFile, 'just a file'); + process.env.GSTACK_HOME = bogusFile; + expect(() => logBudgetOverride({ scope: 'unwritable', reason: 'fs error path' })).not.toThrow(); + process.env.GSTACK_HOME = originalHome; + }); +}); diff --git a/test/helpers/budget-override.ts b/test/helpers/budget-override.ts new file mode 100644 index 000000000..fefe2c790 --- /dev/null +++ b/test/helpers/budget-override.ts @@ -0,0 +1,50 @@ +/** + * Budget override audit trail (v1.45.0.0 T5). + * + * Records uses of GSTACK_SIZE_BUDGET_OVERRIDE_REASON or + * EVALS_BUDGET_OVERRIDE_REASON so a reviewer can see what was waived, + * by whom, and why. Append-only JSONL at ~/.gstack/analytics/spend-overrides.jsonl. + * + * Why audit: a hard cap with no escape valve becomes operationally hostile + * (legit price changes, longer transcripts, new required evals can all + * blow the cap). An escape valve with no audit becomes "everyone overrides + * everything and we lose the gate." This module is the audit half. + */ + +import * as fs from 'fs'; +import * as path from 'path'; +import * as os from 'os'; + +export interface BudgetOverrideEntry { + scope: string; // e.g. 'skill-size-budget', 'evals-cost-cap' + reason: string; // user-supplied REASON env var + details?: Record; // numbers / regressions +} + +function getAuditPath(): string { + const base = process.env.GSTACK_HOME || path.join(os.homedir(), '.gstack'); + return path.join(base, 'analytics', 'spend-overrides.jsonl'); +} + +export function logBudgetOverride(entry: BudgetOverrideEntry): void { + try { + const auditPath = getAuditPath(); + fs.mkdirSync(path.dirname(auditPath), { recursive: true }); + const line = JSON.stringify({ + timestamp: new Date().toISOString(), + scope: entry.scope, + reason: entry.reason, + details: entry.details ?? {}, + // Capture provenance: who/where/which CI ran + ci: process.env.CI === 'true', + runner: process.env.GITHUB_ACTIONS ? 'github-actions' : process.env.CI_RUNNER || 'local', + branch: process.env.GITHUB_REF_NAME || process.env.CI_COMMIT_REF_NAME || 'unknown', + commit: process.env.GITHUB_SHA?.slice(0, 8) || process.env.CI_COMMIT_SHORT_SHA || 'unknown', + }) + '\n'; + fs.appendFileSync(auditPath, line); + } catch (err) { + // Best-effort logging; don't fail the test on audit-write errors. + // eslint-disable-next-line no-console + console.warn(`[budget-override] could not write audit log: ${(err as Error).message}`); + } +} diff --git a/test/helpers/capture-parity-baseline.test.ts b/test/helpers/capture-parity-baseline.test.ts new file mode 100644 index 000000000..ba00c84cc --- /dev/null +++ b/test/helpers/capture-parity-baseline.test.ts @@ -0,0 +1,90 @@ +/** + * Unit tests for parity baseline capture. + * + * Free. Reads the live repo state via captureBaseline() and asserts + * shape + invariants, not specific numbers (which drift release-over-release). + */ + +import { describe, test, expect } from 'bun:test'; +import * as fs from 'fs'; +import * as path from 'path'; +import { captureBaseline, diffBaselines, type ParityBaseline } from './capture-parity-baseline'; + +const REPO_ROOT = path.resolve(import.meta.dir, '..', '..'); + +describe('capture-parity-baseline', () => { + test('produces a shaped baseline for the current repo', () => { + const baseline = captureBaseline({ repoRoot: REPO_ROOT, tag: 'unit-test' }); + expect(baseline.tag).toBe('unit-test'); + expect(baseline.totalSkills).toBeGreaterThan(20); + expect(baseline.totalCorpusBytes).toBeGreaterThan(100_000); + expect(baseline.topHeaviest.length).toBeGreaterThan(0); + expect(baseline.topHeaviest.length).toBeLessThanOrEqual(10); + expect(baseline.topHeaviest[0]!.skillMdBytes).toBeGreaterThan(0); + // Top 1 should be ≥ Top 2 (sort invariant) + if (baseline.topHeaviest.length >= 2) { + expect(baseline.topHeaviest[0]!.skillMdBytes).toBeGreaterThanOrEqual( + baseline.topHeaviest[1]!.skillMdBytes, + ); + } + }); + + test('each skill entry has byte + line + token estimates', () => { + const baseline = captureBaseline({ repoRoot: REPO_ROOT }); + for (const skill of Object.values(baseline.skills)) { + expect(skill.skillMdBytes).toBeGreaterThan(0); + expect(skill.skillMdLines).toBeGreaterThan(0); + expect(skill.estTokens).toBeGreaterThan(0); + // ~4 chars/token heuristic + expect(skill.estTokens).toBeCloseTo(skill.skillMdBytes / 4, -2); + } + }); + + test('diffBaselines returns expected deltas', () => { + const before: ParityBaseline = { + tag: 'before', + capturedAt: '2026-01-01T00:00:00Z', + capturedFromCommit: 'abc', + capturedFromBranch: 'main', + totalSkills: 2, + totalCorpusBytes: 1000, + estTotalCatalogTokens: 100, + topHeaviest: [], + skills: { + foo: { skill: 'foo', skillMdBytes: 600, skillMdLines: 10, estTokens: 150, tmplBytes: 300, descriptionLen: 50, hasGateEval: true, hasPeriodicEval: false }, + bar: { skill: 'bar', skillMdBytes: 400, skillMdLines: 8, estTokens: 100, tmplBytes: 200, descriptionLen: 30, hasGateEval: false, hasPeriodicEval: false }, + }, + }; + const after: ParityBaseline = { + ...before, + tag: 'after', + totalCorpusBytes: 700, + estTotalCatalogTokens: 60, + skills: { + foo: { ...before.skills.foo!, skillMdBytes: 400 }, + bar: { ...before.skills.bar!, skillMdBytes: 300 }, + }, + }; + const diff = diffBaselines(before, after); + expect(diff.totalCorpusDelta).toBe(-300); + expect(diff.totalCorpusDeltaPct).toBeCloseTo(-30, 1); + expect(diff.catalogTokensDelta).toBe(-40); + expect(diff.perSkill.length).toBe(2); + // Sorted by abs delta descending + expect(diff.perSkill[0]!.skill).toBe('foo'); + expect(diff.perSkill[0]!.deltaBytes).toBe(-200); + expect(diff.perSkill[1]!.skill).toBe('bar'); + }); + + test('v1.44.1 baseline file exists with expected shape', () => { + const baselinePath = path.join(REPO_ROOT, 'test', 'fixtures', 'parity-baseline-v1.44.1.json'); + expect(fs.existsSync(baselinePath)).toBe(true); + const baseline = JSON.parse(fs.readFileSync(baselinePath, 'utf-8')) as ParityBaseline; + expect(baseline.tag).toBe('v1.44.1'); + expect(baseline.totalSkills).toBeGreaterThan(40); + // Document the v1.44.1 snapshot as the v1→v2 baseline reference. + // Compression in v1.45+ should drop totalCorpusBytes; this assertion + // anchors the "v1 was XX MB" claim in the CHANGELOG to a real file. + expect(baseline.totalCorpusBytes).toBeGreaterThan(2_000_000); + }); +}); diff --git a/test/helpers/capture-parity-baseline.ts b/test/helpers/capture-parity-baseline.ts new file mode 100644 index 000000000..2c8ce1730 --- /dev/null +++ b/test/helpers/capture-parity-baseline.ts @@ -0,0 +1,231 @@ +/** + * Parity baseline capture — cathedral parity-eval suite primitive. + * + * Snapshots the current state of every top-level SKILL.md: byte count, line + * count, estimated token count, frontmatter description length, eval + * coverage. The output JSON is the v1.44 baseline that v2 must beat on + * compression AND match (or exceed) on parity. + * + * The numbers quoted in the v2.0.0.0 CHANGELOG numbers table are read + * from a baseline JSON captured by this script. Never invent baseline + * numbers; ship them only if they came from a real captureBaseline() run. + * + * Usage: + * bun run scripts/capture-baseline.ts # write default path + * bun run scripts/capture-baseline.ts --out PATH # write custom path + * bun run scripts/capture-baseline.ts --tag v1.44.1 # tag the snapshot + */ + +import * as fs from 'fs'; +import * as path from 'path'; +import { execSync } from 'child_process'; + +export interface SkillBaselineEntry { + skill: string; + skillMdBytes: number; + skillMdLines: number; + estTokens: number; // ~4 chars/token heuristic + tmplBytes: number | null; // null when no .tmpl exists (vendored or non-Claude) + descriptionLen: number; // bytes in frontmatter description field + hasGateEval: boolean; + hasPeriodicEval: boolean; +} + +export interface ParityBaseline { + tag: string; + capturedAt: string; + capturedFromCommit: string; + capturedFromBranch: string; + totalSkills: number; + totalCorpusBytes: number; + estTotalCatalogTokens: number; // sum of all description lengths / 4 + topHeaviest: SkillBaselineEntry[]; // sorted desc by skillMdBytes + skills: Record; +} + +export interface CaptureOptions { + repoRoot: string; + tag?: string; +} + +/** Extract the frontmatter description from a SKILL.md file. Empty string if none. */ +function extractDescription(content: string): string { + if (!content.startsWith('---\n')) return ''; + const fmEnd = content.indexOf('\n---', 4); + if (fmEnd === -1) return ''; + const frontmatter = content.slice(4, fmEnd); + const lines = frontmatter.split('\n'); + let inDescription = false; + const descLines: string[] = []; + for (const line of lines) { + if (line.match(/^description:\s*\|?\s*$/)) { + inDescription = true; + continue; + } + if (line.match(/^description:\s+/)) { + descLines.push(line.replace(/^description:\s+/, '')); + inDescription = true; + continue; + } + if (inDescription) { + if (line.match(/^\w+:\s/)) break; + descLines.push(line.trim()); + } + } + return descLines.join('\n').trim(); +} + +/** Estimate token count via 4 chars/token. Crude but matches existing budget-regression usage. */ +function estimateTokens(bytes: number): number { + return Math.round(bytes / 4); +} + +/** Find which top-level directories contain a SKILL.md (skills we capture). */ +function discoverSkillDirs(repoRoot: string): string[] { + const entries = fs.readdirSync(repoRoot, { withFileTypes: true }); + const dirs: string[] = []; + for (const e of entries) { + if (!e.isDirectory()) continue; + if (e.name.startsWith('.')) continue; + if (e.name === 'node_modules' || e.name === 'docs') continue; + const skillMd = path.join(repoRoot, e.name, 'SKILL.md'); + if (fs.existsSync(skillMd)) dirs.push(e.name); + } + return dirs.sort(); +} + +/** Check whether a skill has E2E gate / periodic eval coverage by scanning test/. */ +function discoverEvalCoverage(repoRoot: string, skills: string[]): { + gate: Set; + periodic: Set; +} { + const gate = new Set(); + const periodic = new Set(); + const testDir = path.join(repoRoot, 'test'); + if (!fs.existsSync(testDir)) return { gate, periodic }; + const testFiles = fs.readdirSync(testDir).filter(f => f.startsWith('skill-e2e-') && f.endsWith('.test.ts')); + // Try to map each test file to a skill by reading its contents for skill names. + for (const file of testFiles) { + const content = fs.readFileSync(path.join(testDir, file), 'utf-8'); + for (const skill of skills) { + // Match the skill name as a word boundary, also try /skill-name slash form. + const re = new RegExp(`(/${skill}|['"\`]${skill}['"\`]|skill[s]?[/=:]\\s*['"\`]${skill}['"\`])`); + if (re.test(content)) { + // Crude tier inference: if file name contains "regression" / known-periodic markers, classify periodic. + if (file.includes('chain') || file.includes('multi') || file.includes('idempotency') || file.includes('finding-floor')) { + periodic.add(skill); + } else { + gate.add(skill); + } + } + } + } + return { gate, periodic }; +} + +function getGitInfo(repoRoot: string): { commit: string; branch: string } { + try { + const commit = execSync('git rev-parse --short HEAD', { cwd: repoRoot, encoding: 'utf-8' }).trim(); + const branch = execSync('git rev-parse --abbrev-ref HEAD', { cwd: repoRoot, encoding: 'utf-8' }).trim(); + return { commit, branch }; + } catch { + return { commit: 'unknown', branch: 'unknown' }; + } +} + +export function captureBaseline(opts: CaptureOptions): ParityBaseline { + const { repoRoot, tag } = opts; + const skillDirs = discoverSkillDirs(repoRoot); + const evalCoverage = discoverEvalCoverage(repoRoot, skillDirs); + const skills: Record = {}; + let totalCorpusBytes = 0; + let totalDescriptionBytes = 0; + for (const dir of skillDirs) { + const skillMdPath = path.join(repoRoot, dir, 'SKILL.md'); + const tmplPath = path.join(repoRoot, dir, 'SKILL.md.tmpl'); + const content = fs.readFileSync(skillMdPath, 'utf-8'); + const bytes = Buffer.byteLength(content, 'utf-8'); + const lines = content.split('\n').length; + const description = extractDescription(content); + const descriptionLen = Buffer.byteLength(description, 'utf-8'); + const tmplBytes = fs.existsSync(tmplPath) + ? Buffer.byteLength(fs.readFileSync(tmplPath, 'utf-8'), 'utf-8') + : null; + const entry: SkillBaselineEntry = { + skill: dir, + skillMdBytes: bytes, + skillMdLines: lines, + estTokens: estimateTokens(bytes), + tmplBytes, + descriptionLen, + hasGateEval: evalCoverage.gate.has(dir), + hasPeriodicEval: evalCoverage.periodic.has(dir), + }; + skills[dir] = entry; + totalCorpusBytes += bytes; + totalDescriptionBytes += descriptionLen; + } + const topHeaviest = Object.values(skills) + .slice() + .sort((a, b) => b.skillMdBytes - a.skillMdBytes) + .slice(0, 10); + const git = getGitInfo(repoRoot); + return { + tag: tag ?? 'untagged', + capturedAt: new Date().toISOString(), + capturedFromCommit: git.commit, + capturedFromBranch: git.branch, + totalSkills: skillDirs.length, + totalCorpusBytes, + estTotalCatalogTokens: estimateTokens(totalDescriptionBytes), + topHeaviest, + skills, + }; +} + +/** Diff two baselines; useful for v2 vs v1.44 deltas. */ +export interface BaselineDiff { + totalCorpusDelta: number; + totalCorpusDeltaPct: number; + catalogTokensDelta: number; + catalogTokensDeltaPct: number; + perSkill: Array<{ + skill: string; + beforeBytes: number; + afterBytes: number; + deltaBytes: number; + deltaPct: number; + }>; +} + +export function diffBaselines(before: ParityBaseline, after: ParityBaseline): BaselineDiff { + const totalCorpusDelta = after.totalCorpusBytes - before.totalCorpusBytes; + const totalCorpusDeltaPct = before.totalCorpusBytes + ? (totalCorpusDelta / before.totalCorpusBytes) * 100 + : 0; + const catalogTokensDelta = after.estTotalCatalogTokens - before.estTotalCatalogTokens; + const catalogTokensDeltaPct = before.estTotalCatalogTokens + ? (catalogTokensDelta / before.estTotalCatalogTokens) * 100 + : 0; + const perSkill: BaselineDiff['perSkill'] = []; + const allSkills = new Set([...Object.keys(before.skills), ...Object.keys(after.skills)]); + for (const skill of allSkills) { + const b = before.skills[skill]?.skillMdBytes ?? 0; + const a = after.skills[skill]?.skillMdBytes ?? 0; + perSkill.push({ + skill, + beforeBytes: b, + afterBytes: a, + deltaBytes: a - b, + deltaPct: b ? ((a - b) / b) * 100 : 0, + }); + } + perSkill.sort((x, y) => Math.abs(y.deltaBytes) - Math.abs(x.deltaBytes)); + return { + totalCorpusDelta, + totalCorpusDeltaPct, + catalogTokensDelta, + catalogTokensDeltaPct, + perSkill, + }; +} diff --git a/test/helpers/parity-harness.ts b/test/helpers/parity-harness.ts new file mode 100644 index 000000000..4071a6cae --- /dev/null +++ b/test/helpers/parity-harness.ts @@ -0,0 +1,230 @@ +/** + * Cathedral parity-eval harness (v1.45.0.0 T0b). + * + * Compares CURRENT SKILL.md output to a v1.44.1 golden baseline along three + * axes: STRUCTURE (frontmatter shape), CONTENT (must-preserve phrases per + * skill family), and SIZE (per-skill byte budget). The fourth axis — + * BEHAVIORAL parity via LLM-as-judge — runs on top of this harness in the + * periodic-tier eval suite (paid, ~$0.20 per skill judge call). + * + * The structural + content checks ship in v1.45.0.0 as the foundation; the + * LLM-judge layer lands in v2.0.0.0 alongside the sections/ pattern. Both + * use this module's APIs. + * + * Why a separate harness from skill-size-budget.test.ts: that one enforces + * size discipline only. This module supports content invariants per skill + * family (e.g., cso must preserve OWASP/STRIDE; plan-ceo must preserve + * mode-selection phrasing) so future compression can't silently strip + * load-bearing prose even when size stays within ratio. + */ + +import * as fs from 'fs'; +import * as path from 'path'; +import type { ParityBaseline, SkillBaselineEntry } from './capture-parity-baseline'; +import { captureBaseline } from './capture-parity-baseline'; + +export interface ParityInvariant { + skill: string; + /** Phrases that MUST appear in the generated SKILL.md (case-insensitive substring). */ + mustContain?: string[]; + /** Markdown H2 headings that MUST appear. */ + mustHaveHeadings?: string[]; + /** Maximum byte size growth ratio vs baseline. 1.0 = no growth allowed. */ + maxSizeRatio?: number; + /** Minimum byte size (catches over-stripping cliffs). */ + minBytes?: number; +} + +export interface ParityCheckResult { + skill: string; + passed: boolean; + failures: string[]; +} + +export function checkSkillParity( + invariant: ParityInvariant, + current: SkillBaselineEntry, + baseline: SkillBaselineEntry | undefined, + repoRoot: string, +): ParityCheckResult { + const failures: string[] = []; + + // SIZE checks + if (invariant.maxSizeRatio !== undefined && baseline) { + const ratio = current.skillMdBytes / baseline.skillMdBytes; + if (ratio > invariant.maxSizeRatio) { + failures.push(`size ratio ${ratio.toFixed(3)} > maxSizeRatio ${invariant.maxSizeRatio}`); + } + } + if (invariant.minBytes !== undefined && current.skillMdBytes < invariant.minBytes) { + failures.push(`size ${current.skillMdBytes} < minBytes ${invariant.minBytes}`); + } + + // CONTENT checks (read live file for fresh content) + if (invariant.mustContain?.length || invariant.mustHaveHeadings?.length) { + const skillMdPath = path.join(repoRoot, invariant.skill, 'SKILL.md'); + let content: string | null = null; + try { + content = fs.readFileSync(skillMdPath, 'utf-8'); + } catch (err) { + failures.push(`cannot read ${skillMdPath}: ${(err as Error).message}`); + } + if (content) { + const lower = content.toLowerCase(); + for (const phrase of invariant.mustContain ?? []) { + if (!lower.includes(phrase.toLowerCase())) { + failures.push(`missing required phrase: "${phrase}"`); + } + } + for (const heading of invariant.mustHaveHeadings ?? []) { + if (!content.includes(heading)) { + failures.push(`missing required heading: "${heading}"`); + } + } + } + } + + return { + skill: invariant.skill, + passed: failures.length === 0, + failures, + }; +} + +export interface ParityReport { + baselineTag: string; + currentCapturedAt: string; + totalChecks: number; + passed: number; + failed: number; + details: ParityCheckResult[]; +} + +export function runParityChecks(opts: { + repoRoot: string; + baseline: ParityBaseline; + invariants: ParityInvariant[]; +}): ParityReport { + const { repoRoot, baseline, invariants } = opts; + const current = captureBaseline({ repoRoot }); + const details: ParityCheckResult[] = []; + for (const invariant of invariants) { + const baselineEntry = baseline.skills[invariant.skill]; + const currentEntry = current.skills[invariant.skill]; + if (!currentEntry) { + details.push({ + skill: invariant.skill, + passed: false, + failures: [`skill removed: ${invariant.skill} present in baseline but not current state`], + }); + continue; + } + details.push(checkSkillParity(invariant, currentEntry, baselineEntry, repoRoot)); + } + return { + baselineTag: baseline.tag, + currentCapturedAt: current.capturedAt, + totalChecks: details.length, + passed: details.filter(d => d.passed).length, + failed: details.filter(d => !d.passed).length, + details, + }; +} + +/** + * Standard invariant registry — the v1.45.0.0 set. + * + * Each entry pins what must-not-break in a skill family. Extend as future + * skills land. Phase B (v2.0.0.0) adds LLM-judge invariants on top of these. + */ +export const PARITY_INVARIANTS: ParityInvariant[] = [ + { + skill: 'cso', + mustContain: ['OWASP', 'STRIDE', 'daily', 'comprehensive', 'verif'], + mustHaveHeadings: ['## Preamble', '## When to invoke'], + maxSizeRatio: 1.05, + minBytes: 30_000, + }, + { + skill: 'ship', + mustContain: [ + 'VERSION', + 'CHANGELOG', + 'review', + 'merge', + 'PR', + ], + mustHaveHeadings: ['## Preamble', '## When to invoke'], + maxSizeRatio: 1.05, + minBytes: 80_000, + }, + { + skill: 'plan-ceo-review', + mustContain: [ + 'SCOPE EXPANSION', + 'SELECTIVE EXPANSION', + 'HOLD SCOPE', + 'SCOPE REDUCTION', + ], + mustHaveHeadings: ['## Preamble', '## When to invoke'], + maxSizeRatio: 1.05, + minBytes: 80_000, + }, + { + skill: 'plan-eng-review', + mustContain: [ + 'Architecture', + 'Code Quality', + 'Test', + 'Performance', + ], + mustHaveHeadings: ['## Preamble', '## When to invoke'], + maxSizeRatio: 1.05, + minBytes: 70_000, + }, + { + skill: 'plan-design-review', + mustContain: [ + 'design', + 'visual', + ], + mustHaveHeadings: ['## Preamble', '## When to invoke'], + maxSizeRatio: 1.05, + minBytes: 70_000, + }, + { + skill: 'review', + mustContain: ['confidence', 'P1', 'P2'], + mustHaveHeadings: ['## Preamble', '## When to invoke'], + maxSizeRatio: 1.05, + minBytes: 70_000, + }, + { + skill: 'qa', + mustContain: ['bug', 'browse', 'fix'], + mustHaveHeadings: ['## Preamble', '## When to invoke'], + maxSizeRatio: 1.05, + minBytes: 50_000, + }, + { + skill: 'investigate', + mustContain: ['root cause', 'hypothes'], + mustHaveHeadings: ['## Preamble', '## When to invoke'], + maxSizeRatio: 1.05, + minBytes: 30_000, + }, + { + skill: 'office-hours', + mustContain: ['design doc', 'problem statement'], + mustHaveHeadings: ['## Preamble', '## When to invoke'], + maxSizeRatio: 1.05, + minBytes: 70_000, + }, + { + skill: 'autoplan', + mustContain: ['ceo', 'eng', 'design'], + mustHaveHeadings: ['## Preamble', '## When to invoke'], + maxSizeRatio: 1.05, + minBytes: 70_000, + }, +]; diff --git a/test/parity-baseline-integrity.test.ts b/test/parity-baseline-integrity.test.ts new file mode 100644 index 000000000..fad9ce1f1 --- /dev/null +++ b/test/parity-baseline-integrity.test.ts @@ -0,0 +1,145 @@ +/** + * Gap C (v1.46.0.0): parity-baseline-v1.44.1.json integrity check. + * + * The v1.44.1 baseline file is the source of every "v1 was X bytes" claim + * in CHANGELOG.md (v1.46.0.0 entry) and the reference for the per-skill + * size-budget gate, the parity-suite content invariants, and the published + * compression numbers. If a contributor (or a sloppy rebase) edits the + * file, every downstream claim silently becomes unverifiable. + * + * This test pins: + * 1. The file exists. + * 2. Its top-level `tag` is "v1.44.1" (rejects a rename-by-edit). + * 3. Its `capturedFromCommit` is the v1.44.1.0 release commit (or earlier + * commit on the slim-skill-tokens branch where the baseline was + * captured — both are immutable historic SHAs). + * 4. The headline numbers reported in CHANGELOG.md are present in the + * baseline JSON. If someone "fixes" the JSON numbers without updating + * CHANGELOG (or vice versa), this surfaces the mismatch. + * 5. A whitelist of known stable commits — anything else means someone + * regenerated the baseline against fresh-current-state, which defeats + * the v1→v2 reference contract. + */ + +import { describe, test, expect } from 'bun:test'; +import * as fs from 'fs'; +import * as path from 'path'; +import * as crypto from 'crypto'; + +const REPO_ROOT = path.resolve(import.meta.dir, '..'); +const BASELINE_PATH = path.join(REPO_ROOT, 'test', 'fixtures', 'parity-baseline-v1.44.1.json'); +const CHANGELOG_PATH = path.join(REPO_ROOT, 'CHANGELOG.md'); + +/** + * The baseline was captured at this commit on the slim-skill-tokens branch + * (commit 74bc8054, just after v2_PLAN.md landed and before any compression + * work). If the baseline is ever regenerated, this whitelist must change AND + * the v1.46.0.0 CHANGELOG numbers table must be updated to reflect the new + * v1.x baseline. + */ +const ALLOWED_BASELINE_COMMITS = new Set([ + '74bc8054', +]); + +/** + * Headline numbers from the v1.46.0.0 CHANGELOG entry. If the baseline JSON + * is edited, these no longer match and the user's published claims become + * unverifiable. We assert the baseline still contains these values. + */ +const EXPECTED_v144_NUMBERS = { + totalSkills: 51, + totalCorpusBytesMin: 2_900_000, // CHANGELOG says ~2,847 KB (uses Math.round(/1024)); allow ±10K slack + totalCorpusBytesMax: 2_930_000, + estTotalCatalogTokensMin: 9_300, + estTotalCatalogTokensMax: 9_340, // CHANGELOG cites ~9,319 +}; + +describe('parity-baseline-v1.44.1.json integrity (v1→v2 reference)', () => { + test('file exists at the canonical path', () => { + expect(fs.existsSync(BASELINE_PATH)).toBe(true); + }); + + test('tag is "v1.44.1" — file was not renamed by edit', () => { + const baseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8')); + expect(baseline.tag).toBe('v1.44.1'); + }); + + test('capturedFromCommit is on the allowlist (rejects ad-hoc regeneration)', () => { + const baseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8')); + if (!ALLOWED_BASELINE_COMMITS.has(baseline.capturedFromCommit)) { + throw new Error( + `parity-baseline-v1.44.1.json was captured at commit ${baseline.capturedFromCommit}, ` + + `not on the allowlist (${[...ALLOWED_BASELINE_COMMITS].join(', ')}).\n` + + `If you intentionally regenerated the baseline, add the new commit to ` + + `ALLOWED_BASELINE_COMMITS in test/parity-baseline-integrity.test.ts AND ` + + `update the v1.46.0.0 CHANGELOG numbers table to match the new baseline.\n` + + `If you didn't intend to regenerate it, restore the file from git history.`, + ); + } + }); + + test('totalSkills matches expected (51)', () => { + const baseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8')); + expect(baseline.totalSkills).toBe(EXPECTED_v144_NUMBERS.totalSkills); + }); + + test('totalCorpusBytes is within the CHANGELOG-cited range (~2,847 KB)', () => { + const baseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8')); + expect(baseline.totalCorpusBytes).toBeGreaterThanOrEqual(EXPECTED_v144_NUMBERS.totalCorpusBytesMin); + expect(baseline.totalCorpusBytes).toBeLessThanOrEqual(EXPECTED_v144_NUMBERS.totalCorpusBytesMax); + }); + + test('estTotalCatalogTokens matches the CHANGELOG-cited ~9,319', () => { + const baseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8')); + expect(baseline.estTotalCatalogTokens).toBeGreaterThanOrEqual(EXPECTED_v144_NUMBERS.estTotalCatalogTokensMin); + expect(baseline.estTotalCatalogTokens).toBeLessThanOrEqual(EXPECTED_v144_NUMBERS.estTotalCatalogTokensMax); + }); + + test('CHANGELOG v1.46.0.0 entry references this baseline file by path', () => { + const changelog = fs.readFileSync(CHANGELOG_PATH, 'utf-8'); + // The CHANGELOG entry must mention the baseline file so reviewers know + // where the numbers come from. If someone edits one without the other, + // this test surfaces the drift. + expect(changelog).toContain('parity-baseline-v1.44.1.json'); + }); + + test('every per-skill entry has the required shape', () => { + const baseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8')); + for (const [skill, entry] of Object.entries(baseline.skills)) { + const e = entry as Record; + expect(typeof e.skill).toBe('string'); + expect(e.skill).toBe(skill); + expect(typeof e.skillMdBytes).toBe('number'); + expect(typeof e.skillMdLines).toBe('number'); + expect(typeof e.estTokens).toBe('number'); + expect(typeof e.descriptionLen).toBe('number'); + expect(e.skillMdBytes as number).toBeGreaterThan(0); + } + }); + + test('content hash is stable (catches any byte-level edit)', () => { + // Pinning the SHA256 of the file content is the strongest possible + // integrity check. When the baseline file LEGITIMATELY needs to change + // (rare — e.g. adding new skills since v1.44.1), this test fails with + // a clear "the hash changed from X to Y; update the constant if + // intentional" signal. The commit that updates the hash MUST also + // explain why and update the v1.46.0.0 CHANGELOG numbers if any + // headline changes. + // + // To re-capture: `shasum -a 256 test/fixtures/parity-baseline-v1.44.1.json` + const buf = fs.readFileSync(BASELINE_PATH); + const hash = crypto.createHash('sha256').update(buf).digest('hex'); + const EXPECTED_HASH = '29da01be6493bb2c7308b072f3066c09bdeb0397cb79ae1c708b5a38850efe46'; + if (hash !== EXPECTED_HASH) { + throw new Error( + `parity-baseline-v1.44.1.json content hash changed.\n` + + ` expected: ${EXPECTED_HASH}\n` + + ` current: ${hash}\n` + + `If you intentionally regenerated the baseline, update EXPECTED_HASH in ` + + `test/parity-baseline-integrity.test.ts AND justify the change in the ` + + `commit message AND update the v1.46.0.0 CHANGELOG numbers table.\n` + + `If you didn't intend to regenerate it, restore the file from git history.`, + ); + } + }); +}); diff --git a/test/parity-suite.test.ts b/test/parity-suite.test.ts new file mode 100644 index 000000000..9d6da4868 --- /dev/null +++ b/test/parity-suite.test.ts @@ -0,0 +1,49 @@ +/** + * Cathedral parity suite — gate-tier (free, structural + content checks). + * + * Runs every PARITY_INVARIANTS check against the current SKILL.md output + * vs the v1.44.1 baseline. Failures get an actionable, per-skill report + * showing missing phrases, missing headings, and size ratios. + * + * Periodic-tier LLM-judge parity (paid) lands in Phase B (v2.0.0.0) + * alongside the sections/ extraction. Plumbing is in parity-harness.ts. + */ + +import { describe, test, expect } from 'bun:test'; +import * as fs from 'fs'; +import * as path from 'path'; +import { runParityChecks, PARITY_INVARIANTS } from './helpers/parity-harness'; +import type { ParityBaseline } from './helpers/capture-parity-baseline'; + +const REPO_ROOT = path.resolve(import.meta.dir, '..'); +const BASELINE_PATH = path.join(REPO_ROOT, 'test', 'fixtures', 'parity-baseline-v1.44.1.json'); + +describe('parity suite vs v1.44.1 baseline (gate, free)', () => { + test('baseline exists', () => { + expect(fs.existsSync(BASELINE_PATH)).toBe(true); + }); + + test('all PARITY_INVARIANTS pass', () => { + const baseline: ParityBaseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8')); + const report = runParityChecks({ + repoRoot: REPO_ROOT, + baseline, + invariants: PARITY_INVARIANTS, + }); + + // eslint-disable-next-line no-console + console.log( + `[parity] ${report.passed}/${report.totalChecks} skills passed parity vs ${baseline.tag}`, + ); + + if (report.failed === 0) return; + + const failureMessages = report.details + .filter(d => !d.passed) + .map(d => ` ${d.skill}:\n - ${d.failures.join('\n - ')}`) + .join('\n'); + throw new Error( + `${report.failed} skill(s) failed parity checks vs v1.44.1:\n${failureMessages}`, + ); + }); +}); diff --git a/test/resolver-entry.test.ts b/test/resolver-entry.test.ts new file mode 100644 index 000000000..964584dde --- /dev/null +++ b/test/resolver-entry.test.ts @@ -0,0 +1,186 @@ +/** + * Unit tests for the ResolverEntry / unwrapResolver mechanism. + * + * Verifies the conditional-injection plumbing added in T2 (v1.45.0.0). + * Plain functions still work; gated entries skip when appliesTo returns false. + */ + +import { describe, test, expect } from 'bun:test'; +import { unwrapResolver, type ResolverFn, type ResolverEntry, type TemplateContext } from '../scripts/resolvers/types'; + +function makeCtx(overrides: Partial = {}): TemplateContext { + return { + skillName: 'test-skill', + tmplPath: '/tmp/test/SKILL.md.tmpl', + host: 'claude', + paths: { + skillRoot: '~/.claude/skills/gstack', + localSkillRoot: '.claude/skills', + binDir: '~/.claude/skills/gstack/bin', + browseDir: '~/.claude/skills/gstack/browse/dist', + designDir: '~/.claude/skills/gstack/design/dist', + makePdfDir: '~/.claude/skills/gstack/make-pdf/dist', + }, + ...overrides, + }; +} + +describe('unwrapResolver — plain function pass-through', () => { + test('returns the function as-is, no gate', () => { + const fn: ResolverFn = (ctx) => `hello-${ctx.skillName}`; + const { resolve, appliesTo } = unwrapResolver(fn); + expect(resolve(makeCtx())).toBe('hello-test-skill'); + expect(appliesTo).toBeUndefined(); + }); +}); + +describe('unwrapResolver — gated entry', () => { + test('returns resolve + gate', () => { + const entry: ResolverEntry = { + resolve: (ctx) => `gated-${ctx.skillName}`, + appliesTo: (ctx) => ['ship', 'review'].includes(ctx.skillName), + }; + const { resolve, appliesTo } = unwrapResolver(entry); + expect(resolve(makeCtx({ skillName: 'ship' }))).toBe('gated-ship'); + expect(appliesTo!(makeCtx({ skillName: 'ship' }))).toBe(true); + expect(appliesTo!(makeCtx({ skillName: 'qa' }))).toBe(false); + }); + + test('gate returning false should signal skip — gen-skill-docs substitutes empty string', () => { + // This mirrors the gen-skill-docs.ts contract: + // if (appliesTo && !appliesTo(ctx)) return ''; + const entry: ResolverEntry = { + resolve: () => 'CONTENT', + appliesTo: () => false, + }; + const { resolve, appliesTo } = unwrapResolver(entry); + const result = appliesTo && !appliesTo(makeCtx()) ? '' : resolve(makeCtx()); + expect(result).toBe(''); + }); + + test('gate returning true allows resolve to fire', () => { + const entry: ResolverEntry = { + resolve: () => 'CONTENT', + appliesTo: () => true, + }; + const { resolve, appliesTo } = unwrapResolver(entry); + const result = appliesTo && !appliesTo(makeCtx()) ? '' : resolve(makeCtx()); + expect(result).toBe('CONTENT'); + }); + + test('entry without appliesTo behaves like ungated', () => { + const entry: ResolverEntry = { resolve: () => 'ALWAYS' }; + const { resolve, appliesTo } = unwrapResolver(entry); + expect(appliesTo).toBeUndefined(); + expect(resolve(makeCtx())).toBe('ALWAYS'); + }); +}); + +describe('RESOLVERS registry still loads with mixed shapes', () => { + test('importing the live registry produces a record with expected resolvers', async () => { + const { RESOLVERS } = await import('../scripts/resolvers/index'); + // Spot-check that core resolvers are present. + expect(RESOLVERS.PREAMBLE).toBeDefined(); + expect(RESOLVERS.REVIEW_DASHBOARD).toBeDefined(); + expect(RESOLVERS.SLUG_EVAL).toBeDefined(); + // Each entry should unwrap cleanly. + for (const [name, entry] of Object.entries(RESOLVERS)) { + const { resolve } = unwrapResolver(entry); + expect(typeof resolve).toBe('function'); + expect(name.length).toBeGreaterThan(0); + } + }); +}); + +/** + * Gap D (v1.46.0.0): live appliesTo gate end-to-end integration. + * + * The ResolverEntry / unwrapResolver machinery has unit coverage above. The + * remaining gap: does the gen-skill-docs.ts:444 substitution loop actually + * USE the gate? A refactor that drops the `if (appliesTo && !appliesTo(ctx))` + * check would silently break every future gated resolver. + * + * This test simulates the exact 4-line shape the live pipeline uses against + * a synthetic registry. If gen-skill-docs.ts is refactored and someone + * forgets to keep the gate check in sync, this assertion fails. + */ +describe('gen-skill-docs substitution loop respects the appliesTo gate', () => { + function simulateGenSubstitution( + template: string, + registry: Record, + ctx: TemplateContext, + ): string { + // Mirrors scripts/gen-skill-docs.ts:457-467 (the {{NAME}} substitution + // loop). Keep this in sync with the real loop. Drift here is what the + // test is designed to catch. + return template.replace(/\{\{(\w+(?::[^}]+)?)\}\}/g, (_match, fullKey) => { + const parts = fullKey.split(':'); + const resolverName = parts[0]; + const args = parts.slice(1); + const entry = registry[resolverName]; + if (!entry) throw new Error(`Unknown placeholder {{${resolverName}}}`); + const { resolve, appliesTo } = unwrapResolver(entry); + if (appliesTo && !appliesTo(ctx)) return ''; + return args.length > 0 ? resolve(ctx, args) : resolve(ctx); + }); + } + + test('plain-function resolver fires unconditionally', () => { + const tpl = '{{ALWAYS}}'; + const out = simulateGenSubstitution(tpl, { + ALWAYS: () => 'fired', + }, makeCtx({ skillName: 'whatever' })); + expect(out).toBe('fired'); + }); + + test('gated resolver fires only when appliesTo returns true', () => { + const tpl = 'before-{{GATED}}-after'; + const out = simulateGenSubstitution(tpl, { + GATED: { + resolve: () => 'CONTENT', + appliesTo: (ctx) => ctx.skillName === 'allowed', + }, + }, makeCtx({ skillName: 'allowed' })); + expect(out).toBe('before-CONTENT-after'); + }); + + test('gated resolver is substituted with empty string when appliesTo returns false', () => { + const tpl = 'before-{{GATED}}-after'; + const out = simulateGenSubstitution(tpl, { + GATED: { + resolve: () => 'CONTENT', + appliesTo: (ctx) => ctx.skillName === 'allowed', + }, + }, makeCtx({ skillName: 'something-else' })); + expect(out).toBe('before--after'); + }); + + test('mixed registry: gated + plain resolvers in the same template', () => { + const tpl = '{{PLAIN}} / {{GATED_ON}} / {{GATED_OFF}}'; + const ctx = makeCtx({ skillName: 'ship' }); + const out = simulateGenSubstitution(tpl, { + PLAIN: () => 'plain', + GATED_ON: { resolve: () => 'on', appliesTo: () => true }, + GATED_OFF: { resolve: () => 'off', appliesTo: () => false }, + }, ctx); + expect(out).toBe('plain / on / '); + }); + + test('parameterized resolver still respects gate', () => { + const tpl = '{{GATED:arg1:arg2}}'; + const ctx = makeCtx({ skillName: 'no' }); + const out = simulateGenSubstitution(tpl, { + GATED: { + resolve: (_c, args) => `fired-with-${(args ?? []).join('-')}`, + appliesTo: (c) => c.skillName === 'yes', + }, + }, ctx); + expect(out).toBe(''); // gated off, args ignored + }); + + test('unknown resolver throws (matches real gen-skill-docs error contract)', () => { + expect(() => + simulateGenSubstitution('{{NEVER_DEFINED}}', {}, makeCtx()), + ).toThrow(/Unknown placeholder/); + }); +}); diff --git a/test/skill-budget-regression.test.ts b/test/skill-budget-regression.test.ts index 651f09180..494ac6781 100644 --- a/test/skill-budget-regression.test.ts +++ b/test/skill-budget-regression.test.ts @@ -35,6 +35,27 @@ import { assertNoBudgetRegression, type EvalResult, } from './helpers/eval-store'; +import { logBudgetOverride } from './helpers/budget-override'; + +/** + * v1.45.0.0 T5 — hard eval cost cap. + * + * Per-tier defaults (override via env): + * EVALS_BUDGET_HARD_CAP_GATE default $25/run + * EVALS_BUDGET_HARD_CAP_PERIODIC default $70/run + * EVALS_BUDGET_HARD_CAP umbrella cap if a tier-specific isn't set; default $30 + * EVALS_BUDGET_OVERRIDE_REASON if set, override fires AND audit-logs to + * ~/.gstack/analytics/spend-overrides.jsonl + * + * Caps are dollars-per-run, not dollars-per-test. A test that legitimately + * gets more expensive should bake into the baseline; a runaway eval (infinite + * retry, model price change) gets stopped here. + */ +const DEFAULT_HARD_CAP_USD = Number(process.env.EVALS_BUDGET_HARD_CAP) || 30; +const TIER_CAPS: Record<'e2e' | 'llm-judge', number> = { + e2e: Number(process.env.EVALS_BUDGET_HARD_CAP_GATE) || DEFAULT_HARD_CAP_USD, + 'llm-judge': Number(process.env.EVALS_BUDGET_HARD_CAP_PERIODIC) || Math.max(70, DEFAULT_HARD_CAP_USD), +}; function currentGitBranch(): string { try { @@ -137,6 +158,40 @@ function checkTier(tier: 'e2e' | 'llm-judge'): void { ); } +/** Enforce a hard dollar cap on per-run eval cost. */ +function checkHardCap(tier: 'e2e' | 'llm-judge'): void { + const evalDir = getProjectEvalDir(); + const latest = findLatestRun(evalDir, tier); + if (!latest) return; + const cap = TIER_CAPS[tier]; + const cost = latest.result.total_cost_usd; + if (cost <= cap) { + // eslint-disable-next-line no-console + console.log(`[budget-hard-cap:${tier}] OK — $${cost.toFixed(2)} ≤ $${cap.toFixed(2)} cap`); + return; + } + const overrideReason = process.env.EVALS_BUDGET_OVERRIDE_REASON?.trim(); + if (overrideReason) { + logBudgetOverride({ + scope: `evals-cost-cap-${tier}`, + reason: overrideReason, + details: { tier, cap, observed_cost_usd: cost, run_file: latest.filepath }, + }); + // eslint-disable-next-line no-console + console.warn( + `[budget-hard-cap:${tier}] OVERRIDE APPLIED ("${overrideReason}") — $${cost.toFixed(2)} > $${cap.toFixed(2)} cap`, + ); + return; + } + throw new Error( + `Eval cost exceeded hard cap for tier ${tier}: ` + + `$${cost.toFixed(2)} > $${cap.toFixed(2)}. ` + + `Set EVALS_BUDGET_OVERRIDE_REASON="why this is OK" to allow + audit. ` + + `Per-tier override: EVALS_BUDGET_HARD_CAP_${tier === 'e2e' ? 'GATE' : 'PERIODIC'}=. ` + + `Run: ${latest.filepath}`, + ); +} + describe('tool budget regression (gate, free)', () => { test('no e2e test exceeds 2× prior tool calls or turns', () => { checkTier('e2e'); @@ -145,4 +200,13 @@ describe('tool budget regression (gate, free)', () => { test('no llm-judge test exceeds 2× prior tool calls or turns', () => { checkTier('llm-judge'); }); + + // T5: hard dollar cap on per-run cost (different from regression ratio above) + test('e2e run cost ≤ EVALS_BUDGET_HARD_CAP_GATE', () => { + checkHardCap('e2e'); + }); + + test('llm-judge run cost ≤ EVALS_BUDGET_HARD_CAP_PERIODIC', () => { + checkHardCap('llm-judge'); + }); }); diff --git a/test/skill-coverage-floor.test.ts b/test/skill-coverage-floor.test.ts new file mode 100644 index 000000000..a0de76292 --- /dev/null +++ b/test/skill-coverage-floor.test.ts @@ -0,0 +1,153 @@ +/** + * Skill coverage floor — gate-tier, free, runs every PR. + * + * Phase 0 of the cathedral parity-eval suite: structural-compliance smoke + * test that covers every gstack skill with file-IO assertions. The intent + * is "every skill ships with at least one CI-blocking check" — even when + * a skill doesn't (yet) have a behavioral E2E test, this floor catches + * frontmatter regressions, missing generated header, empty/trivial bodies, + * and dangling SKILL.md.tmpl-without-SKILL.md mismatches. + * + * Pairs with test/skill-coverage-matrix.ts (the registry) and + * test/parity-suite.test.ts (the content-invariant suite). Together, + * v1.45.0.0 ships with: floor (this file) + matrix (registry CI gate) + * + invariants (content per skill family) + size budget. That's the + * eval-first foundation the v2.0.0.0 sections/ work builds on. + */ + +import { describe, test, expect } from 'bun:test'; +import * as fs from 'fs'; +import * as path from 'path'; +import { SKILL_COVERAGE } from './skill-coverage-matrix'; + +const REPO_ROOT = path.resolve(import.meta.dir, '..'); + +function readSkillMd(skill: string): string | null { + const p = path.join(REPO_ROOT, skill, 'SKILL.md'); + try { + return fs.readFileSync(p, 'utf-8'); + } catch { + return null; + } +} + +function listSkillDirs(): string[] { + const entries = fs.readdirSync(REPO_ROOT, { withFileTypes: true }); + return entries + .filter(e => e.isDirectory() && !e.name.startsWith('.')) + .filter(e => e.name !== 'node_modules' && e.name !== 'docs' && e.name !== 'test') + .filter(e => fs.existsSync(path.join(REPO_ROOT, e.name, 'SKILL.md'))) + .map(e => e.name) + .sort(); +} + +describe('skill-coverage-floor: every skill passes structural compliance', () => { + const skills = listSkillDirs(); + + test('skill registry mentions every skill on disk', () => { + const onDisk = new Set(skills); + const inRegistry = new Set(Object.keys(SKILL_COVERAGE)); + const missingFromRegistry: string[] = []; + for (const s of onDisk) { + if (!inRegistry.has(s)) missingFromRegistry.push(s); + } + if (missingFromRegistry.length > 0) { + throw new Error( + `Skills on disk missing from test/skill-coverage-matrix.ts: ${missingFromRegistry.join(', ')}. ` + + `Add an entry to SKILL_COVERAGE with at least 'test/skill-coverage-floor.test.ts' in gate[].`, + ); + } + }); + + test('every registry entry has at least one gate-tier test', () => { + const missingGate: string[] = []; + for (const [skill, coverage] of Object.entries(SKILL_COVERAGE)) { + if (!coverage.gate || coverage.gate.length === 0) missingGate.push(skill); + } + if (missingGate.length > 0) { + throw new Error( + `Skills with no gate-tier eval: ${missingGate.join(', ')}. ` + + `Eval-first foundation requires at least one CI-blocking check per skill.`, + ); + } + }); + + test('every gate-tier test path referenced in registry exists on disk', () => { + const missing: string[] = []; + for (const [skill, coverage] of Object.entries(SKILL_COVERAGE)) { + for (const testPath of [...coverage.gate, ...coverage.periodic]) { + const fullPath = path.join(REPO_ROOT, testPath); + if (!fs.existsSync(fullPath)) { + missing.push(`${skill} → ${testPath}`); + } + } + } + if (missing.length > 0) { + throw new Error(`Registry references missing test files:\n ${missing.join('\n ')}`); + } + }); + + // Per-skill structural compliance (file IO only, no LLM) + for (const skill of skills) { + describe(`skill: ${skill}`, () => { + test('SKILL.md exists', () => { + const content = readSkillMd(skill); + expect(content).not.toBeNull(); + }); + + test('frontmatter is well-formed and contains name + description', () => { + const content = readSkillMd(skill)!; + expect(content.startsWith('---\n')).toBe(true); + const fmEnd = content.indexOf('\n---', 4); + expect(fmEnd).toBeGreaterThan(0); + const fm = content.slice(4, fmEnd); + // name: ... + expect(/^name:\s*\S/m.test(fm)).toBe(true); + // description: ... (either inline or block form) + expect(/^description:\s*(\S|\|)/m.test(fm)).toBe(true); + }); + + test('frontmatter description fits the catalog-trim contract', () => { + const content = readSkillMd(skill)!; + const fmEnd = content.indexOf('\n---', 4); + const fm = content.slice(4, fmEnd); + // Inline form: description: + const inlineMatch = fm.match(/^description:\s+(.+)$/m); + // Block form: description: |\n multiline + const blockMatch = fm.match(/^description:\s*\|/m); + if (inlineMatch) { + // Catalog-trimmed: should be ≤ 250 chars + expect(inlineMatch[1].length).toBeLessThanOrEqual(250); + } else if (blockMatch) { + // Block form is acceptable for small skills (under-120-chars baseline + // didn't trigger catalog trim). No size cap here; the parity-suite + // and size-budget tests handle bytes. + } else { + throw new Error(`${skill}: description field is not in inline or block form`); + } + }); + + test('generated header present (only edit .tmpl, not .md)', () => { + const content = readSkillMd(skill)!; + expect(content).toContain('AUTO-GENERATED from SKILL.md.tmpl'); + }); + + test('body is non-trivial (≥ 200 bytes after frontmatter)', () => { + const content = readSkillMd(skill)!; + const fmEnd = content.indexOf('\n---', 4); + const body = content.slice(fmEnd + 5).trim(); + expect(body.length).toBeGreaterThanOrEqual(200); + }); + + test('no unresolved {{TEMPLATE}} placeholders leaked into output', () => { + const content = readSkillMd(skill)!; + const leaks = content.match(/\{\{[A-Z_]+(?::[^}]+)?\}\}/g); + if (leaks) { + throw new Error( + `${skill}: ${leaks.length} unresolved placeholder(s) in generated SKILL.md: ${leaks.slice(0, 3).join(', ')}${leaks.length > 3 ? ', ...' : ''}`, + ); + } + }); + }); + } +}); diff --git a/test/skill-coverage-matrix.test.ts b/test/skill-coverage-matrix.test.ts new file mode 100644 index 000000000..1c212d456 --- /dev/null +++ b/test/skill-coverage-matrix.test.ts @@ -0,0 +1,72 @@ +/** + * Skill coverage matrix CI gate (v1.45.0.0 T1). + * + * Asserts every skill on disk has an entry in SKILL_COVERAGE with at + * least one gate-tier test. The detailed per-skill structural checks + * live in test/skill-coverage-floor.test.ts; this file is the matrix- + * level gate that surfaces "skill added but eval not registered" cleanly. + */ + +import { describe, test, expect } from 'bun:test'; +import * as fs from 'fs'; +import * as path from 'path'; +import { SKILL_COVERAGE, type SkillCoverage } from './skill-coverage-matrix'; + +const REPO_ROOT = path.resolve(import.meta.dir, '..'); + +function discoverSkills(): string[] { + return fs.readdirSync(REPO_ROOT, { withFileTypes: true }) + .filter(e => e.isDirectory() && !e.name.startsWith('.')) + .filter(e => fs.existsSync(path.join(REPO_ROOT, e.name, 'SKILL.md'))) + .map(e => e.name) + .sort(); +} + +describe('skill coverage matrix', () => { + test('SKILL_COVERAGE is exported and non-empty', () => { + expect(typeof SKILL_COVERAGE).toBe('object'); + expect(Object.keys(SKILL_COVERAGE).length).toBeGreaterThan(0); + }); + + test('every entry has the right shape', () => { + for (const [skill, coverage] of Object.entries(SKILL_COVERAGE)) { + expect(Array.isArray(coverage.gate)).toBe(true); + expect(Array.isArray(coverage.periodic)).toBe(true); + expect(coverage.gate.length).toBeGreaterThan(0); + for (const p of [...coverage.gate, ...coverage.periodic]) { + expect(typeof p).toBe('string'); + expect(p.startsWith('test/')).toBe(true); + expect(p.endsWith('.test.ts')).toBe(true); + } + } + }); + + test('every skill on disk has a registry entry', () => { + const skills = discoverSkills(); + const missing: string[] = []; + for (const s of skills) { + if (!SKILL_COVERAGE[s]) missing.push(s); + } + if (missing.length > 0) { + throw new Error( + `Skills on disk missing from SKILL_COVERAGE: ${missing.join(', ')}. ` + + `Add an entry to test/skill-coverage-matrix.ts with at least ` + + `'test/skill-coverage-floor.test.ts' in gate[].`, + ); + } + }); + + test('no registry entry references a skill that does not exist on disk', () => { + const skills = new Set(discoverSkills()); + const orphans: string[] = []; + for (const skill of Object.keys(SKILL_COVERAGE)) { + if (!skills.has(skill)) orphans.push(skill); + } + if (orphans.length > 0) { + throw new Error( + `Registry references skills not on disk: ${orphans.join(', ')}. ` + + `Remove from SKILL_COVERAGE or restore the skill directory.`, + ); + } + }); +}); diff --git a/test/skill-coverage-matrix.ts b/test/skill-coverage-matrix.ts new file mode 100644 index 000000000..2e87e46e4 --- /dev/null +++ b/test/skill-coverage-matrix.ts @@ -0,0 +1,181 @@ +/** + * Skill coverage matrix (v1.45.0.0 T1, cathedral Phase 0). + * + * Single source of truth mapping each gstack skill to its E2E test files. + * The CI gate at test/skill-coverage-matrix.test.ts fails if a skill has + * no gate-tier entry, ensuring the eval-first foundation holds: every + * skill has at least one CI-blocking check that asserts must-have + * behavior. + * + * Two tiers per entry: + * gate CI-blocking, runs on every PR, target <$0.50/test or free. + * periodic Weekly cron, deeper coverage, can cost ~$1-$3/test. + * + * The 'floor' entry refers to test/skill-coverage-floor.test.ts — + * a structural-compliance smoke test that covers every skill with + * file-IO checks (free, no LLM cost). When a skill has only 'floor' + * coverage, that's the eval-first minimum; future work can layer + * behavioral checks on top. + */ + +export interface SkillCoverage { + /** Gate-tier test file paths (relative to repo root). At least one required per skill. */ + gate: string[]; + /** Periodic-tier test file paths. Optional but recommended. */ + periodic: string[]; + /** Brief note on why this coverage is the right shape for this skill. */ + rationale?: string; +} + +/** + * Per-skill coverage. Keys MUST match the top-level skill directory name. + * The CI test asserts every skill in the repo has an entry here AND that + * gate[] is non-empty. + * + * Adding a new skill: add an entry here AND either reference an existing + * test that covers it OR add 'test/skill-coverage-floor.test.ts' as the + * minimum gate-tier check. + */ +export const SKILL_COVERAGE: Record = { + // ─── Core loop ────────────────────────────────────────────── + ship: { + gate: ['test/skill-e2e-ship-idempotency.test.ts', 'test/skill-coverage-floor.test.ts'], + periodic: ['test/skill-e2e-workflow.test.ts'], + }, + review: { + gate: ['test/skill-e2e-review.test.ts', 'test/skill-coverage-floor.test.ts'], + periodic: ['test/skill-e2e-review-army.test.ts', 'test/regression-1539-review-self-verify.test.ts'], + }, + qa: { + gate: ['test/skill-e2e-qa-workflow.test.ts', 'test/skill-coverage-floor.test.ts'], + periodic: ['test/skill-e2e-qa-bugs.test.ts'], + }, + 'qa-only': { + gate: ['test/skill-coverage-floor.test.ts'], + periodic: [], + rationale: 'qa-only is qa with --report-only; behavior tested via /qa coverage.', + }, + investigate: { + gate: ['test/skill-coverage-floor.test.ts'], + periodic: [], + }, + browse: { + gate: ['test/skill-coverage-floor.test.ts'], + periodic: [], + rationale: 'browse binary has its own integration suite under browse/test/.', + }, + + // ─── Plan triad ───────────────────────────────────────────── + 'plan-ceo-review': { + gate: [ + 'test/skill-e2e-plan-ceo-finding-floor.test.ts', + 'test/skill-e2e-plan-ceo-plan-mode.test.ts', + 'test/skill-coverage-floor.test.ts', + ], + periodic: [ + 'test/skill-e2e-plan-ceo-finding-count.test.ts', + 'test/skill-e2e-plan-ceo-mode-routing.test.ts', + ], + }, + 'plan-eng-review': { + gate: [ + 'test/skill-e2e-plan-eng-finding-floor.test.ts', + 'test/skill-e2e-plan-eng-plan-mode.test.ts', + 'test/skill-coverage-floor.test.ts', + ], + periodic: [ + 'test/skill-e2e-plan-eng-finding-count.test.ts', + 'test/skill-e2e-plan-eng-multi-finding-batching.test.ts', + ], + }, + 'plan-design-review': { + gate: [ + 'test/skill-e2e-plan-design-finding-floor.test.ts', + 'test/skill-e2e-plan-design-plan-mode.test.ts', + 'test/skill-e2e-plan-design-with-ui.test.ts', + 'test/skill-coverage-floor.test.ts', + ], + periodic: ['test/skill-e2e-plan-design-finding-count.test.ts'], + }, + 'plan-devex-review': { + gate: [ + 'test/skill-e2e-plan-devex-finding-floor.test.ts', + 'test/skill-e2e-plan-devex-plan-mode.test.ts', + 'test/skill-coverage-floor.test.ts', + ], + periodic: ['test/skill-e2e-plan-devex-finding-count.test.ts'], + }, + autoplan: { + gate: ['test/skill-coverage-floor.test.ts'], + periodic: ['test/skill-e2e-autoplan-chain.test.ts', 'test/skill-e2e-autoplan-dual-voice.test.ts'], + }, + 'office-hours': { + gate: ['test/skill-e2e-office-hours.test.ts', 'test/skill-coverage-floor.test.ts'], + periodic: ['test/skill-e2e-office-hours-auto-mode.test.ts', 'test/skill-e2e-office-hours-phase4.test.ts'], + }, + + // ─── Polish + design ──────────────────────────────────────── + 'design-review': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + 'design-consultation': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + 'design-shotgun': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + 'design-html': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + cso: { + gate: ['test/skill-e2e-cso.test.ts', 'test/cso-preserved.test.ts', 'test/skill-coverage-floor.test.ts'], + periodic: [], + rationale: 'cso-preserved.test.ts pins must-not-strip security guidance phrases.', + }, + 'document-release': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + 'document-generate': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + + // ─── Ops + integrations ───────────────────────────────────── + 'land-and-deploy': { gate: ['test/skill-e2e-deploy.test.ts', 'test/skill-coverage-floor.test.ts'], periodic: [] }, + canary: { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + benchmark: { gate: ['test/skill-e2e-benchmark-providers.test.ts', 'test/skill-coverage-floor.test.ts'], periodic: [] }, + 'benchmark-models': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + codex: { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + retro: { + gate: ['test/skill-coverage-floor.test.ts'], + periodic: ['test/regression-1624-retro-stale-base.test.ts'], + }, + 'gstack-upgrade': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + 'context-save': { gate: ['test/skill-e2e-context-skills.test.ts', 'test/skill-coverage-floor.test.ts'], periodic: [] }, + 'context-restore': { gate: ['test/skill-e2e-context-skills.test.ts', 'test/skill-coverage-floor.test.ts'], periodic: [] }, + 'setup-deploy': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + 'setup-browser-cookies': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + 'setup-gbrain': { + gate: [ + 'test/skill-e2e-setup-gbrain-bad-token.test.ts', + 'test/skill-e2e-setup-gbrain-path4-local-pglite.test.ts', + 'test/skill-e2e-setup-gbrain-remote.test.ts', + 'test/skill-coverage-floor.test.ts', + ], + periodic: [], + }, + 'sync-gbrain': { + gate: ['test/skill-coverage-floor.test.ts'], + periodic: ['test/regression-1611-gbrain-sync-resume.test.ts'], + }, + 'open-gstack-browser': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + 'pair-agent': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + scrape: { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + skillify: { gate: ['test/skill-e2e-skillify.test.ts', 'test/skill-coverage-floor.test.ts'], periodic: [] }, + learn: { gate: ['test/skill-e2e-learnings.test.ts', 'test/skill-coverage-floor.test.ts'], periodic: [] }, + 'plan-tune': { gate: ['test/skill-e2e-plan-tune.test.ts', 'test/skill-coverage-floor.test.ts'], periodic: [] }, + + // ─── iOS family ───────────────────────────────────────────── + 'ios-qa': { gate: ['test/skill-e2e-ios.test.ts', 'test/skill-coverage-floor.test.ts'], periodic: ['test/skill-e2e-ios-device.test.ts', 'test/skill-e2e-ios-swift-build.test.ts'] }, + 'ios-fix': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + 'ios-clean': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + 'ios-sync': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + 'ios-design-review': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + + // ─── Safety / housekeeping ────────────────────────────────── + careful: { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + freeze: { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + unfreeze: { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + guard: { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + 'landing-report': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + health: { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + 'make-pdf': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, + 'devex-review': { gate: ['test/skill-coverage-floor.test.ts'], periodic: [] }, +}; diff --git a/test/skill-size-budget.test.ts b/test/skill-size-budget.test.ts new file mode 100644 index 000000000..a22550d3f --- /dev/null +++ b/test/skill-size-budget.test.ts @@ -0,0 +1,220 @@ +/** + * Per-skill SKILL.md size budget regression (v1.46.0.0 T5). + * + * Asserts that no skill's generated SKILL.md grew beyond the v1.44.1 + * baseline. Catches preamble/resolver changes that bloat skills back to + * the pre-compression size. Free — pure file IO + JSON diff. + * + * Why a separate test from skill-budget-regression.test.ts: that one + * compares LIVE eval runs (tool calls, turns, cost); this one compares + * static SKILL.md sizes. Both gate-tier. + * + * The baseline lives at test/fixtures/parity-baseline-v1.44.1.json, + * captured by scripts/capture-baseline.ts before any Phase A work landed. + * + * Override: + * - GSTACK_SIZE_BUDGET_RATIO= changes the per-skill regression ratio. + * Default 1.0 (no growth allowed). Set to 1.10 to permit 10% growth + * (e.g., during deliberate feature additions that the catalog trim + * doesn't offset). + * - GSTACK_SIZE_BUDGET_OVERRIDE_REASON="text" allows a regression to + * pass and logs the reason to ~/.gstack/analytics/spend-overrides.jsonl + * for audit. Use sparingly; the next baseline should bake in the new + * size. + */ + +import { describe, test, expect } from 'bun:test'; +import * as fs from 'fs'; +import * as path from 'path'; +import { captureBaseline, type ParityBaseline } from './helpers/capture-parity-baseline'; +import { logBudgetOverride } from './helpers/budget-override'; + +const REPO_ROOT = path.resolve(import.meta.dir, '..'); +const BASELINE_PATH = path.join(REPO_ROOT, 'test', 'fixtures', 'parity-baseline-v1.44.1.json'); + +// Default per-skill ratio is 1.05 (5% growth tolerance). T4 catalog trim +// MOVES text from frontmatter (always-loaded catalog) to a body section +// ("## When to invoke"), so small skills with already-short descriptions +// see a tiny body growth from the section header itself (~20 bytes). The +// 5% per-skill tolerance accommodates that while still catching real bloat; +// the always-loaded catalog cost is enforced separately with a hard ceiling. +const DEFAULT_RATIO = 1.05; +const RATIO = Number(process.env.GSTACK_SIZE_BUDGET_RATIO) || DEFAULT_RATIO; + +interface Regression { + skill: string; + beforeBytes: number; + afterBytes: number; + growth: number; +} + +describe('SKILL.md size budget regression (gate, free)', () => { + test('parity-baseline-v1.44.1.json exists', () => { + expect(fs.existsSync(BASELINE_PATH)).toBe(true); + }); + + test('no skill exceeds v1.44.1 baseline size × ratio', () => { + const baseline: ParityBaseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8')); + const current = captureBaseline({ repoRoot: REPO_ROOT }); + + const regressions: Regression[] = []; + for (const [skill, before] of Object.entries(baseline.skills)) { + const after = current.skills[skill]; + if (!after) continue; // skill removed since v1.44 — not a regression + if (after.skillMdBytes <= before.skillMdBytes * RATIO) continue; + regressions.push({ + skill, + beforeBytes: before.skillMdBytes, + afterBytes: after.skillMdBytes, + growth: after.skillMdBytes / before.skillMdBytes, + }); + } + + if (regressions.length === 0) return; + + const overrideReason = process.env.GSTACK_SIZE_BUDGET_OVERRIDE_REASON?.trim(); + if (overrideReason) { + logBudgetOverride({ + scope: 'skill-size-budget', + reason: overrideReason, + details: { ratio: RATIO, regressions }, + }); + // eslint-disable-next-line no-console + console.warn( + `[skill-size-budget] OVERRIDE APPLIED (${overrideReason}) — ${regressions.length} regression(s) allowed:`, + ); + for (const r of regressions) { + // eslint-disable-next-line no-console + console.warn(` ${r.skill}: ${r.beforeBytes} → ${r.afterBytes} bytes (×${r.growth.toFixed(2)})`); + } + return; + } + + const msg = regressions.map(r => + ` ${r.skill}: ${r.beforeBytes} → ${r.afterBytes} bytes (×${r.growth.toFixed(2)})`, + ).join('\n'); + throw new Error( + `${regressions.length} skill(s) regressed past v1.44.1 baseline × ${RATIO}:\n${msg}\n` + + `Override: set GSTACK_SIZE_BUDGET_OVERRIDE_REASON="why this is OK" to allow and audit-log.`, + ); + }); + + test('total corpus byte count does not regress past baseline × ratio', () => { + const baseline: ParityBaseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8')); + const current = captureBaseline({ repoRoot: REPO_ROOT }); + const ratio = current.totalCorpusBytes / baseline.totalCorpusBytes; + if (current.totalCorpusBytes <= baseline.totalCorpusBytes * RATIO) { + // eslint-disable-next-line no-console + console.log( + `[skill-size-budget] corpus OK: ${baseline.totalCorpusBytes} → ${current.totalCorpusBytes} bytes (×${ratio.toFixed(3)})`, + ); + return; + } + const overrideReason = process.env.GSTACK_SIZE_BUDGET_OVERRIDE_REASON?.trim(); + if (overrideReason) { + logBudgetOverride({ + scope: 'skill-size-budget-corpus', + reason: overrideReason, + details: { ratio: RATIO, observed: ratio, before: baseline.totalCorpusBytes, after: current.totalCorpusBytes }, + }); + return; + } + throw new Error( + `Total corpus regressed past v1.44.1 baseline × ${RATIO}: ` + + `${baseline.totalCorpusBytes} → ${current.totalCorpusBytes} bytes (×${ratio.toFixed(3)}). ` + + `Override: set GSTACK_SIZE_BUDGET_OVERRIDE_REASON to allow.`, + ); + }); + + /** + * Gap E (v1.46.0.0): per-skill min-size floor. + * + * The existing skill-coverage-floor enforces body ≥ 200 bytes, which is + * a tiny noise floor. A skill that was 100 KB at v1.44.1 and shrinks to + * 250 bytes passes that check despite losing 99.75% of content. The + * parity-suite content invariants cover this for 10 hand-picked skills + * (cso, ship, plan-ceo, etc.); the remaining 41 skills had no per-skill + * shrinkage floor. + * + * Floor: 80% of the v1.44.1 baseline. v1.46 actual shrinkage is <1% per + * skill, so this is a comfortable ceiling that still catches accidental + * mass deletion (e.g., a refactor that strips the body of a skill). + * + * v2.0.0.0 will introduce the sections/ pattern for 5 heavyweights + * (ship, plan-ceo-review, office-hours, plan-eng-review, + * plan-design-review). Those skills will legitimately shrink to ~15 KB + * skeletons. When that lands, add them to SECTIONS_EXTRACTED so the floor + * relaxes for them. + */ + test('no skill shrinks past 80% of v1.44.1 baseline (catches accidental body strip)', () => { + const baseline: ParityBaseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8')); + const current = captureBaseline({ repoRoot: REPO_ROOT }); + const MIN_RATIO = 0.80; // a skill at <80% of its v1.44 size signals mass-deletion + const SECTIONS_EXTRACTED = new Set(); // populate in v2.0.0.0 when sections/ lands + + const undershoots: Array<{ + skill: string; beforeBytes: number; afterBytes: number; ratio: number; + }> = []; + for (const [skill, before] of Object.entries(baseline.skills)) { + if (SECTIONS_EXTRACTED.has(skill)) continue; + const after = current.skills[skill]; + if (!after) continue; // skill removed since baseline — separate concern + const ratio = after.skillMdBytes / before.skillMdBytes; + if (ratio < MIN_RATIO) { + undershoots.push({ + skill, beforeBytes: before.skillMdBytes, afterBytes: after.skillMdBytes, ratio, + }); + } + } + + if (undershoots.length === 0) return; + + const overrideReason = process.env.GSTACK_SIZE_BUDGET_OVERRIDE_REASON?.trim(); + if (overrideReason) { + logBudgetOverride({ + scope: 'skill-size-budget-floor', + reason: overrideReason, + details: { min_ratio: MIN_RATIO, undershoots }, + }); + // eslint-disable-next-line no-console + console.warn( + `[skill-size-budget-floor] OVERRIDE APPLIED (${overrideReason}) — ${undershoots.length} undershoot(s) allowed`, + ); + return; + } + + const msg = undershoots.map(u => + ` ${u.skill}: ${u.beforeBytes} → ${u.afterBytes} bytes (×${u.ratio.toFixed(2)} — below ${MIN_RATIO} floor)`, + ).join('\n'); + throw new Error( + `${undershoots.length} skill(s) shrunk past v1.44.1 × ${MIN_RATIO} floor:\n${msg}\n` + + `This usually signals accidental body strip (e.g., a resolver returning empty, a ` + + `template losing a section). If the shrinkage is intentional (e.g., the skill moved ` + + `to the sections/ pattern), add it to SECTIONS_EXTRACTED in this test. Override: ` + + `GSTACK_SIZE_BUDGET_OVERRIDE_REASON="why" allows + audit-logs.`, + ); + }); + + test('catalog token estimate stays compressed (v1.45 target ≤ 7000)', () => { + const current = captureBaseline({ repoRoot: REPO_ROOT }); + const v145Target = 7000; + if (current.estTotalCatalogTokens <= v145Target) { + // eslint-disable-next-line no-console + console.log(`[skill-size-budget] catalog OK: ~${current.estTotalCatalogTokens} tokens (target ≤${v145Target})`); + return; + } + const overrideReason = process.env.GSTACK_SIZE_BUDGET_OVERRIDE_REASON?.trim(); + if (overrideReason) { + logBudgetOverride({ + scope: 'skill-size-budget-catalog', + reason: overrideReason, + details: { target: v145Target, observed: current.estTotalCatalogTokens }, + }); + return; + } + throw new Error( + `Catalog token estimate regressed past v1.45 target: ${current.estTotalCatalogTokens} tokens > ${v145Target}. ` + + `T4 catalog trim should keep this under control. Override: set GSTACK_SIZE_BUDGET_OVERRIDE_REASON to allow.`, + ); + }); +}); diff --git a/test/skill-validation.test.ts b/test/skill-validation.test.ts index 7df535552..a7f51cca1 100644 --- a/test/skill-validation.test.ts +++ b/test/skill-validation.test.ts @@ -1480,14 +1480,15 @@ describe('Skill trigger phrases', () => { const skillPath = path.join(ROOT, skill, 'SKILL.md'); if (!fs.existsSync(skillPath)) return; const content = fs.readFileSync(skillPath, 'utf-8'); - // Extract description from frontmatter - const frontmatterEnd = content.indexOf('---', 4); - const frontmatter = content.slice(0, frontmatterEnd); - expect(frontmatter).toMatch(/Use when/i); + // v1.45.0.0 catalog trim moved trigger prose out of frontmatter into a + // body "## When to invoke" section. Search the full file content, not + // just frontmatter. The trigger phrase must still appear somewhere in + // the skill so agents can match user requests to the skill. + expect(content).toMatch(/Use when/i); }); } - // Skills with proactive triggers should have "Proactively suggest" in description + // Skills with proactive triggers should have "Proactively suggest" somewhere in the skill. const SKILLS_REQUIRING_PROACTIVE = [ 'qa', 'qa-only', 'ship', 'review', 'investigate', 'office-hours', 'plan-ceo-review', 'plan-eng-review', 'plan-design-review', @@ -1499,9 +1500,8 @@ describe('Skill trigger phrases', () => { const skillPath = path.join(ROOT, skill, 'SKILL.md'); if (!fs.existsSync(skillPath)) return; const content = fs.readFileSync(skillPath, 'utf-8'); - const frontmatterEnd = content.indexOf('---', 4); - const frontmatter = content.slice(0, frontmatterEnd); - expect(frontmatter).toMatch(/Proactively (suggest|invoke)/i); + // Same catalog-trim consideration — search the full file content. + expect(content).toMatch(/Proactively (suggest|invoke)/i); }); } }); diff --git a/test/terse-build.test.ts b/test/terse-build.test.ts new file mode 100644 index 000000000..ca531aeae --- /dev/null +++ b/test/terse-build.test.ts @@ -0,0 +1,151 @@ +/** + * Unit tests for the terse-build flag (v1.46.0.0 T3). + * + * `--explain-level=terse` makes the gen-skill-docs pipeline drop 4 preamble + * sections at gen time. Default builds keep them. Without these tests, a + * refactor that breaks the explainLevel threading silently regresses one + * of the opt-in compression paths — the runtime EXPLAIN_LEVEL: terse runtime + * gate still works, so users wouldn't notice immediately. + * + * Pure-function tests against the resolvers — fast, free, no subprocess. + */ + +import { describe, test, expect } from 'bun:test'; +import type { TemplateContext } from '../scripts/resolvers/types'; +import { generateWritingStyle } from '../scripts/resolvers/preamble/generate-writing-style'; +import { generateCompletenessSection } from '../scripts/resolvers/preamble/generate-completeness-section'; +import { generateConfusionProtocol } from '../scripts/resolvers/preamble/generate-confusion-protocol'; +import { generateContextHealth } from '../scripts/resolvers/preamble/generate-context-health'; +import { generatePreamble } from '../scripts/resolvers/preamble'; + +function makeCtx(explainLevel?: 'default' | 'terse', tier: number = 4): TemplateContext { + return { + skillName: 'test-skill', + tmplPath: '/tmp/test/SKILL.md.tmpl', + host: 'claude', + paths: { + skillRoot: '~/.claude/skills/gstack', + localSkillRoot: '.claude/skills', + binDir: '~/.claude/skills/gstack/bin', + browseDir: '~/.claude/skills/gstack/browse/dist', + designDir: '~/.claude/skills/gstack/design/dist', + makePdfDir: '~/.claude/skills/gstack/make-pdf/dist', + }, + preambleTier: tier, + explainLevel, + }; +} + +describe('terse build — per-resolver behavior', () => { + describe('generateWritingStyle', () => { + test('default: emits full section with jargon-list pointer', () => { + const out = generateWritingStyle(makeCtx('default')); + expect(out).toContain('## Writing Style'); + expect(out).toContain('jargon-list.json'); + expect(out).toContain('Curated jargon list'); + expect(out).toContain('outcome'); + }); + + test('terse: emits one-line terse directive only', () => { + const out = generateWritingStyle(makeCtx('terse')); + expect(out).toContain('## Writing Style'); + expect(out).toContain('Terse mode (build-time)'); + // Negative: NONE of the default-mode prose + expect(out).not.toContain('jargon-list.json'); + expect(out).not.toContain('Curated jargon list'); + expect(out).not.toContain('Frame questions in outcome terms'); + }); + + test('terse is meaningfully shorter than default', () => { + const fullLen = generateWritingStyle(makeCtx('default')).length; + const terseLen = generateWritingStyle(makeCtx('terse')).length; + expect(terseLen).toBeLessThan(fullLen / 3); + }); + }); + + describe('generateCompletenessSection', () => { + test('default: emits full section with Boil-the-Lake prose', () => { + const out = generateCompletenessSection(makeCtx('default')); + expect(out).toContain('## Completeness Principle'); + expect(out).toContain('Boil the Lake'); + }); + + test('terse: returns empty string', () => { + expect(generateCompletenessSection(makeCtx('terse'))).toBe(''); + }); + + test('no ctx arg: defaults to non-terse (back-compat with old callers)', () => { + const out = generateCompletenessSection(); + expect(out).toContain('## Completeness Principle'); + }); + }); + + describe('generateConfusionProtocol', () => { + test('default: emits full section', () => { + const out = generateConfusionProtocol(makeCtx('default')); + expect(out).toContain('## Confusion Protocol'); + expect(out).toContain('high-stakes ambiguity'); + }); + + test('terse: returns empty string', () => { + expect(generateConfusionProtocol(makeCtx('terse'))).toBe(''); + }); + + test('no ctx arg: defaults to non-terse', () => { + expect(generateConfusionProtocol()).toContain('## Confusion Protocol'); + }); + }); + + describe('generateContextHealth', () => { + test('default: emits full section', () => { + const out = generateContextHealth(makeCtx('default')); + expect(out).toContain('## Context Health'); + expect(out).toContain('PROGRESS'); + }); + + test('terse: returns empty string', () => { + expect(generateContextHealth(makeCtx('terse'))).toBe(''); + }); + }); +}); + +describe('terse build — generatePreamble integration', () => { + test('default tier-2 preamble includes all 4 terse-gated sections', () => { + const out = generatePreamble(makeCtx('default', 2)); + expect(out).toContain('## Writing Style'); + expect(out).toContain('## Completeness Principle'); + expect(out).toContain('## Confusion Protocol'); + expect(out).toContain('## Context Health'); + }); + + test('terse tier-2 preamble drops 3 of 4 sections + collapses Writing Style', () => { + const out = generatePreamble(makeCtx('terse', 2)); + // Writing Style heading still present (collapsed to one line) + expect(out).toContain('## Writing Style'); + expect(out).toContain('Terse mode (build-time)'); + // Three sections dropped entirely + expect(out).not.toContain('## Completeness Principle'); + expect(out).not.toContain('## Confusion Protocol'); + expect(out).not.toContain('## Context Health'); + }); + + test('terse preamble is measurably smaller', () => { + const defaultLen = generatePreamble(makeCtx('default', 2)).length; + const terseLen = generatePreamble(makeCtx('terse', 2)).length; + // Saving roughly 2-4 KB across the 4 sections; assert at least 1 KB saved. + expect(defaultLen - terseLen).toBeGreaterThan(1024); + }); + + test('terse preamble at tier 1 is identical to default (terse only affects tier-2+ sections)', () => { + // Tier 1 doesn't include the 4 terse-gated sections in the first place. + const defaultT1 = generatePreamble(makeCtx('default', 1)); + const terseT1 = generatePreamble(makeCtx('terse', 1)); + expect(terseT1).toBe(defaultT1); + }); + + test('explainLevel undefined behaves as default', () => { + const undefinedOut = generatePreamble(makeCtx(undefined, 2)); + const defaultOut = generatePreamble(makeCtx('default', 2)); + expect(undefinedOut).toBe(defaultOut); + }); +}); diff --git a/test/writing-style-resolver.test.ts b/test/writing-style-resolver.test.ts index fce957c22..6d7444536 100644 --- a/test/writing-style-resolver.test.ts +++ b/test/writing-style-resolver.test.ts @@ -49,11 +49,17 @@ describe('Writing Style preamble section', () => { expect(out).toMatch(/terse|no explanations|user-turn override|current message/i); }); - test('tier 2+ preamble inlines jargon list', () => { + test('tier 2+ preamble references jargon list by path (v1.45.0.0 T3 — pointer, not inline)', () => { const out = generatePreamble(makeCtx('claude', 2)); - // Spot-check a few terms from scripts/jargon-list.json - expect(out).toContain('idempotent'); - expect(out).toContain('race condition'); + // T3 dedup: the 80-term jargon list lives in scripts/jargon-list.json. + // The Writing Style section points at the file rather than inlining it, + // saving ~70 KB across the corpus. Agents Read the JSON on first + // jargon term encountered per session. + expect(out).toContain('jargon-list.json'); + expect(out).toContain('Curated jargon list'); + // Negative check: the literal term lines should NOT be inlined any more. + expect(out).not.toMatch(/^- idempotent$/m); + expect(out).not.toMatch(/^- race condition$/m); }); test('tier 2+ preamble includes terse-mode gate condition', () => { diff --git a/unfreeze/SKILL.md b/unfreeze/SKILL.md index 415137bcd..a07f1cf5f 100644 --- a/unfreeze/SKILL.md +++ b/unfreeze/SKILL.md @@ -1,11 +1,7 @@ --- name: unfreeze version: 0.1.0 -description: | - Clear the freeze boundary set by /freeze, allowing edits to all directories - again. Use when you want to widen edit scope without ending the session. - Use when asked to "unfreeze", "unlock edits", "remove freeze", or - "allow all edits". (gstack) +description: Clear the freeze boundary set by /freeze, allowing edits to all directories again. (gstack) triggers: - unfreeze edits - unlock all directories @@ -17,6 +13,13 @@ allowed-tools: + +## When to invoke this skill + +Use when you want to widen edit scope without ending the session. +Use when asked to "unfreeze", "unlock edits", "remove freeze", or +"allow all edits". + # /unfreeze — Clear Freeze Boundary Remove the edit restriction set by `/freeze`, allowing edits to all directories.