Files
gstack/scripts/gen-skill-docs.ts
T
Garry Tan 22f8c7f4e1 v1.46.0.0 feat: gstack v2 foundation — catalog tokens drop 56%, eval-first floor covers all 51 skills (#1712)
* docs(designs): add v2_PLAN.md — gstack v2 the lightest opinionated skill pack

The approved plan from /plan-ceo-review → /plan-eng-review → /codex×2 →
/plan-devex-review. Captures the v1.45/v2.0 hybrid release shape,
cathedral parity-eval suite, sequential v1.45 execution, sections/*.md.tmpl
pipeline, EVALS_BUDGET_HARD_CAP override path, and v2 launch copy specs.

This commit just lands the design doc. Implementation follows in the rest
of the v1.45.0.0 branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(parity): T0a — capture v1.44.1 baseline + capture helper + diff utility

Cathedral parity-eval suite primitive. captureBaseline() walks every
top-level SKILL.md and records bytes, lines, estimated tokens, frontmatter
description length, and eval coverage. diffBaselines() reports per-skill
delta + total corpus delta + catalog tokens delta.

Locks the v1.44.1 reference snapshot at test/fixtures/parity-baseline-v1.44.1.json.
After Phase A+B+C land, scripts/capture-baseline.ts --tag v1.45.0.0 produces
a comparable snapshot; diff supplies the real numbers the v2 CHANGELOG quotes.
Never invent baseline numbers; ship them only if they came from a real run.

v1.44.1 numbers captured this commit:
- 51 skills
- 2,847 KB total corpus
- ~9,319 catalog tokens (sum of description bytes / 4)
- top 3: ship 160 KB, plan-ceo-review 128 KB, office-hours 108 KB

Test plan:
- bun test test/helpers/capture-parity-baseline.test.ts passes 4/4
- The baseline JSON file is committed so reviewers can audit v1→v2 numbers

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(resolvers): T2 — ResolverEntry + appliesTo gate infrastructure

Adds the conditional-resolver-injection plumbing from the v2_PLAN A.1
step. Resolvers can now be either a bare ResolverFn (always fires, current
behavior) or a ResolverEntry { resolve, appliesTo? } (gated; appliesTo
returning false skips the resolver, substitutes empty string).

Why infrastructure-only: the audit during T0a confirmed most resolvers
don't need gating. The {{NAME}} placeholder system is already conditional
at the template level — a resolver only fires for skills that reference it.
The gate is for future use when a placeholder's audience needs a structural
guardrail beyond social convention, or when a sub-resolver inside a larger
composed resolver (e.g. preamble) needs per-skill skip.

scripts/gen-skill-docs.ts:444 now uses unwrapResolver() to handle both
shapes. RESOLVERS map signature widens from Record<string, ResolverFn>
to Record<string, ResolverValue>. All existing resolvers stay bare
functions and work unchanged.

Test plan:
- bun test test/resolver-entry.test.ts: 6 pass (gate plumbing + registry)
- bun test test/gen-skill-docs.test.ts: 389 pass (no regression)
- bun run gen:skill-docs --dry-run: all SKILL.md files FRESH (no diff)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(preamble): T3 — jargon dedup + terse-build flag (Phase A.2 + A.3)

A.2 jargon dedup: generate-writing-style.ts replaces the inlined 80-term
jargon list with a one-line pointer to scripts/jargon-list.json. The list
was duplicated into every tier-2+ skill (48 of 51 skills); inlining cost
was ~1.5 KB × 48 = ~70 KB across the corpus. Pointer cost is ~30 bytes per
skill. Agents Read the JSON once per session on first jargon term
encountered; thereafter the terms array is the canonical reference.

A.3 terse build flag: --explain-level=terse compresses preamble prose at
gen time. When the flag is set, writing-style collapses to a one-line
terse directive and completeness-section + confusion-protocol +
context-health are dropped entirely. The default build keeps the
runtime-conditional behavior intact (sections still render; the model
skips them when EXPLAIN_LEVEL: terse appears in the preamble echo). Terse
build is opt-in for users who want shipped skills to match their runtime
preference and avoid the per-session terse-mode dead prose.

TemplateContext gains an optional `explainLevel: 'default' | 'terse'`
field. Default builds set it to 'default'; --explain-level=terse sets
'terse'. Resolvers gate their output via `ctx?.explainLevel === 'terse'`.

Measured impact (default build, post-T3):
- Total corpus: 2,847 KB → 2,812 KB (saved 35 KB)
- ship.md: 160 → 159 KB
- plan-ceo-review.md: 128 → 127 KB
- Top 10 heaviest: all slightly smaller from jargon pointer

Larger compression lands in T4 (catalog trim) and T7 (atomic regen across
the full Phase A pipeline). The terse build path further compresses to
~711K tokens vs default ~725K (saved ~14K tokens corpus-wide).

Test plan:
- bun test test/gen-skill-docs.test.ts: 389 pass (no regression)
- bun test test/resolver-entry.test.ts: 6 pass
- bun test test/helpers/capture-parity-baseline.test.ts: 4 pass
- bun run gen:skill-docs --explain-level=terse: ship.md drops completeness +
  confusion-protocol + context-health sections; writing-style collapses to
  one-line terse directive

48 SKILL.md files updated (every tier-2+ skill picks up the jargon pointer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(catalog): T4 — catalog trim + proactive-suggestions.json (Phase A.4)

Shortens frontmatter `description:` in every Claude SKILL.md to a single
lead sentence + (gstack) tag. The routing prose ("Use when asked to...",
"Proactively suggest...") and voice triggers move to a "## When to invoke"
body section so they remain discoverable inside the skill. A per-run
registry at scripts/proactive-suggestions.json aggregates the routing/
voice text for all 52 skills so agents can pull guidance on demand
without paying for it in the always-loaded catalog.

Build flag --catalog-mode=full restores v1.44 legacy behavior (full
multi-line descriptions in frontmatter). Default is trim.

splitCatalogDescription() extracts: lead sentence, routing paragraphs,
voice-triggers line, (gstack) tag presence. Short descriptions (<120
chars, already trimmed) are skipped via a guard so re-runs are idempotent.

Measured impact (vs v1.44.1 baseline):
- Catalog tokens (sum of description bytes / 4): 9,319 → 4,045  (-56.6%)
- Total SKILL.md corpus bytes:                   2,915 KB → 2,880 KB (-1.2%)
- Routing prose preserved as in-skill "## When to invoke" sections
- 52 skill entries in scripts/proactive-suggestions.json (on-demand registry)

The corpus drop is small because catalog trim MOVES text from frontmatter
to body, it doesn't delete it. The headline win is the catalog: the
always-loaded system prompt surface drops by more than half.

Test plan:
- bun test test/gen-skill-docs.test.ts: 389 pass, 0 fail
- Manual: ship/SKILL.md frontmatter description is now ONE line ending
  with `(gstack)`; allowed-tools field on next line (YAML well-formed)
- Manual: scripts/proactive-suggestions.json contains 52 entries
- bun run gen:skill-docs --catalog-mode=full restores legacy behavior

53 files changed (52 SKILL.md across hosts + the new proactive-suggestions.json).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(budget): T5 — hard token budgets + override audit trail (Phase A.6)

Two new gate-tier guardrails for the v1.45.0.0 compression baseline:

1. test/skill-size-budget.test.ts (NEW) — per-skill SKILL.md size budget.
   Compares current state to test/fixtures/parity-baseline-v1.44.1.json.
   Three checks: per-skill (×1.05 default ratio), total corpus, and
   catalog token estimate (≤7000 for v1.45). The per-skill ratio is 1.05
   not 1.0 because the T4 catalog trim moves text from frontmatter to a
   body section; small skills see a tiny body growth that's fine when
   offset by the much larger catalog-token win.

2. test/skill-budget-regression.test.ts EXTENDED — hard dollar cap on
   per-run eval cost. Per-tier defaults: gate $25, periodic $70. Umbrella
   EVALS_BUDGET_HARD_CAP=$30. Catches runaway eval costs (infinite retry,
   model price changes) before they amortize across PRs.

Both checks support an override path with audit trail:
   GSTACK_SIZE_BUDGET_OVERRIDE_REASON="why this is OK"   — size
   EVALS_BUDGET_OVERRIDE_REASON="why this is OK"          — cost
Overrides log to ~/.gstack/analytics/spend-overrides.jsonl with
timestamp + scope + reason + CI provenance (runner, branch, commit)
via test/helpers/budget-override.ts.

Why the override audit: a hard cap with no escape valve becomes
operationally hostile (legit price changes, longer transcripts, new
required evals can all blow the cap). An override with no audit becomes
"everyone overrides everything and the gate is theater." This module
ships the audit half so reviewers can see what was waived and why.

Codex 2nd-pass critique #3 absorbed: per-suite caps + override path with
auditability + budget baselines checked into repo (parity-baseline-v1.44.1.json
already in test/fixtures/).

Test plan:
- bun test test/skill-size-budget.test.ts: 4 pass (per-skill, corpus, catalog, baseline-exists)
- bun test test/skill-budget-regression.test.ts: 4 pass (2 existing ratio checks + 2 new hard-cap checks)
- Existing eval runs ($14.11 e2e, $0.02 llm-judge) sit well under the new caps

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(cso): T6 — pin must-preserve security phrases (Phase A.5)

cso/SKILL.md is a content-heavy security audit skill (75 KB after T3+T4).
Codex 2nd-pass critique #9: "cso exemption too broad ... should still get
resolver dedup, catalog trim, sectioning if safe, and targeted evals
around must-not-miss checks."

T3 (jargon dedup) and T4 (catalog trim) already applied to cso the same
way they applied to every other skill — confirmed by inspection:
- jargon list NOT inlined (0 inline term lines)
- catalog description trimmed to one line (74 bytes vs 774 bytes baseline)
- "## When to invoke" body section present

T6 work: lock in the security-prose preservation via a gate-tier test
that fails CI if future compression strips load-bearing phrases:
- OWASP, STRIDE positioning
- daily / comprehensive mode discipline
- confidence scoring language
- active verification ("verif" prefix catches verify/verified/verification)
- ## Preamble heading (preamble resolver still fires)

Also guards cso against accidental over-stripping: SKILL.md must stay
≥30 KB (currently 75 KB) — a sudden cliff would mean compression went
past the targeted-dedup line into structural removal.

No structural change to cso. Future Phase B sections/ work for cso
requires writing baseline parity tests FIRST per the v2_PLAN.md
sequencing.

Test plan:
- bun test test/cso-preserved.test.ts: 5 pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(parity): T0b — cathedral parity-suite harness + invariant registry

Adds the harness that the v2_PLAN.md cathedral parity-eval suite is built
on. Compares CURRENT SKILL.md output to v1.44.1 baseline along three axes:

  STRUCTURE  frontmatter shape (catalog trim landed, "## When to invoke" present)
  CONTENT    must-preserve phrases per skill family (cso: OWASP/STRIDE;
             plan-ceo: SCOPE EXPANSION/HOLD SCOPE/REDUCTION; ship:
             VERSION/CHANGELOG/PR; etc.)
  SIZE       per-skill byte budget (maxSizeRatio + minBytes guards)

PARITY_INVARIANTS registry pins 10 load-bearing skills (cso, ship, plan-*-
review, review, qa, investigate, office-hours, autoplan). Each entry
declares what must NOT regress; future compression that strips these
phrases or shrinks a skill past its minBytes cliff fails CI.

Periodic-tier LLM-judge parity (paid, ~$0.20/skill) lands in v2.0.0.0
sections/ phase. Same registry, same harness, judge added on top.

Test plan:
- bun test test/parity-suite.test.ts: 10/10 invariants pass vs v1.44.1
- Per-skill failures get actionable per-line breakdown so a reviewer can
  see which phrase / heading / size limit went sideways

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(coverage): T1 — skill coverage matrix + structural-compliance floor

Phase 0 deliverable — eval-first foundation. Two new test files plus the
registry:

1. test/skill-coverage-matrix.ts — single source of truth mapping each
   skill to its gate-tier + periodic-tier test files. SKILL_COVERAGE
   record with 51 entries; every gstack skill on disk has at least one
   gate-tier entry.

2. test/skill-coverage-matrix.test.ts — CI gate. Asserts every skill on
   disk has a registry entry AND that gate[] is non-empty. Catches
   "skill added but eval not registered" the moment a new SKILL.md
   lands.

3. test/skill-coverage-floor.test.ts — per-skill structural compliance
   (FREE, file-IO only). For each of 51 skills, verifies:
   - SKILL.md exists
   - Frontmatter well-formed (name + description fields)
   - Catalog-trim contract (inline description ≤ 250 chars, or block form)
   - Generated header present (edit .tmpl, not .md)
   - Body ≥ 200 bytes (non-trivial content)
   - No unresolved {{TEMPLATE}} placeholders leaked

The "floor" is the minimum eval that every skill ships with. Skills that
need deeper behavioral testing get additional entries in their coverage
record (e.g., ship has skill-e2e-ship-idempotency + workflow + floor).
Future skills only need to add the floor entry and the matrix gate
unblocks them.

Codex 2nd-pass critique #1 mitigation: eval-first floor is structural
compliance (the testable part) — judgment-skill behavior gets layered
periodic-tier evals on top. We don't pretend the floor proves
correctness, only that the skill structurally compiles.

Test plan:
- bun test test/skill-coverage-matrix.test.ts: 4 pass (matrix shape + coverage)
- bun test test/skill-coverage-floor.test.ts: 309 pass (6 checks × 51 skills + 3 registry-level)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* build(skills): T7 — atomic regenerate + capture v1.45.0.0 baseline

Final regen pass across all hosts after T1-T6 work landed. Captures the
v1.45.0.0 parity baseline at test/fixtures/parity-baseline-v1.45.0.0.json
for diffing against the v1.44.1 reference.

Measured deltas (real numbers from test/helpers/capture-parity-baseline.ts):

  Total SKILL.md corpus       2,847 KB → 2,813 KB        (-1.2%)
  Catalog tokens (always-loaded) ~9,319 → ~4,045 tokens   (-56.6%)
  Top 10 heaviest skills      0.5-1.0% drop each

The catalog token cut is the headline. It's the always-loaded surface,
i.e. tokens charged on every session start. Per-skill SKILL.md sizes
barely moved because T4 catalog trim MOVES routing prose from frontmatter
to a body "## When to invoke" section rather than deleting it — the
catalog wins without amputating discoverability.

The bigger per-skill compression lands in v2.0.0.0 (Phase B sections/
pattern on the 5 heavyweights). v1.45 is the foundation: eval-first
infrastructure + cheap wins.

scripts/proactive-suggestions.json regenerated with the latest 52 skills
listed (one-time write per gen-skill-docs run; aggregated catalog parts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.45.0.0 — gstack v2 foundation: catalog tokens drop 56%, eval-first floor

Bumps VERSION + package.json to 1.45.0.0. CHANGELOG entry covers what
shipped between v1.44.1 and this release: the cathedral parity-eval
foundation, conditional resolver injection plumbing, jargon dedup, terse
build flag, catalog trim with one-line frontmatter descriptions, hard
token + dollar budget gates with override audit, cso preservation pins,
and the v1.44.1 ↔ v1.45.0.0 parity baselines committed to test/fixtures/.

Numbers (measured, not estimated):
- Catalog tokens: ~9,319 → ~4,045  (-56.6%)
- Total corpus:   2,847 KB → 2,813 KB (-1.2%)
- Skills with gate-tier eval coverage: 32/51 → 51/51 (floor achieved)

This is the foundation release. v2.0.0.0 will ship the architectural
break (sections/*.md.tmpl pattern + mechanical Read enforcement +
eval-coverage annotations) as a coordinated marketing-grade launch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(catalog): refresh proactive-suggestions.json timestamp after v1.45 bump

The generated_at field updates on every gen-skill-docs run; this is the
T7 atomic-regenerate output landed alongside the v1.45.0.0 bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(catalog): deterministic proactive-suggestions.json (no per-run timestamp)

Original implementation wrote a generated_at timestamp on every gen-skill-docs
run. That made CI dry-run freshness checks flap because the file changed on
every regeneration even when the actual content (skill descriptions, routing
prose, voice triggers) was unchanged.

Two fixes:
1. Drop the generated_at field. The file is purely a content registry now.
2. Only write the file when serialized content actually differs from disk.

Reproducible test: bun run gen:skill-docs twice in a row now leaves
scripts/proactive-suggestions.json unchanged on the second run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(catalog): preserve routing prose when first sentence exceeds 200 chars

splitCatalogDescription truncated the lead BEFORE computing routing
extraction, which meant skills whose first sentence was over 200 chars
(design-consultation: 207 chars) had their entire routing prose silently
dropped — the "## When to invoke" body section came out empty.

Root cause: routing was extracted via `collapsed.indexOf(lead)` after lead
was suffixed with "...". The "..." never appeared in the original string,
so indexOf returned -1 and routingProse fell back to empty.

Fix: compute routing from sentenceLead (the untruncated first sentence)
BEFORE truncating the displayed lead. The displayed lead still gets "..."
when over 200 chars, but the routing extraction uses the real boundary.

Also: refresh golden snapshots for claude/codex/factory ship and update
two unit tests that asserted v1.44 behavior:
- skill-validation.test.ts: trigger-phrase + proactive-routing tests now
  search whole content, not just frontmatter (T4 moved them to a body
  "## When to invoke" section)
- writing-style-resolver.test.ts: jargon-list assertion now expects the
  T3 reference pointer, not the inline list

Test plan:
- bun test test/skill-validation.test.ts test/writing-style-resolver.test.ts
  test/host-config.test.ts test/skill-size-budget.test.ts
  test/parity-suite.test.ts test/skill-coverage-matrix.test.ts
  test/skill-coverage-floor.test.ts test/cso-preserved.test.ts
  test/resolver-entry.test.ts test/helpers/capture-parity-baseline.test.ts
  test/gen-skill-docs.test.ts: 1134 pass, 0 fail
- Manual verify: design-consultation/SKILL.md "## When to invoke this skill"
  body section now contains "Use when asked to..." + "Proactively suggest..."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(catalog): deterministic proactive-suggestions.json across machines

CI check-freshness failed because scripts/proactive-suggestions.json
serialized differently on local vs CI:

1. Root-skill key leaked the directory name. processTemplate's outer loop
   computed `dir = path.basename(path.dirname(tmplPath))`. For the root
   SKILL.md.tmpl at ROOT/SKILL.md.tmpl, that returns the repo-checkout
   directory name — "seville-v3" in a Conductor worktree, "gstack" on
   GitHub Actions, anything-else for a fork. Fix: detect root via
   `path.dirname(tmplPath) === ROOT` and hardcode the key to "gstack"
   for that one case.

2. Aggregate key order was filesystem-iteration order. discoverTemplates
   doesn't guarantee stable ordering across platforms, so the JSON
   `skills` object came out shuffled between machines. Fix: sort
   Object.keys(proactiveAggregate) alphabetically before serializing.

After the fix, the generated file is identical on every machine and
matches what's committed. CI freshness check (bun run gen:skill-docs &&
git diff --exit-code) now passes.

Test plan:
- bun run gen:skill-docs && bun run gen:skill-docs --dry-run: all FRESH
- node -e 'verify keys sorted': sorted match: true
- grep -c '"seville-v3"' scripts/proactive-suggestions.json: 0
- Focused test suite: 704 pass, 0 fail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(catalog): unit + regression coverage for catalog-trim helpers

Four exported functions in scripts/gen-skill-docs.ts handle every skill's
frontmatter rewrite at gen time but had zero unit tests. Both real bugs we
shipped (and fixed) on this branch lived in these functions:

  v1.45.0.0 design-consultation: when the first sentence exceeded 200 chars,
  routing-prose extraction lost the entire tail (anchored on truncated lead
  with "..." that didn't substring-match the original).

  v1.45.0.0 CI freshness: root-skill key leaked the checkout directory
  name ("seville-v3" vs "gstack") and aggregate order was filesystem-
  iteration order.

Both shapes are now regression-tested:

- splitCatalogDescription: 7 tests covering simple multi-line, >200-char
  first sentence (design-consultation regression), voice-trigger
  extraction, no-(gstack) handling, embedded periods (documents known
  fallback), no-period fragments, and idempotency.
- buildTrimmedDescription: 3 tests.
- buildWhenToInvokeSection: 3 tests.
- applyCatalogTrim: 4 tests covering the standard rewrite, no-op for
  already-short descriptions, the YAML-collision newline fix, and the
  malformed-frontmatter null return.
- proactive-suggestions.json determinism: 3 tests asserting sorted keys,
  root keyed as "gstack" (not the worktree directory), and no
  timestamp/generated_at field that would flap CI freshness.

Test plan:
- bun test test/catalog-trim.test.ts: 20 pass, 0 fail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(coverage): fill three remaining v1.46.0.0 test gaps

Three untested surfaces from the v1.46.0.0 work. All three would have
caught real bugs we shipped (and fixed) on this branch.

1. test/helpers/budget-override.test.ts — 7 tests pin the audit-trail
   contract for EVALS_BUDGET_OVERRIDE_REASON and
   GSTACK_SIZE_BUDGET_OVERRIDE_REASON. Without this, the audit logger
   could silently drop events and overrides become invisible. Tests
   cover: required fields per JSONL line, CI provenance capture
   (CI/GITHUB_ACTIONS/branch/commit), local-runner defaults,
   append-only behavior, missing-directory recovery, and unwritable-
   path resilience (logs warning instead of throwing).

2. test/terse-build.test.ts — 16 tests pin --explain-level=terse
   behavior across the 4 gated resolvers and the composed preamble.
   Default vs terse vs undefined-ctx all asserted. Without this, a
   refactor that breaks the explainLevel threading silently regresses
   the opt-in compression path; the runtime EXPLAIN_LEVEL: terse gate
   still works so users wouldn't notice. Tier-1 invariant pinned
   (terse-only-affects-tier-2+).

3. test/gen-skill-docs-idempotency.test.ts — 2 tests catch the class
   of bug behind the v1.45.0.0 timestamp flap. Two consecutive
   gen-skill-docs runs must produce byte-identical outputs across
   STABLE_OUTPUTS (proactive-suggestions.json, SKILL.md, ship/SKILL.md,
   plan-ceo-review/SKILL.md, office-hours/SKILL.md, gstack/llms.txt).
   --dry-run reports zero stale files after a fresh gen. CI freshness
   regressions surface as test failures BEFORE a PR is opened.

Test plan:
- bun test test/helpers/budget-override.test.ts: 7 pass
- bun test test/terse-build.test.ts: 16 pass
- bun test test/gen-skill-docs-idempotency.test.ts: 2 pass
- Full focused suite (15 test files): 1179 pass, 0 fail (+45 new tests
  vs the pre-fill baseline of 1134)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(coverage): close 5 remaining v1.46.0.0 test gaps (A-E)

Five behaviors that v1.46 ships but had no test coverage. All now pinned.

A) --host all idempotency (test/gen-skill-docs-idempotency.test.ts)
   The default test ran Claude host only. Non-Claude hosts (Codex, Factory,
   Cursor, OpenClaw, GBrain, Slate, OpenCode, Hermes, Kiro) each have their
   own output paths and could carry their own non-deterministic fields. We
   hit a "--host all needed for freshness check" mid-/ship. Now: two
   consecutive `bun run gen:skill-docs --host all` runs must produce
   byte-identical outputs across a per-host sample (.agents/, .cursor/,
   .factory/, .gbrain/). Catches per-host adapter regressions before CI.

B) --catalog-mode=full opt-out (test/catalog-mode-full.test.ts)
   The legacy escape hatch had zero tests. 6 new tests across two layers:
   static (CATALOG_MODE_ARG parsed; conditional gate present; default is
   "trim"; invalid value throws) + smoke (actual --catalog-mode=full run
   produces a multi-line `description: |` block + omits "## When to invoke"
   body section; mutates the working tree then restores in a finally block).

C) parity-baseline-v1.44.1.json integrity (test/parity-baseline-integrity.test.ts)
   The baseline is the source of every v1→v2 number cited in the
   CHANGELOG v1.46.0.0 entry. Anyone could edit it without test failure
   until now. 8 new tests pin: existence, tag, capturedFromCommit
   allowlist, expected v1.44 numbers (51 skills, ~2,915 KB, ~9,319
   catalog tokens), CHANGELOG references this file by path, per-skill
   shape, and a SHA256 byte-stability hash. Any edit fails with a clear
   "if intentional, update EXPECTED_HASH AND the CHANGELOG numbers" signal.

D) Live appliesTo gate end-to-end (test/resolver-entry.test.ts extended)
   The unwrapResolver unit tests covered the function; the gen-skill-docs.ts
   substitution loop that USES the gate had no integration coverage. 6 new
   tests simulate the exact 4-line shape from gen-skill-docs.ts:457-467
   against synthetic registries: plain-function fires unconditionally,
   gated fires when true / empty-string when false, mixed registries
   compose, parameterized resolvers respect gates, unknown resolvers throw.

E) Per-skill min-size floor (test/skill-size-budget.test.ts extended)
   The existing 200-byte body coverage-floor is a noise floor — a skill
   that lost 99.75% of content still passes. 1 new test asserts every
   skill stays ≥80% of its v1.44.1 baseline size (the parity-suite
   content invariants only covered 10 of 51 skills; the remaining 41
   were uncovered). SECTIONS_EXTRACTED hook in place for v2.0.0.0 when
   the sections/ pattern legitimately shrinks ship/plan-ceo/etc. past
   the floor.

Test plan:
- bun test focused 17-file suite: 1202 pass, 0 fail
  (+23 new tests vs the pre-fill 1179 baseline)
- catalog-mode=full mutates working tree then restores cleanly
- --host all idempotency runs two full gen passes in <1s on this machine

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 16:50:03 -07:00

957 lines
42 KiB
TypeScript

#!/usr/bin/env bun
/**
* Generate SKILL.md files from .tmpl templates.
*
* Pipeline:
* read .tmpl → find {{PLACEHOLDERS}} → resolve from source → format → write .md
*
* Supports --dry-run: generate to memory, exit 1 if different from committed file.
* Used by skill:check and CI freshness checks.
*/
import { COMMAND_DESCRIPTIONS } from '../browse/src/commands';
import { SNAPSHOT_FLAGS } from '../browse/src/snapshot';
import { discoverTemplates } from './discover-skills';
import { writeLlmsTxt } from './gen-llms-txt';
import * as fs from 'fs';
import * as path from 'path';
import type { Host, TemplateContext } from './resolvers/types';
import { HOST_PATHS, unwrapResolver } from './resolvers/types';
import { RESOLVERS } from './resolvers/index';
import { externalSkillName, extractHookSafetyProse as _extractHookSafetyProse, extractNameAndDescription as _extractNameAndDescription, condenseOpenAIShortDescription as _condenseOpenAIShortDescription, generateOpenAIYaml as _generateOpenAIYaml } from './resolvers/codex-helpers';
import { generatePlanCompletionAuditShip, generatePlanCompletionAuditReview, generatePlanVerificationExec } from './resolvers/review';
import { ALL_HOST_CONFIGS, ALL_HOST_NAMES, resolveHostArg, getHostConfig } from '../hosts/index';
import type { HostConfig } from './host-config';
const ROOT = path.resolve(import.meta.dir, '..');
const DRY_RUN = process.argv.includes('--dry-run');
// ─── Host Detection (config-driven) ─────────────────────────
const HOST_ARG = process.argv.find(a => a.startsWith('--host'));
type HostArg = Host | 'all';
const HOST_ARG_VAL: HostArg = (() => {
if (!HOST_ARG) return 'claude';
const val = HOST_ARG.includes('=') ? HOST_ARG.split('=')[1] : process.argv[process.argv.indexOf(HOST_ARG) + 1];
if (val === 'all') return 'all';
try {
return resolveHostArg(val) as Host;
} catch {
throw new Error(`Unknown host: ${val}. Use ${ALL_HOST_NAMES.join(', ')}, or all.`);
}
})();
// For single-host mode, HOST is the host. For --host all, it's set per iteration below.
let HOST: Host = HOST_ARG_VAL === 'all' ? 'claude' : HOST_ARG_VAL;
// ─── Model Overlay Selection ────────────────────────────────
// --model is explicit. We do NOT auto-detect from host (host ≠ model).
// Default is 'claude'. Missing overlay file → empty string (graceful).
import { ALL_MODEL_NAMES, resolveModel, type Model } from './models';
const MODEL_ARG = process.argv.find(a => a.startsWith('--model'));
const MODEL_ARG_VAL: Model = (() => {
if (!MODEL_ARG) return 'claude';
const val = MODEL_ARG.includes('=') ? MODEL_ARG.split('=')[1] : process.argv[process.argv.indexOf(MODEL_ARG) + 1];
const resolved = resolveModel(val);
if (!resolved) {
throw new Error(`Unknown model: ${val}. Use ${ALL_MODEL_NAMES.join(', ')}, or a family variant (e.g., claude-opus-4-7, gpt-5.4-mini, o3).`);
}
return resolved;
})();
// ─── Catalog Mode (v1.45.0.0 T4) ────────────────────────────
// 'trim' (default): shorten frontmatter description to lead sentence,
// move routing/voice prose into a "## When to invoke" body section, and
// emit scripts/proactive-suggestions.json (single file across all skills).
// 'full': legacy v1.44 behavior — full description stays in frontmatter.
const CATALOG_MODE_ARG = process.argv.find(a => a.startsWith('--catalog-mode'));
const CATALOG_MODE: 'trim' | 'full' = (() => {
if (!CATALOG_MODE_ARG) return 'trim';
const val = CATALOG_MODE_ARG.includes('=')
? CATALOG_MODE_ARG.split('=')[1]
: process.argv[process.argv.indexOf(CATALOG_MODE_ARG) + 1];
if (val !== 'trim' && val !== 'full') {
throw new Error(`Unknown catalog mode: ${val}. Use 'trim' (default) or 'full'.`);
}
return val;
})();
// ─── Explain-level Overlay ──────────────────────────────────
// --explain-level=terse compresses preamble prose (writing-style, completeness,
// confusion-protocol, context-health) to a single pointer line at gen time.
// Default keeps the runtime-conditional behavior (sections render unconditionally,
// the model skips them when EXPLAIN_LEVEL: terse appears in the preamble echo).
// Opt-in via the build flag so most users get the runtime-flexible default.
const EXPLAIN_LEVEL_ARG = process.argv.find(a => a.startsWith('--explain-level'));
const EXPLAIN_LEVEL: 'default' | 'terse' = (() => {
if (!EXPLAIN_LEVEL_ARG) return 'default';
const val = EXPLAIN_LEVEL_ARG.includes('=')
? EXPLAIN_LEVEL_ARG.split('=')[1]
: process.argv[process.argv.indexOf(EXPLAIN_LEVEL_ARG) + 1];
if (val !== 'default' && val !== 'terse') {
throw new Error(`Unknown explain level: ${val}. Use 'default' or 'terse'.`);
}
return val;
})();
// HostPaths, HOST_PATHS, and TemplateContext imported from ./resolvers/types (line 7-8)
// Design constants (AI_SLOP_BLACKLIST, OPENAI_HARD_REJECTIONS, OPENAI_LITMUS_CHECKS)
// live in ./resolvers/constants and are consumed by resolvers directly.
// ─── External Host Helpers ───────────────────────────────────
// Re-export local copy for use in this file (matches codex-helpers.ts)
// Accepts optional frontmatter name to support directory/invocation name divergence
function externalSkillName(skillDir: string, frontmatterName?: string): string {
// Root skill (skillDir === '' or '.') always maps to 'gstack' regardless of frontmatter
if (skillDir === '.' || skillDir === '') return 'gstack';
// Use frontmatter name when it differs from directory name (e.g., run-tests/ with name: test)
const baseName = frontmatterName && frontmatterName !== skillDir ? frontmatterName : skillDir;
// Don't double-prefix: gstack-upgrade → gstack-upgrade (not gstack-gstack-upgrade)
if (baseName.startsWith('gstack-')) return baseName;
return `gstack-${baseName}`;
}
function extractNameAndDescription(content: string): { name: string; description: string } {
const fmStart = content.indexOf('---\n');
if (fmStart !== 0) return { name: '', description: '' };
const fmEnd = content.indexOf('\n---', fmStart + 4);
if (fmEnd === -1) return { name: '', description: '' };
const frontmatter = content.slice(fmStart + 4, fmEnd);
const nameMatch = frontmatter.match(/^name:\s*(.+)$/m);
const name = nameMatch ? nameMatch[1].trim() : '';
let description = '';
const lines = frontmatter.split('\n');
let inDescription = false;
const descLines: string[] = [];
for (const line of lines) {
if (line.match(/^description:\s*\|?\s*$/)) {
inDescription = true;
continue;
}
if (line.match(/^description:\s*\S/)) {
description = line.replace(/^description:\s*/, '').trim();
break;
}
if (inDescription) {
if (line === '' || line.match(/^\s/)) {
descLines.push(line.replace(/^ /, ''));
} else {
break;
}
}
}
if (descLines.length > 0) {
description = descLines.join('\n').trim();
}
return { name, description };
}
// ─── Voice Trigger Processing ────────────────────────────────
/**
* Extract voice-triggers YAML list from frontmatter.
* Returns an array of trigger strings, or [] if no voice-triggers field.
*/
function extractVoiceTriggers(content: string): string[] {
const fmStart = content.indexOf('---\n');
if (fmStart !== 0) return [];
const fmEnd = content.indexOf('\n---', fmStart + 4);
if (fmEnd === -1) return [];
const frontmatter = content.slice(fmStart + 4, fmEnd);
const triggers: string[] = [];
let inVoice = false;
for (const line of frontmatter.split('\n')) {
if (/^voice-triggers:/.test(line)) { inVoice = true; continue; }
if (inVoice) {
const m = line.match(/^\s+-\s+"(.+)"$/);
if (m) triggers.push(m[1]);
else if (!/^\s/.test(line)) break;
}
}
return triggers;
}
/**
* Preprocess voice triggers: fold voice-triggers YAML field into description,
* then strip the field from frontmatter. Must run BEFORE transformFrontmatter
* and extractNameAndDescription so all hosts see the updated description.
*/
function processVoiceTriggers(content: string): string {
const triggers = extractVoiceTriggers(content);
if (triggers.length === 0) return content;
// Strip voice-triggers block from frontmatter
content = content.replace(/^voice-triggers:\n(?:\s+-\s+"[^"]*"\n?)*/m, '');
// Get current description (after stripping voice-triggers, so it's clean)
const { description } = extractNameAndDescription(content);
if (!description) return content;
// Build new description with voice triggers appended
const voiceLine = `Voice triggers (speech-to-text aliases): ${triggers.map(t => `"${t}"`).join(', ')}.`;
const newDescription = description + '\n' + voiceLine;
// Replace old indented description with new in frontmatter
const oldIndented = description.split('\n').map(l => ` ${l}`).join('\n');
const newIndented = newDescription.split('\n').map(l => ` ${l}`).join('\n');
content = content.replace(oldIndented, newIndented);
return content;
}
// Export for testing
export { extractVoiceTriggers, processVoiceTriggers };
// ─── Catalog Trim (v1.45.0.0 T4) ─────────────────────────────
//
// Frontmatter `description:` blocks today pack: a one-line outcome, "Use when
// asked to..." voice triggers, "Proactively..." routing guidance, and a
// "(gstack)" tag. This pile is the always-loaded catalog surface — every
// session pays for the full text. The catalog trim splits the description
// into a one-line catalog entry (lead sentence + "(gstack)") that stays in
// the frontmatter, and a "## When to invoke" body section that holds the
// routing/voice triggers prose for in-skill discovery. A registry written
// to scripts/proactive-suggestions.json (one entry per skill) makes routing
// available to agents that need it without paying the always-loaded cost.
//
// Opt-out: `--catalog-mode=full` keeps v1.44 behavior (no trim, full
// description in frontmatter). Use when debugging routing regressions or
// when shipping skills to hosts that depend on the legacy fat catalog.
export interface CatalogParts {
lead: string; // First sentence — kept in catalog
routingProse: string; // "Use when asked to...", "Proactively..." paragraphs
voiceLine: string | null; // "Voice triggers (speech-to-text aliases): ..." line if present
hasGstackTag: boolean;
}
export function splitCatalogDescription(description: string): CatalogParts {
// Voice triggers line (folded in by processVoiceTriggers earlier)
const voiceMatch = description.match(/Voice triggers \(speech-to-text aliases\):[^\n]+/);
const voiceLine = voiceMatch ? voiceMatch[0] : null;
let working = voiceLine ? description.replace(voiceLine, '').trim() : description.trim();
const hasGstackTag = /\(gstack\)/.test(working);
if (hasGstackTag) working = working.replace(/\(gstack\)/, '').trim();
// Lead = first sentence (up to first period followed by space or end of string).
// We tolerate sentences with embedded periods (URLs, "v1.45.0.0") by requiring
// the period to be followed by whitespace OR end-of-text.
// First normalize to single-line for sentence detection, then back out.
const collapsed = working.replace(/\s+/g, ' ').trim();
const sentenceMatch = collapsed.match(/^([^.!?]*[.!?])(?:\s|$)/);
// sentenceLead is the FULL first sentence (no truncation). We compute routing
// from this position, then optionally truncate the displayed lead afterwards.
// Truncating first then computing routing was the v1.45.0.0 bug — when the
// first sentence exceeded 200 chars, the routing extraction would lose the
// entire tail of the description (design-consultation's "Use when..."
// routing prose silently dropped).
const sentenceLead = sentenceMatch ? sentenceMatch[1].trim() : collapsed.split(/\s/).slice(0, 20).join(' ');
// Routing prose: everything AFTER the first sentence boundary in the collapsed view.
const leadInCollapsed = collapsed.indexOf(sentenceLead);
const routingCollapsed = leadInCollapsed >= 0
? collapsed.slice(leadInCollapsed + sentenceLead.length).trim()
: '';
// Now produce the displayed lead — truncated if too long. The original
// sentenceLead is preserved for routing extraction below.
let lead = sentenceLead;
if (lead.length > 200) {
const trunc = lead.slice(0, 197);
const lastSpace = trunc.lastIndexOf(' ');
lead = (lastSpace > 60 ? trunc.slice(0, lastSpace) : trunc) + '...';
}
// Restore line breaks for routing prose by mapping back to original layout.
// Use original whitespace structure where possible; fall back to collapsed.
// Anchor recovery on sentenceLead (the untruncated first sentence) — not
// `lead` (which may have a "..." suffix and won't substring-match `working`).
let routingProse = routingCollapsed;
const collapsedLeadIdx = working.replace(/\s+/g, ' ').indexOf(sentenceLead);
if (collapsedLeadIdx >= 0) {
let consumed = 0;
let cut = 0;
for (let i = 0; i < working.length && consumed < collapsedLeadIdx + sentenceLead.length; i++) {
if (/\s/.test(working[i])) {
if (i === 0 || /\s/.test(working[i - 1])) continue;
consumed += 1;
} else {
consumed += 1;
}
cut = i + 1;
}
const tail = working.slice(cut).trim();
if (tail.length > 0) routingProse = tail;
}
return { lead, routingProse, voiceLine, hasGstackTag };
}
/** Build the catalog-trimmed `description:` block. */
export function buildTrimmedDescription(parts: CatalogParts): string {
const lead = parts.lead.trim();
const suffix = parts.hasGstackTag ? ' (gstack)' : '';
return `${lead}${suffix}`;
}
/** Build the body section that holds the routing/voice prose. */
export function buildWhenToInvokeSection(parts: CatalogParts): string {
const lines: string[] = ['## When to invoke this skill', ''];
if (parts.routingProse) {
lines.push(parts.routingProse);
lines.push('');
}
if (parts.voiceLine) {
lines.push(parts.voiceLine);
lines.push('');
}
return lines.join('\n');
}
/**
* Apply catalog trim to a SKILL.md body:
* - shorten frontmatter `description:` to lead + (gstack)
* - insert "## When to invoke" body section AFTER the generated header
* (so it lands near the top of body content, where routing guidance
* belongs)
*
* Returns the rewritten content plus the parts (used for proactive-suggestions
* JSON aggregation at the end of the run).
*/
export function applyCatalogTrim(content: string, skillName: string): { content: string; parts: CatalogParts } | null {
// Locate description block in frontmatter
if (!content.startsWith('---\n')) return null;
const fmEnd = content.indexOf('\n---', 4);
if (fmEnd === -1) return null;
const frontmatter = content.slice(4, fmEnd);
// Match `description: |` block + indented body lines
const descMatch = frontmatter.match(/^description:\s*\|?\s*\n((?:\s{2,}.*(?:\n|$))+)/m)
|| frontmatter.match(/^description:\s+(.+)$/m);
if (!descMatch) return null;
// Extract full description text
let descText: string;
if (descMatch[0].startsWith('description: |') || /^description:\s*\|/.test(descMatch[0])) {
descText = descMatch[1].split('\n').map(l => l.replace(/^\s{2}/, '')).join('\n').trim();
} else {
descText = descMatch[1].trim();
}
// Skip skills with very short descriptions (already trimmed or no routing prose).
// Below ~120 chars, splitting adds no value.
if (descText.length < 120) return null;
const parts = splitCatalogDescription(descText);
// If lead + (gstack) is already most of the text, no trim needed.
const trimmedLen = buildTrimmedDescription(parts).length;
if (trimmedLen >= descText.length - 20) return null;
// Replace description in frontmatter — keep trailing newline so the next
// YAML field doesn't collide on the same line as the description value.
const newDesc = buildTrimmedDescription(parts);
const newFrontmatter = frontmatter.replace(descMatch[0], `description: ${newDesc}\n`);
let newContent = '---\n' + newFrontmatter + content.slice(fmEnd);
// Insert body section after frontmatter (after the closing ---\n and any
// existing GENERATED header). We insert before the first non-comment line.
const bodyStart = newContent.indexOf('\n---\n') + 5;
const whenToInvoke = '\n' + buildWhenToInvokeSection(parts).trim() + '\n';
// Skip past the generated header if present (it lives after frontmatter close)
const headerMatch = newContent.slice(bodyStart).match(/^(<!--[^>]*-->\s*\n)+/);
const insertAt = bodyStart + (headerMatch ? headerMatch[0].length : 0);
newContent = newContent.slice(0, insertAt) + whenToInvoke + '\n' + newContent.slice(insertAt);
return { content: newContent, parts };
}
const OPENAI_SHORT_DESCRIPTION_LIMIT = 120;
function condenseOpenAIShortDescription(description: string): string {
const firstParagraph = description.split(/\n\s*\n/)[0] || description;
const collapsed = firstParagraph.replace(/\s+/g, ' ').trim();
if (collapsed.length <= OPENAI_SHORT_DESCRIPTION_LIMIT) return collapsed;
const truncated = collapsed.slice(0, OPENAI_SHORT_DESCRIPTION_LIMIT - 3);
const lastSpace = truncated.lastIndexOf(' ');
const safe = lastSpace > 40 ? truncated.slice(0, lastSpace) : truncated;
return `${safe}...`;
}
function generateOpenAIYaml(displayName: string, shortDescription: string): string {
return `interface:
display_name: ${JSON.stringify(displayName)}
short_description: ${JSON.stringify(shortDescription)}
default_prompt: ${JSON.stringify(`Use ${displayName} for this task.`)}
policy:
allow_implicit_invocation: true
`;
}
/**
* Transform frontmatter for external hosts.
* Claude: strips `sensitive:` field (only Factory uses it).
* Codex: keeps name + description only, enforces 1024-char limit.
* Factory: keeps name + description + user-invocable, conditionally adds disable-model-invocation.
*/
function transformFrontmatter(content: string, host: Host): string {
const hostConfig = getHostConfig(host);
const fm = hostConfig.frontmatter;
if (fm.mode === 'denylist') {
// Denylist mode: strip listed fields, keep everything else
for (const field of fm.stripFields || []) {
if (field === 'voice-triggers') {
content = content.replace(/^voice-triggers:\n(?:\s+-\s+"[^"]*"\n?)*/m, '');
} else {
content = content.replace(new RegExp(`^${field}:\\s*.*\\n`, 'm'), '');
}
}
return content;
}
// Allowlist mode: reconstruct frontmatter with only allowed fields
const fmStart = content.indexOf('---\n');
if (fmStart !== 0) return content;
const fmEnd = content.indexOf('\n---', fmStart + 4);
if (fmEnd === -1) return content;
const frontmatter = content.slice(fmStart + 4, fmEnd);
const body = content.slice(fmEnd + 4);
const { name, description } = extractNameAndDescription(content);
// Description limit enforcement
if (fm.descriptionLimit) {
const behavior = fm.descriptionLimitBehavior || 'error';
if (description.length > fm.descriptionLimit) {
if (behavior === 'error') {
throw new Error(
`${hostConfig.displayName} description for "${name}" is ${description.length} chars (max ${fm.descriptionLimit}). ` +
`Compress the description in the .tmpl file.`
);
} else if (behavior === 'warn') {
console.warn(`WARNING: ${hostConfig.displayName} description for "${name}" exceeds ${fm.descriptionLimit} chars`);
}
// 'truncate' — silently proceed
}
}
// Build frontmatter with allowed fields
const indentedDesc = description.split('\n').map(l => ` ${l}`).join('\n');
let newFm = `---\nname: ${name}\ndescription: |\n${indentedDesc}\n`;
// Add extra fields (host-wide)
if (fm.extraFields) {
for (const [key, value] of Object.entries(fm.extraFields)) {
if (key !== 'name' && key !== 'description') {
newFm += `${key}: ${value}\n`;
}
}
}
// Add conditional fields
if (fm.conditionalFields) {
for (const rule of fm.conditionalFields) {
const match = Object.entries(rule.if).every(([k, v]) =>
new RegExp(`^${k}:\\s*${v}`, 'm').test(frontmatter)
);
if (match) {
for (const [key, value] of Object.entries(rule.add)) {
newFm += `${key}: ${value}\n`;
}
}
}
}
// Preserve additional keepFields beyond name and description
if (fm.keepFields) {
for (const field of fm.keepFields) {
if (field === 'name' || field === 'description') continue;
// Match YAML field with possible multi-line/array value (indented lines after colon)
const fieldMatch = frontmatter.match(new RegExp(`^${field}:(.*(?:\\n(?:[ \\t]+.+))*)`, 'm'));
if (fieldMatch) {
newFm += `${field}:${fieldMatch[1]}\n`;
}
}
}
// Rename fields (copy values from template frontmatter with new keys)
if (fm.renameFields) {
for (const [oldName, newName] of Object.entries(fm.renameFields)) {
const fieldMatch = frontmatter.match(new RegExp(`^${oldName}:(.+(?:\\n(?:\\s+.+)*)?)`, 'm'));
if (fieldMatch) {
newFm += `${newName}:${fieldMatch[1]}\n`;
}
}
}
newFm += '---';
return newFm + body;
}
/**
* Extract hook descriptions from frontmatter for inline safety prose.
* Returns a description of what the hooks do, or null if no hooks.
*/
function extractHookSafetyProse(tmplContent: string): string | null {
if (!tmplContent.match(/^hooks:/m)) return null;
// Parse the hook matchers to build a human-readable safety description
const matchers: string[] = [];
const matcherRegex = /matcher:\s*"(\w+)"/g;
let m;
while ((m = matcherRegex.exec(tmplContent)) !== null) {
if (!matchers.includes(m[1])) matchers.push(m[1]);
}
if (matchers.length === 0) return null;
// Build safety prose based on what tools are hooked
const toolDescriptions: Record<string, string> = {
Bash: 'check bash commands for destructive operations (rm -rf, DROP TABLE, force-push, git reset --hard, etc.) before execution',
Edit: 'verify file edits are within the allowed scope boundary before applying',
Write: 'verify file writes are within the allowed scope boundary before applying',
};
const safetyChecks = matchers
.map(t => toolDescriptions[t] || `check ${t} operations for safety`)
.join(', and ');
return `> **Safety Advisory:** This skill includes safety checks that ${safetyChecks}. When using this skill, always pause and verify before executing potentially destructive operations. If uncertain about a command's safety, ask the user for confirmation before proceeding.`;
}
// ─── External Host Config (now derived from hosts/*.ts) ──────
// EXTERNAL_HOST_CONFIG replaced by getHostConfig() from hosts/index.ts
// ─── Template Processing ────────────────────────────────────
const GENERATED_HEADER = `<!-- AUTO-GENERATED from {{SOURCE}} — do not edit directly -->\n<!-- Regenerate: bun run gen:skill-docs -->\n`;
/**
* Process external host output: routing, frontmatter, path rewrites, metadata.
* Shared between Codex and Factory (and future external hosts).
*/
function processExternalHost(
content: string,
tmplContent: string,
host: Host,
skillDir: string,
extractedDescription: string,
ctx: TemplateContext,
frontmatterName?: string,
): { content: string; outputPath: string; outputDir: string; symlinkLoop: boolean } {
const hostConfig = getHostConfig(host);
const name = externalSkillName(skillDir === '.' ? '' : skillDir, frontmatterName);
const outputDir = path.join(ROOT, hostConfig.hostSubdir, 'skills', name);
fs.mkdirSync(outputDir, { recursive: true });
const outputPath = path.join(outputDir, 'SKILL.md');
// Guard against symlink loops
let symlinkLoop = false;
const claudePath = ctx.tmplPath.replace(/\.tmpl$/, '');
try {
const resolvedClaude = fs.realpathSync(claudePath);
const resolvedExternal = fs.realpathSync(path.dirname(outputPath)) + '/' + path.basename(outputPath);
if (resolvedClaude === resolvedExternal) {
symlinkLoop = true;
}
} catch {
// realpathSync fails if file doesn't exist yet — no symlink loop
}
// Extract hook safety prose BEFORE transforming frontmatter (which strips hooks)
const safetyProse = extractHookSafetyProse(tmplContent);
// Transform frontmatter (host-aware)
let result = transformFrontmatter(content, host);
// Insert safety advisory at the top of the body (after frontmatter)
if (safetyProse) {
const bodyStart = result.indexOf('\n---') + 4;
result = result.slice(0, bodyStart) + '\n' + safetyProse + '\n' + result.slice(bodyStart);
}
// Config-driven path rewrites (order matters, replaceAll)
for (const rewrite of hostConfig.pathRewrites) {
result = result.replaceAll(rewrite.from, rewrite.to);
}
// Config-driven tool rewrites
if (hostConfig.toolRewrites) {
for (const [from, to] of Object.entries(hostConfig.toolRewrites)) {
result = result.replaceAll(from, to);
}
}
// Config-driven: generate metadata (e.g., openai.yaml for Codex)
if (hostConfig.generation.generateMetadata && !symlinkLoop) {
const agentsDir = path.join(outputDir, 'agents');
fs.mkdirSync(agentsDir, { recursive: true });
const shortDescription = condenseOpenAIShortDescription(extractedDescription);
fs.writeFileSync(path.join(agentsDir, 'openai.yaml'), generateOpenAIYaml(name, shortDescription));
}
return { content: result, outputPath, outputDir, symlinkLoop };
}
function processTemplate(tmplPath: string, host: Host = 'claude'): { outputPath: string; content: string; symlinkLoop?: boolean; catalogParts?: CatalogParts | null } {
const tmplContent = fs.readFileSync(tmplPath, 'utf-8');
const relTmplPath = path.relative(ROOT, tmplPath);
let outputPath = tmplPath.replace(/\.tmpl$/, '');
// Determine skill directory relative to ROOT
const skillDir = path.relative(ROOT, path.dirname(tmplPath));
// Extract skill name from frontmatter early — needed for both TemplateContext and external host output paths.
// When frontmatter name: differs from directory name (e.g., run-tests/ with name: test),
// the frontmatter name is used for external skill naming and setup script symlinks.
const { name: extractedName, description: extractedDescription } = extractNameAndDescription(tmplContent);
const skillName = extractedName || path.basename(path.dirname(tmplPath));
// Extract benefits-from list from frontmatter (inline YAML: benefits-from: [a, b])
const benefitsMatch = tmplContent.match(/^benefits-from:\s*\[([^\]]*)\]/m);
const benefitsFrom = benefitsMatch
? benefitsMatch[1].split(',').map(s => s.trim()).filter(Boolean)
: undefined;
// Extract preamble-tier from frontmatter (1-4, controls which preamble sections are included)
const tierMatch = tmplContent.match(/^preamble-tier:\s*(\d+)$/m);
const preambleTier = tierMatch ? parseInt(tierMatch[1], 10) : undefined;
// Extract interactive flag from frontmatter (generator-only; controls plan-mode handshake inclusion)
const interactiveMatch = tmplContent.match(/^interactive:\s*(true|false)\s*$/m);
const interactive = interactiveMatch ? interactiveMatch[1] === 'true' : undefined;
const ctx: TemplateContext = { skillName, tmplPath, benefitsFrom, host, paths: HOST_PATHS[host], preambleTier, model: MODEL_ARG_VAL, interactive, explainLevel: EXPLAIN_LEVEL };
// Replace placeholders (supports parameterized: {{NAME:arg1:arg2}})
// Config-driven: suppressedResolvers return empty string for this host
const currentHostConfig = getHostConfig(host);
const suppressed = new Set(currentHostConfig.suppressedResolvers || []);
let content = tmplContent.replace(/\{\{(\w+(?::[^}]+)?)\}\}/g, (match, fullKey) => {
const parts = fullKey.split(':');
const resolverName = parts[0];
const args = parts.slice(1);
if (suppressed.has(resolverName)) return '';
const entry = RESOLVERS[resolverName];
if (!entry) throw new Error(`Unknown placeholder {{${resolverName}}} in ${relTmplPath}`);
const { resolve, appliesTo } = unwrapResolver(entry);
if (appliesTo && !appliesTo(ctx)) return '';
return args.length > 0 ? resolve(ctx, args) : resolve(ctx);
});
// Check for any remaining unresolved placeholders
const remaining = content.match(/\{\{(\w+(?::[^}]+)?)\}\}/g);
if (remaining) {
throw new Error(`Unresolved placeholders in ${relTmplPath}: ${remaining.join(', ')}`);
}
// Preprocess voice triggers: fold into description, strip field from frontmatter.
// Must run BEFORE transformFrontmatter so all hosts see the updated description,
// and BEFORE extractedDescription is used by external host metadata.
content = processVoiceTriggers(content);
// Re-extract description AFTER voice trigger preprocessing so Codex openai.yaml
// metadata gets the updated description with voice triggers included.
const postProcessDescription = extractNameAndDescription(content).description;
// For Claude: strip sensitive: field (only Factory uses it)
// For external hosts: route output, transform frontmatter, rewrite paths
let symlinkLoop = false;
if (host === 'claude') {
content = transformFrontmatter(content, host);
} else {
const result = processExternalHost(content, tmplContent, host, skillDir, postProcessDescription, ctx, extractedName || undefined);
content = result.content;
outputPath = result.outputPath;
symlinkLoop = result.symlinkLoop;
}
// Prepend generated header (after frontmatter)
const header = GENERATED_HEADER.replace('{{SOURCE}}', path.basename(tmplPath));
const fmEnd = content.indexOf('---', content.indexOf('---') + 3);
if (fmEnd !== -1) {
const insertAt = content.indexOf('\n', fmEnd) + 1;
content = content.slice(0, insertAt) + header + content.slice(insertAt);
} else {
content = header + content;
}
// Catalog trim (Claude only — external hosts have their own frontmatter shapes)
let catalogParts: CatalogParts | null = null;
if (host === 'claude' && CATALOG_MODE === 'trim') {
const trimmed = applyCatalogTrim(content, skillName);
if (trimmed) {
content = trimmed.content;
catalogParts = trimmed.parts;
}
}
return { outputPath, content, symlinkLoop, catalogParts };
}
// ─── Main ───────────────────────────────────────────────────
function findTemplates(): string[] {
return discoverTemplates(ROOT).map(t => path.join(ROOT, t.tmpl));
}
const ALL_HOSTS: Host[] = ALL_HOST_NAMES as Host[];
const hostsToRun: Host[] = HOST_ARG_VAL === 'all' ? ALL_HOSTS : [HOST];
const failures: { host: string; error: Error }[] = [];
for (const currentHost of hostsToRun) {
HOST = currentHost;
try {
let hasChanges = false;
const tokenBudget: Array<{ skill: string; lines: number; tokens: number }> = [];
// T4 catalog trim: collect routing/voice parts across all Claude skills,
// then write scripts/proactive-suggestions.json once per gen-skill-docs run.
const proactiveAggregate: Record<string, {
lead: string;
routing: string;
voice_line: string | null;
}> = {};
const currentHostConfig = getHostConfig(currentHost);
for (const tmplPath of findTemplates()) {
const dir = path.basename(path.dirname(tmplPath));
// includeSkills allowlist (union logic: include minus skip)
if (currentHostConfig.generation.includeSkills?.length) {
if (!currentHostConfig.generation.includeSkills.includes(dir)) continue;
}
// skipSkills denylist (subtracts from includeSkills or full set)
if (currentHostConfig.generation.skipSkills?.length) {
if (currentHostConfig.generation.skipSkills.includes(dir)) continue;
}
const { outputPath, content, symlinkLoop, catalogParts } = processTemplate(tmplPath, currentHost);
if (catalogParts) {
// Root-skill detection: when the template lives at ROOT/SKILL.md.tmpl,
// path.basename(path.dirname(tmplPath)) returns the repo's directory
// name (e.g. "seville-v3" in a Conductor worktree, "gstack" on CI).
// That's non-deterministic across machines and breaks CI freshness
// checks. Use the frontmatter `name` field as the registry key — the
// root SKILL.md.tmpl declares `name: gstack` explicitly. For all other
// skills, `dir` matches the directory name which matches the
// frontmatter name by convention.
const isRoot = path.dirname(tmplPath) === ROOT;
const key = isRoot ? 'gstack' : dir;
proactiveAggregate[key] = {
lead: catalogParts.lead,
routing: catalogParts.routingProse,
voice_line: catalogParts.voiceLine,
};
}
const relOutput = path.relative(ROOT, outputPath);
if (symlinkLoop) {
console.log(`SKIPPED (symlink loop): ${relOutput}`);
} else if (DRY_RUN) {
const existing = fs.existsSync(outputPath) ? fs.readFileSync(outputPath, 'utf-8') : '';
if (existing !== content) {
console.log(`STALE: ${relOutput}`);
hasChanges = true;
} else {
console.log(`FRESH: ${relOutput}`);
}
} else {
fs.writeFileSync(outputPath, content);
console.log(`GENERATED: ${relOutput}`);
}
// Track token budget
const lines = content.split('\n').length;
const tokens = Math.round(content.length / 4); // ~4 chars per token
tokenBudget.push({ skill: relOutput, lines, tokens });
// Token ceiling check: warn if any generated SKILL.md exceeds ~40K tokens (160KB).
// The ceiling is a "watch for feature bloat" guardrail, not a hard gate. Modern
// flagship models have 200K-1M context windows, so 40K (4-20% of window) is fine.
// Prompt caching further reduces the marginal cost of larger skills. This ceiling
// exists to catch a runaway preamble or resolver that's grown by 10K+ tokens in
// a release, not to force compression on carefully-tuned big skills (ship,
// plan-ceo-review, office-hours all legitimately pack 25-35K tokens of behavior).
const TOKEN_CEILING_BYTES = 160_000;
if (content.length > TOKEN_CEILING_BYTES) {
console.warn(`⚠️ TOKEN CEILING: ${relOutput} is ${content.length} bytes (~${tokens} tokens), exceeds ${TOKEN_CEILING_BYTES} byte ceiling (~40K tokens)`);
}
}
// Generate gstack-lite and gstack-full for OpenClaw host
if (currentHost === 'openclaw' && !DRY_RUN) {
const openclawDir = path.join(ROOT, 'openclaw');
if (!fs.existsSync(openclawDir)) fs.mkdirSync(openclawDir, { recursive: true });
const gstackLite = `# gstack-lite Planning Discipline
Injected by the orchestrator into spawned Claude Code sessions. Append to existing CLAUDE.md.
## Planning Discipline
1. Read every file you will modify. Understand existing patterns first.
2. Before writing code, state your plan: what, why, which files, test case, risk.
3. When ambiguous, prefer: completeness over shortcuts, existing patterns over new ones,
reversible choices over irreversible ones, safe defaults over clever ones.
4. Self-review your changes before reporting done. Check for: missed files, broken
imports, untested paths, style inconsistencies.
5. Report when done: what shipped, what decisions you made, anything uncertain.
`;
fs.writeFileSync(path.join(openclawDir, 'gstack-lite-CLAUDE.md'), gstackLite);
console.log('GENERATED: openclaw/gstack-lite-CLAUDE.md');
const gstackFull = `# gstack-full Pipeline
Injected by the orchestrator for complete feature builds. Append to existing CLAUDE.md.
## Full Pipeline
1. Read CLAUDE.md and understand the project context.
2. Run /autoplan to review your approach (CEO + eng + design review pipeline).
3. Implement the approved plan. Follow the planning discipline above.
4. Run /ship to create a PR with tests, changelog, and version bump.
5. Report back: PR URL, what shipped, decisions made, anything uncertain.
Do not ask for human input until the PR is ready for review.
`;
fs.writeFileSync(path.join(openclawDir, 'gstack-full-CLAUDE.md'), gstackFull);
console.log('GENERATED: openclaw/gstack-full-CLAUDE.md');
const gstackPlan = `# gstack-plan: Full Review Gauntlet
Injected by the orchestrator when the user wants to plan a Claude Code project.
Append to existing CLAUDE.md.
## Planning Pipeline
1. Read CLAUDE.md and understand the project context.
2. Run /office-hours to produce a design doc (problem statement, premises, alternatives).
3. Run /autoplan to review the design (CEO + eng + design + DX reviews + codex adversarial).
4. Save the final reviewed plan to a file the orchestrator can reference later.
Write it to: plans/<project-slug>-plan-<date>.md in the current repo.
Include the design doc, all review decisions, and the implementation sequence.
5. Report back to the orchestrator:
- Plan file path
- One-paragraph summary of what was designed and the key decisions
- List of accepted scope expansions (if any)
- Recommended next step (usually: spawn a new session with gstack-full to implement)
Do not implement anything. This is planning only.
The orchestrator will persist the plan link to its own memory/knowledge store.
`;
fs.writeFileSync(path.join(openclawDir, 'gstack-plan-CLAUDE.md'), gstackPlan);
console.log('GENERATED: openclaw/gstack-plan-CLAUDE.md');
}
if (DRY_RUN && hasChanges) {
console.error(`\nGenerated SKILL.md files are stale (${currentHost} host). Run: bun run gen:skill-docs --host ${currentHost}`);
if (HOST_ARG_VAL !== 'all') process.exit(1);
failures.push({ host: currentHost, error: new Error('Stale files detected') });
}
// T4 catalog trim: write aggregated proactive-suggestions.json (Claude only).
// The JSON registry lets agents pull voice triggers / routing prose for any
// skill on demand instead of paying for it always-loaded in the catalog.
//
// No timestamp field — keeps the file content-deterministic across runs so
// CI dry-run freshness checks don't flap on regen. If a per-run timestamp
// is ever needed for debugging, write it to a separate `.gen-stamp` file.
if (currentHost === 'claude' && CATALOG_MODE === 'trim' && Object.keys(proactiveAggregate).length > 0 && !DRY_RUN) {
const proactivePath = path.join(ROOT, 'scripts', 'proactive-suggestions.json');
// Sort keys alphabetically so the serialized JSON is identical across
// machines regardless of filesystem-iteration order. Without this, CI
// freshness checks fail when the local dev machine and CI runner
// discover templates in different orders.
const sortedSkills: typeof proactiveAggregate = {};
for (const key of Object.keys(proactiveAggregate).sort()) {
sortedSkills[key] = proactiveAggregate[key];
}
const payload = {
$schema: 'https://gstack.dev/schemas/proactive-suggestions.json',
catalog_mode: 'trim',
note: 'Routing / voice-trigger prose extracted from SKILL.md frontmatter descriptions during catalog trim. Loaded on demand when routing guidance is needed.',
skills: sortedSkills,
};
const serialized = JSON.stringify(payload, null, 2) + '\n';
// Only write if content actually changed — prevents needless touches that
// would flap CI freshness checks. Read existing file, compare, skip write
// when identical.
let existing = '';
try { existing = fs.readFileSync(proactivePath, 'utf-8'); } catch { /* first run */ }
if (existing !== serialized) {
fs.writeFileSync(proactivePath, serialized);
}
}
// Print token budget summary
if (!DRY_RUN && tokenBudget.length > 0) {
tokenBudget.sort((a, b) => b.lines - a.lines);
const totalLines = tokenBudget.reduce((s, t) => s + t.lines, 0);
const totalTokens = tokenBudget.reduce((s, t) => s + t.tokens, 0);
console.log('');
console.log(`Token Budget (${currentHost} host)`);
console.log('═'.repeat(60));
for (const t of tokenBudget) {
const hostSubdirs = ALL_HOST_CONFIGS.map(c => c.hostSubdir.replace('.', '\\.')).join('|');
const name = t.skill.replace(/\/SKILL\.md$/, '').replace(new RegExp(`^\\.(${hostSubdirs})\\/skills\\/`), '');
console.log(` ${name.padEnd(30)} ${String(t.lines).padStart(5)} lines ~${String(t.tokens).padStart(6)} tokens`);
}
console.log('─'.repeat(60));
console.log(` ${'TOTAL'.padEnd(30)} ${String(totalLines).padStart(5)} lines ~${String(totalTokens).padStart(6)} tokens`);
console.log('');
}
} catch (e) {
failures.push({ host: currentHost, error: e as Error });
console.error(`WARNING: ${currentHost} generation failed: ${(e as Error).message}`);
}
}
// --host all: report failures. Only exit(1) if claude failed.
if (failures.length > 0 && HOST_ARG_VAL === 'all') {
console.error(`\n${failures.length} host(s) failed: ${failures.map(f => f.host).join(', ')}`);
if (failures.some(f => f.host === 'claude')) process.exit(1);
}
// Single host dry-run failure already handled above
// After all hosts processed, warn if prefix patches may need re-applying
if (!DRY_RUN) {
try {
const configPath = path.join(process.env.HOME || '', '.gstack', 'config.yaml');
if (fs.existsSync(configPath)) {
const config = fs.readFileSync(configPath, 'utf-8');
if (/^skill_prefix:\s*true/m.test(config)) {
console.log('\nNote: skill_prefix is true. Run gstack-relink to re-apply name: patches.');
}
}
} catch { /* non-fatal */ }
}
// Regenerate gstack/llms.txt — single-file capability index for AI agents.
// Runs after SKILL.md generation so it sees current skill descriptions and
// browse command list. Wrapped in an IIFE so the await-import doesn't make
// this module async (test/gen-skill-docs.test.ts uses require() to pull
// extractVoiceTriggers/processVoiceTriggers, which fails on async modules).
// Freshness is asserted in test/llms-txt-shape.test.ts.
if (!DRY_RUN) {
void (async () => {
try {
const result = await writeLlmsTxt();
if (result.warnings.length > 0) {
for (const w of result.warnings) console.error(`[gen-llms-txt] WARN: ${w}`);
} else {
console.log(`[gen-llms-txt] gstack/llms.txt: ${result.skills.length} skills, ${result.browseCommands.length} browse commands`);
}
} catch (err) {
const msg = err instanceof Error ? err.message : String(err);
console.error(`[gen-llms-txt] FAILED: ${msg}`);
}
})();
}