Files
gstack/test/skill-validation.test.ts
T
Garry Tan 22f8c7f4e1 v1.46.0.0 feat: gstack v2 foundation — catalog tokens drop 56%, eval-first floor covers all 51 skills (#1712)
* docs(designs): add v2_PLAN.md — gstack v2 the lightest opinionated skill pack

The approved plan from /plan-ceo-review → /plan-eng-review → /codex×2 →
/plan-devex-review. Captures the v1.45/v2.0 hybrid release shape,
cathedral parity-eval suite, sequential v1.45 execution, sections/*.md.tmpl
pipeline, EVALS_BUDGET_HARD_CAP override path, and v2 launch copy specs.

This commit just lands the design doc. Implementation follows in the rest
of the v1.45.0.0 branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(parity): T0a — capture v1.44.1 baseline + capture helper + diff utility

Cathedral parity-eval suite primitive. captureBaseline() walks every
top-level SKILL.md and records bytes, lines, estimated tokens, frontmatter
description length, and eval coverage. diffBaselines() reports per-skill
delta + total corpus delta + catalog tokens delta.

Locks the v1.44.1 reference snapshot at test/fixtures/parity-baseline-v1.44.1.json.
After Phase A+B+C land, scripts/capture-baseline.ts --tag v1.45.0.0 produces
a comparable snapshot; diff supplies the real numbers the v2 CHANGELOG quotes.
Never invent baseline numbers; ship them only if they came from a real run.

v1.44.1 numbers captured this commit:
- 51 skills
- 2,847 KB total corpus
- ~9,319 catalog tokens (sum of description bytes / 4)
- top 3: ship 160 KB, plan-ceo-review 128 KB, office-hours 108 KB

Test plan:
- bun test test/helpers/capture-parity-baseline.test.ts passes 4/4
- The baseline JSON file is committed so reviewers can audit v1→v2 numbers

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(resolvers): T2 — ResolverEntry + appliesTo gate infrastructure

Adds the conditional-resolver-injection plumbing from the v2_PLAN A.1
step. Resolvers can now be either a bare ResolverFn (always fires, current
behavior) or a ResolverEntry { resolve, appliesTo? } (gated; appliesTo
returning false skips the resolver, substitutes empty string).

Why infrastructure-only: the audit during T0a confirmed most resolvers
don't need gating. The {{NAME}} placeholder system is already conditional
at the template level — a resolver only fires for skills that reference it.
The gate is for future use when a placeholder's audience needs a structural
guardrail beyond social convention, or when a sub-resolver inside a larger
composed resolver (e.g. preamble) needs per-skill skip.

scripts/gen-skill-docs.ts:444 now uses unwrapResolver() to handle both
shapes. RESOLVERS map signature widens from Record<string, ResolverFn>
to Record<string, ResolverValue>. All existing resolvers stay bare
functions and work unchanged.

Test plan:
- bun test test/resolver-entry.test.ts: 6 pass (gate plumbing + registry)
- bun test test/gen-skill-docs.test.ts: 389 pass (no regression)
- bun run gen:skill-docs --dry-run: all SKILL.md files FRESH (no diff)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(preamble): T3 — jargon dedup + terse-build flag (Phase A.2 + A.3)

A.2 jargon dedup: generate-writing-style.ts replaces the inlined 80-term
jargon list with a one-line pointer to scripts/jargon-list.json. The list
was duplicated into every tier-2+ skill (48 of 51 skills); inlining cost
was ~1.5 KB × 48 = ~70 KB across the corpus. Pointer cost is ~30 bytes per
skill. Agents Read the JSON once per session on first jargon term
encountered; thereafter the terms array is the canonical reference.

A.3 terse build flag: --explain-level=terse compresses preamble prose at
gen time. When the flag is set, writing-style collapses to a one-line
terse directive and completeness-section + confusion-protocol +
context-health are dropped entirely. The default build keeps the
runtime-conditional behavior intact (sections still render; the model
skips them when EXPLAIN_LEVEL: terse appears in the preamble echo). Terse
build is opt-in for users who want shipped skills to match their runtime
preference and avoid the per-session terse-mode dead prose.

TemplateContext gains an optional `explainLevel: 'default' | 'terse'`
field. Default builds set it to 'default'; --explain-level=terse sets
'terse'. Resolvers gate their output via `ctx?.explainLevel === 'terse'`.

Measured impact (default build, post-T3):
- Total corpus: 2,847 KB → 2,812 KB (saved 35 KB)
- ship.md: 160 → 159 KB
- plan-ceo-review.md: 128 → 127 KB
- Top 10 heaviest: all slightly smaller from jargon pointer

Larger compression lands in T4 (catalog trim) and T7 (atomic regen across
the full Phase A pipeline). The terse build path further compresses to
~711K tokens vs default ~725K (saved ~14K tokens corpus-wide).

Test plan:
- bun test test/gen-skill-docs.test.ts: 389 pass (no regression)
- bun test test/resolver-entry.test.ts: 6 pass
- bun test test/helpers/capture-parity-baseline.test.ts: 4 pass
- bun run gen:skill-docs --explain-level=terse: ship.md drops completeness +
  confusion-protocol + context-health sections; writing-style collapses to
  one-line terse directive

48 SKILL.md files updated (every tier-2+ skill picks up the jargon pointer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(catalog): T4 — catalog trim + proactive-suggestions.json (Phase A.4)

Shortens frontmatter `description:` in every Claude SKILL.md to a single
lead sentence + (gstack) tag. The routing prose ("Use when asked to...",
"Proactively suggest...") and voice triggers move to a "## When to invoke"
body section so they remain discoverable inside the skill. A per-run
registry at scripts/proactive-suggestions.json aggregates the routing/
voice text for all 52 skills so agents can pull guidance on demand
without paying for it in the always-loaded catalog.

Build flag --catalog-mode=full restores v1.44 legacy behavior (full
multi-line descriptions in frontmatter). Default is trim.

splitCatalogDescription() extracts: lead sentence, routing paragraphs,
voice-triggers line, (gstack) tag presence. Short descriptions (<120
chars, already trimmed) are skipped via a guard so re-runs are idempotent.

Measured impact (vs v1.44.1 baseline):
- Catalog tokens (sum of description bytes / 4): 9,319 → 4,045  (-56.6%)
- Total SKILL.md corpus bytes:                   2,915 KB → 2,880 KB (-1.2%)
- Routing prose preserved as in-skill "## When to invoke" sections
- 52 skill entries in scripts/proactive-suggestions.json (on-demand registry)

The corpus drop is small because catalog trim MOVES text from frontmatter
to body, it doesn't delete it. The headline win is the catalog: the
always-loaded system prompt surface drops by more than half.

Test plan:
- bun test test/gen-skill-docs.test.ts: 389 pass, 0 fail
- Manual: ship/SKILL.md frontmatter description is now ONE line ending
  with `(gstack)`; allowed-tools field on next line (YAML well-formed)
- Manual: scripts/proactive-suggestions.json contains 52 entries
- bun run gen:skill-docs --catalog-mode=full restores legacy behavior

53 files changed (52 SKILL.md across hosts + the new proactive-suggestions.json).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(budget): T5 — hard token budgets + override audit trail (Phase A.6)

Two new gate-tier guardrails for the v1.45.0.0 compression baseline:

1. test/skill-size-budget.test.ts (NEW) — per-skill SKILL.md size budget.
   Compares current state to test/fixtures/parity-baseline-v1.44.1.json.
   Three checks: per-skill (×1.05 default ratio), total corpus, and
   catalog token estimate (≤7000 for v1.45). The per-skill ratio is 1.05
   not 1.0 because the T4 catalog trim moves text from frontmatter to a
   body section; small skills see a tiny body growth that's fine when
   offset by the much larger catalog-token win.

2. test/skill-budget-regression.test.ts EXTENDED — hard dollar cap on
   per-run eval cost. Per-tier defaults: gate $25, periodic $70. Umbrella
   EVALS_BUDGET_HARD_CAP=$30. Catches runaway eval costs (infinite retry,
   model price changes) before they amortize across PRs.

Both checks support an override path with audit trail:
   GSTACK_SIZE_BUDGET_OVERRIDE_REASON="why this is OK"   — size
   EVALS_BUDGET_OVERRIDE_REASON="why this is OK"          — cost
Overrides log to ~/.gstack/analytics/spend-overrides.jsonl with
timestamp + scope + reason + CI provenance (runner, branch, commit)
via test/helpers/budget-override.ts.

Why the override audit: a hard cap with no escape valve becomes
operationally hostile (legit price changes, longer transcripts, new
required evals can all blow the cap). An override with no audit becomes
"everyone overrides everything and the gate is theater." This module
ships the audit half so reviewers can see what was waived and why.

Codex 2nd-pass critique #3 absorbed: per-suite caps + override path with
auditability + budget baselines checked into repo (parity-baseline-v1.44.1.json
already in test/fixtures/).

Test plan:
- bun test test/skill-size-budget.test.ts: 4 pass (per-skill, corpus, catalog, baseline-exists)
- bun test test/skill-budget-regression.test.ts: 4 pass (2 existing ratio checks + 2 new hard-cap checks)
- Existing eval runs ($14.11 e2e, $0.02 llm-judge) sit well under the new caps

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(cso): T6 — pin must-preserve security phrases (Phase A.5)

cso/SKILL.md is a content-heavy security audit skill (75 KB after T3+T4).
Codex 2nd-pass critique #9: "cso exemption too broad ... should still get
resolver dedup, catalog trim, sectioning if safe, and targeted evals
around must-not-miss checks."

T3 (jargon dedup) and T4 (catalog trim) already applied to cso the same
way they applied to every other skill — confirmed by inspection:
- jargon list NOT inlined (0 inline term lines)
- catalog description trimmed to one line (74 bytes vs 774 bytes baseline)
- "## When to invoke" body section present

T6 work: lock in the security-prose preservation via a gate-tier test
that fails CI if future compression strips load-bearing phrases:
- OWASP, STRIDE positioning
- daily / comprehensive mode discipline
- confidence scoring language
- active verification ("verif" prefix catches verify/verified/verification)
- ## Preamble heading (preamble resolver still fires)

Also guards cso against accidental over-stripping: SKILL.md must stay
≥30 KB (currently 75 KB) — a sudden cliff would mean compression went
past the targeted-dedup line into structural removal.

No structural change to cso. Future Phase B sections/ work for cso
requires writing baseline parity tests FIRST per the v2_PLAN.md
sequencing.

Test plan:
- bun test test/cso-preserved.test.ts: 5 pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(parity): T0b — cathedral parity-suite harness + invariant registry

Adds the harness that the v2_PLAN.md cathedral parity-eval suite is built
on. Compares CURRENT SKILL.md output to v1.44.1 baseline along three axes:

  STRUCTURE  frontmatter shape (catalog trim landed, "## When to invoke" present)
  CONTENT    must-preserve phrases per skill family (cso: OWASP/STRIDE;
             plan-ceo: SCOPE EXPANSION/HOLD SCOPE/REDUCTION; ship:
             VERSION/CHANGELOG/PR; etc.)
  SIZE       per-skill byte budget (maxSizeRatio + minBytes guards)

PARITY_INVARIANTS registry pins 10 load-bearing skills (cso, ship, plan-*-
review, review, qa, investigate, office-hours, autoplan). Each entry
declares what must NOT regress; future compression that strips these
phrases or shrinks a skill past its minBytes cliff fails CI.

Periodic-tier LLM-judge parity (paid, ~$0.20/skill) lands in v2.0.0.0
sections/ phase. Same registry, same harness, judge added on top.

Test plan:
- bun test test/parity-suite.test.ts: 10/10 invariants pass vs v1.44.1
- Per-skill failures get actionable per-line breakdown so a reviewer can
  see which phrase / heading / size limit went sideways

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(coverage): T1 — skill coverage matrix + structural-compliance floor

Phase 0 deliverable — eval-first foundation. Two new test files plus the
registry:

1. test/skill-coverage-matrix.ts — single source of truth mapping each
   skill to its gate-tier + periodic-tier test files. SKILL_COVERAGE
   record with 51 entries; every gstack skill on disk has at least one
   gate-tier entry.

2. test/skill-coverage-matrix.test.ts — CI gate. Asserts every skill on
   disk has a registry entry AND that gate[] is non-empty. Catches
   "skill added but eval not registered" the moment a new SKILL.md
   lands.

3. test/skill-coverage-floor.test.ts — per-skill structural compliance
   (FREE, file-IO only). For each of 51 skills, verifies:
   - SKILL.md exists
   - Frontmatter well-formed (name + description fields)
   - Catalog-trim contract (inline description ≤ 250 chars, or block form)
   - Generated header present (edit .tmpl, not .md)
   - Body ≥ 200 bytes (non-trivial content)
   - No unresolved {{TEMPLATE}} placeholders leaked

The "floor" is the minimum eval that every skill ships with. Skills that
need deeper behavioral testing get additional entries in their coverage
record (e.g., ship has skill-e2e-ship-idempotency + workflow + floor).
Future skills only need to add the floor entry and the matrix gate
unblocks them.

Codex 2nd-pass critique #1 mitigation: eval-first floor is structural
compliance (the testable part) — judgment-skill behavior gets layered
periodic-tier evals on top. We don't pretend the floor proves
correctness, only that the skill structurally compiles.

Test plan:
- bun test test/skill-coverage-matrix.test.ts: 4 pass (matrix shape + coverage)
- bun test test/skill-coverage-floor.test.ts: 309 pass (6 checks × 51 skills + 3 registry-level)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* build(skills): T7 — atomic regenerate + capture v1.45.0.0 baseline

Final regen pass across all hosts after T1-T6 work landed. Captures the
v1.45.0.0 parity baseline at test/fixtures/parity-baseline-v1.45.0.0.json
for diffing against the v1.44.1 reference.

Measured deltas (real numbers from test/helpers/capture-parity-baseline.ts):

  Total SKILL.md corpus       2,847 KB → 2,813 KB        (-1.2%)
  Catalog tokens (always-loaded) ~9,319 → ~4,045 tokens   (-56.6%)
  Top 10 heaviest skills      0.5-1.0% drop each

The catalog token cut is the headline. It's the always-loaded surface,
i.e. tokens charged on every session start. Per-skill SKILL.md sizes
barely moved because T4 catalog trim MOVES routing prose from frontmatter
to a body "## When to invoke" section rather than deleting it — the
catalog wins without amputating discoverability.

The bigger per-skill compression lands in v2.0.0.0 (Phase B sections/
pattern on the 5 heavyweights). v1.45 is the foundation: eval-first
infrastructure + cheap wins.

scripts/proactive-suggestions.json regenerated with the latest 52 skills
listed (one-time write per gen-skill-docs run; aggregated catalog parts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.45.0.0 — gstack v2 foundation: catalog tokens drop 56%, eval-first floor

Bumps VERSION + package.json to 1.45.0.0. CHANGELOG entry covers what
shipped between v1.44.1 and this release: the cathedral parity-eval
foundation, conditional resolver injection plumbing, jargon dedup, terse
build flag, catalog trim with one-line frontmatter descriptions, hard
token + dollar budget gates with override audit, cso preservation pins,
and the v1.44.1 ↔ v1.45.0.0 parity baselines committed to test/fixtures/.

Numbers (measured, not estimated):
- Catalog tokens: ~9,319 → ~4,045  (-56.6%)
- Total corpus:   2,847 KB → 2,813 KB (-1.2%)
- Skills with gate-tier eval coverage: 32/51 → 51/51 (floor achieved)

This is the foundation release. v2.0.0.0 will ship the architectural
break (sections/*.md.tmpl pattern + mechanical Read enforcement +
eval-coverage annotations) as a coordinated marketing-grade launch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(catalog): refresh proactive-suggestions.json timestamp after v1.45 bump

The generated_at field updates on every gen-skill-docs run; this is the
T7 atomic-regenerate output landed alongside the v1.45.0.0 bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(catalog): deterministic proactive-suggestions.json (no per-run timestamp)

Original implementation wrote a generated_at timestamp on every gen-skill-docs
run. That made CI dry-run freshness checks flap because the file changed on
every regeneration even when the actual content (skill descriptions, routing
prose, voice triggers) was unchanged.

Two fixes:
1. Drop the generated_at field. The file is purely a content registry now.
2. Only write the file when serialized content actually differs from disk.

Reproducible test: bun run gen:skill-docs twice in a row now leaves
scripts/proactive-suggestions.json unchanged on the second run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(catalog): preserve routing prose when first sentence exceeds 200 chars

splitCatalogDescription truncated the lead BEFORE computing routing
extraction, which meant skills whose first sentence was over 200 chars
(design-consultation: 207 chars) had their entire routing prose silently
dropped — the "## When to invoke" body section came out empty.

Root cause: routing was extracted via `collapsed.indexOf(lead)` after lead
was suffixed with "...". The "..." never appeared in the original string,
so indexOf returned -1 and routingProse fell back to empty.

Fix: compute routing from sentenceLead (the untruncated first sentence)
BEFORE truncating the displayed lead. The displayed lead still gets "..."
when over 200 chars, but the routing extraction uses the real boundary.

Also: refresh golden snapshots for claude/codex/factory ship and update
two unit tests that asserted v1.44 behavior:
- skill-validation.test.ts: trigger-phrase + proactive-routing tests now
  search whole content, not just frontmatter (T4 moved them to a body
  "## When to invoke" section)
- writing-style-resolver.test.ts: jargon-list assertion now expects the
  T3 reference pointer, not the inline list

Test plan:
- bun test test/skill-validation.test.ts test/writing-style-resolver.test.ts
  test/host-config.test.ts test/skill-size-budget.test.ts
  test/parity-suite.test.ts test/skill-coverage-matrix.test.ts
  test/skill-coverage-floor.test.ts test/cso-preserved.test.ts
  test/resolver-entry.test.ts test/helpers/capture-parity-baseline.test.ts
  test/gen-skill-docs.test.ts: 1134 pass, 0 fail
- Manual verify: design-consultation/SKILL.md "## When to invoke this skill"
  body section now contains "Use when asked to..." + "Proactively suggest..."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(catalog): deterministic proactive-suggestions.json across machines

CI check-freshness failed because scripts/proactive-suggestions.json
serialized differently on local vs CI:

1. Root-skill key leaked the directory name. processTemplate's outer loop
   computed `dir = path.basename(path.dirname(tmplPath))`. For the root
   SKILL.md.tmpl at ROOT/SKILL.md.tmpl, that returns the repo-checkout
   directory name — "seville-v3" in a Conductor worktree, "gstack" on
   GitHub Actions, anything-else for a fork. Fix: detect root via
   `path.dirname(tmplPath) === ROOT` and hardcode the key to "gstack"
   for that one case.

2. Aggregate key order was filesystem-iteration order. discoverTemplates
   doesn't guarantee stable ordering across platforms, so the JSON
   `skills` object came out shuffled between machines. Fix: sort
   Object.keys(proactiveAggregate) alphabetically before serializing.

After the fix, the generated file is identical on every machine and
matches what's committed. CI freshness check (bun run gen:skill-docs &&
git diff --exit-code) now passes.

Test plan:
- bun run gen:skill-docs && bun run gen:skill-docs --dry-run: all FRESH
- node -e 'verify keys sorted': sorted match: true
- grep -c '"seville-v3"' scripts/proactive-suggestions.json: 0
- Focused test suite: 704 pass, 0 fail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(catalog): unit + regression coverage for catalog-trim helpers

Four exported functions in scripts/gen-skill-docs.ts handle every skill's
frontmatter rewrite at gen time but had zero unit tests. Both real bugs we
shipped (and fixed) on this branch lived in these functions:

  v1.45.0.0 design-consultation: when the first sentence exceeded 200 chars,
  routing-prose extraction lost the entire tail (anchored on truncated lead
  with "..." that didn't substring-match the original).

  v1.45.0.0 CI freshness: root-skill key leaked the checkout directory
  name ("seville-v3" vs "gstack") and aggregate order was filesystem-
  iteration order.

Both shapes are now regression-tested:

- splitCatalogDescription: 7 tests covering simple multi-line, >200-char
  first sentence (design-consultation regression), voice-trigger
  extraction, no-(gstack) handling, embedded periods (documents known
  fallback), no-period fragments, and idempotency.
- buildTrimmedDescription: 3 tests.
- buildWhenToInvokeSection: 3 tests.
- applyCatalogTrim: 4 tests covering the standard rewrite, no-op for
  already-short descriptions, the YAML-collision newline fix, and the
  malformed-frontmatter null return.
- proactive-suggestions.json determinism: 3 tests asserting sorted keys,
  root keyed as "gstack" (not the worktree directory), and no
  timestamp/generated_at field that would flap CI freshness.

Test plan:
- bun test test/catalog-trim.test.ts: 20 pass, 0 fail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(coverage): fill three remaining v1.46.0.0 test gaps

Three untested surfaces from the v1.46.0.0 work. All three would have
caught real bugs we shipped (and fixed) on this branch.

1. test/helpers/budget-override.test.ts — 7 tests pin the audit-trail
   contract for EVALS_BUDGET_OVERRIDE_REASON and
   GSTACK_SIZE_BUDGET_OVERRIDE_REASON. Without this, the audit logger
   could silently drop events and overrides become invisible. Tests
   cover: required fields per JSONL line, CI provenance capture
   (CI/GITHUB_ACTIONS/branch/commit), local-runner defaults,
   append-only behavior, missing-directory recovery, and unwritable-
   path resilience (logs warning instead of throwing).

2. test/terse-build.test.ts — 16 tests pin --explain-level=terse
   behavior across the 4 gated resolvers and the composed preamble.
   Default vs terse vs undefined-ctx all asserted. Without this, a
   refactor that breaks the explainLevel threading silently regresses
   the opt-in compression path; the runtime EXPLAIN_LEVEL: terse gate
   still works so users wouldn't notice. Tier-1 invariant pinned
   (terse-only-affects-tier-2+).

3. test/gen-skill-docs-idempotency.test.ts — 2 tests catch the class
   of bug behind the v1.45.0.0 timestamp flap. Two consecutive
   gen-skill-docs runs must produce byte-identical outputs across
   STABLE_OUTPUTS (proactive-suggestions.json, SKILL.md, ship/SKILL.md,
   plan-ceo-review/SKILL.md, office-hours/SKILL.md, gstack/llms.txt).
   --dry-run reports zero stale files after a fresh gen. CI freshness
   regressions surface as test failures BEFORE a PR is opened.

Test plan:
- bun test test/helpers/budget-override.test.ts: 7 pass
- bun test test/terse-build.test.ts: 16 pass
- bun test test/gen-skill-docs-idempotency.test.ts: 2 pass
- Full focused suite (15 test files): 1179 pass, 0 fail (+45 new tests
  vs the pre-fill baseline of 1134)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(coverage): close 5 remaining v1.46.0.0 test gaps (A-E)

Five behaviors that v1.46 ships but had no test coverage. All now pinned.

A) --host all idempotency (test/gen-skill-docs-idempotency.test.ts)
   The default test ran Claude host only. Non-Claude hosts (Codex, Factory,
   Cursor, OpenClaw, GBrain, Slate, OpenCode, Hermes, Kiro) each have their
   own output paths and could carry their own non-deterministic fields. We
   hit a "--host all needed for freshness check" mid-/ship. Now: two
   consecutive `bun run gen:skill-docs --host all` runs must produce
   byte-identical outputs across a per-host sample (.agents/, .cursor/,
   .factory/, .gbrain/). Catches per-host adapter regressions before CI.

B) --catalog-mode=full opt-out (test/catalog-mode-full.test.ts)
   The legacy escape hatch had zero tests. 6 new tests across two layers:
   static (CATALOG_MODE_ARG parsed; conditional gate present; default is
   "trim"; invalid value throws) + smoke (actual --catalog-mode=full run
   produces a multi-line `description: |` block + omits "## When to invoke"
   body section; mutates the working tree then restores in a finally block).

C) parity-baseline-v1.44.1.json integrity (test/parity-baseline-integrity.test.ts)
   The baseline is the source of every v1→v2 number cited in the
   CHANGELOG v1.46.0.0 entry. Anyone could edit it without test failure
   until now. 8 new tests pin: existence, tag, capturedFromCommit
   allowlist, expected v1.44 numbers (51 skills, ~2,915 KB, ~9,319
   catalog tokens), CHANGELOG references this file by path, per-skill
   shape, and a SHA256 byte-stability hash. Any edit fails with a clear
   "if intentional, update EXPECTED_HASH AND the CHANGELOG numbers" signal.

D) Live appliesTo gate end-to-end (test/resolver-entry.test.ts extended)
   The unwrapResolver unit tests covered the function; the gen-skill-docs.ts
   substitution loop that USES the gate had no integration coverage. 6 new
   tests simulate the exact 4-line shape from gen-skill-docs.ts:457-467
   against synthetic registries: plain-function fires unconditionally,
   gated fires when true / empty-string when false, mixed registries
   compose, parameterized resolvers respect gates, unknown resolvers throw.

E) Per-skill min-size floor (test/skill-size-budget.test.ts extended)
   The existing 200-byte body coverage-floor is a noise floor — a skill
   that lost 99.75% of content still passes. 1 new test asserts every
   skill stays ≥80% of its v1.44.1 baseline size (the parity-suite
   content invariants only covered 10 of 51 skills; the remaining 41
   were uncovered). SECTIONS_EXTRACTED hook in place for v2.0.0.0 when
   the sections/ pattern legitimately shrinks ship/plan-ceo/etc. past
   the floor.

Test plan:
- bun test focused 17-file suite: 1202 pass, 0 fail
  (+23 new tests vs the pre-fill 1179 baseline)
- catalog-mode=full mutates working tree then restores cleanly
- --host all idempotency runs two full gen passes in <1s on this machine

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 16:50:03 -07:00

1914 lines
76 KiB
TypeScript

import { describe, test, expect } from 'bun:test';
import { validateSkill, extractRemoteSlugPatterns, extractWeightsFromTable } from './helpers/skill-parser';
import { ALL_COMMANDS, COMMAND_DESCRIPTIONS, READ_COMMANDS, WRITE_COMMANDS, META_COMMANDS } from '../browse/src/commands';
import { SNAPSHOT_FLAGS } from '../browse/src/snapshot';
import * as fs from 'fs';
import * as path from 'path';
const ROOT = path.resolve(import.meta.dir, '..');
describe('SKILL.md command validation', () => {
test('all $B commands in SKILL.md are valid browse commands', () => {
const result = validateSkill(path.join(ROOT, 'SKILL.md'));
expect(result.invalid).toHaveLength(0);
expect(result.valid.length).toBeGreaterThan(0);
});
test('all snapshot flags in SKILL.md are valid', () => {
const result = validateSkill(path.join(ROOT, 'SKILL.md'));
expect(result.snapshotFlagErrors).toHaveLength(0);
});
test('all $B commands in browse/SKILL.md are valid browse commands', () => {
const result = validateSkill(path.join(ROOT, 'browse', 'SKILL.md'));
expect(result.invalid).toHaveLength(0);
expect(result.valid.length).toBeGreaterThan(0);
});
test('all snapshot flags in browse/SKILL.md are valid', () => {
const result = validateSkill(path.join(ROOT, 'browse', 'SKILL.md'));
expect(result.snapshotFlagErrors).toHaveLength(0);
});
test('all $B commands in qa/SKILL.md are valid browse commands', () => {
const qaSkill = path.join(ROOT, 'qa', 'SKILL.md');
if (!fs.existsSync(qaSkill)) return; // skip if missing
const result = validateSkill(qaSkill);
expect(result.invalid).toHaveLength(0);
});
test('all snapshot flags in qa/SKILL.md are valid', () => {
const qaSkill = path.join(ROOT, 'qa', 'SKILL.md');
if (!fs.existsSync(qaSkill)) return;
const result = validateSkill(qaSkill);
expect(result.snapshotFlagErrors).toHaveLength(0);
});
test('all $B commands in qa-only/SKILL.md are valid browse commands', () => {
const qaOnlySkill = path.join(ROOT, 'qa-only', 'SKILL.md');
if (!fs.existsSync(qaOnlySkill)) return;
const result = validateSkill(qaOnlySkill);
expect(result.invalid).toHaveLength(0);
});
test('all snapshot flags in qa-only/SKILL.md are valid', () => {
const qaOnlySkill = path.join(ROOT, 'qa-only', 'SKILL.md');
if (!fs.existsSync(qaOnlySkill)) return;
const result = validateSkill(qaOnlySkill);
expect(result.snapshotFlagErrors).toHaveLength(0);
});
test('all $B commands in plan-design-review/SKILL.md are valid browse commands', () => {
const skill = path.join(ROOT, 'plan-design-review', 'SKILL.md');
if (!fs.existsSync(skill)) return;
const result = validateSkill(skill);
expect(result.invalid).toHaveLength(0);
});
test('all snapshot flags in plan-design-review/SKILL.md are valid', () => {
const skill = path.join(ROOT, 'plan-design-review', 'SKILL.md');
if (!fs.existsSync(skill)) return;
const result = validateSkill(skill);
expect(result.snapshotFlagErrors).toHaveLength(0);
});
test('all $B commands in design-review/SKILL.md are valid browse commands', () => {
const skill = path.join(ROOT, 'design-review', 'SKILL.md');
if (!fs.existsSync(skill)) return;
const result = validateSkill(skill);
expect(result.invalid).toHaveLength(0);
});
test('all snapshot flags in design-review/SKILL.md are valid', () => {
const skill = path.join(ROOT, 'design-review', 'SKILL.md');
if (!fs.existsSync(skill)) return;
const result = validateSkill(skill);
expect(result.snapshotFlagErrors).toHaveLength(0);
});
test('all $B commands in design-consultation/SKILL.md are valid browse commands', () => {
const skill = path.join(ROOT, 'design-consultation', 'SKILL.md');
if (!fs.existsSync(skill)) return;
const result = validateSkill(skill);
expect(result.invalid).toHaveLength(0);
});
test('all snapshot flags in design-consultation/SKILL.md are valid', () => {
const skill = path.join(ROOT, 'design-consultation', 'SKILL.md');
if (!fs.existsSync(skill)) return;
const result = validateSkill(skill);
expect(result.snapshotFlagErrors).toHaveLength(0);
});
test('all $B commands in autoplan/SKILL.md are valid browse commands', () => {
const skill = path.join(ROOT, 'autoplan', 'SKILL.md');
if (!fs.existsSync(skill)) return;
const result = validateSkill(skill);
expect(result.invalid).toHaveLength(0);
});
test('all snapshot flags in autoplan/SKILL.md are valid', () => {
const skill = path.join(ROOT, 'autoplan', 'SKILL.md');
if (!fs.existsSync(skill)) return;
const result = validateSkill(skill);
expect(result.snapshotFlagErrors).toHaveLength(0);
});
});
describe('Command registry consistency', () => {
test('COMMAND_DESCRIPTIONS covers all commands in sets', () => {
const allCmds = new Set([...READ_COMMANDS, ...WRITE_COMMANDS, ...META_COMMANDS]);
const descKeys = new Set(Object.keys(COMMAND_DESCRIPTIONS));
for (const cmd of allCmds) {
expect(descKeys.has(cmd)).toBe(true);
}
});
test('COMMAND_DESCRIPTIONS has no extra commands not in sets', () => {
const allCmds = new Set([...READ_COMMANDS, ...WRITE_COMMANDS, ...META_COMMANDS]);
for (const key of Object.keys(COMMAND_DESCRIPTIONS)) {
expect(allCmds.has(key)).toBe(true);
}
});
test('ALL_COMMANDS matches union of all sets', () => {
const union = new Set([...READ_COMMANDS, ...WRITE_COMMANDS, ...META_COMMANDS]);
expect(ALL_COMMANDS.size).toBe(union.size);
for (const cmd of union) {
expect(ALL_COMMANDS.has(cmd)).toBe(true);
}
});
test('SNAPSHOT_FLAGS option keys are valid SnapshotOptions fields', () => {
const validKeys = new Set([
'interactive', 'compact', 'depth', 'selector',
'diff', 'annotate', 'outputPath', 'cursorInteractive',
'heatmap',
]);
for (const flag of SNAPSHOT_FLAGS) {
expect(validKeys.has(flag.optionKey)).toBe(true);
}
});
});
describe('Usage string consistency', () => {
// Normalize a usage string to its structural skeleton for comparison.
// Replaces <param-names> with <>, [optional] with [], strips parenthetical hints.
// This catches format mismatches (e.g., <name>:<value> vs <name> <value>)
// without tripping on abbreviation differences (e.g., <sel> vs <selector>).
function skeleton(usage: string): string {
return usage
.replace(/\(.*?\)/g, '') // strip parenthetical hints like (e.g., Enter, Tab)
.replace(/<[^>]*>/g, '<>') // normalize <param-name> → <>
.replace(/\[[^\]]*\]/g, '[]') // normalize [optional] → []
.replace(/\s+/g, ' ') // collapse whitespace
.trim();
}
// Cross-check Usage: patterns in implementation against COMMAND_DESCRIPTIONS
test('implementation Usage: structural format matches COMMAND_DESCRIPTIONS', () => {
const implFiles = [
path.join(ROOT, 'browse', 'src', 'write-commands.ts'),
path.join(ROOT, 'browse', 'src', 'read-commands.ts'),
path.join(ROOT, 'browse', 'src', 'meta-commands.ts'),
];
// Extract "Usage: browse <pattern>" from throw new Error(...) calls
const usagePattern = /throw new Error\(['"`]Usage:\s*browse\s+(.+?)['"`]\)/g;
const implUsages = new Map<string, string>();
for (const file of implFiles) {
const content = fs.readFileSync(file, 'utf-8');
let match;
while ((match = usagePattern.exec(content)) !== null) {
const usage = match[1].split('\\n')[0].trim();
const cmd = usage.split(/\s/)[0];
implUsages.set(cmd, usage);
}
}
// Compare structural skeletons
const mismatches: string[] = [];
for (const [cmd, implUsage] of implUsages) {
const desc = COMMAND_DESCRIPTIONS[cmd];
if (!desc) continue;
if (!desc.usage) continue;
const descSkel = skeleton(desc.usage);
const implSkel = skeleton(implUsage);
if (descSkel !== implSkel) {
mismatches.push(`${cmd}: docs "${desc.usage}" (${descSkel}) vs impl "${implUsage}" (${implSkel})`);
}
}
expect(mismatches).toEqual([]);
});
});
describe('Generated SKILL.md freshness', () => {
test('no unresolved {{placeholders}} in generated SKILL.md', () => {
const content = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
const unresolved = content.match(/\{\{\w+\}\}/g);
expect(unresolved).toBeNull();
});
test('no unresolved {{placeholders}} in generated browse/SKILL.md', () => {
const content = fs.readFileSync(path.join(ROOT, 'browse', 'SKILL.md'), 'utf-8');
const unresolved = content.match(/\{\{\w+\}\}/g);
expect(unresolved).toBeNull();
});
test('generated SKILL.md has AUTO-GENERATED header', () => {
const content = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
expect(content).toContain('AUTO-GENERATED');
});
});
// --- Update check preamble validation ---
describe('Update check preamble', () => {
const skillsWithUpdateCheck = [
'SKILL.md', 'browse/SKILL.md', 'qa/SKILL.md',
'qa-only/SKILL.md',
'setup-browser-cookies/SKILL.md',
'ship/SKILL.md', 'review/SKILL.md',
'plan-ceo-review/SKILL.md', 'plan-eng-review/SKILL.md',
'retro/SKILL.md',
'office-hours/SKILL.md', 'investigate/SKILL.md',
'plan-design-review/SKILL.md',
'design-review/SKILL.md',
'design-consultation/SKILL.md',
'document-release/SKILL.md',
'canary/SKILL.md',
'benchmark/SKILL.md',
'land-and-deploy/SKILL.md',
'setup-deploy/SKILL.md',
'cso/SKILL.md',
];
for (const skill of skillsWithUpdateCheck) {
test(`${skill} update check line ends with || true`, () => {
const content = fs.readFileSync(path.join(ROOT, skill), 'utf-8');
// The second line of the bash block must end with || true
// to avoid exit code 1 when _UPD is empty (up to date)
const match = content.match(/\[ -n "\$_UPD" \].*$/m);
expect(match).not.toBeNull();
expect(match![0]).toContain('|| true');
});
}
test('all skills with update check are generated from .tmpl', () => {
for (const skill of skillsWithUpdateCheck) {
const tmplPath = path.join(ROOT, skill + '.tmpl');
expect(fs.existsSync(tmplPath)).toBe(true);
}
});
test('update check bash block exits 0 when up to date', () => {
// Simulate the exact preamble command from SKILL.md
const result = Bun.spawnSync(['bash', '-c',
'_UPD=$(echo "" || true); [ -n "$_UPD" ] && echo "$_UPD" || true'
], { stdout: 'pipe', stderr: 'pipe' });
expect(result.exitCode).toBe(0);
});
test('update check bash block exits 0 when upgrade available', () => {
const result = Bun.spawnSync(['bash', '-c',
'_UPD=$(echo "UPGRADE_AVAILABLE 0.3.3 0.4.0" || true); [ -n "$_UPD" ] && echo "$_UPD" || true'
], { stdout: 'pipe', stderr: 'pipe' });
expect(result.exitCode).toBe(0);
expect(result.stdout.toString().trim()).toBe('UPGRADE_AVAILABLE 0.3.3 0.4.0');
});
});
// --- Part 7: Cross-skill path consistency (A1) ---
describe('Cross-skill path consistency', () => {
test('REMOTE_SLUG derivation pattern is identical across files that use it', () => {
const patterns = extractRemoteSlugPatterns(ROOT, ['qa', 'review']);
const allPatterns: string[] = [];
for (const [, filePatterns] of patterns) {
allPatterns.push(...filePatterns);
}
// Should find at least 2 occurrences (qa/SKILL.md + review/greptile-triage.md)
expect(allPatterns.length).toBeGreaterThanOrEqual(2);
// All occurrences must be character-for-character identical
const unique = new Set(allPatterns);
if (unique.size > 1) {
const variants = Array.from(unique);
throw new Error(
`REMOTE_SLUG pattern differs across files:\n` +
variants.map((v, i) => ` ${i + 1}: ${v}`).join('\n')
);
}
});
test('all greptile-history write references specify both per-project and global paths', () => {
const filesToCheck = [
'review/SKILL.md',
'ship/SKILL.md',
'review/greptile-triage.md',
];
for (const file of filesToCheck) {
const filePath = path.join(ROOT, file);
if (!fs.existsSync(filePath)) continue;
const content = fs.readFileSync(filePath, 'utf-8');
const hasBoth = (content.includes('per-project') && content.includes('global')) ||
(content.includes('$REMOTE_SLUG/greptile-history') && content.includes('~/.gstack/greptile-history'));
expect(hasBoth).toBe(true);
}
});
test('greptile-triage.md contains both project and global history paths', () => {
const content = fs.readFileSync(path.join(ROOT, 'review', 'greptile-triage.md'), 'utf-8');
expect(content).toContain('$REMOTE_SLUG/greptile-history.md');
expect(content).toContain('~/.gstack/greptile-history.md');
});
test('retro/SKILL.md reads global greptile-history (not per-project)', () => {
const content = fs.readFileSync(path.join(ROOT, 'retro', 'SKILL.md'), 'utf-8');
expect(content).toContain('~/.gstack/greptile-history.md');
// Should NOT reference per-project path for reads
expect(content).not.toContain('$REMOTE_SLUG/greptile-history.md');
});
});
// --- Part 7: QA skill structure validation (A2) ---
describe('QA skill structure validation', () => {
const qaContent = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
test('qa/SKILL.md has all 11 phases', () => {
const phases = [
'Phase 1', 'Initialize',
'Phase 2', 'Authenticate',
'Phase 3', 'Orient',
'Phase 4', 'Explore',
'Phase 5', 'Document',
'Phase 6', 'Wrap Up',
'Phase 7', 'Triage',
'Phase 8', 'Fix Loop',
'Phase 9', 'Final QA',
'Phase 10', 'Report',
'Phase 11', 'TODOS',
];
for (const phase of phases) {
expect(qaContent).toContain(phase);
}
});
test('has all four QA modes defined', () => {
const modes = [
'Diff-aware',
'Full',
'Quick',
'Regression',
];
for (const mode of modes) {
expect(qaContent).toContain(mode);
}
// Mode triggers/flags
expect(qaContent).toContain('--quick');
expect(qaContent).toContain('--regression');
});
test('has all three tiers defined', () => {
const tiers = ['Quick', 'Standard', 'Exhaustive'];
for (const tier of tiers) {
expect(qaContent).toContain(tier);
}
});
test('health score weights sum to 100%', () => {
const weights = extractWeightsFromTable(qaContent);
expect(weights.size).toBeGreaterThan(0);
let sum = 0;
for (const pct of weights.values()) {
sum += pct;
}
expect(sum).toBe(100);
});
test('health score has all 8 categories', () => {
const weights = extractWeightsFromTable(qaContent);
const expectedCategories = [
'Console', 'Links', 'Visual', 'Functional',
'UX', 'Performance', 'Content', 'Accessibility',
];
for (const cat of expectedCategories) {
expect(weights.has(cat)).toBe(true);
}
expect(weights.size).toBe(8);
});
test('has four mode definitions (Diff-aware/Full/Quick/Regression)', () => {
expect(qaContent).toContain('### Diff-aware');
expect(qaContent).toContain('### Full');
expect(qaContent).toContain('### Quick');
expect(qaContent).toContain('### Regression');
});
test('output structure references report directory layout', () => {
expect(qaContent).toContain('qa-report-');
expect(qaContent).toContain('baseline.json');
expect(qaContent).toContain('screenshots/');
expect(qaContent).toContain('.gstack/qa-reports/');
});
});
// --- Part 7: Greptile history format consistency (A3) ---
describe('Greptile history format consistency', () => {
test('greptile-triage.md defines the canonical history format', () => {
const content = fs.readFileSync(path.join(ROOT, 'review', 'greptile-triage.md'), 'utf-8');
expect(content).toContain('<YYYY-MM-DD>');
expect(content).toContain('<owner/repo>');
expect(content).toContain('<type');
expect(content).toContain('<file-pattern>');
expect(content).toContain('<category>');
});
test('review/SKILL.md and ship/SKILL.md both reference greptile-triage.md for write details', () => {
const reviewContent = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8');
const shipContent = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(reviewContent.toLowerCase()).toContain('greptile-triage.md');
expect(shipContent.toLowerCase()).toContain('greptile-triage.md');
});
test('greptile-triage.md defines all 9 valid categories', () => {
const content = fs.readFileSync(path.join(ROOT, 'review', 'greptile-triage.md'), 'utf-8');
const categories = [
'race-condition', 'null-check', 'error-handling', 'style',
'type-safety', 'security', 'performance', 'correctness', 'other',
];
for (const cat of categories) {
expect(content).toContain(cat);
}
});
});
// --- Hardcoded branch name detection in templates ---
describe('No hardcoded branch names in SKILL templates', () => {
const tmplFiles = [
'ship/SKILL.md.tmpl',
'review/SKILL.md.tmpl',
'qa/SKILL.md.tmpl',
'plan-ceo-review/SKILL.md.tmpl',
'retro/SKILL.md.tmpl',
'document-release/SKILL.md.tmpl',
'plan-eng-review/SKILL.md.tmpl',
'plan-design-review/SKILL.md.tmpl',
'codex/SKILL.md.tmpl',
];
// Patterns that indicate hardcoded 'main' in git commands
const gitMainPatterns = [
/\bgit\s+diff\s+(?:origin\/)?main\b/,
/\bgit\s+log\s+(?:origin\/)?main\b/,
/\bgit\s+fetch\s+origin\s+main\b/,
/\bgit\s+merge\s+origin\/main\b/,
/\borigin\/main\b/,
];
// Lines that are allowed to mention 'main' (fallback logic, prose)
const allowlist = [
/fall\s*back\s+to\s+`main`/i,
/fall\s*back\s+to\s+`?main`?/i,
/typically\s+`?main`?/i,
/If\s+on\s+`main`/i, // old pattern — should not exist
];
for (const tmplFile of tmplFiles) {
test(`${tmplFile} has no hardcoded 'main' in git commands`, () => {
const filePath = path.join(ROOT, tmplFile);
if (!fs.existsSync(filePath)) return;
const lines = fs.readFileSync(filePath, 'utf-8').split('\n');
const violations: string[] = [];
for (let i = 0; i < lines.length; i++) {
const line = lines[i];
const isAllowlisted = allowlist.some(p => p.test(line));
if (isAllowlisted) continue;
for (const pattern of gitMainPatterns) {
if (pattern.test(line)) {
violations.push(`Line ${i + 1}: ${line.trim()}`);
break;
}
}
}
if (violations.length > 0) {
throw new Error(
`${tmplFile} has hardcoded 'main' in git commands:\n` +
violations.map(v => ` ${v}`).join('\n')
);
}
});
}
});
// --- Part 7b: TODOS-format.md reference consistency ---
describe('TODOS-format.md reference consistency', () => {
test('review/TODOS-format.md exists and defines canonical format', () => {
const content = fs.readFileSync(path.join(ROOT, 'review', 'TODOS-format.md'), 'utf-8');
expect(content).toContain('**What:**');
expect(content).toContain('**Why:**');
expect(content).toContain('**Priority:**');
expect(content).toContain('**Effort:**');
expect(content).toContain('## Completed');
});
test('skills that write TODOs reference TODOS-format.md', () => {
const shipContent = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
const ceoPlanContent = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
const engPlanContent = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
expect(shipContent).toContain('TODOS-format.md');
expect(ceoPlanContent).toContain('TODOS-format.md');
expect(engPlanContent).toContain('TODOS-format.md');
});
});
// --- v0.4.1 feature coverage: RECOMMENDATION format, session awareness, enum completeness ---
describe('v0.4.1 preamble features', () => {
// Tier 1 skills have core preamble only (no AskUserQuestion format)
const tier1Skills = ['SKILL.md', 'browse/SKILL.md', 'setup-browser-cookies/SKILL.md', 'benchmark/SKILL.md'];
// Tier 2+ skills have AskUserQuestion format with RECOMMENDATION
const tier2PlusSkills = [
'qa/SKILL.md', 'qa-only/SKILL.md',
'ship/SKILL.md', 'review/SKILL.md',
'plan-ceo-review/SKILL.md', 'plan-eng-review/SKILL.md',
'retro/SKILL.md',
'office-hours/SKILL.md', 'investigate/SKILL.md',
'plan-design-review/SKILL.md',
'design-review/SKILL.md',
'design-consultation/SKILL.md',
'document-release/SKILL.md',
'canary/SKILL.md',
'land-and-deploy/SKILL.md',
'setup-deploy/SKILL.md',
'cso/SKILL.md',
];
const skillsWithPreamble = [...tier1Skills, ...tier2PlusSkills];
for (const skill of tier2PlusSkills) {
test(`${skill} contains AskUserQuestion Pros/Cons format`, () => {
const content = fs.readFileSync(path.join(ROOT, skill), 'utf-8');
// v1.7.0.0 Pros/Cons format tokens. The preamble resolver
// (generate-ask-user-format.ts) injects all of these into every
// tier-2+ skill. Drop any of them and the test catches it on the
// next `bun test` run.
expect(content).toContain('AskUserQuestion');
expect(content).toContain('Pros / cons:');
expect(content).toContain('Recommendation: <choice>');
expect(content).toContain('Net:');
expect(content).toContain('ELI10');
expect(content).toContain('Stakes if we pick wrong:');
// Concrete format markers must be documented in the resolver text
expect(content).toMatch(/✅/);
expect(content).toMatch(/❌/);
});
}
for (const skill of skillsWithPreamble) {
test(`${skill} contains session awareness`, () => {
const content = fs.readFileSync(path.join(ROOT, skill), 'utf-8');
expect(content).toContain('_SESSIONS');
});
}
for (const skill of skillsWithPreamble) {
test(`${skill} contains escalation protocol`, () => {
const content = fs.readFileSync(path.join(ROOT, skill), 'utf-8');
expect(content).toContain('DONE_WITH_CONCERNS');
expect(content).toContain('BLOCKED');
expect(content).toContain('NEEDS_CONTEXT');
});
}
});
// --- Structural tests for new skills ---
describe('office-hours skill structure', () => {
const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8');
// Original structural assertions
for (const section of ['Phase 1', 'Phase 2', 'Phase 3', 'Phase 4', 'Phase 5', 'Phase 6',
'Design Doc', 'Supersedes', 'APPROVED', 'Premise Challenge',
'Alternatives', 'Smart-skip']) {
test(`contains ${section}`, () => expect(content).toContain(section));
}
// Dual-mode structure
for (const section of ['Startup mode', 'Builder mode']) {
test(`contains ${section}`, () => expect(content).toContain(section));
}
// Mode detection question
test('contains explicit mode detection question', () => {
expect(content).toContain("what's your goal");
});
// Six forcing questions (startup mode)
for (const question of ['Demand Reality', 'Status Quo', 'Desperate Specificity',
'Narrowest Wedge', 'Observation & Surprise', 'Future-Fit']) {
test(`contains forcing question: ${question}`, () => expect(content).toContain(question));
}
// Builder mode questions
test('contains builder brainstorming questions', () => {
expect(content).toContain('coolest version');
expect(content).toContain('delightful');
});
// Intrapreneurship adaptation
test('contains intrapreneurship adaptation', () => {
expect(content).toContain('Intrapreneurship');
});
// YC founder discovery engine
test('contains YC apply CTA with ref tracking', () => {
expect(content).toContain('ycombinator.com/apply?ref=gstack');
});
test('contains "What I noticed" design doc section', () => {
expect(content).toContain('What I noticed about how you think');
});
test('contains golden age framing', () => {
expect(content).toContain('golden age');
});
test('contains Garry Tan personal plea', () => {
expect(content).toContain('Garry Tan, the creator of GStack');
});
test('contains founder signal synthesis phase', () => {
expect(content).toContain('Founder Signal Synthesis');
});
test('contains three-tier decision rubric', () => {
expect(content).toContain('Top tier');
expect(content).toContain('Middle tier');
expect(content).toContain('Base tier');
});
test('contains anti-slop examples', () => {
expect(content).toContain('GOOD:');
expect(content).toContain('BAD:');
});
test('contains "One more thing" transition beat', () => {
expect(content).toContain('One more thing');
});
// Operating principles per mode
test('contains startup operating principles', () => {
expect(content).toContain('Specificity is the only currency');
});
test('contains builder operating principles', () => {
expect(content).toContain('Delight is the currency');
});
// Spec Review Loop (Phase 5.5)
test('contains spec review loop', () => {
expect(content).toContain('Spec Review Loop');
});
test('contains adversarial review dimensions', () => {
for (const dim of ['Completeness', 'Consistency', 'Clarity', 'Scope', 'Feasibility']) {
expect(content).toContain(dim);
}
});
test('contains subagent dispatch instruction', () => {
expect(content).toMatch(/Agent.*tool|subagent/i);
});
test('contains max 3 iterations', () => {
expect(content).toMatch(/3.*iteration|maximum.*3/i);
});
test('contains quality score', () => {
expect(content).toContain('quality score');
});
test('contains spec review metrics path', () => {
expect(content).toContain('spec-review.jsonl');
});
test('contains convergence guard', () => {
expect(content).toMatch(/convergence/i);
});
// Visual Sketch (Phase 4.5)
test('contains visual sketch section', () => {
expect(content).toContain('Visual Sketch');
});
test('contains wireframe generation', () => {
expect(content).toMatch(/wireframe|sketch/i);
});
test('contains DESIGN.md awareness', () => {
expect(content).toContain('DESIGN.md');
});
test('contains browse rendering', () => {
expect(content).toContain('$B goto');
expect(content).toContain('$B screenshot');
});
test('contains rough aesthetic instruction', () => {
expect(content).toMatch(/rough|hand-drawn/i);
});
});
describe('investigate skill structure', () => {
const content = fs.readFileSync(path.join(ROOT, 'investigate', 'SKILL.md'), 'utf-8');
for (const section of ['Iron Law', 'Root Cause', 'Pattern Analysis', 'Hypothesis',
'DEBUG REPORT', '3-strike', 'BLOCKED']) {
test(`contains ${section}`, () => expect(content).toContain(section));
}
});
// Contributor mode was removed in v0.13.10.0 — replaced by operational self-improvement.
// Tests for contributor mode preamble structure are no longer applicable.
describe('Enum & Value Completeness in review checklist', () => {
const checklist = fs.readFileSync(path.join(ROOT, 'review', 'checklist.md'), 'utf-8');
test('checklist has Enum & Value Completeness section', () => {
expect(checklist).toContain('Enum & Value Completeness');
});
test('Enum & Value Completeness is classified as CRITICAL', () => {
// It should appear under Pass 1 — CRITICAL, not Pass 2
const pass1Start = checklist.indexOf('### Pass 1');
const pass2Start = checklist.indexOf('### Pass 2');
const enumStart = checklist.indexOf('Enum & Value Completeness');
expect(enumStart).toBeGreaterThan(pass1Start);
expect(enumStart).toBeLessThan(pass2Start);
});
test('Enum & Value Completeness mentions tracing through consumers', () => {
expect(checklist).toContain('Trace it through every consumer');
expect(checklist).toContain('case');
expect(checklist).toContain('allowlist');
});
test('Enum & Value Completeness is in the severity classification as CRITICAL', () => {
const gateSection = checklist.slice(checklist.indexOf('## Severity Classification'));
// The ASCII art has CRITICAL on the left and INFORMATIONAL on the right
// Enum & Value Completeness should appear on a line with the CRITICAL tree (├─ or └─)
const enumLine = gateSection.split('\n').find(l => l.includes('Enum & Value Completeness'));
expect(enumLine).toBeDefined();
// It's on the left (CRITICAL) side — starts with ├─ or └─
expect(enumLine!.trimStart().startsWith('├─') || enumLine!.trimStart().startsWith('└─')).toBe(true);
});
test('Fix-First Heuristic exists in checklist and is referenced by review + ship', () => {
expect(checklist).toContain('## Fix-First Heuristic');
expect(checklist).toContain('AUTO-FIX');
expect(checklist).toContain('ASK');
const reviewSkill = fs.readFileSync(path.join(ROOT, 'review/SKILL.md'), 'utf-8');
const shipSkill = fs.readFileSync(path.join(ROOT, 'ship/SKILL.md'), 'utf-8');
expect(reviewSkill).toContain('AUTO-FIX');
expect(reviewSkill).toContain('[AUTO-FIXED]');
expect(shipSkill).toContain('AUTO-FIX');
expect(shipSkill).toContain('[AUTO-FIXED]');
});
});
// --- Completeness Principle spot-check ---
describe('Completeness Principle in generated SKILL.md files', () => {
const skillsWithPreamble = [
'qa/SKILL.md',
'qa-only/SKILL.md',
'ship/SKILL.md', 'review/SKILL.md',
'plan-ceo-review/SKILL.md', 'plan-eng-review/SKILL.md',
'retro/SKILL.md',
'plan-design-review/SKILL.md',
'design-review/SKILL.md',
'design-consultation/SKILL.md',
'document-release/SKILL.md',
'cso/SKILL.md', ];
for (const skill of skillsWithPreamble) {
test(`${skill} contains Completeness Principle section`, () => {
const content = fs.readFileSync(path.join(ROOT, skill), 'utf-8');
expect(content).toContain('Completeness Principle');
expect(content).toContain('Boil the Lake');
});
}
test('Completeness Principle keeps compact scoring guidance in tier 2+ skills', () => {
const content = fs.readFileSync(path.join(ROOT, 'cso', 'SKILL.md'), 'utf-8');
expect(content).toContain('Completeness: X/10');
expect(content).toContain('10 = all edge cases');
expect(content).toContain('Note: options differ in kind, not coverage');
expect(content).toContain('Do not fabricate scores');
});
});
// --- Part 7: Planted-bug fixture validation (A4) ---
describe('Planted-bug fixture validation', () => {
test('qa-eval ground truth has exactly 5 planted bugs', () => {
const groundTruth = JSON.parse(
fs.readFileSync(path.join(ROOT, 'test', 'fixtures', 'qa-eval-ground-truth.json'), 'utf-8')
);
expect(groundTruth.bugs).toHaveLength(5);
expect(groundTruth.total_bugs).toBe(5);
});
test('qa-eval-spa ground truth has exactly 5 planted bugs', () => {
const groundTruth = JSON.parse(
fs.readFileSync(path.join(ROOT, 'test', 'fixtures', 'qa-eval-spa-ground-truth.json'), 'utf-8')
);
expect(groundTruth.bugs).toHaveLength(5);
expect(groundTruth.total_bugs).toBe(5);
});
test('qa-eval-checkout ground truth has exactly 5 planted bugs', () => {
const groundTruth = JSON.parse(
fs.readFileSync(path.join(ROOT, 'test', 'fixtures', 'qa-eval-checkout-ground-truth.json'), 'utf-8')
);
expect(groundTruth.bugs).toHaveLength(5);
expect(groundTruth.total_bugs).toBe(5);
});
test('qa-eval.html contains the planted bugs', () => {
const html = fs.readFileSync(path.join(ROOT, 'browse', 'test', 'fixtures', 'qa-eval.html'), 'utf-8');
// BUG 1: broken link
expect(html).toContain('/nonexistent-404-page');
// BUG 2: disabled submit
expect(html).toContain('disabled');
// BUG 3: overflow
expect(html).toContain('overflow: hidden');
// BUG 4: missing alt
expect(html).toMatch(/<img[^>]*src="\/logo\.png"[^>]*>/);
expect(html).not.toMatch(/<img[^>]*src="\/logo\.png"[^>]*alt=/);
// BUG 5: console error
expect(html).toContain("Cannot read properties of undefined");
});
test('review-eval-vuln.rb contains expected vulnerability patterns', () => {
const content = fs.readFileSync(path.join(ROOT, 'test', 'fixtures', 'review-eval-vuln.rb'), 'utf-8');
expect(content).toContain('params[:id]');
expect(content).toContain('update_column');
});
});
// --- CEO review mode validation ---
describe('CEO review mode validation', () => {
const content = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
test('has all four CEO review modes defined', () => {
const modes = ['SCOPE EXPANSION', 'SELECTIVE EXPANSION', 'HOLD SCOPE', 'SCOPE REDUCTION'];
for (const mode of modes) {
expect(content).toContain(mode);
}
});
test('has CEO plan persistence step', () => {
expect(content).toContain('ceo-plans');
expect(content).toContain('status: ACTIVE');
});
test('has docs/designs promotion section', () => {
expect(content).toContain('docs/designs');
expect(content).toContain('PROMOTED');
});
test('mode quick reference has four columns', () => {
expect(content).toContain('EXPANSION');
expect(content).toContain('SELECTIVE');
expect(content).toContain('HOLD SCOPE');
expect(content).toContain('REDUCTION');
});
// Skill chaining (benefits-from)
test('contains prerequisite skill offer for office-hours', () => {
expect(content).toContain('Prerequisite Skill Offer');
expect(content).toContain('/office-hours');
});
test('contains mid-session detection', () => {
expect(content).toContain('Mid-session detection');
expect(content).toMatch(/still figuring out|seems lost/i);
});
// Spec review on CEO plans
test('contains spec review loop for CEO plan documents', () => {
expect(content).toContain('Spec Review Loop');
});
});
// --- gstack-slug helper ---
describe('gstack-slug', () => {
const SLUG_BIN = path.join(ROOT, 'bin', 'gstack-slug');
test('binary exists and is executable', () => {
expect(fs.existsSync(SLUG_BIN)).toBe(true);
const stat = fs.statSync(SLUG_BIN);
expect(stat.mode & 0o111).toBeGreaterThan(0);
});
test('outputs SLUG and BRANCH lines in a git repo', () => {
const result = Bun.spawnSync([SLUG_BIN], { cwd: ROOT, stdout: 'pipe', stderr: 'pipe' });
expect(result.exitCode).toBe(0);
const output = result.stdout.toString();
expect(output).toContain('SLUG=');
expect(output).toContain('BRANCH=');
});
test('SLUG does not contain forward slashes', () => {
const result = Bun.spawnSync([SLUG_BIN], { cwd: ROOT, stdout: 'pipe', stderr: 'pipe' });
const slug = result.stdout.toString().match(/SLUG=(.*)/)?.[1] ?? '';
expect(slug).not.toContain('/');
expect(slug.length).toBeGreaterThan(0);
});
test('BRANCH does not contain forward slashes', () => {
const result = Bun.spawnSync([SLUG_BIN], { cwd: ROOT, stdout: 'pipe', stderr: 'pipe' });
const branch = result.stdout.toString().match(/BRANCH=(.*)/)?.[1] ?? '';
expect(branch).not.toContain('/');
expect(branch.length).toBeGreaterThan(0);
});
test('output is eval-compatible (KEY=VALUE format)', () => {
const result = Bun.spawnSync([SLUG_BIN], { cwd: ROOT, stdout: 'pipe', stderr: 'pipe' });
const lines = result.stdout.toString().trim().split('\n');
expect(lines.length).toBe(2);
expect(lines[0]).toMatch(/^SLUG=.+/);
expect(lines[1]).toMatch(/^BRANCH=.+/);
});
test('output values contain only safe characters (no shell metacharacters)', () => {
const result = Bun.spawnSync([SLUG_BIN], { cwd: ROOT, stdout: 'pipe', stderr: 'pipe' });
const slug = result.stdout.toString().match(/SLUG=(.*)/)?.[1] ?? '';
const branch = result.stdout.toString().match(/BRANCH=(.*)/)?.[1] ?? '';
// Only alphanumeric, dot, dash, underscore are allowed (#133)
expect(slug).toMatch(/^[a-zA-Z0-9._-]+$/);
expect(branch).toMatch(/^[a-zA-Z0-9._-]+$/);
});
test('eval sets variables under bash with set -euo pipefail', () => {
const result = Bun.spawnSync(
['bash', '-c', 'set -euo pipefail; eval "$(./bin/gstack-slug 2>/dev/null)"; echo "SLUG=$SLUG"; echo "BRANCH=$BRANCH"'],
{ cwd: ROOT, stdout: 'pipe', stderr: 'pipe' }
);
expect(result.exitCode).toBe(0);
const output = result.stdout.toString();
expect(output).toMatch(/^SLUG=.+/m);
expect(output).toMatch(/^BRANCH=.+/m);
});
test('no templates or bin scripts use source process substitution for gstack-slug', () => {
const result = Bun.spawnSync(
['grep', '-r', 'source <(.*gstack-slug', '--include=*.tmpl', '--include=gstack-review-*', '.'],
{ cwd: ROOT, stdout: 'pipe', stderr: 'pipe' }
);
// grep returns exit code 1 when no matches found — that's what we want
expect(result.stdout.toString().trim()).toBe('');
});
});
// --- Test Bootstrap validation ---
describe('Test Bootstrap ({{TEST_BOOTSTRAP}}) integration', () => {
test('TEST_BOOTSTRAP resolver produces valid content', () => {
const qaContent = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
expect(qaContent).toContain('Test Framework Bootstrap');
expect(qaContent).toContain('RUNTIME:ruby');
expect(qaContent).toContain('RUNTIME:node');
expect(qaContent).toContain('RUNTIME:python');
expect(qaContent).toContain('no-test-bootstrap');
expect(qaContent).toContain('BOOTSTRAP_DECLINED');
});
test('TEST_BOOTSTRAP appears in qa/SKILL.md', () => {
const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
expect(content).toContain('Test Framework Bootstrap');
expect(content).toContain('TESTING.md');
expect(content).toContain('CLAUDE.md');
});
test('TEST_BOOTSTRAP appears in ship/SKILL.md', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('Test Framework Bootstrap');
expect(content).toContain('Step 4');
});
test('TEST_BOOTSTRAP appears in design-review/SKILL.md', () => {
const content = fs.readFileSync(path.join(ROOT, 'design-review', 'SKILL.md'), 'utf-8');
expect(content).toContain('Test Framework Bootstrap');
});
test('TEST_BOOTSTRAP does NOT appear in qa-only/SKILL.md', () => {
const content = fs.readFileSync(path.join(ROOT, 'qa-only', 'SKILL.md'), 'utf-8');
expect(content).not.toContain('Test Framework Bootstrap');
// But should have the recommendation note
expect(content).toContain('No test framework detected');
expect(content).toContain('Run `/qa` to bootstrap');
});
test('bootstrap includes framework knowledge table', () => {
const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
expect(content).toContain('vitest');
expect(content).toContain('minitest');
expect(content).toContain('pytest');
expect(content).toContain('cargo test');
expect(content).toContain('phpunit');
expect(content).toContain('ExUnit');
});
test('bootstrap includes CI/CD pipeline generation', () => {
const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
expect(content).toContain('.github/workflows/test.yml');
expect(content).toContain('GitHub Actions');
});
test('bootstrap includes first real tests step', () => {
const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
expect(content).toContain('First real tests');
expect(content).toContain('git log --since=30.days');
expect(content).toContain('Prioritize by risk');
});
test('bootstrap includes vibe coding philosophy', () => {
const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
expect(content).toContain('vibe coding');
expect(content).toContain('100% test coverage');
});
test('WebSearch is in allowed-tools for qa, ship, design-review', () => {
const qa = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
const ship = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
const qaDesign = fs.readFileSync(path.join(ROOT, 'design-review', 'SKILL.md'), 'utf-8');
expect(qa).toContain('WebSearch');
expect(ship).toContain('WebSearch');
expect(qaDesign).toContain('WebSearch');
});
});
// --- Phase 8e.5 regression test validation ---
describe('Phase 8e.5 regression test generation', () => {
test('qa/SKILL.md contains Phase 8e.5', () => {
const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
expect(content).toContain('8e.5. Regression Test');
expect(content).toContain('test(qa): regression test');
expect(content).toContain('WTF-likelihood exclusion');
});
test('qa/SKILL.md Rule 13 is amended for regression tests', () => {
const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
expect(content).toContain('Only modify tests when generating regression tests in Phase 8e.5');
expect(content).not.toContain('Never modify tests or CI configuration');
});
test('design-review has CSS-aware Phase 8e.5 variant', () => {
const content = fs.readFileSync(path.join(ROOT, 'design-review', 'SKILL.md'), 'utf-8');
expect(content).toContain('8e.5. Regression Test (design-review variant)');
expect(content).toContain('CSS-only');
expect(content).toContain('test(design): regression test');
});
test('regression test includes full attribution comment format', () => {
const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
expect(content).toContain('// Regression: ISSUE-NNN');
expect(content).toContain('// Found by /qa on');
expect(content).toContain('// Report: .gstack/qa-reports/');
});
test('regression test uses auto-incrementing names', () => {
const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
expect(content).toContain('auto-incrementing');
expect(content).toContain('max number + 1');
});
});
// --- Step 3.4 coverage audit validation ---
describe('Step 3.4 test coverage audit', () => {
test('ship/SKILL.md contains Step 7', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('Step 7: Test Coverage Audit');
// The coverage diagram collapses code-path and user-flow counts onto one
// summary line. Verify that summary is present (labels are stable).
expect(content).toContain('Code paths:');
});
test('Step 3.4 includes quality scoring rubric', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('★★★');
expect(content).toContain('★★');
expect(content).toContain('edge cases AND error paths');
expect(content).toContain('happy path only');
});
test('Step 3.4 includes before/after test count', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('Count test files before');
expect(content).toContain('Count test files after');
});
test('ship PR body includes Test Coverage section', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('## Test Coverage');
});
test('ship rules include test generation rule', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('Step 7 generates coverage tests');
expect(content).toContain('Never commit failing tests');
});
test('Step 3.4 includes vibe coding philosophy', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('vibe coding becomes yolo coding');
});
test('Step 3.4 traces actual codepaths, not just syntax', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('Trace every codepath');
expect(content).toContain('Trace data flow');
expect(content).toContain('Diagram the execution');
});
test('Step 3.4 maps user flows and interaction edge cases', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('Map user flows');
expect(content).toContain('Interaction edge cases');
expect(content).toContain('Double-click');
expect(content).toContain('Navigate away');
expect(content).toContain('Error states the user can see');
expect(content).toContain('Empty/zero/boundary states');
});
test('Step 3.4 diagram includes user-flow coverage summary', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
// The diagram was compressed from separate CODE PATH COVERAGE / USER FLOW
// COVERAGE section headers into a single summary line. Assert on the
// labels that still appear on that summary line.
expect(content).toContain('Code paths:');
expect(content).toContain('User flows:');
});
});
// --- Ship step numbering regression guard ---
describe('ship step numbering', () => {
// Allowed sub-steps that are resolver-generated and intentionally nested:
// 8.1 (Plan Verification), 8.2 (Scope Drift), 9.1 (Review Army), 9.2 (Findings Merge),
// 9.3 (Cross-review dedup), 15.0 (WIP squash — continuous checkpoint), 15.1 (Bisectable commits).
const ALLOWED_SUBSTEPS = new Set(['8.1', '8.2', '9.1', '9.2', '9.3', '15.0', '15.1']);
test('ship/SKILL.md.tmpl contains no unexpected fractional step numbers', () => {
const tmpl = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md.tmpl'), 'utf-8');
// Match "Step X.Y" where X.Y is a decimal step reference (e.g., "Step 3.47", "Step 8.1")
const matches = Array.from(tmpl.matchAll(/Step (\d+\.\d+)/g));
const violations = matches
.map((m) => m[1])
.filter((n) => !ALLOWED_SUBSTEPS.has(n));
if (violations.length > 0) {
const unique = Array.from(new Set(violations)).sort();
throw new Error(
`ship/SKILL.md.tmpl contains fractional step numbers that are not in the allowed sub-step list.\n` +
` Found: ${unique.join(', ')}\n` +
` Allowed sub-steps: ${Array.from(ALLOWED_SUBSTEPS).sort().join(', ')}\n` +
` Fix: use clean integer step numbers (1-20), or add to ALLOWED_SUBSTEPS if intentional.`
);
}
});
test('ship/SKILL.md main headings use clean integer step numbers', () => {
const skill = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
// Headings like "## Step 7: Test Coverage Audit" — NOT sub-steps like "## Step 8.1:"
const headings = Array.from(skill.matchAll(/^## Step (\d+(?:\.\d+)?):/gm)).map(
(m) => m[1]
);
const fractional = headings.filter((n) => n.includes('.'));
const unexpected = fractional.filter((n) => !ALLOWED_SUBSTEPS.has(n));
expect(unexpected).toEqual([]);
});
test('review/SKILL.md step numbers unchanged (regression guard for resolver conditionals)', () => {
const skill = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8');
// /review uses its own fractional numbering: 1.5, 2.5, 4.5, 5.5, 5.6, 5.7, 5.8
// If the ship-side renumber accidentally touched the review-side of resolver conditionals,
// these would vanish. This test catches that.
expect(skill).toContain('## Step 1.5: Scope Drift Detection');
expect(skill).toContain('## Step 4.5: Review Army');
expect(skill).toContain('## Step 5.7: Adversarial review');
});
});
// --- Retro test health validation ---
describe('Retro test health tracking', () => {
test('retro/SKILL.md has test health data gathering commands', () => {
const content = fs.readFileSync(path.join(ROOT, 'retro', 'SKILL.md'), 'utf-8');
expect(content).toContain('# 10. Test file count');
expect(content).toContain('# 11. Regression test commits');
expect(content).toContain('# 12. Test files changed');
});
test('retro/SKILL.md has Test Health metrics row', () => {
const content = fs.readFileSync(path.join(ROOT, 'retro', 'SKILL.md'), 'utf-8');
expect(content).toContain('Test Health');
expect(content).toContain('regression tests');
});
test('retro/SKILL.md has Test Health narrative section', () => {
const content = fs.readFileSync(path.join(ROOT, 'retro', 'SKILL.md'), 'utf-8');
expect(content).toContain('### Test Health');
expect(content).toContain('Total test files');
expect(content).toContain('vibe coding safe');
});
test('retro JSON schema includes test_health field', () => {
const content = fs.readFileSync(path.join(ROOT, 'retro', 'SKILL.md'), 'utf-8');
expect(content).toContain('test_health');
expect(content).toContain('total_test_files');
expect(content).toContain('regression_test_commits');
});
});
// --- QA report template regression tests section ---
describe('QA report template', () => {
test('qa-report-template.md has Regression Tests section', () => {
const content = fs.readFileSync(path.join(ROOT, 'qa', 'templates', 'qa-report-template.md'), 'utf-8');
expect(content).toContain('## Regression Tests');
expect(content).toContain('committed / deferred / skipped');
expect(content).toContain('### Deferred Tests');
expect(content).toContain('**Precondition:**');
});
});
// --- Codex skill validation ---
describe('Codex skill', () => {
test('codex/SKILL.md exists and has correct frontmatter', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('name: codex');
expect(content).toContain('version: 1.0.0');
expect(content).toContain('allowed-tools:');
});
test('codex/SKILL.md contains all three modes', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('Step 2A: Review Mode');
expect(content).toContain('Step 2B: Challenge');
expect(content).toContain('Step 2C: Consult Mode');
});
test('codex/SKILL.md contains gate verdict logic', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('[P1]');
expect(content).toContain('GATE: PASS');
expect(content).toContain('GATE: FAIL');
});
test('codex/SKILL.md contains session continuity', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('codex-session-id');
expect(content).toContain('codex exec resume');
});
test('codex/SKILL.md resume command only uses resume-supported flags', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
const match = content.match(/codex exec resume[^\n]+/);
expect(match).not.toBeNull();
const resumeCommand = match![0];
expect(resumeCommand).not.toContain(' -C ');
expect(resumeCommand).not.toContain(' -s read-only');
expect(resumeCommand).toContain("-c 'sandbox_mode=\"read-only\"'");
});
test('codex/SKILL.md contains cost tracking', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('tokens used');
expect(content).toContain('Est. cost');
});
test('codex/SKILL.md contains cross-model comparison', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('CROSS-MODEL ANALYSIS');
expect(content).toContain('Agreement rate');
});
test('codex/SKILL.md contains review log persistence', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('codex-review');
expect(content).toContain('gstack-review-log');
});
test('codex/SKILL.md uses command -v for binary discovery, not hardcoded path', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('command -v codex');
expect(content).not.toContain('/opt/homebrew/bin/codex');
// Defensive: catch any future regression that reintroduces `which codex`,
// which fails in environments where `which` isn't on PATH (some Windows
// shells, BusyBox-only containers). #1197.
expect(content).not.toContain('which codex');
});
test('codex/SKILL.md contains error handling for missing binary and auth', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('NOT_FOUND');
expect(content).toContain('codex login');
});
test('codex/SKILL.md uses mktemp for temp files', () => {
const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8');
expect(content).toContain('mktemp');
});
test('codex JSON stream parser uses portable Python discovery', () => {
const files = ['codex/SKILL.md.tmpl', 'codex/SKILL.md'];
for (const rel of files) {
const content = fs.readFileSync(path.join(ROOT, rel), 'utf-8');
expect(content).toContain('PYTHON_CMD=$(command -v python3 2>/dev/null || command -v python 2>/dev/null || true)');
expect(content).toContain('PYTHONUNBUFFERED=1 "$PYTHON_CMD" -u -c');
expect(content).not.toContain('PYTHONUNBUFFERED=1 python3 -u -c');
}
});
test('adversarial review in /review always runs both passes', () => {
const content = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8');
expect(content).toContain('Adversarial review (always-on)');
// Always-on: both Claude and Codex adversarial
expect(content).toContain('Claude adversarial subagent (always runs)');
expect(content).toContain('Codex adversarial challenge (always runs when available)');
// Claude adversarial subagent dispatch
expect(content).toContain('Agent tool');
expect(content).toContain('FIXABLE');
expect(content).toContain('INVESTIGATE');
// Codex availability check
expect(content).toContain('CODEX_NOT_AVAILABLE');
// OLD_CFG only gates Codex, not Claude
expect(content).toContain('skip Codex passes only');
// Review log
expect(content).toContain('adversarial-review');
expect(content).toContain('reasoning_effort="high"');
expect(content).toContain('ADVERSARIAL REVIEW SYNTHESIS');
// Large diff structured review still gated
expect(content).toContain('Codex structured review (large diffs only');
expect(content).toContain('200');
});
test('adversarial review in /ship always runs both passes', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('Adversarial review (always-on)');
expect(content).toContain('adversarial-review');
expect(content).toContain('reasoning_effort="high"');
expect(content).toContain('Investigate and fix');
expect(content).toContain('Claude adversarial subagent (always runs)');
});
test('scope drift detection in /review and /ship', () => {
const reviewContent = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8');
const shipContent = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
// Both should contain scope drift from the shared resolver
for (const content of [reviewContent, shipContent]) {
expect(content).toContain('Scope Check:');
expect(content).toContain('DRIFT DETECTED');
expect(content).toContain('SCOPE CREEP');
expect(content).toContain('MISSING REQUIREMENTS');
expect(content).toContain('stated intent');
}
});
test('codex-host ship/review do NOT contain adversarial review step', () => {
// .agents/ is gitignored — generate on demand
Bun.spawnSync(['bun', 'run', 'scripts/gen-skill-docs.ts', '--host', 'codex'], {
cwd: ROOT, stdout: 'pipe', stderr: 'pipe',
});
const shipContent = fs.readFileSync(path.join(ROOT, '.agents', 'skills', 'gstack-ship', 'SKILL.md'), 'utf-8');
expect(shipContent).not.toContain('codex review --base');
expect(shipContent).not.toContain('CODEX_REVIEWS');
const reviewContent = fs.readFileSync(path.join(ROOT, '.agents', 'skills', 'gstack-review', 'SKILL.md'), 'utf-8');
expect(reviewContent).not.toContain('codex review --base');
expect(reviewContent).not.toContain('codex_reviews');
expect(reviewContent).not.toContain('CODEX_REVIEWS');
expect(reviewContent).not.toContain('adversarial-review');
expect(reviewContent).not.toContain('Investigate and fix');
});
test('codex integration in /plan-eng-review offers plan critique', () => {
const content = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
expect(content).toContain('Codex');
expect(content).toContain('codex exec');
});
test('codex review invocations avoid the prompt plus --base argument shape', () => {
for (const rel of ['codex/SKILL.md', 'review/SKILL.md', 'ship/SKILL.md']) {
const content = fs.readFileSync(path.join(ROOT, rel), 'utf-8');
expect(content).not.toContain('--base <base> -c \'model_reasoning_effort="high"\'');
expect(content).toContain('Run git diff origin/<base>...HEAD 2>/dev/null || git diff <base>...HEAD');
}
});
test('codex review prompts always carry the filesystem boundary (#1503/#1522 regression)', () => {
// Pre-#1209, the bare `codex review --base` path stripped the filesystem
// boundary instruction, letting Codex spend tokens reading skill files.
// #1209's prompt rewrite restored the boundary by routing every default
// call through a prompt. Pin both halves so a future refactor can't
// regress: (a) the boundary line must appear, (b) the call must be
// through `codex review "<prompt>"` not bare `codex review --base`.
const boundaryLine =
'Do NOT read or execute any files under ~/.claude/, ~/.agents/, .claude/skills/, or agents/';
for (const rel of ['codex/SKILL.md', 'review/SKILL.md', 'ship/SKILL.md']) {
const content = fs.readFileSync(path.join(ROOT, rel), 'utf-8');
expect(content).toContain(boundaryLine);
}
});
test('/review persists a review-log entry for ship readiness', () => {
const content = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8');
expect(content).toContain('"skill":"review"');
expect(content).toContain('"issues_found":N');
expect(content).toContain('Persist Eng Review result');
});
test('Review Readiness Dashboard includes Adversarial Review row', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('Adversarial');
expect(content).toContain('codex-review');
});
});
// --- Trigger phrase validation ---
describe('Skill trigger phrases', () => {
// Skills that must have "Use when" trigger phrases in their description.
// Excluded: root gstack (browser tool), gstack-upgrade (gstack-specific),
// humanizer (text tool)
const SKILLS_REQUIRING_TRIGGERS = [
'qa', 'qa-only', 'ship', 'review', 'investigate', 'office-hours',
'plan-ceo-review', 'plan-eng-review', 'plan-design-review',
'design-review', 'design-consultation', 'retro', 'document-release',
'codex', 'browse', 'setup-browser-cookies',
];
for (const skill of SKILLS_REQUIRING_TRIGGERS) {
test(`${skill}/SKILL.md has "Use when" trigger phrases`, () => {
const skillPath = path.join(ROOT, skill, 'SKILL.md');
if (!fs.existsSync(skillPath)) return;
const content = fs.readFileSync(skillPath, 'utf-8');
// v1.45.0.0 catalog trim moved trigger prose out of frontmatter into a
// body "## When to invoke" section. Search the full file content, not
// just frontmatter. The trigger phrase must still appear somewhere in
// the skill so agents can match user requests to the skill.
expect(content).toMatch(/Use when/i);
});
}
// Skills with proactive triggers should have "Proactively suggest" somewhere in the skill.
const SKILLS_REQUIRING_PROACTIVE = [
'qa', 'qa-only', 'ship', 'review', 'investigate', 'office-hours',
'plan-ceo-review', 'plan-eng-review', 'plan-design-review',
'design-review', 'design-consultation', 'retro', 'document-release',
];
for (const skill of SKILLS_REQUIRING_PROACTIVE) {
test(`${skill}/SKILL.md has proactive routing phrase`, () => {
const skillPath = path.join(ROOT, skill, 'SKILL.md');
if (!fs.existsSync(skillPath)) return;
const content = fs.readFileSync(skillPath, 'utf-8');
// Same catalog-trim consideration — search the full file content.
expect(content).toMatch(/Proactively (suggest|invoke)/i);
});
}
});
// ─── Private-path leak detector ──────────────────────────────
//
// Catches accidental references to maintainer-private files in skill output.
// Adapted from the McGluut fork's skill-contract-audit.ts (we don't take the
// whole script — these are the unique checks not already covered by
// test/gen-skill-docs.test.ts:1668-2074 .claude/skills leakage tests).
describe('Private-path leak detection', () => {
const PRIVATE_PATTERNS: Array<{ pattern: RegExp; label: string }> = [
{ pattern: /coordination-board\.md/i, label: 'coordination-board.md' },
{ pattern: /SEEKING_LOG\.md/, label: 'SEEKING_LOG.md' },
{ pattern: /RATIONAL_SUBJECT\.md/, label: 'RATIONAL_SUBJECT.md' },
{ pattern: /VALUE_SIGNAL_LOOP\.md/, label: 'VALUE_SIGNAL_LOOP.md' },
{ pattern: /C:\\\\LLM Playground\\\\go/i, label: 'C:\\LLM Playground\\go' },
];
// Walk every SKILL.md and SKILL.md.tmpl in the repo (excluding node_modules,
// generated host outputs, and .git).
function discoverSkillSurface(): string[] {
const results: string[] = [];
function walk(dir: string) {
for (const entry of fs.readdirSync(dir, { withFileTypes: true })) {
if (entry.name.startsWith('.') && entry.name !== '.agents') continue;
if (entry.name === 'node_modules' || entry.name === 'dist') continue;
const full = path.join(dir, entry.name);
if (entry.isDirectory()) {
walk(full);
} else if (entry.name === 'SKILL.md' || entry.name === 'SKILL.md.tmpl') {
results.push(full);
}
}
}
walk(ROOT);
return results;
}
test('no SKILL.md or SKILL.md.tmpl references private maintainer files', () => {
const files = discoverSkillSurface();
expect(files.length).toBeGreaterThan(0);
const leaks: string[] = [];
for (const file of files) {
const content = fs.readFileSync(file, 'utf-8');
for (const { pattern, label } of PRIVATE_PATTERNS) {
if (pattern.test(content)) {
leaks.push(`${path.relative(ROOT, file)} mentions ${label}`);
}
}
}
expect(leaks).toEqual([]);
});
});
// ─── Doc-inventory cross-check ───────────────────────────────
//
// Every skill directory (with a SKILL.md.tmpl) must appear in both AGENTS.md
// and docs/skills.md. Catches the inventory drift codex flagged (/debug
// → /investigate; missing /autoplan, /context-save, /plan-devex-review, etc.).
describe('Doc inventory cross-check', () => {
// Skills that don't get user-invocation lines in agent-facing docs.
// - 'qa-only' is a sub-mode of /qa with shared docs.
// - The 5 listed below are infrastructure (model overlays, shipped binary,
// hosts) that don't show up in the user-facing skill table.
const DOC_INVENTORY_EXCLUDE = new Set([
// Infra / non-skills
'agents', 'claude', 'connect-chrome', 'contrib', 'hosts',
'lib', 'model-overlays', 'openclaw', 'supabase', 'scripts', 'test',
]);
function discoverSkillDirs(): string[] {
const dirs: string[] = [];
for (const entry of fs.readdirSync(ROOT, { withFileTypes: true })) {
if (!entry.isDirectory()) continue;
if (entry.name.startsWith('.')) continue;
if (DOC_INVENTORY_EXCLUDE.has(entry.name)) continue;
const tmplPath = path.join(ROOT, entry.name, 'SKILL.md.tmpl');
if (fs.existsSync(tmplPath)) dirs.push(entry.name);
}
return dirs.sort();
}
test('every skill is documented in AGENTS.md', () => {
const agents = fs.readFileSync(path.join(ROOT, 'AGENTS.md'), 'utf-8');
const missing: string[] = [];
for (const skill of discoverSkillDirs()) {
// Match `/skill-name` as a token boundary.
if (!new RegExp(`/${skill}\\b`).test(agents)) missing.push(skill);
}
expect(missing).toEqual([]);
});
test('every skill is documented in docs/skills.md', () => {
const docs = fs.readFileSync(path.join(ROOT, 'docs', 'skills.md'), 'utf-8');
const missing: string[] = [];
for (const skill of discoverSkillDirs()) {
if (!new RegExp(`/${skill}\\b`).test(docs)) missing.push(skill);
}
expect(missing).toEqual([]);
});
});
// ─── Codex Skill Validation ──────────────────────────────────
describe('Codex skill validation', () => {
const AGENTS_DIR = path.join(ROOT, '.agents', 'skills');
// .agents/ is gitignored (v0.11.2.0) — generate on demand for tests
Bun.spawnSync(['bun', 'run', 'scripts/gen-skill-docs.ts', '--host', 'codex'], {
cwd: ROOT, stdout: 'pipe', stderr: 'pipe',
});
// Discover all shared skills with templates.
// Host-exclusive outside-voice skills are intentionally omitted here:
// - /codex is Claude-only
// - /claude is external-host-only
const CLAUDE_SKILLS_WITH_TEMPLATES = (() => {
const skills: string[] = [];
for (const entry of fs.readdirSync(ROOT, { withFileTypes: true })) {
if (!entry.isDirectory() || entry.name.startsWith('.') || entry.name === 'node_modules') continue;
if (entry.name === 'codex') continue; // Claude-only skill
if (entry.name === 'claude') continue; // External-host-only skill
if (fs.existsSync(path.join(ROOT, entry.name, 'SKILL.md.tmpl'))) {
skills.push(entry.name);
}
}
return skills;
})();
test('all skills (except /codex) have both Claude and Codex variants', () => {
for (const skillDir of CLAUDE_SKILLS_WITH_TEMPLATES) {
// Claude variant
const claudeMd = path.join(ROOT, skillDir, 'SKILL.md');
expect(fs.existsSync(claudeMd)).toBe(true);
// Codex variant
const codexName = skillDir.startsWith('gstack-') ? skillDir : `gstack-${skillDir}`;
const codexMd = path.join(AGENTS_DIR, codexName, 'SKILL.md');
expect(fs.existsSync(codexMd)).toBe(true);
}
// Root template has both too
expect(fs.existsSync(path.join(ROOT, 'SKILL.md'))).toBe(true);
expect(fs.existsSync(path.join(AGENTS_DIR, 'gstack', 'SKILL.md'))).toBe(true);
});
test('/codex skill is Claude-only — no Codex variant', () => {
// Claude variant should exist
expect(fs.existsSync(path.join(ROOT, 'codex', 'SKILL.md'))).toBe(true);
// Codex variant must NOT exist
expect(fs.existsSync(path.join(AGENTS_DIR, 'gstack-codex', 'SKILL.md'))).toBe(false);
});
test('/claude skill is external-host-only — no Claude-host variant', () => {
// Claude host should not get an outside-voice skill that shells into Claude.
expect(fs.existsSync(path.join(ROOT, 'claude', 'SKILL.md'))).toBe(false);
// Codex/external hosts should get the generated wrapper.
expect(fs.existsSync(path.join(AGENTS_DIR, 'gstack-claude', 'SKILL.md'))).toBe(true);
});
test('Codex skill names follow gstack-{name} convention', () => {
const codexDirs = fs.readdirSync(AGENTS_DIR);
for (const dir of codexDirs) {
// Every directory should start with gstack
expect(dir.startsWith('gstack')).toBe(true);
// Root is just 'gstack', others are 'gstack-{name}'
if (dir !== 'gstack') {
expect(dir.startsWith('gstack-')).toBe(true);
}
}
});
test('$B commands in Codex SKILL.md files are valid browse commands', () => {
const codexDirs = fs.readdirSync(AGENTS_DIR);
for (const dir of codexDirs) {
const skillMd = path.join(AGENTS_DIR, dir, 'SKILL.md');
if (!fs.existsSync(skillMd)) continue;
const content = fs.readFileSync(skillMd, 'utf-8');
// Only validate if the skill contains $B commands
if (!content.includes('$B ')) continue;
const result = validateSkill(skillMd);
expect(result.invalid).toHaveLength(0);
}
});
});
// --- Repo mode and test failure triage validation ---
describe('Repo mode preamble validation', () => {
test('generated SKILL.md preamble contains REPO_MODE output', () => {
const content = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
expect(content).toContain('REPO_MODE:');
expect(content).toContain('gstack-repo-mode');
});
test('tier 3+ skills contain See Something Say Something section', () => {
// Root SKILL.md is tier 1 (no Repo Mode). Check a tier 3 skill instead.
const content = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
expect(content).toContain('See Something, Say Something');
expect(content).toContain('REPO_MODE');
expect(content).toContain('solo');
expect(content).toContain('collaborative');
});
});
describe('Test failure triage in ship skill', () => {
test('ship/SKILL.md contains Test Failure Ownership Triage', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('Test Failure Ownership Triage');
});
test('ship/SKILL.md triage uses git diff for classification', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('git diff origin/<base>...HEAD --name-only');
});
test('ship/SKILL.md triage has solo and collaborative paths', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('REPO_MODE');
expect(content).toContain('solo');
expect(content).toContain('collaborative');
expect(content).toContain('Investigate and fix now');
expect(content).toContain('Add as P0 TODO');
});
test('ship/SKILL.md triage has GitHub issue assignment for collaborative mode', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('gh issue create');
expect(content).toContain('--assignee');
});
test('{{TEST_FAILURE_TRIAGE}} placeholder is fully resolved in ship/SKILL.md', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).not.toContain('{{TEST_FAILURE_TRIAGE}}');
});
test('ship/SKILL.md uses in-branch language for stop condition', () => {
const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
expect(content).toContain('In-branch test failures');
});
});
describe('no compiled binaries in git', () => {
// Tracked files enumerated once and reused by both assertions. git ls-files -z
// + split is ~ms; the previous xargs-per-file shell loops blew past 5s on CI.
const trackedFiles: string[] = require('child_process')
.execSync('git ls-files -z', { cwd: ROOT, encoding: 'utf-8' })
.split('\0')
.filter(Boolean);
test('git tracks no Mach-O or ELF binaries', () => {
// Only mode 100755 (executable) files can be binaries we care about. Pre-filter
// via git ls-files -s to avoid running `file` on every text file.
const lsOut: string = require('child_process').execSync('git ls-files -s', {
cwd: ROOT,
encoding: 'utf-8',
});
const executableFiles = lsOut
.split('\n')
.filter(Boolean)
.map((line: string) => {
const parts = line.split(/\s+/);
return { mode: parts[0], file: line.split('\t')[1] };
})
.filter((e: { mode: string; file: string }) => e.mode === '100755')
.map((e: { mode: string; file: string }) => e.file);
if (executableFiles.length === 0) return;
// Batch-invoke `file --mime-type` across all executable files at once.
const result: string = require('child_process')
.execSync(`file --mime-type -- ${executableFiles.map((f: string) => `'${f.replace(/'/g, "'\\''")}'`).join(' ')}`, {
cwd: ROOT,
encoding: 'utf-8',
})
.trim();
const binaries = result
.split('\n')
.filter((l: string) =>
/application\/(x-mach-binary|x-executable|x-pie-executable|x-sharedlib)/.test(l)
)
.map((l: string) => l.split(':')[0].trim());
expect(binaries).toEqual([]);
});
test('warns about tracked files larger than 2MB', () => {
// Large fixtures can be legitimate test infrastructure. Keep visibility on
// repository size without blocking those fixtures from living in git.
// Known-good fixtures are exempted from the warning to keep CI logs clean.
const MAX_BYTES = 2 * 1024 * 1024;
const knownLargeFixtures = new Set([
// Deterministic replay fixture for BrowseSafe-Bench. The live bench is
// expensive; this file is intentionally committed so the gate is free.
'browse/test/fixtures/security-bench-haiku-responses.json',
]);
const oversized = trackedFiles.flatMap((f: string) => {
if (knownLargeFixtures.has(f)) return [];
const full = path.join(ROOT, f);
try {
const size = fs.statSync(full).size;
return size > MAX_BYTES ? [{ file: f, size }] : [];
} catch {
return [];
}
});
if (oversized.length > 0) {
const formatted = oversized
.map(({ file, size }: { file: string; size: number }) => {
const mib = (size / (1024 * 1024)).toFixed(1);
return `${file} (${mib} MiB)`;
})
.join(', ');
console.warn(`[size-warning] tracked files over 2 MiB: ${formatted}`);
}
expect(Array.isArray(oversized)).toBe(true);
});
});
// `sidebar agent (#584)` describe block was here. sidebar-agent.ts and
// the entire chat-queue path were ripped in favor of the interactive
// claude PTY (terminal-agent.ts); these assertions had no target file.
// Terminal-pane invariants are covered by browse/test/sidebar-tabs.test.ts
// and browse/test/terminal-agent.test.ts.
// ─── Browser-skills validation ──────────────────────────────────
//
// Browser-skills are bundled in <gstack-root>/browser-skills/<name>/. Each
// must have a SKILL.md whose frontmatter satisfies the contract enforced by
// browse/src/browser-skills.ts:parseSkillFile (host required, args + triggers
// parseable as the right shape). This test catches malformed bundled skills
// at CI time, before they ship.
describe('Bundled browser-skills frontmatter contract', () => {
const browserSkillsRoot = path.join(ROOT, 'browser-skills');
function listBundledSkillDirs(): string[] {
if (!fs.existsSync(browserSkillsRoot)) return [];
return fs.readdirSync(browserSkillsRoot)
.filter(name => !name.startsWith('.'))
.map(name => path.join(browserSkillsRoot, name))
.filter(dir => {
try { return fs.statSync(dir).isDirectory(); } catch { return false; }
});
}
test('each bundled skill has a SKILL.md', () => {
for (const dir of listBundledSkillDirs()) {
const skillFile = path.join(dir, 'SKILL.md');
expect(fs.existsSync(skillFile)).toBe(true);
}
});
test('each bundled skill SKILL.md frontmatter parses with required fields', async () => {
const { parseSkillFile } = await import('../browse/src/browser-skills');
for (const dir of listBundledSkillDirs()) {
const name = path.basename(dir);
const content = fs.readFileSync(path.join(dir, 'SKILL.md'), 'utf-8');
// parseSkillFile throws on missing required fields; we just want to
// make sure none of our shipped skills tripwire it.
const { frontmatter } = parseSkillFile(content, { skillName: name });
expect(frontmatter.name).toBe(name);
expect(typeof frontmatter.host).toBe('string');
expect(frontmatter.host.length).toBeGreaterThan(0);
expect(Array.isArray(frontmatter.triggers)).toBe(true);
expect(Array.isArray(frontmatter.args)).toBe(true);
}
});
test('each bundled skill has a script.ts', () => {
for (const dir of listBundledSkillDirs()) {
expect(fs.existsSync(path.join(dir, 'script.ts'))).toBe(true);
}
});
test('each bundled skill ships a sibling SDK at _lib/browse-client.ts', () => {
for (const dir of listBundledSkillDirs()) {
expect(fs.existsSync(path.join(dir, '_lib', 'browse-client.ts'))).toBe(true);
}
});
test('each bundled skill has a script.test.ts', () => {
for (const dir of listBundledSkillDirs()) {
expect(fs.existsSync(path.join(dir, 'script.test.ts'))).toBe(true);
}
});
test("each bundled skill's _lib/browse-client.ts matches the canonical SDK", () => {
// If the canonical SDK changes, the bundled copy must be updated. This
// test enforces that — the _lib copy should be byte-identical.
const canonical = fs.readFileSync(path.join(ROOT, 'browse', 'src', 'browse-client.ts'), 'utf-8');
for (const dir of listBundledSkillDirs()) {
const sibling = fs.readFileSync(path.join(dir, '_lib', 'browse-client.ts'), 'utf-8');
expect(sibling).toBe(canonical);
}
});
test('script.ts imports browse from ./_lib/browse-client', () => {
for (const dir of listBundledSkillDirs()) {
const content = fs.readFileSync(path.join(dir, 'script.ts'), 'utf-8');
expect(content).toMatch(/from\s+['"]\.\/_lib\/browse-client['"]/);
}
});
});