Files
gstack/test/skill-size-budget.test.ts
T
Garry Tan 46c1fae7f1 v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded) (#1806)
* feat(test): transcript-section-logger + ship-action fingerprint (T10)

Pure-analysis module over a SkillTestResult/NDJSON transcript:
- extractSectionReads(): which sections/*.md a run opened (post-carve check)
- extractShipActions(): observable action fingerprint (merge/test/bump/
  changelog/commit/push/pr) that works on the MONOLITH too, so a baseline
  captured before the carve can detect a sectioned-ship regression
- baseline read/write + compareShipActions() for baseline-first dogf(T10)

Baseline-first answers the Codex outside-voice critique that a logger in the
same PR as the carve is post-failure telemetry without a pre-carve reference.

11 unit tests, all green. Paid monolith baseline capture runs separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(pipeline): section discovery + generation machinery (T9)

- discover-skills.ts: discoverSectionTemplates() scans <skill>/sections/*.md.tmpl
- gen-skill-docs.ts: extract resolvePlaceholders + applyHostRewrites + buildContext
  as shared helpers (processTemplate and the new processSectionTemplate both call
  them, so a sanitization/rewrite fix can't miss sections) [C1]
- processSectionTemplate: body-fragment generation (no frontmatter/catalog/voice),
  parent-skill TemplateContext (skillName pinned to parent, not 'sections', so
  appliesTo gating + tier behave identically), per-host output routing
- --host all now fails the build on ANY host failure, not just claude, so a stale
  external-host output can't slip the freshness gate [Codex outside-voice #9]

Inert until a skill is carved (no sections/ dirs exist yet). Refactor is
output-neutral: gen:skill-docs --dry-run --host all reports 0 STALE.

5 discovery unit tests + 389 gen-skill-docs tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(setup): install sections/ for cherry-pick targets (claude + kiro) (T9)

Two install targets cherry-pick SKILL.md and would leave a carved skill's
sections/ behind, 404ing a runtime 'Read sections/<name>.md':
- link_claude_skill_dirs: link the sections/ subdir via _link_or_copy (windows
  gets a fresh copy on every ./setup)
- kiro per-skill loop: sed-rewrite + copy each sections/* so paths resolve under
  ~/.kiro, not ~/.codex/~/.claude

codex/factory/opencode link the whole generated dir, so sections ride free.
Addresses Codex outside-voice #4/#6 (runtime pathing landmine). Inert until a
skill is carved. Static-tripwire test + windows-fallback invariant green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(ship): gstack-version-bump CLI — tested idempotency classify + write (T9)

Hybrid CLI extraction (CM1): the deterministic core of ship Step 12 becomes a
tested CLI instead of bash prose the agent re-derives each run.
- classify: FRESH/ALREADY_BUMPED/DRIFT_STALE_PKG/DRIFT_UNEXPECTED from VERSION
  vs origin/<base>:VERSION vs package.json.version (pure reader)
- write: validated dual-write to VERSION + package.json (FRESH bump)
- repair: DRIFT_STALE_PKG sync, no re-bump
Bump-LEVEL choice + queue collision stay agent judgment; slot pick stays
bin/gstack-next-version. This removes the re-bump-a-shipped-branch footgun from
skippable prose into code that can't be skipped or misread.

15 tests (exhaustive state matrix + write/repair fs + real-git classify).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(parity): sectioned-skill parity capability — guards the carve (T9)

Carved skills (skeleton + sections/*.md) need parity checks that see relocated
content, or moving a phrase into a section reads as 'lost':
- readSkillForParity(): union skeleton + all sections/*.md
- checkSkillParity sectioned mode: content checks against the union; minBytes/
  maxSizeRatio against union bytes (total behavior preserved); maxSkeletonBytes
  asserts the always-loaded skeleton actually shrank. Lowering minBytes to fit a
  small skeleton would otherwise make the size floor toothless [Codex #12].

Built + tested BEFORE the carve so ship's invariant can flip to sectioned in the
same commit it lands. Monolith path byte-identical (verified: pre-existing
investigate 1.053 ratio drift fails the same with this change stashed).

7 sectioned-parity tests + existing parity tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(ship): carve into skeleton + on-demand sections (Claude) (T9)

ship/SKILL.md drops 167KB → 68.7KB (~59% of the always-loaded skill) by moving
8 prose-heavy steps into ship/sections/*.md, read on demand:
tests, test-coverage, plan-completion, review-army, greptile, adversarial,
changelog, pr-body. Step 12's version logic now calls the tested
gstack-version-bump CLI instead of inline bash.

Claude-first (S2): {{SECTION:id}} emits a STOP-Read pointer on Claude (skeleton +
generated section files) and INLINES the content on every other host, so external
hosts keep the full monolith — verified factory at 162KB with no sections dir.
{{SECTION_INDEX:ship}} renders the situation→section table from the PASSIVE
manifest (CM2 / v2_PLAN.md:663); required-reads live only in test fixtures.
Multi-pass resolve expands inlined sections' own resolvers.

Parity: ship invariant flipped to sectioned (union content checks + maxSkeletonBytes
asserts the shrink). Carve-fallout fixed across gen-skill-docs/skill-validation/
golden/plan-completion/#1539/size-budget tests via skeleton+sections union reads.
Free suite green except the pre-existing investigate parity drift.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(ship): manifest-consistency + context-parity + requiredReads helper (T9)

Free deterministic guards for the carve:
- required-reads.ts + unit test: assertRequiredReads(run, requiredFiles) — the
  mechanical layer-5 check that the agent Read the sections its situation needs
  (required set comes from the fixture, not the passive manifest)
- section-manifest-consistency: 3-tier orphan classification (generated orphan +
  hand-edited generated file → FAIL; manifest orphan → WARN per v2_PLAN.md) and
  pins the PASSIVE-manifest contract (no applies_when/required_for)
- template-context-parity: generated sections have zero unresolved placeholders
  and gated resolvers (ADVERSARIAL_STEP/CONFIDENCE_CALIBRATION/CHANGELOG_WORKFLOW)
  rendered — proving sections resolve with the parent skillName, not 'sections'

16 tests, all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(ship): section-loading E2E + idempotency CLI detection (T9)

- skill-e2e-ship-section-loading.test.ts (new, periodic): runs real /ship in plan
  mode against a fresh version-changing fixture and asserts the agent Read the
  required sections (review-army + changelog). Runs against the INSTALLED skill
  (~/.claude/skills/gstack/ship), not repo paths, so install-layout 404s surface
  [Codex outside-voice #5]. Layer-5 mechanical guard against silent section-skip.
- skill-e2e-ship-idempotency.test.ts: detection updated for the carve — Step 12
  now runs gstack-version-bump classify (JSON "state":"ALREADY_BUMPED") instead
  of the inline bash echo (STATE: ALREADY_BUMPED). Accept both; add a
  gstack-version-bump-write re-bump regression signal.
- touchfiles: register ship-section-loading (periodic) + extend idempotency deps
  with bin/gstack-version-bump + scripts/resolvers/sections.ts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(ship): union-read redaction wiring test for the carve (T9)

main's PR-body redaction-at-sink lives in sections/pr-body.md.tmpl after the
carve, not the skeleton template. Read skeleton + section templates union so the
redaction-wiring assertions follow the relocated content. 9/9 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 12:09:10 -07:00

231 lines
10 KiB
TypeScript
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
/**
* Per-skill SKILL.md size budget regression (v1.46.0.0 T5).
*
* Asserts that no skill's generated SKILL.md grew beyond the v1.47.0.0
* baseline. Catches preamble/resolver changes that bloat skills back to
* the pre-compression size. Free — pure file IO + JSON diff.
*
* Baseline rebased v1.44.1 → v1.47.0.0 in the AskUserQuestion split-rule
* PR after main merged GSTACK_PLAN_MODE + /spec, pushing the v1.44.1
* anchor past the 5% ratchet. Historical v1.44.1.json and v1.46.0.0.json
* are retained in test/fixtures/ for reference.
*
* Why a separate test from skill-budget-regression.test.ts: that one
* compares LIVE eval runs (tool calls, turns, cost); this one compares
* static SKILL.md sizes. Both gate-tier.
*
* The baseline lives at test/fixtures/parity-baseline-v1.47.0.0.json,
* captured by scripts/capture-baseline.ts before any Phase A work landed.
*
* Override:
* - GSTACK_SIZE_BUDGET_RATIO=<n> changes the per-skill regression ratio.
* Default 1.0 (no growth allowed). Set to 1.10 to permit 10% growth
* (e.g., during deliberate feature additions that the catalog trim
* doesn't offset).
* - GSTACK_SIZE_BUDGET_OVERRIDE_REASON="text" allows a regression to
* pass and logs the reason to ~/.gstack/analytics/spend-overrides.jsonl
* for audit. Use sparingly; the next baseline should bake in the new
* size.
*/
import { describe, test, expect } from 'bun:test';
import * as fs from 'fs';
import * as path from 'path';
import { captureBaseline, type ParityBaseline } from './helpers/capture-parity-baseline';
import { logBudgetOverride } from './helpers/budget-override';
const REPO_ROOT = path.resolve(import.meta.dir, '..');
const BASELINE_PATH = path.join(REPO_ROOT, 'test', 'fixtures', 'parity-baseline-v1.47.0.0.json');
// Default per-skill ratio is 1.50 (50% growth tolerance). Adjusted v1.52.0.0
// (cathedral cap audit) from 1.05 → 1.50: a 5% ratio tripped on legitimate
// feature additions (e.g., plan-tune cathedral T13 grew SKILL.md ×1.24
// adding load-bearing Dream cycle + Audit unmarked + Recent auto-decisions
// surfaces). Real bloat is 2-3×; this catches that while not tripping on
// normal feature scope. The always-loaded catalog cost is enforced
// separately with a hard ceiling.
const DEFAULT_RATIO = 1.50;
const RATIO = Number(process.env.GSTACK_SIZE_BUDGET_RATIO) || DEFAULT_RATIO;
interface Regression {
skill: string;
beforeBytes: number;
afterBytes: number;
growth: number;
}
describe('SKILL.md size budget regression (gate, free)', () => {
test('parity-baseline-v1.47.0.0.json exists', () => {
expect(fs.existsSync(BASELINE_PATH)).toBe(true);
});
test('no skill exceeds v1.47.0.0 baseline size × ratio', () => {
const baseline: ParityBaseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8'));
const current = captureBaseline({ repoRoot: REPO_ROOT });
const regressions: Regression[] = [];
for (const [skill, before] of Object.entries(baseline.skills)) {
const after = current.skills[skill];
if (!after) continue; // skill removed since v1.44 — not a regression
if (after.skillMdBytes <= before.skillMdBytes * RATIO) continue;
regressions.push({
skill,
beforeBytes: before.skillMdBytes,
afterBytes: after.skillMdBytes,
growth: after.skillMdBytes / before.skillMdBytes,
});
}
if (regressions.length === 0) return;
const overrideReason = process.env.GSTACK_SIZE_BUDGET_OVERRIDE_REASON?.trim();
if (overrideReason) {
logBudgetOverride({
scope: 'skill-size-budget',
reason: overrideReason,
details: { ratio: RATIO, regressions },
});
// eslint-disable-next-line no-console
console.warn(
`[skill-size-budget] OVERRIDE APPLIED (${overrideReason}) — ${regressions.length} regression(s) allowed:`,
);
for (const r of regressions) {
// eslint-disable-next-line no-console
console.warn(` ${r.skill}: ${r.beforeBytes}${r.afterBytes} bytes (×${r.growth.toFixed(2)})`);
}
return;
}
const msg = regressions.map(r =>
` ${r.skill}: ${r.beforeBytes}${r.afterBytes} bytes (×${r.growth.toFixed(2)})`,
).join('\n');
throw new Error(
`${regressions.length} skill(s) regressed past v1.47.0.0 baseline × ${RATIO}:\n${msg}\n` +
`Override: set GSTACK_SIZE_BUDGET_OVERRIDE_REASON="why this is OK" to allow and audit-log.`,
);
});
test('total corpus byte count does not regress past baseline × ratio', () => {
const baseline: ParityBaseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8'));
const current = captureBaseline({ repoRoot: REPO_ROOT });
const ratio = current.totalCorpusBytes / baseline.totalCorpusBytes;
if (current.totalCorpusBytes <= baseline.totalCorpusBytes * RATIO) {
// eslint-disable-next-line no-console
console.log(
`[skill-size-budget] corpus OK: ${baseline.totalCorpusBytes}${current.totalCorpusBytes} bytes (×${ratio.toFixed(3)})`,
);
return;
}
const overrideReason = process.env.GSTACK_SIZE_BUDGET_OVERRIDE_REASON?.trim();
if (overrideReason) {
logBudgetOverride({
scope: 'skill-size-budget-corpus',
reason: overrideReason,
details: { ratio: RATIO, observed: ratio, before: baseline.totalCorpusBytes, after: current.totalCorpusBytes },
});
return;
}
throw new Error(
`Total corpus regressed past v1.47.0.0 baseline × ${RATIO}: ` +
`${baseline.totalCorpusBytes}${current.totalCorpusBytes} bytes (×${ratio.toFixed(3)}). ` +
`Override: set GSTACK_SIZE_BUDGET_OVERRIDE_REASON to allow.`,
);
});
/**
* Gap E (v1.46.0.0): per-skill min-size floor.
*
* The existing skill-coverage-floor enforces body ≥ 200 bytes, which is
* a tiny noise floor. A skill that was 100 KB at v1.47.0.0 and shrinks to
* 250 bytes passes that check despite losing 99.75% of content. The
* parity-suite content invariants cover this for 10 hand-picked skills
* (cso, ship, plan-ceo, etc.); the remaining 41 skills had no per-skill
* shrinkage floor.
*
* Floor: 80% of the v1.47.0.0 baseline. v1.46 actual shrinkage is <1% per
* skill, so this is a comfortable ceiling that still catches accidental
* mass deletion (e.g., a refactor that strips the body of a skill).
*
* v2.0.0.0 will introduce the sections/ pattern for 5 heavyweights
* (ship, plan-ceo-review, office-hours, plan-eng-review,
* plan-design-review). Those skills will legitimately shrink to ~15 KB
* skeletons. When that lands, add them to SECTIONS_EXTRACTED so the floor
* relaxes for them.
*/
test('no skill shrinks past 80% of v1.47.0.0 baseline (catches accidental body strip)', () => {
const baseline: ParityBaseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8'));
const current = captureBaseline({ repoRoot: REPO_ROOT });
const MIN_RATIO = 0.80; // a skill at <80% of its v1.44 size signals mass-deletion
// Carved skills (v2 plan T9): the skeleton SKILL.md intentionally shrinks
// because prose moved into sections/*.md. The union size is guarded instead
// by the sectioned ship invariant in parity-harness.ts (minBytes on the
// skeleton+sections union), so exempt the skeleton from the body-strip floor.
const SECTIONS_EXTRACTED = new Set<string>(['ship']);
const undershoots: Array<{
skill: string; beforeBytes: number; afterBytes: number; ratio: number;
}> = [];
for (const [skill, before] of Object.entries(baseline.skills)) {
if (SECTIONS_EXTRACTED.has(skill)) continue;
const after = current.skills[skill];
if (!after) continue; // skill removed since baseline — separate concern
const ratio = after.skillMdBytes / before.skillMdBytes;
if (ratio < MIN_RATIO) {
undershoots.push({
skill, beforeBytes: before.skillMdBytes, afterBytes: after.skillMdBytes, ratio,
});
}
}
if (undershoots.length === 0) return;
const overrideReason = process.env.GSTACK_SIZE_BUDGET_OVERRIDE_REASON?.trim();
if (overrideReason) {
logBudgetOverride({
scope: 'skill-size-budget-floor',
reason: overrideReason,
details: { min_ratio: MIN_RATIO, undershoots },
});
// eslint-disable-next-line no-console
console.warn(
`[skill-size-budget-floor] OVERRIDE APPLIED (${overrideReason}) — ${undershoots.length} undershoot(s) allowed`,
);
return;
}
const msg = undershoots.map(u =>
` ${u.skill}: ${u.beforeBytes}${u.afterBytes} bytes (×${u.ratio.toFixed(2)} — below ${MIN_RATIO} floor)`,
).join('\n');
throw new Error(
`${undershoots.length} skill(s) shrunk past v1.47.0.0 × ${MIN_RATIO} floor:\n${msg}\n` +
`This usually signals accidental body strip (e.g., a resolver returning empty, a ` +
`template losing a section). If the shrinkage is intentional (e.g., the skill moved ` +
`to the sections/ pattern), add it to SECTIONS_EXTRACTED in this test. Override: ` +
`GSTACK_SIZE_BUDGET_OVERRIDE_REASON="why" allows + audit-logs.`,
);
});
test('catalog token estimate stays compressed (v1.45 target ≤ 7000)', () => {
const current = captureBaseline({ repoRoot: REPO_ROOT });
const v145Target = 7000;
if (current.estTotalCatalogTokens <= v145Target) {
// eslint-disable-next-line no-console
console.log(`[skill-size-budget] catalog OK: ~${current.estTotalCatalogTokens} tokens (target ≤${v145Target})`);
return;
}
const overrideReason = process.env.GSTACK_SIZE_BUDGET_OVERRIDE_REASON?.trim();
if (overrideReason) {
logBudgetOverride({
scope: 'skill-size-budget-catalog',
reason: overrideReason,
details: { target: v145Target, observed: current.estTotalCatalogTokens },
});
return;
}
throw new Error(
`Catalog token estimate regressed past v1.45 target: ${current.estTotalCatalogTokens} tokens > ${v145Target}. ` +
`T4 catalog trim should keep this under control. Override: set GSTACK_SIZE_BUDGET_OVERRIDE_REASON to allow.`,
);
});
});