mirror of
https://github.com/garrytan/gstack.git
synced 2026-06-05 17:46:37 +02:00
cab774cced
* refactor(plan-ceo-review): carve review body into on-demand section
Carve the largest skill (138,838 B) into a skeleton + one on-demand
section, the documented next Phase B target after /ship (v2_PLAN.md:216).
- sections/review-sections.md(.tmpl): the 11-section deep review, codex/
outside-voice rules, how-to-ask, Required Outputs, registries, Completion
Summary, Review Log, REVIEW_DASHBOARD, PLAN_FILE_REVIEW_REPORT, Next Steps,
docs/designs promotion, Formatting Rules, and the Mode Quick Reference.
- sections/manifest.json: passive registry (CM2), one entry.
- SKILL.md.tmpl: {{SECTION_INDEX}} after the system audit, a single
{{SECTION:review-sections}} STOP-Read after Step 0 mode selection, and a
Section self-check. All of Step 0 (the scope/mode conversation) stays in
the always-loaded skeleton; only EXIT_PLAN_MODE_GATE follows the section.
Measured: always-loaded skeleton 138,838 -> 80,731 B (-42%, ~14.4K tokens
off every invocation). Union (skeleton + section) 139,110 B, behavior held.
Boundary honors Codex P1: nothing review-governing (formatting rules, mode
reference, how-to-ask, required outputs) sits in the skeleton below the
STOP. Housekeeping resolvers ride in the section, matching the ship
precedent (adversarial.md carries LEARNINGS_LOG + GBRAIN_SAVE_RESULTS).
Tests (atomic with the carve — skill-docs.yml gates gen:skill-docs
freshness on every push, so source + regen + tests must land together):
- parity-harness: plan-ceo flipped to sectioned, maxSkeletonBytes 90_000
(measured 80,731 + headroom); content/minBytes run against the union.
- skill-size-budget: plan-ceo-review added to SECTIONS_EXTRACTED.
- section-manifest-consistency: generalized to discover every carved skill,
vars computed per-skill-case (Codex P2).
- skill-ceo-section-ordering (new, gate): per-PR static guard — STOP after
Step 0, review body absent from skeleton, report writer in the section,
nothing review-governing below the STOP.
- skill-e2e-plan-ceo-review-section-loading (new, periodic): refreshes the
installed skill first (Codex P1), drives full Step 0, asserts the section
is Read before the report.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
carved skills so relocated prose still counts.
- touchfiles: plan-ceo-section-loading registered (periodic).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump VERSION + CHANGELOG for plan-ceo-review carve (v1.56.0.0)
MINOR: carves the largest skill into skeleton + on-demand section,
dropping plan-ceo-review's always-loaded cost 42% (138,838 -> 80,731 B,
~14.4K tokens off every invocation). User-facing release notes lead with
the measured token win.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(todos): file P3 follow-up — carve the shared {{PREAMBLE}} reference blocks
Surfaced by /plan-eng-review on the plan-ceo-review carve: per-skill section
carves stay modest because the ~40-50KB shared preamble dominates the
always-loaded surface. A single preamble-reference carve would help every
tier->=2 skill at once. Records the why, the cold-vs-hot split to measure,
and the guards it needs. Not implemented this PR.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): Layer 0 — guarantee AUQ format spec is always-loaded
Deterministic, free, per-PR keystone for the token-reduction era. For every
interactive (tier>=2) skill, asserts the full AskUserQuestion decision-brief
format (ELI10/Recommendation/Pros-cons/checks/Net/(recommended)/Stakes/
self-check) lives in the always-loaded SKILL.md skeleton, NOT only in an
on-demand section. Plus a roster guard (a carve can't silently drop the block)
and per-skill rule survival in the skeleton+sections union. 51 cases + a
negative control. Fails the instant a future carve strands AUQ-governing text
where it won't be loaded when a question fires.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): SDK capture engine + verbose-vs-carved no-degradation A/B
Adds the reusable SDK $OUT_FILE capture engine (auq-sdk-capture.ts): drives a
skill to its AUQ and captures the verbatim text the model GENERATES, cleanly
(real-PTY mangles plan-mode AUQs via cursor escapes). Pins the skill to an
absolute path with Read/Write-only tools so the agent can't wander to the
global install. gradeAuqRecommendation normalizes a non-"because" connective
before grading so substantive reasons aren't false-flagged (without touching
the pinned shared judge).
The A/B drives the same prompt through the carved 80KB skeleton and the
pre-carve 137KB monolith and fails if carved scores worse. Result: both 7/7
format, substance 5 — proven no degradation, transcript-verified each side read
its own planted SKILL.md. Periodic tier.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): consistency — same trigger N runs, stable format + substance
Drives the carved /plan-ceo-review AUQ N=3 times and fails if any format
element appears in one run but not another, or substance craters. Targets the
"fine one run, broken the next" failure class a single snapshot can't see.
Result: 3/3 stable, 7/7 + substance 5 every run. Periodic tier.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): behavioral matrix across AUQ-heavy skills
Data-driven test that drives each AUQ-heavy skill (plan-eng/design/devex,
office-hours, cso, spec, design-consultation) to its first AskUserQuestion and
grades it to the plan-ceo bar: 7/7 decision-brief format + recommendation
substance >=4. One case per skill (isolated failures), env-subsettable via
AUQ_MATRIX_ONLY. Browser/design-binary skills are intentionally excluded
(comparison boards, not format-AUQs; Layer 0 covers their spec). All targeted
skills pass 7/7 with substance 4-5. Periodic tier.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(codex): live recommendation-substance grade for /codex
Closes the gap where /codex's synthesis recommendation was only checked
statically (template grep) and via fixtures. Drives the real /codex skill over
a flawed diff and grades the emitted "Recommendation: ... because ..." line
with judgeRecommendation (present/commits/has_because/substance>=4). The named
weak spot holds up: substance 5. Periodic tier.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): deterministic trigger for format-compliance gate
A bare /plan-ceo-review against a repo whose work is already implemented makes
the model improvise an off-script "what should I review?" scope question that
skips the decision-brief format, which the gate test then times out waiting for.
Hand it a concrete plan to review (FORCING_FLOOR_CEO) so it reaches the real
Step 0 mode-selection AUQ that is the intended format check.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(office-hours): carve Phase 5+6 into on-demand section
Third Phase B carve (v2_PLAN.md:216, after ship and plan-ceo-review). Moves
Phase 5 (Design Doc templates) + Phase 6 (tiered relationship handoff) — the
session's output + closing tail, only reached after the conversation and
alternatives are done — into sections/design-and-handoff.md, behind a single
STOP-Read after Phase 4.5. The live conversation (Phases 1-4.5) and the
always-run Important Rules stay in the always-loaded skeleton.
Measured: always-loaded skeleton 118,280 -> 88,975 B (-24.8%). Union preserved.
The carved AUQ is identical to pre-carve (matrix: 7/7 format, substance 5),
and Layer 0 confirms the AUQ format spec stays in the skeleton — the AUQ
paranoid suite de-risked this carve end to end.
Atomic with tests + regen (skill-docs.yml gates gen:skill-docs freshness on
every push, so source + regen + tests land together; --host all regenerates
the inlined non-Claude variants):
- sections/manifest.json: passive registry, one entry.
- parity-harness: office-hours flipped to sectioned, maxSkeletonBytes 96_000
(measured 88,975 + headroom); content/minBytes run against the union.
- skill-size-budget: office-hours added to SECTIONS_EXTRACTED.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
office-hours so relocated Phase 5/6 prose still counts.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump VERSION + CHANGELOG for office-hours carve + AUQ suite (v1.57.0.0)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(preamble): carve CJK-escaping manual to on-demand doc
The AskUserQuestion format block is inlined into every interactive skill (~33).
It carried the full multi-paragraph non-ASCII/CJK escaping manual inline, but
that rationale only matters when a question contains CJK text and the operative
rule already lives in the always-loaded self-check. Moved the justification to
docs/askuserquestion-cjk.md (read on demand); kept the rule + a pointer.
Corpus: Claude-host SKILL.md total 3,087,499 -> 3,057,975 B (-29,524 B, ~900 B
x ~33 skills). Layer 0 still passes — the core decision-brief format stays
always-loaded; only the rare CJK rationale moved. Atomic with the all-host
regen (skill-docs.yml freshness gate). VERSION + package.json -> 1.58.0.0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(plan-eng-review): carve review body into on-demand section
Fourth Phase B carve (v2_PLAN.md:220). Moves the 4-section review (Architecture,
Code Quality, Tests, Performance), outside voice, required outputs, and review
report — everything after Step 0 scope — into sections/review-sections.md behind
a single STOP-Read. Step 0 (scope challenge) and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.
Measured: skeleton 106,984 -> 54,892 B (-48.7%). Union preserved. Atomic with
tests + all-host regen (freshness gate): parity flipped to sectioned
(maxSkeletonBytes 62K), plan-eng-review added to SECTIONS_EXTRACTED, gen-skill-docs
reads the union for relocated review/TEST_COVERAGE/dashboard prose. Layer 0 green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(plan-design-review): carve review body into on-demand section
Fifth Phase B carve (v2_PLAN.md:220, bundled with plan-eng). Moves the 7 design
passes, required outputs, and review report — everything after Step 0 scope and
the mockup/rating phase — into sections/review-sections.md behind a STOP-Read.
Step 0, Step 0.5 mockups, the rating method, and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.
Measured: skeleton 112,057 -> 76,024 B (-32.2%). Union preserved. Atomic with
tests + all-host regen: parity sectioned (maxSkeletonBytes 82K), added to
SECTIONS_EXTRACTED, gen-skill-docs reads the union. Layer 0 green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(plan-devex-review): carve review body into on-demand section
Sixth Phase B carve. Moves the 8 DX passes, required outputs, and review report
— everything after the Step 0 DX investigation — into sections/review-sections.md
behind a STOP-Read. All of Step 0 (persona, empathy, benchmark, journey trace,
roleplay) + the rating method + EXIT_PLAN_MODE_GATE stay always-loaded.
Measured: skeleton 110,621 -> 69,658 B (-37%). Union preserved. Atomic with
tests + all-host regen: added to SECTIONS_EXTRACTED, gen-skill-docs reads the
union. Layer 0 green. (No parity invariant entry for plan-devex-review.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump VERSION + CHANGELOG for plan-* family carves (v1.59.0.0)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test: refresh ship golden baselines + gbrain-detection union after carves
Two follow-ups the carve commits should have carried (caught by the full suite,
missed by targeted subsets):
- ship golden baselines (claude/codex/factory) regenerated: the preamble CJK
trim (v1.58) changed ship's always-loaded AskUserQuestion block.
- gbrain-detection-override probes the office-hours skeleton+section union:
GBRAIN_SAVE_RESULTS moved into sections/design-and-handoff.md when office-hours
was carved, so the detection assertions now check both files.
Full `bun test` green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): grade format-compliance gate from SDK capture, not the TUI
The real-PTY version grepped the stripAnsi'd interactive AUQ picker. Verified
directly that this cannot work: plan-mode AUQs render as a cursor picker whose
cursor-positioning escapes stripAnsi can't flatten — the picker renders fine for
a human (cursorSeen=45) but the flattened text drops ELI10:/(recommended) and
parseNumberedOptions returns 0. The test was grading a lossy projection and
failed by construction.
Rewritten to drive /plan-ceo-review via the SDK $OUT_FILE capture (the agent
writes the verbatim question it would have shown — clean text, no rendering
loss) and grade 7/7 format + kind-note + recommendation substance >=4. Same
property, reliable, environment-independent; shares the engine with the periodic
A/B and matrix evals. Result: 7/7 format, substance 5. Touchfiles key renamed
ask-user-question-format-pty -> auq-format-gate (no longer a PTY test).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test: fix carve-broken CI evals (union reads + section fixtures)
Two CI eval jobs failed on the carved plan-* skills because they read content
that moved into sections/:
- llm-judge (skill-llm-eval): runWorkflowJudge sliced SKILL.md between markers
like "## Review Sections" / "## CRITICAL RULE" that now live in
sections/review-sections.md. The markers vanished from the skeleton, so the
judge scored empty/wrong content. Fix: read the skeleton+sections union.
Verified: plan-ceo modes / plan-eng sections / plan-design passes all PASS
(25/25).
- e2e-plan (skill-e2e-plan): setupPlanDir copied only <skill>/SKILL.md into the
fixture, not sections/. The carved skill's STOP pointed at a section file that
was absent, so the model improvised a compressed report table instead of the
canonical "| Review | Trigger | Why | Runs | Status | Findings |". Fix: copy
sections/ alongside SKILL.md in all 6 setup sites. Verified: report test PASS,
canonical table emitted.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test: copy carved sections into all e2e fixtures (prevent more carve-blind CI fails)
Proactive sweep beyond the two CI logs: every e2e test that copies a carved
skill's SKILL.md into a temp fixture must also copy its sections/, or the
model hits a STOP pointing at a missing section file and improvises/degrades.
- skill-e2e.test.ts: plan-ceo/plan-eng/plan-design/office-hours copies across
planDir/reviewDir/ohDir/benefitsDir dests now copy sections/.
- skill-e2e-plan.test.ts: the office-hours copy + the 4-skill codex-offering
loop now copy sections/.
- skill-e2e-design.test.ts: plan-design-review copy now copies sections/.
- skill-e2e-office-hours.test.ts: both office-hours copies now copy sections/.
- skill-e2e-office-hours-brain-writeback.test.ts: GBRAIN_SAVE_RESULTS moved into
the section, so check the regenerated skeleton+section UNION for the gbrain put
block, ship both into the workdir, and restore both (the section regen was also
leaking into the working tree — finally now restores it).
ship copies (single-file Step-0 slices) and review/retro (not carved) untouched.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test: migrate section-loading E2E to lossless SDK tool-stream detection
The /ship and /plan-ceo-review section-loading tests drove a real PTY and
scraped the ANSI screen buffer for sections/<file>.md paths. That silently
saw nothing in a Conductor PTY (cursor-positioned tool renders and an
unanswered Step 0 question loop both defeat the regex), so both reported
read: [] even when the agent did the work.
They now run the skill through claude -p (the same SDK path the AUQ matrix
uses) and detect section reads from the tool-use stream — Read calls whose
file_path contains sections/<file>.md — with no rendering layer to mangle.
The run is also hermetic: the freshly-generated worktree skeleton + sections
are copied into a throwaway fixture with the absolute path pinned, so the
test validates this branch's carve without mutating the user's ~/.claude
install.
Validated EVALS_TIER=periodic: both pass (plan-ceo Reads review-sections.md;
ship Reads review-army.md + changelog.md), ~6.5 min for both vs ~23 min
combined on the old PTY path where both were failing.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: consolidate branch to v1.56.0.0 (single MINOR above main)
The branch bumped VERSION several times during development (1.56 → 1.57 →
1.58 → 1.59), but none of those landed on main (main is at 1.55.1.0). Per
the "never orphan branch-internal versions" discipline, collapse all four
into a single 1.56.0.0 entry — one MINOR release covering the whole branch:
five skills carved (plan-ceo, office-hours, plan-eng, plan-design,
plan-devex), the shared AskUserQuestion preamble CJK trim, and the paranoid
AUQ no-degradation test suite + lossless section-loading tests.
VERSION and package.json set to 1.56.0.0; main's 1.55.1.0 entry preserved
below the consolidated entry. No SKILL.md drift (VERSION is not embedded in
generated bodies).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
234 lines
10 KiB
TypeScript
234 lines
10 KiB
TypeScript
/**
|
||
* Per-skill SKILL.md size budget regression (v1.46.0.0 T5).
|
||
*
|
||
* Asserts that no skill's generated SKILL.md grew beyond the v1.47.0.0
|
||
* baseline. Catches preamble/resolver changes that bloat skills back to
|
||
* the pre-compression size. Free — pure file IO + JSON diff.
|
||
*
|
||
* Baseline rebased v1.44.1 → v1.47.0.0 in the AskUserQuestion split-rule
|
||
* PR after main merged GSTACK_PLAN_MODE + /spec, pushing the v1.44.1
|
||
* anchor past the 5% ratchet. Historical v1.44.1.json and v1.46.0.0.json
|
||
* are retained in test/fixtures/ for reference.
|
||
*
|
||
* Why a separate test from skill-budget-regression.test.ts: that one
|
||
* compares LIVE eval runs (tool calls, turns, cost); this one compares
|
||
* static SKILL.md sizes. Both gate-tier.
|
||
*
|
||
* The baseline lives at test/fixtures/parity-baseline-v1.47.0.0.json,
|
||
* captured by scripts/capture-baseline.ts before any Phase A work landed.
|
||
*
|
||
* Override:
|
||
* - GSTACK_SIZE_BUDGET_RATIO=<n> changes the per-skill regression ratio.
|
||
* Default 1.0 (no growth allowed). Set to 1.10 to permit 10% growth
|
||
* (e.g., during deliberate feature additions that the catalog trim
|
||
* doesn't offset).
|
||
* - GSTACK_SIZE_BUDGET_OVERRIDE_REASON="text" allows a regression to
|
||
* pass and logs the reason to ~/.gstack/analytics/spend-overrides.jsonl
|
||
* for audit. Use sparingly; the next baseline should bake in the new
|
||
* size.
|
||
*/
|
||
|
||
import { describe, test, expect } from 'bun:test';
|
||
import * as fs from 'fs';
|
||
import * as path from 'path';
|
||
import { captureBaseline, type ParityBaseline } from './helpers/capture-parity-baseline';
|
||
import { logBudgetOverride } from './helpers/budget-override';
|
||
|
||
const REPO_ROOT = path.resolve(import.meta.dir, '..');
|
||
const BASELINE_PATH = path.join(REPO_ROOT, 'test', 'fixtures', 'parity-baseline-v1.47.0.0.json');
|
||
|
||
// Default per-skill ratio is 1.50 (50% growth tolerance). Adjusted v1.52.0.0
|
||
// (cathedral cap audit) from 1.05 → 1.50: a 5% ratio tripped on legitimate
|
||
// feature additions (e.g., plan-tune cathedral T13 grew SKILL.md ×1.24
|
||
// adding load-bearing Dream cycle + Audit unmarked + Recent auto-decisions
|
||
// surfaces). Real bloat is 2-3×; this catches that while not tripping on
|
||
// normal feature scope. The always-loaded catalog cost is enforced
|
||
// separately with a hard ceiling.
|
||
const DEFAULT_RATIO = 1.50;
|
||
const RATIO = Number(process.env.GSTACK_SIZE_BUDGET_RATIO) || DEFAULT_RATIO;
|
||
|
||
interface Regression {
|
||
skill: string;
|
||
beforeBytes: number;
|
||
afterBytes: number;
|
||
growth: number;
|
||
}
|
||
|
||
describe('SKILL.md size budget regression (gate, free)', () => {
|
||
test('parity-baseline-v1.47.0.0.json exists', () => {
|
||
expect(fs.existsSync(BASELINE_PATH)).toBe(true);
|
||
});
|
||
|
||
test('no skill exceeds v1.47.0.0 baseline size × ratio', () => {
|
||
const baseline: ParityBaseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8'));
|
||
const current = captureBaseline({ repoRoot: REPO_ROOT });
|
||
|
||
const regressions: Regression[] = [];
|
||
for (const [skill, before] of Object.entries(baseline.skills)) {
|
||
const after = current.skills[skill];
|
||
if (!after) continue; // skill removed since v1.44 — not a regression
|
||
if (after.skillMdBytes <= before.skillMdBytes * RATIO) continue;
|
||
regressions.push({
|
||
skill,
|
||
beforeBytes: before.skillMdBytes,
|
||
afterBytes: after.skillMdBytes,
|
||
growth: after.skillMdBytes / before.skillMdBytes,
|
||
});
|
||
}
|
||
|
||
if (regressions.length === 0) return;
|
||
|
||
const overrideReason = process.env.GSTACK_SIZE_BUDGET_OVERRIDE_REASON?.trim();
|
||
if (overrideReason) {
|
||
logBudgetOverride({
|
||
scope: 'skill-size-budget',
|
||
reason: overrideReason,
|
||
details: { ratio: RATIO, regressions },
|
||
});
|
||
// eslint-disable-next-line no-console
|
||
console.warn(
|
||
`[skill-size-budget] OVERRIDE APPLIED (${overrideReason}) — ${regressions.length} regression(s) allowed:`,
|
||
);
|
||
for (const r of regressions) {
|
||
// eslint-disable-next-line no-console
|
||
console.warn(` ${r.skill}: ${r.beforeBytes} → ${r.afterBytes} bytes (×${r.growth.toFixed(2)})`);
|
||
}
|
||
return;
|
||
}
|
||
|
||
const msg = regressions.map(r =>
|
||
` ${r.skill}: ${r.beforeBytes} → ${r.afterBytes} bytes (×${r.growth.toFixed(2)})`,
|
||
).join('\n');
|
||
throw new Error(
|
||
`${regressions.length} skill(s) regressed past v1.47.0.0 baseline × ${RATIO}:\n${msg}\n` +
|
||
`Override: set GSTACK_SIZE_BUDGET_OVERRIDE_REASON="why this is OK" to allow and audit-log.`,
|
||
);
|
||
});
|
||
|
||
test('total corpus byte count does not regress past baseline × ratio', () => {
|
||
const baseline: ParityBaseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8'));
|
||
const current = captureBaseline({ repoRoot: REPO_ROOT });
|
||
const ratio = current.totalCorpusBytes / baseline.totalCorpusBytes;
|
||
if (current.totalCorpusBytes <= baseline.totalCorpusBytes * RATIO) {
|
||
// eslint-disable-next-line no-console
|
||
console.log(
|
||
`[skill-size-budget] corpus OK: ${baseline.totalCorpusBytes} → ${current.totalCorpusBytes} bytes (×${ratio.toFixed(3)})`,
|
||
);
|
||
return;
|
||
}
|
||
const overrideReason = process.env.GSTACK_SIZE_BUDGET_OVERRIDE_REASON?.trim();
|
||
if (overrideReason) {
|
||
logBudgetOverride({
|
||
scope: 'skill-size-budget-corpus',
|
||
reason: overrideReason,
|
||
details: { ratio: RATIO, observed: ratio, before: baseline.totalCorpusBytes, after: current.totalCorpusBytes },
|
||
});
|
||
return;
|
||
}
|
||
throw new Error(
|
||
`Total corpus regressed past v1.47.0.0 baseline × ${RATIO}: ` +
|
||
`${baseline.totalCorpusBytes} → ${current.totalCorpusBytes} bytes (×${ratio.toFixed(3)}). ` +
|
||
`Override: set GSTACK_SIZE_BUDGET_OVERRIDE_REASON to allow.`,
|
||
);
|
||
});
|
||
|
||
/**
|
||
* Gap E (v1.46.0.0): per-skill min-size floor.
|
||
*
|
||
* The existing skill-coverage-floor enforces body ≥ 200 bytes, which is
|
||
* a tiny noise floor. A skill that was 100 KB at v1.47.0.0 and shrinks to
|
||
* 250 bytes passes that check despite losing 99.75% of content. The
|
||
* parity-suite content invariants cover this for 10 hand-picked skills
|
||
* (cso, ship, plan-ceo, etc.); the remaining 41 skills had no per-skill
|
||
* shrinkage floor.
|
||
*
|
||
* Floor: 80% of the v1.47.0.0 baseline. v1.46 actual shrinkage is <1% per
|
||
* skill, so this is a comfortable ceiling that still catches accidental
|
||
* mass deletion (e.g., a refactor that strips the body of a skill).
|
||
*
|
||
* v2.0.0.0 introduces the sections/ pattern for 5 heavyweights
|
||
* (ship, plan-ceo-review, office-hours, plan-eng-review,
|
||
* plan-design-review). Carved so far: ship (skeleton ~83 KB) and
|
||
* plan-ceo-review (skeleton ~81 KB, down from the 138 KB monolith). Those
|
||
* skeletons legitimately fall below the 80% body-strip floor, so each carved
|
||
* skill is added to SECTIONS_EXTRACTED; its union is guarded instead by the
|
||
* sectioned invariant in parity-harness.ts (minBytes on skeleton+sections).
|
||
* Add the remaining three here as they carve.
|
||
*/
|
||
test('no skill shrinks past 80% of v1.47.0.0 baseline (catches accidental body strip)', () => {
|
||
const baseline: ParityBaseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8'));
|
||
const current = captureBaseline({ repoRoot: REPO_ROOT });
|
||
const MIN_RATIO = 0.80; // a skill at <80% of its v1.44 size signals mass-deletion
|
||
// Carved skills (v2 plan T9): the skeleton SKILL.md intentionally shrinks
|
||
// because prose moved into sections/*.md. The union size is guarded instead
|
||
// by the sectioned ship invariant in parity-harness.ts (minBytes on the
|
||
// skeleton+sections union), so exempt the skeleton from the body-strip floor.
|
||
const SECTIONS_EXTRACTED = new Set<string>(['ship', 'plan-ceo-review', 'office-hours', 'plan-eng-review', 'plan-design-review', 'plan-devex-review']);
|
||
|
||
const undershoots: Array<{
|
||
skill: string; beforeBytes: number; afterBytes: number; ratio: number;
|
||
}> = [];
|
||
for (const [skill, before] of Object.entries(baseline.skills)) {
|
||
if (SECTIONS_EXTRACTED.has(skill)) continue;
|
||
const after = current.skills[skill];
|
||
if (!after) continue; // skill removed since baseline — separate concern
|
||
const ratio = after.skillMdBytes / before.skillMdBytes;
|
||
if (ratio < MIN_RATIO) {
|
||
undershoots.push({
|
||
skill, beforeBytes: before.skillMdBytes, afterBytes: after.skillMdBytes, ratio,
|
||
});
|
||
}
|
||
}
|
||
|
||
if (undershoots.length === 0) return;
|
||
|
||
const overrideReason = process.env.GSTACK_SIZE_BUDGET_OVERRIDE_REASON?.trim();
|
||
if (overrideReason) {
|
||
logBudgetOverride({
|
||
scope: 'skill-size-budget-floor',
|
||
reason: overrideReason,
|
||
details: { min_ratio: MIN_RATIO, undershoots },
|
||
});
|
||
// eslint-disable-next-line no-console
|
||
console.warn(
|
||
`[skill-size-budget-floor] OVERRIDE APPLIED (${overrideReason}) — ${undershoots.length} undershoot(s) allowed`,
|
||
);
|
||
return;
|
||
}
|
||
|
||
const msg = undershoots.map(u =>
|
||
` ${u.skill}: ${u.beforeBytes} → ${u.afterBytes} bytes (×${u.ratio.toFixed(2)} — below ${MIN_RATIO} floor)`,
|
||
).join('\n');
|
||
throw new Error(
|
||
`${undershoots.length} skill(s) shrunk past v1.47.0.0 × ${MIN_RATIO} floor:\n${msg}\n` +
|
||
`This usually signals accidental body strip (e.g., a resolver returning empty, a ` +
|
||
`template losing a section). If the shrinkage is intentional (e.g., the skill moved ` +
|
||
`to the sections/ pattern), add it to SECTIONS_EXTRACTED in this test. Override: ` +
|
||
`GSTACK_SIZE_BUDGET_OVERRIDE_REASON="why" allows + audit-logs.`,
|
||
);
|
||
});
|
||
|
||
test('catalog token estimate stays compressed (v1.45 target ≤ 7000)', () => {
|
||
const current = captureBaseline({ repoRoot: REPO_ROOT });
|
||
const v145Target = 7000;
|
||
if (current.estTotalCatalogTokens <= v145Target) {
|
||
// eslint-disable-next-line no-console
|
||
console.log(`[skill-size-budget] catalog OK: ~${current.estTotalCatalogTokens} tokens (target ≤${v145Target})`);
|
||
return;
|
||
}
|
||
const overrideReason = process.env.GSTACK_SIZE_BUDGET_OVERRIDE_REASON?.trim();
|
||
if (overrideReason) {
|
||
logBudgetOverride({
|
||
scope: 'skill-size-budget-catalog',
|
||
reason: overrideReason,
|
||
details: { target: v145Target, observed: current.estTotalCatalogTokens },
|
||
});
|
||
return;
|
||
}
|
||
throw new Error(
|
||
`Catalog token estimate regressed past v1.45 target: ${current.estTotalCatalogTokens} tokens > ${v145Target}. ` +
|
||
`T4 catalog trim should keep this under control. Override: set GSTACK_SIZE_BUDGET_OVERRIDE_REASON to allow.`,
|
||
);
|
||
});
|
||
});
|