mirror of
https://github.com/garrytan/gstack.git
synced 2026-06-01 15:51:41 +02:00
46c1fae7f1
* feat(test): transcript-section-logger + ship-action fingerprint (T10) Pure-analysis module over a SkillTestResult/NDJSON transcript: - extractSectionReads(): which sections/*.md a run opened (post-carve check) - extractShipActions(): observable action fingerprint (merge/test/bump/ changelog/commit/push/pr) that works on the MONOLITH too, so a baseline captured before the carve can detect a sectioned-ship regression - baseline read/write + compareShipActions() for baseline-first dogf(T10) Baseline-first answers the Codex outside-voice critique that a logger in the same PR as the carve is post-failure telemetry without a pre-carve reference. 11 unit tests, all green. Paid monolith baseline capture runs separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(pipeline): section discovery + generation machinery (T9) - discover-skills.ts: discoverSectionTemplates() scans <skill>/sections/*.md.tmpl - gen-skill-docs.ts: extract resolvePlaceholders + applyHostRewrites + buildContext as shared helpers (processTemplate and the new processSectionTemplate both call them, so a sanitization/rewrite fix can't miss sections) [C1] - processSectionTemplate: body-fragment generation (no frontmatter/catalog/voice), parent-skill TemplateContext (skillName pinned to parent, not 'sections', so appliesTo gating + tier behave identically), per-host output routing - --host all now fails the build on ANY host failure, not just claude, so a stale external-host output can't slip the freshness gate [Codex outside-voice #9] Inert until a skill is carved (no sections/ dirs exist yet). Refactor is output-neutral: gen:skill-docs --dry-run --host all reports 0 STALE. 5 discovery unit tests + 389 gen-skill-docs tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(setup): install sections/ for cherry-pick targets (claude + kiro) (T9) Two install targets cherry-pick SKILL.md and would leave a carved skill's sections/ behind, 404ing a runtime 'Read sections/<name>.md': - link_claude_skill_dirs: link the sections/ subdir via _link_or_copy (windows gets a fresh copy on every ./setup) - kiro per-skill loop: sed-rewrite + copy each sections/* so paths resolve under ~/.kiro, not ~/.codex/~/.claude codex/factory/opencode link the whole generated dir, so sections ride free. Addresses Codex outside-voice #4/#6 (runtime pathing landmine). Inert until a skill is carved. Static-tripwire test + windows-fallback invariant green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(ship): gstack-version-bump CLI — tested idempotency classify + write (T9) Hybrid CLI extraction (CM1): the deterministic core of ship Step 12 becomes a tested CLI instead of bash prose the agent re-derives each run. - classify: FRESH/ALREADY_BUMPED/DRIFT_STALE_PKG/DRIFT_UNEXPECTED from VERSION vs origin/<base>:VERSION vs package.json.version (pure reader) - write: validated dual-write to VERSION + package.json (FRESH bump) - repair: DRIFT_STALE_PKG sync, no re-bump Bump-LEVEL choice + queue collision stay agent judgment; slot pick stays bin/gstack-next-version. This removes the re-bump-a-shipped-branch footgun from skippable prose into code that can't be skipped or misread. 15 tests (exhaustive state matrix + write/repair fs + real-git classify). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(parity): sectioned-skill parity capability — guards the carve (T9) Carved skills (skeleton + sections/*.md) need parity checks that see relocated content, or moving a phrase into a section reads as 'lost': - readSkillForParity(): union skeleton + all sections/*.md - checkSkillParity sectioned mode: content checks against the union; minBytes/ maxSizeRatio against union bytes (total behavior preserved); maxSkeletonBytes asserts the always-loaded skeleton actually shrank. Lowering minBytes to fit a small skeleton would otherwise make the size floor toothless [Codex #12]. Built + tested BEFORE the carve so ship's invariant can flip to sectioned in the same commit it lands. Monolith path byte-identical (verified: pre-existing investigate 1.053 ratio drift fails the same with this change stashed). 7 sectioned-parity tests + existing parity tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(ship): carve into skeleton + on-demand sections (Claude) (T9) ship/SKILL.md drops 167KB → 68.7KB (~59% of the always-loaded skill) by moving 8 prose-heavy steps into ship/sections/*.md, read on demand: tests, test-coverage, plan-completion, review-army, greptile, adversarial, changelog, pr-body. Step 12's version logic now calls the tested gstack-version-bump CLI instead of inline bash. Claude-first (S2): {{SECTION:id}} emits a STOP-Read pointer on Claude (skeleton + generated section files) and INLINES the content on every other host, so external hosts keep the full monolith — verified factory at 162KB with no sections dir. {{SECTION_INDEX:ship}} renders the situation→section table from the PASSIVE manifest (CM2 / v2_PLAN.md:663); required-reads live only in test fixtures. Multi-pass resolve expands inlined sections' own resolvers. Parity: ship invariant flipped to sectioned (union content checks + maxSkeletonBytes asserts the shrink). Carve-fallout fixed across gen-skill-docs/skill-validation/ golden/plan-completion/#1539/size-budget tests via skeleton+sections union reads. Free suite green except the pre-existing investigate parity drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): manifest-consistency + context-parity + requiredReads helper (T9) Free deterministic guards for the carve: - required-reads.ts + unit test: assertRequiredReads(run, requiredFiles) — the mechanical layer-5 check that the agent Read the sections its situation needs (required set comes from the fixture, not the passive manifest) - section-manifest-consistency: 3-tier orphan classification (generated orphan + hand-edited generated file → FAIL; manifest orphan → WARN per v2_PLAN.md) and pins the PASSIVE-manifest contract (no applies_when/required_for) - template-context-parity: generated sections have zero unresolved placeholders and gated resolvers (ADVERSARIAL_STEP/CONFIDENCE_CALIBRATION/CHANGELOG_WORKFLOW) rendered — proving sections resolve with the parent skillName, not 'sections' 16 tests, all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): section-loading E2E + idempotency CLI detection (T9) - skill-e2e-ship-section-loading.test.ts (new, periodic): runs real /ship in plan mode against a fresh version-changing fixture and asserts the agent Read the required sections (review-army + changelog). Runs against the INSTALLED skill (~/.claude/skills/gstack/ship), not repo paths, so install-layout 404s surface [Codex outside-voice #5]. Layer-5 mechanical guard against silent section-skip. - skill-e2e-ship-idempotency.test.ts: detection updated for the carve — Step 12 now runs gstack-version-bump classify (JSON "state":"ALREADY_BUMPED") instead of the inline bash echo (STATE: ALREADY_BUMPED). Accept both; add a gstack-version-bump-write re-bump regression signal. - touchfiles: register ship-section-loading (periodic) + extend idempotency deps with bin/gstack-version-bump + scripts/resolvers/sections.ts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): union-read redaction wiring test for the carve (T9) main's PR-body redaction-at-sink lives in sections/pr-body.md.tmpl after the carve, not the skeleton template. Read skeleton + section templates union so the redaction-wiring assertions follow the relocated content. 9/9 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
298 lines
10 KiB
TypeScript
298 lines
10 KiB
TypeScript
/**
|
|
* Cathedral parity-eval harness (v1.45.0.0 T0b).
|
|
*
|
|
* Compares CURRENT SKILL.md output to a v1.44.1 golden baseline along three
|
|
* axes: STRUCTURE (frontmatter shape), CONTENT (must-preserve phrases per
|
|
* skill family), and SIZE (per-skill byte budget). The fourth axis —
|
|
* BEHAVIORAL parity via LLM-as-judge — runs on top of this harness in the
|
|
* periodic-tier eval suite (paid, ~$0.20 per skill judge call).
|
|
*
|
|
* The structural + content checks ship in v1.45.0.0 as the foundation; the
|
|
* LLM-judge layer lands in v2.0.0.0 alongside the sections/ pattern. Both
|
|
* use this module's APIs.
|
|
*
|
|
* Why a separate harness from skill-size-budget.test.ts: that one enforces
|
|
* size discipline only. This module supports content invariants per skill
|
|
* family (e.g., cso must preserve OWASP/STRIDE; plan-ceo must preserve
|
|
* mode-selection phrasing) so future compression can't silently strip
|
|
* load-bearing prose even when size stays within ratio.
|
|
*/
|
|
|
|
import * as fs from 'fs';
|
|
import * as path from 'path';
|
|
import type { ParityBaseline, SkillBaselineEntry } from './capture-parity-baseline';
|
|
import { captureBaseline } from './capture-parity-baseline';
|
|
|
|
export interface ParityInvariant {
|
|
skill: string;
|
|
/** Phrases that MUST appear in the generated SKILL.md (case-insensitive substring). */
|
|
mustContain?: string[];
|
|
/** Markdown H2 headings that MUST appear. */
|
|
mustHaveHeadings?: string[];
|
|
/** Maximum byte size growth ratio vs baseline. 1.0 = no growth allowed. */
|
|
maxSizeRatio?: number;
|
|
/** Minimum byte size (catches over-stripping cliffs). */
|
|
minBytes?: number;
|
|
/**
|
|
* Carved skill (v2 plan T9): the skill is a skeleton SKILL.md plus on-demand
|
|
* sections/*.md. When true:
|
|
* - mustContain / mustHaveHeadings run against skeleton + ALL sections unioned,
|
|
* so a phrase that moved into a section still counts (content preserved, just
|
|
* relocated — that's the whole point of the carve).
|
|
* - minBytes / maxSizeRatio run against the UNION bytes, not the skeleton alone
|
|
* (total behavior must not shrink; the win is what's no longer always-loaded,
|
|
* which the union size deliberately does NOT measure — maxSkeletonBytes does).
|
|
* - maxSkeletonBytes asserts the always-loaded skeleton actually shrank.
|
|
* Without this, lowering minBytes to fit a 65KB skeleton would make the size
|
|
* floor toothless (Codex outside-voice #12).
|
|
*/
|
|
sectioned?: boolean;
|
|
/** Max bytes for the always-loaded skeleton SKILL.md (carved skills only). */
|
|
maxSkeletonBytes?: number;
|
|
}
|
|
|
|
export interface ParityCheckResult {
|
|
skill: string;
|
|
passed: boolean;
|
|
failures: string[];
|
|
}
|
|
|
|
/**
|
|
* Read a skill's check text + sizes. For a carved skill, union the skeleton with
|
|
* every sections/*.md so relocated content still counts and the union size
|
|
* measures total preserved behavior; skeletonBytes is reported separately so the
|
|
* always-loaded shrink can be asserted. For a monolith, text == skeleton.
|
|
*/
|
|
export function readSkillForParity(
|
|
repoRoot: string,
|
|
skill: string,
|
|
sectioned: boolean,
|
|
): { text: string; unionBytes: number; skeletonBytes: number } {
|
|
const skeleton = fs.readFileSync(path.join(repoRoot, skill, 'SKILL.md'), 'utf-8');
|
|
const skeletonBytes = Buffer.byteLength(skeleton, 'utf-8');
|
|
if (!sectioned) return { text: skeleton, unionBytes: skeletonBytes, skeletonBytes };
|
|
|
|
let text = skeleton;
|
|
let unionBytes = skeletonBytes;
|
|
const sectionsDir = path.join(repoRoot, skill, 'sections');
|
|
if (fs.existsSync(sectionsDir)) {
|
|
for (const f of fs.readdirSync(sectionsDir).sort()) {
|
|
if (!f.endsWith('.md')) continue;
|
|
const sec = fs.readFileSync(path.join(sectionsDir, f), 'utf-8');
|
|
text += '\n' + sec;
|
|
unionBytes += Buffer.byteLength(sec, 'utf-8');
|
|
}
|
|
}
|
|
return { text, unionBytes, skeletonBytes };
|
|
}
|
|
|
|
export function checkSkillParity(
|
|
invariant: ParityInvariant,
|
|
current: SkillBaselineEntry,
|
|
baseline: SkillBaselineEntry | undefined,
|
|
repoRoot: string,
|
|
): ParityCheckResult {
|
|
const failures: string[] = [];
|
|
const needText = !!(invariant.mustContain?.length || invariant.mustHaveHeadings?.length);
|
|
|
|
// Resolve the text + size to check against. Carved skills union skeleton +
|
|
// sections; monoliths use the skeleton alone. Read on demand so size-only
|
|
// invariants don't pay for a file read they don't need (monolith path).
|
|
let checkText: string | null = null;
|
|
let checkBytes = current.skillMdBytes;
|
|
if (invariant.sectioned) {
|
|
try {
|
|
const r = readSkillForParity(repoRoot, invariant.skill, true);
|
|
checkText = r.text;
|
|
checkBytes = r.unionBytes;
|
|
if (invariant.maxSkeletonBytes !== undefined && r.skeletonBytes > invariant.maxSkeletonBytes) {
|
|
failures.push(`skeleton ${r.skeletonBytes} > maxSkeletonBytes ${invariant.maxSkeletonBytes}`);
|
|
}
|
|
} catch (err) {
|
|
failures.push(`cannot read carved skill ${invariant.skill}: ${(err as Error).message}`);
|
|
}
|
|
} else if (needText) {
|
|
try {
|
|
checkText = fs.readFileSync(path.join(repoRoot, invariant.skill, 'SKILL.md'), 'utf-8');
|
|
} catch (err) {
|
|
failures.push(`cannot read ${path.join(repoRoot, invariant.skill, 'SKILL.md')}: ${(err as Error).message}`);
|
|
}
|
|
}
|
|
|
|
// SIZE checks (union bytes for carved skills, skeleton bytes for monoliths)
|
|
if (invariant.maxSizeRatio !== undefined && baseline) {
|
|
const ratio = checkBytes / baseline.skillMdBytes;
|
|
if (ratio > invariant.maxSizeRatio) {
|
|
failures.push(`size ratio ${ratio.toFixed(3)} > maxSizeRatio ${invariant.maxSizeRatio}`);
|
|
}
|
|
}
|
|
if (invariant.minBytes !== undefined && checkBytes < invariant.minBytes) {
|
|
failures.push(`size ${checkBytes} < minBytes ${invariant.minBytes}`);
|
|
}
|
|
|
|
// CONTENT checks
|
|
if (needText && checkText !== null) {
|
|
const lower = checkText.toLowerCase();
|
|
for (const phrase of invariant.mustContain ?? []) {
|
|
if (!lower.includes(phrase.toLowerCase())) {
|
|
failures.push(`missing required phrase: "${phrase}"`);
|
|
}
|
|
}
|
|
for (const heading of invariant.mustHaveHeadings ?? []) {
|
|
if (!checkText.includes(heading)) {
|
|
failures.push(`missing required heading: "${heading}"`);
|
|
}
|
|
}
|
|
}
|
|
|
|
return {
|
|
skill: invariant.skill,
|
|
passed: failures.length === 0,
|
|
failures,
|
|
};
|
|
}
|
|
|
|
export interface ParityReport {
|
|
baselineTag: string;
|
|
currentCapturedAt: string;
|
|
totalChecks: number;
|
|
passed: number;
|
|
failed: number;
|
|
details: ParityCheckResult[];
|
|
}
|
|
|
|
export function runParityChecks(opts: {
|
|
repoRoot: string;
|
|
baseline: ParityBaseline;
|
|
invariants: ParityInvariant[];
|
|
}): ParityReport {
|
|
const { repoRoot, baseline, invariants } = opts;
|
|
const current = captureBaseline({ repoRoot });
|
|
const details: ParityCheckResult[] = [];
|
|
for (const invariant of invariants) {
|
|
const baselineEntry = baseline.skills[invariant.skill];
|
|
const currentEntry = current.skills[invariant.skill];
|
|
if (!currentEntry) {
|
|
details.push({
|
|
skill: invariant.skill,
|
|
passed: false,
|
|
failures: [`skill removed: ${invariant.skill} present in baseline but not current state`],
|
|
});
|
|
continue;
|
|
}
|
|
details.push(checkSkillParity(invariant, currentEntry, baselineEntry, repoRoot));
|
|
}
|
|
return {
|
|
baselineTag: baseline.tag,
|
|
currentCapturedAt: current.capturedAt,
|
|
totalChecks: details.length,
|
|
passed: details.filter(d => d.passed).length,
|
|
failed: details.filter(d => !d.passed).length,
|
|
details,
|
|
};
|
|
}
|
|
|
|
/**
|
|
* Standard invariant registry — the v1.45.0.0 set.
|
|
*
|
|
* Each entry pins what must-not-break in a skill family. Extend as future
|
|
* skills land. Phase B (v2.0.0.0) adds LLM-judge invariants on top of these.
|
|
*/
|
|
export const PARITY_INVARIANTS: ParityInvariant[] = [
|
|
{
|
|
skill: 'cso',
|
|
mustContain: ['OWASP', 'STRIDE', 'daily', 'comprehensive', 'verif'],
|
|
mustHaveHeadings: ['## Preamble', '## When to invoke'],
|
|
maxSizeRatio: 1.05,
|
|
minBytes: 30_000,
|
|
},
|
|
{
|
|
// Carved (v2 plan T9): skeleton SKILL.md + sections/*.md. Content checks run
|
|
// against the union (relocated phrases still count); size floors run against
|
|
// the union (total behavior preserved); maxSkeletonBytes asserts the
|
|
// always-loaded skeleton actually shrank from the ~167KB monolith.
|
|
skill: 'ship',
|
|
sectioned: true,
|
|
maxSkeletonBytes: 90_000,
|
|
mustContain: [
|
|
'VERSION',
|
|
'CHANGELOG',
|
|
'review',
|
|
'merge',
|
|
'PR',
|
|
],
|
|
mustHaveHeadings: ['## Preamble', '## When to invoke'],
|
|
maxSizeRatio: 1.05,
|
|
minBytes: 120_000,
|
|
},
|
|
{
|
|
skill: 'plan-ceo-review',
|
|
mustContain: [
|
|
'SCOPE EXPANSION',
|
|
'SELECTIVE EXPANSION',
|
|
'HOLD SCOPE',
|
|
'SCOPE REDUCTION',
|
|
],
|
|
mustHaveHeadings: ['## Preamble', '## When to invoke'],
|
|
maxSizeRatio: 1.05,
|
|
minBytes: 80_000,
|
|
},
|
|
{
|
|
skill: 'plan-eng-review',
|
|
mustContain: [
|
|
'Architecture',
|
|
'Code Quality',
|
|
'Test',
|
|
'Performance',
|
|
],
|
|
mustHaveHeadings: ['## Preamble', '## When to invoke'],
|
|
maxSizeRatio: 1.05,
|
|
minBytes: 70_000,
|
|
},
|
|
{
|
|
skill: 'plan-design-review',
|
|
mustContain: [
|
|
'design',
|
|
'visual',
|
|
],
|
|
mustHaveHeadings: ['## Preamble', '## When to invoke'],
|
|
maxSizeRatio: 1.05,
|
|
minBytes: 70_000,
|
|
},
|
|
{
|
|
skill: 'review',
|
|
mustContain: ['confidence', 'P1', 'P2'],
|
|
mustHaveHeadings: ['## Preamble', '## When to invoke'],
|
|
maxSizeRatio: 1.05,
|
|
minBytes: 70_000,
|
|
},
|
|
{
|
|
skill: 'qa',
|
|
mustContain: ['bug', 'browse', 'fix'],
|
|
mustHaveHeadings: ['## Preamble', '## When to invoke'],
|
|
maxSizeRatio: 1.05,
|
|
minBytes: 50_000,
|
|
},
|
|
{
|
|
skill: 'investigate',
|
|
mustContain: ['root cause', 'hypothes'],
|
|
mustHaveHeadings: ['## Preamble', '## When to invoke'],
|
|
maxSizeRatio: 1.05,
|
|
minBytes: 30_000,
|
|
},
|
|
{
|
|
skill: 'office-hours',
|
|
mustContain: ['design doc', 'problem statement'],
|
|
mustHaveHeadings: ['## Preamble', '## When to invoke'],
|
|
maxSizeRatio: 1.05,
|
|
minBytes: 70_000,
|
|
},
|
|
{
|
|
skill: 'autoplan',
|
|
mustContain: ['ceo', 'eng', 'design'],
|
|
mustHaveHeadings: ['## Preamble', '## When to invoke'],
|
|
maxSizeRatio: 1.05,
|
|
minBytes: 70_000,
|
|
},
|
|
];
|