mirror of
https://github.com/garrytan/gstack.git
synced 2026-06-02 08:11:37 +02:00
v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded) (#1806)
* feat(test): transcript-section-logger + ship-action fingerprint (T10) Pure-analysis module over a SkillTestResult/NDJSON transcript: - extractSectionReads(): which sections/*.md a run opened (post-carve check) - extractShipActions(): observable action fingerprint (merge/test/bump/ changelog/commit/push/pr) that works on the MONOLITH too, so a baseline captured before the carve can detect a sectioned-ship regression - baseline read/write + compareShipActions() for baseline-first dogf(T10) Baseline-first answers the Codex outside-voice critique that a logger in the same PR as the carve is post-failure telemetry without a pre-carve reference. 11 unit tests, all green. Paid monolith baseline capture runs separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(pipeline): section discovery + generation machinery (T9) - discover-skills.ts: discoverSectionTemplates() scans <skill>/sections/*.md.tmpl - gen-skill-docs.ts: extract resolvePlaceholders + applyHostRewrites + buildContext as shared helpers (processTemplate and the new processSectionTemplate both call them, so a sanitization/rewrite fix can't miss sections) [C1] - processSectionTemplate: body-fragment generation (no frontmatter/catalog/voice), parent-skill TemplateContext (skillName pinned to parent, not 'sections', so appliesTo gating + tier behave identically), per-host output routing - --host all now fails the build on ANY host failure, not just claude, so a stale external-host output can't slip the freshness gate [Codex outside-voice #9] Inert until a skill is carved (no sections/ dirs exist yet). Refactor is output-neutral: gen:skill-docs --dry-run --host all reports 0 STALE. 5 discovery unit tests + 389 gen-skill-docs tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(setup): install sections/ for cherry-pick targets (claude + kiro) (T9) Two install targets cherry-pick SKILL.md and would leave a carved skill's sections/ behind, 404ing a runtime 'Read sections/<name>.md': - link_claude_skill_dirs: link the sections/ subdir via _link_or_copy (windows gets a fresh copy on every ./setup) - kiro per-skill loop: sed-rewrite + copy each sections/* so paths resolve under ~/.kiro, not ~/.codex/~/.claude codex/factory/opencode link the whole generated dir, so sections ride free. Addresses Codex outside-voice #4/#6 (runtime pathing landmine). Inert until a skill is carved. Static-tripwire test + windows-fallback invariant green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(ship): gstack-version-bump CLI — tested idempotency classify + write (T9) Hybrid CLI extraction (CM1): the deterministic core of ship Step 12 becomes a tested CLI instead of bash prose the agent re-derives each run. - classify: FRESH/ALREADY_BUMPED/DRIFT_STALE_PKG/DRIFT_UNEXPECTED from VERSION vs origin/<base>:VERSION vs package.json.version (pure reader) - write: validated dual-write to VERSION + package.json (FRESH bump) - repair: DRIFT_STALE_PKG sync, no re-bump Bump-LEVEL choice + queue collision stay agent judgment; slot pick stays bin/gstack-next-version. This removes the re-bump-a-shipped-branch footgun from skippable prose into code that can't be skipped or misread. 15 tests (exhaustive state matrix + write/repair fs + real-git classify). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(parity): sectioned-skill parity capability — guards the carve (T9) Carved skills (skeleton + sections/*.md) need parity checks that see relocated content, or moving a phrase into a section reads as 'lost': - readSkillForParity(): union skeleton + all sections/*.md - checkSkillParity sectioned mode: content checks against the union; minBytes/ maxSizeRatio against union bytes (total behavior preserved); maxSkeletonBytes asserts the always-loaded skeleton actually shrank. Lowering minBytes to fit a small skeleton would otherwise make the size floor toothless [Codex #12]. Built + tested BEFORE the carve so ship's invariant can flip to sectioned in the same commit it lands. Monolith path byte-identical (verified: pre-existing investigate 1.053 ratio drift fails the same with this change stashed). 7 sectioned-parity tests + existing parity tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(ship): carve into skeleton + on-demand sections (Claude) (T9) ship/SKILL.md drops 167KB → 68.7KB (~59% of the always-loaded skill) by moving 8 prose-heavy steps into ship/sections/*.md, read on demand: tests, test-coverage, plan-completion, review-army, greptile, adversarial, changelog, pr-body. Step 12's version logic now calls the tested gstack-version-bump CLI instead of inline bash. Claude-first (S2): {{SECTION:id}} emits a STOP-Read pointer on Claude (skeleton + generated section files) and INLINES the content on every other host, so external hosts keep the full monolith — verified factory at 162KB with no sections dir. {{SECTION_INDEX:ship}} renders the situation→section table from the PASSIVE manifest (CM2 / v2_PLAN.md:663); required-reads live only in test fixtures. Multi-pass resolve expands inlined sections' own resolvers. Parity: ship invariant flipped to sectioned (union content checks + maxSkeletonBytes asserts the shrink). Carve-fallout fixed across gen-skill-docs/skill-validation/ golden/plan-completion/#1539/size-budget tests via skeleton+sections union reads. Free suite green except the pre-existing investigate parity drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): manifest-consistency + context-parity + requiredReads helper (T9) Free deterministic guards for the carve: - required-reads.ts + unit test: assertRequiredReads(run, requiredFiles) — the mechanical layer-5 check that the agent Read the sections its situation needs (required set comes from the fixture, not the passive manifest) - section-manifest-consistency: 3-tier orphan classification (generated orphan + hand-edited generated file → FAIL; manifest orphan → WARN per v2_PLAN.md) and pins the PASSIVE-manifest contract (no applies_when/required_for) - template-context-parity: generated sections have zero unresolved placeholders and gated resolvers (ADVERSARIAL_STEP/CONFIDENCE_CALIBRATION/CHANGELOG_WORKFLOW) rendered — proving sections resolve with the parent skillName, not 'sections' 16 tests, all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): section-loading E2E + idempotency CLI detection (T9) - skill-e2e-ship-section-loading.test.ts (new, periodic): runs real /ship in plan mode against a fresh version-changing fixture and asserts the agent Read the required sections (review-army + changelog). Runs against the INSTALLED skill (~/.claude/skills/gstack/ship), not repo paths, so install-layout 404s surface [Codex outside-voice #5]. Layer-5 mechanical guard against silent section-skip. - skill-e2e-ship-idempotency.test.ts: detection updated for the carve — Step 12 now runs gstack-version-bump classify (JSON "state":"ALREADY_BUMPED") instead of the inline bash echo (STATE: ALREADY_BUMPED). Accept both; add a gstack-version-bump-write re-bump regression signal. - touchfiles: register ship-section-loading (periodic) + extend idempotency deps with bin/gstack-version-bump + scripts/resolvers/sections.ts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): union-read redaction wiring test for the carve (T9) main's PR-body redaction-at-sink lives in sections/pr-body.md.tmpl after the carve, not the skeleton template. Read skeleton + section templates union so the redaction-wiring assertions follow the relocated content. 9/9 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -33,6 +33,22 @@ export interface ParityInvariant {
|
||||
maxSizeRatio?: number;
|
||||
/** Minimum byte size (catches over-stripping cliffs). */
|
||||
minBytes?: number;
|
||||
/**
|
||||
* Carved skill (v2 plan T9): the skill is a skeleton SKILL.md plus on-demand
|
||||
* sections/*.md. When true:
|
||||
* - mustContain / mustHaveHeadings run against skeleton + ALL sections unioned,
|
||||
* so a phrase that moved into a section still counts (content preserved, just
|
||||
* relocated — that's the whole point of the carve).
|
||||
* - minBytes / maxSizeRatio run against the UNION bytes, not the skeleton alone
|
||||
* (total behavior must not shrink; the win is what's no longer always-loaded,
|
||||
* which the union size deliberately does NOT measure — maxSkeletonBytes does).
|
||||
* - maxSkeletonBytes asserts the always-loaded skeleton actually shrank.
|
||||
* Without this, lowering minBytes to fit a 65KB skeleton would make the size
|
||||
* floor toothless (Codex outside-voice #12).
|
||||
*/
|
||||
sectioned?: boolean;
|
||||
/** Max bytes for the always-loaded skeleton SKILL.md (carved skills only). */
|
||||
maxSkeletonBytes?: number;
|
||||
}
|
||||
|
||||
export interface ParityCheckResult {
|
||||
@@ -41,6 +57,35 @@ export interface ParityCheckResult {
|
||||
failures: string[];
|
||||
}
|
||||
|
||||
/**
|
||||
* Read a skill's check text + sizes. For a carved skill, union the skeleton with
|
||||
* every sections/*.md so relocated content still counts and the union size
|
||||
* measures total preserved behavior; skeletonBytes is reported separately so the
|
||||
* always-loaded shrink can be asserted. For a monolith, text == skeleton.
|
||||
*/
|
||||
export function readSkillForParity(
|
||||
repoRoot: string,
|
||||
skill: string,
|
||||
sectioned: boolean,
|
||||
): { text: string; unionBytes: number; skeletonBytes: number } {
|
||||
const skeleton = fs.readFileSync(path.join(repoRoot, skill, 'SKILL.md'), 'utf-8');
|
||||
const skeletonBytes = Buffer.byteLength(skeleton, 'utf-8');
|
||||
if (!sectioned) return { text: skeleton, unionBytes: skeletonBytes, skeletonBytes };
|
||||
|
||||
let text = skeleton;
|
||||
let unionBytes = skeletonBytes;
|
||||
const sectionsDir = path.join(repoRoot, skill, 'sections');
|
||||
if (fs.existsSync(sectionsDir)) {
|
||||
for (const f of fs.readdirSync(sectionsDir).sort()) {
|
||||
if (!f.endsWith('.md')) continue;
|
||||
const sec = fs.readFileSync(path.join(sectionsDir, f), 'utf-8');
|
||||
text += '\n' + sec;
|
||||
unionBytes += Buffer.byteLength(sec, 'utf-8');
|
||||
}
|
||||
}
|
||||
return { text, unionBytes, skeletonBytes };
|
||||
}
|
||||
|
||||
export function checkSkillParity(
|
||||
invariant: ParityInvariant,
|
||||
current: SkillBaselineEntry,
|
||||
@@ -48,38 +93,54 @@ export function checkSkillParity(
|
||||
repoRoot: string,
|
||||
): ParityCheckResult {
|
||||
const failures: string[] = [];
|
||||
const needText = !!(invariant.mustContain?.length || invariant.mustHaveHeadings?.length);
|
||||
|
||||
// SIZE checks
|
||||
// Resolve the text + size to check against. Carved skills union skeleton +
|
||||
// sections; monoliths use the skeleton alone. Read on demand so size-only
|
||||
// invariants don't pay for a file read they don't need (monolith path).
|
||||
let checkText: string | null = null;
|
||||
let checkBytes = current.skillMdBytes;
|
||||
if (invariant.sectioned) {
|
||||
try {
|
||||
const r = readSkillForParity(repoRoot, invariant.skill, true);
|
||||
checkText = r.text;
|
||||
checkBytes = r.unionBytes;
|
||||
if (invariant.maxSkeletonBytes !== undefined && r.skeletonBytes > invariant.maxSkeletonBytes) {
|
||||
failures.push(`skeleton ${r.skeletonBytes} > maxSkeletonBytes ${invariant.maxSkeletonBytes}`);
|
||||
}
|
||||
} catch (err) {
|
||||
failures.push(`cannot read carved skill ${invariant.skill}: ${(err as Error).message}`);
|
||||
}
|
||||
} else if (needText) {
|
||||
try {
|
||||
checkText = fs.readFileSync(path.join(repoRoot, invariant.skill, 'SKILL.md'), 'utf-8');
|
||||
} catch (err) {
|
||||
failures.push(`cannot read ${path.join(repoRoot, invariant.skill, 'SKILL.md')}: ${(err as Error).message}`);
|
||||
}
|
||||
}
|
||||
|
||||
// SIZE checks (union bytes for carved skills, skeleton bytes for monoliths)
|
||||
if (invariant.maxSizeRatio !== undefined && baseline) {
|
||||
const ratio = current.skillMdBytes / baseline.skillMdBytes;
|
||||
const ratio = checkBytes / baseline.skillMdBytes;
|
||||
if (ratio > invariant.maxSizeRatio) {
|
||||
failures.push(`size ratio ${ratio.toFixed(3)} > maxSizeRatio ${invariant.maxSizeRatio}`);
|
||||
}
|
||||
}
|
||||
if (invariant.minBytes !== undefined && current.skillMdBytes < invariant.minBytes) {
|
||||
failures.push(`size ${current.skillMdBytes} < minBytes ${invariant.minBytes}`);
|
||||
if (invariant.minBytes !== undefined && checkBytes < invariant.minBytes) {
|
||||
failures.push(`size ${checkBytes} < minBytes ${invariant.minBytes}`);
|
||||
}
|
||||
|
||||
// CONTENT checks (read live file for fresh content)
|
||||
if (invariant.mustContain?.length || invariant.mustHaveHeadings?.length) {
|
||||
const skillMdPath = path.join(repoRoot, invariant.skill, 'SKILL.md');
|
||||
let content: string | null = null;
|
||||
try {
|
||||
content = fs.readFileSync(skillMdPath, 'utf-8');
|
||||
} catch (err) {
|
||||
failures.push(`cannot read ${skillMdPath}: ${(err as Error).message}`);
|
||||
}
|
||||
if (content) {
|
||||
const lower = content.toLowerCase();
|
||||
for (const phrase of invariant.mustContain ?? []) {
|
||||
if (!lower.includes(phrase.toLowerCase())) {
|
||||
failures.push(`missing required phrase: "${phrase}"`);
|
||||
}
|
||||
// CONTENT checks
|
||||
if (needText && checkText !== null) {
|
||||
const lower = checkText.toLowerCase();
|
||||
for (const phrase of invariant.mustContain ?? []) {
|
||||
if (!lower.includes(phrase.toLowerCase())) {
|
||||
failures.push(`missing required phrase: "${phrase}"`);
|
||||
}
|
||||
for (const heading of invariant.mustHaveHeadings ?? []) {
|
||||
if (!content.includes(heading)) {
|
||||
failures.push(`missing required heading: "${heading}"`);
|
||||
}
|
||||
}
|
||||
for (const heading of invariant.mustHaveHeadings ?? []) {
|
||||
if (!checkText.includes(heading)) {
|
||||
failures.push(`missing required heading: "${heading}"`);
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -146,7 +207,13 @@ export const PARITY_INVARIANTS: ParityInvariant[] = [
|
||||
minBytes: 30_000,
|
||||
},
|
||||
{
|
||||
// Carved (v2 plan T9): skeleton SKILL.md + sections/*.md. Content checks run
|
||||
// against the union (relocated phrases still count); size floors run against
|
||||
// the union (total behavior preserved); maxSkeletonBytes asserts the
|
||||
// always-loaded skeleton actually shrank from the ~167KB monolith.
|
||||
skill: 'ship',
|
||||
sectioned: true,
|
||||
maxSkeletonBytes: 90_000,
|
||||
mustContain: [
|
||||
'VERSION',
|
||||
'CHANGELOG',
|
||||
@@ -156,7 +223,7 @@ export const PARITY_INVARIANTS: ParityInvariant[] = [
|
||||
],
|
||||
mustHaveHeadings: ['## Preamble', '## When to invoke'],
|
||||
maxSizeRatio: 1.05,
|
||||
minBytes: 80_000,
|
||||
minBytes: 120_000,
|
||||
},
|
||||
{
|
||||
skill: 'plan-ceo-review',
|
||||
|
||||
@@ -0,0 +1,40 @@
|
||||
/**
|
||||
* requiredReads enforcement (v2 plan T9, mitigation layer 5 — the only CI-failing
|
||||
* layer against silent section-skip).
|
||||
*
|
||||
* Given a /ship run's tool calls and the set of section files the run's SITUATION
|
||||
* required, assert the agent actually Read each one. The required set comes from
|
||||
* the TEST FIXTURE (which situation it set up), NOT from the manifest — the
|
||||
* manifest is passive (CM2). This keeps "when is a section required" in exactly
|
||||
* one machine-checkable place: the eval fixtures.
|
||||
*
|
||||
* Builds on extractSectionReads from transcript-section-logger so section-path
|
||||
* matching (the `/sections/<file>.md` segment, host-layout agnostic) lives in one
|
||||
* place.
|
||||
*/
|
||||
|
||||
import { extractSectionReads, type TranscriptResultLike } from './transcript-section-logger';
|
||||
|
||||
export interface RequiredReadsResult {
|
||||
required: string[];
|
||||
read: string[];
|
||||
missing: string[];
|
||||
ok: boolean;
|
||||
}
|
||||
|
||||
/**
|
||||
* @param result the skill run (anything with toolCalls)
|
||||
* @param requiredFiles section basenames the situation required, e.g.
|
||||
* ['version-bump.md','changelog.md'] (or with a sections/
|
||||
* prefix — normalized to basename here)
|
||||
*/
|
||||
export function assertRequiredReads(
|
||||
result: TranscriptResultLike,
|
||||
requiredFiles: string[],
|
||||
): RequiredReadsResult {
|
||||
const read = extractSectionReads(result);
|
||||
const readSet = new Set(read);
|
||||
const required = requiredFiles.map(f => f.replace(/^.*\//, '')); // tolerate sections/<f>
|
||||
const missing = required.filter(f => !readSet.has(f));
|
||||
return { required, read, missing, ok: missing.length === 0 };
|
||||
}
|
||||
@@ -120,7 +120,8 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
|
||||
'plan-ceo-mode-routing': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'plan-design-with-ui-scope': ['plan-design-review/**', 'test/fixtures/plans/ui-heavy-feature.md', 'test/helpers/claude-pty-runner.ts'],
|
||||
'budget-regression-pty': ['test/helpers/eval-store.ts', 'test/skill-budget-regression.test.ts'],
|
||||
'ship-idempotency-pty': ['ship/**', 'bin/gstack-next-version', 'lib/worktree.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'ship-idempotency-pty': ['ship/**', 'bin/gstack-next-version', 'bin/gstack-version-bump', 'scripts/resolvers/sections.ts', 'lib/worktree.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'ship-section-loading': ['ship/**', 'scripts/resolvers/sections.ts', 'scripts/gen-skill-docs.ts', 'test/helpers/required-reads.ts', 'test/helpers/transcript-section-logger.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'autoplan-chain-pty': ['autoplan/**', 'plan-ceo-review/**', 'plan-design-review/**', 'plan-eng-review/**', 'plan-devex-review/**', 'test/fixtures/plans/ui-heavy-feature.md', 'test/helpers/claude-pty-runner.ts'],
|
||||
'e2e-harness-audit': ['plan-ceo-review/**', 'plan-eng-review/**', 'plan-design-review/**', 'plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/agent-sdk-runner.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
|
||||
@@ -508,6 +509,7 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
|
||||
'plan-design-with-ui-scope': 'gate', // ~$0.80/run
|
||||
'budget-regression-pty': 'gate', // free, library-only assertion
|
||||
'ship-idempotency-pty': 'periodic', // ~$3/run, real /ship in plan mode
|
||||
'ship-section-loading': 'periodic', // ~$3/run, real /ship; asserts section reads
|
||||
'autoplan-chain-pty': 'periodic', // ~$8/run, all 3 phases sequential
|
||||
|
||||
// Per-finding count + review-report-at-bottom — periodic because each
|
||||
|
||||
@@ -0,0 +1,196 @@
|
||||
/**
|
||||
* Transcript section logger (v2 plan T10).
|
||||
*
|
||||
* Two jobs, both pure analysis over a SkillTestResult / NDJSON transcript:
|
||||
*
|
||||
* 1. extractSectionReads() — which `sections/*.md` files a run actually Read.
|
||||
* Used by the sectioned world (post-carve) to verify the agent opened the
|
||||
* chapters its situation required.
|
||||
*
|
||||
* 2. extractShipActions() — an observable ACTION fingerprint of a /ship run
|
||||
* (ran tests, bumped VERSION, wrote CHANGELOG, created PR, ...). This works
|
||||
* on BOTH the monolith and the sectioned skill, which is the whole point:
|
||||
* capture a baseline on the current monolith ship FIRST, then assert the
|
||||
* sectioned ship still performs the same actions. A section-read check alone
|
||||
* can't catch "agent read the chapter but skipped the step"; the action
|
||||
* fingerprint can.
|
||||
*
|
||||
* Why baseline-first (Codex outside-voice critique on the T9 plan): a logger
|
||||
* shipped in the same PR as the carve is post-failure telemetry unless it has a
|
||||
* pre-carve reference. captureShipBaseline() records the monolith's action
|
||||
* fingerprint so compareShipActions() can flag a regression introduced by the
|
||||
* carve.
|
||||
*
|
||||
* Pure functions, no I/O except the explicit read/write baseline helpers. The
|
||||
* unit tests drive these with synthetic transcripts — no paid run needed to
|
||||
* validate the logic.
|
||||
*/
|
||||
|
||||
import * as fs from 'fs';
|
||||
import * as path from 'path';
|
||||
import * as os from 'os';
|
||||
|
||||
/** Minimal shape we need from SkillTestResult — kept structural so callers can
|
||||
* pass a full SkillTestResult or a hand-built fixture in unit tests. */
|
||||
export interface ToolCallLike {
|
||||
tool: string;
|
||||
input: unknown;
|
||||
output?: string;
|
||||
}
|
||||
export interface TranscriptResultLike {
|
||||
toolCalls: ToolCallLike[];
|
||||
output?: string;
|
||||
}
|
||||
|
||||
/** Pull the file_path off a tool-call input, tolerating unknown shapes. */
|
||||
function readFilePath(input: unknown): string | null {
|
||||
if (input && typeof input === 'object') {
|
||||
const fp = (input as Record<string, unknown>).file_path;
|
||||
if (typeof fp === 'string') return fp;
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
/** Pull the command string off a Bash tool-call input. */
|
||||
function bashCommand(input: unknown): string | null {
|
||||
if (input && typeof input === 'object') {
|
||||
const cmd = (input as Record<string, unknown>).command;
|
||||
if (typeof cmd === 'string') return cmd;
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
/**
|
||||
* Every `sections/<name>.md` file the run Read, normalized to the section
|
||||
* basename (e.g. "version-bump.md"). Deduped, in first-Read order. Matching is
|
||||
* on the path segment `/sections/<file>.md` so it works regardless of whether
|
||||
* the host resolved a relative, absolute, or prefixed install path.
|
||||
*/
|
||||
export function extractSectionReads(result: TranscriptResultLike): string[] {
|
||||
const seen = new Set<string>();
|
||||
const ordered: string[] = [];
|
||||
for (const call of result.toolCalls) {
|
||||
if (call.tool !== 'Read') continue;
|
||||
const fp = readFilePath(call.input);
|
||||
if (!fp) continue;
|
||||
const m = fp.match(/(?:^|\/)sections\/([A-Za-z0-9._-]+\.md)$/);
|
||||
if (!m) continue;
|
||||
const name = m[1];
|
||||
if (!seen.has(name)) {
|
||||
seen.add(name);
|
||||
ordered.push(name);
|
||||
}
|
||||
}
|
||||
return ordered;
|
||||
}
|
||||
|
||||
/**
|
||||
* The canonical /ship action vocabulary. Each action is detected from the Bash
|
||||
* commands the agent ran (plus a couple of Write/Edit signals). Order is the
|
||||
* rough ship sequence; detection is order-independent.
|
||||
*
|
||||
* Keep this list aligned with the ship skeleton's numbered steps. The
|
||||
* section-loading eval asserts the sectioned ship still triggers the same
|
||||
* actions a monolith run did for the same fixture situation.
|
||||
*/
|
||||
export const SHIP_ACTIONS = [
|
||||
'merged_base', // git merge <base>
|
||||
'ran_tests', // bun test / npm test / the project test cmd
|
||||
'bumped_version', // wrote VERSION / package.json version / ran gstack-version-bump
|
||||
'wrote_changelog', // edited CHANGELOG.md
|
||||
'committed', // git commit
|
||||
'pushed', // git push
|
||||
'opened_pr', // gh pr create / glab mr create
|
||||
] as const;
|
||||
export type ShipAction = (typeof SHIP_ACTIONS)[number];
|
||||
|
||||
const BASH_ACTION_PATTERNS: Array<{ action: ShipAction; re: RegExp }> = [
|
||||
{ action: 'merged_base', re: /\bgit\s+merge\b/ },
|
||||
{ action: 'ran_tests', re: /\b(bun\s+test|npm\s+(run\s+)?test|yarn\s+test|pytest|go\s+test|cargo\s+test|rspec)\b/ },
|
||||
{ action: 'bumped_version', re: /gstack-version-bump\b|gstack-next-version\b|>\s*VERSION\b|npm\s+version\b/ },
|
||||
{ action: 'wrote_changelog', re: /CHANGELOG\.md/ },
|
||||
{ action: 'committed', re: /\bgit\s+commit\b/ },
|
||||
{ action: 'pushed', re: /\bgit\s+push\b/ },
|
||||
{ action: 'opened_pr', re: /\bgh\s+pr\s+create\b|\bglab\s+mr\s+create\b/ },
|
||||
];
|
||||
|
||||
/**
|
||||
* The observable action fingerprint of a ship run. Works on monolith AND
|
||||
* sectioned skills because it reads what the agent DID (Bash + file writes),
|
||||
* not which prose it loaded.
|
||||
*/
|
||||
export function extractShipActions(result: TranscriptResultLike): ShipAction[] {
|
||||
const found = new Set<ShipAction>();
|
||||
for (const call of result.toolCalls) {
|
||||
if (call.tool === 'Bash') {
|
||||
const cmd = bashCommand(call.input);
|
||||
if (!cmd) continue;
|
||||
for (const { action, re } of BASH_ACTION_PATTERNS) {
|
||||
if (re.test(cmd)) found.add(action);
|
||||
}
|
||||
} else if (call.tool === 'Write' || call.tool === 'Edit') {
|
||||
const fp = readFilePath(call.input);
|
||||
if (fp && /CHANGELOG\.md$/.test(fp)) found.add('wrote_changelog');
|
||||
if (fp && /(?:^|\/)VERSION$/.test(fp)) found.add('bumped_version');
|
||||
}
|
||||
}
|
||||
// Preserve canonical order.
|
||||
return SHIP_ACTIONS.filter(a => found.has(a));
|
||||
}
|
||||
|
||||
export interface ShipBaseline {
|
||||
tag: string;
|
||||
/** Fixture/situation id this baseline was captured for. */
|
||||
situation: string;
|
||||
/** Action fingerprint observed on the monolith ship. */
|
||||
actions: ShipAction[];
|
||||
/** Section reads observed (empty on the monolith — present after carve). */
|
||||
sectionReads: string[];
|
||||
capturedAt: string;
|
||||
}
|
||||
|
||||
const DEFAULT_BASELINE_DIR = path.join(os.homedir(), '.gstack-dev', 'ship-baselines');
|
||||
|
||||
/** Where a baseline for a given situation lives. */
|
||||
export function baselinePath(situation: string, dir = DEFAULT_BASELINE_DIR): string {
|
||||
return path.join(dir, `${situation}.json`);
|
||||
}
|
||||
|
||||
/** Persist a ship baseline (used once on the monolith, before the carve). */
|
||||
export function writeShipBaseline(baseline: ShipBaseline, dir = DEFAULT_BASELINE_DIR): string {
|
||||
fs.mkdirSync(dir, { recursive: true });
|
||||
const p = baselinePath(baseline.situation, dir);
|
||||
fs.writeFileSync(p, JSON.stringify(baseline, null, 2) + '\n');
|
||||
return p;
|
||||
}
|
||||
|
||||
/** Read a previously-captured baseline, or null if none exists yet. */
|
||||
export function readShipBaseline(situation: string, dir = DEFAULT_BASELINE_DIR): ShipBaseline | null {
|
||||
try {
|
||||
return JSON.parse(fs.readFileSync(baselinePath(situation, dir), 'utf-8')) as ShipBaseline;
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
export interface ShipActionDiff {
|
||||
/** Actions the baseline performed that the current run did NOT (the regression set). */
|
||||
missing: ShipAction[];
|
||||
/** Actions the current run performed that the baseline did not (usually fine). */
|
||||
added: ShipAction[];
|
||||
/** True when no baseline action was dropped. */
|
||||
ok: boolean;
|
||||
}
|
||||
|
||||
/**
|
||||
* Compare a current sectioned-ship run against the monolith baseline. A dropped
|
||||
* action (in baseline, not in current) is the carve regression we care about:
|
||||
* the sectioned ship stopped doing something the monolith did.
|
||||
*/
|
||||
export function compareShipActions(baseline: ShipBaseline, current: ShipAction[]): ShipActionDiff {
|
||||
const cur = new Set(current);
|
||||
const base = new Set(baseline.actions);
|
||||
const missing = baseline.actions.filter(a => !cur.has(a));
|
||||
const added = current.filter(a => !base.has(a));
|
||||
return { missing, added, ok: missing.length === 0 };
|
||||
}
|
||||
Reference in New Issue
Block a user