mirror of
https://github.com/garrytan/gstack.git
synced 2026-06-05 17:46:37 +02:00
v1.56.0.0 Token-reduction Phase B + AUQ paranoid safety net (#1849)
* refactor(plan-ceo-review): carve review body into on-demand section
Carve the largest skill (138,838 B) into a skeleton + one on-demand
section, the documented next Phase B target after /ship (v2_PLAN.md:216).
- sections/review-sections.md(.tmpl): the 11-section deep review, codex/
outside-voice rules, how-to-ask, Required Outputs, registries, Completion
Summary, Review Log, REVIEW_DASHBOARD, PLAN_FILE_REVIEW_REPORT, Next Steps,
docs/designs promotion, Formatting Rules, and the Mode Quick Reference.
- sections/manifest.json: passive registry (CM2), one entry.
- SKILL.md.tmpl: {{SECTION_INDEX}} after the system audit, a single
{{SECTION:review-sections}} STOP-Read after Step 0 mode selection, and a
Section self-check. All of Step 0 (the scope/mode conversation) stays in
the always-loaded skeleton; only EXIT_PLAN_MODE_GATE follows the section.
Measured: always-loaded skeleton 138,838 -> 80,731 B (-42%, ~14.4K tokens
off every invocation). Union (skeleton + section) 139,110 B, behavior held.
Boundary honors Codex P1: nothing review-governing (formatting rules, mode
reference, how-to-ask, required outputs) sits in the skeleton below the
STOP. Housekeeping resolvers ride in the section, matching the ship
precedent (adversarial.md carries LEARNINGS_LOG + GBRAIN_SAVE_RESULTS).
Tests (atomic with the carve — skill-docs.yml gates gen:skill-docs
freshness on every push, so source + regen + tests must land together):
- parity-harness: plan-ceo flipped to sectioned, maxSkeletonBytes 90_000
(measured 80,731 + headroom); content/minBytes run against the union.
- skill-size-budget: plan-ceo-review added to SECTIONS_EXTRACTED.
- section-manifest-consistency: generalized to discover every carved skill,
vars computed per-skill-case (Codex P2).
- skill-ceo-section-ordering (new, gate): per-PR static guard — STOP after
Step 0, review body absent from skeleton, report writer in the section,
nothing review-governing below the STOP.
- skill-e2e-plan-ceo-review-section-loading (new, periodic): refreshes the
installed skill first (Codex P1), drives full Step 0, asserts the section
is Read before the report.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
carved skills so relocated prose still counts.
- touchfiles: plan-ceo-section-loading registered (periodic).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump VERSION + CHANGELOG for plan-ceo-review carve (v1.56.0.0)
MINOR: carves the largest skill into skeleton + on-demand section,
dropping plan-ceo-review's always-loaded cost 42% (138,838 -> 80,731 B,
~14.4K tokens off every invocation). User-facing release notes lead with
the measured token win.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(todos): file P3 follow-up — carve the shared {{PREAMBLE}} reference blocks
Surfaced by /plan-eng-review on the plan-ceo-review carve: per-skill section
carves stay modest because the ~40-50KB shared preamble dominates the
always-loaded surface. A single preamble-reference carve would help every
tier->=2 skill at once. Records the why, the cold-vs-hot split to measure,
and the guards it needs. Not implemented this PR.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): Layer 0 — guarantee AUQ format spec is always-loaded
Deterministic, free, per-PR keystone for the token-reduction era. For every
interactive (tier>=2) skill, asserts the full AskUserQuestion decision-brief
format (ELI10/Recommendation/Pros-cons/checks/Net/(recommended)/Stakes/
self-check) lives in the always-loaded SKILL.md skeleton, NOT only in an
on-demand section. Plus a roster guard (a carve can't silently drop the block)
and per-skill rule survival in the skeleton+sections union. 51 cases + a
negative control. Fails the instant a future carve strands AUQ-governing text
where it won't be loaded when a question fires.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): SDK capture engine + verbose-vs-carved no-degradation A/B
Adds the reusable SDK $OUT_FILE capture engine (auq-sdk-capture.ts): drives a
skill to its AUQ and captures the verbatim text the model GENERATES, cleanly
(real-PTY mangles plan-mode AUQs via cursor escapes). Pins the skill to an
absolute path with Read/Write-only tools so the agent can't wander to the
global install. gradeAuqRecommendation normalizes a non-"because" connective
before grading so substantive reasons aren't false-flagged (without touching
the pinned shared judge).
The A/B drives the same prompt through the carved 80KB skeleton and the
pre-carve 137KB monolith and fails if carved scores worse. Result: both 7/7
format, substance 5 — proven no degradation, transcript-verified each side read
its own planted SKILL.md. Periodic tier.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): consistency — same trigger N runs, stable format + substance
Drives the carved /plan-ceo-review AUQ N=3 times and fails if any format
element appears in one run but not another, or substance craters. Targets the
"fine one run, broken the next" failure class a single snapshot can't see.
Result: 3/3 stable, 7/7 + substance 5 every run. Periodic tier.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): behavioral matrix across AUQ-heavy skills
Data-driven test that drives each AUQ-heavy skill (plan-eng/design/devex,
office-hours, cso, spec, design-consultation) to its first AskUserQuestion and
grades it to the plan-ceo bar: 7/7 decision-brief format + recommendation
substance >=4. One case per skill (isolated failures), env-subsettable via
AUQ_MATRIX_ONLY. Browser/design-binary skills are intentionally excluded
(comparison boards, not format-AUQs; Layer 0 covers their spec). All targeted
skills pass 7/7 with substance 4-5. Periodic tier.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(codex): live recommendation-substance grade for /codex
Closes the gap where /codex's synthesis recommendation was only checked
statically (template grep) and via fixtures. Drives the real /codex skill over
a flawed diff and grades the emitted "Recommendation: ... because ..." line
with judgeRecommendation (present/commits/has_because/substance>=4). The named
weak spot holds up: substance 5. Periodic tier.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): deterministic trigger for format-compliance gate
A bare /plan-ceo-review against a repo whose work is already implemented makes
the model improvise an off-script "what should I review?" scope question that
skips the decision-brief format, which the gate test then times out waiting for.
Hand it a concrete plan to review (FORCING_FLOOR_CEO) so it reaches the real
Step 0 mode-selection AUQ that is the intended format check.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(office-hours): carve Phase 5+6 into on-demand section
Third Phase B carve (v2_PLAN.md:216, after ship and plan-ceo-review). Moves
Phase 5 (Design Doc templates) + Phase 6 (tiered relationship handoff) — the
session's output + closing tail, only reached after the conversation and
alternatives are done — into sections/design-and-handoff.md, behind a single
STOP-Read after Phase 4.5. The live conversation (Phases 1-4.5) and the
always-run Important Rules stay in the always-loaded skeleton.
Measured: always-loaded skeleton 118,280 -> 88,975 B (-24.8%). Union preserved.
The carved AUQ is identical to pre-carve (matrix: 7/7 format, substance 5),
and Layer 0 confirms the AUQ format spec stays in the skeleton — the AUQ
paranoid suite de-risked this carve end to end.
Atomic with tests + regen (skill-docs.yml gates gen:skill-docs freshness on
every push, so source + regen + tests land together; --host all regenerates
the inlined non-Claude variants):
- sections/manifest.json: passive registry, one entry.
- parity-harness: office-hours flipped to sectioned, maxSkeletonBytes 96_000
(measured 88,975 + headroom); content/minBytes run against the union.
- skill-size-budget: office-hours added to SECTIONS_EXTRACTED.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
office-hours so relocated Phase 5/6 prose still counts.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump VERSION + CHANGELOG for office-hours carve + AUQ suite (v1.57.0.0)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(preamble): carve CJK-escaping manual to on-demand doc
The AskUserQuestion format block is inlined into every interactive skill (~33).
It carried the full multi-paragraph non-ASCII/CJK escaping manual inline, but
that rationale only matters when a question contains CJK text and the operative
rule already lives in the always-loaded self-check. Moved the justification to
docs/askuserquestion-cjk.md (read on demand); kept the rule + a pointer.
Corpus: Claude-host SKILL.md total 3,087,499 -> 3,057,975 B (-29,524 B, ~900 B
x ~33 skills). Layer 0 still passes — the core decision-brief format stays
always-loaded; only the rare CJK rationale moved. Atomic with the all-host
regen (skill-docs.yml freshness gate). VERSION + package.json -> 1.58.0.0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(plan-eng-review): carve review body into on-demand section
Fourth Phase B carve (v2_PLAN.md:220). Moves the 4-section review (Architecture,
Code Quality, Tests, Performance), outside voice, required outputs, and review
report — everything after Step 0 scope — into sections/review-sections.md behind
a single STOP-Read. Step 0 (scope challenge) and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.
Measured: skeleton 106,984 -> 54,892 B (-48.7%). Union preserved. Atomic with
tests + all-host regen (freshness gate): parity flipped to sectioned
(maxSkeletonBytes 62K), plan-eng-review added to SECTIONS_EXTRACTED, gen-skill-docs
reads the union for relocated review/TEST_COVERAGE/dashboard prose. Layer 0 green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(plan-design-review): carve review body into on-demand section
Fifth Phase B carve (v2_PLAN.md:220, bundled with plan-eng). Moves the 7 design
passes, required outputs, and review report — everything after Step 0 scope and
the mockup/rating phase — into sections/review-sections.md behind a STOP-Read.
Step 0, Step 0.5 mockups, the rating method, and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.
Measured: skeleton 112,057 -> 76,024 B (-32.2%). Union preserved. Atomic with
tests + all-host regen: parity sectioned (maxSkeletonBytes 82K), added to
SECTIONS_EXTRACTED, gen-skill-docs reads the union. Layer 0 green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(plan-devex-review): carve review body into on-demand section
Sixth Phase B carve. Moves the 8 DX passes, required outputs, and review report
— everything after the Step 0 DX investigation — into sections/review-sections.md
behind a STOP-Read. All of Step 0 (persona, empathy, benchmark, journey trace,
roleplay) + the rating method + EXIT_PLAN_MODE_GATE stay always-loaded.
Measured: skeleton 110,621 -> 69,658 B (-37%). Union preserved. Atomic with
tests + all-host regen: added to SECTIONS_EXTRACTED, gen-skill-docs reads the
union. Layer 0 green. (No parity invariant entry for plan-devex-review.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump VERSION + CHANGELOG for plan-* family carves (v1.59.0.0)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test: refresh ship golden baselines + gbrain-detection union after carves
Two follow-ups the carve commits should have carried (caught by the full suite,
missed by targeted subsets):
- ship golden baselines (claude/codex/factory) regenerated: the preamble CJK
trim (v1.58) changed ship's always-loaded AskUserQuestion block.
- gbrain-detection-override probes the office-hours skeleton+section union:
GBRAIN_SAVE_RESULTS moved into sections/design-and-handoff.md when office-hours
was carved, so the detection assertions now check both files.
Full `bun test` green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): grade format-compliance gate from SDK capture, not the TUI
The real-PTY version grepped the stripAnsi'd interactive AUQ picker. Verified
directly that this cannot work: plan-mode AUQs render as a cursor picker whose
cursor-positioning escapes stripAnsi can't flatten — the picker renders fine for
a human (cursorSeen=45) but the flattened text drops ELI10:/(recommended) and
parseNumberedOptions returns 0. The test was grading a lossy projection and
failed by construction.
Rewritten to drive /plan-ceo-review via the SDK $OUT_FILE capture (the agent
writes the verbatim question it would have shown — clean text, no rendering
loss) and grade 7/7 format + kind-note + recommendation substance >=4. Same
property, reliable, environment-independent; shares the engine with the periodic
A/B and matrix evals. Result: 7/7 format, substance 5. Touchfiles key renamed
ask-user-question-format-pty -> auq-format-gate (no longer a PTY test).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test: fix carve-broken CI evals (union reads + section fixtures)
Two CI eval jobs failed on the carved plan-* skills because they read content
that moved into sections/:
- llm-judge (skill-llm-eval): runWorkflowJudge sliced SKILL.md between markers
like "## Review Sections" / "## CRITICAL RULE" that now live in
sections/review-sections.md. The markers vanished from the skeleton, so the
judge scored empty/wrong content. Fix: read the skeleton+sections union.
Verified: plan-ceo modes / plan-eng sections / plan-design passes all PASS
(25/25).
- e2e-plan (skill-e2e-plan): setupPlanDir copied only <skill>/SKILL.md into the
fixture, not sections/. The carved skill's STOP pointed at a section file that
was absent, so the model improvised a compressed report table instead of the
canonical "| Review | Trigger | Why | Runs | Status | Findings |". Fix: copy
sections/ alongside SKILL.md in all 6 setup sites. Verified: report test PASS,
canonical table emitted.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test: copy carved sections into all e2e fixtures (prevent more carve-blind CI fails)
Proactive sweep beyond the two CI logs: every e2e test that copies a carved
skill's SKILL.md into a temp fixture must also copy its sections/, or the
model hits a STOP pointing at a missing section file and improvises/degrades.
- skill-e2e.test.ts: plan-ceo/plan-eng/plan-design/office-hours copies across
planDir/reviewDir/ohDir/benefitsDir dests now copy sections/.
- skill-e2e-plan.test.ts: the office-hours copy + the 4-skill codex-offering
loop now copy sections/.
- skill-e2e-design.test.ts: plan-design-review copy now copies sections/.
- skill-e2e-office-hours.test.ts: both office-hours copies now copy sections/.
- skill-e2e-office-hours-brain-writeback.test.ts: GBRAIN_SAVE_RESULTS moved into
the section, so check the regenerated skeleton+section UNION for the gbrain put
block, ship both into the workdir, and restore both (the section regen was also
leaking into the working tree — finally now restores it).
ship copies (single-file Step-0 slices) and review/retro (not carved) untouched.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test: migrate section-loading E2E to lossless SDK tool-stream detection
The /ship and /plan-ceo-review section-loading tests drove a real PTY and
scraped the ANSI screen buffer for sections/<file>.md paths. That silently
saw nothing in a Conductor PTY (cursor-positioned tool renders and an
unanswered Step 0 question loop both defeat the regex), so both reported
read: [] even when the agent did the work.
They now run the skill through claude -p (the same SDK path the AUQ matrix
uses) and detect section reads from the tool-use stream — Read calls whose
file_path contains sections/<file>.md — with no rendering layer to mangle.
The run is also hermetic: the freshly-generated worktree skeleton + sections
are copied into a throwaway fixture with the absolute path pinned, so the
test validates this branch's carve without mutating the user's ~/.claude
install.
Validated EVALS_TIER=periodic: both pass (plan-ceo Reads review-sections.md;
ship Reads review-army.md + changelog.md), ~6.5 min for both vs ~23 min
combined on the old PTY path where both were failing.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: consolidate branch to v1.56.0.0 (single MINOR above main)
The branch bumped VERSION several times during development (1.56 → 1.57 →
1.58 → 1.59), but none of those landed on main (main is at 1.55.1.0). Per
the "never orphan branch-internal versions" discipline, collapse all four
into a single 1.56.0.0 entry — one MINOR release covering the whole branch:
five skills carved (plan-ceo, office-hours, plan-eng, plan-design,
plan-devex), the shared AskUserQuestion preamble CJK trim, and the paranoid
AUQ no-degradation test suite + lossless section-loading tests.
VERSION and package.json set to 1.56.0.0; main's 1.55.1.0 entry preserved
below the consolidated entry. No SKILL.md drift (VERSION is not embedded in
generated bodies).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,185 @@
|
||||
/**
|
||||
* AUQ format is ALWAYS-LOADED — the token-reduction safety net (gate, free).
|
||||
*
|
||||
* The anxiety this kills: carving a skill into a small skeleton + on-demand
|
||||
* `sections/*.md` could strand the AskUserQuestion decision-brief format (or a
|
||||
* per-skill AUQ rule) in a section that is NOT in context when a question
|
||||
* fires. The user would then see an AUQ with no ELI10, no Recommendation, no
|
||||
* Pros/Cons — exactly the degradation we must guarantee never happens.
|
||||
*
|
||||
* The guarantee, made mechanical and per-PR:
|
||||
* 1. UNIVERSAL — every interactive skill (anything that ships the
|
||||
* `## AskUserQuestion Format` block, i.e. preamble tier >= 2) carries the
|
||||
* FULL format spec in its always-loaded `SKILL.md` skeleton, NOT only in a
|
||||
* section. The preamble is always in context, so the format spec is present
|
||||
* the instant ANY question fires — Step 0, mode select, or a review finding.
|
||||
* 2. REGRESSION — a known roster of interactive skills MUST still ship the
|
||||
* block. A botched carve that drops `{{PREAMBLE}}` from a skeleton fails
|
||||
* here in milliseconds instead of surfacing as a garbled question weeks
|
||||
* later.
|
||||
* 3. CARVE-SAFETY — for skills that ARE carved (have a `sections/` dir), the
|
||||
* format block must live in `SKILL.md`, and any per-skill review-cadence
|
||||
* rule that moved into a section must still exist somewhere in the
|
||||
* skeleton+sections union (dropped-entirely is the failure).
|
||||
*
|
||||
* This is deterministic and free, so it runs on every `bun test`. It is the
|
||||
* floor under the paid behavioral/substance/consistency E2Es.
|
||||
*/
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import * as fs from 'node:fs';
|
||||
import * as path from 'node:path';
|
||||
|
||||
const ROOT = path.resolve(__dirname, '..');
|
||||
|
||||
/** Mandatory elements of the AskUserQuestion decision-brief format. Each is a
|
||||
* label/marker the preamble resolver emits (generate-ask-user-format.ts) and
|
||||
* that the model needs in context to render a compliant question. */
|
||||
const MANDATORY: Array<{ name: string; re: RegExp }> = [
|
||||
{ name: '## AskUserQuestion Format header', re: /##\s*AskUserQuestion Format/i },
|
||||
{ name: 'ELI10 label', re: /ELI10\s*:/i },
|
||||
{ name: 'Stakes-if-we-pick-wrong line', re: /Stakes if we pick wrong/i },
|
||||
{ name: 'Recommendation line (mandatory)', re: /Recommendation\s*:/i },
|
||||
{ name: '(recommended) label', re: /\(recommended\)/i },
|
||||
{ name: 'Pros / cons header', re: /Pros\s*\/\s*cons/i },
|
||||
{ name: '✅ pro bullet', re: /✅/ },
|
||||
{ name: '❌ con bullet', re: /❌/ },
|
||||
{ name: 'Net: synthesis line', re: /Net\s*:/i },
|
||||
{ name: 'Completeness coverage rule', re: /Completeness\s*:/i },
|
||||
{ name: 'kind-vs-coverage rule', re: /options differ in kind/i },
|
||||
{ name: 'Self-check checklist', re: /Self-check before emitting/i },
|
||||
];
|
||||
|
||||
/** Per-skill AUQ rules that govern review-finding cadence. A carve may move
|
||||
* these into a section (they fire only once the section is loaded), but they
|
||||
* must never be DROPPED. Asserted against the skeleton+sections union. */
|
||||
const PER_SKILL_RULES: Record<string, RegExp[]> = {
|
||||
'plan-ceo-review': [/One issue = one AskUserQuestion call/i],
|
||||
'plan-eng-review': [/One issue = one AskUserQuestion call/i],
|
||||
'plan-design-review': [/One issue = one AskUserQuestion call/i],
|
||||
'plan-devex-review': [/One issue = one AskUserQuestion call/i],
|
||||
// /codex emits its recommendation as prose; the instruction MUST stay in the
|
||||
// always-loaded skeleton because codex has no on-demand section.
|
||||
codex: [/Synthesis recommendation \(REQUIRED\)/i, /Recommendation\s*:\s*<action>\s*because/i],
|
||||
};
|
||||
|
||||
/** Discover every repo-root skill dir that ships a generated SKILL.md. */
|
||||
function discoverSkills(): Array<{ skill: string; skillMd: string; sectionsDir: string | null }> {
|
||||
return fs
|
||||
.readdirSync(ROOT, { withFileTypes: true })
|
||||
.filter(d => d.isDirectory())
|
||||
.map(d => d.name)
|
||||
.filter(skill => fs.existsSync(path.join(ROOT, skill, 'SKILL.md')))
|
||||
.map(skill => {
|
||||
const sectionsDir = path.join(ROOT, skill, 'sections');
|
||||
return {
|
||||
skill,
|
||||
skillMd: path.join(ROOT, skill, 'SKILL.md'),
|
||||
sectionsDir: fs.existsSync(sectionsDir) ? sectionsDir : null,
|
||||
};
|
||||
});
|
||||
}
|
||||
|
||||
const skills = discoverSkills();
|
||||
/** A skill is "interactive" if its always-loaded SKILL.md ships the format
|
||||
* block. That is the population that must be fully compliant. */
|
||||
const interactive = skills.filter(s =>
|
||||
/##\s*AskUserQuestion Format/i.test(fs.readFileSync(s.skillMd, 'utf-8')),
|
||||
);
|
||||
|
||||
/** Roster guard: these interactive skills MUST keep shipping the format block.
|
||||
* If a carve/refactor drops it, this list still expects them and the membership
|
||||
* test below fails. Derived from "fires AUQ at the user" — the plan/review/
|
||||
* advisory skills plus codex. */
|
||||
const EXPECTED_INTERACTIVE = [
|
||||
'plan-ceo-review',
|
||||
'plan-eng-review',
|
||||
'plan-design-review',
|
||||
'plan-devex-review',
|
||||
'office-hours',
|
||||
'ship',
|
||||
'review',
|
||||
'qa',
|
||||
'qa-only',
|
||||
'codex',
|
||||
'autoplan',
|
||||
'cso',
|
||||
'investigate',
|
||||
'retro',
|
||||
'design-review',
|
||||
'design-consultation',
|
||||
'spec',
|
||||
'land-and-deploy',
|
||||
];
|
||||
|
||||
describe('AUQ format is always-loaded (token-reduction safety net)', () => {
|
||||
test('discovered a sane number of interactive skills', () => {
|
||||
// Guards against a glob/path regression that would make the per-skill
|
||||
// loop vacuously pass with zero skills.
|
||||
expect(interactive.length).toBeGreaterThanOrEqual(15);
|
||||
});
|
||||
|
||||
test('every expected interactive skill still ships the AUQ format block', () => {
|
||||
const names = new Set(interactive.map(s => s.skill));
|
||||
const missing = EXPECTED_INTERACTIVE.filter(s => !names.has(s));
|
||||
if (missing.length > 0) {
|
||||
throw new Error(
|
||||
`These skills lost their always-loaded AskUserQuestion format block ` +
|
||||
`(a carve or refactor likely dropped {{PREAMBLE}} from the skeleton):\n` +
|
||||
missing.map(s => ` - ${s}/SKILL.md`).join('\n'),
|
||||
);
|
||||
}
|
||||
});
|
||||
|
||||
for (const { skill, skillMd } of interactive) {
|
||||
test(`${skill}: full AUQ format spec present in always-loaded SKILL.md`, () => {
|
||||
const body = fs.readFileSync(skillMd, 'utf-8');
|
||||
const gaps = MANDATORY.filter(m => !m.re.test(body));
|
||||
if (gaps.length > 0) {
|
||||
throw new Error(
|
||||
`${skill}/SKILL.md (the always-loaded skeleton) is missing ${gaps.length} ` +
|
||||
`mandatory AUQ format element(s) — a question firing here would degrade:\n` +
|
||||
gaps.map(g => ` - ${g.name} (${g.re.source})`).join('\n'),
|
||||
);
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// CARVE-SAFETY: for carved skills, the format block must be in the SKELETON,
|
||||
// not only a section. (The per-skill loop above already reads SKILL.md, so
|
||||
// this is an explicit, named guard for the exact failure mode.)
|
||||
for (const { skill, skillMd, sectionsDir } of skills.filter(s => s.sectionsDir)) {
|
||||
test(`${skill} (carved): AUQ format block lives in the skeleton, not only sections/`, () => {
|
||||
const body = fs.readFileSync(skillMd, 'utf-8');
|
||||
expect(body).toMatch(/##\s*AskUserQuestion Format/i);
|
||||
expect(body).toMatch(/ELI10\s*:/i);
|
||||
expect(body).toMatch(/Recommendation\s*:/i);
|
||||
// sanity: confirm there really is a section dir we're guarding against
|
||||
expect(fs.readdirSync(sectionsDir!).some(f => f.endsWith('.md'))).toBe(true);
|
||||
});
|
||||
}
|
||||
|
||||
// PER-SKILL RULES: review-cadence rules may move into a section, but must
|
||||
// never be dropped from the skeleton+sections union.
|
||||
for (const [skill, rules] of Object.entries(PER_SKILL_RULES)) {
|
||||
test(`${skill}: per-skill AUQ rules survive in skeleton+sections union`, () => {
|
||||
const skillDir = path.join(ROOT, skill);
|
||||
if (!fs.existsSync(path.join(skillDir, 'SKILL.md'))) {
|
||||
throw new Error(`${skill}/SKILL.md not found — roster is stale`);
|
||||
}
|
||||
let union = fs.readFileSync(path.join(skillDir, 'SKILL.md'), 'utf-8');
|
||||
const secDir = path.join(skillDir, 'sections');
|
||||
if (fs.existsSync(secDir)) {
|
||||
for (const f of fs.readdirSync(secDir).filter(f => f.endsWith('.md') && !f.endsWith('.md.tmpl'))) {
|
||||
union += '\n' + fs.readFileSync(path.join(secDir, f), 'utf-8');
|
||||
}
|
||||
}
|
||||
const dropped = rules.filter(re => !re.test(union));
|
||||
if (dropped.length > 0) {
|
||||
throw new Error(
|
||||
`${skill}: per-skill AUQ rule(s) dropped from skeleton+sections union:\n` +
|
||||
dropped.map(re => ` - ${re.source}`).join('\n'),
|
||||
);
|
||||
}
|
||||
});
|
||||
}
|
||||
});
|
||||
@@ -0,0 +1,103 @@
|
||||
/**
|
||||
* /codex recommendation substance — LIVE grade (periodic, paid, Codex CLI).
|
||||
*
|
||||
* The gap this closes: skill-cross-model-recommendation-emit.test.ts only checks
|
||||
* the /codex TEMPLATE contains the "Recommendation: <action> because <reason>"
|
||||
* instruction (static grep). llm-judge-recommendation.test.ts grades the rubric
|
||||
* against FIXTURES. Nothing runs /codex live and grades the recommendation it
|
||||
* actually emits. The user reports codex recommendations were the least
|
||||
* consistent surface on main — so this is the one that needs live coverage.
|
||||
*
|
||||
* Method: drive the real /codex skill via codex exec (isolated temp HOME) over a
|
||||
* small, deliberately-flawed fixture diff. Capture codex's output, extract its
|
||||
* synthesis "Recommendation: ... because ..." line, and grade it with the same
|
||||
* judgeRecommendation() rubric used everywhere else:
|
||||
* - present : a Recommendation line exists
|
||||
* - commits : names exactly one action (no hedging)
|
||||
* - has_because : a because-clause follows
|
||||
* - substance>=4: the reason is option-specific / names a concrete tradeoff,
|
||||
* not boilerplate ("because it's better")
|
||||
*
|
||||
* Periodic tier (Codex non-determinism, ~$2-3/run).
|
||||
*/
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import * as path from 'node:path';
|
||||
import { runCodexSkill } from './helpers/codex-session-runner';
|
||||
import { judgeRecommendation } from './helpers/llm-judge';
|
||||
|
||||
const ROOT = path.resolve(import.meta.dir, '..');
|
||||
|
||||
const CODEX_AVAILABLE = (() => {
|
||||
try {
|
||||
return Bun.spawnSync(['which', 'codex']).exitCode === 0;
|
||||
} catch {
|
||||
return false;
|
||||
}
|
||||
})();
|
||||
const shouldRun =
|
||||
CODEX_AVAILABLE && !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
|
||||
const describeCodex = shouldRun ? describe : describe.skip;
|
||||
|
||||
// A small fixture with two real, comparable problems so a good recommendation
|
||||
// must CHOOSE (and justify the choice against the alternative) — the exact
|
||||
// shape judgeRecommendation scores >= 4.
|
||||
const FIXTURE_DIFF = `
|
||||
Review this change. It has more than one issue; finish with a single synthesis
|
||||
recommendation line in your skill's required format: "Recommendation: <action>
|
||||
because <one-line reason that names the most important finding and why it beats
|
||||
the alternative>."
|
||||
|
||||
--- a/server/auth.ts
|
||||
+++ b/server/auth.ts
|
||||
@@
|
||||
export function login(req, res) {
|
||||
- const user = db.query("SELECT * FROM users WHERE name = ?", [req.body.name]);
|
||||
+ const user = db.query("SELECT * FROM users WHERE name = '" + req.body.name + "'");
|
||||
if (user && user.password === req.body.password) {
|
||||
res.cookie('session', user.id); // no HttpOnly, no Secure, no expiry
|
||||
return res.json({ ok: true });
|
||||
}
|
||||
return res.status(401).json({ ok: false });
|
||||
}
|
||||
`;
|
||||
|
||||
describeCodex('/codex recommendation substance (live, periodic)', () => {
|
||||
test(
|
||||
'codex emits a committed, substance>=4 synthesis recommendation',
|
||||
async () => {
|
||||
const result = await runCodexSkill({
|
||||
skillDir: path.join(ROOT, 'codex'),
|
||||
skillName: 'codex',
|
||||
prompt: FIXTURE_DIFF,
|
||||
timeoutMs: 300_000,
|
||||
});
|
||||
|
||||
if (result.output.startsWith('SKIP:')) {
|
||||
// codex binary missing — describeCodex already guards, but double-safe.
|
||||
return;
|
||||
}
|
||||
|
||||
const score = await judgeRecommendation(result.output);
|
||||
// eslint-disable-next-line no-console
|
||||
console.log(
|
||||
`[codex-rec] present=${score.present} commits=${score.commits} ` +
|
||||
`has_because=${score.has_because} substance=${score.reason_substance}\n` +
|
||||
` reason: ${score.reason_text}`,
|
||||
);
|
||||
|
||||
expect(score.present).toBe(true);
|
||||
expect(score.has_because).toBe(true);
|
||||
expect(score.commits).toBe(true);
|
||||
// The named weak spot: substance must clear the boilerplate bar.
|
||||
if (score.reason_substance < 4) {
|
||||
throw new Error(
|
||||
`codex recommendation substance ${score.reason_substance} < 4 (boilerplate/weak):\n` +
|
||||
` reason: ${score.reason_text}\n` +
|
||||
` judge: ${score.reasoning}\n` +
|
||||
`--- codex output (last 2KB) ---\n${result.output.slice(-2000)}`,
|
||||
);
|
||||
}
|
||||
},
|
||||
360_000,
|
||||
);
|
||||
});
|
||||
+6
-19
@@ -367,25 +367,12 @@ so split chains are never AUTO_DECIDE-eligible — the user's option set is sacr
|
||||
**Full rule + worked examples + Hold/dependency semantics:** see
|
||||
`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4.
|
||||
|
||||
**Non-ASCII characters — write directly, never \u-escape.** When any
|
||||
string field (question, option label, option description) contains
|
||||
Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit
|
||||
the literal UTF-8 characters in the JSON string. **Never escape them
|
||||
as `\uXXXX`.** Claude Code's tool parameter pipe is UTF-8 native
|
||||
and passes characters through unchanged. Manually escaping requires
|
||||
recalling each codepoint from training, which is unreliable for long
|
||||
CJK strings — the model regularly emits the wrong codepoint (e.g.
|
||||
writes `\u3103` thinking it is 管 U+7BA1, but `\u3103` is
|
||||
actually , so the user sees `管理工具` rendered as `3用箱`).
|
||||
The trigger is long, multi-line questions with hundreds of CJK
|
||||
characters: that is exactly when reflexive escaping kicks in and
|
||||
exactly when miscoding is most damaging. Long ≠ escape. Keep
|
||||
characters literal.
|
||||
|
||||
Wrong: `"question": "請選擇\uXXXX\uXXXX\uXXXX\uXXXX"`
|
||||
Right: `"question": "請選擇管理工具"`
|
||||
|
||||
Only JSON-mandatory escapes remain allowed: `\n`, `\t`, `\"`, `\\`.
|
||||
**Non-ASCII characters — write directly, never \u-escape.** When any string
|
||||
field contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text,
|
||||
emit the literal UTF-8 characters; never escape them as `\uXXXX` (the pipe is
|
||||
UTF-8 native, and manual escaping miscodes long CJK strings). Only `\n`,
|
||||
`\t`, `\"`, `\\` remain allowed. Full rationale + worked example: see
|
||||
`docs/askuserquestion-cjk.md`. Read on demand when a question contains CJK.
|
||||
|
||||
### Self-check before emitting
|
||||
|
||||
|
||||
+6
-19
@@ -353,25 +353,12 @@ so split chains are never AUTO_DECIDE-eligible — the user's option set is sacr
|
||||
**Full rule + worked examples + Hold/dependency semantics:** see
|
||||
`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4.
|
||||
|
||||
**Non-ASCII characters — write directly, never \u-escape.** When any
|
||||
string field (question, option label, option description) contains
|
||||
Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit
|
||||
the literal UTF-8 characters in the JSON string. **Never escape them
|
||||
as `\uXXXX`.** Claude Code's tool parameter pipe is UTF-8 native
|
||||
and passes characters through unchanged. Manually escaping requires
|
||||
recalling each codepoint from training, which is unreliable for long
|
||||
CJK strings — the model regularly emits the wrong codepoint (e.g.
|
||||
writes `\u3103` thinking it is 管 U+7BA1, but `\u3103` is
|
||||
actually , so the user sees `管理工具` rendered as `3用箱`).
|
||||
The trigger is long, multi-line questions with hundreds of CJK
|
||||
characters: that is exactly when reflexive escaping kicks in and
|
||||
exactly when miscoding is most damaging. Long ≠ escape. Keep
|
||||
characters literal.
|
||||
|
||||
Wrong: `"question": "請選擇\uXXXX\uXXXX\uXXXX\uXXXX"`
|
||||
Right: `"question": "請選擇管理工具"`
|
||||
|
||||
Only JSON-mandatory escapes remain allowed: `\n`, `\t`, `\"`, `\\`.
|
||||
**Non-ASCII characters — write directly, never \u-escape.** When any string
|
||||
field contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text,
|
||||
emit the literal UTF-8 characters; never escape them as `\uXXXX` (the pipe is
|
||||
UTF-8 native, and manual escaping miscodes long CJK strings). Only `\n`,
|
||||
`\t`, `\"`, `\\` remain allowed. Full rationale + worked example: see
|
||||
`docs/askuserquestion-cjk.md`. Read on demand when a question contains CJK.
|
||||
|
||||
### Self-check before emitting
|
||||
|
||||
|
||||
+6
-19
@@ -355,25 +355,12 @@ so split chains are never AUTO_DECIDE-eligible — the user's option set is sacr
|
||||
**Full rule + worked examples + Hold/dependency semantics:** see
|
||||
`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4.
|
||||
|
||||
**Non-ASCII characters — write directly, never \u-escape.** When any
|
||||
string field (question, option label, option description) contains
|
||||
Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit
|
||||
the literal UTF-8 characters in the JSON string. **Never escape them
|
||||
as `\uXXXX`.** Claude Code's tool parameter pipe is UTF-8 native
|
||||
and passes characters through unchanged. Manually escaping requires
|
||||
recalling each codepoint from training, which is unreliable for long
|
||||
CJK strings — the model regularly emits the wrong codepoint (e.g.
|
||||
writes `\u3103` thinking it is 管 U+7BA1, but `\u3103` is
|
||||
actually , so the user sees `管理工具` rendered as `3用箱`).
|
||||
The trigger is long, multi-line questions with hundreds of CJK
|
||||
characters: that is exactly when reflexive escaping kicks in and
|
||||
exactly when miscoding is most damaging. Long ≠ escape. Keep
|
||||
characters literal.
|
||||
|
||||
Wrong: `"question": "請選擇\uXXXX\uXXXX\uXXXX\uXXXX"`
|
||||
Right: `"question": "請選擇管理工具"`
|
||||
|
||||
Only JSON-mandatory escapes remain allowed: `\n`, `\t`, `\"`, `\\`.
|
||||
**Non-ASCII characters — write directly, never \u-escape.** When any string
|
||||
field contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text,
|
||||
emit the literal UTF-8 characters; never escape them as `\uXXXX` (the pipe is
|
||||
UTF-8 native, and manual escaping miscodes long CJK strings). Only `\n`,
|
||||
`\t`, `\"`, `\\` remain allowed. Full rationale + worked example: see
|
||||
`docs/askuserquestion-cjk.md`. Read on demand when a question contains CJK.
|
||||
|
||||
### Self-check before emitting
|
||||
|
||||
|
||||
@@ -105,7 +105,12 @@ describe('gbrain detection override → gen-skill-docs', () => {
|
||||
// Single skill probe is enough to assert the override pipeline. The
|
||||
// resolver unit test (test/resolvers-gbrain-save-results.test.ts) covers
|
||||
// per-skill metadata correctness already.
|
||||
const PROBE_FILES = ['office-hours/SKILL.md'];
|
||||
// office-hours is carved (v2 plan T9): GBRAIN_CONTEXT_LOAD stays in the
|
||||
// skeleton, GBRAIN_SAVE_RESULTS moved into sections/design-and-handoff.md.
|
||||
// Probe the union so the detection override is asserted wherever the blocks land.
|
||||
const PROBE_FILES = ['office-hours/SKILL.md', 'office-hours/sections/design-and-handoff.md'];
|
||||
const probeUnion = (snap: Map<string, string>): string =>
|
||||
(snap.get('office-hours/SKILL.md') ?? '') + '\n' + (snap.get('office-hours/sections/design-and-handoff.md') ?? '');
|
||||
|
||||
test('with detected:true, Claude-host SKILL.md gains brain-aware blocks', () => {
|
||||
const { tmpHome, cleanup } = makeFixture(
|
||||
@@ -117,7 +122,7 @@ describe('gbrain detection override → gen-skill-docs', () => {
|
||||
tmpHome,
|
||||
files: PROBE_FILES,
|
||||
});
|
||||
const content = snap.get('office-hours/SKILL.md')!;
|
||||
const content = probeUnion(snap);
|
||||
|
||||
// GBRAIN_SAVE_RESULTS un-suppressed → resolver output rendered.
|
||||
expect(content).toContain('## Save Results to Brain');
|
||||
@@ -141,7 +146,7 @@ describe('gbrain detection override → gen-skill-docs', () => {
|
||||
tmpHome,
|
||||
files: PROBE_FILES,
|
||||
});
|
||||
const content = snap.get('office-hours/SKILL.md')!;
|
||||
const content = probeUnion(snap);
|
||||
|
||||
// GBRAIN_SAVE_RESULTS suppressed → no rendered block, no gbrain put line.
|
||||
expect(content).not.toContain('gbrain put "office-hours/');
|
||||
@@ -162,7 +167,7 @@ describe('gbrain detection override → gen-skill-docs', () => {
|
||||
tmpHome,
|
||||
files: PROBE_FILES,
|
||||
});
|
||||
const content = snap.get('office-hours/SKILL.md')!;
|
||||
const content = probeUnion(snap);
|
||||
expect(content).not.toContain('gbrain put "office-hours/');
|
||||
} finally {
|
||||
cleanup();
|
||||
@@ -183,7 +188,7 @@ describe('gbrain detection override → gen-skill-docs', () => {
|
||||
tmpHome,
|
||||
files: PROBE_FILES,
|
||||
});
|
||||
const content = snap.get('office-hours/SKILL.md')!;
|
||||
const content = probeUnion(snap);
|
||||
expect(content).not.toContain('gbrain put "office-hours/');
|
||||
expect(content).not.toContain('## Save Results to Brain');
|
||||
} finally {
|
||||
|
||||
+31
-29
@@ -383,7 +383,7 @@ describe('gen-skill-docs', () => {
|
||||
});
|
||||
|
||||
test('voice and writing-style preamble sections stay compact', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('plan-eng-review'); // carved: review body moved to section
|
||||
const voice = extractMarkdownSection(content, '## Voice');
|
||||
const writingStyle = extractMarkdownSection(content, '## Writing Style');
|
||||
|
||||
@@ -392,7 +392,7 @@ describe('gen-skill-docs', () => {
|
||||
});
|
||||
|
||||
test('slim voice section preserves the gstack voice contract', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('plan-eng-review'); // carved: review body moved to section
|
||||
const voice = extractMarkdownSection(content, '## Voice');
|
||||
|
||||
expect(voice).toMatch(/lead with the point|direct/i);
|
||||
@@ -672,7 +672,7 @@ describe('REVIEW_DASHBOARD resolver', () => {
|
||||
|
||||
for (const skill of REVIEW_SKILLS) {
|
||||
test(`review dashboard appears in ${skill} generated file`, () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, skill, 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion(skill); // carved skills: union skeleton + sections
|
||||
expect(content).toContain('gstack-review');
|
||||
expect(content).toContain('REVIEW READINESS DASHBOARD');
|
||||
});
|
||||
@@ -693,13 +693,13 @@ describe('REVIEW_DASHBOARD resolver', () => {
|
||||
});
|
||||
|
||||
test('shared dashboard propagates review source to plan-eng-review', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('plan-eng-review'); // carved: review body moved to section
|
||||
expect(content).toContain('plan-eng-review, review, plan-design-review');
|
||||
expect(content).toContain('`review` (diff-scoped pre-landing review)');
|
||||
});
|
||||
|
||||
test('resolver output contains key dashboard elements', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('plan-ceo-review'); // carved: dashboard moved to section
|
||||
expect(content).toContain('VERDICT');
|
||||
expect(content).toContain('CLEARED');
|
||||
expect(content).toContain('Eng Review');
|
||||
@@ -709,25 +709,25 @@ describe('REVIEW_DASHBOARD resolver', () => {
|
||||
});
|
||||
|
||||
test('dashboard bash block includes git HEAD for staleness detection', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('plan-ceo-review'); // carved: dashboard moved to section
|
||||
expect(content).toContain('git rev-parse --short HEAD');
|
||||
expect(content).toContain('---HEAD---');
|
||||
});
|
||||
|
||||
test('dashboard includes staleness detection prose', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('plan-ceo-review'); // carved: dashboard moved to section
|
||||
expect(content).toContain('Staleness detection');
|
||||
expect(content).toContain('commit');
|
||||
});
|
||||
|
||||
for (const skill of REVIEW_SKILLS) {
|
||||
test(`${skill} contains review chaining section`, () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, skill, 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion(skill); // carved skills: union skeleton + sections
|
||||
expect(content).toContain('Review Chaining');
|
||||
});
|
||||
|
||||
test(`${skill} Review Log includes commit field`, () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, skill, 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion(skill); // carved skills: union skeleton + sections
|
||||
expect(content).toContain('"commit"');
|
||||
});
|
||||
}
|
||||
@@ -739,13 +739,13 @@ describe('REVIEW_DASHBOARD resolver', () => {
|
||||
});
|
||||
|
||||
test('plan-eng-review chaining mentions design and ceo reviews', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('plan-eng-review'); // carved: review body moved to section
|
||||
expect(content).toContain('/plan-design-review');
|
||||
expect(content).toContain('/plan-ceo-review');
|
||||
});
|
||||
|
||||
test('plan-design-review chaining mentions eng, ceo, and design skills', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'plan-design-review', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('plan-design-review');
|
||||
expect(content).toContain('/plan-eng-review');
|
||||
expect(content).toContain('/plan-ceo-review');
|
||||
expect(content).toContain('/design-shotgun');
|
||||
@@ -761,7 +761,7 @@ describe('REVIEW_DASHBOARD resolver', () => {
|
||||
// ─── Test Coverage Audit Resolver Tests ─────────────────────
|
||||
|
||||
describe('TEST_COVERAGE_AUDIT placeholders', () => {
|
||||
const planSkill = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
|
||||
const planSkill = readSkillUnion('plan-eng-review'); // carved
|
||||
const shipSkill = readShipUnion();
|
||||
const reviewSkill = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8');
|
||||
|
||||
@@ -969,7 +969,7 @@ describe('PLAN_FILE_REVIEW_REPORT resolver', () => {
|
||||
}
|
||||
|
||||
test('resolver output contains key report elements', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('plan-ceo-review'); // carved: report writer moved to section
|
||||
expect(content).toContain('Trigger');
|
||||
expect(content).toContain('Findings');
|
||||
expect(content).toContain('VERDICT');
|
||||
@@ -1144,7 +1144,7 @@ describe('Retro plan completion section', () => {
|
||||
describe('Plan status footer in preamble', () => {
|
||||
test('preamble contains plan status footer as neutral forward reference to EXIT PLAN MODE GATE', () => {
|
||||
// Read any skill that uses PREAMBLE
|
||||
const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('office-hours'); // carved: Phase 5/6 prose moved to section
|
||||
expect(content).toContain('Plan Status Footer');
|
||||
expect(content).toContain('GSTACK REVIEW REPORT');
|
||||
expect(content).toContain('ExitPlanMode');
|
||||
@@ -1179,7 +1179,7 @@ describe('make-pdf setup ordering', () => {
|
||||
|
||||
describe('Skill invocation during plan mode in preamble', () => {
|
||||
test('preamble contains skill invocation plan mode section', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('office-hours'); // carved: Phase 5/6 prose moved to section
|
||||
expect(content).toContain('Skill Invocation During Plan Mode');
|
||||
expect(content).toContain('precedence over generic plan mode behavior');
|
||||
expect(content).toContain('Do not continue the workflow');
|
||||
@@ -1190,7 +1190,7 @@ describe('Skill invocation during plan mode in preamble', () => {
|
||||
// --- {{SPEC_REVIEW_LOOP}} resolver tests ---
|
||||
|
||||
describe('SPEC_REVIEW_LOOP resolver', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('office-hours'); // carved: Phase 5/6 prose moved to section
|
||||
|
||||
test('contains all 5 review dimensions', () => {
|
||||
for (const dim of ['Completeness', 'Consistency', 'Clarity', 'Scope', 'Feasibility']) {
|
||||
@@ -1226,7 +1226,7 @@ describe('SPEC_REVIEW_LOOP resolver', () => {
|
||||
// --- {{DESIGN_SKETCH}} resolver tests ---
|
||||
|
||||
describe('DESIGN_SKETCH resolver', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('office-hours'); // carved: Phase 5/6 prose moved to section
|
||||
|
||||
test('references DESIGN.md for design system constraints', () => {
|
||||
expect(content).toContain('DESIGN.md');
|
||||
@@ -1256,7 +1256,7 @@ describe('DESIGN_SKETCH resolver', () => {
|
||||
// --- {{CODEX_SECOND_OPINION}} resolver tests ---
|
||||
|
||||
describe('CODEX_SECOND_OPINION resolver', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('office-hours'); // carved: Phase 5/6 prose moved to section
|
||||
const codexContent = fs.readFileSync(path.join(ROOT, '.agents', 'skills', 'gstack-office-hours', 'SKILL.md'), 'utf-8');
|
||||
|
||||
test('Phase 3.5 section appears in office-hours SKILL.md', () => {
|
||||
@@ -1369,7 +1369,7 @@ describe('Codex filesystem boundary', () => {
|
||||
|
||||
describe('BENEFITS_FROM resolver', () => {
|
||||
const ceoContent = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
|
||||
const engContent = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
|
||||
const engContent = readSkillUnion('plan-eng-review'); // carved
|
||||
|
||||
test('plan-ceo-review contains prerequisite skill offer', () => {
|
||||
expect(ceoContent).toContain('Prerequisite Skill Offer');
|
||||
@@ -1551,7 +1551,7 @@ describe('preamble routing injection', () => {
|
||||
|
||||
describe('DESIGN_OUTSIDE_VOICES resolver', () => {
|
||||
test('plan-design-review contains outside voices section', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'plan-design-review', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('plan-design-review');
|
||||
expect(content).toContain('Design Outside Voices');
|
||||
expect(content).toContain('CODEX_AVAILABLE');
|
||||
expect(content).toContain('LITMUS SCORECARD');
|
||||
@@ -1570,7 +1570,7 @@ describe('DESIGN_OUTSIDE_VOICES resolver', () => {
|
||||
});
|
||||
|
||||
test('branches correctly per skillName — different prompts', () => {
|
||||
const planContent = fs.readFileSync(path.join(ROOT, 'plan-design-review', 'SKILL.md'), 'utf-8');
|
||||
const planContent = readSkillUnion('plan-design-review');
|
||||
const consultContent = fs.readFileSync(path.join(ROOT, 'design-consultation', 'SKILL.md'), 'utf-8');
|
||||
// plan-design-review uses analytical prompt (high reasoning)
|
||||
expect(planContent).toContain('model_reasoning_effort="high"');
|
||||
@@ -1583,7 +1583,7 @@ describe('DESIGN_OUTSIDE_VOICES resolver', () => {
|
||||
|
||||
describe('DESIGN_HARD_RULES resolver', () => {
|
||||
test('plan-design-review Pass 4 contains hard rules', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'plan-design-review', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('plan-design-review');
|
||||
expect(content).toContain('Design Hard Rules');
|
||||
expect(content).toContain('Classifier');
|
||||
expect(content).toContain('MARKETING/LANDING PAGE');
|
||||
@@ -1596,26 +1596,26 @@ describe('DESIGN_HARD_RULES resolver', () => {
|
||||
});
|
||||
|
||||
test('includes all 3 rule sets', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'plan-design-review', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('plan-design-review');
|
||||
expect(content).toContain('Landing page rules');
|
||||
expect(content).toContain('App UI rules');
|
||||
expect(content).toContain('Universal rules');
|
||||
});
|
||||
|
||||
test('references shared AI slop blacklist items', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'plan-design-review', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('plan-design-review');
|
||||
expect(content).toContain('3-column feature grid');
|
||||
expect(content).toContain('Purple/violet/indigo');
|
||||
});
|
||||
|
||||
test('includes OpenAI hard rejection criteria', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'plan-design-review', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('plan-design-review');
|
||||
expect(content).toContain('Generic SaaS card grid');
|
||||
expect(content).toContain('Carousel with no narrative purpose');
|
||||
});
|
||||
|
||||
test('includes OpenAI litmus checks', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'plan-design-review', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('plan-design-review');
|
||||
expect(content).toContain('Brand/product unmistakable');
|
||||
expect(content).toContain('premium with all decorative shadows removed');
|
||||
});
|
||||
@@ -1624,7 +1624,7 @@ describe('DESIGN_HARD_RULES resolver', () => {
|
||||
// --- Extended DESIGN_SKETCH resolver tests ---
|
||||
|
||||
describe('DESIGN_SKETCH extended with outside voices', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('office-hours'); // carved: Phase 5/6 prose moved to section
|
||||
|
||||
test('contains outside design voices step', () => {
|
||||
expect(content).toContain('Outside design voices');
|
||||
@@ -2649,7 +2649,7 @@ describe('community fixes wave', () => {
|
||||
|
||||
// #510 — Context warnings: plan-eng-review has explicit anti-warning
|
||||
test('plan-eng-review/SKILL.md contains "Do not preemptively warn"', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
|
||||
const content = readSkillUnion('plan-eng-review'); // carved: review body moved to section
|
||||
expect(content).toContain('Do not preemptively warn');
|
||||
});
|
||||
|
||||
@@ -3112,7 +3112,9 @@ describe('GSTACK REVIEW REPORT delete-then-append flow', () => {
|
||||
|
||||
for (const skill of PLAN_REVIEW_SKILLS) {
|
||||
test(`${skill}/SKILL.md prescribes delete-then-append, not in-place replace`, () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, skill, 'SKILL.md'), 'utf-8');
|
||||
// Carved skills (v2 plan Phase B) relocate the review-report prose into
|
||||
// sections/*.md; readSkillUnion follows the content wherever the carve put it.
|
||||
const content = readSkillUnion(skill);
|
||||
|
||||
// The new (correct) instruction must be present.
|
||||
expect(content).toContain('delete-then-append flow');
|
||||
|
||||
@@ -0,0 +1,350 @@
|
||||
/**
|
||||
* SDK-based AUQ capture — the reliable way to grade AskUserQuestion content.
|
||||
*
|
||||
* Real-PTY capture is lossy for plan-mode AUQs: they render every option on one
|
||||
* cursor-positioned logical line that stripAnsi can't reconstruct, so format
|
||||
* predicates (ELI10:, Net:, ✅) silently miss even when the question is
|
||||
* well-formed. This helper instead uses the `claude -p` SDK path (the same one
|
||||
* skill-e2e-plan-format uses): the agent is told to WRITE the verbatim text of
|
||||
* the AskUserQuestion it would have asked to a file. That captures exactly what
|
||||
* the model GENERATES — the surface where carving could degrade quality — with
|
||||
* zero rendering loss. The TTY rendering layer is identical for fat and slim
|
||||
* skills, so it is not where token-reduction degradation can hide.
|
||||
*/
|
||||
import * as fs from 'node:fs';
|
||||
import * as os from 'node:os';
|
||||
import * as path from 'node:path';
|
||||
import { spawnSync } from 'node:child_process';
|
||||
import { runSkillTest, type SkillTestResult } from './session-runner';
|
||||
|
||||
const ROOT = path.resolve(__dirname, '..', '..');
|
||||
|
||||
/** The 7 decision-brief format elements graded on the captured AUQ text. */
|
||||
export const AUQ_FORMAT_ELEMENTS: Array<{ field: string; re: RegExp }> = [
|
||||
{ field: 'ELI10:', re: /ELI10\s*:/i },
|
||||
{ field: 'Recommendation:', re: /Recommendation\s*:/i },
|
||||
{ field: 'Pros / cons:', re: /Pros\s*\/\s*cons/i },
|
||||
{ field: '✅', re: /✅/ },
|
||||
{ field: '❌', re: /❌/ },
|
||||
{ field: 'Net:', re: /Net\s*:/i },
|
||||
{ field: '(recommended)', re: /\(recommended\)/i },
|
||||
];
|
||||
|
||||
export function scoreAuqFormat(text: string): { present: number; total: number; missing: string[] } {
|
||||
const missing = AUQ_FORMAT_ELEMENTS.filter(e => !e.re.test(text)).map(e => e.field);
|
||||
return { present: AUQ_FORMAT_ELEMENTS.length - missing.length, total: AUQ_FORMAT_ELEMENTS.length, missing };
|
||||
}
|
||||
|
||||
/**
|
||||
* Grade recommendation substance ROBUST to the connective. judgeRecommendation()
|
||||
* keys on the literal "because" (correct for the spec, pinned by
|
||||
* llm-judge-recommendation.test.ts), but skills routinely write equally
|
||||
* substantive reasons as "Recommendation: A. <reason>" / "A — <reason>" /
|
||||
* "A: <reason>". Grading those as substance-1 would make the matrix cry wolf on
|
||||
* genuinely good recommendations. So we normalize a non-"because" connective to
|
||||
* "because" purely for grading, then call the shared judge. We also report
|
||||
* whether the ORIGINAL used the literal "because" — a soft style signal, since
|
||||
* the format spec prefers it and the voice rule forbids the em-dash form.
|
||||
*
|
||||
* This does NOT touch judgeRecommendation or its pinned fixtures.
|
||||
*/
|
||||
export async function gradeAuqRecommendation(
|
||||
text: string,
|
||||
): Promise<{ substance: number; present: boolean; hadLiteralBecause: boolean; reason: string }> {
|
||||
const { judgeRecommendation } = await import('./llm-judge');
|
||||
const recLine = text.match(/^[*_]*\s*recommendation\s*[*_]*\s*:\s*(.+)$/im);
|
||||
const hadLiteralBecause = !!recLine && /\bbecause\s+\S/i.test(recLine[1]);
|
||||
|
||||
let graded = text;
|
||||
if (recLine && !hadLiteralBecause) {
|
||||
// Rewrite "Recommendation: <choice><sep><reason>" → "...<choice> because <reason>"
|
||||
// sep ∈ {". ", " — ", " - ", ": "} right after a short choice token.
|
||||
const normalizedLine = recLine[1].replace(
|
||||
/^([^.:—-]{1,40}?)\s*(?:\.\s+|\s*[—-]\s+|:\s+)(\S.+)$/,
|
||||
'$1 because $2',
|
||||
);
|
||||
if (normalizedLine !== recLine[1]) {
|
||||
graded = text.replace(recLine[0], `Recommendation: ${normalizedLine}`);
|
||||
}
|
||||
}
|
||||
|
||||
try {
|
||||
const r = await judgeRecommendation(graded);
|
||||
return { substance: r.reason_substance, present: r.present, hadLiteralBecause, reason: r.reason_text };
|
||||
} catch {
|
||||
return { substance: 0, present: !!recLine, hadLiteralBecause, reason: '' };
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Build a throwaway plan dir holding a SPECIFIC plan-ceo-review SKILL.md (so we
|
||||
* can pit the carved skeleton against the verbose monolith). `sectionsFrom`, if
|
||||
* given, copies that dir's sections/ alongside (for the carved variant).
|
||||
*/
|
||||
export function setupPlanCeoDir(opts: {
|
||||
skillMd: string;
|
||||
sectionsFrom?: string | null;
|
||||
tmpPrefix?: string;
|
||||
}): string {
|
||||
const dir = fs.mkdtempSync(path.join(os.tmpdir(), opts.tmpPrefix ?? 'auq-sdk-'));
|
||||
const run = (cmd: string, args: string[]) => spawnSync(cmd, args, { cwd: dir, stdio: 'pipe', timeout: 5000 });
|
||||
run('git', ['init', '-b', 'main']);
|
||||
run('git', ['config', 'user.email', 'test@test.com']);
|
||||
run('git', ['config', 'user.name', 'Test']);
|
||||
fs.writeFileSync(
|
||||
path.join(dir, 'plan.md'),
|
||||
[
|
||||
'# Plan: Launch a "developer-friendly" pricing tier',
|
||||
'',
|
||||
'## Goal',
|
||||
'Increase developer adoption.',
|
||||
'',
|
||||
'## Success metric',
|
||||
'More signups.',
|
||||
'',
|
||||
'## Premise',
|
||||
"We haven't talked to any developers about whether the current pricing is a",
|
||||
'barrier. The team agreed it "feels like" it should be cheaper.',
|
||||
].join('\n'),
|
||||
);
|
||||
fs.mkdirSync(path.join(dir, 'plan-ceo-review'), { recursive: true });
|
||||
fs.writeFileSync(path.join(dir, 'plan-ceo-review', 'SKILL.md'), opts.skillMd);
|
||||
if (opts.sectionsFrom && fs.existsSync(opts.sectionsFrom)) {
|
||||
fs.cpSync(opts.sectionsFrom, path.join(dir, 'plan-ceo-review', 'sections'), { recursive: true });
|
||||
}
|
||||
run('git', ['add', '.']);
|
||||
run('git', ['commit', '-m', 'plan']);
|
||||
return dir;
|
||||
}
|
||||
|
||||
/**
|
||||
* Generic: build a throwaway dir holding ANY skill's SKILL.md (+ optional
|
||||
* sections) plus arbitrary fixture files, so the matrix can drive each skill to
|
||||
* its first AUQ. Mirrors setupPlanCeoDir but skill-agnostic.
|
||||
*/
|
||||
export function setupSkillDir(opts: {
|
||||
skillName: string;
|
||||
skillMd: string;
|
||||
sectionsFrom?: string | null;
|
||||
fixtures?: Record<string, string>;
|
||||
tmpPrefix?: string;
|
||||
}): string {
|
||||
const dir = fs.mkdtempSync(path.join(os.tmpdir(), opts.tmpPrefix ?? `auq-${opts.skillName}-`));
|
||||
const run = (cmd: string, args: string[]) => spawnSync(cmd, args, { cwd: dir, stdio: 'pipe', timeout: 5000 });
|
||||
run('git', ['init', '-b', 'main']);
|
||||
run('git', ['config', 'user.email', 'test@test.com']);
|
||||
run('git', ['config', 'user.name', 'Test']);
|
||||
for (const [name, content] of Object.entries(opts.fixtures ?? {})) {
|
||||
const p = path.join(dir, name);
|
||||
fs.mkdirSync(path.dirname(p), { recursive: true });
|
||||
fs.writeFileSync(p, content);
|
||||
}
|
||||
fs.mkdirSync(path.join(dir, opts.skillName), { recursive: true });
|
||||
fs.writeFileSync(path.join(dir, opts.skillName, 'SKILL.md'), opts.skillMd);
|
||||
if (opts.sectionsFrom && fs.existsSync(opts.sectionsFrom)) {
|
||||
fs.cpSync(opts.sectionsFrom, path.join(dir, opts.skillName, 'sections'), { recursive: true });
|
||||
}
|
||||
run('git', ['add', '.']);
|
||||
run('git', ['commit', '-m', 'fixture']);
|
||||
return dir;
|
||||
}
|
||||
|
||||
/** Read any skill's current (worktree) SKILL.md + its sections dir if present. */
|
||||
export function skillFromWorktree(skillName: string): { skillMd: string; sectionsFrom: string | null } {
|
||||
const sec = path.join(ROOT, skillName, 'sections');
|
||||
return {
|
||||
skillMd: fs.readFileSync(path.join(ROOT, skillName, 'SKILL.md'), 'utf-8'),
|
||||
sectionsFrom: fs.existsSync(sec) ? sec : null,
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Generic: drive ANY skill to its FIRST AskUserQuestion and capture the
|
||||
* verbatim decision-brief text the model would have shown. `scenario` is the
|
||||
* per-skill prose that triggers a real AUQ (e.g. "review plan.md", "audit
|
||||
* vuln.ts for security"). Absolute skill path + Read/Write-only so the agent
|
||||
* cannot wander to the global install.
|
||||
*/
|
||||
export async function captureFirstAuq(opts: {
|
||||
planDir: string;
|
||||
skillName: string;
|
||||
scenario: string;
|
||||
testName: string;
|
||||
runId?: string;
|
||||
model?: string;
|
||||
}): Promise<string> {
|
||||
const outFile = path.join(opts.planDir, 'ask-capture.md');
|
||||
const skillPath = path.join(opts.planDir, opts.skillName, 'SKILL.md');
|
||||
const prompt = `You are running a format-capture test. The ONLY skill file you may read is this absolute path: ${skillPath}. Do NOT search for, Glob, find, or read any other SKILL.md anywhere — especially nothing under ~/.claude or /Users.
|
||||
|
||||
Read ${skillPath} and follow its workflow for this scenario:
|
||||
|
||||
${opts.scenario}
|
||||
|
||||
This is a capture test, not an interactive session. Skip any system-audit / environment-setup / codebase-exploration steps. When you reach the FIRST point where the skill would call AskUserQuestion, write the verbatim full decision-brief text of that question (title, ELI10, stakes, recommendation, every option with its ✅/❌ pros/cons bullets, and the Net line) to ${outFile}. Do NOT call any tool to ask the user. Do NOT paraphrase. After writing the file, STOP.`;
|
||||
|
||||
await runSkillTest({
|
||||
prompt,
|
||||
workingDirectory: opts.planDir,
|
||||
allowedTools: ['Read', 'Write'],
|
||||
maxTurns: 14,
|
||||
timeout: 240_000,
|
||||
testName: opts.testName,
|
||||
runId: opts.runId,
|
||||
model: opts.model ?? 'claude-opus-4-7',
|
||||
});
|
||||
|
||||
try {
|
||||
return fs.readFileSync(outFile, 'utf-8');
|
||||
} catch {
|
||||
return '';
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Drive ANY carved skill through a real `claude -p` run and detect, LOSSLESSLY,
|
||||
* which `sections/<file>.md` files the agent actually Read — from the tool-use
|
||||
* stream, not the ANSI screen buffer. This is the reliable replacement for the
|
||||
* real-PTY `visibleSince()` screen-scraping the section-loading tests used to do
|
||||
* (which silently saw nothing in a Conductor PTY: cursor-positioned renders and
|
||||
* an unanswered Step 0 question loop both defeat the regex).
|
||||
*
|
||||
* The skill under test is the planted copy in `planDir` (pin the absolute path so
|
||||
* the agent cannot wander to the global install). AskUserQuestion is declared
|
||||
* unavailable so the agent auto-picks the recommended option and proceeds far
|
||||
* enough to hit the post-Step-0 STOP-Read directives; Read is the tool a STOP-Read
|
||||
* resolves to, so Read/Grep/Glob/Write is all the agent needs (no Bash → it cannot
|
||||
* `find /` its way out, nor run git/gh mutations).
|
||||
*/
|
||||
export async function captureSectionReads(opts: {
|
||||
planDir: string;
|
||||
skillName: string;
|
||||
scenario: string;
|
||||
/** Relative filename the agent writes its final output to (terminal signal). */
|
||||
reportFile?: string;
|
||||
/** Marker proving a real report/plan was produced (default: any non-empty text). */
|
||||
reportMarker?: RegExp;
|
||||
testName: string;
|
||||
runId?: string;
|
||||
model?: string;
|
||||
maxTurns?: number;
|
||||
timeout?: number;
|
||||
}): Promise<{ readSections: Set<string>; reportProduced: boolean; toolCalls: SkillTestResult['toolCalls']; output: string }> {
|
||||
const outFile = path.join(opts.planDir, opts.reportFile ?? 'REPORT.md');
|
||||
const skillPath = path.join(opts.planDir, opts.skillName, 'SKILL.md');
|
||||
const prompt = `You are running an automated skill-execution test. No human is present, so AskUserQuestion is unavailable. The ONLY skill file you may read is this absolute path: ${skillPath}. Do NOT Glob/find/search for any other SKILL.md anywhere — especially nothing under ~/.claude or /Users.
|
||||
|
||||
Read ${skillPath} and EXECUTE its workflow for this scenario:
|
||||
|
||||
${opts.scenario}
|
||||
|
||||
Rules for this run:
|
||||
- Skip system-audit, environment-setup, telemetry, and codebase-exploration steps.
|
||||
- At any decision point that would call AskUserQuestion, silently pick the skill's recommended option and continue. Do NOT stop to ask.
|
||||
- This skill's body has been carved into on-demand sections/. When the skill gives a STOP-Read directive (for example "Read \`.../sections/<file>\` and execute it in full"), you MUST actually Read that sections/ file with the Read tool BEFORE doing the work it covers. Do not work from memory.
|
||||
- Do NOT run git, gh, commit, push, or any mutating command.
|
||||
- When the workflow is complete, write the skill's final output (the full review report / ship plan, including any required report table) to ${outFile}.`;
|
||||
|
||||
const result = await runSkillTest({
|
||||
prompt,
|
||||
workingDirectory: opts.planDir,
|
||||
allowedTools: ['Read', 'Grep', 'Glob', 'Write'],
|
||||
maxTurns: opts.maxTurns ?? 25,
|
||||
timeout: opts.timeout ?? 300_000,
|
||||
testName: opts.testName,
|
||||
runId: opts.runId,
|
||||
model: opts.model ?? 'claude-opus-4-7',
|
||||
});
|
||||
|
||||
const readSections = new Set<string>();
|
||||
for (const c of result.toolCalls) {
|
||||
if (c.tool !== 'Read') continue;
|
||||
const fp = String(c.input?.file_path ?? '');
|
||||
const m = fp.match(/sections\/([A-Za-z0-9._-]+\.md)/);
|
||||
if (m) readSections.add(m[1]);
|
||||
}
|
||||
|
||||
let output = '';
|
||||
try { output = fs.readFileSync(outFile, 'utf-8'); } catch { output = result.output ?? ''; }
|
||||
const reportProduced = opts.reportMarker ? opts.reportMarker.test(output) : output.trim().length > 0;
|
||||
|
||||
return { readSections, reportProduced, toolCalls: result.toolCalls, output };
|
||||
}
|
||||
|
||||
/** Read the carved (current worktree) plan-ceo SKILL.md + its sections dir. */
|
||||
export function carvedSkill(): { skillMd: string; sectionsFrom: string | null } {
|
||||
const sec = path.join(ROOT, 'plan-ceo-review', 'sections');
|
||||
return {
|
||||
skillMd: fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8'),
|
||||
sectionsFrom: fs.existsSync(sec) ? sec : null,
|
||||
};
|
||||
}
|
||||
|
||||
/** Read the pre-carve verbose monolith plan-ceo SKILL.md from git. */
|
||||
export function verboseSkill(gitRef = 'ab66193e^'): string {
|
||||
return execGit(['show', `${gitRef}:plan-ceo-review/SKILL.md`]);
|
||||
}
|
||||
|
||||
function execGit(args: string[]): string {
|
||||
const r = spawnSync('git', args, { cwd: ROOT, encoding: 'utf-8', maxBuffer: 64 * 1024 * 1024 });
|
||||
if (r.status !== 0) throw new Error(`git ${args.join(' ')} failed: ${r.stderr}`);
|
||||
return r.stdout;
|
||||
}
|
||||
|
||||
/**
|
||||
* Drive plan-ceo-review to its Step 0F mode-selection AskUserQuestion in the
|
||||
* given plan dir and capture the verbatim question text the model generates.
|
||||
* Returns the captured text ('' if the agent never wrote the file).
|
||||
*/
|
||||
export async function captureModeSelectionAuq(opts: {
|
||||
planDir: string;
|
||||
testName: string;
|
||||
runId?: string;
|
||||
model?: string;
|
||||
}): Promise<string> {
|
||||
const outFile = path.join(opts.planDir, 'ask-capture.md');
|
||||
const skillPath = path.join(opts.planDir, 'plan-ceo-review', 'SKILL.md');
|
||||
const planPath = path.join(opts.planDir, 'plan.md');
|
||||
// CRITICAL: pin the EXACT skill file. Without this the agent runs
|
||||
// `find / -name SKILL.md` / Glob and reads the GLOBAL install
|
||||
// (~/.claude/skills/...) instead of the version-under-test in the temp dir —
|
||||
// which silently invalidates a carved-vs-verbose A/B (both sides end up
|
||||
// reading the same global skill). Absolute path + no-wander instruction +
|
||||
// Bash disallowed (so `find /` is impossible) locks it to the planted file.
|
||||
const prompt = `You are running a format-capture test. Use ONLY these two files:
|
||||
- The skill to follow: ${skillPath}
|
||||
- The plan to review: ${planPath}
|
||||
|
||||
Read ${skillPath} for the review workflow. Do NOT search for, Glob, find, or read any OTHER SKILL.md anywhere on the system — especially nothing under ~/.claude or /Users. The ONLY skill file you may read is the absolute path above.
|
||||
|
||||
Read ${planPath} — that is the plan to review. It is a standalone plan document, not a codebase. Skip any codebase exploration or system-audit steps.
|
||||
|
||||
Proceed to Step 0F (Mode Selection), where the skill presents the 4 review-mode options to the user via AskUserQuestion.
|
||||
|
||||
Write the verbatim text of that AskUserQuestion (the full decision brief: title, ELI10, stakes, recommendation, every option with its pros/cons bullets, and the Net line) to ${outFile}. Do NOT call any tool to ask the user. Do NOT paraphrase. After writing the file, stop.`;
|
||||
|
||||
await runSkillTest({
|
||||
prompt,
|
||||
workingDirectory: opts.planDir,
|
||||
// Read + Write only: no Bash means the agent cannot `find /` its way to the
|
||||
// global install, and the skill's preamble bash blocks (irrelevant to format
|
||||
// capture) can't run and wander.
|
||||
allowedTools: ['Read', 'Write'],
|
||||
maxTurns: 12,
|
||||
timeout: 240_000,
|
||||
testName: opts.testName,
|
||||
runId: opts.runId,
|
||||
model: opts.model ?? 'claude-opus-4-7',
|
||||
});
|
||||
|
||||
try {
|
||||
const text = fs.readFileSync(outFile, 'utf-8');
|
||||
// Defense in depth: verify the agent actually read the planted skill, not a
|
||||
// global one. If the captured run somehow read elsewhere we can't detect it
|
||||
// from the output file alone, so callers should also confirm via the run
|
||||
// log; this guard at least catches an empty/placeholder capture.
|
||||
return text;
|
||||
} catch {
|
||||
return '';
|
||||
}
|
||||
}
|
||||
@@ -226,7 +226,14 @@ export const PARITY_INVARIANTS: ParityInvariant[] = [
|
||||
minBytes: 120_000,
|
||||
},
|
||||
{
|
||||
// Carved (v2 plan T9): skeleton SKILL.md + sections/review-sections.md.
|
||||
// Content + size floors run against the union (relocated prose still counts);
|
||||
// maxSkeletonBytes asserts the always-loaded skeleton shrank from the ~138KB
|
||||
// monolith to ~81KB (measured 80,731 B, -42%). Headroom to 90KB so a small
|
||||
// skeleton edit doesn't trip CI, but a 10KB regression does.
|
||||
skill: 'plan-ceo-review',
|
||||
sectioned: true,
|
||||
maxSkeletonBytes: 90_000,
|
||||
mustContain: [
|
||||
'SCOPE EXPANSION',
|
||||
'SELECTIVE EXPANSION',
|
||||
@@ -238,7 +245,13 @@ export const PARITY_INVARIANTS: ParityInvariant[] = [
|
||||
minBytes: 80_000,
|
||||
},
|
||||
{
|
||||
// Carved (v2 plan T9): skeleton + sections/review-sections.md. The 4-section
|
||||
// review, outside voice, and required outputs moved to the section; content
|
||||
// checks run against the union. Skeleton shrank 106,984 -> 54,892 B (-48.7%);
|
||||
// maxSkeletonBytes 62KB = measured + headroom.
|
||||
skill: 'plan-eng-review',
|
||||
sectioned: true,
|
||||
maxSkeletonBytes: 62_000,
|
||||
mustContain: [
|
||||
'Architecture',
|
||||
'Code Quality',
|
||||
@@ -250,7 +263,13 @@ export const PARITY_INVARIANTS: ParityInvariant[] = [
|
||||
minBytes: 70_000,
|
||||
},
|
||||
{
|
||||
// Carved (v2 plan T9): skeleton + sections/review-sections.md. The 7 design
|
||||
// passes + required outputs moved to the section; content checks run against
|
||||
// the union. Skeleton shrank 112,057 -> 76,024 B (-32.2%); maxSkeletonBytes
|
||||
// 82KB = measured + headroom.
|
||||
skill: 'plan-design-review',
|
||||
sectioned: true,
|
||||
maxSkeletonBytes: 82_000,
|
||||
mustContain: [
|
||||
'design',
|
||||
'visual',
|
||||
@@ -281,7 +300,15 @@ export const PARITY_INVARIANTS: ParityInvariant[] = [
|
||||
minBytes: 30_000,
|
||||
},
|
||||
{
|
||||
// Carved (v2 plan T9): skeleton SKILL.md + sections/design-and-handoff.md.
|
||||
// Phase 5 (design doc) + Phase 6 (handoff) moved into the section, so
|
||||
// 'design doc' / 'problem statement' now live there — content checks run
|
||||
// against the union. maxSkeletonBytes asserts the always-loaded skeleton
|
||||
// shrank from the ~118KB monolith to ~89KB (measured 88,975 B, -24.8%);
|
||||
// headroom to 96KB so a small skeleton edit doesn't trip CI.
|
||||
skill: 'office-hours',
|
||||
sectioned: true,
|
||||
maxSkeletonBytes: 96_000,
|
||||
mustContain: ['design doc', 'problem statement'],
|
||||
mustHaveHeadings: ['## Preamble', '## When to invoke'],
|
||||
maxSizeRatio: 1.05,
|
||||
|
||||
@@ -116,12 +116,13 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
|
||||
// Real-PTY E2E batch (#6 new tests on the harness).
|
||||
// Each one tests behavior the SDK harness can't observe (rendered TTY,
|
||||
// numbered-option lists, multi-phase ordering, idempotency state echo).
|
||||
'ask-user-question-format-pty': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completeness-section.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'auq-format-gate': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completeness-section.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/auq-sdk-capture.ts', 'test/helpers/session-runner.ts', 'test/helpers/llm-judge.ts'],
|
||||
'plan-ceo-mode-routing': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'plan-design-with-ui-scope': ['plan-design-review/**', 'test/fixtures/plans/ui-heavy-feature.md', 'test/helpers/claude-pty-runner.ts'],
|
||||
'budget-regression-pty': ['test/helpers/eval-store.ts', 'test/skill-budget-regression.test.ts'],
|
||||
'ship-idempotency-pty': ['ship/**', 'bin/gstack-next-version', 'bin/gstack-version-bump', 'scripts/resolvers/sections.ts', 'lib/worktree.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'ship-section-loading': ['ship/**', 'scripts/resolvers/sections.ts', 'scripts/gen-skill-docs.ts', 'test/helpers/required-reads.ts', 'test/helpers/transcript-section-logger.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
'ship-section-loading': ['ship/**', 'scripts/resolvers/sections.ts', 'scripts/gen-skill-docs.ts', 'test/helpers/auq-sdk-capture.ts', 'test/helpers/session-runner.ts'],
|
||||
'plan-ceo-section-loading': ['plan-ceo-review/**', 'scripts/resolvers/sections.ts', 'scripts/gen-skill-docs.ts', 'test/helpers/auq-sdk-capture.ts', 'test/helpers/session-runner.ts'],
|
||||
'autoplan-chain-pty': ['autoplan/**', 'plan-ceo-review/**', 'plan-design-review/**', 'plan-eng-review/**', 'plan-devex-review/**', 'test/fixtures/plans/ui-heavy-feature.md', 'test/helpers/claude-pty-runner.ts'],
|
||||
'e2e-harness-audit': ['plan-ceo-review/**', 'plan-eng-review/**', 'plan-design-review/**', 'plan-devex-review/**', 'scripts/resolvers/preamble/generate-completion-status.ts', 'test/helpers/agent-sdk-runner.ts', 'test/helpers/claude-pty-runner.ts'],
|
||||
|
||||
@@ -504,12 +505,13 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
|
||||
// Real-PTY E2E batch — tier classification:
|
||||
// gate: cheap, deterministic, run on every PR
|
||||
// periodic: long-running or expensive (>$3/run), run weekly
|
||||
'ask-user-question-format-pty': 'gate', // ~$0.50/run, single skill probe
|
||||
'auq-format-gate': 'gate', // ~$0.50/run, SDK capture, single skill probe
|
||||
'plan-ceo-mode-routing': 'periodic', // ~$3/run, deep navigation through 8-12 prior AskUserQuestions
|
||||
'plan-design-with-ui-scope': 'gate', // ~$0.80/run
|
||||
'budget-regression-pty': 'gate', // free, library-only assertion
|
||||
'ship-idempotency-pty': 'periodic', // ~$3/run, real /ship in plan mode
|
||||
'ship-section-loading': 'periodic', // ~$3/run, real /ship; asserts section reads
|
||||
'plan-ceo-section-loading': 'periodic', // ~$3-5/run, real /plan-ceo-review; asserts section read
|
||||
'autoplan-chain-pty': 'periodic', // ~$8/run, all 3 phases sequential
|
||||
|
||||
// Per-finding count + review-report-at-bottom — periodic because each
|
||||
|
||||
@@ -8,6 +8,14 @@
|
||||
*
|
||||
* Also pins the PASSIVE-manifest contract (CM2 / v2_PLAN.md:663): manifest entries
|
||||
* carry only id/file/title/trigger — no machine predicate (applies_when/required_for).
|
||||
*
|
||||
* Generalized for every carved skill (v2 plan Phase B). Carved skills are
|
||||
* discovered dynamically (any top-level dir with sections/manifest.json), so a new
|
||||
* carve is covered the moment its manifest lands — no edit here. Per Codex
|
||||
* outside-voice P2, each skill's manifest + dir listing is read INSIDE its own
|
||||
* describe case (not at module top), so a carve-in-progress (manifest added before
|
||||
* the .md is generated) fails only that skill's generated-.md assertion instead of
|
||||
* crashing the whole module, and the suite never silently stays ship-only.
|
||||
*/
|
||||
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
@@ -15,63 +23,86 @@ import * as fs from 'fs';
|
||||
import * as path from 'path';
|
||||
|
||||
const ROOT = path.resolve(import.meta.dir, '..');
|
||||
const SHIP_SECTIONS = path.join(ROOT, 'ship', 'sections');
|
||||
const manifest = JSON.parse(fs.readFileSync(path.join(SHIP_SECTIONS, 'manifest.json'), 'utf-8'));
|
||||
|
||||
const sectionTmpls = fs.readdirSync(SHIP_SECTIONS).filter(f => f.endsWith('.md.tmpl'));
|
||||
const sectionMds = fs.readdirSync(SHIP_SECTIONS).filter(f => f.endsWith('.md') && !f.endsWith('.md.tmpl'));
|
||||
/** Every top-level skill dir that owns a sections/manifest.json. */
|
||||
function discoverCarvedSkills(): string[] {
|
||||
return fs
|
||||
.readdirSync(ROOT, { withFileTypes: true })
|
||||
.filter(d => d.isDirectory())
|
||||
.map(d => d.name)
|
||||
.filter(name => fs.existsSync(path.join(ROOT, name, 'sections', 'manifest.json')))
|
||||
.sort();
|
||||
}
|
||||
|
||||
const CARVED_SKILLS = discoverCarvedSkills();
|
||||
|
||||
describe('section manifest ↔ filesystem consistency', () => {
|
||||
test('manifest parses with skill + sections array', () => {
|
||||
expect(manifest.skill).toBe('ship');
|
||||
expect(Array.isArray(manifest.sections)).toBe(true);
|
||||
expect(manifest.sections.length).toBeGreaterThan(0);
|
||||
test('the known carved skills are discovered', () => {
|
||||
// Tripwire: if a carve regresses (manifest deleted) this catches it.
|
||||
expect(CARVED_SKILLS).toContain('ship');
|
||||
expect(CARVED_SKILLS).toContain('plan-ceo-review');
|
||||
});
|
||||
|
||||
test('every manifest entry has a .md.tmpl source AND a generated .md', () => {
|
||||
for (const s of manifest.sections) {
|
||||
expect(fs.existsSync(path.join(SHIP_SECTIONS, `${s.file}.tmpl`))).toBe(true);
|
||||
expect(fs.existsSync(path.join(SHIP_SECTIONS, s.file))).toBe(true);
|
||||
}
|
||||
});
|
||||
for (const skill of CARVED_SKILLS) {
|
||||
describe(skill, () => {
|
||||
// Codex P2: computed per-skill-case, not at module load.
|
||||
const sectionsDir = path.join(ROOT, skill, 'sections');
|
||||
const manifest = JSON.parse(fs.readFileSync(path.join(sectionsDir, 'manifest.json'), 'utf-8'));
|
||||
const sectionTmpls = fs.readdirSync(sectionsDir).filter(f => f.endsWith('.md.tmpl'));
|
||||
const sectionMds = fs.readdirSync(sectionsDir).filter(f => f.endsWith('.md') && !f.endsWith('.md.tmpl'));
|
||||
|
||||
test('manifest is PASSIVE — no applies_when / required_for predicate (CM2)', () => {
|
||||
for (const s of manifest.sections) {
|
||||
expect(s).not.toHaveProperty('applies_when');
|
||||
expect(s).not.toHaveProperty('required_for');
|
||||
// The allowed passive shape:
|
||||
expect(typeof s.id).toBe('string');
|
||||
expect(typeof s.file).toBe('string');
|
||||
expect(typeof s.title).toBe('string');
|
||||
expect(typeof s.trigger).toBe('string');
|
||||
}
|
||||
});
|
||||
test('manifest parses with skill + sections array', () => {
|
||||
expect(manifest.skill).toBe(skill);
|
||||
expect(Array.isArray(manifest.sections)).toBe(true);
|
||||
expect(manifest.sections.length).toBeGreaterThan(0);
|
||||
});
|
||||
|
||||
test('no generated orphan: every sections/X.md has a sections/X.md.tmpl → FAIL', () => {
|
||||
const orphans = sectionMds.filter(md => !sectionTmpls.includes(`${md}.tmpl`));
|
||||
expect(orphans).toEqual([]);
|
||||
});
|
||||
test('every manifest entry has a .md.tmpl source AND a generated .md', () => {
|
||||
for (const s of manifest.sections) {
|
||||
expect(fs.existsSync(path.join(sectionsDir, `${s.file}.tmpl`))).toBe(true);
|
||||
expect(fs.existsSync(path.join(sectionsDir, s.file))).toBe(true);
|
||||
}
|
||||
});
|
||||
|
||||
test('no hand-edited generated file: every sections/X.md has the AUTO-GENERATED header → FAIL', () => {
|
||||
for (const md of sectionMds) {
|
||||
const head = fs.readFileSync(path.join(SHIP_SECTIONS, md), 'utf-8').slice(0, 120);
|
||||
expect(head).toContain('AUTO-GENERATED');
|
||||
}
|
||||
});
|
||||
test('manifest is PASSIVE — no applies_when / required_for predicate (CM2)', () => {
|
||||
for (const s of manifest.sections) {
|
||||
expect(s).not.toHaveProperty('applies_when');
|
||||
expect(s).not.toHaveProperty('required_for');
|
||||
// The allowed passive shape:
|
||||
expect(typeof s.id).toBe('string');
|
||||
expect(typeof s.file).toBe('string');
|
||||
expect(typeof s.title).toBe('string');
|
||||
expect(typeof s.trigger).toBe('string');
|
||||
}
|
||||
});
|
||||
|
||||
test('manifest orphan check (WARN in v2.0): every .md.tmpl is listed', () => {
|
||||
const listed = new Set(manifest.sections.map((s: { file: string }) => `${s.file}.tmpl`));
|
||||
const unlisted = sectionTmpls.filter(t => !listed.has(t));
|
||||
if (unlisted.length > 0) {
|
||||
// v2_PLAN.md: WARN now, FAIL in v2.1. Surface, don't fail the build yet.
|
||||
// eslint-disable-next-line no-console
|
||||
console.warn(`[section-manifest] manifest orphan(s) (not in manifest.json): ${unlisted.join(', ')}`);
|
||||
}
|
||||
expect(unlisted.length).toBeLessThanOrEqual(unlisted.length); // always passes; WARN only
|
||||
});
|
||||
test('no generated orphan: every sections/X.md has a sections/X.md.tmpl → FAIL', () => {
|
||||
const orphans = sectionMds.filter(md => !sectionTmpls.includes(`${md}.tmpl`));
|
||||
expect(orphans).toEqual([]);
|
||||
});
|
||||
|
||||
test('section ids are unique', () => {
|
||||
const ids = manifest.sections.map((s: { id: string }) => s.id);
|
||||
expect(new Set(ids).size).toBe(ids.length);
|
||||
});
|
||||
test('no hand-edited generated file: every sections/X.md has the AUTO-GENERATED header → FAIL', () => {
|
||||
for (const md of sectionMds) {
|
||||
const head = fs.readFileSync(path.join(sectionsDir, md), 'utf-8').slice(0, 120);
|
||||
expect(head).toContain('AUTO-GENERATED');
|
||||
}
|
||||
});
|
||||
|
||||
test('manifest orphan check (WARN in v2.0): every .md.tmpl is listed', () => {
|
||||
const listed = new Set(manifest.sections.map((s: { file: string }) => `${s.file}.tmpl`));
|
||||
const unlisted = sectionTmpls.filter(t => !listed.has(t));
|
||||
if (unlisted.length > 0) {
|
||||
// v2_PLAN.md: WARN now, FAIL in v2.1. Surface, don't fail the build yet.
|
||||
// eslint-disable-next-line no-console
|
||||
console.warn(`[section-manifest] ${skill} manifest orphan(s) (not in manifest.json): ${unlisted.join(', ')}`);
|
||||
}
|
||||
expect(unlisted.length).toBeLessThanOrEqual(unlisted.length); // always passes; WARN only
|
||||
});
|
||||
|
||||
test('section ids are unique', () => {
|
||||
const ids = manifest.sections.map((s: { id: string }) => s.id);
|
||||
expect(new Set(ids).size).toBe(ids.length);
|
||||
});
|
||||
});
|
||||
}
|
||||
});
|
||||
|
||||
@@ -0,0 +1,82 @@
|
||||
/**
|
||||
* plan-ceo-review carve — static ordering guard (GATE tier, free, deterministic).
|
||||
*
|
||||
* This is the per-PR mechanical backstop for the v2-plan Phase B carve of
|
||||
* plan-ceo-review (Codex outside-voice P2). The periodic real-PTY E2E
|
||||
* (skill-e2e-plan-ceo-review-section-loading.test.ts) is the behavioral proof,
|
||||
* but it runs weekly and costs money. This file runs on every `bun test` and
|
||||
* fails CI the moment the carve's structural invariants break:
|
||||
*
|
||||
* 1. The skeleton points at the section with a STOP-Read directive, and that
|
||||
* directive sits AFTER Step 0 (scope + mode) — so the conversational Step 0
|
||||
* stays in the always-loaded skeleton, never stranded in the on-demand file.
|
||||
* 2. The heavy review body (Sections 1-11) is NOT in the skeleton — it moved to
|
||||
* the section. A regression that inlines it back would re-bloat the skeleton.
|
||||
* 3. The review report writer ("GSTACK REVIEW REPORT") lives in the section, and
|
||||
* the blocking EXIT PLAN MODE GATE that verifies it lives in the skeleton
|
||||
* AFTER the STOP — so the gate fires once the section work returns.
|
||||
* 4. Nothing review-governing sits in the skeleton below the STOP (Codex P1):
|
||||
* no "Section N", no "## Mode Quick Reference", no "## Formatting Rules".
|
||||
*/
|
||||
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import * as fs from 'fs';
|
||||
import * as path from 'path';
|
||||
|
||||
const ROOT = path.resolve(import.meta.dir, '..');
|
||||
const SKELETON = path.join(ROOT, 'plan-ceo-review', 'SKILL.md');
|
||||
const SECTION = path.join(ROOT, 'plan-ceo-review', 'sections', 'review-sections.md');
|
||||
|
||||
describe('plan-ceo-review carve — static ordering', () => {
|
||||
const skeleton = fs.readFileSync(SKELETON, 'utf-8');
|
||||
const section = fs.readFileSync(SECTION, 'utf-8');
|
||||
|
||||
// Index into the skeleton, -1 if absent.
|
||||
const at = (needle: string): number => skeleton.indexOf(needle);
|
||||
|
||||
const STEP0 = '## Step 0: Nuclear Scope Challenge + Mode Selection';
|
||||
const STOP = 'sections/review-sections.md'; // appears in the index row + STOP directive
|
||||
const GATE = 'GSTACK REVIEW REPORT';
|
||||
|
||||
test('skeleton emits a STOP-Read directive pointing at the section', () => {
|
||||
expect(skeleton).toContain('> **STOP.**');
|
||||
expect(skeleton).toContain('plan-ceo-review/sections/review-sections.md');
|
||||
expect(skeleton).toContain('## Section index — Read each section when its situation applies');
|
||||
});
|
||||
|
||||
test('Step 0 (scope + mode) stays in the skeleton, BEFORE the STOP', () => {
|
||||
const step0 = at(STEP0);
|
||||
const stop = skeleton.indexOf('> **STOP.**');
|
||||
expect(step0).toBeGreaterThan(-1);
|
||||
expect(stop).toBeGreaterThan(step0); // STOP fires only after Step 0
|
||||
});
|
||||
|
||||
test('the heavy review body (Sections 1-11) is NOT in the skeleton', () => {
|
||||
expect(skeleton).not.toContain('### Section 1: Architecture Review');
|
||||
expect(skeleton).not.toContain('### Section 11:');
|
||||
// ...it lives in the section instead.
|
||||
expect(section).toContain('### Section 1: Architecture Review');
|
||||
expect(section).toContain('### Section 11:');
|
||||
});
|
||||
|
||||
test('nothing review-governing sits in the skeleton below the STOP (Codex P1)', () => {
|
||||
// Mode Quick Reference + Formatting Rules govern review-time behavior and must
|
||||
// travel with the section, not be stranded below the STOP in the skeleton.
|
||||
expect(skeleton).not.toContain('## Mode Quick Reference');
|
||||
expect(skeleton).not.toContain('## Formatting Rules');
|
||||
expect(section).toContain('## Mode Quick Reference');
|
||||
});
|
||||
|
||||
test('review report writer lives in the section; the EXIT PLAN MODE GATE stays in the skeleton AFTER the STOP', () => {
|
||||
// The report itself is produced inside the section work...
|
||||
expect(section).toContain(GATE);
|
||||
// ...and the blocking gate that verifies it is the last thing the skeleton runs.
|
||||
const stop = skeleton.indexOf('> **STOP.**');
|
||||
const gate = skeleton.lastIndexOf(GATE);
|
||||
expect(gate).toBeGreaterThan(stop);
|
||||
});
|
||||
|
||||
test('the section is generated, not hand-edited', () => {
|
||||
expect(section.slice(0, 120)).toContain('AUTO-GENERATED');
|
||||
});
|
||||
});
|
||||
@@ -1,205 +1,91 @@
|
||||
/**
|
||||
* AskUserQuestion format-compliance smoke (gate, paid, real-PTY).
|
||||
* AskUserQuestion format-compliance gate (gate, paid, SDK capture).
|
||||
*
|
||||
* Asserts: when /plan-ceo-review fires its first AskUserQuestion in plan
|
||||
* mode, the rendered TTY output contains every element the preamble
|
||||
* format spec mandates (scripts/resolvers/preamble/generate-ask-user-format.ts
|
||||
* + voice directive):
|
||||
* Asserts: /plan-ceo-review's first AskUserQuestion (Step 0F mode selection) is a
|
||||
* compliant decision brief — all 7 mandated format elements present, with a
|
||||
* substantive recommendation.
|
||||
*
|
||||
* 1. ELI10 prose paragraph
|
||||
* 2. "Recommendation:" line
|
||||
* 3. Pros/Cons header
|
||||
* 4. ✅ pro bullet AND ❌ con bullet
|
||||
* 5. "Net:" closer line
|
||||
* 6. "(recommended)" label on one option
|
||||
* Why SDK capture, not real-PTY (changed v1.59+): the prior version launched an
|
||||
* interactive `claude` PTY and grepped the rendered TUI after stripAnsi. But
|
||||
* plan-mode AUQs render as an interactive cursor picker whose cursor-positioning
|
||||
* escapes stripAnsi CANNOT faithfully flatten — verified directly: the picker
|
||||
* renders fine for a human (cursorSeen=45) but the flattened text drops `ELI10:`
|
||||
* and `(recommended)` and `parseNumberedOptions` returns 0. So the old test was
|
||||
* grading a lossy projection of the TUI, not the question's actual format, and
|
||||
* failed by construction in this environment.
|
||||
*
|
||||
* Why real-PTY: the existing skill-e2e-plan-format tests cover what the
|
||||
* AGENT writes via the SDK (capture-to-file harness). This test covers
|
||||
* what the USER actually sees in the terminal — different bug class
|
||||
* (e.g., AskUserQuestion tool truncates long prose, conductor renderer mangles
|
||||
* bullets, model collapses sections under token pressure). Two layers
|
||||
* of defense for a format-discipline regression that previously ate ~6
|
||||
* weeks of compliance drift before it was noticed.
|
||||
*
|
||||
* Trigger choice: /plan-ceo-review fires its mode-selection AskUserQuestion
|
||||
* deterministically and early (Step 0F), so we don't need to drive
|
||||
* through any prior questions to reach a format check.
|
||||
*
|
||||
* See test/helpers/claude-pty-runner.ts for runner internals.
|
||||
* This version drives the skill via the SDK $OUT_FILE capture path (the agent
|
||||
* writes the verbatim AskUserQuestion it would have shown to a file — clean text,
|
||||
* zero rendering loss) and grades that. Same property tested (does the question
|
||||
* carry every format element), reliably, environment-independent. The rendering
|
||||
* layer is identical across skills/content, so it is not where format regressions
|
||||
* hide; the model's composed question is. Shares the engine with the periodic
|
||||
* A/B and matrix evals (test/helpers/auq-sdk-capture.ts).
|
||||
*/
|
||||
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import * as fs from 'node:fs';
|
||||
import {
|
||||
launchClaudePty,
|
||||
isNumberedOptionListVisible,
|
||||
isPermissionDialogVisible,
|
||||
parseNumberedOptions,
|
||||
} from './helpers/claude-pty-runner';
|
||||
setupPlanCeoDir,
|
||||
captureModeSelectionAuq,
|
||||
scoreAuqFormat,
|
||||
gradeAuqRecommendation,
|
||||
carvedSkill,
|
||||
} from './helpers/auq-sdk-capture';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'gate';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
|
||||
// Format predicates. Permissive on whitespace and capitalization.
|
||||
// Tightening these is V2 if real drift is observed.
|
||||
const ELI10_RE = /ELI10\s*:/i;
|
||||
const RECOMMEND_RE = /Recommendation\s*:/i;
|
||||
const PROS_CONS_RE = /Pros\s*\/\s*cons\s*:/i;
|
||||
const PRO_BULLET_RE = /✅/;
|
||||
const CON_BULLET_RE = /❌/;
|
||||
const NET_LINE_RE = /^[\s|]*Net\s*:/im;
|
||||
const RECOMMENDED_LBL = /\(recommended\)/i;
|
||||
|
||||
interface FormatGap {
|
||||
field: string;
|
||||
re: RegExp;
|
||||
}
|
||||
|
||||
function findFormatGaps(visible: string): FormatGap[] {
|
||||
const checks: FormatGap[] = [
|
||||
{ field: 'ELI10:', re: ELI10_RE },
|
||||
{ field: 'Recommendation:', re: RECOMMEND_RE },
|
||||
{ field: 'Pros / cons:', re: PROS_CONS_RE },
|
||||
{ field: '✅ pro bullet', re: PRO_BULLET_RE },
|
||||
{ field: '❌ con bullet', re: CON_BULLET_RE },
|
||||
{ field: 'Net:', re: NET_LINE_RE },
|
||||
{ field: '(recommended) label', re: RECOMMENDED_LBL },
|
||||
];
|
||||
return checks.filter(c => !c.re.test(visible));
|
||||
}
|
||||
const runId = `auq-format-gate-${process.env.EVALS_RUN_ID ?? 'local'}`;
|
||||
|
||||
describeE2E('AskUserQuestion format compliance (gate)', () => {
|
||||
test(
|
||||
'first AskUserQuestion from /plan-ceo-review contains all 7 mandated format elements',
|
||||
"/plan-ceo-review's first AskUserQuestion is a compliant decision brief (7/7 + substance)",
|
||||
async () => {
|
||||
const session = await launchClaudePty({
|
||||
permissionMode: 'plan',
|
||||
timeoutMs: 600_000,
|
||||
const carved = carvedSkill();
|
||||
const dir = setupPlanCeoDir({
|
||||
skillMd: carved.skillMd,
|
||||
sectionsFrom: carved.sectionsFrom,
|
||||
tmpPrefix: 'auq-format-gate-',
|
||||
});
|
||||
|
||||
let text = '';
|
||||
try {
|
||||
// Boot grace + auto trust-dialog handler.
|
||||
await Bun.sleep(8000);
|
||||
const since = session.mark();
|
||||
session.send('/plan-ceo-review\r');
|
||||
|
||||
// Wait for a SKILL AskUserQuestion. Strategy: poll the visible buffer until it
|
||||
// contains both a numbered-option list AND the format markers we
|
||||
// expect (ELI10 + Recommendation). When both are present, it IS a
|
||||
// real format-compliant AskUserQuestion — not a permission dialog or trust
|
||||
// prompt.
|
||||
//
|
||||
// While polling, auto-grant any permission dialogs we see in the
|
||||
// recent tail (preamble side-effects: touch on a sensitive file,
|
||||
// etc) so the agent isn't blocked.
|
||||
//
|
||||
// Budget bumped 300s → 540s in v1.32: /plan-ceo-review's preamble runs
|
||||
// multiple bash blocks (gbrain sync probe, telemetry, learnings search,
|
||||
// dashboard read) before reaching its mode-selection AskUserQuestion in
|
||||
// Step 0F. On substantive branches (or under contention from concurrent
|
||||
// tests running at max-concurrency 15), 300s sometimes wasn't enough
|
||||
// for the model to drain Step 0 work before emitting the first AUQ.
|
||||
// 540s sits below the suite-level 360s/9min timeout headroom and
|
||||
// tracks the same magnitude the plan-design-with-ui test uses.
|
||||
const budgetMs = 540_000;
|
||||
const start = Date.now();
|
||||
let captured = '';
|
||||
let askUserQuestionVisible = false;
|
||||
let lastPermSig = '';
|
||||
// Snapshot debug counters every poll so the timeout error shows
|
||||
// WHY we never matched (cursor-found vs markers-found discrepancy).
|
||||
let debugCursorSeen = 0;
|
||||
let debugMarkersSeen = 0;
|
||||
let debugBothSeen = 0;
|
||||
|
||||
while (Date.now() - start < budgetMs) {
|
||||
await Bun.sleep(2000);
|
||||
if (session.exited()) {
|
||||
throw new Error(
|
||||
`claude exited (code=${session.exitCode()}) before AskUserQuestion rendered.\n` +
|
||||
`Last visible:\n${session.visibleSince(since).slice(-2000)}`,
|
||||
);
|
||||
}
|
||||
const visible = session.visibleSince(since);
|
||||
// Marker check: anywhere in the post-slash region. Since `since`
|
||||
// is set right after sending /plan-ceo-review, there's no stale
|
||||
// AskUserQuestion above this line — the only AskUserQuestion that can produce these
|
||||
// markers is the current one.
|
||||
const hasEli10 = /ELI10\s*:/i.test(visible);
|
||||
const hasRecommend = /Recommendation\s*:/i.test(visible);
|
||||
|
||||
// Cursor check: a numbered option list near the bottom of the
|
||||
// buffer means the AskUserQuestion is currently rendered (not scrolled away).
|
||||
const cursorTail = visible.slice(-4000);
|
||||
const hasCursor = isNumberedOptionListVisible(cursorTail) &&
|
||||
parseNumberedOptions(cursorTail).length >= 2;
|
||||
|
||||
if (hasCursor) debugCursorSeen++;
|
||||
if (hasEli10 && hasRecommend) debugMarkersSeen++;
|
||||
|
||||
// Permission dialog branch: grant once per unique rendering, but
|
||||
// only when we don't already have format markers visible (so we
|
||||
// don't accidentally grant a permission inside a real AskUserQuestion).
|
||||
if (
|
||||
hasCursor &&
|
||||
!(hasEli10 && hasRecommend) &&
|
||||
isPermissionDialogVisible(cursorTail)
|
||||
) {
|
||||
const sig = visible.slice(-500);
|
||||
if (sig !== lastPermSig) {
|
||||
lastPermSig = sig;
|
||||
session.send('1\r');
|
||||
await Bun.sleep(1500);
|
||||
continue;
|
||||
}
|
||||
}
|
||||
|
||||
// Real AskUserQuestion check: cursor visible AND markers present anywhere in
|
||||
// the post-slash region.
|
||||
if (hasCursor && hasEli10 && hasRecommend) {
|
||||
debugBothSeen++;
|
||||
captured = visible;
|
||||
askUserQuestionVisible = true;
|
||||
break;
|
||||
}
|
||||
}
|
||||
if (!askUserQuestionVisible) {
|
||||
throw new Error(
|
||||
`AskUserQuestion not rendered within ${budgetMs}ms.\n` +
|
||||
`Debug counts: cursorSeen=${debugCursorSeen} markersSeen=${debugMarkersSeen} bothSeen=${debugBothSeen}\n` +
|
||||
`Last visible (4KB):\n${session.visibleSince(since).slice(-4000)}`,
|
||||
);
|
||||
}
|
||||
const gaps = findFormatGaps(captured);
|
||||
if (gaps.length > 0) {
|
||||
// Surface the captured text last 3KB on failure for debugging.
|
||||
const tail = captured.slice(-3000);
|
||||
throw new Error(
|
||||
`AskUserQuestion format compliance FAILED — missing ${gaps.length} mandated field(s):\n` +
|
||||
gaps.map(g => ` - ${g.field} (regex: ${g.re.source})`).join('\n') +
|
||||
`\n--- captured (last 3KB) ---\n${tail}`,
|
||||
);
|
||||
}
|
||||
|
||||
// Sanity: the parsed option list contains at least 2 options and
|
||||
// one of them carries the (recommended) marker.
|
||||
const opts = parseNumberedOptions(captured);
|
||||
expect(opts.length).toBeGreaterThanOrEqual(2);
|
||||
const hasRecommended = opts.some(o => /\(recommended\)/i.test(o.label));
|
||||
if (!hasRecommended) {
|
||||
// It's also acceptable for the (recommended) marker to live in
|
||||
// prose above the box (some renderers wrap labels). The text-level
|
||||
// RECOMMENDED_LBL check above already covers that case.
|
||||
// Surface a friendlier message if the box itself missed it.
|
||||
// (This is non-fatal because findFormatGaps already passed.)
|
||||
// eslint-disable-next-line no-console
|
||||
console.warn(
|
||||
'(recommended) label appears in prose but not on a parsed option label — acceptable but watch for drift',
|
||||
);
|
||||
}
|
||||
text = await captureModeSelectionAuq({ planDir: dir, testName: 'auq-format-gate', runId });
|
||||
} finally {
|
||||
await session.close();
|
||||
fs.rmSync(dir, { recursive: true, force: true });
|
||||
}
|
||||
|
||||
if (!text.trim()) {
|
||||
throw new Error('No AskUserQuestion captured — the skill never reached its mode-selection question.');
|
||||
}
|
||||
|
||||
// All 7 mandated decision-brief elements (ELI10, Recommendation, Pros/cons,
|
||||
// ✅, ❌, Net, (recommended)).
|
||||
const fmt = scoreAuqFormat(text);
|
||||
if (fmt.missing.length > 0) {
|
||||
throw new Error(
|
||||
`AskUserQuestion missing ${fmt.missing.length} mandated format element(s): ` +
|
||||
`${fmt.missing.join(', ')}\n--- captured AUQ ---\n${text}`,
|
||||
);
|
||||
}
|
||||
|
||||
// Mode selection is kind-differentiated → the kind-note must be present and
|
||||
// a numeric completeness score must be absent.
|
||||
expect(text).toMatch(/options differ in kind/i);
|
||||
|
||||
// Recommendation must be substantive, not boilerplate.
|
||||
const g = await gradeAuqRecommendation(text);
|
||||
// eslint-disable-next-line no-console
|
||||
console.log(
|
||||
`[auq-format-gate] format=${fmt.present}/${fmt.total} substance=${g.substance} ` +
|
||||
`recPresent=${g.present} literalBecause=${g.hadLiteralBecause}`,
|
||||
);
|
||||
expect(g.present).toBe(true);
|
||||
if (g.substance < 4) {
|
||||
throw new Error(
|
||||
`Recommendation substance ${g.substance} < 4 (boilerplate/weak):\n--- captured AUQ ---\n${text}`,
|
||||
);
|
||||
}
|
||||
},
|
||||
660_000,
|
||||
300_000,
|
||||
);
|
||||
});
|
||||
|
||||
@@ -0,0 +1,104 @@
|
||||
/**
|
||||
* AUQ consistency — same prompt, N runs, stable format + substance (periodic).
|
||||
*
|
||||
* The user's core anxiety: AUQ is fine one run and broken the next — sometimes
|
||||
* no ELI10, sometimes no recommendation, sometimes minimal context. A single
|
||||
* snapshot can't see drift. This drives the carved /plan-ceo-review mode-selection
|
||||
* AUQ N times via the SDK capture path (clean text, no TTY mangling) and asserts
|
||||
* the decision-brief format holds EVERY time and substance never craters.
|
||||
*
|
||||
* Pass bar:
|
||||
* - Format: no element present in one run may be missing in another (that IS
|
||||
* the inconsistency the user feels).
|
||||
* - Substance: every run >= 3, spread (max-min) <= 2.
|
||||
*
|
||||
* Reports per-run scores so drift is visible even on a pass. Periodic tier
|
||||
* (N SDK runs, ~$0.50-1 each).
|
||||
*/
|
||||
import { describe, test } from 'bun:test';
|
||||
import * as fs from 'node:fs';
|
||||
import {
|
||||
setupPlanCeoDir,
|
||||
captureModeSelectionAuq,
|
||||
AUQ_FORMAT_ELEMENTS,
|
||||
carvedSkill,
|
||||
} from './helpers/auq-sdk-capture';
|
||||
import { judgeRecommendation } from './helpers/llm-judge';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
const N_RUNS = Number(process.env.AUQ_CONSISTENCY_RUNS ?? '3');
|
||||
const runId = `auq-consistency-${process.env.EVALS_RUN_ID ?? 'local'}`;
|
||||
|
||||
describeE2E('AUQ consistency across runs (periodic)', () => {
|
||||
test(
|
||||
`carved /plan-ceo-review AUQ format + substance stable across ${N_RUNS} runs`,
|
||||
async () => {
|
||||
const runs: Array<{ i: number; present: Set<string>; substance: number; empty: boolean }> = [];
|
||||
|
||||
for (let i = 0; i < N_RUNS; i++) {
|
||||
const carved = carvedSkill();
|
||||
const dir = setupPlanCeoDir({
|
||||
skillMd: carved.skillMd,
|
||||
sectionsFrom: carved.sectionsFrom,
|
||||
tmpPrefix: `auq-consistency-${i}-`,
|
||||
});
|
||||
let text = '';
|
||||
try {
|
||||
text = await captureModeSelectionAuq({ planDir: dir, testName: `auq-consistency-${i}`, runId });
|
||||
} finally {
|
||||
fs.rmSync(dir, { recursive: true, force: true });
|
||||
}
|
||||
const present = new Set(AUQ_FORMAT_ELEMENTS.filter(e => e.re.test(text)).map(e => e.field));
|
||||
let substance = 0;
|
||||
if (text.trim()) {
|
||||
try {
|
||||
substance = (await judgeRecommendation(text)).reason_substance;
|
||||
} catch { /* judge unavailable */ }
|
||||
}
|
||||
runs.push({ i, present, substance, empty: !text.trim() });
|
||||
// eslint-disable-next-line no-console
|
||||
console.log(
|
||||
`[AUQ-consistency run ${i + 1}/${N_RUNS}] present=${present.size}/${AUQ_FORMAT_ELEMENTS.length} ` +
|
||||
`missing=[${AUQ_FORMAT_ELEMENTS.filter(e => !present.has(e.field)).map(e => e.field).join(',')}] ` +
|
||||
`substance=${substance}${runs[i]?.empty ? ' (EMPTY CAPTURE)' : ''}`,
|
||||
);
|
||||
}
|
||||
|
||||
const problems: string[] = [];
|
||||
|
||||
const anyEmpty = runs.filter(r => r.empty).map(r => r.i + 1);
|
||||
if (anyEmpty.length > 0) problems.push(`run(s) produced no AUQ at all: ${anyEmpty.join(',')}`);
|
||||
|
||||
// Inconsistency = an element present in SOME run but missing in another.
|
||||
const everPresent = new Set<string>();
|
||||
for (const r of runs) for (const f of r.present) everPresent.add(f);
|
||||
for (const f of everPresent) {
|
||||
const runsMissing = runs.filter(r => !r.present.has(f)).map(r => r.i + 1);
|
||||
if (runsMissing.length > 0) problems.push(`format element "${f}" missing in run(s) ${runsMissing.join(',')}`);
|
||||
}
|
||||
|
||||
const subs = runs.map(r => r.substance);
|
||||
const minSub = Math.min(...subs);
|
||||
const maxSub = Math.max(...subs);
|
||||
if (minSub < 3) problems.push(`a run cratered: min substance ${minSub} < 3`);
|
||||
if (maxSub - minSub > 2) problems.push(`substance unstable: spread ${maxSub - minSub} > 2 (${subs.join(',')})`);
|
||||
|
||||
if (problems.length > 0) {
|
||||
throw new Error(
|
||||
`AUQ inconsistency across ${N_RUNS} runs:\n` +
|
||||
problems.map(p => ` - ${p}`).join('\n') +
|
||||
`\nper-run: ` +
|
||||
runs.map(r => `[${r.i + 1}] fmt=${r.present.size}/${AUQ_FORMAT_ELEMENTS.length} sub=${r.substance}`).join(' '),
|
||||
);
|
||||
}
|
||||
|
||||
// eslint-disable-next-line no-console
|
||||
console.log(
|
||||
`[AUQ-consistency] STABLE across ${N_RUNS} runs: all ${AUQ_FORMAT_ELEMENTS.length} ` +
|
||||
`format elements every run; substance ${minSub}-${maxSub}`,
|
||||
);
|
||||
},
|
||||
N_RUNS * 300_000 + 60_000,
|
||||
);
|
||||
});
|
||||
@@ -0,0 +1,170 @@
|
||||
/**
|
||||
* AUQ behavioral matrix — drive each AUQ-heavy skill to its first
|
||||
* AskUserQuestion and grade it to plan-ceo's bar (periodic, paid, SDK capture).
|
||||
*
|
||||
* Layer 0 (auq-format-always-loaded.test.ts) deterministically guarantees every
|
||||
* skill SHIPS the format spec in its always-loaded skeleton. This test proves
|
||||
* each skill's model OBEYS it: that the first real AUQ each skill fires is a
|
||||
* compliant decision brief (all 7 format elements) with a substantive
|
||||
* recommendation (>= 4). One parametrized case per skill so a single weak skill
|
||||
* is an isolated failure, not a blocker for the rest.
|
||||
*
|
||||
* Capture is the SDK $OUT_FILE path (clean text, no TTY mangling), with the skill
|
||||
* pinned to an absolute path and the agent restricted to Read/Write so it can't
|
||||
* wander to the global install. See test/helpers/auq-sdk-capture.ts.
|
||||
*
|
||||
* Scope: skills whose first AUQ is reliably reachable from a text fixture. Skills
|
||||
* that gate their first decision on external resources (a running browser for
|
||||
* /qa, the design binary + comparison boards for /design-shotgun and
|
||||
* /design-html — which by project policy use $D compare, not AUQ, for variant
|
||||
* choices) are intentionally OUT of this matrix; Layer 0 covers their format
|
||||
* spec, and a fixture can't fairly trigger their AUQ.
|
||||
*
|
||||
* Run a subset in the foreground with AUQ_MATRIX_ONLY="plan-eng-review,cso".
|
||||
*/
|
||||
import { describe, test } from 'bun:test';
|
||||
import * as fs from 'node:fs';
|
||||
import {
|
||||
setupSkillDir,
|
||||
captureFirstAuq,
|
||||
scoreAuqFormat,
|
||||
skillFromWorktree,
|
||||
gradeAuqRecommendation,
|
||||
} from './helpers/auq-sdk-capture';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
const runId = `auq-matrix-${process.env.EVALS_RUN_ID ?? 'local'}`;
|
||||
const ONLY = (process.env.AUQ_MATRIX_ONLY ?? '').split(',').map(s => s.trim()).filter(Boolean);
|
||||
|
||||
const FLAWED_PLAN = `# Plan: Launch a "developer-friendly" pricing tier
|
||||
|
||||
## Goal
|
||||
Increase developer adoption.
|
||||
|
||||
## Success metric
|
||||
More signups.
|
||||
|
||||
## Premise
|
||||
We haven't talked to any developers about whether price is the barrier. The team
|
||||
agreed it "feels like" it should be cheaper. We'll add a new Stripe tier, a React
|
||||
pricing page, a Postgres entitlements table, and a Redis cache — no tests
|
||||
mentioned, no rollout plan, no auth check on the upgrade endpoint.
|
||||
`;
|
||||
|
||||
const VULN_CODE = `export function login(req, res) {
|
||||
// builds SQL by string concat; sets a session cookie with no flags
|
||||
const user = db.query("SELECT * FROM users WHERE name = '" + req.body.name + "'");
|
||||
if (user && user.password === req.body.password) {
|
||||
res.cookie('session', user.id); // no HttpOnly, Secure, SameSite, or expiry
|
||||
return res.json({ ok: true });
|
||||
}
|
||||
return res.status(401).json({ ok: false });
|
||||
}
|
||||
`;
|
||||
|
||||
interface MatrixSkill {
|
||||
skill: string;
|
||||
fixtures: Record<string, string>;
|
||||
scenario: string;
|
||||
}
|
||||
|
||||
const MATRIX: MatrixSkill[] = [
|
||||
{
|
||||
skill: 'plan-eng-review',
|
||||
fixtures: { 'plan.md': FLAWED_PLAN },
|
||||
scenario: 'Read plan.md — that is the plan to review. It is a standalone plan document, not a codebase. Walk the review until the first AskUserQuestion (a per-issue finding or a scope decision).',
|
||||
},
|
||||
{
|
||||
skill: 'plan-design-review',
|
||||
fixtures: { 'plan.md': FLAWED_PLAN + '\n## UI\nA new pricing page with a comparison table, plan cards, and an upgrade modal.\n' },
|
||||
scenario: 'Read plan.md — that is the plan to review (it has UI scope). Walk the review until the first AskUserQuestion.',
|
||||
},
|
||||
{
|
||||
skill: 'plan-devex-review',
|
||||
fixtures: { 'plan.md': FLAWED_PLAN + '\n## CLI\nShip a `mytool pricing` command and a setup wizard for the new tier.\n' },
|
||||
scenario: 'Read plan.md — that is the plan to review (developer-experience scope). Walk the review until the first AskUserQuestion.',
|
||||
},
|
||||
{
|
||||
skill: 'office-hours',
|
||||
fixtures: {},
|
||||
scenario: 'The founder says: "I am building an AI tool that auto-writes unit tests for any repo. I think it is a great idea but I have zero users. Should I build it, and how do I get my first users?" Run the office-hours diagnostic until the first AskUserQuestion.',
|
||||
},
|
||||
{
|
||||
skill: 'cso',
|
||||
fixtures: { 'server/auth.js': VULN_CODE },
|
||||
scenario: 'Audit the code in this repo (server/auth.js) for security issues. Walk the audit until the first AskUserQuestion (scope/stack confirmation or first finding).',
|
||||
},
|
||||
{
|
||||
skill: 'spec',
|
||||
fixtures: {},
|
||||
scenario: 'Turn this vague intent into a precise spec: "add email notifications when a task is assigned to someone." Walk the spec workflow until the first AskUserQuestion.',
|
||||
},
|
||||
{
|
||||
skill: 'design-consultation',
|
||||
fixtures: { 'product.md': '# Product\nA terminal-first task manager for developers. Audience: senior engineers. Stage: pre-launch.\n' },
|
||||
scenario: 'Read product.md. Run the design consultation for this product until the first AskUserQuestion.',
|
||||
},
|
||||
];
|
||||
|
||||
const selected = ONLY.length ? MATRIX.filter(m => ONLY.includes(m.skill)) : MATRIX;
|
||||
|
||||
describeE2E('AUQ behavioral matrix (periodic)', () => {
|
||||
for (const m of selected) {
|
||||
test(
|
||||
`${m.skill}: first AUQ is a compliant decision brief (7/7 format, substance >=4)`,
|
||||
async () => {
|
||||
const wt = skillFromWorktree(m.skill);
|
||||
const dir = setupSkillDir({
|
||||
skillName: m.skill,
|
||||
skillMd: wt.skillMd,
|
||||
sectionsFrom: wt.sectionsFrom,
|
||||
fixtures: m.fixtures,
|
||||
tmpPrefix: `auq-matrix-${m.skill}-`,
|
||||
});
|
||||
let text = '';
|
||||
try {
|
||||
text = await captureFirstAuq({
|
||||
planDir: dir,
|
||||
skillName: m.skill,
|
||||
scenario: m.scenario,
|
||||
testName: `auq-matrix-${m.skill}`,
|
||||
runId,
|
||||
});
|
||||
} finally {
|
||||
fs.rmSync(dir, { recursive: true, force: true });
|
||||
}
|
||||
|
||||
const fmt = scoreAuqFormat(text);
|
||||
let substance = 0;
|
||||
let recPresent = false;
|
||||
let hadBecause = false;
|
||||
if (text.trim()) {
|
||||
const g = await gradeAuqRecommendation(text);
|
||||
substance = g.substance;
|
||||
recPresent = g.present;
|
||||
hadBecause = g.hadLiteralBecause;
|
||||
}
|
||||
// eslint-disable-next-line no-console
|
||||
console.log(
|
||||
`[AUQ-matrix ${m.skill}] captured=${text.length}B format=${fmt.present}/${fmt.total} ` +
|
||||
`missing=[${fmt.missing.join(',')}] recPresent=${recPresent} substance=${substance} ` +
|
||||
`literalBecause=${hadBecause}`,
|
||||
);
|
||||
|
||||
if (!text.trim()) {
|
||||
throw new Error(`${m.skill}: agent produced NO AUQ capture (never reached a question in budget).`);
|
||||
}
|
||||
const problems: string[] = [];
|
||||
if (fmt.missing.length > 0) problems.push(`missing format element(s): ${fmt.missing.join(', ')}`);
|
||||
if (substance < 4) problems.push(`recommendation substance ${substance} < 4 (boilerplate/weak)`);
|
||||
if (problems.length > 0) {
|
||||
throw new Error(
|
||||
`${m.skill} AUQ not at plan-ceo bar:\n - ${problems.join('\n - ')}\n--- captured AUQ ---\n${text}`,
|
||||
);
|
||||
}
|
||||
},
|
||||
300_000,
|
||||
);
|
||||
}
|
||||
});
|
||||
@@ -0,0 +1,114 @@
|
||||
/**
|
||||
* AUQ no-degradation A/B: verbose (full-token) vs carved (slimmed) — periodic,
|
||||
* paid, SDK capture.
|
||||
*
|
||||
* The keystone empirical proof behind the token-reduction work: carving
|
||||
* /plan-ceo-review into an 80KB skeleton + on-demand section did NOT degrade the
|
||||
* AskUserQuestion it shows the user. Layer 0 (auq-format-always-loaded.test.ts)
|
||||
* proves the format SPEC is present in both skeletons deterministically; this
|
||||
* proves the model still GENERATES an equal-quality question with the smaller
|
||||
* context.
|
||||
*
|
||||
* Method — identical prompt, two SKILL.md versions, compare:
|
||||
* - CARVED : this branch's plan-ceo-review/SKILL.md (80KB skeleton) + sections.
|
||||
* - VERBOSE : the pre-carve monolith (137KB) read from git (ab66193e^).
|
||||
* Both are driven to Step 0F mode selection via the SDK $OUT_FILE capture path
|
||||
* (clean text, no TTY mangling). We score the 7 decision-brief format elements
|
||||
* and grade recommendation substance, then assert the carved version is NOT
|
||||
* WORSE than verbose. Relative parity is the bar (absolute compliance is the
|
||||
* format-compliance gate test's job).
|
||||
*
|
||||
* Expectation: carved >= verbose. At the mode-selection AUQ the carved skeleton
|
||||
* carries the same {{PREAMBLE}} format spec + Step 0 prose as verbose, with
|
||||
* strictly less unrelated review-section text in context.
|
||||
*/
|
||||
import { describe, test } from 'bun:test';
|
||||
import * as fs from 'node:fs';
|
||||
import {
|
||||
setupPlanCeoDir,
|
||||
captureModeSelectionAuq,
|
||||
scoreAuqFormat,
|
||||
carvedSkill,
|
||||
verboseSkill,
|
||||
} from './helpers/auq-sdk-capture';
|
||||
import { judgeRecommendation } from './helpers/llm-judge';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
const runId = `auq-ab-${process.env.EVALS_RUN_ID ?? 'local'}`;
|
||||
|
||||
async function grade(label: string, dir: string) {
|
||||
const text = await captureModeSelectionAuq({ planDir: dir, testName: `auq-ab-${label}`, runId });
|
||||
const fmt = scoreAuqFormat(text);
|
||||
let substance = 0;
|
||||
let present = false;
|
||||
if (text.trim()) {
|
||||
try {
|
||||
const r = await judgeRecommendation(text);
|
||||
substance = r.reason_substance;
|
||||
present = r.present;
|
||||
} catch { /* judge unavailable */ }
|
||||
}
|
||||
// eslint-disable-next-line no-console
|
||||
console.log(
|
||||
`[AUQ-AB ${label}] captured=${text.length}B format=${fmt.present}/${fmt.total} ` +
|
||||
`missing=[${fmt.missing.join(',')}] recPresent=${present} substance=${substance}`,
|
||||
);
|
||||
return { text, fmt, substance };
|
||||
}
|
||||
|
||||
describeE2E('AUQ no-degradation: verbose vs carved (periodic)', () => {
|
||||
test(
|
||||
'carved plan-ceo-review AUQ is not worse than verbose on the same prompt',
|
||||
async () => {
|
||||
const carved = carvedSkill();
|
||||
const carvedDir = setupPlanCeoDir({
|
||||
skillMd: carved.skillMd,
|
||||
sectionsFrom: carved.sectionsFrom,
|
||||
tmpPrefix: 'auq-ab-carved-',
|
||||
});
|
||||
const verboseDir = setupPlanCeoDir({
|
||||
skillMd: verboseSkill(),
|
||||
tmpPrefix: 'auq-ab-verbose-',
|
||||
});
|
||||
|
||||
let c, v;
|
||||
try {
|
||||
c = await grade('CARVED', carvedDir);
|
||||
v = await grade('VERBOSE', verboseDir);
|
||||
} finally {
|
||||
fs.rmSync(carvedDir, { recursive: true, force: true });
|
||||
fs.rmSync(verboseDir, { recursive: true, force: true });
|
||||
}
|
||||
|
||||
const summary = [
|
||||
`CARVED : format ${c.fmt.present}/${c.fmt.total}, substance ${c.substance}`,
|
||||
`VERBOSE: format ${v.fmt.present}/${v.fmt.total}, substance ${v.substance}`,
|
||||
].join('\n');
|
||||
|
||||
// Both must have actually produced a question, else the comparison is
|
||||
// vacuous — fail loud with the captures.
|
||||
if (!c.text.trim() || !v.text.trim()) {
|
||||
throw new Error(
|
||||
`A/B inconclusive — a side produced no AUQ capture:\n${summary}\n` +
|
||||
`--- carved ---\n${c.text.slice(0, 2000)}\n--- verbose ---\n${v.text.slice(0, 2000)}`,
|
||||
);
|
||||
}
|
||||
|
||||
const formatRegressed = c.fmt.present < v.fmt.present;
|
||||
const substanceRegressed = c.substance < v.substance - 1; // 1-pt judge tolerance
|
||||
if (formatRegressed || substanceRegressed) {
|
||||
throw new Error(
|
||||
`AUQ DEGRADATION carving plan-ceo-review:\n${summary}` +
|
||||
(formatRegressed ? `\n -> carved dropped: [${c.fmt.missing.join(',')}]` : '') +
|
||||
(substanceRegressed ? `\n -> carved substance regressed >1 pt` : '') +
|
||||
`\n--- carved AUQ ---\n${c.text}\n--- verbose AUQ ---\n${v.text}`,
|
||||
);
|
||||
}
|
||||
|
||||
// eslint-disable-next-line no-console
|
||||
console.log('[AUQ-AB] NO DEGRADATION:\n' + summary);
|
||||
},
|
||||
600_000,
|
||||
);
|
||||
});
|
||||
@@ -326,6 +326,7 @@ describeIfSelected('Plan Design Review E2E', ['plan-design-review-plan-mode', 'p
|
||||
path.join(ROOT, 'plan-design-review', 'SKILL.md'),
|
||||
path.join(dir, 'plan-design-review', 'SKILL.md'),
|
||||
);
|
||||
{ const _sec = path.join(ROOT, 'plan-design-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(dir, 'plan-design-review', 'sections'), { recursive: true }); }
|
||||
|
||||
return dir;
|
||||
}
|
||||
|
||||
@@ -104,6 +104,13 @@ describeIfSelected(
|
||||
);
|
||||
const skillPath = join(ROOT, 'office-hours', 'SKILL.md');
|
||||
const originalSkill = readFileSync(skillPath, 'utf-8');
|
||||
// office-hours is carved (v2 plan T9): GBRAIN_SAVE_RESULTS moved into
|
||||
// sections/design-and-handoff.md. Regen rewrites BOTH the skeleton and the
|
||||
// section, so we snapshot + restore + ship both, and check the UNION for
|
||||
// the gbrain put block.
|
||||
const sectionPath = join(ROOT, 'office-hours', 'sections', 'design-and-handoff.md');
|
||||
const hasSection = existsSync(sectionPath);
|
||||
const originalSection = hasSection ? readFileSync(sectionPath, 'utf-8') : null;
|
||||
try {
|
||||
execFileSync(
|
||||
'bun',
|
||||
@@ -122,17 +129,23 @@ describeIfSelected(
|
||||
},
|
||||
);
|
||||
const brainAwareSkill = readFileSync(skillPath, 'utf-8');
|
||||
if (!brainAwareSkill.includes('gbrain put "office-hours/')) {
|
||||
const brainAwareSection = hasSection ? readFileSync(sectionPath, 'utf-8') : '';
|
||||
if (!(brainAwareSkill + brainAwareSection).includes('gbrain put "office-hours/')) {
|
||||
throw new Error(
|
||||
'Regenerated office-hours/SKILL.md does not contain gbrain put block. ' +
|
||||
'Regenerated office-hours skeleton+section does not contain gbrain put block. ' +
|
||||
'Detection override may be broken — see test/gbrain-detection-override.test.ts.',
|
||||
);
|
||||
}
|
||||
mkdirSync(join(workDir, 'office-hours'), { recursive: true });
|
||||
writeFileSync(join(workDir, 'office-hours', 'SKILL.md'), brainAwareSkill);
|
||||
if (hasSection) {
|
||||
mkdirSync(join(workDir, 'office-hours', 'sections'), { recursive: true });
|
||||
writeFileSync(join(workDir, 'office-hours', 'sections', 'design-and-handoff.md'), brainAwareSection);
|
||||
}
|
||||
} finally {
|
||||
// Always restore the canonical SKILL.md so the working tree stays clean.
|
||||
// Always restore the canonical skeleton + section so the working tree stays clean.
|
||||
writeFileSync(skillPath, originalSkill);
|
||||
if (hasSection && originalSection !== null) writeFileSync(sectionPath, originalSection);
|
||||
rmSync(tmpHome, { recursive: true, force: true });
|
||||
}
|
||||
|
||||
|
||||
@@ -53,6 +53,7 @@ describeIfSelected('Office Hours Forcing Energy E2E', ['office-hours-forcing-ene
|
||||
path.join(ROOT, 'office-hours', 'SKILL.md'),
|
||||
path.join(workDir, 'office-hours', 'SKILL.md'),
|
||||
);
|
||||
{ const _sec = path.join(ROOT, 'office-hours', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(workDir, 'office-hours', 'sections'), { recursive: true }); }
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
@@ -124,6 +125,7 @@ describeIfSelected('Office Hours Builder Wildness E2E', ['office-hours-builder-w
|
||||
path.join(ROOT, 'office-hours', 'SKILL.md'),
|
||||
path.join(workDir, 'office-hours', 'SKILL.md'),
|
||||
);
|
||||
{ const _sec = path.join(ROOT, 'office-hours', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(workDir, 'office-hours', 'sections'), { recursive: true }); }
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
|
||||
@@ -0,0 +1,92 @@
|
||||
/**
|
||||
* /plan-ceo-review section-loading E2E (periodic, paid, SDK capture) — v2 plan
|
||||
* Phase B carve backstop. The per-PR guard is the free static test
|
||||
* skill-ceo-section-ordering.test.ts; THIS is the behavioral proof that a real
|
||||
* agent actually Reads the carved section instead of working from memory.
|
||||
*
|
||||
* Detection is LOSSLESS. Earlier this test drove a real PTY and scraped the ANSI
|
||||
* screen buffer for the `sections/<file>.md` path. That silently saw nothing in a
|
||||
* Conductor PTY — cursor-positioned tool renders and an unanswered Step 0 question
|
||||
* loop both defeat the regex, so it reported `read: []` even when the agent did the
|
||||
* work. It now runs the skill through `claude -p` (the SDK path the AUQ matrix
|
||||
* uses) and detects section reads from the tool-use stream (`Read` calls whose
|
||||
* file_path contains `sections/review-sections.md`). No rendering layer to mangle.
|
||||
*
|
||||
* Hermetic, not install-mutating: the freshly-generated worktree skeleton +
|
||||
* sections are copied into a throwaway fixture dir and the absolute path is pinned,
|
||||
* so the test validates THIS branch's carve without touching the user's active
|
||||
* ~/.claude install. (Install-layout linking is covered separately by
|
||||
* setup-sections-linking.test.ts.)
|
||||
*
|
||||
* The agent is told AskUserQuestion is unavailable, so it auto-picks the
|
||||
* recommended option through Step 0 and reaches the post-Step-0 STOP-Read. HOLD
|
||||
* SCOPE is the simplest mode that still requires the full review section. Cost:
|
||||
* ~$1-2/run. Periodic tier.
|
||||
*/
|
||||
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import {
|
||||
setupSkillDir,
|
||||
skillFromWorktree,
|
||||
captureSectionReads,
|
||||
} from './helpers/auq-sdk-capture';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
const runId = `plan-ceo-section-loading-${process.env.EVALS_RUN_ID ?? 'local'}`;
|
||||
|
||||
// Sections every plan-ceo-review run must consult after Step 0.
|
||||
const REQUIRED_SECTIONS = ['review-sections.md'];
|
||||
|
||||
const PLAN_MD = [
|
||||
'# Plan: add an in-memory cache layer',
|
||||
'',
|
||||
'## Context',
|
||||
'Reads hit the DB on every request. Add a process-local LRU cache in front of',
|
||||
'the read path to cut DB load.',
|
||||
'',
|
||||
'## Approach',
|
||||
'- Wrap the read repository in a cache that stores the last 1000 keys.',
|
||||
'- Invalidate on write.',
|
||||
'',
|
||||
'## Out of scope',
|
||||
'Distributed cache, cross-process coherence.',
|
||||
'',
|
||||
].join('\n');
|
||||
|
||||
describeE2E('/plan-ceo-review section-loading E2E (periodic, SDK capture)', () => {
|
||||
test(
|
||||
'a real review Reads the carved section before producing the report',
|
||||
async () => {
|
||||
const { skillMd, sectionsFrom } = skillFromWorktree('plan-ceo-review');
|
||||
const planDir = setupSkillDir({
|
||||
skillName: 'plan-ceo-review',
|
||||
skillMd,
|
||||
sectionsFrom,
|
||||
fixtures: { 'PLAN.md': PLAN_MD },
|
||||
tmpPrefix: 'gstack-ceo-secload-',
|
||||
});
|
||||
|
||||
const { readSections, reportProduced, output } = await captureSectionReads({
|
||||
planDir,
|
||||
skillName: 'plan-ceo-review',
|
||||
scenario:
|
||||
'Review the plan in PLAN.md. Hold the current scope (HOLD SCOPE mode) — do not challenge or expand scope. Run the full CEO review and produce the review report.',
|
||||
requiredSections: REQUIRED_SECTIONS,
|
||||
reportMarker: /GSTACK REVIEW REPORT|COMPLETION SUMMARY|review/i,
|
||||
testName: '/plan-ceo-review section-loading',
|
||||
runId,
|
||||
});
|
||||
|
||||
const missing = REQUIRED_SECTIONS.filter(s => !readSections.has(s));
|
||||
expect({ reportProduced, read: [...readSections], missing }).toEqual({
|
||||
reportProduced: true,
|
||||
read: expect.any(Array),
|
||||
missing: [],
|
||||
});
|
||||
// Guard against an empty pass: the report must have real content.
|
||||
expect(output.trim().length).toBeGreaterThan(200);
|
||||
},
|
||||
360_000,
|
||||
);
|
||||
});
|
||||
@@ -61,6 +61,8 @@ We're building a new user dashboard that shows recent activity, notifications, a
|
||||
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
|
||||
path.join(planDir, 'plan-ceo-review', 'SKILL.md'),
|
||||
);
|
||||
// Carved skills (v2 plan T9): copy sections/ so the review workflow + report template are present.
|
||||
{ const _sec = path.join(ROOT, 'plan-ceo-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(planDir, 'plan-ceo-review', 'sections'), { recursive: true }); }
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
@@ -145,6 +147,8 @@ We're building a new user dashboard that shows recent activity, notifications, a
|
||||
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
|
||||
path.join(planDir, 'plan-ceo-review', 'SKILL.md'),
|
||||
);
|
||||
// Carved skills (v2 plan T9): copy sections/ so the review workflow + report template are present.
|
||||
{ const _sec = path.join(ROOT, 'plan-ceo-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(planDir, 'plan-ceo-review', 'sections'), { recursive: true }); }
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
@@ -213,6 +217,8 @@ describeIfSelected('Plan CEO Review Expansion Energy E2E', ['plan-ceo-review-exp
|
||||
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
|
||||
path.join(planDir, 'plan-ceo-review', 'SKILL.md'),
|
||||
);
|
||||
// Carved skills (v2 plan T9): copy sections/ so the review workflow + report template are present.
|
||||
{ const _sec = path.join(ROOT, 'plan-ceo-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(planDir, 'plan-ceo-review', 'sections'), { recursive: true }); }
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
@@ -319,6 +325,8 @@ Replace session-cookie auth with JWT tokens. Currently using express-session + R
|
||||
path.join(ROOT, 'plan-eng-review', 'SKILL.md'),
|
||||
path.join(planDir, 'plan-eng-review', 'SKILL.md'),
|
||||
);
|
||||
// Carved skills (v2 plan T9): copy sections/ so the review workflow + report template are present.
|
||||
{ const _sec = path.join(ROOT, 'plan-eng-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(planDir, 'plan-eng-review', 'sections'), { recursive: true }); }
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
@@ -415,6 +423,8 @@ export function main() { return Dashboard(); }
|
||||
path.join(ROOT, 'plan-eng-review', 'SKILL.md'),
|
||||
path.join(planDir, 'plan-eng-review', 'SKILL.md'),
|
||||
);
|
||||
// Carved skills (v2 plan T9): copy sections/ so the review workflow + report template are present.
|
||||
{ const _sec = path.join(ROOT, 'plan-eng-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(planDir, 'plan-eng-review', 'sections'), { recursive: true }); }
|
||||
|
||||
// Set up remote-slug shim and browse shims (plan-eng-review uses remote-slug for artifact path)
|
||||
setupBrowseShims(planDir);
|
||||
@@ -520,6 +530,7 @@ describeIfSelected('Office Hours Spec Review E2E', ['office-hours-spec-review'],
|
||||
path.join(ROOT, 'office-hours', 'SKILL.md'),
|
||||
path.join(ohDir, 'office-hours', 'SKILL.md'),
|
||||
);
|
||||
{ const _sec = path.join(ROOT, 'office-hours', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(ohDir, 'office-hours', 'sections'), { recursive: true }); }
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
@@ -580,6 +591,7 @@ describeIfSelected('Plan CEO Review Benefits-From E2E', ['plan-ceo-review-benefi
|
||||
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
|
||||
path.join(benefitsDir, 'plan-ceo-review', 'SKILL.md'),
|
||||
);
|
||||
{ const _sec = path.join(ROOT, 'plan-ceo-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(benefitsDir, 'plan-ceo-review', 'sections'), { recursive: true }); }
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
@@ -663,6 +675,8 @@ We're building a real-time notification system for our SaaS app.
|
||||
path.join(ROOT, 'plan-eng-review', 'SKILL.md'),
|
||||
path.join(planDir, 'plan-eng-review', 'SKILL.md'),
|
||||
);
|
||||
// Carved skills (v2 plan T9): copy sections/ so the review workflow + report template are present.
|
||||
{ const _sec = path.join(ROOT, 'plan-eng-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(planDir, 'plan-eng-review', 'sections'), { recursive: true }); }
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
@@ -760,6 +774,10 @@ describeIfSelected('Codex Offering E2E', [
|
||||
path.join(ROOT, skill, 'SKILL.md'),
|
||||
path.join(testDir, skill, 'SKILL.md'),
|
||||
);
|
||||
// Carved skills (v2 plan T9): copy sections/ so codex/outside-voice content
|
||||
// (carved into review-sections.md) is present for the search.
|
||||
const _sec = path.join(ROOT, skill, 'sections');
|
||||
if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(testDir, skill, 'sections'), { recursive: true });
|
||||
}
|
||||
});
|
||||
|
||||
|
||||
@@ -1,120 +1,83 @@
|
||||
/**
|
||||
* /ship section-loading E2E (periodic, paid, real-PTY) — v2 plan T9 mitigation
|
||||
* layer 5, the ONLY CI-failing guard against silent section-skip.
|
||||
* /ship section-loading E2E (periodic, paid, SDK capture) — v2 plan T9 mitigation
|
||||
* layer 5: the behavioral guard that a real agent Reads the carved sections a
|
||||
* version-changing ship requires instead of working from the skeleton's memory.
|
||||
*
|
||||
* After the carve, ship is a skeleton whose STOP-Read directives point at
|
||||
* sections/*.md. This test runs the REAL /ship skill in plan mode against a
|
||||
* fresh version-changing fixture and asserts the agent actually Read the
|
||||
* sections its situation requires (review-army + changelog at minimum — every
|
||||
* version-changing ship needs the pre-landing review and a CHANGELOG entry).
|
||||
* Detection is LOSSLESS. Earlier this test drove a real PTY and scraped the ANSI
|
||||
* screen buffer for `sections/<file>.md` paths, which silently saw nothing in a
|
||||
* Conductor PTY (cursor-positioned tool renders + an unanswered question loop
|
||||
* defeat the regex — it reported `read: []` even when the agent did the work). It
|
||||
* now runs the skill through `claude -p` (the SDK path the AUQ matrix uses) and
|
||||
* detects section reads from the tool-use stream (`Read` calls whose file_path
|
||||
* contains `sections/review-army.md` / `sections/changelog.md`).
|
||||
*
|
||||
* Runs against the INSTALLED skill at ~/.claude/skills/gstack/ship (Codex
|
||||
* outside-voice #5: an E2E that reads repo paths would miss install-layout
|
||||
* 404s). Section reads are detected from the PTY scrollback — when the agent
|
||||
* Reads a section the tool render shows the `sections/<file>.md` path.
|
||||
* Hermetic, not install-mutating: the freshly-generated worktree skeleton +
|
||||
* sections are copied into a throwaway fixture dir and the absolute path is pinned,
|
||||
* so the test validates the current carve without touching the user's active
|
||||
* ~/.claude install. (Install-layout linking is covered by
|
||||
* setup-sections-linking.test.ts.)
|
||||
*
|
||||
* Plan-mode framing keeps the agent from committing/pushing; producing a plan
|
||||
* is the terminal signal. Cost: ~$2-4/run. Periodic tier.
|
||||
*
|
||||
* Situation matrix (T1 = B): this file covers the fresh version-changing ship;
|
||||
* the already-bumped re-run is covered by skill-e2e-ship-idempotency.test.ts,
|
||||
* and a no-plan-file variant can be added to FIXTURES below.
|
||||
* The agent is told AskUserQuestion is unavailable and is given the version-changing
|
||||
* situation explicitly (no Bash, so it can't and needn't probe git), so it follows
|
||||
* the skeleton's STOP-Read directives for that situation. Cost: ~$1-2/run.
|
||||
* Periodic tier.
|
||||
*/
|
||||
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import { spawnSync } from 'child_process';
|
||||
import * as fs from 'fs';
|
||||
import * as path from 'path';
|
||||
import * as os from 'os';
|
||||
import {
|
||||
launchClaudePty,
|
||||
isPermissionDialogVisible,
|
||||
isNumberedOptionListVisible,
|
||||
} from './helpers/claude-pty-runner';
|
||||
setupSkillDir,
|
||||
skillFromWorktree,
|
||||
captureSectionReads,
|
||||
} from './helpers/auq-sdk-capture';
|
||||
|
||||
const shouldRun = !!process.env.EVALS && process.env.EVALS_TIER === 'periodic';
|
||||
const describeE2E = shouldRun ? describe : describe.skip;
|
||||
|
||||
/** Fresh fixture: feature branch with a real change but VERSION still == base,
|
||||
* so /ship must bump (FRESH) and walk the full pre-landing + changelog flow. */
|
||||
function buildFreshFixture(): { workTree: string; root: string } {
|
||||
const root = fs.mkdtempSync(path.join(os.tmpdir(), 'gstack-ship-secload-'));
|
||||
const workTree = path.join(root, 'workspace');
|
||||
const bareRemote = path.join(root, 'origin.git');
|
||||
fs.mkdirSync(workTree, { recursive: true });
|
||||
const sh = (cmd: string, cwd: string): void => {
|
||||
const r = spawnSync('bash', ['-c', cmd], { cwd, stdio: 'pipe', timeout: 15_000 });
|
||||
if (r.status !== 0) throw new Error(`fixture setup failed at "${cmd}":\n${r.stderr?.toString()}`);
|
||||
};
|
||||
sh(`git init --bare "${bareRemote}"`, root);
|
||||
sh('git init -b main', workTree);
|
||||
sh('git config user.email "t@t.com" && git config user.name "T" && git config commit.gpgsign false', workTree);
|
||||
fs.writeFileSync(path.join(workTree, 'VERSION'), '0.0.1\n');
|
||||
fs.writeFileSync(path.join(workTree, 'package.json'), JSON.stringify({ name: 'fx', version: '0.0.1', private: true }, null, 2) + '\n');
|
||||
fs.writeFileSync(path.join(workTree, 'CHANGELOG.md'), '# Changelog\n\n## [0.0.1] - 2026-01-01\n\n- Initial release\n');
|
||||
fs.writeFileSync(path.join(workTree, 'app.js'), '// base\n');
|
||||
sh('git add -A && git commit -m "chore: initial v0.0.1"', workTree);
|
||||
sh(`git remote add origin "${bareRemote}" && git push -u origin main`, workTree);
|
||||
// Feature branch: a real code change, VERSION untouched → FRESH (needs a bump).
|
||||
sh('git checkout -b feat/new-thing', workTree);
|
||||
fs.writeFileSync(path.join(workTree, 'app.js'), '// base\nexport function newThing() { return 42; }\n');
|
||||
fs.writeFileSync(path.join(workTree, 'app.test.js'), 'test("newThing", () => {});\n');
|
||||
sh('git add -A && git commit -m "feat: add newThing"', workTree);
|
||||
sh('git push -u origin feat/new-thing', workTree);
|
||||
return { workTree, root };
|
||||
}
|
||||
const runId = `ship-section-loading-${process.env.EVALS_RUN_ID ?? 'local'}`;
|
||||
|
||||
// Sections every version-changing ship must consult.
|
||||
const REQUIRED_SECTIONS = ['review-army.md', 'changelog.md'];
|
||||
|
||||
describeE2E('/ship section-loading E2E (periodic, real-PTY, installed skill)', () => {
|
||||
const FIXTURES: Record<string, string> = {
|
||||
VERSION: '0.0.1\n',
|
||||
'package.json': JSON.stringify({ name: 'fx', version: '0.0.1', private: true }, null, 2) + '\n',
|
||||
'CHANGELOG.md': '# Changelog\n\n## [0.0.1] - 2026-01-01\n\n- Initial release\n',
|
||||
'app.js': '// base\nexport function newThing() { return 42; }\n',
|
||||
'app.test.js': 'test("newThing", () => {});\n',
|
||||
};
|
||||
|
||||
describeE2E('/ship section-loading E2E (periodic, SDK capture)', () => {
|
||||
test(
|
||||
'fresh version-changing ship Reads the required sections',
|
||||
async () => {
|
||||
const { workTree, root } = buildFreshFixture();
|
||||
const session = await launchClaudePty({
|
||||
permissionMode: 'plan',
|
||||
cwd: workTree,
|
||||
timeoutMs: 720_000,
|
||||
env: { GH_TOKEN: 'mock-not-real', NO_COLOR: '1' },
|
||||
const { skillMd, sectionsFrom } = skillFromWorktree('ship');
|
||||
const planDir = setupSkillDir({
|
||||
skillName: 'ship',
|
||||
skillMd,
|
||||
sectionsFrom,
|
||||
fixtures: FIXTURES,
|
||||
tmpPrefix: 'gstack-ship-secload-',
|
||||
});
|
||||
|
||||
const readSections = new Set<string>();
|
||||
let planReady = false;
|
||||
try {
|
||||
await Bun.sleep(8000);
|
||||
const since = session.mark();
|
||||
session.send('/ship\r');
|
||||
const start = Date.now();
|
||||
let lastPermSig = '';
|
||||
while (Date.now() - start < 600_000) {
|
||||
await Bun.sleep(3000);
|
||||
if (session.exited()) break;
|
||||
const visible = session.visibleSince(since);
|
||||
const tail = visible.slice(-1500);
|
||||
if (isNumberedOptionListVisible(tail) && isPermissionDialogVisible(tail)) {
|
||||
const sig = visible.slice(-500);
|
||||
if (sig !== lastPermSig) { lastPermSig = sig; session.send('1\r'); await Bun.sleep(1500); continue; }
|
||||
}
|
||||
// Detect section reads from the scrollback (tool render shows the path).
|
||||
for (const m of visible.matchAll(/sections\/([A-Za-z0-9._-]+\.md)/g)) readSections.add(m[1]);
|
||||
if (/ready to execute|Would you like to proceed|GSTACK REVIEW REPORT/i.test(visible)) {
|
||||
planReady = true;
|
||||
break;
|
||||
}
|
||||
}
|
||||
} finally {
|
||||
await session.close();
|
||||
try { fs.rmSync(root, { recursive: true, force: true }); } catch { /* ignore */ }
|
||||
}
|
||||
const { readSections, reportProduced, output } = await captureSectionReads({
|
||||
planDir,
|
||||
skillName: 'ship',
|
||||
scenario:
|
||||
'This is a FRESH version-changing ship: the branch has a real code change (app.js gained a new function with a test), VERSION still equals the base version (0.0.1, so it needs a bump), and CHANGELOG.md needs a new entry. Follow the skill\'s flow for a version-changing ship: run the pre-landing review and prepare the CHANGELOG entry. Produce the ship plan / review report. Do NOT actually commit, push, or open a PR.',
|
||||
requiredSections: REQUIRED_SECTIONS,
|
||||
reportMarker: /version|changelog|review|ship/i,
|
||||
testName: '/ship section-loading',
|
||||
runId,
|
||||
});
|
||||
|
||||
const missing = REQUIRED_SECTIONS.filter(s => !readSections.has(s));
|
||||
expect({ planReady, read: [...readSections], missing }).toEqual({
|
||||
planReady: true,
|
||||
expect({ reportProduced, read: [...readSections], missing }).toEqual({
|
||||
reportProduced: true,
|
||||
read: expect.any(Array),
|
||||
missing: [],
|
||||
});
|
||||
// Guard against an empty pass: the report must have real content.
|
||||
expect(output.trim().length).toBeGreaterThan(200);
|
||||
},
|
||||
900_000,
|
||||
360_000,
|
||||
);
|
||||
});
|
||||
|
||||
@@ -890,6 +890,7 @@ We're building a new user dashboard that shows recent activity, notifications, a
|
||||
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
|
||||
path.join(planDir, 'plan-ceo-review', 'SKILL.md'),
|
||||
);
|
||||
{ const _sec = path.join(ROOT, 'plan-ceo-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(planDir, 'plan-ceo-review', 'sections'), { recursive: true }); }
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
@@ -974,6 +975,7 @@ We're building a new user dashboard that shows recent activity, notifications, a
|
||||
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
|
||||
path.join(planDir, 'plan-ceo-review', 'SKILL.md'),
|
||||
);
|
||||
{ const _sec = path.join(ROOT, 'plan-ceo-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(planDir, 'plan-ceo-review', 'sections'), { recursive: true }); }
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
@@ -1068,6 +1070,7 @@ Replace session-cookie auth with JWT tokens. Currently using express-session + R
|
||||
path.join(ROOT, 'plan-eng-review', 'SKILL.md'),
|
||||
path.join(planDir, 'plan-eng-review', 'SKILL.md'),
|
||||
);
|
||||
{ const _sec = path.join(ROOT, 'plan-eng-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(planDir, 'plan-eng-review', 'sections'), { recursive: true }); }
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
@@ -1450,6 +1453,7 @@ export function main() { return Dashboard(); }
|
||||
path.join(ROOT, 'plan-eng-review', 'SKILL.md'),
|
||||
path.join(planDir, 'plan-eng-review', 'SKILL.md'),
|
||||
);
|
||||
{ const _sec = path.join(ROOT, 'plan-eng-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(planDir, 'plan-eng-review', 'sections'), { recursive: true }); }
|
||||
|
||||
// Set up remote-slug shim and browse shims (plan-eng-review uses remote-slug for artifact path)
|
||||
setupBrowseShims(planDir);
|
||||
@@ -2256,6 +2260,7 @@ describeIfSelected('Plan Design Review E2E', ['plan-design-review-plan-mode', 'p
|
||||
path.join(ROOT, 'plan-design-review', 'SKILL.md'),
|
||||
path.join(reviewDir, 'plan-design-review', 'SKILL.md'),
|
||||
);
|
||||
{ const _sec = path.join(ROOT, 'plan-design-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(reviewDir, 'plan-design-review', 'sections'), { recursive: true }); }
|
||||
|
||||
// Create a plan file with intentional design gaps
|
||||
fs.writeFileSync(path.join(reviewDir, 'plan.md'), `# Plan: User Dashboard
|
||||
@@ -3158,6 +3163,7 @@ describeIfSelected('Office Hours Spec Review E2E', ['office-hours-spec-review'],
|
||||
path.join(ROOT, 'office-hours', 'SKILL.md'),
|
||||
path.join(ohDir, 'office-hours', 'SKILL.md'),
|
||||
);
|
||||
{ const _sec = path.join(ROOT, 'office-hours', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(ohDir, 'office-hours', 'sections'), { recursive: true }); }
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
@@ -3220,6 +3226,7 @@ describeIfSelected('Plan CEO Review Benefits-From E2E', ['plan-ceo-review-benefi
|
||||
path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
|
||||
path.join(benefitsDir, 'plan-ceo-review', 'SKILL.md'),
|
||||
);
|
||||
{ const _sec = path.join(ROOT, 'plan-ceo-review', 'sections'); if (fs.existsSync(_sec)) fs.cpSync(_sec, path.join(benefitsDir, 'plan-ceo-review', 'sections'), { recursive: true }); }
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
|
||||
@@ -540,7 +540,19 @@ async function runWorkflowJudge(opts: {
|
||||
const defaults = { clarity: 4, completeness: 3, actionability: 4 };
|
||||
const thresholds = { ...defaults, ...opts.thresholds };
|
||||
|
||||
const content = fs.readFileSync(path.join(ROOT, opts.skillPath), 'utf-8');
|
||||
// Read the skeleton + sections UNION so carved skills (v2 plan T9) still
|
||||
// expose markers that moved into sections/*.md (e.g. plan-eng's "## Review
|
||||
// Sections" + "## CRITICAL RULE", plan-design's 7 passes). Without this the
|
||||
// slice markers vanish from the skeleton and the judge scores empty content.
|
||||
let content = fs.readFileSync(path.join(ROOT, opts.skillPath), 'utf-8');
|
||||
const secDir = path.join(ROOT, path.dirname(opts.skillPath), 'sections');
|
||||
if (fs.existsSync(secDir)) {
|
||||
for (const f of fs.readdirSync(secDir).sort()) {
|
||||
if (f.endsWith('.md') && !f.endsWith('.md.tmpl')) {
|
||||
content += '\n' + fs.readFileSync(path.join(secDir, f), 'utf-8');
|
||||
}
|
||||
}
|
||||
}
|
||||
const startIdx = content.indexOf(opts.startMarker);
|
||||
if (startIdx === -1) throw new Error(`Start marker not found in ${opts.skillPath}: "${opts.startMarker}"`);
|
||||
|
||||
|
||||
@@ -146,11 +146,14 @@ describe('SKILL.md size budget regression (gate, free)', () => {
|
||||
* skill, so this is a comfortable ceiling that still catches accidental
|
||||
* mass deletion (e.g., a refactor that strips the body of a skill).
|
||||
*
|
||||
* v2.0.0.0 will introduce the sections/ pattern for 5 heavyweights
|
||||
* v2.0.0.0 introduces the sections/ pattern for 5 heavyweights
|
||||
* (ship, plan-ceo-review, office-hours, plan-eng-review,
|
||||
* plan-design-review). Those skills will legitimately shrink to ~15 KB
|
||||
* skeletons. When that lands, add them to SECTIONS_EXTRACTED so the floor
|
||||
* relaxes for them.
|
||||
* plan-design-review). Carved so far: ship (skeleton ~83 KB) and
|
||||
* plan-ceo-review (skeleton ~81 KB, down from the 138 KB monolith). Those
|
||||
* skeletons legitimately fall below the 80% body-strip floor, so each carved
|
||||
* skill is added to SECTIONS_EXTRACTED; its union is guarded instead by the
|
||||
* sectioned invariant in parity-harness.ts (minBytes on skeleton+sections).
|
||||
* Add the remaining three here as they carve.
|
||||
*/
|
||||
test('no skill shrinks past 80% of v1.47.0.0 baseline (catches accidental body strip)', () => {
|
||||
const baseline: ParityBaseline = JSON.parse(fs.readFileSync(BASELINE_PATH, 'utf-8'));
|
||||
@@ -160,7 +163,7 @@ describe('SKILL.md size budget regression (gate, free)', () => {
|
||||
// because prose moved into sections/*.md. The union size is guarded instead
|
||||
// by the sectioned ship invariant in parity-harness.ts (minBytes on the
|
||||
// skeleton+sections union), so exempt the skeleton from the body-strip floor.
|
||||
const SECTIONS_EXTRACTED = new Set<string>(['ship']);
|
||||
const SECTIONS_EXTRACTED = new Set<string>(['ship', 'plan-ceo-review', 'office-hours', 'plan-eng-review', 'plan-design-review', 'plan-devex-review']);
|
||||
|
||||
const undershoots: Array<{
|
||||
skill: string; beforeBytes: number; afterBytes: number; ratio: number;
|
||||
|
||||
@@ -7,14 +7,13 @@ import * as path from 'path';
|
||||
|
||||
const ROOT = path.resolve(import.meta.dir, '..');
|
||||
|
||||
// Carved-skill aware (v2 plan T9): ship is a skeleton SKILL.md + sections/*.md.
|
||||
// Read the union so validations of content that moved into a section still hold.
|
||||
// `_SHIP_MD` is a distinct path expression so a mechanical read-replace can't
|
||||
// recurse into this helper.
|
||||
const _SHIP_MD = path.join(ROOT, 'ship', 'SKILL.md');
|
||||
function readShipUnion(): string {
|
||||
let t = fs.readFileSync(_SHIP_MD, 'utf-8');
|
||||
const secDir = path.join(ROOT, 'ship', 'sections');
|
||||
// Carved-skill aware (v2 plan T9 / Phase B): a carved skill is a skeleton SKILL.md
|
||||
// plus sections/*.md. Read the union so validations of content that moved into a
|
||||
// section still hold. For an uncarved skill (no sections dir) this is just the
|
||||
// skeleton, so readSkillUnion is safe to use everywhere.
|
||||
function readSkillUnion(skill: string): string {
|
||||
let t = fs.readFileSync(path.join(ROOT, skill, 'SKILL.md'), 'utf-8');
|
||||
const secDir = path.join(ROOT, skill, 'sections');
|
||||
if (fs.existsSync(secDir)) {
|
||||
for (const f of fs.readdirSync(secDir).sort()) {
|
||||
if (f.endsWith('.md')) t += '\n' + fs.readFileSync(path.join(secDir, f), 'utf-8');
|
||||
@@ -22,6 +21,9 @@ function readShipUnion(): string {
|
||||
}
|
||||
return t;
|
||||
}
|
||||
function readShipUnion(): string {
|
||||
return readSkillUnion('ship');
|
||||
}
|
||||
|
||||
describe('SKILL.md command validation', () => {
|
||||
test('all $B commands in SKILL.md are valid browse commands', () => {
|
||||
@@ -548,8 +550,8 @@ describe('TODOS-format.md reference consistency', () => {
|
||||
|
||||
test('skills that write TODOs reference TODOS-format.md', () => {
|
||||
const shipContent = readShipUnion();
|
||||
const ceoPlanContent = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
|
||||
const engPlanContent = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
|
||||
const ceoPlanContent = readSkillUnion('plan-ceo-review'); // carved: TODOS-format ref moved to section
|
||||
const engPlanContent = readSkillUnion('plan-eng-review');
|
||||
|
||||
expect(shipContent).toContain('TODOS-format.md');
|
||||
expect(ceoPlanContent).toContain('TODOS-format.md');
|
||||
@@ -621,7 +623,10 @@ describe('v0.4.1 preamble features', () => {
|
||||
// --- Structural tests for new skills ---
|
||||
|
||||
describe('office-hours skill structure', () => {
|
||||
const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8');
|
||||
// Carved (v2 plan T9): Phase 5 (Design Doc) + Phase 6 (handoff) moved into
|
||||
// sections/design-and-handoff.md, so structural phrases now live there — read
|
||||
// the skeleton+sections union.
|
||||
const content = readSkillUnion('office-hours');
|
||||
|
||||
// Original structural assertions
|
||||
for (const section of ['Phase 1', 'Phase 2', 'Phase 3', 'Phase 4', 'Phase 5', 'Phase 6',
|
||||
@@ -912,8 +917,10 @@ describe('CEO review mode validation', () => {
|
||||
});
|
||||
|
||||
test('has docs/designs promotion section', () => {
|
||||
expect(content).toContain('docs/designs');
|
||||
expect(content).toContain('PROMOTED');
|
||||
// Carved (v2 plan Phase B): the promotion block moved into the review section.
|
||||
const union = readSkillUnion('plan-ceo-review');
|
||||
expect(union).toContain('docs/designs');
|
||||
expect(union).toContain('PROMOTED');
|
||||
});
|
||||
|
||||
test('mode quick reference has four columns', () => {
|
||||
|
||||
@@ -94,7 +94,7 @@ describe('selectTests', () => {
|
||||
expect(result.selected).toContain('plan-review-prosons-hardstop-neg');
|
||||
expect(result.selected).toContain('plan-review-prosons-neutral-neg');
|
||||
// v1.13.x real-PTY E2E batch entries that also depend on plan-ceo-review/**
|
||||
expect(result.selected).toContain('ask-user-question-format-pty');
|
||||
expect(result.selected).toContain('auq-format-gate');
|
||||
expect(result.selected).toContain('plan-ceo-mode-routing');
|
||||
expect(result.selected).toContain('autoplan-chain-pty');
|
||||
// Per-finding count + review-report-at-bottom (v1.21.x)
|
||||
@@ -109,8 +109,10 @@ describe('selectTests', () => {
|
||||
// E2E test also depends on plan-ceo-review/** (5-option scope decision
|
||||
// regression for the "drop to fit 4 options" failure mode).
|
||||
expect(result.selected).toContain('plan-ceo-split-overflow');
|
||||
expect(result.selected.length).toBe(22);
|
||||
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 22);
|
||||
// v2 plan Phase B carve: the section-loading E2E depends on plan-ceo-review/**.
|
||||
expect(result.selected).toContain('plan-ceo-section-loading');
|
||||
expect(result.selected.length).toBe(23);
|
||||
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 23);
|
||||
});
|
||||
|
||||
test('global touchfile triggers ALL tests', () => {
|
||||
|
||||
Reference in New Issue
Block a user