Files
gstack/test/touchfiles.test.ts
T
Garry Tan cab774cced v1.56.0.0 Token-reduction Phase B + AUQ paranoid safety net (#1849)
* refactor(plan-ceo-review): carve review body into on-demand section

Carve the largest skill (138,838 B) into a skeleton + one on-demand
section, the documented next Phase B target after /ship (v2_PLAN.md:216).

- sections/review-sections.md(.tmpl): the 11-section deep review, codex/
  outside-voice rules, how-to-ask, Required Outputs, registries, Completion
  Summary, Review Log, REVIEW_DASHBOARD, PLAN_FILE_REVIEW_REPORT, Next Steps,
  docs/designs promotion, Formatting Rules, and the Mode Quick Reference.
- sections/manifest.json: passive registry (CM2), one entry.
- SKILL.md.tmpl: {{SECTION_INDEX}} after the system audit, a single
  {{SECTION:review-sections}} STOP-Read after Step 0 mode selection, and a
  Section self-check. All of Step 0 (the scope/mode conversation) stays in
  the always-loaded skeleton; only EXIT_PLAN_MODE_GATE follows the section.

Measured: always-loaded skeleton 138,838 -> 80,731 B (-42%, ~14.4K tokens
off every invocation). Union (skeleton + section) 139,110 B, behavior held.

Boundary honors Codex P1: nothing review-governing (formatting rules, mode
reference, how-to-ask, required outputs) sits in the skeleton below the
STOP. Housekeeping resolvers ride in the section, matching the ship
precedent (adversarial.md carries LEARNINGS_LOG + GBRAIN_SAVE_RESULTS).

Tests (atomic with the carve — skill-docs.yml gates gen:skill-docs
freshness on every push, so source + regen + tests must land together):
- parity-harness: plan-ceo flipped to sectioned, maxSkeletonBytes 90_000
  (measured 80,731 + headroom); content/minBytes run against the union.
- skill-size-budget: plan-ceo-review added to SECTIONS_EXTRACTED.
- section-manifest-consistency: generalized to discover every carved skill,
  vars computed per-skill-case (Codex P2).
- skill-ceo-section-ordering (new, gate): per-PR static guard — STOP after
  Step 0, review body absent from skeleton, report writer in the section,
  nothing review-governing below the STOP.
- skill-e2e-plan-ceo-review-section-loading (new, periodic): refreshes the
  installed skill first (Codex P1), drives full Step 0, asserts the section
  is Read before the report.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
  carved skills so relocated prose still counts.
- touchfiles: plan-ceo-section-loading registered (periodic).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for plan-ceo-review carve (v1.56.0.0)

MINOR: carves the largest skill into skeleton + on-demand section,
dropping plan-ceo-review's always-loaded cost 42% (138,838 -> 80,731 B,
~14.4K tokens off every invocation). User-facing release notes lead with
the measured token win.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(todos): file P3 follow-up — carve the shared {{PREAMBLE}} reference blocks

Surfaced by /plan-eng-review on the plan-ceo-review carve: per-skill section
carves stay modest because the ~40-50KB shared preamble dominates the
always-loaded surface. A single preamble-reference carve would help every
tier->=2 skill at once. Records the why, the cold-vs-hot split to measure,
and the guards it needs. Not implemented this PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): Layer 0 — guarantee AUQ format spec is always-loaded

Deterministic, free, per-PR keystone for the token-reduction era. For every
interactive (tier>=2) skill, asserts the full AskUserQuestion decision-brief
format (ELI10/Recommendation/Pros-cons/checks/Net/(recommended)/Stakes/
self-check) lives in the always-loaded SKILL.md skeleton, NOT only in an
on-demand section. Plus a roster guard (a carve can't silently drop the block)
and per-skill rule survival in the skeleton+sections union. 51 cases + a
negative control. Fails the instant a future carve strands AUQ-governing text
where it won't be loaded when a question fires.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): SDK capture engine + verbose-vs-carved no-degradation A/B

Adds the reusable SDK $OUT_FILE capture engine (auq-sdk-capture.ts): drives a
skill to its AUQ and captures the verbatim text the model GENERATES, cleanly
(real-PTY mangles plan-mode AUQs via cursor escapes). Pins the skill to an
absolute path with Read/Write-only tools so the agent can't wander to the
global install. gradeAuqRecommendation normalizes a non-"because" connective
before grading so substantive reasons aren't false-flagged (without touching
the pinned shared judge).

The A/B drives the same prompt through the carved 80KB skeleton and the
pre-carve 137KB monolith and fails if carved scores worse. Result: both 7/7
format, substance 5 — proven no degradation, transcript-verified each side read
its own planted SKILL.md. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): consistency — same trigger N runs, stable format + substance

Drives the carved /plan-ceo-review AUQ N=3 times and fails if any format
element appears in one run but not another, or substance craters. Targets the
"fine one run, broken the next" failure class a single snapshot can't see.
Result: 3/3 stable, 7/7 + substance 5 every run. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): behavioral matrix across AUQ-heavy skills

Data-driven test that drives each AUQ-heavy skill (plan-eng/design/devex,
office-hours, cso, spec, design-consultation) to its first AskUserQuestion and
grades it to the plan-ceo bar: 7/7 decision-brief format + recommendation
substance >=4. One case per skill (isolated failures), env-subsettable via
AUQ_MATRIX_ONLY. Browser/design-binary skills are intentionally excluded
(comparison boards, not format-AUQs; Layer 0 covers their spec). All targeted
skills pass 7/7 with substance 4-5. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(codex): live recommendation-substance grade for /codex

Closes the gap where /codex's synthesis recommendation was only checked
statically (template grep) and via fixtures. Drives the real /codex skill over
a flawed diff and grades the emitted "Recommendation: ... because ..." line
with judgeRecommendation (present/commits/has_because/substance>=4). The named
weak spot holds up: substance 5. Periodic tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): deterministic trigger for format-compliance gate

A bare /plan-ceo-review against a repo whose work is already implemented makes
the model improvise an off-script "what should I review?" scope question that
skips the decision-brief format, which the gate test then times out waiting for.
Hand it a concrete plan to review (FORCING_FLOOR_CEO) so it reaches the real
Step 0 mode-selection AUQ that is the intended format check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(office-hours): carve Phase 5+6 into on-demand section

Third Phase B carve (v2_PLAN.md:216, after ship and plan-ceo-review). Moves
Phase 5 (Design Doc templates) + Phase 6 (tiered relationship handoff) — the
session's output + closing tail, only reached after the conversation and
alternatives are done — into sections/design-and-handoff.md, behind a single
STOP-Read after Phase 4.5. The live conversation (Phases 1-4.5) and the
always-run Important Rules stay in the always-loaded skeleton.

Measured: always-loaded skeleton 118,280 -> 88,975 B (-24.8%). Union preserved.
The carved AUQ is identical to pre-carve (matrix: 7/7 format, substance 5),
and Layer 0 confirms the AUQ format spec stays in the skeleton — the AUQ
paranoid suite de-risked this carve end to end.

Atomic with tests + regen (skill-docs.yml gates gen:skill-docs freshness on
every push, so source + regen + tests land together; --host all regenerates
the inlined non-Claude variants):
- sections/manifest.json: passive registry, one entry.
- parity-harness: office-hours flipped to sectioned, maxSkeletonBytes 96_000
  (measured 88,975 + headroom); content/minBytes run against the union.
- skill-size-budget: office-hours added to SECTIONS_EXTRACTED.
- gen-skill-docs + skill-validation: read the skeleton+sections union for
  office-hours so relocated Phase 5/6 prose still counts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for office-hours carve + AUQ suite (v1.57.0.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(preamble): carve CJK-escaping manual to on-demand doc

The AskUserQuestion format block is inlined into every interactive skill (~33).
It carried the full multi-paragraph non-ASCII/CJK escaping manual inline, but
that rationale only matters when a question contains CJK text and the operative
rule already lives in the always-loaded self-check. Moved the justification to
docs/askuserquestion-cjk.md (read on demand); kept the rule + a pointer.

Corpus: Claude-host SKILL.md total 3,087,499 -> 3,057,975 B (-29,524 B, ~900 B
x ~33 skills). Layer 0 still passes — the core decision-brief format stays
always-loaded; only the rare CJK rationale moved. Atomic with the all-host
regen (skill-docs.yml freshness gate). VERSION + package.json -> 1.58.0.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-eng-review): carve review body into on-demand section

Fourth Phase B carve (v2_PLAN.md:220). Moves the 4-section review (Architecture,
Code Quality, Tests, Performance), outside voice, required outputs, and review
report — everything after Step 0 scope — into sections/review-sections.md behind
a single STOP-Read. Step 0 (scope challenge) and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.

Measured: skeleton 106,984 -> 54,892 B (-48.7%). Union preserved. Atomic with
tests + all-host regen (freshness gate): parity flipped to sectioned
(maxSkeletonBytes 62K), plan-eng-review added to SECTIONS_EXTRACTED, gen-skill-docs
reads the union for relocated review/TEST_COVERAGE/dashboard prose. Layer 0 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-design-review): carve review body into on-demand section

Fifth Phase B carve (v2_PLAN.md:220, bundled with plan-eng). Moves the 7 design
passes, required outputs, and review report — everything after Step 0 scope and
the mockup/rating phase — into sections/review-sections.md behind a STOP-Read.
Step 0, Step 0.5 mockups, the rating method, and EXIT_PLAN_MODE_GATE stay in the
always-loaded skeleton.

Measured: skeleton 112,057 -> 76,024 B (-32.2%). Union preserved. Atomic with
tests + all-host regen: parity sectioned (maxSkeletonBytes 82K), added to
SECTIONS_EXTRACTED, gen-skill-docs reads the union. Layer 0 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(plan-devex-review): carve review body into on-demand section

Sixth Phase B carve. Moves the 8 DX passes, required outputs, and review report
— everything after the Step 0 DX investigation — into sections/review-sections.md
behind a STOP-Read. All of Step 0 (persona, empathy, benchmark, journey trace,
roleplay) + the rating method + EXIT_PLAN_MODE_GATE stay always-loaded.

Measured: skeleton 110,621 -> 69,658 B (-37%). Union preserved. Atomic with
tests + all-host regen: added to SECTIONS_EXTRACTED, gen-skill-docs reads the
union. Layer 0 green. (No parity invariant entry for plan-devex-review.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump VERSION + CHANGELOG for plan-* family carves (v1.59.0.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: refresh ship golden baselines + gbrain-detection union after carves

Two follow-ups the carve commits should have carried (caught by the full suite,
missed by targeted subsets):
- ship golden baselines (claude/codex/factory) regenerated: the preamble CJK
  trim (v1.58) changed ship's always-loaded AskUserQuestion block.
- gbrain-detection-override probes the office-hours skeleton+section union:
  GBRAIN_SAVE_RESULTS moved into sections/design-and-handoff.md when office-hours
  was carved, so the detection assertions now check both files.

Full `bun test` green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(auq): grade format-compliance gate from SDK capture, not the TUI

The real-PTY version grepped the stripAnsi'd interactive AUQ picker. Verified
directly that this cannot work: plan-mode AUQs render as a cursor picker whose
cursor-positioning escapes stripAnsi can't flatten — the picker renders fine for
a human (cursorSeen=45) but the flattened text drops ELI10:/(recommended) and
parseNumberedOptions returns 0. The test was grading a lossy projection and
failed by construction.

Rewritten to drive /plan-ceo-review via the SDK $OUT_FILE capture (the agent
writes the verbatim question it would have shown — clean text, no rendering
loss) and grade 7/7 format + kind-note + recommendation substance >=4. Same
property, reliable, environment-independent; shares the engine with the periodic
A/B and matrix evals. Result: 7/7 format, substance 5. Touchfiles key renamed
ask-user-question-format-pty -> auq-format-gate (no longer a PTY test).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: fix carve-broken CI evals (union reads + section fixtures)

Two CI eval jobs failed on the carved plan-* skills because they read content
that moved into sections/:

- llm-judge (skill-llm-eval): runWorkflowJudge sliced SKILL.md between markers
  like "## Review Sections" / "## CRITICAL RULE" that now live in
  sections/review-sections.md. The markers vanished from the skeleton, so the
  judge scored empty/wrong content. Fix: read the skeleton+sections union.
  Verified: plan-ceo modes / plan-eng sections / plan-design passes all PASS
  (25/25).

- e2e-plan (skill-e2e-plan): setupPlanDir copied only <skill>/SKILL.md into the
  fixture, not sections/. The carved skill's STOP pointed at a section file that
  was absent, so the model improvised a compressed report table instead of the
  canonical "| Review | Trigger | Why | Runs | Status | Findings |". Fix: copy
  sections/ alongside SKILL.md in all 6 setup sites. Verified: report test PASS,
  canonical table emitted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: copy carved sections into all e2e fixtures (prevent more carve-blind CI fails)

Proactive sweep beyond the two CI logs: every e2e test that copies a carved
skill's SKILL.md into a temp fixture must also copy its sections/, or the
model hits a STOP pointing at a missing section file and improvises/degrades.

- skill-e2e.test.ts: plan-ceo/plan-eng/plan-design/office-hours copies across
  planDir/reviewDir/ohDir/benefitsDir dests now copy sections/.
- skill-e2e-plan.test.ts: the office-hours copy + the 4-skill codex-offering
  loop now copy sections/.
- skill-e2e-design.test.ts: plan-design-review copy now copies sections/.
- skill-e2e-office-hours.test.ts: both office-hours copies now copy sections/.
- skill-e2e-office-hours-brain-writeback.test.ts: GBRAIN_SAVE_RESULTS moved into
  the section, so check the regenerated skeleton+section UNION for the gbrain put
  block, ship both into the workdir, and restore both (the section regen was also
  leaking into the working tree — finally now restores it).

ship copies (single-file Step-0 slices) and review/retro (not carved) untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: migrate section-loading E2E to lossless SDK tool-stream detection

The /ship and /plan-ceo-review section-loading tests drove a real PTY and
scraped the ANSI screen buffer for sections/<file>.md paths. That silently
saw nothing in a Conductor PTY (cursor-positioned tool renders and an
unanswered Step 0 question loop both defeat the regex), so both reported
read: [] even when the agent did the work.

They now run the skill through claude -p (the same SDK path the AUQ matrix
uses) and detect section reads from the tool-use stream — Read calls whose
file_path contains sections/<file>.md — with no rendering layer to mangle.
The run is also hermetic: the freshly-generated worktree skeleton + sections
are copied into a throwaway fixture with the absolute path pinned, so the
test validates this branch's carve without mutating the user's ~/.claude
install.

Validated EVALS_TIER=periodic: both pass (plan-ceo Reads review-sections.md;
ship Reads review-army.md + changelog.md), ~6.5 min for both vs ~23 min
combined on the old PTY path where both were failing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: consolidate branch to v1.56.0.0 (single MINOR above main)

The branch bumped VERSION several times during development (1.56 → 1.57 →
1.58 → 1.59), but none of those landed on main (main is at 1.55.1.0). Per
the "never orphan branch-internal versions" discipline, collapse all four
into a single 1.56.0.0 entry — one MINOR release covering the whole branch:
five skills carved (plan-ceo, office-hours, plan-eng, plan-design,
plan-devex), the shared AskUserQuestion preamble CJK trim, and the paranoid
AUQ no-degradation test suite + lossless section-loading tests.

VERSION and package.json set to 1.56.0.0; main's 1.55.1.0 entry preserved
below the consolidated entry. No SKILL.md drift (VERSION is not embedded in
generated bodies).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 11:14:43 -07:00

333 lines
14 KiB
TypeScript

/**
* Unit tests for diff-based test selection.
* Free (no API calls), runs with `bun test`.
*/
import { describe, test, expect } from 'bun:test';
import { spawnSync } from 'child_process';
import * as fs from 'fs';
import * as path from 'path';
import * as os from 'os';
import {
matchGlob,
selectTests,
detectBaseBranch,
E2E_TOUCHFILES,
E2E_TIERS,
LLM_JUDGE_TOUCHFILES,
GLOBAL_TOUCHFILES,
} from './helpers/touchfiles';
const ROOT = path.resolve(import.meta.dir, '..');
// --- matchGlob ---
describe('matchGlob', () => {
test('** matches any depth of path segments', () => {
expect(matchGlob('browse/src/commands.ts', 'browse/src/**')).toBe(true);
expect(matchGlob('browse/src/deep/nested/file.ts', 'browse/src/**')).toBe(true);
expect(matchGlob('browse/src/cli.ts', 'browse/src/**')).toBe(true);
});
test('** does not match unrelated paths', () => {
expect(matchGlob('browse/src/commands.ts', 'qa/**')).toBe(false);
expect(matchGlob('review/SKILL.md', 'qa/**')).toBe(false);
});
test('exact match works', () => {
expect(matchGlob('SKILL.md', 'SKILL.md')).toBe(true);
expect(matchGlob('SKILL.md.tmpl', 'SKILL.md')).toBe(false);
expect(matchGlob('qa/SKILL.md', 'SKILL.md')).toBe(false);
});
test('* matches within a single segment', () => {
expect(matchGlob('test/fixtures/review-eval-enum.rb', 'test/fixtures/review-eval-enum*.rb')).toBe(true);
expect(matchGlob('test/fixtures/review-eval-enum-diff.rb', 'test/fixtures/review-eval-enum*.rb')).toBe(true);
expect(matchGlob('test/fixtures/review-eval-vuln.rb', 'test/fixtures/review-eval-enum*.rb')).toBe(false);
});
test('dots in patterns are escaped correctly', () => {
expect(matchGlob('SKILL.md', 'SKILL.md')).toBe(true);
expect(matchGlob('SKILLxmd', 'SKILL.md')).toBe(false);
});
test('** at end matches files in the directory', () => {
expect(matchGlob('qa/SKILL.md', 'qa/**')).toBe(true);
expect(matchGlob('qa/SKILL.md.tmpl', 'qa/**')).toBe(true);
expect(matchGlob('qa/templates/report.md', 'qa/**')).toBe(true);
});
});
// --- selectTests ---
describe('selectTests', () => {
test('browse/src change selects browse and qa tests', () => {
const result = selectTests(['browse/src/commands.ts'], E2E_TOUCHFILES);
expect(result.selected).toContain('browse-basic');
expect(result.selected).toContain('browse-snapshot');
expect(result.selected).toContain('qa-quick');
expect(result.selected).toContain('qa-fix-loop');
expect(result.selected).toContain('design-review-fix');
expect(result.reason).toBe('diff');
// Should NOT include unrelated tests
expect(result.selected).not.toContain('plan-ceo-review');
expect(result.selected).not.toContain('retro');
expect(result.selected).not.toContain('document-release');
});
test('skill-specific change selects only that skill and related tests', () => {
const result = selectTests(['plan-ceo-review/SKILL.md'], E2E_TOUCHFILES);
expect(result.selected).toContain('plan-ceo-review');
expect(result.selected).toContain('plan-ceo-review-selective');
expect(result.selected).toContain('plan-ceo-review-benefits');
expect(result.selected).toContain('plan-ceo-review-expansion-energy');
expect(result.selected).toContain('autoplan-core');
expect(result.selected).toContain('codex-offered-ceo-review');
expect(result.selected).toContain('plan-ceo-review-format-mode');
expect(result.selected).toContain('plan-ceo-review-format-approach');
// v1.10.2.0 plan-mode handshake entries also depend on plan-ceo-review/**
expect(result.selected).toContain('plan-ceo-review-plan-mode');
expect(result.selected).toContain('plan-mode-no-op');
expect(result.selected).toContain('e2e-harness-audit');
expect(result.selected).toContain('plan-ceo-review-prosons-cadence');
expect(result.selected).toContain('plan-review-prosons-format');
expect(result.selected).toContain('plan-review-prosons-hardstop-neg');
expect(result.selected).toContain('plan-review-prosons-neutral-neg');
// v1.13.x real-PTY E2E batch entries that also depend on plan-ceo-review/**
expect(result.selected).toContain('auq-format-gate');
expect(result.selected).toContain('plan-ceo-mode-routing');
expect(result.selected).toContain('autoplan-chain-pty');
// Per-finding count + review-report-at-bottom (v1.21.x)
expect(result.selected).toContain('plan-ceo-finding-count');
// v1.22+ AskUserQuestion-blocked regression: auto-decide-preserved
// also depends on plan-ceo-review/** (autoplan-auto-mode test was
// removed in v1.28 — see commit message for the rationale).
expect(result.selected).toContain('auto-decide-preserved');
// v1.27+ gate-tier reviewCount-floor regression for transcript bug
expect(result.selected).toContain('plan-ceo-finding-floor');
// garrytan/askuserquestion-split-on-overflow: split-overflow periodic
// E2E test also depends on plan-ceo-review/** (5-option scope decision
// regression for the "drop to fit 4 options" failure mode).
expect(result.selected).toContain('plan-ceo-split-overflow');
// v2 plan Phase B carve: the section-loading E2E depends on plan-ceo-review/**.
expect(result.selected).toContain('plan-ceo-section-loading');
expect(result.selected.length).toBe(23);
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 23);
});
test('global touchfile triggers ALL tests', () => {
const result = selectTests(['test/helpers/session-runner.ts'], E2E_TOUCHFILES);
expect(result.selected.length).toBe(Object.keys(E2E_TOUCHFILES).length);
expect(result.skipped.length).toBe(0);
expect(result.reason).toContain('global');
});
test('gen-skill-docs.ts is a scoped touchfile, not global', () => {
const result = selectTests(['scripts/gen-skill-docs.ts'], E2E_TOUCHFILES);
// Should select tests that list gen-skill-docs.ts in their touchfiles, not ALL tests
expect(result.selected.length).toBeGreaterThan(0);
expect(result.selected.length).toBeLessThan(Object.keys(E2E_TOUCHFILES).length);
expect(result.reason).toBe('diff');
// Should include tests that depend on gen-skill-docs.ts
expect(result.selected).toContain('skillmd-setup-discovery');
expect(result.selected).toContain('session-awareness');
expect(result.selected).toContain('journey-ideation');
// Should NOT include tests that don't depend on it
expect(result.selected).not.toContain('retro');
expect(result.selected).not.toContain('cso-full-audit');
});
test('unrelated file selects nothing', () => {
const result = selectTests(['README.md'], E2E_TOUCHFILES);
expect(result.selected).toEqual([]);
expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length);
});
test('empty changed files selects nothing', () => {
const result = selectTests([], E2E_TOUCHFILES);
expect(result.selected).toEqual([]);
});
test('multiple changed files union their selections', () => {
const result = selectTests(
['plan-ceo-review/SKILL.md', 'retro/SKILL.md.tmpl'],
E2E_TOUCHFILES,
);
expect(result.selected).toContain('plan-ceo-review');
expect(result.selected).toContain('plan-ceo-review-selective');
expect(result.selected).toContain('retro');
expect(result.selected).toContain('retro-base-branch');
// Also selects journey routing tests (*/SKILL.md.tmpl matches retro/SKILL.md.tmpl)
expect(result.selected.length).toBeGreaterThanOrEqual(4);
});
test('works with LLM_JUDGE_TOUCHFILES', () => {
const result = selectTests(['qa/SKILL.md'], LLM_JUDGE_TOUCHFILES);
expect(result.selected).toContain('qa/SKILL.md workflow');
expect(result.selected).toContain('qa/SKILL.md health rubric');
expect(result.selected).toContain('qa/SKILL.md anti-refusal');
expect(result.selected.length).toBe(3);
});
test('SKILL.md.tmpl root template selects root-dependent tests and routing tests', () => {
const result = selectTests(['SKILL.md.tmpl'], E2E_TOUCHFILES);
// Should select the 7 tests that depend on root SKILL.md
expect(result.selected).toContain('skillmd-setup-discovery');
expect(result.selected).toContain('session-awareness');
expect(result.selected).toContain('session-awareness');
// Also selects journey routing tests (SKILL.md.tmpl in their touchfiles)
expect(result.selected).toContain('journey-ideation');
// Should NOT select unrelated non-routing tests
expect(result.selected).not.toContain('plan-ceo-review');
expect(result.selected).not.toContain('retro');
});
test('global touchfiles work for LLM-judge tests too', () => {
const result = selectTests(['test/helpers/session-runner.ts'], LLM_JUDGE_TOUCHFILES);
expect(result.selected.length).toBe(Object.keys(LLM_JUDGE_TOUCHFILES).length);
});
});
// --- detectBaseBranch ---
describe('detectBaseBranch', () => {
test('detects local main branch', () => {
const dir = fs.mkdtempSync(path.join(os.tmpdir(), 'touchfiles-test-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: dir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(dir, 'test.txt'), 'hello\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'init']);
const result = detectBaseBranch(dir);
// Should find 'main' (or 'master' depending on git default)
expect(result).toMatch(/^(main|master)$/);
try { fs.rmSync(dir, { recursive: true, force: true }); } catch {}
});
test('returns null for empty repo with no branches', () => {
const dir = fs.mkdtempSync(path.join(os.tmpdir(), 'touchfiles-test-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: dir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
// No commits = no branches
const result = detectBaseBranch(dir);
expect(result).toBeNull();
try { fs.rmSync(dir, { recursive: true, force: true }); } catch {}
});
test('returns null for non-git directory', () => {
const dir = fs.mkdtempSync(path.join(os.tmpdir(), 'touchfiles-test-'));
const result = detectBaseBranch(dir);
expect(result).toBeNull();
try { fs.rmSync(dir, { recursive: true, force: true }); } catch {}
});
});
// --- Completeness: every testName in skill-e2e-*.test.ts has a TOUCHFILES entry ---
describe('TOUCHFILES completeness', () => {
test('every E2E testName has a TOUCHFILES entry', () => {
// Read all split E2E test files
const testDir = path.join(ROOT, 'test');
const e2eFiles = fs.readdirSync(testDir).filter(f => f.startsWith('skill-e2e-') && f.endsWith('.test.ts'));
let e2eContent = '';
for (const f of e2eFiles) {
e2eContent += fs.readFileSync(path.join(testDir, f), 'utf-8') + '\n';
}
// Extract all testName: 'value' entries
const testNameRegex = /testName:\s*['"`]([^'"`]+)['"`]/g;
const testNames: string[] = [];
let match;
while ((match = testNameRegex.exec(e2eContent)) !== null) {
let name = match[1];
// Handle template literals like `qa-${label}` — these expand to
// qa-b6-static, qa-b7-spa, qa-b8-checkout
if (name.includes('${')) continue; // skip template literals, check expanded forms below
testNames.push(name);
}
// Add the template-expanded testNames from runPlantedBugEval calls
const plantedBugRegex = /runPlantedBugEval\([^,]+,\s*[^,]+,\s*['"`]([^'"`]+)['"`]\)/g;
while ((match = plantedBugRegex.exec(e2eContent)) !== null) {
testNames.push(`qa-${match[1]}`);
}
expect(testNames.length).toBeGreaterThan(0);
const missing = testNames.filter(name => !(name in E2E_TOUCHFILES));
if (missing.length > 0) {
throw new Error(
`E2E tests missing TOUCHFILES entries: ${missing.join(', ')}\n` +
`Add these to E2E_TOUCHFILES in test/helpers/touchfiles.ts`,
);
}
});
test('E2E_TIERS covers exactly the same tests as E2E_TOUCHFILES', () => {
const touchfileKeys = new Set(Object.keys(E2E_TOUCHFILES));
const tierKeys = new Set(Object.keys(E2E_TIERS));
const missingFromTiers = [...touchfileKeys].filter(k => !tierKeys.has(k));
const extraInTiers = [...tierKeys].filter(k => !touchfileKeys.has(k));
if (missingFromTiers.length > 0) {
throw new Error(
`E2E tests missing TIER entries: ${missingFromTiers.join(', ')}\n` +
`Add these to E2E_TIERS in test/helpers/touchfiles.ts`,
);
}
if (extraInTiers.length > 0) {
throw new Error(
`E2E_TIERS has extra entries not in E2E_TOUCHFILES: ${extraInTiers.join(', ')}\n` +
`Remove these from E2E_TIERS or add to E2E_TOUCHFILES`,
);
}
});
test('E2E_TIERS only contains valid tier values', () => {
const validTiers = ['gate', 'periodic'];
for (const [name, tier] of Object.entries(E2E_TIERS)) {
if (!validTiers.includes(tier)) {
throw new Error(`E2E_TIERS['${name}'] has invalid tier '${tier}'. Valid: ${validTiers.join(', ')}`);
}
}
});
test('every LLM-judge test has a TOUCHFILES entry', () => {
const llmContent = fs.readFileSync(
path.join(ROOT, 'test', 'skill-llm-eval.test.ts'),
'utf-8',
);
// Extract test names from addTest({ name: '...' }) calls
const nameRegex = /name:\s*['"`]([^'"`]+)['"`]/g;
const testNames: string[] = [];
let match;
while ((match = nameRegex.exec(llmContent)) !== null) {
testNames.push(match[1]);
}
// Deduplicate (some tests call addTest with the same name)
const unique = [...new Set(testNames)];
expect(unique.length).toBeGreaterThan(0);
const missing = unique.filter(name => !(name in LLM_JUDGE_TOUCHFILES));
if (missing.length > 0) {
throw new Error(
`LLM-judge tests missing TOUCHFILES entries: ${missing.join(', ')}\n` +
`Add these to LLM_JUDGE_TOUCHFILES in test/helpers/touchfiles.ts`,
);
}
});
});