Files
gstack/ship/sections/test-coverage.md
T
Garry Tan 46c1fae7f1 v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded) (#1806)
* feat(test): transcript-section-logger + ship-action fingerprint (T10)

Pure-analysis module over a SkillTestResult/NDJSON transcript:
- extractSectionReads(): which sections/*.md a run opened (post-carve check)
- extractShipActions(): observable action fingerprint (merge/test/bump/
  changelog/commit/push/pr) that works on the MONOLITH too, so a baseline
  captured before the carve can detect a sectioned-ship regression
- baseline read/write + compareShipActions() for baseline-first dogf(T10)

Baseline-first answers the Codex outside-voice critique that a logger in the
same PR as the carve is post-failure telemetry without a pre-carve reference.

11 unit tests, all green. Paid monolith baseline capture runs separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(pipeline): section discovery + generation machinery (T9)

- discover-skills.ts: discoverSectionTemplates() scans <skill>/sections/*.md.tmpl
- gen-skill-docs.ts: extract resolvePlaceholders + applyHostRewrites + buildContext
  as shared helpers (processTemplate and the new processSectionTemplate both call
  them, so a sanitization/rewrite fix can't miss sections) [C1]
- processSectionTemplate: body-fragment generation (no frontmatter/catalog/voice),
  parent-skill TemplateContext (skillName pinned to parent, not 'sections', so
  appliesTo gating + tier behave identically), per-host output routing
- --host all now fails the build on ANY host failure, not just claude, so a stale
  external-host output can't slip the freshness gate [Codex outside-voice #9]

Inert until a skill is carved (no sections/ dirs exist yet). Refactor is
output-neutral: gen:skill-docs --dry-run --host all reports 0 STALE.

5 discovery unit tests + 389 gen-skill-docs tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(setup): install sections/ for cherry-pick targets (claude + kiro) (T9)

Two install targets cherry-pick SKILL.md and would leave a carved skill's
sections/ behind, 404ing a runtime 'Read sections/<name>.md':
- link_claude_skill_dirs: link the sections/ subdir via _link_or_copy (windows
  gets a fresh copy on every ./setup)
- kiro per-skill loop: sed-rewrite + copy each sections/* so paths resolve under
  ~/.kiro, not ~/.codex/~/.claude

codex/factory/opencode link the whole generated dir, so sections ride free.
Addresses Codex outside-voice #4/#6 (runtime pathing landmine). Inert until a
skill is carved. Static-tripwire test + windows-fallback invariant green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(ship): gstack-version-bump CLI — tested idempotency classify + write (T9)

Hybrid CLI extraction (CM1): the deterministic core of ship Step 12 becomes a
tested CLI instead of bash prose the agent re-derives each run.
- classify: FRESH/ALREADY_BUMPED/DRIFT_STALE_PKG/DRIFT_UNEXPECTED from VERSION
  vs origin/<base>:VERSION vs package.json.version (pure reader)
- write: validated dual-write to VERSION + package.json (FRESH bump)
- repair: DRIFT_STALE_PKG sync, no re-bump
Bump-LEVEL choice + queue collision stay agent judgment; slot pick stays
bin/gstack-next-version. This removes the re-bump-a-shipped-branch footgun from
skippable prose into code that can't be skipped or misread.

15 tests (exhaustive state matrix + write/repair fs + real-git classify).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(parity): sectioned-skill parity capability — guards the carve (T9)

Carved skills (skeleton + sections/*.md) need parity checks that see relocated
content, or moving a phrase into a section reads as 'lost':
- readSkillForParity(): union skeleton + all sections/*.md
- checkSkillParity sectioned mode: content checks against the union; minBytes/
  maxSizeRatio against union bytes (total behavior preserved); maxSkeletonBytes
  asserts the always-loaded skeleton actually shrank. Lowering minBytes to fit a
  small skeleton would otherwise make the size floor toothless [Codex #12].

Built + tested BEFORE the carve so ship's invariant can flip to sectioned in the
same commit it lands. Monolith path byte-identical (verified: pre-existing
investigate 1.053 ratio drift fails the same with this change stashed).

7 sectioned-parity tests + existing parity tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(ship): carve into skeleton + on-demand sections (Claude) (T9)

ship/SKILL.md drops 167KB → 68.7KB (~59% of the always-loaded skill) by moving
8 prose-heavy steps into ship/sections/*.md, read on demand:
tests, test-coverage, plan-completion, review-army, greptile, adversarial,
changelog, pr-body. Step 12's version logic now calls the tested
gstack-version-bump CLI instead of inline bash.

Claude-first (S2): {{SECTION:id}} emits a STOP-Read pointer on Claude (skeleton +
generated section files) and INLINES the content on every other host, so external
hosts keep the full monolith — verified factory at 162KB with no sections dir.
{{SECTION_INDEX:ship}} renders the situation→section table from the PASSIVE
manifest (CM2 / v2_PLAN.md:663); required-reads live only in test fixtures.
Multi-pass resolve expands inlined sections' own resolvers.

Parity: ship invariant flipped to sectioned (union content checks + maxSkeletonBytes
asserts the shrink). Carve-fallout fixed across gen-skill-docs/skill-validation/
golden/plan-completion/#1539/size-budget tests via skeleton+sections union reads.
Free suite green except the pre-existing investigate parity drift.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(ship): manifest-consistency + context-parity + requiredReads helper (T9)

Free deterministic guards for the carve:
- required-reads.ts + unit test: assertRequiredReads(run, requiredFiles) — the
  mechanical layer-5 check that the agent Read the sections its situation needs
  (required set comes from the fixture, not the passive manifest)
- section-manifest-consistency: 3-tier orphan classification (generated orphan +
  hand-edited generated file → FAIL; manifest orphan → WARN per v2_PLAN.md) and
  pins the PASSIVE-manifest contract (no applies_when/required_for)
- template-context-parity: generated sections have zero unresolved placeholders
  and gated resolvers (ADVERSARIAL_STEP/CONFIDENCE_CALIBRATION/CHANGELOG_WORKFLOW)
  rendered — proving sections resolve with the parent skillName, not 'sections'

16 tests, all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(ship): section-loading E2E + idempotency CLI detection (T9)

- skill-e2e-ship-section-loading.test.ts (new, periodic): runs real /ship in plan
  mode against a fresh version-changing fixture and asserts the agent Read the
  required sections (review-army + changelog). Runs against the INSTALLED skill
  (~/.claude/skills/gstack/ship), not repo paths, so install-layout 404s surface
  [Codex outside-voice #5]. Layer-5 mechanical guard against silent section-skip.
- skill-e2e-ship-idempotency.test.ts: detection updated for the carve — Step 12
  now runs gstack-version-bump classify (JSON "state":"ALREADY_BUMPED") instead
  of the inline bash echo (STATE: ALREADY_BUMPED). Accept both; add a
  gstack-version-bump-write re-bump regression signal.
- touchfiles: register ship-section-loading (periodic) + extend idempotency deps
  with bin/gstack-version-bump + scripts/resolvers/sections.ts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(ship): union-read redaction wiring test for the carve (T9)

main's PR-body redaction-at-sink lives in sections/pr-body.md.tmpl after the
carve, not the skeleton template. Read skeleton + section templates union so the
redaction-wiring assertions follow the relocated content. 9/9 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 12:09:10 -07:00

14 KiB

Step 7: Test Coverage Audit

Dispatch this step as a subagent using the Agent tool with subagent_type: "general-purpose". The subagent runs the coverage audit in a fresh context window — the parent only sees the conclusion, not intermediate file reads. This is context-rot defense.

Subagent prompt: Pass the following instructions to the subagent, with <base> substituted with the base branch:

You are running a ship-workflow test coverage audit. Run git diff <base>...HEAD as needed. Do not commit or push — report only.

100% coverage is the goal — every untested path is a path where bugs hide and vibe coding becomes yolo coding. Evaluate what was ACTUALLY coded (from the diff), not what was planned.

Test Framework Detection

Before analyzing coverage, detect the project's test framework:

  1. Read CLAUDE.md — look for a ## Testing section with test command and framework name. If found, use that as the authoritative source.
  2. If CLAUDE.md has no testing section, auto-detect:
setopt +o nomatch 2>/dev/null || true  # zsh compat
# Detect project runtime
[ -f Gemfile ] && echo "RUNTIME:ruby"
[ -f package.json ] && echo "RUNTIME:node"
[ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python"
[ -f go.mod ] && echo "RUNTIME:go"
[ -f Cargo.toml ] && echo "RUNTIME:rust"
# Check for existing test infrastructure
ls jest.config.* vitest.config.* playwright.config.* cypress.config.* .rspec pytest.ini phpunit.xml 2>/dev/null
ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null
  1. If no framework detected: falls through to the Test Framework Bootstrap step (Step 4) which handles full setup.

0. Before/after test count:

# Count test files before any generation
find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l

Store this number for the PR body.

1. Trace every codepath changed using git diff origin/<base>...HEAD:

Read every changed file. For each one, trace how data flows through the code — don't just list functions, actually follow the execution:

  1. Read the diff. For each changed file, read the full file (not just the diff hunk) to understand context.
  2. Trace data flow. Starting from each entry point (route handler, exported function, event listener, component render), follow the data through every branch:
    • Where does input come from? (request params, props, database, API call)
    • What transforms it? (validation, mapping, computation)
    • Where does it go? (database write, API response, rendered output, side effect)
    • What can go wrong at each step? (null/undefined, invalid input, network failure, empty collection)
  3. Diagram the execution. For each changed file, draw an ASCII diagram showing:
    • Every function/method that was added or modified
    • Every conditional branch (if/else, switch, ternary, guard clause, early return)
    • Every error path (try/catch, rescue, error boundary, fallback)
    • Every call to another function (trace into it — does IT have untested branches?)
    • Every edge: what happens with null input? Empty array? Invalid type?

This is the critical step — you're building a map of every line of code that can execute differently based on input. Every branch in this diagram needs a test.

2. Map user flows, interactions, and error states:

Code coverage isn't enough — you need to cover how real users interact with the changed code. For each changed feature, think through:

  • User flows: What sequence of actions does a user take that touches this code? Map the full journey (e.g., "user clicks 'Pay' → form validates → API call → success/failure screen"). Each step in the journey needs a test.
  • Interaction edge cases: What happens when the user does something unexpected?
    • Double-click/rapid resubmit
    • Navigate away mid-operation (back button, close tab, click another link)
    • Submit with stale data (page sat open for 30 minutes, session expired)
    • Slow connection (API takes 10 seconds — what does the user see?)
    • Concurrent actions (two tabs, same form)
  • Error states the user can see: For every error the code handles, what does the user actually experience?
    • Is there a clear error message or a silent failure?
    • Can the user recover (retry, go back, fix input) or are they stuck?
    • What happens with no network? With a 500 from the API? With invalid data from the server?
  • Empty/zero/boundary states: What does the UI show with zero results? With 10,000 results? With a single character input? With maximum-length input?

Add these to your diagram alongside the code branches. A user flow with no test is just as much a gap as an untested if/else.

3. Check each branch against existing tests:

Go through your diagram branch by branch — both code paths AND user flows. For each one, search for a test that exercises it:

  • Function processPayment() → look for billing.test.ts, billing.spec.ts, test/billing_test.rb
  • An if/else → look for tests covering BOTH the true AND false path
  • An error handler → look for a test that triggers that specific error condition
  • A call to helperFn() that has its own branches → those branches need tests too
  • A user flow → look for an integration or E2E test that walks through the journey
  • An interaction edge case → look for a test that simulates the unexpected action

Quality scoring rubric:

  • ★★★ Tests behavior with edge cases AND error paths
  • ★★ Tests correct behavior, happy path only
  • ★ Smoke test / existence check / trivial assertion (e.g., "it renders", "it doesn't throw")

E2E Test Decision Matrix

When checking each branch, also determine whether a unit test or E2E/integration test is the right tool:

RECOMMEND E2E (mark as [→E2E] in the diagram):

  • Common user flow spanning 3+ components/services (e.g., signup → verify email → first login)
  • Integration point where mocking hides real failures (e.g., API → queue → worker → DB)
  • Auth/payment/data-destruction flows — too important to trust unit tests alone

RECOMMEND EVAL (mark as [→EVAL] in the diagram):

  • Critical LLM call that needs a quality eval (e.g., prompt change → test output still meets quality bar)
  • Changes to prompt templates, system instructions, or tool definitions

STICK WITH UNIT TESTS:

  • Pure function with clear inputs/outputs
  • Internal helper with no side effects
  • Edge case of a single function (null input, empty array)
  • Obscure/rare flow that isn't customer-facing

REGRESSION RULE (mandatory)

IRON RULE: When the coverage audit identifies a REGRESSION — code that previously worked but the diff broke — a regression test is written immediately. No AskUserQuestion. No skipping. Regressions are the highest-priority test because they prove something broke.

A regression is when:

  • The diff modifies existing behavior (not new code)
  • The existing test suite (if any) doesn't cover the changed path
  • The change introduces a new failure mode for existing callers

When uncertain whether a change is a regression, err on the side of writing the test.

Format: commit as test: regression test for {what broke}

4. Output ASCII coverage diagram:

Include BOTH code paths and user flows in the same diagram. Mark E2E-worthy and eval-worthy paths:

CODE PATHS                                            USER FLOWS
[+] src/services/billing.ts                           [+] Payment checkout
  ├── processPayment()                                  ├── [★★★ TESTED] Complete purchase — checkout.e2e.ts:15
  │   ├── [★★★ TESTED] happy + declined + timeout      ├── [GAP] [→E2E] Double-click submit
  │   ├── [GAP]         Network timeout                 └── [GAP]        Navigate away mid-payment
  │   └── [GAP]         Invalid currency
  └── refundPayment()                                 [+] Error states
      ├── [★★  TESTED] Full refund — :89                ├── [★★  TESTED] Card declined message
      └── [★   TESTED] Partial (non-throw only) — :101  └── [GAP]        Network timeout UX

LLM integration: [GAP] [→EVAL] Prompt template change — needs eval test

COVERAGE: 5/13 paths tested (38%)  |  Code paths: 3/5 (60%)  |  User flows: 2/8 (25%)
QUALITY: ★★★:2 ★★:2 ★:1  |  GAPS: 8 (2 E2E, 1 eval)

Legend: ★★★ behavior + edge + error | ★★ happy path | ★ smoke check [→E2E] = needs integration test | [→EVAL] = needs LLM eval

Fast path: All paths covered → "Step 7: All new code paths have test coverage ✓" Continue.

5. Generate tests for uncovered paths:

If test framework detected (or bootstrapped in Step 4):

  • Prioritize error handlers and edge cases first (happy paths are more likely already tested)
  • Read 2-3 existing test files to match conventions exactly
  • Generate unit tests. Mock all external dependencies (DB, API, Redis).
  • For paths marked [→E2E]: generate integration/E2E tests using the project's E2E framework (Playwright, Cypress, Capybara, etc.)
  • For paths marked [→EVAL]: generate eval tests using the project's eval framework, or flag for manual eval if none exists
  • Write tests that exercise the specific uncovered path with real assertions
  • Run each test. Passes → commit as test: coverage for {feature}
  • Fails → fix once. Still fails → revert, note gap in diagram.

Caps: 30 code paths max, 20 tests generated max (code + user flow combined), 2-min per-test exploration cap.

If no test framework AND user declined bootstrap → diagram only, no generation. Note: "Test generation skipped — no test framework configured."

Diff is test-only changes: Skip Step 7 entirely: "No new application code paths to audit."

6. After-count and coverage summary:

# Count test files after generation
find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l

For PR body: Tests: {before} → {after} (+{delta} new) Coverage line: Test Coverage Audit: N new code paths. M covered (X%). K tests generated, J committed.

7. Coverage gate:

Before proceeding, check CLAUDE.md for a ## Test Coverage section with Minimum: and Target: fields. If found, use those percentages. Otherwise use defaults: Minimum = 60%, Target = 80%.

Using the coverage percentage from the diagram in substep 4 (the COVERAGE: X/Y (Z%) line):

  • >= target: Pass. "Coverage gate: PASS ({X}%)." Continue.

  • >= minimum, < target: Use AskUserQuestion:

    • "AI-assessed coverage is {X}%. {N} code paths are untested. Target is {target}%."
    • RECOMMENDATION: Choose A because untested code paths are where production bugs hide.
    • Options: A) Generate more tests for remaining gaps (recommended) B) Ship anyway — I accept the coverage risk C) These paths don't need tests — mark as intentionally uncovered
    • If A: Loop back to substep 5 (generate tests) targeting the remaining gaps. After second pass, if still below target, present AskUserQuestion again with updated numbers. Maximum 2 generation passes total.
    • If B: Continue. Include in PR body: "Coverage gate: {X}% — user accepted risk."
    • If C: Continue. Include in PR body: "Coverage gate: {X}% — {N} paths intentionally uncovered."
  • < minimum: Use AskUserQuestion:

    • "AI-assessed coverage is critically low ({X}%). {N} of {M} code paths have no tests. Minimum threshold is {minimum}%."
    • RECOMMENDATION: Choose A because less than {minimum}% means more code is untested than tested.
    • Options: A) Generate tests for remaining gaps (recommended) B) Override — ship with low coverage (I understand the risk)
    • If A: Loop back to substep 5. Maximum 2 passes. If still below minimum after 2 passes, present the override choice again.
    • If B: Continue. Include in PR body: "Coverage gate: OVERRIDDEN at {X}%."

Coverage percentage undetermined: If the coverage diagram doesn't produce a clear numeric percentage (ambiguous output, parse error), skip the gate with: "Coverage gate: could not determine percentage — skipping." Do not default to 0% or block.

Test-only diffs: Skip the gate (same as the existing fast-path).

100% coverage: "Coverage gate: PASS (100%)." Continue.

Test Plan Artifact

After producing the coverage diagram, write a test plan artifact so /qa and /qa-only can consume it:

eval "$(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)" && mkdir -p ~/.gstack/projects/$SLUG
USER=$(whoami)
DATETIME=$(date +%Y%m%d-%H%M%S)

Write to ~/.gstack/projects/{slug}/{user}-{branch}-ship-test-plan-{datetime}.md:

# Test Plan
Generated by /ship on {date}
Branch: {branch}
Repo: {owner/repo}

## Affected Pages/Routes
- {URL path} — {what to test and why}

## Key Interactions to Verify
- {interaction description} on {page}

## Edge Cases
- {edge case} on {page}

## Critical Paths
- {end-to-end flow that must work}

After your analysis, output a single JSON object on the LAST LINE of your response (no other text after it): {"coverage_pct":N,"gaps":N,"diagram":"<full markdown coverage diagram for PR body>","tests_added":["path",...]}

Parent processing:

  1. Read the subagent's final output. Parse the LAST line as JSON.
  2. Store coverage_pct (for Step 20 metrics), gaps (user summary), tests_added (for the commit).
  3. Embed diagram verbatim in the PR body's ## Test Coverage section (Step 19).
  4. Print a one-line summary: Coverage: {coverage_pct}%, {gaps} gaps. {tests_added.length} tests added.

If the subagent fails, times out, or returns invalid JSON: Fall back to running the audit inline in the parent. Do not block /ship on subagent failure — partial results are better than none.