mirror of
https://github.com/garrytan/gstack.git
synced 2026-06-01 15:51:41 +02:00
46c1fae7f1
* feat(test): transcript-section-logger + ship-action fingerprint (T10) Pure-analysis module over a SkillTestResult/NDJSON transcript: - extractSectionReads(): which sections/*.md a run opened (post-carve check) - extractShipActions(): observable action fingerprint (merge/test/bump/ changelog/commit/push/pr) that works on the MONOLITH too, so a baseline captured before the carve can detect a sectioned-ship regression - baseline read/write + compareShipActions() for baseline-first dogf(T10) Baseline-first answers the Codex outside-voice critique that a logger in the same PR as the carve is post-failure telemetry without a pre-carve reference. 11 unit tests, all green. Paid monolith baseline capture runs separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(pipeline): section discovery + generation machinery (T9) - discover-skills.ts: discoverSectionTemplates() scans <skill>/sections/*.md.tmpl - gen-skill-docs.ts: extract resolvePlaceholders + applyHostRewrites + buildContext as shared helpers (processTemplate and the new processSectionTemplate both call them, so a sanitization/rewrite fix can't miss sections) [C1] - processSectionTemplate: body-fragment generation (no frontmatter/catalog/voice), parent-skill TemplateContext (skillName pinned to parent, not 'sections', so appliesTo gating + tier behave identically), per-host output routing - --host all now fails the build on ANY host failure, not just claude, so a stale external-host output can't slip the freshness gate [Codex outside-voice #9] Inert until a skill is carved (no sections/ dirs exist yet). Refactor is output-neutral: gen:skill-docs --dry-run --host all reports 0 STALE. 5 discovery unit tests + 389 gen-skill-docs tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(setup): install sections/ for cherry-pick targets (claude + kiro) (T9) Two install targets cherry-pick SKILL.md and would leave a carved skill's sections/ behind, 404ing a runtime 'Read sections/<name>.md': - link_claude_skill_dirs: link the sections/ subdir via _link_or_copy (windows gets a fresh copy on every ./setup) - kiro per-skill loop: sed-rewrite + copy each sections/* so paths resolve under ~/.kiro, not ~/.codex/~/.claude codex/factory/opencode link the whole generated dir, so sections ride free. Addresses Codex outside-voice #4/#6 (runtime pathing landmine). Inert until a skill is carved. Static-tripwire test + windows-fallback invariant green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(ship): gstack-version-bump CLI — tested idempotency classify + write (T9) Hybrid CLI extraction (CM1): the deterministic core of ship Step 12 becomes a tested CLI instead of bash prose the agent re-derives each run. - classify: FRESH/ALREADY_BUMPED/DRIFT_STALE_PKG/DRIFT_UNEXPECTED from VERSION vs origin/<base>:VERSION vs package.json.version (pure reader) - write: validated dual-write to VERSION + package.json (FRESH bump) - repair: DRIFT_STALE_PKG sync, no re-bump Bump-LEVEL choice + queue collision stay agent judgment; slot pick stays bin/gstack-next-version. This removes the re-bump-a-shipped-branch footgun from skippable prose into code that can't be skipped or misread. 15 tests (exhaustive state matrix + write/repair fs + real-git classify). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(parity): sectioned-skill parity capability — guards the carve (T9) Carved skills (skeleton + sections/*.md) need parity checks that see relocated content, or moving a phrase into a section reads as 'lost': - readSkillForParity(): union skeleton + all sections/*.md - checkSkillParity sectioned mode: content checks against the union; minBytes/ maxSizeRatio against union bytes (total behavior preserved); maxSkeletonBytes asserts the always-loaded skeleton actually shrank. Lowering minBytes to fit a small skeleton would otherwise make the size floor toothless [Codex #12]. Built + tested BEFORE the carve so ship's invariant can flip to sectioned in the same commit it lands. Monolith path byte-identical (verified: pre-existing investigate 1.053 ratio drift fails the same with this change stashed). 7 sectioned-parity tests + existing parity tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(ship): carve into skeleton + on-demand sections (Claude) (T9) ship/SKILL.md drops 167KB → 68.7KB (~59% of the always-loaded skill) by moving 8 prose-heavy steps into ship/sections/*.md, read on demand: tests, test-coverage, plan-completion, review-army, greptile, adversarial, changelog, pr-body. Step 12's version logic now calls the tested gstack-version-bump CLI instead of inline bash. Claude-first (S2): {{SECTION:id}} emits a STOP-Read pointer on Claude (skeleton + generated section files) and INLINES the content on every other host, so external hosts keep the full monolith — verified factory at 162KB with no sections dir. {{SECTION_INDEX:ship}} renders the situation→section table from the PASSIVE manifest (CM2 / v2_PLAN.md:663); required-reads live only in test fixtures. Multi-pass resolve expands inlined sections' own resolvers. Parity: ship invariant flipped to sectioned (union content checks + maxSkeletonBytes asserts the shrink). Carve-fallout fixed across gen-skill-docs/skill-validation/ golden/plan-completion/#1539/size-budget tests via skeleton+sections union reads. Free suite green except the pre-existing investigate parity drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): manifest-consistency + context-parity + requiredReads helper (T9) Free deterministic guards for the carve: - required-reads.ts + unit test: assertRequiredReads(run, requiredFiles) — the mechanical layer-5 check that the agent Read the sections its situation needs (required set comes from the fixture, not the passive manifest) - section-manifest-consistency: 3-tier orphan classification (generated orphan + hand-edited generated file → FAIL; manifest orphan → WARN per v2_PLAN.md) and pins the PASSIVE-manifest contract (no applies_when/required_for) - template-context-parity: generated sections have zero unresolved placeholders and gated resolvers (ADVERSARIAL_STEP/CONFIDENCE_CALIBRATION/CHANGELOG_WORKFLOW) rendered — proving sections resolve with the parent skillName, not 'sections' 16 tests, all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): section-loading E2E + idempotency CLI detection (T9) - skill-e2e-ship-section-loading.test.ts (new, periodic): runs real /ship in plan mode against a fresh version-changing fixture and asserts the agent Read the required sections (review-army + changelog). Runs against the INSTALLED skill (~/.claude/skills/gstack/ship), not repo paths, so install-layout 404s surface [Codex outside-voice #5]. Layer-5 mechanical guard against silent section-skip. - skill-e2e-ship-idempotency.test.ts: detection updated for the carve — Step 12 now runs gstack-version-bump classify (JSON "state":"ALREADY_BUMPED") instead of the inline bash echo (STATE: ALREADY_BUMPED). Accept both; add a gstack-version-bump-write re-bump regression signal. - touchfiles: register ship-section-loading (periodic) + extend idempotency deps with bin/gstack-version-bump + scripts/resolvers/sections.ts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): union-read redaction wiring test for the carve (T9) main's PR-body redaction-at-sink lives in sections/pr-body.md.tmpl after the carve, not the skeleton template. Read skeleton + section templates union so the redaction-wiring assertions follow the relocated content. 9/9 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
260 lines
14 KiB
Markdown
260 lines
14 KiB
Markdown
<!-- AUTO-GENERATED from test-coverage.md.tmpl — do not edit directly -->
|
|
<!-- Regenerate: bun run gen:skill-docs -->
|
|
## Step 7: Test Coverage Audit
|
|
|
|
**Dispatch this step as a subagent** using the Agent tool with `subagent_type: "general-purpose"`. The subagent runs the coverage audit in a fresh context window — the parent only sees the conclusion, not intermediate file reads. This is context-rot defense.
|
|
|
|
**Subagent prompt:** Pass the following instructions to the subagent, with `<base>` substituted with the base branch:
|
|
|
|
> You are running a ship-workflow test coverage audit. Run `git diff <base>...HEAD` as needed. Do not commit or push — report only.
|
|
>
|
|
> 100% coverage is the goal — every untested path is a path where bugs hide and vibe coding becomes yolo coding. Evaluate what was ACTUALLY coded (from the diff), not what was planned.
|
|
|
|
### Test Framework Detection
|
|
|
|
Before analyzing coverage, detect the project's test framework:
|
|
|
|
1. **Read CLAUDE.md** — look for a `## Testing` section with test command and framework name. If found, use that as the authoritative source.
|
|
2. **If CLAUDE.md has no testing section, auto-detect:**
|
|
|
|
```bash
|
|
setopt +o nomatch 2>/dev/null || true # zsh compat
|
|
# Detect project runtime
|
|
[ -f Gemfile ] && echo "RUNTIME:ruby"
|
|
[ -f package.json ] && echo "RUNTIME:node"
|
|
[ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python"
|
|
[ -f go.mod ] && echo "RUNTIME:go"
|
|
[ -f Cargo.toml ] && echo "RUNTIME:rust"
|
|
# Check for existing test infrastructure
|
|
ls jest.config.* vitest.config.* playwright.config.* cypress.config.* .rspec pytest.ini phpunit.xml 2>/dev/null
|
|
ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null
|
|
```
|
|
|
|
3. **If no framework detected:** falls through to the Test Framework Bootstrap step (Step 4) which handles full setup.
|
|
|
|
**0. Before/after test count:**
|
|
|
|
```bash
|
|
# Count test files before any generation
|
|
find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l
|
|
```
|
|
|
|
Store this number for the PR body.
|
|
|
|
**1. Trace every codepath changed** using `git diff origin/<base>...HEAD`:
|
|
|
|
Read every changed file. For each one, trace how data flows through the code — don't just list functions, actually follow the execution:
|
|
|
|
1. **Read the diff.** For each changed file, read the full file (not just the diff hunk) to understand context.
|
|
2. **Trace data flow.** Starting from each entry point (route handler, exported function, event listener, component render), follow the data through every branch:
|
|
- Where does input come from? (request params, props, database, API call)
|
|
- What transforms it? (validation, mapping, computation)
|
|
- Where does it go? (database write, API response, rendered output, side effect)
|
|
- What can go wrong at each step? (null/undefined, invalid input, network failure, empty collection)
|
|
3. **Diagram the execution.** For each changed file, draw an ASCII diagram showing:
|
|
- Every function/method that was added or modified
|
|
- Every conditional branch (if/else, switch, ternary, guard clause, early return)
|
|
- Every error path (try/catch, rescue, error boundary, fallback)
|
|
- Every call to another function (trace into it — does IT have untested branches?)
|
|
- Every edge: what happens with null input? Empty array? Invalid type?
|
|
|
|
This is the critical step — you're building a map of every line of code that can execute differently based on input. Every branch in this diagram needs a test.
|
|
|
|
**2. Map user flows, interactions, and error states:**
|
|
|
|
Code coverage isn't enough — you need to cover how real users interact with the changed code. For each changed feature, think through:
|
|
|
|
- **User flows:** What sequence of actions does a user take that touches this code? Map the full journey (e.g., "user clicks 'Pay' → form validates → API call → success/failure screen"). Each step in the journey needs a test.
|
|
- **Interaction edge cases:** What happens when the user does something unexpected?
|
|
- Double-click/rapid resubmit
|
|
- Navigate away mid-operation (back button, close tab, click another link)
|
|
- Submit with stale data (page sat open for 30 minutes, session expired)
|
|
- Slow connection (API takes 10 seconds — what does the user see?)
|
|
- Concurrent actions (two tabs, same form)
|
|
- **Error states the user can see:** For every error the code handles, what does the user actually experience?
|
|
- Is there a clear error message or a silent failure?
|
|
- Can the user recover (retry, go back, fix input) or are they stuck?
|
|
- What happens with no network? With a 500 from the API? With invalid data from the server?
|
|
- **Empty/zero/boundary states:** What does the UI show with zero results? With 10,000 results? With a single character input? With maximum-length input?
|
|
|
|
Add these to your diagram alongside the code branches. A user flow with no test is just as much a gap as an untested if/else.
|
|
|
|
**3. Check each branch against existing tests:**
|
|
|
|
Go through your diagram branch by branch — both code paths AND user flows. For each one, search for a test that exercises it:
|
|
- Function `processPayment()` → look for `billing.test.ts`, `billing.spec.ts`, `test/billing_test.rb`
|
|
- An if/else → look for tests covering BOTH the true AND false path
|
|
- An error handler → look for a test that triggers that specific error condition
|
|
- A call to `helperFn()` that has its own branches → those branches need tests too
|
|
- A user flow → look for an integration or E2E test that walks through the journey
|
|
- An interaction edge case → look for a test that simulates the unexpected action
|
|
|
|
Quality scoring rubric:
|
|
- ★★★ Tests behavior with edge cases AND error paths
|
|
- ★★ Tests correct behavior, happy path only
|
|
- ★ Smoke test / existence check / trivial assertion (e.g., "it renders", "it doesn't throw")
|
|
|
|
### E2E Test Decision Matrix
|
|
|
|
When checking each branch, also determine whether a unit test or E2E/integration test is the right tool:
|
|
|
|
**RECOMMEND E2E (mark as [→E2E] in the diagram):**
|
|
- Common user flow spanning 3+ components/services (e.g., signup → verify email → first login)
|
|
- Integration point where mocking hides real failures (e.g., API → queue → worker → DB)
|
|
- Auth/payment/data-destruction flows — too important to trust unit tests alone
|
|
|
|
**RECOMMEND EVAL (mark as [→EVAL] in the diagram):**
|
|
- Critical LLM call that needs a quality eval (e.g., prompt change → test output still meets quality bar)
|
|
- Changes to prompt templates, system instructions, or tool definitions
|
|
|
|
**STICK WITH UNIT TESTS:**
|
|
- Pure function with clear inputs/outputs
|
|
- Internal helper with no side effects
|
|
- Edge case of a single function (null input, empty array)
|
|
- Obscure/rare flow that isn't customer-facing
|
|
|
|
### REGRESSION RULE (mandatory)
|
|
|
|
**IRON RULE:** When the coverage audit identifies a REGRESSION — code that previously worked but the diff broke — a regression test is written immediately. No AskUserQuestion. No skipping. Regressions are the highest-priority test because they prove something broke.
|
|
|
|
A regression is when:
|
|
- The diff modifies existing behavior (not new code)
|
|
- The existing test suite (if any) doesn't cover the changed path
|
|
- The change introduces a new failure mode for existing callers
|
|
|
|
When uncertain whether a change is a regression, err on the side of writing the test.
|
|
|
|
Format: commit as `test: regression test for {what broke}`
|
|
|
|
**4. Output ASCII coverage diagram:**
|
|
|
|
Include BOTH code paths and user flows in the same diagram. Mark E2E-worthy and eval-worthy paths:
|
|
|
|
```
|
|
CODE PATHS USER FLOWS
|
|
[+] src/services/billing.ts [+] Payment checkout
|
|
├── processPayment() ├── [★★★ TESTED] Complete purchase — checkout.e2e.ts:15
|
|
│ ├── [★★★ TESTED] happy + declined + timeout ├── [GAP] [→E2E] Double-click submit
|
|
│ ├── [GAP] Network timeout └── [GAP] Navigate away mid-payment
|
|
│ └── [GAP] Invalid currency
|
|
└── refundPayment() [+] Error states
|
|
├── [★★ TESTED] Full refund — :89 ├── [★★ TESTED] Card declined message
|
|
└── [★ TESTED] Partial (non-throw only) — :101 └── [GAP] Network timeout UX
|
|
|
|
LLM integration: [GAP] [→EVAL] Prompt template change — needs eval test
|
|
|
|
COVERAGE: 5/13 paths tested (38%) | Code paths: 3/5 (60%) | User flows: 2/8 (25%)
|
|
QUALITY: ★★★:2 ★★:2 ★:1 | GAPS: 8 (2 E2E, 1 eval)
|
|
```
|
|
|
|
Legend: ★★★ behavior + edge + error | ★★ happy path | ★ smoke check
|
|
[→E2E] = needs integration test | [→EVAL] = needs LLM eval
|
|
|
|
**Fast path:** All paths covered → "Step 7: All new code paths have test coverage ✓" Continue.
|
|
|
|
**5. Generate tests for uncovered paths:**
|
|
|
|
If test framework detected (or bootstrapped in Step 4):
|
|
- Prioritize error handlers and edge cases first (happy paths are more likely already tested)
|
|
- Read 2-3 existing test files to match conventions exactly
|
|
- Generate unit tests. Mock all external dependencies (DB, API, Redis).
|
|
- For paths marked [→E2E]: generate integration/E2E tests using the project's E2E framework (Playwright, Cypress, Capybara, etc.)
|
|
- For paths marked [→EVAL]: generate eval tests using the project's eval framework, or flag for manual eval if none exists
|
|
- Write tests that exercise the specific uncovered path with real assertions
|
|
- Run each test. Passes → commit as `test: coverage for {feature}`
|
|
- Fails → fix once. Still fails → revert, note gap in diagram.
|
|
|
|
Caps: 30 code paths max, 20 tests generated max (code + user flow combined), 2-min per-test exploration cap.
|
|
|
|
If no test framework AND user declined bootstrap → diagram only, no generation. Note: "Test generation skipped — no test framework configured."
|
|
|
|
**Diff is test-only changes:** Skip Step 7 entirely: "No new application code paths to audit."
|
|
|
|
**6. After-count and coverage summary:**
|
|
|
|
```bash
|
|
# Count test files after generation
|
|
find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l
|
|
```
|
|
|
|
For PR body: `Tests: {before} → {after} (+{delta} new)`
|
|
Coverage line: `Test Coverage Audit: N new code paths. M covered (X%). K tests generated, J committed.`
|
|
|
|
**7. Coverage gate:**
|
|
|
|
Before proceeding, check CLAUDE.md for a `## Test Coverage` section with `Minimum:` and `Target:` fields. If found, use those percentages. Otherwise use defaults: Minimum = 60%, Target = 80%.
|
|
|
|
Using the coverage percentage from the diagram in substep 4 (the `COVERAGE: X/Y (Z%)` line):
|
|
|
|
- **>= target:** Pass. "Coverage gate: PASS ({X}%)." Continue.
|
|
- **>= minimum, < target:** Use AskUserQuestion:
|
|
- "AI-assessed coverage is {X}%. {N} code paths are untested. Target is {target}%."
|
|
- RECOMMENDATION: Choose A because untested code paths are where production bugs hide.
|
|
- Options:
|
|
A) Generate more tests for remaining gaps (recommended)
|
|
B) Ship anyway — I accept the coverage risk
|
|
C) These paths don't need tests — mark as intentionally uncovered
|
|
- If A: Loop back to substep 5 (generate tests) targeting the remaining gaps. After second pass, if still below target, present AskUserQuestion again with updated numbers. Maximum 2 generation passes total.
|
|
- If B: Continue. Include in PR body: "Coverage gate: {X}% — user accepted risk."
|
|
- If C: Continue. Include in PR body: "Coverage gate: {X}% — {N} paths intentionally uncovered."
|
|
|
|
- **< minimum:** Use AskUserQuestion:
|
|
- "AI-assessed coverage is critically low ({X}%). {N} of {M} code paths have no tests. Minimum threshold is {minimum}%."
|
|
- RECOMMENDATION: Choose A because less than {minimum}% means more code is untested than tested.
|
|
- Options:
|
|
A) Generate tests for remaining gaps (recommended)
|
|
B) Override — ship with low coverage (I understand the risk)
|
|
- If A: Loop back to substep 5. Maximum 2 passes. If still below minimum after 2 passes, present the override choice again.
|
|
- If B: Continue. Include in PR body: "Coverage gate: OVERRIDDEN at {X}%."
|
|
|
|
**Coverage percentage undetermined:** If the coverage diagram doesn't produce a clear numeric percentage (ambiguous output, parse error), **skip the gate** with: "Coverage gate: could not determine percentage — skipping." Do not default to 0% or block.
|
|
|
|
**Test-only diffs:** Skip the gate (same as the existing fast-path).
|
|
|
|
**100% coverage:** "Coverage gate: PASS (100%)." Continue.
|
|
|
|
### Test Plan Artifact
|
|
|
|
After producing the coverage diagram, write a test plan artifact so `/qa` and `/qa-only` can consume it:
|
|
|
|
```bash
|
|
eval "$(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)" && mkdir -p ~/.gstack/projects/$SLUG
|
|
USER=$(whoami)
|
|
DATETIME=$(date +%Y%m%d-%H%M%S)
|
|
```
|
|
|
|
Write to `~/.gstack/projects/{slug}/{user}-{branch}-ship-test-plan-{datetime}.md`:
|
|
|
|
```markdown
|
|
# Test Plan
|
|
Generated by /ship on {date}
|
|
Branch: {branch}
|
|
Repo: {owner/repo}
|
|
|
|
## Affected Pages/Routes
|
|
- {URL path} — {what to test and why}
|
|
|
|
## Key Interactions to Verify
|
|
- {interaction description} on {page}
|
|
|
|
## Edge Cases
|
|
- {edge case} on {page}
|
|
|
|
## Critical Paths
|
|
- {end-to-end flow that must work}
|
|
```
|
|
>
|
|
> After your analysis, output a single JSON object on the LAST LINE of your response (no other text after it):
|
|
> `{"coverage_pct":N,"gaps":N,"diagram":"<full markdown coverage diagram for PR body>","tests_added":["path",...]}`
|
|
|
|
**Parent processing:**
|
|
|
|
1. Read the subagent's final output. Parse the LAST line as JSON.
|
|
2. Store `coverage_pct` (for Step 20 metrics), `gaps` (user summary), `tests_added` (for the commit).
|
|
3. Embed `diagram` verbatim in the PR body's `## Test Coverage` section (Step 19).
|
|
4. Print a one-line summary: `Coverage: {coverage_pct}%, {gaps} gaps. {tests_added.length} tests added.`
|
|
|
|
**If the subagent fails, times out, or returns invalid JSON:** Fall back to running the audit inline in the parent. Do not block /ship on subagent failure — partial results are better than none.
|
|
|
|
---
|