From 8c44ec70c567c88592384a1e7712014ce2a253e0 Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Fri, 20 Mar 2026 07:57:10 -0700 Subject: [PATCH] refactor: ship uses shared test coverage audit Replace 135 lines of inline Step 3.4 methodology with {{TEST_COVERAGE_AUDIT_SHIP}}. Functionally identical output plus: - E2E test decision matrix (marks paths needing E2E vs unit) - Eval recommendations for LLM prompt changes - Regression detection iron rule - Test framework detection via CLAUDE.md first - Test plan artifact for /qa consumption Co-Authored-By: Claude Opus 4.6 (1M context) --- .agents/skills/gstack-ship/SKILL.md | 104 +++++++++++++++++++-- ship/SKILL.md | 104 +++++++++++++++++++-- ship/SKILL.md.tmpl | 134 +--------------------------- 3 files changed, 195 insertions(+), 147 deletions(-) diff --git a/.agents/skills/gstack-ship/SKILL.md b/.agents/skills/gstack-ship/SKILL.md index 8b66d09e..d81eb301 100644 --- a/.agents/skills/gstack-ship/SKILL.md +++ b/.agents/skills/gstack-ship/SKILL.md @@ -574,6 +574,27 @@ If multiple suites need to run, run them sequentially (each needs a test lane). 100% coverage is the goal — every untested path is a path where bugs hide and vibe coding becomes yolo coding. Evaluate what was ACTUALLY coded (from the diff), not what was planned. +### Test Framework Detection + +Before analyzing coverage, detect the project's test framework: + +1. **Read CLAUDE.md** — look for a `## Testing` section with test command and framework name. If found, use that as the authoritative source. +2. **If CLAUDE.md has no testing section, auto-detect:** + +```bash +# Detect project runtime +[ -f Gemfile ] && echo "RUNTIME:ruby" +[ -f package.json ] && echo "RUNTIME:node" +[ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python" +[ -f go.mod ] && echo "RUNTIME:go" +[ -f Cargo.toml ] && echo "RUNTIME:rust" +# Check for existing test infrastructure +ls jest.config.* vitest.config.* playwright.config.* cypress.config.* .rspec pytest.ini phpunit.xml 2>/dev/null +ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null +``` + +3. **If no framework detected:** falls through to the Test Framework Bootstrap step (Step 2.5) which handles full setup. + **0. Before/after test count:** ```bash @@ -636,9 +657,41 @@ Quality scoring rubric: - ★★ Tests correct behavior, happy path only - ★ Smoke test / existence check / trivial assertion (e.g., "it renders", "it doesn't throw") +### E2E Test Decision Matrix + +When checking each branch, also determine whether a unit test or E2E/integration test is the right tool: + +**RECOMMEND E2E (mark as [→E2E] in the diagram):** +- Common user flow spanning 3+ components/services (e.g., signup → verify email → first login) +- Integration point where mocking hides real failures (e.g., API → queue → worker → DB) +- Auth/payment/data-destruction flows — too important to trust unit tests alone + +**RECOMMEND EVAL (mark as [→EVAL] in the diagram):** +- Critical LLM call that needs a quality eval (e.g., prompt change → test output still meets quality bar) +- Changes to prompt templates, system instructions, or tool definitions + +**STICK WITH UNIT TESTS:** +- Pure function with clear inputs/outputs +- Internal helper with no side effects +- Edge case of a single function (null input, empty array) +- Obscure/rare flow that isn't customer-facing + +### REGRESSION RULE (mandatory) + +**IRON RULE:** When the coverage audit identifies a REGRESSION — code that previously worked but the diff broke — a regression test is written immediately. No AskUserQuestion. No skipping. Regressions are the highest-priority test because they prove something broke. + +A regression is when: +- The diff modifies existing behavior (not new code) +- The existing test suite (if any) doesn't cover the changed path +- The change introduces a new failure mode for existing callers + +When uncertain whether a change is a regression, err on the side of writing the test. + +Format: commit as `test: regression test for {what broke}` + **4. Output ASCII coverage diagram:** -Include BOTH code paths and user flows in the same diagram: +Include BOTH code paths and user flows in the same diagram. Mark E2E-worthy and eval-worthy paths: ``` CODE PATH COVERAGE @@ -659,9 +712,9 @@ USER FLOW COVERAGE [+] Payment checkout flow │ ├── [★★★ TESTED] Complete purchase — checkout.e2e.ts:15 - ├── [GAP] Double-click submit — NO TEST - ├── [GAP] Navigate away during payment — NO TEST - └── [★ TESTED] Form validation errors (checks render only) — checkout.test.ts:40 + ├── [GAP] [→E2E] Double-click submit — needs E2E, not just unit + ├── [GAP] Navigate away during payment — unit test sufficient + └── [★ TESTED] Form validation errors (checks render only) — checkout.test.ts:40 [+] Error states │ @@ -669,12 +722,16 @@ USER FLOW COVERAGE ├── [GAP] Network timeout UX (what does user see?) — NO TEST └── [GAP] Empty cart submission — NO TEST +[+] LLM integration + │ + └── [GAP] [→EVAL] Prompt template change — needs eval test + ───────────────────────────────── -COVERAGE: 5/12 paths tested (42%) +COVERAGE: 5/13 paths tested (38%) Code paths: 3/5 (60%) - User flows: 2/7 (29%) + User flows: 2/8 (25%) QUALITY: ★★★: 2 ★★: 2 ★: 1 -GAPS: 7 paths need tests +GAPS: 8 paths need tests (2 need E2E, 1 needs eval) ───────────────────────────────── ``` @@ -686,6 +743,8 @@ If test framework detected (or bootstrapped in Step 2.5): - Prioritize error handlers and edge cases first (happy paths are more likely already tested) - Read 2-3 existing test files to match conventions exactly - Generate unit tests. Mock all external dependencies (DB, API, Redis). +- For paths marked [→E2E]: generate integration/E2E tests using the project's E2E framework (Playwright, Cypress, Capybara, etc.) +- For paths marked [→EVAL]: generate eval tests using the project's eval framework, or flag for manual eval if none exists - Write tests that exercise the specific uncovered path with real assertions - Run each test. Passes → commit as `test: coverage for {feature}` - Fails → fix once. Still fails → revert, note gap in diagram. @@ -706,6 +765,37 @@ find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec For PR body: `Tests: {before} → {after} (+{delta} new)` Coverage line: `Test Coverage Audit: N new code paths. M covered (X%). K tests generated, J committed.` +### Test Plan Artifact + +After producing the coverage diagram, write a test plan artifact so `/qa` and `/qa-only` can consume it: + +```bash +source <(~/.codex/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG +USER=$(whoami) +DATETIME=$(date +%Y%m%d-%H%M%S) +``` + +Write to `~/.gstack/projects/{slug}/{user}-{branch}-ship-test-plan-{datetime}.md`: + +```markdown +# Test Plan +Generated by /ship on {date} +Branch: {branch} +Repo: {owner/repo} + +## Affected Pages/Routes +- {URL path} — {what to test and why} + +## Key Interactions to Verify +- {interaction description} on {page} + +## Edge Cases +- {edge case} on {page} + +## Critical Paths +- {end-to-end flow that must work} +``` + --- ## Step 3.5: Pre-Landing Review diff --git a/ship/SKILL.md b/ship/SKILL.md index be1b628c..cee2db2f 100644 --- a/ship/SKILL.md +++ b/ship/SKILL.md @@ -584,6 +584,27 @@ If multiple suites need to run, run them sequentially (each needs a test lane). 100% coverage is the goal — every untested path is a path where bugs hide and vibe coding becomes yolo coding. Evaluate what was ACTUALLY coded (from the diff), not what was planned. +### Test Framework Detection + +Before analyzing coverage, detect the project's test framework: + +1. **Read CLAUDE.md** — look for a `## Testing` section with test command and framework name. If found, use that as the authoritative source. +2. **If CLAUDE.md has no testing section, auto-detect:** + +```bash +# Detect project runtime +[ -f Gemfile ] && echo "RUNTIME:ruby" +[ -f package.json ] && echo "RUNTIME:node" +[ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python" +[ -f go.mod ] && echo "RUNTIME:go" +[ -f Cargo.toml ] && echo "RUNTIME:rust" +# Check for existing test infrastructure +ls jest.config.* vitest.config.* playwright.config.* cypress.config.* .rspec pytest.ini phpunit.xml 2>/dev/null +ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null +``` + +3. **If no framework detected:** falls through to the Test Framework Bootstrap step (Step 2.5) which handles full setup. + **0. Before/after test count:** ```bash @@ -646,9 +667,41 @@ Quality scoring rubric: - ★★ Tests correct behavior, happy path only - ★ Smoke test / existence check / trivial assertion (e.g., "it renders", "it doesn't throw") +### E2E Test Decision Matrix + +When checking each branch, also determine whether a unit test or E2E/integration test is the right tool: + +**RECOMMEND E2E (mark as [→E2E] in the diagram):** +- Common user flow spanning 3+ components/services (e.g., signup → verify email → first login) +- Integration point where mocking hides real failures (e.g., API → queue → worker → DB) +- Auth/payment/data-destruction flows — too important to trust unit tests alone + +**RECOMMEND EVAL (mark as [→EVAL] in the diagram):** +- Critical LLM call that needs a quality eval (e.g., prompt change → test output still meets quality bar) +- Changes to prompt templates, system instructions, or tool definitions + +**STICK WITH UNIT TESTS:** +- Pure function with clear inputs/outputs +- Internal helper with no side effects +- Edge case of a single function (null input, empty array) +- Obscure/rare flow that isn't customer-facing + +### REGRESSION RULE (mandatory) + +**IRON RULE:** When the coverage audit identifies a REGRESSION — code that previously worked but the diff broke — a regression test is written immediately. No AskUserQuestion. No skipping. Regressions are the highest-priority test because they prove something broke. + +A regression is when: +- The diff modifies existing behavior (not new code) +- The existing test suite (if any) doesn't cover the changed path +- The change introduces a new failure mode for existing callers + +When uncertain whether a change is a regression, err on the side of writing the test. + +Format: commit as `test: regression test for {what broke}` + **4. Output ASCII coverage diagram:** -Include BOTH code paths and user flows in the same diagram: +Include BOTH code paths and user flows in the same diagram. Mark E2E-worthy and eval-worthy paths: ``` CODE PATH COVERAGE @@ -669,9 +722,9 @@ USER FLOW COVERAGE [+] Payment checkout flow │ ├── [★★★ TESTED] Complete purchase — checkout.e2e.ts:15 - ├── [GAP] Double-click submit — NO TEST - ├── [GAP] Navigate away during payment — NO TEST - └── [★ TESTED] Form validation errors (checks render only) — checkout.test.ts:40 + ├── [GAP] [→E2E] Double-click submit — needs E2E, not just unit + ├── [GAP] Navigate away during payment — unit test sufficient + └── [★ TESTED] Form validation errors (checks render only) — checkout.test.ts:40 [+] Error states │ @@ -679,12 +732,16 @@ USER FLOW COVERAGE ├── [GAP] Network timeout UX (what does user see?) — NO TEST └── [GAP] Empty cart submission — NO TEST +[+] LLM integration + │ + └── [GAP] [→EVAL] Prompt template change — needs eval test + ───────────────────────────────── -COVERAGE: 5/12 paths tested (42%) +COVERAGE: 5/13 paths tested (38%) Code paths: 3/5 (60%) - User flows: 2/7 (29%) + User flows: 2/8 (25%) QUALITY: ★★★: 2 ★★: 2 ★: 1 -GAPS: 7 paths need tests +GAPS: 8 paths need tests (2 need E2E, 1 needs eval) ───────────────────────────────── ``` @@ -696,6 +753,8 @@ If test framework detected (or bootstrapped in Step 2.5): - Prioritize error handlers and edge cases first (happy paths are more likely already tested) - Read 2-3 existing test files to match conventions exactly - Generate unit tests. Mock all external dependencies (DB, API, Redis). +- For paths marked [→E2E]: generate integration/E2E tests using the project's E2E framework (Playwright, Cypress, Capybara, etc.) +- For paths marked [→EVAL]: generate eval tests using the project's eval framework, or flag for manual eval if none exists - Write tests that exercise the specific uncovered path with real assertions - Run each test. Passes → commit as `test: coverage for {feature}` - Fails → fix once. Still fails → revert, note gap in diagram. @@ -716,6 +775,37 @@ find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec For PR body: `Tests: {before} → {after} (+{delta} new)` Coverage line: `Test Coverage Audit: N new code paths. M covered (X%). K tests generated, J committed.` +### Test Plan Artifact + +After producing the coverage diagram, write a test plan artifact so `/qa` and `/qa-only` can consume it: + +```bash +source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG +USER=$(whoami) +DATETIME=$(date +%Y%m%d-%H%M%S) +``` + +Write to `~/.gstack/projects/{slug}/{user}-{branch}-ship-test-plan-{datetime}.md`: + +```markdown +# Test Plan +Generated by /ship on {date} +Branch: {branch} +Repo: {owner/repo} + +## Affected Pages/Routes +- {URL path} — {what to test and why} + +## Key Interactions to Verify +- {interaction description} on {page} + +## Edge Cases +- {edge case} on {page} + +## Critical Paths +- {end-to-end flow that must work} +``` + --- ## Step 3.5: Pre-Landing Review diff --git a/ship/SKILL.md.tmpl b/ship/SKILL.md.tmpl index 22dff7d0..be276a1e 100644 --- a/ship/SKILL.md.tmpl +++ b/ship/SKILL.md.tmpl @@ -188,139 +188,7 @@ If multiple suites need to run, run them sequentially (each needs a test lane). ## Step 3.4: Test Coverage Audit -100% coverage is the goal — every untested path is a path where bugs hide and vibe coding becomes yolo coding. Evaluate what was ACTUALLY coded (from the diff), not what was planned. - -**0. Before/after test count:** - -```bash -# Count test files before any generation -find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l -``` - -Store this number for the PR body. - -**1. Trace every codepath changed** using `git diff origin/...HEAD`: - -Read every changed file. For each one, trace how data flows through the code — don't just list functions, actually follow the execution: - -1. **Read the diff.** For each changed file, read the full file (not just the diff hunk) to understand context. -2. **Trace data flow.** Starting from each entry point (route handler, exported function, event listener, component render), follow the data through every branch: - - Where does input come from? (request params, props, database, API call) - - What transforms it? (validation, mapping, computation) - - Where does it go? (database write, API response, rendered output, side effect) - - What can go wrong at each step? (null/undefined, invalid input, network failure, empty collection) -3. **Diagram the execution.** For each changed file, draw an ASCII diagram showing: - - Every function/method that was added or modified - - Every conditional branch (if/else, switch, ternary, guard clause, early return) - - Every error path (try/catch, rescue, error boundary, fallback) - - Every call to another function (trace into it — does IT have untested branches?) - - Every edge: what happens with null input? Empty array? Invalid type? - -This is the critical step — you're building a map of every line of code that can execute differently based on input. Every branch in this diagram needs a test. - -**2. Map user flows, interactions, and error states:** - -Code coverage isn't enough — you need to cover how real users interact with the changed code. For each changed feature, think through: - -- **User flows:** What sequence of actions does a user take that touches this code? Map the full journey (e.g., "user clicks 'Pay' → form validates → API call → success/failure screen"). Each step in the journey needs a test. -- **Interaction edge cases:** What happens when the user does something unexpected? - - Double-click/rapid resubmit - - Navigate away mid-operation (back button, close tab, click another link) - - Submit with stale data (page sat open for 30 minutes, session expired) - - Slow connection (API takes 10 seconds — what does the user see?) - - Concurrent actions (two tabs, same form) -- **Error states the user can see:** For every error the code handles, what does the user actually experience? - - Is there a clear error message or a silent failure? - - Can the user recover (retry, go back, fix input) or are they stuck? - - What happens with no network? With a 500 from the API? With invalid data from the server? -- **Empty/zero/boundary states:** What does the UI show with zero results? With 10,000 results? With a single character input? With maximum-length input? - -Add these to your diagram alongside the code branches. A user flow with no test is just as much a gap as an untested if/else. - -**3. Check each branch against existing tests:** - -Go through your diagram branch by branch — both code paths AND user flows. For each one, search for a test that exercises it: -- Function `processPayment()` → look for `billing.test.ts`, `billing.spec.ts`, `test/billing_test.rb` -- An if/else → look for tests covering BOTH the true AND false path -- An error handler → look for a test that triggers that specific error condition -- A call to `helperFn()` that has its own branches → those branches need tests too -- A user flow → look for an integration or E2E test that walks through the journey -- An interaction edge case → look for a test that simulates the unexpected action - -Quality scoring rubric: -- ★★★ Tests behavior with edge cases AND error paths -- ★★ Tests correct behavior, happy path only -- ★ Smoke test / existence check / trivial assertion (e.g., "it renders", "it doesn't throw") - -**4. Output ASCII coverage diagram:** - -Include BOTH code paths and user flows in the same diagram: - -``` -CODE PATH COVERAGE -=========================== -[+] src/services/billing.ts - │ - ├── processPayment() - │ ├── [★★★ TESTED] Happy path + card declined + timeout — billing.test.ts:42 - │ ├── [GAP] Network timeout — NO TEST - │ └── [GAP] Invalid currency — NO TEST - │ - └── refundPayment() - ├── [★★ TESTED] Full refund — billing.test.ts:89 - └── [★ TESTED] Partial refund (checks non-throw only) — billing.test.ts:101 - -USER FLOW COVERAGE -=========================== -[+] Payment checkout flow - │ - ├── [★★★ TESTED] Complete purchase — checkout.e2e.ts:15 - ├── [GAP] Double-click submit — NO TEST - ├── [GAP] Navigate away during payment — NO TEST - └── [★ TESTED] Form validation errors (checks render only) — checkout.test.ts:40 - -[+] Error states - │ - ├── [★★ TESTED] Card declined message — billing.test.ts:58 - ├── [GAP] Network timeout UX (what does user see?) — NO TEST - └── [GAP] Empty cart submission — NO TEST - -───────────────────────────────── -COVERAGE: 5/12 paths tested (42%) - Code paths: 3/5 (60%) - User flows: 2/7 (29%) -QUALITY: ★★★: 2 ★★: 2 ★: 1 -GAPS: 7 paths need tests -───────────────────────────────── -``` - -**Fast path:** All paths covered → "Step 3.4: All new code paths have test coverage ✓" Continue. - -**5. Generate tests for uncovered paths:** - -If test framework detected (or bootstrapped in Step 2.5): -- Prioritize error handlers and edge cases first (happy paths are more likely already tested) -- Read 2-3 existing test files to match conventions exactly -- Generate unit tests. Mock all external dependencies (DB, API, Redis). -- Write tests that exercise the specific uncovered path with real assertions -- Run each test. Passes → commit as `test: coverage for {feature}` -- Fails → fix once. Still fails → revert, note gap in diagram. - -Caps: 30 code paths max, 20 tests generated max (code + user flow combined), 2-min per-test exploration cap. - -If no test framework AND user declined bootstrap → diagram only, no generation. Note: "Test generation skipped — no test framework configured." - -**Diff is test-only changes:** Skip Step 3.4 entirely: "No new application code paths to audit." - -**6. After-count and coverage summary:** - -```bash -# Count test files after generation -find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l -``` - -For PR body: `Tests: {before} → {after} (+{delta} new)` -Coverage line: `Test Coverage Audit: N new code paths. M covered (X%). K tests generated, J committed.` +{{TEST_COVERAGE_AUDIT_SHIP}} ---