feat: test bootstrap, regression tests, coverage audit, retro test health

- Add {{TEST_BOOTSTRAP}} resolver to gen-skill-docs.ts - Add Phase 8e.5 regression test generation to /qa and /qa-design-review - Add Step 3.4 test coverage audit with quality scoring to /ship - Add test health tracking to /retro - Add 2 E2E evals (bootstrap + coverage audit) - Add 26 validation tests - Update ARCHITECTURE.md placeholder table - Add 2 P3 TODOs (CI/CD non-GitHub, auto-upgrade weak tests)
2026-05-07 05:56:41 +02:00 · 2026-03-17 10:41:40 -07:00
parent 73b00b4e29
commit 04eb7824bd
16 changed files with 1512 additions and 6 deletions
@@ -11,6 +11,7 @@ allowed-tools:
  - Grep
  - Glob
  - AskUserQuestion
+  - WebSearch
 ---
 <!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
 <!-- Regenerate: bun run gen:skill-docs -->
@@ -121,6 +122,7 @@ You are running the `/ship` workflow. This is a **non-interactive, fully automat
 - Multi-file changesets (auto-split into bisectable commits)
 - TODOS.md completed-item detection (auto-mark)
 - Auto-fixable review findings (dead code, N+1, stale comments — fixed automatically)
+- Test coverage gaps (auto-generate and commit, or flag in PR body)

 ---

@@ -185,6 +187,163 @@ git fetch origin <base> && git merge origin/<base> --no-edit

 ---

+## Step 2.5: Test Framework Bootstrap
+
+## Test Framework Bootstrap
+
+**Detect existing test framework and project runtime:**
+
+```bash
+# Detect project runtime
+[ -f Gemfile ] && echo "RUNTIME:ruby"
+[ -f package.json ] && echo "RUNTIME:node"
+[ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python"
+[ -f go.mod ] && echo "RUNTIME:go"
+[ -f Cargo.toml ] && echo "RUNTIME:rust"
+[ -f composer.json ] && echo "RUNTIME:php"
+[ -f mix.exs ] && echo "RUNTIME:elixir"
+# Detect sub-frameworks
+[ -f Gemfile ] && grep -q "rails" Gemfile 2>/dev/null && echo "FRAMEWORK:rails"
+[ -f package.json ] && grep -q '"next"' package.json 2>/dev/null && echo "FRAMEWORK:nextjs"
+# Check for existing test infrastructure
+ls jest.config.* vitest.config.* playwright.config.* .rspec pytest.ini pyproject.toml phpunit.xml 2>/dev/null
+ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null
+# Check opt-out marker
+[ -f .gstack/no-test-bootstrap ] && echo "BOOTSTRAP_DECLINED"
+```
+
+**If test framework detected** (config files or test directories found):
+Print "Test framework detected: {name} ({N} existing tests). Skipping bootstrap."
+Read 2-3 existing test files to learn conventions (naming, imports, assertion style, setup patterns).
+Store conventions as prose context for use in Phase 8e.5 or Step 3.4. **Skip the rest of bootstrap.**
+
+**If BOOTSTRAP_DECLINED** appears: Print "Test bootstrap previously declined — skipping." **Skip the rest of bootstrap.**
+
+**If NO runtime detected** (no config files found): Use AskUserQuestion:
+"I couldn't detect your project's language. What runtime are you using?"
+Options: A) Node.js/TypeScript B) Ruby/Rails C) Python D) Go E) Rust F) PHP G) Elixir H) This project doesn't need tests.
+If user picks H → write `.gstack/no-test-bootstrap` and continue without tests.
+
+**If runtime detected but no test framework — bootstrap:**
+
+### B2. Research best practices
+
+Use WebSearch to find current best practices for the detected runtime:
+- `"[runtime] best test framework 2025 2026"`
+- `"[framework A] vs [framework B] comparison"`
+
+If WebSearch is unavailable, use this built-in knowledge table:
+
+| Runtime | Primary recommendation | Alternative |
+|---------|----------------------|-------------|
+| Ruby/Rails | minitest + fixtures + capybara | rspec + factory_bot + shoulda-matchers |
+| Node.js | vitest + @testing-library | jest + @testing-library |
+| Next.js | vitest + @testing-library/react + playwright | jest + cypress |
+| Python | pytest + pytest-cov | unittest |
+| Go | stdlib testing + testify | stdlib only |
+| Rust | cargo test (built-in) + mockall | — |
+| PHP | phpunit + mockery | pest |
+| Elixir | ExUnit (built-in) + ex_machina | — |
+
+### B3. Framework selection
+
+Use AskUserQuestion:
+"I detected this is a [Runtime/Framework] project with no test framework. I researched current best practices. Here are the options:
+A) [Primary] — [rationale]. Includes: [packages]. Supports: unit, integration, smoke, e2e
+B) [Alternative] — [rationale]. Includes: [packages]
+C) Skip — don't set up testing right now
+RECOMMENDATION: Choose A because [reason based on project context]"
+
+If user picks C → write `.gstack/no-test-bootstrap`. Tell user: "If you change your mind later, delete `.gstack/no-test-bootstrap` and re-run." Continue without tests.
+
+If multiple runtimes detected (monorepo) → ask which runtime to set up first, with option to do both sequentially.
+
+### B4. Install and configure
+
+1. Install the chosen packages (npm/bun/gem/pip/etc.)
+2. Create minimal config file
+3. Create directory structure (test/, spec/, etc.)
+4. Create one example test matching the project's code to verify setup works
+
+If package installation fails → debug once. If still failing → revert with `git checkout -- package.json package-lock.json` (or equivalent for the runtime). Warn user and continue without tests.
+
+### B4.5. First real tests
+
+Generate 3-5 real tests for existing code:
+
+1. **Find recently changed files:** `git log --since=30.days --name-only --format="" | sort | uniq -c | sort -rn | head -10`
+2. **Prioritize by risk:** Error handlers > business logic with conditionals > API endpoints > pure functions
+3. **For each file:** Write one test that tests real behavior with meaningful assertions. Never `expect(x).toBeDefined()` — test what the code DOES.
+4. Run each test. Passes → keep. Fails → fix once. Still fails → delete silently.
+5. Generate at least 1 test, cap at 5.
+
+Never import secrets, API keys, or credentials in test files. Use environment variables or test fixtures.
+
+### B5. Verify
+
+```bash
+# Run the full test suite to confirm everything works
+{detected test command}
+```
+
+If tests fail → debug once. If still failing → revert all bootstrap changes and warn user.
+
+### B5.5. CI/CD pipeline
+
+```bash
+# Check CI provider
+ls -d .github/ 2>/dev/null && echo "CI:github"
+ls .gitlab-ci.yml .circleci/ bitrise.yml 2>/dev/null
+```
+
+If `.github/` exists (or no CI detected — default to GitHub Actions):
+Create `.github/workflows/test.yml` with:
+- `runs-on: ubuntu-latest`
+- Appropriate setup action for the runtime (setup-node, setup-ruby, setup-python, etc.)
+- The same test command verified in B5
+- Trigger: push + pull_request
+
+If non-GitHub CI detected → skip CI generation with note: "Detected {provider} — CI pipeline generation supports GitHub Actions only. Add test step to your existing pipeline manually."
+
+### B6. Create TESTING.md
+
+First check: If TESTING.md already exists → read it and update/append rather than overwriting. Never destroy existing content.
+
+Write TESTING.md with:
+- Philosophy: "100% test coverage is the key to great vibe coding. Tests let you move fast, trust your instincts, and ship with confidence — without them, vibe coding is just yolo coding. With tests, it's a superpower."
+- Framework name and version
+- How to run tests (the verified command from B5)
+- Test layers: Unit tests (what, where, when), Integration tests, Smoke tests, E2E tests
+- Conventions: file naming, assertion style, setup/teardown patterns
+
+### B7. Update CLAUDE.md
+
+First check: If CLAUDE.md already has a `## Testing` section → skip. Don't duplicate.
+
+Append a `## Testing` section:
+- Run command and test directory
+- Reference to TESTING.md
+- Test expectations:
+  - 100% test coverage is the goal — tests make vibe coding safe
+  - When writing new functions, write a corresponding test
+  - When fixing a bug, write a regression test
+  - When adding error handling, write a test that triggers the error
+  - When adding a conditional (if/else, switch), write tests for BOTH paths
+  - Never commit code that makes existing tests fail
+
+### B8. Commit
+
+```bash
+git status --porcelain
+```
+
+Only commit if there are changes. Stage all bootstrap files (config, test directory, TESTING.md, CLAUDE.md, .github/workflows/test.yml if created):
+`git commit -m "chore: bootstrap test framework ({framework name})"`
+
+---
+
+---
+
 ## Step 3: Run tests (on merged code)

 **Do NOT run `RAILS_ENV=test bin/rails db:migrate`** — `bin/test-lane` already calls
@@ -269,6 +428,94 @@ If multiple suites need to run, run them sequentially (each needs a test lane).

 ---

+## Step 3.4: Test Coverage Audit
+
+100% coverage is the goal — every untested path is a path where bugs hide and vibe coding becomes yolo coding. Evaluate what was ACTUALLY coded (from the diff), not what was planned.
+
+**0. Before/after test count:**
+
+```bash
+# Count test files before any generation
+find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l
+```
+
+Store this number for the PR body.
+
+**1. Build the code path map** from `git diff origin/<base>...HEAD`:
+
+Extract all new or modified:
+- Functions/methods (def, function, const, class methods)
+- Conditional branches (if/else, switch/case, ternary, guard clauses, early returns)
+- API routes/endpoints (route definitions, controller actions)
+- Components (new files or new exports)
+- Error handlers (try/catch, rescue, error boundaries, fallback paths)
+
+**2. Search for corresponding tests and score quality:**
+
+For each code path, search for a test exercising it:
+- `src/services/billing.ts:processPayment` → `billing.test.ts`, `billing.spec.ts`
+- `app/controllers/payments_controller.rb#create` → `test/controllers/payments_controller_test.rb`
+- New if/else → tests for BOTH paths (not just happy path)
+- New error handler → test triggering the error condition
+
+Quality scoring rubric:
+- ★★★  Tests behavior with edge cases AND error paths
+- ★★   Tests correct behavior, happy path only
+- ★    Smoke test / existence check / trivial assertion (e.g., "it renders", "it doesn't throw")
+
+**3. Output ASCII coverage diagram:**
+
+```
+NEW CODE PATH COVERAGE MAP
+===========================
+[+] src/services/billing.ts
+    │
+    ├── processPayment()
+    │   ├── [★★★ TESTED] Happy path + card declined + timeout — billing.test.ts:42
+    │   ├── [GAP]         Network timeout — NO TEST
+    │   └── [GAP]         Invalid currency — NO TEST
+    │
+    └── refundPayment()
+        ├── [★★  TESTED] Full refund — billing.test.ts:89
+        └── [★   TESTED] Partial refund (checks non-throw only) — billing.test.ts:101
+
+─────────────────────────────────
+COVERAGE: 3/5 new paths tested (60%)
+QUALITY:  ★★★: 1  ★★: 1  ★: 1  (avg: ★★)
+GAPS: 2 paths need tests
+─────────────────────────────────
+```
+
+**Fast path:** All paths covered → "Step 3.4: All new code paths have test coverage ✓" Continue.
+
+**4. Generate tests for uncovered paths:**
+
+If test framework detected (or bootstrapped in Step 2.5):
+- Prioritize error handlers and edge cases first (happy paths are more likely already tested)
+- Read 2-3 existing test files to match conventions exactly
+- Generate unit tests. Mock all external dependencies (DB, API, Redis).
+- Write tests that exercise the specific uncovered path with real assertions
+- Run each test. Passes → commit as `test: coverage for {feature}`
+- Fails → fix once. Still fails → revert, note gap in diagram.
+
+Caps: 30 code paths max, 10 tests generated max, 2-min per-test exploration cap.
+
+If no test framework AND user declined bootstrap → diagram only, no generation. Note: "Test generation skipped — no test framework configured."
+
+**Diff is test-only changes:** Skip Step 3.4 entirely: "No new application code paths to audit."
+
+**5. After-count and coverage summary:**
+
+```bash
+# Count test files after generation
+find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l
+```
+
+For PR body: `Tests: {before} → {after} (+{delta} new)`
+Coverage line: `Test Coverage Audit: N new code paths. M covered (X%). K tests generated, J committed.`
+
+---
+
 ## Step 3.5: Pre-Landing Review

 Review the diff for structural issues that tests don't catch.
@@ -497,6 +744,10 @@ gh pr create --base <base> --title "<type>: <summary>" --body "$(cat <<'EOF'
 ## Summary
 <bullet points from CHANGELOG>

+## Test Coverage
+<coverage diagram from Step 3.4, or "All new code paths have test coverage.">
+<If Step 3.4 ran: "Tests: {before} → {after} (+{delta} new)">
+
 ## Pre-Landing Review
 <findings from Step 3.5, or "No issues found.">

@@ -538,4 +789,5 @@ EOF
 - **Split commits for bisectability** — each commit = one logical change.
 - **TODOS.md completion detection must be conservative.** Only mark items as completed when the diff clearly shows the work is done.
 - **Use Greptile reply templates from greptile-triage.md.** Every reply includes evidence (inline diff, code references, re-rank suggestion). Never post vague replies.
+- **Step 3.4 generates coverage tests.** They must pass before committing. Never commit failing tests.
 - **The goal is: user says `/ship`, next thing they see is the review + PR URL.**
@@ -11,6 +11,7 @@ allowed-tools:
  - Grep
  - Glob
  - AskUserQuestion
+  - WebSearch
 ---

 {{PREAMBLE}}
@@ -39,6 +40,7 @@ You are running the `/ship` workflow. This is a **non-interactive, fully automat
 - Multi-file changesets (auto-split into bisectable commits)
 - TODOS.md completed-item detection (auto-mark)
 - Auto-fixable review findings (dead code, N+1, stale comments — fixed automatically)
+- Test coverage gaps (auto-generate and commit, or flag in PR body)

 ---

@@ -75,6 +77,12 @@ git fetch origin <base> && git merge origin/<base> --no-edit

 ---

+## Step 2.5: Test Framework Bootstrap
+
+{{TEST_BOOTSTRAP}}
+
+---
+
 ## Step 3: Run tests (on merged code)

 **Do NOT run `RAILS_ENV=test bin/rails db:migrate`** — `bin/test-lane` already calls
@@ -159,6 +167,94 @@ If multiple suites need to run, run them sequentially (each needs a test lane).

 ---

+## Step 3.4: Test Coverage Audit
+
+100% coverage is the goal — every untested path is a path where bugs hide and vibe coding becomes yolo coding. Evaluate what was ACTUALLY coded (from the diff), not what was planned.
+
+**0. Before/after test count:**
+
+```bash
+# Count test files before any generation
+find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l
+```
+
+Store this number for the PR body.
+
+**1. Build the code path map** from `git diff origin/<base>...HEAD`:
+
+Extract all new or modified:
+- Functions/methods (def, function, const, class methods)
+- Conditional branches (if/else, switch/case, ternary, guard clauses, early returns)
+- API routes/endpoints (route definitions, controller actions)
+- Components (new files or new exports)
+- Error handlers (try/catch, rescue, error boundaries, fallback paths)
+
+**2. Search for corresponding tests and score quality:**
+
+For each code path, search for a test exercising it:
+- `src/services/billing.ts:processPayment` → `billing.test.ts`, `billing.spec.ts`
+- `app/controllers/payments_controller.rb#create` → `test/controllers/payments_controller_test.rb`
+- New if/else → tests for BOTH paths (not just happy path)
+- New error handler → test triggering the error condition
+
+Quality scoring rubric:
+- ★★★  Tests behavior with edge cases AND error paths
+- ★★   Tests correct behavior, happy path only
+- ★    Smoke test / existence check / trivial assertion (e.g., "it renders", "it doesn't throw")
+
+**3. Output ASCII coverage diagram:**
+
+```
+NEW CODE PATH COVERAGE MAP
+===========================
+[+] src/services/billing.ts
+    │
+    ├── processPayment()
+    │   ├── [★★★ TESTED] Happy path + card declined + timeout — billing.test.ts:42
+    │   ├── [GAP]         Network timeout — NO TEST
+    │   └── [GAP]         Invalid currency — NO TEST
+    │
+    └── refundPayment()
+        ├── [★★  TESTED] Full refund — billing.test.ts:89
+        └── [★   TESTED] Partial refund (checks non-throw only) — billing.test.ts:101
+
+─────────────────────────────────
+COVERAGE: 3/5 new paths tested (60%)
+QUALITY:  ★★★: 1  ★★: 1  ★: 1  (avg: ★★)
+GAPS: 2 paths need tests
+─────────────────────────────────
+```
+
+**Fast path:** All paths covered → "Step 3.4: All new code paths have test coverage ✓" Continue.
+
+**4. Generate tests for uncovered paths:**
+
+If test framework detected (or bootstrapped in Step 2.5):
+- Prioritize error handlers and edge cases first (happy paths are more likely already tested)
+- Read 2-3 existing test files to match conventions exactly
+- Generate unit tests. Mock all external dependencies (DB, API, Redis).
+- Write tests that exercise the specific uncovered path with real assertions
+- Run each test. Passes → commit as `test: coverage for {feature}`
+- Fails → fix once. Still fails → revert, note gap in diagram.
+
+Caps: 30 code paths max, 10 tests generated max, 2-min per-test exploration cap.
+
+If no test framework AND user declined bootstrap → diagram only, no generation. Note: "Test generation skipped — no test framework configured."
+
+**Diff is test-only changes:** Skip Step 3.4 entirely: "No new application code paths to audit."
+
+**5. After-count and coverage summary:**
+
+```bash
+# Count test files after generation
+find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l
+```
+
+For PR body: `Tests: {before} → {after} (+{delta} new)`
+Coverage line: `Test Coverage Audit: N new code paths. M covered (X%). K tests generated, J committed.`
+
+---
+
 ## Step 3.5: Pre-Landing Review

 Review the diff for structural issues that tests don't catch.
@@ -387,6 +483,10 @@ gh pr create --base <base> --title "<type>: <summary>" --body "$(cat <<'EOF'
 ## Summary
 <bullet points from CHANGELOG>

+## Test Coverage
+<coverage diagram from Step 3.4, or "All new code paths have test coverage.">
+<If Step 3.4 ran: "Tests: {before} → {after} (+{delta} new)">
+
 ## Pre-Landing Review
 <findings from Step 3.5, or "No issues found.">

@@ -428,4 +528,5 @@ EOF
 - **Split commits for bisectability** — each commit = one logical change.
 - **TODOS.md completion detection must be conservative.** Only mark items as completed when the diff clearly shows the work is done.
 - **Use Greptile reply templates from greptile-triage.md.** Every reply includes evidence (inline diff, code references, re-rank suggestion). Never post vague replies.
+- **Step 3.4 generates coverage tests.** They must pass before committing. Never commit failing tests.
 - **The goal is: user says `/ship`, next thing they see is the review + PR URL.**