## Step 4: Test Framework Bootstrap ## Test Framework Bootstrap **Detect existing test framework and project runtime:** ```bash setopt +o nomatch 2>/dev/null || true # zsh compat # Detect project runtime [ -f Gemfile ] && echo "RUNTIME:ruby" [ -f package.json ] && echo "RUNTIME:node" [ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python" [ -f go.mod ] && echo "RUNTIME:go" [ -f Cargo.toml ] && echo "RUNTIME:rust" [ -f composer.json ] && echo "RUNTIME:php" [ -f mix.exs ] && echo "RUNTIME:elixir" # Detect sub-frameworks [ -f Gemfile ] && grep -q "rails" Gemfile 2>/dev/null && echo "FRAMEWORK:rails" [ -f package.json ] && grep -q '"next"' package.json 2>/dev/null && echo "FRAMEWORK:nextjs" # Check for existing test infrastructure ls jest.config.* vitest.config.* playwright.config.* .rspec pytest.ini pyproject.toml phpunit.xml 2>/dev/null ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null # Check opt-out marker [ -f .gstack/no-test-bootstrap ] && echo "BOOTSTRAP_DECLINED" ``` **If test framework detected** (config files or test directories found): Print "Test framework detected: {name} ({N} existing tests). Skipping bootstrap." Read 2-3 existing test files to learn conventions (naming, imports, assertion style, setup patterns). Store conventions as prose context for use in Phase 8e.5 or Step 7. **Skip the rest of bootstrap.** **If BOOTSTRAP_DECLINED** appears: Print "Test bootstrap previously declined — skipping." **Skip the rest of bootstrap.** **If NO runtime detected** (no config files found): Use AskUserQuestion: "I couldn't detect your project's language. What runtime are you using?" Options: A) Node.js/TypeScript B) Ruby/Rails C) Python D) Go E) Rust F) PHP G) Elixir H) This project doesn't need tests. If user picks H → write `.gstack/no-test-bootstrap` and continue without tests. **If runtime detected but no test framework — bootstrap:** ### B2. Research best practices Use WebSearch to find current best practices for the detected runtime: - `"[runtime] best test framework 2025 2026"` - `"[framework A] vs [framework B] comparison"` If WebSearch is unavailable, use this built-in knowledge table: | Runtime | Primary recommendation | Alternative | |---------|----------------------|-------------| | Ruby/Rails | minitest + fixtures + capybara | rspec + factory_bot + shoulda-matchers | | Node.js | vitest + @testing-library | jest + @testing-library | | Next.js | vitest + @testing-library/react + playwright | jest + cypress | | Python | pytest + pytest-cov | unittest | | Go | stdlib testing + testify | stdlib only | | Rust | cargo test (built-in) + mockall | — | | PHP | phpunit + mockery | pest | | Elixir | ExUnit (built-in) + ex_machina | — | ### B3. Framework selection Use AskUserQuestion: "I detected this is a [Runtime/Framework] project with no test framework. I researched current best practices. Here are the options: A) [Primary] — [rationale]. Includes: [packages]. Supports: unit, integration, smoke, e2e B) [Alternative] — [rationale]. Includes: [packages] C) Skip — don't set up testing right now RECOMMENDATION: Choose A because [reason based on project context]" If user picks C → write `.gstack/no-test-bootstrap`. Tell user: "If you change your mind later, delete `.gstack/no-test-bootstrap` and re-run." Continue without tests. If multiple runtimes detected (monorepo) → ask which runtime to set up first, with option to do both sequentially. ### B4. Install and configure 1. Install the chosen packages (npm/bun/gem/pip/etc.) 2. Create minimal config file 3. Create directory structure (test/, spec/, etc.) 4. Create one example test matching the project's code to verify setup works If package installation fails → debug once. If still failing → revert with `git checkout -- package.json package-lock.json` (or equivalent for the runtime). Warn user and continue without tests. ### B4.5. First real tests Generate 3-5 real tests for existing code: 1. **Find recently changed files:** `git log --since=30.days --name-only --format="" | sort | uniq -c | sort -rn | head -10` 2. **Prioritize by risk:** Error handlers > business logic with conditionals > API endpoints > pure functions 3. **For each file:** Write one test that tests real behavior with meaningful assertions. Never `expect(x).toBeDefined()` — test what the code DOES. 4. Run each test. Passes → keep. Fails → fix once. Still fails → delete silently. 5. Generate at least 1 test, cap at 5. Never import secrets, API keys, or credentials in test files. Use environment variables or test fixtures. ### B5. Verify ```bash # Run the full test suite to confirm everything works {detected test command} ``` If tests fail → debug once. If still failing → revert all bootstrap changes and warn user. ### B5.5. CI/CD pipeline ```bash # Check CI provider ls -d .github/ 2>/dev/null && echo "CI:github" ls .gitlab-ci.yml .circleci/ bitrise.yml 2>/dev/null ``` If `.github/` exists (or no CI detected — default to GitHub Actions): Create `.github/workflows/test.yml` with: - `runs-on: ubuntu-latest` - Appropriate setup action for the runtime (setup-node, setup-ruby, setup-python, etc.) - The same test command verified in B5 - Trigger: push + pull_request If non-GitHub CI detected → skip CI generation with note: "Detected {provider} — CI pipeline generation supports GitHub Actions only. Add test step to your existing pipeline manually." ### B6. Create TESTING.md First check: If TESTING.md already exists → read it and update/append rather than overwriting. Never destroy existing content. Write TESTING.md with: - Philosophy: "100% test coverage is the key to great vibe coding. Tests let you move fast, trust your instincts, and ship with confidence — without them, vibe coding is just yolo coding. With tests, it's a superpower." - Framework name and version - How to run tests (the verified command from B5) - Test layers: Unit tests (what, where, when), Integration tests, Smoke tests, E2E tests - Conventions: file naming, assertion style, setup/teardown patterns ### B7. Update CLAUDE.md First check: If CLAUDE.md already has a `## Testing` section → skip. Don't duplicate. Append a `## Testing` section: - Run command and test directory - Reference to TESTING.md - Test expectations: - 100% test coverage is the goal — tests make vibe coding safe - When writing new functions, write a corresponding test - When fixing a bug, write a regression test - When adding error handling, write a test that triggers the error - When adding a conditional (if/else, switch), write tests for BOTH paths - Never commit code that makes existing tests fail ### B8. Commit ```bash git status --porcelain ``` Only commit if there are changes. Stage all bootstrap files (config, test directory, TESTING.md, CLAUDE.md, .github/workflows/test.yml if created): `git commit -m "chore: bootstrap test framework ({framework name})"` --- --- ## Step 5: Run tests (on merged code) **Do NOT run `RAILS_ENV=test bin/rails db:migrate`** — `bin/test-lane` already calls `db:test:prepare` internally, which loads the schema into the correct lane database. Running bare test migrations without INSTANCE hits an orphan DB and corrupts structure.sql. Run both test suites in parallel: ```bash bin/test-lane 2>&1 | tee /tmp/ship_tests.txt & npm run test 2>&1 | tee /tmp/ship_vitest.txt & wait ``` After both complete, read the output files and check pass/fail. **If any test fails:** Do NOT immediately stop. Apply the Test Failure Ownership Triage: ## Test Failure Ownership Triage When tests fail, do NOT immediately stop. First, determine ownership: ### Step T1: Classify each failure For each failing test: 1. **Get the files changed on this branch:** ```bash git diff origin/...HEAD --name-only ``` 2. **Classify the failure:** - **In-branch** if: the failing test file itself was modified on this branch, OR the test output references code that was changed on this branch, OR you can trace the failure to a change in the branch diff. - **Likely pre-existing** if: neither the test file nor the code it tests was modified on this branch, AND the failure is unrelated to any branch change you can identify. - **When ambiguous, default to in-branch.** It is safer to stop the developer than to let a broken test ship. Only classify as pre-existing when you are confident. This classification is heuristic — use your judgment reading the diff and the test output. You do not have a programmatic dependency graph. ### Step T2: Handle in-branch failures **STOP.** These are your failures. Show them and do not proceed. The developer must fix their own broken tests before shipping. ### Step T3: Handle pre-existing failures Check `REPO_MODE` from the preamble output. **If REPO_MODE is `solo`:** Use AskUserQuestion: > These test failures appear pre-existing (not caused by your branch changes): > > [list each failure with file:line and brief error description] > > Since this is a solo repo, you're the only one who will fix these. > > RECOMMENDATION: Choose A — fix now while the context is fresh. Completeness: 9/10. > A) Investigate and fix now (human: ~2-4h / CC: ~15min) — Completeness: 10/10 > B) Add as P0 TODO — fix after this branch lands — Completeness: 7/10 > C) Skip — I know about this, ship anyway — Completeness: 3/10 **If REPO_MODE is `collaborative` or `unknown`:** Use AskUserQuestion: > These test failures appear pre-existing (not caused by your branch changes): > > [list each failure with file:line and brief error description] > > This is a collaborative repo — these may be someone else's responsibility. > > RECOMMENDATION: Choose B — assign it to whoever broke it so the right person fixes it. Completeness: 9/10. > A) Investigate and fix now anyway — Completeness: 10/10 > B) Blame + assign GitHub issue to the author — Completeness: 9/10 > C) Add as P0 TODO — Completeness: 7/10 > D) Skip — ship anyway — Completeness: 3/10 ### Step T4: Execute the chosen action **If "Investigate and fix now":** - Switch to /investigate mindset: root cause first, then minimal fix. - Fix the pre-existing failure. - Commit the fix separately from the branch's changes: `git commit -m "fix: pre-existing test failure in "` - Continue with the workflow. **If "Add as P0 TODO":** - If `TODOS.md` exists, add the entry following the format in `review/TODOS-format.md` (or `.claude/skills/review/TODOS-format.md`). - If `TODOS.md` does not exist, create it with the standard header and add the entry. - Entry should include: title, the error output, which branch it was noticed on, and priority P0. - Continue with the workflow — treat the pre-existing failure as non-blocking. **If "Blame + assign GitHub issue" (collaborative only):** - Find who likely broke it. Check BOTH the test file AND the production code it tests: ```bash # Who last touched the failing test? git log --format="%an (%ae)" -1 -- # Who last touched the production code the test covers? (often the actual breaker) git log --format="%an (%ae)" -1 -- ``` If these are different people, prefer the production code author — they likely introduced the regression. - Create an issue assigned to that person (use the platform detected in Step 0): - **If GitHub:** ```bash gh issue create \ --title "Pre-existing test failure: " \ --body "Found failing on branch . Failure is pre-existing.\n\n**Error:**\n```\n\n```\n\n**Last modified by:** \n**Noticed by:** gstack /ship on " \ --assignee "" ``` - **If GitLab:** ```bash glab issue create \ -t "Pre-existing test failure: " \ -d "Found failing on branch . Failure is pre-existing.\n\n**Error:**\n```\n\n```\n\n**Last modified by:** \n**Noticed by:** gstack /ship on " \ -a "" ``` - If neither CLI is available or `--assignee`/`-a` fails (user not in org, etc.), create the issue without assignee and note who should look at it in the body. - Continue with the workflow. **If "Skip":** - Continue with the workflow. - Note in output: "Pre-existing test failure skipped: " **After triage:** If any in-branch failures remain unfixed, **STOP**. Do not proceed. If all failures were pre-existing and handled (fixed, TODOed, assigned, or skipped), continue to Step 6. **If all pass:** Continue silently — just note the counts briefly. --- ## Step 6: Eval Suites (conditional) Evals are mandatory when prompt-related files change. Skip this step entirely if no prompt files are in the diff. **1. Check if the diff touches prompt-related files:** ```bash git diff origin/ --name-only ``` Match against these patterns (from CLAUDE.md): - `app/services/*_prompt_builder.rb` - `app/services/*_generation_service.rb`, `*_writer_service.rb`, `*_designer_service.rb` - `app/services/*_evaluator.rb`, `*_scorer.rb`, `*_classifier_service.rb`, `*_analyzer.rb` - `app/services/concerns/*voice*.rb`, `*writing*.rb`, `*prompt*.rb`, `*token*.rb` - `app/services/chat_tools/*.rb`, `app/services/x_thread_tools/*.rb` - `config/system_prompts/*.txt` - `test/evals/**/*` (eval infrastructure changes affect all suites) **If no matches:** Print "No prompt-related files changed — skipping evals." and continue to Step 9. **2. Identify affected eval suites:** Each eval runner (`test/evals/*_eval_runner.rb`) declares `PROMPT_SOURCE_FILES` listing which source files affect it. Grep these to find which suites match the changed files: ```bash grep -l "changed_file_basename" test/evals/*_eval_runner.rb ``` Map runner → test file: `post_generation_eval_runner.rb` → `post_generation_eval_test.rb`. **Special cases:** - Changes to `test/evals/judges/*.rb`, `test/evals/support/*.rb`, or `test/evals/fixtures/` affect ALL suites that use those judges/support files. Check imports in the eval test files to determine which. - Changes to `config/system_prompts/*.txt` — grep eval runners for the prompt filename to find affected suites. - If unsure which suites are affected, run ALL suites that could plausibly be impacted. Over-testing is better than missing a regression. **3. Run affected suites at `EVAL_JUDGE_TIER=full`:** `/ship` is a pre-merge gate, so always use full tier (Sonnet structural + Opus persona judges). ```bash EVAL_JUDGE_TIER=full EVAL_VERBOSE=1 bin/test-lane --eval test/evals/_eval_test.rb 2>&1 | tee /tmp/ship_evals.txt ``` If multiple suites need to run, run them sequentially (each needs a test lane). If the first suite fails, stop immediately — don't burn API cost on remaining suites. **4. Check results:** - **If any eval fails:** Show the failures, the cost dashboard, and **STOP**. Do not proceed. - **If all pass:** Note pass counts and cost. Continue to Step 9. **5. Save eval output** — include eval results and cost dashboard in the PR body (Step 19). **Tier reference (for context — /ship always uses `full`):** | Tier | When | Speed (cached) | Cost | |------|------|----------------|------| | `fast` (Haiku) | Dev iteration, smoke tests | ~5s (14x faster) | ~$0.07/run | | `standard` (Sonnet) | Default dev, `bin/test-lane --eval` | ~17s (4x faster) | ~$0.37/run | | `full` (Opus persona) | **`/ship` and pre-merge** | ~72s (baseline) | ~$1.27/run | ---