## Step 4: Test Framework Bootstrap {{TEST_BOOTSTRAP}} --- ## Step 5: Run tests (on merged code) **Do NOT run `RAILS_ENV=test bin/rails db:migrate`** — `bin/test-lane` already calls `db:test:prepare` internally, which loads the schema into the correct lane database. Running bare test migrations without INSTANCE hits an orphan DB and corrupts structure.sql. Run both test suites in parallel: ```bash bin/test-lane 2>&1 | tee /tmp/ship_tests.txt & npm run test 2>&1 | tee /tmp/ship_vitest.txt & wait ``` After both complete, read the output files and check pass/fail. **If any test fails:** Do NOT immediately stop. Apply the Test Failure Ownership Triage: {{TEST_FAILURE_TRIAGE}} **After triage:** If any in-branch failures remain unfixed, **STOP**. Do not proceed. If all failures were pre-existing and handled (fixed, TODOed, assigned, or skipped), continue to Step 6. **If all pass:** Continue silently — just note the counts briefly. --- ## Step 6: Eval Suites (conditional) Evals are mandatory when prompt-related files change. Skip this step entirely if no prompt files are in the diff. **1. Check if the diff touches prompt-related files:** ```bash git diff origin/ --name-only ``` Match against these patterns (from CLAUDE.md): - `app/services/*_prompt_builder.rb` - `app/services/*_generation_service.rb`, `*_writer_service.rb`, `*_designer_service.rb` - `app/services/*_evaluator.rb`, `*_scorer.rb`, `*_classifier_service.rb`, `*_analyzer.rb` - `app/services/concerns/*voice*.rb`, `*writing*.rb`, `*prompt*.rb`, `*token*.rb` - `app/services/chat_tools/*.rb`, `app/services/x_thread_tools/*.rb` - `config/system_prompts/*.txt` - `test/evals/**/*` (eval infrastructure changes affect all suites) **If no matches:** Print "No prompt-related files changed — skipping evals." and continue to Step 9. **2. Identify affected eval suites:** Each eval runner (`test/evals/*_eval_runner.rb`) declares `PROMPT_SOURCE_FILES` listing which source files affect it. Grep these to find which suites match the changed files: ```bash grep -l "changed_file_basename" test/evals/*_eval_runner.rb ``` Map runner → test file: `post_generation_eval_runner.rb` → `post_generation_eval_test.rb`. **Special cases:** - Changes to `test/evals/judges/*.rb`, `test/evals/support/*.rb`, or `test/evals/fixtures/` affect ALL suites that use those judges/support files. Check imports in the eval test files to determine which. - Changes to `config/system_prompts/*.txt` — grep eval runners for the prompt filename to find affected suites. - If unsure which suites are affected, run ALL suites that could plausibly be impacted. Over-testing is better than missing a regression. **3. Run affected suites at `EVAL_JUDGE_TIER=full`:** `/ship` is a pre-merge gate, so always use full tier (Sonnet structural + Opus persona judges). ```bash EVAL_JUDGE_TIER=full EVAL_VERBOSE=1 bin/test-lane --eval test/evals/_eval_test.rb 2>&1 | tee /tmp/ship_evals.txt ``` If multiple suites need to run, run them sequentially (each needs a test lane). If the first suite fails, stop immediately — don't burn API cost on remaining suites. **4. Check results:** - **If any eval fails:** Show the failures, the cost dashboard, and **STOP**. Do not proceed. - **If all pass:** Note pass counts and cost. Continue to Step 9. **5. Save eval output** — include eval results and cost dashboard in the PR body (Step 19). **Tier reference (for context — /ship always uses `full`):** | Tier | When | Speed (cached) | Cost | |------|------|----------------|------| | `fast` (Haiku) | Dev iteration, smoke tests | ~5s (14x faster) | ~$0.07/run | | `standard` (Sonnet) | Default dev, `bin/test-lane --eval` | ~17s (4x faster) | ~$0.37/run | | `full` (Opus persona) | **`/ship` and pre-merge** | ~72s (baseline) | ~$1.27/run | ---