mirror of
https://github.com/garrytan/gstack.git
synced 2026-06-01 15:51:41 +02:00
46c1fae7f1
* feat(test): transcript-section-logger + ship-action fingerprint (T10) Pure-analysis module over a SkillTestResult/NDJSON transcript: - extractSectionReads(): which sections/*.md a run opened (post-carve check) - extractShipActions(): observable action fingerprint (merge/test/bump/ changelog/commit/push/pr) that works on the MONOLITH too, so a baseline captured before the carve can detect a sectioned-ship regression - baseline read/write + compareShipActions() for baseline-first dogf(T10) Baseline-first answers the Codex outside-voice critique that a logger in the same PR as the carve is post-failure telemetry without a pre-carve reference. 11 unit tests, all green. Paid monolith baseline capture runs separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(pipeline): section discovery + generation machinery (T9) - discover-skills.ts: discoverSectionTemplates() scans <skill>/sections/*.md.tmpl - gen-skill-docs.ts: extract resolvePlaceholders + applyHostRewrites + buildContext as shared helpers (processTemplate and the new processSectionTemplate both call them, so a sanitization/rewrite fix can't miss sections) [C1] - processSectionTemplate: body-fragment generation (no frontmatter/catalog/voice), parent-skill TemplateContext (skillName pinned to parent, not 'sections', so appliesTo gating + tier behave identically), per-host output routing - --host all now fails the build on ANY host failure, not just claude, so a stale external-host output can't slip the freshness gate [Codex outside-voice #9] Inert until a skill is carved (no sections/ dirs exist yet). Refactor is output-neutral: gen:skill-docs --dry-run --host all reports 0 STALE. 5 discovery unit tests + 389 gen-skill-docs tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(setup): install sections/ for cherry-pick targets (claude + kiro) (T9) Two install targets cherry-pick SKILL.md and would leave a carved skill's sections/ behind, 404ing a runtime 'Read sections/<name>.md': - link_claude_skill_dirs: link the sections/ subdir via _link_or_copy (windows gets a fresh copy on every ./setup) - kiro per-skill loop: sed-rewrite + copy each sections/* so paths resolve under ~/.kiro, not ~/.codex/~/.claude codex/factory/opencode link the whole generated dir, so sections ride free. Addresses Codex outside-voice #4/#6 (runtime pathing landmine). Inert until a skill is carved. Static-tripwire test + windows-fallback invariant green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(ship): gstack-version-bump CLI — tested idempotency classify + write (T9) Hybrid CLI extraction (CM1): the deterministic core of ship Step 12 becomes a tested CLI instead of bash prose the agent re-derives each run. - classify: FRESH/ALREADY_BUMPED/DRIFT_STALE_PKG/DRIFT_UNEXPECTED from VERSION vs origin/<base>:VERSION vs package.json.version (pure reader) - write: validated dual-write to VERSION + package.json (FRESH bump) - repair: DRIFT_STALE_PKG sync, no re-bump Bump-LEVEL choice + queue collision stay agent judgment; slot pick stays bin/gstack-next-version. This removes the re-bump-a-shipped-branch footgun from skippable prose into code that can't be skipped or misread. 15 tests (exhaustive state matrix + write/repair fs + real-git classify). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(parity): sectioned-skill parity capability — guards the carve (T9) Carved skills (skeleton + sections/*.md) need parity checks that see relocated content, or moving a phrase into a section reads as 'lost': - readSkillForParity(): union skeleton + all sections/*.md - checkSkillParity sectioned mode: content checks against the union; minBytes/ maxSizeRatio against union bytes (total behavior preserved); maxSkeletonBytes asserts the always-loaded skeleton actually shrank. Lowering minBytes to fit a small skeleton would otherwise make the size floor toothless [Codex #12]. Built + tested BEFORE the carve so ship's invariant can flip to sectioned in the same commit it lands. Monolith path byte-identical (verified: pre-existing investigate 1.053 ratio drift fails the same with this change stashed). 7 sectioned-parity tests + existing parity tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(ship): carve into skeleton + on-demand sections (Claude) (T9) ship/SKILL.md drops 167KB → 68.7KB (~59% of the always-loaded skill) by moving 8 prose-heavy steps into ship/sections/*.md, read on demand: tests, test-coverage, plan-completion, review-army, greptile, adversarial, changelog, pr-body. Step 12's version logic now calls the tested gstack-version-bump CLI instead of inline bash. Claude-first (S2): {{SECTION:id}} emits a STOP-Read pointer on Claude (skeleton + generated section files) and INLINES the content on every other host, so external hosts keep the full monolith — verified factory at 162KB with no sections dir. {{SECTION_INDEX:ship}} renders the situation→section table from the PASSIVE manifest (CM2 / v2_PLAN.md:663); required-reads live only in test fixtures. Multi-pass resolve expands inlined sections' own resolvers. Parity: ship invariant flipped to sectioned (union content checks + maxSkeletonBytes asserts the shrink). Carve-fallout fixed across gen-skill-docs/skill-validation/ golden/plan-completion/#1539/size-budget tests via skeleton+sections union reads. Free suite green except the pre-existing investigate parity drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): manifest-consistency + context-parity + requiredReads helper (T9) Free deterministic guards for the carve: - required-reads.ts + unit test: assertRequiredReads(run, requiredFiles) — the mechanical layer-5 check that the agent Read the sections its situation needs (required set comes from the fixture, not the passive manifest) - section-manifest-consistency: 3-tier orphan classification (generated orphan + hand-edited generated file → FAIL; manifest orphan → WARN per v2_PLAN.md) and pins the PASSIVE-manifest contract (no applies_when/required_for) - template-context-parity: generated sections have zero unresolved placeholders and gated resolvers (ADVERSARIAL_STEP/CONFIDENCE_CALIBRATION/CHANGELOG_WORKFLOW) rendered — proving sections resolve with the parent skillName, not 'sections' 16 tests, all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): section-loading E2E + idempotency CLI detection (T9) - skill-e2e-ship-section-loading.test.ts (new, periodic): runs real /ship in plan mode against a fresh version-changing fixture and asserts the agent Read the required sections (review-army + changelog). Runs against the INSTALLED skill (~/.claude/skills/gstack/ship), not repo paths, so install-layout 404s surface [Codex outside-voice #5]. Layer-5 mechanical guard against silent section-skip. - skill-e2e-ship-idempotency.test.ts: detection updated for the carve — Step 12 now runs gstack-version-bump classify (JSON "state":"ALREADY_BUMPED") instead of the inline bash echo (STATE: ALREADY_BUMPED). Accept both; add a gstack-version-bump-write re-bump regression signal. - touchfiles: register ship-section-loading (periodic) + extend idempotency deps with bin/gstack-version-bump + scripts/resolvers/sections.ts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): union-read redaction wiring test for the carve (T9) main's PR-body redaction-at-sink lives in sections/pr-body.md.tmpl after the carve, not the skeleton template. Read skeleton + section templates union so the redaction-wiring assertions follow the relocated content. 9/9 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
350 lines
15 KiB
Markdown
350 lines
15 KiB
Markdown
<!-- AUTO-GENERATED from tests.md.tmpl — do not edit directly -->
|
|
<!-- Regenerate: bun run gen:skill-docs -->
|
|
## Step 4: Test Framework Bootstrap
|
|
|
|
## Test Framework Bootstrap
|
|
|
|
**Detect existing test framework and project runtime:**
|
|
|
|
```bash
|
|
setopt +o nomatch 2>/dev/null || true # zsh compat
|
|
# Detect project runtime
|
|
[ -f Gemfile ] && echo "RUNTIME:ruby"
|
|
[ -f package.json ] && echo "RUNTIME:node"
|
|
[ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python"
|
|
[ -f go.mod ] && echo "RUNTIME:go"
|
|
[ -f Cargo.toml ] && echo "RUNTIME:rust"
|
|
[ -f composer.json ] && echo "RUNTIME:php"
|
|
[ -f mix.exs ] && echo "RUNTIME:elixir"
|
|
# Detect sub-frameworks
|
|
[ -f Gemfile ] && grep -q "rails" Gemfile 2>/dev/null && echo "FRAMEWORK:rails"
|
|
[ -f package.json ] && grep -q '"next"' package.json 2>/dev/null && echo "FRAMEWORK:nextjs"
|
|
# Check for existing test infrastructure
|
|
ls jest.config.* vitest.config.* playwright.config.* .rspec pytest.ini pyproject.toml phpunit.xml 2>/dev/null
|
|
ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null
|
|
# Check opt-out marker
|
|
[ -f .gstack/no-test-bootstrap ] && echo "BOOTSTRAP_DECLINED"
|
|
```
|
|
|
|
**If test framework detected** (config files or test directories found):
|
|
Print "Test framework detected: {name} ({N} existing tests). Skipping bootstrap."
|
|
Read 2-3 existing test files to learn conventions (naming, imports, assertion style, setup patterns).
|
|
Store conventions as prose context for use in Phase 8e.5 or Step 7. **Skip the rest of bootstrap.**
|
|
|
|
**If BOOTSTRAP_DECLINED** appears: Print "Test bootstrap previously declined — skipping." **Skip the rest of bootstrap.**
|
|
|
|
**If NO runtime detected** (no config files found): Use AskUserQuestion:
|
|
"I couldn't detect your project's language. What runtime are you using?"
|
|
Options: A) Node.js/TypeScript B) Ruby/Rails C) Python D) Go E) Rust F) PHP G) Elixir H) This project doesn't need tests.
|
|
If user picks H → write `.gstack/no-test-bootstrap` and continue without tests.
|
|
|
|
**If runtime detected but no test framework — bootstrap:**
|
|
|
|
### B2. Research best practices
|
|
|
|
Use WebSearch to find current best practices for the detected runtime:
|
|
- `"[runtime] best test framework 2025 2026"`
|
|
- `"[framework A] vs [framework B] comparison"`
|
|
|
|
If WebSearch is unavailable, use this built-in knowledge table:
|
|
|
|
| Runtime | Primary recommendation | Alternative |
|
|
|---------|----------------------|-------------|
|
|
| Ruby/Rails | minitest + fixtures + capybara | rspec + factory_bot + shoulda-matchers |
|
|
| Node.js | vitest + @testing-library | jest + @testing-library |
|
|
| Next.js | vitest + @testing-library/react + playwright | jest + cypress |
|
|
| Python | pytest + pytest-cov | unittest |
|
|
| Go | stdlib testing + testify | stdlib only |
|
|
| Rust | cargo test (built-in) + mockall | — |
|
|
| PHP | phpunit + mockery | pest |
|
|
| Elixir | ExUnit (built-in) + ex_machina | — |
|
|
|
|
### B3. Framework selection
|
|
|
|
Use AskUserQuestion:
|
|
"I detected this is a [Runtime/Framework] project with no test framework. I researched current best practices. Here are the options:
|
|
A) [Primary] — [rationale]. Includes: [packages]. Supports: unit, integration, smoke, e2e
|
|
B) [Alternative] — [rationale]. Includes: [packages]
|
|
C) Skip — don't set up testing right now
|
|
RECOMMENDATION: Choose A because [reason based on project context]"
|
|
|
|
If user picks C → write `.gstack/no-test-bootstrap`. Tell user: "If you change your mind later, delete `.gstack/no-test-bootstrap` and re-run." Continue without tests.
|
|
|
|
If multiple runtimes detected (monorepo) → ask which runtime to set up first, with option to do both sequentially.
|
|
|
|
### B4. Install and configure
|
|
|
|
1. Install the chosen packages (npm/bun/gem/pip/etc.)
|
|
2. Create minimal config file
|
|
3. Create directory structure (test/, spec/, etc.)
|
|
4. Create one example test matching the project's code to verify setup works
|
|
|
|
If package installation fails → debug once. If still failing → revert with `git checkout -- package.json package-lock.json` (or equivalent for the runtime). Warn user and continue without tests.
|
|
|
|
### B4.5. First real tests
|
|
|
|
Generate 3-5 real tests for existing code:
|
|
|
|
1. **Find recently changed files:** `git log --since=30.days --name-only --format="" | sort | uniq -c | sort -rn | head -10`
|
|
2. **Prioritize by risk:** Error handlers > business logic with conditionals > API endpoints > pure functions
|
|
3. **For each file:** Write one test that tests real behavior with meaningful assertions. Never `expect(x).toBeDefined()` — test what the code DOES.
|
|
4. Run each test. Passes → keep. Fails → fix once. Still fails → delete silently.
|
|
5. Generate at least 1 test, cap at 5.
|
|
|
|
Never import secrets, API keys, or credentials in test files. Use environment variables or test fixtures.
|
|
|
|
### B5. Verify
|
|
|
|
```bash
|
|
# Run the full test suite to confirm everything works
|
|
{detected test command}
|
|
```
|
|
|
|
If tests fail → debug once. If still failing → revert all bootstrap changes and warn user.
|
|
|
|
### B5.5. CI/CD pipeline
|
|
|
|
```bash
|
|
# Check CI provider
|
|
ls -d .github/ 2>/dev/null && echo "CI:github"
|
|
ls .gitlab-ci.yml .circleci/ bitrise.yml 2>/dev/null
|
|
```
|
|
|
|
If `.github/` exists (or no CI detected — default to GitHub Actions):
|
|
Create `.github/workflows/test.yml` with:
|
|
- `runs-on: ubuntu-latest`
|
|
- Appropriate setup action for the runtime (setup-node, setup-ruby, setup-python, etc.)
|
|
- The same test command verified in B5
|
|
- Trigger: push + pull_request
|
|
|
|
If non-GitHub CI detected → skip CI generation with note: "Detected {provider} — CI pipeline generation supports GitHub Actions only. Add test step to your existing pipeline manually."
|
|
|
|
### B6. Create TESTING.md
|
|
|
|
First check: If TESTING.md already exists → read it and update/append rather than overwriting. Never destroy existing content.
|
|
|
|
Write TESTING.md with:
|
|
- Philosophy: "100% test coverage is the key to great vibe coding. Tests let you move fast, trust your instincts, and ship with confidence — without them, vibe coding is just yolo coding. With tests, it's a superpower."
|
|
- Framework name and version
|
|
- How to run tests (the verified command from B5)
|
|
- Test layers: Unit tests (what, where, when), Integration tests, Smoke tests, E2E tests
|
|
- Conventions: file naming, assertion style, setup/teardown patterns
|
|
|
|
### B7. Update CLAUDE.md
|
|
|
|
First check: If CLAUDE.md already has a `## Testing` section → skip. Don't duplicate.
|
|
|
|
Append a `## Testing` section:
|
|
- Run command and test directory
|
|
- Reference to TESTING.md
|
|
- Test expectations:
|
|
- 100% test coverage is the goal — tests make vibe coding safe
|
|
- When writing new functions, write a corresponding test
|
|
- When fixing a bug, write a regression test
|
|
- When adding error handling, write a test that triggers the error
|
|
- When adding a conditional (if/else, switch), write tests for BOTH paths
|
|
- Never commit code that makes existing tests fail
|
|
|
|
### B8. Commit
|
|
|
|
```bash
|
|
git status --porcelain
|
|
```
|
|
|
|
Only commit if there are changes. Stage all bootstrap files (config, test directory, TESTING.md, CLAUDE.md, .github/workflows/test.yml if created):
|
|
`git commit -m "chore: bootstrap test framework ({framework name})"`
|
|
|
|
---
|
|
|
|
---
|
|
|
|
## Step 5: Run tests (on merged code)
|
|
|
|
**Do NOT run `RAILS_ENV=test bin/rails db:migrate`** — `bin/test-lane` already calls
|
|
`db:test:prepare` internally, which loads the schema into the correct lane database.
|
|
Running bare test migrations without INSTANCE hits an orphan DB and corrupts structure.sql.
|
|
|
|
Run both test suites in parallel:
|
|
|
|
```bash
|
|
bin/test-lane 2>&1 | tee /tmp/ship_tests.txt &
|
|
npm run test 2>&1 | tee /tmp/ship_vitest.txt &
|
|
wait
|
|
```
|
|
|
|
After both complete, read the output files and check pass/fail.
|
|
|
|
**If any test fails:** Do NOT immediately stop. Apply the Test Failure Ownership Triage:
|
|
|
|
## Test Failure Ownership Triage
|
|
|
|
When tests fail, do NOT immediately stop. First, determine ownership:
|
|
|
|
### Step T1: Classify each failure
|
|
|
|
For each failing test:
|
|
|
|
1. **Get the files changed on this branch:**
|
|
```bash
|
|
git diff origin/<base>...HEAD --name-only
|
|
```
|
|
|
|
2. **Classify the failure:**
|
|
- **In-branch** if: the failing test file itself was modified on this branch, OR the test output references code that was changed on this branch, OR you can trace the failure to a change in the branch diff.
|
|
- **Likely pre-existing** if: neither the test file nor the code it tests was modified on this branch, AND the failure is unrelated to any branch change you can identify.
|
|
- **When ambiguous, default to in-branch.** It is safer to stop the developer than to let a broken test ship. Only classify as pre-existing when you are confident.
|
|
|
|
This classification is heuristic — use your judgment reading the diff and the test output. You do not have a programmatic dependency graph.
|
|
|
|
### Step T2: Handle in-branch failures
|
|
|
|
**STOP.** These are your failures. Show them and do not proceed. The developer must fix their own broken tests before shipping.
|
|
|
|
### Step T3: Handle pre-existing failures
|
|
|
|
Check `REPO_MODE` from the preamble output.
|
|
|
|
**If REPO_MODE is `solo`:**
|
|
|
|
Use AskUserQuestion:
|
|
|
|
> These test failures appear pre-existing (not caused by your branch changes):
|
|
>
|
|
> [list each failure with file:line and brief error description]
|
|
>
|
|
> Since this is a solo repo, you're the only one who will fix these.
|
|
>
|
|
> RECOMMENDATION: Choose A — fix now while the context is fresh. Completeness: 9/10.
|
|
> A) Investigate and fix now (human: ~2-4h / CC: ~15min) — Completeness: 10/10
|
|
> B) Add as P0 TODO — fix after this branch lands — Completeness: 7/10
|
|
> C) Skip — I know about this, ship anyway — Completeness: 3/10
|
|
|
|
**If REPO_MODE is `collaborative` or `unknown`:**
|
|
|
|
Use AskUserQuestion:
|
|
|
|
> These test failures appear pre-existing (not caused by your branch changes):
|
|
>
|
|
> [list each failure with file:line and brief error description]
|
|
>
|
|
> This is a collaborative repo — these may be someone else's responsibility.
|
|
>
|
|
> RECOMMENDATION: Choose B — assign it to whoever broke it so the right person fixes it. Completeness: 9/10.
|
|
> A) Investigate and fix now anyway — Completeness: 10/10
|
|
> B) Blame + assign GitHub issue to the author — Completeness: 9/10
|
|
> C) Add as P0 TODO — Completeness: 7/10
|
|
> D) Skip — ship anyway — Completeness: 3/10
|
|
|
|
### Step T4: Execute the chosen action
|
|
|
|
**If "Investigate and fix now":**
|
|
- Switch to /investigate mindset: root cause first, then minimal fix.
|
|
- Fix the pre-existing failure.
|
|
- Commit the fix separately from the branch's changes: `git commit -m "fix: pre-existing test failure in <test-file>"`
|
|
- Continue with the workflow.
|
|
|
|
**If "Add as P0 TODO":**
|
|
- If `TODOS.md` exists, add the entry following the format in `review/TODOS-format.md` (or `.claude/skills/review/TODOS-format.md`).
|
|
- If `TODOS.md` does not exist, create it with the standard header and add the entry.
|
|
- Entry should include: title, the error output, which branch it was noticed on, and priority P0.
|
|
- Continue with the workflow — treat the pre-existing failure as non-blocking.
|
|
|
|
**If "Blame + assign GitHub issue" (collaborative only):**
|
|
- Find who likely broke it. Check BOTH the test file AND the production code it tests:
|
|
```bash
|
|
# Who last touched the failing test?
|
|
git log --format="%an (%ae)" -1 -- <failing-test-file>
|
|
# Who last touched the production code the test covers? (often the actual breaker)
|
|
git log --format="%an (%ae)" -1 -- <source-file-under-test>
|
|
```
|
|
If these are different people, prefer the production code author — they likely introduced the regression.
|
|
- Create an issue assigned to that person (use the platform detected in Step 0):
|
|
- **If GitHub:**
|
|
```bash
|
|
gh issue create \
|
|
--title "Pre-existing test failure: <test-name>" \
|
|
--body "Found failing on branch <current-branch>. Failure is pre-existing.\n\n**Error:**\n```\n<first 10 lines>\n```\n\n**Last modified by:** <author>\n**Noticed by:** gstack /ship on <date>" \
|
|
--assignee "<github-username>"
|
|
```
|
|
- **If GitLab:**
|
|
```bash
|
|
glab issue create \
|
|
-t "Pre-existing test failure: <test-name>" \
|
|
-d "Found failing on branch <current-branch>. Failure is pre-existing.\n\n**Error:**\n```\n<first 10 lines>\n```\n\n**Last modified by:** <author>\n**Noticed by:** gstack /ship on <date>" \
|
|
-a "<gitlab-username>"
|
|
```
|
|
- If neither CLI is available or `--assignee`/`-a` fails (user not in org, etc.), create the issue without assignee and note who should look at it in the body.
|
|
- Continue with the workflow.
|
|
|
|
**If "Skip":**
|
|
- Continue with the workflow.
|
|
- Note in output: "Pre-existing test failure skipped: <test-name>"
|
|
|
|
**After triage:** If any in-branch failures remain unfixed, **STOP**. Do not proceed. If all failures were pre-existing and handled (fixed, TODOed, assigned, or skipped), continue to Step 6.
|
|
|
|
**If all pass:** Continue silently — just note the counts briefly.
|
|
|
|
---
|
|
|
|
## Step 6: Eval Suites (conditional)
|
|
|
|
Evals are mandatory when prompt-related files change. Skip this step entirely if no prompt files are in the diff.
|
|
|
|
**1. Check if the diff touches prompt-related files:**
|
|
|
|
```bash
|
|
git diff origin/<base> --name-only
|
|
```
|
|
|
|
Match against these patterns (from CLAUDE.md):
|
|
- `app/services/*_prompt_builder.rb`
|
|
- `app/services/*_generation_service.rb`, `*_writer_service.rb`, `*_designer_service.rb`
|
|
- `app/services/*_evaluator.rb`, `*_scorer.rb`, `*_classifier_service.rb`, `*_analyzer.rb`
|
|
- `app/services/concerns/*voice*.rb`, `*writing*.rb`, `*prompt*.rb`, `*token*.rb`
|
|
- `app/services/chat_tools/*.rb`, `app/services/x_thread_tools/*.rb`
|
|
- `config/system_prompts/*.txt`
|
|
- `test/evals/**/*` (eval infrastructure changes affect all suites)
|
|
|
|
**If no matches:** Print "No prompt-related files changed — skipping evals." and continue to Step 9.
|
|
|
|
**2. Identify affected eval suites:**
|
|
|
|
Each eval runner (`test/evals/*_eval_runner.rb`) declares `PROMPT_SOURCE_FILES` listing which source files affect it. Grep these to find which suites match the changed files:
|
|
|
|
```bash
|
|
grep -l "changed_file_basename" test/evals/*_eval_runner.rb
|
|
```
|
|
|
|
Map runner → test file: `post_generation_eval_runner.rb` → `post_generation_eval_test.rb`.
|
|
|
|
**Special cases:**
|
|
- Changes to `test/evals/judges/*.rb`, `test/evals/support/*.rb`, or `test/evals/fixtures/` affect ALL suites that use those judges/support files. Check imports in the eval test files to determine which.
|
|
- Changes to `config/system_prompts/*.txt` — grep eval runners for the prompt filename to find affected suites.
|
|
- If unsure which suites are affected, run ALL suites that could plausibly be impacted. Over-testing is better than missing a regression.
|
|
|
|
**3. Run affected suites at `EVAL_JUDGE_TIER=full`:**
|
|
|
|
`/ship` is a pre-merge gate, so always use full tier (Sonnet structural + Opus persona judges).
|
|
|
|
```bash
|
|
EVAL_JUDGE_TIER=full EVAL_VERBOSE=1 bin/test-lane --eval test/evals/<suite>_eval_test.rb 2>&1 | tee /tmp/ship_evals.txt
|
|
```
|
|
|
|
If multiple suites need to run, run them sequentially (each needs a test lane). If the first suite fails, stop immediately — don't burn API cost on remaining suites.
|
|
|
|
**4. Check results:**
|
|
|
|
- **If any eval fails:** Show the failures, the cost dashboard, and **STOP**. Do not proceed.
|
|
- **If all pass:** Note pass counts and cost. Continue to Step 9.
|
|
|
|
**5. Save eval output** — include eval results and cost dashboard in the PR body (Step 19).
|
|
|
|
**Tier reference (for context — /ship always uses `full`):**
|
|
| Tier | When | Speed (cached) | Cost |
|
|
|------|------|----------------|------|
|
|
| `fast` (Haiku) | Dev iteration, smoke tests | ~5s (14x faster) | ~$0.07/run |
|
|
| `standard` (Sonnet) | Default dev, `bin/test-lane --eval` | ~17s (4x faster) | ~$0.37/run |
|
|
| `full` (Opus persona) | **`/ship` and pre-merge** | ~72s (baseline) | ~$1.27/run |
|
|
|
|
---
|