mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-02 11:45:20 +02:00
d85233017b
* feat: /codex skill — multi-AI second opinion (review, challenge, consult) Three modes: code review with pass/fail gate, adversarial challenge mode, and conversational consult with session continuity. First multi-AI skill in gstack, wrapping OpenAI's Codex CLI. * feat: integrate /codex into /review, /ship, /plan-eng-review + dashboard /review offers Codex second opinion after completing its own review. /ship offers Codex review as optional gate before pushing. /plan-eng-review offers Codex plan critique after scope challenge. Review Readiness Dashboard shows Codex Review as optional row. * chore: bump version and changelog (v0.8.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: codex skill validation (12 stub tests) + E2E eval test Stub tests (free tier): verify template content — three modes, gate verdict, session continuity, cost tracking, cross-model comparison, binary discovery, error handling, mktemp usage, and integrations into /review, /ship, /plan-eng-review. E2E test (paid tier): runs /codex review on vulnerable fixture repo via session-runner, verifies output contains findings and GATE verdict. * fix: codex auth error message — use codex login, not OPENAI_API_KEY Codex authenticates via ChatGPT OAuth (codex login), not an env var. * feat: codex uses high reasoning effort by default gpt-5.2-codex is the only model available with ChatGPT login. All commands now use model_reasoning_effort="high" for maximum depth — the whole point is a thorough second opinion. * feat: crank codex reasoning to xhigh (maximum) * feat: per-mode reasoning (high for review/consult, xhigh for challenge) + web search Review and consult use high reasoning — thorough but not slow. Challenge (adversarial) uses xhigh — maximum depth for breaking code. All modes enable web_search_cached so Codex can look up docs/APIs. * refactor: don't hardcode model — use codex default (always latest) * feat: JSONL output for codex challenge + consult modes Use --json flag to parse codex's JSONL events, extracting reasoning traces ([codex thinking]), tool calls ([codex ran]), and token counts. This gives richer output than the -o flag alone — you can see what codex thought through before its answer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: only persist codex-review log when code review actually ran Don't write a codex-review entry to reviews.jsonl when only the adversarial challenge (option B) was selected — there's no gate verdict to record, and a false entry misleads the Review Readiness Dashboard into thinking a code review happened. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add codex plan review option to /plan-eng-review After scope challenge (Step 0), offer to have Codex independently review the plan with a brutally honest tech reviewer persona. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: update e2e test for codex skill Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: codex integration bugs — plan content, review persistence, quoting, stderr - plan-eng-review: Codex now reads the plan file itself instead of inlining content as a CLI arg (avoids ARG_MAX for large plans) - review: add missing echo to persist codex-review results to reviews.jsonl - codex: consult mode uses $TMPERR (mktemp) instead of hardcoded stderr path - codex + review: quote $SLUG/$BRANCH_SLUG in review log paths - codex: scope plan lookup to current project, warn on cross-project fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add .context/ to .gitignore to prevent session ID leaks Codex consult mode stores session IDs in .context/codex-session-id. Without this ignore rule, session IDs could leak into commits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: proactive skill suggestions + opt-out + trigger phrase tests - Preamble reads proactive config via gstack-config - Root SKILL.md.tmpl has lifecycle map (stage → skill suggestion) - Users can opt out ("stop suggesting") / opt in ("be proactive again") - Restored trigger phrase validation tests (16 skills × "Use when" check) - Added missing "Use when" trigger phrases to /debug and /office-hours Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: update changelog for v0.8.0 — add proactive suggestions note Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
674 lines
30 KiB
Cheetah
674 lines
30 KiB
Cheetah
---
|
|
name: ship
|
|
version: 1.0.0
|
|
description: |
|
|
Ship workflow: detect + merge base branch, run tests, review diff, bump VERSION, update CHANGELOG, commit, push, create PR. Use when asked to "ship", "deploy", "push to main", "create a PR", or "merge and push".
|
|
Proactively suggest when the user says code is ready or asks about deploying.
|
|
allowed-tools:
|
|
- Bash
|
|
- Read
|
|
- Write
|
|
- Edit
|
|
- Grep
|
|
- Glob
|
|
- AskUserQuestion
|
|
- WebSearch
|
|
---
|
|
|
|
{{PREAMBLE}}
|
|
|
|
{{BASE_BRANCH_DETECT}}
|
|
|
|
# Ship: Fully Automated Ship Workflow
|
|
|
|
You are running the `/ship` workflow. This is a **non-interactive, fully automated** workflow. Do NOT ask for confirmation at any step. The user said `/ship` which means DO IT. Run straight through and output the PR URL at the end.
|
|
|
|
**Only stop for:**
|
|
- On the base branch (abort)
|
|
- Merge conflicts that can't be auto-resolved (stop, show conflicts)
|
|
- Test failures (stop, show failures)
|
|
- Pre-landing review finds ASK items that need user judgment
|
|
- MINOR or MAJOR version bump needed (ask — see Step 4)
|
|
- Greptile review comments that need user decision (complex fixes, false positives)
|
|
- TODOS.md missing and user wants to create one (ask — see Step 5.5)
|
|
- TODOS.md disorganized and user wants to reorganize (ask — see Step 5.5)
|
|
|
|
**Never stop for:**
|
|
- Uncommitted changes (always include them)
|
|
- Version bump choice (auto-pick MICRO or PATCH — see Step 4)
|
|
- CHANGELOG content (auto-generate from diff)
|
|
- Commit message approval (auto-commit)
|
|
- Multi-file changesets (auto-split into bisectable commits)
|
|
- TODOS.md completed-item detection (auto-mark)
|
|
- Auto-fixable review findings (dead code, N+1, stale comments — fixed automatically)
|
|
- Test coverage gaps (auto-generate and commit, or flag in PR body)
|
|
|
|
---
|
|
|
|
## Step 1: Pre-flight
|
|
|
|
1. Check the current branch. If on the base branch or the repo's default branch, **abort**: "You're on the base branch. Ship from a feature branch."
|
|
|
|
2. Run `git status` (never use `-uall`). Uncommitted changes are always included — no need to ask.
|
|
|
|
3. Run `git diff <base>...HEAD --stat` and `git log <base>..HEAD --oneline` to understand what's being shipped.
|
|
|
|
4. Check review readiness:
|
|
|
|
{{REVIEW_DASHBOARD}}
|
|
|
|
If the Eng Review is NOT "CLEAR":
|
|
|
|
1. **Check for a prior override on this branch:**
|
|
```bash
|
|
eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
|
|
grep '"skill":"ship-review-override"' ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl 2>/dev/null || echo "NO_OVERRIDE"
|
|
```
|
|
If an override exists, display the dashboard and note "Review gate previously accepted — continuing." Do NOT ask again.
|
|
|
|
2. **If no override exists,** use AskUserQuestion:
|
|
- Show that Eng Review is missing or has open issues
|
|
- RECOMMENDATION: Choose C if the change is obviously trivial (< 20 lines, typo fix, config-only); Choose B for larger changes
|
|
- Options: A) Ship anyway B) Abort — run /plan-eng-review first C) Change is too small to need eng review
|
|
- If CEO Review is missing, mention as informational ("CEO Review not run — recommended for product changes") but do NOT block
|
|
- For Design Review: run `eval $(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)`. If `SCOPE_FRONTEND=true` and no design review (plan-design-review or design-review-lite) exists in the dashboard, mention: "Design Review not run — this PR changes frontend code. The lite design check will run automatically in Step 3.5, but consider running /design-review for a full visual audit post-implementation." Still never block.
|
|
|
|
3. **If the user chooses A or C,** persist the decision so future `/ship` runs on this branch skip the gate:
|
|
```bash
|
|
eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
|
|
echo '{"skill":"ship-review-override","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","decision":"USER_CHOICE"}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl
|
|
```
|
|
Substitute USER_CHOICE with "ship_anyway" or "not_relevant".
|
|
|
|
---
|
|
|
|
## Step 2: Merge the base branch (BEFORE tests)
|
|
|
|
Fetch and merge the base branch into the feature branch so tests run against the merged state:
|
|
|
|
```bash
|
|
git fetch origin <base> && git merge origin/<base> --no-edit
|
|
```
|
|
|
|
**If there are merge conflicts:** Try to auto-resolve if they are simple (VERSION, schema.rb, CHANGELOG ordering). If conflicts are complex or ambiguous, **STOP** and show them.
|
|
|
|
**If already up to date:** Continue silently.
|
|
|
|
---
|
|
|
|
## Step 2.5: Test Framework Bootstrap
|
|
|
|
{{TEST_BOOTSTRAP}}
|
|
|
|
---
|
|
|
|
## Step 3: Run tests (on merged code)
|
|
|
|
**Do NOT run `RAILS_ENV=test bin/rails db:migrate`** — `bin/test-lane` already calls
|
|
`db:test:prepare` internally, which loads the schema into the correct lane database.
|
|
Running bare test migrations without INSTANCE hits an orphan DB and corrupts structure.sql.
|
|
|
|
Run both test suites in parallel:
|
|
|
|
```bash
|
|
bin/test-lane 2>&1 | tee /tmp/ship_tests.txt &
|
|
npm run test 2>&1 | tee /tmp/ship_vitest.txt &
|
|
wait
|
|
```
|
|
|
|
After both complete, read the output files and check pass/fail.
|
|
|
|
**If any test fails:** Show the failures and **STOP**. Do not proceed.
|
|
|
|
**If all pass:** Continue silently — just note the counts briefly.
|
|
|
|
---
|
|
|
|
## Step 3.25: Eval Suites (conditional)
|
|
|
|
Evals are mandatory when prompt-related files change. Skip this step entirely if no prompt files are in the diff.
|
|
|
|
**1. Check if the diff touches prompt-related files:**
|
|
|
|
```bash
|
|
git diff origin/<base> --name-only
|
|
```
|
|
|
|
Match against these patterns (from CLAUDE.md):
|
|
- `app/services/*_prompt_builder.rb`
|
|
- `app/services/*_generation_service.rb`, `*_writer_service.rb`, `*_designer_service.rb`
|
|
- `app/services/*_evaluator.rb`, `*_scorer.rb`, `*_classifier_service.rb`, `*_analyzer.rb`
|
|
- `app/services/concerns/*voice*.rb`, `*writing*.rb`, `*prompt*.rb`, `*token*.rb`
|
|
- `app/services/chat_tools/*.rb`, `app/services/x_thread_tools/*.rb`
|
|
- `config/system_prompts/*.txt`
|
|
- `test/evals/**/*` (eval infrastructure changes affect all suites)
|
|
|
|
**If no matches:** Print "No prompt-related files changed — skipping evals." and continue to Step 3.5.
|
|
|
|
**2. Identify affected eval suites:**
|
|
|
|
Each eval runner (`test/evals/*_eval_runner.rb`) declares `PROMPT_SOURCE_FILES` listing which source files affect it. Grep these to find which suites match the changed files:
|
|
|
|
```bash
|
|
grep -l "changed_file_basename" test/evals/*_eval_runner.rb
|
|
```
|
|
|
|
Map runner → test file: `post_generation_eval_runner.rb` → `post_generation_eval_test.rb`.
|
|
|
|
**Special cases:**
|
|
- Changes to `test/evals/judges/*.rb`, `test/evals/support/*.rb`, or `test/evals/fixtures/` affect ALL suites that use those judges/support files. Check imports in the eval test files to determine which.
|
|
- Changes to `config/system_prompts/*.txt` — grep eval runners for the prompt filename to find affected suites.
|
|
- If unsure which suites are affected, run ALL suites that could plausibly be impacted. Over-testing is better than missing a regression.
|
|
|
|
**3. Run affected suites at `EVAL_JUDGE_TIER=full`:**
|
|
|
|
`/ship` is a pre-merge gate, so always use full tier (Sonnet structural + Opus persona judges).
|
|
|
|
```bash
|
|
EVAL_JUDGE_TIER=full EVAL_VERBOSE=1 bin/test-lane --eval test/evals/<suite>_eval_test.rb 2>&1 | tee /tmp/ship_evals.txt
|
|
```
|
|
|
|
If multiple suites need to run, run them sequentially (each needs a test lane). If the first suite fails, stop immediately — don't burn API cost on remaining suites.
|
|
|
|
**4. Check results:**
|
|
|
|
- **If any eval fails:** Show the failures, the cost dashboard, and **STOP**. Do not proceed.
|
|
- **If all pass:** Note pass counts and cost. Continue to Step 3.5.
|
|
|
|
**5. Save eval output** — include eval results and cost dashboard in the PR body (Step 8).
|
|
|
|
**Tier reference (for context — /ship always uses `full`):**
|
|
| Tier | When | Speed (cached) | Cost |
|
|
|------|------|----------------|------|
|
|
| `fast` (Haiku) | Dev iteration, smoke tests | ~5s (14x faster) | ~$0.07/run |
|
|
| `standard` (Sonnet) | Default dev, `bin/test-lane --eval` | ~17s (4x faster) | ~$0.37/run |
|
|
| `full` (Opus persona) | **`/ship` and pre-merge** | ~72s (baseline) | ~$1.27/run |
|
|
|
|
---
|
|
|
|
## Step 3.4: Test Coverage Audit
|
|
|
|
100% coverage is the goal — every untested path is a path where bugs hide and vibe coding becomes yolo coding. Evaluate what was ACTUALLY coded (from the diff), not what was planned.
|
|
|
|
**0. Before/after test count:**
|
|
|
|
```bash
|
|
# Count test files before any generation
|
|
find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l
|
|
```
|
|
|
|
Store this number for the PR body.
|
|
|
|
**1. Trace every codepath changed** using `git diff origin/<base>...HEAD`:
|
|
|
|
Read every changed file. For each one, trace how data flows through the code — don't just list functions, actually follow the execution:
|
|
|
|
1. **Read the diff.** For each changed file, read the full file (not just the diff hunk) to understand context.
|
|
2. **Trace data flow.** Starting from each entry point (route handler, exported function, event listener, component render), follow the data through every branch:
|
|
- Where does input come from? (request params, props, database, API call)
|
|
- What transforms it? (validation, mapping, computation)
|
|
- Where does it go? (database write, API response, rendered output, side effect)
|
|
- What can go wrong at each step? (null/undefined, invalid input, network failure, empty collection)
|
|
3. **Diagram the execution.** For each changed file, draw an ASCII diagram showing:
|
|
- Every function/method that was added or modified
|
|
- Every conditional branch (if/else, switch, ternary, guard clause, early return)
|
|
- Every error path (try/catch, rescue, error boundary, fallback)
|
|
- Every call to another function (trace into it — does IT have untested branches?)
|
|
- Every edge: what happens with null input? Empty array? Invalid type?
|
|
|
|
This is the critical step — you're building a map of every line of code that can execute differently based on input. Every branch in this diagram needs a test.
|
|
|
|
**2. Map user flows, interactions, and error states:**
|
|
|
|
Code coverage isn't enough — you need to cover how real users interact with the changed code. For each changed feature, think through:
|
|
|
|
- **User flows:** What sequence of actions does a user take that touches this code? Map the full journey (e.g., "user clicks 'Pay' → form validates → API call → success/failure screen"). Each step in the journey needs a test.
|
|
- **Interaction edge cases:** What happens when the user does something unexpected?
|
|
- Double-click/rapid resubmit
|
|
- Navigate away mid-operation (back button, close tab, click another link)
|
|
- Submit with stale data (page sat open for 30 minutes, session expired)
|
|
- Slow connection (API takes 10 seconds — what does the user see?)
|
|
- Concurrent actions (two tabs, same form)
|
|
- **Error states the user can see:** For every error the code handles, what does the user actually experience?
|
|
- Is there a clear error message or a silent failure?
|
|
- Can the user recover (retry, go back, fix input) or are they stuck?
|
|
- What happens with no network? With a 500 from the API? With invalid data from the server?
|
|
- **Empty/zero/boundary states:** What does the UI show with zero results? With 10,000 results? With a single character input? With maximum-length input?
|
|
|
|
Add these to your diagram alongside the code branches. A user flow with no test is just as much a gap as an untested if/else.
|
|
|
|
**3. Check each branch against existing tests:**
|
|
|
|
Go through your diagram branch by branch — both code paths AND user flows. For each one, search for a test that exercises it:
|
|
- Function `processPayment()` → look for `billing.test.ts`, `billing.spec.ts`, `test/billing_test.rb`
|
|
- An if/else → look for tests covering BOTH the true AND false path
|
|
- An error handler → look for a test that triggers that specific error condition
|
|
- A call to `helperFn()` that has its own branches → those branches need tests too
|
|
- A user flow → look for an integration or E2E test that walks through the journey
|
|
- An interaction edge case → look for a test that simulates the unexpected action
|
|
|
|
Quality scoring rubric:
|
|
- ★★★ Tests behavior with edge cases AND error paths
|
|
- ★★ Tests correct behavior, happy path only
|
|
- ★ Smoke test / existence check / trivial assertion (e.g., "it renders", "it doesn't throw")
|
|
|
|
**4. Output ASCII coverage diagram:**
|
|
|
|
Include BOTH code paths and user flows in the same diagram:
|
|
|
|
```
|
|
CODE PATH COVERAGE
|
|
===========================
|
|
[+] src/services/billing.ts
|
|
│
|
|
├── processPayment()
|
|
│ ├── [★★★ TESTED] Happy path + card declined + timeout — billing.test.ts:42
|
|
│ ├── [GAP] Network timeout — NO TEST
|
|
│ └── [GAP] Invalid currency — NO TEST
|
|
│
|
|
└── refundPayment()
|
|
├── [★★ TESTED] Full refund — billing.test.ts:89
|
|
└── [★ TESTED] Partial refund (checks non-throw only) — billing.test.ts:101
|
|
|
|
USER FLOW COVERAGE
|
|
===========================
|
|
[+] Payment checkout flow
|
|
│
|
|
├── [★★★ TESTED] Complete purchase — checkout.e2e.ts:15
|
|
├── [GAP] Double-click submit — NO TEST
|
|
├── [GAP] Navigate away during payment — NO TEST
|
|
└── [★ TESTED] Form validation errors (checks render only) — checkout.test.ts:40
|
|
|
|
[+] Error states
|
|
│
|
|
├── [★★ TESTED] Card declined message — billing.test.ts:58
|
|
├── [GAP] Network timeout UX (what does user see?) — NO TEST
|
|
└── [GAP] Empty cart submission — NO TEST
|
|
|
|
─────────────────────────────────
|
|
COVERAGE: 5/12 paths tested (42%)
|
|
Code paths: 3/5 (60%)
|
|
User flows: 2/7 (29%)
|
|
QUALITY: ★★★: 2 ★★: 2 ★: 1
|
|
GAPS: 7 paths need tests
|
|
─────────────────────────────────
|
|
```
|
|
|
|
**Fast path:** All paths covered → "Step 3.4: All new code paths have test coverage ✓" Continue.
|
|
|
|
**5. Generate tests for uncovered paths:**
|
|
|
|
If test framework detected (or bootstrapped in Step 2.5):
|
|
- Prioritize error handlers and edge cases first (happy paths are more likely already tested)
|
|
- Read 2-3 existing test files to match conventions exactly
|
|
- Generate unit tests. Mock all external dependencies (DB, API, Redis).
|
|
- Write tests that exercise the specific uncovered path with real assertions
|
|
- Run each test. Passes → commit as `test: coverage for {feature}`
|
|
- Fails → fix once. Still fails → revert, note gap in diagram.
|
|
|
|
Caps: 30 code paths max, 20 tests generated max (code + user flow combined), 2-min per-test exploration cap.
|
|
|
|
If no test framework AND user declined bootstrap → diagram only, no generation. Note: "Test generation skipped — no test framework configured."
|
|
|
|
**Diff is test-only changes:** Skip Step 3.4 entirely: "No new application code paths to audit."
|
|
|
|
**6. After-count and coverage summary:**
|
|
|
|
```bash
|
|
# Count test files after generation
|
|
find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l
|
|
```
|
|
|
|
For PR body: `Tests: {before} → {after} (+{delta} new)`
|
|
Coverage line: `Test Coverage Audit: N new code paths. M covered (X%). K tests generated, J committed.`
|
|
|
|
---
|
|
|
|
## Step 3.5: Pre-Landing Review
|
|
|
|
Review the diff for structural issues that tests don't catch.
|
|
|
|
1. Read `.claude/skills/review/checklist.md`. If the file cannot be read, **STOP** and report the error.
|
|
|
|
2. Run `git diff origin/<base>` to get the full diff (scoped to feature changes against the freshly-fetched base branch).
|
|
|
|
3. Apply the review checklist in two passes:
|
|
- **Pass 1 (CRITICAL):** SQL & Data Safety, LLM Output Trust Boundary
|
|
- **Pass 2 (INFORMATIONAL):** All remaining categories
|
|
|
|
{{DESIGN_REVIEW_LITE}}
|
|
|
|
Include any design findings alongside the code review findings. They follow the same Fix-First flow below.
|
|
|
|
4. **Classify each finding as AUTO-FIX or ASK** per the Fix-First Heuristic in
|
|
checklist.md. Critical findings lean toward ASK; informational lean toward AUTO-FIX.
|
|
|
|
5. **Auto-fix all AUTO-FIX items.** Apply each fix. Output one line per fix:
|
|
`[AUTO-FIXED] [file:line] Problem → what you did`
|
|
|
|
6. **If ASK items remain,** present them in ONE AskUserQuestion:
|
|
- List each with number, severity, problem, recommended fix
|
|
- Per-item options: A) Fix B) Skip
|
|
- Overall RECOMMENDATION
|
|
- If 3 or fewer ASK items, you may use individual AskUserQuestion calls instead
|
|
|
|
7. **After all fixes (auto + user-approved):**
|
|
- If ANY fixes were applied: commit fixed files by name (`git add <fixed-files> && git commit -m "fix: pre-landing review fixes"`), then **STOP** and tell the user to run `/ship` again to re-test.
|
|
- If no fixes applied (all ASK items skipped, or no issues found): continue to Step 4.
|
|
|
|
8. Output summary: `Pre-Landing Review: N issues — M auto-fixed, K asked (J fixed, L skipped)`
|
|
|
|
If no issues found: `Pre-Landing Review: No issues found.`
|
|
|
|
Save the review output — it goes into the PR body in Step 8.
|
|
|
|
---
|
|
|
|
## Step 3.75: Address Greptile review comments (if PR exists)
|
|
|
|
Read `.claude/skills/review/greptile-triage.md` and follow the fetch, filter, classify, and **escalation detection** steps.
|
|
|
|
**If no PR exists, `gh` fails, API returns an error, or there are zero Greptile comments:** Skip this step silently. Continue to Step 4.
|
|
|
|
**If Greptile comments are found:**
|
|
|
|
Include a Greptile summary in your output: `+ N Greptile comments (X valid, Y fixed, Z FP)`
|
|
|
|
Before replying to any comment, run the **Escalation Detection** algorithm from greptile-triage.md to determine whether to use Tier 1 (friendly) or Tier 2 (firm) reply templates.
|
|
|
|
For each classified comment:
|
|
|
|
**VALID & ACTIONABLE:** Use AskUserQuestion with:
|
|
- The comment (file:line or [top-level] + body summary + permalink URL)
|
|
- `RECOMMENDATION: Choose A because [one-line reason]`
|
|
- Options: A) Fix now, B) Acknowledge and ship anyway, C) It's a false positive
|
|
- If user chooses A: apply the fix, commit the fixed files (`git add <fixed-files> && git commit -m "fix: address Greptile review — <brief description>"`), reply using the **Fix reply template** from greptile-triage.md (include inline diff + explanation), and save to both per-project and global greptile-history (type: fix).
|
|
- If user chooses C: reply using the **False Positive reply template** from greptile-triage.md (include evidence + suggested re-rank), save to both per-project and global greptile-history (type: fp).
|
|
|
|
**VALID BUT ALREADY FIXED:** Reply using the **Already Fixed reply template** from greptile-triage.md — no AskUserQuestion needed:
|
|
- Include what was done and the fixing commit SHA
|
|
- Save to both per-project and global greptile-history (type: already-fixed)
|
|
|
|
**FALSE POSITIVE:** Use AskUserQuestion:
|
|
- Show the comment and why you think it's wrong (file:line or [top-level] + body summary + permalink URL)
|
|
- Options:
|
|
- A) Reply to Greptile explaining the false positive (recommended if clearly wrong)
|
|
- B) Fix it anyway (if trivial)
|
|
- C) Ignore silently
|
|
- If user chooses A: reply using the **False Positive reply template** from greptile-triage.md (include evidence + suggested re-rank), save to both per-project and global greptile-history (type: fp)
|
|
|
|
**SUPPRESSED:** Skip silently — these are known false positives from previous triage.
|
|
|
|
**After all comments are resolved:** If any fixes were applied, the tests from Step 3 are now stale. **Re-run tests** (Step 3) before continuing to Step 4. If no fixes were applied, continue to Step 4.
|
|
|
|
---
|
|
|
|
## Step 3.8: Codex second opinion (optional)
|
|
|
|
Check if the Codex CLI is available:
|
|
|
|
```bash
|
|
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
|
|
```
|
|
|
|
If Codex is available, use AskUserQuestion:
|
|
|
|
```
|
|
Pre-landing review complete. Want an independent Codex (OpenAI) review before shipping?
|
|
|
|
A) Run Codex code review — independent diff review with pass/fail gate
|
|
B) Run Codex adversarial challenge — try to break this code
|
|
C) Skip — ship without Codex review
|
|
```
|
|
|
|
If the user chooses A or B:
|
|
|
|
**For code review (A):** Run `codex review --base <base>` with a 5-minute timeout.
|
|
Present the full output verbatim under a `CODEX SAYS:` header. Check for `[P1]` markers
|
|
to determine pass/fail gate. Persist the result:
|
|
|
|
```bash
|
|
eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
|
|
BRANCH_SLUG=$(git rev-parse --abbrev-ref HEAD 2>/dev/null | tr '/' '-')
|
|
mkdir -p ~/.gstack/projects/$SLUG
|
|
echo '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}' >> ~/.gstack/projects/$SLUG/$BRANCH_SLUG-reviews.jsonl
|
|
```
|
|
|
|
If GATE is FAIL, use AskUserQuestion: "Codex found critical issues. Ship anyway?"
|
|
If the user says no, stop. If yes, continue to Step 4.
|
|
|
|
**For adversarial (B):** Run codex exec with the adversarial prompt (see /codex skill).
|
|
Present findings. This is informational — does not block shipping.
|
|
|
|
If Codex is not available, skip silently. Continue to Step 4.
|
|
|
|
---
|
|
|
|
## Step 4: Version bump (auto-decide)
|
|
|
|
1. Read the current `VERSION` file (4-digit format: `MAJOR.MINOR.PATCH.MICRO`)
|
|
|
|
2. **Auto-decide the bump level based on the diff:**
|
|
- Count lines changed (`git diff origin/<base>...HEAD --stat | tail -1`)
|
|
- **MICRO** (4th digit): < 50 lines changed, trivial tweaks, typos, config
|
|
- **PATCH** (3rd digit): 50+ lines changed, bug fixes, small-medium features
|
|
- **MINOR** (2nd digit): **ASK the user** — only for major features or significant architectural changes
|
|
- **MAJOR** (1st digit): **ASK the user** — only for milestones or breaking changes
|
|
|
|
3. Compute the new version:
|
|
- Bumping a digit resets all digits to its right to 0
|
|
- Example: `0.19.1.0` + PATCH → `0.19.2.0`
|
|
|
|
4. Write the new version to the `VERSION` file.
|
|
|
|
---
|
|
|
|
## Step 5: CHANGELOG (auto-generate)
|
|
|
|
1. Read `CHANGELOG.md` header to know the format.
|
|
|
|
2. Auto-generate the entry from **ALL commits on the branch** (not just recent ones):
|
|
- Use `git log <base>..HEAD --oneline` to see every commit being shipped
|
|
- Use `git diff <base>...HEAD` to see the full diff against the base branch
|
|
- The CHANGELOG entry must be comprehensive of ALL changes going into the PR
|
|
- If existing CHANGELOG entries on the branch already cover some commits, replace them with one unified entry for the new version
|
|
- Categorize changes into applicable sections:
|
|
- `### Added` — new features
|
|
- `### Changed` — changes to existing functionality
|
|
- `### Fixed` — bug fixes
|
|
- `### Removed` — removed features
|
|
- Write concise, descriptive bullet points
|
|
- Insert after the file header (line 5), dated today
|
|
- Format: `## [X.Y.Z.W] - YYYY-MM-DD`
|
|
|
|
**Do NOT ask the user to describe changes.** Infer from the diff and commit history.
|
|
|
|
---
|
|
|
|
## Step 5.5: TODOS.md (auto-update)
|
|
|
|
Cross-reference the project's TODOS.md against the changes being shipped. Mark completed items automatically; prompt only if the file is missing or disorganized.
|
|
|
|
Read `.claude/skills/review/TODOS-format.md` for the canonical format reference.
|
|
|
|
**1. Check if TODOS.md exists** in the repository root.
|
|
|
|
**If TODOS.md does not exist:** Use AskUserQuestion:
|
|
- Message: "GStack recommends maintaining a TODOS.md organized by skill/component, then priority (P0 at top through P4, then Completed at bottom). See TODOS-format.md for the full format. Would you like to create one?"
|
|
- Options: A) Create it now, B) Skip for now
|
|
- If A: Create `TODOS.md` with a skeleton (# TODOS heading + ## Completed section). Continue to step 3.
|
|
- If B: Skip the rest of Step 5.5. Continue to Step 6.
|
|
|
|
**2. Check structure and organization:**
|
|
|
|
Read TODOS.md and verify it follows the recommended structure:
|
|
- Items grouped under `## <Skill/Component>` headings
|
|
- Each item has `**Priority:**` field with P0-P4 value
|
|
- A `## Completed` section at the bottom
|
|
|
|
**If disorganized** (missing priority fields, no component groupings, no Completed section): Use AskUserQuestion:
|
|
- Message: "TODOS.md doesn't follow the recommended structure (skill/component groupings, P0-P4 priority, Completed section). Would you like to reorganize it?"
|
|
- Options: A) Reorganize now (recommended), B) Leave as-is
|
|
- If A: Reorganize in-place following TODOS-format.md. Preserve all content — only restructure, never delete items.
|
|
- If B: Continue to step 3 without restructuring.
|
|
|
|
**3. Detect completed TODOs:**
|
|
|
|
This step is fully automatic — no user interaction.
|
|
|
|
Use the diff and commit history already gathered in earlier steps:
|
|
- `git diff <base>...HEAD` (full diff against the base branch)
|
|
- `git log <base>..HEAD --oneline` (all commits being shipped)
|
|
|
|
For each TODO item, check if the changes in this PR complete it by:
|
|
- Matching commit messages against the TODO title and description
|
|
- Checking if files referenced in the TODO appear in the diff
|
|
- Checking if the TODO's described work matches the functional changes
|
|
|
|
**Be conservative:** Only mark a TODO as completed if there is clear evidence in the diff. If uncertain, leave it alone.
|
|
|
|
**4. Move completed items** to the `## Completed` section at the bottom. Append: `**Completed:** vX.Y.Z (YYYY-MM-DD)`
|
|
|
|
**5. Output summary:**
|
|
- `TODOS.md: N items marked complete (item1, item2, ...). M items remaining.`
|
|
- Or: `TODOS.md: No completed items detected. M items remaining.`
|
|
- Or: `TODOS.md: Created.` / `TODOS.md: Reorganized.`
|
|
|
|
**6. Defensive:** If TODOS.md cannot be written (permission error, disk full), warn the user and continue. Never stop the ship workflow for a TODOS failure.
|
|
|
|
Save this summary — it goes into the PR body in Step 8.
|
|
|
|
---
|
|
|
|
## Step 6: Commit (bisectable chunks)
|
|
|
|
**Goal:** Create small, logical commits that work well with `git bisect` and help LLMs understand what changed.
|
|
|
|
1. Analyze the diff and group changes into logical commits. Each commit should represent **one coherent change** — not one file, but one logical unit.
|
|
|
|
2. **Commit ordering** (earlier commits first):
|
|
- **Infrastructure:** migrations, config changes, route additions
|
|
- **Models & services:** new models, services, concerns (with their tests)
|
|
- **Controllers & views:** controllers, views, JS/React components (with their tests)
|
|
- **VERSION + CHANGELOG + TODOS.md:** always in the final commit
|
|
|
|
3. **Rules for splitting:**
|
|
- A model and its test file go in the same commit
|
|
- A service and its test file go in the same commit
|
|
- A controller, its views, and its test go in the same commit
|
|
- Migrations are their own commit (or grouped with the model they support)
|
|
- Config/route changes can group with the feature they enable
|
|
- If the total diff is small (< 50 lines across < 4 files), a single commit is fine
|
|
|
|
4. **Each commit must be independently valid** — no broken imports, no references to code that doesn't exist yet. Order commits so dependencies come first.
|
|
|
|
5. Compose each commit message:
|
|
- First line: `<type>: <summary>` (type = feat/fix/chore/refactor/docs)
|
|
- Body: brief description of what this commit contains
|
|
- Only the **final commit** (VERSION + CHANGELOG) gets the version tag and co-author trailer:
|
|
|
|
```bash
|
|
git commit -m "$(cat <<'EOF'
|
|
chore: bump version and changelog (vX.Y.Z.W)
|
|
|
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
EOF
|
|
)"
|
|
```
|
|
|
|
---
|
|
|
|
## Step 6.5: Verification Gate
|
|
|
|
**IRON LAW: NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE.**
|
|
|
|
Before pushing, re-verify if code changed during Steps 4-6:
|
|
|
|
1. **Test verification:** If ANY code changed after Step 3's test run (fixes from review findings, CHANGELOG edits don't count), re-run the test suite. Paste fresh output. Stale output from Step 3 is NOT acceptable.
|
|
|
|
2. **Build verification:** If the project has a build step, run it. Paste output.
|
|
|
|
3. **Rationalization prevention:**
|
|
- "Should work now" → RUN IT.
|
|
- "I'm confident" → Confidence is not evidence.
|
|
- "I already tested earlier" → Code changed since then. Test again.
|
|
- "It's a trivial change" → Trivial changes break production.
|
|
|
|
**If tests fail here:** STOP. Do not push. Fix the issue and return to Step 3.
|
|
|
|
Claiming work is complete without verification is dishonesty, not efficiency.
|
|
|
|
---
|
|
|
|
## Step 7: Push
|
|
|
|
Push to the remote with upstream tracking:
|
|
|
|
```bash
|
|
git push -u origin <branch-name>
|
|
```
|
|
|
|
---
|
|
|
|
## Step 8: Create PR
|
|
|
|
Create a pull request using `gh`:
|
|
|
|
```bash
|
|
gh pr create --base <base> --title "<type>: <summary>" --body "$(cat <<'EOF'
|
|
## Summary
|
|
<bullet points from CHANGELOG>
|
|
|
|
## Test Coverage
|
|
<coverage diagram from Step 3.4, or "All new code paths have test coverage.">
|
|
<If Step 3.4 ran: "Tests: {before} → {after} (+{delta} new)">
|
|
|
|
## Pre-Landing Review
|
|
<findings from Step 3.5 code review, or "No issues found.">
|
|
|
|
## Design Review
|
|
<If design review ran: "Design Review (lite): N findings — M auto-fixed, K skipped. AI Slop: clean/N issues.">
|
|
<If no frontend files changed: "No frontend files changed — design review skipped.">
|
|
|
|
## Eval Results
|
|
<If evals ran: suite names, pass/fail counts, cost dashboard summary. If skipped: "No prompt-related files changed — evals skipped.">
|
|
|
|
## Greptile Review
|
|
<If Greptile comments were found: bullet list with [FIXED] / [FALSE POSITIVE] / [ALREADY FIXED] tag + one-line summary per comment>
|
|
<If no Greptile comments found: "No Greptile comments.">
|
|
<If no PR existed during Step 3.75: omit this section entirely>
|
|
|
|
## TODOS
|
|
<If items marked complete: bullet list of completed items with version>
|
|
<If no items completed: "No TODO items completed in this PR.">
|
|
<If TODOS.md created or reorganized: note that>
|
|
<If TODOS.md doesn't exist and user skipped: omit this section>
|
|
|
|
## Test plan
|
|
- [x] All Rails tests pass (N runs, 0 failures)
|
|
- [x] All Vitest tests pass (N tests)
|
|
|
|
🤖 Generated with [Claude Code](https://claude.com/claude-code)
|
|
EOF
|
|
)"
|
|
```
|
|
|
|
**Output the PR URL** — this should be the final output the user sees.
|
|
|
|
---
|
|
|
|
## Important Rules
|
|
|
|
- **Never skip tests.** If tests fail, stop.
|
|
- **Never skip the pre-landing review.** If checklist.md is unreadable, stop.
|
|
- **Never force push.** Use regular `git push` only.
|
|
- **Never ask for confirmation** except for MINOR/MAJOR version bumps and pre-landing review ASK items (batched into at most one AskUserQuestion).
|
|
- **Always use the 4-digit version format** from the VERSION file.
|
|
- **Date format in CHANGELOG:** `YYYY-MM-DD`
|
|
- **Split commits for bisectability** — each commit = one logical change.
|
|
- **TODOS.md completion detection must be conservative.** Only mark items as completed when the diff clearly shows the work is done.
|
|
- **Use Greptile reply templates from greptile-triage.md.** Every reply includes evidence (inline diff, code references, re-rank suggestion). Never post vague replies.
|
|
- **Never push without fresh verification evidence.** If code changed after Step 3 tests, re-run before pushing.
|
|
- **Step 3.4 generates coverage tests.** They must pass before committing. Never commit failing tests.
|
|
- **The goal is: user says `/ship`, next thing they see is the review + PR URL.**
|