Files
gstack/ship/SKILL.md.tmpl
T
Garry Tan 9fd03fae9e v1.58.4.0 fix: high-priority community bug wave + PTY plan-mode smoke gate (#2077)
* fix(gbrain): stop forcing GBRAIN_PREPARE on transaction-mode poolers (#1965)

buildGbrainEnv auto-set GBRAIN_PREPARE=true whenever DATABASE_URL targeted
port 6543, and the /sync-gbrain capability check exported it for the rest
of the skill run. Both had the semantics inverted: gbrain auto-disables
prepared statements on transaction-mode poolers because they break every
write there ("prepared statement does not exist"); GBRAIN_PREPARE=true is
gbrain's documented override for SESSION-mode poolers on 6543, not a
requirement for transaction mode. The #1435 search symptom the auto-set
worked around was fixed gbrain-side.

Remove both force-sets. A caller-set GBRAIN_PREPARE (either value) still
passes through untouched, preserving the session-mode-on-6543 escape hatch.
isTransactionModePooler stays exported.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(gbrain): classify probe timeout as its own status; sync proceeds instead of skipping (#1964)

The 5s engine probe misclassified healthy-but-slow engines (cold Supabase
pooler connections measured at 6.9-10.7s) as broken-config, so /sync-gbrain
silently skipped code+memory and told the user their config was malformed.

- New "timeout" status: probe killed at the deadline with no recognized
  stderr pattern. Default deadline is now 15s, overridable via
  GSTACK_GBRAIN_PROBE_TIMEOUT_MS (tests set 300ms against a fake that
  sleeps 2s).
- Sync stages PROCEED on timeout with a stderr warning naming the env knob;
  a genuinely-dead engine surfaces its real error at the first operation
  instead of a false config diagnosis.
- Consistency everywhere "ok" gated behavior: gstack-gbrain-detect --is-ok
  exits 0 on timeout, and gen-skill-docs' detection gate accepts it, so a
  slow engine no longer silently suppresses brain-aware features.
- Status cache: key now includes the effective probe timeout (raising it
  invalidates a cached timeout) and GBRAIN_HOME; config detection honors
  GBRAIN_HOME so relocated-home users stop being misclassified as
  missing-config.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(bins): cygpath-normalize SCRIPT_DIR for bun imports; surface learnings-log errors (#1950)

Under Windows git-bash, pwd yields a POSIX path (/c/Users/...) that Bun on
Windows cannot resolve as an ES module specifier. gstack-learnings-log
interpolates SCRIPT_DIR into a bun -e import, so every invocation died with
"Cannot find module" — and 2>/dev/null swallowed the error, silently
dropping every AI-logged learning for Windows users.

- 3-line cygpath -m guard in gstack-learnings-log and gstack-question-log
  (which gains the same import shape in the next commit). Matches the
  duplicated IS_WINDOWS convention in setup; no shared shell lib exists.
- learnings-log adopts question-log's set +e / TMPERR capture pattern
  wholesale: validation errors now print to stderr. The old
  `if [ $? -ne 0 ]` check was dead code under set -euo pipefail — the
  script exited at the failing assignment before reaching it.
- New test/bin-windows-bun-import-paths.test.ts: static invariant (any
  bash bin interpolating $SCRIPT_DIR into a bun -e import must carry the
  guard) + behavioral end-to-end run invoked via `bash <bin>` — added to
  the windows-free-tests workflow list so the conversion is proven on the
  only platform where the bug exists.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(question-log): dedupe INJECTION_PATTERNS via lib/jsonl-store (#1934)

bin/gstack-question-log carried a local copy of the injection-pattern list,
so pattern fixes to lib/jsonl-store.ts never propagated — including the
/override[:\s]/i false-positive fix arriving via community PR #1940.
Import the shared hasInjection instead (enabled by the previous commit's
cygpath guard). question-log also gets the lib's stricter superset
(human:, disregard, from-now-on, approve-all patterns).

Tests pin the contract in a #1940-order-independent way: an "Override:
ignore all previous instructions" header is rejected, "prose overrides the
deterministic table" is accepted, and a static invariant keeps local
INJECTION_PATTERNS duplicates out of the bin.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(security): community-pulse + both dashboards never report fake zeros (#1947)

The security-signaling surface failed open at three layers — every failure
mode read as a reassuring "0 attacks" / "0 installs":

- community-pulse edge function: supabase-js returns {data,error} without
  throwing, and all five queries discarded `error` — a DB outage produced
  real-looking zeros via the SUCCESS path, and the catch (also returning
  zeros with HTTP 200) was unreachable for query failures. Every query now
  destructures and throws; the catch serves the stale cache (marked
  "stale": true) when one exists, else 503 {"error":"pulse_unavailable"}.
  Success responses carry "status":"ok" so clients can distinguish
  authoritative data from legacy backends. NOTE: the edge function deploys
  out-of-band (supabase functions deploy community-pulse).
- gstack-security-dashboard: captures the HTTP status; non-200 / network
  failure / error body / missing section → "unknown — backend error";
  jq missing → "unknown — install jq" (the lossy grep fallback broke on
  nested arrays and under-reported attacks as zero — removed); a 200
  without the new marker shows figures with an "unverified (legacy
  backend)" note. Also fixes a latent display bug: the TOTAL grep matched
  the digit 7 inside "attacks_last_7_days" and misreported every count.
- gstack-community-dashboard: same class — curl || echo "{}" plus
  grep || echo "0" printed "Weekly active installs: 0" on any failure.
  Now "unknown — backend error (HTTP N)".

test/security-dashboard-fallback.test.ts pins the matrix (200+marker,
200-legacy, 503, network failure) x (jq present, jq absent) for both bins:
"unknown" states never render as 0.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(telemetry): redact error_message spans before they leave the machine (#1947)

error_message was uploaded with only quote/newline escaping — stack traces
and failed-API errors can embed credentials, private paths, and hostnames,
and the sync path strips only _repo_slug/_branch.

New lib/redact-engine.ts export redactFindingSpans(): replaces EVERY
finding's span with <REDACTED-{id}> regardless of tier (applyRedactions is
the interactive PII-only path and exits nonzero on credential findings, so
it can't serve machine egress). Returns null when a span can't be located —
callers drop the whole payload rather than risk a leak.

gstack-telemetry-log pipes error_message through it at LOG time, so the
local JSONL at rest is clean too; surrounding text survives for crash
triage. FAIL CLOSED: bun missing, engine error, or non-JSON-string output
all null the field. Tests pin: embedded ghp_ token → <REDACTED-github.pat>
with context intact; redactor unavailable → null; raw bytes on disk never
contain the token.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(redact): prepush guard fails closed on git failure; /ship owns hook install (#1946)

Two gaps closed:

1. Fail closed. The git() helper returned "" on ANY non-zero exit or
   maxBuffer overflow (status null), addedLinesFor produced an empty
   string, and the push sailed through unscanned — fail-open on exactly
   the oversized-diff case where a large secret-bearing blob is most
   likely. The diff call now uses a strict variant that throws; main
   blocks with a clear message naming the GSTACK_REDACT_PREPUSH=skip
   escape valve. Probe calls (symbolic-ref, rev-parse, merge-base) keep
   the permissive helper — their failures are normal control flow.

2. Install path. The hook was installed by nothing ("opt-in, installed by
   nothing" was the issue's words). ./setup runs in the gstack checkout —
   the wrong repo for a per-project hook — so it gets a one-line hint
   only. /ship owns per-repo install: config redact_prepush_hook=true +
   hook missing → silent install (consent already given); config unset +
   no ~/.gstack/.redact-prepush-prompted marker → one-time machine-wide
   AskUserQuestion offer, answer persisted. ship/SKILL.md regenerated in
   this same commit (check-freshness bisect discipline).

Tests: unscannable diff (bogus SHAs) → exit 1 + valve named; empty-but-
successful diff → exit 0; static asserts pin setup as hint-only and the
ship template as the installer surface.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(redact): six new credential patterns — GitLab, HuggingFace, npm, DigitalOcean, Bearer, GCP SA (#1946)

Coverage gaps from the #1946 security review, including token types for
tooling gstack itself drives (glab):

HIGH (block): gitlab.token (glpat-/glptt-/gldt-), huggingface.token (hf_),
npm.token (npm_), digitalocean.token (dop_v1_), gcp.service_account (the
JSON-escaped "private_key" form that dodges pem.private_key's literal-block
match when minified, confirmed by "private_key_id" proximity).

MEDIUM (warn): auth.bearer — the most FP-prone shape in the set (docs are
full of "Authorization: Bearer <token>"), so it requires header-context
proximity and the same entropy>=3.0 + placeholder validator recipe as
env.kv. "Bearer YOUR_TOKEN_HERE" never fires; calibration over coverage,
per the cries-wolf principle.

All shapes are linear-time; test/redact-pattern-lint.test.ts covers them
automatically. Engine tests add positive + placeholder-negative cases per
pattern.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test: coverage-audit additions for the fix wave

Ship Step 7 gap-fill (all passing, 248 tests across the touched suites):
memory + dream stage probe-timeout proceeds, gbrain-detect override paths,
stale-flag passthrough, 200-body-missing-.security fail-closed case,
telemetry redaction edges, and credential-pattern edge cases.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix: pre-landing review fixes

Review army findings (1 critical, auto-fixed with regression tests):

- CRITICAL (security specialist, verified live): redactFindingSpans spliced
  only the regex capture span, and pem.private_key / gcp.service_account
  capture just the BEGIN-header — the key body survived "redaction" and
  shipped via telemetry. Marker-only patterns now drop the whole payload
  (null, fail closed). Overlapping spans (Bearer+JWT on the same bytes) are
  coalesced before splicing so stale offsets can't leave partial secret
  bytes behind.
- gitStrict: drop the dead `|| r.status === null` disjunct (null !== 0
  already covers it); add the signal-kill/null-status regression test the
  docstring promised.
- security-dashboard human mode flags stale snapshots ("figures may be out
  of date") instead of presenting frozen counts as current.
- community-dashboard marker check uses jq when available — the grep-only
  variant misclassified whitespaced/reserialized bodies as legacy.
- telemetry fail-closed test now shadows bun with a failing stub
  (deterministic on any host layout); stale "five status cases" describe
  title renamed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix: adversarial review fixes (Claude + Codex cross-model passes)

Both adversarial passes ran against the wave; every FIXABLE finding landed
with a regression test:

- probeTimeoutMs clamps to >=1ms: a fractional override floored to 0, and
  execFileSync treats timeout:0 as NO timeout — the probe that exists to
  bound hangs could hang forever (found by both models independently).
- /ship silent hook install now requires the hooks dir to live inside
  .git: with core.hooksPath (husky's COMMITTED .husky/), the chaining
  installer would have renamed the team's committed pre-push and written a
  machine-local wrapper into the working tree (found by both models).
- gstack-config gbrain-refresh accepts the "timeout" status — the last
  consumer still gating on literal "ok" (Codex); gstack-gbrain-detect's
  config-derived fields honor GBRAIN_HOME so the detection JSON can't
  report status ok alongside config_exists false (Codex).
- prepush: a remote sha absent locally (shallow clone / stale fetch) falls
  back to the merge-base/empty-tree range — scans MORE, never blocks a
  legitimate push into training users toward --no-verify.
- dashboards: curl's own 000 no longer doubles to "HTTP 000000"; the
  community dashboard flags stale snapshots like the security one; array
  sections parse via jq (the sed/grep loops truncated at the first ']');
  the no-jq marker grep tolerates whitespace.
- telemetry: multi-line redactor output nulls the field instead of
  corrupting the JSONL record; setup's hint fires only when the config key
  is genuinely unset (an explicit false is a recorded decline); the /ship
  prompt marker honors GSTACK_HOME.

Kept as designed (cross-model tension noted): Bearer stays MEDIUM in the
prepush gate — a HIGH Bearer would block every docs example; the entropy
validator can't eliminate that FP class, and MEDIUM warns visibly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore: bump version and changelog (v1.57.11.0)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs: P1 TODO — eval harness live progress + incremental persistence

Root-caused during this ship: a killed eval run was indistinguishable from a
healthy one for hours (per-file output buffering across mega test files, no
incremental eval-store writes, no honest liveness signal). Full context and
starting points in the entry.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test: fix operational-learning E2E fixture — copy lib/jsonl-store.ts

Pre-existing breakage, proven on main: gstack-learnings-log has imported
lib/jsonl-store.ts (shared injection patterns) since v1.57.5.0 / #1910, but
the fixture copies only the bin scripts — the bin exits 1 before writing
anything, on main silently (stderr swallowed) and on this branch loudly
(the #1950 error-surfacing made the four-day-old failure visible). A real
install always ships bin/ and lib/ together; the fixture now does too.
Verified: the fixture-shaped invocation writes the learning (exit 0) with
lib present, exits 1 on both main and this branch without it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(ios-qa): isolate E2E tests under --concurrent (3 real races)

The ios-qa E2E file failed intermittently under `bun test --concurrent`
(the eval harness default). Three distinct shared-state races, all fixed:

1. Shared pidfile: a module-level `workDir` reassigned in beforeEach was
   clobbered by parallel tests, so concurrent daemons collided on the same
   pidfile and the loser returned `already_running`. Each test now gets its
   own dir via makeWorkDir().
2. process.env path globals: tests set GSTACK_IOS_AUDIT_PATH /
   _ATTEMPTS_PATH / _ALLOWLIST_PATH on the shared process env; concurrent
   tests stomped each other's audit/attempts destinations. Threaded
   auditPath/attemptsPath/allowlistPath through DaemonOptions (and
   mintForCaller) as explicit args — env is no longer load-bearing.
3. afterEach cleanup race: the per-test cleanup drained a shared dir array,
   so the first test to finish deleted still-running tests' workDirs
   mid-assertion. Moved to afterAll (cleans once, after all settle).

Verified: 5/5 clean full-suite runs at --max-concurrency 15 (was
intermittent); daemon unit suite 91/91; daemon source compiles. The paths
default to the env-derived locations when options are omitted, so the
production CLI path is unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test(pty): pin spawned claude to EVALS model chain (default claude-sonnet-4-6)

launchClaudePty spawned the interactive `claude` TUI with no --model flag, so
the child inherited the operator's ~/.claude/settings.json model. On a
slow-thinking model that meant 5+ min of extended thinking on empty plan-mode
context, timing out the plan-mode smoke tests regardless of contention. Pin the
model via opts.model ?? EVALS_MODEL ?? 'claude-sonnet-4-6' — byte-identical to
session-runner.ts:144, so PTY and `claude -p` evals always agree.

Pushed before extraArgs (last flag wins, so a per-test --model still overrides).
Placement leaves the spawn region byte-stable for a clean merge with the
in-flight hermetic-env branch. Plumbed model through the three plan-skill
wrappers. Static-grep tripwires guard the pin, its fallback chain, the
before-extraArgs ordering, and all three wrapper forwards.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(pty): detect markdown bold-bullet prose AUQs (fixes office-hours smoke)

office-hours auto-mode renders its mode question as `- **Building a startup**`
markdown bullets (office-hours/SKILL.md.tmpl:102) with no letter/number marker.
isProseAUQVisible only matched `A)`-style lettered or `1.`-style numbered
options, so the question went undetected: the model surfaced it at ~2m19s
(well under the 300s budget) but the harness kept scoring the run "working"
off the spinner glyphs and timed out — a false timeout on a question that was
already on screen.

Add Pattern 3: when an interrogative line ('?') is present AND 3+ bold-bullet
markers (`- **`) appear in the 4KB tail, classify as a prose AUQ. Bold is the
discriminator vs incidental prose bullets; the line anchor is dropped (stripAnsi
can collapse option lines) and the existing `❯ 1.` cursor gate still defers to a
live native list. Wires through the existing classifyVisible 'asked' path and the
timeout high-water-mark, so office-hours now classifies 'asked' instead of
'timeout'. Five unit cases: the office-hours render passes; no-'?', <3-bullet,
plain-bullet, and native-cursor cases stay false.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(pty): detect stripAnsi-collapsed prose AUQs + judge spinner-precedence

The plan-eng/plan-design plan-mode + finding-floor smokes timed out even when
the skill HAD rendered a complete prose AskUserQuestion and was waiting: the PTY
strips cursor-positioning escapes, collapsing the option newlines/spaces so
"A) ..." arrives as "A(recommended)" / "-B:" and "Reply with A, B, or C" as
"ReplywithA,B,orC". Every line-anchored detector (Patterns 1-3) returns false on
those bytes, so proseAUQEverObserved never latched and the run timed out on a
question that was already on screen.

Add Pattern 4/5: a two-signal collapsed-form detector — a reply/recommendation
marker (space-insensitive "reply with [A-D]", "Recommendation:", or
"(recommended)") AND 2+ distinct A-D letters each punctuated by ) : or (. The
conjunction is what separates a real AUQ from incidental report prose; verified
true on the verbatim failing-run buffers where Patterns 1-3 return false.

Also fix the Haiku judge spinner bias: of 614 verdicts, 569 were 'working' and
95 of those noted a question was visible — Claude Code keeps the spinner
animating at an idle prose decision, so the judge coin-flipped. Add a precedence
override: when an option list AND a Recommendation/Reply instruction are both
visible, classify WAITING even with spinner glyphs. Kept the strict dual-signal
gate (never option-list-alone) so auto-decide-preserved doesn't flip.

5 unit tests pin the two-signal contract (2 true on real collapsed bytes, 3
false guards). 90 -> 95 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(plan-review): ask-first scope gate for plan-eng + plan-design review

On an empty/cold invocation, plan-eng-review and plan-design-review would dive
straight into repo exploration (plan-eng) or a 7-pass mockup+audit (plan-design)
and only ask the user much later, if at all. plan-ceo-review already asks first
via an unconditional Step-0 gate and behaves well; these two did not.

Add a hard-STOP scope gate as the FIRST operational instruction in each skill
(above the design-doc check / pre-review audit / mockup defaults it explicitly
overrides): the first tool call must be AskUserQuestion confirming the review
target, before any git/Read/Grep/Glob/Bash or mockup generation. Under
--disallowedTools the options render as plain column-0 lettered prose with a
Recommendation + "Reply with A, B, or C" line so the answer is detectable.

This is correct cold-start UX (confirm what to review before grinding a full
review on nothing) and it is the product half of the plan-mode smoke fix; the
harness collapsed-form detector is the deterministic half that catches the ask
however it renders. Templates + regenerated SKILL.md (default variant).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(tiers): reclassify stochastic plan-eng/plan-design ask-first smokes as periodic

plan-eng-review and plan-design-review run a long explore/audit before their
first AskUserQuestion, so whether the plan-mode + finding-floor smokes reach a
terminal outcome within the 300s/600s budget depends on stochastic ask-first
compliance (measured ~50-67%/run even with the hardened gate). Per the
"non-deterministic -> periodic" tiering rule, move the four affected smokes
(plan-eng/plan-design review-plan-mode + finding-floor) to periodic.

The deterministic harness fix (collapsed-form detector + judge precedence) and
the ask-first gate lift these from always-failing to mostly-passing and are the
real product+harness improvements; periodic monitoring tracks the rate weekly
without blocking PRs on an LLM coin-flip. plan-ceo/plan-devex ask-first reliably
and stay gate-tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* ci(evals): gate the deterministic PTY plan-mode smokes in CI

The real-PTY plan-mode smokes never ran in CI — the gate was local-only. Add an
e2e-pty-plan-smoke matrix suite running the two deterministically-reliable ones
(office-hours-auto-mode, plan-mode-no-op) so a regression there blocks PRs. The
stochastic plan-eng/plan-design ask-first smokes stay periodic (touchfiles
E2E_TIERS) and are not CI-gated.

A fresh CI container has no ~/.claude.json, so the spawned interactive `claude`
would wedge on the onboarding + API-key-approval dialog. Add a scoped seed step
(hasCompletedOnboarding + key approval, its own ANTHROPIC_API_KEY env) before the
run — mirrors what the hermetic E2E child env seeds. Per-suite timeout override
(35 min) via matrix.suite.timeout so the PTY suite has headroom for --retry 2
without bumping the other 12 suites. Report runner count 12 -> 13.

Validate via workflow_dispatch before relying on the gate (PTY-in-CI is new).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* ci(evals): install gstack skill registry for the PTY smoke suite

The first dry-run of e2e-pty-plan-smoke failed: the spawned interactive `claude`
printed "Unknown command: /plan-ceo-review". .claude/skills is gitignored, so a
fresh CI checkout has no gstack skill registry and the TUI can't resolve
/office-hours or /plan-ceo-review.

Add a Register step (scoped to the suite, after Seed, before Run) that mirrors
setup's --no-prefix user-scoped registry minimally: $HOME/.claude/skills/gstack
-> repo (resolves the preambles' absolute ~/.claude/skills/gstack/bin/* and
<skill>/sections/* paths) + per-skill SKILL.md/sections symlinks for the two
skills these tests invoke. HOME is /github/home in this container and the runner
adds no HOME/CLAUDE_CONFIG_DIR override (no hermetic mode), so $HOME is the right
anchor — the Seed step already proved claude reads it. No ./setup (binary build
+ Chromium + fonts + /dev/tty prompt); SKILL.md + bin/ + sections/ are committed.

Self-validating: fails the step loudly on a dangling symlink or missing
`name:` frontmatter, so a moved target surfaces here instead of as a silent
35-min "Unknown command" timeout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.58.4.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-21 07:15:19 -07:00

538 lines
25 KiB
Cheetah

---
name: ship
preamble-tier: 4
version: 1.0.0
description: |
Ship workflow: detect + merge base branch, run tests, review diff, bump VERSION,
update CHANGELOG, commit, push, create PR. Use when asked to "ship", "deploy",
"push to main", "create a PR", "merge and push", or "get it deployed".
Proactively invoke this skill (do NOT push/PR directly) when the user says code
is ready, asks about deploying, wants to push code up, or asks to create a PR. (gstack)
allowed-tools:
- Bash
- Read
- Write
- Edit
- Grep
- Glob
- Agent
- AskUserQuestion
- WebSearch
sensitive: true
triggers:
- ship it
- create a pr
- push to main
- deploy this
---
{{PREAMBLE}}
{{BASE_BRANCH_DETECT}}
{{GBRAIN_CONTEXT_LOAD}}
# Ship: Fully Automated Ship Workflow
You are running the `/ship` workflow. This is a **non-interactive, fully automated** workflow. Do NOT ask for confirmation at any step. The user said `/ship` which means DO IT. Run straight through and output the PR URL at the end.
**Only stop for:**
- On the base branch (abort)
- Merge conflicts that can't be auto-resolved (stop, show conflicts)
- In-branch test failures (pre-existing failures are triaged, not auto-blocking)
- Pre-landing review finds ASK items that need user judgment
- MINOR or MAJOR version bump needed (ask — see Step 12)
- Greptile review comments that need user decision (complex fixes, false positives)
- AI-assessed coverage below minimum threshold (hard gate with user override — see Step 7)
- Plan items NOT DONE with no user override (see Step 8)
- Plan verification failures (see Step 8.1)
- TODOS.md missing and user wants to create one (ask — see Step 14)
- TODOS.md disorganized and user wants to reorganize (ask — see Step 14)
**Never stop for:**
- Uncommitted changes (always include them)
- Version bump choice (auto-pick MICRO or PATCH — see Step 12)
- CHANGELOG content (auto-generate from diff)
- Commit message approval (auto-commit)
- Multi-file changesets (auto-split into bisectable commits)
- TODOS.md completed-item detection (auto-mark)
- Auto-fixable review findings (dead code, N+1, stale comments — fixed automatically)
- Test coverage gaps within target threshold (auto-generate and commit, or flag in PR body)
**Re-run behavior (idempotency):**
Re-running `/ship` means "run the whole checklist again." Every verification step
(tests, coverage audit, plan completion, pre-landing review, adversarial review,
VERSION/CHANGELOG check, TODOS, document-release) runs on every invocation.
Only *actions* are idempotent:
- Step 12: If VERSION already bumped, skip the bump but still read the version
- Step 17: If already pushed, skip the push command
- Step 19: If PR exists, update the body instead of creating a new PR
Never skip a verification step because a prior `/ship` run already performed it.
---
{{SECTION_INDEX:ship}}
---
## Step 1: Pre-flight
1. Check the current branch. If on the base branch or the repo's default branch, **abort**: "You're on the base branch. Ship from a feature branch."
2. Run `git status` (never use `-uall`). Uncommitted changes are always included — no need to ask.
3. Run `git diff <base>...HEAD --stat` and `git log <base>..HEAD --oneline` to understand what's being shipped.
4. Check review readiness:
{{REVIEW_DASHBOARD}}
If the Eng Review is NOT "CLEAR":
Print: "No prior eng review found — ship will run its own pre-landing review in Step 9."
Check diff size: `git diff <base>...HEAD --stat | tail -1`. If the diff is >200 lines, add: "Note: This is a large diff. Consider running `/plan-eng-review` or `/autoplan` for architecture-level review before shipping."
If CEO Review is missing, mention as informational ("CEO Review not run — recommended for product changes") but do NOT block.
For Design Review: run `source <(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)`. If `SCOPE_FRONTEND=true` and no design review (plan-design-review or design-review-lite) exists in the dashboard, mention: "Design Review not run — this PR changes frontend code. The lite design check will run automatically in Step 9, but consider running /design-review for a full visual audit post-implementation." Still never block.
Continue to Step 2 — do NOT block or ask. Ship runs its own review in Step 9.
---
## Step 2: Distribution Pipeline Check
If the diff introduces a new standalone artifact (CLI binary, library package, tool) — not a web
service with existing deployment — verify that a distribution pipeline exists.
1. Check if the diff adds a new `cmd/` directory, `main.go`, or `bin/` entry point:
```bash
git diff origin/<base> --name-only | grep -E '(cmd/.*/main\.go|bin/|Cargo\.toml|setup\.py|package\.json)' | head -5
```
2. If new artifact detected, check for a release workflow:
```bash
ls .github/workflows/ 2>/dev/null | grep -iE 'release|publish|dist'
grep -qE 'release|publish|deploy' .gitlab-ci.yml 2>/dev/null && echo "GITLAB_CI_RELEASE"
```
3. **If no release pipeline exists and a new artifact was added:** Use AskUserQuestion:
- "This PR adds a new binary/tool but there's no CI/CD pipeline to build and publish it.
Users won't be able to download the artifact after merge."
- A) Add a release workflow now (CI/CD release pipeline — GitHub Actions or GitLab CI depending on platform)
- B) Defer — add to TODOS.md
- C) Not needed — this is internal/web-only, existing deployment covers it
4. **If release pipeline exists:** Continue silently.
5. **If no new artifact detected:** Skip silently.
---
## Step 3: Merge the base branch (BEFORE tests)
Fetch and merge the base branch into the feature branch so tests run against the merged state:
```bash
git fetch origin <base> && git merge origin/<base> --no-edit
```
**If there are merge conflicts:** Try to auto-resolve if they are simple (VERSION, schema.rb, CHANGELOG ordering). If conflicts are complex or ambiguous, **STOP** and show them.
**If already up to date:** Continue silently.
---
{{SECTION:tests}}
{{SECTION:test-coverage}}
{{SECTION:plan-completion}}
{{SECTION:review-army}}
{{SECTION:greptile}}
{{SECTION:adversarial}}
## Step 12: Version bump (auto-decide)
The deterministic version-state logic is the tested **`gstack-version-bump`** CLI
(classify / write / repair). The bump-LEVEL decision and queue-collision handling
stay agent judgment; the slot pick stays `gstack-next-version`.
1. **Classify state** — pure reader, never writes:
```bash
bun run ~/.claude/skills/gstack/bin/gstack-version-bump classify --base <base>
```
Read the JSON `state` and dispatch:
- **FRESH** → do the bump (steps 2-4).
- **ALREADY_BUMPED** → skip the bump, but run the queue-drift check (step 3) with the reported `currentVersion`. If the queue moved (next free version differs), **AskUserQuestion**: rebump to the new version (rewrites CHANGELOG header + PR title) or keep current (CI version-gate will reject until resolved).
- **DRIFT_STALE_PKG** → run `gstack-version-bump repair` (syncs package.json to VERSION). No re-bump; reuse `currentVersion` for CHANGELOG + PR.
- **DRIFT_UNEXPECTED** → **STOP**. package.json disagrees with VERSION while VERSION matches base — a manual edit bypassed /ship. Reconcile manually, then re-run.
2. **Decide the bump level** from the diff (agent judgment):
- **MICRO**: <50 lines, trivial tweaks/config. **PATCH**: 50+ lines, no feature signals.
- **MINOR**: **ASK** if any feature signal (new route/page, migration, new module), OR 500+ lines. **MAJOR**: **ASK** — milestones or breaking changes only.
Save as `BUMP_LEVEL`. The level is the user-intended bump; queue-aware placement may advance the slot without changing the level.
3. **Queue-aware pick** (workspace-aware ship):
```bash
QUEUE_JSON=$(bun run ~/.claude/skills/gstack/bin/gstack-next-version --base <base> --bump "$BUMP_LEVEL" --current-version "$BASE_VERSION" 2>/dev/null || echo '{"offline":true}')
NEW_VERSION=$(echo "$QUEUE_JSON" | jq -r '.version // empty')
```
If `offline`/util fails: fall back to local `BUMP_LEVEL` arithmetic and print `⚠ workspace-aware ship offline — using local bump only`. If `claimed` is non-empty, render the queue table so the user sees landing order. If an active sibling workspace holds a version `>= NEW_VERSION`, **AskUserQuestion**: advance past (unrelated work) or abort and sync with the sibling.
4. **Write the bump** (FRESH, or an approved rebump):
```bash
bun run ~/.claude/skills/gstack/bin/gstack-version-bump write --version "$NEW_VERSION"
```
The CLI validates the 4-digit `MAJOR.MINOR.PATCH.MICRO` pattern and writes **both** VERSION and package.json. On a half-write (VERSION written, package.json failed) it exits 3 — re-run, and classify will report DRIFT_STALE_PKG for `repair` to fix.
5. **Record the release decision** (durable cross-session memory). The bump level is a real decision the next session should not re-derive blind:
```bash
~/.claude/skills/gstack/bin/gstack-decision-log '{"decision":"Ship NEW_VERSION (BUMP_LEVEL)","rationale":"WHY","scope":"repo","source":"skill","confidence":9}' 2>/dev/null || true
```
Substitute `NEW_VERSION`, `BUMP_LEVEL`, and a one-line `WHY` (the signal that set the level: diff scale, a new feature, a breaking change). Best-effort and non-interactive; never blocks the ship. Skip on the ALREADY_BUMPED path (the decision was logged on the run that did the bump).
{{SECTION:changelog}}
## Step 14: TODOS.md (auto-update)
Cross-reference the project's TODOS.md against the changes being shipped. Mark completed items automatically; prompt only if the file is missing or disorganized.
Read `.claude/skills/review/TODOS-format.md` for the canonical format reference.
**1. Check if TODOS.md exists** in the repository root.
**If TODOS.md does not exist:** Use AskUserQuestion:
- Message: "GStack recommends maintaining a TODOS.md organized by skill/component, then priority (P0 at top through P4, then Completed at bottom). See TODOS-format.md for the full format. Would you like to create one?"
- Options: A) Create it now, B) Skip for now
- If A: Create `TODOS.md` with a skeleton (# TODOS heading + ## Completed section). Continue to step 3.
- If B: Skip the rest of Step 14. Continue to Step 15.
**2. Check structure and organization:**
Read TODOS.md and verify it follows the recommended structure:
- Items grouped under `## <Skill/Component>` headings
- Each item has `**Priority:**` field with P0-P4 value
- A `## Completed` section at the bottom
**If disorganized** (missing priority fields, no component groupings, no Completed section): Use AskUserQuestion:
- Message: "TODOS.md doesn't follow the recommended structure (skill/component groupings, P0-P4 priority, Completed section). Would you like to reorganize it?"
- Options: A) Reorganize now (recommended), B) Leave as-is
- If A: Reorganize in-place following TODOS-format.md. Preserve all content — only restructure, never delete items.
- If B: Continue to step 3 without restructuring.
**3. Detect completed TODOs:**
This step is fully automatic — no user interaction.
Use the diff and commit history already gathered in earlier steps:
- `git diff <base>...HEAD` (full diff against the base branch)
- `git log <base>..HEAD --oneline` (all commits being shipped)
For each TODO item, check if the changes in this PR complete it by:
- Matching commit messages against the TODO title and description
- Checking if files referenced in the TODO appear in the diff
- Checking if the TODO's described work matches the functional changes
**Be conservative:** Only mark a TODO as completed if there is clear evidence in the diff. If uncertain, leave it alone.
**4. Move completed items** to the `## Completed` section at the bottom. Append: `**Completed:** vX.Y.Z (YYYY-MM-DD)`
**5. Output summary:**
- `TODOS.md: N items marked complete (item1, item2, ...). M items remaining.`
- Or: `TODOS.md: No completed items detected. M items remaining.`
- Or: `TODOS.md: Created.` / `TODOS.md: Reorganized.`
**6. Defensive:** If TODOS.md cannot be written (permission error, disk full), warn the user and continue. Never stop the ship workflow for a TODOS failure.
Save this summary — it goes into the PR body in Step 19.
---
## Step 15: Commit (bisectable chunks)
### Step 15.0: WIP Commit Squash (continuous checkpoint mode only)
If `CHECKPOINT_MODE` is `"continuous"`, the branch likely contains `WIP:` commits
from auto-checkpointing. These must be squashed INTO the corresponding logical
commits before the bisectable-grouping logic in Step 15.1 runs. Non-WIP commits
on the branch (earlier landed work) must be preserved.
**Detection:**
```bash
WIP_COUNT=$(git log <base>..HEAD --oneline --grep="^WIP:" 2>/dev/null | wc -l | tr -d ' ')
echo "WIP_COMMITS: $WIP_COUNT"
```
If `WIP_COUNT` is 0: skip this sub-step entirely.
If `WIP_COUNT` > 0, collect the WIP context first so it survives the squash:
```bash
# Export [gstack-context] blocks from all WIP commits on this branch.
# This file becomes input to the CHANGELOG entry and may inform PR body context.
mkdir -p "$(git rev-parse --show-toplevel)/.gstack"
git log <base>..HEAD --grep="^WIP:" --format="%H%n%B%n---END---" > \
"$(git rev-parse --show-toplevel)/.gstack/wip-context-before-squash.md" 2>/dev/null || true
```
**Non-destructive squash strategy:**
`git reset --soft <merge-base>` WOULD uncommit everything including non-WIP commits.
DO NOT DO THAT. Instead, use `git rebase` scoped to filter WIP commits only.
Option 1 (preferred, if there are non-WIP commits mixed in):
```bash
# Interactive rebase with automated WIP squashing.
# Mark every WIP commit as 'fixup' (drop its message, fold changes into prior commit).
git rebase -i $(git merge-base HEAD origin/<base>) \
--exec 'true' \
-X ours 2>/dev/null || {
echo "Rebase conflict. Aborting: git rebase --abort"
git rebase --abort
echo "STATUS: BLOCKED — manual WIP squash required"
exit 1
}
```
Option 2 (simpler, if the branch is ALL WIP commits so far — no landed work):
```bash
# Branch contains only WIP commits. Reset-soft is safe here because there's
# nothing non-WIP to preserve. Verify first.
NON_WIP=$(git log <base>..HEAD --oneline --invert-grep --grep="^WIP:" 2>/dev/null | wc -l | tr -d ' ')
if [ "$NON_WIP" -eq 0 ]; then
git reset --soft $(git merge-base HEAD origin/<base>)
echo "WIP-only branch, reset-soft to merge base. Step 15.1 will create clean commits."
fi
```
Decide at runtime which option applies. If unsure, prefer stopping and asking the
user via AskUserQuestion rather than destroying non-WIP commits.
**Anti-footgun rules:**
- NEVER blind `git reset --soft` if there are non-WIP commits. Codex flagged this
as destructive — it would uncommit real landed work and turn the push step into
a non-fast-forward push for anyone who already pushed.
- Only proceed to Step 15.1 after WIP commits are successfully squashed/absorbed
or the branch has been verified to contain only WIP work.
### Step 15.1: Bisectable Commits
**Goal:** Create small, logical commits that work well with `git bisect` and help LLMs understand what changed.
1. Analyze the diff and group changes into logical commits. Each commit should represent **one coherent change** — not one file, but one logical unit.
2. **Commit ordering** (earlier commits first):
- **Infrastructure:** migrations, config changes, route additions
- **Models & services:** new models, services, concerns (with their tests)
- **Controllers & views:** controllers, views, JS/React components (with their tests)
- **VERSION + CHANGELOG + TODOS.md:** always in the final commit
3. **Rules for splitting:**
- A model and its test file go in the same commit
- A service and its test file go in the same commit
- A controller, its views, and its test go in the same commit
- Migrations are their own commit (or grouped with the model they support)
- Config/route changes can group with the feature they enable
- If the total diff is small (< 50 lines across < 4 files), a single commit is fine
4. **Each commit must be independently valid** — no broken imports, no references to code that doesn't exist yet. Order commits so dependencies come first.
5. Compose each commit message:
- First line: `<type>: <summary>` (type = feat/fix/chore/refactor/docs)
- Body: brief description of what this commit contains
- Only the **final commit** (VERSION + CHANGELOG) gets the version tag and co-author trailer:
```bash
git commit -m "$(cat <<'EOF'
chore: bump version and changelog (vX.Y.Z.W)
{{CO_AUTHOR_TRAILER}}
EOF
)"
```
---
## Step 16: Verification Gate
**IRON LAW: NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE.**
Before pushing, re-verify if code changed during Steps 4-6:
1. **Test verification:** If ANY code changed after Step 5's test run (fixes from review findings, CHANGELOG edits don't count), re-run the test suite. Paste fresh output. Stale output from Step 5 is NOT acceptable.
2. **Build verification:** If the project has a build step, run it. Paste output.
3. **Rationalization prevention:**
- "Should work now" → RUN IT.
- "I'm confident" → Confidence is not evidence.
- "I already tested earlier" → Code changed since then. Test again.
- "It's a trivial change" → Trivial changes break production.
**If tests fail here:** STOP. Do not push. Fix the issue and return to Step 5.
Claiming work is complete without verification is dishonesty, not efficiency.
---
## Step 17: Push
**Credential pre-push guard (#1946) — run before the push:**
```bash
_REDACT_PREPUSH=$(~/.claude/skills/gstack/bin/gstack-config get redact_prepush_hook 2>/dev/null || echo "false")
_HOOK_PATH=$(git rev-parse --git-path hooks/pre-push 2>/dev/null || echo "")
_HOOK_INSTALLED="no"
[ -n "$_HOOK_PATH" ] && [ -f "$_HOOK_PATH" ] && grep -q "gstack-redact" "$_HOOK_PATH" 2>/dev/null && _HOOK_INSTALLED="yes"
# Custom hooks dirs (core.hooksPath — e.g. husky's COMMITTED .husky/) must
# never get a silent install: the chaining installer would rename the team's
# committed hook and write a machine-local wrapper into the working tree.
_HOOKS_DIR=$(git rev-parse --git-path hooks 2>/dev/null || echo "")
_GIT_DIR=$(git rev-parse --absolute-git-dir 2>/dev/null || echo "")
_HOOKS_IN_GIT_DIR="no"
case "$_HOOKS_DIR" in
"$_GIT_DIR"/*|hooks|.git/hooks) _HOOKS_IN_GIT_DIR="yes" ;;
esac
_PREPUSH_PROMPTED=$([ -f "${GSTACK_HOME:-$HOME/.gstack}/.redact-prepush-prompted" ] && echo "yes" || echo "no")
echo "REDACT_PREPUSH: $_REDACT_PREPUSH"
echo "HOOK_INSTALLED: $_HOOK_INSTALLED"
echo "HOOKS_IN_GIT_DIR: $_HOOKS_IN_GIT_DIR"
echo "PREPUSH_PROMPTED: $_PREPUSH_PROMPTED"
```
Branch on the echoed values:
1. **`REDACT_PREPUSH: true` and `HOOK_INSTALLED: no` and `HOOKS_IN_GIT_DIR: yes`** —
consent already given; install silently (no question) and continue:
```bash
~/.claude/skills/gstack/bin/gstack-redact install-prepush-hook
```
If `HOOKS_IN_GIT_DIR: no` (husky or another committed hooks dir), do NOT
install silently — print one line: "redact pre-push guard not installed:
this repo uses a custom core.hooksPath; run
`gstack-redact install-prepush-hook` manually if you want it chained."
2. **`REDACT_PREPUSH` not true AND `PREPUSH_PROMPTED: no`** — one-time
offer (fires once EVER, machine-wide). AskUserQuestion:
> gstack can install a per-repo git pre-push hook that blocks pushes
> containing credentials (API keys, tokens, private keys). It's a
> guardrail, not enforcement — `GSTACK_REDACT_PREPUSH=skip` bypasses it.
> Install it for repos you ship from?
Options:
- A) Yes — install the credential guard (recommended)
- B) No — never ask again
If A: run `~/.claude/skills/gstack/bin/gstack-config set redact_prepush_hook true`
then `~/.claude/skills/gstack/bin/gstack-redact install-prepush-hook`.
If B: run `~/.claude/skills/gstack/bin/gstack-config set redact_prepush_hook false`.
ALWAYS (after either answer, but NOT if the question itself failed to
render — a failed AskUserQuestion must re-offer next time):
```bash
touch "${GSTACK_HOME:-$HOME/.gstack}/.redact-prepush-prompted"
```
3. **Anything else** (declined earlier, or already installed) — continue
without comment.
**Idempotency check:** Check if the branch is already pushed and up to date.
```bash
git fetch origin <branch-name> 2>/dev/null
LOCAL=$(git rev-parse HEAD)
REMOTE=$(git rev-parse origin/<branch-name> 2>/dev/null || echo "none")
echo "LOCAL: $LOCAL REMOTE: $REMOTE"
[ "$LOCAL" = "$REMOTE" ] && echo "ALREADY_PUSHED" || echo "PUSH_NEEDED"
```
If `ALREADY_PUSHED`, skip the push but continue to Step 18. Otherwise push with upstream tracking:
```bash
git push -u origin <branch-name>
```
**You are NOT done.** The code is pushed but documentation sync and PR creation are mandatory final steps. Continue to Step 18.
---
**PR/MR title invariant (always applies — do not skip even if you don't open the section below):** Any PR or MR you create OR update in the next step MUST have a title that starts with `v$NEW_VERSION` (the version bumped in Step 12), in the format `v<NEW_VERSION> <type>: <summary>`. Never create or edit a PR/MR title without this prefix. Compute the correct title with the single source of truth helper: `~/.claude/skills/gstack/bin/gstack-pr-title-rewrite.sh "$NEW_VERSION" "<current title>"`. The full create/update procedure (idempotency, redaction scan, self-check) is in the section below.
{{SECTION:pr-body}}
## Step 20: Persist ship metrics
Log coverage and plan completion data so `/retro` can track trends:
```bash
eval "$(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)" && mkdir -p ~/.gstack/projects/$SLUG
```
Append to `~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl`:
```bash
echo '{"skill":"ship","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","coverage_pct":COVERAGE_PCT,"plan_items_total":PLAN_TOTAL,"plan_items_done":PLAN_DONE,"verification_result":"VERIFY_RESULT","version":"VERSION","branch":"BRANCH"}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl
```
Substitute from earlier steps:
- **COVERAGE_PCT**: coverage percentage from Step 7 diagram (integer, or -1 if undetermined)
- **PLAN_TOTAL**: total plan items extracted in Step 8 (0 if no plan file)
- **PLAN_DONE**: count of DONE + CHANGED items from Step 8 (0 if no plan file)
- **VERIFY_RESULT**: "pass", "fail", or "skipped" from Step 8.1
- **VERSION**: from the VERSION file
- **BRANCH**: current branch name
This step is automatic — never skip it, never ask for confirmation.
---
## Step 21: Plan-tune discoverability nudge (first-successful-ship only)
Plan-tune cathedral T15. After a successful ship, surface /plan-tune once
per machine. Single line, non-blocking, marker-gated so it never re-fires.
```bash
_NUDGE_MARKER="$HOME/.gstack/.plan-tune-nudge-shown"
_QT=$(~/.claude/skills/gstack/bin/gstack-config get question_tuning 2>/dev/null || echo "false")
if [ ! -f "$_NUDGE_MARKER" ] && [ "$_QT" = "false" ]; then
echo ""
echo "gstack can learn from your AskUserQuestion answers. Run /plan-tune to opt in"
echo "— it captures which prompts you find valuable vs noisy and (with hooks installed)"
echo "auto-decides your never-ask preferences."
touch "$_NUDGE_MARKER"
fi
```
If the marker exists, OR question_tuning is already on, the nudge is a
no-op. The marker guarantees at-most-once per machine. To re-enable:
`rm ~/.gstack/.plan-tune-nudge-shown` before next ship.
---
## Section self-check (before you finish)
You ran a carved skill. For your situation, list every section the Section index
named as applying, and confirm you issued a Read for each one. If you executed any
of those steps from memory without reading its section, you skipped the source of
truth — STOP, Read it now, and redo that step. Deterministic version work goes
through `gstack-version-bump`; never hand-roll the VERSION/package.json write.
---
## Important Rules
- **Never skip tests.** If tests fail, stop.
- **Never skip the pre-landing review.** If checklist.md is unreadable, stop.
- **Never force push.** Use regular `git push` only.
- **Never ask for trivial confirmations** (e.g., "ready to push?", "create PR?"). DO stop for: version bumps (MINOR/MAJOR), pre-landing review findings (ASK items), and Codex structured review [P1] findings (large diffs only).
- **Always use the 4-digit version format** from the VERSION file.
- **Date format in CHANGELOG:** `YYYY-MM-DD`
- **Split commits for bisectability** — each commit = one logical change.
- **TODOS.md completion detection must be conservative.** Only mark items as completed when the diff clearly shows the work is done.
- **Use Greptile reply templates from greptile-triage.md.** Every reply includes evidence (inline diff, code references, re-rank suggestion). Never post vague replies.
- **Never push without fresh verification evidence.** If code changed after Step 5 tests, re-run before pushing.
- **Step 7 generates coverage tests.** They must pass before committing. Never commit failing tests.
- **The goal is: user says `/ship`, next thing they see is the review + PR URL + auto-synced docs.**