* refactor: extract {{TEST_COVERAGE_AUDIT}} shared resolver
DRY extraction of the test coverage audit methodology into a shared
generator function with three explicit placeholders:
- TEST_COVERAGE_AUDIT_PLAN (plan-eng-review)
- TEST_COVERAGE_AUDIT_SHIP (ship)
- TEST_COVERAGE_AUDIT_REVIEW (review)
Shared across all modes: codepath tracing, ASCII diagram format,
quality scoring rubric, E2E test decision matrix, regression rule,
and test framework detection via CLAUDE.md.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: plan-eng-review uses shared test coverage audit
Replace the thin 6-line Section 3 test review with the full shared
methodology via {{TEST_COVERAGE_AUDIT_PLAN}}. Plan mode now:
- Traces every codepath with full ASCII diagrams
- Adds missing tests to the plan (not just "check for tests")
- Writes test plan artifact for /qa consumption
- Includes E2E/eval recommendations and regression detection
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: ship uses shared test coverage audit
Replace 135 lines of inline Step 3.4 methodology with
{{TEST_COVERAGE_AUDIT_SHIP}}. Functionally identical output plus:
- E2E test decision matrix (marks paths needing E2E vs unit)
- Eval recommendations for LLM prompt changes
- Regression detection iron rule
- Test framework detection via CLAUDE.md first
- Test plan artifact for /qa consumption
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: /review Step 4.75 test coverage diagram
Add codepath tracing to the pre-landing review via
{{TEST_COVERAGE_AUDIT_REVIEW}}. Review mode:
- Produces ASCII coverage diagram (same methodology as plan/ship)
- Generates tests for gaps via Fix-First (ASK user)
- Subsumes Pass 2 "Test Gaps" checklist category
- Gaps are INFORMATIONAL findings
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: mode differentiation + regression guard for coverage audit
10 new tests verifying the three TEST_COVERAGE_AUDIT placeholders:
- All modes share: codepath tracing, E2E matrix, regression rule
- Plan mode: adds to plan + artifact, no ship-specific content
- Ship mode: auto-generates + before/after count + coverage summary
- Review mode: Fix-First ASK + INFORMATIONAL, no artifact
- Regression guard: ship SKILL.md preserves all key phrases
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: extract shared coverage audit fixture + review E2E
- Extract billing.ts fixture into coverage-audit-fixture.ts (DRY)
- Refactor ship-coverage-audit E2E to use shared fixture
- Add review-coverage-audit E2E for Step 4.75
- Update touchfiles: both E2Es depend on shared fixture
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: strengthen E2E assertions for coverage audit tests
The coverage audit E2E tests (ship + review) were only asserting
exitReason === 'success' and readCalls > 0 — they passed even
if the agent produced no coverage diagram. Add assertion that
the output contains either GAP or TESTED markers.
Found during /review.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: plan mode traces the plan, not the git diff
Codex adversarial review caught that plan-eng-review was inheriting
"git diff origin/<base>...HEAD" from the shared resolver, but plan mode
reviews a plan document, not a code diff. Plan mode now says:
"Trace every codepath in the plan" and "Read the plan document."
Ship and review modes keep the git diff instruction.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.9.5.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: test coverage catalog + failure triage (merged branches) (#285)
* feat: add bin/gstack-repo-mode — solo vs collaborative detection with caching
Detects whether a repo is solo-dev (one person does 80%+ of recent commits)
or collaborative. Uses 90-day git shortlog window with 7-day cache in
~/.gstack/projects/{SLUG}/repo-mode.json. Config override via
`gstack-config set repo_mode solo|collaborative` takes precedence over
the heuristic. Minimum 5 commits required to classify (otherwise unknown).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: test failure ownership triage — see something say something
Adds two new preamble sections to all gstack skills:
- Repo Ownership Mode: explains solo vs collaborative behavior
- See Something, Say Something: proactive issue flagging principle
Adds {{TEST_FAILURE_TRIAGE}} template variable (opt-in, used by /ship):
- Classifies test failures as in-branch vs pre-existing
- Solo mode defaults to "investigate and fix now"
- Collaborative mode offers "blame + assign GitHub issue" option
- Also offers P0 TODO and skip options
/ship Step 3 now triages test failures instead of hard-stopping on all
failures. In-branch failures still block shipping. Pre-existing failures
get user-directed triage based on repo mode.
Adds P2 TODO for gstack notes system (deferred lightweight reminder).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: regenerate SKILL.md files for Claude and Codex hosts
All 22 Claude skills and 21 Codex skills regenerated with new preamble
sections (Repo Ownership Mode, See Something Say Something) and
{{TEST_FAILURE_TRIAGE}} resolved in ship/SKILL.md.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: validate repo mode values to prevent shell injection
Codex adversarial review found that unvalidated config/cache values
could be injected into shell via source <(gstack-repo-mode). Added
validate_mode() that only allows solo|collaborative|unknown — anything
else becomes "unknown". Prevents persistent code execution through
malicious config.yaml or tampered cache JSON.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: shell injection via branch names + feature-branch sampling bias
Codex code review found two issues:
P1: eval $(gstack-slug) in gstack-repo-mode executes branch names as
shell. Branch names like foo$(touch${IFS}pwned) are valid git refs and
would execute arbitrary commands. Fix: compute SLUG directly with sed
instead of eval'ing gstack-slug output.
P2: git shortlog HEAD only sees current branch history. On feature
branches that haven't merged main recently, other contributors disappear
from the sample. Fix: use git shortlog on the default branch
(origin/main) instead of HEAD.
Also improved blame lookup in collaborative triage to check both the
test file and the production code it covers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: broaden codex-host stripping test to accommodate triage section
"Investigate and fix" now appears in TEST_FAILURE_TRIAGE (not just the
Codex review step). Use CODEX_REVIEWS config string as a more specific
marker for detecting the Codex review step in Codex-hosted skills.
* fix: replace template placeholder in TODOS.md with readable text
{{TEST_FAILURE_TRIAGE}} is template syntax but TODOS.md is not processed
by gen-skill-docs — replaced with human-readable reference.
* chore: bump version and changelog (v0.9.5.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add bin/ directory to project structure in CLAUDE.md
* test: add triage resolver unit tests, plan-eng coverage audit E2E, and triage E2E
- TEST_FAILURE_TRIAGE resolver: 6 unit tests verifying all triage steps (T1-T4),
REPO_MODE branching, and safety default for ambiguous failures
- plan-eng-coverage-audit E2E: tests /plan-eng-review coverage audit codepath
(gap identified during eng review — existed on neither branch)
- ship-triage E2E: planted-bug fixture with in-branch (truncate null) and
pre-existing (divide-by-zero) failures; verifies correct classification
- Touchfile entries for diff-based test selection
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: regenerate stale Codex SKILL.md for retro
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: gstack-repo-mode handles repos without origin remote
Split `git remote get-url origin` into a separate variable with `|| true`
so the script doesn't crash under `set -euo pipefail` in local-only repos.
Falls back to REPO_MODE=unknown gracefully.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: REPO_MODE defaults to unknown when helper emits nothing
Changed preamble from `source <(...) || REPO_MODE=unknown` (which doesn't
catch empty output) to `source <(...) || true` followed by
`REPO_MODE=${REPO_MODE:-unknown}`. Regenerated all SKILL.md files.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: triage E2E runs both test files in subprocesses
math.test.js called process.exit(1) which killed the runner before
string.test.js could execute. Changed test runner to use child_process
so each test runs independently and both failure classes are exercised.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: gstack-repo-mode handles repos without origin remote
Fall back through origin/main → origin/master → HEAD when
git symbolic-ref refs/remotes/origin/HEAD is not set. Prevents
shortlog crash in repos where origin/HEAD isn't configured.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: triage E2E runs both test files in subprocesses
Add assertions verifying both math.test.js (pre-existing failure) and
string.test.js (in-branch failure) actually executed during triage.
Prevents false passes where only one failure class is exercised.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: REPO_MODE defaults to unknown when helper emits nothing
- Remove head -20 truncation that biased solo classification by
dropping low-volume contributors from the denominator
- Use atomic write (mktemp + mv) for cache to prevent concurrent
preamble reads from seeing partial JSON
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add test coverage catalog to CHANGELOG + update project structure
- CHANGELOG: add 6 entries for coverage audit, review Step 4.75, E2E
recommendations, regression iron rule, failure triage, repo-mode fix
- CLAUDE.md: add missing skill directories (autoplan, benchmark, canary,
codex, land-and-deploy, setup-deploy) to project structure
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.10.1.0)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: CHANGELOG rules — branch-scoped versions, never fold into old entries
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
14 KiB
gstack development
Commands
bun install # install dependencies
bun test # run free tests (browse + snapshot + skill validation)
bun run test:evals # run paid evals: LLM judge + E2E (diff-based, ~$4/run max)
bun run test:evals:all # run ALL paid evals regardless of diff
bun run test:e2e # run E2E tests only (diff-based, ~$3.85/run max)
bun run test:e2e:all # run ALL E2E tests regardless of diff
bun run eval:select # show which tests would run based on current diff
bun run dev <cmd> # run CLI in dev mode, e.g. bun run dev goto https://example.com
bun run build # gen docs + compile binaries
bun run gen:skill-docs # regenerate SKILL.md files from templates
bun run skill:check # health dashboard for all skills
bun run dev:skill # watch mode: auto-regen + validate on change
bun run eval:list # list all eval runs from ~/.gstack-dev/evals/
bun run eval:compare # compare two eval runs (auto-picks most recent)
bun run eval:summary # aggregate stats across all eval runs
test:evals requires ANTHROPIC_API_KEY. Codex E2E tests (test/codex-e2e.test.ts)
use Codex's own auth from ~/.codex/ config — no OPENAI_API_KEY env var needed.
E2E tests stream progress in real-time (tool-by-tool via --output-format stream-json --verbose). Results are persisted to ~/.gstack-dev/evals/ with auto-comparison
against the previous run.
Diff-based test selection: test:evals and test:e2e auto-select tests based
on git diff against the base branch. Each test declares its file dependencies in
test/helpers/touchfiles.ts. Changes to global touchfiles (session-runner, eval-store,
llm-judge, gen-skill-docs) trigger all tests. Use EVALS_ALL=1 or the :all script
variants to force all tests. Run eval:select to preview which tests would run.
Testing
bun test # run before every commit — free, <2s
bun run test:evals # run before shipping — paid, diff-based (~$4/run max)
bun test runs skill validation, gen-skill-docs quality checks, and browse
integration tests. bun run test:evals runs LLM-judge quality evals and E2E
tests via claude -p. Both must pass before creating a PR.
Project structure
gstack/
├── browse/ # Headless browser CLI (Playwright)
│ ├── src/ # CLI + server + commands
│ │ ├── commands.ts # Command registry (single source of truth)
│ │ └── snapshot.ts # SNAPSHOT_FLAGS metadata array
│ ├── test/ # Integration tests + fixtures
│ └── dist/ # Compiled binary
├── scripts/ # Build + DX tooling
│ ├── gen-skill-docs.ts # Template → SKILL.md generator
│ ├── skill-check.ts # Health dashboard
│ └── dev-skill.ts # Watch mode
├── test/ # Skill validation + eval tests
│ ├── helpers/ # skill-parser.ts, session-runner.ts, llm-judge.ts, eval-store.ts
│ ├── fixtures/ # Ground truth JSON, planted-bug fixtures, eval baselines
│ ├── skill-validation.test.ts # Tier 1: static validation (free, <1s)
│ ├── gen-skill-docs.test.ts # Tier 1: generator quality (free, <1s)
│ ├── skill-llm-eval.test.ts # Tier 3: LLM-as-judge (~$0.15/run)
│ └── skill-e2e-*.test.ts # Tier 2: E2E via claude -p (~$3.85/run, split by category)
├── qa-only/ # /qa-only skill (report-only QA, no fixes)
├── plan-design-review/ # /plan-design-review skill (report-only design audit)
├── design-review/ # /design-review skill (design audit + fix loop)
├── ship/ # Ship workflow skill
├── review/ # PR review skill
├── plan-ceo-review/ # /plan-ceo-review skill
├── plan-eng-review/ # /plan-eng-review skill
├── autoplan/ # /autoplan skill (auto-review pipeline: CEO → design → eng)
├── benchmark/ # /benchmark skill (performance regression detection)
├── canary/ # /canary skill (post-deploy monitoring loop)
├── codex/ # /codex skill (multi-AI second opinion via OpenAI Codex CLI)
├── land-and-deploy/ # /land-and-deploy skill (merge → deploy → canary verify)
├── office-hours/ # /office-hours skill (YC Office Hours — startup diagnostic + builder brainstorm)
├── investigate/ # /investigate skill (systematic root-cause debugging)
├── retro/ # Retrospective skill
├── document-release/ # /document-release skill (post-ship doc updates)
├── setup-deploy/ # /setup-deploy skill (one-time deploy config)
├── bin/ # CLI utilities (gstack-repo-mode, gstack-slug, gstack-config, etc.)
├── setup # One-time setup: build binary + symlink skills
├── SKILL.md # Generated from SKILL.md.tmpl (don't edit directly)
├── SKILL.md.tmpl # Template: edit this, run gen:skill-docs
├── ETHOS.md # Builder philosophy (Boil the Lake, Search Before Building)
└── package.json # Build scripts for browse
SKILL.md workflow
SKILL.md files are generated from .tmpl templates. To update docs:
- Edit the
.tmplfile (e.g.SKILL.md.tmplorbrowse/SKILL.md.tmpl) - Run
bun run gen:skill-docs(orbun run buildwhich does it automatically) - Commit both the
.tmpland generated.mdfiles
To add a new browse command: add it to browse/src/commands.ts and rebuild.
To add a snapshot flag: add it to SNAPSHOT_FLAGS in browse/src/snapshot.ts and rebuild.
Merge conflicts on SKILL.md files: NEVER resolve conflicts on generated SKILL.md
files by accepting either side. Instead: (1) resolve conflicts on the .tmpl templates
and scripts/gen-skill-docs.ts (the sources of truth), (2) run bun run gen:skill-docs
to regenerate all SKILL.md files, (3) stage the regenerated files. Accepting one side's
generated output silently drops the other side's template changes.
Platform-agnostic design
Skills must NEVER hardcode framework-specific commands, file patterns, or directory structures. Instead:
- Read CLAUDE.md for project-specific config (test commands, eval commands, etc.)
- If missing, AskUserQuestion — let the user tell you or let gstack search the repo
- Persist the answer to CLAUDE.md so we never have to ask again
This applies to test commands, eval commands, deploy commands, and any other project-specific behavior. The project owns its config; gstack reads it.
Writing SKILL templates
SKILL.md.tmpl files are prompt templates read by Claude, not bash scripts. Each bash code block runs in a separate shell — variables do not persist between blocks.
Rules:
- Use natural language for logic and state. Don't use shell variables to pass state between code blocks. Instead, tell Claude what to remember and reference it in prose (e.g., "the base branch detected in Step 0").
- Don't hardcode branch names. Detect
main/master/etc dynamically viagh pr vieworgh repo view. Use{{BASE_BRANCH_DETECT}}for PR-targeting skills. Use "the base branch" in prose,<base>in code block placeholders. - Keep bash blocks self-contained. Each code block should work independently. If a block needs context from a previous step, restate it in the prose above.
- Express conditionals as English. Instead of nested
if/elif/elsein bash, write numbered decision steps: "1. If X, do Y. 2. Otherwise, do Z."
Browser interaction
When you need to interact with a browser (QA, dogfooding, cookie setup), use the
/browse skill or run the browse binary directly via $B <command>. NEVER use
mcp__claude-in-chrome__* tools — they are slow, unreliable, and not what this
project uses.
Vendored symlink awareness
When developing gstack, .claude/skills/gstack may be a symlink back to this
working directory (gitignored). This means skill changes are live immediately —
great for rapid iteration, risky during big refactors where half-written skills
could break other Claude Code sessions using gstack concurrently.
Check once per session: Run ls -la .claude/skills/gstack to see if it's a
symlink or a real copy. If it's a symlink to your working directory, be aware that:
- Template changes +
bun run gen:skill-docsimmediately affect all gstack invocations - Breaking changes to SKILL.md.tmpl files can break concurrent gstack sessions
- During large refactors, remove the symlink (
rm .claude/skills/gstack) so the global install at~/.claude/skills/gstack/is used instead
For plan reviews: When reviewing plans that modify skill templates or the gen-skill-docs pipeline, consider whether the changes should be tested in isolation before going live (especially if the user is actively using gstack in other windows).
Commit style
Always bisect commits. Every commit should be a single logical change. When you've made multiple changes (e.g., a rename + a rewrite + new tests), split them into separate commits before pushing. Each commit should be independently understandable and revertable.
Examples of good bisection:
- Rename/move separate from behavior changes
- Test infrastructure (touchfiles, helpers) separate from test implementations
- Template changes separate from generated file regeneration
- Mechanical refactors separate from new features
When the user says "bisect commit" or "bisect and push," split staged/unstaged changes into logical commits and push.
CHANGELOG + VERSION style
VERSION and CHANGELOG are branch-scoped. Every feature branch that ships gets its own version bump and CHANGELOG entry. The entry describes what THIS branch adds — not what was already on main.
When to write the CHANGELOG entry:
- At
/shiptime (Step 5), not during development or mid-branch. - The entry covers ALL commits on this branch vs the base branch.
- Never fold new work into an existing CHANGELOG entry from a prior version that already landed on main. If main has v0.10.0.0 and your branch adds features, bump to v0.10.1.0 with a new entry — don't edit the v0.10.0.0 entry.
Key questions before writing:
- What branch am I on? What did THIS branch change?
- Is the base branch version already released? (If yes, bump and create new entry.)
- Does an existing entry on this branch already cover earlier work? (If yes, replace it with one unified entry for the final version.)
CHANGELOG.md is for users, not contributors. Write it like product release notes:
- Lead with what the user can now do that they couldn't before. Sell the feature.
- Use plain language, not implementation details. "You can now..." not "Refactored the..."
- Never mention TODOS.md, internal tracking, eval infrastructure, or contributor-facing details. These are invisible to users and meaningless to them.
- Put contributor/internal changes in a separate "For contributors" section at the bottom.
- Every entry should make someone think "oh nice, I want to try that."
- No jargon: say "every question now tells you which project and branch you're in" not "AskUserQuestion format standardized across skill templates via preamble resolver."
AI effort compression
When estimating or discussing effort, always show both human-team and CC+gstack time:
| Task type | Human team | CC+gstack | Compression |
|---|---|---|---|
| Boilerplate / scaffolding | 2 days | 15 min | ~100x |
| Test writing | 1 day | 15 min | ~50x |
| Feature implementation | 1 week | 30 min | ~30x |
| Bug fix + regression test | 4 hours | 15 min | ~20x |
| Architecture / design | 2 days | 4 hours | ~5x |
| Research / exploration | 1 day | 3 hours | ~3x |
Completeness is cheap. Don't recommend shortcuts when the complete implementation is a "lake" (achievable) not an "ocean" (multi-quarter migration). See the Completeness Principle in the skill preamble for the full philosophy.
Search before building
Before designing any solution that involves concurrency, unfamiliar patterns, infrastructure, or anything where the runtime/framework might have a built-in:
- Search for "{runtime} {thing} built-in"
- Search for "{thing} best practice {current year}"
- Check official runtime/framework docs
Three layers of knowledge: tried-and-true (Layer 1), new-and-popular (Layer 2), first-principles (Layer 3). Prize Layer 3 above all. See ETHOS.md for the full builder philosophy.
Local plans
Contributors can store long-range vision docs and design documents in ~/.gstack-dev/plans/.
These are local-only (not checked in). When reviewing TODOS.md, check plans/ for candidates
that may be ready to promote to TODOs or implement.
E2E eval failure blame protocol
When an E2E eval fails during /ship or any other workflow, never claim "not
related to our changes" without proving it. These systems have invisible couplings —
a preamble text change affects agent behavior, a new helper changes timing, a
regenerated SKILL.md shifts prompt context.
Required before attributing a failure to "pre-existing":
- Run the same eval on main (or base branch) and show it fails there too
- If it passes on main but fails on the branch — it IS your change. Trace the blame.
- If you can't run on main, say "unverified — may or may not be related" and flag it as a risk in the PR body
"Pre-existing" without receipts is a lazy claim. Prove it or don't say it.
Long-running tasks: don't give up
When running evals, E2E tests, or any long-running background task, poll until
completion. Use sleep 180 && echo "ready" + TaskOutput in a loop every 3
minutes. Never switch to blocking mode and give up when the poll times out. Never
say "I'll be notified when it completes" and stop checking — keep the loop going
until the task finishes or the user tells you to stop.
The full E2E suite can take 30-45 minutes. That's 10-15 polling cycles. Do all of them. Report progress at each check (which tests passed, which are running, any failures so far). The user wants to see the run complete, not a promise that you'll check later.
Deploying to the active skill
The active skill lives at ~/.claude/skills/gstack/. After making changes:
- Push your branch
- Fetch and reset in the skill directory:
cd ~/.claude/skills/gstack && git fetch origin && git reset --hard origin/main - Rebuild:
cd ~/.claude/skills/gstack && bun run build
Or copy the binary directly: cp browse/dist/browse ~/.claude/skills/gstack/browse/dist/browse