Files
gstack/CONTRIBUTING.md
Garry Tan 00bc482fe1 feat: /land-and-deploy, /canary, /benchmark + perf review (v0.7.0) (#183)
* feat: add /canary, /benchmark, /land-and-deploy skills (v0.7.0)

Three new skills that close the deploy loop:
- /canary: standalone post-deploy monitoring with browse daemon
- /benchmark: performance regression detection with Web Vitals
- /land-and-deploy: merge PR, wait for deploy, canary verify production

Incorporates patterns from community PR #151.

Co-Authored-By: HMAKT99 <HMAKT99@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add Performance & Bundle Impact category to review checklist

New Pass 2 (INFORMATIONAL) category catching heavy dependencies
(moment.js, lodash full), missing lazy loading, synchronous scripts,
CSS @import blocking, fetch waterfalls, and tree-shaking breaks.

Both /review and /ship automatically pick this up via checklist.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add {{DEPLOY_BOOTSTRAP}} resolver + deployed row in dashboard

- New generateDeployBootstrap() resolver auto-detects deploy platform
  (Vercel, Netlify, Fly.io, GH Actions, etc.), production URL, and
  merge method. Persists to CLAUDE.md like test bootstrap.
- Review Readiness Dashboard now shows a "Deployed" row from
  /land-and-deploy JSONL entries (informational, never gates shipping).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: mark 3 TODOs completed, bump v0.7.0, update CHANGELOG

Superseded by /land-and-deploy:
- /merge skill — review-gated PR merge
- Deploy-verify skill
- Post-deploy verification (ship + browse)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /setup-deploy skill + platform-specific deploy verification

- New /setup-deploy skill: interactive guided setup for deploy configuration.
  Detects Fly.io, Render, Vercel, Netlify, Heroku, Railway, GitHub Actions,
  and custom deploy scripts. Writes config to CLAUDE.md with custom hooks
  section for non-standard setups.

- Enhanced deploy bootstrap: platform-specific URL resolution (fly.toml app
  → {app}.fly.dev, render.yaml → {service}.onrender.com, etc.), deploy
  status commands (fly status, heroku releases), and custom deploy hooks
  section in CLAUDE.md for manual/scripted deploys.

- Platform-specific deploy verification in /land-and-deploy Step 6:
  Strategy A (GitHub Actions polling), Strategy B (platform CLI: fly/render/heroku),
  Strategy C (auto-deploy: vercel/netlify), Strategy D (custom hooks from CLAUDE.md).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: E2E + LLM-judge evals for deploy skills

- 4 E2E tests: land-and-deploy (Fly.io detection + deploy report),
  canary (monitoring report structure), benchmark (perf report schema),
  setup-deploy (platform detection → CLAUDE.md config)
- 4 LLM-judge evals: workflow quality for all 4 new skills
- Touchfile entries for diff-based test selection (E2E + LLM-judge)
- 460 free tests pass, 0 fail

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: harden E2E tests — server lifecycle, timeouts, preamble budget, skip flaky

Cross-cutting fixes:
- Pre-seed ~/.gstack/.completeness-intro-seen and ~/.gstack/.telemetry-prompted
  so preamble doesn't burn 3-7 turns on lake intro + telemetry in every test
- Each describe block creates its own test server instance instead of sharing
  a global that dies between suites

Test fixes (5 tests):
- /qa quick: own server instance + preamble skip
- /review SQL injection: timeout 90→180s, maxTurns 15→20, added assertion
  that review output actually mentions SQL injection
- /review design-lite: maxTurns 25→35 + preamble skip (now detects 7/7)
- ship-base-branch: both timeouts 90→150/180s + preamble skip
- plan-eng artifact: clean stale state in beforeAll, maxTurns 20→25

Skipped (4 flaky/redundant tests):
- contributor-mode: tests prompt compliance, not skill functionality
- design-consultation-research: WebSearch-dependent, redundant with core
- design-consultation-preview: redundant with core test
- /qa bootstrap: too ambitious (65 turns, installs vitest)

Also: preamble skip added to qa-only, qa-fix-loop, design-consultation-core,
and design-consultation-existing prompts. Updated touchfiles entries and
touchfiles.test.ts. Added honest comment to codex-review-findings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: redesign 6 skipped/todo E2E tests + add test.concurrent support

Redesigned tests (previously skipped/todo):
- contributor-mode: pre-fail approach, 5 turns/30s (was 10 turns/90s)
- design-consultation-research: WebSearch-only, 8 turns/90s (was 45/480s)
- design-consultation-preview: preview HTML only, 8 turns/90s (was 30/480s)
- qa-bootstrap: bootstrap-only, 12 turns/90s (was 65/420s)
- /ship workflow: local bare remote, 15 turns/120s (was test.todo)
- /setup-browser-cookies: browser detection smoke, 5 turns/45s (was test.todo)

Added testConcurrentIfSelected() helper for future parallelization.
Updated touchfiles entries for all 6 re-enabled tests.

Target: 0 skip, 0 todo, 0 fail across all E2E tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: relax contributor-mode assertions — test structure not exact phrasing

* perf: enable test.concurrent for 31 independent E2E tests

Convert 18 skill-e2e, 11 routing, and 2 codex tests from sequential
to test.concurrent. Only design-consultation tests (4) remain sequential
due to shared designDir state. Expected ~6x speedup on Teams high-burst.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: add --concurrent flag to bun test + convert remaining 4 sequential tests

bun's test.concurrent only works within a describe block, not across
describe blocks. Adding --concurrent to the CLI command makes ALL tests
concurrent regardless of describe boundaries. Also converted the 4
design-consultation tests to concurrent (each already independent).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* perf: split monolithic E2E test into 8 parallel files

Split test/skill-e2e.test.ts (3442 lines) into 8 category files:
- skill-e2e-browse.test.ts (7 tests)
- skill-e2e-review.test.ts (7 tests)
- skill-e2e-qa-bugs.test.ts (3 tests)
- skill-e2e-qa-workflow.test.ts (4 tests)
- skill-e2e-plan.test.ts (6 tests)
- skill-e2e-design.test.ts (7 tests)
- skill-e2e-workflow.test.ts (6 tests)
- skill-e2e-deploy.test.ts (4 tests)

Bun runs each file in its own worker = 10 parallel workers
(8 split + routing + codex). Expected: 78 min → ~12 min.

Extracted shared helpers to test/helpers/e2e-helpers.ts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* perf: bump default E2E concurrency to 15

* perf: add model pinning infrastructure + rate-limit telemetry to E2E runner

Default E2E model changed from Opus to Sonnet (5x faster, 5x cheaper).
Session runner now accepts `model` option with EVALS_MODEL env var override.
Added timing telemetry (first_response_ms, max_inter_turn_ms) and wall_clock_ms
to eval-store for diagnosing rate-limit impact. Added EVALS_FAST test filtering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve 3 E2E test failures — tmpdir race, wasted turns, brittle assertions

plan-design-review-plan-mode: give each test its own tmpdir to eliminate
race condition where concurrent tests pollute each other's working directory.

ship-local-workflow: inline ship workflow steps in prompt instead of having
agent read 700+ line SKILL.md (was wasting 6 of 15 turns on file I/O).

design-consultation-core: replace exact section name matching with fuzzy
synonym-based matching (e.g. "Colors" matches "Color", "Type System"
matches "Typography"). All 7 sections still required, LLM judge still hard fail.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* perf: pin quality tests to Opus, add --retry 2 and test:e2e:fast tier

~10 quality-sensitive tests (planted-bug detection, design quality judge,
strategic review, retro analysis) explicitly pinned to Opus. ~30 structure
tests default to Sonnet for 5x speed improvement.

Added --retry 2 to all E2E scripts for flaky test resilience.
Added test:e2e:fast script that excludes 8 slowest tests for quick feedback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: mark E2E model pinning TODO as shipped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add SKILL.md merge conflict directive to CLAUDE.md

When resolving merge conflicts on generated SKILL.md files, always merge
the .tmpl templates first, then regenerate — never accept either side's
generated output directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: add DEPLOY_BOOTSTRAP resolver to gen-skill-docs

The land-and-deploy template referenced {{DEPLOY_BOOTSTRAP}} but no resolver
existed, causing gen-skill-docs to fail. Added generateDeployBootstrap() that
generates the deploy config detection bash block (check CLAUDE.md for persisted
config, auto-detect platform from config files, detect deploy workflows).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate SKILL.md files after DEPLOY_BOOTSTRAP fix

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: move prompt temp file outside workingDirectory to prevent race condition

The .prompt-tmp file was written inside workingDirectory, which gets deleted
by afterAll cleanup. With --concurrent --retry, afterAll can interleave with
retries, causing "No such file or directory" crashes at 0s (seen in
review-design-lite and office-hours-spec-review).

Fix: write prompt file to os.tmpdir() with a unique suffix so it survives
directory cleanup. Also convert review-design-lite from describeE2E to
describeIfSelected for proper diff-based test selection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: add --retry 2 --concurrent flags to test:evals scripts for consistency

test:evals and test:evals:all were missing the retry and concurrency flags
that test:e2e already had, causing inconsistent behavior between the two
script families.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: HMAKT99 <HMAKT99@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 14:31:36 -07:00

15 KiB

Contributing to gstack

Thanks for wanting to make gstack better. Whether you're fixing a typo in a skill prompt or building an entirely new workflow, this guide will get you up and running fast.

Quick start

gstack skills are Markdown files that Claude Code discovers from a skills/ directory. Normally they live at ~/.claude/skills/gstack/ (your global install). But when you're developing gstack itself, you want Claude Code to use the skills in your working tree — so edits take effect instantly without copying or deploying anything.

That's what dev mode does. It symlinks your repo into the local .claude/skills/ directory so Claude Code reads skills straight from your checkout.

git clone <repo> && cd gstack
bun install                    # install dependencies
bin/dev-setup                  # activate dev mode

Now edit any SKILL.md, invoke it in Claude Code (e.g. /review), and see your changes live. When you're done developing:

bin/dev-teardown               # deactivate — back to your global install

Contributor mode

Contributor mode turns gstack into a self-improving tool. Enable it and Claude Code will periodically reflect on its gstack experience — rating it 0-10 at the end of each major workflow step. When something isn't a 10, it thinks about why and files a report to ~/.gstack/contributor-logs/ with what happened, repro steps, and what would make it better.

~/.claude/skills/gstack/bin/gstack-config set gstack_contributor true

The logs are for you. When something bugs you enough to fix, the report is already written. Fork gstack, symlink your fork into the project where you hit the issue, fix it, and open a PR.

The contributor workflow

  1. Use gstack normally — contributor mode reflects and logs issues automatically
  2. Check your logs: ls ~/.gstack/contributor-logs/
  3. Fork and clone gstack (if you haven't already)
  4. Symlink your fork into the project where you hit the bug:
    # In your core project (the one where gstack annoyed you)
    ln -sfn /path/to/your/gstack-fork .claude/skills/gstack
    cd .claude/skills/gstack && bun install && bun run build
    
  5. Fix the issue — your changes are live immediately in this project
  6. Test by actually using gstack — do the thing that annoyed you, verify it's fixed
  7. Open a PR from your fork

This is the best way to contribute: fix gstack while doing your real work, in the project where you actually felt the pain.

Session awareness

When you have 3+ gstack sessions open simultaneously, every question tells you which project, which branch, and what's happening. No more staring at a question thinking "wait, which window is this?" The format is consistent across all 15 skills.

Working on gstack inside the gstack repo

When you're editing gstack skills and want to test them by actually using gstack in the same repo, bin/dev-setup wires this up. It creates .claude/skills/ symlinks (gitignored) pointing back to your working tree, so Claude Code uses your local edits instead of the global install.

gstack/                          <- your working tree
├── .claude/skills/              <- created by dev-setup (gitignored)
│   ├── gstack -> ../../         <- symlink back to repo root
│   ├── review -> gstack/review
│   ├── ship -> gstack/ship
│   └── ...                      <- one symlink per skill
├── review/
│   └── SKILL.md                 <- edit this, test with /review
├── ship/
│   └── SKILL.md
├── browse/
│   ├── src/                     <- TypeScript source
│   └── dist/                    <- compiled binary (gitignored)
└── ...

Day-to-day workflow

# 1. Enter dev mode
bin/dev-setup

# 2. Edit a skill
vim review/SKILL.md

# 3. Test it in Claude Code — changes are live
#    > /review

# 4. Editing browse source? Rebuild the binary
bun run build

# 5. Done for the day? Tear down
bin/dev-teardown

Testing & evals

Setup

# 1. Copy .env.example and add your API key
cp .env.example .env
# Edit .env → set ANTHROPIC_API_KEY=sk-ant-...

# 2. Install deps (if you haven't already)
bun install

Bun auto-loads .env — no extra config. Conductor workspaces inherit .env from the main worktree automatically (see "Conductor workspaces" below).

Test tiers

Tier Command Cost What it tests
1 — Static bun test Free Command validation, snapshot flags, SKILL.md correctness, TODOS-format.md refs, observability unit tests
2 — E2E bun run test:e2e ~$3.85 Full skill execution via claude -p subprocess
3 — LLM eval bun run test:evals ~$0.15 standalone LLM-as-judge scoring of generated SKILL.md docs
2+3 bun run test:evals ~$4 combined E2E + LLM-as-judge (runs both)
bun test                     # Tier 1 only (runs on every commit, <5s)
bun run test:e2e             # Tier 2: E2E only (needs EVALS=1, can't run inside Claude Code)
bun run test:evals           # Tier 2 + 3 combined (~$4/run)

Tier 1: Static validation (free)

Runs automatically with bun test. No API keys needed.

  • Skill parser tests (test/skill-parser.test.ts) — Extracts every $B command from SKILL.md bash code blocks and validates against the command registry in browse/src/commands.ts. Catches typos, removed commands, and invalid snapshot flags.
  • Skill validation tests (test/skill-validation.test.ts) — Validates that SKILL.md files reference only real commands and flags, and that command descriptions meet quality thresholds.
  • Generator tests (test/gen-skill-docs.test.ts) — Tests the template system: verifies placeholders resolve correctly, output includes value hints for flags (e.g. -d <N> not just -d), enriched descriptions for key commands (e.g. is lists valid states, press lists key examples).

Tier 2: E2E via claude -p (~$3.85/run)

Spawns claude -p as a subprocess with --output-format stream-json --verbose, streams NDJSON for real-time progress, and scans for browse errors. This is the closest thing to "does this skill actually work end-to-end?"

# Must run from a plain terminal — can't nest inside Claude Code or Conductor
EVALS=1 bun test test/skill-e2e-*.test.ts
  • Gated by EVALS=1 env var (prevents accidental expensive runs)
  • Auto-skips if running inside Claude Code (claude -p can't nest)
  • API connectivity pre-check — fails fast on ConnectionRefused before burning budget
  • Real-time progress to stderr: [Ns] turn T tool #C: Name(...)
  • Saves full NDJSON transcripts and failure JSON for debugging
  • Tests live in test/skill-e2e-*.test.ts (split by category), runner logic in test/helpers/session-runner.ts

E2E observability

When E2E tests run, they produce machine-readable artifacts in ~/.gstack-dev/:

Artifact Path Purpose
Heartbeat e2e-live.json Current test status (updated per tool call)
Partial results evals/_partial-e2e.json Completed tests (survives kills)
Progress log e2e-runs/{runId}/progress.log Append-only text log
NDJSON transcripts e2e-runs/{runId}/{test}.ndjson Raw claude -p output per test
Failure JSON e2e-runs/{runId}/{test}-failure.json Diagnostic data on failure

Live dashboard: Run bun run eval:watch in a second terminal to see a live dashboard showing completed tests, the currently running test, and cost. Use --tail to also show the last 10 lines of progress.log.

Eval history tools:

bun run eval:list            # list all eval runs (turns, duration, cost per run)
bun run eval:compare         # compare two runs — shows per-test deltas + Takeaway commentary
bun run eval:summary         # aggregate stats + per-test efficiency averages across runs

Eval comparison commentary: eval:compare generates natural-language Takeaway sections interpreting what changed between runs — flagging regressions, noting improvements, calling out efficiency gains (fewer turns, faster, cheaper), and producing an overall summary. This is driven by generateCommentary() in eval-store.ts.

Artifacts are never cleaned up — they accumulate in ~/.gstack-dev/ for post-mortem debugging and trend analysis.

Tier 3: LLM-as-judge (~$0.15/run)

Uses Claude Sonnet to score generated SKILL.md docs on three dimensions:

  • Clarity — Can an AI agent understand the instructions without ambiguity?
  • Completeness — Are all commands, flags, and usage patterns documented?
  • Actionability — Can the agent execute tasks using only the information in the doc?

Each dimension is scored 1-5. Threshold: every dimension must score ≥ 4. There's also a regression test that compares generated docs against the hand-maintained baseline from origin/main — generated must score equal or higher.

# Needs ANTHROPIC_API_KEY in .env — included in bun run test:evals
  • Uses claude-sonnet-4-6 for scoring stability
  • Tests live in test/skill-llm-eval.test.ts
  • Calls the Anthropic API directly (not claude -p), so it works from anywhere including inside Claude Code

CI

A GitHub Action (.github/workflows/skill-docs.yml) runs bun run gen:skill-docs --dry-run on every push and PR. If the generated SKILL.md files differ from what's committed, CI fails. This catches stale docs before they merge.

Tests run against the browse binary directly — they don't require dev mode.

Editing SKILL.md files

SKILL.md files are generated from .tmpl templates. Don't edit the .md directly — your changes will be overwritten on the next build.

# 1. Edit the template
vim SKILL.md.tmpl              # or browse/SKILL.md.tmpl

# 2. Regenerate for both hosts
bun run gen:skill-docs
bun run gen:skill-docs --host codex

# 3. Check health (reports both Claude and Codex)
bun run skill:check

# Or use watch mode — auto-regenerates on save
bun run dev:skill

For template authoring best practices (natural language over bash-isms, dynamic branch detection, {{BASE_BRANCH_DETECT}} usage), see CLAUDE.md's "Writing SKILL templates" section.

To add a browse command, add it to browse/src/commands.ts. To add a snapshot flag, add it to SNAPSHOT_FLAGS in browse/src/snapshot.ts. Then rebuild.

Dual-host development (Claude + Codex)

gstack generates SKILL.md files for two hosts: Claude (.claude/skills/) and Codex (.agents/skills/). Every template change needs to be generated for both.

Generating for both hosts

# Generate Claude output (default)
bun run gen:skill-docs

# Generate Codex output
bun run gen:skill-docs --host codex
# --host agents is an alias for --host codex

# Or use build, which does both + compiles binaries
bun run build

What changes between hosts

Aspect Claude Codex
Output directory {skill}/SKILL.md .agents/skills/gstack-{skill}/SKILL.md
Frontmatter Full (name, description, allowed-tools, hooks, version) Minimal (name + description only)
Paths ~/.claude/skills/gstack ~/.codex/skills/gstack
Hook skills hooks: frontmatter (enforced by Claude) Inline safety advisory prose (advisory only)
/codex skill Included (Claude wraps codex exec) Excluded (self-referential)

Testing Codex output

# Run all static tests (includes Codex validation)
bun test

# Check freshness for both hosts
bun run gen:skill-docs --dry-run
bun run gen:skill-docs --host codex --dry-run

# Health dashboard covers both hosts
bun run skill:check

Dev setup for .agents/

When you run bin/dev-setup, it creates symlinks in both .claude/skills/ and .agents/skills/ (if applicable), so Codex-compatible agents can discover your dev skills too.

Adding a new skill

When you add a new skill template, both hosts get it automatically:

  1. Create {skill}/SKILL.md.tmpl
  2. Run bun run gen:skill-docs (Claude output) and bun run gen:skill-docs --host codex (Codex output)
  3. The dynamic template discovery picks it up — no static list to update
  4. Commit both {skill}/SKILL.md and .agents/skills/gstack-{skill}/SKILL.md

Conductor workspaces

If you're using Conductor to run multiple Claude Code sessions in parallel, conductor.json wires up workspace lifecycle automatically:

Hook Script What it does
setup bin/dev-setup Copies .env from main worktree, installs deps, symlinks skills
archive bin/dev-teardown Removes skill symlinks, cleans up .claude/ directory

When Conductor creates a new workspace, bin/dev-setup runs automatically. It detects the main worktree (via git worktree list), copies your .env so API keys carry over, and sets up dev mode — no manual steps needed.

First-time setup: Put your ANTHROPIC_API_KEY in .env in the main repo (see .env.example). Every Conductor workspace inherits it automatically.

Things to know

  • SKILL.md files are generated. Edit the .tmpl template, not the .md. Run bun run gen:skill-docs to regenerate.
  • TODOS.md is the unified backlog. Organized by skill/component with P0-P4 priorities. /ship auto-detects completed items. All planning/review/retro skills read it for context.
  • Browse source changes need a rebuild. If you touch browse/src/*.ts, run bun run build.
  • Dev mode shadows your global install. Project-local skills take priority over ~/.claude/skills/gstack. bin/dev-teardown restores the global one.
  • Conductor workspaces are independent. Each workspace is its own git worktree. bin/dev-setup runs automatically via conductor.json.
  • .env propagates across worktrees. Set it once in the main repo, all Conductor workspaces get it.
  • .claude/skills/ is gitignored. The symlinks never get committed.

Testing your changes in a real project

This is the recommended way to develop gstack. Symlink your gstack checkout into the project where you actually use it, so your changes are live while you do real work:

# In your core project
ln -sfn /path/to/your/gstack-checkout .claude/skills/gstack
cd .claude/skills/gstack && bun install && bun run build

Now every gstack skill invocation in this project uses your working tree. Edit a template, run bun run gen:skill-docs, and the next /review or /qa call picks it up immediately.

To go back to the stable global install, just remove the symlink:

rm .claude/skills/gstack

Claude Code falls back to ~/.claude/skills/gstack/ automatically.

Alternative: point your global install at a branch

If you don't want per-project symlinks, you can switch the global install:

cd ~/.claude/skills/gstack
git fetch origin
git checkout origin/<branch>
bun install && bun run build

This affects all projects. To revert: git checkout main && git pull && bun run build.

Shipping your changes

When you're happy with your skill edits:

/ship

This runs tests, reviews the diff, triages Greptile comments (with 2-tier escalation), manages TODOS.md, bumps the version, and opens a PR. See ship/SKILL.md for the full workflow.