Files
gstack/CLAUDE.md
T
Garry Tan 6000af4589 feat: founder discovery engine + /debug skill — v0.7.0 (#185)
* feat: add escalation protocol to preamble — all skills get DONE/BLOCKED/NEEDS_CONTEXT

Every skill now reports completion status (DONE, DONE_WITH_CONCERNS, BLOCKED,
NEEDS_CONTEXT) and has escalation rules: 3 failed attempts → STOP, security
uncertainty → STOP, scope exceeds verification → STOP.

"It is always OK to stop and say 'this is too hard for me.'"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add verification gate to /ship (Step 6.5) — no push without fresh evidence

Before pushing, re-verify tests if code changed during review fixes.
Rationalization prevention: "Should work now" → RUN IT.
"I'm confident" → Confidence is not evidence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add scope drift detection + verification of claims to /review

Step 1.5: Before reviewing code quality, check if the diff matches stated
intent. Flags scope creep and missing requirements (INFORMATIONAL).

Step 5 addition: Every review claim must cite evidence — "this pattern is
safe" needs a line reference, "tests cover this" needs a test name.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: mandatory implementation alternatives + design doc lookup in /plan-ceo-review

Step 0C-bis: Every plan must consider 2-3 approaches (minimal viable vs ideal
architecture) before mode selection. RECOMMENDATION required.

Pre-Review System Audit now checks ~/.gstack/projects/ for /brainstorm design
docs (branch-filtered with fallback).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: design doc lookup in /plan-eng-review + fix branch name sanitization

Step 0 now checks ~/.gstack/projects/ for /brainstorm design docs
(branch-filtered with fallback, reads Supersedes: for revision context).

Fix: branch names with '/' (e.g. garrytan/better-process) now get
sanitized via tr '/' '-' in test plan artifact filenames.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: new /brainstorm and /debug skills

/brainstorm: Socratic design exploration before planning. Context gathering,
clarifying questions (smart-skip), related design discovery (keyword grep),
premise challenge, forced alternatives, design doc artifact with lineage
tracking (Supersedes: field). Writes to ~/.gstack/projects/$SLUG/.

/debug: Systematic root-cause debugging. Iron Law: no fixes without root
cause investigation. Pattern analysis, hypothesis testing with 3-strike
escalation, structured DEBUG REPORT output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: structural tests for new skills + escalation protocol assertions

Add brainstorm + debug to skillsWithUpdateCheck and skillsWithPreamble arrays.
Add structural tests: brainstorm (Phase 1-6, Design Doc, Supersedes, Smart-skip),
debug (Iron Law, Root Cause, Pattern Analysis, Hypothesis, DEBUG REPORT, 3-strike).
Add escalation protocol tests (DONE_WITH_CONCERNS, BLOCKED, NEEDS_CONTEXT) for
all preamble skills.

Also: 2 new TODOs (design docs → Supabase sync, /plan-design-review skill),
update CLAUDE.md project structure with new skill directories.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.6.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: rename /brainstorm → /office-hours across references

Update CHANGELOG, CLAUDE.md, TODOS, design-consultation, plan-ceo-review,
and gen-skill-docs to reference the new office-hours skill name.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: YC Office Hours — dual-mode product diagnostic + builder brainstorm

Rewrite /office-hours with two modes:

Startup mode: six forcing questions (Demand Reality, Status Quo, Desperate
Specificity, Narrowest Wedge, Observation & Surprise, Future-Fit) that push
founders toward radical honesty about demand, users, and product decisions.
Includes smart routing by product stage, intrapreneurship adaptation, and
YC apply CTA for strong-signal founders.

Builder mode: generative brainstorming for side projects, hackathons,
learning, and open source. Enthusiastic collaborator tone, design thinking
questions, no business interrogation.

Mode is determined by an explicit question in Phase 1 — no guessing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add 14 assertions for YC Office Hours content coverage

Validates dual-mode structure (Startup/Builder), all six forcing questions,
builder brainstorming content, intrapreneurship adaptation, YC apply CTA,
and operating principles for both modes. 192 tests total, all passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.6.1

- README.md: added /office-hours and /debug to skills table, updated
  skill count from 13 to 15, added both to install instructions
- docs/skills.md: added /office-hours and /debug deep dive sections
- CLAUDE.md: updated office-hours description to reflect dual-mode
- CONTRIBUTING.md: updated skill count from 13 to 15
- CHANGELOG.md: added YC Office Hours and /debug entries to 0.6.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: founder discovery engine in /office-hours (v0.7.0)

Turn /office-hours into a YC founder discovery engine. Every session now
ends with three beats: signal reflection (specific callbacks to what the
user said), "One more thing." transition, and a personal plea from Garry
Tan with three tiers based on founder signal strength. Top tier uses
AskUserQuestion to ask directly and opens ycombinator.com/apply?ref=gstack.

Adds Phase 4.5 (Founder Signal Synthesis), "What I noticed about how you
think" section to both design doc templates, anti-slop GOOD/BAD examples,
and emotional targets per tier.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add validation assertions for founder discovery engine

8 new assertions covering: YC apply CTA with ref=gstack tracking,
"What I noticed" design doc section, golden age framing, Garry Tan
personal plea, founder signal synthesis phase, three-tier decision
rubric, anti-slop GOOD/BAD examples, "One more thing" transition beat.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.7.0

VERSION: 0.6.4.1 → 0.7.0
CHANGELOG: new entry — Office Hours Gets Personal
README: updated /office-hours and /plan-design-review descriptions
docs/skills.md: updated /office-hours table + deep dive section
TODOS.md: added /yc-prep skill TODO (P2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove duplicate Install section, fix stale skills lists, deduplicate CHANGELOG entries

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 11:19:04 -05:00

10 KiB

gstack development

Commands

bun install          # install dependencies
bun test             # run free tests (browse + snapshot + skill validation)
bun run test:evals   # run paid evals: LLM judge + E2E (diff-based, ~$4/run max)
bun run test:evals:all  # run ALL paid evals regardless of diff
bun run test:e2e     # run E2E tests only (diff-based, ~$3.85/run max)
bun run test:e2e:all # run ALL E2E tests regardless of diff
bun run eval:select  # show which tests would run based on current diff
bun run dev <cmd>    # run CLI in dev mode, e.g. bun run dev goto https://example.com
bun run build        # gen docs + compile binaries
bun run gen:skill-docs  # regenerate SKILL.md files from templates
bun run skill:check  # health dashboard for all skills
bun run dev:skill    # watch mode: auto-regen + validate on change
bun run eval:list    # list all eval runs from ~/.gstack-dev/evals/
bun run eval:compare # compare two eval runs (auto-picks most recent)
bun run eval:summary # aggregate stats across all eval runs

test:evals requires ANTHROPIC_API_KEY. E2E tests stream progress in real-time (tool-by-tool via --output-format stream-json --verbose). Results are persisted to ~/.gstack-dev/evals/ with auto-comparison against the previous run.

Diff-based test selection: test:evals and test:e2e auto-select tests based on git diff against the base branch. Each test declares its file dependencies in test/helpers/touchfiles.ts. Changes to global touchfiles (session-runner, eval-store, llm-judge, gen-skill-docs) trigger all tests. Use EVALS_ALL=1 or the :all script variants to force all tests. Run eval:select to preview which tests would run.

Project structure

gstack/
├── browse/          # Headless browser CLI (Playwright)
│   ├── src/         # CLI + server + commands
│   │   ├── commands.ts  # Command registry (single source of truth)
│   │   └── snapshot.ts  # SNAPSHOT_FLAGS metadata array
│   ├── test/        # Integration tests + fixtures
│   └── dist/        # Compiled binary
├── scripts/         # Build + DX tooling
│   ├── gen-skill-docs.ts  # Template → SKILL.md generator
│   ├── skill-check.ts     # Health dashboard
│   └── dev-skill.ts       # Watch mode
├── test/            # Skill validation + eval tests
│   ├── helpers/     # skill-parser.ts, session-runner.ts, llm-judge.ts, eval-store.ts
│   ├── fixtures/    # Ground truth JSON, planted-bug fixtures, eval baselines
│   ├── skill-validation.test.ts  # Tier 1: static validation (free, <1s)
│   ├── gen-skill-docs.test.ts    # Tier 1: generator quality (free, <1s)
│   ├── skill-llm-eval.test.ts   # Tier 3: LLM-as-judge (~$0.15/run)
│   └── skill-e2e.test.ts         # Tier 2: E2E via claude -p (~$3.85/run)
├── qa-only/         # /qa-only skill (report-only QA, no fixes)
├── plan-design-review/  # /plan-design-review skill (report-only design audit)
├── design-review/    # /design-review skill (design audit + fix loop)
├── ship/            # Ship workflow skill
├── review/          # PR review skill
├── plan-ceo-review/ # /plan-ceo-review skill
├── plan-eng-review/ # /plan-eng-review skill
├── office-hours/    # /office-hours skill (YC Office Hours — startup diagnostic + builder brainstorm)
├── debug/           # /debug skill (systematic root-cause debugging)
├── retro/           # Retrospective skill
├── document-release/ # /document-release skill (post-ship doc updates)
├── setup            # One-time setup: build binary + symlink skills
├── SKILL.md         # Generated from SKILL.md.tmpl (don't edit directly)
├── SKILL.md.tmpl    # Template: edit this, run gen:skill-docs
└── package.json     # Build scripts for browse

SKILL.md workflow

SKILL.md files are generated from .tmpl templates. To update docs:

  1. Edit the .tmpl file (e.g. SKILL.md.tmpl or browse/SKILL.md.tmpl)
  2. Run bun run gen:skill-docs (or bun run build which does it automatically)
  3. Commit both the .tmpl and generated .md files

To add a new browse command: add it to browse/src/commands.ts and rebuild. To add a snapshot flag: add it to SNAPSHOT_FLAGS in browse/src/snapshot.ts and rebuild.

Writing SKILL templates

SKILL.md.tmpl files are prompt templates read by Claude, not bash scripts. Each bash code block runs in a separate shell — variables do not persist between blocks.

Rules:

  • Use natural language for logic and state. Don't use shell variables to pass state between code blocks. Instead, tell Claude what to remember and reference it in prose (e.g., "the base branch detected in Step 0").
  • Don't hardcode branch names. Detect main/master/etc dynamically via gh pr view or gh repo view. Use {{BASE_BRANCH_DETECT}} for PR-targeting skills. Use "the base branch" in prose, <base> in code block placeholders.
  • Keep bash blocks self-contained. Each code block should work independently. If a block needs context from a previous step, restate it in the prose above.
  • Express conditionals as English. Instead of nested if/elif/else in bash, write numbered decision steps: "1. If X, do Y. 2. Otherwise, do Z."

Browser interaction

When you need to interact with a browser (QA, dogfooding, cookie setup), use the /browse skill or run the browse binary directly via $B <command>. NEVER use mcp__claude-in-chrome__* tools — they are slow, unreliable, and not what this project uses.

When developing gstack, .claude/skills/gstack may be a symlink back to this working directory (gitignored). This means skill changes are live immediately — great for rapid iteration, risky during big refactors where half-written skills could break other Claude Code sessions using gstack concurrently.

Check once per session: Run ls -la .claude/skills/gstack to see if it's a symlink or a real copy. If it's a symlink to your working directory, be aware that:

  • Template changes + bun run gen:skill-docs immediately affect all gstack invocations
  • Breaking changes to SKILL.md.tmpl files can break concurrent gstack sessions
  • During large refactors, remove the symlink (rm .claude/skills/gstack) so the global install at ~/.claude/skills/gstack/ is used instead

For plan reviews: When reviewing plans that modify skill templates or the gen-skill-docs pipeline, consider whether the changes should be tested in isolation before going live (especially if the user is actively using gstack in other windows).

Commit style

Always bisect commits. Every commit should be a single logical change. When you've made multiple changes (e.g., a rename + a rewrite + new tests), split them into separate commits before pushing. Each commit should be independently understandable and revertable.

Examples of good bisection:

  • Rename/move separate from behavior changes
  • Test infrastructure (touchfiles, helpers) separate from test implementations
  • Template changes separate from generated file regeneration
  • Mechanical refactors separate from new features

When the user says "bisect commit" or "bisect and push," split staged/unstaged changes into logical commits and push.

CHANGELOG style

CHANGELOG.md is for users, not contributors. Write it like product release notes:

  • Lead with what the user can now do that they couldn't before. Sell the feature.
  • Use plain language, not implementation details. "You can now..." not "Refactored the..."
  • Never mention TODOS.md, internal tracking, eval infrastructure, or contributor-facing details. These are invisible to users and meaningless to them.
  • Put contributor/internal changes in a separate "For contributors" section at the bottom.
  • Every entry should make someone think "oh nice, I want to try that."
  • No jargon: say "every question now tells you which project and branch you're in" not "AskUserQuestion format standardized across skill templates via preamble resolver."

AI effort compression

When estimating or discussing effort, always show both human-team and CC+gstack time:

Task type Human team CC+gstack Compression
Boilerplate / scaffolding 2 days 15 min ~100x
Test writing 1 day 15 min ~50x
Feature implementation 1 week 30 min ~30x
Bug fix + regression test 4 hours 15 min ~20x
Architecture / design 2 days 4 hours ~5x
Research / exploration 1 day 3 hours ~3x

Completeness is cheap. Don't recommend shortcuts when the complete implementation is a "lake" (achievable) not an "ocean" (multi-quarter migration). See the Completeness Principle in the skill preamble for the full philosophy.

Local plans

Contributors can store long-range vision docs and design documents in ~/.gstack-dev/plans/. These are local-only (not checked in). When reviewing TODOS.md, check plans/ for candidates that may be ready to promote to TODOs or implement.

E2E eval failure blame protocol

When an E2E eval fails during /ship or any other workflow, never claim "not related to our changes" without proving it. These systems have invisible couplings — a preamble text change affects agent behavior, a new helper changes timing, a regenerated SKILL.md shifts prompt context.

Required before attributing a failure to "pre-existing":

  1. Run the same eval on main (or base branch) and show it fails there too
  2. If it passes on main but fails on the branch — it IS your change. Trace the blame.
  3. If you can't run on main, say "unverified — may or may not be related" and flag it as a risk in the PR body

"Pre-existing" without receipts is a lazy claim. Prove it or don't say it.

Deploying to the active skill

The active skill lives at ~/.claude/skills/gstack/. After making changes:

  1. Push your branch
  2. Fetch and reset in the skill directory: cd ~/.claude/skills/gstack && git fetch origin && git reset --hard origin/main
  3. Rebuild: cd ~/.claude/skills/gstack && bun run build

Or copy the binary directly: cp browse/dist/browse ~/.claude/skills/gstack/browse/dist/browse