diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index bccb13ff..79bfda75 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -203,6 +203,8 @@ Templates contain the workflows, tips, and examples that require human judgment. | `{{BASE_BRANCH_DETECT}}` | `gen-skill-docs.ts` | Dynamic base branch detection for PR-targeting skills (ship, review, qa, plan-ceo-review) | | `{{QA_METHODOLOGY}}` | `gen-skill-docs.ts` | Shared QA methodology block for /qa and /qa-only | | `{{DESIGN_METHODOLOGY}}` | `gen-skill-docs.ts` | Shared design audit methodology for /plan-design-review and /qa-design-review | +| `{{REVIEW_DASHBOARD}}` | `gen-skill-docs.ts` | Review Readiness Dashboard for /ship pre-flight | +| `{{TEST_BOOTSTRAP}}` | `gen-skill-docs.ts` | Test framework detection, bootstrap, CI/CD setup for /qa, /ship, /qa-design-review | This is structurally sound — if a command exists in code, it appears in docs. If it doesn't exist, it can't appear. diff --git a/CHANGELOG.md b/CHANGELOG.md index 34f1277b..be473577 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,30 +1,92 @@ # Changelog -## 0.6.0 — 2026-03-16 +## 0.6.1 — 2026-03-17 — Boil the Lake -### Added -- **New `/office-hours` skill — think before you build.** YC-style office hours that run before planning. Asks clarifying questions one at a time, challenges your premises, forces you to consider 2-3 implementation approaches, then writes a design doc. The design doc feeds directly into `/plan-ceo-review` and `/plan-eng-review`. You can now do `office-hours → plan → implement → review → QA → ship → retro` — the full lifecycle. -- **New `/debug` skill — find the root cause, not the symptom.** Systematic debugging with an Iron Law: no fixes without root cause investigation first. Traces data flow, matches against known bug patterns, tests hypotheses one at a time. If 3 fixes fail, it stops and questions the architecture instead of thrashing. -- **Every skill now knows when to stop.** New escalation protocol across all skills: DONE, DONE_WITH_CONCERNS, BLOCKED, NEEDS_CONTEXT. "It is always OK to stop and say 'this is too hard for me.' Bad work is worse than no work." -- **/ship now re-verifies before pushing.** New verification gate (Step 6.5): if code changed during review fixes, tests must pass again before push. No more "should work now" — run it and prove it. -- **/review now catches scope drift.** Before reviewing code quality, Step 1.5 compares the diff against TODOS.md and commit messages. Flags files changed that weren't in the plan, and requirements that weren't addressed in the diff. -- **/review now cites evidence for every claim.** "This pattern is safe" requires a line reference. "Tests cover this" requires a test name. No more "probably handled." -- **/plan-ceo-review now forces you to consider alternatives.** Step 0C-bis requires 2-3 implementation approaches before mode selection — one minimal, one ideal. You pick the approach, then the review runs against it. -- **Design docs flow downstream automatically.** `/office-hours` writes design docs to `~/.gstack/projects/`. `/plan-ceo-review` and `/plan-eng-review` discover and read them during their pre-review audits. Branch-filtered lookup with fallback. -- **Design lineage tracking.** Run office hours on the same feature twice? The second design doc links to the first via a `Supersedes:` field. Trace how your design evolved. +Every gstack skill now follows the **Completeness Principle**: always recommend the +full implementation when AI makes the marginal cost near-zero. No more "Choose B +because it's 90% of the value" when option A is 70 lines more code. -### Fixed -- Branch names with `/` (like `garrytan/better-process`) no longer break artifact filenames. Fixed in `/office-hours` and `/plan-eng-review` test plan artifacts. +Read the philosophy: https://garryslist.org/posts/boil-the-ocean + +- **Completeness scoring**: every AskUserQuestion option now shows a completeness + score (1-10), biasing toward the complete solution +- **Dual time estimates**: effort estimates show both human-team and CC+gstack time + (e.g., "human: ~2 weeks / CC: ~1 hour") with a task-type compression reference table +- **Anti-pattern examples**: concrete "don't do this" gallery in the preamble so the + principle isn't abstract +- **First-time onboarding**: new users see a one-time introduction linking to the + essay, with option to open in browser +- **Review completeness gaps**: `/review` now flags shortcut implementations where the + complete version costs <30 min CC time +- **Lake Score**: CEO and Eng review completion summaries show how many recommendations + chose the complete option vs shortcuts +- **CEO + Eng review dual-time**: temporal interrogation, effort estimates, and delight + opportunities all show both human and CC time scales + +## 0.6.0.1 — 2026-03-17 + +- **`/gstack-upgrade` now catches stale vendored copies automatically.** If your global gstack is up to date but the vendored copy in your project is behind, `/gstack-upgrade` detects the mismatch and syncs it. No more manually asking "did we vendor it?" — it just tells you and offers to update. +- **Upgrade sync is safer.** If `./setup` fails while syncing a vendored copy, gstack restores the previous version from backup instead of leaving a broken install. ### For contributors -- New structural tests for `/office-hours` (Phase headers, Design Doc, Supersedes, Smart-skip) and `/debug` (Iron Law, Root Cause, Pattern Analysis, Hypothesis, DEBUG REPORT, 3-strike). -- Escalation protocol assertions added to all preamble skills (DONE_WITH_CONCERNS, BLOCKED, NEEDS_CONTEXT). -- Two new TODOs: design docs → Supabase team store sync (P2), /plan-design-review skill (P2). + +- Standalone usage section in `gstack-upgrade/SKILL.md.tmpl` now references Steps 2 and 4.5 (DRY) instead of duplicating detection/sync bash blocks. Added one new version-comparison bash block. +- Update check fallback in standalone mode now matches the preamble pattern (global path → local path → `|| true`). + +## 0.6.0 — 2026-03-17 + +- **100% test coverage is the key to great vibe coding.** gstack now bootstraps test frameworks from scratch when your project doesn't have one. Detects your runtime, researches the best framework, asks you to pick, installs it, writes 3-5 real tests for your actual code, sets up CI/CD (GitHub Actions), creates TESTING.md, and adds test culture instructions to CLAUDE.md. Every Claude Code session after that writes tests naturally. +- **Every bug fix now gets a regression test.** When `/qa` fixes a bug and verifies it, Phase 8e.5 automatically generates a regression test that catches the exact scenario that broke. Tests include full attribution tracing back to the QA report. Auto-incrementing filenames prevent collisions across sessions. +- **Ship with confidence — coverage audit shows what's tested and what's not.** `/ship` Step 3.4 builds a code path map from your diff, searches for corresponding tests, and produces an ASCII coverage diagram with quality stars (★★★ = edge cases + errors, ★★ = happy path, ★ = smoke test). Gaps get tests auto-generated. PR body shows "Tests: 42 → 47 (+5 new)". +- **Your retro tracks test health.** `/retro` now shows total test files, tests added this period, regression test commits, and trend deltas. If test ratio drops below 20%, it flags it as a growth area. +- **Design reviews generate regression tests too.** `/qa-design-review` Phase 8e.5 skips CSS-only fixes (those are caught by re-running the design audit) but writes tests for JavaScript behavior changes like broken dropdowns or animation failures. + +### For contributors + +- Added `generateTestBootstrap()` resolver to `gen-skill-docs.ts` (~155 lines). Registered as `{{TEST_BOOTSTRAP}}` in the RESOLVERS map. Inserted into qa, ship (Step 2.5), and qa-design-review templates. +- Phase 8e.5 regression test generation added to `qa/SKILL.md.tmpl` (46 lines) and CSS-aware variant to `qa-design-review/SKILL.md.tmpl` (12 lines). Rule 13 amended to allow creating new test files. +- Step 3.4 test coverage audit added to `ship/SKILL.md.tmpl` (88 lines) with quality scoring rubric and ASCII diagram format. +- Test health tracking added to `retro/SKILL.md.tmpl`: 3 new data gathering commands, metrics row, narrative section, JSON schema field. +- `qa-only/SKILL.md.tmpl` gets recommendation note when no test framework detected. +- `qa-report-template.md` gains Regression Tests section with deferred test specs. +- ARCHITECTURE.md placeholder table updated with `{{TEST_BOOTSTRAP}}` and `{{REVIEW_DASHBOARD}}`. +- WebSearch added to allowed-tools for qa, ship, qa-design-review. +- 26 new validation tests, 2 new E2E evals (bootstrap + coverage audit). +- 2 new P3 TODOs: CI/CD for non-GitHub providers, auto-upgrade weak tests. + +## 0.5.4 — 2026-03-17 + +- **Engineering review is always the full review now.** `/plan-eng-review` no longer asks you to choose between "big change" and "small change" modes. Every plan gets the full interactive walkthrough (architecture, code quality, tests, performance). Scope reduction is only suggested when the complexity check actually triggers — not as a standing menu option. +- **Ship stops asking about reviews once you've answered.** When `/ship` asks about missing reviews and you say "ship anyway" or "not relevant," that decision is saved for the branch. No more getting re-asked every time you re-run `/ship` after a pre-landing fix. + +### For contributors + +- Removed SMALL_CHANGE / BIG_CHANGE / SCOPE_REDUCTION menu from `plan-eng-review/SKILL.md.tmpl`. Scope reduction is now proactive (triggered by complexity check) rather than a menu item. +- Added review gate override persistence to `ship/SKILL.md.tmpl` — writes `ship-review-override` entries to `$BRANCH-reviews.jsonl` so subsequent `/ship` runs skip the gate. +- Updated 2 E2E test prompts to match new flow. + +## 0.5.3 — 2026-03-17 + +- **You're always in control — even when dreaming big.** `/plan-ceo-review` now presents every scope expansion as an individual decision you opt into. EXPANSION mode recommends enthusiastically, but you say yes or no to each idea. No more "the agent went wild and added 5 features I didn't ask for." +- **New mode: SELECTIVE EXPANSION.** Hold your current scope as the baseline, but see what else is possible. The agent surfaces expansion opportunities one by one with neutral recommendations — you cherry-pick the ones worth doing. Perfect for iterating on existing features where you want rigor but also want to be tempted by adjacent improvements. +- **Your CEO review visions are saved, not lost.** Expansion ideas, cherry-pick decisions, and 10x visions are now persisted to `~/.gstack/projects/{repo}/ceo-plans/` as structured design documents. Stale plans get archived automatically. If a vision is exceptional, you can promote it to `docs/designs/` in your repo for the team. + +- **Smarter ship gates.** `/ship` no longer nags you about CEO and Design reviews when they're not relevant. Eng Review is the only required gate (and you can disable even that with `gstack-config set skip_eng_review true`). CEO Review is recommended for big product changes; Design Review for UI work. The dashboard still shows all three — it just won't block you for the optional ones. + +### For contributors + +- Added SELECTIVE EXPANSION mode to `plan-ceo-review/SKILL.md.tmpl` with cherry-pick ceremony, neutral recommendation posture, and HOLD SCOPE baseline. +- Rewrote EXPANSION mode's Step 0D to include opt-in ceremony — distill vision into discrete proposals, present each as AskUserQuestion. +- Added CEO plan persistence (0D-POST step): structured markdown with YAML frontmatter (`status: ACTIVE/ARCHIVED/PROMOTED`), scope decisions table, archival flow. +- Added `docs/designs` promotion step after Review Log. +- Mode Quick Reference table expanded to 4 columns. +- Review Readiness Dashboard: Eng Review required (overridable via `skip_eng_review` config), CEO/Design optional with agent judgment. +- New tests: CEO review mode validation (4 modes, persistence, promotion), SELECTIVE EXPANSION E2E test. ## 0.5.2 — 2026-03-17 - **Your design consultant now takes creative risks.** `/design-consultation` doesn't just propose a safe, coherent system — it explicitly breaks down SAFE CHOICES (category baseline) vs. RISKS (where your product stands out). You pick which rules to break. Every risk comes with a rationale for why it works and what it costs. -- **See the competition before you choose.** When you opt into research, the agent browses competitor sites with screenshots and accessibility tree analysis — not just web search results. You see what the landscape looks like before making design decisions. +- **See the landscape before you choose.** When you opt into research, the agent browses real sites in your space with screenshots and accessibility tree analysis — not just web search results. You see what's out there before making design decisions. - **Preview pages that look like your product.** The preview page now renders realistic product mockups — dashboards with sidebar nav and data tables, marketing pages with hero sections, settings pages with forms — not just font swatches and color palettes. ## 0.5.1 — 2026-03-17 diff --git a/CLAUDE.md b/CLAUDE.md index 42f632c8..78c6df50 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -123,6 +123,23 @@ CHANGELOG.md is **for users**, not contributors. Write it like product release n - No jargon: say "every question now tells you which project and branch you're in" not "AskUserQuestion format standardized across skill templates via preamble resolver." +## AI effort compression + +When estimating or discussing effort, always show both human-team and CC+gstack time: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +Completeness is cheap. Don't recommend shortcuts when the complete implementation +is a "lake" (achievable) not an "ocean" (multi-quarter migration). See the +Completeness Principle in the skill preamble for the full philosophy. + ## Local plans Contributors can store long-range vision docs and design documents in `~/.gstack-dev/plans/`. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 4af2e889..2fce6375 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -54,6 +54,10 @@ the issue, fix it, and open a PR. This is the best way to contribute: fix gstack while doing your real work, in the project where you actually felt the pain. +### Session awareness + +When you have 3+ gstack sessions open simultaneously, every question tells you which project, which branch, and what's happening. No more staring at a question thinking "wait, which window is this?" The format is consistent across all 13 skills. + ## Working on gstack inside the gstack repo When you're editing gstack skills and want to test them by actually using gstack diff --git a/README.md b/README.md index fd4b684f..c0a5f0b6 100644 --- a/README.md +++ b/README.md @@ -1,136 +1,136 @@ # gstack -**gstack turns Claude Code from one generic assistant into a team of specialists you can summon on demand.** +Hi, I'm [Garry Tan](https://x.com/garrytan). I'm President & CEO of [Y Combinator](https://www.ycombinator.com/), where I've worked with thousands of startups including Coinbase, Instacart, and Rippling when the founders were just one or two people in a garage — companies now worth tens of billions of dollars. Before YC, I designed the Palantir logo and was one of the first eng manager/PM/designers there. I cofounded Posterous, a blog platform we sold to Twitter. I built Bookface, YC's internal social network, back in 2013. I've been building products as a designer, PM, and eng manager for a long time. -Thirteen opinionated workflow skills for [Claude Code](https://docs.anthropic.com/en/docs/claude-code). Plan review, design review, code review, one-command shipping, browser automation, QA testing, engineering retrospectives, and post-ship documentation — all as slash commands. +And right now I am in the middle of something that feels like a new era entirely. -### Without gstack +In the last 60 days I have written **over 600,000 lines of production code** — 35% tests — and I am doing **10,000 to 20,000 usable lines of code per day** as a part-time part of my day while doing all my duties as CEO of YC. That is not a typo. My last `/retro` (developer stats from the last 7 days) across 3 projects: **140,751 lines added, 362 commits, ~115k net LOC**. The models are getting dramatically better every week. We are at the dawn of something real — one person shipping at a scale that used to require a team of twenty. -- The agent takes your request literally — it never asks if you're building the right thing -- It will implement exactly what you said, even when the real product is something bigger -- "Review my PR" gives inconsistent depth every time -- "Ship this" turns into a long back-and-forth about what to do -- The agent can write code but can't see your app — it's half blind -- You still do QA by hand: open browser, click around, check pages, squint at layouts +**2026 — 1,237 contributions and counting:** -### With gstack +![GitHub contributions 2026 — 1,237 contributions, massive acceleration in Jan-Mar](docs/images/github-2026.png) -| Skill | Mode | What it does | -|-------|------|--------------| -| `/plan-ceo-review` | Founder / CEO | Rethink the problem. Find the 10-star product hiding inside the request. | -| `/plan-eng-review` | Eng manager / tech lead | Lock in architecture, data flow, diagrams, edge cases, and tests. | -| `/plan-design-review` | Senior product designer | Designer's eye audit. 80-item checklist, letter grades, AI Slop detection, DESIGN.md inference. Report only — never touches code. | -| `/design-consultation` | Design consultant | Build a complete design system from scratch. Browses competitors to get in the ballpark, proposes safe choices AND creative risks, generates realistic product mockups, and writes DESIGN.md. | -| `/review` | Paranoid staff engineer | Find the bugs that pass CI but blow up in production. Triages Greptile review comments. | -| `/ship` | Release engineer | Sync main, run tests, resolve Greptile reviews, push, open PR. For a ready branch, not for deciding what to build. | -| `/browse` | QA engineer | Give the agent eyes. It logs in, clicks through your app, takes screenshots, catches breakage. Full QA pass in 60 seconds. | -| `/qa` | QA + fix engineer | Test app, find bugs, fix them with atomic commits, re-verify. Before/after health scores and ship-readiness summary. Three tiers: Quick, Standard, Exhaustive. | -| `/qa-only` | QA reporter | Report-only QA testing. Same methodology as /qa but never fixes anything. Use when you want a pure bug report without code changes. | -| `/qa-design-review` | Designer + frontend engineer | Same design audit as /plan-design-review, then fixes what it finds. Atomic `style(design):` commits, before/after screenshots, CSS-safe self-regulation. | -| `/setup-browser-cookies` | Session manager | Import cookies from your real browser (Comet, Chrome, Arc, Brave, Edge) into the headless session. Test authenticated pages without logging in manually. | -| `/retro` | Engineering manager | Team-aware retro: your deep-dive + per-person praise and growth opportunities for every contributor. | -| `/document-release` | Technical writer | Update README, ARCHITECTURE, CONTRIBUTING, and project docs to match what you just shipped. | +**2013 — when I built Bookface at YC (772 contributions):** -## Demo: one feature, five modes +![GitHub contributions 2013 — 772 contributions building Bookface at YC](docs/images/github-2013.png) -I start almost every feature in plan mode. I describe what I want to build, then use `/plan-ceo-review` to pressure-test whether I am even building the right thing. Only after the product direction is locked do I shift into engineering, review, ship, and test. +Same person. Different era. The difference is the tooling. + +**gstack is how I do it.** It is my open source software factory. It turns Claude Code into a virtual engineering team you actually manage — a CEO who rethinks the product, an eng manager who locks the architecture, a designer who catches AI slop, a paranoid reviewer who finds production bugs, a QA lead who opens a real browser and clicks through your app, and a release engineer who ships the PR. Thirteen specialists, all as slash commands, all Markdown, **all free, MIT license, available right now.** + +I am learning how to get to the edge of what agentic systems can do as of March 2026, and this is my live experiment. I am sharing it because I want the whole world on this journey with me. + +Fork it. Improve it. Make it yours. Don't player hate, appreciate. + +--- + +## See it work ``` -You: [enter plan mode, describe the feature] - I want to add seller photo upload to the listing app. Sellers - should be able to upload a photo of their item and we create - the listing from it. - -You: /plan-ceo-review - -Claude: "Photo upload" is not the feature. The real job is helping sellers - create listings that actually sell. Here's what the 10-star version - looks like: auto-identify the product from the photo, pull specs and - pricing comps from the web, draft the title and description, suggest - the best hero image... - -You: /plan-eng-review - -Claude: [Architecture diagram, state machine for upload → classify → enrich - → draft pipeline, async job boundaries, failure modes, test matrix] - -You: [exit plan mode, implement the plan] - -You: /review - -Claude: Race condition: two tabs can overwrite cover-photo selection. - Trust boundary: pulling web data into draft generation creates - a prompt injection vector. Missing: orphan cleanup for failed uploads. - -You: (fix the issues) - -You: /ship - -Claude: [Syncs main, runs tests, pushes branch, opens PR — 6 tool calls, done] - -You: /qa - -Claude: Analyzing branch diff... 8 files changed, 3 routes affected. - [Tests /listings/new, /listings/:id, /api/listings against localhost:3000] - All 3 routes working. Upload + enrichment flow passes end to end. - -You: /setup-browser-cookies staging.myapp.com - -Claude: Imported 8 cookies for staging.myapp.com from Chrome. - -You: /qa https://staging.myapp.com --quick - -Claude: [Smoke test: homepage + 5 pages, 30 seconds] - Health Score: 91/100. No critical issues. 1 medium: mobile nav overlap. - -You: /plan-design-review https://staging.myapp.com +You: I want to add photo upload for sellers. +You: /plan-ceo-review +Claude: "Photo upload" is not the feature. The real job is helping + sellers create listings that actually sell. What if we + auto-identify the product, pull specs and comps from the + web, and draft the listing automatically? That's 10 stars. + "Upload a photo" is 3 stars. Which are we building? + [8 expansion proposals, you cherry-pick 5, defer 3 to backlog] +You: /plan-design-review Claude: Design Score: B | AI Slop Score: C + "Upload flow looks like a default Bootstrap form." + [80-item audit, infers your design system, exports DESIGN.md] + [flags 3 AI slop patterns: gradient hero, icon grid, uniform radius] - "The site communicates competence but not confidence." - "I notice the spacing between sections is uniform — nothing feels important." - "The first 3 things my eye goes to are: the gradient hero, a 3-column - feature grid with icons in circles, and a cookie-cutter CTA." - "If I had to describe this in one word: template." +You: /plan-eng-review +Claude: ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐ + │ Upload │───▶│ Classify │───▶│ Enrich │───▶│ Draft │ + │ (sync) │ │ (async) │ │ (async) │ │ (async) │ + └─────────┘ └──────────┘ └──────────┘ └─────────┘ + [ASCII diagrams for every data flow, state machine, error path] + [14-case test matrix, 6 failure modes mapped, 3 security concerns] - Top issues: generic typography (Inter everywhere), AI slop patterns - (3-column icon grid, purple gradient), flat heading scale. +You: Approve plan. Exit plan mode. + [Claude writes 2,400 lines across 11 files — models, services, + controllers, views, migrations, and tests. ~8 minutes.] - [Full report with letter grades, 12 findings, inferred design system] - Want me to save this as your DESIGN.md? +You: /review +Claude: [AUTO-FIXED] Orphan S3 cleanup on failed upload + [AUTO-FIXED] Missing index on listings.status + [ASK] Race condition on hero image selection → You: yes + [traces every new enum value through all switch statements] + 3 issues — 2 auto-fixed, 1 fixed. -You: /qa-design-review +You: /qa https://staging.myapp.com +Claude: [opens real browser, logs in, uploads photos, clicks through flows] + Upload → classify → enrich → draft: end to end ✓ + Mobile: ✓ | Slow connection: ✓ | Bad image: ✓ + [finds bug: preview doesn't clear on second upload — fixes it] + Regression test generated. -Claude: [Runs the same audit, then fixes 8 design issues] - style(design): FINDING-001 — replace Inter with system font stack - style(design): FINDING-003 — add heading scale contrast - style(design): FINDING-007 — break 3-column grid into asymmetric layout - Design Score: B → A- | AI Slop Score: C → A +You: /ship +Claude: Tests: 42 → 51 (+9 new) + Coverage: 14/14 code paths (100%) + PR: github.com/you/app/pull/42 ``` -## Who this is for +One feature. Seven commands. The agent reframed the product, ran an 80-item design audit, drew the architecture, wrote 2,400 lines of code, found a race condition I would have missed, auto-fixed two issues, opened a real browser to QA test, found and fixed a bug I didn't know about, wrote 9 tests, and generated a regression test. That is not a copilot. That is a team. -You already use Claude Code heavily and want consistent, high-rigor workflows instead of one mushy generic mode. You want to tell the model what kind of brain to use right now — founder taste, engineering rigor, paranoid review, or fast execution. +--- -This is not a prompt pack for beginners. It is an operating system for people who ship. +## The team -## How to fly: 10 sessions at once +| Skill | Your specialist | What they do | +|-------|----------------|--------------| +| `/plan-ceo-review` | **CEO / Founder** | Rethink the problem. Find the 10-star product hiding inside the request. Four modes: Expansion, Selective Expansion, Hold Scope, Reduction. | +| `/plan-eng-review` | **Eng Manager** | Lock in architecture, data flow, diagrams, edge cases, and tests. Forces hidden assumptions into the open. | +| `/plan-design-review` | **Senior Designer** | 80-item design audit with letter grades. AI Slop detection. Infers your design system. Report only — never touches code. | +| `/design-consultation` | **Design Partner** | Build a complete design system from scratch. Knows the landscape, proposes creative risks, generates realistic product mockups. Design at the heart of all other phases. | +| `/review` | **Staff Engineer** | Find the bugs that pass CI but blow up in production. Auto-fixes the obvious ones. Flags completeness gaps. | +| `/ship` | **Release Engineer** | Sync main, run tests, audit coverage, push, open PR. Bootstraps test frameworks if you don't have one. One command. | +| `/browse` | **QA Engineer** | Give the agent eyes. Real Chromium browser, real clicks, real screenshots. ~100ms per command. | +| `/qa` | **QA Lead** | Test your app, find bugs, fix them with atomic commits, re-verify. Auto-generates regression tests for every fix. | +| `/qa-only` | **QA Reporter** | Same methodology as /qa but report only. Use when you want a pure bug report without code changes. | +| `/qa-design-review` | **Designer Who Codes** | Same audit as /plan-design-review, then fixes what it finds. Atomic commits, before/after screenshots. | +| `/setup-browser-cookies` | **Session Manager** | Import cookies from your real browser (Chrome, Arc, Brave, Edge) into the headless session. Test authenticated pages. | +| `/retro` | **Eng Manager** | Team-aware weekly retro. Per-person breakdowns, shipping streaks, test health trends, growth opportunities. | +| `/document-release` | **Technical Writer** | Update all project docs to match what you just shipped. Catches stale READMEs automatically. | -gstack is powerful with one Claude Code session. It is transformative with ten. +**[Deep dives with examples and philosophy for every skill →](docs/skills.md)** -[Conductor](https://conductor.build) runs multiple Claude Code sessions in parallel — each in its own isolated workspace. That means you can have one session running `/qa` on staging, another doing `/review` on a PR, a third implementing a feature, and seven more working on other branches. All at the same time. +--- -Each workspace gets its own isolated browser instance automatically — separate Chromium process, cookies, tabs, and logs stored in `.gstack/` inside each project root. No port collisions, no shared state, no configuration needed. `/browse` and `/qa` sessions never interfere with each other, even across ten parallel workspaces. +## What's new and why it matters -This is the setup I use. One person, ten parallel agents, each with the right cognitive mode for its task. That is not incremental improvement. That is a different way of building software. +**Design is at the heart.** `/design-consultation` doesn't just pick fonts. It researches what's out there in your space, proposes safe choices AND creative risks, generates realistic mockups of your actual product, and writes `DESIGN.md` — and then `/qa-design-review` and `/plan-eng-review` read what you chose. Design decisions flow through the whole system. -## Install +**`/qa` was a massive unlock.** It let me go from 6 to 12 parallel workers. Claude Code saying *"I SEE THE ISSUE"* and then actually fixing it, generating a regression test, and verifying the fix — that changed how I work. The agent has eyes now. -**Requirements:** [Claude Code](https://docs.anthropic.com/en/docs/claude-code), [Git](https://git-scm.com/), [Bun](https://bun.sh/) v1.0+. `/browse` compiles a native binary — works on macOS and Linux (x64 and arm64). +**Smart review routing.** Just like at a well-run startup: CEO doesn't have to look at infra bug fixes, design review isn't needed for backend changes. gstack tracks what reviews are run, figures out what's appropriate, and just does the smart thing. The Review Readiness Dashboard tells you where you stand before you ship. + +**Test everything.** `/ship` bootstraps test frameworks from scratch if your project doesn't have one. Every `/ship` run produces a coverage audit. Every `/qa` bug fix generates a regression test. 100% test coverage is the goal — tests make vibe coding safe instead of yolo coding. + +**`/document-release` is the engineer you never had.** It reads every doc file in your project, cross-references the diff, and updates everything that drifted. README, ARCHITECTURE, CONTRIBUTING, CLAUDE.md, TODOS — all kept current automatically. + +--- + +## 10 sessions at once + +gstack is powerful with one session. It is transformative with ten. + +[Conductor](https://conductor.build) runs multiple Claude Code sessions in parallel — each in its own isolated workspace. One session running `/qa` on staging, another doing `/review` on a PR, a third implementing a feature, and seven more on other branches. All at the same time. + +One person, ten parallel agents, each with the right cognitive mode. That is a different way of building software. + +--- + +## Install — takes 30 seconds + +**Requirements:** [Claude Code](https://docs.anthropic.com/en/docs/claude-code), [Git](https://git-scm.com/), [Bun](https://bun.sh/) v1.0+ ### Step 1: Install on your machine -Open Claude Code and paste this. Claude will do the rest. +Open Claude Code and paste this. Claude does the rest. > Install gstack: run `git clone https://github.com/garrytan/gstack.git ~/.claude/skills/gstack && cd ~/.claude/skills/gstack && ./setup` then add a "gstack" section to CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, and lists the available skills: /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /review, /ship, /browse, /qa, /qa-only, /qa-design-review, /setup-browser-cookies, /retro, /document-release. Then ask the user if they also want to add gstack to the current project so teammates get it. @@ -138,25 +138,14 @@ Open Claude Code and paste this. Claude will do the rest. > Add gstack to this project: run `cp -Rf ~/.claude/skills/gstack .claude/skills/gstack && rm -rf .claude/skills/gstack/.git && cd .claude/skills/gstack && ./setup` then add a "gstack" section to this project's CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, lists the available skills: /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /review, /ship, /browse, /qa, /qa-only, /qa-design-review, /setup-browser-cookies, /retro, /document-release, and tells Claude that if gstack skills aren't working, run `cd .claude/skills/gstack && ./setup` to build the binary and register skills. -Real files get committed to your repo (not a submodule), so `git clone` just works. The binary and node\_modules are gitignored — teammates just need to run `cd .claude/skills/gstack && ./setup` once to build (or `/browse` handles it automatically on first use). - -### What gets installed - -- Skill files (Markdown prompts) in `~/.claude/skills/gstack/` (or `.claude/skills/gstack/` for project installs) -- Symlinks at `~/.claude/skills/browse`, `~/.claude/skills/qa`, `~/.claude/skills/review`, etc. pointing into the gstack directory -- Browser binary at `browse/dist/browse` (~58MB, gitignored) -- `node_modules/` (gitignored) -- `/retro` saves JSON snapshots to `.context/retros/` in your project for trend tracking - -Everything lives inside `.claude/`. Nothing touches your PATH or runs in the background. +Real files get committed to your repo (not a submodule), so `git clone` just works. Everything lives inside `.claude/`. Nothing touches your PATH or runs in the background. --- ``` +----------------------------------------------------------------------------+ | | -| Are you a great software engineer who loves to write 10K LOC/day | -| and land 10 PRs a day like Garry? | +| Are you a great software engineer who wants to ship 10K+ LOC/day? | | | | Come work at YC: ycombinator.com/software | | | @@ -169,694 +158,36 @@ Everything lives inside `.claude/`. Nothing touches your PATH or runs in the bac --- -## How I use these skills +## Come ride the wave -Created by [Garry Tan](https://x.com/garrytan), President & CEO of [Y Combinator](https://www.ycombinator.com/). +This is **free, MIT licensed, open source, available now.** No premium tier. No waitlist. No strings. -I built gstack because I do not want AI coding tools stuck in one mushy mode. +I open sourced how I do development and I am actively upgrading my own software factory here. You can fork it and make it your own. That's the whole point. I want everyone on this journey. -Planning is not review. Review is not shipping. Founder taste is not engineering rigor. If you blur all of that together, you usually get a mediocre blend of all four. +The models are getting better fast. The people who figure out how to work with them now — really work with them, not just dabble — are going to have a massive advantage. This is that window. Let's go. -I want explicit gears. - -These skills let me tell the model what kind of brain I want right now. I can switch cognitive modes on demand — founder, eng manager, paranoid reviewer, release machine. That is the unlock. +**[github.com/garrytan/gstack](https://github.com/garrytan/gstack)** — MIT License --- -## `/plan-ceo-review` - -This is my **founder mode**. - -This is where I want the model to think with taste, ambition, user empathy, and a long time horizon. I do not want it taking the request literally. I want it asking a more important question first: - -**What is this product actually for?** - -I think of this as **Brian Chesky mode**. - -The point is not to implement the obvious ticket. The point is to rethink the problem from the user's point of view and find the version that feels inevitable, delightful, and maybe even a little magical. - -### Example - -Say I am building a Craigslist-style listing app and I say: - -> "Let sellers upload a photo for their item." - -A weak assistant will add a file picker and save an image. - -That is not the real product. - -In `/plan-ceo-review`, I want the model to ask whether "photo upload" is even the feature. Maybe the real feature is helping someone create a listing that actually sells. - -If that is the real job, the whole plan changes. - -Now the model should ask: - -* Can we identify the product from the photo? -* Can we infer the SKU or model number? -* Can we search the web and draft the title and description automatically? -* Can we pull specs, category, and pricing comps? -* Can we suggest which photo will convert best as the hero image? -* Can we detect when the uploaded photo is ugly, dark, cluttered, or low-trust? -* Can we make the experience feel premium instead of like a dead form from 2007? - -That is what `/plan-ceo-review` does for me. - -It does not just ask, "how do I add this feature?" -It asks, **"what is the 10-star product hiding inside this request?"** - -That is a very different kind of power. - ---- - -## `/plan-eng-review` - -This is my **eng manager mode**. - -Once the product direction is right, I want a different kind of intelligence entirely. I do not want more sprawling ideation. I do not want more "wouldn't it be cool if." I want the model to become my best technical lead. - -This mode should nail: - -* architecture -* system boundaries -* data flow -* state transitions -* failure modes -* edge cases -* trust boundaries -* test coverage - -And one surprisingly big unlock for me: **diagrams**. - -LLMs get way more complete when you force them to draw the system. Sequence diagrams, state diagrams, component diagrams, data-flow diagrams, even test matrices. Diagrams force hidden assumptions into the open. They make hand-wavy planning much harder. - -So `/plan-eng-review` is where I want the model to build the technical spine that can carry the product vision. - -### Example - -Take the same listing app example. - -Let's say `/plan-ceo-review` already did its job. We decided the real feature is not just photo upload. It is a smart listing flow that: - -* uploads photos -* identifies the product -* enriches the listing from the web -* drafts a strong title and description -* suggests the best hero image - -Now `/plan-eng-review` takes over. - -Now I want the model to answer questions like: - -* What is the architecture for upload, classification, enrichment, and draft generation? -* Which steps happen synchronously, and which go to background jobs? -* Where are the boundaries between app server, object storage, vision model, search/enrichment APIs, and the listing database? -* What happens if upload succeeds but enrichment fails? -* What happens if product identification is low-confidence? -* How do retries work? -* How do we prevent duplicate jobs? -* What gets persisted when, and what can be safely recomputed? - -And this is where I want diagrams — architecture diagrams, state models, data-flow diagrams, test matrices. Diagrams force hidden assumptions into the open. They make hand-wavy planning much harder. - -That is `/plan-eng-review`. - -Not "make the idea smaller." -**Make the idea buildable.** - ---- - -## `/plan-design-review` - -This is my **senior designer mode**. - -Most developers cannot tell whether their site looks AI-generated. I could not, until I started paying attention. There is a growing class of sites that are functional but soulless — they work fine but scream "an AI built this and nobody with taste looked at it." Purple gradients, 3-column icon grids, uniform bubbly border-radius on everything, centered text on every section, decorative blobs floating in the background. The ChatGPT aesthetic. - -`/plan-design-review` gives the agent a designer's eye. - -It opens your site and reacts to it the way a Stripe or Linear designer would — immediately, viscerally, with opinions. The first output is a structured gut reaction: what the site communicates at a glance, what the eye is drawn to, and a one-word verdict. That is the most valuable part. Everything after is supporting evidence. - -Then it extracts your actual design system by reading the CSS: how many fonts are you really using? What is your heading scale? How many colors? Is your spacing on a grid or arbitrary? It structures this as an **Inferred Design System** and offers to save it as a `DESIGN.md` — so now you have a design system you never had before, derived from what you actually shipped. - -The audit covers 80 items across 10 categories. Typography, spacing, hierarchy, color, responsive design, interaction states, motion, content quality, performance feel, and — the headline feature — **AI Slop Detection**. Ten specific anti-patterns that mark a site as machine-generated. - -Every finding gets a letter grade (A through F) per category. The report opens with two headline scores: **Design Score** and **AI Slop Score**. "Your site got a B on design but a D on AI Slop" tells you exactly where to focus. - -### Example - -``` -You: /plan-design-review https://myapp.com - -Claude: Design Score: C | AI Slop Score: D - - "The site communicates generic SaaS energy." - "I notice the hero section uses a blue-to-purple gradient - with a 3-column feature grid below it — the single most - recognizable AI-generated layout pattern." - "The first 3 things my eye goes to are: the gradient, the - symmetrical icon grid, and a 'Get Started' button that - looks like every other 'Get Started' button." - "If I had to describe this in one word: template." - - Inferred Design System: - - Fonts: Inter (body), Inter (headings) — 1 font family - - Colors: 14 unique non-gray colors (flag: >12) - - Heading scale: 36/24/18/16 — flat, no drama between sizes - - Spacing: mostly 16px and 24px — no rhythm - - Top 5 Design Improvements: - 1. Break the 3-column icon grid (High, AI Slop) - 2. Add heading scale contrast — at least 2:1 ratio (High, Typography) - 3. Replace uniform border-radius with intentional hierarchy (Medium, Spacing) - 4. Kill the gradient hero — use a real image or bold typography (High, AI Slop) - 5. Add a second font for headings — Inter-only reads as generic (Medium, Typography) - - [Full report saved to .gstack/design-reports/] - Want me to save this inferred design system as your DESIGN.md? -``` - -This is report only — it never touches your code. Use `/qa-design-review` when you want it to fix what it finds. - ---- - -## `/design-consultation` - -This is my **design partner mode**. - -`/plan-design-review` audits a site that already exists. `/design-consultation` is for when you have nothing yet — no design system, no font choices, no color palette. You are starting from zero and you want a senior designer to sit down with you and build the whole visual identity together. - -It is a conversation, not a form. The agent asks about your product, your users, and your audience. It thinks about what your product needs to communicate — trust, speed, craft, warmth, whatever fits — and works backward from that to concrete choices. Then it proposes a complete, coherent design system: aesthetic direction, typography (3+ fonts with specific roles), color palette with hex values, spacing scale, layout approach, and motion strategy. Every recommendation comes with a rationale. Every choice reinforces every other choice. - -But coherence is table stakes. Every dev tool dashboard looks the same — clean sans-serif, muted grays, a blue accent. They are all coherent. They are all forgettable. The difference between a product that looks "nice" and one that people actually recognize is the **deliberate creative risks**: the unexpected serif for headings, the bold accent nobody else in your category uses, the tighter spacing that makes your data feel authoritative instead of airy. - -That is what `/design-consultation` is really about. It does not just propose a safe system. It proposes safe choices AND risks — and tells you which is which. "Here are the choices that keep you literate in your category. And here is where I think you should break from convention, and why." You pick which risks to take. The agent checks that the whole system still coheres either way. - -If you want, the agent will browse your competitors' actual sites — take screenshots, analyze their fonts and colors and spacing — so you can see the landscape before you make choices. This is not about copying. It is about getting in the ballpark so you know what the conventions are, and then deciding which ones are worth breaking. - -After you agree on the system, it generates an interactive HTML preview page — not just swatches and font samples, but realistic product pages. If you are building a dashboard, you see a dashboard with a sidebar, data tables, and stat cards. If you are building a marketing site, you see a hero section with real copy and a CTA. Everything rendered in your design system, with your product name, in light and dark mode. You see what your product could feel like before a single line of production code is written. - -Then it writes `DESIGN.md` to your repo root — your project's design source of truth — and updates `CLAUDE.md` so every future Claude Code session respects the system. From that point on, `/qa-design-review` can audit against it, and any agent working on your frontend knows the rules. - -``` -You: /design-consultation - -Claude: From the README, this looks like a developer analytics dashboard - for engineering teams. Sound right? Tell me about the feeling - you want — and would you like me to browse some competitors - first so we can see the landscape? - -You: Yes, it's for eng managers. Show me what's out there. - -Claude: [browses Linear, Vercel, Datadog, Posthog — takes screenshots] - - Here's the landscape. They all converge on clean sans-serifs - and muted blue-gray palettes. Linear stands out with dark-first - and tight spacing. The rest feel interchangeable. - - Here's my proposal — and where I'd play it safe vs. take risks: - - AESTHETIC: Industrial/Utilitarian — data-first, every pixel earns its place - TYPOGRAPHY: Geist (body + UI), Instrument Serif (display), JetBrains Mono (data) - COLOR: Restrained — slate neutrals, single teal accent (#0D9488) - SPACING: 4px base, compact density - LAYOUT: Grid-disciplined, 12-column - MOTION: Minimal-functional — only transitions that aid comprehension - - SAFE CHOICES (match category expectations): - Geist for body — your users already know this font from Vercel - Slate neutrals — warm grays feel right for data-heavy UIs - Grid-disciplined layout — dashboards need predictable alignment - - RISKS (where you'd stand out): - Instrument Serif for display headings — nobody in dev tools - uses a serif. It creates instant hierarchy contrast and - signals "we care about craft" in a space full of sans-serifs. - Single teal accent instead of blue — every competitor uses blue. - Teal is close enough to feel trustworthy, different enough - to be recognizable. - 4px compact density — tighter than Linear, says "we respect - your screen real estate." - - The safe choices keep you literate. The risks are where your - product gets its own face. Which risks appeal to you? - -You: Love the serif risk. Let's see it. - -Claude: Preview opened in your browser. - [/tmp/design-consultation-preview-1710000000.html] - - The page shows a realistic dashboard: sidebar nav, header with - avatar, stat cards, data table — all in your design system. - Instrument Serif at hero scale, Geist in body text, JetBrains - Mono in the data table. Teal accent on buttons and active states. - Light and dark mode toggle in the corner. - - How does it feel? Want to adjust anything before I write DESIGN.md? - -You: Ship it. - -Claude: Wrote DESIGN.md (typography, color, spacing, layout, motion). - Updated CLAUDE.md with design system guidelines. - All future sessions will respect these choices. -``` - ---- - -## `/qa-design-review` - -This is my **designer who codes mode**. - -`/plan-design-review` tells you what is wrong. `/qa-design-review` fixes it. - -It runs the same 80-item audit, then enters a fix loop: for each design finding, it locates the source file, makes the minimal CSS/styling change, commits with `style(design): FINDING-NNN`, re-navigates to verify, and takes before/after screenshots. One commit per fix, fully bisectable. - -The self-regulation heuristic is tuned for design work — CSS-only changes get a free pass (they are inherently safe and reversible), but changes to component JSX/TSX files count against the risk budget. Hard cap at 30 fixes. If the risk score exceeds 20%, it stops and asks. - -``` -You: /qa-design-review https://myapp.com - -Claude: [Runs full design audit — same output as /plan-design-review] - Design Score: C | AI Slop Score: D - 12 findings (4 high, 5 medium, 3 polish) - - Fixing 9 design issues... - - style(design): FINDING-001 — replace 3-column icon grid with asymmetric layout - style(design): FINDING-002 — add heading scale 48/32/24/18/16 - style(design): FINDING-003 — remove gradient hero, use bold typography - style(design): FINDING-004 — add second font for headings - style(design): FINDING-005 — vary border-radius by element role - style(design): FINDING-006 — left-align body text, reserve center for headings - style(design): FINDING-007 — add hover/focus states to all interactive elements - style(design): FINDING-008 — add prefers-reduced-motion media query - style(design): FINDING-009 — set max content width to 680px for body text - - Final audit: - Design Score: C → B+ | AI Slop Score: D → A - 9 fixes applied (8 verified, 1 best-effort). 3 deferred. - - [Report with before/after screenshots saved to .gstack/design-reports/] -``` - -Nine commits, each touching one concern. The AI Slop score went from D to A because the three most recognizable patterns (gradient hero, 3-column grid, uniform radius) are gone. The design score improved two grades because the typography now has a scale, the spacing has hierarchy, and interactive elements have proper states. - ---- - -## `/review` - -This is my **paranoid staff engineer mode**. - -Passing tests do not mean the branch is safe. - -`/review` exists because there is a whole class of bugs that can survive CI and still punch you in the face in production. This mode is not about dreaming bigger. It is not about making the plan prettier. It is about asking: - -**What can still break?** - -This is a structural audit, not a style nitpick pass. I want the model to look for things like: - -* N+1 queries -* stale reads -* race conditions -* bad trust boundaries -* missing indexes -* escaping bugs -* broken invariants -* bad retry logic -* tests that pass while missing the real failure mode - -### Example - -Suppose the smart listing flow is implemented and the tests are green. - -`/review` should still ask: - -* Did I introduce an N+1 query when rendering listing photos or draft suggestions? -* Am I trusting client-provided file metadata instead of validating the actual file? -* Can two tabs race and overwrite cover-photo selection or item details? -* Do failed uploads leave orphaned files in storage forever? -* Can the "exactly one hero image" rule break under concurrency? -* If enrichment APIs partially fail, do I degrade gracefully or save garbage? -* Did I accidentally create a prompt injection or trust-boundary problem by pulling web data into draft generation? - -That is the point of `/review`. - -I do not want flattery here. -I want the model imagining the production incident before it happens. - ---- - -## `/ship` - -This is my **release machine mode**. - -Once I have decided what to build, nailed the technical plan, and run a serious review, I do not want more talking. I want execution. - -`/ship` is for the final mile. It is for a ready branch, not for deciding what to build. - -This is where the model should stop behaving like a brainstorm partner and start behaving like a disciplined release engineer: sync with main, run the right tests, make sure the branch state is sane, update changelog or versioning if the repo expects it, push, and create or update the PR. - -Momentum matters here. - -A lot of branches die when the interesting work is done and only the boring release work is left. Humans procrastinate that part. AI should not. - -### Example - -Suppose the smart listing flow is finished. - -The product thinking is done. -The architecture is done. -The review pass is done. -Now the branch just needs to get landed. - -That is what `/ship` is for. - -It takes care of the repetitive release hygiene so I do not bleed energy on: - -* syncing with main -* rerunning tests -* checking for weird branch state -* updating changelog/version metadata -* pushing the branch -* opening or updating the PR - -At this point I do not want more ideation. -I want the plane landed. - ---- - -## Greptile integration - -[Greptile](https://greptile.com) is a YC company that reviews your PRs automatically. It catches real bugs — race conditions, security issues, things that pass CI and blow up in production. It has genuinely saved my ass more than once. I love these guys. - -**Setup:** Install Greptile on your GitHub repo at [greptile.com](https://greptile.com) — it takes about 30 seconds. Once it's reviewing your PRs, gstack picks up its comments automatically. No additional configuration. - -The problem with any automated reviewer is triage. Greptile is good, but not every comment is a real issue. Some are false positives. Some flag things you already fixed three commits ago. Without a triage layer, the comments pile up and you start ignoring them — which defeats the purpose. - -gstack solves this. `/review` and `/ship` are now Greptile-aware. They read Greptile's comments, classify each one, and take action: - -- **Valid issues** get added to the critical findings and fixed before shipping -- **Already-fixed issues** get an auto-reply acknowledging the catch -- **False positives** get pushed back — you confirm, and a reply goes out explaining why it's wrong - -The result is a two-layer review: Greptile catches things asynchronously on the PR, then `/review` and `/ship` triage those findings as part of the normal workflow. Nothing falls through the cracks. - -It also learns. Every false positive you confirm gets saved to `~/.gstack/greptile-history.md`. Future runs auto-skip known FP patterns for your codebase. And `/retro` tracks Greptile's batting average over time — so you can see whether the signal-to-noise ratio is improving. - -### Example - -``` -You: /ship - -Claude: [syncs main, runs tests, pre-landing review...] - - Greptile found 3 comments on this PR: - - [VALID] app/services/payment_service.rb:47 — Race condition: - concurrent charges can double-debit. Recommend DB-level advisory - lock around the charge block. - → https://github.com/you/app/pull/42#discussion_r123 - - [ALREADY FIXED] app/models/user.rb:12 — Missing null check - on email before downcase. - → Already fixed in commit abc1234. Auto-replying. - - [FALSE POSITIVE] lib/auth.rb:88 — "Token comparison should - use constant-time comparison." We already use - ActiveSupport::SecurityUtils.secure_compare here. - → https://github.com/you/app/pull/42#discussion_r125 - - Fixing the race condition in payment_service.rb... - Applied advisory lock. Committed: fix: address Greptile - review — advisory lock on concurrent charges - - The auth token comment is a false positive — we're already - using secure_compare. - - Options for the false positive: - A) Reply to Greptile explaining (recommended) - B) Fix it anyway - C) Ignore - -You: A - -Claude: Replied to Greptile. Re-running tests after the fix... - All tests pass. Continuing to version bump. - - [creates PR with Greptile Review section in body] - PR: https://github.com/you/app/pull/42 -``` - -Three Greptile comments. One real fix. One auto-acknowledged. One false positive pushed back with a reply. Total extra time: about 30 seconds. And the PR body has the full audit trail. - ---- - -## `/browse` - -This is my **QA engineer mode**. - -`/browse` is the skill that closes the loop. Before it, the agent could think and code but was still half blind. It had to guess about UI state, auth flows, redirects, console errors, empty states, and broken layouts. Now it can just go look. - -That is a real step-change. The full cycle becomes: plan, code, run the app, inspect the UI, reproduce the bug, verify the fix, ship. QA stops being "go click around manually" and becomes part of the same agentic loop. It gives the agent eyes. Now it can do the boring, high-context QA work itself: click through the app, catch breakage, verify the fix, and keep going. That feels like having a real operator in the machine. - -It is a compiled binary that talks to a persistent Chromium daemon — built on [Playwright](https://playwright.dev/) by Microsoft. First call starts the browser (~3s). Every call after that: ~100-200ms. The browser stays running between commands, so cookies, tabs, and localStorage carry over. - -After I push a branch, I tell Claude to go check staging. It does the entire QA pass for me: - -``` -You: /browse staging.myapp.com — log in, test the signup flow, and check - every page I changed in this branch - -Claude: [18 tool calls, ~60 seconds] - - > browse goto https://staging.myapp.com/signup - > browse snapshot -i - > browse fill @e2 "test@example.com" - > browse fill @e3 "password123" - > browse click @e5 (Submit) - > browse screenshot /tmp/signup.png - > Read /tmp/signup.png - - Signup works. Redirected to onboarding. Now checking changed pages. - - > browse goto https://staging.myapp.com/dashboard - > browse screenshot /tmp/dashboard.png - > Read /tmp/dashboard.png - > browse console - - Dashboard loads. No console errors. Charts render with sample data. - - > browse goto https://staging.myapp.com/settings/billing - > browse screenshot /tmp/billing.png - > Read /tmp/billing.png - - Billing page loads. Plan selector shows Free/Pro/Enterprise. - Stripe elements render correctly. - - > browse goto https://staging.myapp.com/api/health - > browse text - - API health endpoint returns {"status":"ok","version":"2.4.1"}. - - All 4 pages load correctly. No console errors. No broken layouts. - Signup → onboarding → dashboard flow works end to end. -``` - -18 tool calls, about a minute. It signed up a test user, navigated every page I changed, took screenshots, read them, checked the console for errors, and verified the API. That is a full QA pass. I did not open a browser. - -It works for reading documentation, verifying deployments, filling out forms, taking screenshots, comparing pages across environments, and anything else where Claude needs eyes on a live URL. - -**Security note:** `/browse` runs a persistent Chromium session. Cookies, localStorage, and session state carry over between commands. Do not use it against sensitive production environments unless you intend to — it is a real browser with real state. The session auto-shuts down after 30 minutes of idle time. - -For the full command reference, technical internals, and architecture details, see [BROWSER.md](BROWSER.md). - ---- - -## `/qa` - -This is my **QA lead mode**. - -`/browse` gives the agent eyes. `/qa` gives it a testing methodology. - -The most common use case: you're on a feature branch, you just finished coding, and you want to verify everything works. Just say `/qa` — it reads your git diff, identifies which pages and routes your changes affect, spins up the browser, and tests each one. No URL required. No manual test plan. It figures out what to test from the code you changed. - -``` -You: /qa - -Claude: Analyzing branch diff against main... - 12 files changed: 3 controllers, 2 views, 4 services, 3 tests - - Affected routes: /listings/new, /listings/:id, /api/listings - Detected app running on localhost:3000. - - [Tests each affected page — navigates, fills forms, clicks buttons, - screenshots, checks console errors] - - QA Report: 3 routes tested, all working. - - /listings/new: upload + enrichment flow works end to end - - /listings/:id: detail page renders correctly - - /api/listings: returns 200 with expected shape - No console errors. No regressions on adjacent pages. -``` - -Four modes: - -- **Diff-aware** (automatic on feature branches) — reads `git diff main`, identifies affected pages, tests them specifically. The fastest path from "I just wrote code" to "it works." -- **Full** — systematic exploration of the entire app. 5-15 minutes depending on app size. Documents 5-10 well-evidenced issues. -- **Quick** (`--quick`) — 30-second smoke test. Homepage + top 5 nav targets. Loads? Console errors? Broken links? -- **Regression** (`--regression baseline.json`) — run full mode, then diff against a previous baseline. Which issues are fixed? Which are new? What's the score delta? - -``` -You: /qa https://staging.myapp.com - -Claude: [Explores 12 pages, fills 3 forms, tests 2 flows] - - QA Report: staging.myapp.com — Health Score: 72/100 - - Top 3 Issues: - 1. CRITICAL: Checkout form submits with empty required fields - 2. HIGH: Mobile nav menu doesn't close after selecting an item - 3. MEDIUM: Dashboard chart overlaps sidebar below 1024px - - [Full report with screenshots saved to .gstack/qa-reports/] -``` - -Reports and screenshots accumulate in `.gstack/qa-reports/` so you can track quality over time and compare runs. - -**Testing authenticated pages:** Use `/setup-browser-cookies` first to import your real browser sessions, then `/qa` can test pages behind login. - ---- - -## `/setup-browser-cookies` - -This is my **session manager mode**. - -Before `/qa` or `/browse` can test authenticated pages, they need cookies. Instead of manually logging in through the headless browser every time, `/setup-browser-cookies` imports your real sessions directly from your daily browser. - -It auto-detects installed Chromium browsers (Comet, Chrome, Arc, Brave, Edge), decrypts cookies via the macOS Keychain, and loads them into the Playwright session. An interactive picker UI lets you choose exactly which domains to import — no cookie values are ever displayed. - -``` -You: /setup-browser-cookies - -Claude: Cookie picker opened — select the domains you want to import - in your browser, then tell me when you're done. - - [You pick github.com, myapp.com in the browser UI] - -You: done - -Claude: Imported 2 domains (47 cookies). Session is ready. -``` - -Or skip the UI entirely: - -``` -You: /setup-browser-cookies github.com - -Claude: Imported 12 cookies for github.com from Comet. -``` - -First import per browser triggers a macOS Keychain prompt — click "Allow" or "Always Allow." - ---- - -## `/retro` - -This is my **engineering manager mode**. - -At the end of the week I want to know what actually happened. Not vibes — data. `/retro` analyzes commit history, work patterns, and shipping velocity and writes a candid retrospective. - -It is team-aware. It identifies who is running the command, gives you the deepest treatment on your own work, then breaks down every contributor with specific praise and growth opportunities — the kind of feedback you would actually give in a 1:1. It computes metrics like commits, LOC, test ratio, PR sizes, and fix ratio. It detects coding sessions from commit timestamps, finds hotspot files, tracks shipping streaks, and identifies the biggest ship of the week. - -``` -You: /retro - -Claude: Week of Mar 1: 47 commits (3 contributors), 3.2k LOC, 38% tests, 12 PRs, peak: 10pm | Streak: 47d - - ## Your Week - 32 commits, +2.4k LOC, 41% tests. Peak hours: 9-11pm. - Biggest ship: cookie import system (browser decryption + picker UI). - What you did well: shipped a complete feature with encryption, UI, and - 18 unit tests in one focused push... - - ## Team Breakdown - - ### Alice - 12 commits focused on app/services/. Every PR under 200 LOC — disciplined. - Opportunity: test ratio at 12% — worth investing before payment gets more complex. - - ### Bob - 3 commits — fixed the N+1 query on dashboard. Small but high-impact. - Opportunity: only 1 active day this week — check if blocked on anything. - - [Top 3 team wins, 3 things to improve, 3 habits for next week] -``` - -It saves a JSON snapshot to `.context/retros/` so the next run can show trends. Run `/retro compare` to see this week vs last week side by side. - ---- - -## `/document-release` - -This is my **technical writer mode**. - -After `/ship` creates the PR but before it merges, `/document-release` reads every documentation file in the project and cross-references it against the diff. It updates file paths, command lists, project structure trees, and anything else that drifted. Risky or subjective changes get surfaced as questions — everything else is handled automatically. - -``` -You: /document-release - -Claude: Analyzing 21 files changed across 3 commits. Found 8 documentation files. - - README.md: updated skill count from 9 to 10, added new skill to table - CLAUDE.md: added new directory to project structure - CONTRIBUTING.md: current — no changes needed - TODOS.md: marked 2 items complete, added 1 new item - - All docs updated and committed. PR body updated with doc diff. -``` - -It also polishes CHANGELOG voice (without ever overwriting entries), cleans up completed TODOS, checks cross-doc consistency, and asks about VERSION bumps only when appropriate. - ---- +## Docs + +| Doc | What it covers | +|-----|---------------| +| [Skill Deep Dives](docs/skills.md) | Philosophy, examples, and workflow for every skill (includes Greptile integration) | +| [Architecture](ARCHITECTURE.md) | Design decisions and system internals | +| [Browser Reference](BROWSER.md) | Full command reference for `/browse` | +| [Contributing](CONTRIBUTING.md) | Dev setup, testing, contributor mode, and dev mode | +| [Changelog](CHANGELOG.md) | What's new in every version | ## Troubleshooting -**Skill not showing up in Claude Code?** -Run `cd ~/.claude/skills/gstack && ./setup` (or `cd .claude/skills/gstack && ./setup` for project installs). This rebuilds symlinks so Claude can discover the skills. +**Skill not showing up?** `cd ~/.claude/skills/gstack && ./setup` -**`/browse` fails or binary not found?** -Run `cd ~/.claude/skills/gstack && bun install && bun run build`. This compiles the browser binary. Requires Bun v1.0+. +**`/browse` fails?** `cd ~/.claude/skills/gstack && bun install && bun run build` -**Project copy is stale?** -Run `/gstack-upgrade` — it updates both the global install and any vendored project copy automatically. - -**`bun` not installed?** -Install it: `curl -fsSL https://bun.sh/install | bash` - -## Upgrading - -Run `/gstack-upgrade` in Claude Code. It detects your install type (global or vendored), upgrades, syncs any project copies, and shows what's new. - -Or set `auto_upgrade: true` in `~/.gstack/config.yaml` to upgrade automatically whenever a new version is available. - -## Uninstalling - -Paste this into Claude Code: - -> Uninstall gstack: remove the skill symlinks by running `for s in browse plan-ceo-review plan-eng-review plan-design-review design-consultation review ship retro qa qa-only qa-design-review setup-browser-cookies document-release; do rm -f ~/.claude/skills/$s; done` then run `rm -rf ~/.claude/skills/gstack` and remove the gstack section from CLAUDE.md. If this project also has gstack at .claude/skills/gstack, remove it by running `for s in browse plan-ceo-review plan-eng-review plan-design-review review ship retro qa qa-only qa-design-review setup-browser-cookies document-release; do rm -f .claude/skills/$s; done && rm -rf .claude/skills/gstack` and remove the gstack section from the project CLAUDE.md too. - -## Development - -See [CONTRIBUTING.md](CONTRIBUTING.md) for setup, testing, and dev mode. See [ARCHITECTURE.md](ARCHITECTURE.md) for design decisions and system internals. See [BROWSER.md](BROWSER.md) for the browse command reference. - -### Testing - -```bash -bun test # free static tests (<5s) -EVALS=1 bun run test:evals # full E2E + LLM evals (~$4, ~20min) -bun run eval:watch # live dashboard during E2E runs -``` - -E2E tests stream real-time progress, write machine-readable diagnostics, and persist partial results that survive kills. See CONTRIBUTING.md for the full eval infrastructure. +**Stale install?** Run `/gstack-upgrade` — or set `auto_upgrade: true` in `~/.gstack/config.yaml` ## License -MIT +MIT. Free forever. Go build something. diff --git a/SKILL.md b/SKILL.md index 255c3648..e12be621 100644 --- a/SKILL.md +++ b/SKILL.md @@ -28,22 +28,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/TODOS.md b/TODOS.md index 7efbef8b..17bfc411 100644 --- a/TODOS.md +++ b/TODOS.md @@ -263,6 +263,30 @@ **Effort:** S **Priority:** P3 +### CI/CD generation for non-GitHub providers + +**What:** Extend CI/CD bootstrap to generate GitLab CI (`.gitlab-ci.yml`), CircleCI (`.circleci/config.yml`), and Bitrise pipelines. + +**Why:** Not all projects use GitHub Actions. Universal CI/CD bootstrap would make test bootstrap work for everyone. + +**Context:** v1 ships with GitHub Actions only. Detection logic already checks for `.gitlab-ci.yml`, `.circleci/`, `bitrise.yml` and skips with an informational note. Each provider needs ~20 lines of template text in `generateTestBootstrap()`. + +**Effort:** M +**Priority:** P3 +**Depends on:** Test bootstrap (shipped) + +### Auto-upgrade weak tests (★) to strong tests (★★★) + +**What:** When Step 3.4 coverage audit identifies existing ★-rated tests (smoke/trivial assertions), generate improved versions testing edge cases and error paths. + +**Why:** Many codebases have tests that technically exist but don't catch real bugs — `expect(component).toBeDefined()` isn't testing behavior. Upgrading these closes the gap between "has tests" and "has good tests." + +**Context:** Requires the quality scoring rubric from the test coverage audit. Modifying existing test files is riskier than creating new ones — needs careful diffing to ensure the upgraded test still passes. Consider creating a companion test file rather than modifying the original. + +**Effort:** M +**Priority:** P3 +**Depends on:** Test quality scoring (shipped) + ## Retro ### Deployment health tracking (retro + browse) @@ -362,6 +386,16 @@ **Priority:** P2 **Depends on:** None +### Cross-platform URL open helper + +**What:** `gstack-open-url` helper script — detect platform, use `open` (macOS) or `xdg-open` (Linux). + +**Why:** The first-time Completeness Principle intro uses macOS `open` to launch the essay. If gstack ever supports Linux, this silently fails. + +**Effort:** S (human: ~30 min / CC: ~2 min) +**Priority:** P4 +**Depends on:** Nothing + ### CDP-based DOM mutation detection for ref staleness **What:** Use Chrome DevTools Protocol `DOM.documentUpdated` / MutationObserver events to proactively invalidate stale refs when the DOM changes, without requiring an explicit `snapshot` call. @@ -446,6 +480,20 @@ Shipped as v0.5.0 on main. Includes `/plan-design-review` (report-only design au **Priority:** P2 **Depends on:** Ship Confidence Dashboard (shipped) +## Completeness + +### Completeness metrics dashboard + +**What:** Track how often Claude chooses the complete option vs shortcut across gstack sessions. Aggregate into a dashboard showing completeness trend over time. + +**Why:** Without measurement, we can't know if the Completeness Principle is working. Could surface patterns (e.g., certain skills still bias toward shortcuts). + +**Context:** Would require logging choices (e.g., append to a JSONL file when AskUserQuestion resolves), parsing them, and displaying trends. Similar pattern to eval persistence. + +**Effort:** M (human) / S (CC) +**Priority:** P3 +**Depends on:** Boil the Lake shipped (v0.6.1) + ## Completed ### Phase 1: Foundations (v0.2.0) diff --git a/VERSION b/VERSION index a918a2aa..ee6cdce3 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.6.0 +0.6.1 diff --git a/browse/SKILL.md b/browse/SKILL.md index 90e6e070..49d2ff0c 100644 --- a/browse/SKILL.md +++ b/browse/SKILL.md @@ -28,22 +28,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/debug/SKILL.md b/debug/SKILL.md index 56c75666..4448453a 100644 --- a/debug/SKILL.md +++ b/debug/SKILL.md @@ -28,22 +28,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/design-consultation/SKILL.md b/design-consultation/SKILL.md index 33867c8a..b494104d 100644 --- a/design-consultation/SKILL.md +++ b/design-consultation/SKILL.md @@ -2,7 +2,7 @@ name: design-consultation version: 1.0.0 description: | - Design consultation: understands your product, researches competitors, proposes a + Design consultation: understands your product, researches the landscape, proposes a complete design system (aesthetic, typography, color, layout, spacing, motion), and generates font+color preview pages. Creates DESIGN.md as your project's design source of truth. For existing sites, use /plan-design-review to infer the system instead. @@ -31,22 +31,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. @@ -183,7 +222,7 @@ Ask the user a single question that covers everything you need to know. Pre-fill 3. "Want me to research what top products in your space are doing for design, or should I work from my design knowledge?" 4. **Explicitly say:** "At any point you can just drop into chat and we'll talk through anything — this isn't a rigid form, it's a conversation." -If the README or office-hours output gives you enough context, pre-fill and confirm: *"From what I can see, this is [X] for [Y] in the [Z] space. Sound right? And would you like me to research competitors, or should I work from what I know?"* +If the README or office-hours output gives you enough context, pre-fill and confirm: *"From what I can see, this is [X] for [Y] in the [Z] space. Sound right? And would you like me to research what's out there in this space, or should I work from what I know?"* --- @@ -191,7 +230,7 @@ If the README or office-hours output gives you enough context, pre-fill and conf If the user wants competitive research: -**Step 1: Identify competitors via WebSearch** +**Step 1: Identify what's out there via WebSearch** Use WebSearch to find 5-10 products in their space. Search for: - "[product category] website design" @@ -200,17 +239,17 @@ Use WebSearch to find 5-10 products in their space. Search for: **Step 2: Visual research via browse (if available)** -If the browse binary is available (`$B` is set), visit the top 3-5 competitor sites and capture visual evidence: +If the browse binary is available (`$B` is set), visit the top 3-5 sites in the space and capture visual evidence: ```bash -$B goto "https://competitor-site.com" -$B screenshot "/tmp/design-research-competitor-name.png" +$B goto "https://example-site.com" +$B screenshot "/tmp/design-research-site-name.png" $B snapshot ``` -For each competitor, analyze: fonts actually used, color palette, layout approach, spacing density, aesthetic direction. The screenshot gives you the feel; the snapshot gives you structural data. +For each site, analyze: fonts actually used, color palette, layout approach, spacing density, aesthetic direction. The screenshot gives you the feel; the snapshot gives you structural data. -If a competitor site blocks the headless browser or requires login, skip it and note why. +If a site blocks the headless browser or requires login, skip it and note why. If browse is not available, rely on WebSearch results and your built-in design knowledge — this is fine. @@ -219,7 +258,7 @@ If browse is not available, rely on WebSearch results and your built-in design k The goal of research is NOT to copy. It is to get in the ballpark — to understand the visual language users in this category already expect. This gives you the baseline. The interesting design work starts after you have the baseline: deciding where to follow conventions (so the product feels literate) and where to break from them (so the product is memorable). Summarize conversationally: -> "I looked at [competitors]. Here's the landscape: they converge on [patterns]. Most of them feel [observation — e.g., interchangeable, polished but generic, etc.]. The opportunity to stand out is [gap]. Here's where I'd play it safe and where I'd take a risk..." +> "I looked at what's out there. Here's the landscape: they converge on [patterns]. Most of them feel [observation — e.g., interchangeable, polished but generic, etc.]. The opportunity to stand out is [gap]. Here's where I'd play it safe and where I'd take a risk..." **Graceful degradation:** - Browse available → screenshots + snapshots + WebSearch (richest research) diff --git a/design-consultation/SKILL.md.tmpl b/design-consultation/SKILL.md.tmpl index 1ef4734c..2f903992 100644 --- a/design-consultation/SKILL.md.tmpl +++ b/design-consultation/SKILL.md.tmpl @@ -2,7 +2,7 @@ name: design-consultation version: 1.0.0 description: | - Design consultation: understands your product, researches competitors, proposes a + Design consultation: understands your product, researches the landscape, proposes a complete design system (aesthetic, typography, color, layout, spacing, motion), and generates font+color preview pages. Creates DESIGN.md as your project's design source of truth. For existing sites, use /plan-design-review to infer the system instead. @@ -76,7 +76,7 @@ Ask the user a single question that covers everything you need to know. Pre-fill 3. "Want me to research what top products in your space are doing for design, or should I work from my design knowledge?" 4. **Explicitly say:** "At any point you can just drop into chat and we'll talk through anything — this isn't a rigid form, it's a conversation." -If the README or office-hours output gives you enough context, pre-fill and confirm: *"From what I can see, this is [X] for [Y] in the [Z] space. Sound right? And would you like me to research competitors, or should I work from what I know?"* +If the README or office-hours output gives you enough context, pre-fill and confirm: *"From what I can see, this is [X] for [Y] in the [Z] space. Sound right? And would you like me to research what's out there in this space, or should I work from what I know?"* --- @@ -84,7 +84,7 @@ If the README or office-hours output gives you enough context, pre-fill and conf If the user wants competitive research: -**Step 1: Identify competitors via WebSearch** +**Step 1: Identify what's out there via WebSearch** Use WebSearch to find 5-10 products in their space. Search for: - "[product category] website design" @@ -93,17 +93,17 @@ Use WebSearch to find 5-10 products in their space. Search for: **Step 2: Visual research via browse (if available)** -If the browse binary is available (`$B` is set), visit the top 3-5 competitor sites and capture visual evidence: +If the browse binary is available (`$B` is set), visit the top 3-5 sites in the space and capture visual evidence: ```bash -$B goto "https://competitor-site.com" -$B screenshot "/tmp/design-research-competitor-name.png" +$B goto "https://example-site.com" +$B screenshot "/tmp/design-research-site-name.png" $B snapshot ``` -For each competitor, analyze: fonts actually used, color palette, layout approach, spacing density, aesthetic direction. The screenshot gives you the feel; the snapshot gives you structural data. +For each site, analyze: fonts actually used, color palette, layout approach, spacing density, aesthetic direction. The screenshot gives you the feel; the snapshot gives you structural data. -If a competitor site blocks the headless browser or requires login, skip it and note why. +If a site blocks the headless browser or requires login, skip it and note why. If browse is not available, rely on WebSearch results and your built-in design knowledge — this is fine. @@ -112,7 +112,7 @@ If browse is not available, rely on WebSearch results and your built-in design k The goal of research is NOT to copy. It is to get in the ballpark — to understand the visual language users in this category already expect. This gives you the baseline. The interesting design work starts after you have the baseline: deciding where to follow conventions (so the product feels literate) and where to break from them (so the product is memorable). Summarize conversationally: -> "I looked at [competitors]. Here's the landscape: they converge on [patterns]. Most of them feel [observation — e.g., interchangeable, polished but generic, etc.]. The opportunity to stand out is [gap]. Here's where I'd play it safe and where I'd take a risk..." +> "I looked at what's out there. Here's the landscape: they converge on [patterns]. Most of them feel [observation — e.g., interchangeable, polished but generic, etc.]. The opportunity to stand out is [gap]. Here's where I'd play it safe and where I'd take a risk..." **Graceful degradation:** - Browse available → screenshots + snapshots + WebSearch (richest research) diff --git a/docs/images/github-2013.png b/docs/images/github-2013.png new file mode 100644 index 00000000..186324ba Binary files /dev/null and b/docs/images/github-2013.png differ diff --git a/docs/images/github-2026.png b/docs/images/github-2026.png new file mode 100644 index 00000000..b908c8a4 Binary files /dev/null and b/docs/images/github-2026.png differ diff --git a/docs/skills.md b/docs/skills.md new file mode 100644 index 00000000..16045860 --- /dev/null +++ b/docs/skills.md @@ -0,0 +1,649 @@ +# Skill Deep Dives + +Detailed guides for every gstack skill — philosophy, workflow, and examples. + +| Skill | Your specialist | What they do | +|-------|----------------|--------------| +| [`/plan-ceo-review`](#plan-ceo-review) | **CEO / Founder** | Rethink the problem. Find the 10-star product hiding inside the request. Four modes: Expansion, Selective Expansion, Hold Scope, Reduction. | +| [`/plan-eng-review`](#plan-eng-review) | **Eng Manager** | Lock in architecture, data flow, diagrams, edge cases, and tests. Forces hidden assumptions into the open. | +| [`/plan-design-review`](#plan-design-review) | **Senior Designer** | 80-item design audit with letter grades. AI Slop detection. Infers your design system. Report only — never touches code. | +| [`/design-consultation`](#design-consultation) | **Design Partner** | Build a complete design system from scratch. Knows the landscape, proposes creative risks, generates realistic product mockups. Design at the heart of all other phases. | +| [`/review`](#review) | **Staff Engineer** | Find the bugs that pass CI but blow up in production. Auto-fixes the obvious ones. Flags completeness gaps. | +| [`/ship`](#ship) | **Release Engineer** | Sync main, run tests, audit coverage, push, open PR. Bootstraps test frameworks if you don't have one. One command. | +| [`/browse`](#browse) | **QA Engineer** | Give the agent eyes. Real Chromium browser, real clicks, real screenshots. ~100ms per command. | +| [`/qa`](#qa) | **QA Lead** | Test your app, find bugs, fix them with atomic commits, re-verify. Auto-generates regression tests for every fix. | +| [`/qa-only`](#qa) | **QA Reporter** | Same methodology as /qa but report only. Use when you want a pure bug report without code changes. | +| [`/qa-design-review`](#qa-design-review) | **Designer Who Codes** | Same audit as /plan-design-review, then fixes what it finds. Atomic commits, before/after screenshots. | +| [`/setup-browser-cookies`](#setup-browser-cookies) | **Session Manager** | Import cookies from your real browser (Chrome, Arc, Brave, Edge) into the headless session. Test authenticated pages. | +| [`/retro`](#retro) | **Eng Manager** | Team-aware weekly retro. Per-person breakdowns, shipping streaks, test health trends, growth opportunities. | +| [`/document-release`](#document-release) | **Technical Writer** | Update all project docs to match what you just shipped. Catches stale READMEs automatically. | + +--- + +## `/plan-ceo-review` + +This is my **founder mode**. + +This is where I want the model to think with taste, ambition, user empathy, and a long time horizon. I do not want it taking the request literally. I want it asking a more important question first: + +**What is this product actually for?** + +I think of this as **Brian Chesky mode**. + +The point is not to implement the obvious ticket. The point is to rethink the problem from the user's point of view and find the version that feels inevitable, delightful, and maybe even a little magical. + +### Example + +Say I am building a Craigslist-style listing app and I say: + +> "Let sellers upload a photo for their item." + +A weak assistant will add a file picker and save an image. + +That is not the real product. + +In `/plan-ceo-review`, I want the model to ask whether "photo upload" is even the feature. Maybe the real feature is helping someone create a listing that actually sells. + +If that is the real job, the whole plan changes. + +Now the model should ask: + +* Can we identify the product from the photo? +* Can we infer the SKU or model number? +* Can we search the web and draft the title and description automatically? +* Can we pull specs, category, and pricing comps? +* Can we suggest which photo will convert best as the hero image? +* Can we detect when the uploaded photo is ugly, dark, cluttered, or low-trust? +* Can we make the experience feel premium instead of like a dead form from 2007? + +That is what `/plan-ceo-review` does for me. + +It does not just ask, "how do I add this feature?" +It asks, **"what is the 10-star product hiding inside this request?"** + +### Four modes + +- **SCOPE EXPANSION** — dream big. The agent proposes the ambitious version. Every expansion is presented as an individual decision you opt into. Recommends enthusiastically. +- **SELECTIVE EXPANSION** — hold your current scope as the baseline, but see what else is possible. The agent surfaces opportunities one by one with neutral recommendations — you cherry-pick the ones worth doing. +- **HOLD SCOPE** — maximum rigor on the existing plan. No expansions surfaced. +- **SCOPE REDUCTION** — find the minimum viable version. Cut everything else. + +Visions and decisions are persisted to `~/.gstack/projects/` so they survive beyond the conversation. Exceptional visions can be promoted to `docs/designs/` in your repo for the team. + +--- + +## `/plan-eng-review` + +This is my **eng manager mode**. + +Once the product direction is right, I want a different kind of intelligence entirely. I do not want more sprawling ideation. I do not want more "wouldn't it be cool if." I want the model to become my best technical lead. + +This mode should nail: + +* architecture +* system boundaries +* data flow +* state transitions +* failure modes +* edge cases +* trust boundaries +* test coverage + +And one surprisingly big unlock for me: **diagrams**. + +LLMs get way more complete when you force them to draw the system. Sequence diagrams, state diagrams, component diagrams, data-flow diagrams, even test matrices. Diagrams force hidden assumptions into the open. They make hand-wavy planning much harder. + +So `/plan-eng-review` is where I want the model to build the technical spine that can carry the product vision. + +### Example + +Take the same listing app example. + +Let's say `/plan-ceo-review` already did its job. We decided the real feature is not just photo upload. It is a smart listing flow that: + +* uploads photos +* identifies the product +* enriches the listing from the web +* drafts a strong title and description +* suggests the best hero image + +Now `/plan-eng-review` takes over. + +Now I want the model to answer questions like: + +* What is the architecture for upload, classification, enrichment, and draft generation? +* Which steps happen synchronously, and which go to background jobs? +* Where are the boundaries between app server, object storage, vision model, search/enrichment APIs, and the listing database? +* What happens if upload succeeds but enrichment fails? +* What happens if product identification is low-confidence? +* How do retries work? +* How do we prevent duplicate jobs? +* What gets persisted when, and what can be safely recomputed? + +And this is where I want diagrams — architecture diagrams, state models, data-flow diagrams, test matrices. Diagrams force hidden assumptions into the open. They make hand-wavy planning much harder. + +That is `/plan-eng-review`. + +Not "make the idea smaller." +**Make the idea buildable.** + +### Review Readiness Dashboard + +Every review (CEO, Eng, Design) logs its result. At the end of each review, you see a dashboard: + +``` ++====================================================================+ +| REVIEW READINESS DASHBOARD | ++====================================================================+ +| Review | Runs | Last Run | Status | Required | +|-----------------|------|---------------------|-----------|----------| +| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | YES | +| CEO Review | 1 | 2026-03-16 14:30 | CLEAR | no | +| Design Review | 0 | — | — | no | ++--------------------------------------------------------------------+ +| VERDICT: CLEARED — Eng Review passed | ++====================================================================+ +``` + +Eng Review is the only required gate (disable with `gstack-config set skip_eng_review true`). CEO and Design are informational — recommended for product and UI changes respectively. + +### Plan-to-QA flow + +When `/plan-eng-review` finishes the test review section, it writes a test plan artifact to `~/.gstack/projects/`. When you later run `/qa`, it picks up that test plan automatically — your engineering review feeds directly into QA testing with no manual copy-paste. + +--- + +## `/plan-design-review` + +This is my **senior designer mode**. + +Most developers cannot tell whether their site looks AI-generated. I could not, until I started paying attention. There is a growing class of sites that are functional but soulless — they work fine but scream "an AI built this and nobody with taste looked at it." Purple gradients, 3-column icon grids, uniform bubbly border-radius on everything, centered text on every section, decorative blobs floating in the background. The ChatGPT aesthetic. + +`/plan-design-review` gives the agent a designer's eye. + +It opens your site and reacts to it the way a Stripe or Linear designer would — immediately, viscerally, with opinions. The first output is a structured gut reaction: what the site communicates at a glance, what the eye is drawn to, and a one-word verdict. That is the most valuable part. Everything after is supporting evidence. + +Then it extracts your actual design system by reading the CSS: how many fonts are you really using? What is your heading scale? How many colors? Is your spacing on a grid or arbitrary? It structures this as an **Inferred Design System** and offers to save it as a `DESIGN.md` — so now you have a design system you never had before, derived from what you actually shipped. + +The audit covers 80 items across 10 categories. Typography, spacing, hierarchy, color, responsive design, interaction states, motion, content quality, performance feel, and — the headline feature — **AI Slop Detection**. Ten specific anti-patterns that mark a site as machine-generated. + +Every finding gets a letter grade (A through F) per category. The report opens with two headline scores: **Design Score** and **AI Slop Score**. "Your site got a B on design but a D on AI Slop" tells you exactly where to focus. + +### Example + +``` +You: /plan-design-review https://myapp.com + +Claude: Design Score: C | AI Slop Score: D + + "The site communicates generic SaaS energy." + "I notice the hero section uses a blue-to-purple gradient + with a 3-column feature grid below it — the single most + recognizable AI-generated layout pattern." + "The first 3 things my eye goes to are: the gradient, the + symmetrical icon grid, and a 'Get Started' button that + looks like every other 'Get Started' button." + "If I had to describe this in one word: template." + + Inferred Design System: + - Fonts: Inter (body), Inter (headings) — 1 font family + - Colors: 14 unique non-gray colors (flag: >12) + - Heading scale: 36/24/18/16 — flat, no drama between sizes + - Spacing: mostly 16px and 24px — no rhythm + + Top 5 Design Improvements: + 1. Break the 3-column icon grid (High, AI Slop) + 2. Add heading scale contrast — at least 2:1 ratio (High, Typography) + 3. Replace uniform border-radius with intentional hierarchy (Medium, Spacing) + 4. Kill the gradient hero — use a real image or bold typography (High, AI Slop) + 5. Add a second font for headings — Inter-only reads as generic (Medium, Typography) + + [Full report saved to .gstack/design-reports/] + Want me to save this inferred design system as your DESIGN.md? +``` + +This is report only — it never touches your code. Use `/qa-design-review` when you want it to fix what it finds. + +--- + +## `/design-consultation` + +This is my **design partner mode**. + +`/plan-design-review` audits a site that already exists. `/design-consultation` is for when you have nothing yet — no design system, no font choices, no color palette. You are starting from zero and you want a senior designer to sit down with you and build the whole visual identity together. + +It is a conversation, not a form. The agent asks about your product, your users, and your audience. It thinks about what your product needs to communicate — trust, speed, craft, warmth, whatever fits — and works backward from that to concrete choices. Then it proposes a complete, coherent design system: aesthetic direction, typography (3+ fonts with specific roles), color palette with hex values, spacing scale, layout approach, and motion strategy. Every recommendation comes with a rationale. Every choice reinforces every other choice. + +But coherence is table stakes. Every dev tool dashboard looks the same — clean sans-serif, muted grays, a blue accent. They are all coherent. They are all forgettable. The difference between a product that looks "nice" and one that people actually recognize is the **deliberate creative risks**: the unexpected serif for headings, the bold accent nobody else in your category uses, the tighter spacing that makes your data feel authoritative instead of airy. + +That is what `/design-consultation` is really about. It does not just propose a safe system. It proposes safe choices AND risks — and tells you which is which. "Here are the choices that keep you literate in your category. And here is where I think you should break from convention, and why." You pick which risks to take. The agent checks that the whole system still coheres either way. + +If you want, the agent will research what's already out there in your space — take screenshots of real sites, analyze their fonts and colors and spacing — so you can see the landscape before you make choices. This is not about copying. It is about getting in the ballpark so you know what the conventions are, and then deciding which ones are worth breaking. + +After you agree on the system, it generates an interactive HTML preview page — not just swatches and font samples, but realistic product pages. If you are building a dashboard, you see a dashboard with a sidebar, data tables, and stat cards. If you are building a marketing site, you see a hero section with real copy and a CTA. Everything rendered in your design system, with your product name, in light and dark mode. You see what your product could feel like before a single line of production code is written. + +Then it writes `DESIGN.md` to your repo root — your project's design source of truth — and updates `CLAUDE.md` so every future Claude Code session respects the system. From that point on, `/qa-design-review` can audit against it, and any agent working on your frontend knows the rules. + +### Example + +``` +You: /design-consultation + +Claude: From the README, this looks like a developer analytics dashboard + for engineering teams. Sound right? Tell me about the feeling + you want — and would you like me to look at what's out there + first so we can see the landscape? + +You: Yes, it's for eng managers. Show me what's out there. + +Claude: [browses Linear, Vercel, Datadog, Posthog — takes screenshots] + + Here's the landscape. They all converge on clean sans-serifs + and muted blue-gray palettes. Linear stands out with dark-first + and tight spacing. The rest feel interchangeable. + + Here's my proposal — and where I'd play it safe vs. take risks: + + AESTHETIC: Industrial/Utilitarian — data-first, every pixel earns its place + TYPOGRAPHY: Geist (body + UI), Instrument Serif (display), JetBrains Mono (data) + COLOR: Restrained — slate neutrals, single teal accent (#0D9488) + SPACING: 4px base, compact density + LAYOUT: Grid-disciplined, 12-column + MOTION: Minimal-functional — only transitions that aid comprehension + + SAFE CHOICES (match category expectations): + Geist for body — your users already know this font from Vercel + Slate neutrals — warm grays feel right for data-heavy UIs + Grid-disciplined layout — dashboards need predictable alignment + + RISKS (where you'd stand out): + Instrument Serif for display headings — nobody in dev tools + uses a serif. It creates instant hierarchy contrast and + signals "we care about craft" in a space full of sans-serifs. + Single teal accent instead of blue — everyone else in this space uses blue. + Teal is close enough to feel trustworthy, different enough + to be recognizable. + 4px compact density — tighter than Linear, says "we respect + your screen real estate." + + The safe choices keep you literate. The risks are where your + product gets its own face. Which risks appeal to you? + +You: Love the serif risk. Let's see it. + +Claude: Preview opened in your browser. + [/tmp/design-consultation-preview-1710000000.html] + + The page shows a realistic dashboard: sidebar nav, header with + avatar, stat cards, data table — all in your design system. + Instrument Serif at hero scale, Geist in body text, JetBrains + Mono in the data table. Teal accent on buttons and active states. + Light and dark mode toggle in the corner. + + How does it feel? Want to adjust anything before I write DESIGN.md? + +You: Ship it. + +Claude: Wrote DESIGN.md (typography, color, spacing, layout, motion). + Updated CLAUDE.md with design system guidelines. + All future sessions will respect these choices. +``` + +--- + +## `/qa-design-review` + +This is my **designer who codes mode**. + +`/plan-design-review` tells you what is wrong. `/qa-design-review` fixes it. + +It runs the same 80-item audit, then enters a fix loop: for each design finding, it locates the source file, makes the minimal CSS/styling change, commits with `style(design): FINDING-NNN`, re-navigates to verify, and takes before/after screenshots. One commit per fix, fully bisectable. + +The self-regulation heuristic is tuned for design work — CSS-only changes get a free pass (they are inherently safe and reversible), but changes to component JSX/TSX files count against the risk budget. Hard cap at 30 fixes. If the risk score exceeds 20%, it stops and asks. + +### Example + +``` +You: /qa-design-review https://myapp.com + +Claude: [Runs full design audit — same output as /plan-design-review] + Design Score: C | AI Slop Score: D + 12 findings (4 high, 5 medium, 3 polish) + + Fixing 9 design issues... + + style(design): FINDING-001 — replace 3-column icon grid with asymmetric layout + style(design): FINDING-002 — add heading scale 48/32/24/18/16 + style(design): FINDING-003 — remove gradient hero, use bold typography + style(design): FINDING-004 — add second font for headings + style(design): FINDING-005 — vary border-radius by element role + style(design): FINDING-006 — left-align body text, reserve center for headings + style(design): FINDING-007 — add hover/focus states to all interactive elements + style(design): FINDING-008 — add prefers-reduced-motion media query + style(design): FINDING-009 — set max content width to 680px for body text + + Final audit: + Design Score: C → B+ | AI Slop Score: D → A + 9 fixes applied (8 verified, 1 best-effort). 3 deferred. + + [Report with before/after screenshots saved to .gstack/design-reports/] +``` + +Nine commits, each touching one concern. The AI Slop score went from D to A because the three most recognizable patterns (gradient hero, 3-column grid, uniform radius) are gone. + +--- + +## `/review` + +This is my **paranoid staff engineer mode**. + +Passing tests do not mean the branch is safe. + +`/review` exists because there is a whole class of bugs that can survive CI and still punch you in the face in production. This mode is not about dreaming bigger. It is not about making the plan prettier. It is about asking: + +**What can still break?** + +This is a structural audit, not a style nitpick pass. I want the model to look for things like: + +* N+1 queries +* stale reads +* race conditions +* bad trust boundaries +* missing indexes +* escaping bugs +* broken invariants +* bad retry logic +* tests that pass while missing the real failure mode +* forgotten enum handlers — add a new status or type constant, and `/review` traces it through every switch statement and allowlist in your codebase, not just the files you changed + +### Fix-First + +Findings get action, not just listed. Obvious mechanical fixes (dead code, stale comments, N+1 queries) are applied automatically — you see `[AUTO-FIXED] file:line Problem → what was done` for each one. Genuinely ambiguous issues (security, race conditions, design decisions) get surfaced for your call. + +### Completeness gaps + +`/review` now flags shortcut implementations where the complete version costs less than 30 minutes of CC time. If you chose the 80% solution and the 100% solution is a lake, not an ocean, the review will call it out. + +### Example + +Suppose the smart listing flow is implemented and the tests are green. + +`/review` should still ask: + +* Did I introduce an N+1 query when rendering listing photos or draft suggestions? +* Am I trusting client-provided file metadata instead of validating the actual file? +* Can two tabs race and overwrite cover-photo selection or item details? +* Do failed uploads leave orphaned files in storage forever? +* Can the "exactly one hero image" rule break under concurrency? +* If enrichment APIs partially fail, do I degrade gracefully or save garbage? +* Did I accidentally create a prompt injection or trust-boundary problem by pulling web data into draft generation? + +That is the point of `/review`. + +I do not want flattery here. +I want the model imagining the production incident before it happens. + +--- + +## `/ship` + +This is my **release machine mode**. + +Once I have decided what to build, nailed the technical plan, and run a serious review, I do not want more talking. I want execution. + +`/ship` is for the final mile. It is for a ready branch, not for deciding what to build. + +This is where the model should stop behaving like a brainstorm partner and start behaving like a disciplined release engineer: sync with main, run the right tests, make sure the branch state is sane, update changelog or versioning if the repo expects it, push, and create or update the PR. + +### Test bootstrap + +If your project doesn't have a test framework, `/ship` sets one up — detects your runtime, researches the best framework, installs it, writes 3-5 real tests for your actual code, sets up CI/CD (GitHub Actions), and creates TESTING.md. 100% test coverage is the goal — tests make vibe coding safe instead of yolo coding. + +### Coverage audit + +Every `/ship` run builds a code path map from your diff, searches for corresponding tests, and produces an ASCII coverage diagram with quality stars. Gaps get tests auto-generated. Your PR body shows the coverage: `Tests: 42 → 47 (+5 new)`. + +### Review gate + +`/ship` checks the [Review Readiness Dashboard](#review-readiness-dashboard) before creating the PR. If the Eng Review is missing, it asks — but won't block you. Decisions are saved per-branch so you're never re-asked. + +A lot of branches die when the interesting work is done and only the boring release work is left. Humans procrastinate that part. AI should not. + +--- + +## `/browse` + +This is my **QA engineer mode**. + +`/browse` is the skill that closes the loop. Before it, the agent could think and code but was still half blind. It had to guess about UI state, auth flows, redirects, console errors, empty states, and broken layouts. Now it can just go look. + +It is a compiled binary that talks to a persistent Chromium daemon — built on [Playwright](https://playwright.dev/) by Microsoft. First call starts the browser (~3s). Every call after that: ~100-200ms. The browser stays running between commands, so cookies, tabs, and localStorage carry over. + +### Example + +``` +You: /browse staging.myapp.com — log in, test the signup flow, and check + every page I changed in this branch + +Claude: [18 tool calls, ~60 seconds] + + > browse goto https://staging.myapp.com/signup + > browse snapshot -i + > browse fill @e2 "test@example.com" + > browse fill @e3 "password123" + > browse click @e5 (Submit) + > browse screenshot /tmp/signup.png + > Read /tmp/signup.png + + Signup works. Redirected to onboarding. Now checking changed pages. + + > browse goto https://staging.myapp.com/dashboard + > browse screenshot /tmp/dashboard.png + > Read /tmp/dashboard.png + > browse console + + Dashboard loads. No console errors. Charts render with sample data. + + All 4 pages load correctly. No console errors. No broken layouts. + Signup → onboarding → dashboard flow works end to end. +``` + +18 tool calls, about a minute. Full QA pass. No browser opened. + +**Security note:** `/browse` runs a persistent Chromium session. Cookies, localStorage, and session state carry over between commands. Do not use it against sensitive production environments unless you intend to — it is a real browser with real state. The session auto-shuts down after 30 minutes of idle time. + +For the full command reference, see [BROWSER.md](../BROWSER.md). + +--- + +## `/qa` + +This is my **QA lead mode**. + +`/browse` gives the agent eyes. `/qa` gives it a testing methodology. + +The most common use case: you're on a feature branch, you just finished coding, and you want to verify everything works. Just say `/qa` — it reads your git diff, identifies which pages and routes your changes affect, spins up the browser, and tests each one. No URL required. No manual test plan. + +Four modes: + +- **Diff-aware** (automatic on feature branches) — reads `git diff main`, identifies affected pages, tests them specifically +- **Full** — systematic exploration of the entire app. 5-15 minutes. Documents 5-10 well-evidenced issues. +- **Quick** (`--quick`) — 30-second smoke test. Homepage + top 5 nav targets. +- **Regression** (`--regression baseline.json`) — run full mode, then diff against a previous baseline. + +### Automatic regression tests + +When `/qa` fixes a bug and verifies it, it automatically generates a regression test that catches the exact scenario that broke. Tests include full attribution tracing back to the QA report. + +### Example + +``` +You: /qa https://staging.myapp.com + +Claude: [Explores 12 pages, fills 3 forms, tests 2 flows] + + QA Report: staging.myapp.com — Health Score: 72/100 + + Top 3 Issues: + 1. CRITICAL: Checkout form submits with empty required fields + 2. HIGH: Mobile nav menu doesn't close after selecting an item + 3. MEDIUM: Dashboard chart overlaps sidebar below 1024px + + [Full report with screenshots saved to .gstack/qa-reports/] +``` + +**Testing authenticated pages:** Use `/setup-browser-cookies` first to import your real browser sessions, then `/qa` can test pages behind login. + +--- + +## `/setup-browser-cookies` + +This is my **session manager mode**. + +Before `/qa` or `/browse` can test authenticated pages, they need cookies. Instead of manually logging in through the headless browser every time, `/setup-browser-cookies` imports your real sessions directly from your daily browser. + +It auto-detects installed Chromium browsers (Comet, Chrome, Arc, Brave, Edge), decrypts cookies via the macOS Keychain, and loads them into the Playwright session. An interactive picker UI lets you choose exactly which domains to import — no cookie values are ever displayed. + +``` +You: /setup-browser-cookies + +Claude: Cookie picker opened — select the domains you want to import + in your browser, then tell me when you're done. + + [You pick github.com, myapp.com in the browser UI] + +You: done + +Claude: Imported 2 domains (47 cookies). Session is ready. +``` + +Or skip the UI entirely: + +``` +You: /setup-browser-cookies github.com + +Claude: Imported 12 cookies for github.com from Comet. +``` + +--- + +## `/retro` + +This is my **engineering manager mode**. + +At the end of the week I want to know what actually happened. Not vibes — data. `/retro` analyzes commit history, work patterns, and shipping velocity and writes a candid retrospective. + +It is team-aware. It identifies who is running the command, gives you the deepest treatment on your own work, then breaks down every contributor with specific praise and growth opportunities. It computes metrics like commits, LOC, test ratio, PR sizes, and fix ratio. It detects coding sessions from commit timestamps, finds hotspot files, tracks shipping streaks, and identifies the biggest ship of the week. + +It also tracks test health: total test files, tests added this period, regression test commits, and trend deltas. If test ratio drops below 20%, it flags it as a growth area. + +### Example + +``` +You: /retro + +Claude: Week of Mar 1: 47 commits (3 contributors), 3.2k LOC, 38% tests, 12 PRs, peak: 10pm | Streak: 47d + + ## Your Week + 32 commits, +2.4k LOC, 41% tests. Peak hours: 9-11pm. + Biggest ship: cookie import system (browser decryption + picker UI). + What you did well: shipped a complete feature with encryption, UI, and + 18 unit tests in one focused push... + + ## Team Breakdown + + ### Alice + 12 commits focused on app/services/. Every PR under 200 LOC — disciplined. + Opportunity: test ratio at 12% — worth investing before payment gets more complex. + + ### Bob + 3 commits — fixed the N+1 query on dashboard. Small but high-impact. + Opportunity: only 1 active day this week — check if blocked on anything. + + [Top 3 team wins, 3 things to improve, 3 habits for next week] +``` + +It saves a JSON snapshot to `.context/retros/` so the next run can show trends. + +--- + +## `/document-release` + +This is my **technical writer mode**. + +After `/ship` creates the PR but before it merges, `/document-release` reads every documentation file in the project and cross-references it against the diff. It updates file paths, command lists, project structure trees, and anything else that drifted. Risky or subjective changes get surfaced as questions — everything else is handled automatically. + +``` +You: /document-release + +Claude: Analyzing 21 files changed across 3 commits. Found 8 documentation files. + + README.md: updated skill count from 9 to 10, added new skill to table + CLAUDE.md: added new directory to project structure + CONTRIBUTING.md: current — no changes needed + TODOS.md: marked 2 items complete, added 1 new item + + All docs updated and committed. PR body updated with doc diff. +``` + +It also polishes CHANGELOG voice (without ever overwriting entries), cleans up completed TODOS, checks cross-doc consistency, and asks about VERSION bumps only when appropriate. + +--- + +## Greptile integration + +[Greptile](https://greptile.com) is a YC company that reviews your PRs automatically. It catches real bugs — race conditions, security issues, things that pass CI and blow up in production. It has genuinely saved my ass more than once. I love these guys. + +### Setup + +Install Greptile on your GitHub repo at [greptile.com](https://greptile.com) — it takes about 30 seconds. Once it's reviewing your PRs, gstack picks up its comments automatically. No additional configuration. + +### How it works + +The problem with any automated reviewer is triage. Greptile is good, but not every comment is a real issue. Some are false positives. Some flag things you already fixed three commits ago. Without a triage layer, the comments pile up and you start ignoring them — which defeats the purpose. + +gstack solves this. `/review` and `/ship` are now Greptile-aware. They read Greptile's comments, classify each one, and take action: + +- **Valid issues** get added to the critical findings and fixed before shipping +- **Already-fixed issues** get an auto-reply acknowledging the catch +- **False positives** get pushed back — you confirm, and a reply goes out explaining why it's wrong + +The result is a two-layer review: Greptile catches things asynchronously on the PR, then `/review` and `/ship` triage those findings as part of the normal workflow. Nothing falls through the cracks. + +### Learning from history + +Every false positive you confirm gets saved to `~/.gstack/greptile-history.md`. Future runs auto-skip known FP patterns for your codebase. And `/retro` tracks Greptile's batting average over time — so you can see whether the signal-to-noise ratio is improving. + +### Example + +``` +You: /ship + +Claude: [syncs main, runs tests, pre-landing review...] + + Greptile found 3 comments on this PR: + + [VALID] app/services/payment_service.rb:47 — Race condition: + concurrent charges can double-debit. Recommend DB-level advisory + lock around the charge block. + + [ALREADY FIXED] app/models/user.rb:12 — Missing null check + on email before downcase. + → Already fixed in commit abc1234. Auto-replying. + + [FALSE POSITIVE] lib/auth.rb:88 — "Token comparison should + use constant-time comparison." We already use + ActiveSupport::SecurityUtils.secure_compare here. + + Fixing the race condition in payment_service.rb... + + The auth token comment is a false positive. + Options: A) Reply to Greptile explaining B) Fix anyway C) Ignore + +You: A + +Claude: Replied to Greptile. All tests pass. + PR: github.com/you/app/pull/42 +``` + +Three Greptile comments. One real fix. One auto-acknowledged. One false positive pushed back with a reply. Total extra time: about 30 seconds. diff --git a/document-release/SKILL.md b/document-release/SKILL.md index e4448c64..5e7c5c70 100644 --- a/document-release/SKILL.md +++ b/document-release/SKILL.md @@ -29,22 +29,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/gstack-upgrade/SKILL.md b/gstack-upgrade/SKILL.md index 9d635651..50b67dfc 100644 --- a/gstack-upgrade/SKILL.md +++ b/gstack-upgrade/SKILL.md @@ -156,6 +156,13 @@ rm -rf "$LOCAL_GSTACK.bak" ``` Tell user: "Also updated vendored copy at `$LOCAL_GSTACK` — commit `.claude/skills/gstack/` when you're ready." +If `./setup` fails, restore from backup and warn the user: +```bash +rm -rf "$LOCAL_GSTACK" +mv "$LOCAL_GSTACK.bak" "$LOCAL_GSTACK" +``` +Tell user: "Sync failed — restored previous version at `$LOCAL_GSTACK`. Run `/gstack-upgrade` manually to retry." + ### Step 5: Write marker + clear cache ```bash @@ -193,9 +200,26 @@ When invoked directly as `/gstack-upgrade` (not from a preamble): 1. Force a fresh update check (bypass cache): ```bash -~/.claude/skills/gstack/bin/gstack-update-check --force +~/.claude/skills/gstack/bin/gstack-update-check --force 2>/dev/null || \ +.claude/skills/gstack/bin/gstack-update-check --force 2>/dev/null || true ``` Use the output to determine if an upgrade is available. 2. If `UPGRADE_AVAILABLE `: follow Steps 2-6 above. -3. If no output (up to date): tell the user "You're already on the latest version (v{version})." + +3. If no output (primary is up to date): check for a stale local vendored copy. + +Run the Step 2 bash block above to detect the primary install type and directory (`INSTALL_TYPE` and `INSTALL_DIR`). Then run the Step 4.5 detection bash block above to check for a local vendored copy (`LOCAL_GSTACK`). + +**If `LOCAL_GSTACK` is empty** (no local vendored copy): tell the user "You're already on the latest version (v{version})." + +**If `LOCAL_GSTACK` is non-empty**, compare versions: +```bash +PRIMARY_VER=$(cat "$INSTALL_DIR/VERSION" 2>/dev/null || echo "unknown") +LOCAL_VER=$(cat "$LOCAL_GSTACK/VERSION" 2>/dev/null || echo "unknown") +echo "PRIMARY=$PRIMARY_VER LOCAL=$LOCAL_VER" +``` + +**If versions differ:** follow the Step 4.5 sync bash block above to update the local copy from the primary. Tell user: "Global v{PRIMARY_VER} is up to date. Updated local vendored copy from v{LOCAL_VER} → v{PRIMARY_VER}. Commit `.claude/skills/gstack/` when you're ready." + +**If versions match:** tell the user "You're on the latest version (v{PRIMARY_VER}). Global and local vendored copy are both up to date." diff --git a/gstack-upgrade/SKILL.md.tmpl b/gstack-upgrade/SKILL.md.tmpl index a441b8d6..0acf16c2 100644 --- a/gstack-upgrade/SKILL.md.tmpl +++ b/gstack-upgrade/SKILL.md.tmpl @@ -154,6 +154,13 @@ rm -rf "$LOCAL_GSTACK.bak" ``` Tell user: "Also updated vendored copy at `$LOCAL_GSTACK` — commit `.claude/skills/gstack/` when you're ready." +If `./setup` fails, restore from backup and warn the user: +```bash +rm -rf "$LOCAL_GSTACK" +mv "$LOCAL_GSTACK.bak" "$LOCAL_GSTACK" +``` +Tell user: "Sync failed — restored previous version at `$LOCAL_GSTACK`. Run `/gstack-upgrade` manually to retry." + ### Step 5: Write marker + clear cache ```bash @@ -191,9 +198,26 @@ When invoked directly as `/gstack-upgrade` (not from a preamble): 1. Force a fresh update check (bypass cache): ```bash -~/.claude/skills/gstack/bin/gstack-update-check --force +~/.claude/skills/gstack/bin/gstack-update-check --force 2>/dev/null || \ +.claude/skills/gstack/bin/gstack-update-check --force 2>/dev/null || true ``` Use the output to determine if an upgrade is available. 2. If `UPGRADE_AVAILABLE `: follow Steps 2-6 above. -3. If no output (up to date): tell the user "You're already on the latest version (v{version})." + +3. If no output (primary is up to date): check for a stale local vendored copy. + +Run the Step 2 bash block above to detect the primary install type and directory (`INSTALL_TYPE` and `INSTALL_DIR`). Then run the Step 4.5 detection bash block above to check for a local vendored copy (`LOCAL_GSTACK`). + +**If `LOCAL_GSTACK` is empty** (no local vendored copy): tell the user "You're already on the latest version (v{version})." + +**If `LOCAL_GSTACK` is non-empty**, compare versions: +```bash +PRIMARY_VER=$(cat "$INSTALL_DIR/VERSION" 2>/dev/null || echo "unknown") +LOCAL_VER=$(cat "$LOCAL_GSTACK/VERSION" 2>/dev/null || echo "unknown") +echo "PRIMARY=$PRIMARY_VER LOCAL=$LOCAL_VER" +``` + +**If versions differ:** follow the Step 4.5 sync bash block above to update the local copy from the primary. Tell user: "Global v{PRIMARY_VER} is up to date. Updated local vendored copy from v{LOCAL_VER} → v{PRIMARY_VER}. Commit `.claude/skills/gstack/` when you're ready." + +**If versions match:** tell the user "You're on the latest version (v{PRIMARY_VER}). Global and local vendored copy are both up to date." diff --git a/office-hours/SKILL.md b/office-hours/SKILL.md index f2511f02..791650d3 100644 --- a/office-hours/SKILL.md +++ b/office-hours/SKILL.md @@ -31,22 +31,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/plan-ceo-review/SKILL.md b/plan-ceo-review/SKILL.md index f4387c44..5a831897 100644 --- a/plan-ceo-review/SKILL.md +++ b/plan-ceo-review/SKILL.md @@ -3,9 +3,9 @@ name: plan-ceo-review version: 1.0.0 description: | CEO/founder-mode plan review. Rethink the problem, find the 10-star product, - challenge premises, expand scope when it creates a better product. Three modes: - SCOPE EXPANSION (dream big), HOLD SCOPE (maximum rigor), SCOPE REDUCTION - (strip to essentials). + challenge premises, expand scope when it creates a better product. Four modes: + SCOPE EXPANSION (dream big), SELECTIVE EXPANSION (hold scope + cherry-pick + expansions), HOLD SCOPE (maximum rigor), SCOPE REDUCTION (strip to essentials). allowed-tools: - Read - Grep @@ -28,22 +28,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. @@ -130,10 +169,12 @@ branch name wherever the instructions say "the base branch." ## Philosophy You are not here to rubber-stamp this plan. You are here to make it extraordinary, catch every landmine before it explodes, and ensure that when this ships, it ships at the highest possible standard. But your posture depends on what the user needs: -* SCOPE EXPANSION: You are building a cathedral. Envision the platonic ideal. Push scope UP. Ask "what would make this 10x better for 2x the effort?" The answer to "should we also build X?" is "yes, if it serves the vision." You have permission to dream. +* SCOPE EXPANSION: You are building a cathedral. Envision the platonic ideal. Push scope UP. Ask "what would make this 10x better for 2x the effort?" You have permission to dream — and to recommend enthusiastically. But every expansion is the user's decision. Present each scope-expanding idea as an AskUserQuestion. The user opts in or out. +* SELECTIVE EXPANSION: You are a rigorous reviewer who also has taste. Hold the current scope as your baseline — make it bulletproof. But separately, surface every expansion opportunity you see and present each one individually as an AskUserQuestion so the user can cherry-pick. Neutral recommendation posture — present the opportunity, state effort and risk, let the user decide. Accepted expansions become part of the plan's scope for the remaining sections. Rejected ones go to "NOT in scope." * HOLD SCOPE: You are a rigorous reviewer. The plan's scope is accepted. Your job is to make it bulletproof — catch every failure mode, test every edge case, ensure observability, map every error path. Do not silently reduce OR expand. * SCOPE REDUCTION: You are a surgeon. Find the minimum viable version that achieves the core outcome. Cut everything else. Be ruthless. -Critical rule: Once the user selects a mode, COMMIT to it. Do not silently drift toward a different mode. If EXPANSION is selected, do not argue for less work during later sections. If REDUCTION is selected, do not sneak scope back in. Raise concerns once in Step 0 — after that, execute the chosen mode faithfully. +* COMPLETENESS IS CHEAP: AI coding compresses implementation time 10-100x. When evaluating "approach A (full, ~150 LOC) vs approach B (90%, ~80 LOC)" — always prefer A. The 70-line delta costs seconds with CC. "Ship the shortcut" is legacy thinking from when human engineering time was the bottleneck. Boil the lake. +Critical rule: In ALL modes, the user is 100% in control. Every scope change is an explicit opt-in via AskUserQuestion — never silently add or remove scope. Once the user selects a mode, COMMIT to it. Do not silently drift toward a different mode. If EXPANSION is selected, do not argue for less work during later sections. If SELECTIVE EXPANSION is selected, surface expansions as individual decisions — do not silently include or exclude them. If REDUCTION is selected, do not sneak scope back in. Raise concerns once in Step 0 — after that, execute the chosen mode faithfully. Do NOT make any code changes. Do NOT start implementation. Your only job right now is to review the plan with maximum rigor and the appropriate level of ambition. ## Prime Directives @@ -201,7 +242,7 @@ Map: ### Retrospective Check Check the git log for this branch. If there are prior commits suggesting a previous review cycle (review-driven refactors, reverted changes), note what was changed and whether the current plan re-touches those areas. Be MORE aggressive reviewing areas that were previously problematic. Recurring problem areas are architectural smells — surface them as architectural concerns. -### Taste Calibration (EXPANSION mode only) +### Taste Calibration (EXPANSION and SELECTIVE EXPANSION modes) Identify 2-3 files or patterns in the existing codebase that are particularly well-designed. Note them as style references for the review. Also note 1-2 patterns that are frustrating or poorly designed — these are anti-patterns to avoid repeating. Report findings before proceeding to Step 0. @@ -254,10 +295,20 @@ Rules: - Do NOT proceed to mode selection (0F) without user approval of the chosen approach. ### 0D. Mode-Specific Analysis -**For SCOPE EXPANSION** — run all three: +**For SCOPE EXPANSION** — run all three, then the opt-in ceremony: 1. 10x check: What's the version that's 10x more ambitious and delivers 10x more value for 2x the effort? Describe it concretely. 2. Platonic ideal: If the best engineer in the world had unlimited time and perfect taste, what would this system look like? What would the user feel when using it? Start from experience, not architecture. -3. Delight opportunities: What adjacent 30-minute improvements would make this feature sing? Things where a user would think "oh nice, they thought of that." List at least 3. +3. Delight opportunities: What adjacent 30-minute improvements would make this feature sing? Things where a user would think "oh nice, they thought of that." List at least 5. +4. **Expansion opt-in ceremony:** Describe the vision first (10x check, platonic ideal). Then distill concrete scope proposals from those visions — individual features, components, or improvements. Present each proposal as its own AskUserQuestion. Recommend enthusiastically — explain why it's worth doing. But the user decides. Options: **A)** Add to this plan's scope **B)** Defer to TODOS.md **C)** Skip. Accepted items become plan scope for all remaining review sections. Rejected items go to "NOT in scope." + +**For SELECTIVE EXPANSION** — run the HOLD SCOPE analysis first, then surface expansions: +1. Complexity check: If the plan touches more than 8 files or introduces more than 2 new classes/services, treat that as a smell and challenge whether the same goal can be achieved with fewer moving parts. +2. What is the minimum set of changes that achieves the stated goal? Flag any work that could be deferred without blocking the core objective. +3. Then run the expansion scan (do NOT add these to scope yet — they are candidates): + - 10x check: What's the version that's 10x more ambitious? Describe it concretely. + - Delight opportunities: What adjacent 30-minute improvements would make this feature sing? List at least 5. + - Platform potential: Would any expansion turn this feature into infrastructure other features can build on? +4. **Cherry-pick ceremony:** Present each expansion opportunity as its own individual AskUserQuestion. Neutral recommendation posture — present the opportunity, state effort (S/M/L) and risk, let the user decide without bias. Options: **A)** Add to this plan's scope **B)** Defer to TODOS.md **C)** Skip. If you have more than 8 candidates, present the top 5-6 and note the remainder as lower-priority options the user can request. Accepted items become plan scope for all remaining review sections. Rejected items go to "NOT in scope." **For HOLD SCOPE** — run this: 1. Complexity check: If the plan touches more than 8 files or introduces more than 2 new classes/services, treat that as a smell and challenge whether the same goal can be achieved with fewer moving parts. @@ -267,7 +318,57 @@ Rules: 1. Ruthless cut: What is the absolute minimum that ships value to a user? Everything else is deferred. No exceptions. 2. What can be a follow-up PR? Separate "must ship together" from "nice to ship together." -### 0E. Temporal Interrogation (EXPANSION and HOLD modes) +### 0D-POST. Persist CEO Plan (EXPANSION and SELECTIVE EXPANSION only) + +After the opt-in/cherry-pick ceremony, write the plan to disk so the vision and decisions survive beyond this conversation. Only run this step for EXPANSION and SELECTIVE EXPANSION modes. + +```bash +eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) +mkdir -p ~/.gstack/projects/$SLUG/ceo-plans +``` + +Before writing, check for existing CEO plans in the ceo-plans/ directory. If any are >30 days old or their branch has been merged/deleted, offer to archive them: + +```bash +mkdir -p ~/.gstack/projects/$SLUG/ceo-plans/archive +# For each stale plan: mv ~/.gstack/projects/$SLUG/ceo-plans/{old-plan}.md ~/.gstack/projects/$SLUG/ceo-plans/archive/ +``` + +Write to `~/.gstack/projects/$SLUG/ceo-plans/{date}-{feature-slug}.md` using this format: + +```markdown +--- +status: ACTIVE +--- +# CEO Plan: {Feature Name} +Generated by /plan-ceo-review on {date} +Branch: {branch} | Mode: {EXPANSION / SELECTIVE EXPANSION} +Repo: {owner/repo} + +## Vision + +### 10x Check +{10x vision description} + +### Platonic Ideal +{platonic ideal description — EXPANSION mode only} + +## Scope Decisions + +| # | Proposal | Effort | Decision | Reasoning | +|---|----------|--------|----------|-----------| +| 1 | {proposal} | S/M/L | ACCEPTED / DEFERRED / SKIPPED | {why} | + +## Accepted Scope (added to this plan) +- {bullet list of what's now in scope} + +## Deferred to TODOS.md +- {items with context} +``` + +Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format. + +### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes) Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan? ``` HOUR 1 (foundations): What does the implementer need to know? @@ -275,20 +376,30 @@ Think ahead to implementation: What decisions will need to be made during implem HOUR 4-5 (integration): What will surprise them? HOUR 6+ (polish/tests): What will they wish they'd planned for? ``` +NOTE: These represent human-team implementation hours. With CC + gstack, +6 hours of human implementation compresses to ~30-60 minutes. The decisions +are identical — the implementation speed is 10-20x faster. Always present +both scales when discussing effort. + Surface these as questions for the user NOW, not as "figure it out later." ### 0F. Mode Selection -Present three options: -1. **SCOPE EXPANSION:** The plan is good but could be great. Propose the ambitious version, then review that. Push scope up. Build the cathedral. -2. **HOLD SCOPE:** The plan's scope is right. Review it with maximum rigor — architecture, security, edge cases, observability, deployment. Make it bulletproof. -3. **SCOPE REDUCTION:** The plan is overbuilt or wrong-headed. Propose a minimal version that achieves the core goal, then review that. +In every mode, you are 100% in control. No scope is added without your explicit approval. + +Present four options: +1. **SCOPE EXPANSION:** The plan is good but could be great. Dream big — propose the ambitious version. Every expansion is presented individually for your approval. You opt in to each one. +2. **SELECTIVE EXPANSION:** The plan's scope is the baseline, but you want to see what else is possible. Every expansion opportunity presented individually — you cherry-pick the ones worth doing. Neutral recommendations. +3. **HOLD SCOPE:** The plan's scope is right. Review it with maximum rigor — architecture, security, edge cases, observability, deployment. Make it bulletproof. No expansions surfaced. +4. **SCOPE REDUCTION:** The plan is overbuilt or wrong-headed. Propose a minimal version that achieves the core goal, then review that. Context-dependent defaults: * Greenfield feature → default EXPANSION +* Feature enhancement or iteration on existing system → default SELECTIVE EXPANSION * Bug fix or hotfix → default HOLD SCOPE * Refactor → default HOLD SCOPE * Plan touching >15 files → suggest REDUCTION unless user pushes back * User says "go big" / "ambitious" / "cathedral" → EXPANSION, no question +* User says "hold scope but tempt me" / "show me options" / "cherry-pick" → SELECTIVE EXPANSION, no question After mode is selected, confirm which implementation approach (from 0C-bis) applies under the chosen mode. EXPANSION may favor the ideal architecture approach; REDUCTION may favor the minimal viable approach. @@ -313,10 +424,12 @@ Evaluate and diagram: * Production failure scenarios. For each new integration point, describe one realistic production failure (timeout, cascade, data corruption, auth failure) and whether the plan accounts for it. * Rollback posture. If this ships and immediately breaks, what's the rollback procedure? Git revert? Feature flag? DB migration rollback? How long? -**EXPANSION mode additions:** +**EXPANSION and SELECTIVE EXPANSION additions:** * What would make this architecture beautiful? Not just correct — elegant. Is there a design that would make a new engineer joining in 6 months say "oh, that's clever and obvious at the same time"? * What infrastructure would make this feature a platform that other features can build on? +**SELECTIVE EXPANSION:** If any accepted cherry-picks from Step 0D affect the architecture, evaluate their architectural fit here. Flag any that create coupling concerns or don't integrate cleanly — this is a chance to revisit the decision with new information. + Required ASCII diagram: full system architecture showing new components and their relationships to existing ones. **STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. @@ -475,8 +588,8 @@ Evaluate: * Admin tooling. New operational tasks that need admin UI or rake tasks? * Runbooks. For each new failure mode: what's the operational response? -**EXPANSION mode addition:** -* What observability would make this feature a joy to operate? +**EXPANSION and SELECTIVE EXPANSION addition:** +* What observability would make this feature a joy to operate? (For SELECTIVE EXPANSION, include observability for any accepted cherry-picks.) **STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. ### Section 9: Deployment & Rollout Review @@ -490,8 +603,8 @@ Evaluate: * Post-deploy verification checklist. First 5 minutes? First hour? * Smoke tests. What automated checks should run immediately post-deploy? -**EXPANSION mode addition:** -* What deploy infrastructure would make shipping this feature routine? +**EXPANSION and SELECTIVE EXPANSION addition:** +* What deploy infrastructure would make shipping this feature routine? (For SELECTIVE EXPANSION, assess whether accepted cherry-picks change the deployment risk profile.) **STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. ### Section 10: Long-Term Trajectory Review @@ -503,9 +616,10 @@ Evaluate: * Ecosystem fit. Aligns with Rails/JS ecosystem direction? * The 1-year question. Read this plan as a new engineer in 12 months — obvious? -**EXPANSION mode additions:** +**EXPANSION and SELECTIVE EXPANSION additions:** * What comes after this ships? Phase 2? Phase 3? Does the architecture support that trajectory? * Platform potential. Does this create capabilities other features can leverage? +* (SELECTIVE EXPANSION only) Retrospective: Were the right cherry-picks accepted? Did any rejected expansions turn out to be load-bearing for the accepted ones? **STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. ## CRITICAL RULE — How to ask questions @@ -548,14 +662,17 @@ For each TODO, describe: * **Pros:** What you gain by doing this work. * **Cons:** Cost, complexity, or risks of doing it. * **Context:** Enough detail that someone picking this up in 3 months understands the motivation, the current state, and where to start. -* **Effort estimate:** S/M/L/XL +* **Effort estimate:** S/M/L/XL (human team) → with CC+gstack: S→S, M→S, L→M, XL→L * **Priority:** P1/P2/P3 * **Depends on / blocked by:** Any prerequisites or ordering constraints. Then present options: **A)** Add to TODOS.md **B)** Skip — not valuable enough **C)** Build it now in this PR instead of deferring. -### Delight Opportunities (EXPANSION mode only) -Identify at least 5 "bonus chunk" opportunities (<30 min each) that would make users think "oh nice, they thought of that." Present each delight opportunity as its own individual AskUserQuestion. Never batch them. For each one, describe what it is, why it would delight users, and effort estimate. Then present options: **A)** Add to TODOS.md as a vision item **B)** Skip **C)** Build it now in this PR. +### Scope Expansion Decisions (EXPANSION and SELECTIVE EXPANSION only) +For EXPANSION and SELECTIVE EXPANSION modes: expansion opportunities and delight items were surfaced and decided in Step 0D (opt-in/cherry-pick ceremony). The decisions are persisted in the CEO plan document. Reference the CEO plan for the full record. Do not re-surface them here — list the accepted expansions for completeness: +* Accepted: {list items added to scope} +* Deferred: {list items sent to TODOS.md} +* Skipped: {list items rejected} ### Diagrams (mandatory, produce all that apply) 1. System architecture @@ -573,7 +690,7 @@ List every ASCII diagram in files this plan touches. Still accurate? +====================================================================+ | MEGA PLAN REVIEW — COMPLETION SUMMARY | +====================================================================+ - | Mode selected | EXPANSION / HOLD / REDUCTION | + | Mode selected | EXPANSION / SELECTIVE / HOLD / REDUCTION | | System Audit | [key findings] | | Step 0 | [mode + key decisions] | | Section 1 (Arch) | ___ issues found | @@ -593,7 +710,9 @@ List every ASCII diagram in files this plan touches. Still accurate? | Error/rescue registry| ___ methods, ___ CRITICAL GAPS | | Failure modes | ___ total, ___ CRITICAL GAPS | | TODOS.md updates | ___ items proposed | - | Delight opportunities| ___ identified (EXPANSION only) | + | Scope proposals | ___ proposed, ___ accepted (EXP + SEL) | + | CEO plan | written / skipped (HOLD/REDUCTION) | + | Lake Score | X/Y recommendations chose complete option | | Diagrams produced | ___ (list types) | | Stale diagrams found | ___ | | Unresolved decisions | ___ (listed below) | @@ -618,15 +737,17 @@ Before running this command, substitute the placeholder values from the Completi - **STATUS**: "clean" if 0 unresolved decisions AND 0 critical gaps; otherwise "issues_open" - **unresolved**: number from "Unresolved decisions" in the summary - **critical_gaps**: number from "Failure modes: ___ CRITICAL GAPS" in the summary -- **MODE**: the mode the user selected (SCOPE_EXPANSION / HOLD_SCOPE / SCOPE_REDUCTION) +- **MODE**: the mode the user selected (SCOPE_EXPANSION / SELECTIVE_EXPANSION / HOLD_SCOPE / SCOPE_REDUCTION) ## Review Readiness Dashboard -After completing the review, read the review log to display the dashboard. +After completing the review, read the review log and config to display the dashboard. ```bash eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) cat ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl 2>/dev/null || echo "NO_REVIEWS" +echo "---CONFIG---" +~/.claude/skills/gstack/bin/gstack-config get skip_eng_review 2>/dev/null || echo "false" ``` Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review). Ignore entries with timestamps older than 7 days. Display: @@ -635,20 +756,37 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl +====================================================================+ | REVIEW READINESS DASHBOARD | +====================================================================+ -| Review | Runs | Last Run | Status | -|-----------------|------|---------------------|----------------------| -| CEO Review | 1 | 2026-03-16 14:30 | CLEAR | -| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | -| Design Review | 0 | — | NOT YET RUN | +| Review | Runs | Last Run | Status | Required | +|-----------------|------|---------------------|-----------|----------| +| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | YES | +| CEO Review | 0 | — | — | no | +| Design Review | 0 | — | — | no | +--------------------------------------------------------------------+ -| VERDICT: 2/3 CLEAR — Design Review not yet run | +| VERDICT: CLEARED — Eng Review passed | +====================================================================+ ``` +**Review tiers:** +- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting). +- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. +- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. + **Verdict logic:** -- **CLEARED TO SHIP (3/3)**: All three have >= 1 entry within 7 days AND most recent status is "clean" -- **N/3 CLEAR**: Show count and list which are missing, have open issues, or are stale (>7 days) -- Informational only — does NOT block. +- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`) +- **NOT CLEARED**: Eng Review missing, stale (>7 days), or has open issues +- CEO and Design reviews are shown for context but never block shipping +- If \`skip_eng_review\` config is \`true\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED + +## docs/designs Promotion (EXPANSION and SELECTIVE EXPANSION only) + +At the end of the review, if the vision produced a compelling feature direction, offer to promote the CEO plan to the project repo. AskUserQuestion: + +"The vision from this review produced {N} accepted scope expansions. Want to promote it to a design doc in the repo?" +- **A)** Promote to `docs/designs/{FEATURE}.md` (committed to repo, visible to the team) +- **B)** Keep in `~/.gstack/projects/` only (local, personal reference) +- **C)** Skip + +If promoted, copy the CEO plan content to `docs/designs/{FEATURE}.md` (create the directory if needed) and update the `status` field in the original CEO plan from `ACTIVE` to `PROMOTED`. ## Formatting Rules * NUMBER issues (1, 2, 3...) and LETTERS for options (A, B, C...). @@ -659,30 +797,37 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl ## Mode Quick Reference ``` - ┌─────────────────────────────────────────────────────────────────┐ - │ MODE COMPARISON │ - ├─────────────┬──────────────┬──────────────┬────────────────────┤ - │ │ EXPANSION │ HOLD SCOPE │ REDUCTION │ - ├─────────────┼──────────────┼──────────────┼────────────────────┤ - │ Scope │ Push UP │ Maintain │ Push DOWN │ - │ 10x check │ Mandatory │ Optional │ Skip │ - │ Platonic │ Yes │ No │ No │ - │ ideal │ │ │ │ - │ Delight │ 5+ items │ Note if seen │ Skip │ - │ opps │ │ │ │ - │ Complexity │ "Is it big │ "Is it too │ "Is it the bare │ - │ question │ enough?" │ complex?" │ minimum?" │ - │ Taste │ Yes │ No │ No │ - │ calibration │ │ │ │ - │ Temporal │ Full (hr 1-6)│ Key decisions│ Skip │ - │ interrogate │ │ only │ │ - │ Observ. │ "Joy to │ "Can we │ "Can we see if │ - │ standard │ operate" │ debug it?" │ it's broken?" │ - │ Deploy │ Infra as │ Safe deploy │ Simplest possible │ - │ standard │ feature scope│ + rollback │ deploy │ - │ Error map │ Full + chaos │ Full │ Critical paths │ - │ │ scenarios │ │ only │ - │ Phase 2/3 │ Map it │ Note it │ Skip │ - │ planning │ │ │ │ - └─────────────┴──────────────┴──────────────┴────────────────────┘ + ┌────────────────────────────────────────────────────────────────────────────────┐ + │ MODE COMPARISON │ + ├─────────────┬──────────────┬──────────────┬──────────────┬────────────────────┤ + │ │ EXPANSION │ SELECTIVE │ HOLD SCOPE │ REDUCTION │ + ├─────────────┼──────────────┼──────────────┼──────────────┼────────────────────┤ + │ Scope │ Push UP │ Hold + offer │ Maintain │ Push DOWN │ + │ │ (opt-in) │ │ │ │ + │ Recommend │ Enthusiastic │ Neutral │ N/A │ N/A │ + │ posture │ │ │ │ │ + │ 10x check │ Mandatory │ Surface as │ Optional │ Skip │ + │ │ │ cherry-pick │ │ │ + │ Platonic │ Yes │ No │ No │ No │ + │ ideal │ │ │ │ │ + │ Delight │ Opt-in │ Cherry-pick │ Note if seen │ Skip │ + │ opps │ ceremony │ ceremony │ │ │ + │ Complexity │ "Is it big │ "Is it right │ "Is it too │ "Is it the bare │ + │ question │ enough?" │ + what else │ complex?" │ minimum?" │ + │ │ │ is tempting"│ │ │ + │ Taste │ Yes │ Yes │ No │ No │ + │ calibration │ │ │ │ │ + │ Temporal │ Full (hr 1-6)│ Full (hr 1-6)│ Key decisions│ Skip │ + │ interrogate │ │ │ only │ │ + │ Observ. │ "Joy to │ "Joy to │ "Can we │ "Can we see if │ + │ standard │ operate" │ operate" │ debug it?" │ it's broken?" │ + │ Deploy │ Infra as │ Safe deploy │ Safe deploy │ Simplest possible │ + │ standard │ feature scope│ + cherry-pick│ + rollback │ deploy │ + │ │ │ risk check │ │ │ + │ Error map │ Full + chaos │ Full + chaos │ Full │ Critical paths │ + │ │ scenarios │ for accepted │ │ only │ + │ CEO plan │ Written │ Written │ Skipped │ Skipped │ + │ Phase 2/3 │ Map accepted │ Map accepted │ Note it │ Skip │ + │ planning │ │ cherry-picks │ │ │ + └─────────────┴──────────────┴──────────────┴──────────────┴────────────────────┘ ``` diff --git a/plan-ceo-review/SKILL.md.tmpl b/plan-ceo-review/SKILL.md.tmpl index 7e5e30a1..c33d82e6 100644 --- a/plan-ceo-review/SKILL.md.tmpl +++ b/plan-ceo-review/SKILL.md.tmpl @@ -3,9 +3,9 @@ name: plan-ceo-review version: 1.0.0 description: | CEO/founder-mode plan review. Rethink the problem, find the 10-star product, - challenge premises, expand scope when it creates a better product. Three modes: - SCOPE EXPANSION (dream big), HOLD SCOPE (maximum rigor), SCOPE REDUCTION - (strip to essentials). + challenge premises, expand scope when it creates a better product. Four modes: + SCOPE EXPANSION (dream big), SELECTIVE EXPANSION (hold scope + cherry-pick + expansions), HOLD SCOPE (maximum rigor), SCOPE REDUCTION (strip to essentials). allowed-tools: - Read - Grep @@ -23,10 +23,12 @@ allowed-tools: ## Philosophy You are not here to rubber-stamp this plan. You are here to make it extraordinary, catch every landmine before it explodes, and ensure that when this ships, it ships at the highest possible standard. But your posture depends on what the user needs: -* SCOPE EXPANSION: You are building a cathedral. Envision the platonic ideal. Push scope UP. Ask "what would make this 10x better for 2x the effort?" The answer to "should we also build X?" is "yes, if it serves the vision." You have permission to dream. +* SCOPE EXPANSION: You are building a cathedral. Envision the platonic ideal. Push scope UP. Ask "what would make this 10x better for 2x the effort?" You have permission to dream — and to recommend enthusiastically. But every expansion is the user's decision. Present each scope-expanding idea as an AskUserQuestion. The user opts in or out. +* SELECTIVE EXPANSION: You are a rigorous reviewer who also has taste. Hold the current scope as your baseline — make it bulletproof. But separately, surface every expansion opportunity you see and present each one individually as an AskUserQuestion so the user can cherry-pick. Neutral recommendation posture — present the opportunity, state effort and risk, let the user decide. Accepted expansions become part of the plan's scope for the remaining sections. Rejected ones go to "NOT in scope." * HOLD SCOPE: You are a rigorous reviewer. The plan's scope is accepted. Your job is to make it bulletproof — catch every failure mode, test every edge case, ensure observability, map every error path. Do not silently reduce OR expand. * SCOPE REDUCTION: You are a surgeon. Find the minimum viable version that achieves the core outcome. Cut everything else. Be ruthless. -Critical rule: Once the user selects a mode, COMMIT to it. Do not silently drift toward a different mode. If EXPANSION is selected, do not argue for less work during later sections. If REDUCTION is selected, do not sneak scope back in. Raise concerns once in Step 0 — after that, execute the chosen mode faithfully. +* COMPLETENESS IS CHEAP: AI coding compresses implementation time 10-100x. When evaluating "approach A (full, ~150 LOC) vs approach B (90%, ~80 LOC)" — always prefer A. The 70-line delta costs seconds with CC. "Ship the shortcut" is legacy thinking from when human engineering time was the bottleneck. Boil the lake. +Critical rule: In ALL modes, the user is 100% in control. Every scope change is an explicit opt-in via AskUserQuestion — never silently add or remove scope. Once the user selects a mode, COMMIT to it. Do not silently drift toward a different mode. If EXPANSION is selected, do not argue for less work during later sections. If SELECTIVE EXPANSION is selected, surface expansions as individual decisions — do not silently include or exclude them. If REDUCTION is selected, do not sneak scope back in. Raise concerns once in Step 0 — after that, execute the chosen mode faithfully. Do NOT make any code changes. Do NOT start implementation. Your only job right now is to review the plan with maximum rigor and the appropriate level of ambition. ## Prime Directives @@ -94,7 +96,7 @@ Map: ### Retrospective Check Check the git log for this branch. If there are prior commits suggesting a previous review cycle (review-driven refactors, reverted changes), note what was changed and whether the current plan re-touches those areas. Be MORE aggressive reviewing areas that were previously problematic. Recurring problem areas are architectural smells — surface them as architectural concerns. -### Taste Calibration (EXPANSION mode only) +### Taste Calibration (EXPANSION and SELECTIVE EXPANSION modes) Identify 2-3 files or patterns in the existing codebase that are particularly well-designed. Note them as style references for the review. Also note 1-2 patterns that are frustrating or poorly designed — these are anti-patterns to avoid repeating. Report findings before proceeding to Step 0. @@ -147,10 +149,20 @@ Rules: - Do NOT proceed to mode selection (0F) without user approval of the chosen approach. ### 0D. Mode-Specific Analysis -**For SCOPE EXPANSION** — run all three: +**For SCOPE EXPANSION** — run all three, then the opt-in ceremony: 1. 10x check: What's the version that's 10x more ambitious and delivers 10x more value for 2x the effort? Describe it concretely. 2. Platonic ideal: If the best engineer in the world had unlimited time and perfect taste, what would this system look like? What would the user feel when using it? Start from experience, not architecture. -3. Delight opportunities: What adjacent 30-minute improvements would make this feature sing? Things where a user would think "oh nice, they thought of that." List at least 3. +3. Delight opportunities: What adjacent 30-minute improvements would make this feature sing? Things where a user would think "oh nice, they thought of that." List at least 5. +4. **Expansion opt-in ceremony:** Describe the vision first (10x check, platonic ideal). Then distill concrete scope proposals from those visions — individual features, components, or improvements. Present each proposal as its own AskUserQuestion. Recommend enthusiastically — explain why it's worth doing. But the user decides. Options: **A)** Add to this plan's scope **B)** Defer to TODOS.md **C)** Skip. Accepted items become plan scope for all remaining review sections. Rejected items go to "NOT in scope." + +**For SELECTIVE EXPANSION** — run the HOLD SCOPE analysis first, then surface expansions: +1. Complexity check: If the plan touches more than 8 files or introduces more than 2 new classes/services, treat that as a smell and challenge whether the same goal can be achieved with fewer moving parts. +2. What is the minimum set of changes that achieves the stated goal? Flag any work that could be deferred without blocking the core objective. +3. Then run the expansion scan (do NOT add these to scope yet — they are candidates): + - 10x check: What's the version that's 10x more ambitious? Describe it concretely. + - Delight opportunities: What adjacent 30-minute improvements would make this feature sing? List at least 5. + - Platform potential: Would any expansion turn this feature into infrastructure other features can build on? +4. **Cherry-pick ceremony:** Present each expansion opportunity as its own individual AskUserQuestion. Neutral recommendation posture — present the opportunity, state effort (S/M/L) and risk, let the user decide without bias. Options: **A)** Add to this plan's scope **B)** Defer to TODOS.md **C)** Skip. If you have more than 8 candidates, present the top 5-6 and note the remainder as lower-priority options the user can request. Accepted items become plan scope for all remaining review sections. Rejected items go to "NOT in scope." **For HOLD SCOPE** — run this: 1. Complexity check: If the plan touches more than 8 files or introduces more than 2 new classes/services, treat that as a smell and challenge whether the same goal can be achieved with fewer moving parts. @@ -160,7 +172,57 @@ Rules: 1. Ruthless cut: What is the absolute minimum that ships value to a user? Everything else is deferred. No exceptions. 2. What can be a follow-up PR? Separate "must ship together" from "nice to ship together." -### 0E. Temporal Interrogation (EXPANSION and HOLD modes) +### 0D-POST. Persist CEO Plan (EXPANSION and SELECTIVE EXPANSION only) + +After the opt-in/cherry-pick ceremony, write the plan to disk so the vision and decisions survive beyond this conversation. Only run this step for EXPANSION and SELECTIVE EXPANSION modes. + +```bash +eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) +mkdir -p ~/.gstack/projects/$SLUG/ceo-plans +``` + +Before writing, check for existing CEO plans in the ceo-plans/ directory. If any are >30 days old or their branch has been merged/deleted, offer to archive them: + +```bash +mkdir -p ~/.gstack/projects/$SLUG/ceo-plans/archive +# For each stale plan: mv ~/.gstack/projects/$SLUG/ceo-plans/{old-plan}.md ~/.gstack/projects/$SLUG/ceo-plans/archive/ +``` + +Write to `~/.gstack/projects/$SLUG/ceo-plans/{date}-{feature-slug}.md` using this format: + +```markdown +--- +status: ACTIVE +--- +# CEO Plan: {Feature Name} +Generated by /plan-ceo-review on {date} +Branch: {branch} | Mode: {EXPANSION / SELECTIVE EXPANSION} +Repo: {owner/repo} + +## Vision + +### 10x Check +{10x vision description} + +### Platonic Ideal +{platonic ideal description — EXPANSION mode only} + +## Scope Decisions + +| # | Proposal | Effort | Decision | Reasoning | +|---|----------|--------|----------|-----------| +| 1 | {proposal} | S/M/L | ACCEPTED / DEFERRED / SKIPPED | {why} | + +## Accepted Scope (added to this plan) +- {bullet list of what's now in scope} + +## Deferred to TODOS.md +- {items with context} +``` + +Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format. + +### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes) Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan? ``` HOUR 1 (foundations): What does the implementer need to know? @@ -168,20 +230,30 @@ Think ahead to implementation: What decisions will need to be made during implem HOUR 4-5 (integration): What will surprise them? HOUR 6+ (polish/tests): What will they wish they'd planned for? ``` +NOTE: These represent human-team implementation hours. With CC + gstack, +6 hours of human implementation compresses to ~30-60 minutes. The decisions +are identical — the implementation speed is 10-20x faster. Always present +both scales when discussing effort. + Surface these as questions for the user NOW, not as "figure it out later." ### 0F. Mode Selection -Present three options: -1. **SCOPE EXPANSION:** The plan is good but could be great. Propose the ambitious version, then review that. Push scope up. Build the cathedral. -2. **HOLD SCOPE:** The plan's scope is right. Review it with maximum rigor — architecture, security, edge cases, observability, deployment. Make it bulletproof. -3. **SCOPE REDUCTION:** The plan is overbuilt or wrong-headed. Propose a minimal version that achieves the core goal, then review that. +In every mode, you are 100% in control. No scope is added without your explicit approval. + +Present four options: +1. **SCOPE EXPANSION:** The plan is good but could be great. Dream big — propose the ambitious version. Every expansion is presented individually for your approval. You opt in to each one. +2. **SELECTIVE EXPANSION:** The plan's scope is the baseline, but you want to see what else is possible. Every expansion opportunity presented individually — you cherry-pick the ones worth doing. Neutral recommendations. +3. **HOLD SCOPE:** The plan's scope is right. Review it with maximum rigor — architecture, security, edge cases, observability, deployment. Make it bulletproof. No expansions surfaced. +4. **SCOPE REDUCTION:** The plan is overbuilt or wrong-headed. Propose a minimal version that achieves the core goal, then review that. Context-dependent defaults: * Greenfield feature → default EXPANSION +* Feature enhancement or iteration on existing system → default SELECTIVE EXPANSION * Bug fix or hotfix → default HOLD SCOPE * Refactor → default HOLD SCOPE * Plan touching >15 files → suggest REDUCTION unless user pushes back * User says "go big" / "ambitious" / "cathedral" → EXPANSION, no question +* User says "hold scope but tempt me" / "show me options" / "cherry-pick" → SELECTIVE EXPANSION, no question After mode is selected, confirm which implementation approach (from 0C-bis) applies under the chosen mode. EXPANSION may favor the ideal architecture approach; REDUCTION may favor the minimal viable approach. @@ -206,10 +278,12 @@ Evaluate and diagram: * Production failure scenarios. For each new integration point, describe one realistic production failure (timeout, cascade, data corruption, auth failure) and whether the plan accounts for it. * Rollback posture. If this ships and immediately breaks, what's the rollback procedure? Git revert? Feature flag? DB migration rollback? How long? -**EXPANSION mode additions:** +**EXPANSION and SELECTIVE EXPANSION additions:** * What would make this architecture beautiful? Not just correct — elegant. Is there a design that would make a new engineer joining in 6 months say "oh, that's clever and obvious at the same time"? * What infrastructure would make this feature a platform that other features can build on? +**SELECTIVE EXPANSION:** If any accepted cherry-picks from Step 0D affect the architecture, evaluate their architectural fit here. Flag any that create coupling concerns or don't integrate cleanly — this is a chance to revisit the decision with new information. + Required ASCII diagram: full system architecture showing new components and their relationships to existing ones. **STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. @@ -368,8 +442,8 @@ Evaluate: * Admin tooling. New operational tasks that need admin UI or rake tasks? * Runbooks. For each new failure mode: what's the operational response? -**EXPANSION mode addition:** -* What observability would make this feature a joy to operate? +**EXPANSION and SELECTIVE EXPANSION addition:** +* What observability would make this feature a joy to operate? (For SELECTIVE EXPANSION, include observability for any accepted cherry-picks.) **STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. ### Section 9: Deployment & Rollout Review @@ -383,8 +457,8 @@ Evaluate: * Post-deploy verification checklist. First 5 minutes? First hour? * Smoke tests. What automated checks should run immediately post-deploy? -**EXPANSION mode addition:** -* What deploy infrastructure would make shipping this feature routine? +**EXPANSION and SELECTIVE EXPANSION addition:** +* What deploy infrastructure would make shipping this feature routine? (For SELECTIVE EXPANSION, assess whether accepted cherry-picks change the deployment risk profile.) **STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. ### Section 10: Long-Term Trajectory Review @@ -396,9 +470,10 @@ Evaluate: * Ecosystem fit. Aligns with Rails/JS ecosystem direction? * The 1-year question. Read this plan as a new engineer in 12 months — obvious? -**EXPANSION mode additions:** +**EXPANSION and SELECTIVE EXPANSION additions:** * What comes after this ships? Phase 2? Phase 3? Does the architecture support that trajectory? * Platform potential. Does this create capabilities other features can leverage? +* (SELECTIVE EXPANSION only) Retrospective: Were the right cherry-picks accepted? Did any rejected expansions turn out to be load-bearing for the accepted ones? **STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. ## CRITICAL RULE — How to ask questions @@ -441,14 +516,17 @@ For each TODO, describe: * **Pros:** What you gain by doing this work. * **Cons:** Cost, complexity, or risks of doing it. * **Context:** Enough detail that someone picking this up in 3 months understands the motivation, the current state, and where to start. -* **Effort estimate:** S/M/L/XL +* **Effort estimate:** S/M/L/XL (human team) → with CC+gstack: S→S, M→S, L→M, XL→L * **Priority:** P1/P2/P3 * **Depends on / blocked by:** Any prerequisites or ordering constraints. Then present options: **A)** Add to TODOS.md **B)** Skip — not valuable enough **C)** Build it now in this PR instead of deferring. -### Delight Opportunities (EXPANSION mode only) -Identify at least 5 "bonus chunk" opportunities (<30 min each) that would make users think "oh nice, they thought of that." Present each delight opportunity as its own individual AskUserQuestion. Never batch them. For each one, describe what it is, why it would delight users, and effort estimate. Then present options: **A)** Add to TODOS.md as a vision item **B)** Skip **C)** Build it now in this PR. +### Scope Expansion Decisions (EXPANSION and SELECTIVE EXPANSION only) +For EXPANSION and SELECTIVE EXPANSION modes: expansion opportunities and delight items were surfaced and decided in Step 0D (opt-in/cherry-pick ceremony). The decisions are persisted in the CEO plan document. Reference the CEO plan for the full record. Do not re-surface them here — list the accepted expansions for completeness: +* Accepted: {list items added to scope} +* Deferred: {list items sent to TODOS.md} +* Skipped: {list items rejected} ### Diagrams (mandatory, produce all that apply) 1. System architecture @@ -466,7 +544,7 @@ List every ASCII diagram in files this plan touches. Still accurate? +====================================================================+ | MEGA PLAN REVIEW — COMPLETION SUMMARY | +====================================================================+ - | Mode selected | EXPANSION / HOLD / REDUCTION | + | Mode selected | EXPANSION / SELECTIVE / HOLD / REDUCTION | | System Audit | [key findings] | | Step 0 | [mode + key decisions] | | Section 1 (Arch) | ___ issues found | @@ -486,7 +564,9 @@ List every ASCII diagram in files this plan touches. Still accurate? | Error/rescue registry| ___ methods, ___ CRITICAL GAPS | | Failure modes | ___ total, ___ CRITICAL GAPS | | TODOS.md updates | ___ items proposed | - | Delight opportunities| ___ identified (EXPANSION only) | + | Scope proposals | ___ proposed, ___ accepted (EXP + SEL) | + | CEO plan | written / skipped (HOLD/REDUCTION) | + | Lake Score | X/Y recommendations chose complete option | | Diagrams produced | ___ (list types) | | Stale diagrams found | ___ | | Unresolved decisions | ___ (listed below) | @@ -511,10 +591,21 @@ Before running this command, substitute the placeholder values from the Completi - **STATUS**: "clean" if 0 unresolved decisions AND 0 critical gaps; otherwise "issues_open" - **unresolved**: number from "Unresolved decisions" in the summary - **critical_gaps**: number from "Failure modes: ___ CRITICAL GAPS" in the summary -- **MODE**: the mode the user selected (SCOPE_EXPANSION / HOLD_SCOPE / SCOPE_REDUCTION) +- **MODE**: the mode the user selected (SCOPE_EXPANSION / SELECTIVE_EXPANSION / HOLD_SCOPE / SCOPE_REDUCTION) {{REVIEW_DASHBOARD}} +## docs/designs Promotion (EXPANSION and SELECTIVE EXPANSION only) + +At the end of the review, if the vision produced a compelling feature direction, offer to promote the CEO plan to the project repo. AskUserQuestion: + +"The vision from this review produced {N} accepted scope expansions. Want to promote it to a design doc in the repo?" +- **A)** Promote to `docs/designs/{FEATURE}.md` (committed to repo, visible to the team) +- **B)** Keep in `~/.gstack/projects/` only (local, personal reference) +- **C)** Skip + +If promoted, copy the CEO plan content to `docs/designs/{FEATURE}.md` (create the directory if needed) and update the `status` field in the original CEO plan from `ACTIVE` to `PROMOTED`. + ## Formatting Rules * NUMBER issues (1, 2, 3...) and LETTERS for options (A, B, C...). * Label with NUMBER + LETTER (e.g., "3A", "3B"). @@ -524,30 +615,37 @@ Before running this command, substitute the placeholder values from the Completi ## Mode Quick Reference ``` - ┌─────────────────────────────────────────────────────────────────┐ - │ MODE COMPARISON │ - ├─────────────┬──────────────┬──────────────┬────────────────────┤ - │ │ EXPANSION │ HOLD SCOPE │ REDUCTION │ - ├─────────────┼──────────────┼──────────────┼────────────────────┤ - │ Scope │ Push UP │ Maintain │ Push DOWN │ - │ 10x check │ Mandatory │ Optional │ Skip │ - │ Platonic │ Yes │ No │ No │ - │ ideal │ │ │ │ - │ Delight │ 5+ items │ Note if seen │ Skip │ - │ opps │ │ │ │ - │ Complexity │ "Is it big │ "Is it too │ "Is it the bare │ - │ question │ enough?" │ complex?" │ minimum?" │ - │ Taste │ Yes │ No │ No │ - │ calibration │ │ │ │ - │ Temporal │ Full (hr 1-6)│ Key decisions│ Skip │ - │ interrogate │ │ only │ │ - │ Observ. │ "Joy to │ "Can we │ "Can we see if │ - │ standard │ operate" │ debug it?" │ it's broken?" │ - │ Deploy │ Infra as │ Safe deploy │ Simplest possible │ - │ standard │ feature scope│ + rollback │ deploy │ - │ Error map │ Full + chaos │ Full │ Critical paths │ - │ │ scenarios │ │ only │ - │ Phase 2/3 │ Map it │ Note it │ Skip │ - │ planning │ │ │ │ - └─────────────┴──────────────┴──────────────┴────────────────────┘ + ┌────────────────────────────────────────────────────────────────────────────────┐ + │ MODE COMPARISON │ + ├─────────────┬──────────────┬──────────────┬──────────────┬────────────────────┤ + │ │ EXPANSION │ SELECTIVE │ HOLD SCOPE │ REDUCTION │ + ├─────────────┼──────────────┼──────────────┼──────────────┼────────────────────┤ + │ Scope │ Push UP │ Hold + offer │ Maintain │ Push DOWN │ + │ │ (opt-in) │ │ │ │ + │ Recommend │ Enthusiastic │ Neutral │ N/A │ N/A │ + │ posture │ │ │ │ │ + │ 10x check │ Mandatory │ Surface as │ Optional │ Skip │ + │ │ │ cherry-pick │ │ │ + │ Platonic │ Yes │ No │ No │ No │ + │ ideal │ │ │ │ │ + │ Delight │ Opt-in │ Cherry-pick │ Note if seen │ Skip │ + │ opps │ ceremony │ ceremony │ │ │ + │ Complexity │ "Is it big │ "Is it right │ "Is it too │ "Is it the bare │ + │ question │ enough?" │ + what else │ complex?" │ minimum?" │ + │ │ │ is tempting"│ │ │ + │ Taste │ Yes │ Yes │ No │ No │ + │ calibration │ │ │ │ │ + │ Temporal │ Full (hr 1-6)│ Full (hr 1-6)│ Key decisions│ Skip │ + │ interrogate │ │ │ only │ │ + │ Observ. │ "Joy to │ "Joy to │ "Can we │ "Can we see if │ + │ standard │ operate" │ operate" │ debug it?" │ it's broken?" │ + │ Deploy │ Infra as │ Safe deploy │ Safe deploy │ Simplest possible │ + │ standard │ feature scope│ + cherry-pick│ + rollback │ deploy │ + │ │ │ risk check │ │ │ + │ Error map │ Full + chaos │ Full + chaos │ Full │ Critical paths │ + │ │ scenarios │ for accepted │ │ only │ + │ CEO plan │ Written │ Written │ Skipped │ Skipped │ + │ Phase 2/3 │ Map accepted │ Map accepted │ Note it │ Skip │ + │ planning │ │ cherry-picks │ │ │ + └─────────────┴──────────────┴──────────────┴──────────────┴────────────────────┘ ``` diff --git a/plan-design-review/SKILL.md b/plan-design-review/SKILL.md index 3d7a1c0a..3e4ef1fa 100644 --- a/plan-design-review/SKILL.md +++ b/plan-design-review/SKILL.md @@ -29,22 +29,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. @@ -601,11 +640,13 @@ Substitute values from the report: ## Review Readiness Dashboard -After completing the review, read the review log to display the dashboard. +After completing the review, read the review log and config to display the dashboard. ```bash eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) cat ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl 2>/dev/null || echo "NO_REVIEWS" +echo "---CONFIG---" +~/.claude/skills/gstack/bin/gstack-config get skip_eng_review 2>/dev/null || echo "false" ``` Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review). Ignore entries with timestamps older than 7 days. Display: @@ -614,17 +655,23 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl +====================================================================+ | REVIEW READINESS DASHBOARD | +====================================================================+ -| Review | Runs | Last Run | Status | -|-----------------|------|---------------------|----------------------| -| CEO Review | 1 | 2026-03-16 14:30 | CLEAR | -| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | -| Design Review | 0 | — | NOT YET RUN | +| Review | Runs | Last Run | Status | Required | +|-----------------|------|---------------------|-----------|----------| +| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | YES | +| CEO Review | 0 | — | — | no | +| Design Review | 0 | — | — | no | +--------------------------------------------------------------------+ -| VERDICT: 2/3 CLEAR — Design Review not yet run | +| VERDICT: CLEARED — Eng Review passed | +====================================================================+ ``` +**Review tiers:** +- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting). +- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. +- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. + **Verdict logic:** -- **CLEARED TO SHIP (3/3)**: All three have >= 1 entry within 7 days AND most recent status is "clean" -- **N/3 CLEAR**: Show count and list which are missing, have open issues, or are stale (>7 days) -- Informational only — does NOT block. +- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`) +- **NOT CLEARED**: Eng Review missing, stale (>7 days), or has open issues +- CEO and Design reviews are shown for context but never block shipping +- If \`skip_eng_review\` config is \`true\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED diff --git a/plan-eng-review/SKILL.md b/plan-eng-review/SKILL.md index 35880a54..6edddea1 100644 --- a/plan-eng-review/SKILL.md +++ b/plan-eng-review/SKILL.md @@ -28,22 +28,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. @@ -145,12 +184,13 @@ Before reviewing anything, answer these questions: 3. **Complexity check:** If the plan touches more than 8 files or introduces more than 2 new classes/services, treat that as a smell and challenge whether the same goal can be achieved with fewer moving parts. 4. **TODOS cross-reference:** Read `TODOS.md` if it exists. Are any deferred items blocking this plan? Can any deferred items be bundled into this PR without expanding scope? Does this plan create new work that should be captured as a TODO? -Then ask if I want one of three options: -1. **SCOPE REDUCTION:** The plan is overbuilt. Propose a minimal version that achieves the core goal, then review that. -2. **BIG CHANGE:** Work through interactively, one section at a time (Architecture → Code Quality → Tests → Performance) with at most 8 top issues per section. -3. **SMALL CHANGE:** Compressed review — Step 0 + one combined pass covering all 4 sections. For each section, pick the single most important issue (think hard — this forces you to prioritize). Present as a single numbered list with lettered options + mandatory test diagram + completion summary. One AskUserQuestion round at the end. For each issue in the batch, state your recommendation and explain WHY, with lettered options. +5. **Completeness check:** Is the plan doing the complete version or a shortcut? With AI-assisted coding, the cost of completeness (100% test coverage, full edge case handling, complete error paths) is 10-100x cheaper than with a human team. If the plan proposes a shortcut that saves human-hours but only saves minutes with CC+gstack, recommend the complete version. Boil the lake. -**Critical: If I do not select SCOPE REDUCTION, respect that decision fully.** Your job becomes making the plan I chose succeed, not continuing to lobby for a smaller plan. Raise scope concerns once in Step 0 — after that, commit to my chosen scope and optimize within it. Do not silently reduce scope, skip planned components, or re-argue for less work during later review sections. +If the complexity check triggers (8+ files or 2+ new classes/services), proactively recommend scope reduction via AskUserQuestion — explain what's overbuilt, propose a minimal version that achieves the core goal, and ask whether to reduce or proceed as-is. If the complexity check does not trigger, present your Step 0 findings and proceed directly to Section 1. + +Always work through the full interactive review: one section at a time (Architecture → Code Quality → Tests → Performance) with at most 8 top issues per section. + +**Critical: Once the user accepts or rejects a scope reduction recommendation, commit fully.** Do not re-argue for smaller scope during later review sections. Do not silently reduce scope or skip planned components. ## Review Sections (after scope is agreed) @@ -232,11 +272,10 @@ Follow the AskUserQuestion format from the Preamble above. Additional rules for * **One issue = one AskUserQuestion call.** Never combine multiple issues into one question. * Describe the problem concretely, with file and line references. * Present 2-3 options, including "do nothing" where that's reasonable. -* For each option, specify in one line: effort, risk, and maintenance burden. +* For each option, specify in one line: effort (human: ~X / CC: ~Y), risk, and maintenance burden. If the complete option is only marginally more effort than the shortcut with CC, recommend the complete option. * **Map the reasoning to my engineering preferences above.** One sentence connecting your recommendation to a specific preference (DRY, explicit > clever, minimal diff, etc.). * Label with issue NUMBER + option LETTER (e.g., "3A", "3B"). * **Escape hatch:** If a section has no issues, say so and move on. If an issue has an obvious fix with no real alternatives, state what you'll do and move on — don't waste a question on it. Only use AskUserQuestion when there is a genuine decision with meaningful tradeoffs. -* **Exception:** SMALL CHANGE mode intentionally batches one issue per section into a single AskUserQuestion at the end — but each issue in that batch still requires its own recommendation + WHY + lettered options. ## Required outputs @@ -274,7 +313,7 @@ If any failure mode has no test AND no error handling AND would be silent, flag ### Completion summary At the end of the review, fill in and display this summary so the user can see all findings at a glance: -- Step 0: Scope Challenge (user chose: ___) +- Step 0: Scope Challenge — ___ (scope accepted as-is / scope reduced per recommendation) - Architecture Review: ___ issues found - Code Quality Review: ___ issues found - Test Review: diagram produced, ___ gaps identified @@ -283,6 +322,7 @@ At the end of the review, fill in and display this summary so the user can see a - What already exists: written - TODOS.md updates: ___ items proposed to user - Failure modes: ___ critical gaps flagged +- Lake Score: X/Y recommendations chose complete option ## Retrospective learning Check the git log for this branch. If there are prior commits suggesting a previous review cycle (e.g., review-driven refactors, reverted changes), note what was changed and whether the current plan touches the same areas. Be more aggressive reviewing areas that were previously problematic. @@ -308,15 +348,17 @@ Substitute values from the Completion Summary: - **STATUS**: "clean" if 0 unresolved decisions AND 0 critical gaps; otherwise "issues_open" - **unresolved**: number from "Unresolved decisions" count - **critical_gaps**: number from "Failure modes: ___ critical gaps flagged" -- **MODE**: SCOPE_REDUCTION / BIG_CHANGE / SMALL_CHANGE +- **MODE**: FULL_REVIEW / SCOPE_REDUCED ## Review Readiness Dashboard -After completing the review, read the review log to display the dashboard. +After completing the review, read the review log and config to display the dashboard. ```bash eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) cat ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl 2>/dev/null || echo "NO_REVIEWS" +echo "---CONFIG---" +~/.claude/skills/gstack/bin/gstack-config get skip_eng_review 2>/dev/null || echo "false" ``` Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review). Ignore entries with timestamps older than 7 days. Display: @@ -325,20 +367,26 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl +====================================================================+ | REVIEW READINESS DASHBOARD | +====================================================================+ -| Review | Runs | Last Run | Status | -|-----------------|------|---------------------|----------------------| -| CEO Review | 1 | 2026-03-16 14:30 | CLEAR | -| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | -| Design Review | 0 | — | NOT YET RUN | +| Review | Runs | Last Run | Status | Required | +|-----------------|------|---------------------|-----------|----------| +| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | YES | +| CEO Review | 0 | — | — | no | +| Design Review | 0 | — | — | no | +--------------------------------------------------------------------+ -| VERDICT: 2/3 CLEAR — Design Review not yet run | +| VERDICT: CLEARED — Eng Review passed | +====================================================================+ ``` +**Review tiers:** +- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting). +- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. +- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. + **Verdict logic:** -- **CLEARED TO SHIP (3/3)**: All three have >= 1 entry within 7 days AND most recent status is "clean" -- **N/3 CLEAR**: Show count and list which are missing, have open issues, or are stale (>7 days) -- Informational only — does NOT block. +- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`) +- **NOT CLEARED**: Eng Review missing, stale (>7 days), or has open issues +- CEO and Design reviews are shown for context but never block shipping +- If \`skip_eng_review\` config is \`true\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED ## Unresolved decisions If the user does not respond to an AskUserQuestion or interrupts to move on, note which decisions were left unresolved. At the end of the review, list these as "Unresolved decisions that may bite you later" — never silently default to an option. diff --git a/plan-eng-review/SKILL.md.tmpl b/plan-eng-review/SKILL.md.tmpl index 3549f012..ef491571 100644 --- a/plan-eng-review/SKILL.md.tmpl +++ b/plan-eng-review/SKILL.md.tmpl @@ -55,12 +55,13 @@ Before reviewing anything, answer these questions: 3. **Complexity check:** If the plan touches more than 8 files or introduces more than 2 new classes/services, treat that as a smell and challenge whether the same goal can be achieved with fewer moving parts. 4. **TODOS cross-reference:** Read `TODOS.md` if it exists. Are any deferred items blocking this plan? Can any deferred items be bundled into this PR without expanding scope? Does this plan create new work that should be captured as a TODO? -Then ask if I want one of three options: -1. **SCOPE REDUCTION:** The plan is overbuilt. Propose a minimal version that achieves the core goal, then review that. -2. **BIG CHANGE:** Work through interactively, one section at a time (Architecture → Code Quality → Tests → Performance) with at most 8 top issues per section. -3. **SMALL CHANGE:** Compressed review — Step 0 + one combined pass covering all 4 sections. For each section, pick the single most important issue (think hard — this forces you to prioritize). Present as a single numbered list with lettered options + mandatory test diagram + completion summary. One AskUserQuestion round at the end. For each issue in the batch, state your recommendation and explain WHY, with lettered options. +5. **Completeness check:** Is the plan doing the complete version or a shortcut? With AI-assisted coding, the cost of completeness (100% test coverage, full edge case handling, complete error paths) is 10-100x cheaper than with a human team. If the plan proposes a shortcut that saves human-hours but only saves minutes with CC+gstack, recommend the complete version. Boil the lake. -**Critical: If I do not select SCOPE REDUCTION, respect that decision fully.** Your job becomes making the plan I chose succeed, not continuing to lobby for a smaller plan. Raise scope concerns once in Step 0 — after that, commit to my chosen scope and optimize within it. Do not silently reduce scope, skip planned components, or re-argue for less work during later review sections. +If the complexity check triggers (8+ files or 2+ new classes/services), proactively recommend scope reduction via AskUserQuestion — explain what's overbuilt, propose a minimal version that achieves the core goal, and ask whether to reduce or proceed as-is. If the complexity check does not trigger, present your Step 0 findings and proceed directly to Section 1. + +Always work through the full interactive review: one section at a time (Architecture → Code Quality → Tests → Performance) with at most 8 top issues per section. + +**Critical: Once the user accepts or rejects a scope reduction recommendation, commit fully.** Do not re-argue for smaller scope during later review sections. Do not silently reduce scope or skip planned components. ## Review Sections (after scope is agreed) @@ -142,11 +143,10 @@ Follow the AskUserQuestion format from the Preamble above. Additional rules for * **One issue = one AskUserQuestion call.** Never combine multiple issues into one question. * Describe the problem concretely, with file and line references. * Present 2-3 options, including "do nothing" where that's reasonable. -* For each option, specify in one line: effort, risk, and maintenance burden. +* For each option, specify in one line: effort (human: ~X / CC: ~Y), risk, and maintenance burden. If the complete option is only marginally more effort than the shortcut with CC, recommend the complete option. * **Map the reasoning to my engineering preferences above.** One sentence connecting your recommendation to a specific preference (DRY, explicit > clever, minimal diff, etc.). * Label with issue NUMBER + option LETTER (e.g., "3A", "3B"). * **Escape hatch:** If a section has no issues, say so and move on. If an issue has an obvious fix with no real alternatives, state what you'll do and move on — don't waste a question on it. Only use AskUserQuestion when there is a genuine decision with meaningful tradeoffs. -* **Exception:** SMALL CHANGE mode intentionally batches one issue per section into a single AskUserQuestion at the end — but each issue in that batch still requires its own recommendation + WHY + lettered options. ## Required outputs @@ -184,7 +184,7 @@ If any failure mode has no test AND no error handling AND would be silent, flag ### Completion summary At the end of the review, fill in and display this summary so the user can see all findings at a glance: -- Step 0: Scope Challenge (user chose: ___) +- Step 0: Scope Challenge — ___ (scope accepted as-is / scope reduced per recommendation) - Architecture Review: ___ issues found - Code Quality Review: ___ issues found - Test Review: diagram produced, ___ gaps identified @@ -193,6 +193,7 @@ At the end of the review, fill in and display this summary so the user can see a - What already exists: written - TODOS.md updates: ___ items proposed to user - Failure modes: ___ critical gaps flagged +- Lake Score: X/Y recommendations chose complete option ## Retrospective learning Check the git log for this branch. If there are prior commits suggesting a previous review cycle (e.g., review-driven refactors, reverted changes), note what was changed and whether the current plan touches the same areas. Be more aggressive reviewing areas that were previously problematic. @@ -218,7 +219,7 @@ Substitute values from the Completion Summary: - **STATUS**: "clean" if 0 unresolved decisions AND 0 critical gaps; otherwise "issues_open" - **unresolved**: number from "Unresolved decisions" count - **critical_gaps**: number from "Failure modes: ___ critical gaps flagged" -- **MODE**: SCOPE_REDUCTION / BIG_CHANGE / SMALL_CHANGE +- **MODE**: FULL_REVIEW / SCOPE_REDUCED {{REVIEW_DASHBOARD}} diff --git a/qa-design-review/SKILL.md b/qa-design-review/SKILL.md index 8ac9abfc..adcc5a08 100644 --- a/qa-design-review/SKILL.md +++ b/qa-design-review/SKILL.md @@ -14,6 +14,7 @@ allowed-tools: - Glob - Grep - AskUserQuestion + - WebSearch --- @@ -30,22 +31,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. @@ -161,6 +201,161 @@ If `NEEDS_SETUP`: 2. Run: `cd && ./setup` 3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash` +**Check test framework (bootstrap if needed):** + +## Test Framework Bootstrap + +**Detect existing test framework and project runtime:** + +```bash +# Detect project runtime +[ -f Gemfile ] && echo "RUNTIME:ruby" +[ -f package.json ] && echo "RUNTIME:node" +[ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python" +[ -f go.mod ] && echo "RUNTIME:go" +[ -f Cargo.toml ] && echo "RUNTIME:rust" +[ -f composer.json ] && echo "RUNTIME:php" +[ -f mix.exs ] && echo "RUNTIME:elixir" +# Detect sub-frameworks +[ -f Gemfile ] && grep -q "rails" Gemfile 2>/dev/null && echo "FRAMEWORK:rails" +[ -f package.json ] && grep -q '"next"' package.json 2>/dev/null && echo "FRAMEWORK:nextjs" +# Check for existing test infrastructure +ls jest.config.* vitest.config.* playwright.config.* .rspec pytest.ini pyproject.toml phpunit.xml 2>/dev/null +ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null +# Check opt-out marker +[ -f .gstack/no-test-bootstrap ] && echo "BOOTSTRAP_DECLINED" +``` + +**If test framework detected** (config files or test directories found): +Print "Test framework detected: {name} ({N} existing tests). Skipping bootstrap." +Read 2-3 existing test files to learn conventions (naming, imports, assertion style, setup patterns). +Store conventions as prose context for use in Phase 8e.5 or Step 3.4. **Skip the rest of bootstrap.** + +**If BOOTSTRAP_DECLINED** appears: Print "Test bootstrap previously declined — skipping." **Skip the rest of bootstrap.** + +**If NO runtime detected** (no config files found): Use AskUserQuestion: +"I couldn't detect your project's language. What runtime are you using?" +Options: A) Node.js/TypeScript B) Ruby/Rails C) Python D) Go E) Rust F) PHP G) Elixir H) This project doesn't need tests. +If user picks H → write `.gstack/no-test-bootstrap` and continue without tests. + +**If runtime detected but no test framework — bootstrap:** + +### B2. Research best practices + +Use WebSearch to find current best practices for the detected runtime: +- `"[runtime] best test framework 2025 2026"` +- `"[framework A] vs [framework B] comparison"` + +If WebSearch is unavailable, use this built-in knowledge table: + +| Runtime | Primary recommendation | Alternative | +|---------|----------------------|-------------| +| Ruby/Rails | minitest + fixtures + capybara | rspec + factory_bot + shoulda-matchers | +| Node.js | vitest + @testing-library | jest + @testing-library | +| Next.js | vitest + @testing-library/react + playwright | jest + cypress | +| Python | pytest + pytest-cov | unittest | +| Go | stdlib testing + testify | stdlib only | +| Rust | cargo test (built-in) + mockall | — | +| PHP | phpunit + mockery | pest | +| Elixir | ExUnit (built-in) + ex_machina | — | + +### B3. Framework selection + +Use AskUserQuestion: +"I detected this is a [Runtime/Framework] project with no test framework. I researched current best practices. Here are the options: +A) [Primary] — [rationale]. Includes: [packages]. Supports: unit, integration, smoke, e2e +B) [Alternative] — [rationale]. Includes: [packages] +C) Skip — don't set up testing right now +RECOMMENDATION: Choose A because [reason based on project context]" + +If user picks C → write `.gstack/no-test-bootstrap`. Tell user: "If you change your mind later, delete `.gstack/no-test-bootstrap` and re-run." Continue without tests. + +If multiple runtimes detected (monorepo) → ask which runtime to set up first, with option to do both sequentially. + +### B4. Install and configure + +1. Install the chosen packages (npm/bun/gem/pip/etc.) +2. Create minimal config file +3. Create directory structure (test/, spec/, etc.) +4. Create one example test matching the project's code to verify setup works + +If package installation fails → debug once. If still failing → revert with `git checkout -- package.json package-lock.json` (or equivalent for the runtime). Warn user and continue without tests. + +### B4.5. First real tests + +Generate 3-5 real tests for existing code: + +1. **Find recently changed files:** `git log --since=30.days --name-only --format="" | sort | uniq -c | sort -rn | head -10` +2. **Prioritize by risk:** Error handlers > business logic with conditionals > API endpoints > pure functions +3. **For each file:** Write one test that tests real behavior with meaningful assertions. Never `expect(x).toBeDefined()` — test what the code DOES. +4. Run each test. Passes → keep. Fails → fix once. Still fails → delete silently. +5. Generate at least 1 test, cap at 5. + +Never import secrets, API keys, or credentials in test files. Use environment variables or test fixtures. + +### B5. Verify + +```bash +# Run the full test suite to confirm everything works +{detected test command} +``` + +If tests fail → debug once. If still failing → revert all bootstrap changes and warn user. + +### B5.5. CI/CD pipeline + +```bash +# Check CI provider +ls -d .github/ 2>/dev/null && echo "CI:github" +ls .gitlab-ci.yml .circleci/ bitrise.yml 2>/dev/null +``` + +If `.github/` exists (or no CI detected — default to GitHub Actions): +Create `.github/workflows/test.yml` with: +- `runs-on: ubuntu-latest` +- Appropriate setup action for the runtime (setup-node, setup-ruby, setup-python, etc.) +- The same test command verified in B5 +- Trigger: push + pull_request + +If non-GitHub CI detected → skip CI generation with note: "Detected {provider} — CI pipeline generation supports GitHub Actions only. Add test step to your existing pipeline manually." + +### B6. Create TESTING.md + +First check: If TESTING.md already exists → read it and update/append rather than overwriting. Never destroy existing content. + +Write TESTING.md with: +- Philosophy: "100% test coverage is the key to great vibe coding. Tests let you move fast, trust your instincts, and ship with confidence — without them, vibe coding is just yolo coding. With tests, it's a superpower." +- Framework name and version +- How to run tests (the verified command from B5) +- Test layers: Unit tests (what, where, when), Integration tests, Smoke tests, E2E tests +- Conventions: file naming, assertion style, setup/teardown patterns + +### B7. Update CLAUDE.md + +First check: If CLAUDE.md already has a `## Testing` section → skip. Don't duplicate. + +Append a `## Testing` section: +- Run command and test directory +- Reference to TESTING.md +- Test expectations: + - 100% test coverage is the goal — tests make vibe coding safe + - When writing new functions, write a corresponding test + - When fixing a bug, write a regression test + - When adding error handling, write a test that triggers the error + - When adding a conditional (if/else, switch), write tests for BOTH paths + - Never commit code that makes existing tests fail + +### B8. Commit + +```bash +git status --porcelain +``` + +Only commit if there are changes. Stage all bootstrap files (config, test directory, TESTING.md, CLAUDE.md, .github/workflows/test.yml if created): +`git commit -m "chore: bootstrap test framework ({framework name})"` + +--- + **Create output directories:** ```bash @@ -590,6 +785,18 @@ Take **before/after screenshot pair** for every fix. - **best-effort**: fix applied but couldn't fully verify (e.g., needs specific browser state) - **reverted**: regression detected → `git revert HEAD` → mark finding as "deferred" +### 8e.5. Regression Test (design-review variant) + +Design fixes are typically CSS-only. Only generate regression tests for fixes involving +JavaScript behavior changes — broken dropdowns, animation failures, conditional rendering, +interactive state issues. + +For CSS-only fixes: skip entirely. CSS regressions are caught by re-running /qa-design-review. + +If the fix involved JS behavior: follow the same procedure as /qa Phase 8e.5 (study existing +test patterns, write a regression test encoding the exact bug condition, run it, commit if +passes or defer if fails). Commit format: `test(design): regression test for FINDING-NNN`. + ### 8f. Self-Regulation (STOP AND EVALUATE) Every 5 fixes (or after any revert), compute the design-fix risk level: @@ -664,7 +871,7 @@ If the repo has a `TODOS.md`: 11. **Clean working tree required.** Refuse to start if `git status --porcelain` is non-empty. 12. **One commit per fix.** Never bundle multiple design fixes into one commit. -13. **Never modify tests or CI configuration.** Only fix application source code and styles. +13. **Only modify tests when generating regression tests in Phase 8e.5.** Never modify CI configuration. Never modify existing tests — only create new test files. 14. **Revert on regression.** If a fix makes things worse, `git revert HEAD` immediately. 15. **Self-regulate.** Follow the design-fix risk heuristic. When in doubt, stop and ask. 16. **CSS-first.** Prefer CSS/styling changes over structural component changes. CSS-only changes are safer and more reversible. diff --git a/qa-design-review/SKILL.md.tmpl b/qa-design-review/SKILL.md.tmpl index 0053a494..5969fb52 100644 --- a/qa-design-review/SKILL.md.tmpl +++ b/qa-design-review/SKILL.md.tmpl @@ -14,6 +14,7 @@ allowed-tools: - Glob - Grep - AskUserQuestion + - WebSearch --- {{PREAMBLE}} @@ -54,6 +55,10 @@ fi {{BROWSE_SETUP}} +**Check test framework (bootstrap if needed):** + +{{TEST_BOOTSTRAP}} + **Create output directories:** ```bash @@ -153,6 +158,18 @@ Take **before/after screenshot pair** for every fix. - **best-effort**: fix applied but couldn't fully verify (e.g., needs specific browser state) - **reverted**: regression detected → `git revert HEAD` → mark finding as "deferred" +### 8e.5. Regression Test (design-review variant) + +Design fixes are typically CSS-only. Only generate regression tests for fixes involving +JavaScript behavior changes — broken dropdowns, animation failures, conditional rendering, +interactive state issues. + +For CSS-only fixes: skip entirely. CSS regressions are caught by re-running /qa-design-review. + +If the fix involved JS behavior: follow the same procedure as /qa Phase 8e.5 (study existing +test patterns, write a regression test encoding the exact bug condition, run it, commit if +passes or defer if fails). Commit format: `test(design): regression test for FINDING-NNN`. + ### 8f. Self-Regulation (STOP AND EVALUATE) Every 5 fixes (or after any revert), compute the design-fix risk level: @@ -227,7 +244,7 @@ If the repo has a `TODOS.md`: 11. **Clean working tree required.** Refuse to start if `git status --porcelain` is non-empty. 12. **One commit per fix.** Never bundle multiple design fixes into one commit. -13. **Never modify tests or CI configuration.** Only fix application source code and styles. +13. **Only modify tests when generating regression tests in Phase 8e.5.** Never modify CI configuration. Never modify existing tests — only create new test files. 14. **Revert on regression.** If a fix makes things worse, `git revert HEAD` immediately. 15. **Self-regulate.** Follow the design-fix risk heuristic. When in doubt, stop and ask. 16. **CSS-first.** Prefer CSS/styling changes over structural component changes. CSS-only changes are safer and more reversible. diff --git a/qa-only/SKILL.md b/qa-only/SKILL.md index cdcd1f1c..a5684dd7 100644 --- a/qa-only/SKILL.md +++ b/qa-only/SKILL.md @@ -27,22 +27,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. @@ -477,3 +516,4 @@ Report filenames use the domain and date: `qa-report-myapp-com-2026-03-12.md` ## Additional Rules (qa-only specific) 11. **Never fix bugs.** Find and document only. Do not read source code, edit files, or suggest fixes in the report. Your job is to report what's broken, not to fix it. Use `/qa` for the test-fix-verify loop. +12. **No test framework detected?** If the project has no test infrastructure (no test config files, no test directories), include in the report summary: "No test framework detected. Run `/qa` to bootstrap one and enable regression test generation." diff --git a/qa-only/SKILL.md.tmpl b/qa-only/SKILL.md.tmpl index 101cd71c..831e71ed 100644 --- a/qa-only/SKILL.md.tmpl +++ b/qa-only/SKILL.md.tmpl @@ -97,3 +97,4 @@ Report filenames use the domain and date: `qa-report-myapp-com-2026-03-12.md` ## Additional Rules (qa-only specific) 11. **Never fix bugs.** Find and document only. Do not read source code, edit files, or suggest fixes in the report. Your job is to report what's broken, not to fix it. Use `/qa` for the test-fix-verify loop. +12. **No test framework detected?** If the project has no test infrastructure (no test config files, no test directories), include in the report summary: "No test framework detected. Run `/qa` to bootstrap one and enable regression test generation." diff --git a/qa/SKILL.md b/qa/SKILL.md index b0f7266f..2d12fca8 100644 --- a/qa/SKILL.md +++ b/qa/SKILL.md @@ -16,6 +16,7 @@ allowed-tools: - Glob - Grep - AskUserQuestion + - WebSearch --- @@ -32,22 +33,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. @@ -182,6 +222,161 @@ If `NEEDS_SETUP`: 2. Run: `cd && ./setup` 3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash` +**Check test framework (bootstrap if needed):** + +## Test Framework Bootstrap + +**Detect existing test framework and project runtime:** + +```bash +# Detect project runtime +[ -f Gemfile ] && echo "RUNTIME:ruby" +[ -f package.json ] && echo "RUNTIME:node" +[ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python" +[ -f go.mod ] && echo "RUNTIME:go" +[ -f Cargo.toml ] && echo "RUNTIME:rust" +[ -f composer.json ] && echo "RUNTIME:php" +[ -f mix.exs ] && echo "RUNTIME:elixir" +# Detect sub-frameworks +[ -f Gemfile ] && grep -q "rails" Gemfile 2>/dev/null && echo "FRAMEWORK:rails" +[ -f package.json ] && grep -q '"next"' package.json 2>/dev/null && echo "FRAMEWORK:nextjs" +# Check for existing test infrastructure +ls jest.config.* vitest.config.* playwright.config.* .rspec pytest.ini pyproject.toml phpunit.xml 2>/dev/null +ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null +# Check opt-out marker +[ -f .gstack/no-test-bootstrap ] && echo "BOOTSTRAP_DECLINED" +``` + +**If test framework detected** (config files or test directories found): +Print "Test framework detected: {name} ({N} existing tests). Skipping bootstrap." +Read 2-3 existing test files to learn conventions (naming, imports, assertion style, setup patterns). +Store conventions as prose context for use in Phase 8e.5 or Step 3.4. **Skip the rest of bootstrap.** + +**If BOOTSTRAP_DECLINED** appears: Print "Test bootstrap previously declined — skipping." **Skip the rest of bootstrap.** + +**If NO runtime detected** (no config files found): Use AskUserQuestion: +"I couldn't detect your project's language. What runtime are you using?" +Options: A) Node.js/TypeScript B) Ruby/Rails C) Python D) Go E) Rust F) PHP G) Elixir H) This project doesn't need tests. +If user picks H → write `.gstack/no-test-bootstrap` and continue without tests. + +**If runtime detected but no test framework — bootstrap:** + +### B2. Research best practices + +Use WebSearch to find current best practices for the detected runtime: +- `"[runtime] best test framework 2025 2026"` +- `"[framework A] vs [framework B] comparison"` + +If WebSearch is unavailable, use this built-in knowledge table: + +| Runtime | Primary recommendation | Alternative | +|---------|----------------------|-------------| +| Ruby/Rails | minitest + fixtures + capybara | rspec + factory_bot + shoulda-matchers | +| Node.js | vitest + @testing-library | jest + @testing-library | +| Next.js | vitest + @testing-library/react + playwright | jest + cypress | +| Python | pytest + pytest-cov | unittest | +| Go | stdlib testing + testify | stdlib only | +| Rust | cargo test (built-in) + mockall | — | +| PHP | phpunit + mockery | pest | +| Elixir | ExUnit (built-in) + ex_machina | — | + +### B3. Framework selection + +Use AskUserQuestion: +"I detected this is a [Runtime/Framework] project with no test framework. I researched current best practices. Here are the options: +A) [Primary] — [rationale]. Includes: [packages]. Supports: unit, integration, smoke, e2e +B) [Alternative] — [rationale]. Includes: [packages] +C) Skip — don't set up testing right now +RECOMMENDATION: Choose A because [reason based on project context]" + +If user picks C → write `.gstack/no-test-bootstrap`. Tell user: "If you change your mind later, delete `.gstack/no-test-bootstrap` and re-run." Continue without tests. + +If multiple runtimes detected (monorepo) → ask which runtime to set up first, with option to do both sequentially. + +### B4. Install and configure + +1. Install the chosen packages (npm/bun/gem/pip/etc.) +2. Create minimal config file +3. Create directory structure (test/, spec/, etc.) +4. Create one example test matching the project's code to verify setup works + +If package installation fails → debug once. If still failing → revert with `git checkout -- package.json package-lock.json` (or equivalent for the runtime). Warn user and continue without tests. + +### B4.5. First real tests + +Generate 3-5 real tests for existing code: + +1. **Find recently changed files:** `git log --since=30.days --name-only --format="" | sort | uniq -c | sort -rn | head -10` +2. **Prioritize by risk:** Error handlers > business logic with conditionals > API endpoints > pure functions +3. **For each file:** Write one test that tests real behavior with meaningful assertions. Never `expect(x).toBeDefined()` — test what the code DOES. +4. Run each test. Passes → keep. Fails → fix once. Still fails → delete silently. +5. Generate at least 1 test, cap at 5. + +Never import secrets, API keys, or credentials in test files. Use environment variables or test fixtures. + +### B5. Verify + +```bash +# Run the full test suite to confirm everything works +{detected test command} +``` + +If tests fail → debug once. If still failing → revert all bootstrap changes and warn user. + +### B5.5. CI/CD pipeline + +```bash +# Check CI provider +ls -d .github/ 2>/dev/null && echo "CI:github" +ls .gitlab-ci.yml .circleci/ bitrise.yml 2>/dev/null +``` + +If `.github/` exists (or no CI detected — default to GitHub Actions): +Create `.github/workflows/test.yml` with: +- `runs-on: ubuntu-latest` +- Appropriate setup action for the runtime (setup-node, setup-ruby, setup-python, etc.) +- The same test command verified in B5 +- Trigger: push + pull_request + +If non-GitHub CI detected → skip CI generation with note: "Detected {provider} — CI pipeline generation supports GitHub Actions only. Add test step to your existing pipeline manually." + +### B6. Create TESTING.md + +First check: If TESTING.md already exists → read it and update/append rather than overwriting. Never destroy existing content. + +Write TESTING.md with: +- Philosophy: "100% test coverage is the key to great vibe coding. Tests let you move fast, trust your instincts, and ship with confidence — without them, vibe coding is just yolo coding. With tests, it's a superpower." +- Framework name and version +- How to run tests (the verified command from B5) +- Test layers: Unit tests (what, where, when), Integration tests, Smoke tests, E2E tests +- Conventions: file naming, assertion style, setup/teardown patterns + +### B7. Update CLAUDE.md + +First check: If CLAUDE.md already has a `## Testing` section → skip. Don't duplicate. + +Append a `## Testing` section: +- Run command and test directory +- Reference to TESTING.md +- Test expectations: + - 100% test coverage is the goal — tests make vibe coding safe + - When writing new functions, write a corresponding test + - When fixing a bug, write a regression test + - When adding error handling, write a test that triggers the error + - When adding a conditional (if/else, switch), write tests for BOTH paths + - Never commit code that makes existing tests fail + +### B8. Commit + +```bash +git status --porcelain +``` + +Only commit if there are changes. Stage all bootstrap files (config, test directory, TESTING.md, CLAUDE.md, .github/workflows/test.yml if created): +`git commit -m "chore: bootstrap test framework ({framework name})"` + +--- + **Create output directories:** ```bash @@ -566,6 +761,59 @@ $B snapshot -D - **best-effort**: fix applied but couldn't fully verify (e.g., needs auth state, external service) - **reverted**: regression detected → `git revert HEAD` → mark issue as "deferred" +### 8e.5. Regression Test + +Skip if: classification is not "verified", OR the fix is purely visual/CSS with no JS behavior, OR no test framework was detected AND user declined bootstrap. + +**1. Study the project's existing test patterns:** + +Read 2-3 test files closest to the fix (same directory, same code type). Match exactly: +- File naming, imports, assertion style, describe/it nesting, setup/teardown patterns +The regression test must look like it was written by the same developer. + +**2. Trace the bug's codepath, then write a regression test:** + +Before writing the test, trace the data flow through the code you just fixed: +- What input/state triggered the bug? (the exact precondition) +- What codepath did it follow? (which branches, which function calls) +- Where did it break? (the exact line/condition that failed) +- What other inputs could hit the same codepath? (edge cases around the fix) + +The test MUST: +- Set up the precondition that triggered the bug (the exact state that made it break) +- Perform the action that exposed the bug +- Assert the correct behavior (NOT "it renders" or "it doesn't throw") +- If you found adjacent edge cases while tracing, test those too (e.g., null input, empty array, boundary value) +- Include full attribution comment: + ``` + // Regression: ISSUE-NNN — {what broke} + // Found by /qa on {YYYY-MM-DD} + // Report: .gstack/qa-reports/qa-report-{domain}-{date}.md + ``` + +Test type decision: +- Console error / JS exception / logic bug → unit or integration test +- Broken form / API failure / data flow bug → integration test with request/response +- Visual bug with JS behavior (broken dropdown, animation) → component test +- Pure CSS → skip (caught by QA reruns) + +Generate unit tests. Mock all external dependencies (DB, API, Redis, file system). + +Use auto-incrementing names to avoid collisions: check existing `{name}.regression-*.test.{ext}` files, take max number + 1. + +**3. Run only the new test file:** + +```bash +{detected test command} {new-test-file} +``` + +**4. Evaluate:** +- Passes → commit: `git commit -m "test(qa): regression test for ISSUE-NNN — {desc}"` +- Fails → fix test once. Still failing → delete test, defer. +- Taking >2 min exploration → skip and defer. + +**5. WTF-likelihood exclusion:** Test commits don't count toward the heuristic. + ### 8f. Self-Regulation (STOP AND EVALUATE) Every 5 fixes (or after any revert), compute the WTF-likelihood: @@ -639,6 +887,6 @@ If the repo has a `TODOS.md`: 11. **Clean working tree required.** Refuse to start if `git status --porcelain` is non-empty. 12. **One commit per fix.** Never bundle multiple fixes into one commit. -13. **Never modify tests or CI configuration.** Only fix application source code. +13. **Only modify tests when generating regression tests in Phase 8e.5.** Never modify CI configuration. Never modify existing tests — only create new test files. 14. **Revert on regression.** If a fix makes things worse, `git revert HEAD` immediately. 15. **Self-regulate.** Follow the WTF-likelihood heuristic. When in doubt, stop and ask. diff --git a/qa/SKILL.md.tmpl b/qa/SKILL.md.tmpl index 45dfbea6..bd94debe 100644 --- a/qa/SKILL.md.tmpl +++ b/qa/SKILL.md.tmpl @@ -16,6 +16,7 @@ allowed-tools: - Glob - Grep - AskUserQuestion + - WebSearch --- {{PREAMBLE}} @@ -58,6 +59,10 @@ fi {{BROWSE_SETUP}} +**Check test framework (bootstrap if needed):** + +{{TEST_BOOTSTRAP}} + **Create output directories:** ```bash @@ -169,6 +174,59 @@ $B snapshot -D - **best-effort**: fix applied but couldn't fully verify (e.g., needs auth state, external service) - **reverted**: regression detected → `git revert HEAD` → mark issue as "deferred" +### 8e.5. Regression Test + +Skip if: classification is not "verified", OR the fix is purely visual/CSS with no JS behavior, OR no test framework was detected AND user declined bootstrap. + +**1. Study the project's existing test patterns:** + +Read 2-3 test files closest to the fix (same directory, same code type). Match exactly: +- File naming, imports, assertion style, describe/it nesting, setup/teardown patterns +The regression test must look like it was written by the same developer. + +**2. Trace the bug's codepath, then write a regression test:** + +Before writing the test, trace the data flow through the code you just fixed: +- What input/state triggered the bug? (the exact precondition) +- What codepath did it follow? (which branches, which function calls) +- Where did it break? (the exact line/condition that failed) +- What other inputs could hit the same codepath? (edge cases around the fix) + +The test MUST: +- Set up the precondition that triggered the bug (the exact state that made it break) +- Perform the action that exposed the bug +- Assert the correct behavior (NOT "it renders" or "it doesn't throw") +- If you found adjacent edge cases while tracing, test those too (e.g., null input, empty array, boundary value) +- Include full attribution comment: + ``` + // Regression: ISSUE-NNN — {what broke} + // Found by /qa on {YYYY-MM-DD} + // Report: .gstack/qa-reports/qa-report-{domain}-{date}.md + ``` + +Test type decision: +- Console error / JS exception / logic bug → unit or integration test +- Broken form / API failure / data flow bug → integration test with request/response +- Visual bug with JS behavior (broken dropdown, animation) → component test +- Pure CSS → skip (caught by QA reruns) + +Generate unit tests. Mock all external dependencies (DB, API, Redis, file system). + +Use auto-incrementing names to avoid collisions: check existing `{name}.regression-*.test.{ext}` files, take max number + 1. + +**3. Run only the new test file:** + +```bash +{detected test command} {new-test-file} +``` + +**4. Evaluate:** +- Passes → commit: `git commit -m "test(qa): regression test for ISSUE-NNN — {desc}"` +- Fails → fix test once. Still failing → delete test, defer. +- Taking >2 min exploration → skip and defer. + +**5. WTF-likelihood exclusion:** Test commits don't count toward the heuristic. + ### 8f. Self-Regulation (STOP AND EVALUATE) Every 5 fixes (or after any revert), compute the WTF-likelihood: @@ -242,6 +300,6 @@ If the repo has a `TODOS.md`: 11. **Clean working tree required.** Refuse to start if `git status --porcelain` is non-empty. 12. **One commit per fix.** Never bundle multiple fixes into one commit. -13. **Never modify tests or CI configuration.** Only fix application source code. +13. **Only modify tests when generating regression tests in Phase 8e.5.** Never modify CI configuration. Never modify existing tests — only create new test files. 14. **Revert on regression.** If a fix makes things worse, `git revert HEAD` immediately. 15. **Self-regulate.** Follow the WTF-likelihood heuristic. When in doubt, stop and ask. diff --git a/qa/templates/qa-report-template.md b/qa/templates/qa-report-template.md index 5466bda4..6aa30943 100644 --- a/qa/templates/qa-report-template.md +++ b/qa/templates/qa-report-template.md @@ -86,6 +86,22 @@ --- +## Regression Tests + +| Issue | Test File | Status | Description | +|-------|-----------|--------|-------------| +| ISSUE-NNN | path/to/test | committed / deferred / skipped | description | + +### Deferred Tests + +#### ISSUE-NNN: {title} +**Precondition:** {setup state that triggers the bug} +**Action:** {what the user does} +**Expected:** {correct behavior} +**Why deferred:** {reason} + +--- + ## Ship Readiness | Metric | Value | diff --git a/retro/SKILL.md b/retro/SKILL.md index da42a039..0839a9af 100644 --- a/retro/SKILL.md +++ b/retro/SKILL.md @@ -27,22 +27,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. @@ -189,6 +228,15 @@ cat ~/.gstack/greptile-history.md 2>/dev/null || true # 9. TODOS.md backlog (if available) cat TODOS.md 2>/dev/null || true + +# 10. Test file count +find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' 2>/dev/null | grep -v node_modules | wc -l + +# 11. Regression test commits in window +git log origin/ --since="" --oneline --grep="test(qa):" --grep="test(design):" --grep="test: coverage" + +# 12. Test files changed in window +git log origin/ --since="" --format="" --name-only | grep -E '\.(test|spec)\.' | sort -u | wc -l ``` ### Step 2: Compute Metrics @@ -210,6 +258,7 @@ Calculate and present these metrics in a summary table: | Detected sessions | N | | Avg LOC/session-hour | N | | Greptile signal | N% (Y catches, Z FPs) | +| Test Health | N total tests · M added this period · K regression tests | Then show a **per-author leaderboard** immediately below: @@ -433,7 +482,17 @@ Use the Write tool to save the JSON file with this schema: } ``` -**Note:** Only include the `greptile` field if `~/.gstack/greptile-history.md` exists and has entries within the time window. Only include the `backlog` field if `TODOS.md` exists. If either has no data, omit the field entirely. +**Note:** Only include the `greptile` field if `~/.gstack/greptile-history.md` exists and has entries within the time window. Only include the `backlog` field if `TODOS.md` exists. Only include the `test_health` field if test files were found (command 10 returns > 0). If any has no data, omit the field entirely. + +Include test health data in the JSON when test files exist: +```json + "test_health": { + "total_test_files": 47, + "tests_added_this_period": 5, + "regression_test_commits": 3, + "test_files_changed": 8 + } +``` Include backlog data in the JSON when TODOS.md exists: ```json @@ -489,6 +548,13 @@ Narrative covering: - Any XL PRs that should have been split - Greptile signal ratio and trend (if history exists): "Greptile: X% signal (Y valid catches, Z false positives)" +### Test Health +- Total test files: N (from command 10) +- Tests added this period: M (from command 12 — test files changed) +- Regression test commits: list `test(qa):` and `test(design):` and `test: coverage` commits from command 11 +- If prior retro exists and has `test_health`: show delta "Test count: {last} → {now} (+{delta})" +- If test ratio < 20%: flag as growth area — "100% test coverage is the goal. Tests make vibe coding safe." + ### Focus & Highlights (from Step 8) - Focus score with interpretation diff --git a/retro/SKILL.md.tmpl b/retro/SKILL.md.tmpl index 2f39fb5c..bfbc2003 100644 --- a/retro/SKILL.md.tmpl +++ b/retro/SKILL.md.tmpl @@ -99,6 +99,15 @@ cat ~/.gstack/greptile-history.md 2>/dev/null || true # 9. TODOS.md backlog (if available) cat TODOS.md 2>/dev/null || true + +# 10. Test file count +find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' 2>/dev/null | grep -v node_modules | wc -l + +# 11. Regression test commits in window +git log origin/ --since="" --oneline --grep="test(qa):" --grep="test(design):" --grep="test: coverage" + +# 12. Test files changed in window +git log origin/ --since="" --format="" --name-only | grep -E '\.(test|spec)\.' | sort -u | wc -l ``` ### Step 2: Compute Metrics @@ -120,6 +129,7 @@ Calculate and present these metrics in a summary table: | Detected sessions | N | | Avg LOC/session-hour | N | | Greptile signal | N% (Y catches, Z FPs) | +| Test Health | N total tests · M added this period · K regression tests | Then show a **per-author leaderboard** immediately below: @@ -343,7 +353,17 @@ Use the Write tool to save the JSON file with this schema: } ``` -**Note:** Only include the `greptile` field if `~/.gstack/greptile-history.md` exists and has entries within the time window. Only include the `backlog` field if `TODOS.md` exists. If either has no data, omit the field entirely. +**Note:** Only include the `greptile` field if `~/.gstack/greptile-history.md` exists and has entries within the time window. Only include the `backlog` field if `TODOS.md` exists. Only include the `test_health` field if test files were found (command 10 returns > 0). If any has no data, omit the field entirely. + +Include test health data in the JSON when test files exist: +```json + "test_health": { + "total_test_files": 47, + "tests_added_this_period": 5, + "regression_test_commits": 3, + "test_files_changed": 8 + } +``` Include backlog data in the JSON when TODOS.md exists: ```json @@ -399,6 +419,13 @@ Narrative covering: - Any XL PRs that should have been split - Greptile signal ratio and trend (if history exists): "Greptile: X% signal (Y valid catches, Z false positives)" +### Test Health +- Total test files: N (from command 10) +- Tests added this period: M (from command 12 — test files changed) +- Regression test commits: list `test(qa):` and `test(design):` and `test: coverage` commits from command 11 +- If prior retro exists and has `test_health`: show delta "Test count: {last} → {now} (+{delta})" +- If test ratio < 20%: flag as growth area — "100% test coverage is the goal. Tests make vibe coding safe." + ### Focus & Highlights (from Step 8) - Focus score with interpretation diff --git a/review/SKILL.md b/review/SKILL.md index 1982f6a5..15f5cbc3 100644 --- a/review/SKILL.md +++ b/review/SKILL.md @@ -28,22 +28,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/review/checklist.md b/review/checklist.md index e2246050..282c9944 100644 --- a/review/checklist.md +++ b/review/checklist.md @@ -84,6 +84,12 @@ To do this: use Grep to find all references to the sibling values (e.g., grep fo - `.expects(:something).never` missing when a code path should explicitly NOT call an external service - Security enforcement features (blocking, rate limiting, auth) without integration tests verifying the enforcement path works end-to-end +#### Completeness Gaps +- Shortcut implementations where the complete version would cost <30 minutes CC time (e.g., partial enum handling, incomplete error paths, missing edge cases that are straightforward to add) +- Options presented with only human-team effort estimates — should show both human and CC+gstack time +- Test coverage gaps where adding the missing tests is a "lake" not an "ocean" (e.g., missing negative-path tests, missing edge case tests that mirror happy-path structure) +- Features implemented at 80-90% when 100% is achievable with modest additional code + #### Crypto & Entropy - Truncation of data instead of hashing (last N chars instead of SHA-256) — less entropy, easier collisions - `rand()` / `Random.rand` for security-sensitive values — use `SecureRandom` instead @@ -113,6 +119,7 @@ CRITICAL (highest severity): INFORMATIONAL (lower severity): ├─ LLM Output Trust Boundary ├─ Dead Code & Consistency └─ Enum & Value Completeness ├─ LLM Prompt Issues ├─ Test Gaps + ├─ Completeness Gaps ├─ Crypto & Entropy ├─ Time Window Safety ├─ Type Coercion at Boundaries diff --git a/scripts/gen-skill-docs.ts b/scripts/gen-skill-docs.ts index affb18d7..682f1e1c 100644 --- a/scripts/gen-skill-docs.ts +++ b/scripts/gen-skill-docs.ts @@ -107,22 +107,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" \`\`\` If output shows \`UPGRADE_AVAILABLE \`: read \`~/.claude/skills/gstack/gstack-upgrade/SKILL.md\` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If \`JUST_UPGRADED \`: tell user "Running gstack v{to} (just updated!)" and continue. +If \`LAKE_INTRO\` is \`no\`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +\`\`\`bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +\`\`\` + +Only run \`open\` if the user says yes. Always run \`touch\` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the \`_BRANCH\` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** \`RECOMMENDATION: Choose [X] because [one-line reason]\` -4. **Options:** Lettered options: \`A) ... B) ... C) ...\` +3. **Recommend:** \`RECOMMENDATION: Choose [X] because [one-line reason]\` — always prefer the complete option over shortcuts (see Completeness Principle). Include \`Completeness: X/10\` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: \`A) ... B) ... C) ...\` — when an option involves effort, show both scales: \`(human: ~X / CC: ~Y)\` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If \`_CONTRIB\` is \`true\`: you are in **contributor mode**. You're a gstack user who also helps make it better. @@ -842,11 +881,13 @@ Tie everything to user goals and product objectives. Always suggest specific imp function generateReviewDashboard(): string { return `## Review Readiness Dashboard -After completing the review, read the review log to display the dashboard. +After completing the review, read the review log and config to display the dashboard. \`\`\`bash eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) cat ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl 2>/dev/null || echo "NO_REVIEWS" +echo "---CONFIG---" +~/.claude/skills/gstack/bin/gstack-config get skip_eng_review 2>/dev/null || echo "false" \`\`\` Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review). Ignore entries with timestamps older than 7 days. Display: @@ -855,20 +896,181 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl +====================================================================+ | REVIEW READINESS DASHBOARD | +====================================================================+ -| Review | Runs | Last Run | Status | -|-----------------|------|---------------------|----------------------| -| CEO Review | 1 | 2026-03-16 14:30 | CLEAR | -| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | -| Design Review | 0 | — | NOT YET RUN | +| Review | Runs | Last Run | Status | Required | +|-----------------|------|---------------------|-----------|----------| +| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | YES | +| CEO Review | 0 | — | — | no | +| Design Review | 0 | — | — | no | +--------------------------------------------------------------------+ -| VERDICT: 2/3 CLEAR — Design Review not yet run | +| VERDICT: CLEARED — Eng Review passed | +====================================================================+ \`\`\` +**Review tiers:** +- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \\\`gstack-config set skip_eng_review true\\\` (the "don't bother me" setting). +- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. +- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. + **Verdict logic:** -- **CLEARED TO SHIP (3/3)**: All three have >= 1 entry within 7 days AND most recent status is "clean" -- **N/3 CLEAR**: Show count and list which are missing, have open issues, or are stale (>7 days) -- Informational only — does NOT block.`; +- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \\\`skip_eng_review\\\` is \\\`true\\\`) +- **NOT CLEARED**: Eng Review missing, stale (>7 days), or has open issues +- CEO and Design reviews are shown for context but never block shipping +- If \\\`skip_eng_review\\\` config is \\\`true\\\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED`; +} + +function generateTestBootstrap(): string { + return `## Test Framework Bootstrap + +**Detect existing test framework and project runtime:** + +\`\`\`bash +# Detect project runtime +[ -f Gemfile ] && echo "RUNTIME:ruby" +[ -f package.json ] && echo "RUNTIME:node" +[ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python" +[ -f go.mod ] && echo "RUNTIME:go" +[ -f Cargo.toml ] && echo "RUNTIME:rust" +[ -f composer.json ] && echo "RUNTIME:php" +[ -f mix.exs ] && echo "RUNTIME:elixir" +# Detect sub-frameworks +[ -f Gemfile ] && grep -q "rails" Gemfile 2>/dev/null && echo "FRAMEWORK:rails" +[ -f package.json ] && grep -q '"next"' package.json 2>/dev/null && echo "FRAMEWORK:nextjs" +# Check for existing test infrastructure +ls jest.config.* vitest.config.* playwright.config.* .rspec pytest.ini pyproject.toml phpunit.xml 2>/dev/null +ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null +# Check opt-out marker +[ -f .gstack/no-test-bootstrap ] && echo "BOOTSTRAP_DECLINED" +\`\`\` + +**If test framework detected** (config files or test directories found): +Print "Test framework detected: {name} ({N} existing tests). Skipping bootstrap." +Read 2-3 existing test files to learn conventions (naming, imports, assertion style, setup patterns). +Store conventions as prose context for use in Phase 8e.5 or Step 3.4. **Skip the rest of bootstrap.** + +**If BOOTSTRAP_DECLINED** appears: Print "Test bootstrap previously declined — skipping." **Skip the rest of bootstrap.** + +**If NO runtime detected** (no config files found): Use AskUserQuestion: +"I couldn't detect your project's language. What runtime are you using?" +Options: A) Node.js/TypeScript B) Ruby/Rails C) Python D) Go E) Rust F) PHP G) Elixir H) This project doesn't need tests. +If user picks H → write \`.gstack/no-test-bootstrap\` and continue without tests. + +**If runtime detected but no test framework — bootstrap:** + +### B2. Research best practices + +Use WebSearch to find current best practices for the detected runtime: +- \`"[runtime] best test framework 2025 2026"\` +- \`"[framework A] vs [framework B] comparison"\` + +If WebSearch is unavailable, use this built-in knowledge table: + +| Runtime | Primary recommendation | Alternative | +|---------|----------------------|-------------| +| Ruby/Rails | minitest + fixtures + capybara | rspec + factory_bot + shoulda-matchers | +| Node.js | vitest + @testing-library | jest + @testing-library | +| Next.js | vitest + @testing-library/react + playwright | jest + cypress | +| Python | pytest + pytest-cov | unittest | +| Go | stdlib testing + testify | stdlib only | +| Rust | cargo test (built-in) + mockall | — | +| PHP | phpunit + mockery | pest | +| Elixir | ExUnit (built-in) + ex_machina | — | + +### B3. Framework selection + +Use AskUserQuestion: +"I detected this is a [Runtime/Framework] project with no test framework. I researched current best practices. Here are the options: +A) [Primary] — [rationale]. Includes: [packages]. Supports: unit, integration, smoke, e2e +B) [Alternative] — [rationale]. Includes: [packages] +C) Skip — don't set up testing right now +RECOMMENDATION: Choose A because [reason based on project context]" + +If user picks C → write \`.gstack/no-test-bootstrap\`. Tell user: "If you change your mind later, delete \`.gstack/no-test-bootstrap\` and re-run." Continue without tests. + +If multiple runtimes detected (monorepo) → ask which runtime to set up first, with option to do both sequentially. + +### B4. Install and configure + +1. Install the chosen packages (npm/bun/gem/pip/etc.) +2. Create minimal config file +3. Create directory structure (test/, spec/, etc.) +4. Create one example test matching the project's code to verify setup works + +If package installation fails → debug once. If still failing → revert with \`git checkout -- package.json package-lock.json\` (or equivalent for the runtime). Warn user and continue without tests. + +### B4.5. First real tests + +Generate 3-5 real tests for existing code: + +1. **Find recently changed files:** \`git log --since=30.days --name-only --format="" | sort | uniq -c | sort -rn | head -10\` +2. **Prioritize by risk:** Error handlers > business logic with conditionals > API endpoints > pure functions +3. **For each file:** Write one test that tests real behavior with meaningful assertions. Never \`expect(x).toBeDefined()\` — test what the code DOES. +4. Run each test. Passes → keep. Fails → fix once. Still fails → delete silently. +5. Generate at least 1 test, cap at 5. + +Never import secrets, API keys, or credentials in test files. Use environment variables or test fixtures. + +### B5. Verify + +\`\`\`bash +# Run the full test suite to confirm everything works +{detected test command} +\`\`\` + +If tests fail → debug once. If still failing → revert all bootstrap changes and warn user. + +### B5.5. CI/CD pipeline + +\`\`\`bash +# Check CI provider +ls -d .github/ 2>/dev/null && echo "CI:github" +ls .gitlab-ci.yml .circleci/ bitrise.yml 2>/dev/null +\`\`\` + +If \`.github/\` exists (or no CI detected — default to GitHub Actions): +Create \`.github/workflows/test.yml\` with: +- \`runs-on: ubuntu-latest\` +- Appropriate setup action for the runtime (setup-node, setup-ruby, setup-python, etc.) +- The same test command verified in B5 +- Trigger: push + pull_request + +If non-GitHub CI detected → skip CI generation with note: "Detected {provider} — CI pipeline generation supports GitHub Actions only. Add test step to your existing pipeline manually." + +### B6. Create TESTING.md + +First check: If TESTING.md already exists → read it and update/append rather than overwriting. Never destroy existing content. + +Write TESTING.md with: +- Philosophy: "100% test coverage is the key to great vibe coding. Tests let you move fast, trust your instincts, and ship with confidence — without them, vibe coding is just yolo coding. With tests, it's a superpower." +- Framework name and version +- How to run tests (the verified command from B5) +- Test layers: Unit tests (what, where, when), Integration tests, Smoke tests, E2E tests +- Conventions: file naming, assertion style, setup/teardown patterns + +### B7. Update CLAUDE.md + +First check: If CLAUDE.md already has a \`## Testing\` section → skip. Don't duplicate. + +Append a \`## Testing\` section: +- Run command and test directory +- Reference to TESTING.md +- Test expectations: + - 100% test coverage is the goal — tests make vibe coding safe + - When writing new functions, write a corresponding test + - When fixing a bug, write a regression test + - When adding error handling, write a test that triggers the error + - When adding a conditional (if/else, switch), write tests for BOTH paths + - Never commit code that makes existing tests fail + +### B8. Commit + +\`\`\`bash +git status --porcelain +\`\`\` + +Only commit if there are changes. Stage all bootstrap files (config, test directory, TESTING.md, CLAUDE.md, .github/workflows/test.yml if created): +\`git commit -m "chore: bootstrap test framework ({framework name})"\` + +---`; } const RESOLVERS: Record string> = { @@ -880,6 +1082,7 @@ const RESOLVERS: Record string> = { QA_METHODOLOGY: generateQAMethodology, DESIGN_METHODOLOGY: generateDesignMethodology, REVIEW_DASHBOARD: generateReviewDashboard, + TEST_BOOTSTRAP: generateTestBootstrap, }; // ─── Template Processing ──────────────────────────────────── diff --git a/setup-browser-cookies/SKILL.md b/setup-browser-cookies/SKILL.md index 984df045..b0c9c195 100644 --- a/setup-browser-cookies/SKILL.md +++ b/setup-browser-cookies/SKILL.md @@ -25,22 +25,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/ship/SKILL.md b/ship/SKILL.md index 2d81389a..afd0404c 100644 --- a/ship/SKILL.md +++ b/ship/SKILL.md @@ -11,6 +11,7 @@ allowed-tools: - Grep - Glob - AskUserQuestion + - WebSearch --- @@ -27,22 +28,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. @@ -146,6 +186,7 @@ You are running the `/ship` workflow. This is a **non-interactive, fully automat - Multi-file changesets (auto-split into bisectable commits) - TODOS.md completed-item detection (auto-mark) - Auto-fixable review findings (dead code, N+1, stale comments — fixed automatically) +- Test coverage gaps (auto-generate and commit, or flag in PR body) --- @@ -161,11 +202,13 @@ You are running the `/ship` workflow. This is a **non-interactive, fully automat ## Review Readiness Dashboard -After completing the review, read the review log to display the dashboard. +After completing the review, read the review log and config to display the dashboard. ```bash eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) cat ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl 2>/dev/null || echo "NO_REVIEWS" +echo "---CONFIG---" +~/.claude/skills/gstack/bin/gstack-config get skip_eng_review 2>/dev/null || echo "false" ``` Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review). Ignore entries with timestamps older than 7 days. Display: @@ -174,25 +217,48 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl +====================================================================+ | REVIEW READINESS DASHBOARD | +====================================================================+ -| Review | Runs | Last Run | Status | -|-----------------|------|---------------------|----------------------| -| CEO Review | 1 | 2026-03-16 14:30 | CLEAR | -| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | -| Design Review | 0 | — | NOT YET RUN | +| Review | Runs | Last Run | Status | Required | +|-----------------|------|---------------------|-----------|----------| +| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | YES | +| CEO Review | 0 | — | — | no | +| Design Review | 0 | — | — | no | +--------------------------------------------------------------------+ -| VERDICT: 2/3 CLEAR — Design Review not yet run | +| VERDICT: CLEARED — Eng Review passed | +====================================================================+ ``` -**Verdict logic:** -- **CLEARED TO SHIP (3/3)**: All three have >= 1 entry within 7 days AND most recent status is "clean" -- **N/3 CLEAR**: Show count and list which are missing, have open issues, or are stale (>7 days) -- Informational only — does NOT block. +**Review tiers:** +- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting). +- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. +- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. -If the verdict is NOT "CLEARED TO SHIP (3/3)", use AskUserQuestion: -- Show which reviews are missing or have open issues -- RECOMMENDATION: Choose B (run missing reviews first) unless the change is trivial -- Options: A) Ship anyway B) Abort — run missing review(s) first C) Reviews not relevant for this change +**Verdict logic:** +- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`) +- **NOT CLEARED**: Eng Review missing, stale (>7 days), or has open issues +- CEO and Design reviews are shown for context but never block shipping +- If \`skip_eng_review\` config is \`true\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED + +If the Eng Review is NOT "CLEAR": + +1. **Check for a prior override on this branch:** + ```bash + eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) + grep '"skill":"ship-review-override"' ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl 2>/dev/null || echo "NO_OVERRIDE" + ``` + If an override exists, display the dashboard and note "Review gate previously accepted — continuing." Do NOT ask again. + +2. **If no override exists,** use AskUserQuestion: + - Show that Eng Review is missing or has open issues + - RECOMMENDATION: Choose C if the change is obviously trivial (< 20 lines, typo fix, config-only); Choose B for larger changes + - Options: A) Ship anyway B) Abort — run /plan-eng-review first C) Change is too small to need eng review + - If CEO/Design reviews are missing, mention them as informational ("CEO Review not run — recommended for product changes") but do NOT block or recommend aborting for them + +3. **If the user chooses A or C,** persist the decision so future `/ship` runs on this branch skip the gate: + ```bash + eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) + echo '{"skill":"ship-review-override","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","decision":"USER_CHOICE"}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl + ``` + Substitute USER_CHOICE with "ship_anyway" or "not_relevant". --- @@ -210,6 +276,163 @@ git fetch origin && git merge origin/ --no-edit --- +## Step 2.5: Test Framework Bootstrap + +## Test Framework Bootstrap + +**Detect existing test framework and project runtime:** + +```bash +# Detect project runtime +[ -f Gemfile ] && echo "RUNTIME:ruby" +[ -f package.json ] && echo "RUNTIME:node" +[ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python" +[ -f go.mod ] && echo "RUNTIME:go" +[ -f Cargo.toml ] && echo "RUNTIME:rust" +[ -f composer.json ] && echo "RUNTIME:php" +[ -f mix.exs ] && echo "RUNTIME:elixir" +# Detect sub-frameworks +[ -f Gemfile ] && grep -q "rails" Gemfile 2>/dev/null && echo "FRAMEWORK:rails" +[ -f package.json ] && grep -q '"next"' package.json 2>/dev/null && echo "FRAMEWORK:nextjs" +# Check for existing test infrastructure +ls jest.config.* vitest.config.* playwright.config.* .rspec pytest.ini pyproject.toml phpunit.xml 2>/dev/null +ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null +# Check opt-out marker +[ -f .gstack/no-test-bootstrap ] && echo "BOOTSTRAP_DECLINED" +``` + +**If test framework detected** (config files or test directories found): +Print "Test framework detected: {name} ({N} existing tests). Skipping bootstrap." +Read 2-3 existing test files to learn conventions (naming, imports, assertion style, setup patterns). +Store conventions as prose context for use in Phase 8e.5 or Step 3.4. **Skip the rest of bootstrap.** + +**If BOOTSTRAP_DECLINED** appears: Print "Test bootstrap previously declined — skipping." **Skip the rest of bootstrap.** + +**If NO runtime detected** (no config files found): Use AskUserQuestion: +"I couldn't detect your project's language. What runtime are you using?" +Options: A) Node.js/TypeScript B) Ruby/Rails C) Python D) Go E) Rust F) PHP G) Elixir H) This project doesn't need tests. +If user picks H → write `.gstack/no-test-bootstrap` and continue without tests. + +**If runtime detected but no test framework — bootstrap:** + +### B2. Research best practices + +Use WebSearch to find current best practices for the detected runtime: +- `"[runtime] best test framework 2025 2026"` +- `"[framework A] vs [framework B] comparison"` + +If WebSearch is unavailable, use this built-in knowledge table: + +| Runtime | Primary recommendation | Alternative | +|---------|----------------------|-------------| +| Ruby/Rails | minitest + fixtures + capybara | rspec + factory_bot + shoulda-matchers | +| Node.js | vitest + @testing-library | jest + @testing-library | +| Next.js | vitest + @testing-library/react + playwright | jest + cypress | +| Python | pytest + pytest-cov | unittest | +| Go | stdlib testing + testify | stdlib only | +| Rust | cargo test (built-in) + mockall | — | +| PHP | phpunit + mockery | pest | +| Elixir | ExUnit (built-in) + ex_machina | — | + +### B3. Framework selection + +Use AskUserQuestion: +"I detected this is a [Runtime/Framework] project with no test framework. I researched current best practices. Here are the options: +A) [Primary] — [rationale]. Includes: [packages]. Supports: unit, integration, smoke, e2e +B) [Alternative] — [rationale]. Includes: [packages] +C) Skip — don't set up testing right now +RECOMMENDATION: Choose A because [reason based on project context]" + +If user picks C → write `.gstack/no-test-bootstrap`. Tell user: "If you change your mind later, delete `.gstack/no-test-bootstrap` and re-run." Continue without tests. + +If multiple runtimes detected (monorepo) → ask which runtime to set up first, with option to do both sequentially. + +### B4. Install and configure + +1. Install the chosen packages (npm/bun/gem/pip/etc.) +2. Create minimal config file +3. Create directory structure (test/, spec/, etc.) +4. Create one example test matching the project's code to verify setup works + +If package installation fails → debug once. If still failing → revert with `git checkout -- package.json package-lock.json` (or equivalent for the runtime). Warn user and continue without tests. + +### B4.5. First real tests + +Generate 3-5 real tests for existing code: + +1. **Find recently changed files:** `git log --since=30.days --name-only --format="" | sort | uniq -c | sort -rn | head -10` +2. **Prioritize by risk:** Error handlers > business logic with conditionals > API endpoints > pure functions +3. **For each file:** Write one test that tests real behavior with meaningful assertions. Never `expect(x).toBeDefined()` — test what the code DOES. +4. Run each test. Passes → keep. Fails → fix once. Still fails → delete silently. +5. Generate at least 1 test, cap at 5. + +Never import secrets, API keys, or credentials in test files. Use environment variables or test fixtures. + +### B5. Verify + +```bash +# Run the full test suite to confirm everything works +{detected test command} +``` + +If tests fail → debug once. If still failing → revert all bootstrap changes and warn user. + +### B5.5. CI/CD pipeline + +```bash +# Check CI provider +ls -d .github/ 2>/dev/null && echo "CI:github" +ls .gitlab-ci.yml .circleci/ bitrise.yml 2>/dev/null +``` + +If `.github/` exists (or no CI detected — default to GitHub Actions): +Create `.github/workflows/test.yml` with: +- `runs-on: ubuntu-latest` +- Appropriate setup action for the runtime (setup-node, setup-ruby, setup-python, etc.) +- The same test command verified in B5 +- Trigger: push + pull_request + +If non-GitHub CI detected → skip CI generation with note: "Detected {provider} — CI pipeline generation supports GitHub Actions only. Add test step to your existing pipeline manually." + +### B6. Create TESTING.md + +First check: If TESTING.md already exists → read it and update/append rather than overwriting. Never destroy existing content. + +Write TESTING.md with: +- Philosophy: "100% test coverage is the key to great vibe coding. Tests let you move fast, trust your instincts, and ship with confidence — without them, vibe coding is just yolo coding. With tests, it's a superpower." +- Framework name and version +- How to run tests (the verified command from B5) +- Test layers: Unit tests (what, where, when), Integration tests, Smoke tests, E2E tests +- Conventions: file naming, assertion style, setup/teardown patterns + +### B7. Update CLAUDE.md + +First check: If CLAUDE.md already has a `## Testing` section → skip. Don't duplicate. + +Append a `## Testing` section: +- Run command and test directory +- Reference to TESTING.md +- Test expectations: + - 100% test coverage is the goal — tests make vibe coding safe + - When writing new functions, write a corresponding test + - When fixing a bug, write a regression test + - When adding error handling, write a test that triggers the error + - When adding a conditional (if/else, switch), write tests for BOTH paths + - Never commit code that makes existing tests fail + +### B8. Commit + +```bash +git status --porcelain +``` + +Only commit if there are changes. Stage all bootstrap files (config, test directory, TESTING.md, CLAUDE.md, .github/workflows/test.yml if created): +`git commit -m "chore: bootstrap test framework ({framework name})"` + +--- + +--- + ## Step 3: Run tests (on merged code) **Do NOT run `RAILS_ENV=test bin/rails db:migrate`** — `bin/test-lane` already calls @@ -294,6 +517,144 @@ If multiple suites need to run, run them sequentially (each needs a test lane). --- +## Step 3.4: Test Coverage Audit + +100% coverage is the goal — every untested path is a path where bugs hide and vibe coding becomes yolo coding. Evaluate what was ACTUALLY coded (from the diff), not what was planned. + +**0. Before/after test count:** + +```bash +# Count test files before any generation +find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l +``` + +Store this number for the PR body. + +**1. Trace every codepath changed** using `git diff origin/...HEAD`: + +Read every changed file. For each one, trace how data flows through the code — don't just list functions, actually follow the execution: + +1. **Read the diff.** For each changed file, read the full file (not just the diff hunk) to understand context. +2. **Trace data flow.** Starting from each entry point (route handler, exported function, event listener, component render), follow the data through every branch: + - Where does input come from? (request params, props, database, API call) + - What transforms it? (validation, mapping, computation) + - Where does it go? (database write, API response, rendered output, side effect) + - What can go wrong at each step? (null/undefined, invalid input, network failure, empty collection) +3. **Diagram the execution.** For each changed file, draw an ASCII diagram showing: + - Every function/method that was added or modified + - Every conditional branch (if/else, switch, ternary, guard clause, early return) + - Every error path (try/catch, rescue, error boundary, fallback) + - Every call to another function (trace into it — does IT have untested branches?) + - Every edge: what happens with null input? Empty array? Invalid type? + +This is the critical step — you're building a map of every line of code that can execute differently based on input. Every branch in this diagram needs a test. + +**2. Map user flows, interactions, and error states:** + +Code coverage isn't enough — you need to cover how real users interact with the changed code. For each changed feature, think through: + +- **User flows:** What sequence of actions does a user take that touches this code? Map the full journey (e.g., "user clicks 'Pay' → form validates → API call → success/failure screen"). Each step in the journey needs a test. +- **Interaction edge cases:** What happens when the user does something unexpected? + - Double-click/rapid resubmit + - Navigate away mid-operation (back button, close tab, click another link) + - Submit with stale data (page sat open for 30 minutes, session expired) + - Slow connection (API takes 10 seconds — what does the user see?) + - Concurrent actions (two tabs, same form) +- **Error states the user can see:** For every error the code handles, what does the user actually experience? + - Is there a clear error message or a silent failure? + - Can the user recover (retry, go back, fix input) or are they stuck? + - What happens with no network? With a 500 from the API? With invalid data from the server? +- **Empty/zero/boundary states:** What does the UI show with zero results? With 10,000 results? With a single character input? With maximum-length input? + +Add these to your diagram alongside the code branches. A user flow with no test is just as much a gap as an untested if/else. + +**3. Check each branch against existing tests:** + +Go through your diagram branch by branch — both code paths AND user flows. For each one, search for a test that exercises it: +- Function `processPayment()` → look for `billing.test.ts`, `billing.spec.ts`, `test/billing_test.rb` +- An if/else → look for tests covering BOTH the true AND false path +- An error handler → look for a test that triggers that specific error condition +- A call to `helperFn()` that has its own branches → those branches need tests too +- A user flow → look for an integration or E2E test that walks through the journey +- An interaction edge case → look for a test that simulates the unexpected action + +Quality scoring rubric: +- ★★★ Tests behavior with edge cases AND error paths +- ★★ Tests correct behavior, happy path only +- ★ Smoke test / existence check / trivial assertion (e.g., "it renders", "it doesn't throw") + +**4. Output ASCII coverage diagram:** + +Include BOTH code paths and user flows in the same diagram: + +``` +CODE PATH COVERAGE +=========================== +[+] src/services/billing.ts + │ + ├── processPayment() + │ ├── [★★★ TESTED] Happy path + card declined + timeout — billing.test.ts:42 + │ ├── [GAP] Network timeout — NO TEST + │ └── [GAP] Invalid currency — NO TEST + │ + └── refundPayment() + ├── [★★ TESTED] Full refund — billing.test.ts:89 + └── [★ TESTED] Partial refund (checks non-throw only) — billing.test.ts:101 + +USER FLOW COVERAGE +=========================== +[+] Payment checkout flow + │ + ├── [★★★ TESTED] Complete purchase — checkout.e2e.ts:15 + ├── [GAP] Double-click submit — NO TEST + ├── [GAP] Navigate away during payment — NO TEST + └── [★ TESTED] Form validation errors (checks render only) — checkout.test.ts:40 + +[+] Error states + │ + ├── [★★ TESTED] Card declined message — billing.test.ts:58 + ├── [GAP] Network timeout UX (what does user see?) — NO TEST + └── [GAP] Empty cart submission — NO TEST + +───────────────────────────────── +COVERAGE: 5/12 paths tested (42%) + Code paths: 3/5 (60%) + User flows: 2/7 (29%) +QUALITY: ★★★: 2 ★★: 2 ★: 1 +GAPS: 7 paths need tests +───────────────────────────────── +``` + +**Fast path:** All paths covered → "Step 3.4: All new code paths have test coverage ✓" Continue. + +**5. Generate tests for uncovered paths:** + +If test framework detected (or bootstrapped in Step 2.5): +- Prioritize error handlers and edge cases first (happy paths are more likely already tested) +- Read 2-3 existing test files to match conventions exactly +- Generate unit tests. Mock all external dependencies (DB, API, Redis). +- Write tests that exercise the specific uncovered path with real assertions +- Run each test. Passes → commit as `test: coverage for {feature}` +- Fails → fix once. Still fails → revert, note gap in diagram. + +Caps: 30 code paths max, 20 tests generated max (code + user flow combined), 2-min per-test exploration cap. + +If no test framework AND user declined bootstrap → diagram only, no generation. Note: "Test generation skipped — no test framework configured." + +**Diff is test-only changes:** Skip Step 3.4 entirely: "No new application code paths to audit." + +**6. After-count and coverage summary:** + +```bash +# Count test files after generation +find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l +``` + +For PR body: `Tests: {before} → {after} (+{delta} new)` +Coverage line: `Test Coverage Audit: N new code paths. M covered (X%). K tests generated, J committed.` + +--- + ## Step 3.5: Pre-Landing Review Review the diff for structural issues that tests don't catch. @@ -544,6 +905,10 @@ gh pr create --base --title ": " --body "$(cat <<'EOF' ## Summary +## Test Coverage + + + ## Pre-Landing Review @@ -586,4 +951,5 @@ EOF - **TODOS.md completion detection must be conservative.** Only mark items as completed when the diff clearly shows the work is done. - **Use Greptile reply templates from greptile-triage.md.** Every reply includes evidence (inline diff, code references, re-rank suggestion). Never post vague replies. - **Never push without fresh verification evidence.** If code changed after Step 3 tests, re-run before pushing. +- **Step 3.4 generates coverage tests.** They must pass before committing. Never commit failing tests. - **The goal is: user says `/ship`, next thing they see is the review + PR URL.** diff --git a/ship/SKILL.md.tmpl b/ship/SKILL.md.tmpl index 0b9f1cb5..a911210d 100644 --- a/ship/SKILL.md.tmpl +++ b/ship/SKILL.md.tmpl @@ -11,6 +11,7 @@ allowed-tools: - Grep - Glob - AskUserQuestion + - WebSearch --- {{PREAMBLE}} @@ -39,6 +40,7 @@ You are running the `/ship` workflow. This is a **non-interactive, fully automat - Multi-file changesets (auto-split into bisectable commits) - TODOS.md completed-item detection (auto-mark) - Auto-fixable review findings (dead code, N+1, stale comments — fixed automatically) +- Test coverage gaps (auto-generate and commit, or flag in PR body) --- @@ -54,10 +56,27 @@ You are running the `/ship` workflow. This is a **non-interactive, fully automat {{REVIEW_DASHBOARD}} -If the verdict is NOT "CLEARED TO SHIP (3/3)", use AskUserQuestion: -- Show which reviews are missing or have open issues -- RECOMMENDATION: Choose B (run missing reviews first) unless the change is trivial -- Options: A) Ship anyway B) Abort — run missing review(s) first C) Reviews not relevant for this change +If the Eng Review is NOT "CLEAR": + +1. **Check for a prior override on this branch:** + ```bash + eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) + grep '"skill":"ship-review-override"' ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl 2>/dev/null || echo "NO_OVERRIDE" + ``` + If an override exists, display the dashboard and note "Review gate previously accepted — continuing." Do NOT ask again. + +2. **If no override exists,** use AskUserQuestion: + - Show that Eng Review is missing or has open issues + - RECOMMENDATION: Choose C if the change is obviously trivial (< 20 lines, typo fix, config-only); Choose B for larger changes + - Options: A) Ship anyway B) Abort — run /plan-eng-review first C) Change is too small to need eng review + - If CEO/Design reviews are missing, mention them as informational ("CEO Review not run — recommended for product changes") but do NOT block or recommend aborting for them + +3. **If the user chooses A or C,** persist the decision so future `/ship` runs on this branch skip the gate: + ```bash + eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) + echo '{"skill":"ship-review-override","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","decision":"USER_CHOICE"}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl + ``` + Substitute USER_CHOICE with "ship_anyway" or "not_relevant". --- @@ -75,6 +94,12 @@ git fetch origin && git merge origin/ --no-edit --- +## Step 2.5: Test Framework Bootstrap + +{{TEST_BOOTSTRAP}} + +--- + ## Step 3: Run tests (on merged code) **Do NOT run `RAILS_ENV=test bin/rails db:migrate`** — `bin/test-lane` already calls @@ -159,6 +184,144 @@ If multiple suites need to run, run them sequentially (each needs a test lane). --- +## Step 3.4: Test Coverage Audit + +100% coverage is the goal — every untested path is a path where bugs hide and vibe coding becomes yolo coding. Evaluate what was ACTUALLY coded (from the diff), not what was planned. + +**0. Before/after test count:** + +```bash +# Count test files before any generation +find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l +``` + +Store this number for the PR body. + +**1. Trace every codepath changed** using `git diff origin/...HEAD`: + +Read every changed file. For each one, trace how data flows through the code — don't just list functions, actually follow the execution: + +1. **Read the diff.** For each changed file, read the full file (not just the diff hunk) to understand context. +2. **Trace data flow.** Starting from each entry point (route handler, exported function, event listener, component render), follow the data through every branch: + - Where does input come from? (request params, props, database, API call) + - What transforms it? (validation, mapping, computation) + - Where does it go? (database write, API response, rendered output, side effect) + - What can go wrong at each step? (null/undefined, invalid input, network failure, empty collection) +3. **Diagram the execution.** For each changed file, draw an ASCII diagram showing: + - Every function/method that was added or modified + - Every conditional branch (if/else, switch, ternary, guard clause, early return) + - Every error path (try/catch, rescue, error boundary, fallback) + - Every call to another function (trace into it — does IT have untested branches?) + - Every edge: what happens with null input? Empty array? Invalid type? + +This is the critical step — you're building a map of every line of code that can execute differently based on input. Every branch in this diagram needs a test. + +**2. Map user flows, interactions, and error states:** + +Code coverage isn't enough — you need to cover how real users interact with the changed code. For each changed feature, think through: + +- **User flows:** What sequence of actions does a user take that touches this code? Map the full journey (e.g., "user clicks 'Pay' → form validates → API call → success/failure screen"). Each step in the journey needs a test. +- **Interaction edge cases:** What happens when the user does something unexpected? + - Double-click/rapid resubmit + - Navigate away mid-operation (back button, close tab, click another link) + - Submit with stale data (page sat open for 30 minutes, session expired) + - Slow connection (API takes 10 seconds — what does the user see?) + - Concurrent actions (two tabs, same form) +- **Error states the user can see:** For every error the code handles, what does the user actually experience? + - Is there a clear error message or a silent failure? + - Can the user recover (retry, go back, fix input) or are they stuck? + - What happens with no network? With a 500 from the API? With invalid data from the server? +- **Empty/zero/boundary states:** What does the UI show with zero results? With 10,000 results? With a single character input? With maximum-length input? + +Add these to your diagram alongside the code branches. A user flow with no test is just as much a gap as an untested if/else. + +**3. Check each branch against existing tests:** + +Go through your diagram branch by branch — both code paths AND user flows. For each one, search for a test that exercises it: +- Function `processPayment()` → look for `billing.test.ts`, `billing.spec.ts`, `test/billing_test.rb` +- An if/else → look for tests covering BOTH the true AND false path +- An error handler → look for a test that triggers that specific error condition +- A call to `helperFn()` that has its own branches → those branches need tests too +- A user flow → look for an integration or E2E test that walks through the journey +- An interaction edge case → look for a test that simulates the unexpected action + +Quality scoring rubric: +- ★★★ Tests behavior with edge cases AND error paths +- ★★ Tests correct behavior, happy path only +- ★ Smoke test / existence check / trivial assertion (e.g., "it renders", "it doesn't throw") + +**4. Output ASCII coverage diagram:** + +Include BOTH code paths and user flows in the same diagram: + +``` +CODE PATH COVERAGE +=========================== +[+] src/services/billing.ts + │ + ├── processPayment() + │ ├── [★★★ TESTED] Happy path + card declined + timeout — billing.test.ts:42 + │ ├── [GAP] Network timeout — NO TEST + │ └── [GAP] Invalid currency — NO TEST + │ + └── refundPayment() + ├── [★★ TESTED] Full refund — billing.test.ts:89 + └── [★ TESTED] Partial refund (checks non-throw only) — billing.test.ts:101 + +USER FLOW COVERAGE +=========================== +[+] Payment checkout flow + │ + ├── [★★★ TESTED] Complete purchase — checkout.e2e.ts:15 + ├── [GAP] Double-click submit — NO TEST + ├── [GAP] Navigate away during payment — NO TEST + └── [★ TESTED] Form validation errors (checks render only) — checkout.test.ts:40 + +[+] Error states + │ + ├── [★★ TESTED] Card declined message — billing.test.ts:58 + ├── [GAP] Network timeout UX (what does user see?) — NO TEST + └── [GAP] Empty cart submission — NO TEST + +───────────────────────────────── +COVERAGE: 5/12 paths tested (42%) + Code paths: 3/5 (60%) + User flows: 2/7 (29%) +QUALITY: ★★★: 2 ★★: 2 ★: 1 +GAPS: 7 paths need tests +───────────────────────────────── +``` + +**Fast path:** All paths covered → "Step 3.4: All new code paths have test coverage ✓" Continue. + +**5. Generate tests for uncovered paths:** + +If test framework detected (or bootstrapped in Step 2.5): +- Prioritize error handlers and edge cases first (happy paths are more likely already tested) +- Read 2-3 existing test files to match conventions exactly +- Generate unit tests. Mock all external dependencies (DB, API, Redis). +- Write tests that exercise the specific uncovered path with real assertions +- Run each test. Passes → commit as `test: coverage for {feature}` +- Fails → fix once. Still fails → revert, note gap in diagram. + +Caps: 30 code paths max, 20 tests generated max (code + user flow combined), 2-min per-test exploration cap. + +If no test framework AND user declined bootstrap → diagram only, no generation. Note: "Test generation skipped — no test framework configured." + +**Diff is test-only changes:** Skip Step 3.4 entirely: "No new application code paths to audit." + +**6. After-count and coverage summary:** + +```bash +# Count test files after generation +find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l +``` + +For PR body: `Tests: {before} → {after} (+{delta} new)` +Coverage line: `Test Coverage Audit: N new code paths. M covered (X%). K tests generated, J committed.` + +--- + ## Step 3.5: Pre-Landing Review Review the diff for structural issues that tests don't catch. @@ -409,6 +572,10 @@ gh pr create --base --title ": " --body "$(cat <<'EOF' ## Summary +## Test Coverage + + + ## Pre-Landing Review @@ -451,4 +618,5 @@ EOF - **TODOS.md completion detection must be conservative.** Only mark items as completed when the diff clearly shows the work is done. - **Use Greptile reply templates from greptile-triage.md.** Every reply includes evidence (inline diff, code references, re-rank suggestion). Never post vague replies. - **Never push without fresh verification evidence.** If code changed after Step 3 tests, re-run before pushing. +- **Step 3.4 generates coverage tests.** They must pass before committing. Never commit failing tests. - **The goal is: user says `/ship`, next thing they see is the review + PR URL.** diff --git a/test/gen-skill-docs.test.ts b/test/gen-skill-docs.test.ts index 26de63db..c3861e8d 100644 --- a/test/gen-skill-docs.test.ts +++ b/test/gen-skill-docs.test.ts @@ -343,9 +343,10 @@ describe('REVIEW_DASHBOARD resolver', () => { test('resolver output contains key dashboard elements', () => { const content = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8'); expect(content).toContain('VERDICT'); - expect(content).toContain('CLEARED TO SHIP'); - expect(content).toContain('NOT YET RUN'); + expect(content).toContain('CLEARED'); + expect(content).toContain('Eng Review'); expect(content).toContain('7 days'); expect(content).toContain('Design Review'); + expect(content).toContain('skip_eng_review'); }); }); diff --git a/test/skill-e2e.test.ts b/test/skill-e2e.test.ts index 4378c322..021e41da 100644 --- a/test/skill-e2e.test.ts +++ b/test/skill-e2e.test.ts @@ -852,6 +852,89 @@ Focus on reviewing the plan content: architecture, error handling, security, and }, 420_000); }); +// --- Plan CEO Review (SELECTIVE EXPANSION) E2E --- + +describeE2E('Plan CEO Review SELECTIVE EXPANSION E2E', () => { + let planDir: string; + + beforeAll(() => { + planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-ceo-sel-')); + const { spawnSync } = require('child_process'); + const run = (cmd: string, args: string[]) => + spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 }); + + run('git', ['init']); + run('git', ['config', 'user.email', 'test@test.com']); + run('git', ['config', 'user.name', 'Test']); + + fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Add User Dashboard + +## Context +We're building a new user dashboard that shows recent activity, notifications, and quick actions. + +## Changes +1. New React component \`UserDashboard\` in \`src/components/\` +2. REST API endpoint \`GET /api/dashboard\` returning user stats +3. PostgreSQL query for activity aggregation +4. Redis cache layer for dashboard data (5min TTL) + +## Architecture +- Frontend: React + TailwindCSS +- Backend: Express.js REST API +- Database: PostgreSQL with existing user/activity tables +- Cache: Redis for dashboard aggregates + +## Open questions +- Should we use WebSocket for real-time updates? +- How do we handle users with 100k+ activity records? +`); + + run('git', ['add', '.']); + run('git', ['commit', '-m', 'add plan']); + + fs.mkdirSync(path.join(planDir, 'plan-ceo-review'), { recursive: true }); + fs.copyFileSync( + path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), + path.join(planDir, 'plan-ceo-review', 'SKILL.md'), + ); + }); + + afterAll(() => { + try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {} + }); + + test('/plan-ceo-review SELECTIVE EXPANSION produces structured review output', async () => { + const result = await runSkillTest({ + prompt: `Read plan-ceo-review/SKILL.md for the review workflow. + +Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration or system audit steps. + +Choose SELECTIVE EXPANSION mode. Skip any AskUserQuestion calls — this is non-interactive. +For the cherry-pick ceremony, accept all expansion proposals automatically. +Write your complete review directly to ${planDir}/review-output-selective.md + +Focus on reviewing the plan content: architecture, error handling, security, and performance.`, + workingDirectory: planDir, + maxTurns: 15, + timeout: 360_000, + testName: 'plan-ceo-review-selective', + runId, + }); + + logCost('/plan-ceo-review (SELECTIVE)', result); + recordE2E('/plan-ceo-review-selective', 'Plan CEO Review SELECTIVE EXPANSION E2E', result, { + passed: ['success', 'error_max_turns'].includes(result.exitReason), + }); + expect(['success', 'error_max_turns']).toContain(result.exitReason); + + const reviewPath = path.join(planDir, 'review-output-selective.md'); + if (fs.existsSync(reviewPath)) { + const review = fs.readFileSync(reviewPath, 'utf-8'); + expect(review.length).toBeGreaterThan(200); + } + }, 420_000); +}); + // --- Plan Eng Review E2E --- describeE2E('Plan Eng Review E2E', () => { @@ -920,7 +1003,7 @@ Replace session-cookie auth with JWT tokens. Currently using express-session + R Read plan.md — that's the plan to review. This is a standalone plan document, not a codebase — skip any codebase exploration steps. -Choose SMALL CHANGE mode. Skip any AskUserQuestion calls — this is non-interactive. +Proceed directly to the full review. Skip any AskUserQuestion calls — this is non-interactive. Write your complete review directly to ${planDir}/review-output.md Focus on architecture, code quality, tests, and performance sections.`, @@ -1321,7 +1404,7 @@ export function main() { return Dashboard(); } Read plan.md — that's the plan to review. This is a standalone plan with source code in app.ts and dashboard.ts. -Choose SMALL CHANGE mode. Skip any AskUserQuestion calls — this is non-interactive. +Proceed directly to the full review. Skip any AskUserQuestion calls — this is non-interactive. IMPORTANT: After your review, you MUST write the test-plan artifact as described in the "Test Plan Artifact" section of SKILL.md. The remote-slug shim is at ${planDir}/browse/bin/remote-slug. @@ -1803,7 +1886,7 @@ Write DESIGN.md and CLAUDE.md (or update it) in the working directory.`, This is a civic tech data platform called CivicPulse. Read the README.md. -DO research competitors before proposing — search for civic tech and government data platform designs. Skip the font preview page. Skip any AskUserQuestion calls — this is non-interactive. +DO research what's out there before proposing — search for civic tech and government data platform designs. Skip the font preview page. Skip any AskUserQuestion calls — this is non-interactive. Write DESIGN.md to the working directory.`, workingDirectory: designDir, @@ -2215,6 +2298,269 @@ Review the site at ${serverUrl}. Use --quick mode. Skip any AskUserQuestion call }, 420_000); }); +// --- Test Bootstrap E2E --- + +describeE2E('Test Bootstrap E2E', () => { + let bootstrapDir: string; + let bootstrapServer: ReturnType; + + beforeAll(() => { + bootstrapDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-bootstrap-')); + setupBrowseShims(bootstrapDir); + + // Copy qa skill files + copyDirSync(path.join(ROOT, 'qa'), path.join(bootstrapDir, 'qa')); + + // Create a minimal Node.js project with NO test framework + fs.writeFileSync(path.join(bootstrapDir, 'package.json'), JSON.stringify({ + name: 'test-bootstrap-app', + version: '1.0.0', + type: 'module', + }, null, 2)); + + // Create a simple app file with a bug + fs.writeFileSync(path.join(bootstrapDir, 'app.js'), ` +export function add(a, b) { return a + b; } +export function subtract(a, b) { return a - b; } +export function divide(a, b) { return a / b; } // BUG: no zero check +`); + + // Create a simple HTML page with a bug + fs.writeFileSync(path.join(bootstrapDir, 'index.html'), ` + +Bootstrap Test + +

Test App

+ Broken Link + + + +`); + + // Init git repo + const run = (cmd: string, args: string[]) => + spawnSync(cmd, args, { cwd: bootstrapDir, stdio: 'pipe', timeout: 5000 }); + run('git', ['init']); + run('git', ['config', 'user.email', 'test@test.com']); + run('git', ['config', 'user.name', 'Test']); + run('git', ['add', '.']); + run('git', ['commit', '-m', 'initial commit']); + + // Serve from working directory + bootstrapServer = Bun.serve({ + port: 0, + hostname: '127.0.0.1', + fetch(req) { + const url = new URL(req.url); + let filePath = url.pathname === '/' ? '/index.html' : url.pathname; + filePath = filePath.replace(/^\//, ''); + const fullPath = path.join(bootstrapDir, filePath); + if (!fs.existsSync(fullPath)) { + return new Response('Not Found', { status: 404 }); + } + const content = fs.readFileSync(fullPath, 'utf-8'); + return new Response(content, { + headers: { 'Content-Type': 'text/html' }, + }); + }, + }); + }); + + afterAll(() => { + bootstrapServer?.stop(); + try { fs.rmSync(bootstrapDir, { recursive: true, force: true }); } catch {} + }); + + test('/qa bootstrap + regression test on zero-test project', async () => { + const serverUrl = `http://127.0.0.1:${bootstrapServer!.port}`; + + const result = await runSkillTest({ + prompt: `You have a browse binary at ${browseBin}. Assign it to B variable like: B="${browseBin}" + +Read the file qa/SKILL.md for the QA workflow instructions. + +Run a Quick-tier QA test on ${serverUrl} +The source code for this page is at ${bootstrapDir}/index.html — you can fix bugs there. +Do NOT use AskUserQuestion — for any AskUserQuestion prompts, choose the RECOMMENDED option automatically. +Write your report to ${bootstrapDir}/qa-reports/qa-report.md + +This project has NO test framework. When the bootstrap asks, pick vitest (option A). +This is a test+fix loop: find bugs, fix them, write regression tests, commit each fix.`, + workingDirectory: bootstrapDir, + maxTurns: 50, + allowedTools: ['Bash', 'Read', 'Write', 'Edit', 'Glob', 'Grep'], + timeout: 420_000, + testName: 'qa-bootstrap', + runId, + }); + + logCost('/qa bootstrap', result); + recordE2E('/qa bootstrap + regression test', 'Test Bootstrap E2E', result, { + passed: ['success', 'error_max_turns'].includes(result.exitReason), + }); + + expect(['success', 'error_max_turns']).toContain(result.exitReason); + + // Verify bootstrap created test infrastructure + const hasTestConfig = fs.existsSync(path.join(bootstrapDir, 'vitest.config.ts')) + || fs.existsSync(path.join(bootstrapDir, 'vitest.config.js')) + || fs.existsSync(path.join(bootstrapDir, 'jest.config.js')) + || fs.existsSync(path.join(bootstrapDir, 'jest.config.ts')); + console.log(`Test config created: ${hasTestConfig}`); + + const hasTestingMd = fs.existsSync(path.join(bootstrapDir, 'TESTING.md')); + console.log(`TESTING.md created: ${hasTestingMd}`); + + // Check for bootstrap commit + const gitLog = spawnSync('git', ['log', '--oneline', '--grep=bootstrap'], { + cwd: bootstrapDir, stdio: 'pipe', + }); + const bootstrapCommits = gitLog.stdout.toString().trim(); + console.log(`Bootstrap commits: ${bootstrapCommits || 'none'}`); + + // Check for regression test commits + const regressionLog = spawnSync('git', ['log', '--oneline', '--grep=test(qa)'], { + cwd: bootstrapDir, stdio: 'pipe', + }); + const regressionCommits = regressionLog.stdout.toString().trim(); + console.log(`Regression test commits: ${regressionCommits || 'none'}`); + + // Verify at least the bootstrap happened (fix commits are bonus) + const allCommits = spawnSync('git', ['log', '--oneline'], { + cwd: bootstrapDir, stdio: 'pipe', + }); + const totalCommits = allCommits.stdout.toString().trim().split('\n').length; + console.log(`Total commits: ${totalCommits}`); + expect(totalCommits).toBeGreaterThan(1); // At least initial + bootstrap + }, 420_000); +}); + +// --- Test Coverage Audit E2E --- + +describeE2E('Test Coverage Audit E2E', () => { + let coverageDir: string; + + beforeAll(() => { + coverageDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-coverage-')); + + // Copy ship skill files + copyDirSync(path.join(ROOT, 'ship'), path.join(coverageDir, 'ship')); + copyDirSync(path.join(ROOT, 'review'), path.join(coverageDir, 'review')); + + // Create a Node.js project WITH test framework but coverage gaps + fs.writeFileSync(path.join(coverageDir, 'package.json'), JSON.stringify({ + name: 'test-coverage-app', + version: '1.0.0', + type: 'module', + scripts: { test: 'echo "no tests yet"' }, + devDependencies: { vitest: '^1.0.0' }, + }, null, 2)); + + // Create vitest config + fs.writeFileSync(path.join(coverageDir, 'vitest.config.ts'), + `import { defineConfig } from 'vitest/config';\nexport default defineConfig({ test: {} });\n`); + + fs.writeFileSync(path.join(coverageDir, 'VERSION'), '0.1.0.0\n'); + fs.writeFileSync(path.join(coverageDir, 'CHANGELOG.md'), '# Changelog\n'); + + // Create source file with multiple code paths + fs.mkdirSync(path.join(coverageDir, 'src'), { recursive: true }); + fs.writeFileSync(path.join(coverageDir, 'src', 'billing.ts'), ` +export function processPayment(amount: number, currency: string) { + if (amount <= 0) throw new Error('Invalid amount'); + if (currency !== 'USD' && currency !== 'EUR') throw new Error('Unsupported currency'); + return { status: 'success', amount, currency }; +} + +export function refundPayment(paymentId: string, reason: string) { + if (!paymentId) throw new Error('Payment ID required'); + if (!reason) throw new Error('Reason required'); + return { status: 'refunded', paymentId, reason }; +} +`); + + // Create a test directory with ONE test (partial coverage) + fs.mkdirSync(path.join(coverageDir, 'test'), { recursive: true }); + fs.writeFileSync(path.join(coverageDir, 'test', 'billing.test.ts'), ` +import { describe, test, expect } from 'vitest'; +import { processPayment } from '../src/billing'; + +describe('processPayment', () => { + test('processes valid payment', () => { + const result = processPayment(100, 'USD'); + expect(result.status).toBe('success'); + }); + // GAP: no test for invalid amount + // GAP: no test for unsupported currency + // GAP: refundPayment not tested at all +}); +`); + + // Init git repo with main branch + const run = (cmd: string, args: string[]) => + spawnSync(cmd, args, { cwd: coverageDir, stdio: 'pipe', timeout: 5000 }); + run('git', ['init', '-b', 'main']); + run('git', ['config', 'user.email', 'test@test.com']); + run('git', ['config', 'user.name', 'Test']); + run('git', ['add', '.']); + run('git', ['commit', '-m', 'initial commit']); + + // Create feature branch + run('git', ['checkout', '-b', 'feature/billing']); + }); + + afterAll(() => { + try { fs.rmSync(coverageDir, { recursive: true, force: true }); } catch {} + }); + + test('/ship Step 3.4 produces coverage diagram', async () => { + const result = await runSkillTest({ + prompt: `Read the file ship/SKILL.md for the ship workflow instructions. + +You are on the feature/billing branch. The base branch is main. +This is a test project — there is no remote, no PR to create. + +ONLY run Step 3.4 (Test Coverage Audit) from the ship workflow. +Skip all other steps (tests, evals, review, version, changelog, commit, push, PR). + +The source code is in ${coverageDir}/src/billing.ts. +Existing tests are in ${coverageDir}/test/billing.test.ts. +The test command is: echo "tests pass" (mocked — just pretend tests pass). + +Produce the ASCII coverage diagram showing which code paths are tested and which have gaps. +Do NOT generate new tests — just produce the diagram and coverage summary. +Output the diagram directly.`, + workingDirectory: coverageDir, + maxTurns: 15, + allowedTools: ['Bash', 'Read', 'Write', 'Edit', 'Glob', 'Grep'], + timeout: 120_000, + testName: 'ship-coverage-audit', + runId, + }); + + logCost('/ship coverage audit', result); + recordE2E('/ship Step 3.4 coverage audit', 'Test Coverage Audit E2E', result, { + passed: result.exitReason === 'success', + }); + + expect(result.exitReason).toBe('success'); + + // Check output contains coverage diagram elements + const output = result.output || ''; + const hasGap = output.includes('GAP') || output.includes('gap') || output.includes('NO TEST'); + const hasTested = output.includes('TESTED') || output.includes('tested') || output.includes('✓'); + const hasCoverage = output.includes('COVERAGE') || output.includes('coverage') || output.includes('paths tested'); + + console.log(`Output has GAP markers: ${hasGap}`); + console.log(`Output has TESTED markers: ${hasTested}`); + console.log(`Output has coverage summary: ${hasCoverage}`); + + // At minimum, the agent should have read the source and test files + const readCalls = result.toolCalls.filter(tc => tc.tool === 'Read'); + expect(readCalls.length).toBeGreaterThan(0); + }, 180_000); +}); + // Module-level afterAll — finalize eval collector after all tests complete afterAll(async () => { if (evalCollector) { diff --git a/test/skill-validation.test.ts b/test/skill-validation.test.ts index a7755f13..f831b8f8 100644 --- a/test/skill-validation.test.ts +++ b/test/skill-validation.test.ts @@ -690,6 +690,43 @@ describe('Enum & Value Completeness in review checklist', () => { }); }); +// --- Completeness Principle spot-check --- + +describe('Completeness Principle in generated SKILL.md files', () => { + const skillsWithPreamble = [ + 'SKILL.md', 'browse/SKILL.md', 'qa/SKILL.md', + 'qa-only/SKILL.md', + 'setup-browser-cookies/SKILL.md', + 'ship/SKILL.md', 'review/SKILL.md', + 'plan-ceo-review/SKILL.md', 'plan-eng-review/SKILL.md', + 'retro/SKILL.md', + 'plan-design-review/SKILL.md', + 'qa-design-review/SKILL.md', + 'design-consultation/SKILL.md', + 'document-release/SKILL.md', + ]; + + for (const skill of skillsWithPreamble) { + test(`${skill} contains Completeness Principle section`, () => { + const content = fs.readFileSync(path.join(ROOT, skill), 'utf-8'); + expect(content).toContain('Completeness Principle'); + expect(content).toContain('Boil the Lake'); + }); + } + + test('Completeness Principle includes compression table', () => { + const content = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8'); + expect(content).toContain('CC+gstack'); + expect(content).toContain('Compression'); + }); + + test('Completeness Principle includes anti-patterns', () => { + const content = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8'); + expect(content).toContain('BAD:'); + expect(content).toContain('Anti-patterns'); + }); +}); + // --- Part 7: Planted-bug fixture validation (A4) --- describe('Planted-bug fixture validation', () => { @@ -739,6 +776,36 @@ describe('Planted-bug fixture validation', () => { }); }); +// --- CEO review mode validation --- + +describe('CEO review mode validation', () => { + const content = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8'); + + test('has all four CEO review modes defined', () => { + const modes = ['SCOPE EXPANSION', 'SELECTIVE EXPANSION', 'HOLD SCOPE', 'SCOPE REDUCTION']; + for (const mode of modes) { + expect(content).toContain(mode); + } + }); + + test('has CEO plan persistence step', () => { + expect(content).toContain('ceo-plans'); + expect(content).toContain('status: ACTIVE'); + }); + + test('has docs/designs promotion section', () => { + expect(content).toContain('docs/designs'); + expect(content).toContain('PROMOTED'); + }); + + test('mode quick reference has four columns', () => { + expect(content).toContain('EXPANSION'); + expect(content).toContain('SELECTIVE'); + expect(content).toContain('HOLD SCOPE'); + expect(content).toContain('REDUCTION'); + }); +}); + // --- gstack-slug helper --- describe('gstack-slug', () => { @@ -780,3 +847,225 @@ describe('gstack-slug', () => { expect(lines[1]).toMatch(/^BRANCH=.+/); }); }); + +// --- Test Bootstrap validation --- + +describe('Test Bootstrap ({{TEST_BOOTSTRAP}}) integration', () => { + test('TEST_BOOTSTRAP resolver produces valid content', () => { + const qaContent = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(qaContent).toContain('Test Framework Bootstrap'); + expect(qaContent).toContain('RUNTIME:ruby'); + expect(qaContent).toContain('RUNTIME:node'); + expect(qaContent).toContain('RUNTIME:python'); + expect(qaContent).toContain('no-test-bootstrap'); + expect(qaContent).toContain('BOOTSTRAP_DECLINED'); + }); + + test('TEST_BOOTSTRAP appears in qa/SKILL.md', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Test Framework Bootstrap'); + expect(content).toContain('TESTING.md'); + expect(content).toContain('CLAUDE.md'); + }); + + test('TEST_BOOTSTRAP appears in ship/SKILL.md', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Test Framework Bootstrap'); + expect(content).toContain('Step 2.5'); + }); + + test('TEST_BOOTSTRAP appears in qa-design-review/SKILL.md', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa-design-review', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Test Framework Bootstrap'); + }); + + test('TEST_BOOTSTRAP does NOT appear in qa-only/SKILL.md', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa-only', 'SKILL.md'), 'utf-8'); + expect(content).not.toContain('Test Framework Bootstrap'); + // But should have the recommendation note + expect(content).toContain('No test framework detected'); + expect(content).toContain('Run `/qa` to bootstrap'); + }); + + test('bootstrap includes framework knowledge table', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('vitest'); + expect(content).toContain('minitest'); + expect(content).toContain('pytest'); + expect(content).toContain('cargo test'); + expect(content).toContain('phpunit'); + expect(content).toContain('ExUnit'); + }); + + test('bootstrap includes CI/CD pipeline generation', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('.github/workflows/test.yml'); + expect(content).toContain('GitHub Actions'); + }); + + test('bootstrap includes first real tests step', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('First real tests'); + expect(content).toContain('git log --since=30.days'); + expect(content).toContain('Prioritize by risk'); + }); + + test('bootstrap includes vibe coding philosophy', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('vibe coding'); + expect(content).toContain('100% test coverage'); + }); + + test('WebSearch is in allowed-tools for qa, ship, qa-design-review', () => { + const qa = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + const ship = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + const qaDesign = fs.readFileSync(path.join(ROOT, 'qa-design-review', 'SKILL.md'), 'utf-8'); + expect(qa).toContain('WebSearch'); + expect(ship).toContain('WebSearch'); + expect(qaDesign).toContain('WebSearch'); + }); +}); + +// --- Phase 8e.5 regression test validation --- + +describe('Phase 8e.5 regression test generation', () => { + test('qa/SKILL.md contains Phase 8e.5', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('8e.5. Regression Test'); + expect(content).toContain('test(qa): regression test'); + expect(content).toContain('WTF-likelihood exclusion'); + }); + + test('qa/SKILL.md Rule 13 is amended for regression tests', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Only modify tests when generating regression tests in Phase 8e.5'); + expect(content).not.toContain('Never modify tests or CI configuration'); + }); + + test('qa-design-review has CSS-aware Phase 8e.5 variant', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa-design-review', 'SKILL.md'), 'utf-8'); + expect(content).toContain('8e.5. Regression Test (design-review variant)'); + expect(content).toContain('CSS-only'); + expect(content).toContain('test(design): regression test'); + }); + + test('regression test includes full attribution comment format', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('// Regression: ISSUE-NNN'); + expect(content).toContain('// Found by /qa on'); + expect(content).toContain('// Report: .gstack/qa-reports/'); + }); + + test('regression test uses auto-incrementing names', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('auto-incrementing'); + expect(content).toContain('max number + 1'); + }); +}); + +// --- Step 3.4 coverage audit validation --- + +describe('Step 3.4 test coverage audit', () => { + test('ship/SKILL.md contains Step 3.4', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Step 3.4: Test Coverage Audit'); + expect(content).toContain('CODE PATH COVERAGE'); + }); + + test('Step 3.4 includes quality scoring rubric', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('★★★'); + expect(content).toContain('★★'); + expect(content).toContain('edge cases AND error paths'); + expect(content).toContain('happy path only'); + }); + + test('Step 3.4 includes before/after test count', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Count test files before'); + expect(content).toContain('Count test files after'); + }); + + test('ship PR body includes Test Coverage section', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('## Test Coverage'); + }); + + test('ship rules include test generation rule', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Step 3.4 generates coverage tests'); + expect(content).toContain('Never commit failing tests'); + }); + + test('Step 3.4 includes vibe coding philosophy', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('vibe coding becomes yolo coding'); + }); + + test('Step 3.4 traces actual codepaths, not just syntax', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Trace every codepath'); + expect(content).toContain('Trace data flow'); + expect(content).toContain('Diagram the execution'); + }); + + test('Step 3.4 maps user flows and interaction edge cases', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Map user flows'); + expect(content).toContain('Interaction edge cases'); + expect(content).toContain('Double-click'); + expect(content).toContain('Navigate away'); + expect(content).toContain('Error states the user can see'); + expect(content).toContain('Empty/zero/boundary states'); + }); + + test('Step 3.4 diagram includes USER FLOW COVERAGE section', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('USER FLOW COVERAGE'); + expect(content).toContain('Code paths:'); + expect(content).toContain('User flows:'); + }); +}); + +// --- Retro test health validation --- + +describe('Retro test health tracking', () => { + test('retro/SKILL.md has test health data gathering commands', () => { + const content = fs.readFileSync(path.join(ROOT, 'retro', 'SKILL.md'), 'utf-8'); + expect(content).toContain('# 10. Test file count'); + expect(content).toContain('# 11. Regression test commits'); + expect(content).toContain('# 12. Test files changed'); + }); + + test('retro/SKILL.md has Test Health metrics row', () => { + const content = fs.readFileSync(path.join(ROOT, 'retro', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Test Health'); + expect(content).toContain('regression tests'); + }); + + test('retro/SKILL.md has Test Health narrative section', () => { + const content = fs.readFileSync(path.join(ROOT, 'retro', 'SKILL.md'), 'utf-8'); + expect(content).toContain('### Test Health'); + expect(content).toContain('Total test files'); + expect(content).toContain('vibe coding safe'); + }); + + test('retro JSON schema includes test_health field', () => { + const content = fs.readFileSync(path.join(ROOT, 'retro', 'SKILL.md'), 'utf-8'); + expect(content).toContain('test_health'); + expect(content).toContain('total_test_files'); + expect(content).toContain('regression_test_commits'); + }); +}); + +// --- QA report template regression tests section --- + +describe('QA report template', () => { + test('qa-report-template.md has Regression Tests section', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'templates', 'qa-report-template.md'), 'utf-8'); + expect(content).toContain('## Regression Tests'); + expect(content).toContain('committed / deferred / skipped'); + expect(content).toContain('### Deferred Tests'); + expect(content).toContain('**Precondition:**'); + }); +});