mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-02 03:35:09 +02:00
chore: merge main (v0.7.2), resolve conflicts, bump to v0.7.3
Merge origin/main which brought proactive skill suggestions, journey-stage E2E tests, and retro timezone/midnight fixes. Resolve VERSION (0.7.3), CHANGELOG (our entry above 0.7.2), and gen-skill-docs.ts (keep both telemetry line and proactive config). Regenerate all 21 SKILL.md files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
+23
-1
@@ -1,15 +1,37 @@
|
||||
# Changelog
|
||||
|
||||
## [0.7.1] - 2026-03-18
|
||||
## [0.7.3] - 2026-03-18
|
||||
|
||||
### Added
|
||||
|
||||
- **Safety guardrails you can turn on with one command.** Say "be careful" or "safety mode" and `/careful` will warn you before any destructive command — `rm -rf`, `DROP TABLE`, force-push, `kubectl delete`, and more. You can override every warning. Common build artifact cleanups (`rm -rf node_modules`, `dist`, `.next`) are whitelisted.
|
||||
- **Lock edits to one folder with `/freeze`.** Debugging something and don't want Claude to "fix" unrelated code? `/freeze` blocks all file edits outside a directory you choose. Hard block, not just a warning. Run `/unfreeze` to remove the restriction without ending your session.
|
||||
- **`/guard` activates both at once.** One command for maximum safety when touching prod or live systems — destructive command warnings plus directory-scoped edit restrictions.
|
||||
- **`/debug` now auto-freezes edits to the module being debugged.** After forming a root cause hypothesis, `/debug` locks edits to the narrowest affected directory. No more accidental "fixes" to unrelated code during debugging.
|
||||
- **You can now see which skills you use and how often.** Every skill invocation is logged locally to `~/.gstack/analytics/skill-usage.jsonl`. Run `bun run analytics` to see your top skills, per-repo breakdown, and how often safety hooks actually catch something. Data stays on your machine.
|
||||
- **Weekly retros now include skill usage.** `/retro` shows which skills you used during the retro window alongside your usual commit analysis and metrics.
|
||||
|
||||
## [0.7.2] - 2026-03-18
|
||||
|
||||
### Fixed
|
||||
|
||||
- `/retro` date ranges now align to midnight instead of the current time. Running `/retro` at 9pm no longer silently drops the morning of the start date — you get full calendar days.
|
||||
- `/retro` timestamps now use your local timezone instead of hardcoded Pacific time. Users outside the US-West coast get correct local hours in histograms, session detection, and streak tracking.
|
||||
|
||||
## [0.7.1] - 2026-03-19
|
||||
|
||||
### Added
|
||||
|
||||
- **gstack now suggests skills at natural moments.** You don't need to know slash commands — just talk about what you're doing. Brainstorming an idea? gstack suggests `/office-hours`. Something's broken? It suggests `/debug`. Ready to deploy? It suggests `/ship`. Every workflow skill now has proactive triggers that fire when the moment is right.
|
||||
- **Lifecycle map.** gstack's root skill description now includes a developer workflow guide mapping 12 stages (brainstorm → plan → review → code → debug → test → ship → docs → retro) to the right skill. Claude sees this in every session.
|
||||
- **Opt-out with natural language.** If proactive suggestions feel too aggressive, just say "stop suggesting things" — gstack remembers across sessions. Say "be proactive again" to re-enable.
|
||||
- **11 journey-stage E2E tests.** Each test simulates a real moment in the developer lifecycle with realistic project context (plan.md, error logs, git history, code) and verifies the right skill fires from natural language alone. 11/11 pass.
|
||||
- **Trigger phrase validation.** Static tests verify every workflow skill has "Use when" and "Proactively suggest" phrases — catches regressions for free.
|
||||
|
||||
### Fixed
|
||||
|
||||
- `/debug` and `/office-hours` were completely invisible to natural language — no trigger phrases at all. Now both have full reactive + proactive triggers.
|
||||
|
||||
## [0.7.0] - 2026-03-18 — YC Office Hours
|
||||
|
||||
**`/office-hours` — sit down with a YC partner before you write a line of code.**
|
||||
|
||||
@@ -30,10 +30,11 @@ Fork it. Improve it. Make it yours. Don't player hate, appreciate.
|
||||
## Quick start: your first 10 minutes
|
||||
|
||||
1. Install gstack (30 seconds — see below)
|
||||
2. Run `/plan-ceo-review` on any feature idea
|
||||
3. Run `/review` on any branch with changes
|
||||
4. Run `/qa` on your staging URL
|
||||
5. Stop there. You'll know if this is for you.
|
||||
2. Run `/office-hours` — describe what you're building. It will reframe the problem before you write a line of code.
|
||||
3. Run `/plan-ceo-review` on any feature idea
|
||||
4. Run `/review` on any branch with changes
|
||||
5. Run `/qa` on your staging URL
|
||||
6. Stop there. You'll know if this is for you.
|
||||
|
||||
Expect first useful run in under 5 minutes on any repo with tests already set up.
|
||||
|
||||
@@ -58,79 +59,80 @@ Real files get committed to your repo (not a submodule), so `git clone` just wor
|
||||
## See it work
|
||||
|
||||
```
|
||||
You: I want to add photo upload for sellers.
|
||||
You: /plan-ceo-review
|
||||
Claude: "Photo upload" is not the feature. The real job is helping
|
||||
sellers create listings that actually sell. What if we
|
||||
auto-identify the product, pull specs and comps from the
|
||||
web, and draft the listing automatically? That's 10 stars.
|
||||
"Upload a photo" is 3 stars. Which are we building?
|
||||
[8 expansion proposals, you cherry-pick 5, defer 3 to backlog]
|
||||
You: I want to build a daily briefing app for my calendar.
|
||||
You: /office-hours
|
||||
Claude: [asks about the pain — specific examples, not hypotheticals]
|
||||
|
||||
You: /plan-design-review
|
||||
Claude: Design Score: B | AI Slop Score: C
|
||||
"Upload flow looks like a default Bootstrap form."
|
||||
[80-item audit, infers your design system, exports DESIGN.md]
|
||||
[flags 3 AI slop patterns: gradient hero, icon grid, uniform radius]
|
||||
You: Multiple Google calendars, events with stale info, wrong locations.
|
||||
Prep takes forever and the results aren't good enough...
|
||||
|
||||
Claude: I'm going to push back on the framing. You said "daily briefing
|
||||
app." But what you actually described is a personal chief of
|
||||
staff AI.
|
||||
[extracts 5 capabilities you didn't realize you were describing]
|
||||
[challenges 4 premises — you agree, disagree, or adjust]
|
||||
[generates 3 implementation approaches with effort estimates]
|
||||
RECOMMENDATION: Ship the narrowest wedge tomorrow, learn from
|
||||
real usage. The full vision is a 3-month project — start with
|
||||
the daily briefing that actually works.
|
||||
[writes design doc → feeds into downstream skills automatically]
|
||||
|
||||
You: /plan-ceo-review
|
||||
[reads the design doc, challenges scope, runs 10-section review]
|
||||
|
||||
You: /plan-eng-review
|
||||
Claude: ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐
|
||||
│ Upload │───▶│ Classify │───▶│ Enrich │───▶│ Draft │
|
||||
│ (sync) │ │ (async) │ │ (async) │ │ (async) │
|
||||
└─────────┘ └──────────┘ └──────────┘ └─────────┘
|
||||
[ASCII diagrams for every data flow, state machine, error path]
|
||||
[14-case test matrix, 6 failure modes mapped, 3 security concerns]
|
||||
[ASCII diagrams for data flow, state machines, error paths]
|
||||
[test matrix, failure modes, security concerns]
|
||||
|
||||
You: Approve plan. Exit plan mode.
|
||||
[Claude writes 2,400 lines across 11 files — models, services,
|
||||
controllers, views, migrations, and tests. ~8 minutes.]
|
||||
[writes 2,400 lines across 11 files. ~8 minutes.]
|
||||
|
||||
You: /review
|
||||
Claude: [AUTO-FIXED] Orphan S3 cleanup on failed upload
|
||||
[AUTO-FIXED] Missing index on listings.status
|
||||
[ASK] Race condition on hero image selection → You: yes
|
||||
[traces every new enum value through all switch statements]
|
||||
3 issues — 2 auto-fixed, 1 fixed.
|
||||
[AUTO-FIXED] 2 issues. [ASK] Race condition → you approve fix.
|
||||
|
||||
You: /qa https://staging.myapp.com
|
||||
Claude: [opens real browser, logs in, uploads photos, clicks through flows]
|
||||
Upload → classify → enrich → draft: end to end ✓
|
||||
Mobile: ✓ | Slow connection: ✓ | Bad image: ✓
|
||||
[finds bug: preview doesn't clear on second upload — fixes it]
|
||||
Regression test generated.
|
||||
[opens real browser, clicks through flows, finds and fixes a bug]
|
||||
|
||||
You: /ship
|
||||
Claude: Tests: 42 → 51 (+9 new)
|
||||
Coverage: 14/14 code paths (100%)
|
||||
PR: github.com/you/app/pull/42
|
||||
Tests: 42 → 51 (+9 new). PR: github.com/you/app/pull/42
|
||||
```
|
||||
|
||||
One feature. Seven commands. The agent reframed the product, ran an 80-item design audit, drew the architecture, wrote 2,400 lines of code, found a race condition I would have missed, auto-fixed two issues, opened a real browser to QA test, found and fixed a bug I didn't know about, wrote 9 tests, and generated a regression test. That is not a copilot. That is a team.
|
||||
You said "daily briefing app." The agent said "you're building a chief of staff AI" — because it listened to your pain, not your feature request. Then it challenged your premises, generated three approaches, recommended the narrowest wedge, and wrote a design doc that fed into every downstream skill. Eight commands. That is not a copilot. That is a team.
|
||||
|
||||
## The team
|
||||
## The sprint
|
||||
|
||||
gstack is a process, not a collection of tools. The skills are ordered the way a sprint runs:
|
||||
|
||||
**Think → Plan → Build → Review → Test → Ship → Reflect**
|
||||
|
||||
Each skill feeds into the next. `/office-hours` writes a design doc that `/plan-ceo-review` reads. `/plan-eng-review` writes a test plan that `/qa` picks up. `/review` catches bugs that `/ship` verifies are fixed. Nothing falls through the cracks because every step knows what came before it.
|
||||
|
||||
One sprint, one person, one feature — that takes about 30 minutes with gstack. But here's what changes everything: you can run 10-15 of these sprints in parallel. Different features, different branches, different agents — all at the same time. That is how I ship 10,000+ lines of production code per day while doing my actual job.
|
||||
|
||||
| Skill | Your specialist | What they do |
|
||||
|-------|----------------|--------------|
|
||||
| `/office-hours` | **YC Office Hours** | Start here. Six forcing questions that reframe your product before you write code. Pushes back on your framing, challenges premises, generates implementation alternatives. Design doc feeds into every downstream skill. |
|
||||
| `/plan-ceo-review` | **CEO / Founder** | Rethink the problem. Find the 10-star product hiding inside the request. Four modes: Expansion, Selective Expansion, Hold Scope, Reduction. |
|
||||
| `/plan-eng-review` | **Eng Manager** | Lock in architecture, data flow, diagrams, edge cases, and tests. Forces hidden assumptions into the open. |
|
||||
| `/plan-design-review` | **Senior Designer** | Rates each design dimension 0-10, explains what a 10 looks like, then edits the plan to get there. AI Slop detection. Interactive — one AskUserQuestion per design choice. |
|
||||
| `/design-consultation` | **Design Partner** | Build a complete design system from scratch. Knows the landscape, proposes creative risks, generates realistic product mockups. Design at the heart of all other phases. |
|
||||
| `/review` | **Staff Engineer** | Find the bugs that pass CI but blow up in production. Auto-fixes the obvious ones. Flags completeness gaps. |
|
||||
| `/ship` | **Release Engineer** | Sync main, run tests, audit coverage, push, open PR. Bootstraps test frameworks if you don't have one. One command. |
|
||||
| `/browse` | **QA Engineer** | Give the agent eyes. Real Chromium browser, real clicks, real screenshots. ~100ms per command. |
|
||||
| `/debug` | **Debugger** | Systematic root-cause debugging. Iron Law: no fixes without investigation. Traces data flow, tests hypotheses, stops after 3 failed fixes. |
|
||||
| `/design-review` | **Designer Who Codes** | Same audit as /plan-design-review, then fixes what it finds. Atomic commits, before/after screenshots. |
|
||||
| `/qa` | **QA Lead** | Test your app, find bugs, fix them with atomic commits, re-verify. Auto-generates regression tests for every fix. |
|
||||
| `/qa-only` | **QA Reporter** | Same methodology as /qa but report only. Use when you want a pure bug report without code changes. |
|
||||
| `/design-review` | **Designer Who Codes** | Same audit as /plan-design-review, then fixes what it finds. Atomic commits, before/after screenshots. |
|
||||
| `/setup-browser-cookies` | **Session Manager** | Import cookies from your real browser (Chrome, Arc, Brave, Edge) into the headless session. Test authenticated pages. |
|
||||
| `/retro` | **Eng Manager** | Team-aware weekly retro. Per-person breakdowns, shipping streaks, test health trends, growth opportunities. |
|
||||
| `/office-hours` | **YC Office Hours** | Two modes. Startup: six forcing questions on demand, users, and product. Builder: brainstorming for side projects, hackathons, and learning. Writes a design doc with personal observations about how you think. |
|
||||
| `/debug` | **Debugger** | Systematic root-cause debugging. Iron Law: no fixes without investigation. Traces data flow, tests hypotheses, stops after 3 failed fixes. |
|
||||
| `/ship` | **Release Engineer** | Sync main, run tests, audit coverage, push, open PR. Bootstraps test frameworks if you don't have one. One command. |
|
||||
| `/document-release` | **Technical Writer** | Update all project docs to match what you just shipped. Catches stale READMEs automatically. |
|
||||
| `/retro` | **Eng Manager** | Team-aware weekly retro. Per-person breakdowns, shipping streaks, test health trends, growth opportunities. |
|
||||
| `/browse` | **QA Engineer** | Give the agent eyes. Real Chromium browser, real clicks, real screenshots. ~100ms per command. |
|
||||
| `/setup-browser-cookies` | **Session Manager** | Import cookies from your real browser (Chrome, Arc, Brave, Edge) into the headless session. Test authenticated pages. |
|
||||
|
||||
**[Deep dives with examples and philosophy for every skill →](docs/skills.md)**
|
||||
|
||||
## What's new and why it matters
|
||||
|
||||
**`/office-hours` reframes your product before you write code.** You say "daily briefing app." It listens to your actual pain, pushes back on the framing, tells you you're really building a personal chief of staff AI, challenges your premises, and generates three implementation approaches with effort estimates. The design doc it writes feeds directly into `/plan-ceo-review` and `/plan-eng-review` — so every downstream skill starts with real clarity instead of a vague feature request.
|
||||
|
||||
**Design is at the heart.** `/design-consultation` doesn't just pick fonts. It researches what's out there in your space, proposes safe choices AND creative risks, generates realistic mockups of your actual product, and writes `DESIGN.md` — and then `/design-review` and `/plan-eng-review` read what you chose. Design decisions flow through the whole system.
|
||||
|
||||
**`/qa` was a massive unlock.** It let me go from 6 to 12 parallel workers. Claude Code saying *"I SEE THE ISSUE"* and then actually fixing it, generating a regression test, and verifying the fix — that changed how I work. The agent has eyes now.
|
||||
@@ -141,13 +143,13 @@ One feature. Seven commands. The agent reframed the product, ran an 80-item desi
|
||||
|
||||
**`/document-release` is the engineer you never had.** It reads every doc file in your project, cross-references the diff, and updates everything that drifted. README, ARCHITECTURE, CONTRIBUTING, CLAUDE.md, TODOS — all kept current automatically.
|
||||
|
||||
## 10 sessions at once
|
||||
## 10-15 parallel sprints
|
||||
|
||||
gstack is powerful with one session. It is transformative with ten.
|
||||
gstack is powerful with one sprint. It is transformative with ten running at once.
|
||||
|
||||
[Conductor](https://conductor.build) runs multiple Claude Code sessions in parallel — each in its own isolated workspace. One session running `/qa` on staging, another doing `/review` on a PR, a third implementing a feature, and seven more on other branches. All at the same time.
|
||||
[Conductor](https://conductor.build) runs multiple Claude Code sessions in parallel — each in its own isolated workspace. One session running `/office-hours` on a new idea, another doing `/review` on a PR, a third implementing a feature, a fourth running `/qa` on staging, and six more on other branches. All at the same time. I regularly run 10-15 parallel sprints — that's the practical max right now.
|
||||
|
||||
One person, ten parallel agents, each with the right cognitive mode. That is a different way of building software.
|
||||
The sprint structure is what makes parallelism work. Without a process, ten agents is ten sources of chaos. With a process — think, plan, build, review, test, ship — each agent knows exactly what to do and when to stop. You manage them the way a CEO manages a team: check in on the decisions that matter, let the rest run.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -7,6 +7,32 @@ description: |
|
||||
responsive layouts, test forms and uploads, handle dialogs, and assert element states.
|
||||
~100ms per command. Use when you need to test a feature, verify a deployment, dogfood a
|
||||
user flow, or file a bug with evidence.
|
||||
|
||||
gstack also includes development workflow skills. When you notice the user is at
|
||||
these stages, suggest the appropriate skill:
|
||||
- Brainstorming a new idea → suggest /office-hours
|
||||
- Reviewing a plan (strategy) → suggest /plan-ceo-review
|
||||
- Reviewing a plan (architecture) → suggest /plan-eng-review
|
||||
- Reviewing a plan (design) → suggest /plan-design-review
|
||||
- Creating a design system → suggest /design-consultation
|
||||
- Debugging errors → suggest /debug
|
||||
- Testing the app → suggest /qa
|
||||
- Code review before merge → suggest /review
|
||||
- Visual design audit → suggest /design-review
|
||||
- Ready to deploy / create PR → suggest /ship
|
||||
- Post-ship doc updates → suggest /document-release
|
||||
- Weekly retrospective → suggest /retro
|
||||
|
||||
If the user pushes back on skill suggestions ("stop suggesting things",
|
||||
"I don't need suggestions", "too aggressive"):
|
||||
1. Stop suggesting for the rest of this session
|
||||
2. Run: gstack-config set proactive false
|
||||
3. Say: "Got it — I'll stop suggesting skills. Just tell me to be proactive
|
||||
again if you change your mind."
|
||||
|
||||
If the user says "be proactive again" or "turn on suggestions":
|
||||
1. Run: gstack-config set proactive true
|
||||
2. Say: "Proactive suggestions are back on."
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
@@ -32,8 +58,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"gstack","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
```
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
|
||||
|
||||
@@ -7,6 +7,32 @@ description: |
|
||||
responsive layouts, test forms and uploads, handle dialogs, and assert element states.
|
||||
~100ms per command. Use when you need to test a feature, verify a deployment, dogfood a
|
||||
user flow, or file a bug with evidence.
|
||||
|
||||
gstack also includes development workflow skills. When you notice the user is at
|
||||
these stages, suggest the appropriate skill:
|
||||
- Brainstorming a new idea → suggest /office-hours
|
||||
- Reviewing a plan (strategy) → suggest /plan-ceo-review
|
||||
- Reviewing a plan (architecture) → suggest /plan-eng-review
|
||||
- Reviewing a plan (design) → suggest /plan-design-review
|
||||
- Creating a design system → suggest /design-consultation
|
||||
- Debugging errors → suggest /debug
|
||||
- Testing the app → suggest /qa
|
||||
- Code review before merge → suggest /review
|
||||
- Visual design audit → suggest /design-review
|
||||
- Ready to deploy / create PR → suggest /ship
|
||||
- Post-ship doc updates → suggest /document-release
|
||||
- Weekly retrospective → suggest /retro
|
||||
|
||||
If the user pushes back on skill suggestions ("stop suggesting things",
|
||||
"I don't need suggestions", "too aggressive"):
|
||||
1. Stop suggesting for the rest of this session
|
||||
2. Run: gstack-config set proactive false
|
||||
3. Say: "Got it — I'll stop suggesting skills. Just tell me to be proactive
|
||||
again if you change your mind."
|
||||
|
||||
If the user says "be proactive again" or "turn on suggestions":
|
||||
1. Run: gstack-config set proactive true
|
||||
2. Say: "Proactive suggestions are back on."
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
|
||||
@@ -33,8 +33,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"browse","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
```
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
|
||||
|
||||
@@ -4,6 +4,10 @@ version: 1.0.0
|
||||
description: |
|
||||
Systematic debugging with root cause investigation. Four phases: investigate,
|
||||
analyze, hypothesize, implement. Iron Law: no fixes without root cause.
|
||||
Use when asked to "debug this", "fix this bug", "why is this broken",
|
||||
"investigate this error", or "root cause analysis".
|
||||
Proactively suggest when the user reports errors, unexpected behavior, or
|
||||
is troubleshooting why something stopped working.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
@@ -44,8 +48,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"debug","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
```
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
|
||||
|
||||
@@ -4,6 +4,10 @@ version: 1.0.0
|
||||
description: |
|
||||
Systematic debugging with root cause investigation. Four phases: investigate,
|
||||
analyze, hypothesize, implement. Iron Law: no fixes without root cause.
|
||||
Use when asked to "debug this", "fix this bug", "why is this broken",
|
||||
"investigate this error", or "root cause analysis".
|
||||
Proactively suggest when the user reports errors, unexpected behavior, or
|
||||
is troubleshooting why something stopped working.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
|
||||
@@ -7,6 +7,8 @@ description: |
|
||||
generates font+color preview pages. Creates DESIGN.md as your project's design source
|
||||
of truth. For existing sites, use /plan-design-review to infer the system instead.
|
||||
Use when asked to "design system", "brand guidelines", or "create DESIGN.md".
|
||||
Proactively suggest when starting a new project's UI with no existing
|
||||
design system or DESIGN.md.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
@@ -36,8 +38,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"design-consultation","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
```
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
|
||||
|
||||
@@ -7,6 +7,8 @@ description: |
|
||||
generates font+color preview pages. Creates DESIGN.md as your project's design source
|
||||
of truth. For existing sites, use /plan-design-review to infer the system instead.
|
||||
Use when asked to "design system", "brand guidelines", or "create DESIGN.md".
|
||||
Proactively suggest when starting a new project's UI with no existing
|
||||
design system or DESIGN.md.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
|
||||
@@ -7,6 +7,8 @@ description: |
|
||||
in source code, committing each fix atomically and re-verifying with before/after
|
||||
screenshots. For plan-mode design review (before implementation), use /plan-design-review.
|
||||
Use when asked to "audit the design", "visual QA", "check if it looks good", or "design polish".
|
||||
Proactively suggest when the user mentions visual inconsistencies or
|
||||
wants to polish the look of a live site.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
@@ -36,8 +38,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"design-review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
```
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
|
||||
|
||||
@@ -7,6 +7,8 @@ description: |
|
||||
in source code, committing each fix atomically and re-verifying with before/after
|
||||
screenshots. For plan-mode design review (before implementation), use /plan-design-review.
|
||||
Use when asked to "audit the design", "visual QA", "check if it looks good", or "design polish".
|
||||
Proactively suggest when the user mentions visual inconsistencies or
|
||||
wants to polish the look of a live site.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
|
||||
+160
-121
@@ -4,21 +4,21 @@ Detailed guides for every gstack skill — philosophy, workflow, and examples.
|
||||
|
||||
| Skill | Your specialist | What they do |
|
||||
|-------|----------------|--------------|
|
||||
| [`/office-hours`](#office-hours) | **YC Office Hours** | Start here. Six forcing questions that reframe your product before you write code. Pushes back on your framing, challenges premises, generates implementation alternatives. Design doc feeds into every downstream skill. |
|
||||
| [`/plan-ceo-review`](#plan-ceo-review) | **CEO / Founder** | Rethink the problem. Find the 10-star product hiding inside the request. Four modes: Expansion, Selective Expansion, Hold Scope, Reduction. |
|
||||
| [`/plan-eng-review`](#plan-eng-review) | **Eng Manager** | Lock in architecture, data flow, diagrams, edge cases, and tests. Forces hidden assumptions into the open. |
|
||||
| [`/plan-design-review`](#plan-design-review) | **Senior Designer** | Interactive plan-mode design review. Rates each dimension 0-10, explains what a 10 looks like, fixes the plan. Works in plan mode. |
|
||||
| [`/design-consultation`](#design-consultation) | **Design Partner** | Build a complete design system from scratch. Knows the landscape, proposes creative risks, generates realistic product mockups. Design at the heart of all other phases. |
|
||||
| [`/review`](#review) | **Staff Engineer** | Find the bugs that pass CI but blow up in production. Auto-fixes the obvious ones. Flags completeness gaps. |
|
||||
| [`/ship`](#ship) | **Release Engineer** | Sync main, run tests, audit coverage, push, open PR. Bootstraps test frameworks if you don't have one. One command. |
|
||||
| [`/browse`](#browse) | **QA Engineer** | Give the agent eyes. Real Chromium browser, real clicks, real screenshots. ~100ms per command. |
|
||||
| [`/debug`](#debug) | **Debugger** | Systematic root-cause debugging. Iron Law: no fixes without investigation. Traces data flow, tests hypotheses, stops after 3 failed fixes. |
|
||||
| [`/design-review`](#design-review) | **Designer Who Codes** | Live-site visual audit + fix loop. 80-item audit, then fixes what it finds. Atomic commits, before/after screenshots. |
|
||||
| [`/qa`](#qa) | **QA Lead** | Test your app, find bugs, fix them with atomic commits, re-verify. Auto-generates regression tests for every fix. |
|
||||
| [`/qa-only`](#qa) | **QA Reporter** | Same methodology as /qa but report only. Use when you want a pure bug report without code changes. |
|
||||
| [`/design-review`](#design-review) | **Designer Who Codes** | Live-site visual audit + fix loop. 80-item audit, then fixes what it finds. Atomic commits, before/after screenshots. |
|
||||
| [`/setup-browser-cookies`](#setup-browser-cookies) | **Session Manager** | Import cookies from your real browser (Chrome, Arc, Brave, Edge) into the headless session. Test authenticated pages. |
|
||||
| [`/retro`](#retro) | **Eng Manager** | Team-aware weekly retro. Per-person breakdowns, shipping streaks, test health trends, growth opportunities. |
|
||||
| [`/office-hours`](#office-hours) | **YC Office Hours** | Two modes. Startup: six forcing questions on demand, users, and product. Builder: brainstorming for side projects, hackathons, and learning. Writes a design doc with personal observations about how you think. |
|
||||
| [`/debug`](#debug) | **Debugger** | Systematic root-cause debugging. Iron Law: no fixes without investigation. Traces data flow, tests hypotheses, stops after 3 failed fixes. |
|
||||
| [`/ship`](#ship) | **Release Engineer** | Sync main, run tests, audit coverage, push, open PR. Bootstraps test frameworks if you don't have one. One command. |
|
||||
| [`/document-release`](#document-release) | **Technical Writer** | Update all project docs to match what you just shipped. Catches stale READMEs automatically. |
|
||||
| [`/retro`](#retro) | **Eng Manager** | Team-aware weekly retro. Per-person breakdowns, shipping streaks, test health trends, growth opportunities. |
|
||||
| [`/browse`](#browse) | **QA Engineer** | Give the agent eyes. Real Chromium browser, real clicks, real screenshots. ~100ms per command. |
|
||||
| [`/setup-browser-cookies`](#setup-browser-cookies) | **Session Manager** | Import cookies from your real browser (Chrome, Arc, Brave, Edge) into the headless session. Test authenticated pages. |
|
||||
|
||||
---
|
||||
|
||||
@@ -26,25 +26,56 @@ Detailed guides for every gstack skill — philosophy, workflow, and examples.
|
||||
|
||||
This is where every project should start.
|
||||
|
||||
Before you plan, before you review, before you write code — sit down and think about what you're building and why. `/office-hours` is a YC-style conversation that forces clarity before action.
|
||||
Before you plan, before you review, before you write code — sit down with a YC-style partner and think about what you're actually building. Not what you think you're building. What you're *actually* building.
|
||||
|
||||
It works in two modes, and it asks you which one upfront:
|
||||
### The reframe
|
||||
|
||||
Here's what happened on a real project. The user said: "I want to build a daily briefing app for my calendar." Reasonable request. Then it asked about the pain — specific examples, not hypotheticals. They described an assistant missing things, calendar items across multiple Google accounts with stale info, prep docs that were AI slop, events with wrong locations that took forever to track down.
|
||||
|
||||
It came back with: *"I'm going to push back on the framing, because I think you've outgrown it. You said 'daily briefing app for multi-Google-Calendar management.' But what you actually described is a personal chief of staff AI."*
|
||||
|
||||
Then it extracted five capabilities the user didn't realize they were describing:
|
||||
|
||||
1. **Watches your calendar** across all accounts and detects stale info, missing locations, permission gaps
|
||||
2. **Generates real prep work** — not logistics summaries, but *the intellectual work* of preparing for a board meeting, a podcast, a fundraiser
|
||||
3. **Manages your CRM** — who are you meeting, what's the relationship, what do they want, what's the history
|
||||
4. **Prioritizes your time** — flags when prep needs to start early, blocks time proactively, ranks events by importance
|
||||
5. **Trades money for leverage** — actively looks for ways to delegate or automate
|
||||
|
||||
That reframe changed the entire project. They were about to build a calendar app. Now they're building something ten times more valuable — because the skill listened to their pain instead of their feature request.
|
||||
|
||||
### Premise challenge
|
||||
|
||||
After the reframe, it presents premises for you to validate. Not "does this sound good?" — actual falsifiable claims about the product:
|
||||
|
||||
1. The calendar is the anchor data source, but the value is in the intelligence layer on top
|
||||
2. The assistant doesn't get replaced — they get superpowered
|
||||
3. The narrowest wedge is a daily briefing that actually works
|
||||
4. CRM integration is a must-have, not a nice-to-have
|
||||
|
||||
You agree, disagree, or adjust. Every premise you accept becomes load-bearing in the design doc.
|
||||
|
||||
### Implementation alternatives
|
||||
|
||||
Then it generates 2-3 concrete implementation approaches with honest effort estimates:
|
||||
|
||||
- **Approach A: Daily Briefing First** — narrowest wedge, ships tomorrow, M effort (human: ~3 weeks / CC: ~2 days)
|
||||
- **Approach B: CRM-First** — build the relationship graph first, L effort (human: ~6 weeks / CC: ~4 days)
|
||||
- **Approach C: Full Vision** — everything at once, XL effort (human: ~3 months / CC: ~1.5 weeks)
|
||||
|
||||
Recommends A because you learn from real usage. CRM data comes naturally in week two.
|
||||
|
||||
### Two modes
|
||||
|
||||
**Startup mode** — for founders and intrapreneurs building a business. You get six forcing questions distilled from how YC partners evaluate products: demand reality, status quo, desperate specificity, narrowest wedge, observation & surprise, and future-fit. These questions are uncomfortable on purpose. If you can't name a specific human who needs your product, that's the most important thing to learn before writing any code.
|
||||
|
||||
**Builder mode** — for hackathons, side projects, open source, learning, and having fun. You get an enthusiastic collaborator who helps you find the coolest version of your idea. What would make someone say "whoa"? What's the fastest path to something you can share? The questions are generative, not interrogative.
|
||||
|
||||
### The design doc
|
||||
|
||||
Both modes end with a design doc written to `~/.gstack/projects/` — and that doc feeds directly into `/plan-ceo-review` and `/plan-eng-review`. The full lifecycle is now: `office-hours → plan → implement → review → QA → ship → retro`.
|
||||
|
||||
**Personal observations.** After the design doc is approved, `/office-hours` reflects on what it noticed about how you think — not generic praise, but specific callbacks to things you said during the session. The observations appear in the design doc too, so you re-encounter them when you re-read later.
|
||||
|
||||
---
|
||||
|
||||
## `/debug`
|
||||
|
||||
When something is broken and you don't know why, `/debug` is your systematic debugger. It follows the Iron Law: **no fixes without root cause investigation first.**
|
||||
|
||||
Instead of guessing and patching, it traces data flow, matches against known bug patterns, and tests hypotheses one at a time. If three fix attempts fail, it stops and questions the architecture instead of thrashing. This prevents the "let me try one more thing" spiral that wastes hours.
|
||||
After the design doc is approved, `/office-hours` reflects on what it noticed about how you think — not generic praise, but specific callbacks to things you said during the session. The observations appear in the design doc too, so you re-encounter them when you re-read later.
|
||||
|
||||
---
|
||||
|
||||
@@ -409,6 +440,54 @@ I want the model imagining the production incident before it happens.
|
||||
|
||||
---
|
||||
|
||||
## `/debug`
|
||||
|
||||
When something is broken and you don't know why, `/debug` is your systematic debugger. It follows the Iron Law: **no fixes without root cause investigation first.**
|
||||
|
||||
Instead of guessing and patching, it traces data flow, matches against known bug patterns, and tests hypotheses one at a time. If three fix attempts fail, it stops and questions the architecture instead of thrashing. This prevents the "let me try one more thing" spiral that wastes hours.
|
||||
|
||||
---
|
||||
|
||||
## `/qa`
|
||||
|
||||
This is my **QA lead mode**.
|
||||
|
||||
`/browse` gives the agent eyes. `/qa` gives it a testing methodology.
|
||||
|
||||
The most common use case: you're on a feature branch, you just finished coding, and you want to verify everything works. Just say `/qa` — it reads your git diff, identifies which pages and routes your changes affect, spins up the browser, and tests each one. No URL required. No manual test plan.
|
||||
|
||||
Four modes:
|
||||
|
||||
- **Diff-aware** (automatic on feature branches) — reads `git diff main`, identifies affected pages, tests them specifically
|
||||
- **Full** — systematic exploration of the entire app. 5-15 minutes. Documents 5-10 well-evidenced issues.
|
||||
- **Quick** (`--quick`) — 30-second smoke test. Homepage + top 5 nav targets.
|
||||
- **Regression** (`--regression baseline.json`) — run full mode, then diff against a previous baseline.
|
||||
|
||||
### Automatic regression tests
|
||||
|
||||
When `/qa` fixes a bug and verifies it, it automatically generates a regression test that catches the exact scenario that broke. Tests include full attribution tracing back to the QA report.
|
||||
|
||||
### Example
|
||||
|
||||
```
|
||||
You: /qa https://staging.myapp.com
|
||||
|
||||
Claude: [Explores 12 pages, fills 3 forms, tests 2 flows]
|
||||
|
||||
QA Report: staging.myapp.com — Health Score: 72/100
|
||||
|
||||
Top 3 Issues:
|
||||
1. CRITICAL: Checkout form submits with empty required fields
|
||||
2. HIGH: Mobile nav menu doesn't close after selecting an item
|
||||
3. MEDIUM: Dashboard chart overlaps sidebar below 1024px
|
||||
|
||||
[Full report with screenshots saved to .gstack/qa-reports/]
|
||||
```
|
||||
|
||||
**Testing authenticated pages:** Use `/setup-browser-cookies` first to import your real browser sessions, then `/qa` can test pages behind login.
|
||||
|
||||
---
|
||||
|
||||
## `/ship`
|
||||
|
||||
This is my **release machine mode**.
|
||||
@@ -435,6 +514,69 @@ A lot of branches die when the interesting work is done and only the boring rele
|
||||
|
||||
---
|
||||
|
||||
## `/document-release`
|
||||
|
||||
This is my **technical writer mode**.
|
||||
|
||||
After `/ship` creates the PR but before it merges, `/document-release` reads every documentation file in the project and cross-references it against the diff. It updates file paths, command lists, project structure trees, and anything else that drifted. Risky or subjective changes get surfaced as questions — everything else is handled automatically.
|
||||
|
||||
```
|
||||
You: /document-release
|
||||
|
||||
Claude: Analyzing 21 files changed across 3 commits. Found 8 documentation files.
|
||||
|
||||
README.md: updated skill count from 9 to 10, added new skill to table
|
||||
CLAUDE.md: added new directory to project structure
|
||||
CONTRIBUTING.md: current — no changes needed
|
||||
TODOS.md: marked 2 items complete, added 1 new item
|
||||
|
||||
All docs updated and committed. PR body updated with doc diff.
|
||||
```
|
||||
|
||||
It also polishes CHANGELOG voice (without ever overwriting entries), cleans up completed TODOS, checks cross-doc consistency, and asks about VERSION bumps only when appropriate.
|
||||
|
||||
---
|
||||
|
||||
## `/retro`
|
||||
|
||||
This is my **engineering manager mode**.
|
||||
|
||||
At the end of the week I want to know what actually happened. Not vibes — data. `/retro` analyzes commit history, work patterns, and shipping velocity and writes a candid retrospective.
|
||||
|
||||
It is team-aware. It identifies who is running the command, gives you the deepest treatment on your own work, then breaks down every contributor with specific praise and growth opportunities. It computes metrics like commits, LOC, test ratio, PR sizes, and fix ratio. It detects coding sessions from commit timestamps, finds hotspot files, tracks shipping streaks, and identifies the biggest ship of the week.
|
||||
|
||||
It also tracks test health: total test files, tests added this period, regression test commits, and trend deltas. If test ratio drops below 20%, it flags it as a growth area.
|
||||
|
||||
### Example
|
||||
|
||||
```
|
||||
You: /retro
|
||||
|
||||
Claude: Week of Mar 1: 47 commits (3 contributors), 3.2k LOC, 38% tests, 12 PRs, peak: 10pm | Streak: 47d
|
||||
|
||||
## Your Week
|
||||
32 commits, +2.4k LOC, 41% tests. Peak hours: 9-11pm.
|
||||
Biggest ship: cookie import system (browser decryption + picker UI).
|
||||
What you did well: shipped a complete feature with encryption, UI, and
|
||||
18 unit tests in one focused push...
|
||||
|
||||
## Team Breakdown
|
||||
|
||||
### Alice
|
||||
12 commits focused on app/services/. Every PR under 200 LOC — disciplined.
|
||||
Opportunity: test ratio at 12% — worth investing before payment gets more complex.
|
||||
|
||||
### Bob
|
||||
3 commits — fixed the N+1 query on dashboard. Small but high-impact.
|
||||
Opportunity: only 1 active day this week — check if blocked on anything.
|
||||
|
||||
[Top 3 team wins, 3 things to improve, 3 habits for next week]
|
||||
```
|
||||
|
||||
It saves a JSON snapshot to `.context/retros/` so the next run can show trends.
|
||||
|
||||
---
|
||||
|
||||
## `/browse`
|
||||
|
||||
This is my **QA engineer mode**.
|
||||
@@ -480,46 +622,6 @@ For the full command reference, see [BROWSER.md](../BROWSER.md).
|
||||
|
||||
---
|
||||
|
||||
## `/qa`
|
||||
|
||||
This is my **QA lead mode**.
|
||||
|
||||
`/browse` gives the agent eyes. `/qa` gives it a testing methodology.
|
||||
|
||||
The most common use case: you're on a feature branch, you just finished coding, and you want to verify everything works. Just say `/qa` — it reads your git diff, identifies which pages and routes your changes affect, spins up the browser, and tests each one. No URL required. No manual test plan.
|
||||
|
||||
Four modes:
|
||||
|
||||
- **Diff-aware** (automatic on feature branches) — reads `git diff main`, identifies affected pages, tests them specifically
|
||||
- **Full** — systematic exploration of the entire app. 5-15 minutes. Documents 5-10 well-evidenced issues.
|
||||
- **Quick** (`--quick`) — 30-second smoke test. Homepage + top 5 nav targets.
|
||||
- **Regression** (`--regression baseline.json`) — run full mode, then diff against a previous baseline.
|
||||
|
||||
### Automatic regression tests
|
||||
|
||||
When `/qa` fixes a bug and verifies it, it automatically generates a regression test that catches the exact scenario that broke. Tests include full attribution tracing back to the QA report.
|
||||
|
||||
### Example
|
||||
|
||||
```
|
||||
You: /qa https://staging.myapp.com
|
||||
|
||||
Claude: [Explores 12 pages, fills 3 forms, tests 2 flows]
|
||||
|
||||
QA Report: staging.myapp.com — Health Score: 72/100
|
||||
|
||||
Top 3 Issues:
|
||||
1. CRITICAL: Checkout form submits with empty required fields
|
||||
2. HIGH: Mobile nav menu doesn't close after selecting an item
|
||||
3. MEDIUM: Dashboard chart overlaps sidebar below 1024px
|
||||
|
||||
[Full report with screenshots saved to .gstack/qa-reports/]
|
||||
```
|
||||
|
||||
**Testing authenticated pages:** Use `/setup-browser-cookies` first to import your real browser sessions, then `/qa` can test pages behind login.
|
||||
|
||||
---
|
||||
|
||||
## `/setup-browser-cookies`
|
||||
|
||||
This is my **session manager mode**.
|
||||
@@ -551,69 +653,6 @@ Claude: Imported 12 cookies for github.com from Comet.
|
||||
|
||||
---
|
||||
|
||||
## `/retro`
|
||||
|
||||
This is my **engineering manager mode**.
|
||||
|
||||
At the end of the week I want to know what actually happened. Not vibes — data. `/retro` analyzes commit history, work patterns, and shipping velocity and writes a candid retrospective.
|
||||
|
||||
It is team-aware. It identifies who is running the command, gives you the deepest treatment on your own work, then breaks down every contributor with specific praise and growth opportunities. It computes metrics like commits, LOC, test ratio, PR sizes, and fix ratio. It detects coding sessions from commit timestamps, finds hotspot files, tracks shipping streaks, and identifies the biggest ship of the week.
|
||||
|
||||
It also tracks test health: total test files, tests added this period, regression test commits, and trend deltas. If test ratio drops below 20%, it flags it as a growth area.
|
||||
|
||||
### Example
|
||||
|
||||
```
|
||||
You: /retro
|
||||
|
||||
Claude: Week of Mar 1: 47 commits (3 contributors), 3.2k LOC, 38% tests, 12 PRs, peak: 10pm | Streak: 47d
|
||||
|
||||
## Your Week
|
||||
32 commits, +2.4k LOC, 41% tests. Peak hours: 9-11pm.
|
||||
Biggest ship: cookie import system (browser decryption + picker UI).
|
||||
What you did well: shipped a complete feature with encryption, UI, and
|
||||
18 unit tests in one focused push...
|
||||
|
||||
## Team Breakdown
|
||||
|
||||
### Alice
|
||||
12 commits focused on app/services/. Every PR under 200 LOC — disciplined.
|
||||
Opportunity: test ratio at 12% — worth investing before payment gets more complex.
|
||||
|
||||
### Bob
|
||||
3 commits — fixed the N+1 query on dashboard. Small but high-impact.
|
||||
Opportunity: only 1 active day this week — check if blocked on anything.
|
||||
|
||||
[Top 3 team wins, 3 things to improve, 3 habits for next week]
|
||||
```
|
||||
|
||||
It saves a JSON snapshot to `.context/retros/` so the next run can show trends.
|
||||
|
||||
---
|
||||
|
||||
## `/document-release`
|
||||
|
||||
This is my **technical writer mode**.
|
||||
|
||||
After `/ship` creates the PR but before it merges, `/document-release` reads every documentation file in the project and cross-references it against the diff. It updates file paths, command lists, project structure trees, and anything else that drifted. Risky or subjective changes get surfaced as questions — everything else is handled automatically.
|
||||
|
||||
```
|
||||
You: /document-release
|
||||
|
||||
Claude: Analyzing 21 files changed across 3 commits. Found 8 documentation files.
|
||||
|
||||
README.md: updated skill count from 9 to 10, added new skill to table
|
||||
CLAUDE.md: added new directory to project structure
|
||||
CONTRIBUTING.md: current — no changes needed
|
||||
TODOS.md: marked 2 items complete, added 1 new item
|
||||
|
||||
All docs updated and committed. PR body updated with doc diff.
|
||||
```
|
||||
|
||||
It also polishes CHANGELOG voice (without ever overwriting entries), cleans up completed TODOS, checks cross-doc consistency, and asks about VERSION bumps only when appropriate.
|
||||
|
||||
---
|
||||
|
||||
## Greptile integration
|
||||
|
||||
[Greptile](https://greptile.com) is a YC company that reviews your PRs automatically. It catches real bugs — race conditions, security issues, things that pass CI and blow up in production. It has genuinely saved my ass more than once. I love these guys.
|
||||
|
||||
@@ -6,6 +6,7 @@ description: |
|
||||
diff, updates README/ARCHITECTURE/CONTRIBUTING/CLAUDE.md to match what shipped,
|
||||
polishes CHANGELOG voice, cleans up TODOS, and optionally bumps VERSION. Use when
|
||||
asked to "update the docs", "sync documentation", or "post-ship docs".
|
||||
Proactively suggest after a PR is merged or code is shipped.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
@@ -34,8 +35,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"document-release","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
```
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
|
||||
|
||||
@@ -6,6 +6,7 @@ description: |
|
||||
diff, updates README/ARCHITECTURE/CONTRIBUTING/CLAUDE.md to match what shipped,
|
||||
polishes CHANGELOG voice, cleans up TODOS, and optionally bumps VERSION. Use when
|
||||
asked to "update the docs", "sync documentation", or "post-ship docs".
|
||||
Proactively suggest after a PR is merged or code is shipped.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
|
||||
@@ -6,6 +6,10 @@ description: |
|
||||
demand reality, status quo, desperate specificity, narrowest wedge, observation,
|
||||
and future-fit. Builder mode: design thinking brainstorming for side projects,
|
||||
hackathons, learning, and open source. Saves a design doc.
|
||||
Use when asked to "brainstorm this", "I have an idea", "help me think through
|
||||
this", "office hours", or "is this worth building".
|
||||
Proactively suggest when the user describes a new product idea or is exploring
|
||||
whether something is worth building — before any code is written.
|
||||
Use before /plan-ceo-review or /plan-eng-review.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
@@ -35,8 +39,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"office-hours","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
```
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
|
||||
|
||||
@@ -6,6 +6,10 @@ description: |
|
||||
demand reality, status quo, desperate specificity, narrowest wedge, observation,
|
||||
and future-fit. Builder mode: design thinking brainstorming for side projects,
|
||||
hackathons, learning, and open source. Saves a design doc.
|
||||
Use when asked to "brainstorm this", "I have an idea", "help me think through
|
||||
this", "office hours", or "is this worth building".
|
||||
Proactively suggest when the user describes a new product idea or is exploring
|
||||
whether something is worth building — before any code is written.
|
||||
Use before /plan-ceo-review or /plan-eng-review.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
|
||||
+5
-5
@@ -12,11 +12,11 @@
|
||||
"gen:skill-docs": "bun run scripts/gen-skill-docs.ts",
|
||||
"dev": "bun run browse/src/cli.ts",
|
||||
"server": "bun run browse/src/server.ts",
|
||||
"test": "bun test browse/test/ test/ --ignore test/skill-e2e.test.ts --ignore test/skill-llm-eval.test.ts",
|
||||
"test:evals": "EVALS=1 bun test test/skill-llm-eval.test.ts test/skill-e2e.test.ts",
|
||||
"test:evals:all": "EVALS=1 EVALS_ALL=1 bun test test/skill-llm-eval.test.ts test/skill-e2e.test.ts",
|
||||
"test:e2e": "EVALS=1 bun test test/skill-e2e.test.ts",
|
||||
"test:e2e:all": "EVALS=1 EVALS_ALL=1 bun test test/skill-e2e.test.ts",
|
||||
"test": "bun test browse/test/ test/ --ignore test/skill-e2e.test.ts --ignore test/skill-llm-eval.test.ts --ignore test/skill-routing-e2e.test.ts",
|
||||
"test:evals": "EVALS=1 bun test test/skill-llm-eval.test.ts test/skill-e2e.test.ts test/skill-routing-e2e.test.ts",
|
||||
"test:evals:all": "EVALS=1 EVALS_ALL=1 bun test test/skill-llm-eval.test.ts test/skill-e2e.test.ts test/skill-routing-e2e.test.ts",
|
||||
"test:e2e": "EVALS=1 bun test test/skill-e2e.test.ts test/skill-routing-e2e.test.ts",
|
||||
"test:e2e:all": "EVALS=1 EVALS_ALL=1 bun test test/skill-e2e.test.ts test/skill-routing-e2e.test.ts",
|
||||
"skill:check": "bun run scripts/skill-check.ts",
|
||||
"dev:skill": "bun run scripts/dev-skill.ts",
|
||||
"start": "bun run browse/src/server.ts",
|
||||
|
||||
@@ -8,6 +8,8 @@ description: |
|
||||
expansions), HOLD SCOPE (maximum rigor), SCOPE REDUCTION (strip to essentials).
|
||||
Use when asked to "think bigger", "expand scope", "strategy review", "rethink this",
|
||||
or "is this ambitious enough".
|
||||
Proactively suggest when the user is questioning scope or ambition of a plan,
|
||||
or when the plan feels like it could be thinking bigger.
|
||||
allowed-tools:
|
||||
- Read
|
||||
- Grep
|
||||
@@ -34,8 +36,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"plan-ceo-review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
```
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
|
||||
|
||||
@@ -8,6 +8,8 @@ description: |
|
||||
expansions), HOLD SCOPE (maximum rigor), SCOPE REDUCTION (strip to essentials).
|
||||
Use when asked to "think bigger", "expand scope", "strategy review", "rethink this",
|
||||
or "is this ambitious enough".
|
||||
Proactively suggest when the user is questioning scope or ambition of a plan,
|
||||
or when the plan feels like it could be thinking bigger.
|
||||
allowed-tools:
|
||||
- Read
|
||||
- Grep
|
||||
|
||||
@@ -7,6 +7,8 @@ description: |
|
||||
then fixes the plan to get there. Works in plan mode. For live site
|
||||
visual audits, use /design-review. Use when asked to "review the design plan"
|
||||
or "design critique".
|
||||
Proactively suggest when the user has a plan with UI/UX components that
|
||||
should be reviewed before implementation.
|
||||
allowed-tools:
|
||||
- Read
|
||||
- Edit
|
||||
@@ -34,8 +36,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"plan-design-review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
```
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
|
||||
|
||||
@@ -7,6 +7,8 @@ description: |
|
||||
then fixes the plan to get there. Works in plan mode. For live site
|
||||
visual audits, use /design-review. Use when asked to "review the design plan"
|
||||
or "design critique".
|
||||
Proactively suggest when the user has a plan with UI/UX components that
|
||||
should be reviewed before implementation.
|
||||
allowed-tools:
|
||||
- Read
|
||||
- Edit
|
||||
|
||||
@@ -6,6 +6,8 @@ description: |
|
||||
data flow, diagrams, edge cases, test coverage, performance. Walks through
|
||||
issues interactively with opinionated recommendations. Use when asked to
|
||||
"review the architecture", "engineering review", or "lock in the plan".
|
||||
Proactively suggest when the user has a plan or design doc and is about to
|
||||
start coding — to catch architecture issues before implementation.
|
||||
allowed-tools:
|
||||
- Read
|
||||
- Write
|
||||
@@ -33,8 +35,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"plan-eng-review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
```
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
|
||||
|
||||
@@ -6,6 +6,8 @@ description: |
|
||||
data flow, diagrams, edge cases, test coverage, performance. Walks through
|
||||
issues interactively with opinionated recommendations. Use when asked to
|
||||
"review the architecture", "engineering review", or "lock in the plan".
|
||||
Proactively suggest when the user has a plan or design doc and is about to
|
||||
start coding — to catch architecture issues before implementation.
|
||||
allowed-tools:
|
||||
- Read
|
||||
- Write
|
||||
|
||||
@@ -6,6 +6,7 @@ description: |
|
||||
structured report with health score, screenshots, and repro steps — but never
|
||||
fixes anything. Use when asked to "just report bugs", "qa report only", or
|
||||
"test but don't fix". For the full test-fix-verify loop, use /qa instead.
|
||||
Proactively suggest when the user wants a bug report without any code changes.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
@@ -31,8 +32,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"qa-only","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
```
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
|
||||
|
||||
@@ -6,6 +6,7 @@ description: |
|
||||
structured report with health score, screenshots, and repro steps — but never
|
||||
fixes anything. Use when asked to "just report bugs", "qa report only", or
|
||||
"test but don't fix". For the full test-fix-verify loop, use /qa instead.
|
||||
Proactively suggest when the user wants a bug report without any code changes.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
|
||||
+8
-1
@@ -5,7 +5,9 @@ description: |
|
||||
Systematically QA test a web application and fix bugs found. Runs QA testing,
|
||||
then iteratively fixes bugs in source code, committing each fix atomically and
|
||||
re-verifying. Use when asked to "qa", "QA", "test this site", "find bugs",
|
||||
"test and fix", or "fix what's broken". Three tiers: Quick (critical/high only),
|
||||
"test and fix", or "fix what's broken".
|
||||
Proactively suggest when the user says a feature is ready for testing
|
||||
or asks "does this work?". Three tiers: Quick (critical/high only),
|
||||
Standard (+ medium), Exhaustive (+ cosmetic). Produces before/after health scores,
|
||||
fix evidence, and a ship-readiness summary. For report-only mode, use /qa-only.
|
||||
allowed-tools:
|
||||
@@ -37,8 +39,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"qa","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
```
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
|
||||
|
||||
+3
-1
@@ -5,7 +5,9 @@ description: |
|
||||
Systematically QA test a web application and fix bugs found. Runs QA testing,
|
||||
then iteratively fixes bugs in source code, committing each fix atomically and
|
||||
re-verifying. Use when asked to "qa", "QA", "test this site", "find bugs",
|
||||
"test and fix", or "fix what's broken". Three tiers: Quick (critical/high only),
|
||||
"test and fix", or "fix what's broken".
|
||||
Proactively suggest when the user says a feature is ready for testing
|
||||
or asks "does this work?". Three tiers: Quick (critical/high only),
|
||||
Standard (+ medium), Exhaustive (+ cosmetic). Produces before/after health scores,
|
||||
fix evidence, and a ship-readiness summary. For report-only mode, use /qa-only.
|
||||
allowed-tools:
|
||||
|
||||
+18
-11
@@ -6,6 +6,7 @@ description: |
|
||||
and code quality metrics with persistent history and trend tracking.
|
||||
Team-aware: breaks down per-person contributions with praise and growth areas.
|
||||
Use when asked to "weekly retro", "what did we ship", or "engineering retrospective".
|
||||
Proactively suggest at the end of a work week or sprint.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
@@ -32,8 +33,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"retro","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
```
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
|
||||
@@ -174,7 +180,9 @@ When the user types `/retro`, run this skill.
|
||||
|
||||
## Instructions
|
||||
|
||||
Parse the argument to determine the time window. Default to 7 days if no argument given. Use `--since="N days ago"`, `--since="N hours ago"`, or `--since="N weeks ago"` (for `w` units) for git log queries. All times should be reported in **Pacific time** (use `TZ=America/Los_Angeles` when converting timestamps).
|
||||
Parse the argument to determine the time window. Default to 7 days if no argument given. All times should be reported in the user's **local timezone** (use the system default — do NOT set `TZ`).
|
||||
|
||||
**Midnight-aligned windows:** For day (`d`) and week (`w`) units, compute an absolute start date at local midnight, not a relative string. For example, if today is 2026-03-18 and the window is 7 days: the start date is 2026-03-11. Use `--since="2026-03-11"` for git log queries — git interprets a bare date as midnight in the local timezone, so this captures full calendar days regardless of what time the retro runs. For week units, multiply by 7 to get days (e.g., `2w` = 14 days back). For hour (`h`) units, use `--since="N hours ago"` since midnight alignment does not apply to sub-day windows.
|
||||
|
||||
**Argument validation:** If the argument doesn't match a number followed by `d`, `h`, or `w`, the word `compare`, or `compare` followed by a number and `d`/`h`/`w`, show this usage and stop:
|
||||
```
|
||||
@@ -211,8 +219,7 @@ git log origin/<default> --since="<window>" --format="%H|%aN|%ae|%ai|%s" --short
|
||||
git log origin/<default> --since="<window>" --format="COMMIT:%H|%aN" --numstat
|
||||
|
||||
# 3. Commit timestamps for session detection and hourly distribution (with author)
|
||||
# Use TZ=America/Los_Angeles for Pacific time conversion
|
||||
TZ=America/Los_Angeles git log origin/<default> --since="<window>" --format="%at|%aN|%ai|%s" | sort -n
|
||||
git log origin/<default> --since="<window>" --format="%at|%aN|%ai|%s" | sort -n
|
||||
|
||||
# 4. Files most frequently changed (hotspot analysis)
|
||||
git log origin/<default> --since="<window>" --format="" --name-only | grep -v '^$' | sort | uniq -c | sort -rn
|
||||
@@ -300,7 +307,7 @@ If the JSONL file doesn't exist or has no entries in the window, skip the Skill
|
||||
|
||||
### Step 3: Commit Time Distribution
|
||||
|
||||
Show hourly histogram in Pacific time using bar chart:
|
||||
Show hourly histogram in local time using bar chart:
|
||||
|
||||
```
|
||||
Hour Commits ████████████████
|
||||
@@ -404,11 +411,11 @@ If the time window is 14 days or more, split into weekly buckets and show trends
|
||||
Count consecutive days with at least 1 commit to origin/<default>, going back from today. Track both team streak and personal streak:
|
||||
|
||||
```bash
|
||||
# Team streak: all unique commit dates (Pacific time) — no hard cutoff
|
||||
TZ=America/Los_Angeles git log origin/<default> --format="%ad" --date=format:"%Y-%m-%d" | sort -u
|
||||
# Team streak: all unique commit dates (local time) — no hard cutoff
|
||||
git log origin/<default> --format="%ad" --date=format:"%Y-%m-%d" | sort -u
|
||||
|
||||
# Personal streak: only the current user's commits
|
||||
TZ=America/Los_Angeles git log origin/<default> --author="<user_name>" --format="%ad" --date=format:"%Y-%m-%d" | sort -u
|
||||
git log origin/<default> --author="<user_name>" --format="%ad" --date=format:"%Y-%m-%d" | sort -u
|
||||
```
|
||||
|
||||
Count backward from today — how many consecutive days have at least one commit? This queries the full history so streaks of any length are reported accurately. Display both:
|
||||
@@ -447,7 +454,7 @@ mkdir -p .context/retros
|
||||
Determine the next sequence number for today (substitute the actual date for `$(date +%Y-%m-%d)`):
|
||||
```bash
|
||||
# Count existing retros for today to get next sequence number
|
||||
today=$(TZ=America/Los_Angeles date +%Y-%m-%d)
|
||||
today=$(date +%Y-%m-%d)
|
||||
existing=$(ls .context/retros/${today}-*.json 2>/dev/null | wc -l | tr -d ' ')
|
||||
next=$((existing + 1))
|
||||
# Save as .context/retros/${today}-${next}.json
|
||||
@@ -621,8 +628,8 @@ Small, practical, realistic. Each must be something that takes <5 minutes to ado
|
||||
|
||||
When the user runs `/retro compare` (or `/retro compare 14d`):
|
||||
|
||||
1. Compute metrics for the current window (default 7d) using `--since="7 days ago"`
|
||||
2. Compute metrics for the immediately prior same-length window using both `--since` and `--until` to avoid overlap (e.g., `--since="14 days ago" --until="7 days ago"` for a 7d window)
|
||||
1. Compute metrics for the current window (default 7d) using the midnight-aligned start date (same logic as the main retro — e.g., if today is 2026-03-18 and window is 7d, use `--since="2026-03-11"`)
|
||||
2. Compute metrics for the immediately prior same-length window using both `--since` and `--until` with midnight-aligned dates to avoid overlap (e.g., for a 7d window starting 2026-03-11: prior window is `--since="2026-03-04" --until="2026-03-11"`)
|
||||
3. Show a side-by-side comparison table with deltas and arrows
|
||||
4. Write a brief narrative highlighting the biggest improvements and regressions
|
||||
5. Save only the current-window snapshot to `.context/retros/` (same as a normal retro run); do **not** persist the prior-window metrics.
|
||||
@@ -644,7 +651,7 @@ When the user runs `/retro compare` (or `/retro compare 14d`):
|
||||
|
||||
- ALL narrative output goes directly to the user in the conversation. The ONLY file written is the `.context/retros/` JSON snapshot.
|
||||
- Use `origin/<default>` for all git queries (not local main which may be stale)
|
||||
- Convert all timestamps to Pacific time for display (use `TZ=America/Los_Angeles`)
|
||||
- Display all timestamps in the user's local timezone (do not override `TZ`)
|
||||
- If the window has zero commits, say so and suggest a different window
|
||||
- Round LOC/hour to nearest 50
|
||||
- Treat merge commits as PR boundaries
|
||||
|
||||
+13
-11
@@ -6,6 +6,7 @@ description: |
|
||||
and code quality metrics with persistent history and trend tracking.
|
||||
Team-aware: breaks down per-person contributions with praise and growth areas.
|
||||
Use when asked to "weekly retro", "what did we ship", or "engineering retrospective".
|
||||
Proactively suggest at the end of a work week or sprint.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
@@ -43,7 +44,9 @@ When the user types `/retro`, run this skill.
|
||||
|
||||
## Instructions
|
||||
|
||||
Parse the argument to determine the time window. Default to 7 days if no argument given. Use `--since="N days ago"`, `--since="N hours ago"`, or `--since="N weeks ago"` (for `w` units) for git log queries. All times should be reported in **Pacific time** (use `TZ=America/Los_Angeles` when converting timestamps).
|
||||
Parse the argument to determine the time window. Default to 7 days if no argument given. All times should be reported in the user's **local timezone** (use the system default — do NOT set `TZ`).
|
||||
|
||||
**Midnight-aligned windows:** For day (`d`) and week (`w`) units, compute an absolute start date at local midnight, not a relative string. For example, if today is 2026-03-18 and the window is 7 days: the start date is 2026-03-11. Use `--since="2026-03-11"` for git log queries — git interprets a bare date as midnight in the local timezone, so this captures full calendar days regardless of what time the retro runs. For week units, multiply by 7 to get days (e.g., `2w` = 14 days back). For hour (`h`) units, use `--since="N hours ago"` since midnight alignment does not apply to sub-day windows.
|
||||
|
||||
**Argument validation:** If the argument doesn't match a number followed by `d`, `h`, or `w`, the word `compare`, or `compare` followed by a number and `d`/`h`/`w`, show this usage and stop:
|
||||
```
|
||||
@@ -80,8 +83,7 @@ git log origin/<default> --since="<window>" --format="%H|%aN|%ae|%ai|%s" --short
|
||||
git log origin/<default> --since="<window>" --format="COMMIT:%H|%aN" --numstat
|
||||
|
||||
# 3. Commit timestamps for session detection and hourly distribution (with author)
|
||||
# Use TZ=America/Los_Angeles for Pacific time conversion
|
||||
TZ=America/Los_Angeles git log origin/<default> --since="<window>" --format="%at|%aN|%ai|%s" | sort -n
|
||||
git log origin/<default> --since="<window>" --format="%at|%aN|%ai|%s" | sort -n
|
||||
|
||||
# 4. Files most frequently changed (hotspot analysis)
|
||||
git log origin/<default> --since="<window>" --format="" --name-only | grep -v '^$' | sort | uniq -c | sort -rn
|
||||
@@ -169,7 +171,7 @@ If the JSONL file doesn't exist or has no entries in the window, skip the Skill
|
||||
|
||||
### Step 3: Commit Time Distribution
|
||||
|
||||
Show hourly histogram in Pacific time using bar chart:
|
||||
Show hourly histogram in local time using bar chart:
|
||||
|
||||
```
|
||||
Hour Commits ████████████████
|
||||
@@ -273,11 +275,11 @@ If the time window is 14 days or more, split into weekly buckets and show trends
|
||||
Count consecutive days with at least 1 commit to origin/<default>, going back from today. Track both team streak and personal streak:
|
||||
|
||||
```bash
|
||||
# Team streak: all unique commit dates (Pacific time) — no hard cutoff
|
||||
TZ=America/Los_Angeles git log origin/<default> --format="%ad" --date=format:"%Y-%m-%d" | sort -u
|
||||
# Team streak: all unique commit dates (local time) — no hard cutoff
|
||||
git log origin/<default> --format="%ad" --date=format:"%Y-%m-%d" | sort -u
|
||||
|
||||
# Personal streak: only the current user's commits
|
||||
TZ=America/Los_Angeles git log origin/<default> --author="<user_name>" --format="%ad" --date=format:"%Y-%m-%d" | sort -u
|
||||
git log origin/<default> --author="<user_name>" --format="%ad" --date=format:"%Y-%m-%d" | sort -u
|
||||
```
|
||||
|
||||
Count backward from today — how many consecutive days have at least one commit? This queries the full history so streaks of any length are reported accurately. Display both:
|
||||
@@ -316,7 +318,7 @@ mkdir -p .context/retros
|
||||
Determine the next sequence number for today (substitute the actual date for `$(date +%Y-%m-%d)`):
|
||||
```bash
|
||||
# Count existing retros for today to get next sequence number
|
||||
today=$(TZ=America/Los_Angeles date +%Y-%m-%d)
|
||||
today=$(date +%Y-%m-%d)
|
||||
existing=$(ls .context/retros/${today}-*.json 2>/dev/null | wc -l | tr -d ' ')
|
||||
next=$((existing + 1))
|
||||
# Save as .context/retros/${today}-${next}.json
|
||||
@@ -490,8 +492,8 @@ Small, practical, realistic. Each must be something that takes <5 minutes to ado
|
||||
|
||||
When the user runs `/retro compare` (or `/retro compare 14d`):
|
||||
|
||||
1. Compute metrics for the current window (default 7d) using `--since="7 days ago"`
|
||||
2. Compute metrics for the immediately prior same-length window using both `--since` and `--until` to avoid overlap (e.g., `--since="14 days ago" --until="7 days ago"` for a 7d window)
|
||||
1. Compute metrics for the current window (default 7d) using the midnight-aligned start date (same logic as the main retro — e.g., if today is 2026-03-18 and window is 7d, use `--since="2026-03-11"`)
|
||||
2. Compute metrics for the immediately prior same-length window using both `--since` and `--until` with midnight-aligned dates to avoid overlap (e.g., for a 7d window starting 2026-03-11: prior window is `--since="2026-03-04" --until="2026-03-11"`)
|
||||
3. Show a side-by-side comparison table with deltas and arrows
|
||||
4. Write a brief narrative highlighting the biggest improvements and regressions
|
||||
5. Save only the current-window snapshot to `.context/retros/` (same as a normal retro run); do **not** persist the prior-window metrics.
|
||||
@@ -513,7 +515,7 @@ When the user runs `/retro compare` (or `/retro compare 14d`):
|
||||
|
||||
- ALL narrative output goes directly to the user in the conversation. The ONLY file written is the `.context/retros/` JSON snapshot.
|
||||
- Use `origin/<default>` for all git queries (not local main which may be stale)
|
||||
- Convert all timestamps to Pacific time for display (use `TZ=America/Los_Angeles`)
|
||||
- Display all timestamps in the user's local timezone (do not override `TZ`)
|
||||
- If the window has zero commits, say so and suggest a different window
|
||||
- Round LOC/hour to nearest 50
|
||||
- Treat merge commits as PR boundaries
|
||||
|
||||
@@ -5,6 +5,7 @@ description: |
|
||||
Pre-landing PR review. Analyzes diff against the base branch for SQL safety, LLM trust
|
||||
boundary violations, conditional side effects, and other structural issues. Use when
|
||||
asked to "review this PR", "code review", "pre-landing review", or "check my diff".
|
||||
Proactively suggest when the user is about to merge or land code changes.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
@@ -33,8 +34,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
```
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
|
||||
|
||||
@@ -5,6 +5,7 @@ description: |
|
||||
Pre-landing PR review. Analyzes diff against the base branch for SQL safety, LLM trust
|
||||
boundary violations, conditional side effects, and other structural issues. Use when
|
||||
asked to "review this PR", "code review", "pre-landing review", or "check my diff".
|
||||
Proactively suggest when the user is about to merge or land code changes.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
|
||||
@@ -118,8 +118,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"${ctx.skillName}","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
\`\`\`
|
||||
|
||||
If \`PROACTIVE\` is \`"false"\`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows \`UPGRADE_AVAILABLE <old> <new>\`: read \`~/.claude/skills/gstack/gstack-upgrade/SKILL.md\` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If \`JUST_UPGRADED <from> <to>\`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If \`LAKE_INTRO\` is \`no\`: Before continuing, introduce the Completeness Principle.
|
||||
|
||||
@@ -30,8 +30,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"setup-browser-cookies","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
```
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
|
||||
|
||||
@@ -3,6 +3,7 @@ name: ship
|
||||
version: 1.0.0
|
||||
description: |
|
||||
Ship workflow: detect + merge base branch, run tests, review diff, bump VERSION, update CHANGELOG, commit, push, create PR. Use when asked to "ship", "deploy", "push to main", "create a PR", or "merge and push".
|
||||
Proactively suggest when the user says code is ready or asks about deploying.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
@@ -32,8 +33,13 @@ _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
mkdir -p ~/.gstack/analytics
|
||||
echo '{"skill":"ship","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
|
||||
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true")
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
```
|
||||
|
||||
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
|
||||
them when the user explicitly asks. The user opted out of proactive suggestions.
|
||||
|
||||
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
|
||||
|
||||
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
|
||||
|
||||
@@ -3,6 +3,7 @@ name: ship
|
||||
version: 1.0.0
|
||||
description: |
|
||||
Ship workflow: detect + merge base branch, run tests, review diff, bump VERSION, update CHANGELOG, commit, push, create PR. Use when asked to "ship", "deploy", "push to main", "create a PR", or "merge and push".
|
||||
Proactively suggest when the user says code is ready or asks about deploying.
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
|
||||
@@ -90,6 +90,19 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
|
||||
|
||||
// gstack-upgrade
|
||||
'gstack-upgrade-happy-path': ['gstack-upgrade/**'],
|
||||
|
||||
// Skill routing — journey-stage tests (depend on ALL skill descriptions)
|
||||
'journey-ideation': ['*/SKILL.md.tmpl', 'SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
|
||||
'journey-plan-eng': ['*/SKILL.md.tmpl', 'SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
|
||||
'journey-think-bigger': ['*/SKILL.md.tmpl', 'SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
|
||||
'journey-debug': ['*/SKILL.md.tmpl', 'SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
|
||||
'journey-qa': ['*/SKILL.md.tmpl', 'SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
|
||||
'journey-code-review': ['*/SKILL.md.tmpl', 'SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
|
||||
'journey-ship': ['*/SKILL.md.tmpl', 'SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
|
||||
'journey-docs': ['*/SKILL.md.tmpl', 'SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
|
||||
'journey-retro': ['*/SKILL.md.tmpl', 'SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
|
||||
'journey-design-system': ['*/SKILL.md.tmpl', 'SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
|
||||
'journey-visual-qa': ['*/SKILL.md.tmpl', 'SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
|
||||
};
|
||||
|
||||
/**
|
||||
|
||||
@@ -0,0 +1,605 @@
|
||||
import { describe, test, expect, afterAll } from 'bun:test';
|
||||
import { runSkillTest } from './helpers/session-runner';
|
||||
import type { SkillTestResult } from './helpers/session-runner';
|
||||
import { EvalCollector } from './helpers/eval-store';
|
||||
import type { EvalTestEntry } from './helpers/eval-store';
|
||||
import { selectTests, detectBaseBranch, getChangedFiles, E2E_TOUCHFILES, GLOBAL_TOUCHFILES } from './helpers/touchfiles';
|
||||
import { spawnSync } from 'child_process';
|
||||
import * as fs from 'fs';
|
||||
import * as path from 'path';
|
||||
import * as os from 'os';
|
||||
|
||||
const ROOT = path.resolve(import.meta.dir, '..');
|
||||
|
||||
// Skip unless EVALS=1.
|
||||
const evalsEnabled = !!process.env.EVALS;
|
||||
const describeE2E = evalsEnabled ? describe : describe.skip;
|
||||
|
||||
// Eval result collector
|
||||
const evalCollector = evalsEnabled ? new EvalCollector('e2e-routing') : null;
|
||||
|
||||
// Unique run ID for this session
|
||||
const runId = new Date().toISOString().replace(/[:.]/g, '').replace('T', '-').slice(0, 15);
|
||||
|
||||
// --- Diff-based test selection ---
|
||||
// Journey routing tests use E2E_TOUCHFILES (entries prefixed 'journey-' in touchfiles.ts).
|
||||
let selectedTests: string[] | null = null;
|
||||
|
||||
if (evalsEnabled && !process.env.EVALS_ALL) {
|
||||
const baseBranch = process.env.EVALS_BASE
|
||||
|| detectBaseBranch(ROOT)
|
||||
|| 'main';
|
||||
const changedFiles = getChangedFiles(baseBranch, ROOT);
|
||||
|
||||
if (changedFiles.length > 0) {
|
||||
const selection = selectTests(changedFiles, E2E_TOUCHFILES, GLOBAL_TOUCHFILES);
|
||||
selectedTests = selection.selected;
|
||||
process.stderr.write(`\nRouting E2E selection (${selection.reason}): ${selection.selected.length}/${Object.keys(E2E_TOUCHFILES).length} tests\n`);
|
||||
if (selection.skipped.length > 0) {
|
||||
process.stderr.write(` Skipped: ${selection.skipped.join(', ')}\n`);
|
||||
}
|
||||
process.stderr.write('\n');
|
||||
}
|
||||
}
|
||||
|
||||
// --- Helper functions ---
|
||||
|
||||
/** Copy all SKILL.md files into tmpDir/.claude/skills/gstack/ for auto-discovery */
|
||||
function installSkills(tmpDir: string) {
|
||||
const skillDirs = [
|
||||
'', // root gstack SKILL.md
|
||||
'qa', 'qa-only', 'ship', 'review', 'plan-ceo-review', 'plan-eng-review',
|
||||
'plan-design-review', 'design-review', 'design-consultation', 'retro',
|
||||
'document-release', 'debug', 'office-hours', 'browse', 'setup-browser-cookies',
|
||||
'gstack-upgrade', 'humanizer',
|
||||
];
|
||||
|
||||
for (const skill of skillDirs) {
|
||||
const srcPath = path.join(ROOT, skill, 'SKILL.md');
|
||||
if (!fs.existsSync(srcPath)) continue;
|
||||
|
||||
const destDir = skill
|
||||
? path.join(tmpDir, '.claude', 'skills', 'gstack', skill)
|
||||
: path.join(tmpDir, '.claude', 'skills', 'gstack');
|
||||
fs.mkdirSync(destDir, { recursive: true });
|
||||
fs.copyFileSync(srcPath, path.join(destDir, 'SKILL.md'));
|
||||
}
|
||||
}
|
||||
|
||||
/** Init a git repo with config */
|
||||
function initGitRepo(dir: string) {
|
||||
const run = (cmd: string, args: string[]) =>
|
||||
spawnSync(cmd, args, { cwd: dir, stdio: 'pipe', timeout: 5000 });
|
||||
run('git', ['init']);
|
||||
run('git', ['config', 'user.email', 'test@test.com']);
|
||||
run('git', ['config', 'user.name', 'Test']);
|
||||
}
|
||||
|
||||
function logCost(label: string, result: { costEstimate: { turnsUsed: number; estimatedTokens: number; estimatedCost: number }; duration: number }) {
|
||||
const { turnsUsed, estimatedTokens, estimatedCost } = result.costEstimate;
|
||||
const durationSec = Math.round(result.duration / 1000);
|
||||
console.log(`${label}: $${estimatedCost.toFixed(2)} (${turnsUsed} turns, ${(estimatedTokens / 1000).toFixed(1)}k tokens, ${durationSec}s)`);
|
||||
}
|
||||
|
||||
function recordRouting(name: string, result: SkillTestResult, expectedSkill: string, actualSkill: string | undefined) {
|
||||
evalCollector?.addTest({
|
||||
name,
|
||||
suite: 'Skill Routing E2E',
|
||||
tier: 'e2e',
|
||||
passed: actualSkill === expectedSkill,
|
||||
duration_ms: result.duration,
|
||||
cost_usd: result.costEstimate.estimatedCost,
|
||||
transcript: result.transcript,
|
||||
output: result.output?.slice(0, 2000),
|
||||
turns_used: result.costEstimate.turnsUsed,
|
||||
exit_reason: result.exitReason,
|
||||
});
|
||||
}
|
||||
|
||||
// --- Tests ---
|
||||
|
||||
describeE2E('Skill Routing E2E — Developer Journey', () => {
|
||||
afterAll(() => {
|
||||
evalCollector?.finalize();
|
||||
});
|
||||
|
||||
test('journey-ideation', async () => {
|
||||
const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'routing-ideation-'));
|
||||
try {
|
||||
initGitRepo(tmpDir);
|
||||
installSkills(tmpDir);
|
||||
fs.writeFileSync(path.join(tmpDir, 'README.md'), '# New Project\n');
|
||||
spawnSync('git', ['add', '.'], { cwd: tmpDir, stdio: 'pipe', timeout: 5000 });
|
||||
spawnSync('git', ['commit', '-m', 'initial'], { cwd: tmpDir, stdio: 'pipe', timeout: 5000 });
|
||||
|
||||
const testName = 'journey-ideation';
|
||||
const expectedSkill = 'office-hours';
|
||||
const result = await runSkillTest({
|
||||
prompt: "I've been thinking about building a waitlist management tool for restaurants. The existing solutions are expensive and overcomplicated. I want something simple — a tablet app where hosts can add parties, see wait times, and text customers when their table is ready. Help me think through whether this is worth building and what the key design decisions are.",
|
||||
workingDirectory: tmpDir,
|
||||
maxTurns: 5,
|
||||
allowedTools: ['Skill', 'Read', 'Bash', 'Glob', 'Grep'],
|
||||
timeout: 60_000,
|
||||
testName,
|
||||
runId,
|
||||
});
|
||||
|
||||
const skillCalls = result.toolCalls.filter(tc => tc.tool === 'Skill');
|
||||
const actualSkill = skillCalls.length > 0 ? skillCalls[0]?.input?.skill : undefined;
|
||||
|
||||
logCost(`journey: ${testName}`, result);
|
||||
recordRouting(testName, result, expectedSkill, actualSkill);
|
||||
|
||||
expect(skillCalls.length, `Expected Skill tool to be called but got 0 calls. Claude may have answered directly without invoking a skill. Tool calls: ${result.toolCalls.map(tc => tc.tool).join(', ')}`).toBeGreaterThan(0);
|
||||
expect([expectedSkill], `Expected skill ${expectedSkill} but got ${actualSkill}`).toContain(actualSkill);
|
||||
} finally {
|
||||
fs.rmSync(tmpDir, { recursive: true, force: true });
|
||||
}
|
||||
}, 90_000);
|
||||
|
||||
test('journey-plan-eng', async () => {
|
||||
const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'routing-plan-eng-'));
|
||||
try {
|
||||
initGitRepo(tmpDir);
|
||||
installSkills(tmpDir);
|
||||
fs.writeFileSync(path.join(tmpDir, 'plan.md'), `# Waitlist App Architecture
|
||||
|
||||
## Components
|
||||
- REST API (Express.js)
|
||||
- PostgreSQL database
|
||||
- React frontend
|
||||
- SMS integration (Twilio)
|
||||
|
||||
## Data Model
|
||||
- restaurants (id, name, settings)
|
||||
- parties (id, restaurant_id, name, size, phone, status, created_at)
|
||||
- wait_estimates (id, restaurant_id, avg_wait_minutes)
|
||||
|
||||
## API Endpoints
|
||||
- POST /api/parties - add party to waitlist
|
||||
- GET /api/parties - list current waitlist
|
||||
- PATCH /api/parties/:id/status - update party status
|
||||
- GET /api/estimate - get current wait estimate
|
||||
`);
|
||||
spawnSync('git', ['add', '.'], { cwd: tmpDir, stdio: 'pipe', timeout: 5000 });
|
||||
spawnSync('git', ['commit', '-m', 'initial'], { cwd: tmpDir, stdio: 'pipe', timeout: 5000 });
|
||||
|
||||
const testName = 'journey-plan-eng';
|
||||
const expectedSkill = 'plan-eng-review';
|
||||
const result = await runSkillTest({
|
||||
prompt: "I wrote up a plan for the waitlist app in plan.md. Can you take a look at the architecture and make sure I'm not missing any edge cases or failure modes before I start coding?",
|
||||
workingDirectory: tmpDir,
|
||||
maxTurns: 5,
|
||||
allowedTools: ['Skill', 'Read', 'Bash', 'Glob', 'Grep'],
|
||||
timeout: 60_000,
|
||||
testName,
|
||||
runId,
|
||||
});
|
||||
|
||||
const skillCalls = result.toolCalls.filter(tc => tc.tool === 'Skill');
|
||||
const actualSkill = skillCalls.length > 0 ? skillCalls[0]?.input?.skill : undefined;
|
||||
|
||||
logCost(`journey: ${testName}`, result);
|
||||
recordRouting(testName, result, expectedSkill, actualSkill);
|
||||
|
||||
expect(skillCalls.length, `Expected Skill tool to be called but got 0 calls. Claude may have answered directly without invoking a skill. Tool calls: ${result.toolCalls.map(tc => tc.tool).join(', ')}`).toBeGreaterThan(0);
|
||||
expect([expectedSkill], `Expected skill ${expectedSkill} but got ${actualSkill}`).toContain(actualSkill);
|
||||
} finally {
|
||||
fs.rmSync(tmpDir, { recursive: true, force: true });
|
||||
}
|
||||
}, 90_000);
|
||||
|
||||
test('journey-think-bigger', async () => {
|
||||
const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'routing-think-bigger-'));
|
||||
try {
|
||||
initGitRepo(tmpDir);
|
||||
installSkills(tmpDir);
|
||||
fs.writeFileSync(path.join(tmpDir, 'plan.md'), `# Waitlist App Architecture
|
||||
|
||||
## Components
|
||||
- REST API (Express.js)
|
||||
- PostgreSQL database
|
||||
- React frontend
|
||||
- SMS integration (Twilio)
|
||||
|
||||
## Data Model
|
||||
- restaurants (id, name, settings)
|
||||
- parties (id, restaurant_id, name, size, phone, status, created_at)
|
||||
- wait_estimates (id, restaurant_id, avg_wait_minutes)
|
||||
|
||||
## API Endpoints
|
||||
- POST /api/parties - add party to waitlist
|
||||
- GET /api/parties - list current waitlist
|
||||
- PATCH /api/parties/:id/status - update party status
|
||||
- GET /api/estimate - get current wait estimate
|
||||
`);
|
||||
spawnSync('git', ['add', '.'], { cwd: tmpDir, stdio: 'pipe', timeout: 5000 });
|
||||
spawnSync('git', ['commit', '-m', 'initial'], { cwd: tmpDir, stdio: 'pipe', timeout: 5000 });
|
||||
|
||||
const testName = 'journey-think-bigger';
|
||||
const expectedSkill = 'plan-ceo-review';
|
||||
const result = await runSkillTest({
|
||||
prompt: "Actually, looking at this plan again, I feel like we're thinking too small. We're just doing waitlists but what about the whole restaurant guest experience? Is there a bigger opportunity here we should go after?",
|
||||
workingDirectory: tmpDir,
|
||||
maxTurns: 5,
|
||||
allowedTools: ['Skill', 'Read', 'Bash', 'Glob', 'Grep'],
|
||||
timeout: 120_000,
|
||||
testName,
|
||||
runId,
|
||||
});
|
||||
|
||||
const skillCalls = result.toolCalls.filter(tc => tc.tool === 'Skill');
|
||||
const actualSkill = skillCalls.length > 0 ? skillCalls[0]?.input?.skill : undefined;
|
||||
|
||||
logCost(`journey: ${testName}`, result);
|
||||
recordRouting(testName, result, expectedSkill, actualSkill);
|
||||
|
||||
expect(skillCalls.length, `Expected Skill tool to be called but got 0 calls. Claude may have answered directly without invoking a skill. Tool calls: ${result.toolCalls.map(tc => tc.tool).join(', ')}`).toBeGreaterThan(0);
|
||||
expect([expectedSkill], `Expected skill ${expectedSkill} but got ${actualSkill}`).toContain(actualSkill);
|
||||
} finally {
|
||||
fs.rmSync(tmpDir, { recursive: true, force: true });
|
||||
}
|
||||
}, 180_000);
|
||||
|
||||
test('journey-debug', async () => {
|
||||
const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'routing-debug-'));
|
||||
try {
|
||||
initGitRepo(tmpDir);
|
||||
installSkills(tmpDir);
|
||||
|
||||
const run = (cmd: string, args: string[]) =>
|
||||
spawnSync(cmd, args, { cwd: tmpDir, stdio: 'pipe', timeout: 5000 });
|
||||
|
||||
fs.mkdirSync(path.join(tmpDir, 'src'), { recursive: true });
|
||||
fs.writeFileSync(path.join(tmpDir, 'src/api.ts'), `
|
||||
import express from 'express';
|
||||
const app = express();
|
||||
|
||||
app.get('/api/waitlist', async (req, res) => {
|
||||
const db = req.app.locals.db;
|
||||
const parties = await db.query('SELECT * FROM parties WHERE status = $1', ['waiting']);
|
||||
res.json(parties.rows);
|
||||
});
|
||||
|
||||
export default app;
|
||||
`);
|
||||
fs.writeFileSync(path.join(tmpDir, 'error.log'), `
|
||||
[2026-03-18T10:23:45Z] ERROR: GET /api/waitlist - 500 Internal Server Error
|
||||
TypeError: Cannot read properties of undefined (reading 'query')
|
||||
at /src/api.ts:5:32
|
||||
at Layer.handle [as handle_request] (/node_modules/express/lib/router/layer.js:95:5)
|
||||
[2026-03-18T10:23:46Z] ERROR: GET /api/waitlist - 500 Internal Server Error
|
||||
TypeError: Cannot read properties of undefined (reading 'query')
|
||||
`);
|
||||
|
||||
run('git', ['add', '.']);
|
||||
run('git', ['commit', '-m', 'initial']);
|
||||
run('git', ['checkout', '-b', 'feature/waitlist-api']);
|
||||
|
||||
const testName = 'journey-debug';
|
||||
const expectedSkill = 'debug';
|
||||
const result = await runSkillTest({
|
||||
prompt: "The GET /api/waitlist endpoint was working fine yesterday but now it's returning 500 errors. The tests are passing locally but the endpoint fails when I hit it with curl. Can you figure out what's going on?",
|
||||
workingDirectory: tmpDir,
|
||||
maxTurns: 5,
|
||||
allowedTools: ['Skill', 'Read', 'Bash', 'Glob', 'Grep'],
|
||||
timeout: 60_000,
|
||||
testName,
|
||||
runId,
|
||||
});
|
||||
|
||||
const skillCalls = result.toolCalls.filter(tc => tc.tool === 'Skill');
|
||||
const actualSkill = skillCalls.length > 0 ? skillCalls[0]?.input?.skill : undefined;
|
||||
|
||||
logCost(`journey: ${testName}`, result);
|
||||
recordRouting(testName, result, expectedSkill, actualSkill);
|
||||
|
||||
expect(skillCalls.length, `Expected Skill tool to be called but got 0 calls. Claude may have answered directly without invoking a skill. Tool calls: ${result.toolCalls.map(tc => tc.tool).join(', ')}`).toBeGreaterThan(0);
|
||||
expect([expectedSkill], `Expected skill ${expectedSkill} but got ${actualSkill}`).toContain(actualSkill);
|
||||
} finally {
|
||||
fs.rmSync(tmpDir, { recursive: true, force: true });
|
||||
}
|
||||
}, 90_000);
|
||||
|
||||
test('journey-qa', async () => {
|
||||
const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'routing-qa-'));
|
||||
try {
|
||||
initGitRepo(tmpDir);
|
||||
installSkills(tmpDir);
|
||||
|
||||
fs.writeFileSync(path.join(tmpDir, 'package.json'), JSON.stringify({ name: 'waitlist-app', scripts: { dev: 'next dev' } }, null, 2));
|
||||
fs.mkdirSync(path.join(tmpDir, 'src'), { recursive: true });
|
||||
fs.writeFileSync(path.join(tmpDir, 'src/index.html'), '<html><body><h1>Waitlist App</h1></body></html>');
|
||||
spawnSync('git', ['add', '.'], { cwd: tmpDir, stdio: 'pipe', timeout: 5000 });
|
||||
spawnSync('git', ['commit', '-m', 'initial'], { cwd: tmpDir, stdio: 'pipe', timeout: 5000 });
|
||||
|
||||
const testName = 'journey-qa';
|
||||
const expectedSkill = 'qa';
|
||||
const alternateSkills = ['qa-only', 'browse'];
|
||||
const result = await runSkillTest({
|
||||
prompt: "I think the app is mostly working now. Can you go through the site and test everything — find any bugs and fix them?",
|
||||
workingDirectory: tmpDir,
|
||||
maxTurns: 5,
|
||||
allowedTools: ['Skill', 'Read', 'Bash', 'Glob', 'Grep'],
|
||||
timeout: 60_000,
|
||||
testName,
|
||||
runId,
|
||||
});
|
||||
|
||||
const skillCalls = result.toolCalls.filter(tc => tc.tool === 'Skill');
|
||||
const actualSkill = skillCalls.length > 0 ? skillCalls[0]?.input?.skill : undefined;
|
||||
const acceptable = [expectedSkill, ...alternateSkills];
|
||||
|
||||
logCost(`journey: ${testName}`, result);
|
||||
recordRouting(testName, result, expectedSkill, actualSkill);
|
||||
|
||||
expect(skillCalls.length, `Expected Skill tool to be called but got 0 calls. Claude may have answered directly without invoking a skill. Tool calls: ${result.toolCalls.map(tc => tc.tool).join(', ')}`).toBeGreaterThan(0);
|
||||
expect(acceptable, `Expected skill ${expectedSkill} but got ${actualSkill}`).toContain(actualSkill);
|
||||
} finally {
|
||||
fs.rmSync(tmpDir, { recursive: true, force: true });
|
||||
}
|
||||
}, 90_000);
|
||||
|
||||
test('journey-code-review', async () => {
|
||||
const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'routing-code-review-'));
|
||||
try {
|
||||
initGitRepo(tmpDir);
|
||||
installSkills(tmpDir);
|
||||
|
||||
const run = (cmd: string, args: string[]) =>
|
||||
spawnSync(cmd, args, { cwd: tmpDir, stdio: 'pipe', timeout: 5000 });
|
||||
|
||||
fs.writeFileSync(path.join(tmpDir, 'app.ts'), '// base\n');
|
||||
run('git', ['add', '.']);
|
||||
run('git', ['commit', '-m', 'initial']);
|
||||
run('git', ['checkout', '-b', 'feature/add-waitlist']);
|
||||
fs.writeFileSync(path.join(tmpDir, 'app.ts'), '// updated with waitlist feature\nimport { WaitlistService } from "./waitlist";\n');
|
||||
fs.writeFileSync(path.join(tmpDir, 'waitlist.ts'), 'export class WaitlistService {\n async addParty(name: string, size: number) {\n // TODO: implement\n }\n}\n');
|
||||
run('git', ['add', '.']);
|
||||
run('git', ['commit', '-m', 'feat: add waitlist service']);
|
||||
|
||||
const testName = 'journey-code-review';
|
||||
const expectedSkill = 'review';
|
||||
const result = await runSkillTest({
|
||||
prompt: "I'm about to merge this into main. Can you look over my changes and flag anything risky before I land it?",
|
||||
workingDirectory: tmpDir,
|
||||
maxTurns: 5,
|
||||
allowedTools: ['Skill', 'Read', 'Bash', 'Glob', 'Grep'],
|
||||
timeout: 60_000,
|
||||
testName,
|
||||
runId,
|
||||
});
|
||||
|
||||
const skillCalls = result.toolCalls.filter(tc => tc.tool === 'Skill');
|
||||
const actualSkill = skillCalls.length > 0 ? skillCalls[0]?.input?.skill : undefined;
|
||||
|
||||
logCost(`journey: ${testName}`, result);
|
||||
recordRouting(testName, result, expectedSkill, actualSkill);
|
||||
|
||||
expect(skillCalls.length, `Expected Skill tool to be called but got 0 calls. Claude may have answered directly without invoking a skill. Tool calls: ${result.toolCalls.map(tc => tc.tool).join(', ')}`).toBeGreaterThan(0);
|
||||
expect([expectedSkill], `Expected skill ${expectedSkill} but got ${actualSkill}`).toContain(actualSkill);
|
||||
} finally {
|
||||
fs.rmSync(tmpDir, { recursive: true, force: true });
|
||||
}
|
||||
}, 90_000);
|
||||
|
||||
test('journey-ship', async () => {
|
||||
const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'routing-ship-'));
|
||||
try {
|
||||
initGitRepo(tmpDir);
|
||||
installSkills(tmpDir);
|
||||
|
||||
const run = (cmd: string, args: string[]) =>
|
||||
spawnSync(cmd, args, { cwd: tmpDir, stdio: 'pipe', timeout: 5000 });
|
||||
|
||||
fs.writeFileSync(path.join(tmpDir, 'app.ts'), '// base\n');
|
||||
run('git', ['add', '.']);
|
||||
run('git', ['commit', '-m', 'initial']);
|
||||
run('git', ['checkout', '-b', 'feature/waitlist']);
|
||||
fs.writeFileSync(path.join(tmpDir, 'app.ts'), '// waitlist feature\n');
|
||||
run('git', ['add', '.']);
|
||||
run('git', ['commit', '-m', 'feat: waitlist']);
|
||||
|
||||
const testName = 'journey-ship';
|
||||
const expectedSkill = 'ship';
|
||||
const result = await runSkillTest({
|
||||
prompt: "This looks good. Let's get it deployed — push the code up and create a PR.",
|
||||
workingDirectory: tmpDir,
|
||||
maxTurns: 5,
|
||||
allowedTools: ['Skill', 'Read', 'Bash', 'Glob', 'Grep'],
|
||||
timeout: 60_000,
|
||||
testName,
|
||||
runId,
|
||||
});
|
||||
|
||||
const skillCalls = result.toolCalls.filter(tc => tc.tool === 'Skill');
|
||||
const actualSkill = skillCalls.length > 0 ? skillCalls[0]?.input?.skill : undefined;
|
||||
|
||||
logCost(`journey: ${testName}`, result);
|
||||
recordRouting(testName, result, expectedSkill, actualSkill);
|
||||
|
||||
expect(skillCalls.length, `Expected Skill tool to be called but got 0 calls. Claude may have answered directly without invoking a skill. Tool calls: ${result.toolCalls.map(tc => tc.tool).join(', ')}`).toBeGreaterThan(0);
|
||||
expect([expectedSkill], `Expected skill ${expectedSkill} but got ${actualSkill}`).toContain(actualSkill);
|
||||
} finally {
|
||||
fs.rmSync(tmpDir, { recursive: true, force: true });
|
||||
}
|
||||
}, 90_000);
|
||||
|
||||
test('journey-docs', async () => {
|
||||
const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'routing-docs-'));
|
||||
try {
|
||||
initGitRepo(tmpDir);
|
||||
installSkills(tmpDir);
|
||||
|
||||
const run = (cmd: string, args: string[]) =>
|
||||
spawnSync(cmd, args, { cwd: tmpDir, stdio: 'pipe', timeout: 5000 });
|
||||
|
||||
fs.writeFileSync(path.join(tmpDir, 'README.md'), '# Waitlist App\nA simple waitlist management tool.\n');
|
||||
fs.mkdirSync(path.join(tmpDir, 'src'), { recursive: true });
|
||||
fs.writeFileSync(path.join(tmpDir, 'src/api.ts'), '// API code\n');
|
||||
run('git', ['add', '.']);
|
||||
run('git', ['commit', '-m', 'feat: ship waitlist feature']);
|
||||
|
||||
const testName = 'journey-docs';
|
||||
const expectedSkill = 'document-release';
|
||||
const result = await runSkillTest({
|
||||
prompt: "We just shipped the waitlist feature. Can you go through the README and any other docs and make sure they match what we actually built?",
|
||||
workingDirectory: tmpDir,
|
||||
maxTurns: 5,
|
||||
allowedTools: ['Skill', 'Read', 'Bash', 'Glob', 'Grep'],
|
||||
timeout: 60_000,
|
||||
testName,
|
||||
runId,
|
||||
});
|
||||
|
||||
const skillCalls = result.toolCalls.filter(tc => tc.tool === 'Skill');
|
||||
const actualSkill = skillCalls.length > 0 ? skillCalls[0]?.input?.skill : undefined;
|
||||
|
||||
logCost(`journey: ${testName}`, result);
|
||||
recordRouting(testName, result, expectedSkill, actualSkill);
|
||||
|
||||
expect(skillCalls.length, `Expected Skill tool to be called but got 0 calls. Claude may have answered directly without invoking a skill. Tool calls: ${result.toolCalls.map(tc => tc.tool).join(', ')}`).toBeGreaterThan(0);
|
||||
expect([expectedSkill], `Expected skill ${expectedSkill} but got ${actualSkill}`).toContain(actualSkill);
|
||||
} finally {
|
||||
fs.rmSync(tmpDir, { recursive: true, force: true });
|
||||
}
|
||||
}, 90_000);
|
||||
|
||||
test('journey-retro', async () => {
|
||||
const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'routing-retro-'));
|
||||
try {
|
||||
initGitRepo(tmpDir);
|
||||
installSkills(tmpDir);
|
||||
|
||||
const run = (cmd: string, args: string[]) =>
|
||||
spawnSync(cmd, args, { cwd: tmpDir, stdio: 'pipe', timeout: 5000 });
|
||||
|
||||
fs.writeFileSync(path.join(tmpDir, 'api.ts'), 'export function getParties() { return []; }\n');
|
||||
run('git', ['add', '.']);
|
||||
run('git', ['commit', '-m', 'feat: add parties API', '--date', '2026-03-12T09:30:00']);
|
||||
|
||||
fs.writeFileSync(path.join(tmpDir, 'ui.tsx'), 'export function WaitlistView() { return <div>Waitlist</div>; }\n');
|
||||
run('git', ['add', '.']);
|
||||
run('git', ['commit', '-m', 'feat: add waitlist UI', '--date', '2026-03-13T14:00:00']);
|
||||
|
||||
fs.writeFileSync(path.join(tmpDir, 'README.md'), '# Waitlist App\n');
|
||||
run('git', ['add', '.']);
|
||||
run('git', ['commit', '-m', 'docs: add README', '--date', '2026-03-14T16:00:00']);
|
||||
|
||||
const testName = 'journey-retro';
|
||||
const expectedSkill = 'retro';
|
||||
const result = await runSkillTest({
|
||||
prompt: "It's Friday. What did we ship this week? I want to do a quick retrospective on what the team accomplished.",
|
||||
workingDirectory: tmpDir,
|
||||
maxTurns: 5,
|
||||
allowedTools: ['Skill', 'Read', 'Bash', 'Glob', 'Grep'],
|
||||
timeout: 60_000,
|
||||
testName,
|
||||
runId,
|
||||
});
|
||||
|
||||
const skillCalls = result.toolCalls.filter(tc => tc.tool === 'Skill');
|
||||
const actualSkill = skillCalls.length > 0 ? skillCalls[0]?.input?.skill : undefined;
|
||||
|
||||
logCost(`journey: ${testName}`, result);
|
||||
recordRouting(testName, result, expectedSkill, actualSkill);
|
||||
|
||||
expect(skillCalls.length, `Expected Skill tool to be called but got 0 calls. Claude may have answered directly without invoking a skill. Tool calls: ${result.toolCalls.map(tc => tc.tool).join(', ')}`).toBeGreaterThan(0);
|
||||
expect([expectedSkill], `Expected skill ${expectedSkill} but got ${actualSkill}`).toContain(actualSkill);
|
||||
} finally {
|
||||
fs.rmSync(tmpDir, { recursive: true, force: true });
|
||||
}
|
||||
}, 90_000);
|
||||
|
||||
test('journey-design-system', async () => {
|
||||
const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'routing-design-system-'));
|
||||
try {
|
||||
initGitRepo(tmpDir);
|
||||
installSkills(tmpDir);
|
||||
|
||||
const run = (cmd: string, args: string[]) =>
|
||||
spawnSync(cmd, args, { cwd: tmpDir, stdio: 'pipe', timeout: 5000 });
|
||||
|
||||
fs.writeFileSync(path.join(tmpDir, 'package.json'), JSON.stringify({ name: 'waitlist-app' }, null, 2));
|
||||
run('git', ['add', '.']);
|
||||
run('git', ['commit', '-m', 'initial']);
|
||||
|
||||
const testName = 'journey-design-system';
|
||||
const expectedSkill = 'design-consultation';
|
||||
const result = await runSkillTest({
|
||||
prompt: "Before we build the UI, I want to establish a design system — typography, colors, spacing, the whole thing. Can you put together brand guidelines for this project?",
|
||||
workingDirectory: tmpDir,
|
||||
maxTurns: 5,
|
||||
allowedTools: ['Skill', 'Read', 'Bash', 'Glob', 'Grep'],
|
||||
timeout: 60_000,
|
||||
testName,
|
||||
runId,
|
||||
});
|
||||
|
||||
const skillCalls = result.toolCalls.filter(tc => tc.tool === 'Skill');
|
||||
const actualSkill = skillCalls.length > 0 ? skillCalls[0]?.input?.skill : undefined;
|
||||
|
||||
logCost(`journey: ${testName}`, result);
|
||||
recordRouting(testName, result, expectedSkill, actualSkill);
|
||||
|
||||
expect(skillCalls.length, `Expected Skill tool to be called but got 0 calls. Claude may have answered directly without invoking a skill. Tool calls: ${result.toolCalls.map(tc => tc.tool).join(', ')}`).toBeGreaterThan(0);
|
||||
expect([expectedSkill], `Expected skill ${expectedSkill} but got ${actualSkill}`).toContain(actualSkill);
|
||||
} finally {
|
||||
fs.rmSync(tmpDir, { recursive: true, force: true });
|
||||
}
|
||||
}, 90_000);
|
||||
|
||||
test('journey-visual-qa', async () => {
|
||||
const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'routing-visual-qa-'));
|
||||
try {
|
||||
initGitRepo(tmpDir);
|
||||
installSkills(tmpDir);
|
||||
|
||||
const run = (cmd: string, args: string[]) =>
|
||||
spawnSync(cmd, args, { cwd: tmpDir, stdio: 'pipe', timeout: 5000 });
|
||||
|
||||
fs.mkdirSync(path.join(tmpDir, 'src'), { recursive: true });
|
||||
fs.writeFileSync(path.join(tmpDir, 'src/styles.css'), `
|
||||
body { font-family: sans-serif; }
|
||||
.header { font-size: 24px; margin: 20px; }
|
||||
.card { padding: 16px; margin: 8px; border: 1px solid #ccc; }
|
||||
.button { background: #007bff; color: white; padding: 10px 20px; }
|
||||
`);
|
||||
fs.writeFileSync(path.join(tmpDir, 'src/index.html'), `
|
||||
<html>
|
||||
<head><link rel="stylesheet" href="styles.css"></head>
|
||||
<body>
|
||||
<div class="header">Waitlist</div>
|
||||
<div class="card">Party of 4 - Smith</div>
|
||||
<div class="card">Party of 2 - Jones</div>
|
||||
</body>
|
||||
</html>
|
||||
`);
|
||||
run('git', ['add', '.']);
|
||||
run('git', ['commit', '-m', 'initial UI']);
|
||||
|
||||
const testName = 'journey-visual-qa';
|
||||
const expectedSkill = 'design-review';
|
||||
const result = await runSkillTest({
|
||||
prompt: "Something looks off on the site. The spacing between sections is inconsistent and the font sizes don't feel right. Can you audit the visual design and fix anything that doesn't look polished?",
|
||||
workingDirectory: tmpDir,
|
||||
maxTurns: 5,
|
||||
allowedTools: ['Skill', 'Read', 'Bash', 'Glob', 'Grep'],
|
||||
timeout: 60_000,
|
||||
testName,
|
||||
runId,
|
||||
});
|
||||
|
||||
const skillCalls = result.toolCalls.filter(tc => tc.tool === 'Skill');
|
||||
const actualSkill = skillCalls.length > 0 ? skillCalls[0]?.input?.skill : undefined;
|
||||
|
||||
logCost(`journey: ${testName}`, result);
|
||||
recordRouting(testName, result, expectedSkill, actualSkill);
|
||||
|
||||
expect(skillCalls.length, `Expected Skill tool to be called but got 0 calls. Claude may have answered directly without invoking a skill. Tool calls: ${result.toolCalls.map(tc => tc.tool).join(', ')}`).toBeGreaterThan(0);
|
||||
expect([expectedSkill], `Expected skill ${expectedSkill} but got ${actualSkill}`).toContain(actualSkill);
|
||||
} finally {
|
||||
fs.rmSync(tmpDir, { recursive: true, force: true });
|
||||
}
|
||||
}, 90_000);
|
||||
});
|
||||
@@ -1120,3 +1120,40 @@ describe('QA report template', () => {
|
||||
expect(content).toContain('**Precondition:**');
|
||||
});
|
||||
});
|
||||
|
||||
// --- Trigger phrase validation ---
|
||||
|
||||
describe('Skill trigger phrases', () => {
|
||||
// Skills that must have "Use when" trigger phrases in their description.
|
||||
// Excluded: root gstack (browser tool), gstack-upgrade (gstack-specific),
|
||||
// setup-browser-cookies (utility), humanizer (text tool), browse (subskill of gstack)
|
||||
const SKILLS_REQUIRING_TRIGGERS = [
|
||||
'qa', 'qa-only', 'ship', 'review', 'debug', 'office-hours',
|
||||
'plan-ceo-review', 'plan-eng-review', 'plan-design-review',
|
||||
'design-review', 'design-consultation', 'retro', 'document-release',
|
||||
];
|
||||
|
||||
for (const skill of SKILLS_REQUIRING_TRIGGERS) {
|
||||
test(`${skill}/SKILL.md has "Use when" trigger phrases`, () => {
|
||||
const skillPath = path.join(ROOT, skill, 'SKILL.md');
|
||||
if (!fs.existsSync(skillPath)) return;
|
||||
const content = fs.readFileSync(skillPath, 'utf-8');
|
||||
// Extract description from frontmatter
|
||||
const frontmatterEnd = content.indexOf('---', 4);
|
||||
const frontmatter = content.slice(0, frontmatterEnd);
|
||||
expect(frontmatter).toMatch(/Use when/i);
|
||||
});
|
||||
}
|
||||
|
||||
// Skills with proactive triggers should have "Proactively suggest" in description
|
||||
for (const skill of SKILLS_REQUIRING_TRIGGERS) {
|
||||
test(`${skill}/SKILL.md has "Proactively suggest" phrase`, () => {
|
||||
const skillPath = path.join(ROOT, skill, 'SKILL.md');
|
||||
if (!fs.existsSync(skillPath)) return;
|
||||
const content = fs.readFileSync(skillPath, 'utf-8');
|
||||
const frontmatterEnd = content.indexOf('---', 4);
|
||||
const frontmatter = content.slice(0, frontmatterEnd);
|
||||
expect(frontmatter).toMatch(/Proactively suggest/i);
|
||||
});
|
||||
}
|
||||
});
|
||||
|
||||
@@ -115,7 +115,8 @@ describe('selectTests', () => {
|
||||
expect(result.selected).toContain('plan-ceo-review-selective');
|
||||
expect(result.selected).toContain('retro');
|
||||
expect(result.selected).toContain('retro-base-branch');
|
||||
expect(result.selected.length).toBe(4);
|
||||
// Also selects journey routing tests (*/SKILL.md.tmpl matches retro/SKILL.md.tmpl)
|
||||
expect(result.selected.length).toBeGreaterThanOrEqual(4);
|
||||
});
|
||||
|
||||
test('works with LLM_JUDGE_TOUCHFILES', () => {
|
||||
@@ -125,13 +126,15 @@ describe('selectTests', () => {
|
||||
expect(result.selected.length).toBe(2);
|
||||
});
|
||||
|
||||
test('SKILL.md.tmpl root template only selects root-dependent tests', () => {
|
||||
test('SKILL.md.tmpl root template selects root-dependent tests and routing tests', () => {
|
||||
const result = selectTests(['SKILL.md.tmpl'], E2E_TOUCHFILES);
|
||||
// Should select the 7 tests that depend on root SKILL.md
|
||||
expect(result.selected).toContain('skillmd-setup-discovery');
|
||||
expect(result.selected).toContain('contributor-mode');
|
||||
expect(result.selected).toContain('session-awareness');
|
||||
// Should NOT select unrelated tests
|
||||
// Also selects journey routing tests (SKILL.md.tmpl in their touchfiles)
|
||||
expect(result.selected).toContain('journey-ideation');
|
||||
// Should NOT select unrelated non-routing tests
|
||||
expect(result.selected).not.toContain('plan-ceo-review');
|
||||
expect(result.selected).not.toContain('retro');
|
||||
});
|
||||
|
||||
Reference in New Issue
Block a user