merge: integrate origin/main (v0.4.0, v0.4.1) into team-supabase-store

Resolves conflicts in CHANGELOG.md (ordering), CONTRIBUTING.md (eval
tools list merge), VERSION (take main's 0.4.1), qa/SKILL.md.tmpl
(keep full methodology + baseline line), eval-store.test.ts (drop
redundant comment).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-03-16 07:49:27 -05:00
47 changed files with 3358 additions and 269 deletions
+33 -2
View File
@@ -120,7 +120,7 @@ Refs (`@e1`, `@e2`, `@c1`) are how the agent addresses page elements without wri
2. Server calls Playwright's page.accessibility.snapshot()
3. Parser walks the ARIA tree, assigns sequential refs: @e1, @e2, @e3...
4. For each ref, builds a Playwright Locator: getByRole(role, { name }).nth(index)
5. Stores Map<string, Locator> on the BrowserManager instance
5. Stores Map<string, RefEntry> on the BrowserManager instance (role + name + Locator)
6. Returns the annotated tree as plain text
Later:
@@ -142,6 +142,19 @@ Playwright Locators are external to the DOM. They use the accessibility tree (wh
Refs are cleared on navigation (the `framenavigated` event on the main frame). This is correct — after navigation, all locators are stale. The agent must run `snapshot` again to get fresh refs. This is by design: stale refs should fail loudly, not click the wrong element.
### Ref staleness detection
SPAs can mutate the DOM without triggering `framenavigated` (e.g. React router transitions, tab switches, modal opens). This makes refs stale even though the page URL didn't change. To catch this, `resolveRef()` performs an async `count()` check before using any ref:
```
resolveRef(@e3) → entry = refMap.get("e3")
→ count = await entry.locator.count()
→ if count === 0: throw "Ref @e3 is stale — element no longer exists. Run 'snapshot' to get fresh refs."
→ if count > 0: return { locator }
```
This fails fast (~5ms overhead) instead of letting Playwright's 30-second action timeout expire on a missing element. The `RefEntry` stores `role` and `name` metadata alongside the Locator so the error message can tell the agent what the element was.
### Cursor-interactive refs (@c)
The `-C` flag finds elements that are clickable but not in the ARIA tree — things styled with `cursor: pointer`, elements with `onclick` attributes, or custom `tabindex`. These get `@c1`, `@c2` refs in a separate namespace. This catches custom components that frameworks render as `<div>` but are actually buttons.
@@ -179,7 +192,25 @@ gen-skill-docs.ts (reads source code metadata)
SKILL.md (committed, auto-generated sections)
```
Templates contain the workflows, tips, and examples that require human judgment. The `{{COMMAND_REFERENCE}}` and `{{SNAPSHOT_FLAGS}}` placeholders are filled from `commands.ts` and `snapshot.ts` at build time. This is structurally sound — if a command exists in code, it appears in docs. If it doesn't exist, it can't appear.
Templates contain the workflows, tips, and examples that require human judgment. Placeholders are filled from source code at build time:
| Placeholder | Source | What it generates |
|-------------|--------|-------------------|
| `{{COMMAND_REFERENCE}}` | `commands.ts` | Categorized command table |
| `{{SNAPSHOT_FLAGS}}` | `snapshot.ts` | Flag reference with examples |
| `{{PREAMBLE}}` | `gen-skill-docs.ts` | Startup block: update check, session tracking, contributor mode, AskUserQuestion format |
| `{{BROWSE_SETUP}}` | `gen-skill-docs.ts` | Binary discovery + setup instructions |
This is structurally sound — if a command exists in code, it appears in docs. If it doesn't exist, it can't appear.
### The preamble
Every skill starts with a `{{PREAMBLE}}` block that runs before the skill's own logic. It handles four things in a single bash command:
1. **Update check** — calls `gstack-update-check`, reports if an upgrade is available.
2. **Session tracking** — touches `~/.gstack/sessions/$PPID` and counts active sessions (files modified in the last 2 hours). When 3+ sessions are running, all skills enter "ELI16 mode" — every question re-grounds the user on context because they're juggling windows.
3. **Contributor mode** — reads `gstack_contributor` from config. When true, the agent files casual field reports to `~/.gstack/contributor-logs/` when gstack itself misbehaves.
4. **AskUserQuestion format** — universal format: context, question, `RECOMMENDATION: Choose X because ___`, lettered options. Consistent across all skills.
### Why committed, not generated at runtime?
+2
View File
@@ -87,6 +87,8 @@ The browser's key innovation is ref-based element selection, built on Playwright
No DOM mutation. No injected scripts. Just Playwright's native accessibility API.
**Ref staleness detection:** SPAs can mutate the DOM without navigation (React router, tab switches, modals). When this happens, refs collected from a previous `snapshot` may point to elements that no longer exist. To handle this, `resolveRef()` runs an async `count()` check before using any ref — if the element count is 0, it throws immediately with a message telling the agent to re-run `snapshot`. This fails fast (~5ms) instead of waiting for Playwright's 30-second action timeout.
**Extended snapshot features:**
- `--diff` (`-D`): Stores each snapshot as a baseline. On the next `-D` call, returns a unified diff showing what changed. Use this to verify that an action (click, fill, etc.) actually worked.
- `--annotate` (`-a`): Injects temporary overlay divs at each ref's bounding box, takes a screenshot with ref labels visible, then removes the overlays. Use `-o <path>` to control the output path.
+38
View File
@@ -1,5 +1,43 @@
# Changelog
## 0.4.1 — 2026-03-16
- **gstack now notices when it screws up.** Turn on contributor mode (`gstack-config set gstack_contributor true`) and gstack automatically writes up what went wrong — what you were doing, what broke, repro steps. Next time something annoys you, the bug report is already written. Fork gstack and fix it yourself.
- **Juggling multiple sessions? gstack keeps up.** When you have 3+ gstack windows open, every question now tells you which project, which branch, and what you were working on. No more staring at a question thinking "wait, which window is this?"
- **Every question now comes with a recommendation.** Instead of dumping options on you and making you think, gstack tells you what it would pick and why. Same clear format across every skill.
- **/review now catches forgotten enum handlers.** Add a new status, tier, or type constant? /review traces it through every switch statement, allowlist, and filter in your codebase — not just the files you changed. Catches the "added the value but forgot to handle it" class of bugs before they ship.
### For contributors
- Renamed `{{UPDATE_CHECK}}` to `{{PREAMBLE}}` across all 11 skill templates — one startup block now handles update check, session tracking, contributor mode, and question formatting.
- DRY'd plan-ceo-review and plan-eng-review question formatting to reference the preamble baseline instead of duplicating rules.
- Added CHANGELOG style guide and vendored symlink awareness docs to CLAUDE.md.
## 0.4.0 — 2026-03-16
### Added
- **QA-only skill** (`/qa-only`) — report-only QA mode that finds and documents bugs without making fixes. Hand off a clean bug report to your team without the agent touching your code.
- **QA fix loop** — `/qa` now runs a find-fix-verify cycle: discover bugs, fix them, commit, re-navigate to confirm the fix took. One command to go from broken to shipped.
- **Plan-to-QA artifact flow** — `/plan-eng-review` writes test-plan artifacts that `/qa` picks up automatically. Your engineering review now feeds directly into QA testing with no manual copy-paste.
- **`{{QA_METHODOLOGY}}` DRY placeholder** — shared QA methodology block injected into both `/qa` and `/qa-only` templates. Keeps both skills in sync when you update testing standards.
- **Eval efficiency metrics** — turns, duration, and cost now displayed across all eval surfaces with natural-language **Takeaway** commentary. See at a glance whether your prompt changes made the agent faster or slower.
- **`generateCommentary()` engine** — interprets comparison deltas so you don't have to: flags regressions, notes improvements, and produces an overall efficiency summary.
- **Eval list columns** — `bun run eval:list` now shows Turns and Duration per run. Spot expensive or slow runs instantly.
- **Eval summary per-test efficiency** — `bun run eval:summary` shows average turns/duration/cost per test across runs. Identify which tests are costing you the most over time.
- **`judgePassed()` unit tests** — extracted and tested the pass/fail judgment logic.
- **3 new E2E tests** — qa-only no-fix guardrail, qa fix loop with commit verification, plan-eng-review test-plan artifact.
- **Browser ref staleness detection** — `resolveRef()` now checks element count to detect stale refs after page mutations. SPA navigation no longer causes 30-second timeouts on missing elements.
- 3 new snapshot tests for ref staleness.
### Changed
- QA skill prompt restructured with explicit two-cycle workflow (find → fix → verify).
- `formatComparison()` now shows per-test turns and duration deltas alongside cost.
- `printSummary()` shows turns and duration columns.
- `eval-store.test.ts` fixed pre-existing `_partial` file assertion bug.
### Fixed
- Browser ref staleness — refs collected before page mutation (e.g. SPA navigation) are now detected and re-collected. Eliminates a class of flaky QA failures on dynamic sites.
## 0.3.10 — 2026-03-15
### Added
+30
View File
@@ -43,6 +43,7 @@ gstack/
│ ├── gen-skill-docs.test.ts # Tier 1: generator quality (free, <1s)
│ ├── skill-llm-eval.test.ts # Tier 3: LLM-as-judge (~$0.15/run)
│ └── skill-e2e.test.ts # Tier 2: E2E via claude -p (~$3.85/run)
├── qa-only/ # /qa-only skill (report-only QA, no fixes)
├── ship/ # Ship workflow skill
├── review/ # PR review skill
├── plan-ceo-review/ # /plan-ceo-review skill
@@ -72,6 +73,35 @@ When you need to interact with a browser (QA, dogfooding, cookie setup), use the
`mcp__claude-in-chrome__*` tools — they are slow, unreliable, and not what this
project uses.
## Vendored symlink awareness
When developing gstack, `.claude/skills/gstack` may be a symlink back to this
working directory (gitignored). This means skill changes are **live immediately**
great for rapid iteration, risky during big refactors where half-written skills
could break other Claude Code sessions using gstack concurrently.
**Check once per session:** Run `ls -la .claude/skills/gstack` to see if it's a
symlink or a real copy. If it's a symlink to your working directory, be aware that:
- Template changes + `bun run gen:skill-docs` immediately affect all gstack invocations
- Breaking changes to SKILL.md.tmpl files can break concurrent gstack sessions
- During large refactors, remove the symlink (`rm .claude/skills/gstack`) so the
global install at `~/.claude/skills/gstack/` is used instead
**For plan reviews:** When reviewing plans that modify skill templates or the
gen-skill-docs pipeline, consider whether the changes should be tested in isolation
before going live (especially if the user is actively using gstack in other windows).
## CHANGELOG style
CHANGELOG.md is **for users**, not contributors. Write it like product release notes:
- Lead with what the user can now **do** that they couldn't before. Sell the feature.
- Use plain language, not implementation details. "You can now..." not "Refactored the..."
- Put contributor/internal changes in a separate "For contributors" section at the bottom.
- Every entry should make someone think "oh nice, I want to try that."
- No jargon: say "every question now tells you which project and branch you're in" not
"AskUserQuestion format standardized across skill templates via preamble resolver."
## Deploying to the active skill
The active skill lives at `~/.claude/skills/gstack/`. After making changes:
+69 -59
View File
@@ -20,9 +20,44 @@ Now edit any `SKILL.md`, invoke it in Claude Code (e.g. `/review`), and see your
bin/dev-teardown # deactivate — back to your global install
```
## How dev mode works
## Contributor mode
`bin/dev-setup` creates a `.claude/skills/` directory inside the repo (gitignored) and fills it with symlinks pointing back to your working tree. Claude Code sees the local `skills/` first, so your edits win over the global install.
Contributor mode is for people who want to fix gstack when it annoys them. Enable it
and Claude Code will automatically log issues to `~/.gstack/contributor-logs/` as you
work — what you were doing, what went wrong, repro steps, raw output.
```bash
~/.claude/skills/gstack/bin/gstack-config set gstack_contributor true
```
The logs are for **you**. When something bugs you enough to fix, the report is
already written. Fork gstack, symlink your fork into the project where you hit
the issue, fix it, and open a PR.
### The contributor workflow
1. **Hit friction while using gstack** — contributor mode logs it automatically
2. **Check your logs:** `ls ~/.gstack/contributor-logs/`
3. **Fork and clone gstack** (if you haven't already)
4. **Symlink your fork into the project where you hit the bug:**
```bash
# In your core project (the one where gstack annoyed you)
ln -sfn /path/to/your/gstack-fork .claude/skills/gstack
cd .claude/skills/gstack && bun install && bun run build
```
5. **Fix the issue** — your changes are live immediately in this project
6. **Test by actually using gstack** — do the thing that annoyed you, verify it's fixed
7. **Open a PR from your fork**
This is the best way to contribute: fix gstack while doing your real work, in the
project where you actually felt the pain.
## Working on gstack inside the gstack repo
When you're editing gstack skills and want to test them by actually using gstack
in the same repo, `bin/dev-setup` wires this up. It creates `.claude/skills/`
symlinks (gitignored) pointing back to your working tree, so Claude Code uses
your local edits instead of the global install.
```
gstack/ <- your working tree
@@ -131,13 +166,15 @@ When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`:
**Eval history tools:**
```bash
bun run eval:list # list all eval runs
bun run eval:compare # compare two runs (auto-picks most recent)
bun run eval:summary # aggregate stats across all runs
bun run eval:list # list all eval runs (turns, duration, cost per run)
bun run eval:compare # compare two runs — shows per-test deltas + Takeaway commentary
bun run eval:summary # aggregate stats + per-test efficiency averages across runs
bun run eval:trend # per-test pass rate over last N runs (flaky detection)
bun run eval:cache stats # check LLM judge cache hit rate
```
**Eval comparison commentary:** `eval:compare` generates natural-language Takeaway sections interpreting what changed between runs — flagging regressions, noting improvements, calling out efficiency gains (fewer turns, faster, cheaper), and producing an overall summary. This is driven by `generateCommentary()` in `eval-store.ts`.
Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis.
### Tier 3: LLM-as-judge (~$0.15/run)
@@ -208,69 +245,42 @@ When Conductor creates a new workspace, `bin/dev-setup` runs automatically. It d
- **`.env` propagates across worktrees.** Set it once in the main repo, all Conductor workspaces get it.
- **`.claude/skills/` is gitignored.** The symlinks never get committed.
## Testing a branch in another repo
## Testing your changes in a real project
When you're developing gstack in one workspace and want to test your branch in a
different project (e.g. testing browse changes against your real app), there are
two cases depending on how gstack is installed in that project.
**This is the recommended way to develop gstack.** Symlink your gstack checkout
into the project where you actually use it, so your changes are live while you
do real work:
### Global install only (no `.claude/skills/gstack/` in the project)
```bash
# In your core project
ln -sfn /path/to/your/gstack-checkout .claude/skills/gstack
cd .claude/skills/gstack && bun install && bun run build
```
Point your global install at the branch:
Now every gstack skill invocation in this project uses your working tree. Edit a
template, run `bun run gen:skill-docs`, and the next `/review` or `/qa` call picks
it up immediately.
**To go back to the stable global install**, just remove the symlink:
```bash
rm .claude/skills/gstack
```
Claude Code falls back to `~/.claude/skills/gstack/` automatically.
### Alternative: point your global install at a branch
If you don't want per-project symlinks, you can switch the global install:
```bash
cd ~/.claude/skills/gstack
git fetch origin
git checkout origin/<branch> # e.g. origin/v0.3.2
bun install # in case deps changed
bun run build # rebuild the binary
git checkout origin/<branch>
bun install && bun run build
```
Now open Claude Code in the other project — it picks up skills from
`~/.claude/skills/` automatically. To go back to main when you're done:
```bash
cd ~/.claude/skills/gstack
git checkout main && git pull
bun run build
```
### Vendored project copy (`.claude/skills/gstack/` checked into the project)
Some projects vendor gstack by copying it into the repo (no `.git` inside the
copy). Project-local skills take priority over global, so you need to update
the vendored copy too. This is a three-step process:
1. **Update your global install to the branch** (so you have the source):
```bash
cd ~/.claude/skills/gstack
git fetch origin
git checkout origin/<branch> # e.g. origin/v0.3.2
bun install && bun run build
```
2. **Replace the vendored copy** in the other project:
```bash
cd /path/to/other-project
# Remove old skill symlinks and vendored copy
for s in browse plan-ceo-review plan-eng-review review ship retro qa setup-browser-cookies; do
rm -f .claude/skills/$s
done
rm -rf .claude/skills/gstack
# Copy from global install (strips .git so it stays vendored)
cp -Rf ~/.claude/skills/gstack .claude/skills/gstack
rm -rf .claude/skills/gstack/.git
# Rebuild binary and re-create skill symlinks
cd .claude/skills/gstack && ./setup
```
3. **Test your changes** — open Claude Code in that project and use the skills.
To revert to main later, repeat steps 1-2 with `git checkout main && git pull`
instead of `git checkout origin/<branch>`.
This affects all projects. To revert: `git checkout main && git pull && bun run build`.
## Shipping your changes
+5 -4
View File
@@ -2,7 +2,7 @@
**gstack turns Claude Code from one generic assistant into a team of specialists you can summon on demand.**
Eight opinionated workflow skills for [Claude Code](https://docs.anthropic.com/en/docs/claude-code). Plan review, code review, one-command shipping, browser automation, QA testing, and engineering retrospectives — all as slash commands.
Nine opinionated workflow skills for [Claude Code](https://docs.anthropic.com/en/docs/claude-code). Plan review, code review, one-command shipping, browser automation, QA testing, and engineering retrospectives — all as slash commands.
### Without gstack
@@ -22,7 +22,8 @@ Eight opinionated workflow skills for [Claude Code](https://docs.anthropic.com/e
| `/review` | Paranoid staff engineer | Find the bugs that pass CI but blow up in production. Triages Greptile review comments. |
| `/ship` | Release engineer | Sync main, run tests, resolve Greptile reviews, push, open PR. For a ready branch, not for deciding what to build. |
| `/browse` | QA engineer | Give the agent eyes. It logs in, clicks through your app, takes screenshots, catches breakage. Full QA pass in 60 seconds. |
| `/qa` | QA lead | Systematic QA testing. On a feature branch, auto-analyzes your diff, identifies affected pages, and tests them. Also: full exploration, quick smoke test, regression mode. |
| `/qa` | QA + fix engineer | Test app, find bugs, fix them with atomic commits, re-verify. Before/after health scores and ship-readiness summary. Three tiers: Quick, Standard, Exhaustive. |
| `/qa-only` | QA reporter | Report-only QA testing. Same methodology as /qa but never fixes anything. Use when you want a pure bug report without code changes. |
| `/setup-browser-cookies` | Session manager | Import cookies from your real browser (Comet, Chrome, Arc, Brave, Edge) into the headless session. Test authenticated pages without logging in manually. |
| `/retro` | Engineering manager | Team-aware retro: your deep-dive + per-person praise and growth opportunities for every contributor. |
@@ -103,7 +104,7 @@ This is the setup I use. One person, ten parallel agents, each with the right co
Open Claude Code and paste this. Claude will do the rest.
> Install gstack: run `git clone https://github.com/garrytan/gstack.git ~/.claude/skills/gstack && cd ~/.claude/skills/gstack && ./setup` then add a "gstack" section to CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, and lists the available skills: /plan-ceo-review, /plan-eng-review, /review, /ship, /browse, /qa, /setup-browser-cookies, /retro. Then ask the user if they also want to add gstack to the current project so teammates get it.
> Install gstack: run `git clone https://github.com/garrytan/gstack.git ~/.claude/skills/gstack && cd ~/.claude/skills/gstack && ./setup` then add a "gstack" section to CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, and lists the available skills: /plan-ceo-review, /plan-eng-review, /review, /ship, /browse, /qa, /qa-only, /setup-browser-cookies, /retro. Then ask the user if they also want to add gstack to the current project so teammates get it.
### Step 2: Add to your repo so teammates get it (optional)
@@ -613,7 +614,7 @@ Or set `auto_upgrade: true` in `~/.gstack/config.yaml` to upgrade automatically
Paste this into Claude Code:
> Uninstall gstack: remove the skill symlinks by running `for s in browse plan-ceo-review plan-eng-review review ship retro qa setup-browser-cookies; do rm -f ~/.claude/skills/$s; done` then run `rm -rf ~/.claude/skills/gstack` and remove the gstack section from CLAUDE.md. If this project also has gstack at .claude/skills/gstack, remove it by running `for s in browse plan-ceo-review plan-eng-review review ship retro qa setup-browser-cookies; do rm -f .claude/skills/$s; done && rm -rf .claude/skills/gstack` and remove the gstack section from the project CLAUDE.md too.
> Uninstall gstack: remove the skill symlinks by running `for s in browse plan-ceo-review plan-eng-review review ship retro qa qa-only setup-browser-cookies; do rm -f ~/.claude/skills/$s; done` then run `rm -rf ~/.claude/skills/gstack` and remove the gstack section from CLAUDE.md. If this project also has gstack at .claude/skills/gstack, remove it by running `for s in browse plan-ceo-review plan-eng-review review ship retro qa qa-only setup-browser-cookies; do rm -f .claude/skills/$s; done && rm -rf .claude/skills/gstack` and remove the gstack section from the project CLAUDE.md too.
## Development
+49 -1
View File
@@ -16,15 +16,63 @@ allowed-tools:
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->
## Update Check (run first)
## Preamble (run first)
```bash
_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
[ -n "$_UPD" ] && echo "$_UPD" || true
mkdir -p ~/.gstack/sessions
touch ~/.gstack/sessions/"$PPID"
_SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ')
find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true
_CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true)
```
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call:**
1. Context: project name, current branch, what we're working on (1-2 sentences)
2. The specific question or decision point
3. `RECOMMENDATION: Choose [X] because [one-line reason]`
4. Lettered options: `A) ... B) ... C) ...`
If `_SESSIONS` is 3 or more: the user is juggling multiple gstack sessions and context-switching heavily. **ELI16 mode** — they may not remember what this conversation is about. Every AskUserQuestion MUST re-ground them: state the project, the branch, the current plan/task, then the specific problem, THEN the recommendation and options. Be extra clear and self-contained — assume they haven't looked at this window in 20 minutes.
Per-skill instructions may add additional formatting rules on top of this baseline.
## Contributor Mode
If `_CONTRIB` is `true`: you are in **contributor mode**. When you hit friction with **gstack itself** (not the user's app), file a field report. Think: "hey, I was trying to do X with gstack and it didn't work / was confusing / was annoying. Here's what happened."
**gstack issues:** browse command fails/wrong output, snapshot missing elements, skill instructions unclear or misleading, binary crash/hang, unhelpful error message, any rough edge or annoyance — even minor stuff.
**NOT gstack issues:** user's app bugs, network errors to user's URL, auth failures on user's site.
**To file:** write `~/.gstack/contributor-logs/{slug}.md` with this structure:
```
# {Title}
Hey gstack team — ran into this while using /{skill-name}:
**What I was trying to do:** {what the user/agent was attempting}
**What happened instead:** {what actually happened}
**How annoying (1-5):** {1=meh, 3=friction, 5=blocker}
## Steps to reproduce
1. {step}
## Raw output
(wrap any error messages or unexpected output in a markdown code block)
**Date:** {YYYY-MM-DD} | **Version:** {gstack version} | **Skill:** /{skill}
```
Then run: `mkdir -p ~/.gstack/contributor-logs && open ~/.gstack/contributor-logs/{slug}.md`
Slug: lowercase, hyphens, max 60 chars (e.g. `browse-snapshot-ref-gap`). Skip if file already exists. Max 3 reports per session. File inline and continue — don't stop the workflow. Tell user: "Filed gstack field report: {title}"
# gstack browse: QA Testing & Dogfooding
Persistent headless Chromium. First call auto-starts (~3s), then ~100-200ms per command.
+1 -1
View File
@@ -14,7 +14,7 @@ allowed-tools:
---
{{UPDATE_CHECK}}
{{PREAMBLE}}
# gstack browse: QA Testing & Dogfooding
+24
View File
@@ -390,6 +390,30 @@
**Priority:** P3
**Depends on:** Eval persistence (shipped in v0.3.6)
### CI/CD QA quality gate
**What:** Run `/qa` as a GitHub Action step, fail PR if health score drops below threshold.
**Why:** Automated quality gate catches regressions before merge. Currently QA is manual — CI integration makes it part of the standard workflow.
**Context:** Requires headless browse binary available in CI. The `/qa` skill already produces `baseline.json` with health scores — CI step would compare against the main branch baseline and fail if score drops. Would need `ANTHROPIC_API_KEY` in CI secrets since `/qa` uses Claude.
**Effort:** M
**Priority:** P2
**Depends on:** None
### CDP-based DOM mutation detection for ref staleness
**What:** Use Chrome DevTools Protocol `DOM.documentUpdated` / MutationObserver events to proactively invalidate stale refs when the DOM changes, without requiring an explicit `snapshot` call.
**Why:** Current ref staleness detection (async count() check) only catches stale refs at action time. CDP mutation detection would proactively warn when refs become stale, preventing the 5-second timeout entirely for SPA re-renders.
**Context:** Parts 1+2 of ref staleness fix (RefEntry metadata + eager validation via count()) are shipped. This is Part 3 — the most ambitious piece. Requires CDP session alongside Playwright, MutationObserver bridge, and careful performance tuning to avoid overhead on every DOM change.
**Effort:** L
**Priority:** P3
**Depends on:** Ref staleness Parts 1+2 (shipped)
## Completed
### Phase 1: Foundations (v0.2.0)
+1 -1
View File
@@ -1 +1 @@
0.3.10
0.4.1
+49 -1
View File
@@ -16,15 +16,63 @@ allowed-tools:
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->
## Update Check (run first)
## Preamble (run first)
```bash
_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
[ -n "$_UPD" ] && echo "$_UPD" || true
mkdir -p ~/.gstack/sessions
touch ~/.gstack/sessions/"$PPID"
_SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ')
find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true
_CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true)
```
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call:**
1. Context: project name, current branch, what we're working on (1-2 sentences)
2. The specific question or decision point
3. `RECOMMENDATION: Choose [X] because [one-line reason]`
4. Lettered options: `A) ... B) ... C) ...`
If `_SESSIONS` is 3 or more: the user is juggling multiple gstack sessions and context-switching heavily. **ELI16 mode** — they may not remember what this conversation is about. Every AskUserQuestion MUST re-ground them: state the project, the branch, the current plan/task, then the specific problem, THEN the recommendation and options. Be extra clear and self-contained — assume they haven't looked at this window in 20 minutes.
Per-skill instructions may add additional formatting rules on top of this baseline.
## Contributor Mode
If `_CONTRIB` is `true`: you are in **contributor mode**. When you hit friction with **gstack itself** (not the user's app), file a field report. Think: "hey, I was trying to do X with gstack and it didn't work / was confusing / was annoying. Here's what happened."
**gstack issues:** browse command fails/wrong output, snapshot missing elements, skill instructions unclear or misleading, binary crash/hang, unhelpful error message, any rough edge or annoyance — even minor stuff.
**NOT gstack issues:** user's app bugs, network errors to user's URL, auth failures on user's site.
**To file:** write `~/.gstack/contributor-logs/{slug}.md` with this structure:
```
# {Title}
Hey gstack team — ran into this while using /{skill-name}:
**What I was trying to do:** {what the user/agent was attempting}
**What happened instead:** {what actually happened}
**How annoying (1-5):** {1=meh, 3=friction, 5=blocker}
## Steps to reproduce
1. {step}
## Raw output
(wrap any error messages or unexpected output in a markdown code block)
**Date:** {YYYY-MM-DD} | **Version:** {gstack version} | **Skill:** /{skill}
```
Then run: `mkdir -p ~/.gstack/contributor-logs && open ~/.gstack/contributor-logs/{slug}.md`
Slug: lowercase, hyphens, max 60 chars (e.g. `browse-snapshot-ref-gap`). Skip if file already exists. Max 3 reports per session. File inline and continue — don't stop the workflow. Tell user: "Filed gstack field report: {title}"
# browse: QA Testing & Dogfooding
Persistent headless Chromium. First call auto-starts (~3s), then ~100ms per command.
+1 -1
View File
@@ -14,7 +14,7 @@ allowed-tools:
---
{{UPDATE_CHECK}}
{{PREAMBLE}}
# browse: QA Testing & Dogfooding
+20 -7
View File
@@ -18,6 +18,12 @@
import { chromium, type Browser, type BrowserContext, type Page, type Locator } from 'playwright';
import { addConsoleEntry, addNetworkEntry, addDialogEntry, networkBuffer, type DialogEntry } from './buffers';
export interface RefEntry {
locator: Locator;
role: string;
name: string;
}
export class BrowserManager {
private browser: Browser | null = null;
private context: BrowserContext | null = null;
@@ -31,7 +37,7 @@ export class BrowserManager {
public serverPort: number = 0;
// ─── Ref Map (snapshot → @e1, @e2, @c1, @c2, ...) ────────
private refMap: Map<string, Locator> = new Map();
private refMap: Map<string, RefEntry> = new Map();
// ─── Snapshot Diffing ─────────────────────────────────────
// NOT cleared on navigation — it's a text baseline for diffing
@@ -169,7 +175,7 @@ export class BrowserManager {
}
// ─── Ref Map ──────────────────────────────────────────────
setRefMap(refs: Map<string, Locator>) {
setRefMap(refs: Map<string, RefEntry>) {
this.refMap = refs;
}
@@ -181,16 +187,23 @@ export class BrowserManager {
* Resolve a selector that may be a @ref (e.g., "@e3", "@c1") or a CSS selector.
* Returns { locator } for refs or { selector } for CSS selectors.
*/
resolveRef(selector: string): { locator: Locator } | { selector: string } {
async resolveRef(selector: string): Promise<{ locator: Locator } | { selector: string }> {
if (selector.startsWith('@e') || selector.startsWith('@c')) {
const ref = selector.slice(1); // "e3" or "c1"
const locator = this.refMap.get(ref);
if (!locator) {
const entry = this.refMap.get(ref);
if (!entry) {
throw new Error(
`Ref ${selector} not found. Page may have changed — run 'snapshot' to get fresh refs.`
`Ref ${selector} not found. Run 'snapshot' to get fresh refs.`
);
}
return { locator };
const count = await entry.locator.count();
if (count === 0) {
throw new Error(
`Ref ${selector} (${entry.role} "${entry.name}") is stale — element no longer exists. ` +
`Run 'snapshot' for fresh refs.`
);
}
return { locator: entry.locator };
}
return { selector };
}
+1 -1
View File
@@ -150,7 +150,7 @@ export async function handleMetaCommand(
}
if (targetSelector) {
const resolved = bm.resolveRef(targetSelector);
const resolved = await bm.resolveRef(targetSelector);
const locator = 'locator' in resolved ? resolved.locator : page.locator(resolved.selector);
await locator.screenshot({ path: outputPath, timeout: 5000 });
return `Screenshot saved (element): ${outputPath}`;
+4 -4
View File
@@ -61,7 +61,7 @@ export async function handleReadCommand(
case 'html': {
const selector = args[0];
if (selector) {
const resolved = bm.resolveRef(selector);
const resolved = await bm.resolveRef(selector);
if ('locator' in resolved) {
return await resolved.locator.innerHTML({ timeout: 5000 });
}
@@ -135,7 +135,7 @@ export async function handleReadCommand(
case 'css': {
const [selector, property] = args;
if (!selector || !property) throw new Error('Usage: browse css <selector> <property>');
const resolved = bm.resolveRef(selector);
const resolved = await bm.resolveRef(selector);
if ('locator' in resolved) {
const value = await resolved.locator.evaluate(
(el, prop) => getComputedStyle(el).getPropertyValue(prop),
@@ -157,7 +157,7 @@ export async function handleReadCommand(
case 'attrs': {
const selector = args[0];
if (!selector) throw new Error('Usage: browse attrs <selector>');
const resolved = bm.resolveRef(selector);
const resolved = await bm.resolveRef(selector);
if ('locator' in resolved) {
const attrs = await resolved.locator.evaluate((el) => {
const result: Record<string, string> = {};
@@ -221,7 +221,7 @@ export async function handleReadCommand(
const selector = args[1];
if (!property || !selector) throw new Error('Usage: browse is <property> <selector>\nProperties: visible, hidden, enabled, disabled, checked, editable, focused');
const resolved = bm.resolveRef(selector);
const resolved = await bm.resolveRef(selector);
let locator;
if ('locator' in resolved) {
locator = resolved.locator;
+6 -6
View File
@@ -18,7 +18,7 @@
*/
import type { Page, Locator } from 'playwright';
import type { BrowserManager } from './browser-manager';
import type { BrowserManager, RefEntry } from './browser-manager';
import * as Diff from 'diff';
// Roles considered "interactive" for the -i flag
@@ -154,7 +154,7 @@ export async function handleSnapshot(
// Parse the ariaSnapshot output
const lines = ariaText.split('\n');
const refMap = new Map<string, Locator>();
const refMap = new Map<string, RefEntry>();
const output: string[] = [];
let refCounter = 1;
@@ -218,7 +218,7 @@ export async function handleSnapshot(
locator = locator.nth(seenIndex);
}
refMap.set(ref, locator);
refMap.set(ref, { locator, role: node.role, name: node.name || '' });
// Format output line
let outputLine = `${indent}@${ref} [${node.role}]`;
@@ -287,7 +287,7 @@ export async function handleSnapshot(
for (const elem of cursorElements) {
const ref = `c${cRefCounter++}`;
const locator = page.locator(elem.selector);
refMap.set(ref, locator);
refMap.set(ref, { locator, role: 'cursor-interactive', name: elem.text });
output.push(`@${ref} [${elem.reason}] "${elem.text}"`);
}
}
@@ -318,9 +318,9 @@ export async function handleSnapshot(
try {
// Inject overlay divs at each ref's bounding box
const boxes: Array<{ ref: string; box: { x: number; y: number; width: number; height: number } }> = [];
for (const [ref, locator] of refMap) {
for (const [ref, entry] of refMap) {
try {
const box = await locator.boundingBox({ timeout: 1000 });
const box = await entry.locator.boundingBox({ timeout: 1000 });
if (box) {
boxes.push({ ref: `@${ref}`, box });
}
+7 -7
View File
@@ -44,7 +44,7 @@ export async function handleWriteCommand(
case 'click': {
const selector = args[0];
if (!selector) throw new Error('Usage: browse click <selector>');
const resolved = bm.resolveRef(selector);
const resolved = await bm.resolveRef(selector);
if ('locator' in resolved) {
await resolved.locator.click({ timeout: 5000 });
} else {
@@ -59,7 +59,7 @@ export async function handleWriteCommand(
const [selector, ...valueParts] = args;
const value = valueParts.join(' ');
if (!selector || !value) throw new Error('Usage: browse fill <selector> <value>');
const resolved = bm.resolveRef(selector);
const resolved = await bm.resolveRef(selector);
if ('locator' in resolved) {
await resolved.locator.fill(value, { timeout: 5000 });
} else {
@@ -72,7 +72,7 @@ export async function handleWriteCommand(
const [selector, ...valueParts] = args;
const value = valueParts.join(' ');
if (!selector || !value) throw new Error('Usage: browse select <selector> <value>');
const resolved = bm.resolveRef(selector);
const resolved = await bm.resolveRef(selector);
if ('locator' in resolved) {
await resolved.locator.selectOption(value, { timeout: 5000 });
} else {
@@ -84,7 +84,7 @@ export async function handleWriteCommand(
case 'hover': {
const selector = args[0];
if (!selector) throw new Error('Usage: browse hover <selector>');
const resolved = bm.resolveRef(selector);
const resolved = await bm.resolveRef(selector);
if ('locator' in resolved) {
await resolved.locator.hover({ timeout: 5000 });
} else {
@@ -110,7 +110,7 @@ export async function handleWriteCommand(
case 'scroll': {
const selector = args[0];
if (selector) {
const resolved = bm.resolveRef(selector);
const resolved = await bm.resolveRef(selector);
if ('locator' in resolved) {
await resolved.locator.scrollIntoViewIfNeeded({ timeout: 5000 });
} else {
@@ -139,7 +139,7 @@ export async function handleWriteCommand(
return 'DOM content loaded';
}
const timeout = args[1] ? parseInt(args[1], 10) : 15000;
const resolved = bm.resolveRef(selector);
const resolved = await bm.resolveRef(selector);
if ('locator' in resolved) {
await resolved.locator.waitFor({ state: 'visible', timeout });
} else {
@@ -204,7 +204,7 @@ export async function handleWriteCommand(
if (!fs.existsSync(fp)) throw new Error(`File not found: ${fp}`);
}
const resolved = bm.resolveRef(selector);
const resolved = await bm.resolveRef(selector);
if ('locator' in resolved) {
await resolved.locator.setInputFiles(filePaths);
} else {
+49
View File
@@ -201,6 +201,55 @@ describe('Ref invalidation', () => {
});
});
// ─── Ref Staleness Detection ────────────────────────────────────
describe('Ref staleness detection', () => {
test('ref metadata stores role and name', async () => {
await handleWriteCommand('goto', [baseUrl + '/snapshot.html'], bm);
await handleMetaCommand('snapshot', ['-i'], bm, shutdown);
// Refs should exist with metadata
expect(bm.getRefCount()).toBeGreaterThan(0);
});
test('stale ref after DOM removal gives descriptive error', async () => {
await handleWriteCommand('goto', [baseUrl + '/snapshot.html'], bm);
const snap = await handleMetaCommand('snapshot', ['-i'], bm, shutdown);
// Find a button ref
const buttonLine = snap.split('\n').find(l => l.includes('[button]') && l.includes('"Submit"'));
expect(buttonLine).toBeDefined();
const refMatch = buttonLine!.match(/@(e\d+)/);
expect(refMatch).toBeDefined();
const ref = `@${refMatch![1]}`;
// Remove the button from DOM (simulates SPA re-render)
await handleReadCommand('js', ['document.querySelector("button[type=submit]").remove()'], bm);
// Try to click — should get descriptive staleness error
try {
await handleWriteCommand('click', [ref], bm);
expect(true).toBe(false); // Should not reach here
} catch (err: any) {
expect(err.message).toContain('stale');
expect(err.message).toContain('button');
expect(err.message).toContain('Submit');
expect(err.message).toContain('snapshot');
}
});
test('valid ref still resolves normally after staleness check', async () => {
await handleWriteCommand('goto', [baseUrl + '/snapshot.html'], bm);
const snap = await handleMetaCommand('snapshot', ['-i'], bm, shutdown);
const linkLine = snap.split('\n').find(l => l.includes('[link]'));
expect(linkLine).toBeDefined();
const refMatch = linkLine!.match(/@(e\d+)/);
const ref = `@${refMatch![1]}`;
// Should work normally — element still exists
const result = await handleWriteCommand('hover', [ref], bm);
expect(result).toContain('Hovered');
});
});
// ─── Snapshot Diffing ──────────────────────────────────────────
describe('Snapshot diff', () => {
+51 -7
View File
@@ -16,15 +16,63 @@ allowed-tools:
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->
## Update Check (run first)
## Preamble (run first)
```bash
_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
[ -n "$_UPD" ] && echo "$_UPD" || true
mkdir -p ~/.gstack/sessions
touch ~/.gstack/sessions/"$PPID"
_SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ')
find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true
_CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true)
```
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call:**
1. Context: project name, current branch, what we're working on (1-2 sentences)
2. The specific question or decision point
3. `RECOMMENDATION: Choose [X] because [one-line reason]`
4. Lettered options: `A) ... B) ... C) ...`
If `_SESSIONS` is 3 or more: the user is juggling multiple gstack sessions and context-switching heavily. **ELI16 mode** — they may not remember what this conversation is about. Every AskUserQuestion MUST re-ground them: state the project, the branch, the current plan/task, then the specific problem, THEN the recommendation and options. Be extra clear and self-contained — assume they haven't looked at this window in 20 minutes.
Per-skill instructions may add additional formatting rules on top of this baseline.
## Contributor Mode
If `_CONTRIB` is `true`: you are in **contributor mode**. When you hit friction with **gstack itself** (not the user's app), file a field report. Think: "hey, I was trying to do X with gstack and it didn't work / was confusing / was annoying. Here's what happened."
**gstack issues:** browse command fails/wrong output, snapshot missing elements, skill instructions unclear or misleading, binary crash/hang, unhelpful error message, any rough edge or annoyance — even minor stuff.
**NOT gstack issues:** user's app bugs, network errors to user's URL, auth failures on user's site.
**To file:** write `~/.gstack/contributor-logs/{slug}.md` with this structure:
```
# {Title}
Hey gstack team — ran into this while using /{skill-name}:
**What I was trying to do:** {what the user/agent was attempting}
**What happened instead:** {what actually happened}
**How annoying (1-5):** {1=meh, 3=friction, 5=blocker}
## Steps to reproduce
1. {step}
## Raw output
(wrap any error messages or unexpected output in a markdown code block)
**Date:** {YYYY-MM-DD} | **Version:** {gstack version} | **Skill:** /{skill}
```
Then run: `mkdir -p ~/.gstack/contributor-logs && open ~/.gstack/contributor-logs/{slug}.md`
Slug: lowercase, hyphens, max 60 chars (e.g. `browse-snapshot-ref-gap`). Skip if file already exists. Max 3 reports per session. File inline and continue — don't stop the workflow. Tell user: "Filed gstack field report: {title}"
# Mega Plan Review Mode
## Philosophy
@@ -365,16 +413,13 @@ Evaluate:
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds.
## CRITICAL RULE — How to ask questions
Every AskUserQuestion MUST: (1) present 2-3 concrete lettered options, (2) state which option you recommend FIRST, (3) explain in 1-2 sentences WHY that option over the others, mapping to engineering preferences. No batching multiple issues into one question. No yes/no questions. Open-ended questions are allowed ONLY when you have genuine ambiguity about developer intent, architecture direction, 12-month goals, or what the end user wants — and you must explain what specifically is ambiguous.
## For Each Issue You Find
Follow the AskUserQuestion format from the Preamble above. Additional rules for plan reviews:
* **One issue = one AskUserQuestion call.** Never combine multiple issues into one question.
* Describe the problem concretely, with file and line references.
* Present 2-3 options, including "do nothing" where reasonable.
* For each option: effort, risk, and maintenance burden in one line.
* **Lead with your recommendation.** State it as a directive: "Do B. Here's why:" — not "Option B might be worth considering." Be opinionated. I'm paying for your judgment, not a menu.
* **Map the reasoning to my engineering preferences above.** One sentence connecting your recommendation to a specific preference.
* **AskUserQuestion format:** Start with "We recommend [LETTER]: [one-line reason]" then list all options as `A) ... B) ... C) ...`. Label with issue NUMBER + option LETTER (e.g., "3A", "3B").
* Label with issue NUMBER + option LETTER (e.g., "3A", "3B").
* **Escape hatch:** If a section has no issues, say so and move on. If an issue has an obvious fix with no real alternatives, state what you'll do and move on — don't waste a question on it. Only use AskUserQuestion when there is a genuine decision with meaningful tradeoffs.
## Required Outputs
@@ -465,7 +510,6 @@ If any AskUserQuestion goes unanswered, note it here. Never silently default.
## Formatting Rules
* NUMBER issues (1, 2, 3...) and LETTERS for options (A, B, C...).
* Label with NUMBER + LETTER (e.g., "3A", "3B").
* Recommended option always listed first.
* One sentence max per option.
* After each section, pause and wait for feedback.
* Use **CRITICAL GAP** / **WARNING** / **OK** for scannability.
+3 -7
View File
@@ -14,7 +14,7 @@ allowed-tools:
- AskUserQuestion
---
{{UPDATE_CHECK}}
{{PREAMBLE}}
# Mega Plan Review Mode
@@ -356,16 +356,13 @@ Evaluate:
**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds.
## CRITICAL RULE — How to ask questions
Every AskUserQuestion MUST: (1) present 2-3 concrete lettered options, (2) state which option you recommend FIRST, (3) explain in 1-2 sentences WHY that option over the others, mapping to engineering preferences. No batching multiple issues into one question. No yes/no questions. Open-ended questions are allowed ONLY when you have genuine ambiguity about developer intent, architecture direction, 12-month goals, or what the end user wants — and you must explain what specifically is ambiguous.
## For Each Issue You Find
Follow the AskUserQuestion format from the Preamble above. Additional rules for plan reviews:
* **One issue = one AskUserQuestion call.** Never combine multiple issues into one question.
* Describe the problem concretely, with file and line references.
* Present 2-3 options, including "do nothing" where reasonable.
* For each option: effort, risk, and maintenance burden in one line.
* **Lead with your recommendation.** State it as a directive: "Do B. Here's why:" — not "Option B might be worth considering." Be opinionated. I'm paying for your judgment, not a menu.
* **Map the reasoning to my engineering preferences above.** One sentence connecting your recommendation to a specific preference.
* **AskUserQuestion format:** Start with "We recommend [LETTER]: [one-line reason]" then list all options as `A) ... B) ... C) ...`. Label with issue NUMBER + option LETTER (e.g., "3A", "3B").
* Label with issue NUMBER + option LETTER (e.g., "3A", "3B").
* **Escape hatch:** If a section has no issues, say so and move on. If an issue has an obvious fix with no real alternatives, state what you'll do and move on — don't waste a question on it. Only use AskUserQuestion when there is a genuine decision with meaningful tradeoffs.
## Required Outputs
@@ -456,7 +453,6 @@ If any AskUserQuestion goes unanswered, note it here. Never silently default.
## Formatting Rules
* NUMBER issues (1, 2, 3...) and LETTERS for options (A, B, C...).
* Label with NUMBER + LETTER (e.g., "3A", "3B").
* Recommended option always listed first.
* One sentence max per option.
* After each section, pause and wait for feedback.
* Use **CRITICAL GAP** / **WARNING** / **OK** for scannability.
+92 -12
View File
@@ -7,6 +7,7 @@ description: |
issues interactively with opinionated recommendations.
allowed-tools:
- Read
- Write
- Grep
- Glob
- AskUserQuestion
@@ -15,15 +16,63 @@ allowed-tools:
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->
## Update Check (run first)
## Preamble (run first)
```bash
_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
[ -n "$_UPD" ] && echo "$_UPD" || true
mkdir -p ~/.gstack/sessions
touch ~/.gstack/sessions/"$PPID"
_SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ')
find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true
_CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true)
```
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call:**
1. Context: project name, current branch, what we're working on (1-2 sentences)
2. The specific question or decision point
3. `RECOMMENDATION: Choose [X] because [one-line reason]`
4. Lettered options: `A) ... B) ... C) ...`
If `_SESSIONS` is 3 or more: the user is juggling multiple gstack sessions and context-switching heavily. **ELI16 mode** — they may not remember what this conversation is about. Every AskUserQuestion MUST re-ground them: state the project, the branch, the current plan/task, then the specific problem, THEN the recommendation and options. Be extra clear and self-contained — assume they haven't looked at this window in 20 minutes.
Per-skill instructions may add additional formatting rules on top of this baseline.
## Contributor Mode
If `_CONTRIB` is `true`: you are in **contributor mode**. When you hit friction with **gstack itself** (not the user's app), file a field report. Think: "hey, I was trying to do X with gstack and it didn't work / was confusing / was annoying. Here's what happened."
**gstack issues:** browse command fails/wrong output, snapshot missing elements, skill instructions unclear or misleading, binary crash/hang, unhelpful error message, any rough edge or annoyance — even minor stuff.
**NOT gstack issues:** user's app bugs, network errors to user's URL, auth failures on user's site.
**To file:** write `~/.gstack/contributor-logs/{slug}.md` with this structure:
```
# {Title}
Hey gstack team — ran into this while using /{skill-name}:
**What I was trying to do:** {what the user/agent was attempting}
**What happened instead:** {what actually happened}
**How annoying (1-5):** {1=meh, 3=friction, 5=blocker}
## Steps to reproduce
1. {step}
## Raw output
(wrap any error messages or unexpected output in a markdown code block)
**Date:** {YYYY-MM-DD} | **Version:** {gstack version} | **Skill:** /{skill}
```
Then run: `mkdir -p ~/.gstack/contributor-logs && open ~/.gstack/contributor-logs/{slug}.md`
Slug: lowercase, hyphens, max 60 chars (e.g. `browse-snapshot-ref-gap`). Skip if file already exists. Max 3 reports per session. File inline and continue — don't stop the workflow. Tell user: "Filed gstack field report: {title}"
# Plan Review Mode
Review this plan thoroughly before making any code changes. For every issue or recommendation, explain the concrete tradeoffs, give me an opinionated recommendation, and ask for my input before assuming a direction.
@@ -92,6 +141,41 @@ For LLM/prompt changes: check the "Prompt/LLM changes" file patterns listed in C
**STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved.
### Test Plan Artifact
After producing the test diagram, write a test plan artifact to the project directory so `/qa` and `/qa-only` can consume it as primary test input (replacing the lossy git-diff heuristic):
```bash
SLUG=$(git remote get-url origin 2>/dev/null | sed 's|.*[:/]\([^/]*/[^/]*\)\.git$|\1|;s|.*[:/]\([^/]*/[^/]*\)$|\1|' | tr '/' '-')
BRANCH=$(git rev-parse --abbrev-ref HEAD)
USER=$(whoami)
DATETIME=$(date +%Y%m%d-%H%M%S)
mkdir -p ~/.gstack/projects/$SLUG
```
Write to `~/.gstack/projects/{slug}/{user}-{branch}-test-plan-{datetime}.md`:
```markdown
# Test Plan
Generated by /plan-eng-review on {date}
Branch: {branch}
Repo: {owner/repo}
## Affected Pages/Routes
- {URL path} — {what to test and why}
## Key Interactions to Verify
- {interaction description} on {page}
## Edge Cases
- {edge case} on {page}
## Critical Paths
- {end-to-end flow that must work}
```
This file is consumed by `/qa` and `/qa-only` as primary test input. Include only the information that helps a QA tester know **what to test and where** — not implementation details.
### 4. Performance review
Evaluate:
* N+1 queries and database access patterns.
@@ -102,18 +186,15 @@ Evaluate:
**STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved.
## CRITICAL RULE — How to ask questions
Every AskUserQuestion MUST: (1) present 2-3 concrete lettered options, (2) state which option you recommend FIRST, (3) explain in 1-2 sentences WHY that option over the others, mapping to engineering preferences. No batching multiple issues into one question. No yes/no questions. Open-ended questions are allowed ONLY when you have genuine ambiguity about developer intent, architecture direction, 12-month goals, or what the end user wants — and you must explain what specifically is ambiguous. **Exception:** SMALL CHANGE mode intentionally batches one issue per section into a single AskUserQuestion at the end — but each issue in that batch still requires its own recommendation + WHY + lettered options.
## For each issue you find
For every specific issue (bug, smell, design concern, or risk):
Follow the AskUserQuestion format from the Preamble above. Additional rules for plan reviews:
* **One issue = one AskUserQuestion call.** Never combine multiple issues into one question.
* Describe the problem concretely, with file and line references.
* Present 23 options, including "do nothing" where that's reasonable.
* Present 2-3 options, including "do nothing" where that's reasonable.
* For each option, specify in one line: effort, risk, and maintenance burden.
* **Lead with your recommendation.** State it as a directive: "Do B. Here's why:" — not "Option B might be worth considering." Be opinionated. I'm paying for your judgment, not a menu.
* **Map the reasoning to my engineering preferences above.** One sentence connecting your recommendation to a specific preference (DRY, explicit > clever, minimal diff, etc.).
* **AskUserQuestion format:** Start with "We recommend [LETTER]: [one-line reason]" then list all options as `A) ... B) ... C) ...`. Label with issue NUMBER + option LETTER (e.g., "3A", "3B").
* Label with issue NUMBER + option LETTER (e.g., "3A", "3B").
* **Escape hatch:** If a section has no issues, say so and move on. If an issue has an obvious fix with no real alternatives, state what you'll do and move on — don't waste a question on it. Only use AskUserQuestion when there is a genuine decision with meaningful tradeoffs.
* **Exception:** SMALL CHANGE mode intentionally batches one issue per section into a single AskUserQuestion at the end — but each issue in that batch still requires its own recommendation + WHY + lettered options.
## Required outputs
@@ -165,10 +246,9 @@ At the end of the review, fill in and display this summary so the user can see a
Check the git log for this branch. If there are prior commits suggesting a previous review cycle (e.g., review-driven refactors, reverted changes), note what was changed and whether the current plan touches the same areas. Be more aggressive reviewing areas that were previously problematic.
## Formatting rules
* NUMBER issues (1, 2, 3...) and give LETTERS for options (A, B, C...).
* When using AskUserQuestion, label each option with issue NUMBER and option LETTER so I don't get confused.
* Recommended option is always listed first.
* Keep each option to one sentence max. I should be able to pick in under 5 seconds.
* NUMBER issues (1, 2, 3...) and LETTERS for options (A, B, C...).
* Label with NUMBER + LETTER (e.g., "3A", "3B").
* One sentence max per option. Pick in under 5 seconds.
* After each review section, pause and ask for feedback before moving on.
## Unresolved decisions
+44 -12
View File
@@ -7,13 +7,14 @@ description: |
issues interactively with opinionated recommendations.
allowed-tools:
- Read
- Write
- Grep
- Glob
- AskUserQuestion
- Bash
---
{{UPDATE_CHECK}}
{{PREAMBLE}}
# Plan Review Mode
@@ -83,6 +84,41 @@ For LLM/prompt changes: check the "Prompt/LLM changes" file patterns listed in C
**STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved.
### Test Plan Artifact
After producing the test diagram, write a test plan artifact to the project directory so `/qa` and `/qa-only` can consume it as primary test input (replacing the lossy git-diff heuristic):
```bash
SLUG=$(git remote get-url origin 2>/dev/null | sed 's|.*[:/]\([^/]*/[^/]*\)\.git$|\1|;s|.*[:/]\([^/]*/[^/]*\)$|\1|' | tr '/' '-')
BRANCH=$(git rev-parse --abbrev-ref HEAD)
USER=$(whoami)
DATETIME=$(date +%Y%m%d-%H%M%S)
mkdir -p ~/.gstack/projects/$SLUG
```
Write to `~/.gstack/projects/{slug}/{user}-{branch}-test-plan-{datetime}.md`:
```markdown
# Test Plan
Generated by /plan-eng-review on {date}
Branch: {branch}
Repo: {owner/repo}
## Affected Pages/Routes
- {URL path} — {what to test and why}
## Key Interactions to Verify
- {interaction description} on {page}
## Edge Cases
- {edge case} on {page}
## Critical Paths
- {end-to-end flow that must work}
```
This file is consumed by `/qa` and `/qa-only` as primary test input. Include only the information that helps a QA tester know **what to test and where** — not implementation details.
### 4. Performance review
Evaluate:
* N+1 queries and database access patterns.
@@ -93,18 +129,15 @@ Evaluate:
**STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved.
## CRITICAL RULE — How to ask questions
Every AskUserQuestion MUST: (1) present 2-3 concrete lettered options, (2) state which option you recommend FIRST, (3) explain in 1-2 sentences WHY that option over the others, mapping to engineering preferences. No batching multiple issues into one question. No yes/no questions. Open-ended questions are allowed ONLY when you have genuine ambiguity about developer intent, architecture direction, 12-month goals, or what the end user wants — and you must explain what specifically is ambiguous. **Exception:** SMALL CHANGE mode intentionally batches one issue per section into a single AskUserQuestion at the end — but each issue in that batch still requires its own recommendation + WHY + lettered options.
## For each issue you find
For every specific issue (bug, smell, design concern, or risk):
Follow the AskUserQuestion format from the Preamble above. Additional rules for plan reviews:
* **One issue = one AskUserQuestion call.** Never combine multiple issues into one question.
* Describe the problem concretely, with file and line references.
* Present 23 options, including "do nothing" where that's reasonable.
* Present 2-3 options, including "do nothing" where that's reasonable.
* For each option, specify in one line: effort, risk, and maintenance burden.
* **Lead with your recommendation.** State it as a directive: "Do B. Here's why:" — not "Option B might be worth considering." Be opinionated. I'm paying for your judgment, not a menu.
* **Map the reasoning to my engineering preferences above.** One sentence connecting your recommendation to a specific preference (DRY, explicit > clever, minimal diff, etc.).
* **AskUserQuestion format:** Start with "We recommend [LETTER]: [one-line reason]" then list all options as `A) ... B) ... C) ...`. Label with issue NUMBER + option LETTER (e.g., "3A", "3B").
* Label with issue NUMBER + option LETTER (e.g., "3A", "3B").
* **Escape hatch:** If a section has no issues, say so and move on. If an issue has an obvious fix with no real alternatives, state what you'll do and move on — don't waste a question on it. Only use AskUserQuestion when there is a genuine decision with meaningful tradeoffs.
* **Exception:** SMALL CHANGE mode intentionally batches one issue per section into a single AskUserQuestion at the end — but each issue in that batch still requires its own recommendation + WHY + lettered options.
## Required outputs
@@ -156,10 +189,9 @@ At the end of the review, fill in and display this summary so the user can see a
Check the git log for this branch. If there are prior commits suggesting a previous review cycle (e.g., review-driven refactors, reverted changes), note what was changed and whether the current plan touches the same areas. Be more aggressive reviewing areas that were previously problematic.
## Formatting rules
* NUMBER issues (1, 2, 3...) and give LETTERS for options (A, B, C...).
* When using AskUserQuestion, label each option with issue NUMBER and option LETTER so I don't get confused.
* Recommended option is always listed first.
* Keep each option to one sentence max. I should be able to pick in under 5 seconds.
* NUMBER issues (1, 2, 3...) and LETTERS for options (A, B, C...).
* Label with NUMBER + LETTER (e.g., "3A", "3B").
* One sentence max per option. Pick in under 5 seconds.
* After each review section, pause and ask for feedback before moving on.
## Unresolved decisions
+445
View File
@@ -0,0 +1,445 @@
---
name: qa-only
version: 1.0.0
description: |
Report-only QA testing. Systematically tests a web application and produces a
structured report with health score, screenshots, and repro steps — but never
fixes anything. Use when asked to "just report bugs", "qa report only", or
"test but don't fix". For the full test-fix-verify loop, use /qa instead.
allowed-tools:
- Bash
- Read
- Write
- AskUserQuestion
---
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->
## Preamble (run first)
```bash
_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
[ -n "$_UPD" ] && echo "$_UPD" || true
mkdir -p ~/.gstack/sessions
touch ~/.gstack/sessions/"$PPID"
_SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ')
find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true
_CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true)
```
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call:**
1. Context: project name, current branch, what we're working on (1-2 sentences)
2. The specific question or decision point
3. `RECOMMENDATION: Choose [X] because [one-line reason]`
4. Lettered options: `A) ... B) ... C) ...`
If `_SESSIONS` is 3 or more: the user is juggling multiple gstack sessions and context-switching heavily. **ELI16 mode** — they may not remember what this conversation is about. Every AskUserQuestion MUST re-ground them: state the project, the branch, the current plan/task, then the specific problem, THEN the recommendation and options. Be extra clear and self-contained — assume they haven't looked at this window in 20 minutes.
Per-skill instructions may add additional formatting rules on top of this baseline.
## Contributor Mode
If `_CONTRIB` is `true`: you are in **contributor mode**. When you hit friction with **gstack itself** (not the user's app), file a field report. Think: "hey, I was trying to do X with gstack and it didn't work / was confusing / was annoying. Here's what happened."
**gstack issues:** browse command fails/wrong output, snapshot missing elements, skill instructions unclear or misleading, binary crash/hang, unhelpful error message, any rough edge or annoyance — even minor stuff.
**NOT gstack issues:** user's app bugs, network errors to user's URL, auth failures on user's site.
**To file:** write `~/.gstack/contributor-logs/{slug}.md` with this structure:
```
# {Title}
Hey gstack team — ran into this while using /{skill-name}:
**What I was trying to do:** {what the user/agent was attempting}
**What happened instead:** {what actually happened}
**How annoying (1-5):** {1=meh, 3=friction, 5=blocker}
## Steps to reproduce
1. {step}
## Raw output
(wrap any error messages or unexpected output in a markdown code block)
**Date:** {YYYY-MM-DD} | **Version:** {gstack version} | **Skill:** /{skill}
```
Then run: `mkdir -p ~/.gstack/contributor-logs && open ~/.gstack/contributor-logs/{slug}.md`
Slug: lowercase, hyphens, max 60 chars (e.g. `browse-snapshot-ref-gap`). Skip if file already exists. Max 3 reports per session. File inline and continue — don't stop the workflow. Tell user: "Filed gstack field report: {title}"
# /qa-only: Report-Only QA Testing
You are a QA engineer. Test web applications like a real user — click everything, fill every form, check every state. Produce a structured report with evidence. **NEVER fix anything.**
## Setup
**Parse the user's request for these parameters:**
| Parameter | Default | Override example |
|-----------|---------|-----------------:|
| Target URL | (auto-detect or required) | `https://myapp.com`, `http://localhost:3000` |
| Mode | full | `--quick`, `--regression .gstack/qa-reports/baseline.json` |
| Output dir | `.gstack/qa-reports/` | `Output to /tmp/qa` |
| Scope | Full app (or diff-scoped) | `Focus on the billing page` |
| Auth | None | `Sign in to user@example.com`, `Import cookies from cookies.json` |
**If no URL is given and you're on a feature branch:** Automatically enter **diff-aware mode** (see Modes below). This is the most common case — the user just shipped code on a branch and wants to verify it works.
**Find the browse binary:**
## SETUP (run this check BEFORE any browse command)
```bash
_ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
B=""
[ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse"
[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse
if [ -x "$B" ]; then
echo "READY: $B"
else
echo "NEEDS_SETUP"
fi
```
If `NEEDS_SETUP`:
1. Tell the user: "gstack browse needs a one-time build (~10 seconds). OK to proceed?" Then STOP and wait.
2. Run: `cd <SKILL_DIR> && ./setup`
3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash`
**Create output directories:**
```bash
REPORT_DIR=".gstack/qa-reports"
mkdir -p "$REPORT_DIR/screenshots"
```
---
## Test Plan Context
Before falling back to git diff heuristics, check for richer test plan sources:
1. **Project-scoped test plans:** Check `~/.gstack/projects/` for recent `*-test-plan-*.md` files for this repo
```bash
SLUG=$(git remote get-url origin 2>/dev/null | sed 's|.*[:/]\([^/]*/[^/]*\)\.git$|\1|;s|.*[:/]\([^/]*/[^/]*\)$|\1|' | tr '/' '-')
ls -t ~/.gstack/projects/$SLUG/*-test-plan-*.md 2>/dev/null | head -1
```
2. **Conversation context:** Check if a prior `/plan-eng-review` or `/plan-ceo-review` produced test plan output in this conversation
3. **Use whichever source is richer.** Fall back to git diff analysis only if neither is available.
---
## Modes
### Diff-aware (automatic when on a feature branch with no URL)
This is the **primary mode** for developers verifying their work. When the user says `/qa` without a URL and the repo is on a feature branch, automatically:
1. **Analyze the branch diff** to understand what changed:
```bash
git diff main...HEAD --name-only
git log main..HEAD --oneline
```
2. **Identify affected pages/routes** from the changed files:
- Controller/route files → which URL paths they serve
- View/template/component files → which pages render them
- Model/service files → which pages use those models (check controllers that reference them)
- CSS/style files → which pages include those stylesheets
- API endpoints → test them directly with `$B js "await fetch('/api/...')"`
- Static pages (markdown, HTML) → navigate to them directly
3. **Detect the running app** — check common local dev ports:
```bash
$B goto http://localhost:3000 2>/dev/null && echo "Found app on :3000" || \
$B goto http://localhost:4000 2>/dev/null && echo "Found app on :4000" || \
$B goto http://localhost:8080 2>/dev/null && echo "Found app on :8080"
```
If no local app is found, check for a staging/preview URL in the PR or environment. If nothing works, ask the user for the URL.
4. **Test each affected page/route:**
- Navigate to the page
- Take a screenshot
- Check console for errors
- If the change was interactive (forms, buttons, flows), test the interaction end-to-end
- Use `snapshot -D` before and after actions to verify the change had the expected effect
5. **Cross-reference with commit messages and PR description** to understand *intent* — what should the change do? Verify it actually does that.
6. **Check TODOS.md** (if it exists) for known bugs or issues related to the changed files. If a TODO describes a bug that this branch should fix, add it to your test plan. If you find a new bug during QA that isn't in TODOS.md, note it in the report.
7. **Report findings** scoped to the branch changes:
- "Changes tested: N pages/routes affected by this branch"
- For each: does it work? Screenshot evidence.
- Any regressions on adjacent pages?
**If the user provides a URL with diff-aware mode:** Use that URL as the base but still scope testing to the changed files.
### Full (default when URL is provided)
Systematic exploration. Visit every reachable page. Document 5-10 well-evidenced issues. Produce health score. Takes 5-15 minutes depending on app size.
### Quick (`--quick`)
30-second smoke test. Visit homepage + top 5 navigation targets. Check: page loads? Console errors? Broken links? Produce health score. No detailed issue documentation.
### Regression (`--regression <baseline>`)
Run full mode, then load `baseline.json` from a previous run. Diff: which issues are fixed? Which are new? What's the score delta? Append regression section to report.
---
## Workflow
### Phase 1: Initialize
1. Find browse binary (see Setup above)
2. Create output directories
3. Copy report template from `qa/templates/qa-report-template.md` to output dir
4. Start timer for duration tracking
### Phase 2: Authenticate (if needed)
**If the user specified auth credentials:**
```bash
$B goto <login-url>
$B snapshot -i # find the login form
$B fill @e3 "user@example.com"
$B fill @e4 "[REDACTED]" # NEVER include real passwords in report
$B click @e5 # submit
$B snapshot -D # verify login succeeded
```
**If the user provided a cookie file:**
```bash
$B cookie-import cookies.json
$B goto <target-url>
```
**If 2FA/OTP is required:** Ask the user for the code and wait.
**If CAPTCHA blocks you:** Tell the user: "Please complete the CAPTCHA in the browser, then tell me to continue."
### Phase 3: Orient
Get a map of the application:
```bash
$B goto <target-url>
$B snapshot -i -a -o "$REPORT_DIR/screenshots/initial.png"
$B links # map navigation structure
$B console --errors # any errors on landing?
```
**Detect framework** (note in report metadata):
- `__next` in HTML or `_next/data` requests → Next.js
- `csrf-token` meta tag → Rails
- `wp-content` in URLs → WordPress
- Client-side routing with no page reloads → SPA
**For SPAs:** The `links` command may return few results because navigation is client-side. Use `snapshot -i` to find nav elements (buttons, menu items) instead.
### Phase 4: Explore
Visit pages systematically. At each page:
```bash
$B goto <page-url>
$B snapshot -i -a -o "$REPORT_DIR/screenshots/page-name.png"
$B console --errors
```
Then follow the **per-page exploration checklist** (see `qa/references/issue-taxonomy.md`):
1. **Visual scan** — Look at the annotated screenshot for layout issues
2. **Interactive elements** — Click buttons, links, controls. Do they work?
3. **Forms** — Fill and submit. Test empty, invalid, edge cases
4. **Navigation** — Check all paths in and out
5. **States** — Empty state, loading, error, overflow
6. **Console** — Any new JS errors after interactions?
7. **Responsiveness** — Check mobile viewport if relevant:
```bash
$B viewport 375x812
$B screenshot "$REPORT_DIR/screenshots/page-mobile.png"
$B viewport 1280x720
```
**Depth judgment:** Spend more time on core features (homepage, dashboard, checkout, search) and less on secondary pages (about, terms, privacy).
**Quick mode:** Only visit homepage + top 5 navigation targets from the Orient phase. Skip the per-page checklist — just check: loads? Console errors? Broken links visible?
### Phase 5: Document
Document each issue **immediately when found** — don't batch them.
**Two evidence tiers:**
**Interactive bugs** (broken flows, dead buttons, form failures):
1. Take a screenshot before the action
2. Perform the action
3. Take a screenshot showing the result
4. Use `snapshot -D` to show what changed
5. Write repro steps referencing screenshots
```bash
$B screenshot "$REPORT_DIR/screenshots/issue-001-step-1.png"
$B click @e5
$B screenshot "$REPORT_DIR/screenshots/issue-001-result.png"
$B snapshot -D
```
**Static bugs** (typos, layout issues, missing images):
1. Take a single annotated screenshot showing the problem
2. Describe what's wrong
```bash
$B snapshot -i -a -o "$REPORT_DIR/screenshots/issue-002.png"
```
**Write each issue to the report immediately** using the template format from `qa/templates/qa-report-template.md`.
### Phase 6: Wrap Up
1. **Compute health score** using the rubric below
2. **Write "Top 3 Things to Fix"** — the 3 highest-severity issues
3. **Write console health summary** — aggregate all console errors seen across pages
4. **Update severity counts** in the summary table
5. **Fill in report metadata** — date, duration, pages visited, screenshot count, framework
6. **Save baseline** — write `baseline.json` with:
```json
{
"date": "YYYY-MM-DD",
"url": "<target>",
"healthScore": N,
"issues": [{ "id": "ISSUE-001", "title": "...", "severity": "...", "category": "..." }],
"categoryScores": { "console": N, "links": N, ... }
}
```
**Regression mode:** After writing the report, load the baseline file. Compare:
- Health score delta
- Issues fixed (in baseline but not current)
- New issues (in current but not baseline)
- Append the regression section to the report
---
## Health Score Rubric
Compute each category score (0-100), then take the weighted average.
### Console (weight: 15%)
- 0 errors → 100
- 1-3 errors → 70
- 4-10 errors → 40
- 10+ errors → 10
### Links (weight: 10%)
- 0 broken → 100
- Each broken link → -15 (minimum 0)
### Per-Category Scoring (Visual, Functional, UX, Content, Performance, Accessibility)
Each category starts at 100. Deduct per finding:
- Critical issue → -25
- High issue → -15
- Medium issue → -8
- Low issue → -3
Minimum 0 per category.
### Weights
| Category | Weight |
|----------|--------|
| Console | 15% |
| Links | 10% |
| Visual | 10% |
| Functional | 20% |
| UX | 15% |
| Performance | 10% |
| Content | 5% |
| Accessibility | 15% |
### Final Score
`score = Σ (category_score × weight)`
---
## Framework-Specific Guidance
### Next.js
- Check console for hydration errors (`Hydration failed`, `Text content did not match`)
- Monitor `_next/data` requests in network — 404s indicate broken data fetching
- Test client-side navigation (click links, don't just `goto`) — catches routing issues
- Check for CLS (Cumulative Layout Shift) on pages with dynamic content
### Rails
- Check for N+1 query warnings in console (if development mode)
- Verify CSRF token presence in forms
- Test Turbo/Stimulus integration — do page transitions work smoothly?
- Check for flash messages appearing and dismissing correctly
### WordPress
- Check for plugin conflicts (JS errors from different plugins)
- Verify admin bar visibility for logged-in users
- Test REST API endpoints (`/wp-json/`)
- Check for mixed content warnings (common with WP)
### General SPA (React, Vue, Angular)
- Use `snapshot -i` for navigation — `links` command misses client-side routes
- Check for stale state (navigate away and back — does data refresh?)
- Test browser back/forward — does the app handle history correctly?
- Check for memory leaks (monitor console after extended use)
---
## Important Rules
1. **Repro is everything.** Every issue needs at least one screenshot. No exceptions.
2. **Verify before documenting.** Retry the issue once to confirm it's reproducible, not a fluke.
3. **Never include credentials.** Write `[REDACTED]` for passwords in repro steps.
4. **Write incrementally.** Append each issue to the report as you find it. Don't batch.
5. **Never read source code.** Test as a user, not a developer.
6. **Check console after every interaction.** JS errors that don't surface visually are still bugs.
7. **Test like a user.** Use realistic data. Walk through complete workflows end-to-end.
8. **Depth over breadth.** 5-10 well-documented issues with evidence > 20 vague descriptions.
9. **Never delete output files.** Screenshots and reports accumulate — that's intentional.
10. **Use `snapshot -C` for tricky UIs.** Finds clickable divs that the accessibility tree misses.
---
## Output
Write the report to both local and project-scoped locations:
**Local:** `.gstack/qa-reports/qa-report-{domain}-{YYYY-MM-DD}.md`
**Project-scoped:** Write test outcome artifact for cross-session context:
```bash
SLUG=$(git remote get-url origin 2>/dev/null | sed 's|.*[:/]\([^/]*/[^/]*\)\.git$|\1|;s|.*[:/]\([^/]*/[^/]*\)$|\1|' | tr '/' '-')
mkdir -p ~/.gstack/projects/$SLUG
```
Write to `~/.gstack/projects/{slug}/{user}-{branch}-test-outcome-{datetime}.md`
### Output Structure
```
.gstack/qa-reports/
├── qa-report-{domain}-{YYYY-MM-DD}.md # Structured report
├── screenshots/
│ ├── initial.png # Landing page annotated screenshot
│ ├── issue-001-step-1.png # Per-issue evidence
│ ├── issue-001-result.png
│ └── ...
└── baseline.json # For regression mode
```
Report filenames use the domain and date: `qa-report-myapp-com-2026-03-12.md`
---
## Additional Rules (qa-only specific)
11. **Never fix bugs.** Find and document only. Do not read source code, edit files, or suggest fixes in the report. Your job is to report what's broken, not to fix it. Use `/qa` for the test-fix-verify loop.
+99
View File
@@ -0,0 +1,99 @@
---
name: qa-only
version: 1.0.0
description: |
Report-only QA testing. Systematically tests a web application and produces a
structured report with health score, screenshots, and repro steps — but never
fixes anything. Use when asked to "just report bugs", "qa report only", or
"test but don't fix". For the full test-fix-verify loop, use /qa instead.
allowed-tools:
- Bash
- Read
- Write
- AskUserQuestion
---
{{PREAMBLE}}
# /qa-only: Report-Only QA Testing
You are a QA engineer. Test web applications like a real user — click everything, fill every form, check every state. Produce a structured report with evidence. **NEVER fix anything.**
## Setup
**Parse the user's request for these parameters:**
| Parameter | Default | Override example |
|-----------|---------|-----------------:|
| Target URL | (auto-detect or required) | `https://myapp.com`, `http://localhost:3000` |
| Mode | full | `--quick`, `--regression .gstack/qa-reports/baseline.json` |
| Output dir | `.gstack/qa-reports/` | `Output to /tmp/qa` |
| Scope | Full app (or diff-scoped) | `Focus on the billing page` |
| Auth | None | `Sign in to user@example.com`, `Import cookies from cookies.json` |
**If no URL is given and you're on a feature branch:** Automatically enter **diff-aware mode** (see Modes below). This is the most common case — the user just shipped code on a branch and wants to verify it works.
**Find the browse binary:**
{{BROWSE_SETUP}}
**Create output directories:**
```bash
REPORT_DIR=".gstack/qa-reports"
mkdir -p "$REPORT_DIR/screenshots"
```
---
## Test Plan Context
Before falling back to git diff heuristics, check for richer test plan sources:
1. **Project-scoped test plans:** Check `~/.gstack/projects/` for recent `*-test-plan-*.md` files for this repo
```bash
SLUG=$(git remote get-url origin 2>/dev/null | sed 's|.*[:/]\([^/]*/[^/]*\)\.git$|\1|;s|.*[:/]\([^/]*/[^/]*\)$|\1|' | tr '/' '-')
ls -t ~/.gstack/projects/$SLUG/*-test-plan-*.md 2>/dev/null | head -1
```
2. **Conversation context:** Check if a prior `/plan-eng-review` or `/plan-ceo-review` produced test plan output in this conversation
3. **Use whichever source is richer.** Fall back to git diff analysis only if neither is available.
---
{{QA_METHODOLOGY}}
---
## Output
Write the report to both local and project-scoped locations:
**Local:** `.gstack/qa-reports/qa-report-{domain}-{YYYY-MM-DD}.md`
**Project-scoped:** Write test outcome artifact for cross-session context:
```bash
SLUG=$(git remote get-url origin 2>/dev/null | sed 's|.*[:/]\([^/]*/[^/]*\)\.git$|\1|;s|.*[:/]\([^/]*/[^/]*\)$|\1|' | tr '/' '-')
mkdir -p ~/.gstack/projects/$SLUG
```
Write to `~/.gstack/projects/{slug}/{user}-{branch}-test-outcome-{datetime}.md`
### Output Structure
```
.gstack/qa-reports/
├── qa-report-{domain}-{YYYY-MM-DD}.md # Structured report
├── screenshots/
│ ├── initial.png # Landing page annotated screenshot
│ ├── issue-001-step-1.png # Per-issue evidence
│ ├── issue-001-result.png
│ └── ...
└── baseline.json # For regression mode
```
Report filenames use the domain and date: `qa-report-myapp-com-2026-03-12.md`
---
## Additional Rules (qa-only specific)
11. **Never fix bugs.** Find and document only. Do not read source code, edit files, or suggest fixes in the report. Your job is to report what's broken, not to fix it. Use `/qa` for the test-fix-verify loop.
+238 -11
View File
@@ -1,48 +1,114 @@
---
name: qa
version: 1.0.0
version: 2.0.0
description: |
Systematically QA test a web application. Use when asked to "qa", "QA", "test this site",
"find bugs", "dogfood", or review quality. Four modes: diff-aware (automatic on feature
branches — analyzes git diff, identifies affected pages, tests them), full (systematic
exploration), quick (30-second smoke test), regression (compare against baseline). Produces
structured report with health score, screenshots, and repro steps.
Systematically QA test a web application and fix bugs found. Runs QA testing,
then iteratively fixes bugs in source code, committing each fix atomically and
re-verifying. Use when asked to "qa", "QA", "test this site", "find bugs",
"test and fix", or "fix what's broken". Three tiers: Quick (critical/high only),
Standard (+ medium), Exhaustive (+ cosmetic). Produces before/after health scores,
fix evidence, and a ship-readiness summary. For report-only mode, use /qa-only.
allowed-tools:
- Bash
- Read
- Write
- Edit
- Glob
- Grep
- AskUserQuestion
---
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->
## Update Check (run first)
## Preamble (run first)
```bash
_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
[ -n "$_UPD" ] && echo "$_UPD" || true
mkdir -p ~/.gstack/sessions
touch ~/.gstack/sessions/"$PPID"
_SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ')
find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true
_CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true)
```
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
# /qa: Systematic QA Testing
## AskUserQuestion Format
You are a QA engineer. Test web applications like a real user — click everything, fill every form, check every state. Produce a structured report with evidence.
**ALWAYS follow this structure for every AskUserQuestion call:**
1. Context: project name, current branch, what we're working on (1-2 sentences)
2. The specific question or decision point
3. `RECOMMENDATION: Choose [X] because [one-line reason]`
4. Lettered options: `A) ... B) ... C) ...`
If `_SESSIONS` is 3 or more: the user is juggling multiple gstack sessions and context-switching heavily. **ELI16 mode** — they may not remember what this conversation is about. Every AskUserQuestion MUST re-ground them: state the project, the branch, the current plan/task, then the specific problem, THEN the recommendation and options. Be extra clear and self-contained — assume they haven't looked at this window in 20 minutes.
Per-skill instructions may add additional formatting rules on top of this baseline.
## Contributor Mode
If `_CONTRIB` is `true`: you are in **contributor mode**. When you hit friction with **gstack itself** (not the user's app), file a field report. Think: "hey, I was trying to do X with gstack and it didn't work / was confusing / was annoying. Here's what happened."
**gstack issues:** browse command fails/wrong output, snapshot missing elements, skill instructions unclear or misleading, binary crash/hang, unhelpful error message, any rough edge or annoyance — even minor stuff.
**NOT gstack issues:** user's app bugs, network errors to user's URL, auth failures on user's site.
**To file:** write `~/.gstack/contributor-logs/{slug}.md` with this structure:
```
# {Title}
Hey gstack team — ran into this while using /{skill-name}:
**What I was trying to do:** {what the user/agent was attempting}
**What happened instead:** {what actually happened}
**How annoying (1-5):** {1=meh, 3=friction, 5=blocker}
## Steps to reproduce
1. {step}
## Raw output
(wrap any error messages or unexpected output in a markdown code block)
**Date:** {YYYY-MM-DD} | **Version:** {gstack version} | **Skill:** /{skill}
```
Then run: `mkdir -p ~/.gstack/contributor-logs && open ~/.gstack/contributor-logs/{slug}.md`
Slug: lowercase, hyphens, max 60 chars (e.g. `browse-snapshot-ref-gap`). Skip if file already exists. Max 3 reports per session. File inline and continue — don't stop the workflow. Tell user: "Filed gstack field report: {title}"
# /qa: Test → Fix → Verify
You are a QA engineer AND a bug-fix engineer. Test web applications like a real user — click everything, fill every form, check every state. When you find bugs, fix them in source code with atomic commits, then re-verify. Produce a structured report with before/after evidence.
## Setup
**Parse the user's request for these parameters:**
| Parameter | Default | Override example |
|-----------|---------|-----------------|
|-----------|---------|-----------------:|
| Target URL | (auto-detect or required) | `https://myapp.com`, `http://localhost:3000` |
| Mode | full | `--quick`, `--regression .gstack/qa-reports/baseline.json` |
| Tier | Standard | `--quick`, `--exhaustive` |
| Mode | full | `--regression .gstack/qa-reports/baseline.json` |
| Output dir | `.gstack/qa-reports/` | `Output to /tmp/qa` |
| Scope | Full app (or diff-scoped) | `Focus on the billing page` |
| Auth | None | `Sign in to user@example.com`, `Import cookies from cookies.json` |
**Tiers determine which issues get fixed:**
- **Quick:** Fix critical + high severity only
- **Standard:** + medium severity (default)
- **Exhaustive:** + low/cosmetic severity
**If no URL is given and you're on a feature branch:** Automatically enter **diff-aware mode** (see Modes below). This is the most common case — the user just shipped code on a branch and wants to verify it works.
**Require clean working tree before starting:**
```bash
if [ -n "$(git status --porcelain)" ]; then
echo "ERROR: Working tree is dirty. Commit or stash changes before running /qa."
exit 1
fi
```
**Find the browse binary:**
## SETUP (run this check BEFORE any browse command)
@@ -73,6 +139,22 @@ mkdir -p "$REPORT_DIR/screenshots"
---
## Test Plan Context
Before falling back to git diff heuristics, check for richer test plan sources:
1. **Project-scoped test plans:** Check `~/.gstack/projects/` for recent `*-test-plan-*.md` files for this repo
```bash
SLUG=$(git remote get-url origin 2>/dev/null | sed 's|.*[:/]\([^/]*/[^/]*\)\.git$|\1|;s|.*[:/]\([^/]*/[^/]*\)$|\1|' | tr '/' '-')
ls -t ~/.gstack/projects/$SLUG/*-test-plan-*.md 2>/dev/null | head -1
```
2. **Conversation context:** Check if a prior `/plan-eng-review` or `/plan-ceo-review` produced test plan output in this conversation
3. **Use whichever source is richer.** Fall back to git diff analysis only if neither is available.
---
## Phases 1-6: QA Baseline
## Modes
### Diff-aware (automatic when on a feature branch with no URL)
@@ -363,6 +445,8 @@ Minimum 0 per category.
9. **Never delete output files.** Screenshots and reports accumulate — that's intentional.
10. **Use `snapshot -C` for tricky UIs.** Finds clickable divs that the accessibility tree misses.
Record baseline health score at end of Phase 6.
---
## Output Structure
@@ -374,8 +458,151 @@ Minimum 0 per category.
│ ├── initial.png # Landing page annotated screenshot
│ ├── issue-001-step-1.png # Per-issue evidence
│ ├── issue-001-result.png
│ ├── issue-001-before.png # Before fix (if fixed)
│ ├── issue-001-after.png # After fix (if fixed)
│ └── ...
└── baseline.json # For regression mode
```
Report filenames use the domain and date: `qa-report-myapp-com-2026-03-12.md`
---
## Phase 7: Triage
Sort all discovered issues by severity, then decide which to fix based on the selected tier:
- **Quick:** Fix critical + high only. Mark medium/low as "deferred."
- **Standard:** Fix critical + high + medium. Mark low as "deferred."
- **Exhaustive:** Fix all, including cosmetic/low severity.
Mark issues that cannot be fixed from source code (e.g., third-party widget bugs, infrastructure issues) as "deferred" regardless of tier.
---
## Phase 8: Fix Loop
For each fixable issue, in severity order:
### 8a. Locate source
```bash
# Grep for error messages, component names, route definitions
# Glob for file patterns matching the affected page
```
- Find the source file(s) responsible for the bug
- ONLY modify files directly related to the issue
### 8b. Fix
- Read the source code, understand the context
- Make the **minimal fix** — smallest change that resolves the issue
- Do NOT refactor surrounding code, add features, or "improve" unrelated things
### 8c. Commit
```bash
git add <only-changed-files>
git commit -m "fix(qa): ISSUE-NNN — short description"
```
- One commit per fix. Never bundle multiple fixes.
- Message format: `fix(qa): ISSUE-NNN — short description`
### 8d. Re-test
- Navigate back to the affected page
- Take **before/after screenshot pair**
- Check console for errors
- Use `snapshot -D` to verify the change had the expected effect
```bash
$B goto <affected-url>
$B screenshot "$REPORT_DIR/screenshots/issue-NNN-after.png"
$B console --errors
$B snapshot -D
```
### 8e. Classify
- **verified**: re-test confirms the fix works, no new errors introduced
- **best-effort**: fix applied but couldn't fully verify (e.g., needs auth state, external service)
- **reverted**: regression detected → `git revert HEAD` → mark issue as "deferred"
### 8f. Self-Regulation (STOP AND EVALUATE)
Every 5 fixes (or after any revert), compute the WTF-likelihood:
```
WTF-LIKELIHOOD:
Start at 0%
Each revert: +15%
Each fix touching >3 files: +5%
After fix 15: +1% per additional fix
All remaining Low severity: +10%
Touching unrelated files: +20%
```
**If WTF > 20%:** STOP immediately. Show the user what you've done so far. Ask whether to continue.
**Hard cap: 50 fixes.** After 50 fixes, stop regardless of remaining issues.
---
## Phase 9: Final QA
After all fixes are applied:
1. Re-run QA on all affected pages
2. Compute final health score
3. **If final score is WORSE than baseline:** WARN prominently — something regressed
---
## Phase 10: Report
Write the report to both local and project-scoped locations:
**Local:** `.gstack/qa-reports/qa-report-{domain}-{YYYY-MM-DD}.md`
**Project-scoped:** Write test outcome artifact for cross-session context:
```bash
SLUG=$(git remote get-url origin 2>/dev/null | sed 's|.*[:/]\([^/]*/[^/]*\)\.git$|\1|;s|.*[:/]\([^/]*/[^/]*\)$|\1|' | tr '/' '-')
mkdir -p ~/.gstack/projects/$SLUG
```
Write to `~/.gstack/projects/{slug}/{user}-{branch}-test-outcome-{datetime}.md`
**Per-issue additions** (beyond standard report template):
- Fix Status: verified / best-effort / reverted / deferred
- Commit SHA (if fixed)
- Files Changed (if fixed)
- Before/After screenshots (if fixed)
**Summary section:**
- Total issues found
- Fixes applied (verified: X, best-effort: Y, reverted: Z)
- Deferred issues
- Health score delta: baseline → final
**PR Summary:** Include a one-line summary suitable for PR descriptions:
> "QA found N issues, fixed M, health score X → Y."
---
## Phase 11: TODOS.md Update
If the repo has a `TODOS.md`:
1. **New deferred bugs** → add as TODOs with severity, category, and repro steps
2. **Fixed bugs that were in TODOS.md** → annotate with "Fixed by /qa on {branch}, {date}"
---
## Additional Rules (qa-specific)
11. **Clean working tree required.** Refuse to start if `git status --porcelain` is non-empty.
12. **One commit per fix.** Never bundle multiple fixes into one commit.
13. **Never modify tests or CI configuration.** Only fix application source code.
14. **Revert on regression.** If a fix makes things worse, `git revert HEAD` immediately.
15. **Self-regulate.** Follow the WTF-likelihood heuristic. When in doubt, stop and ask.
+183 -63
View File
@@ -1,39 +1,57 @@
---
name: qa
version: 1.0.0
version: 2.0.0
description: |
Systematically QA test a web application. Use when asked to "qa", "QA", "test this site",
"find bugs", "dogfood", or review quality. Four modes: diff-aware (automatic on feature
branches — analyzes git diff, identifies affected pages, tests them), full (systematic
exploration), quick (30-second smoke test), regression (compare against baseline). Produces
structured report with health score, screenshots, and repro steps.
Systematically QA test a web application and fix bugs found. Runs QA testing,
then iteratively fixes bugs in source code, committing each fix atomically and
re-verifying. Use when asked to "qa", "QA", "test this site", "find bugs",
"test and fix", or "fix what's broken". Three tiers: Quick (critical/high only),
Standard (+ medium), Exhaustive (+ cosmetic). Produces before/after health scores,
fix evidence, and a ship-readiness summary. For report-only mode, use /qa-only.
allowed-tools:
- Bash
- Read
- Write
- Edit
- Glob
- Grep
- AskUserQuestion
---
{{UPDATE_CHECK}}
{{PREAMBLE}}
# /qa: Systematic QA Testing
# /qa: Test → Fix → Verify
You are a QA engineer. Test web applications like a real user — click everything, fill every form, check every state. Produce a structured report with evidence.
You are a QA engineer AND a bug-fix engineer. Test web applications like a real user — click everything, fill every form, check every state. When you find bugs, fix them in source code with atomic commits, then re-verify. Produce a structured report with before/after evidence.
## Setup
**Parse the user's request for these parameters:**
| Parameter | Default | Override example |
|-----------|---------|-----------------|
|-----------|---------|-----------------:|
| Target URL | (auto-detect or required) | `https://myapp.com`, `http://localhost:3000` |
| Mode | full | `--quick`, `--regression .gstack/qa-reports/baseline.json` |
| Tier | Standard | `--quick`, `--exhaustive` |
| Mode | full | `--regression .gstack/qa-reports/baseline.json` |
| Output dir | `.gstack/qa-reports/` | `Output to /tmp/qa` |
| Scope | Full app (or diff-scoped) | `Focus on the billing page` |
| Auth | None | `Sign in to user@example.com`, `Import cookies from cookies.json` |
**Tiers determine which issues get fixed:**
- **Quick:** Fix critical + high severity only
- **Standard:** + medium severity (default)
- **Exhaustive:** + low/cosmetic severity
**If no URL is given and you're on a feature branch:** Automatically enter **diff-aware mode** (see Modes below). This is the most common case — the user just shipped code on a branch and wants to verify it works.
**Require clean working tree before starting:**
```bash
if [ -n "$(git status --porcelain)" ]; then
echo "ERROR: Working tree is dirty. Commit or stash changes before running /qa."
exit 1
fi
```
**Find the browse binary:**
{{BROWSE_SETUP}}
@@ -47,66 +65,23 @@ mkdir -p "$REPORT_DIR/screenshots"
---
## Modes
## Test Plan Context
### Diff-aware (automatic when on a feature branch with no URL)
Before falling back to git diff heuristics, check for richer test plan sources:
This is the **primary mode** for developers verifying their work. When the user says `/qa` without a URL and the repo is on a feature branch, automatically:
1. **Analyze the branch diff** to understand what changed:
1. **Project-scoped test plans:** Check `~/.gstack/projects/` for recent `*-test-plan-*.md` files for this repo
```bash
git diff main...HEAD --name-only
git log main..HEAD --oneline
SLUG=$(git remote get-url origin 2>/dev/null | sed 's|.*[:/]\([^/]*/[^/]*\)\.git$|\1|;s|.*[:/]\([^/]*/[^/]*\)$|\1|' | tr '/' '-')
ls -t ~/.gstack/projects/$SLUG/*-test-plan-*.md 2>/dev/null | head -1
```
2. **Identify affected pages/routes** from the changed files:
- Controller/route files → which URL paths they serve
- View/template/component files → which pages render them
- Model/service files → which pages use those models (check controllers that reference them)
- CSS/style files → which pages include those stylesheets
- API endpoints → test them directly with `$B js "await fetch('/api/...')"`
- Static pages (markdown, HTML) → navigate to them directly
3. **Detect the running app** — check common local dev ports:
```bash
$B goto http://localhost:3000 2>/dev/null && echo "Found app on :3000" || \
$B goto http://localhost:4000 2>/dev/null && echo "Found app on :4000" || \
$B goto http://localhost:8080 2>/dev/null && echo "Found app on :8080"
```
If no local app is found, check for a staging/preview URL in the PR or environment. If nothing works, ask the user for the URL.
4. **Test each affected page/route:**
- Navigate to the page
- Take a screenshot
- Check console for errors
- If the change was interactive (forms, buttons, flows), test the interaction end-to-end
- Use `snapshot -D` before and after actions to verify the change had the expected effect
5. **Cross-reference with commit messages and PR description** to understand *intent* — what should the change do? Verify it actually does that.
6. **Check TODOS.md** (if it exists) for known bugs or issues related to the changed files. If a TODO describes a bug that this branch should fix, add it to your test plan. If you find a new bug during QA that isn't in TODOS.md, note it in the report.
7. **Report findings** scoped to the branch changes:
- "Changes tested: N pages/routes affected by this branch"
- For each: does it work? Screenshot evidence.
- Any regressions on adjacent pages?
**If the user provides a URL with diff-aware mode:** Use that URL as the base but still scope testing to the changed files.
### Full (default when URL is provided)
Systematic exploration. Visit every reachable page. Document 5-10 well-evidenced issues. Produce health score. Takes 5-15 minutes depending on app size.
### Quick (`--quick`)
30-second smoke test. Visit homepage + top 5 navigation targets. Check: page loads? Console errors? Broken links? Produce health score. No detailed issue documentation.
### Regression (`--regression <baseline>`)
Run full mode, then load `baseline.json` from a previous run. Diff: which issues are fixed? Which are new? What's the score delta? Append regression section to report.
2. **Conversation context:** Check if a prior `/plan-eng-review` or `/plan-ceo-review` produced test plan output in this conversation
3. **Use whichever source is richer.** Fall back to git diff analysis only if neither is available.
---
## Workflow
## Phases 1-6: QA Baseline
### Phase 1: Initialize
{{QA_METHODOLOGY}}
1. Find browse binary (see Setup above)
2. Create output directories
@@ -337,6 +312,8 @@ Minimum 0 per category.
9. **Never delete output files.** Screenshots and reports accumulate — that's intentional.
10. **Use `snapshot -C` for tricky UIs.** Finds clickable divs that the accessibility tree misses.
Record baseline health score at end of Phase 6.
---
## Output Structure
@@ -348,8 +325,151 @@ Minimum 0 per category.
│ ├── initial.png # Landing page annotated screenshot
│ ├── issue-001-step-1.png # Per-issue evidence
│ ├── issue-001-result.png
│ ├── issue-001-before.png # Before fix (if fixed)
│ ├── issue-001-after.png # After fix (if fixed)
│ └── ...
└── baseline.json # For regression mode
```
Report filenames use the domain and date: `qa-report-myapp-com-2026-03-12.md`
---
## Phase 7: Triage
Sort all discovered issues by severity, then decide which to fix based on the selected tier:
- **Quick:** Fix critical + high only. Mark medium/low as "deferred."
- **Standard:** Fix critical + high + medium. Mark low as "deferred."
- **Exhaustive:** Fix all, including cosmetic/low severity.
Mark issues that cannot be fixed from source code (e.g., third-party widget bugs, infrastructure issues) as "deferred" regardless of tier.
---
## Phase 8: Fix Loop
For each fixable issue, in severity order:
### 8a. Locate source
```bash
# Grep for error messages, component names, route definitions
# Glob for file patterns matching the affected page
```
- Find the source file(s) responsible for the bug
- ONLY modify files directly related to the issue
### 8b. Fix
- Read the source code, understand the context
- Make the **minimal fix** — smallest change that resolves the issue
- Do NOT refactor surrounding code, add features, or "improve" unrelated things
### 8c. Commit
```bash
git add <only-changed-files>
git commit -m "fix(qa): ISSUE-NNN — short description"
```
- One commit per fix. Never bundle multiple fixes.
- Message format: `fix(qa): ISSUE-NNN — short description`
### 8d. Re-test
- Navigate back to the affected page
- Take **before/after screenshot pair**
- Check console for errors
- Use `snapshot -D` to verify the change had the expected effect
```bash
$B goto <affected-url>
$B screenshot "$REPORT_DIR/screenshots/issue-NNN-after.png"
$B console --errors
$B snapshot -D
```
### 8e. Classify
- **verified**: re-test confirms the fix works, no new errors introduced
- **best-effort**: fix applied but couldn't fully verify (e.g., needs auth state, external service)
- **reverted**: regression detected → `git revert HEAD` → mark issue as "deferred"
### 8f. Self-Regulation (STOP AND EVALUATE)
Every 5 fixes (or after any revert), compute the WTF-likelihood:
```
WTF-LIKELIHOOD:
Start at 0%
Each revert: +15%
Each fix touching >3 files: +5%
After fix 15: +1% per additional fix
All remaining Low severity: +10%
Touching unrelated files: +20%
```
**If WTF > 20%:** STOP immediately. Show the user what you've done so far. Ask whether to continue.
**Hard cap: 50 fixes.** After 50 fixes, stop regardless of remaining issues.
---
## Phase 9: Final QA
After all fixes are applied:
1. Re-run QA on all affected pages
2. Compute final health score
3. **If final score is WORSE than baseline:** WARN prominently — something regressed
---
## Phase 10: Report
Write the report to both local and project-scoped locations:
**Local:** `.gstack/qa-reports/qa-report-{domain}-{YYYY-MM-DD}.md`
**Project-scoped:** Write test outcome artifact for cross-session context:
```bash
SLUG=$(git remote get-url origin 2>/dev/null | sed 's|.*[:/]\([^/]*/[^/]*\)\.git$|\1|;s|.*[:/]\([^/]*/[^/]*\)$|\1|' | tr '/' '-')
mkdir -p ~/.gstack/projects/$SLUG
```
Write to `~/.gstack/projects/{slug}/{user}-{branch}-test-outcome-{datetime}.md`
**Per-issue additions** (beyond standard report template):
- Fix Status: verified / best-effort / reverted / deferred
- Commit SHA (if fixed)
- Files Changed (if fixed)
- Before/After screenshots (if fixed)
**Summary section:**
- Total issues found
- Fixes applied (verified: X, best-effort: Y, reverted: Z)
- Deferred issues
- Health score delta: baseline → final
**PR Summary:** Include a one-line summary suitable for PR descriptions:
> "QA found N issues, fixed M, health score X → Y."
---
## Phase 11: TODOS.md Update
If the repo has a `TODOS.md`:
1. **New deferred bugs** → add as TODOs with severity, category, and repro steps
2. **Fixed bugs that were in TODOS.md** → annotate with "Fixed by /qa on {branch}, {date}"
---
## Additional Rules (qa-specific)
11. **Clean working tree required.** Refuse to start if `git status --porcelain` is non-empty.
12. **One commit per fix.** Never bundle multiple fixes into one commit.
13. **Never modify tests or CI configuration.** Only fix application source code.
14. **Revert on regression.** If a fix makes things worse, `git revert HEAD` immediately.
15. **Self-regulate.** Follow the WTF-likelihood heuristic. When in doubt, stop and ask.
+27
View File
@@ -72,6 +72,33 @@
---
## Fixes Applied (if applicable)
| Issue | Fix Status | Commit | Files Changed |
|-------|-----------|--------|---------------|
| ISSUE-NNN | verified / best-effort / reverted / deferred | {SHA} | {files} |
### Before/After Evidence
#### ISSUE-NNN: {title}
**Before:** ![Before](screenshots/issue-NNN-before.png)
**After:** ![After](screenshots/issue-NNN-after.png)
---
## Ship Readiness
| Metric | Value |
|--------|-------|
| Health score | {before} → {after} ({delta}) |
| Issues found | N |
| Fixes applied | N (verified: X, best-effort: Y, reverted: Z) |
| Deferred | N |
**PR Summary:** "QA found N issues, fixed M, health score X → Y."
---
## Regression (if applicable)
| Metric | Baseline | Current | Delta |
+49 -1
View File
@@ -15,15 +15,63 @@ allowed-tools:
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->
## Update Check (run first)
## Preamble (run first)
```bash
_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
[ -n "$_UPD" ] && echo "$_UPD" || true
mkdir -p ~/.gstack/sessions
touch ~/.gstack/sessions/"$PPID"
_SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ')
find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true
_CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true)
```
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call:**
1. Context: project name, current branch, what we're working on (1-2 sentences)
2. The specific question or decision point
3. `RECOMMENDATION: Choose [X] because [one-line reason]`
4. Lettered options: `A) ... B) ... C) ...`
If `_SESSIONS` is 3 or more: the user is juggling multiple gstack sessions and context-switching heavily. **ELI16 mode** — they may not remember what this conversation is about. Every AskUserQuestion MUST re-ground them: state the project, the branch, the current plan/task, then the specific problem, THEN the recommendation and options. Be extra clear and self-contained — assume they haven't looked at this window in 20 minutes.
Per-skill instructions may add additional formatting rules on top of this baseline.
## Contributor Mode
If `_CONTRIB` is `true`: you are in **contributor mode**. When you hit friction with **gstack itself** (not the user's app), file a field report. Think: "hey, I was trying to do X with gstack and it didn't work / was confusing / was annoying. Here's what happened."
**gstack issues:** browse command fails/wrong output, snapshot missing elements, skill instructions unclear or misleading, binary crash/hang, unhelpful error message, any rough edge or annoyance — even minor stuff.
**NOT gstack issues:** user's app bugs, network errors to user's URL, auth failures on user's site.
**To file:** write `~/.gstack/contributor-logs/{slug}.md` with this structure:
```
# {Title}
Hey gstack team — ran into this while using /{skill-name}:
**What I was trying to do:** {what the user/agent was attempting}
**What happened instead:** {what actually happened}
**How annoying (1-5):** {1=meh, 3=friction, 5=blocker}
## Steps to reproduce
1. {step}
## Raw output
(wrap any error messages or unexpected output in a markdown code block)
**Date:** {YYYY-MM-DD} | **Version:** {gstack version} | **Skill:** /{skill}
```
Then run: `mkdir -p ~/.gstack/contributor-logs && open ~/.gstack/contributor-logs/{slug}.md`
Slug: lowercase, hyphens, max 60 chars (e.g. `browse-snapshot-ref-gap`). Skip if file already exists. Max 3 reports per session. File inline and continue — don't stop the workflow. Tell user: "Filed gstack field report: {title}"
# /retro — Weekly Engineering Retrospective
Generates a comprehensive engineering retrospective analyzing commit history, work patterns, and code quality metrics. Team-aware: identifies the user running the command, then analyzes every contributor with per-person praise and growth opportunities. Designed for a senior IC/CTO-level builder using Claude Code as a force multiplier.
+1 -1
View File
@@ -13,7 +13,7 @@ allowed-tools:
- AskUserQuestion
---
{{UPDATE_CHECK}}
{{PREAMBLE}}
# /retro — Weekly Engineering Retrospective
+53 -3
View File
@@ -16,15 +16,63 @@ allowed-tools:
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->
## Update Check (run first)
## Preamble (run first)
```bash
_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
[ -n "$_UPD" ] && echo "$_UPD" || true
mkdir -p ~/.gstack/sessions
touch ~/.gstack/sessions/"$PPID"
_SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ')
find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true
_CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true)
```
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call:**
1. Context: project name, current branch, what we're working on (1-2 sentences)
2. The specific question or decision point
3. `RECOMMENDATION: Choose [X] because [one-line reason]`
4. Lettered options: `A) ... B) ... C) ...`
If `_SESSIONS` is 3 or more: the user is juggling multiple gstack sessions and context-switching heavily. **ELI16 mode** — they may not remember what this conversation is about. Every AskUserQuestion MUST re-ground them: state the project, the branch, the current plan/task, then the specific problem, THEN the recommendation and options. Be extra clear and self-contained — assume they haven't looked at this window in 20 minutes.
Per-skill instructions may add additional formatting rules on top of this baseline.
## Contributor Mode
If `_CONTRIB` is `true`: you are in **contributor mode**. When you hit friction with **gstack itself** (not the user's app), file a field report. Think: "hey, I was trying to do X with gstack and it didn't work / was confusing / was annoying. Here's what happened."
**gstack issues:** browse command fails/wrong output, snapshot missing elements, skill instructions unclear or misleading, binary crash/hang, unhelpful error message, any rough edge or annoyance — even minor stuff.
**NOT gstack issues:** user's app bugs, network errors to user's URL, auth failures on user's site.
**To file:** write `~/.gstack/contributor-logs/{slug}.md` with this structure:
```
# {Title}
Hey gstack team — ran into this while using /{skill-name}:
**What I was trying to do:** {what the user/agent was attempting}
**What happened instead:** {what actually happened}
**How annoying (1-5):** {1=meh, 3=friction, 5=blocker}
## Steps to reproduce
1. {step}
## Raw output
(wrap any error messages or unexpected output in a markdown code block)
**Date:** {YYYY-MM-DD} | **Version:** {gstack version} | **Skill:** /{skill}
```
Then run: `mkdir -p ~/.gstack/contributor-logs && open ~/.gstack/contributor-logs/{slug}.md`
Slug: lowercase, hyphens, max 60 chars (e.g. `browse-snapshot-ref-gap`). Skip if file already exists. Max 3 reports per session. File inline and continue — don't stop the workflow. Tell user: "Filed gstack field report: {title}"
# Pre-Landing PR Review
You are running the `/review` workflow. Analyze the current branch's diff against main for structural issues that tests don't catch.
@@ -73,9 +121,11 @@ Run `git diff origin/main` to get the full diff. This includes both committed an
Apply the checklist against the diff in two passes:
1. **Pass 1 (CRITICAL):** SQL & Data Safety, LLM Output Trust Boundary
1. **Pass 1 (CRITICAL):** SQL & Data Safety, Race Conditions & Concurrency, LLM Output Trust Boundary, Enum & Value Completeness
2. **Pass 2 (INFORMATIONAL):** Conditional Side Effects, Magic Numbers & String Coupling, Dead Code & Consistency, LLM Prompt Issues, Test Gaps, View/Frontend
**Enum & Value Completeness requires reading code OUTSIDE the diff.** When the diff introduces a new enum value, status, tier, or type constant, use Grep to find all files that reference sibling values, then Read those files to check if the new value is handled. This is the one category where within-diff review is insufficient.
Follow the output format specified in the checklist. Respect the suppressions — do NOT flag items listed in the "DO NOT flag" section.
---
@@ -84,7 +134,7 @@ Follow the output format specified in the checklist. Respect the suppressions
**Always output ALL findings** — both critical and informational. The user must see every issue.
- If CRITICAL issues found: output all findings, then for EACH critical issue use a separate AskUserQuestion with the problem, your recommended fix, and options (A: Fix it now, B: Acknowledge, C: False positive — skip).
- If CRITICAL issues found: output all findings, then for EACH critical issue use a separate AskUserQuestion with the problem, then `RECOMMENDATION: Choose A because [one-line reason]`, then options (A: Fix it now, B: Acknowledge, C: False positive — skip).
After all critical questions are answered, output a summary of what the user chose for each issue. If the user chose A (fix) on any issue, apply the recommended fixes. If only B/C were chosen, no action needed.
- If only non-critical issues found: output findings. No further action needed.
- If no issues found: output `Pre-Landing Review: No issues found.`
+5 -3
View File
@@ -14,7 +14,7 @@ allowed-tools:
- AskUserQuestion
---
{{UPDATE_CHECK}}
{{PREAMBLE}}
# Pre-Landing PR Review
@@ -64,9 +64,11 @@ Run `git diff origin/main` to get the full diff. This includes both committed an
Apply the checklist against the diff in two passes:
1. **Pass 1 (CRITICAL):** SQL & Data Safety, LLM Output Trust Boundary
1. **Pass 1 (CRITICAL):** SQL & Data Safety, Race Conditions & Concurrency, LLM Output Trust Boundary, Enum & Value Completeness
2. **Pass 2 (INFORMATIONAL):** Conditional Side Effects, Magic Numbers & String Coupling, Dead Code & Consistency, LLM Prompt Issues, Test Gaps, View/Frontend
**Enum & Value Completeness requires reading code OUTSIDE the diff.** When the diff introduces a new enum value, status, tier, or type constant, use Grep to find all files that reference sibling values, then Read those files to check if the new value is handled. This is the one category where within-diff review is insufficient.
Follow the output format specified in the checklist. Respect the suppressions — do NOT flag items listed in the "DO NOT flag" section.
---
@@ -75,7 +77,7 @@ Follow the output format specified in the checklist. Respect the suppressions
**Always output ALL findings** — both critical and informational. The user must see every issue.
- If CRITICAL issues found: output all findings, then for EACH critical issue use a separate AskUserQuestion with the problem, your recommended fix, and options (A: Fix it now, B: Acknowledge, C: False positive — skip).
- If CRITICAL issues found: output all findings, then for EACH critical issue use a separate AskUserQuestion with the problem, then `RECOMMENDATION: Choose A because [one-line reason]`, then options (A: Fix it now, B: Acknowledge, C: False positive — skip).
After all critical questions are answered, output a summary of what the user chose for each issue. If the user chose A (fix) on any issue, apply the recommended fixes. If only B/C were chosen, no action needed.
- If only non-critical issues found: output findings. No further action needed.
- If no issues found: output `Pre-Landing Review: No issues found.`
+9 -2
View File
@@ -48,6 +48,13 @@ Be terse. For each issue: one line describing the problem, one line with the fix
- LLM-generated values (emails, URLs, names) written to DB or passed to mailers without format validation. Add lightweight guards (`EMAIL_REGEXP`, `URI.parse`, `.strip`) before persisting.
- Structured tool output (arrays, hashes) accepted without type/shape checks before database writes.
#### Enum & Value Completeness
When the diff introduces a new enum value, status string, tier name, or type constant:
- **Trace it through every consumer.** Read (don't just grep — READ) each file that switches on, filters by, or displays that value. If any consumer doesn't handle the new value, flag it. Common miss: adding a value to the frontend dropdown but the backend model/compute method doesn't persist it.
- **Check allowlists/filter arrays.** Search for arrays or `%w[]` lists containing sibling values (e.g., if adding "revise" to tiers, find every `%w[quick lfg mega]` and verify "revise" is included where needed).
- **Check `case`/`if-elsif` chains.** If existing code branches on the enum, does the new value fall through to a wrong default?
To do this: use Grep to find all references to the sibling values (e.g., grep for "lfg" or "mega" to find all tier consumers). Read each match. This step requires reading code OUTSIDE the diff.
### Pass 2 — INFORMATIONAL
#### Conditional Side Effects
@@ -101,8 +108,8 @@ Be terse. For each issue: one line describing the problem, one line with the fix
CRITICAL (blocks /ship): INFORMATIONAL (in PR body):
├─ SQL & Data Safety ├─ Conditional Side Effects
├─ Race Conditions & Concurrency ├─ Magic Numbers & String Coupling
─ LLM Output Trust Boundary ├─ Dead Code & Consistency
├─ LLM Prompt Issues
─ LLM Output Trust Boundary ├─ Dead Code & Consistency
└─ Enum & Value Completeness ├─ LLM Prompt Issues
├─ Test Gaps
├─ Crypto & Entropy
├─ Time Window Safety
+18 -7
View File
@@ -47,6 +47,8 @@ interface RunSummary {
passed: number;
total: number;
cost: number;
duration: number;
turns: number;
}
const runs: RunSummary[] = [];
@@ -55,6 +57,7 @@ for (const file of files) {
const data = JSON.parse(fs.readFileSync(path.join(EVAL_DIR, file), 'utf-8'));
if (filterBranch && data.branch !== filterBranch) continue;
if (filterTier && data.tier !== filterTier) continue;
const totalTurns = (data.tests || []).reduce((s: number, t: any) => s + (t.turns_used || 0), 0);
runs.push({
file,
timestamp: data.timestamp || '',
@@ -64,6 +67,8 @@ for (const file of files) {
passed: data.passed || 0,
total: data.total_tests || 0,
cost: data.total_cost_usd || 0,
duration: data.total_duration_ms || 0,
turns: totalTurns,
});
} catch { continue; }
}
@@ -77,29 +82,35 @@ const displayed = runs.slice(0, limit);
// Print table
console.log('');
console.log(`Eval History (${runs.length} total runs)`);
console.log('═'.repeat(90));
console.log('═'.repeat(105));
console.log(
' ' +
'Date'.padEnd(17) +
'Branch'.padEnd(28) +
'Branch'.padEnd(25) +
'Tier'.padEnd(12) +
'Pass'.padEnd(8) +
'Cost'.padEnd(8) +
'Turns'.padEnd(7) +
'Duration'.padEnd(10) +
'Version'
);
console.log('─'.repeat(90));
console.log('─'.repeat(105));
for (const run of displayed) {
const date = run.timestamp.replace('T', ' ').slice(0, 16);
const branch = run.branch.length > 26 ? run.branch.slice(0, 23) + '...' : run.branch.padEnd(28);
const branch = run.branch.length > 23 ? run.branch.slice(0, 20) + '...' : run.branch.padEnd(25);
const pass = `${run.passed}/${run.total}`.padEnd(8);
const cost = `$${run.cost.toFixed(2)}`.padEnd(8);
console.log(` ${date.padEnd(17)}${branch}${run.tier.padEnd(12)}${pass}${cost}v${run.version}`);
const turns = run.turns > 0 ? `${run.turns}t`.padEnd(7) : ''.padEnd(7);
const dur = run.duration > 0 ? `${Math.round(run.duration / 1000)}s`.padEnd(10) : ''.padEnd(10);
console.log(` ${date.padEnd(17)}${branch}${run.tier.padEnd(12)}${pass}${cost}${turns}${dur}v${run.version}`);
}
console.log('─'.repeat(90));
console.log('─'.repeat(105));
const totalCost = runs.reduce((s, r) => s + r.cost, 0);
console.log(` ${runs.length} runs | Total spend: $${totalCost.toFixed(2)} | Showing: ${displayed.length}`);
const totalDur = runs.reduce((s, r) => s + r.duration, 0);
const totalTurns = runs.reduce((s, r) => s + r.turns, 0);
console.log(` ${runs.length} runs | $${totalCost.toFixed(2)} total | ${totalTurns} turns | ${Math.round(totalDur / 1000)}s | Showing: ${displayed.length}`);
console.log(` Dir: ${EVAL_DIR}`);
console.log('');
+57 -4
View File
@@ -40,6 +40,33 @@ const totalCost = results.reduce((s, r) => s + (r.total_cost_usd || 0), 0);
const avgE2ECost = e2eRuns.length > 0 ? e2eRuns.reduce((s, r) => s + r.total_cost_usd, 0) / e2eRuns.length : 0;
const avgJudgeCost = judgeRuns.length > 0 ? judgeRuns.reduce((s, r) => s + r.total_cost_usd, 0) / judgeRuns.length : 0;
// Duration + turns from E2E runs
const avgE2EDuration = e2eRuns.length > 0
? e2eRuns.reduce((s, r) => s + (r.total_duration_ms || 0), 0) / e2eRuns.length
: 0;
const e2eTurns: number[] = [];
for (const r of e2eRuns) {
const runTurns = r.tests.reduce((s, t) => s + (t.turns_used || 0), 0);
if (runTurns > 0) e2eTurns.push(runTurns);
}
const avgE2ETurns = e2eTurns.length > 0
? e2eTurns.reduce((a, b) => a + b, 0) / e2eTurns.length
: 0;
// Per-test efficiency stats (avg turns + duration across runs)
const testEfficiency = new Map<string, { turns: number[]; durations: number[]; costs: number[] }>();
for (const r of e2eRuns) {
for (const t of r.tests) {
if (!testEfficiency.has(t.name)) {
testEfficiency.set(t.name, { turns: [], durations: [], costs: [] });
}
const stats = testEfficiency.get(t.name)!;
if (t.turns_used !== undefined) stats.turns.push(t.turns_used);
if (t.duration_ms > 0) stats.durations.push(t.duration_ms);
if (t.cost_usd > 0) stats.costs.push(t.cost_usd);
}
}
// Detection rates from outcome evals
const detectionRates: number[] = [];
for (const r of e2eRuns) {
@@ -94,22 +121,48 @@ for (const stats of branchStats.values()) {
// Print summary
console.log('');
console.log('Eval Summary');
console.log('═'.repeat(60));
console.log('═'.repeat(70));
console.log(` Total runs: ${results.length} (${e2eRuns.length} e2e, ${judgeRuns.length} llm-judge)`);
console.log(` Total spend: $${totalCost.toFixed(2)}`);
console.log(` Avg cost/e2e: $${avgE2ECost.toFixed(2)}`);
console.log(` Avg cost/judge: $${avgJudgeCost.toFixed(2)}`);
if (avgE2EDuration > 0) {
console.log(` Avg duration/e2e: ${Math.round(avgE2EDuration / 1000)}s`);
}
if (avgE2ETurns > 0) {
console.log(` Avg turns/e2e: ${Math.round(avgE2ETurns)}`);
}
if (avgDetection !== null) {
console.log(` Avg detection: ${avgDetection.toFixed(1)} bugs`);
}
console.log('─'.repeat(60));
console.log('─'.repeat(70));
// Per-test efficiency averages (only if we have enough data)
if (testEfficiency.size > 0 && e2eRuns.length >= 2) {
console.log(' Per-test efficiency (averages across runs):');
const sorted = [...testEfficiency.entries()]
.filter(([, s]) => s.turns.length >= 2)
.sort((a, b) => {
const avgA = a[1].costs.reduce((s, c) => s + c, 0) / a[1].costs.length;
const avgB = b[1].costs.reduce((s, c) => s + c, 0) / b[1].costs.length;
return avgB - avgA;
});
for (const [name, stats] of sorted) {
const avgT = Math.round(stats.turns.reduce((a, b) => a + b, 0) / stats.turns.length);
const avgD = Math.round(stats.durations.reduce((a, b) => a + b, 0) / stats.durations.length / 1000);
const avgC = (stats.costs.reduce((a, b) => a + b, 0) / stats.costs.length).toFixed(2);
const label = name.length > 30 ? name.slice(0, 27) + '...' : name.padEnd(30);
console.log(` ${label} $${avgC} ${avgT}t ${avgD}s (${stats.turns.length} runs)`);
}
console.log('─'.repeat(70));
}
if (flakyTests.length > 0) {
console.log(` Flaky tests (${flakyTests.length}):`);
for (const name of flakyTests) {
console.log(` - ${name}`);
}
console.log('─'.repeat(60));
console.log('─'.repeat(70));
}
if (branchStats.size > 0) {
@@ -119,7 +172,7 @@ if (branchStats.size > 0) {
const det = stats.detections.length > 0 ? ` avg det: ${stats.avgDetection.toFixed(1)}` : '';
console.log(` ${branch.padEnd(30)} ${stats.runs} runs${det}`);
}
console.log('─'.repeat(60));
console.log('─'.repeat(70));
}
// Date range
+330 -4
View File
@@ -94,15 +94,63 @@ function generateSnapshotFlags(): string {
return lines.join('\n');
}
function generateUpdateCheck(): string {
return `## Update Check (run first)
function generatePreamble(): string {
return `## Preamble (run first)
\`\`\`bash
_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
[ -n "$_UPD" ] && echo "$_UPD" || true
mkdir -p ~/.gstack/sessions
touch ~/.gstack/sessions/"$PPID"
_SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ')
find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true
_CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true)
\`\`\`
If output shows \`UPGRADE_AVAILABLE <old> <new>\`: read \`~/.claude/skills/gstack/gstack-upgrade/SKILL.md\` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If \`JUST_UPGRADED <from> <to>\`: tell user "Running gstack v{to} (just updated!)" and continue.`;
If output shows \`UPGRADE_AVAILABLE <old> <new>\`: read \`~/.claude/skills/gstack/gstack-upgrade/SKILL.md\` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If \`JUST_UPGRADED <from> <to>\`: tell user "Running gstack v{to} (just updated!)" and continue.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call:**
1. Context: project name, current branch, what we're working on (1-2 sentences)
2. The specific question or decision point
3. \`RECOMMENDATION: Choose [X] because [one-line reason]\`
4. Lettered options: \`A) ... B) ... C) ...\`
If \`_SESSIONS\` is 3 or more: the user is juggling multiple gstack sessions and context-switching heavily. **ELI16 mode** — they may not remember what this conversation is about. Every AskUserQuestion MUST re-ground them: state the project, the branch, the current plan/task, then the specific problem, THEN the recommendation and options. Be extra clear and self-contained — assume they haven't looked at this window in 20 minutes.
Per-skill instructions may add additional formatting rules on top of this baseline.
## Contributor Mode
If \`_CONTRIB\` is \`true\`: you are in **contributor mode**. When you hit friction with **gstack itself** (not the user's app), file a field report. Think: "hey, I was trying to do X with gstack and it didn't work / was confusing / was annoying. Here's what happened."
**gstack issues:** browse command fails/wrong output, snapshot missing elements, skill instructions unclear or misleading, binary crash/hang, unhelpful error message, any rough edge or annoyance even minor stuff.
**NOT gstack issues:** user's app bugs, network errors to user's URL, auth failures on user's site.
**To file:** write \`~/.gstack/contributor-logs/{slug}.md\` with this structure:
\`\`\`
# {Title}
Hey gstack team ran into this while using /{skill-name}:
**What I was trying to do:** {what the user/agent was attempting}
**What happened instead:** {what actually happened}
**How annoying (1-5):** {1=meh, 3=friction, 5=blocker}
## Steps to reproduce
1. {step}
## Raw output
(wrap any error messages or unexpected output in a markdown code block)
**Date:** {YYYY-MM-DD} | **Version:** {gstack version} | **Skill:** /{skill}
\`\`\`
Then run: \`mkdir -p ~/.gstack/contributor-logs && open ~/.gstack/contributor-logs/{slug}.md\`
Slug: lowercase, hyphens, max 60 chars (e.g. \`browse-snapshot-ref-gap\`). Skip if file already exists. Max 3 reports per session. File inline and continue — don't stop the workflow. Tell user: "Filed gstack field report: {title}"`;
}
function generateBrowseSetup(): string {
@@ -126,11 +174,288 @@ If \`NEEDS_SETUP\`:
3. If \`bun\` is not installed: \`curl -fsSL https://bun.sh/install | bash\``;
}
function generateQAMethodology(): string {
return `## Modes
### Diff-aware (automatic when on a feature branch with no URL)
This is the **primary mode** for developers verifying their work. When the user says \`/qa\` without a URL and the repo is on a feature branch, automatically:
1. **Analyze the branch diff** to understand what changed:
\`\`\`bash
git diff main...HEAD --name-only
git log main..HEAD --oneline
\`\`\`
2. **Identify affected pages/routes** from the changed files:
- Controller/route files which URL paths they serve
- View/template/component files which pages render them
- Model/service files which pages use those models (check controllers that reference them)
- CSS/style files which pages include those stylesheets
- API endpoints test them directly with \`$B js "await fetch('/api/...')"\`
- Static pages (markdown, HTML) navigate to them directly
3. **Detect the running app** check common local dev ports:
\`\`\`bash
$B goto http://localhost:3000 2>/dev/null && echo "Found app on :3000" || \\
$B goto http://localhost:4000 2>/dev/null && echo "Found app on :4000" || \\
$B goto http://localhost:8080 2>/dev/null && echo "Found app on :8080"
\`\`\`
If no local app is found, check for a staging/preview URL in the PR or environment. If nothing works, ask the user for the URL.
4. **Test each affected page/route:**
- Navigate to the page
- Take a screenshot
- Check console for errors
- If the change was interactive (forms, buttons, flows), test the interaction end-to-end
- Use \`snapshot -D\` before and after actions to verify the change had the expected effect
5. **Cross-reference with commit messages and PR description** to understand *intent* what should the change do? Verify it actually does that.
6. **Check TODOS.md** (if it exists) for known bugs or issues related to the changed files. If a TODO describes a bug that this branch should fix, add it to your test plan. If you find a new bug during QA that isn't in TODOS.md, note it in the report.
7. **Report findings** scoped to the branch changes:
- "Changes tested: N pages/routes affected by this branch"
- For each: does it work? Screenshot evidence.
- Any regressions on adjacent pages?
**If the user provides a URL with diff-aware mode:** Use that URL as the base but still scope testing to the changed files.
### Full (default when URL is provided)
Systematic exploration. Visit every reachable page. Document 5-10 well-evidenced issues. Produce health score. Takes 5-15 minutes depending on app size.
### Quick (\`--quick\`)
30-second smoke test. Visit homepage + top 5 navigation targets. Check: page loads? Console errors? Broken links? Produce health score. No detailed issue documentation.
### Regression (\`--regression <baseline>\`)
Run full mode, then load \`baseline.json\` from a previous run. Diff: which issues are fixed? Which are new? What's the score delta? Append regression section to report.
---
## Workflow
### Phase 1: Initialize
1. Find browse binary (see Setup above)
2. Create output directories
3. Copy report template from \`qa/templates/qa-report-template.md\` to output dir
4. Start timer for duration tracking
### Phase 2: Authenticate (if needed)
**If the user specified auth credentials:**
\`\`\`bash
$B goto <login-url>
$B snapshot -i # find the login form
$B fill @e3 "user@example.com"
$B fill @e4 "[REDACTED]" # NEVER include real passwords in report
$B click @e5 # submit
$B snapshot -D # verify login succeeded
\`\`\`
**If the user provided a cookie file:**
\`\`\`bash
$B cookie-import cookies.json
$B goto <target-url>
\`\`\`
**If 2FA/OTP is required:** Ask the user for the code and wait.
**If CAPTCHA blocks you:** Tell the user: "Please complete the CAPTCHA in the browser, then tell me to continue."
### Phase 3: Orient
Get a map of the application:
\`\`\`bash
$B goto <target-url>
$B snapshot -i -a -o "$REPORT_DIR/screenshots/initial.png"
$B links # map navigation structure
$B console --errors # any errors on landing?
\`\`\`
**Detect framework** (note in report metadata):
- \`__next\` in HTML or \`_next/data\` requests → Next.js
- \`csrf-token\` meta tag → Rails
- \`wp-content\` in URLs → WordPress
- Client-side routing with no page reloads SPA
**For SPAs:** The \`links\` command may return few results because navigation is client-side. Use \`snapshot -i\` to find nav elements (buttons, menu items) instead.
### Phase 4: Explore
Visit pages systematically. At each page:
\`\`\`bash
$B goto <page-url>
$B snapshot -i -a -o "$REPORT_DIR/screenshots/page-name.png"
$B console --errors
\`\`\`
Then follow the **per-page exploration checklist** (see \`qa/references/issue-taxonomy.md\`):
1. **Visual scan** Look at the annotated screenshot for layout issues
2. **Interactive elements** Click buttons, links, controls. Do they work?
3. **Forms** Fill and submit. Test empty, invalid, edge cases
4. **Navigation** Check all paths in and out
5. **States** Empty state, loading, error, overflow
6. **Console** Any new JS errors after interactions?
7. **Responsiveness** Check mobile viewport if relevant:
\`\`\`bash
$B viewport 375x812
$B screenshot "$REPORT_DIR/screenshots/page-mobile.png"
$B viewport 1280x720
\`\`\`
**Depth judgment:** Spend more time on core features (homepage, dashboard, checkout, search) and less on secondary pages (about, terms, privacy).
**Quick mode:** Only visit homepage + top 5 navigation targets from the Orient phase. Skip the per-page checklist just check: loads? Console errors? Broken links visible?
### Phase 5: Document
Document each issue **immediately when found** don't batch them.
**Two evidence tiers:**
**Interactive bugs** (broken flows, dead buttons, form failures):
1. Take a screenshot before the action
2. Perform the action
3. Take a screenshot showing the result
4. Use \`snapshot -D\` to show what changed
5. Write repro steps referencing screenshots
\`\`\`bash
$B screenshot "$REPORT_DIR/screenshots/issue-001-step-1.png"
$B click @e5
$B screenshot "$REPORT_DIR/screenshots/issue-001-result.png"
$B snapshot -D
\`\`\`
**Static bugs** (typos, layout issues, missing images):
1. Take a single annotated screenshot showing the problem
2. Describe what's wrong
\`\`\`bash
$B snapshot -i -a -o "$REPORT_DIR/screenshots/issue-002.png"
\`\`\`
**Write each issue to the report immediately** using the template format from \`qa/templates/qa-report-template.md\`.
### Phase 6: Wrap Up
1. **Compute health score** using the rubric below
2. **Write "Top 3 Things to Fix"** the 3 highest-severity issues
3. **Write console health summary** aggregate all console errors seen across pages
4. **Update severity counts** in the summary table
5. **Fill in report metadata** date, duration, pages visited, screenshot count, framework
6. **Save baseline** write \`baseline.json\` with:
\`\`\`json
{
"date": "YYYY-MM-DD",
"url": "<target>",
"healthScore": N,
"issues": [{ "id": "ISSUE-001", "title": "...", "severity": "...", "category": "..." }],
"categoryScores": { "console": N, "links": N, ... }
}
\`\`\`
**Regression mode:** After writing the report, load the baseline file. Compare:
- Health score delta
- Issues fixed (in baseline but not current)
- New issues (in current but not baseline)
- Append the regression section to the report
---
## Health Score Rubric
Compute each category score (0-100), then take the weighted average.
### Console (weight: 15%)
- 0 errors 100
- 1-3 errors 70
- 4-10 errors 40
- 10+ errors 10
### Links (weight: 10%)
- 0 broken 100
- Each broken link -15 (minimum 0)
### Per-Category Scoring (Visual, Functional, UX, Content, Performance, Accessibility)
Each category starts at 100. Deduct per finding:
- Critical issue -25
- High issue -15
- Medium issue -8
- Low issue -3
Minimum 0 per category.
### Weights
| Category | Weight |
|----------|--------|
| Console | 15% |
| Links | 10% |
| Visual | 10% |
| Functional | 20% |
| UX | 15% |
| Performance | 10% |
| Content | 5% |
| Accessibility | 15% |
### Final Score
\`score = Σ (category_score × weight)\`
---
## Framework-Specific Guidance
### Next.js
- Check console for hydration errors (\`Hydration failed\`, \`Text content did not match\`)
- Monitor \`_next/data\` requests in network — 404s indicate broken data fetching
- Test client-side navigation (click links, don't just \`goto\`) — catches routing issues
- Check for CLS (Cumulative Layout Shift) on pages with dynamic content
### Rails
- Check for N+1 query warnings in console (if development mode)
- Verify CSRF token presence in forms
- Test Turbo/Stimulus integration do page transitions work smoothly?
- Check for flash messages appearing and dismissing correctly
### WordPress
- Check for plugin conflicts (JS errors from different plugins)
- Verify admin bar visibility for logged-in users
- Test REST API endpoints (\`/wp-json/\`)
- Check for mixed content warnings (common with WP)
### General SPA (React, Vue, Angular)
- Use \`snapshot -i\` for navigation — \`links\` command misses client-side routes
- Check for stale state (navigate away and back does data refresh?)
- Test browser back/forward does the app handle history correctly?
- Check for memory leaks (monitor console after extended use)
---
## Important Rules
1. **Repro is everything.** Every issue needs at least one screenshot. No exceptions.
2. **Verify before documenting.** Retry the issue once to confirm it's reproducible, not a fluke.
3. **Never include credentials.** Write \`[REDACTED]\` for passwords in repro steps.
4. **Write incrementally.** Append each issue to the report as you find it. Don't batch.
5. **Never read source code.** Test as a user, not a developer.
6. **Check console after every interaction.** JS errors that don't surface visually are still bugs.
7. **Test like a user.** Use realistic data. Walk through complete workflows end-to-end.
8. **Depth over breadth.** 5-10 well-documented issues with evidence > 20 vague descriptions.
9. **Never delete output files.** Screenshots and reports accumulate that's intentional.
10. **Use \`snapshot -C\` for tricky UIs.** Finds clickable divs that the accessibility tree misses.`;
}
const RESOLVERS: Record<string, () => string> = {
COMMAND_REFERENCE: generateCommandReference,
SNAPSHOT_FLAGS: generateSnapshotFlags,
UPDATE_CHECK: generateUpdateCheck,
PREAMBLE: generatePreamble,
BROWSE_SETUP: generateBrowseSetup,
QA_METHODOLOGY: generateQAMethodology,
};
// ─── Template Processing ────────────────────────────────────
@@ -176,6 +501,7 @@ function findTemplates(): string[] {
path.join(ROOT, 'SKILL.md.tmpl'),
path.join(ROOT, 'browse', 'SKILL.md.tmpl'),
path.join(ROOT, 'qa', 'SKILL.md.tmpl'),
path.join(ROOT, 'qa-only', 'SKILL.md.tmpl'),
path.join(ROOT, 'setup-browser-cookies', 'SKILL.md.tmpl'),
path.join(ROOT, 'ship', 'SKILL.md.tmpl'),
path.join(ROOT, 'review', 'SKILL.md.tmpl'),
+1
View File
@@ -20,6 +20,7 @@ const SKILL_FILES = [
'SKILL.md',
'browse/SKILL.md',
'qa/SKILL.md',
'qa-only/SKILL.md',
'ship/SKILL.md',
'review/SKILL.md',
'retro/SKILL.md',
+49 -1
View File
@@ -13,15 +13,63 @@ allowed-tools:
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->
## Update Check (run first)
## Preamble (run first)
```bash
_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
[ -n "$_UPD" ] && echo "$_UPD" || true
mkdir -p ~/.gstack/sessions
touch ~/.gstack/sessions/"$PPID"
_SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ')
find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true
_CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true)
```
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call:**
1. Context: project name, current branch, what we're working on (1-2 sentences)
2. The specific question or decision point
3. `RECOMMENDATION: Choose [X] because [one-line reason]`
4. Lettered options: `A) ... B) ... C) ...`
If `_SESSIONS` is 3 or more: the user is juggling multiple gstack sessions and context-switching heavily. **ELI16 mode** — they may not remember what this conversation is about. Every AskUserQuestion MUST re-ground them: state the project, the branch, the current plan/task, then the specific problem, THEN the recommendation and options. Be extra clear and self-contained — assume they haven't looked at this window in 20 minutes.
Per-skill instructions may add additional formatting rules on top of this baseline.
## Contributor Mode
If `_CONTRIB` is `true`: you are in **contributor mode**. When you hit friction with **gstack itself** (not the user's app), file a field report. Think: "hey, I was trying to do X with gstack and it didn't work / was confusing / was annoying. Here's what happened."
**gstack issues:** browse command fails/wrong output, snapshot missing elements, skill instructions unclear or misleading, binary crash/hang, unhelpful error message, any rough edge or annoyance — even minor stuff.
**NOT gstack issues:** user's app bugs, network errors to user's URL, auth failures on user's site.
**To file:** write `~/.gstack/contributor-logs/{slug}.md` with this structure:
```
# {Title}
Hey gstack team — ran into this while using /{skill-name}:
**What I was trying to do:** {what the user/agent was attempting}
**What happened instead:** {what actually happened}
**How annoying (1-5):** {1=meh, 3=friction, 5=blocker}
## Steps to reproduce
1. {step}
## Raw output
(wrap any error messages or unexpected output in a markdown code block)
**Date:** {YYYY-MM-DD} | **Version:** {gstack version} | **Skill:** /{skill}
```
Then run: `mkdir -p ~/.gstack/contributor-logs && open ~/.gstack/contributor-logs/{slug}.md`
Slug: lowercase, hyphens, max 60 chars (e.g. `browse-snapshot-ref-gap`). Skip if file already exists. Max 3 reports per session. File inline and continue — don't stop the workflow. Tell user: "Filed gstack field report: {title}"
# Setup Browser Cookies
Import logged-in sessions from your real Chromium browser into the headless browse session.
+1 -1
View File
@@ -11,7 +11,7 @@ allowed-tools:
- AskUserQuestion
---
{{UPDATE_CHECK}}
{{PREAMBLE}}
# Setup Browser Cookies
+53 -5
View File
@@ -15,15 +15,63 @@ allowed-tools:
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->
## Update Check (run first)
## Preamble (run first)
```bash
_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
[ -n "$_UPD" ] && echo "$_UPD" || true
mkdir -p ~/.gstack/sessions
touch ~/.gstack/sessions/"$PPID"
_SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ')
find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true
_CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true)
```
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
## AskUserQuestion Format
**ALWAYS follow this structure for every AskUserQuestion call:**
1. Context: project name, current branch, what we're working on (1-2 sentences)
2. The specific question or decision point
3. `RECOMMENDATION: Choose [X] because [one-line reason]`
4. Lettered options: `A) ... B) ... C) ...`
If `_SESSIONS` is 3 or more: the user is juggling multiple gstack sessions and context-switching heavily. **ELI16 mode** — they may not remember what this conversation is about. Every AskUserQuestion MUST re-ground them: state the project, the branch, the current plan/task, then the specific problem, THEN the recommendation and options. Be extra clear and self-contained — assume they haven't looked at this window in 20 minutes.
Per-skill instructions may add additional formatting rules on top of this baseline.
## Contributor Mode
If `_CONTRIB` is `true`: you are in **contributor mode**. When you hit friction with **gstack itself** (not the user's app), file a field report. Think: "hey, I was trying to do X with gstack and it didn't work / was confusing / was annoying. Here's what happened."
**gstack issues:** browse command fails/wrong output, snapshot missing elements, skill instructions unclear or misleading, binary crash/hang, unhelpful error message, any rough edge or annoyance — even minor stuff.
**NOT gstack issues:** user's app bugs, network errors to user's URL, auth failures on user's site.
**To file:** write `~/.gstack/contributor-logs/{slug}.md` with this structure:
```
# {Title}
Hey gstack team — ran into this while using /{skill-name}:
**What I was trying to do:** {what the user/agent was attempting}
**What happened instead:** {what actually happened}
**How annoying (1-5):** {1=meh, 3=friction, 5=blocker}
## Steps to reproduce
1. {step}
## Raw output
(wrap any error messages or unexpected output in a markdown code block)
**Date:** {YYYY-MM-DD} | **Version:** {gstack version} | **Skill:** /{skill}
```
Then run: `mkdir -p ~/.gstack/contributor-logs && open ~/.gstack/contributor-logs/{slug}.md`
Slug: lowercase, hyphens, max 60 chars (e.g. `browse-snapshot-ref-gap`). Skip if file already exists. Max 3 reports per session. File inline and continue — don't stop the workflow. Tell user: "Filed gstack field report: {title}"
# Ship: Fully Automated Ship Workflow
You are running the `/ship` workflow. This is a **non-interactive, fully automated** workflow. Do NOT ask for confirmation at any step. The user said `/ship` which means DO IT. Run straight through and output the PR URL at the end.
@@ -174,8 +222,8 @@ Review the diff for structural issues that tests don't catch.
6. **If CRITICAL issues found:** For EACH critical issue, use a separate AskUserQuestion with:
- The problem (`file:line` + description)
- Your recommended fix
- Options: A) Fix it now (recommend), B) Acknowledge and ship anyway, C) It's a false positive — skip
- `RECOMMENDATION: Choose A because [one-line reason]`
- Options: A) Fix it now, B) Acknowledge and ship anyway, C) It's a false positive — skip
After resolving all critical issues: if the user chose A (fix) on any issue, apply the recommended fixes, then commit only the fixed files by name (`git add <fixed-files> && git commit -m "fix: apply pre-landing review fixes"`), then **STOP** and tell the user to run `/ship` again to re-test with the fixes applied. If the user chose only B (acknowledge) or C (false positive) on all issues, continue with Step 4.
7. **If only non-critical issues found:** Output them and continue. They will be included in the PR body at Step 8.
@@ -202,8 +250,8 @@ For each classified comment:
**VALID & ACTIONABLE:** Use AskUserQuestion with:
- The comment (file:line or [top-level] + body summary + permalink URL)
- Your recommended fix
- Options: A) Fix now (recommended), B) Acknowledge and ship anyway, C) It's a false positive
- `RECOMMENDATION: Choose A because [one-line reason]`
- Options: A) Fix now, B) Acknowledge and ship anyway, C) It's a false positive
- If user chooses A: apply the fix, commit the fixed files (`git add <fixed-files> && git commit -m "fix: address Greptile review — <brief description>"`), reply using the **Fix reply template** from greptile-triage.md (include inline diff + explanation), and save to both per-project and global greptile-history (type: fix).
- If user chooses C: reply using the **False Positive reply template** from greptile-triage.md (include evidence + suggested re-rank), save to both per-project and global greptile-history (type: fp).
+5 -5
View File
@@ -13,7 +13,7 @@ allowed-tools:
- AskUserQuestion
---
{{UPDATE_CHECK}}
{{PREAMBLE}}
# Ship: Fully Automated Ship Workflow
@@ -165,8 +165,8 @@ Review the diff for structural issues that tests don't catch.
6. **If CRITICAL issues found:** For EACH critical issue, use a separate AskUserQuestion with:
- The problem (`file:line` + description)
- Your recommended fix
- Options: A) Fix it now (recommend), B) Acknowledge and ship anyway, C) It's a false positive — skip
- `RECOMMENDATION: Choose A because [one-line reason]`
- Options: A) Fix it now, B) Acknowledge and ship anyway, C) It's a false positive — skip
After resolving all critical issues: if the user chose A (fix) on any issue, apply the recommended fixes, then commit only the fixed files by name (`git add <fixed-files> && git commit -m "fix: apply pre-landing review fixes"`), then **STOP** and tell the user to run `/ship` again to re-test with the fixes applied. If the user chose only B (acknowledge) or C (false positive) on all issues, continue with Step 4.
7. **If only non-critical issues found:** Output them and continue. They will be included in the PR body at Step 8.
@@ -193,8 +193,8 @@ For each classified comment:
**VALID & ACTIONABLE:** Use AskUserQuestion with:
- The comment (file:line or [top-level] + body summary + permalink URL)
- Your recommended fix
- Options: A) Fix now (recommended), B) Acknowledge and ship anyway, C) It's a false positive
- `RECOMMENDATION: Choose A because [one-line reason]`
- Options: A) Fix now, B) Acknowledge and ship anyway, C) It's a false positive
- If user chooses A: apply the fix, commit the fixed files (`git add <fixed-files> && git commit -m "fix: address Greptile review — <brief description>"`), reply using the **Fix reply template** from greptile-triage.md (include inline diff + explanation), and save to both per-project and global greptile-history (type: fix).
- If user chooses C: reply using the **False Positive reply template** from greptile-triage.md (include evidence + suggested re-rank), save to both per-project and global greptile-history (type: fp).
+30
View File
@@ -0,0 +1,30 @@
# Feature branch version: adds "returned" status but misses consumers
class Order < ApplicationRecord
STATUSES = %w[pending processing shipped delivered returned].freeze
validates :status, inclusion: { in: STATUSES }
def display_status
case status
when 'pending' then 'Awaiting processing'
when 'processing' then 'Being prepared'
when 'shipped' then 'On the way'
when 'delivered' then 'Delivered'
# BUG: 'returned' not handled — falls through to nil
end
end
def can_cancel?
# BUG: should 'returned' be cancellable? Not considered.
%w[pending processing].include?(status)
end
def notify_customer
case status
when 'pending' then OrderMailer.confirmation(self).deliver_later
when 'shipped' then OrderMailer.shipped(self).deliver_later
when 'delivered' then OrderMailer.delivered(self).deliver_later
# BUG: 'returned' has no notification — customer won't know return was received
end
end
end
+27
View File
@@ -0,0 +1,27 @@
# Existing file on main: order model with status handling
class Order < ApplicationRecord
STATUSES = %w[pending processing shipped delivered].freeze
validates :status, inclusion: { in: STATUSES }
def display_status
case status
when 'pending' then 'Awaiting processing'
when 'processing' then 'Being prepared'
when 'shipped' then 'On the way'
when 'delivered' then 'Delivered'
end
end
def can_cancel?
%w[pending processing].include?(status)
end
def notify_customer
case status
when 'pending' then OrderMailer.confirmation(self).deliver_later
when 'shipped' then OrderMailer.shipped(self).deliver_later
when 'delivered' then OrderMailer.delivered(self).deliver_later
end
end
end
+72
View File
@@ -61,6 +61,7 @@ describe('gen-skill-docs', () => {
{ dir: '.', name: 'root gstack' },
{ dir: 'browse', name: 'browse' },
{ dir: 'qa', name: 'qa' },
{ dir: 'qa-only', name: 'qa-only' },
{ dir: 'review', name: 'review' },
{ dir: 'ship', name: 'ship' },
{ dir: 'plan-ceo-review', name: 'plan-ceo-review' },
@@ -124,10 +125,81 @@ describe('gen-skill-docs', () => {
const rootTmpl = fs.readFileSync(path.join(ROOT, 'SKILL.md.tmpl'), 'utf-8');
expect(rootTmpl).toContain('{{COMMAND_REFERENCE}}');
expect(rootTmpl).toContain('{{SNAPSHOT_FLAGS}}');
expect(rootTmpl).toContain('{{PREAMBLE}}');
const browseTmpl = fs.readFileSync(path.join(ROOT, 'browse', 'SKILL.md.tmpl'), 'utf-8');
expect(browseTmpl).toContain('{{COMMAND_REFERENCE}}');
expect(browseTmpl).toContain('{{SNAPSHOT_FLAGS}}');
expect(browseTmpl).toContain('{{PREAMBLE}}');
});
test('generated SKILL.md contains contributor mode check', () => {
const content = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
expect(content).toContain('Contributor Mode');
expect(content).toContain('gstack_contributor');
expect(content).toContain('contributor-logs');
});
test('generated SKILL.md contains session awareness', () => {
const content = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
expect(content).toContain('_SESSIONS');
expect(content).toContain('RECOMMENDATION');
expect(content).toContain('ELI16');
});
test('qa and qa-only templates use QA_METHODOLOGY placeholder', () => {
const qaTmpl = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md.tmpl'), 'utf-8');
expect(qaTmpl).toContain('{{QA_METHODOLOGY}}');
const qaOnlyTmpl = fs.readFileSync(path.join(ROOT, 'qa-only', 'SKILL.md.tmpl'), 'utf-8');
expect(qaOnlyTmpl).toContain('{{QA_METHODOLOGY}}');
});
test('QA_METHODOLOGY appears expanded in both qa and qa-only generated files', () => {
const qaContent = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
const qaOnlyContent = fs.readFileSync(path.join(ROOT, 'qa-only', 'SKILL.md'), 'utf-8');
// Both should contain the health score rubric
expect(qaContent).toContain('Health Score Rubric');
expect(qaOnlyContent).toContain('Health Score Rubric');
// Both should contain framework guidance
expect(qaContent).toContain('Framework-Specific Guidance');
expect(qaOnlyContent).toContain('Framework-Specific Guidance');
// Both should contain the important rules
expect(qaContent).toContain('Important Rules');
expect(qaOnlyContent).toContain('Important Rules');
// Both should contain the 6 phases
expect(qaContent).toContain('Phase 1');
expect(qaOnlyContent).toContain('Phase 1');
expect(qaContent).toContain('Phase 6');
expect(qaOnlyContent).toContain('Phase 6');
});
test('qa-only has no-fix guardrails', () => {
const qaOnlyContent = fs.readFileSync(path.join(ROOT, 'qa-only', 'SKILL.md'), 'utf-8');
expect(qaOnlyContent).toContain('Never fix bugs');
expect(qaOnlyContent).toContain('NEVER fix anything');
// Should not have Edit, Glob, or Grep in allowed-tools
expect(qaOnlyContent).not.toMatch(/allowed-tools:[\s\S]*?Edit/);
expect(qaOnlyContent).not.toMatch(/allowed-tools:[\s\S]*?Glob/);
expect(qaOnlyContent).not.toMatch(/allowed-tools:[\s\S]*?Grep/);
});
test('qa has fix-loop tools and phases', () => {
const qaContent = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
// Should have Edit, Glob, Grep in allowed-tools
expect(qaContent).toContain('Edit');
expect(qaContent).toContain('Glob');
expect(qaContent).toContain('Grep');
// Should have fix-loop phases
expect(qaContent).toContain('Phase 7');
expect(qaContent).toContain('Phase 8');
expect(qaContent).toContain('Fix Loop');
expect(qaContent).toContain('Triage');
expect(qaContent).toContain('WTF');
});
});
+218 -4
View File
@@ -8,8 +8,10 @@ import {
findPreviousRun,
compareEvalResults,
formatComparison,
generateCommentary,
judgePassed,
} from './eval-store';
import type { EvalResult, EvalTestEntry } from './eval-store';
import type { EvalResult, EvalTestEntry, ComparisonResult } from './eval-store';
let tmpDir: string;
@@ -114,7 +116,6 @@ describe('EvalCollector', () => {
expect(filepath1).toBeTruthy();
expect(filepath2).toBe(''); // second call returns empty
// Exclude _partial files — savePartial writes _partial-e2e.json alongside the final
expect(fs.readdirSync(tmpDir).filter(f => f.endsWith('.json') && !f.startsWith('_partial'))).toHaveLength(1);
});
@@ -198,6 +199,45 @@ describe('EvalCollector', () => {
});
});
// --- judgePassed tests ---
describe('judgePassed', () => {
test('passes when all thresholds met', () => {
expect(judgePassed(
{ detection_rate: 3, false_positives: 1, evidence_quality: 3 },
{ minimum_detection: 2, max_false_positives: 2 },
)).toBe(true);
});
test('fails when detection rate below minimum', () => {
expect(judgePassed(
{ detection_rate: 1, false_positives: 0, evidence_quality: 3 },
{ minimum_detection: 2, max_false_positives: 2 },
)).toBe(false);
});
test('fails when too many false positives', () => {
expect(judgePassed(
{ detection_rate: 3, false_positives: 3, evidence_quality: 3 },
{ minimum_detection: 2, max_false_positives: 2 },
)).toBe(false);
});
test('fails when evidence quality below 2', () => {
expect(judgePassed(
{ detection_rate: 3, false_positives: 0, evidence_quality: 1 },
{ minimum_detection: 2, max_false_positives: 2 },
)).toBe(false);
});
test('passes at exact thresholds', () => {
expect(judgePassed(
{ detection_rate: 2, false_positives: 2, evidence_quality: 2 },
{ minimum_detection: 2, max_false_positives: 2 },
)).toBe(true);
});
});
// --- extractToolSummary tests ---
describe('extractToolSummary', () => {
@@ -371,8 +411,8 @@ describe('formatComparison', () => {
deltas: [
{
name: 'browse basic',
before: { passed: true, cost_usd: 0.07, tool_summary: { Bash: 3 } },
after: { passed: true, cost_usd: 0.06, tool_summary: { Bash: 4 } },
before: { passed: true, cost_usd: 0.07, turns_used: 6, duration_ms: 24000, tool_summary: { Bash: 3 } },
after: { passed: true, cost_usd: 0.06, turns_used: 5, duration_ms: 19000, tool_summary: { Bash: 4 } },
status_change: 'unchanged',
},
{
@@ -398,5 +438,179 @@ describe('formatComparison', () => {
expect(output).toContain('1 unchanged');
expect(output).toContain('↑'); // improved arrow
expect(output).toContain('='); // unchanged arrow
// Turns and duration deltas
expect(output).toContain('6→5t');
expect(output).toContain('24→19s');
});
test('includes commentary section', () => {
const comparison: ComparisonResult = {
before_file: 'a.json', after_file: 'b.json',
before_branch: 'main', after_branch: 'main',
before_timestamp: '2026-03-13T14:30:00Z',
after_timestamp: '2026-03-14T14:30:00Z',
deltas: [
{
name: 'test-a',
before: { passed: true, cost_usd: 0.50, turns_used: 20, duration_ms: 120000 },
after: { passed: true, cost_usd: 0.30, turns_used: 10, duration_ms: 60000 },
status_change: 'unchanged',
},
{
name: 'test-b',
before: { passed: true, cost_usd: 0.10, turns_used: 5, duration_ms: 20000 },
after: { passed: true, cost_usd: 0.10, turns_used: 5, duration_ms: 20000 },
status_change: 'unchanged',
},
{
name: 'test-c',
before: { passed: true, cost_usd: 0.10, turns_used: 5, duration_ms: 20000 },
after: { passed: true, cost_usd: 0.10, turns_used: 5, duration_ms: 20000 },
status_change: 'unchanged',
},
],
total_cost_delta: -0.20,
total_duration_delta: -60000,
improved: 0, regressed: 0, unchanged: 3,
tool_count_before: 30, tool_count_after: 20,
};
const output = formatComparison(comparison);
expect(output).toContain('Takeaway');
expect(output).toContain('fewer turns');
expect(output).toContain('faster');
});
});
// --- generateCommentary tests ---
describe('generateCommentary', () => {
test('flags regressions prominently', () => {
const c: ComparisonResult = {
before_file: 'a.json', after_file: 'b.json',
before_branch: 'main', after_branch: 'main',
before_timestamp: '', after_timestamp: '',
deltas: [{
name: 'critical-test',
before: { passed: true, cost_usd: 0.10 },
after: { passed: false, cost_usd: 0.10 },
status_change: 'regressed',
}],
total_cost_delta: 0, total_duration_delta: 0,
improved: 0, regressed: 1, unchanged: 0,
tool_count_before: 0, tool_count_after: 0,
};
const notes = generateCommentary(c);
expect(notes.some(n => n.includes('REGRESSION'))).toBe(true);
expect(notes.some(n => n.includes('critical-test'))).toBe(true);
});
test('notes improvements', () => {
const c: ComparisonResult = {
before_file: 'a.json', after_file: 'b.json',
before_branch: 'main', after_branch: 'main',
before_timestamp: '', after_timestamp: '',
deltas: [{
name: 'fixed-test',
before: { passed: false, cost_usd: 0.10 },
after: { passed: true, cost_usd: 0.10 },
status_change: 'improved',
}],
total_cost_delta: 0, total_duration_delta: 0,
improved: 1, regressed: 0, unchanged: 0,
tool_count_before: 0, tool_count_after: 0,
};
const notes = generateCommentary(c);
expect(notes.some(n => n.includes('Fixed'))).toBe(true);
expect(notes.some(n => n.includes('fixed-test'))).toBe(true);
});
test('reports efficiency gains for stable tests', () => {
const c: ComparisonResult = {
before_file: 'a.json', after_file: 'b.json',
before_branch: 'main', after_branch: 'main',
before_timestamp: '', after_timestamp: '',
deltas: [{
name: 'fast-test',
before: { passed: true, cost_usd: 0.50, turns_used: 20, duration_ms: 120000 },
after: { passed: true, cost_usd: 0.25, turns_used: 10, duration_ms: 60000 },
status_change: 'unchanged',
}],
total_cost_delta: -0.25, total_duration_delta: -60000,
improved: 0, regressed: 0, unchanged: 1,
tool_count_before: 0, tool_count_after: 0,
};
const notes = generateCommentary(c);
expect(notes.some(n => n.includes('fewer turns'))).toBe(true);
expect(notes.some(n => n.includes('faster'))).toBe(true);
expect(notes.some(n => n.includes('cheaper'))).toBe(true);
});
test('reports detection rate changes', () => {
const c: ComparisonResult = {
before_file: 'a.json', after_file: 'b.json',
before_branch: 'main', after_branch: 'main',
before_timestamp: '', after_timestamp: '',
deltas: [{
name: 'detection-test',
before: { passed: true, cost_usd: 0.50, detection_rate: 3 },
after: { passed: true, cost_usd: 0.50, detection_rate: 5 },
status_change: 'unchanged',
}],
total_cost_delta: 0, total_duration_delta: 0,
improved: 0, regressed: 0, unchanged: 1,
tool_count_before: 0, tool_count_after: 0,
};
const notes = generateCommentary(c);
expect(notes.some(n => n.includes('detecting 2 more bugs'))).toBe(true);
});
test('produces overall summary for 3+ tests with no regressions', () => {
const c: ComparisonResult = {
before_file: 'a.json', after_file: 'b.json',
before_branch: 'main', after_branch: 'main',
before_timestamp: '', after_timestamp: '',
deltas: [
{ name: 'a', before: { passed: true, cost_usd: 0.50, turns_used: 10, duration_ms: 60000 },
after: { passed: true, cost_usd: 0.30, turns_used: 6, duration_ms: 40000 }, status_change: 'unchanged' },
{ name: 'b', before: { passed: true, cost_usd: 0.20, turns_used: 5, duration_ms: 30000 },
after: { passed: true, cost_usd: 0.15, turns_used: 4, duration_ms: 25000 }, status_change: 'unchanged' },
{ name: 'c', before: { passed: true, cost_usd: 0.10, turns_used: 3, duration_ms: 20000 },
after: { passed: true, cost_usd: 0.08, turns_used: 3, duration_ms: 18000 }, status_change: 'unchanged' },
],
total_cost_delta: -0.27, total_duration_delta: -27000,
improved: 0, regressed: 0, unchanged: 3,
tool_count_before: 0, tool_count_after: 0,
};
const notes = generateCommentary(c);
expect(notes.some(n => n.includes('Overall'))).toBe(true);
expect(notes.some(n => n.includes('No regressions'))).toBe(true);
});
test('returns empty for stable run with no significant changes', () => {
const c: ComparisonResult = {
before_file: 'a.json', after_file: 'b.json',
before_branch: 'main', after_branch: 'main',
before_timestamp: '', after_timestamp: '',
deltas: [
{ name: 'a', before: { passed: true, cost_usd: 0.10, turns_used: 5, duration_ms: 20000 },
after: { passed: true, cost_usd: 0.10, turns_used: 5, duration_ms: 21000 }, status_change: 'unchanged' },
{ name: 'b', before: { passed: true, cost_usd: 0.10, turns_used: 5, duration_ms: 20000 },
after: { passed: true, cost_usd: 0.10, turns_used: 5, duration_ms: 20000 }, status_change: 'unchanged' },
{ name: 'c', before: { passed: true, cost_usd: 0.10, turns_used: 5, duration_ms: 20000 },
after: { passed: true, cost_usd: 0.10, turns_used: 5, duration_ms: 20000 }, status_change: 'unchanged' },
],
total_cost_delta: 0, total_duration_delta: 1000,
improved: 0, regressed: 0, unchanged: 3,
tool_count_before: 15, tool_count_after: 15,
};
const notes = generateCommentary(c);
expect(notes.some(n => n.includes('Stable run'))).toBe(true);
});
});
+182 -8
View File
@@ -77,9 +77,9 @@ export interface EvalResult {
export interface TestDelta {
name: string;
before: { passed: boolean; cost_usd: number; turns_used?: number;
before: { passed: boolean; cost_usd: number; turns_used?: number; duration_ms?: number;
detection_rate?: number; tool_summary?: Record<string, number> };
after: { passed: boolean; cost_usd: number; turns_used?: number;
after: { passed: boolean; cost_usd: number; turns_used?: number; duration_ms?: number;
detection_rate?: number; tool_summary?: Record<string, number> };
status_change: 'improved' | 'regressed' | 'unchanged';
}
@@ -101,6 +101,21 @@ export interface ComparisonResult {
tool_count_after: number;
}
// --- Shared helpers ---
/**
* Determine if a planted-bug eval passed based on judge results vs ground truth thresholds.
* Centralizes the pass/fail logic so all planted-bug tests use the same criteria.
*/
export function judgePassed(
judgeResult: { detection_rate: number; false_positives: number; evidence_quality: number },
groundTruth: { minimum_detection: number; max_false_positives: number },
): boolean {
return judgeResult.detection_rate >= groundTruth.minimum_detection
&& judgeResult.false_positives <= groundTruth.max_false_positives
&& judgeResult.evidence_quality >= 2;
}
// --- Comparison functions (exported for eval:compare CLI) ---
/**
@@ -213,6 +228,7 @@ export function compareEvalResults(
passed: beforeTest?.passed ?? false,
cost_usd: beforeTest?.cost_usd ?? 0,
turns_used: beforeTest?.turns_used,
duration_ms: beforeTest?.duration_ms,
detection_rate: beforeTest?.detection_rate,
tool_summary: beforeToolSummary,
},
@@ -220,6 +236,7 @@ export function compareEvalResults(
passed: afterTest.passed,
cost_usd: afterTest.cost_usd,
turns_used: afterTest.turns_used,
duration_ms: afterTest.duration_ms,
detection_rate: afterTest.detection_rate,
tool_summary: afterToolSummary,
},
@@ -241,6 +258,7 @@ export function compareEvalResults(
passed: beforeTest.passed,
cost_usd: beforeTest.cost_usd,
turns_used: beforeTest.turns_used,
duration_ms: beforeTest.duration_ms,
detection_rate: beforeTest.detection_rate,
tool_summary: beforeToolSummary,
},
@@ -282,6 +300,28 @@ export function formatComparison(c: ComparisonResult): string {
const beforeStatus = d.before.passed ? 'PASS' : 'FAIL';
const afterStatus = d.after.passed ? 'PASS' : 'FAIL';
// Turns delta
let turnsDelta = '';
if (d.before.turns_used !== undefined && d.after.turns_used !== undefined) {
const td = d.after.turns_used - d.before.turns_used;
turnsDelta = ` ${d.before.turns_used}${d.after.turns_used}t`;
if (td !== 0) turnsDelta += `(${td > 0 ? '+' : ''}${td})`;
} else if (d.after.turns_used !== undefined) {
turnsDelta = ` ${d.after.turns_used}t`;
}
// Duration delta
let durDelta = '';
if (d.before.duration_ms !== undefined && d.after.duration_ms !== undefined) {
const bs = Math.round(d.before.duration_ms / 1000);
const as = Math.round(d.after.duration_ms / 1000);
const dd = as - bs;
durDelta = ` ${bs}${as}s`;
if (dd !== 0) durDelta += `(${dd > 0 ? '+' : ''}${dd})`;
} else if (d.after.duration_ms !== undefined) {
durDelta = ` ${Math.round(d.after.duration_ms / 1000)}s`;
}
let detail = '';
if (d.before.detection_rate !== undefined || d.after.detection_rate !== undefined) {
detail = ` ${d.before.detection_rate ?? '?'}${d.after.detection_rate ?? '?'} det`;
@@ -291,8 +331,8 @@ export function formatComparison(c: ComparisonResult): string {
detail = ` $${costBefore}$${costAfter}`;
}
const name = d.name.length > 35 ? d.name.slice(0, 32) + '...' : d.name.padEnd(35);
lines.push(` ${name} ${beforeStatus.padEnd(5)}${afterStatus.padEnd(5)} ${arrow}${detail}`);
const name = d.name.length > 30 ? d.name.slice(0, 27) + '...' : d.name.padEnd(30);
lines.push(` ${name} ${beforeStatus.padEnd(5)}${afterStatus.padEnd(5)} ${arrow}${detail}${turnsDelta}${durDelta}`);
}
lines.push('─'.repeat(70));
@@ -345,9 +385,143 @@ export function formatComparison(c: ComparisonResult): string {
}
}
// Commentary — interpret what the deltas mean
const commentary = generateCommentary(c);
if (commentary.length > 0) {
lines.push('');
lines.push(' Takeaway:');
for (const line of commentary) {
lines.push(` ${line}`);
}
}
return lines.join('\n');
}
/**
* Generate human-readable commentary interpreting comparison deltas.
* Pure function analyzes the numbers and explains what they mean.
*/
export function generateCommentary(c: ComparisonResult): string[] {
const notes: string[] = [];
// 1. Regressions are the most important signal — call them out first
const regressions = c.deltas.filter(d => d.status_change === 'regressed');
if (regressions.length > 0) {
for (const d of regressions) {
notes.push(`REGRESSION: "${d.name}" was passing, now fails. Investigate immediately.`);
}
}
// 2. Improvements
const improvements = c.deltas.filter(d => d.status_change === 'improved');
for (const d of improvements) {
notes.push(`Fixed: "${d.name}" now passes.`);
}
// 3. Per-test efficiency changes (only for unchanged-status tests — regressions/improvements are already noted)
const stable = c.deltas.filter(d => d.status_change === 'unchanged' && d.after.passed);
for (const d of stable) {
const insights: string[] = [];
// Turns
if (d.before.turns_used !== undefined && d.after.turns_used !== undefined && d.before.turns_used > 0) {
const turnsDelta = d.after.turns_used - d.before.turns_used;
const turnsPct = Math.round((turnsDelta / d.before.turns_used) * 100);
if (Math.abs(turnsPct) >= 20 && Math.abs(turnsDelta) >= 2) {
if (turnsDelta < 0) {
insights.push(`${Math.abs(turnsDelta)} fewer turns (${Math.abs(turnsPct)}% more efficient)`);
} else {
insights.push(`${turnsDelta} more turns (${turnsPct}% less efficient)`);
}
}
}
// Duration
if (d.before.duration_ms !== undefined && d.after.duration_ms !== undefined && d.before.duration_ms > 0) {
const durDelta = d.after.duration_ms - d.before.duration_ms;
const durPct = Math.round((durDelta / d.before.duration_ms) * 100);
if (Math.abs(durPct) >= 20 && Math.abs(durDelta) >= 5000) {
if (durDelta < 0) {
insights.push(`${Math.round(Math.abs(durDelta) / 1000)}s faster`);
} else {
insights.push(`${Math.round(durDelta / 1000)}s slower`);
}
}
}
// Detection rate
if (d.before.detection_rate !== undefined && d.after.detection_rate !== undefined) {
const detDelta = d.after.detection_rate - d.before.detection_rate;
if (detDelta !== 0) {
if (detDelta > 0) {
insights.push(`detecting ${detDelta} more bug${detDelta > 1 ? 's' : ''}`);
} else {
insights.push(`detecting ${Math.abs(detDelta)} fewer bug${Math.abs(detDelta) > 1 ? 's' : ''} — check prompt quality`);
}
}
}
// Cost
if (d.before.cost_usd > 0) {
const costDelta = d.after.cost_usd - d.before.cost_usd;
const costPct = Math.round((costDelta / d.before.cost_usd) * 100);
if (Math.abs(costPct) >= 30 && Math.abs(costDelta) >= 0.05) {
if (costDelta < 0) {
insights.push(`${Math.abs(costPct)}% cheaper`);
} else {
insights.push(`${costPct}% more expensive`);
}
}
}
if (insights.length > 0) {
notes.push(`"${d.name}": ${insights.join(', ')}.`);
}
}
// 4. Overall summary
if (c.deltas.length >= 3 && regressions.length === 0) {
const overallParts: string[] = [];
// Total cost
const totalBefore = c.deltas.reduce((s, d) => s + d.before.cost_usd, 0);
if (totalBefore > 0) {
const costPct = Math.round((c.total_cost_delta / totalBefore) * 100);
if (Math.abs(costPct) >= 10) {
overallParts.push(`${Math.abs(costPct)}% ${costPct < 0 ? 'cheaper' : 'more expensive'} overall`);
}
}
// Total duration
const totalDurBefore = c.deltas.reduce((s, d) => s + (d.before.duration_ms || 0), 0);
if (totalDurBefore > 0) {
const durPct = Math.round((c.total_duration_delta / totalDurBefore) * 100);
if (Math.abs(durPct) >= 10) {
overallParts.push(`${Math.abs(durPct)}% ${durPct < 0 ? 'faster' : 'slower'}`);
}
}
// Total turns
const turnsBefore = c.deltas.reduce((s, d) => s + (d.before.turns_used || 0), 0);
const turnsAfter = c.deltas.reduce((s, d) => s + (d.after.turns_used || 0), 0);
if (turnsBefore > 0) {
const turnsPct = Math.round(((turnsAfter - turnsBefore) / turnsBefore) * 100);
if (Math.abs(turnsPct) >= 10) {
overallParts.push(`${Math.abs(turnsPct)}% ${turnsPct < 0 ? 'fewer' : 'more'} turns`);
}
}
if (overallParts.length > 0) {
notes.push(`Overall: ${overallParts.join(', ')}. ${regressions.length === 0 ? 'No regressions.' : ''}`);
} else if (regressions.length === 0) {
notes.push('Stable run — no significant efficiency changes, no regressions.');
}
}
return notes;
}
// --- EvalCollector ---
function getGitInfo(): { branch: string; sha: string } {
@@ -500,19 +674,19 @@ export class EvalCollector {
for (const t of this.tests) {
const status = t.passed ? ' PASS ' : ' FAIL ';
const cost = `$${t.cost_usd.toFixed(2)}`;
const dur = t.duration_ms ? `${Math.round(t.duration_ms / 1000)}s` : '';
const turns = t.turns_used !== undefined ? `${t.turns_used}t` : '';
let detail = '';
if (t.detection_rate !== undefined) {
detail = `${t.detection_rate}/${(t.detected_bugs?.length || 0) + (t.missed_bugs?.length || 0)} det`;
} else if (t.turns_used !== undefined) {
detail = `${t.turns_used} turns`;
} else if (t.judge_scores) {
const scores = Object.entries(t.judge_scores).map(([k, v]) => `${k[0]}:${v}`).join(' ');
detail = scores;
}
const name = t.name.length > 38 ? t.name.slice(0, 35) + '...' : t.name.padEnd(38);
lines.push(` ${name} ${status} ${cost.padStart(6)} ${detail}`);
const name = t.name.length > 35 ? t.name.slice(0, 32) + '...' : t.name.padEnd(35);
lines.push(` ${name} ${status} ${cost.padStart(6)} ${turns.padStart(4)} ${dur.padStart(5)} ${detail}`);
}
lines.push('─'.repeat(70));
+539 -12
View File
@@ -2,7 +2,7 @@ import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
import { runSkillTest } from './helpers/session-runner';
import type { SkillTestResult } from './helpers/session-runner';
import { outcomeJudge } from './helpers/llm-judge';
import { EvalCollector } from './helpers/eval-store';
import { EvalCollector, judgePassed } from './helpers/eval-store';
import type { EvalTestEntry } from './helpers/eval-store';
import { startTestServer } from '../browse/test/test-server';
import { spawnSync } from 'child_process';
@@ -281,6 +281,124 @@ Report the exact output — either "READY: <path>" or "NEEDS_SETUP".`,
// Clean up
try { fs.rmSync(nonGitDir, { recursive: true, force: true }); } catch {}
}, 60_000);
test('contributor mode files a report on gstack error', async () => {
const contribDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-contrib-'));
const logsDir = path.join(contribDir, 'contributor-logs');
fs.mkdirSync(logsDir, { recursive: true });
// Extract contributor mode instructions from generated SKILL.md
const skillMd = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
const contribStart = skillMd.indexOf('## Contributor Mode');
const contribEnd = skillMd.indexOf('\n## ', contribStart + 1);
const contribBlock = skillMd.slice(contribStart, contribEnd > 0 ? contribEnd : undefined);
const result = await runSkillTest({
prompt: `You are in contributor mode (_CONTRIB=true).
${contribBlock}
OVERRIDE: Write contributor logs to ${logsDir}/ instead of ~/.gstack/contributor-logs/
Now try this browse command (it will fail there is no binary at this path):
/nonexistent/path/browse goto https://example.com
This is a gstack issue (the browse binary is missing/misconfigured).
File a contributor report about this issue. Then tell me what you filed.`,
workingDirectory: contribDir,
maxTurns: 8,
timeout: 60_000,
testName: 'contributor-mode',
runId,
});
logCost('contributor mode', result);
// Override passed: this test intentionally triggers a browse error (nonexistent binary)
// so browseErrors will be non-empty — that's expected, not a failure
recordE2E('contributor mode report', 'Skill E2E tests', result, {
passed: result.exitReason === 'success',
});
// Verify a contributor log was created with expected format
const logFiles = fs.readdirSync(logsDir).filter(f => f.endsWith('.md'));
expect(logFiles.length).toBeGreaterThan(0);
const logContent = fs.readFileSync(path.join(logsDir, logFiles[0]), 'utf-8');
expect(logContent).toContain('Hey gstack team');
expect(logContent).toContain('What I was trying to do');
expect(logContent).toContain('What happened instead');
// Clean up
try { fs.rmSync(contribDir, { recursive: true, force: true }); } catch {}
}, 90_000);
test('session awareness adds ELI16 context when _SESSIONS >= 3', async () => {
const sessionDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-session-'));
// Set up a git repo so there's project/branch context to reference
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: sessionDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(sessionDir, 'app.rb'), '# my app\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'init']);
run('git', ['checkout', '-b', 'feature/add-payments']);
// Add a remote so the agent can derive a project name
run('git', ['remote', 'add', 'origin', 'https://github.com/acme/billing-app.git']);
// Extract AskUserQuestion format instructions from generated SKILL.md
const skillMd = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8');
const aqStart = skillMd.indexOf('## AskUserQuestion Format');
const aqEnd = skillMd.indexOf('\n## ', aqStart + 1);
const aqBlock = skillMd.slice(aqStart, aqEnd > 0 ? aqEnd : undefined);
const outputPath = path.join(sessionDir, 'question-output.md');
const result = await runSkillTest({
prompt: `You are running a gstack skill. The session preamble detected _SESSIONS=4 (the user has 4 gstack windows open).
${aqBlock}
You are on branch feature/add-payments in the billing-app project. You were reviewing a plan to add Stripe integration.
You've hit a decision point: the plan doesn't specify whether to use Stripe Checkout (hosted) or Stripe Elements (embedded). You need to ask the user which approach to use.
Since this is non-interactive, DO NOT actually call AskUserQuestion. Instead, write the EXACT text you would display to the user (the full AskUserQuestion content) to the file: ${outputPath}
Remember: _SESSIONS=4, so ELI16 mode is active. The user is juggling multiple windows and may not remember what this conversation is about. Re-ground them.`,
workingDirectory: sessionDir,
maxTurns: 8,
timeout: 60_000,
testName: 'session-awareness',
runId,
});
logCost('session awareness', result);
recordE2E('session awareness ELI16', 'Skill E2E tests', result);
// Verify the output contains ELI16 re-grounding context
if (fs.existsSync(outputPath)) {
const output = fs.readFileSync(outputPath, 'utf-8');
const lower = output.toLowerCase();
// Must mention project name
expect(lower.includes('billing') || lower.includes('acme')).toBe(true);
// Must mention branch
expect(lower.includes('payment') || lower.includes('feature')).toBe(true);
// Must mention what we're working on
expect(lower.includes('stripe') || lower.includes('checkout') || lower.includes('payment')).toBe(true);
// Must have a RECOMMENDATION
expect(output).toContain('RECOMMENDATION');
} else {
// Check agent output as fallback
const output = result.output || '';
expect(output).toContain('RECOMMENDATION');
}
// Clean up
try { fs.rmSync(sessionDir, { recursive: true, force: true }); } catch {}
}, 90_000);
});
// --- B4: QA skill E2E ---
@@ -315,14 +433,16 @@ Run a Quick-depth QA test on ${testServer.url}/basic.html
Do NOT use AskUserQuestion run Quick tier directly.
Write your report to ${qaDir}/qa-reports/qa-report.md`,
workingDirectory: qaDir,
maxTurns: 30,
maxTurns: 35,
timeout: 180_000,
testName: 'qa-quick',
runId,
});
logCost('/qa quick', result);
recordE2E('/qa quick', 'QA skill E2E', result);
recordE2E('/qa quick', 'QA skill E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
// browseErrors can include false positives from hallucinated paths
if (result.browseErrors.length > 0) {
console.warn('/qa quick browse errors (non-fatal):', result.browseErrors);
@@ -391,6 +511,78 @@ Write your review findings to ${reviewDir}/review-output.md`,
}, 120_000);
});
// --- Review: Enum completeness E2E ---
describeE2E('Review enum completeness E2E', () => {
let enumDir: string;
beforeAll(() => {
enumDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-enum-'));
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: enumDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Commit baseline on main — order model with 4 statuses
const baseContent = fs.readFileSync(path.join(ROOT, 'test', 'fixtures', 'review-eval-enum.rb'), 'utf-8');
fs.writeFileSync(path.join(enumDir, 'order.rb'), baseContent);
run('git', ['add', 'order.rb']);
run('git', ['commit', '-m', 'initial order model']);
// Feature branch adds "returned" status but misses handlers
run('git', ['checkout', '-b', 'feature/add-returned-status']);
const diffContent = fs.readFileSync(path.join(ROOT, 'test', 'fixtures', 'review-eval-enum-diff.rb'), 'utf-8');
fs.writeFileSync(path.join(enumDir, 'order.rb'), diffContent);
run('git', ['add', 'order.rb']);
run('git', ['commit', '-m', 'add returned status']);
// Copy review skill files
fs.copyFileSync(path.join(ROOT, 'review', 'SKILL.md'), path.join(enumDir, 'review-SKILL.md'));
fs.copyFileSync(path.join(ROOT, 'review', 'checklist.md'), path.join(enumDir, 'review-checklist.md'));
fs.copyFileSync(path.join(ROOT, 'review', 'greptile-triage.md'), path.join(enumDir, 'review-greptile-triage.md'));
});
afterAll(() => {
try { fs.rmSync(enumDir, { recursive: true, force: true }); } catch {}
});
test('/review catches missing enum handlers for new status value', async () => {
const result = await runSkillTest({
prompt: `You are in a git repo on branch feature/add-returned-status with changes against main.
Read review-SKILL.md for the review workflow instructions.
Also read review-checklist.md and apply it pay special attention to the Enum & Value Completeness section.
Run /review on the current diff (git diff main...HEAD).
Write your review findings to ${enumDir}/review-output.md
The diff adds a new "returned" status to the Order model. Your job is to check if all consumers handle it.`,
workingDirectory: enumDir,
maxTurns: 15,
timeout: 90_000,
testName: 'review-enum-completeness',
runId,
});
logCost('/review enum', result);
recordE2E('/review enum completeness', 'Review enum completeness E2E', result);
expect(result.exitReason).toBe('success');
// Verify the review caught the missing enum handlers
const reviewPath = path.join(enumDir, 'review-output.md');
if (fs.existsSync(reviewPath)) {
const review = fs.readFileSync(reviewPath, 'utf-8');
// Should mention the missing "returned" handling in at least one of the methods
const mentionsReturned = review.toLowerCase().includes('returned');
const mentionsEnum = review.toLowerCase().includes('enum') || review.toLowerCase().includes('status');
const mentionsCritical = review.toLowerCase().includes('critical');
expect(mentionsReturned).toBe(true);
expect(mentionsEnum || mentionsCritical).toBe(true);
}
}, 120_000);
});
// --- B6/B7/B8: Planted-bug outcome evals ---
// Outcome evals also need ANTHROPIC_API_KEY for the LLM judge
@@ -451,11 +643,12 @@ Write every bug you found so far. Format each as:
- Severity: high / medium / low
- Evidence: what you observed
PHASE 3 Interactive testing (systematic form + edge case testing):
- For EVERY input field on the page: fill it, clear it, try invalid values
- Specifically test: empty fields, invalid email formats, extra-long text, clearing numeric fields
- Submit the form and immediately run $B console --errors
- Click every link/button and check for broken behavior
PHASE 3 Interactive testing (targeted max 15 commands):
- Test email: type "user@" (no domain) and blur does it validate?
- Test quantity: clear the field entirely check the total display
- Test credit card: type a 25-character string check for overflow
- Submit the form with zip code empty does it require zip?
- Submit a valid form and run $B console --errors
- After finding more bugs, UPDATE ${reportPath} with new findings
PHASE 4 Finalize report:
@@ -467,7 +660,7 @@ CRITICAL RULES:
- Write the report file in PHASE 2 before doing interactive testing
- The report MUST exist at ${reportPath} when you finish`,
workingDirectory: testWorkDir,
maxTurns: 40,
maxTurns: 50,
timeout: 300_000,
testName: `qa-${label}`,
runId,
@@ -522,6 +715,7 @@ CRITICAL RULES:
// Record to eval collector with outcome judge results
recordE2E(`/qa ${label}`, 'Planted-bug outcome evals', result, {
passed: judgePassed(judgeResult, groundTruth),
detection_rate: judgeResult.detection_rate,
false_positives: judgeResult.false_positives,
evidence_quality: judgeResult.evidence_quality,
@@ -629,7 +823,9 @@ Focus on reviewing the plan content: architecture, error handling, security, and
});
logCost('/plan-ceo-review', result);
recordE2E('/plan-ceo-review', 'Plan CEO Review E2E', result);
recordE2E('/plan-ceo-review', 'Plan CEO Review E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
// Accept error_max_turns — the CEO review is very thorough and may exceed turns
expect(['success', 'error_max_turns']).toContain(result.exitReason);
@@ -722,7 +918,9 @@ Focus on architecture, code quality, tests, and performance sections.`,
});
logCost('/plan-eng-review', result);
recordE2E('/plan-eng-review', 'Plan Eng Review E2E', result);
recordE2E('/plan-eng-review', 'Plan Eng Review E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
// Verify the review was written
@@ -805,7 +1003,9 @@ Analyze the git history and produce the narrative report as described in the SKI
});
logCost('/retro', result);
recordE2E('/retro', 'Retro E2E', result);
recordE2E('/retro', 'Retro E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
// Accept error_max_turns — retro does many git commands to analyze history
expect(['success', 'error_max_turns']).toContain(result.exitReason);
@@ -818,6 +1018,333 @@ Analyze the git history and produce the narrative report as described in the SKI
}, 420_000);
});
// --- QA-Only E2E (report-only, no fixes) ---
describeE2E('QA-Only skill E2E', () => {
let qaOnlyDir: string;
beforeAll(() => {
testServer = testServer || startTestServer();
qaOnlyDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-qa-only-'));
setupBrowseShims(qaOnlyDir);
// Copy qa-only skill files
copyDirSync(path.join(ROOT, 'qa-only'), path.join(qaOnlyDir, 'qa-only'));
// Copy qa templates (qa-only references qa/templates/qa-report-template.md)
fs.mkdirSync(path.join(qaOnlyDir, 'qa', 'templates'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'qa', 'templates', 'qa-report-template.md'),
path.join(qaOnlyDir, 'qa', 'templates', 'qa-report-template.md'),
);
// Init git repo (qa-only checks for feature branch in diff-aware mode)
const { spawnSync } = require('child_process');
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: qaOnlyDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
fs.writeFileSync(path.join(qaOnlyDir, 'index.html'), '<h1>Test</h1>\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'initial']);
});
afterAll(() => {
try { fs.rmSync(qaOnlyDir, { recursive: true, force: true }); } catch {}
});
test('/qa-only produces report without using Edit tool', async () => {
const result = await runSkillTest({
prompt: `IMPORTANT: The browse binary is already assigned below as B. Do NOT search for it or run the SKILL.md setup block — just use $B directly.
B="${browseBin}"
Read the file qa-only/SKILL.md for the QA-only workflow instructions.
Run a Quick QA test on ${testServer.url}/qa-eval.html
Do NOT use AskUserQuestion run Quick tier directly.
Write your report to ${qaOnlyDir}/qa-reports/qa-only-report.md`,
workingDirectory: qaOnlyDir,
maxTurns: 35,
allowedTools: ['Bash', 'Read', 'Write', 'Glob'], // NO Edit — the critical guardrail
timeout: 180_000,
testName: 'qa-only-no-fix',
runId,
});
logCost('/qa-only', result);
// Verify Edit was not used — the critical guardrail for report-only mode.
// Glob is read-only and may be used for file discovery (e.g. finding SKILL.md).
const editCalls = result.toolCalls.filter(tc => tc.tool === 'Edit');
if (editCalls.length > 0) {
console.warn('qa-only used Edit tool:', editCalls.length, 'times');
}
const exitOk = ['success', 'error_max_turns'].includes(result.exitReason);
recordE2E('/qa-only no-fix', 'QA-Only skill E2E', result, {
passed: exitOk && editCalls.length === 0,
});
expect(editCalls).toHaveLength(0);
// Accept error_max_turns — the agent doing thorough QA is not a failure
expect(['success', 'error_max_turns']).toContain(result.exitReason);
// Verify git working tree is still clean (no source modifications)
const gitStatus = spawnSync('git', ['status', '--porcelain'], {
cwd: qaOnlyDir, stdio: 'pipe',
});
const statusLines = gitStatus.stdout.toString().trim().split('\n').filter(
(l: string) => l.trim() && !l.includes('.prompt-tmp') && !l.includes('.gstack/') && !l.includes('qa-reports/'),
);
expect(statusLines.filter((l: string) => l.startsWith(' M') || l.startsWith('M '))).toHaveLength(0);
}, 240_000);
});
// --- QA Fix Loop E2E ---
describeE2E('QA Fix Loop E2E', () => {
let qaFixDir: string;
let qaFixServer: ReturnType<typeof Bun.serve> | null = null;
beforeAll(() => {
qaFixDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-qa-fix-'));
setupBrowseShims(qaFixDir);
// Copy qa skill files
copyDirSync(path.join(ROOT, 'qa'), path.join(qaFixDir, 'qa'));
// Create a simple HTML page with obvious fixable bugs
fs.writeFileSync(path.join(qaFixDir, 'index.html'), `<!DOCTYPE html>
<html lang="en">
<head><meta charset="utf-8"><title>Test App</title></head>
<body>
<h1>Welcome to Test App</h1>
<nav>
<a href="/about">About</a>
<a href="/nonexistent-broken-page">Help</a> <!-- BUG: broken link -->
</nav>
<form id="contact">
<input type="text" name="name" placeholder="Name">
<input type="email" name="email" placeholder="Email">
<button type="submit" disabled>Send</button> <!-- BUG: permanently disabled -->
</form>
<img src="/missing-logo.png"> <!-- BUG: missing alt text -->
<script>console.error("TypeError: Cannot read property 'map' of undefined");</script> <!-- BUG: console error -->
</body>
</html>
`);
// Init git repo with clean working tree
const { spawnSync } = require('child_process');
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: qaFixDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'initial commit']);
// Start a local server serving from the working directory so fixes are reflected on refresh
qaFixServer = Bun.serve({
port: 0,
hostname: '127.0.0.1',
fetch(req) {
const url = new URL(req.url);
let filePath = url.pathname === '/' ? '/index.html' : url.pathname;
filePath = filePath.replace(/^\//, '');
const fullPath = path.join(qaFixDir, filePath);
if (!fs.existsSync(fullPath)) {
return new Response('Not Found', { status: 404 });
}
const content = fs.readFileSync(fullPath, 'utf-8');
return new Response(content, {
headers: { 'Content-Type': 'text/html' },
});
},
});
});
afterAll(() => {
qaFixServer?.stop();
try { fs.rmSync(qaFixDir, { recursive: true, force: true }); } catch {}
});
test('/qa fix loop finds bugs and commits fixes', async () => {
const qaFixUrl = `http://127.0.0.1:${qaFixServer!.port}`;
const result = await runSkillTest({
prompt: `You have a browse binary at ${browseBin}. Assign it to B variable like: B="${browseBin}"
Read the file qa/SKILL.md for the QA workflow instructions.
Run a Quick-tier QA test on ${qaFixUrl}
The source code for this page is at ${qaFixDir}/index.html you can fix bugs there.
Do NOT use AskUserQuestion run Quick tier directly.
Write your report to ${qaFixDir}/qa-reports/qa-report.md
This is a test+fix loop: find bugs, fix them in the source code, commit each fix, and re-verify.`,
workingDirectory: qaFixDir,
maxTurns: 40,
allowedTools: ['Bash', 'Read', 'Write', 'Edit', 'Glob', 'Grep'],
timeout: 300_000,
testName: 'qa-fix-loop',
runId,
});
logCost('/qa fix loop', result);
recordE2E('/qa fix loop', 'QA Fix Loop E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
// Accept error_max_turns — fix loop may use many turns
expect(['success', 'error_max_turns']).toContain(result.exitReason);
// Verify at least one fix commit was made beyond the initial commit
const gitLog = spawnSync('git', ['log', '--oneline'], {
cwd: qaFixDir, stdio: 'pipe',
});
const commits = gitLog.stdout.toString().trim().split('\n');
console.log(`/qa fix loop: ${commits.length} commits total (1 initial + ${commits.length - 1} fixes)`);
expect(commits.length).toBeGreaterThan(1);
// Verify Edit tool was used (agent actually modified source code)
const editCalls = result.toolCalls.filter(tc => tc.tool === 'Edit');
expect(editCalls.length).toBeGreaterThan(0);
}, 360_000);
});
// --- Plan-Eng-Review Test-Plan Artifact E2E ---
describeE2E('Plan-Eng-Review Test-Plan Artifact E2E', () => {
let planDir: string;
let projectDir: string;
beforeAll(() => {
planDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-plan-artifact-'));
const { spawnSync } = require('child_process');
const run = (cmd: string, args: string[]) =>
spawnSync(cmd, args, { cwd: planDir, stdio: 'pipe', timeout: 5000 });
run('git', ['init']);
run('git', ['config', 'user.email', 'test@test.com']);
run('git', ['config', 'user.name', 'Test']);
// Create base commit on main
fs.writeFileSync(path.join(planDir, 'app.ts'), 'export function greet() { return "hello"; }\n');
run('git', ['add', '.']);
run('git', ['commit', '-m', 'initial']);
// Create feature branch with changes
run('git', ['checkout', '-b', 'feature/add-dashboard']);
fs.writeFileSync(path.join(planDir, 'dashboard.ts'), `export function Dashboard() {
const data = fetchStats();
return { users: data.users, revenue: data.revenue };
}
function fetchStats() {
return fetch('/api/stats').then(r => r.json());
}
`);
fs.writeFileSync(path.join(planDir, 'app.ts'), `import { Dashboard } from "./dashboard";
export function greet() { return "hello"; }
export function main() { return Dashboard(); }
`);
run('git', ['add', '.']);
run('git', ['commit', '-m', 'feat: add dashboard']);
// Plan document
fs.writeFileSync(path.join(planDir, 'plan.md'), `# Plan: Add Dashboard
## Changes
1. New \`dashboard.ts\` with Dashboard component and fetchStats API call
2. Updated \`app.ts\` to import and use Dashboard
## Architecture
- Dashboard fetches from \`/api/stats\` endpoint
- Returns user count and revenue metrics
`);
run('git', ['add', 'plan.md']);
run('git', ['commit', '-m', 'add plan']);
// Copy plan-eng-review skill
fs.mkdirSync(path.join(planDir, 'plan-eng-review'), { recursive: true });
fs.copyFileSync(
path.join(ROOT, 'plan-eng-review', 'SKILL.md'),
path.join(planDir, 'plan-eng-review', 'SKILL.md'),
);
// Set up remote-slug shim and browse shims (plan-eng-review uses remote-slug for artifact path)
setupBrowseShims(planDir);
// Create project directory for artifacts
projectDir = path.join(os.homedir(), '.gstack', 'projects', 'test-project');
fs.mkdirSync(projectDir, { recursive: true });
});
afterAll(() => {
try { fs.rmSync(planDir, { recursive: true, force: true }); } catch {}
// Clean up test-plan artifacts (but not the project dir itself)
try {
const files = fs.readdirSync(projectDir);
for (const f of files) {
if (f.includes('test-plan')) {
fs.unlinkSync(path.join(projectDir, f));
}
}
} catch {}
});
test('/plan-eng-review writes test-plan artifact to ~/.gstack/projects/', async () => {
// Count existing test-plan files before
const beforeFiles = fs.readdirSync(projectDir).filter(f => f.includes('test-plan'));
const result = await runSkillTest({
prompt: `Read plan-eng-review/SKILL.md for the review workflow.
Read plan.md that's the plan to review. This is a standalone plan with source code in app.ts and dashboard.ts.
Choose SMALL CHANGE mode. Skip any AskUserQuestion calls this is non-interactive.
IMPORTANT: After your review, you MUST write the test-plan artifact as described in the "Test Plan Artifact" section of SKILL.md. The remote-slug shim is at ${planDir}/browse/bin/remote-slug.
Write your review to ${planDir}/review-output.md`,
workingDirectory: planDir,
maxTurns: 20,
allowedTools: ['Bash', 'Read', 'Write', 'Glob', 'Grep'],
timeout: 360_000,
testName: 'plan-eng-review-artifact',
runId,
});
logCost('/plan-eng-review artifact', result);
recordE2E('/plan-eng-review test-plan artifact', 'Plan-Eng-Review Test-Plan Artifact E2E', result, {
passed: ['success', 'error_max_turns'].includes(result.exitReason),
});
expect(['success', 'error_max_turns']).toContain(result.exitReason);
// Verify test-plan artifact was written
const afterFiles = fs.readdirSync(projectDir).filter(f => f.includes('test-plan'));
const newFiles = afterFiles.filter(f => !beforeFiles.includes(f));
console.log(`Test-plan artifacts: ${beforeFiles.length} before, ${afterFiles.length} after, ${newFiles.length} new`);
if (newFiles.length > 0) {
const content = fs.readFileSync(path.join(projectDir, newFiles[0]), 'utf-8');
console.log(`Test-plan artifact (${newFiles[0]}): ${content.length} chars`);
expect(content.length).toBeGreaterThan(50);
} else {
console.warn('No test-plan artifact found — agent may not have followed artifact instructions');
}
// Soft assertion: we expect an artifact but agent compliance is not guaranteed
expect(newFiles.length).toBeGreaterThanOrEqual(1);
}, 420_000);
});
// --- Deferred skill E2E tests (destructive or require interactive UI) ---
describeE2E('Deferred skill E2E', () => {
+88 -1
View File
@@ -43,6 +43,20 @@ describe('SKILL.md command validation', () => {
const result = validateSkill(qaSkill);
expect(result.snapshotFlagErrors).toHaveLength(0);
});
test('all $B commands in qa-only/SKILL.md are valid browse commands', () => {
const qaOnlySkill = path.join(ROOT, 'qa-only', 'SKILL.md');
if (!fs.existsSync(qaOnlySkill)) return;
const result = validateSkill(qaOnlySkill);
expect(result.invalid).toHaveLength(0);
});
test('all snapshot flags in qa-only/SKILL.md are valid', () => {
const qaOnlySkill = path.join(ROOT, 'qa-only', 'SKILL.md');
if (!fs.existsSync(qaOnlySkill)) return;
const result = validateSkill(qaOnlySkill);
expect(result.snapshotFlagErrors).toHaveLength(0);
});
});
describe('Command registry consistency', () => {
@@ -157,6 +171,7 @@ describe('Generated SKILL.md freshness', () => {
describe('Update check preamble', () => {
const skillsWithUpdateCheck = [
'SKILL.md', 'browse/SKILL.md', 'qa/SKILL.md',
'qa-only/SKILL.md',
'setup-browser-cookies/SKILL.md',
'ship/SKILL.md', 'review/SKILL.md',
'plan-ceo-review/SKILL.md', 'plan-eng-review/SKILL.md',
@@ -261,7 +276,7 @@ describe('Cross-skill path consistency', () => {
describe('QA skill structure validation', () => {
const qaContent = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
test('qa/SKILL.md has all 6 phases', () => {
test('qa/SKILL.md has all 11 phases', () => {
const phases = [
'Phase 1', 'Initialize',
'Phase 2', 'Authenticate',
@@ -269,6 +284,11 @@ describe('QA skill structure validation', () => {
'Phase 4', 'Explore',
'Phase 5', 'Document',
'Phase 6', 'Wrap Up',
'Phase 7', 'Triage',
'Phase 8', 'Fix Loop',
'Phase 9', 'Final QA',
'Phase 10', 'Report',
'Phase 11', 'TODOS',
];
for (const phase of phases) {
expect(qaContent).toContain(phase);
@@ -291,6 +311,13 @@ describe('QA skill structure validation', () => {
expect(qaContent).toContain('--regression');
});
test('has all three tiers defined', () => {
const tiers = ['Quick', 'Standard', 'Exhaustive'];
for (const tier of tiers) {
expect(qaContent).toContain(tier);
}
});
test('health score weights sum to 100%', () => {
const weights = extractWeightsFromTable(qaContent);
expect(weights.size).toBeGreaterThan(0);
@@ -384,6 +411,66 @@ describe('TODOS-format.md reference consistency', () => {
});
});
// --- v0.4.1 feature coverage: RECOMMENDATION format, session awareness, enum completeness ---
describe('v0.4.1 preamble features', () => {
const skillsWithPreamble = [
'SKILL.md', 'browse/SKILL.md', 'qa/SKILL.md',
'qa-only/SKILL.md',
'setup-browser-cookies/SKILL.md',
'ship/SKILL.md', 'review/SKILL.md',
'plan-ceo-review/SKILL.md', 'plan-eng-review/SKILL.md',
'retro/SKILL.md',
];
for (const skill of skillsWithPreamble) {
test(`${skill} contains RECOMMENDATION format`, () => {
const content = fs.readFileSync(path.join(ROOT, skill), 'utf-8');
expect(content).toContain('RECOMMENDATION: Choose');
expect(content).toContain('AskUserQuestion');
});
test(`${skill} contains session awareness`, () => {
const content = fs.readFileSync(path.join(ROOT, skill), 'utf-8');
expect(content).toContain('_SESSIONS');
expect(content).toContain('ELI16');
});
}
});
describe('Enum & Value Completeness in review checklist', () => {
const checklist = fs.readFileSync(path.join(ROOT, 'review', 'checklist.md'), 'utf-8');
test('checklist has Enum & Value Completeness section', () => {
expect(checklist).toContain('Enum & Value Completeness');
});
test('Enum & Value Completeness is classified as CRITICAL', () => {
// It should appear under Pass 1 — CRITICAL, not Pass 2
const pass1Start = checklist.indexOf('### Pass 1');
const pass2Start = checklist.indexOf('### Pass 2');
const enumStart = checklist.indexOf('Enum & Value Completeness');
expect(enumStart).toBeGreaterThan(pass1Start);
expect(enumStart).toBeLessThan(pass2Start);
});
test('Enum & Value Completeness mentions tracing through consumers', () => {
expect(checklist).toContain('Trace it through every consumer');
expect(checklist).toContain('case');
expect(checklist).toContain('allowlist');
});
test('Enum & Value Completeness is in the gate classification as CRITICAL', () => {
const gateSection = checklist.slice(checklist.indexOf('## Gate Classification'));
// The ASCII art has CRITICAL on the left and INFORMATIONAL on the right
// Enum & Value Completeness should appear on a line with the CRITICAL tree (├─ or └─)
const enumLine = gateSection.split('\n').find(l => l.includes('Enum & Value Completeness'));
expect(enumLine).toBeDefined();
// It's on the left (CRITICAL) side — starts with ├─ or └─
expect(enumLine!.trimStart().startsWith('├─') || enumLine!.trimStart().startsWith('└─')).toBe(true);
});
});
// --- Part 7: Planted-bug fixture validation (A4) ---
describe('Planted-bug fixture validation', () => {