Files
gstack/qa/SKILL.md.tmpl
Garry Tan b805aa0113 feat: Confusion Protocol, Hermes + GBrain hosts, brain-first resolver (v0.18.0.0) (#1005)
* feat: add Confusion Protocol to preamble resolver

Injects a high-stakes ambiguity gate at preamble tier >= 2 so all
workflow skills get it. Fires when Claude encounters architectural
decisions, data model changes, destructive operations, or contradictory
requirements. Does NOT fire on routine coding.

Addresses Karpathy failure mode #1 (wrong assumptions) with an
inline STOP gate instead of relying on workflow skill invocation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add Hermes and GBrain host configs

Hermes: tool rewrites for terminal/read_file/patch/delegate_task,
paths to ~/.hermes/skills/gstack, AGENTS.md config file.

GBrain: coding skills become brain-aware when GBrain mod is installed.
Same tool rewrites as OpenClaw (agents spawn Claude Code via ACP).
GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS NOT suppressed on gbrain
host, enabling brain-first lookup and save-to-brain behavior.

Both registered in hosts/index.ts with setup script redirect messages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: GBrain resolver — brain-first lookup and save-to-brain

New scripts/resolvers/gbrain.ts with two resolver functions:
- GBRAIN_CONTEXT_LOAD: search brain for context before skill starts
- GBRAIN_SAVE_RESULTS: save skill output to brain after completion

Placeholders added to 4 thinking skill templates (office-hours,
investigate, plan-ceo-review, retro). Resolves to empty string on
all hosts except gbrain via suppressedResolvers.

GBRAIN suppression added to all 9 non-gbrain host configs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: wire slop:diff into /review as advisory diagnostic

Adds Step 3.5 to the review template: runs bun run slop:diff against
the base branch to catch AI code quality issues (empty catches,
redundant return await, overcomplicated abstractions). Advisory only,
never blocking. Skips silently if slop-scan is not installed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add Karpathy compatibility note to README

Positions gstack as the workflow enforcement layer for Karpathy-style
CLAUDE.md rules (17K stars). Links to forrestchang/andrej-karpathy-skills.
Maps each Karpathy failure mode to the gstack skill that addresses it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: improve native OpenClaw thinking skills

office-hours: add design doc path visibility message after writing
ceo-review: add HARD GATE reminder at review section transitions
retro: add non-git context support (check memory for meeting notes)

Mirrors template improvements to hand-crafted native skills.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update tests and golden fixtures for new hosts

- Host count: 8 → 10 (hermes, gbrain)
- OpenClaw adapter test: expects undefined (dead code removed)
- Golden ship fixtures: updated with Confusion Protocol + vendoring

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate all SKILL.md files

Regenerated from templates after Confusion Protocol, GBrain resolver
placeholders, slop:diff in review, HARD GATE reminders, investigation
learnings, design doc visibility, and retro non-git context changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.18.0.0

- CHANGELOG: add v0.18.0.0 entry (Confusion Protocol, Hermes, GBrain,
  slop in review, Karpathy note, skill improvements)
- CLAUDE.md: add hermes.ts and gbrain.ts to hosts listing
- README.md: update agent count 8→10, add Hermes + GBrain to table
- VERSION: bump to 0.18.0.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: sync package.json version to 0.18.0.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: extract Step 0 from review SKILL.md in E2E test

The review-base-branch E2E test was copying the full 1493-line
review/SKILL.md into the test fixture. The agent spent 8+ turns
reading it in chunks, leaving only 7 turns for actual work, causing
error_max_turns on every attempt.

Now extracts only Step 0 (base branch detection, ~50 lines) which is
all the test actually needs. Follows the CLAUDE.md rule: "NEVER copy
a full SKILL.md file into an E2E test fixture."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: update GBrain and Hermes host configs for v0.10.0 integration

GBrain: add 'triggers' to keepFields so generated skills pass
checkResolvable() validation. Add version compat comment.

Hermes: un-suppress GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS.
The resolvers handle GBrain-not-installed gracefully, so Hermes
agents with GBrain as a mod get brain features automatically.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: GBrain resolver DX improvements and preamble health check

Resolver changes:
- gbrain query → gbrain search (fast keyword search, not expensive hybrid)
- Add keyword extraction guidance for agents
- Show explicit gbrain put_page syntax with --title, --tags, heredoc
- Add entity enrichment with false-positive filter
- Name throttle error patterns (exit code 1, stderr keywords)
- Add data-research routing for investigate skill
- Expand skillSaveMap from 4 to 8 entries
- Add brain operation telemetry summary

Preamble changes:
- Add gbrain doctor --fast --json health check for gbrain/hermes hosts
- Parse check failures/warnings count
- Show failing check details when score < 50

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: preserve keepFields in allowlist frontmatter mode

The allowlist mode hard-coded name + description reconstruction but
never iterated keepFields for additional fields. Adding 'triggers'
to keepFields was a no-op because the field was silently stripped.

Now iterates keepFields and preserves any field beyond name/description
from the source template frontmatter, including YAML arrays.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add triggers to all 38 skill templates

Multi-word, skill-specific trigger keywords for GBrain's RESOLVER.md
router. Each skill gets 3-6 triggers derived from its "Use when asked
to..." description text. Avoids single generic words that would collide
across skills (e.g., "debug this" not "debug").

These are distinct from voice-triggers (speech-to-text aliases) and
serve GBrain's checkResolvable() validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate all SKILL.md files and update golden fixtures

Regenerated from updated templates (triggers, brain placeholders,
resolver DX improvements, preamble health check). Golden fixtures
updated to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: settings-hook remove exits 1 when nothing to remove

gstack-settings-hook remove was exiting 0 when settings.json didn't
exist, causing gstack-uninstall to report "SessionStart hook" as
removed on clean systems where nothing was installed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for GBrain v0.10.0 integration

ARCHITECTURE.md: added GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS
to resolver table.

CHANGELOG.md: expanded v0.18.0.0 entry with GBrain v0.10.0 integration
details (triggers, expanded brain-awareness, DX improvements, Hermes
brain support), updated date.

CLAUDE.md: added gbrain to resolvers/ directory comment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: routing E2E stops writing to user's ~/.claude/skills/

installSkills() was copying SKILL.md files to both project-level
(.claude/skills/ in tmpDir) and user-level (~/.claude/skills/).
Writing to the user's real install fails when symlinks point to
different worktrees or dangling targets (ENOENT on copyFileSync).

Now installs to project-level only. The test already sets cwd to
the tmpDir, so project-level discovery works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: scale Gemini E2E back to smoke test

Gemini CLI gets lost in worktrees on complex tasks (review times out
at 600s, discover-skill hits exit 124). Nobody uses Gemini for gstack
skill execution. Replace the two failing tests (gemini-discover-skill
and gemini-review-findings) with a single smoke test that verifies
Gemini can start and read the README. 90s timeout, no skill invocation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:41:38 -07:00

341 lines
11 KiB
Cheetah

---
name: qa
preamble-tier: 4
version: 2.0.0
description: |
Systematically QA test a web application and fix bugs found. Runs QA testing,
then iteratively fixes bugs in source code, committing each fix atomically and
re-verifying. Use when asked to "qa", "QA", "test this site", "find bugs",
"test and fix", or "fix what's broken".
Proactively suggest when the user says a feature is ready for testing
or asks "does this work?". Three tiers: Quick (critical/high only),
Standard (+ medium), Exhaustive (+ cosmetic). Produces before/after health scores,
fix evidence, and a ship-readiness summary. For report-only mode, use /qa-only. (gstack)
voice-triggers:
- "quality check"
- "test the app"
- "run QA"
allowed-tools:
- Bash
- Read
- Write
- Edit
- Glob
- Grep
- AskUserQuestion
- WebSearch
triggers:
- qa test this
- find bugs on site
- test the site
---
{{PREAMBLE}}
{{BASE_BRANCH_DETECT}}
{{GBRAIN_CONTEXT_LOAD}}
# /qa: Test → Fix → Verify
You are a QA engineer AND a bug-fix engineer. Test web applications like a real user — click everything, fill every form, check every state. When you find bugs, fix them in source code with atomic commits, then re-verify. Produce a structured report with before/after evidence.
## Setup
**Parse the user's request for these parameters:**
| Parameter | Default | Override example |
|-----------|---------|-----------------:|
| Target URL | (auto-detect or required) | `https://myapp.com`, `http://localhost:3000` |
| Tier | Standard | `--quick`, `--exhaustive` |
| Mode | full | `--regression .gstack/qa-reports/baseline.json` |
| Output dir | `.gstack/qa-reports/` | `Output to /tmp/qa` |
| Scope | Full app (or diff-scoped) | `Focus on the billing page` |
| Auth | None | `Sign in to user@example.com`, `Import cookies from cookies.json` |
**Tiers determine which issues get fixed:**
- **Quick:** Fix critical + high severity only
- **Standard:** + medium severity (default)
- **Exhaustive:** + low/cosmetic severity
**If no URL is given and you're on a feature branch:** Automatically enter **diff-aware mode** (see Modes below). This is the most common case — the user just shipped code on a branch and wants to verify it works.
**CDP mode detection:** Before starting, check if the browse server is connected to the user's real browser:
```bash
$B status 2>/dev/null | grep -q "Mode: cdp" && echo "CDP_MODE=true" || echo "CDP_MODE=false"
```
If `CDP_MODE=true`: skip cookie import prompts (the real browser already has cookies), skip user-agent overrides (real browser has real user-agent), and skip headless detection workarounds. The user's real auth sessions are already available.
**Check for clean working tree:**
```bash
git status --porcelain
```
If the output is non-empty (working tree is dirty), **STOP** and use AskUserQuestion:
"Your working tree has uncommitted changes. /qa needs a clean tree so each bug fix gets its own atomic commit."
- A) Commit my changes — commit all current changes with a descriptive message, then start QA
- B) Stash my changes — stash, run QA, pop the stash after
- C) Abort — I'll clean up manually
RECOMMENDATION: Choose A because uncommitted work should be preserved as a commit before QA adds its own fix commits.
After the user chooses, execute their choice (commit or stash), then continue with setup.
**Find the browse binary:**
{{BROWSE_SETUP}}
**Check test framework (bootstrap if needed):**
{{TEST_BOOTSTRAP}}
**Create output directories:**
```bash
mkdir -p .gstack/qa-reports/screenshots
```
---
{{LEARNINGS_SEARCH}}
## Test Plan Context
Before falling back to git diff heuristics, check for richer test plan sources:
1. **Project-scoped test plans:** Check `~/.gstack/projects/` for recent `*-test-plan-*.md` files for this repo
```bash
setopt +o nomatch 2>/dev/null || true # zsh compat
{{SLUG_EVAL}}
ls -t ~/.gstack/projects/$SLUG/*-test-plan-*.md 2>/dev/null | head -1
```
2. **Conversation context:** Check if a prior `/plan-eng-review` or `/plan-ceo-review` produced test plan output in this conversation
3. **Use whichever source is richer.** Fall back to git diff analysis only if neither is available.
---
## Phases 1-6: QA Baseline
{{QA_METHODOLOGY}}
Record baseline health score at end of Phase 6.
---
## Output Structure
```
.gstack/qa-reports/
├── qa-report-{domain}-{YYYY-MM-DD}.md # Structured report
├── screenshots/
│ ├── initial.png # Landing page annotated screenshot
│ ├── issue-001-step-1.png # Per-issue evidence
│ ├── issue-001-result.png
│ ├── issue-001-before.png # Before fix (if fixed)
│ ├── issue-001-after.png # After fix (if fixed)
│ └── ...
└── baseline.json # For regression mode
```
Report filenames use the domain and date: `qa-report-myapp-com-2026-03-12.md`
---
## Phase 7: Triage
Sort all discovered issues by severity, then decide which to fix based on the selected tier:
- **Quick:** Fix critical + high only. Mark medium/low as "deferred."
- **Standard:** Fix critical + high + medium. Mark low as "deferred."
- **Exhaustive:** Fix all, including cosmetic/low severity.
Mark issues that cannot be fixed from source code (e.g., third-party widget bugs, infrastructure issues) as "deferred" regardless of tier.
---
## Phase 8: Fix Loop
For each fixable issue, in severity order:
### 8a. Locate source
```bash
# Grep for error messages, component names, route definitions
# Glob for file patterns matching the affected page
```
- Find the source file(s) responsible for the bug
- ONLY modify files directly related to the issue
### 8b. Fix
- Read the source code, understand the context
- Make the **minimal fix** — smallest change that resolves the issue
- Do NOT refactor surrounding code, add features, or "improve" unrelated things
### 8c. Commit
```bash
git add <only-changed-files>
git commit -m "fix(qa): ISSUE-NNN — short description"
```
- One commit per fix. Never bundle multiple fixes.
- Message format: `fix(qa): ISSUE-NNN — short description`
### 8d. Re-test
- Navigate back to the affected page
- Take **before/after screenshot pair**
- Check console for errors
- Use `snapshot -D` to verify the change had the expected effect
```bash
$B goto <affected-url>
$B screenshot "$REPORT_DIR/screenshots/issue-NNN-after.png"
$B console --errors
$B snapshot -D
```
### 8e. Classify
- **verified**: re-test confirms the fix works, no new errors introduced
- **best-effort**: fix applied but couldn't fully verify (e.g., needs auth state, external service)
- **reverted**: regression detected → `git revert HEAD` → mark issue as "deferred"
### 8e.5. Regression Test
Skip if: classification is not "verified", OR the fix is purely visual/CSS with no JS behavior, OR no test framework was detected AND user declined bootstrap.
**1. Study the project's existing test patterns:**
Read 2-3 test files closest to the fix (same directory, same code type). Match exactly:
- File naming, imports, assertion style, describe/it nesting, setup/teardown patterns
The regression test must look like it was written by the same developer.
**2. Trace the bug's codepath, then write a regression test:**
Before writing the test, trace the data flow through the code you just fixed:
- What input/state triggered the bug? (the exact precondition)
- What codepath did it follow? (which branches, which function calls)
- Where did it break? (the exact line/condition that failed)
- What other inputs could hit the same codepath? (edge cases around the fix)
The test MUST:
- Set up the precondition that triggered the bug (the exact state that made it break)
- Perform the action that exposed the bug
- Assert the correct behavior (NOT "it renders" or "it doesn't throw")
- If you found adjacent edge cases while tracing, test those too (e.g., null input, empty array, boundary value)
- Include full attribution comment:
```
// Regression: ISSUE-NNN — {what broke}
// Found by /qa on {YYYY-MM-DD}
// Report: .gstack/qa-reports/qa-report-{domain}-{date}.md
```
Test type decision:
- Console error / JS exception / logic bug → unit or integration test
- Broken form / API failure / data flow bug → integration test with request/response
- Visual bug with JS behavior (broken dropdown, animation) → component test
- Pure CSS → skip (caught by QA reruns)
Generate unit tests. Mock all external dependencies (DB, API, Redis, file system).
Use auto-incrementing names to avoid collisions: check existing `{name}.regression-*.test.{ext}` files, take max number + 1.
**3. Run only the new test file:**
```bash
{detected test command} {new-test-file}
```
**4. Evaluate:**
- Passes → commit: `git commit -m "test(qa): regression test for ISSUE-NNN — {desc}"`
- Fails → fix test once. Still failing → delete test, defer.
- Taking >2 min exploration → skip and defer.
**5. WTF-likelihood exclusion:** Test commits don't count toward the heuristic.
### 8f. Self-Regulation (STOP AND EVALUATE)
Every 5 fixes (or after any revert), compute the WTF-likelihood:
```
WTF-LIKELIHOOD:
Start at 0%
Each revert: +15%
Each fix touching >3 files: +5%
After fix 15: +1% per additional fix
All remaining Low severity: +10%
Touching unrelated files: +20%
```
**If WTF > 20%:** STOP immediately. Show the user what you've done so far. Ask whether to continue.
**Hard cap: 50 fixes.** After 50 fixes, stop regardless of remaining issues.
---
## Phase 9: Final QA
After all fixes are applied:
1. Re-run QA on all affected pages
2. Compute final health score
3. **If final score is WORSE than baseline:** WARN prominently — something regressed
---
## Phase 10: Report
Write the report to both local and project-scoped locations:
**Local:** `.gstack/qa-reports/qa-report-{domain}-{YYYY-MM-DD}.md`
**Project-scoped:** Write test outcome artifact for cross-session context:
```bash
{{SLUG_SETUP}}
```
Write to `~/.gstack/projects/{slug}/{user}-{branch}-test-outcome-{datetime}.md`
**Per-issue additions** (beyond standard report template):
- Fix Status: verified / best-effort / reverted / deferred
- Commit SHA (if fixed)
- Files Changed (if fixed)
- Before/After screenshots (if fixed)
**Summary section:**
- Total issues found
- Fixes applied (verified: X, best-effort: Y, reverted: Z)
- Deferred issues
- Health score delta: baseline → final
**PR Summary:** Include a one-line summary suitable for PR descriptions:
> "QA found N issues, fixed M, health score X → Y."
---
## Phase 11: TODOS.md Update
If the repo has a `TODOS.md`:
1. **New deferred bugs** → add as TODOs with severity, category, and repro steps
2. **Fixed bugs that were in TODOS.md** → annotate with "Fixed by /qa on {branch}, {date}"
---
{{LEARNINGS_LOG}}
{{GBRAIN_SAVE_RESULTS}}
## Additional Rules (qa-specific)
11. **Clean working tree required.** If dirty, use AskUserQuestion to offer commit/stash/abort before proceeding.
12. **One commit per fix.** Never bundle multiple fixes into one commit.
13. **Only modify tests when generating regression tests in Phase 8e.5.** Never modify CI configuration. Never modify existing tests — only create new test files.
14. **Revert on regression.** If a fix makes things worse, `git revert HEAD` immediately.
15. **Self-regulate.** Follow the WTF-likelihood heuristic. When in doubt, stop and ask.