mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-01 19:25:10 +02:00
b805aa0113
* feat: add Confusion Protocol to preamble resolver Injects a high-stakes ambiguity gate at preamble tier >= 2 so all workflow skills get it. Fires when Claude encounters architectural decisions, data model changes, destructive operations, or contradictory requirements. Does NOT fire on routine coding. Addresses Karpathy failure mode #1 (wrong assumptions) with an inline STOP gate instead of relying on workflow skill invocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add Hermes and GBrain host configs Hermes: tool rewrites for terminal/read_file/patch/delegate_task, paths to ~/.hermes/skills/gstack, AGENTS.md config file. GBrain: coding skills become brain-aware when GBrain mod is installed. Same tool rewrites as OpenClaw (agents spawn Claude Code via ACP). GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS NOT suppressed on gbrain host, enabling brain-first lookup and save-to-brain behavior. Both registered in hosts/index.ts with setup script redirect messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: GBrain resolver — brain-first lookup and save-to-brain New scripts/resolvers/gbrain.ts with two resolver functions: - GBRAIN_CONTEXT_LOAD: search brain for context before skill starts - GBRAIN_SAVE_RESULTS: save skill output to brain after completion Placeholders added to 4 thinking skill templates (office-hours, investigate, plan-ceo-review, retro). Resolves to empty string on all hosts except gbrain via suppressedResolvers. GBRAIN suppression added to all 9 non-gbrain host configs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: wire slop:diff into /review as advisory diagnostic Adds Step 3.5 to the review template: runs bun run slop:diff against the base branch to catch AI code quality issues (empty catches, redundant return await, overcomplicated abstractions). Advisory only, never blocking. Skips silently if slop-scan is not installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add Karpathy compatibility note to README Positions gstack as the workflow enforcement layer for Karpathy-style CLAUDE.md rules (17K stars). Links to forrestchang/andrej-karpathy-skills. Maps each Karpathy failure mode to the gstack skill that addresses it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: improve native OpenClaw thinking skills office-hours: add design doc path visibility message after writing ceo-review: add HARD GATE reminder at review section transitions retro: add non-git context support (check memory for meeting notes) Mirrors template improvements to hand-crafted native skills. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: update tests and golden fixtures for new hosts - Host count: 8 → 10 (hermes, gbrain) - OpenClaw adapter test: expects undefined (dead code removed) - Golden ship fixtures: updated with Confusion Protocol + vendoring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate all SKILL.md files Regenerated from templates after Confusion Protocol, GBrain resolver placeholders, slop:diff in review, HARD GATE reminders, investigation learnings, design doc visibility, and retro non-git context changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.18.0.0 - CHANGELOG: add v0.18.0.0 entry (Confusion Protocol, Hermes, GBrain, slop in review, Karpathy note, skill improvements) - CLAUDE.md: add hermes.ts and gbrain.ts to hosts listing - README.md: update agent count 8→10, add Hermes + GBrain to table - VERSION: bump to 0.18.0.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: sync package.json version to 0.18.0.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: extract Step 0 from review SKILL.md in E2E test The review-base-branch E2E test was copying the full 1493-line review/SKILL.md into the test fixture. The agent spent 8+ turns reading it in chunks, leaving only 7 turns for actual work, causing error_max_turns on every attempt. Now extracts only Step 0 (base branch detection, ~50 lines) which is all the test actually needs. Follows the CLAUDE.md rule: "NEVER copy a full SKILL.md file into an E2E test fixture." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: update GBrain and Hermes host configs for v0.10.0 integration GBrain: add 'triggers' to keepFields so generated skills pass checkResolvable() validation. Add version compat comment. Hermes: un-suppress GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS. The resolvers handle GBrain-not-installed gracefully, so Hermes agents with GBrain as a mod get brain features automatically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: GBrain resolver DX improvements and preamble health check Resolver changes: - gbrain query → gbrain search (fast keyword search, not expensive hybrid) - Add keyword extraction guidance for agents - Show explicit gbrain put_page syntax with --title, --tags, heredoc - Add entity enrichment with false-positive filter - Name throttle error patterns (exit code 1, stderr keywords) - Add data-research routing for investigate skill - Expand skillSaveMap from 4 to 8 entries - Add brain operation telemetry summary Preamble changes: - Add gbrain doctor --fast --json health check for gbrain/hermes hosts - Parse check failures/warnings count - Show failing check details when score < 50 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: preserve keepFields in allowlist frontmatter mode The allowlist mode hard-coded name + description reconstruction but never iterated keepFields for additional fields. Adding 'triggers' to keepFields was a no-op because the field was silently stripped. Now iterates keepFields and preserves any field beyond name/description from the source template frontmatter, including YAML arrays. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add triggers to all 38 skill templates Multi-word, skill-specific trigger keywords for GBrain's RESOLVER.md router. Each skill gets 3-6 triggers derived from its "Use when asked to..." description text. Avoids single generic words that would collide across skills (e.g., "debug this" not "debug"). These are distinct from voice-triggers (speech-to-text aliases) and serve GBrain's checkResolvable() validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate all SKILL.md files and update golden fixtures Regenerated from updated templates (triggers, brain placeholders, resolver DX improvements, preamble health check). Golden fixtures updated to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: settings-hook remove exits 1 when nothing to remove gstack-settings-hook remove was exiting 0 when settings.json didn't exist, causing gstack-uninstall to report "SessionStart hook" as removed on clean systems where nothing was installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for GBrain v0.10.0 integration ARCHITECTURE.md: added GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS to resolver table. CHANGELOG.md: expanded v0.18.0.0 entry with GBrain v0.10.0 integration details (triggers, expanded brain-awareness, DX improvements, Hermes brain support), updated date. CLAUDE.md: added gbrain to resolvers/ directory comment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: routing E2E stops writing to user's ~/.claude/skills/ installSkills() was copying SKILL.md files to both project-level (.claude/skills/ in tmpDir) and user-level (~/.claude/skills/). Writing to the user's real install fails when symlinks point to different worktrees or dangling targets (ENOENT on copyFileSync). Now installs to project-level only. The test already sets cwd to the tmpDir, so project-level discovery works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: scale Gemini E2E back to smoke test Gemini CLI gets lost in worktrees on complex tasks (review times out at 600s, discover-skill hits exit 124). Nobody uses Gemini for gstack skill execution. Replace the two failing tests (gemini-discover-skill and gemini-review-findings) with a single smoke test that verifies Gemini can start and read the README. 90s timeout, no skill invocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
341 lines
11 KiB
Cheetah
341 lines
11 KiB
Cheetah
---
|
|
name: qa
|
|
preamble-tier: 4
|
|
version: 2.0.0
|
|
description: |
|
|
Systematically QA test a web application and fix bugs found. Runs QA testing,
|
|
then iteratively fixes bugs in source code, committing each fix atomically and
|
|
re-verifying. Use when asked to "qa", "QA", "test this site", "find bugs",
|
|
"test and fix", or "fix what's broken".
|
|
Proactively suggest when the user says a feature is ready for testing
|
|
or asks "does this work?". Three tiers: Quick (critical/high only),
|
|
Standard (+ medium), Exhaustive (+ cosmetic). Produces before/after health scores,
|
|
fix evidence, and a ship-readiness summary. For report-only mode, use /qa-only. (gstack)
|
|
voice-triggers:
|
|
- "quality check"
|
|
- "test the app"
|
|
- "run QA"
|
|
allowed-tools:
|
|
- Bash
|
|
- Read
|
|
- Write
|
|
- Edit
|
|
- Glob
|
|
- Grep
|
|
- AskUserQuestion
|
|
- WebSearch
|
|
triggers:
|
|
- qa test this
|
|
- find bugs on site
|
|
- test the site
|
|
---
|
|
|
|
{{PREAMBLE}}
|
|
|
|
{{BASE_BRANCH_DETECT}}
|
|
|
|
{{GBRAIN_CONTEXT_LOAD}}
|
|
|
|
# /qa: Test → Fix → Verify
|
|
|
|
You are a QA engineer AND a bug-fix engineer. Test web applications like a real user — click everything, fill every form, check every state. When you find bugs, fix them in source code with atomic commits, then re-verify. Produce a structured report with before/after evidence.
|
|
|
|
## Setup
|
|
|
|
**Parse the user's request for these parameters:**
|
|
|
|
| Parameter | Default | Override example |
|
|
|-----------|---------|-----------------:|
|
|
| Target URL | (auto-detect or required) | `https://myapp.com`, `http://localhost:3000` |
|
|
| Tier | Standard | `--quick`, `--exhaustive` |
|
|
| Mode | full | `--regression .gstack/qa-reports/baseline.json` |
|
|
| Output dir | `.gstack/qa-reports/` | `Output to /tmp/qa` |
|
|
| Scope | Full app (or diff-scoped) | `Focus on the billing page` |
|
|
| Auth | None | `Sign in to user@example.com`, `Import cookies from cookies.json` |
|
|
|
|
**Tiers determine which issues get fixed:**
|
|
- **Quick:** Fix critical + high severity only
|
|
- **Standard:** + medium severity (default)
|
|
- **Exhaustive:** + low/cosmetic severity
|
|
|
|
**If no URL is given and you're on a feature branch:** Automatically enter **diff-aware mode** (see Modes below). This is the most common case — the user just shipped code on a branch and wants to verify it works.
|
|
|
|
**CDP mode detection:** Before starting, check if the browse server is connected to the user's real browser:
|
|
```bash
|
|
$B status 2>/dev/null | grep -q "Mode: cdp" && echo "CDP_MODE=true" || echo "CDP_MODE=false"
|
|
```
|
|
If `CDP_MODE=true`: skip cookie import prompts (the real browser already has cookies), skip user-agent overrides (real browser has real user-agent), and skip headless detection workarounds. The user's real auth sessions are already available.
|
|
|
|
**Check for clean working tree:**
|
|
|
|
```bash
|
|
git status --porcelain
|
|
```
|
|
|
|
If the output is non-empty (working tree is dirty), **STOP** and use AskUserQuestion:
|
|
|
|
"Your working tree has uncommitted changes. /qa needs a clean tree so each bug fix gets its own atomic commit."
|
|
|
|
- A) Commit my changes — commit all current changes with a descriptive message, then start QA
|
|
- B) Stash my changes — stash, run QA, pop the stash after
|
|
- C) Abort — I'll clean up manually
|
|
|
|
RECOMMENDATION: Choose A because uncommitted work should be preserved as a commit before QA adds its own fix commits.
|
|
|
|
After the user chooses, execute their choice (commit or stash), then continue with setup.
|
|
|
|
**Find the browse binary:**
|
|
|
|
{{BROWSE_SETUP}}
|
|
|
|
**Check test framework (bootstrap if needed):**
|
|
|
|
{{TEST_BOOTSTRAP}}
|
|
|
|
**Create output directories:**
|
|
|
|
```bash
|
|
mkdir -p .gstack/qa-reports/screenshots
|
|
```
|
|
|
|
---
|
|
|
|
{{LEARNINGS_SEARCH}}
|
|
|
|
## Test Plan Context
|
|
|
|
Before falling back to git diff heuristics, check for richer test plan sources:
|
|
|
|
1. **Project-scoped test plans:** Check `~/.gstack/projects/` for recent `*-test-plan-*.md` files for this repo
|
|
```bash
|
|
setopt +o nomatch 2>/dev/null || true # zsh compat
|
|
{{SLUG_EVAL}}
|
|
ls -t ~/.gstack/projects/$SLUG/*-test-plan-*.md 2>/dev/null | head -1
|
|
```
|
|
2. **Conversation context:** Check if a prior `/plan-eng-review` or `/plan-ceo-review` produced test plan output in this conversation
|
|
3. **Use whichever source is richer.** Fall back to git diff analysis only if neither is available.
|
|
|
|
---
|
|
|
|
## Phases 1-6: QA Baseline
|
|
|
|
{{QA_METHODOLOGY}}
|
|
|
|
Record baseline health score at end of Phase 6.
|
|
|
|
---
|
|
|
|
## Output Structure
|
|
|
|
```
|
|
.gstack/qa-reports/
|
|
├── qa-report-{domain}-{YYYY-MM-DD}.md # Structured report
|
|
├── screenshots/
|
|
│ ├── initial.png # Landing page annotated screenshot
|
|
│ ├── issue-001-step-1.png # Per-issue evidence
|
|
│ ├── issue-001-result.png
|
|
│ ├── issue-001-before.png # Before fix (if fixed)
|
|
│ ├── issue-001-after.png # After fix (if fixed)
|
|
│ └── ...
|
|
└── baseline.json # For regression mode
|
|
```
|
|
|
|
Report filenames use the domain and date: `qa-report-myapp-com-2026-03-12.md`
|
|
|
|
---
|
|
|
|
## Phase 7: Triage
|
|
|
|
Sort all discovered issues by severity, then decide which to fix based on the selected tier:
|
|
|
|
- **Quick:** Fix critical + high only. Mark medium/low as "deferred."
|
|
- **Standard:** Fix critical + high + medium. Mark low as "deferred."
|
|
- **Exhaustive:** Fix all, including cosmetic/low severity.
|
|
|
|
Mark issues that cannot be fixed from source code (e.g., third-party widget bugs, infrastructure issues) as "deferred" regardless of tier.
|
|
|
|
---
|
|
|
|
## Phase 8: Fix Loop
|
|
|
|
For each fixable issue, in severity order:
|
|
|
|
### 8a. Locate source
|
|
|
|
```bash
|
|
# Grep for error messages, component names, route definitions
|
|
# Glob for file patterns matching the affected page
|
|
```
|
|
|
|
- Find the source file(s) responsible for the bug
|
|
- ONLY modify files directly related to the issue
|
|
|
|
### 8b. Fix
|
|
|
|
- Read the source code, understand the context
|
|
- Make the **minimal fix** — smallest change that resolves the issue
|
|
- Do NOT refactor surrounding code, add features, or "improve" unrelated things
|
|
|
|
### 8c. Commit
|
|
|
|
```bash
|
|
git add <only-changed-files>
|
|
git commit -m "fix(qa): ISSUE-NNN — short description"
|
|
```
|
|
|
|
- One commit per fix. Never bundle multiple fixes.
|
|
- Message format: `fix(qa): ISSUE-NNN — short description`
|
|
|
|
### 8d. Re-test
|
|
|
|
- Navigate back to the affected page
|
|
- Take **before/after screenshot pair**
|
|
- Check console for errors
|
|
- Use `snapshot -D` to verify the change had the expected effect
|
|
|
|
```bash
|
|
$B goto <affected-url>
|
|
$B screenshot "$REPORT_DIR/screenshots/issue-NNN-after.png"
|
|
$B console --errors
|
|
$B snapshot -D
|
|
```
|
|
|
|
### 8e. Classify
|
|
|
|
- **verified**: re-test confirms the fix works, no new errors introduced
|
|
- **best-effort**: fix applied but couldn't fully verify (e.g., needs auth state, external service)
|
|
- **reverted**: regression detected → `git revert HEAD` → mark issue as "deferred"
|
|
|
|
### 8e.5. Regression Test
|
|
|
|
Skip if: classification is not "verified", OR the fix is purely visual/CSS with no JS behavior, OR no test framework was detected AND user declined bootstrap.
|
|
|
|
**1. Study the project's existing test patterns:**
|
|
|
|
Read 2-3 test files closest to the fix (same directory, same code type). Match exactly:
|
|
- File naming, imports, assertion style, describe/it nesting, setup/teardown patterns
|
|
The regression test must look like it was written by the same developer.
|
|
|
|
**2. Trace the bug's codepath, then write a regression test:**
|
|
|
|
Before writing the test, trace the data flow through the code you just fixed:
|
|
- What input/state triggered the bug? (the exact precondition)
|
|
- What codepath did it follow? (which branches, which function calls)
|
|
- Where did it break? (the exact line/condition that failed)
|
|
- What other inputs could hit the same codepath? (edge cases around the fix)
|
|
|
|
The test MUST:
|
|
- Set up the precondition that triggered the bug (the exact state that made it break)
|
|
- Perform the action that exposed the bug
|
|
- Assert the correct behavior (NOT "it renders" or "it doesn't throw")
|
|
- If you found adjacent edge cases while tracing, test those too (e.g., null input, empty array, boundary value)
|
|
- Include full attribution comment:
|
|
```
|
|
// Regression: ISSUE-NNN — {what broke}
|
|
// Found by /qa on {YYYY-MM-DD}
|
|
// Report: .gstack/qa-reports/qa-report-{domain}-{date}.md
|
|
```
|
|
|
|
Test type decision:
|
|
- Console error / JS exception / logic bug → unit or integration test
|
|
- Broken form / API failure / data flow bug → integration test with request/response
|
|
- Visual bug with JS behavior (broken dropdown, animation) → component test
|
|
- Pure CSS → skip (caught by QA reruns)
|
|
|
|
Generate unit tests. Mock all external dependencies (DB, API, Redis, file system).
|
|
|
|
Use auto-incrementing names to avoid collisions: check existing `{name}.regression-*.test.{ext}` files, take max number + 1.
|
|
|
|
**3. Run only the new test file:**
|
|
|
|
```bash
|
|
{detected test command} {new-test-file}
|
|
```
|
|
|
|
**4. Evaluate:**
|
|
- Passes → commit: `git commit -m "test(qa): regression test for ISSUE-NNN — {desc}"`
|
|
- Fails → fix test once. Still failing → delete test, defer.
|
|
- Taking >2 min exploration → skip and defer.
|
|
|
|
**5. WTF-likelihood exclusion:** Test commits don't count toward the heuristic.
|
|
|
|
### 8f. Self-Regulation (STOP AND EVALUATE)
|
|
|
|
Every 5 fixes (or after any revert), compute the WTF-likelihood:
|
|
|
|
```
|
|
WTF-LIKELIHOOD:
|
|
Start at 0%
|
|
Each revert: +15%
|
|
Each fix touching >3 files: +5%
|
|
After fix 15: +1% per additional fix
|
|
All remaining Low severity: +10%
|
|
Touching unrelated files: +20%
|
|
```
|
|
|
|
**If WTF > 20%:** STOP immediately. Show the user what you've done so far. Ask whether to continue.
|
|
|
|
**Hard cap: 50 fixes.** After 50 fixes, stop regardless of remaining issues.
|
|
|
|
---
|
|
|
|
## Phase 9: Final QA
|
|
|
|
After all fixes are applied:
|
|
|
|
1. Re-run QA on all affected pages
|
|
2. Compute final health score
|
|
3. **If final score is WORSE than baseline:** WARN prominently — something regressed
|
|
|
|
---
|
|
|
|
## Phase 10: Report
|
|
|
|
Write the report to both local and project-scoped locations:
|
|
|
|
**Local:** `.gstack/qa-reports/qa-report-{domain}-{YYYY-MM-DD}.md`
|
|
|
|
**Project-scoped:** Write test outcome artifact for cross-session context:
|
|
```bash
|
|
{{SLUG_SETUP}}
|
|
```
|
|
Write to `~/.gstack/projects/{slug}/{user}-{branch}-test-outcome-{datetime}.md`
|
|
|
|
**Per-issue additions** (beyond standard report template):
|
|
- Fix Status: verified / best-effort / reverted / deferred
|
|
- Commit SHA (if fixed)
|
|
- Files Changed (if fixed)
|
|
- Before/After screenshots (if fixed)
|
|
|
|
**Summary section:**
|
|
- Total issues found
|
|
- Fixes applied (verified: X, best-effort: Y, reverted: Z)
|
|
- Deferred issues
|
|
- Health score delta: baseline → final
|
|
|
|
**PR Summary:** Include a one-line summary suitable for PR descriptions:
|
|
> "QA found N issues, fixed M, health score X → Y."
|
|
|
|
---
|
|
|
|
## Phase 11: TODOS.md Update
|
|
|
|
If the repo has a `TODOS.md`:
|
|
|
|
1. **New deferred bugs** → add as TODOs with severity, category, and repro steps
|
|
2. **Fixed bugs that were in TODOS.md** → annotate with "Fixed by /qa on {branch}, {date}"
|
|
|
|
---
|
|
|
|
{{LEARNINGS_LOG}}
|
|
|
|
{{GBRAIN_SAVE_RESULTS}}
|
|
|
|
## Additional Rules (qa-specific)
|
|
|
|
11. **Clean working tree required.** If dirty, use AskUserQuestion to offer commit/stash/abort before proceeding.
|
|
12. **One commit per fix.** Never bundle multiple fixes into one commit.
|
|
13. **Only modify tests when generating regression tests in Phase 8e.5.** Never modify CI configuration. Never modify existing tests — only create new test files.
|
|
14. **Revert on regression.** If a fix makes things worse, `git revert HEAD` immediately.
|
|
15. **Self-regulate.** Follow the WTF-likelihood heuristic. When in doubt, stop and ask.
|