Phase 3: /qa skill — systematic QA testing with health scores

New /qa skill for systematic web app QA testing. Three modes:
- full: 5-10 documented issues with screenshots and repro steps
- quick: 30-second smoke test with health score
- regression: compare against saved baseline

Includes issue taxonomy (7 categories, 4 severity levels), structured
report template, health score rubric (weighted across 7 categories),
framework detection guidance (Next.js, Rails, WordPress, SPA).

Also adds browse/bin/find-browse (DRY binary discovery using git
rev-parse), .gstack/ to .gitignore, and updated TODO roadmap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-03-12 13:34:06 -07:00
parent d827276a8d
commit 6ff6d5b6d3
6 changed files with 516 additions and 17 deletions
+1
View File
@@ -1,5 +1,6 @@
node_modules/
browse/dist/
.gstack/
/tmp/
*.log
bun.lock
+45 -17
View File
@@ -7,24 +7,46 @@
- [x] Snapshot command with ref-based element selection
- [x] Snapshot tests
## Phase 2: Enhanced Browser
- [ ] Annotated screenshots (--annotate flag, numbered labels on elements mapped to refs)
- [ ] Snapshot diffing (compare before/after accessibility trees, verify actions worked)
- [ ] Dialog handling (dialog accept/dismiss prevents browser lockup)
- [ ] File upload (upload <sel> <files>)
- [ ] Cursor-interactive elements (-C flag, detect divs with cursor:pointer/onclick/tabindex)
- [ ] Element state checks (is visible/enabled/checked <sel>)
## Phase 2: Enhanced Browser (v0.2.0) ✅
- [x] Annotated screenshots (--annotate flag, ref labels overlaid on screenshot)
- [x] Snapshot diffing (--diff flag, unified diff against previous snapshot)
- [x] Dialog handling (auto-accept/dismiss, dialog buffer, prevents browser lockup)
- [x] File upload (upload <sel> <files>)
- [x] Cursor-interactive elements (-C flag, cursor:pointer/onclick/tabindex scan)
- [x] Element state checks (is visible/hidden/enabled/disabled/checked/editable/focused)
- [x] CircularBuffer — O(1) ring buffer for console/network/dialog (was O(n) array+shift)
- [x] Async buffer flush with Bun.write() (was appendFileSync)
- [x] Health check with page.evaluate('1') + 2s timeout
- [x] Playwright error wrapping — actionable messages for AI agents
- [x] Fix useragent — context recreation preserves cookies/storage/URLs
- [x] DRY: getCleanText exported, command sets in chain updated
- [x] 148 integration tests (was ~63)
## Phase 3: QA Testing Agent (dogfood skill)
- [ ] SKILL.md — 6-phase workflow: Initialize → Authenticate → Orient → Explore → Document → Wrap up
- [ ] Issue taxonomy reference (7 categories: visual, functional, UX, content, performance, console, accessibility)
- [ ] Severity classification (critical/high/medium/low)
- [ ] Exploration checklist per page
- [ ] Report template (structured markdown with per-issue evidence)
- [ ] Repro-first philosophy: every issue gets evidence before moving on
- [ ] Two evidence tiers: interactive bugs (video + step-by-step screenshots), static bugs (single annotated screenshot)
- [ ] Video recording (record start/stop for WebM capture via Playwright)
- [ ] Key guidance: 5-10 well-documented issues per session, depth over breadth, write incrementally
## Phase 3: QA Testing Agent (v0.3.0)
- [x] `/qa` SKILL.md — 6-phase workflow: Initialize → Authenticate → Orient → Explore → Document → Wrap up
- [x] Issue taxonomy reference (7 categories: visual, functional, UX, content, performance, console, accessibility)
- [x] Severity classification (critical/high/medium/low)
- [x] Exploration checklist per page
- [x] Report template (structured markdown with per-issue evidence)
- [x] Repro-first philosophy: every issue gets evidence before moving on
- [x] Two evidence tiers: interactive bugs (multi-step screenshots), static bugs (single annotated screenshot)
- [x] Key guidance: 5-10 well-documented issues per session, depth over breadth, write incrementally
- [x] Three modes: full (systematic), quick (30-second smoke test), regression (compare against baseline)
- [x] Framework detection guidance (Next.js, Rails, WordPress, SPA)
- [x] Health score rubric (7 categories, weighted average)
- [x] `wait --networkidle` / `wait --load` / `wait --domcontentloaded`
- [x] `console --errors` (filter to error/warning only)
- [x] `cookie-import <json-file>` (bulk cookie import with auto-fill domain)
- [x] `browse/bin/find-browse` (DRY binary discovery across skills)
- [ ] Video recording (deferred to Phase 5 — recreateContext destroys page state)
## Phase 3.5: Visual PR Annotations + S3 Upload
- [ ] `/setup-gstack-upload` skill (configure S3 bucket for image hosting)
- [ ] `browse/bin/gstack-upload` helper (upload file to S3, return public URL)
- [ ] `/ship` Step 7.5: visual verification with screenshots in PR body
- [ ] `/review` Step 4.5: visual review with annotated screenshots in PR
- [ ] WebM → GIF conversion (ffmpeg) for video evidence in PRs
- [ ] README documentation for visual PR annotations
## Phase 4: Skill + Browser Integration
- [ ] ship + browse: post-deploy verification
@@ -51,6 +73,7 @@
- [ ] Sessions (isolated browser instances with separate cookies/storage/history)
- [ ] State persistence (save/load cookies + localStorage to JSON files)
- [ ] Auth vault (encrypted credential storage, referenced by name, LLM never sees passwords)
- [ ] Video recording (record start/stop — needs sessions for clean context lifecycle)
- [ ] retro + browse: deployment health tracking
- Screenshot production state
- Check perf metrics (page load times)
@@ -67,6 +90,11 @@
- [ ] Streaming (WebSocket live preview for pair browsing)
- [ ] CDP mode (connect to already-running Chrome/Electron apps)
## Future Ideas
- [ ] Trend tracking across QA runs — compare baseline.json over time, detect regressions (P2, S)
- [ ] CI/CD integration — `/qa` as GitHub Action step, fail PR if health score drops (P2, M)
- [ ] Accessibility audit mode — `--a11y` flag for focused accessibility testing (P3, S)
## Ideas & Notes
- Browser is the nervous system — every skill should be able to see, interact with, and verify the web
- Skills are the product; the browser enables them
+11
View File
@@ -0,0 +1,11 @@
#!/bin/bash
# Find the gstack browse binary. Echoes path and exits 0, or exits 1 if not found.
ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
if [ -n "$ROOT" ] && test -x "$ROOT/.claude/skills/gstack/browse/dist/browse"; then
echo "$ROOT/.claude/skills/gstack/browse/dist/browse"
elif test -x "$HOME/.claude/skills/gstack/browse/dist/browse"; then
echo "$HOME/.claude/skills/gstack/browse/dist/browse"
else
echo "ERROR: browse binary not found. Run: cd <skill-dir> && ./setup" >&2
exit 1
fi
+295
View File
@@ -0,0 +1,295 @@
---
name: qa
version: 1.0.0
description: |
Systematically QA test a web application. Use when asked to "qa", "QA", "test this site",
"find bugs", "dogfood", or review quality. Three modes: full (systematic exploration),
quick (30-second smoke test), regression (compare against baseline). Produces structured
report with health score, screenshots, and repro steps.
allowed-tools:
- Bash
- Read
- Write
---
# /qa: Systematic QA Testing
You are a QA engineer. Test web applications like a real user — click everything, fill every form, check every state. Produce a structured report with evidence.
## Setup
**Parse the user's request for these parameters:**
| Parameter | Default | Override example |
|-----------|---------|-----------------|
| Target URL | (required) | `https://myapp.com`, `http://localhost:3000` |
| Mode | full | `--quick`, `--regression .gstack/qa-reports/baseline.json` |
| Output dir | `.gstack/qa-reports/` | `Output to /tmp/qa` |
| Scope | Full app | `Focus on the billing page` |
| Auth | None | `Sign in to user@example.com`, `Import cookies from cookies.json` |
**Find the browse binary:**
```bash
B=$(browse/bin/find-browse 2>/dev/null || ~/.claude/skills/gstack/browse/bin/find-browse 2>/dev/null)
if [ -z "$B" ]; then
echo "ERROR: browse binary not found"
exit 1
fi
```
**Create output directories:**
```bash
REPORT_DIR=".gstack/qa-reports"
mkdir -p "$REPORT_DIR/screenshots"
```
---
## Modes
### Full (default)
Systematic exploration. Visit every reachable page. Document 5-10 well-evidenced issues. Produce health score. Takes 5-15 minutes depending on app size.
### Quick (`--quick`)
30-second smoke test. Visit homepage + top 5 navigation targets. Check: page loads? Console errors? Broken links? Produce health score. No detailed issue documentation.
### Regression (`--regression <baseline>`)
Run full mode, then load `baseline.json` from a previous run. Diff: which issues are fixed? Which are new? What's the score delta? Append regression section to report.
---
## Workflow
### Phase 1: Initialize
1. Find browse binary (see Setup above)
2. Create output directories
3. Copy report template from `qa/templates/qa-report-template.md` to output dir
4. Start timer for duration tracking
### Phase 2: Authenticate (if needed)
**If the user specified auth credentials:**
```bash
$B goto <login-url>
$B snapshot -i # find the login form
$B fill @e3 "user@example.com"
$B fill @e4 "[REDACTED]" # NEVER include real passwords in report
$B click @e5 # submit
$B snapshot -D # verify login succeeded
```
**If the user provided a cookie file:**
```bash
$B cookie-import cookies.json
$B goto <target-url>
```
**If 2FA/OTP is required:** Ask the user for the code and wait.
**If CAPTCHA blocks you:** Tell the user: "Please complete the CAPTCHA in the browser, then tell me to continue."
### Phase 3: Orient
Get a map of the application:
```bash
$B goto <target-url>
$B snapshot -i -a -o "$REPORT_DIR/screenshots/initial.png"
$B links # map navigation structure
$B console --errors # any errors on landing?
```
**Detect framework** (note in report metadata):
- `__next` in HTML or `_next/data` requests → Next.js
- `csrf-token` meta tag → Rails
- `wp-content` in URLs → WordPress
- Client-side routing with no page reloads → SPA
**For SPAs:** The `links` command may return few results because navigation is client-side. Use `snapshot -i` to find nav elements (buttons, menu items) instead.
### Phase 4: Explore
Visit pages systematically. At each page:
```bash
$B goto <page-url>
$B snapshot -i -a -o "$REPORT_DIR/screenshots/page-name.png"
$B console --errors
```
Then follow the **per-page exploration checklist** (see `qa/references/issue-taxonomy.md`):
1. **Visual scan** — Look at the annotated screenshot for layout issues
2. **Interactive elements** — Click buttons, links, controls. Do they work?
3. **Forms** — Fill and submit. Test empty, invalid, edge cases
4. **Navigation** — Check all paths in and out
5. **States** — Empty state, loading, error, overflow
6. **Console** — Any new JS errors after interactions?
7. **Responsiveness** — Check mobile viewport if relevant:
```bash
$B viewport 375x812
$B screenshot "$REPORT_DIR/screenshots/page-mobile.png"
$B viewport 1280x720
```
**Depth judgment:** Spend more time on core features (homepage, dashboard, checkout, search) and less on secondary pages (about, terms, privacy).
**Quick mode:** Only visit homepage + top 5 navigation targets from the Orient phase. Skip the per-page checklist — just check: loads? Console errors? Broken links visible?
### Phase 5: Document
Document each issue **immediately when found** — don't batch them.
**Two evidence tiers:**
**Interactive bugs** (broken flows, dead buttons, form failures):
1. Take a screenshot before the action
2. Perform the action
3. Take a screenshot showing the result
4. Use `snapshot -D` to show what changed
5. Write repro steps referencing screenshots
```bash
$B screenshot "$REPORT_DIR/screenshots/issue-001-step-1.png"
$B click @e5
$B screenshot "$REPORT_DIR/screenshots/issue-001-result.png"
$B snapshot -D
```
**Static bugs** (typos, layout issues, missing images):
1. Take a single annotated screenshot showing the problem
2. Describe what's wrong
```bash
$B snapshot -i -a -o "$REPORT_DIR/screenshots/issue-002.png"
```
**Write each issue to the report immediately** using the template format from `qa/templates/qa-report-template.md`.
### Phase 6: Wrap Up
1. **Compute health score** using the rubric below
2. **Write "Top 3 Things to Fix"** — the 3 highest-severity issues
3. **Write console health summary** — aggregate all console errors seen across pages
4. **Update severity counts** in the summary table
5. **Fill in report metadata** — date, duration, pages visited, screenshot count, framework
6. **Save baseline** — write `baseline.json` with:
```json
{
"date": "YYYY-MM-DD",
"url": "<target>",
"healthScore": N,
"issues": [{ "id": "ISSUE-001", "title": "...", "severity": "...", "category": "..." }],
"categoryScores": { "console": N, "links": N, ... }
}
```
**Regression mode:** After writing the report, load the baseline file. Compare:
- Health score delta
- Issues fixed (in baseline but not current)
- New issues (in current but not baseline)
- Append the regression section to the report
---
## Health Score Rubric
Compute each category score (0-100), then take the weighted average.
### Console (weight: 15%)
- 0 errors → 100
- 1-3 errors → 70
- 4-10 errors → 40
- 10+ errors → 10
### Links (weight: 10%)
- 0 broken → 100
- Each broken link → -15 (minimum 0)
### Per-Category Scoring (Visual, Functional, UX, Content, Performance, Accessibility)
Each category starts at 100. Deduct per finding:
- Critical issue → -25
- High issue → -15
- Medium issue → -8
- Low issue → -3
Minimum 0 per category.
### Weights
| Category | Weight |
|----------|--------|
| Console | 15% |
| Links | 10% |
| Visual | 10% |
| Functional | 20% |
| UX | 15% |
| Performance | 10% |
| Content | 5% |
| Accessibility | 15% |
### Final Score
`score = Σ (category_score × weight)`
---
## Framework-Specific Guidance
### Next.js
- Check console for hydration errors (`Hydration failed`, `Text content did not match`)
- Monitor `_next/data` requests in network — 404s indicate broken data fetching
- Test client-side navigation (click links, don't just `goto`) — catches routing issues
- Check for CLS (Cumulative Layout Shift) on pages with dynamic content
### Rails
- Check for N+1 query warnings in console (if development mode)
- Verify CSRF token presence in forms
- Test Turbo/Stimulus integration — do page transitions work smoothly?
- Check for flash messages appearing and dismissing correctly
### WordPress
- Check for plugin conflicts (JS errors from different plugins)
- Verify admin bar visibility for logged-in users
- Test REST API endpoints (`/wp-json/`)
- Check for mixed content warnings (common with WP)
### General SPA (React, Vue, Angular)
- Use `snapshot -i` for navigation — `links` command misses client-side routes
- Check for stale state (navigate away and back — does data refresh?)
- Test browser back/forward — does the app handle history correctly?
- Check for memory leaks (monitor console after extended use)
---
## Important Rules
1. **Repro is everything.** Every issue needs at least one screenshot. No exceptions.
2. **Verify before documenting.** Retry the issue once to confirm it's reproducible, not a fluke.
3. **Never include credentials.** Write `[REDACTED]` for passwords in repro steps.
4. **Write incrementally.** Append each issue to the report as you find it. Don't batch.
5. **Never read source code.** Test as a user, not a developer.
6. **Check console after every interaction.** JS errors that don't surface visually are still bugs.
7. **Test like a user.** Use realistic data. Walk through complete workflows end-to-end.
8. **Depth over breadth.** 5-10 well-documented issues with evidence > 20 vague descriptions.
9. **Never delete output files.** Screenshots and reports accumulate — that's intentional.
10. **Use `snapshot -C` for tricky UIs.** Finds clickable divs that the accessibility tree misses.
---
## Output Structure
```
.gstack/qa-reports/
├── qa-report-{domain}-{YYYY-MM-DD}.md # Structured report
├── screenshots/
│ ├── initial.png # Landing page annotated screenshot
│ ├── issue-001-step-1.png # Per-issue evidence
│ ├── issue-001-result.png
│ └── ...
└── baseline.json # For regression mode
```
Report filenames use the domain and date: `qa-report-myapp-com-2026-03-12.md`
+85
View File
@@ -0,0 +1,85 @@
# QA Issue Taxonomy
## Severity Levels
| Severity | Definition | Examples |
|----------|------------|----------|
| **critical** | Blocks a core workflow, causes data loss, or crashes the app | Form submit causes error page, checkout flow broken, data deleted without confirmation |
| **high** | Major feature broken or unusable, no workaround | Search returns wrong results, file upload silently fails, auth redirect loop |
| **medium** | Feature works but with noticeable problems, workaround exists | Slow page load (>5s), form validation missing but submit still works, layout broken on mobile only |
| **low** | Minor cosmetic or polish issue | Typo in footer, 1px alignment issue, hover state inconsistent |
## Categories
### 1. Visual/UI
- Layout breaks (overlapping elements, clipped text, horizontal scrollbar)
- Broken or missing images
- Incorrect z-index (elements appearing behind others)
- Font/color inconsistencies
- Animation glitches (jank, incomplete transitions)
- Alignment issues (off-grid, uneven spacing)
- Dark mode / theme issues
### 2. Functional
- Broken links (404, wrong destination)
- Dead buttons (click does nothing)
- Form validation (missing, wrong, bypassed)
- Incorrect redirects
- State not persisting (data lost on refresh, back button)
- Race conditions (double-submit, stale data)
- Search returning wrong or no results
### 3. UX
- Confusing navigation (no breadcrumbs, dead ends)
- Missing loading indicators (user doesn't know something is happening)
- Slow interactions (>500ms with no feedback)
- Unclear error messages ("Something went wrong" with no detail)
- No confirmation before destructive actions
- Inconsistent interaction patterns across pages
- Dead ends (no way back, no next action)
### 4. Content
- Typos and grammar errors
- Outdated or incorrect text
- Placeholder / lorem ipsum text left in
- Truncated text (cut off without ellipsis or "more")
- Wrong labels on buttons or form fields
- Missing or unhelpful empty states
### 5. Performance
- Slow page loads (>3 seconds)
- Janky scrolling (dropped frames)
- Layout shifts (content jumping after load)
- Excessive network requests (>50 on a single page)
- Large unoptimized images
- Blocking JavaScript (page unresponsive during load)
### 6. Console/Errors
- JavaScript exceptions (uncaught errors)
- Failed network requests (4xx, 5xx)
- Deprecation warnings (upcoming breakage)
- CORS errors
- Mixed content warnings (HTTP resources on HTTPS)
- CSP violations
### 7. Accessibility
- Missing alt text on images
- Unlabeled form inputs
- Keyboard navigation broken (can't tab to elements)
- Focus traps (can't escape a modal or dropdown)
- Missing or incorrect ARIA attributes
- Insufficient color contrast
- Content not reachable by screen reader
## Per-Page Exploration Checklist
For each page visited during a QA session:
1. **Visual scan** — Take annotated screenshot (`snapshot -i -a -o`). Look for layout issues, broken images, alignment.
2. **Interactive elements** — Click every button, link, and control. Does each do what it says?
3. **Forms** — Fill and submit. Test empty submission, invalid data, edge cases (long text, special characters).
4. **Navigation** — Check all paths in/out. Breadcrumbs, back button, deep links, mobile menu.
5. **States** — Check empty state, loading state, error state, full/overflow state.
6. **Console** — Run `console --errors` after interactions. Any new JS errors or failed requests?
7. **Responsiveness** — If relevant, check mobile and tablet viewports.
8. **Auth boundaries** — What happens when logged out? Different user roles?
+79
View File
@@ -0,0 +1,79 @@
# QA Report: {APP_NAME}
| Field | Value |
|-------|-------|
| **Date** | {DATE} |
| **URL** | {URL} |
| **Scope** | {SCOPE or "Full app"} |
| **Mode** | {full / quick / regression} |
| **Duration** | {DURATION} |
| **Pages visited** | {COUNT} |
| **Screenshots** | {COUNT} |
| **Framework** | {DETECTED or "Unknown"} |
## Health Score: {SCORE}/100
| Category | Score |
|----------|-------|
| Console | {0-100} |
| Links | {0-100} |
| Visual | {0-100} |
| Functional | {0-100} |
| UX | {0-100} |
| Performance | {0-100} |
| Accessibility | {0-100} |
## Top 3 Things to Fix
1. **{ISSUE-NNN}: {title}** — {one-line description}
2. **{ISSUE-NNN}: {title}** — {one-line description}
3. **{ISSUE-NNN}: {title}** — {one-line description}
## Console Health
| Error | Count | First seen |
|-------|-------|------------|
| {error message} | {N} | {URL} |
## Summary
| Severity | Count |
|----------|-------|
| Critical | 0 |
| High | 0 |
| Medium | 0 |
| Low | 0 |
| **Total** | **0** |
## Issues
### ISSUE-001: {Short title}
| Field | Value |
|-------|-------|
| **Severity** | critical / high / medium / low |
| **Category** | visual / functional / ux / content / performance / console / accessibility |
| **URL** | {page URL} |
**Description:** {What is wrong, expected vs actual.}
**Repro Steps:**
1. Navigate to {URL}
![Step 1](screenshots/issue-001-step-1.png)
2. {Action}
![Step 2](screenshots/issue-001-step-2.png)
3. **Observe:** {what goes wrong}
![Result](screenshots/issue-001-result.png)
---
## Regression (if applicable)
| Metric | Baseline | Current | Delta |
|--------|----------|---------|-------|
| Health score | {N} | {N} | {+/-N} |
| Issues | {N} | {N} | {+/-N} |
**Fixed since baseline:** {list}
**New since baseline:** {list}