Phase 3: /qa skill — systematic QA testing with health scores

New /qa skill for systematic web app QA testing. Three modes: - full: 5-10 documented issues with screenshots and repro steps - quick: 30-second smoke test with health score - regression: compare against saved baseline Includes issue taxonomy (7 categories, 4 severity levels), structured report template, health score rubric (weighted across 7 categories), framework detection guidance (Next.js, Rails, WordPress, SPA). Also adds browse/bin/find-browse (DRY binary discovery using git rev-parse), .gstack/ to .gitignore, and updated TODO roadmap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-06 13:45:35 +02:00 · 2026-03-12 13:34:06 -07:00
parent d827276a8d
commit 6ff6d5b6d3
6 changed files with 516 additions and 17 deletions
@@ -1,5 +1,6 @@
 node_modules/
 browse/dist/
+.gstack/
 /tmp/
 *.log
 bun.lock
@@ -7,24 +7,46 @@
  - [x] Snapshot command with ref-based element selection
  - [x] Snapshot tests

-## Phase 2: Enhanced Browser
-  - [ ] Annotated screenshots (--annotate flag, numbered labels on elements mapped to refs)
-  - [ ] Snapshot diffing (compare before/after accessibility trees, verify actions worked)
-  - [ ] Dialog handling (dialog accept/dismiss — prevents browser lockup)
-  - [ ] File upload (upload <sel> <files>)
-  - [ ] Cursor-interactive elements (-C flag, detect divs with cursor:pointer/onclick/tabindex)
-  - [ ] Element state checks (is visible/enabled/checked <sel>)
+## Phase 2: Enhanced Browser (v0.2.0) ✅
+  - [x] Annotated screenshots (--annotate flag, ref labels overlaid on screenshot)
+  - [x] Snapshot diffing (--diff flag, unified diff against previous snapshot)
+  - [x] Dialog handling (auto-accept/dismiss, dialog buffer, prevents browser lockup)
+  - [x] File upload (upload <sel> <files>)
+  - [x] Cursor-interactive elements (-C flag, cursor:pointer/onclick/tabindex scan)
+  - [x] Element state checks (is visible/hidden/enabled/disabled/checked/editable/focused)
+  - [x] CircularBuffer — O(1) ring buffer for console/network/dialog (was O(n) array+shift)
+  - [x] Async buffer flush with Bun.write() (was appendFileSync)
+  - [x] Health check with page.evaluate('1') + 2s timeout
+  - [x] Playwright error wrapping — actionable messages for AI agents
+  - [x] Fix useragent — context recreation preserves cookies/storage/URLs
+  - [x] DRY: getCleanText exported, command sets in chain updated
+  - [x] 148 integration tests (was ~63)

-## Phase 3: QA Testing Agent (dogfood skill)
-  - [ ] SKILL.md — 6-phase workflow: Initialize → Authenticate → Orient → Explore → Document → Wrap up
-  - [ ] Issue taxonomy reference (7 categories: visual, functional, UX, content, performance, console, accessibility)
-  - [ ] Severity classification (critical/high/medium/low)
-  - [ ] Exploration checklist per page
-  - [ ] Report template (structured markdown with per-issue evidence)
-  - [ ] Repro-first philosophy: every issue gets evidence before moving on
-  - [ ] Two evidence tiers: interactive bugs (video + step-by-step screenshots), static bugs (single annotated screenshot)
-  - [ ] Video recording (record start/stop for WebM capture via Playwright)
-  - [ ] Key guidance: 5-10 well-documented issues per session, depth over breadth, write incrementally
+## Phase 3: QA Testing Agent (v0.3.0)
+  - [x] `/qa` SKILL.md — 6-phase workflow: Initialize → Authenticate → Orient → Explore → Document → Wrap up
+  - [x] Issue taxonomy reference (7 categories: visual, functional, UX, content, performance, console, accessibility)
+  - [x] Severity classification (critical/high/medium/low)
+  - [x] Exploration checklist per page
+  - [x] Report template (structured markdown with per-issue evidence)
+  - [x] Repro-first philosophy: every issue gets evidence before moving on
+  - [x] Two evidence tiers: interactive bugs (multi-step screenshots), static bugs (single annotated screenshot)
+  - [x] Key guidance: 5-10 well-documented issues per session, depth over breadth, write incrementally
+  - [x] Three modes: full (systematic), quick (30-second smoke test), regression (compare against baseline)
+  - [x] Framework detection guidance (Next.js, Rails, WordPress, SPA)
+  - [x] Health score rubric (7 categories, weighted average)
+  - [x] `wait --networkidle` / `wait --load` / `wait --domcontentloaded`
+  - [x] `console --errors` (filter to error/warning only)
+  - [x] `cookie-import <json-file>` (bulk cookie import with auto-fill domain)
+  - [x] `browse/bin/find-browse` (DRY binary discovery across skills)
+  - [ ] Video recording (deferred to Phase 5 — recreateContext destroys page state)
+
+## Phase 3.5: Visual PR Annotations + S3 Upload
+  - [ ] `/setup-gstack-upload` skill (configure S3 bucket for image hosting)
+  - [ ] `browse/bin/gstack-upload` helper (upload file to S3, return public URL)
+  - [ ] `/ship` Step 7.5: visual verification with screenshots in PR body
+  - [ ] `/review` Step 4.5: visual review with annotated screenshots in PR
+  - [ ] WebM → GIF conversion (ffmpeg) for video evidence in PRs
+  - [ ] README documentation for visual PR annotations

 ## Phase 4: Skill + Browser Integration
  - [ ] ship + browse: post-deploy verification
@@ -51,6 +73,7 @@
  - [ ] Sessions (isolated browser instances with separate cookies/storage/history)
  - [ ] State persistence (save/load cookies + localStorage to JSON files)
  - [ ] Auth vault (encrypted credential storage, referenced by name, LLM never sees passwords)
+  - [ ] Video recording (record start/stop — needs sessions for clean context lifecycle)
  - [ ] retro + browse: deployment health tracking
    - Screenshot production state
    - Check perf metrics (page load times)
@@ -67,6 +90,11 @@
  - [ ] Streaming (WebSocket live preview for pair browsing)
  - [ ] CDP mode (connect to already-running Chrome/Electron apps)

+## Future Ideas
+  - [ ] Trend tracking across QA runs — compare baseline.json over time, detect regressions (P2, S)
+  - [ ] CI/CD integration — `/qa` as GitHub Action step, fail PR if health score drops (P2, M)
+  - [ ] Accessibility audit mode — `--a11y` flag for focused accessibility testing (P3, S)
+
 ## Ideas & Notes
  - Browser is the nervous system — every skill should be able to see, interact with, and verify the web
  - Skills are the product; the browser enables them
@@ -0,0 +1,11 @@
+#!/bin/bash
+# Find the gstack browse binary. Echoes path and exits 0, or exits 1 if not found.
+ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
+if [ -n "$ROOT" ] && test -x "$ROOT/.claude/skills/gstack/browse/dist/browse"; then
+  echo "$ROOT/.claude/skills/gstack/browse/dist/browse"
+elif test -x "$HOME/.claude/skills/gstack/browse/dist/browse"; then
+  echo "$HOME/.claude/skills/gstack/browse/dist/browse"
+else
+  echo "ERROR: browse binary not found. Run: cd <skill-dir> && ./setup" >&2
+  exit 1
+fi
@@ -0,0 +1,295 @@
+---
+name: qa
+version: 1.0.0
+description: |
+  Systematically QA test a web application. Use when asked to "qa", "QA", "test this site",
+  "find bugs", "dogfood", or review quality. Three modes: full (systematic exploration),
+  quick (30-second smoke test), regression (compare against baseline). Produces structured
+  report with health score, screenshots, and repro steps.
+allowed-tools:
+  - Bash
+  - Read
+  - Write
+---
+
+# /qa: Systematic QA Testing
+
+You are a QA engineer. Test web applications like a real user — click everything, fill every form, check every state. Produce a structured report with evidence.
+
+## Setup
+
+**Parse the user's request for these parameters:**
+
+| Parameter | Default | Override example |
+|-----------|---------|-----------------|
+| Target URL | (required) | `https://myapp.com`, `http://localhost:3000` |
+| Mode | full | `--quick`, `--regression .gstack/qa-reports/baseline.json` |
+| Output dir | `.gstack/qa-reports/` | `Output to /tmp/qa` |
+| Scope | Full app | `Focus on the billing page` |
+| Auth | None | `Sign in to user@example.com`, `Import cookies from cookies.json` |
+
+**Find the browse binary:**
+
+```bash
+B=$(browse/bin/find-browse 2>/dev/null || ~/.claude/skills/gstack/browse/bin/find-browse 2>/dev/null)
+if [ -z "$B" ]; then
+  echo "ERROR: browse binary not found"
+  exit 1
+fi
+```
+
+**Create output directories:**
+
+```bash
+REPORT_DIR=".gstack/qa-reports"
+mkdir -p "$REPORT_DIR/screenshots"
+```
+
+---
+
+## Modes
+
+### Full (default)
+Systematic exploration. Visit every reachable page. Document 5-10 well-evidenced issues. Produce health score. Takes 5-15 minutes depending on app size.
+
+### Quick (`--quick`)
+30-second smoke test. Visit homepage + top 5 navigation targets. Check: page loads? Console errors? Broken links? Produce health score. No detailed issue documentation.
+
+### Regression (`--regression <baseline>`)
+Run full mode, then load `baseline.json` from a previous run. Diff: which issues are fixed? Which are new? What's the score delta? Append regression section to report.
+
+---
+
+## Workflow
+
+### Phase 1: Initialize
+
+1. Find browse binary (see Setup above)
+2. Create output directories
+3. Copy report template from `qa/templates/qa-report-template.md` to output dir
+4. Start timer for duration tracking
+
+### Phase 2: Authenticate (if needed)
+
+**If the user specified auth credentials:**
+
+```bash
+$B goto <login-url>
+$B snapshot -i                    # find the login form
+$B fill @e3 "user@example.com"
+$B fill @e4 "[REDACTED]"         # NEVER include real passwords in report
+$B click @e5                      # submit
+$B snapshot -D                    # verify login succeeded
+```
+
+**If the user provided a cookie file:**
+
+```bash
+$B cookie-import cookies.json
+$B goto <target-url>
+```
+
+**If 2FA/OTP is required:** Ask the user for the code and wait.
+
+**If CAPTCHA blocks you:** Tell the user: "Please complete the CAPTCHA in the browser, then tell me to continue."
+
+### Phase 3: Orient
+
+Get a map of the application:
+
+```bash
+$B goto <target-url>
+$B snapshot -i -a -o "$REPORT_DIR/screenshots/initial.png"
+$B links                          # map navigation structure
+$B console --errors               # any errors on landing?
+```
+
+**Detect framework** (note in report metadata):
+- `__next` in HTML or `_next/data` requests → Next.js
+- `csrf-token` meta tag → Rails
+- `wp-content` in URLs → WordPress
+- Client-side routing with no page reloads → SPA
+
+**For SPAs:** The `links` command may return few results because navigation is client-side. Use `snapshot -i` to find nav elements (buttons, menu items) instead.
+
+### Phase 4: Explore
+
+Visit pages systematically. At each page:
+
+```bash
+$B goto <page-url>
+$B snapshot -i -a -o "$REPORT_DIR/screenshots/page-name.png"
+$B console --errors
+```
+
+Then follow the **per-page exploration checklist** (see `qa/references/issue-taxonomy.md`):
+
+1. **Visual scan** — Look at the annotated screenshot for layout issues
+2. **Interactive elements** — Click buttons, links, controls. Do they work?
+3. **Forms** — Fill and submit. Test empty, invalid, edge cases
+4. **Navigation** — Check all paths in and out
+5. **States** — Empty state, loading, error, overflow
+6. **Console** — Any new JS errors after interactions?
+7. **Responsiveness** — Check mobile viewport if relevant:
+   ```bash
+   $B viewport 375x812
+   $B screenshot "$REPORT_DIR/screenshots/page-mobile.png"
+   $B viewport 1280x720
+   ```
+
+**Depth judgment:** Spend more time on core features (homepage, dashboard, checkout, search) and less on secondary pages (about, terms, privacy).
+
+**Quick mode:** Only visit homepage + top 5 navigation targets from the Orient phase. Skip the per-page checklist — just check: loads? Console errors? Broken links visible?
+
+### Phase 5: Document
+
+Document each issue **immediately when found** — don't batch them.
+
+**Two evidence tiers:**
+
+**Interactive bugs** (broken flows, dead buttons, form failures):
+1. Take a screenshot before the action
+2. Perform the action
+3. Take a screenshot showing the result
+4. Use `snapshot -D` to show what changed
+5. Write repro steps referencing screenshots
+
+```bash
+$B screenshot "$REPORT_DIR/screenshots/issue-001-step-1.png"
+$B click @e5
+$B screenshot "$REPORT_DIR/screenshots/issue-001-result.png"
+$B snapshot -D
+```
+
+**Static bugs** (typos, layout issues, missing images):
+1. Take a single annotated screenshot showing the problem
+2. Describe what's wrong
+
+```bash
+$B snapshot -i -a -o "$REPORT_DIR/screenshots/issue-002.png"
+```
+
+**Write each issue to the report immediately** using the template format from `qa/templates/qa-report-template.md`.
+
+### Phase 6: Wrap Up
+
+1. **Compute health score** using the rubric below
+2. **Write "Top 3 Things to Fix"** — the 3 highest-severity issues
+3. **Write console health summary** — aggregate all console errors seen across pages
+4. **Update severity counts** in the summary table
+5. **Fill in report metadata** — date, duration, pages visited, screenshot count, framework
+6. **Save baseline** — write `baseline.json` with:
+   ```json
+   {
+     "date": "YYYY-MM-DD",
+     "url": "<target>",
+     "healthScore": N,
+     "issues": [{ "id": "ISSUE-001", "title": "...", "severity": "...", "category": "..." }],
+     "categoryScores": { "console": N, "links": N, ... }
+   }
+   ```
+
+**Regression mode:** After writing the report, load the baseline file. Compare:
+- Health score delta
+- Issues fixed (in baseline but not current)
+- New issues (in current but not baseline)
+- Append the regression section to the report
+
+---
+
+## Health Score Rubric
+
+Compute each category score (0-100), then take the weighted average.
+
+### Console (weight: 15%)
+- 0 errors → 100
+- 1-3 errors → 70
+- 4-10 errors → 40
+- 10+ errors → 10
+
+### Links (weight: 10%)
+- 0 broken → 100
+- Each broken link → -15 (minimum 0)
+
+### Per-Category Scoring (Visual, Functional, UX, Content, Performance, Accessibility)
+Each category starts at 100. Deduct per finding:
+- Critical issue → -25
+- High issue → -15
+- Medium issue → -8
+- Low issue → -3
+Minimum 0 per category.
+
+### Weights
+| Category | Weight |
+|----------|--------|
+| Console | 15% |
+| Links | 10% |
+| Visual | 10% |
+| Functional | 20% |
+| UX | 15% |
+| Performance | 10% |
+| Content | 5% |
+| Accessibility | 15% |
+
+### Final Score
+`score = Σ (category_score × weight)`
+
+---
+
+## Framework-Specific Guidance
+
+### Next.js
+- Check console for hydration errors (`Hydration failed`, `Text content did not match`)
+- Monitor `_next/data` requests in network — 404s indicate broken data fetching
+- Test client-side navigation (click links, don't just `goto`) — catches routing issues
+- Check for CLS (Cumulative Layout Shift) on pages with dynamic content
+
+### Rails
+- Check for N+1 query warnings in console (if development mode)
+- Verify CSRF token presence in forms
+- Test Turbo/Stimulus integration — do page transitions work smoothly?
+- Check for flash messages appearing and dismissing correctly
+
+### WordPress
+- Check for plugin conflicts (JS errors from different plugins)
+- Verify admin bar visibility for logged-in users
+- Test REST API endpoints (`/wp-json/`)
+- Check for mixed content warnings (common with WP)
+
+### General SPA (React, Vue, Angular)
+- Use `snapshot -i` for navigation — `links` command misses client-side routes
+- Check for stale state (navigate away and back — does data refresh?)
+- Test browser back/forward — does the app handle history correctly?
+- Check for memory leaks (monitor console after extended use)
+
+---
+
+## Important Rules
+
+1. **Repro is everything.** Every issue needs at least one screenshot. No exceptions.
+2. **Verify before documenting.** Retry the issue once to confirm it's reproducible, not a fluke.
+3. **Never include credentials.** Write `[REDACTED]` for passwords in repro steps.
+4. **Write incrementally.** Append each issue to the report as you find it. Don't batch.
+5. **Never read source code.** Test as a user, not a developer.
+6. **Check console after every interaction.** JS errors that don't surface visually are still bugs.
+7. **Test like a user.** Use realistic data. Walk through complete workflows end-to-end.
+8. **Depth over breadth.** 5-10 well-documented issues with evidence > 20 vague descriptions.
+9. **Never delete output files.** Screenshots and reports accumulate — that's intentional.
+10. **Use `snapshot -C` for tricky UIs.** Finds clickable divs that the accessibility tree misses.
+
+---
+
+## Output Structure
+
+```
+.gstack/qa-reports/
+├── qa-report-{domain}-{YYYY-MM-DD}.md    # Structured report
+├── screenshots/
+│   ├── initial.png                        # Landing page annotated screenshot
+│   ├── issue-001-step-1.png               # Per-issue evidence
+│   ├── issue-001-result.png
+│   └── ...
+└── baseline.json                          # For regression mode
+```
+
+Report filenames use the domain and date: `qa-report-myapp-com-2026-03-12.md`
@@ -0,0 +1,85 @@
+# QA Issue Taxonomy
+
+## Severity Levels
+
+| Severity | Definition | Examples |
+|----------|------------|----------|
+| **critical** | Blocks a core workflow, causes data loss, or crashes the app | Form submit causes error page, checkout flow broken, data deleted without confirmation |
+| **high** | Major feature broken or unusable, no workaround | Search returns wrong results, file upload silently fails, auth redirect loop |
+| **medium** | Feature works but with noticeable problems, workaround exists | Slow page load (>5s), form validation missing but submit still works, layout broken on mobile only |
+| **low** | Minor cosmetic or polish issue | Typo in footer, 1px alignment issue, hover state inconsistent |
+
+## Categories
+
+### 1. Visual/UI
+- Layout breaks (overlapping elements, clipped text, horizontal scrollbar)
+- Broken or missing images
+- Incorrect z-index (elements appearing behind others)
+- Font/color inconsistencies
+- Animation glitches (jank, incomplete transitions)
+- Alignment issues (off-grid, uneven spacing)
+- Dark mode / theme issues
+
+### 2. Functional
+- Broken links (404, wrong destination)
+- Dead buttons (click does nothing)
+- Form validation (missing, wrong, bypassed)
+- Incorrect redirects
+- State not persisting (data lost on refresh, back button)
+- Race conditions (double-submit, stale data)
+- Search returning wrong or no results
+
+### 3. UX
+- Confusing navigation (no breadcrumbs, dead ends)
+- Missing loading indicators (user doesn't know something is happening)
+- Slow interactions (>500ms with no feedback)
+- Unclear error messages ("Something went wrong" with no detail)
+- No confirmation before destructive actions
+- Inconsistent interaction patterns across pages
+- Dead ends (no way back, no next action)
+
+### 4. Content
+- Typos and grammar errors
+- Outdated or incorrect text
+- Placeholder / lorem ipsum text left in
+- Truncated text (cut off without ellipsis or "more")
+- Wrong labels on buttons or form fields
+- Missing or unhelpful empty states
+
+### 5. Performance
+- Slow page loads (>3 seconds)
+- Janky scrolling (dropped frames)
+- Layout shifts (content jumping after load)
+- Excessive network requests (>50 on a single page)
+- Large unoptimized images
+- Blocking JavaScript (page unresponsive during load)
+
+### 6. Console/Errors
+- JavaScript exceptions (uncaught errors)
+- Failed network requests (4xx, 5xx)
+- Deprecation warnings (upcoming breakage)
+- CORS errors
+- Mixed content warnings (HTTP resources on HTTPS)
+- CSP violations
+
+### 7. Accessibility
+- Missing alt text on images
+- Unlabeled form inputs
+- Keyboard navigation broken (can't tab to elements)
+- Focus traps (can't escape a modal or dropdown)
+- Missing or incorrect ARIA attributes
+- Insufficient color contrast
+- Content not reachable by screen reader
+
+## Per-Page Exploration Checklist
+
+For each page visited during a QA session:
+
+1. **Visual scan** — Take annotated screenshot (`snapshot -i -a -o`). Look for layout issues, broken images, alignment.
+2. **Interactive elements** — Click every button, link, and control. Does each do what it says?
+3. **Forms** — Fill and submit. Test empty submission, invalid data, edge cases (long text, special characters).
+4. **Navigation** — Check all paths in/out. Breadcrumbs, back button, deep links, mobile menu.
+5. **States** — Check empty state, loading state, error state, full/overflow state.
+6. **Console** — Run `console --errors` after interactions. Any new JS errors or failed requests?
+7. **Responsiveness** — If relevant, check mobile and tablet viewports.
+8. **Auth boundaries** — What happens when logged out? Different user roles?
@@ -0,0 +1,79 @@
+# QA Report: {APP_NAME}
+
+| Field | Value |
+|-------|-------|
+| **Date** | {DATE} |
+| **URL** | {URL} |
+| **Scope** | {SCOPE or "Full app"} |
+| **Mode** | {full / quick / regression} |
+| **Duration** | {DURATION} |
+| **Pages visited** | {COUNT} |
+| **Screenshots** | {COUNT} |
+| **Framework** | {DETECTED or "Unknown"} |
+
+## Health Score: {SCORE}/100
+
+| Category | Score |
+|----------|-------|
+| Console | {0-100} |
+| Links | {0-100} |
+| Visual | {0-100} |
+| Functional | {0-100} |
+| UX | {0-100} |
+| Performance | {0-100} |
+| Accessibility | {0-100} |
+
+## Top 3 Things to Fix
+
+1. **{ISSUE-NNN}: {title}** — {one-line description}
+2. **{ISSUE-NNN}: {title}** — {one-line description}
+3. **{ISSUE-NNN}: {title}** — {one-line description}
+
+## Console Health
+
+| Error | Count | First seen |
+|-------|-------|------------|
+| {error message} | {N} | {URL} |
+
+## Summary
+
+| Severity | Count |
+|----------|-------|
+| Critical | 0 |
+| High | 0 |
+| Medium | 0 |
+| Low | 0 |
+| **Total** | **0** |
+
+## Issues
+
+### ISSUE-001: {Short title}
+
+| Field | Value |
+|-------|-------|
+| **Severity** | critical / high / medium / low |
+| **Category** | visual / functional / ux / content / performance / console / accessibility |
+| **URL** | {page URL} |
+
+**Description:** {What is wrong, expected vs actual.}
+
+**Repro Steps:**
+
+1. Navigate to {URL}
+   ![Step 1](screenshots/issue-001-step-1.png)
+2. {Action}
+   ![Step 2](screenshots/issue-001-step-2.png)
+3. **Observe:** {what goes wrong}
+   ![Result](screenshots/issue-001-result.png)
+
+---
+
+## Regression (if applicable)
+
+| Metric | Baseline | Current | Delta |
+|--------|----------|---------|-------|
+| Health score | {N} | {N} | {+/-N} |
+| Issues | {N} | {N} | {+/-N} |
+
+**Fixed since baseline:** {list}
+**New since baseline:** {list}