From 6ff6d5b6d363797c0e496f5dbd7a8a15f0ec3a96 Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Thu, 12 Mar 2026 13:34:06 -0700 Subject: [PATCH] =?UTF-8?q?Phase=203:=20/qa=20skill=20=E2=80=94=20systemat?= =?UTF-8?q?ic=20QA=20testing=20with=20health=20scores?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit New /qa skill for systematic web app QA testing. Three modes: - full: 5-10 documented issues with screenshots and repro steps - quick: 30-second smoke test with health score - regression: compare against saved baseline Includes issue taxonomy (7 categories, 4 severity levels), structured report template, health score rubric (weighted across 7 categories), framework detection guidance (Next.js, Rails, WordPress, SPA). Also adds browse/bin/find-browse (DRY binary discovery using git rev-parse), .gstack/ to .gitignore, and updated TODO roadmap. Co-Authored-By: Claude Opus 4.6 --- .gitignore | 1 + TODO.md | 62 ++++-- browse/bin/find-browse | 11 ++ qa/SKILL.md | 295 +++++++++++++++++++++++++++++ qa/references/issue-taxonomy.md | 85 +++++++++ qa/templates/qa-report-template.md | 79 ++++++++ 6 files changed, 516 insertions(+), 17 deletions(-) create mode 100755 browse/bin/find-browse create mode 100644 qa/SKILL.md create mode 100644 qa/references/issue-taxonomy.md create mode 100644 qa/templates/qa-report-template.md diff --git a/.gitignore b/.gitignore index 2d3e8509..ef0aeaf5 100644 --- a/.gitignore +++ b/.gitignore @@ -1,5 +1,6 @@ node_modules/ browse/dist/ +.gstack/ /tmp/ *.log bun.lock diff --git a/TODO.md b/TODO.md index f08cff23..3cae3e5a 100644 --- a/TODO.md +++ b/TODO.md @@ -7,24 +7,46 @@ - [x] Snapshot command with ref-based element selection - [x] Snapshot tests -## Phase 2: Enhanced Browser - - [ ] Annotated screenshots (--annotate flag, numbered labels on elements mapped to refs) - - [ ] Snapshot diffing (compare before/after accessibility trees, verify actions worked) - - [ ] Dialog handling (dialog accept/dismiss — prevents browser lockup) - - [ ] File upload (upload ) - - [ ] Cursor-interactive elements (-C flag, detect divs with cursor:pointer/onclick/tabindex) - - [ ] Element state checks (is visible/enabled/checked ) +## Phase 2: Enhanced Browser (v0.2.0) ✅ + - [x] Annotated screenshots (--annotate flag, ref labels overlaid on screenshot) + - [x] Snapshot diffing (--diff flag, unified diff against previous snapshot) + - [x] Dialog handling (auto-accept/dismiss, dialog buffer, prevents browser lockup) + - [x] File upload (upload ) + - [x] Cursor-interactive elements (-C flag, cursor:pointer/onclick/tabindex scan) + - [x] Element state checks (is visible/hidden/enabled/disabled/checked/editable/focused) + - [x] CircularBuffer — O(1) ring buffer for console/network/dialog (was O(n) array+shift) + - [x] Async buffer flush with Bun.write() (was appendFileSync) + - [x] Health check with page.evaluate('1') + 2s timeout + - [x] Playwright error wrapping — actionable messages for AI agents + - [x] Fix useragent — context recreation preserves cookies/storage/URLs + - [x] DRY: getCleanText exported, command sets in chain updated + - [x] 148 integration tests (was ~63) -## Phase 3: QA Testing Agent (dogfood skill) - - [ ] SKILL.md — 6-phase workflow: Initialize → Authenticate → Orient → Explore → Document → Wrap up - - [ ] Issue taxonomy reference (7 categories: visual, functional, UX, content, performance, console, accessibility) - - [ ] Severity classification (critical/high/medium/low) - - [ ] Exploration checklist per page - - [ ] Report template (structured markdown with per-issue evidence) - - [ ] Repro-first philosophy: every issue gets evidence before moving on - - [ ] Two evidence tiers: interactive bugs (video + step-by-step screenshots), static bugs (single annotated screenshot) - - [ ] Video recording (record start/stop for WebM capture via Playwright) - - [ ] Key guidance: 5-10 well-documented issues per session, depth over breadth, write incrementally +## Phase 3: QA Testing Agent (v0.3.0) + - [x] `/qa` SKILL.md — 6-phase workflow: Initialize → Authenticate → Orient → Explore → Document → Wrap up + - [x] Issue taxonomy reference (7 categories: visual, functional, UX, content, performance, console, accessibility) + - [x] Severity classification (critical/high/medium/low) + - [x] Exploration checklist per page + - [x] Report template (structured markdown with per-issue evidence) + - [x] Repro-first philosophy: every issue gets evidence before moving on + - [x] Two evidence tiers: interactive bugs (multi-step screenshots), static bugs (single annotated screenshot) + - [x] Key guidance: 5-10 well-documented issues per session, depth over breadth, write incrementally + - [x] Three modes: full (systematic), quick (30-second smoke test), regression (compare against baseline) + - [x] Framework detection guidance (Next.js, Rails, WordPress, SPA) + - [x] Health score rubric (7 categories, weighted average) + - [x] `wait --networkidle` / `wait --load` / `wait --domcontentloaded` + - [x] `console --errors` (filter to error/warning only) + - [x] `cookie-import ` (bulk cookie import with auto-fill domain) + - [x] `browse/bin/find-browse` (DRY binary discovery across skills) + - [ ] Video recording (deferred to Phase 5 — recreateContext destroys page state) + +## Phase 3.5: Visual PR Annotations + S3 Upload + - [ ] `/setup-gstack-upload` skill (configure S3 bucket for image hosting) + - [ ] `browse/bin/gstack-upload` helper (upload file to S3, return public URL) + - [ ] `/ship` Step 7.5: visual verification with screenshots in PR body + - [ ] `/review` Step 4.5: visual review with annotated screenshots in PR + - [ ] WebM → GIF conversion (ffmpeg) for video evidence in PRs + - [ ] README documentation for visual PR annotations ## Phase 4: Skill + Browser Integration - [ ] ship + browse: post-deploy verification @@ -51,6 +73,7 @@ - [ ] Sessions (isolated browser instances with separate cookies/storage/history) - [ ] State persistence (save/load cookies + localStorage to JSON files) - [ ] Auth vault (encrypted credential storage, referenced by name, LLM never sees passwords) + - [ ] Video recording (record start/stop — needs sessions for clean context lifecycle) - [ ] retro + browse: deployment health tracking - Screenshot production state - Check perf metrics (page load times) @@ -67,6 +90,11 @@ - [ ] Streaming (WebSocket live preview for pair browsing) - [ ] CDP mode (connect to already-running Chrome/Electron apps) +## Future Ideas + - [ ] Trend tracking across QA runs — compare baseline.json over time, detect regressions (P2, S) + - [ ] CI/CD integration — `/qa` as GitHub Action step, fail PR if health score drops (P2, M) + - [ ] Accessibility audit mode — `--a11y` flag for focused accessibility testing (P3, S) + ## Ideas & Notes - Browser is the nervous system — every skill should be able to see, interact with, and verify the web - Skills are the product; the browser enables them diff --git a/browse/bin/find-browse b/browse/bin/find-browse new file mode 100755 index 00000000..72882030 --- /dev/null +++ b/browse/bin/find-browse @@ -0,0 +1,11 @@ +#!/bin/bash +# Find the gstack browse binary. Echoes path and exits 0, or exits 1 if not found. +ROOT=$(git rev-parse --show-toplevel 2>/dev/null) +if [ -n "$ROOT" ] && test -x "$ROOT/.claude/skills/gstack/browse/dist/browse"; then + echo "$ROOT/.claude/skills/gstack/browse/dist/browse" +elif test -x "$HOME/.claude/skills/gstack/browse/dist/browse"; then + echo "$HOME/.claude/skills/gstack/browse/dist/browse" +else + echo "ERROR: browse binary not found. Run: cd && ./setup" >&2 + exit 1 +fi diff --git a/qa/SKILL.md b/qa/SKILL.md new file mode 100644 index 00000000..9da05fa2 --- /dev/null +++ b/qa/SKILL.md @@ -0,0 +1,295 @@ +--- +name: qa +version: 1.0.0 +description: | + Systematically QA test a web application. Use when asked to "qa", "QA", "test this site", + "find bugs", "dogfood", or review quality. Three modes: full (systematic exploration), + quick (30-second smoke test), regression (compare against baseline). Produces structured + report with health score, screenshots, and repro steps. +allowed-tools: + - Bash + - Read + - Write +--- + +# /qa: Systematic QA Testing + +You are a QA engineer. Test web applications like a real user — click everything, fill every form, check every state. Produce a structured report with evidence. + +## Setup + +**Parse the user's request for these parameters:** + +| Parameter | Default | Override example | +|-----------|---------|-----------------| +| Target URL | (required) | `https://myapp.com`, `http://localhost:3000` | +| Mode | full | `--quick`, `--regression .gstack/qa-reports/baseline.json` | +| Output dir | `.gstack/qa-reports/` | `Output to /tmp/qa` | +| Scope | Full app | `Focus on the billing page` | +| Auth | None | `Sign in to user@example.com`, `Import cookies from cookies.json` | + +**Find the browse binary:** + +```bash +B=$(browse/bin/find-browse 2>/dev/null || ~/.claude/skills/gstack/browse/bin/find-browse 2>/dev/null) +if [ -z "$B" ]; then + echo "ERROR: browse binary not found" + exit 1 +fi +``` + +**Create output directories:** + +```bash +REPORT_DIR=".gstack/qa-reports" +mkdir -p "$REPORT_DIR/screenshots" +``` + +--- + +## Modes + +### Full (default) +Systematic exploration. Visit every reachable page. Document 5-10 well-evidenced issues. Produce health score. Takes 5-15 minutes depending on app size. + +### Quick (`--quick`) +30-second smoke test. Visit homepage + top 5 navigation targets. Check: page loads? Console errors? Broken links? Produce health score. No detailed issue documentation. + +### Regression (`--regression `) +Run full mode, then load `baseline.json` from a previous run. Diff: which issues are fixed? Which are new? What's the score delta? Append regression section to report. + +--- + +## Workflow + +### Phase 1: Initialize + +1. Find browse binary (see Setup above) +2. Create output directories +3. Copy report template from `qa/templates/qa-report-template.md` to output dir +4. Start timer for duration tracking + +### Phase 2: Authenticate (if needed) + +**If the user specified auth credentials:** + +```bash +$B goto +$B snapshot -i # find the login form +$B fill @e3 "user@example.com" +$B fill @e4 "[REDACTED]" # NEVER include real passwords in report +$B click @e5 # submit +$B snapshot -D # verify login succeeded +``` + +**If the user provided a cookie file:** + +```bash +$B cookie-import cookies.json +$B goto +``` + +**If 2FA/OTP is required:** Ask the user for the code and wait. + +**If CAPTCHA blocks you:** Tell the user: "Please complete the CAPTCHA in the browser, then tell me to continue." + +### Phase 3: Orient + +Get a map of the application: + +```bash +$B goto +$B snapshot -i -a -o "$REPORT_DIR/screenshots/initial.png" +$B links # map navigation structure +$B console --errors # any errors on landing? +``` + +**Detect framework** (note in report metadata): +- `__next` in HTML or `_next/data` requests → Next.js +- `csrf-token` meta tag → Rails +- `wp-content` in URLs → WordPress +- Client-side routing with no page reloads → SPA + +**For SPAs:** The `links` command may return few results because navigation is client-side. Use `snapshot -i` to find nav elements (buttons, menu items) instead. + +### Phase 4: Explore + +Visit pages systematically. At each page: + +```bash +$B goto +$B snapshot -i -a -o "$REPORT_DIR/screenshots/page-name.png" +$B console --errors +``` + +Then follow the **per-page exploration checklist** (see `qa/references/issue-taxonomy.md`): + +1. **Visual scan** — Look at the annotated screenshot for layout issues +2. **Interactive elements** — Click buttons, links, controls. Do they work? +3. **Forms** — Fill and submit. Test empty, invalid, edge cases +4. **Navigation** — Check all paths in and out +5. **States** — Empty state, loading, error, overflow +6. **Console** — Any new JS errors after interactions? +7. **Responsiveness** — Check mobile viewport if relevant: + ```bash + $B viewport 375x812 + $B screenshot "$REPORT_DIR/screenshots/page-mobile.png" + $B viewport 1280x720 + ``` + +**Depth judgment:** Spend more time on core features (homepage, dashboard, checkout, search) and less on secondary pages (about, terms, privacy). + +**Quick mode:** Only visit homepage + top 5 navigation targets from the Orient phase. Skip the per-page checklist — just check: loads? Console errors? Broken links visible? + +### Phase 5: Document + +Document each issue **immediately when found** — don't batch them. + +**Two evidence tiers:** + +**Interactive bugs** (broken flows, dead buttons, form failures): +1. Take a screenshot before the action +2. Perform the action +3. Take a screenshot showing the result +4. Use `snapshot -D` to show what changed +5. Write repro steps referencing screenshots + +```bash +$B screenshot "$REPORT_DIR/screenshots/issue-001-step-1.png" +$B click @e5 +$B screenshot "$REPORT_DIR/screenshots/issue-001-result.png" +$B snapshot -D +``` + +**Static bugs** (typos, layout issues, missing images): +1. Take a single annotated screenshot showing the problem +2. Describe what's wrong + +```bash +$B snapshot -i -a -o "$REPORT_DIR/screenshots/issue-002.png" +``` + +**Write each issue to the report immediately** using the template format from `qa/templates/qa-report-template.md`. + +### Phase 6: Wrap Up + +1. **Compute health score** using the rubric below +2. **Write "Top 3 Things to Fix"** — the 3 highest-severity issues +3. **Write console health summary** — aggregate all console errors seen across pages +4. **Update severity counts** in the summary table +5. **Fill in report metadata** — date, duration, pages visited, screenshot count, framework +6. **Save baseline** — write `baseline.json` with: + ```json + { + "date": "YYYY-MM-DD", + "url": "", + "healthScore": N, + "issues": [{ "id": "ISSUE-001", "title": "...", "severity": "...", "category": "..." }], + "categoryScores": { "console": N, "links": N, ... } + } + ``` + +**Regression mode:** After writing the report, load the baseline file. Compare: +- Health score delta +- Issues fixed (in baseline but not current) +- New issues (in current but not baseline) +- Append the regression section to the report + +--- + +## Health Score Rubric + +Compute each category score (0-100), then take the weighted average. + +### Console (weight: 15%) +- 0 errors → 100 +- 1-3 errors → 70 +- 4-10 errors → 40 +- 10+ errors → 10 + +### Links (weight: 10%) +- 0 broken → 100 +- Each broken link → -15 (minimum 0) + +### Per-Category Scoring (Visual, Functional, UX, Content, Performance, Accessibility) +Each category starts at 100. Deduct per finding: +- Critical issue → -25 +- High issue → -15 +- Medium issue → -8 +- Low issue → -3 +Minimum 0 per category. + +### Weights +| Category | Weight | +|----------|--------| +| Console | 15% | +| Links | 10% | +| Visual | 10% | +| Functional | 20% | +| UX | 15% | +| Performance | 10% | +| Content | 5% | +| Accessibility | 15% | + +### Final Score +`score = Σ (category_score × weight)` + +--- + +## Framework-Specific Guidance + +### Next.js +- Check console for hydration errors (`Hydration failed`, `Text content did not match`) +- Monitor `_next/data` requests in network — 404s indicate broken data fetching +- Test client-side navigation (click links, don't just `goto`) — catches routing issues +- Check for CLS (Cumulative Layout Shift) on pages with dynamic content + +### Rails +- Check for N+1 query warnings in console (if development mode) +- Verify CSRF token presence in forms +- Test Turbo/Stimulus integration — do page transitions work smoothly? +- Check for flash messages appearing and dismissing correctly + +### WordPress +- Check for plugin conflicts (JS errors from different plugins) +- Verify admin bar visibility for logged-in users +- Test REST API endpoints (`/wp-json/`) +- Check for mixed content warnings (common with WP) + +### General SPA (React, Vue, Angular) +- Use `snapshot -i` for navigation — `links` command misses client-side routes +- Check for stale state (navigate away and back — does data refresh?) +- Test browser back/forward — does the app handle history correctly? +- Check for memory leaks (monitor console after extended use) + +--- + +## Important Rules + +1. **Repro is everything.** Every issue needs at least one screenshot. No exceptions. +2. **Verify before documenting.** Retry the issue once to confirm it's reproducible, not a fluke. +3. **Never include credentials.** Write `[REDACTED]` for passwords in repro steps. +4. **Write incrementally.** Append each issue to the report as you find it. Don't batch. +5. **Never read source code.** Test as a user, not a developer. +6. **Check console after every interaction.** JS errors that don't surface visually are still bugs. +7. **Test like a user.** Use realistic data. Walk through complete workflows end-to-end. +8. **Depth over breadth.** 5-10 well-documented issues with evidence > 20 vague descriptions. +9. **Never delete output files.** Screenshots and reports accumulate — that's intentional. +10. **Use `snapshot -C` for tricky UIs.** Finds clickable divs that the accessibility tree misses. + +--- + +## Output Structure + +``` +.gstack/qa-reports/ +├── qa-report-{domain}-{YYYY-MM-DD}.md # Structured report +├── screenshots/ +│ ├── initial.png # Landing page annotated screenshot +│ ├── issue-001-step-1.png # Per-issue evidence +│ ├── issue-001-result.png +│ └── ... +└── baseline.json # For regression mode +``` + +Report filenames use the domain and date: `qa-report-myapp-com-2026-03-12.md` diff --git a/qa/references/issue-taxonomy.md b/qa/references/issue-taxonomy.md new file mode 100644 index 00000000..05c57416 --- /dev/null +++ b/qa/references/issue-taxonomy.md @@ -0,0 +1,85 @@ +# QA Issue Taxonomy + +## Severity Levels + +| Severity | Definition | Examples | +|----------|------------|----------| +| **critical** | Blocks a core workflow, causes data loss, or crashes the app | Form submit causes error page, checkout flow broken, data deleted without confirmation | +| **high** | Major feature broken or unusable, no workaround | Search returns wrong results, file upload silently fails, auth redirect loop | +| **medium** | Feature works but with noticeable problems, workaround exists | Slow page load (>5s), form validation missing but submit still works, layout broken on mobile only | +| **low** | Minor cosmetic or polish issue | Typo in footer, 1px alignment issue, hover state inconsistent | + +## Categories + +### 1. Visual/UI +- Layout breaks (overlapping elements, clipped text, horizontal scrollbar) +- Broken or missing images +- Incorrect z-index (elements appearing behind others) +- Font/color inconsistencies +- Animation glitches (jank, incomplete transitions) +- Alignment issues (off-grid, uneven spacing) +- Dark mode / theme issues + +### 2. Functional +- Broken links (404, wrong destination) +- Dead buttons (click does nothing) +- Form validation (missing, wrong, bypassed) +- Incorrect redirects +- State not persisting (data lost on refresh, back button) +- Race conditions (double-submit, stale data) +- Search returning wrong or no results + +### 3. UX +- Confusing navigation (no breadcrumbs, dead ends) +- Missing loading indicators (user doesn't know something is happening) +- Slow interactions (>500ms with no feedback) +- Unclear error messages ("Something went wrong" with no detail) +- No confirmation before destructive actions +- Inconsistent interaction patterns across pages +- Dead ends (no way back, no next action) + +### 4. Content +- Typos and grammar errors +- Outdated or incorrect text +- Placeholder / lorem ipsum text left in +- Truncated text (cut off without ellipsis or "more") +- Wrong labels on buttons or form fields +- Missing or unhelpful empty states + +### 5. Performance +- Slow page loads (>3 seconds) +- Janky scrolling (dropped frames) +- Layout shifts (content jumping after load) +- Excessive network requests (>50 on a single page) +- Large unoptimized images +- Blocking JavaScript (page unresponsive during load) + +### 6. Console/Errors +- JavaScript exceptions (uncaught errors) +- Failed network requests (4xx, 5xx) +- Deprecation warnings (upcoming breakage) +- CORS errors +- Mixed content warnings (HTTP resources on HTTPS) +- CSP violations + +### 7. Accessibility +- Missing alt text on images +- Unlabeled form inputs +- Keyboard navigation broken (can't tab to elements) +- Focus traps (can't escape a modal or dropdown) +- Missing or incorrect ARIA attributes +- Insufficient color contrast +- Content not reachable by screen reader + +## Per-Page Exploration Checklist + +For each page visited during a QA session: + +1. **Visual scan** — Take annotated screenshot (`snapshot -i -a -o`). Look for layout issues, broken images, alignment. +2. **Interactive elements** — Click every button, link, and control. Does each do what it says? +3. **Forms** — Fill and submit. Test empty submission, invalid data, edge cases (long text, special characters). +4. **Navigation** — Check all paths in/out. Breadcrumbs, back button, deep links, mobile menu. +5. **States** — Check empty state, loading state, error state, full/overflow state. +6. **Console** — Run `console --errors` after interactions. Any new JS errors or failed requests? +7. **Responsiveness** — If relevant, check mobile and tablet viewports. +8. **Auth boundaries** — What happens when logged out? Different user roles? diff --git a/qa/templates/qa-report-template.md b/qa/templates/qa-report-template.md new file mode 100644 index 00000000..d118ab84 --- /dev/null +++ b/qa/templates/qa-report-template.md @@ -0,0 +1,79 @@ +# QA Report: {APP_NAME} + +| Field | Value | +|-------|-------| +| **Date** | {DATE} | +| **URL** | {URL} | +| **Scope** | {SCOPE or "Full app"} | +| **Mode** | {full / quick / regression} | +| **Duration** | {DURATION} | +| **Pages visited** | {COUNT} | +| **Screenshots** | {COUNT} | +| **Framework** | {DETECTED or "Unknown"} | + +## Health Score: {SCORE}/100 + +| Category | Score | +|----------|-------| +| Console | {0-100} | +| Links | {0-100} | +| Visual | {0-100} | +| Functional | {0-100} | +| UX | {0-100} | +| Performance | {0-100} | +| Accessibility | {0-100} | + +## Top 3 Things to Fix + +1. **{ISSUE-NNN}: {title}** — {one-line description} +2. **{ISSUE-NNN}: {title}** — {one-line description} +3. **{ISSUE-NNN}: {title}** — {one-line description} + +## Console Health + +| Error | Count | First seen | +|-------|-------|------------| +| {error message} | {N} | {URL} | + +## Summary + +| Severity | Count | +|----------|-------| +| Critical | 0 | +| High | 0 | +| Medium | 0 | +| Low | 0 | +| **Total** | **0** | + +## Issues + +### ISSUE-001: {Short title} + +| Field | Value | +|-------|-------| +| **Severity** | critical / high / medium / low | +| **Category** | visual / functional / ux / content / performance / console / accessibility | +| **URL** | {page URL} | + +**Description:** {What is wrong, expected vs actual.} + +**Repro Steps:** + +1. Navigate to {URL} + ![Step 1](screenshots/issue-001-step-1.png) +2. {Action} + ![Step 2](screenshots/issue-001-step-2.png) +3. **Observe:** {what goes wrong} + ![Result](screenshots/issue-001-result.png) + +--- + +## Regression (if applicable) + +| Metric | Baseline | Current | Delta | +|--------|----------|---------|-------| +| Health score | {N} | {N} | {+/-N} | +| Issues | {N} | {N} | {+/-N} | + +**Fixed since baseline:** {list} +**New since baseline:** {list}