diff --git a/CHANGELOG.md b/CHANGELOG.md index 1da799f2..983926dc 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,14 @@ # Changelog +## [0.6.5.0] - 2026-03-18 + +### Added + +- **`/qa` now escalates stubborn bugs to `/debug` automatically.** When a bug resists 2+ fix attempts (each reverted due to regressions), `/qa` spawns a debug sub-agent with a structured bug brief — symptoms, reproduction steps, failed fixes, file paths. The debug agent investigates root cause systematically while you wait. Results appear in the QA report's new Debug Escalation Summary. +- **`/review` recommends `/debug` for pre-existing bugs.** When code review spots a bug in the base branch (not introduced by your PR), it now classifies it as informational and recommends `/debug` for root-cause investigation instead of leaving you guessing. +- **`/ship` detects reverted QA fixes in your branch.** If your branch history has reverted `fix(qa):` commits, `/ship` warns you the underlying bugs may still be present and suggests running `/debug`. Informational — doesn't block shipping. +- **`/debug` can now reproduce bugs visually.** The debug skill has access to the headless browser — it takes screenshots during investigation, checks console errors, and visually verifies fixes before reporting. + ## [0.6.4.1] - 2026-03-18 ### Added diff --git a/TODOS.md b/TODOS.md index 710399e2..ebb88462 100644 --- a/TODOS.md +++ b/TODOS.md @@ -422,6 +422,20 @@ **Priority:** P2 **Depends on:** `garrytan/team-supabase-store` branch landing on main +## Debug + +### Worktree-based parallel debug sub-agents + +**What:** When /qa hits multiple stubborn bugs, spawn parallel debug agents each in their own git worktree (`isolation: "worktree"` on the Agent call). They investigate simultaneously while /qa continues other work. Results cherry-picked back. + +**Why:** Sequential debug investigations add ~2-5 min per bug. With 3-4 hard bugs, that's 10-20 min of sequential waiting. Parallel worktree agents could investigate all simultaneously, with zero working-tree conflicts. + +**Context:** v1 ships sequential (debug-subagent-escalation PR). The Agent tool supports `isolation: "worktree"` which creates a temporary git worktree — each agent gets its own copy of the repo. Especially powerful in plan mode where /qa might find multiple hard bugs across different subsystems. + +**Effort:** M (human: ~1 week / CC: ~30min) +**Priority:** P2 +**Depends on:** Sequential debug escalation (debug-subagent-escalation PR) + ## Design Review ### /plan-design-review + /qa-design-review + /design-consultation — SHIPPED diff --git a/VERSION b/VERSION index f4362155..9de863a6 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.6.4.1 +0.6.5.0 diff --git a/debug/SKILL.md b/debug/SKILL.md index 4448453a..49982f82 100644 --- a/debug/SKILL.md +++ b/debug/SKILL.md @@ -145,6 +145,25 @@ ATTEMPTED: [what you tried] RECOMMENDATION: [what the user should do next] ``` +## SETUP (run this check BEFORE any browse command) + +```bash +_ROOT=$(git rev-parse --show-toplevel 2>/dev/null) +B="" +[ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" +[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +if [ -x "$B" ]; then + echo "READY: $B" +else + echo "NEEDS_SETUP" +fi +``` + +If `NEEDS_SETUP`: +1. Tell the user: "gstack browse needs a one-time build (~10 seconds). OK to proceed?" Then STOP and wait. +2. Run: `cd && ./setup` +3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash` + # Systematic Debugging ## Iron Law @@ -171,6 +190,14 @@ Gather context before forming any hypothesis. 4. **Reproduce:** Can you trigger the bug deterministically? If not, gather more evidence before proceeding. +5. **Visual reproduction (if browser-testable bug):** Use the browse binary to reproduce the bug: + ```bash + $B goto + $B screenshot screenshots/debug-ISSUE-NNN-repro.png + $B console --errors + ``` + Compare what you see against the reported symptom. If the bug is not reproducible visually, that's useful evidence too — the root cause may be in data flow, not rendering. + Output: **"Root cause hypothesis: ..."** — a specific, testable claim about what is wrong and why. --- @@ -249,6 +276,14 @@ Once root cause is confirmed: Run the test suite and paste the output. +**Visual verification (if browser-testable bug):** +```bash +$B goto +$B screenshot screenshots/debug-ISSUE-NNN-fixed.png +$B console --errors +``` +Compare against the original reproduction screenshot. The bug should no longer be present. + Output a structured debug report: ``` DEBUG REPORT diff --git a/debug/SKILL.md.tmpl b/debug/SKILL.md.tmpl index 312d2420..a96f2d86 100644 --- a/debug/SKILL.md.tmpl +++ b/debug/SKILL.md.tmpl @@ -16,6 +16,8 @@ allowed-tools: {{PREAMBLE}} +{{BROWSE_SETUP}} + # Systematic Debugging ## Iron Law @@ -42,6 +44,14 @@ Gather context before forming any hypothesis. 4. **Reproduce:** Can you trigger the bug deterministically? If not, gather more evidence before proceeding. +5. **Visual reproduction (if browser-testable bug):** Use the browse binary to reproduce the bug: + ```bash + $B goto + $B screenshot screenshots/debug-ISSUE-NNN-repro.png + $B console --errors + ``` + Compare what you see against the reported symptom. If the bug is not reproducible visually, that's useful evidence too — the root cause may be in data flow, not rendering. + Output: **"Root cause hypothesis: ..."** — a specific, testable claim about what is wrong and why. --- @@ -120,6 +130,14 @@ Once root cause is confirmed: Run the test suite and paste the output. +**Visual verification (if browser-testable bug):** +```bash +$B goto +$B screenshot screenshots/debug-ISSUE-NNN-fixed.png +$B console --errors +``` +Compare against the original reproduction screenshot. The bug should no longer be present. + Output a structured debug report: ``` DEBUG REPORT diff --git a/qa/SKILL.md b/qa/SKILL.md index 2d12fca8..e50ac750 100644 --- a/qa/SKILL.md +++ b/qa/SKILL.md @@ -17,6 +17,7 @@ allowed-tools: - Grep - AskUserQuestion - WebSearch + - Agent --- @@ -832,6 +833,41 @@ WTF-LIKELIHOOD: **Hard cap: 50 fixes.** After 50 fixes, stop regardless of remaining issues. +### 8g. Debug Escalation + +After the fix loop completes (all issues processed through 8a-8f), check for issues that were **reverted at least twice** (two or more separate fix attempts each caused regressions and were reverted). + +If no issues match, skip this phase entirely. + +For each matching issue, sequentially spawn a debug sub-agent using the Agent tool: + +**Agent prompt — structured handoff:** + +> Read `~/.claude/skills/gstack/debug/SKILL.md` (fallback: `.claude/skills/debug/SKILL.md`) and follow the systematic debugging methodology. +> +> **Bug Brief:** +> - Issue ID: ISSUE-NNN +> - Symptom: [what was observed during QA testing] +> - Severity: [critical/high/medium/low] +> - Affected URL: [page URL where bug was found] +> - Reproduction: [step-by-step repro from QA testing] +> - Screenshot evidence: [paths to before/after screenshots from QA] +> - Failed fix attempts: +> 1. [what was changed] → [what regression it caused] (reverted) +> 2. [what was changed] → [what regression it caused] (reverted) +> - Files investigated: [list of source files already examined during fix attempts] +> - Console errors: [relevant JS errors if any] +> +> Investigate this bug. You have full access to AskUserQuestion if you need user input. Output the DEBUG REPORT when done. + +**When the debug agent returns:** +- **DONE status:** The fix is in the working tree. Re-test the affected page via browse. If verified, commit with `fix(qa/debug): ISSUE-NNN — [description]`. If regression detected, revert and mark as "deferred (debug-investigated, fix regressed)." +- **DONE_WITH_CONCERNS:** Apply the fix, note concerns in report. Re-test. +- **BLOCKED:** Run `git checkout .` to discard any uncommitted debug artifacts (temp logs, assertions). Mark as "deferred (debug-investigated, root cause unclear)." Include the debug report in Phase 10 output. +- **Agent call fails:** Mark as "deferred (debug unavailable)." Continue to Phase 9. + +**Sequencing:** Process one debug issue at a time. Wait for each agent to complete before spawning the next. The working tree must be clean between debug investigations. + --- ## Phase 9: Final QA @@ -872,6 +908,20 @@ Write to `~/.gstack/projects/{slug}/{user}-{branch}-test-outcome-{datetime}.md` **PR Summary:** Include a one-line summary suitable for PR descriptions: > "QA found N issues, fixed M, health score X → Y." +**Debug Escalation Summary** (include only if any issues were escalated in Phase 8g): +``` +DEBUG ESCALATION + Issues escalated: N + Resolved (DONE): X (commit SHAs) + Concerns: Y + Blocked: Z (root cause unclear) + Unavailable: W (agent call failed) + + Per-issue details: + ISSUE-NNN: [symptom] → [root cause found / BLOCKED] + Debug report: [inline summary or full report] +``` + --- ## Phase 11: TODOS.md Update diff --git a/qa/SKILL.md.tmpl b/qa/SKILL.md.tmpl index bd94debe..98e0b349 100644 --- a/qa/SKILL.md.tmpl +++ b/qa/SKILL.md.tmpl @@ -17,6 +17,7 @@ allowed-tools: - Grep - AskUserQuestion - WebSearch + - Agent --- {{PREAMBLE}} @@ -245,6 +246,41 @@ WTF-LIKELIHOOD: **Hard cap: 50 fixes.** After 50 fixes, stop regardless of remaining issues. +### 8g. Debug Escalation + +After the fix loop completes (all issues processed through 8a-8f), check for issues that were **reverted at least twice** (two or more separate fix attempts each caused regressions and were reverted). + +If no issues match, skip this phase entirely. + +For each matching issue, sequentially spawn a debug sub-agent using the Agent tool: + +**Agent prompt — structured handoff:** + +> Read `~/.claude/skills/gstack/debug/SKILL.md` (fallback: `.claude/skills/debug/SKILL.md`) and follow the systematic debugging methodology. +> +> **Bug Brief:** +> - Issue ID: ISSUE-NNN +> - Symptom: [what was observed during QA testing] +> - Severity: [critical/high/medium/low] +> - Affected URL: [page URL where bug was found] +> - Reproduction: [step-by-step repro from QA testing] +> - Screenshot evidence: [paths to before/after screenshots from QA] +> - Failed fix attempts: +> 1. [what was changed] → [what regression it caused] (reverted) +> 2. [what was changed] → [what regression it caused] (reverted) +> - Files investigated: [list of source files already examined during fix attempts] +> - Console errors: [relevant JS errors if any] +> +> Investigate this bug. You have full access to AskUserQuestion if you need user input. Output the DEBUG REPORT when done. + +**When the debug agent returns:** +- **DONE status:** The fix is in the working tree. Re-test the affected page via browse. If verified, commit with `fix(qa/debug): ISSUE-NNN — [description]`. If regression detected, revert and mark as "deferred (debug-investigated, fix regressed)." +- **DONE_WITH_CONCERNS:** Apply the fix, note concerns in report. Re-test. +- **BLOCKED:** Run `git checkout .` to discard any uncommitted debug artifacts (temp logs, assertions). Mark as "deferred (debug-investigated, root cause unclear)." Include the debug report in Phase 10 output. +- **Agent call fails:** Mark as "deferred (debug unavailable)." Continue to Phase 9. + +**Sequencing:** Process one debug issue at a time. Wait for each agent to complete before spawning the next. The working tree must be clean between debug investigations. + --- ## Phase 9: Final QA @@ -285,6 +321,20 @@ Write to `~/.gstack/projects/{slug}/{user}-{branch}-test-outcome-{datetime}.md` **PR Summary:** Include a one-line summary suitable for PR descriptions: > "QA found N issues, fixed M, health score X → Y." +**Debug Escalation Summary** (include only if any issues were escalated in Phase 8g): +``` +DEBUG ESCALATION + Issues escalated: N + Resolved (DONE): X (commit SHAs) + Concerns: Y + Blocked: Z (root cause unclear) + Unavailable: W (agent call failed) + + Per-issue details: + ISSUE-NNN: [symptom] → [root cause found / BLOCKED] + Debug report: [inline summary or full report] +``` + --- ## Phase 11: TODOS.md Update diff --git a/review/SKILL.md b/review/SKILL.md index 354e715b..cbb24812 100644 --- a/review/SKILL.md +++ b/review/SKILL.md @@ -409,6 +409,18 @@ If no documentation files exist, skip this step silently. --- +## Step 5.7: Pre-existing bug detection + +If during the two-pass review you identified findings that appear to be bugs in the **base branch** (not introduced by this PR's diff): + +1. Classify them as INFORMATIONAL in your output +2. Note: "This appears to be a pre-existing issue, not introduced by this PR." +3. Recommend: "Run `/debug` to investigate this systematically — it traces root cause before applying fixes." + +Do not spawn agents or investigate further — /review's scope is the diff. Pre-existing bugs are out of scope for this review. + +--- + ## Important Rules - **Read the FULL diff before commenting.** Do not flag issues already addressed in the diff. diff --git a/review/SKILL.md.tmpl b/review/SKILL.md.tmpl index 7094a156..8361b68b 100644 --- a/review/SKILL.md.tmpl +++ b/review/SKILL.md.tmpl @@ -230,6 +230,18 @@ If no documentation files exist, skip this step silently. --- +## Step 5.7: Pre-existing bug detection + +If during the two-pass review you identified findings that appear to be bugs in the **base branch** (not introduced by this PR's diff): + +1. Classify them as INFORMATIONAL in your output +2. Note: "This appears to be a pre-existing issue, not introduced by this PR." +3. Recommend: "Run `/debug` to investigate this systematically — it traces root cause before applying fixes." + +Do not spawn agents or investigate further — /review's scope is the diff. Pre-existing bugs are out of scope for this review. + +--- + ## Important Rules - **Read the FULL diff before commenting.** Do not flag issues already addressed in the diff. diff --git a/ship/SKILL.md b/ship/SKILL.md index 3f0f0067..8b1f5015 100644 --- a/ship/SKILL.md +++ b/ship/SKILL.md @@ -727,6 +727,14 @@ Substitute: TIMESTAMP = ISO 8601 datetime, STATUS = "clean" if 0 findings or "is Save the review output — it goes into the PR body in Step 8. +**Reverted QA fix detection:** Check if the branch history contains reverted QA commits: +```bash +git log origin/..HEAD --oneline | grep -i 'revert.*fix(qa)' +``` +If reverted QA commits are found, note in review output: +"This branch has reverted QA fixes — the underlying bugs may still be present. Consider running `/debug` on the affected areas before shipping." +This is INFORMATIONAL only — does not block shipping. + --- ## Step 3.75: Address Greptile review comments (if PR exists) diff --git a/ship/SKILL.md.tmpl b/ship/SKILL.md.tmpl index aef5c9d3..44d459f9 100644 --- a/ship/SKILL.md.tmpl +++ b/ship/SKILL.md.tmpl @@ -361,6 +361,14 @@ Review the diff for structural issues that tests don't catch. Save the review output — it goes into the PR body in Step 8. +**Reverted QA fix detection:** Check if the branch history contains reverted QA commits: +```bash +git log origin/..HEAD --oneline | grep -i 'revert.*fix(qa)' +``` +If reverted QA commits are found, note in review output: +"This branch has reverted QA fixes — the underlying bugs may still be present. Consider running `/debug` on the affected areas before shipping." +This is INFORMATIONAL only — does not block shipping. + --- ## Step 3.75: Address Greptile review comments (if PR exists) diff --git a/test/helpers/touchfiles.ts b/test/helpers/touchfiles.ts index 995648a1..1f537bf0 100644 --- a/test/helpers/touchfiles.ts +++ b/test/helpers/touchfiles.ts @@ -90,6 +90,12 @@ export const E2E_TOUCHFILES: Record = { // gstack-upgrade 'gstack-upgrade-happy-path': ['gstack-upgrade/**'], + + // Debug escalation + 'qa-debug-prompt-logic': ['qa/**', 'debug/**'], + 'qa-debug-escalation': ['qa/**', 'debug/**', 'browse/src/**'], + 'review-pre-existing-bug': ['review/**', 'debug/**'], + 'ship-reverted-qa-commits': ['ship/**', 'debug/**'], }; /** @@ -123,6 +129,9 @@ export const LLM_JUDGE_TOUCHFILES: Record = { 'retro/SKILL.md instructions': ['retro/SKILL.md', 'retro/SKILL.md.tmpl'], 'qa-only/SKILL.md workflow': ['qa-only/SKILL.md', 'qa-only/SKILL.md.tmpl'], 'gstack-upgrade/SKILL.md upgrade flow': ['gstack-upgrade/SKILL.md', 'gstack-upgrade/SKILL.md.tmpl'], + + // Debug escalation + 'qa/SKILL.md debug escalation': ['qa/SKILL.md', 'qa/SKILL.md.tmpl', 'debug/SKILL.md', 'debug/SKILL.md.tmpl'], }; /** diff --git a/test/skill-e2e.test.ts b/test/skill-e2e.test.ts index 13539278..54b5e099 100644 --- a/test/skill-e2e.test.ts +++ b/test/skill-e2e.test.ts @@ -2841,6 +2841,272 @@ Output the diagram directly.`, }, 180_000); }); +// --- Review pre-existing bug detection E2E --- + +describeIfSelected('Review pre-existing bug detection', ['review-pre-existing-bug'], () => { + let reviewBugDir: string; + const run = (cmd: string, args: string[], cwd: string) => + spawnSync(cmd, args, { cwd, stdio: 'pipe', timeout: 5000 }); + + beforeAll(() => { + reviewBugDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-review-preexisting-')); + + run('git', ['init', '-b', 'main'], reviewBugDir); + run('git', ['config', 'user.email', 'test@test.com'], reviewBugDir); + run('git', ['config', 'user.name', 'Test'], reviewBugDir); + + // Base commit with a pre-existing bug: SQL injection in existing code + fs.writeFileSync(path.join(reviewBugDir, 'user_service.rb'), `class UserService + def find_user(id) + # PRE-EXISTING BUG: SQL injection — id is not sanitized + User.where("id = \#{id}").first + end + + def list_users + User.all.order(:name) + end +end +`); + run('git', ['add', '.'], reviewBugDir); + run('git', ['commit', '-m', 'initial: user service'], reviewBugDir); + + // Feature branch adds a new safe method — the diff is clean + run('git', ['checkout', '-b', 'feature/add-search'], reviewBugDir); + fs.writeFileSync(path.join(reviewBugDir, 'user_service.rb'), `class UserService + def find_user(id) + # PRE-EXISTING BUG: SQL injection — id is not sanitized + User.where("id = \#{id}").first + end + + def list_users + User.all.order(:name) + end + + def search_users(query) + User.where("name LIKE ?", "%\#{query}%") + end +end +`); + run('git', ['add', '.'], reviewBugDir); + run('git', ['commit', '-m', 'feat: add user search'], reviewBugDir); + + // Copy review skill files + fs.copyFileSync(path.join(ROOT, 'review', 'SKILL.md'), path.join(reviewBugDir, 'review-SKILL.md')); + fs.copyFileSync(path.join(ROOT, 'review', 'checklist.md'), path.join(reviewBugDir, 'review-checklist.md')); + fs.copyFileSync(path.join(ROOT, 'review', 'greptile-triage.md'), path.join(reviewBugDir, 'review-greptile-triage.md')); + }); + + afterAll(() => { + try { fs.rmSync(reviewBugDir, { recursive: true, force: true }); } catch {} + }); + + testIfSelected('review-pre-existing-bug', async () => { + const result = await runSkillTest({ + prompt: `You are in a git repo on branch feature/add-search with changes against main. +Read review-SKILL.md for the full review workflow. +Read review-checklist.md for the code review checklist. + +IMPORTANT: Follow Step 0 to detect the base branch. Since there is no remote, gh commands will fail — fall back to main. +Run the review against the detected base branch. + +The diff adds a search_users method. But notice the existing find_user method has a SQL injection bug (string interpolation instead of parameterized query). This is a pre-existing issue, not introduced by this PR. + +Follow Step 5.7: Pre-existing bug detection. If you find pre-existing bugs, classify them as INFORMATIONAL and recommend /debug. + +Write your review findings to ${reviewBugDir}/review-output.md`, + workingDirectory: reviewBugDir, + maxTurns: 15, + timeout: 120_000, + testName: 'review-pre-existing-bug', + runId, + }); + + logCost('/review pre-existing bug', result); + recordE2E('review-pre-existing-bug', 'Review pre-existing bug detection', result); + expect(result.exitReason).toBe('success'); + + // Check output for /debug recommendation + const allOutput = (result.output ?? '') + + result.toolCalls.map(tc => tc.output || '').join('\n'); + + // Also check written file + const outputPath = path.join(reviewBugDir, 'review-output.md'); + const fileOutput = fs.existsSync(outputPath) ? fs.readFileSync(outputPath, 'utf-8') : ''; + const combined = (allOutput + fileOutput).toLowerCase(); + + const mentionsPreExisting = /pre-existing|pre existing|base branch|not introduced/i.test(combined); + const mentionsDebug = /\/debug/i.test(combined); + const mentionsSqlInjection = /sql injection|interpolat|unsanitized|inject/i.test(combined); + + console.log(`Mentions pre-existing: ${mentionsPreExisting}`); + console.log(`Mentions /debug: ${mentionsDebug}`); + console.log(`Mentions SQL injection: ${mentionsSqlInjection}`); + + // Must detect the SQL injection as pre-existing and recommend /debug + expect(mentionsSqlInjection).toBe(true); + expect(mentionsDebug).toBe(true); + }, 150_000); +}); + +// --- Ship reverted QA commit detection E2E --- + +describeIfSelected('Ship reverted QA commits', ['ship-reverted-qa-commits'], () => { + let shipRevertDir: string; + const run = (cmd: string, args: string[], cwd: string) => + spawnSync(cmd, args, { cwd, stdio: 'pipe', timeout: 5000 }); + + beforeAll(() => { + shipRevertDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-ship-revert-')); + + run('git', ['init', '-b', 'main'], shipRevertDir); + run('git', ['config', 'user.email', 'test@test.com'], shipRevertDir); + run('git', ['config', 'user.name', 'Test'], shipRevertDir); + + // Initial commit on main + fs.writeFileSync(path.join(shipRevertDir, 'app.ts'), 'console.log("v1");\n'); + run('git', ['add', '.'], shipRevertDir); + run('git', ['commit', '-m', 'initial'], shipRevertDir); + + // Feature branch with a QA fix that was reverted + run('git', ['checkout', '-b', 'feature/with-reverted-qa'], shipRevertDir); + fs.writeFileSync(path.join(shipRevertDir, 'app.ts'), 'console.log("v2 - feature");\n'); + run('git', ['add', '.'], shipRevertDir); + run('git', ['commit', '-m', 'feat: add new feature'], shipRevertDir); + + // Simulate a QA fix commit + fs.writeFileSync(path.join(shipRevertDir, 'app.ts'), 'console.log("v2 - feature - qa fix");\n'); + run('git', ['add', '.'], shipRevertDir); + run('git', ['commit', '-m', 'fix(qa): ISSUE-003 — fix broken button handler'], shipRevertDir); + + // Simulate reverting the QA fix + run('git', ['revert', 'HEAD', '--no-edit'], shipRevertDir); + }); + + afterAll(() => { + try { fs.rmSync(shipRevertDir, { recursive: true, force: true }); } catch {} + }); + + testIfSelected('ship-reverted-qa-commits', async () => { + // Copy ship skill + fs.copyFileSync(path.join(ROOT, 'ship', 'SKILL.md'), path.join(shipRevertDir, 'ship-SKILL.md')); + + const result = await runSkillTest({ + prompt: `You are on a feature branch in a git repo. The base branch is main (no remote exists). + +Do these two things: + +1. Run this command and show the output: + git log main..HEAD --oneline | grep -i 'revert.*fix(qa)' + +2. Read ship-SKILL.md and find the "Reverted QA fix detection" section. Follow its instructions based on the output from step 1. + +Write your findings to ${shipRevertDir}/ship-preflight.md including whether reverted QA fixes were found and what you recommend.`, + workingDirectory: shipRevertDir, + maxTurns: 15, + timeout: 90_000, + testName: 'ship-reverted-qa-commits', + runId, + }); + + logCost('/ship reverted QA', result); + recordE2E('ship-reverted-qa-commits', 'Ship reverted QA commits', result); + expect(['success', 'error_max_turns']).toContain(result.exitReason); + + // Check output for /debug recommendation + const allOutput = (result.output ?? '') + + result.toolCalls.map(tc => tc.output || '').join('\n'); + const outputPath = path.join(shipRevertDir, 'ship-preflight.md'); + const fileOutput = fs.existsSync(outputPath) ? fs.readFileSync(outputPath, 'utf-8') : ''; + const combined = (allOutput + fileOutput).toLowerCase(); + + const mentionsRevert = /revert.*fix\(qa\)|reverted qa/i.test(combined); + const mentionsDebug = /\/debug/i.test(combined); + const mentionsBugs = /bug.*present|underlying|still present/i.test(combined); + + console.log(`Mentions reverted QA fix: ${mentionsRevert}`); + console.log(`Mentions /debug: ${mentionsDebug}`); + console.log(`Mentions underlying bugs: ${mentionsBugs}`); + + // The git log should show the reverted QA commit (initial branch may be main or master) + let gitLog = run('git', ['log', 'main..HEAD', '--oneline'], shipRevertDir); + let logOutput = gitLog.stdout.toString(); + if (!logOutput.trim()) { + gitLog = run('git', ['log', 'master..HEAD', '--oneline'], shipRevertDir); + logOutput = gitLog.stdout.toString(); + } + console.log(`Git log: ${logOutput.trim()}`); + expect(logOutput.toLowerCase()).toContain('revert'); + + // The ship preflight should detect the reverted QA fix + expect(mentionsRevert || mentionsDebug).toBe(true); + }, 90_000); +}); + +// --- Debug escalation E2E tests --- + +describeIfSelected('Debug escalation', ['qa-debug-prompt-logic', 'qa-debug-escalation'], () => { + // Test A: Prompt-level deterministic — verify the template produces correct escalation behavior + testIfSelected('qa-debug-prompt-logic', async () => { + const result = await runSkillTest({ + prompt: `You are in Phase 8g of the /qa workflow. The following has happened: +- ISSUE-007: "Submit button does nothing on checkout page" + - Severity: critical + - URL: http://localhost:3000/checkout + - Fix attempt 1: Added click handler to button → caused JS error on payment page (reverted) + - Fix attempt 2: Fixed form action attribute → broke form validation (reverted) + - Files investigated: src/components/Checkout.tsx, src/pages/checkout.ts + - Console errors: "TypeError: Cannot read property 'submit' of null" + +Read qa/SKILL.md and follow Phase 8g exactly. Show the Agent prompt you would use to spawn the debug sub-agent. Do NOT actually spawn the agent — just output the prompt you would use.`, + workingDirectory: ROOT, + maxTurns: 10, + testName: 'qa-debug-prompt-logic', + runId, + }); + + logCost('qa-debug-prompt-logic', result); + recordE2E('qa-debug-prompt-logic', 'Debug escalation', result); + + // Verify the output contains a well-formed structured brief + const output = result.output ?? ''; + const hasIssueId = /ISSUE-007/i.test(output); + const hasSymptom = /submit.*button|does nothing|checkout/i.test(output); + const hasRepro = /localhost.*3000|checkout/i.test(output); + const hasFailedAttempts = /fix attempt|click handler|form action/i.test(output); + const hasFiles = /Checkout\.tsx|checkout\.ts/i.test(output); + const hasDebugSkillRef = /debug\/SKILL\.md/i.test(output); + + console.log(`Has Issue ID: ${hasIssueId}`); + console.log(`Has Symptom: ${hasSymptom}`); + console.log(`Has Repro: ${hasRepro}`); + console.log(`Has Failed Attempts: ${hasFailedAttempts}`); + console.log(`Has Files: ${hasFiles}`); + console.log(`Has Debug Skill Ref: ${hasDebugSkillRef}`); + + // The output should contain all the structured handoff fields + expect(hasIssueId).toBe(true); + expect(hasSymptom).toBe(true); + expect(hasFailedAttempts).toBe(true); + expect(hasDebugSkillRef).toBe(true); + }, 120_000); + + // Test B: Full E2E with planted regression + // This test requires a fixture app with a deliberately hard-to-fix bug. + // The bug should resist at least 2 fix attempts to trigger escalation. + // TODO: Create fixture at browse/test/fixtures/qa-eval-debug-escalation/ + testIfSelected('qa-debug-escalation', async () => { + // Skip until fixture is created — this is a placeholder for the full flow + console.log('SKIP: qa-debug-escalation — fixture not yet created'); + console.log('TODO: Create browse/test/fixtures/qa-eval-debug-escalation/ with a deliberately hard-to-fix bug'); + // When implemented, this test should: + // 1. Start test server serving the fixture + // 2. Run /qa against it + // 3. Verify fix attempts are made and reverted + // 4. Verify Phase 8g triggers (Agent tool call appears in transcript) + // 5. Verify debug report appears in QA output + }, 300_000); +}); + // Module-level afterAll — finalize eval collector after all tests complete afterAll(async () => { if (evalCollector) { diff --git a/test/skill-llm-eval.test.ts b/test/skill-llm-eval.test.ts index 528d5115..058f52a0 100644 --- a/test/skill-llm-eval.test.ts +++ b/test/skill-llm-eval.test.ts @@ -668,6 +668,23 @@ describeIfSelected('Other skill evals', [ }, 30_000); }); +// Block 5: Debug escalation +describeIfSelected('Debug escalation evals', [ + 'qa/SKILL.md debug escalation', +], () => { + testIfSelected('qa/SKILL.md debug escalation', async () => { + await runWorkflowJudge({ + testName: 'qa/SKILL.md debug escalation', + suite: 'Debug escalation evals', + skillPath: 'qa/SKILL.md', + startMarker: '### 8g. Debug Escalation', + endMarker: '## Phase 9: Final QA', + judgeContext: 'a debug sub-agent escalation workflow within a QA fix loop', + judgeGoal: 'when and how to spawn a debug sub-agent for bugs that resisted multiple fix attempts, including structured bug brief format (issue ID, symptom, reproduction, failed fix attempts, files), result handling for all four statuses (DONE, DONE_WITH_CONCERNS, BLOCKED, agent failure), working tree cleanup, and sequential processing', + }); + }, 30_000); +}); + // Module-level afterAll — finalize eval collector after all tests complete afterAll(async () => { if (evalCollector) { diff --git a/test/skill-validation.test.ts b/test/skill-validation.test.ts index e63a4b67..f02c27b6 100644 --- a/test/skill-validation.test.ts +++ b/test/skill-validation.test.ts @@ -620,6 +620,74 @@ describe('debug skill structure', () => { 'DEBUG REPORT', '3-strike', 'BLOCKED']) { test(`contains ${section}`, () => expect(content).toContain(section)); } + + test('has browse setup for visual reproduction', () => { + expect(content).toContain('Visual reproduction'); + expect(content).toContain('$B goto'); + expect(content).toContain('$B screenshot'); + }); + + test('has visual verification in Phase 5', () => { + expect(content).toContain('Visual verification'); + expect(content).toContain('debug-ISSUE-NNN-fixed'); + }); +}); + +// --- Debug sub-agent escalation validation --- + +describe('Debug sub-agent escalation', () => { + test('qa/SKILL.md has Agent in allowed-tools', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + // Check frontmatter allowed-tools section contains Agent + const frontmatter = content.split('---')[1]; + expect(frontmatter).toContain('Agent'); + }); + + test('qa/SKILL.md has Phase 8g Debug Escalation', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('8g. Debug Escalation'); + expect(content).toContain('reverted at least twice'); + expect(content).toContain('Bug Brief'); + expect(content).toContain('debug/SKILL.md'); + }); + + test('qa/SKILL.md has structured handoff in debug prompt', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Issue ID'); + expect(content).toContain('Symptom'); + expect(content).toContain('Reproduction'); + expect(content).toContain('Failed fix attempts'); + expect(content).toContain('Files investigated'); + }); + + test('qa/SKILL.md has all four agent result handlers', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('DONE status'); + expect(content).toContain('DONE_WITH_CONCERNS'); + expect(content).toContain('git checkout .'); + expect(content).toContain('deferred (debug unavailable)'); + }); + + test('qa/SKILL.md has debug escalation summary in Phase 10', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('DEBUG ESCALATION'); + expect(content).toContain('Issues escalated'); + expect(content).toContain('Per-issue details'); + }); + + test('review/SKILL.md has Step 5.7 pre-existing bug detection', () => { + const content = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8'); + expect(content).toContain('5.7: Pre-existing bug detection'); + expect(content).toContain('/debug'); + expect(content).toContain('pre-existing issue'); + }); + + test('ship/SKILL.md has reverted QA commit detection', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Reverted QA fix detection'); + expect(content).toContain('revert.*fix(qa)'); + expect(content).toContain('/debug'); + }); }); // --- Contributor mode preamble structure validation --- diff --git a/test/touchfiles.test.ts b/test/touchfiles.test.ts index 48613d64..84a4a5a8 100644 --- a/test/touchfiles.test.ts +++ b/test/touchfiles.test.ts @@ -122,7 +122,8 @@ describe('selectTests', () => { const result = selectTests(['qa/SKILL.md'], LLM_JUDGE_TOUCHFILES); expect(result.selected).toContain('qa/SKILL.md workflow'); expect(result.selected).toContain('qa/SKILL.md health rubric'); - expect(result.selected.length).toBe(2); + expect(result.selected).toContain('qa/SKILL.md debug escalation'); + expect(result.selected.length).toBe(3); }); test('SKILL.md.tmpl root template only selects root-dependent tests', () => {