* feat: add browse access to /debug for visual verification
Debug skill can now use the browse binary to visually reproduce bugs,
take screenshots as evidence, and verify fixes. This makes /debug
effective for web app bugs when spawned as a sub-agent from /qa.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add /debug sub-agent escalation to /qa (Phase 8g)
When QA fix attempts fail twice on the same bug (reverted due to
regressions), /qa now spawns a /debug sub-agent with a structured
bug brief including symptoms, repro steps, failed fix details, and
file paths. Results are reported in Phase 10's debug escalation summary.
Sequential execution: one debug investigation at a time, working tree
cleaned between investigations. Graceful degradation on all failure
modes (BLOCKED, agent failure → deferred in report).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add /debug recommendation to /review (Step 5.7)
When /review finds what appears to be a pre-existing bug in the base
branch (not introduced by the PR's diff), it now classifies it as
INFORMATIONAL and recommends running /debug for systematic root-cause
investigation. No Agent spawning — /review's scope stays on the diff.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add reverted QA commit detection to /ship
During pre-landing review, /ship now checks for reverted fix(qa):
commits in the branch history and recommends /debug for systematic
investigation. Informational only — does not block shipping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add debug escalation tests (validation + LLM judge + E2E)
Skill validation: 11 new assertions covering Phase 8g trigger, structured
handoff fields, agent result handlers, debug escalation summary, Step 5.7
recommendation, ship reverted QA detection, and debug browse setup.
LLM judge: evaluates Phase 8g template quality — structured brief format,
result handling, working tree cleanup, sequential processing.
E2E: prompt-level deterministic test (verifies escalation prompt has all
required fields) + full flow stub (fixture TODO for planted regression).
Touchfile entries for diff-based test selection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: add worktree parallel debug agents to TODOS.md (P2)
When /qa hits multiple stubborn bugs, parallel debug agents in
isolated git worktrees could investigate simultaneously. Deferred
from the sequential debug escalation PR as a follow-up.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.6.5.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: add E2E evals for /review pre-existing bug + /ship reverted QA detection
Two new E2E tests:
- review-pre-existing-bug: plants SQL injection in base branch, verifies
Step 5.7 classifies as INFORMATIONAL and recommends /debug
- ship-reverted-qa-commits: creates branch with reverted fix(qa): commits,
verifies /ship detects them and recommends /debug
Also fixes qa-debug-prompt-logic to use correct workingDirectory, and
ensures test repo init uses -b main for portability.
All 4 debug-related evals pass: $0.34 total, 94s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: rename qa-design-review → design-review
The "qa-" prefix was confusing — this is the live-site design audit with
fix loop, not a QA-only report. Rename directory and update all references
across docs, tests, scripts, and skill templates.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: interactive /plan-design-review + CEO invokes designer
Rewrite /plan-design-review from report-only grading to an interactive
plan-fixer that rates each design dimension 0-10, explains what a 10
looks like, and edits the plan to get there. Parallel structure with
/plan-ceo-review and /plan-eng-review — one issue = one AskUserQuestion.
CEO review now detects UI scope and invokes the designer perspective
when the plan has frontend/UX work, so you get design review
automatically when it matters.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: validation + touchfile entries for 100% coverage
Add design-consultation to command/snapshot flag validation. Add 4
skills to contributor mode validation (plan-design-review,
design-review, design-consultation, document-release). Add 2 templates
to hardcoded branch check. Register touchfile entries for 10 new
LLM-judge tests and 1 new E2E test.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: LLM-judge for 10 skills + gstack-upgrade E2E
Add LLM-judge quality evals for all uncovered skills using a DRY
runWorkflowJudge helper with section marker guards. Add real E2E
test for gstack-upgrade using mock git remote (replaces test.todo).
Add plan-edit assertion to plan-design-review E2E.
14/15 skills now at full coverage. setup-browser-cookies remains
deferred (needs real browser).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add bisect commit style to CLAUDE.md
All commits should be single logical changes, split before pushing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.6.4.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: diff-based test selection for E2E and LLM-judge evals
Each test declares file dependencies in a TOUCHFILES map. The test runner
checks git diff against the base branch and only runs tests whose
dependencies were modified. Global touchfiles (session-runner, eval-store,
gen-skill-docs) trigger all tests.
New scripts: test:e2e:all, test:evals:all, eval:select
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* chore: bump version and changelog (v0.6.1.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: plan-design-review-audit eval — bump turns to 30, add efficiency hints
The test was flaky at 20 turns because the agent reads a 300-line SKILL.md,
navigates, extracts design data, and writes a report. Added hints to skip
preamble/batch commands/write early while still testing the real SKILL.md.
Now completes in ~13 turns consistently.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>