gstack/TODO.md at garrytan/ref-screenshot-fix

Garry Tan 76803d789a feat: 3-tier eval suite with planted-bug outcome testing (EVALS=1)

Adds comprehensive eval infrastructure:
- Tier 1 (free): 13 new static tests — cross-skill path consistency, QA
  structure validation, greptile format, planted-bug fixture validation
- Tier 2 (Agent SDK E2E): /qa quick, /review with pre-built git repo,
  3 planted-bug outcome evals (static, SPA, checkout — each with 5 bugs)
- Tier 3 (LLM judge): QA workflow quality, health rubric clarity,
  cross-skill consistency, baseline score pinning

New fixtures: 3 HTML pages with 15 total planted bugs, ground truth JSON,
review-eval-vuln.rb, eval-baselines.json. Shared llm-judge.ts helper (DRY).

Unified EVALS=1 flag replaces SKILL_E2E + ANTHROPIC_API_KEY checks.
`bun run test:evals` runs everything that costs money (~$4/run).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

7.0 KiB

Raw Permalink Blame History

TODO — gstack roadmap

Phase 1: Foundations (v0.2.0)

Phase 2: Enhanced Browser (v0.2.0) ✅

Phase 3: QA Testing Agent (v0.3.0)

Phase 3.6: Visual PR Annotations + S3 Upload

Phase 4: Skill + Browser Integration

Phase 5: State & Sessions

Phase 6: Advanced Browser

Future Ideas

Ideas & Notes

7.0 KiB Raw Permalink Blame History

TODO — gstack roadmap

Phase 1: Foundations (v0.2.0)

Phase 2: Enhanced Browser (v0.2.0) ✅

Phase 3: QA Testing Agent (v0.3.0)

Phase 3.5: Browser Cookie Import (v0.3.x)

Phase 3.6: Visual PR Annotations + S3 Upload

Phase 4: Skill + Browser Integration

Phase 5: State & Sessions

Phase 6: Advanced Browser

Future Ideas

Ideas & Notes

7.0 KiB

Raw Permalink Blame History