Commit Graph

31 Commits

Author SHA1 Message Date
Garry Tan 17c1c06cd9 feat: diff-based test selection for E2E and LLM-judge evals (v0.6.1.0) (#139)
* feat: diff-based test selection for E2E and LLM-judge evals

Each test declares file dependencies in a TOUCHFILES map. The test runner
checks git diff against the base branch and only runs tests whose
dependencies were modified. Global touchfiles (session-runner, eval-store,
gen-skill-docs) trigger all tests.

New scripts: test:e2e:all, test:evals:all, eval:select

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version and changelog (v0.6.1.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: plan-design-review-audit eval — bump turns to 30, add efficiency hints

The test was flaky at 20 turns because the agent reads a 300-line SKILL.md,
navigates, extracts design data, and writes a report. Added hints to skip
preamble/batch commands/write early while still testing the real SKILL.md.
Now completes in ~13 turns consistently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 18:45:41 -05:00
Garry Tan 9d47619e4c feat: Completeness Principle — Boil the Lake (v0.6.1) (#140)
* feat: Completeness Principle — Boil the Lake (WIP, pre-merge)

Add Completeness Principle to all skill preambles, dual-time estimates,
compression table, anti-pattern gallery, Lake Score, and completeness
gaps review category. VERSION/CHANGELOG will be rebased after merge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: update stale version reference in TODOS.md (v0.5.3 → v0.6.1)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update CHANGELOG date + README for v0.6.1 features

- Add date to CHANGELOG 0.6.1 entry
- Add Completeness Principle to README intro
- Add SELECTIVE EXPANSION mode to CEO review section
- Add test bootstrap mention to /ship section
- Fix uninstall command missing design-consultation in project uninstall
- Add "recommends shortcuts" and "no tests" to Without gstack list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: split README into lean intro + docs/ directory (gh CLI pattern)

README: 875 → 243 lines. Keeps intro, skill table, demo, install, and
troubleshooting. All per-skill deep dives, Greptile integration guide,
and contributor mode docs moved to docs/ directory.

- docs/skills.md — full philosophy and examples for all 13 skills
- docs/greptile.md — Greptile setup and triage workflow
- docs/contributor-mode.md — how to enable and use contributor mode
- README now links to docs/ via Documentation table
- Updated skill table entries with latest features (fix-first, regression
  tests, test health, completeness gaps)
- Updated demo transcript with AUTO-FIXED, coverage audit, regression test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: remove "competitor" language, rewrite README in Garry's voice

Replace "browses competitors" with "knows the landscape" / "what's out
there" throughout all user-facing copy. Trim README from 243 to 167
lines — tighter, more opinionated, less listicle energy. Remove
Completeness Principle from README top (it lives in CLAUDE.md and the
skill preambles where Claude actually reads it).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: rewrite README in Garry's raw voice — AGI era, L8 factory, real stories

The README now sounds like Garry, not a product page. Leads with the
live experiment, the 16k LOC/day reality, the real-life coding stories
(Austin, hospital bedside). Highlights the newest unlocks (design at
the heart, /qa parallelism, smart review routing, test bootstrap).
Closes with an open invitation — free MIT, fork it, let's all ride
the wave together.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add Garry's bonafides to README intro — Palantir, Posterous, YC, 600k LOC

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add real /retro numbers — 140k lines, 362 commits across 3 projects

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add "in the last 60 days" timeframe to 600k LOC claim

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add GitHub contribution graphs — 2026 vs 2013 side by side

Same person, different era. 2013: 772 contributions building Bookface.
2026: 1,237 contributions and accelerating. The difference is the tooling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: clarify /retro stats are from last 7 days

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add designer/PM/eng manager roles to intro

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: remove Josh/L8 reference from README

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: move demo up, make it dramatically more impressive

Show the actual architecture diagram, auto-fixed issues, 100% coverage,
regression test generation. Punch line: "That is not a copilot. That is
a team."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: remove "My journey" section — intro already covers it

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: prefix all skill commands with You: in demo transcript

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: collapse You/Claude lines in demo — no gap between command and response

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: clarify plan mode flow in demo — approve, exit, Claude implements

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: move /ship to end of demo — review → QA → ship is the real flow

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add /plan-design-review to demo, tighten CEO response

Shorter CEO reply, compressed eng diagram, added design audit with
AI Slop score. Seven commands now: plan → eng → build → design →
review → QA → ship.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: move design review before implementation — it's part of planning

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: reorder demo — design before eng, after CEO

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: remove URL from /plan-design-review in demo

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add [...] annotations showing what actually happens at each step

Each step now shows what the agent does under the hood: 8 expansion
proposals cherry-picked, 80-item design audit, ASCII diagrams for
every flow, 2400 lines written in 8 minutes, real browser QA, bug
found and fixed. Makes the demo feel real, not abstract.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: rename Contributor Mode to How to Contribute in docs table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add Coinbase, Instacart, Rippling to YC bonafides

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add "one or two people in a garage" to founder story

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add skill table to top of skills.md with anchor links

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: consolidate — roll contributor-mode into CONTRIBUTING, greptile into skills

- docs/contributor-mode.md → merged into CONTRIBUTING.md (session awareness section)
- docs/greptile.md → merged into docs/skills.md (Greptile integration section)
- Reordered docs table: Skills > Architecture > Browser > Contributing > Changelog

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 16:34:08 -05:00
Garry Tan 1f3b691411 feat: /gstack-upgrade detects and syncs stale vendored copies (v0.5.4.1) (#137)
When the global gstack is already up to date, standalone /gstack-upgrade
now checks if the local vendored copy in the current project is at a
different version and syncs it automatically. Also adds rollback on
setup failure and update-check fallback matching the preamble pattern.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 13:06:14 -05:00
Garry Tan a2d756f945 feat: Test Bootstrap + Regression Tests + Coverage Audit (v0.6.0) (#136)
* feat: test bootstrap, regression tests, coverage audit, retro test health

- Add {{TEST_BOOTSTRAP}} resolver to gen-skill-docs.ts
- Add Phase 8e.5 regression test generation to /qa and /qa-design-review
- Add Step 3.4 test coverage audit with quality scoring to /ship
- Add test health tracking to /retro
- Add 2 E2E evals (bootstrap + coverage audit)
- Add 26 validation tests
- Update ARCHITECTURE.md placeholder table
- Add 2 P3 TODOs (CI/CD non-GitHub, auto-upgrade weak tests)

* chore: bump version and changelog (v0.6.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: make coverage audit trace actual codepaths, not just syntax patterns

Step 3.4 now instructs Claude to read full files, trace data flow through
every branch, diagram the execution, and check each branch against tests.
Phase 8e.5 regression tests now trace the bug's codepath before writing
the test, catching adjacent edge cases.

* feat: coverage audit now maps user flows, interactions, and error states

Step 3.4 now covers the full picture: code branches AND user-facing behavior.
Maps user flows (complete journey through the feature), interaction edge cases
(double-click, back button, stale state, slow connection), error states
(what does the user actually see?), and boundary states (zero results,
10k results, max-length input). Coverage diagram splits into Code Path
Coverage and User Flow Coverage sections with separate percentages.

* fix: raise test gen cap to 20, add validation tests for user flow coverage

- Raise Step 3.4 test generation cap from 10 to 20 (code + user flow combined)
- Add 3 validation tests: codepath tracing, user flow mapping, diagram sections

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 13:05:18 -05:00
Garry Tan b65a464d37 feat: always-full eng review + ship review gate persistence (v0.5.4) (#135)
Remove SMALL/BIG CHANGE menu from /plan-eng-review — every plan gets the
full interactive review. Scope reduction is now proactive (only when
complexity check triggers) rather than a menu item.

Add review gate override persistence to /ship — when the user says "ship
anyway" or "not relevant", that decision is saved to the branch's
reviews.jsonl so subsequent /ship runs don't re-ask.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 12:41:44 -05:00
Garry Tan 5e9f0e78f2 feat: SELECTIVE EXPANSION + smarter ship gates (v0.5.3) (#134)
* feat: SELECTIVE EXPANSION mode + user control for CEO review

Add 4th mode to /plan-ceo-review: SELECTIVE EXPANSION holds current scope
as baseline but surfaces expansion opportunities one by one for cherry-picking.
All modes now present every scope-expanding idea as individual AskUserQuestion
calls — user opts in or out of each one. EXPANSION recommends enthusiastically,
SELECTIVE recommends neutrally. CEO plan persistence writes decisions to disk.

* feat: review dashboard — eng required, CEO/design optional

Only Eng Review gates shipping. CEO Review recommended for big product
changes, Design Review for UI work — both informational only. Adds
skip_eng_review global config to disable the gate entirely.

* chore: bump version and changelog (v0.5.3)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 12:22:10 -05:00
Garry Tan c99757b522 feat: /design-consultation — risk-taking, visual research, ambitious preview (v0.5.2) (#131)
* feat: /design-consultation — risk-taking thesis, visual research, ambitious preview

Add SAFE/RISK breakdown to design proposals so users see which choices
match category conventions vs. which are deliberate creative departures.

Wire browse binary for visual competitive research — agent browses
competitor sites, takes screenshots, and analyzes fonts/colors/spacing
with graceful degradation to WebSearch-only or built-in knowledge.

Upgrade preview page instructions to render realistic product mockups
(dashboards, marketing pages, settings forms) instead of just swatches.

Rewrite README section with the thesis: "coherence is table stakes —
the real question is where you take risks."

* chore: bump version and changelog (v0.5.2)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: restore SKILL.md files to match main

Prior commit included SKILL.md files regenerated from stale templates.
Restore to match origin/main content.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:49:22 -05:00
Garry Tan 73b00b4e29 feat: Review Readiness Dashboard + gstack-slug helper (v0.5.1) (#130)
* feat: add bin/gstack-slug helper + migrate all inline SLUG computation

Extract the opaque SLUG sed pipeline into a shared 5-line shell script.
Replace 8 inline copies across templates with eval $(gstack-slug).
Sanitizes branch names (/ → -) to prevent subdirectory creation.

* feat: review readiness dashboard — track CEO/Eng/Design reviews per branch

Each review skill logs its result to JSONL. A shared {{REVIEW_DASHBOARD}}
placeholder displays run counts, timestamps, and a CLEARED TO SHIP verdict.
/ship pre-flight reads the dashboard and prompts when reviews are missing.

* chore: bump version and changelog (v0.5.1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:33:46 -05:00
Garry Tan 5f41cd9ad7 feat: show screenshots to user during QA and browse sessions (v0.5.0.1) (#129)
Add rule 11 to QA and Design methodologies in gen-skill-docs.ts
instructing Claude to Read screenshot PNGs after taking them.
This makes screenshots visible as clickable elements in Conductor
and other Claude Code UIs. Also added to browse and gstack SKILL
templates.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:30:19 -05:00
Garry Tan 4a77cc2c34 feat: /plan-design-review + /qa-design-review skills (v0.5.0) (#102)
* feat: add {{DESIGN_METHODOLOGY}} resolver and register design review skills

Add generateDesignMethodology() to gen-skill-docs.ts with 10-category, 80-item
design audit checklist. Register plan-design-review and qa-design-review templates
in findTemplates(). Add both skills to skill-check.ts SKILL_FILES. Add command
and snapshot flag validation tests for both skills in skill-validation.test.ts.

* feat: add /plan-design-review and /qa-design-review skills

/plan-design-review: report-only designer audit with letter grades, AI slop
scoring, structured first impression, design system extraction, DESIGN.md
inference and export offer. Never modifies code.

/qa-design-review: same audit, then iterative fix loop with style(design):
commits, CSS-safe WTF heuristic, before/after screenshots, final re-audit.

* chore: bump version and changelog (v0.5.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update README, ARCHITECTURE for design review skills (v0.5.0)

- Update skill count to 11, add /plan-design-review and /qa-design-review
  to skill table, install/uninstall commands, and demo walkthrough
- Add narrative sections: "senior designer mode" and "designer who codes mode"
  with compelling examples showing AI Slop detection and design system inference
- Add {{DESIGN_METHODOLOGY}} to ARCHITECTURE.md placeholder table
- Extend demo to show full plan→eng→review→ship→qa→design-review pipeline

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: regenerate design review SKILL.md files after merge from main

Picks up BASE_BRANCH_DETECT resolver and updated contributor mode from main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add /design-consultation skill — design consultant that creates DESIGN.md

6-phase consultant flow: product context → competitive research (WebSearch) →
complete coherent proposal → drill-downs on demand → font+color preview page →
write DESIGN.md + update CLAUDE.md. Opinionated recommendations grounded in
product context, not menu-driven forms.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add E2E tests for design skill family (7 tests + LLM quality judge)

Tests 1-4: /design-consultation (core flow, research integration, existing
DESIGN.md handling, font+color preview generation).
Tests 5-6: /plan-design-review (audit report, DESIGN.md export).
Test 7: /qa-design-review (audit + fix loop).
LLM judge validates font blacklist compliance, coherence, and AI slop avoidance.
Also adds plan-design-review + qa-design-review to ALL_SKILLS test array.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: mark /design-consultation as shipped in TODOS.md

Renamed from /setup-design-md to reflect the consultant approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 21:55:07 -05:00
Garry Tan a30f7079da feat: Fix-First Review — auto-fix obvious issues, ask about hard ones (v0.4.5) (#116)
* feat: Fix-First Review — auto-fix obvious issues, ask about hard ones

Replace the CRITICAL-only AskUserQuestion flow with Fix-First:
- Every finding gets action (not just critical ones)
- AUTO-FIX items (dead code, N+1, stale comments) applied directly
- ASK items (security, race conditions, design decisions) batched
  into at most one AskUserQuestion
- Fix-First Heuristic in checklist.md (single source of truth)
- Gate Classification → Severity Classification rename

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.5)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: polish CHANGELOG v0.4.5 voice — lead with user benefit

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 21:52:50 -05:00
Garry Tan 318ffdbdf0 fix: js statement wrapping + click auto-routes option to selectOption (v0.4.5) (#117)
* fix: js statement wrapping + click auto-routes option to selectOption

Bug 1: js command wrapped all code as expressions — const, semicolons,
and multi-line code broke with SyntaxError. Added needsBlockWrapper()
and wrapForEvaluate() helpers (shared with eval) to detect statements
and use block wrapper {…} instead of expression wrapper (…).

Bug 2: clicking <option> refs hung forever because Playwright can't
.click() native select UI. Click handler now checks ARIA role + DOM
tagName and auto-routes to selectOption() via parent <select>.

Bug 3: click timeouts on <option> elements gave no guidance. Now
throws helpful error: "Use browse select instead of click."

* chore: bump version and changelog (v0.4.5)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 21:50:43 -05:00
Garry Tan c86faa7968 fix: update check cache — 60min UP_TO_DATE TTL + --force flag (v0.4.4) (#110)
* fix: split update check cache TTL + add --force flag

UP_TO_DATE cache now expires after 60 min (was 720 min / 12 hours).
UPGRADE_AVAILABLE keeps 720 min TTL to keep nagging.

--force flag deletes cache before checking, used by /gstack-upgrade
standalone invocation to always get a fresh result from GitHub.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /gstack-upgrade standalone uses --force for fresh check

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.4)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 14:14:15 -05:00
Garry Tan a68244ab57 feat: /document-release skill — post-ship doc updates (v0.4.3) (#109)
* docs: update project documentation for v0.4.2

- README: skill count 9→10, added /document-release to skills table,
  install/uninstall sections, and dedicated section with example
- CHANGELOG: added /document-release bullet to v0.4.2 entry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add /document-release skill with smart VERSION handling

New skill runs after /ship but before PR merge. Reads every doc file,
cross-references the diff, auto-updates factual changes, asks about
risky edits. CHANGELOG clobber protection: never uses Write tool on
CHANGELOG.md, only Edit with exact old_string matches.

Smart VERSION logic: instead of silently skipping already-bumped
versions, compares CHANGELOG entry scope against full diff and asks
if significant uncovered changes exist.

Also fixes gstack-upgrade/SKILL.md missing from skill-check.ts
SKILL_FILES array (existing inconsistency with gen-skill-docs.ts).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /review Step 5.6 — documentation staleness check

Review skill now cross-references code changes against doc files.
If a doc describes a feature that changed but the doc wasn't updated,
flags it as INFORMATIONAL with a pointer to /document-release.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: /document-release E2E with CHANGELOG clobber guard

E2E test creates a repo with existing CHANGELOG entries, runs
/document-release, and asserts original entries survive. Critical
guardrail against the incident where an agent replaced CHANGELOG
entries during conflict resolution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump to v0.4.3 — /document-release skill

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate SKILL.md files after merge

* chore: regenerate SKILL.md files after merge

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 12:30:22 -05:00
Garry Tan 276d0cc6cb feat: always-on ELI16 + branch detection (v0.4.3) (#108)
* feat: always-on ELI16 + branch detection in preamble

- Add _BRANCH detection to preamble bash block (git branch --show-current)
- Merge ELI16 rules into default AskUserQuestion format (always-on)
- Remove _SESSIONS >= 3 conditional — better questions always
- Add simplification rules: plain English, no jargon, no raw function names
- Update tests for branch detection and simplification regression guard

* chore: bump version and changelog (v0.4.3)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 12:27:47 -05:00
Garry Tan 78e519e3b7 feat: await support in browse js/eval + contributor mode v2 (#104)
* feat: support await in $B js and eval commands

Auto-wrap await expressions in async IIFE context so
$B js "await fetch(...)" works without SyntaxError.

- hasAwait() strips comments before detection
- js: expression wrapping (async()=>(expr))()
- eval: smart wrapping — single-line=expression, multi-line=block
- 6 new unit tests covering async, false-positive, and return semantics

* feat: redesign contributor mode — periodic reflection with 0-10 rating

Replace passive "report when things break" with active reflection:
- Rate gstack experience 0-10 at workflow step boundaries
- Historical calibration example (await bug) anchors the reporting bar
- "What would make this a 10" field focuses on actionable improvements
- Removed category lists in favor of judgment-based assessment

* test: add deterministic contributor mode preamble validation

40 new skill-validation tests (4 checks × 10 skills) verify:
- 0-10 rating scale present
- Calibration example present
- "What would make this a 10" field present
- Periodic reflection (not per-command)

Update existing E2E contributor eval for new report format.

* chore: bump version and changelog (v0.4.2)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: improve contributor mode + qa-quick E2E reliability

Contributor mode:
- Add "do not truncate" directive to template — agent was stopping
  after "My rating" without completing Steps/Raw output/What would
  make this a 10 sections
- Restore assertions for Steps to reproduce and Date footer

QA quick:
- Make test server URL prominent: top of prompt, explicit "already
  running" and "do NOT discover ports" instructions
- Bump session timeout 180s→240s and test timeout 240s→300s
- Set B= at top of prompt (was buried in prose)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use flexible assertions for contributor mode E2E

Agent writes thorough reports with creative section names
("Repro Steps" vs "Steps to reproduce"). Match intent not formatting:
- /repro|steps to reproduce/ for reproduction steps
- /date.*2026/ for date footer presence

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add E2E eval failure blame protocol

"Not related to our changes" is an extraordinary claim that requires
extraordinary proof. When evals fail during /ship:

1. Run the same eval on main — prove it fails there too
2. If it passes on main, it IS your change — trace the blame
3. If you can't verify, say "unverified" not "pre-existing"

Added to CLAUDE.md and as a comment in skill-e2e.test.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update CONTRIBUTING.md and BROWSER.md for v0.4.2

CONTRIBUTING.md: update contributor mode description — now describes
periodic 0-10 reflection loop instead of passive friction detection.

BROWSER.md: add js/eval async documentation — await expressions are
auto-wrapped in async context, single-line eval returns values directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: restore v0.4.2 changelog entries lost during cherry-pick conflict

The base branch detection entries from main were dropped when resolving
the CHANGELOG conflict — should have merged both sets, not replaced.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 11:28:58 -05:00
Garry Tan 1e06b6a5c6 fix: dynamic base branch detection across all SKILL templates (v0.3.10) (#81)
* feat: add {{BASE_BRANCH_DETECT}} resolver to gen-skill-docs

DRY placeholder for dynamic base branch detection across PR-targeting
skills. Detects via gh pr view (existing PR base) → gh repo view
(repo default) → fallback to main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: ship skill detects base branch instead of hardcoding main

Replaces ~14 hardcoded 'main' references with dynamic detection via
{{BASE_BRANCH_DETECT}}. Fixes stacked branches and Conductor workspaces
targeting non-main branches. Adds --base <base> to gh pr create.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: review, qa, plan-ceo-review detect base branch dynamically

Same pattern as ship: replaces hardcoded 'main' with {{BASE_BRANCH_DETECT}}.
Also cleans up qa bash-isms (REPORT_DIR variable, port chaining).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: retro detects default branch instead of hardcoding origin/main

Retro queries commit history (not PR targets), so uses simpler detection:
gh repo view defaultBranchRef. Replaces ~11 origin/main refs with
origin/<default>.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add explicit cross-step references in gstack-upgrade template

Bash blocks are self-contained, but cross-block variable references
(INSTALL_DIR from Step 2) were implicit. Adds prose making them explicit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs+test: SKILL authoring guidance + regression tests

Adds "Writing SKILL templates" section to CLAUDE.md explaining that
templates are prompts, not scripts. Adds validation test catching
hardcoded 'main' in git commands, and resolver content test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update ARCHITECTURE + CONTRIBUTING for new placeholders

Add {{BASE_BRANCH_DETECT}} to ARCHITECTURE.md placeholder list.
Cross-reference CLAUDE.md template authoring guidance from CONTRIBUTING.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.3.10)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add missing blank line between resolver functions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add 3 E2E smoke tests for base branch detection

- /review: verifies Step 0 detection + git diff against detected base
- /ship: truncated dry-run (Steps 0-1 only, no push/PR), asserts no
  destructive actions
- /retro: verifies default branch detection for git log queries

Covers the {{BASE_BRANCH_DETECT}} resolver path (review), the ship
template's dual abort check, and retro's inline detection pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 10:59:13 -05:00
Garry Tan 3e3843c4a9 feat: contributor mode, session awareness, recommendation format (#90)
* feat: contributor mode, session awareness, universal RECOMMENDATION format

- Rename {{UPDATE_CHECK}} → {{PREAMBLE}} across all 10 skill templates
- Add session tracking (touch ~/.gstack/sessions/$PPID, count active sessions)
- ELI16 mode when 3+ concurrent sessions detected (re-ground user on context)
- Contributor mode: auto-file field reports to ~/.gstack/contributor-logs/
- Universal AskUserQuestion format: context → question → RECOMMENDATION → options
- Update plan-ceo-review and plan-eng-review to reference preamble baseline
- Add vendored symlink awareness section to CLAUDE.md
- Rewrite CONTRIBUTING.md with contributor workflow and cross-project testing
- Add tests for contributor mode and session awareness in generated output
- Add E2E eval for contributor mode report filing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add Enum & Value Completeness to /review critical checklist

New CRITICAL review category that traces new enum values, status strings,
and type constants through every consumer outside the diff. Catches the
class of bugs where a new value is added but not handled in all switch/case
chains, allowlists, or frontend-backend contracts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump v0.4.1, user-facing changelog, update qa-only template and architecture docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add CHANGELOG style guide — user-facing, sell the feature

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: rewrite v0.4.1 changelog to be user-facing and sell the features

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add evals for RECOMMENDATION format, session awareness, and enum completeness

Free tests (Tier 1): RECOMMENDATION format + session awareness in all
preamble SKILL.md files, enum completeness checklist structure and CRITICAL
classification.

E2E eval: /review catches missed enum handlers when a new status value
is added but not handled in case/switch and notify methods.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add E2E eval for session awareness ELI16 mode

Stubs _SESSIONS=4, gives agent a decision point on feature/add-payments
branch, verifies the output re-grounds the user with project, branch,
context, and RECOMMENDATION — the ELI16 mode behavior for 3+ sessions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: contributor mode eval marked FAIL due to expected browse error

The test intentionally runs a nonexistent binary to trigger contributor
mode. The session runner's browse error detection catches "no such file
or directory...browse" and sets browseErrors, causing recordE2E to mark
passed=false. Override passed to check only exitReason since the browse
error is the expected scenario.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 01:45:50 -05:00
Garry Tan f3ee0ee28a feat: QA restructure, browser ref staleness, eval efficiency metrics (v0.4.0) (#83)
* feat: browser ref staleness detection via async count() validation

resolveRef() now checks element count to detect stale refs after page
mutations (e.g. SPA navigation). RefEntry stores role+name metadata
for better diagnostics. 3 new snapshot tests for staleness detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: qa-only skill, qa fix loop, plan-to-QA artifact flow

Add /qa-only (report-only, Edit tool blocked), restructure /qa with
find-fix-verify cycle, add {{QA_METHODOLOGY}} DRY placeholder for
shared methodology. /plan-eng-review now writes test-plan artifacts
to ~/.gstack/projects/<slug>/ for QA consumption.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: eval efficiency metrics — turns, duration, commentary across all surfaces

Add generateCommentary() for natural-language delta interpretation,
per-test turns/duration in comparison and summary output, judgePassed
unit tests, 3 new E2E tests (qa-only, qa fix loop, plan artifact).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update ARCHITECTURE, BROWSER, CONTRIBUTING, README for v0.4.0

- ARCHITECTURE: add ref staleness detection section, update RefEntry type
- BROWSER: add ref staleness paragraph to snapshot system docs
- CONTRIBUTING: update eval tool descriptions with commentary feature
- README: fix missing qa-only in project-local uninstall command

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add user-facing benefit descriptions to v0.4.0 changelog

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 23:55:39 -05:00
Garry Tan bb46ca6b21 feat: smart update check with auto-upgrade, snooze backoff, config CLI (v0.3.9) (#62)
* feat: add bin/gstack-config CLI for reading/writing ~/.gstack/config.yaml

Simple get/set/list interface for persistent gstack configuration.
Used by update-check and upgrade skill for auto_upgrade and update_check settings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: smart update check with 12h cache, snooze backoff, config disable

- Reduce cache TTL from 24h to 12h for faster update detection
- Add exponential snooze backoff: 24h → 48h → 1 week (resets on new version)
- Add update_check: false config option to disable checks entirely
- Clear snooze file on just-upgraded
- 14 new tests covering snooze levels, expiry, corruption, and config paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: upgrade skill with auto-upgrade, 4-option prompt, vendored sync

- Auto-upgrade mode via config or GSTACK_AUTO_UPGRADE=1 env var
- 4-option AskUserQuestion: upgrade once, always, not now, never
- Step 4.5: sync local vendored copy after upgrading primary install
- Snooze write with escalating backoff on "Not now"
- Update preamble text in gen-skill-docs for new upgrade flow
- Regenerate all SKILL.md files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: simplify upgrade instructions, move auto-upgrade to completed

README now points to /gstack-upgrade instead of long paste commands.
Auto-upgrade TODO moved to Completed section (v0.3.8).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version and changelog (v0.3.9)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 23:28:02 -07:00
Garry Tan 41141007c1 feat: TODOS-aware skills, 2-tier Greptile replies, gitignore fix (#61)
* fix: log non-ENOENT errors in ensureStateDir() instead of silently swallowing

Replace bare catch {} with ENOENT-only silence. Non-ENOENT errors (EACCES,
ENOSPC) are now logged to .gstack/browse-server.log. Includes test for
permission-denied scenario with chmod 444.

* feat: merge TODO.md + TODOS.md into unified backlog with shared format reference

Merge TODO.md (roadmap) and TODOS.md (near-term) into one file organized by
skill/component with P0-P4 priority ordering and Completed section. Add shared
review/TODOS-format.md for canonical format. Add static validation tests.

* feat: add 2-tier Greptile reply system with escalation detection

Add reply templates (Tier 1 friendly, Tier 2 firm), explicit escalation
detection algorithm, and severity re-ranking guidance to greptile-triage.md.

* feat: cross-skill TODOS awareness + Greptile template refs in all skills

/ship Step 5.5: auto-detect completed TODOs, offer reorganization.
/review Step 5.5: cross-reference PR against open TODOs.
/plan-ceo-review, /plan-eng-review: TODOS context in planning.
/retro: Backlog Health metric. /qa: bug TODO context in diff-aware mode.
All Greptile-aware skills now reference reply templates and escalation detection.

* chore: bump version and changelog (v0.3.8)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update CONTRIBUTING.md for v0.3.8 changes

Clarify test tier cost table (Tier 3 standalone vs combined), add TODOS.md
to "Things to know", mention Greptile triage in ship workflow description.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 20:15:11 -07:00
Garry Tan 2aa745cb0e feat: screenshot element/region clipping (v0.3.7) (#56)
* feat: screenshot element/region clipping (--clip, --viewport, CSS/@ref)

Add element crop (CSS selector or @ref), region clip (--clip x,y,w,h),
and viewport-only (--viewport) modes to the screenshot command. Uses
Playwright's native locator.screenshot() and page.screenshot({ clip }).
Full page remains the default. Includes 10 new tests covering all modes
and error paths.

* chore: bump version and changelog (v0.3.7)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add screenshot modes to BROWSER.md command reference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 12:47:42 -07:00
Garry Tan 4ace0c2f6f chore: bump version and changelog (v0.3.6)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 12:44:41 -05:00
Garry Tan 1717ed2891 fix: browse binary discovery broken for agents (v0.3.5) (#44)
* fix: replace find-browse with direct path in SKILL.md setup blocks

Agents were skipping the find-browse binary and guessing bin/browse
(wrong path). Now the setup block explicitly checks browse/dist/browse
with workspace-local priority, global fallback.

Also adds || true to update check to prevent misleading exit code 1.

Adds {{UPDATE_CHECK}} and {{BROWSE_SETUP}} template placeholders to
gen-skill-docs.ts so all skills share a single source of truth.

* refactor: convert qa/ and setup-browser-cookies/ to .tmpl templates

Replaces hardcoded update check and find-browse blocks with
{{UPDATE_CHECK}} and {{BROWSE_SETUP}} placeholders. Both skills
are now generated from templates via gen-skill-docs.

* test: add e2e and LLM eval tests for SKILL.md setup block

- 3 Agent SDK e2e tests: happy path, NEEDS_SETUP, non-git-repo
- LLM eval: setup block clarity + actionability >= 4
- New error pattern: 'no such file or directory.*browse'

These tests catch the exact failure mode where agents can't discover
the browse binary via SKILL.md instructions.

* chore: bump version and changelog (v0.3.5)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 00:24:06 -07:00
Garry Tan 6b69c46a27 feat: daily update check + /gstack-upgrade skill (v0.3.4) (#42)
* feat: add daily update check script + /gstack-upgrade skill

bin/gstack-update-check: pure bash, checks VERSION against remote once/day,
outputs UPGRADE_AVAILABLE or JUST_UPGRADED. Uses ~/.gstack/ for state.

gstack-upgrade/SKILL.md: new skill with inline upgrade flow for all preambles.
Detects global-git, local-git, vendored installs. Shows What's New from CHANGELOG.

browse/test/gstack-update-check.test.ts: 10 test cases covering all branch paths.

* refactor: remove version check from find-browse, simplify to binary locator

Delete checkVersion(), readCache(), writeCache(), fetchRemoteSHA(),
resolveSkillDir(), CacheEntry interface, REPO_URL/CACHE_PATH/CACHE_TTL
constants, and META output from find-browse.ts.

Version checking is now handled by bin/gstack-update-check (previous commit).

* feat: add update check preamble to all 9 skills

Every skill now runs bin/gstack-update-check on invocation. If an upgrade
is available, reads gstack-upgrade/SKILL.md inline upgrade flow.

Also adds AskUserQuestion to 5 skills that lacked it (gstack root, browse,
qa, retro, setup-browser-cookies) and Bash to plan-eng-review.

Simplifies qa and setup-browser-cookies setup blocks (removes META parsing).

* chore: bump version and changelog (v0.3.4)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove unused import + add corrupt cache test

Address pre-landing review findings:
- Remove unused mkdirSync import from gstack-update-check.test.ts
- Add Path I test: corrupt cache file falls through to remote fetch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 22:17:25 -07:00
Garry Tan a468374272 fix: enrich SKILL.md docs to pass LLM evals, upgrade judge to Sonnet 4.6 (#43)
* fix: enrich command descriptions and snapshot flags for LLM eval quality

14 command descriptions enriched with specific arg formats, valid values,
error behavior, and return types. Fixed header usage from <name> <value>
to <name>:<value>. Added cookie usage syntax. Snapshot flags now show
long names, ref numbering, and output format examples.

* refactor: auto-generate server.ts help text from COMMAND_DESCRIPTIONS

Replace hand-maintained help block with generateHelpText() that reads
from COMMAND_DESCRIPTIONS and SNAPSHOT_FLAGS. Eliminates help text
drift from source of truth.

* test: add usage consistency and pipe guard tests

Usage consistency test cross-checks Usage: patterns in implementation
against COMMAND_DESCRIPTIONS using structural skeleton comparison.
Pipe guard test ensures descriptions don't contain | which would break
markdown table rendering.

* chore: upgrade eval judge to Sonnet 4.6, update changelog

Switch LLM-as-judge evals from Haiku to Sonnet 4.6 for more stable,
nuanced scoring. Add changelog entry for all eval improvements.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 22:14:14 -07:00
Garry Tan 5205070299 feat: SKILL.md template system, 3-tier testing, DX tools (v0.3.3) (#41)
* refactor: extract command registry to commands.ts, add SNAPSHOT_FLAGS metadata

- NEW: browse/src/commands.ts — command sets + COMMAND_DESCRIPTIONS + load-time validation (zero side effects)
- server.ts imports from commands.ts instead of declaring sets inline
- snapshot.ts: SNAPSHOT_FLAGS array drives parseSnapshotArgs (metadata-driven, no duplication)
- All 186 existing tests pass

* feat: SKILL.md template system with auto-generated command references

- SKILL.md.tmpl + browse/SKILL.md.tmpl with {{COMMAND_REFERENCE}} and {{SNAPSHOT_FLAGS}} placeholders
- scripts/gen-skill-docs.ts generates SKILL.md from templates (supports --dry-run)
- Build pipeline runs gen:skill-docs before binary compilation
- Generated files have AUTO-GENERATED header, committed to git

* test: Tier 1 static validation — 34 tests for SKILL.md command correctness

- test/helpers/skill-parser.ts: extracts $B commands from code blocks, validates against registry
- test/skill-parser.test.ts: 13 parser/validator unit tests
- test/skill-validation.test.ts: 13 tests validating all SKILL.md files + registry consistency
- test/gen-skill-docs.test.ts: 8 generator tests (categories, sorting, freshness)

* feat: DX tools (skill:check, dev:skill) + Tier 2 E2E test scaffolding

- scripts/skill-check.ts: health summary for all SKILL.md files (commands, templates, freshness)
- scripts/dev-skill.ts: watch mode for template development
- test/helpers/session-runner.ts: Agent SDK wrapper for E2E skill tests
- test/skill-e2e.test.ts: 2 E2E tests + 3 stubs (auto-skip inside Claude Code sessions)
- E2E tests must run from plain terminal: SKILL_E2E=1 bun test test/skill-e2e.test.ts

* ci: SKILL.md freshness check on push/PR + TODO updates

- .github/workflows/skill-docs.yml: fails if generated SKILL.md files are stale
- TODO.md: add E2E cost tracking and model pinning to future ideas

* fix: restore rich descriptions lost in auto-generation

- Snapshot flags: add back value hints (-d <N>, -s <sel>, -o <path>)
- Snapshot flags: restore parenthetical context (@e refs, @c refs, etc.)
- Commands: is → includes valid states enum
- Commands: console → notes --errors filter behavior
- Commands: press → lists common keys (Enter, Tab, Escape)
- Commands: cookie-import-browser → describes picker UI
- Commands: dialog-accept → specifies alert/confirm/prompt
- Tips: restore → arrow (was downgraded to ->)

* test: quality evals for generated SKILL.md descriptions

Catches the exact regressions we shipped and caught in review:
- Snapshot flags must include value hints (-d <N>, -s <sel>, -o <path>)
- is command must list all valid states (visible/hidden/enabled/...)
- press command must list example keys (Enter, Tab, Escape)
- console command must describe --errors behavior
- Snapshot -i must mention @e refs, -C must mention @c refs
- All descriptions must be >= 8 chars (no empty stubs)
- Tips section must use → not ->

* feat: LLM-as-judge evals for SKILL.md documentation quality

4 eval tests using Anthropic API (claude-haiku, ~$0.01-0.03/run):
- Command reference table: clarity/completeness/actionability >= 4/5
- Snapshot flags section: same thresholds
- browse/SKILL.md overall quality
- Regression: generated version must score >= hand-maintained baseline

Requires ANTHROPIC_API_KEY. Auto-skips without it.
Run: bun run test:eval (or ANTHROPIC_API_KEY=sk-... bun test test/skill-llm-eval.test.ts)

* chore: bump version to 0.3.3, update changelog

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add ARCHITECTURE.md, update CLAUDE.md and CONTRIBUTING.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: conductor.json lifecycle hooks + .env propagation across worktrees

bin/dev-setup now copies .env from main worktree so API keys carry
over to Conductor workspaces automatically. conductor.json wires up
setup and archive hooks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: complete CHANGELOG for v0.3.3 (architecture, conductor, .env)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 21:08:12 -07:00
Garry Tan 07b4e15b34 feat: v0.3.2 — project-local state, diff-aware QA, Greptile integration (#36)
* fix: cookie import picker returns JSON instead of HTML

jsonResponse() was defined at module scope but referenced `url` which
only existed as a parameter of handleCookiePickerRoute(). Every API call
crashed, the catch block also crashed, and Bun returned a default HTML
page that the frontend couldn't parse as JSON.

Thread port via corsOrigin() helper and options objects. Add route-level
tests to prevent this class of bug from shipping again.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add help command to browse server

Agents that don't have SKILL.md loaded (or misread flags) had no way to
self-discover the CLI. The help command returns a formatted reference of
all commands and snapshot flags.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: version-aware find-browse with META signal protocol

Agents in other workspaces found stale browse binaries that were missing
newer flags. find-browse now compares the local binary's git SHA against
origin/main via git ls-remote (4hr cache), and emits META:UPDATE_AVAILABLE
when behind. SKILL.md setup checks parse META signals and prompt the user
to update.

- New compiled binary: browse/dist/find-browse (TypeScript, testable)
- Bash shim at browse/bin/find-browse delegates to compiled binary
- .version file written at build time with git commit SHA
- Build script compiles both browse and find-browse binaries
- Graceful degradation: offline, missing .version, corrupt cache all skip check

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: clean up .bun-build temp files after compile

bun build --compile leaves ~58MB temp files in the working directory.
Add rm -f .*.bun-build to the build script to clean up after each build.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: make help command reachable by removing it from META_COMMANDS

help was in META_COMMANDS, so it dispatched to handleMetaCommand() which
threw "Unknown meta command: help". Removing it from the set lets the
dedicated else-if handler in handleCommand() execute correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version and changelog (v0.3.2)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add shared Greptile comment triage reference doc

Shared reference for fetching, filtering, and classifying Greptile
review comments on GitHub PRs. Used by both /review and /ship skills.
Includes parallel API fetching, suppressions check, classification
logic, reply APIs, and history file writes.

* feat: make /review and /ship Greptile-aware

/review: Step 2.5 fetches and classifies Greptile comments, Step 5
resolves them with AskUserQuestion for valid issues and false positives.

/ship: Step 3.75 triages Greptile comments between pre-landing review
and version bump. Adds Greptile Review section to PR body in Step 8.
Re-runs tests if any Greptile fixes are applied.

* feat: add Greptile batting average to /retro

Reads ~/.gstack/greptile-history.md, computes signal ratio
(valid catches vs false positives), includes in metrics table,
JSON snapshot, and Code Quality Signals narrative.

* docs: add Greptile integration section to README

Personal endorsement, two-layer review narrative, full UX walkthrough
transcript, skills table updates. Add Greptile training feedback loop
to TODO.md future ideas.

* feat: add local dev mode for testing skills from within the repo

bin/dev-setup creates .claude/skills/gstack symlink to the working tree
so Claude Code discovers skills locally. bin/dev-teardown cleans up.
DEVELOPING_GSTACK.md documents the workflow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: narrow gitignore to .claude/skills/ instead of all .claude/

Avoids ignoring legitimate Claude Code config like settings.json or CLAUDE.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: rename DEVELOPING_GSTACK.md to CONTRIBUTING.md

Rewritten as a contributor-friendly guide instead of a dry plan doc.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: explain why dev-setup is needed in CONTRIBUTING.md quick start

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add browser interaction guidance to CLAUDE.md

Prevents Claude from using mcp__claude-in-chrome__* tools instead of /browse.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add shared config module for project-local browse state

Centralizes path resolution (git root detection, state dir, log paths) into
config.ts. Both cli.ts and server.ts import from it, eliminating duplicated
PORT_OFFSET/BROWSE_PORT/STATE_FILE logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: rewrite port selection to use random ports

Replace CONDUCTOR_PORT magic offset and 9400-9409 scan with random port
10000-60000. Atomic state file writes, log paths from config module,
binaryVersion field for auto-restart on update.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: move browse state from /tmp to project-local .gstack/

CLI now uses config module for state paths, passes BROWSE_STATE_FILE to
spawned server. Adds version mismatch auto-restart, legacy /tmp cleanup
with PID verification, and removes stale global install fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: update crash log path reference to .gstack/

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add config tests and update CLI lifecycle test

14 new tests for config resolution, ensureStateDir, readVersionHash,
resolveServerScript, and version mismatch detection. Remove obsolete
CONDUCTOR_PORT/BROWSE_PORT filtering from commands.test.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update BROWSER.md and TODO.md for project-local state

Replace /tmp paths with .gstack/, remove CONDUCTOR_PORT docs, document
random port selection and per-project isolation. Add server bundling TODO.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update README, CHANGELOG, and CONTRIBUTING for v0.3.2

- README: replace Conductor-aware language with project-local isolation,
  add Greptile setup note
- CHANGELOG: comprehensive v0.3.2 entry with all state management changes
- CONTRIBUTING: add instructions for testing branches in other repos

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add diff-aware mode to /qa — auto-tests affected pages from branch diff

When on a feature branch, /qa now reads git diff main, identifies affected
pages/routes from changed files, and tests them automatically. No URL required.
The most natural flow: write code, /ship, /qa.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: update CHANGELOG for complete v0.3.2 coverage

Add missing entries: diff-aware QA mode, Greptile integration,
local dev mode, crash log path fix, README/SKILL.md updates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 18:10:56 -07:00
Garry Tan f7b95329c1 feat: Phase 3.5 — cookie import, QA testing, team retro (v0.3.1) (#29)
* Phase 2: Enhanced browser — dialog handling, upload, state checks, snapshots

- CircularBuffer O(1) ring buffer for console/network/dialog (was O(n) array+shift)
- Async buffer flush with Bun.write() (was appendFileSync)
- Dialog auto-accept/dismiss with buffer + prompt text support
- File upload command (upload <sel> <file...>)
- Element state checks (is visible/hidden/enabled/disabled/checked/editable/focused)
- Annotated screenshots with ref labels overlaid (-a flag)
- Snapshot diffing against previous snapshot (-D flag)
- Cursor-interactive element scan for non-ARIA clickables (-C flag)
- Snapshot scoping depth limit (-d N flag)
- Health check with page.evaluate + 2s timeout
- Playwright error wrapping — actionable messages for AI agents
- Fix useragent — context recreation preserves cookies/storage/URLs
- wait --networkidle / --load / --domcontentloaded flags
- console --errors filter (error + warning only)
- cookie-import <json-file> with auto-fill domain from page URL
- 166 integration tests (was ~63)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Phase 2: Rewrite SKILL.md as QA playbook + command reference

Reorient SKILL.md files from raw command reference to QA-first playbook
with 10 workflow patterns (test user flows, verify deployments, dogfood
features, responsive layouts, file upload, forms, dialogs, compare pages).
Compact command reference tables at the bottom.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Phase 3: /qa skill — systematic QA testing with health scores

New /qa skill for systematic web app QA testing. Three modes:
- full: 5-10 documented issues with screenshots and repro steps
- quick: 30-second smoke test with health score
- regression: compare against saved baseline

Includes issue taxonomy (7 categories, 4 severity levels), structured
report template, health score rubric (weighted across 7 categories),
framework detection guidance (Next.js, Rails, WordPress, SPA).

Also adds browse/bin/find-browse (DRY binary discovery using git
rev-parse), .gstack/ to .gitignore, and updated TODO roadmap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Bump to v0.3.0 — Phase 2 + Phase 3 changelog

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: cookie-import-browser — Chromium cookie decryption module + tests

Pure logic module for reading and decrypting cookies from macOS Chromium
browsers (Comet, Chrome, Arc, Brave, Edge). Supports v10 AES-128-CBC
encryption with macOS Keychain access, PBKDF2 key derivation, and
per-browser key caching. 18 unit tests with encrypted cookie fixtures.

* feat: cookie picker web UI + route handler

Two-panel dark-theme picker served from the browse server. Left panel
shows source browser domains with search and import buttons. Right panel
shows imported domains with trash buttons. No cookie values exposed.
6 API endpoints, importedDomains Set tracking, inline clearCookies.

* feat: wire cookie-import-browser into browse server

Add cookie-picker route dispatch (no auth, localhost-only), add
cookie-import-browser to WRITE_COMMANDS and CHAIN_WRITE, add serverPort
property to BrowserManager, add write command with two modes (picker UI
vs --domain direct import), update CLI help text.

* chore: /setup-browser-cookies skill + docs (Phase 3.5)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version and changelog (v0.3.1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* security: redact sensitive values from command output (PR #21)

type no longer echoes text (reports character count), cookie redacts
value with ****, header redacts Authorization/Cookie/X-API-Key/X-Auth-Token,
storage set drops value, forms redacts password fields. Prevents secrets
from persisting in LLM transcripts. 7 new tests.

Credit: fredluz (PR #21)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* security: path traversal prevention for screenshot/pdf/eval (PR #26)

Add validateOutputPath() for screenshot/pdf/responsive (restricts to
/tmp and cwd) and validateReadPath() for eval (blocks .. sequences and
absolute paths outside safe dirs). 7 new tests.

Credit: Jah-yee (PR #26)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: auto-install Playwright Chromium in setup (PR #22)

Setup now verifies Playwright can launch Chromium, and auto-installs
it via `bunx playwright install chromium` if missing. Exits non-zero
if build or Chromium launch fails.

Credit: AkbarDevop (PR #22)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* security: fix path validation bypass, CORS restriction, cookie-import path check

- startsWith('/tmp') matched '/tmpevil' — now requires trailing slash
- CORS Access-Control-Allow-Origin changed from * to http://127.0.0.1:<port>
- cookie-import now validates file paths (was missing validateReadPath)
- 3 new tests for prefix collision and cookie-import path traversal

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address review informational issues + add regression tests

- Add cookie-import to CHAIN_WRITE set for chain command routing
- Add path validation to snapshot -a -o output path
- Fix package.json version to match 0.3.1
- Use crypto.randomUUID() for temp DB paths (unpredictable filenames)
- Add regression tests for chain cookie-import and snapshot path validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add /qa, /setup-browser-cookies to README + update BROWSER.md

- Add /qa and /setup-browser-cookies to skills table, install/update/uninstall blurbs
- Add dedicated README sections for both new skills with usage examples
- Update demo workflow to show cookie import → QA → browse flow
- Update BROWSER.md: cookie import commands, new source files, test count (203)
- Update skill count from 6 to 8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: team-aware /retro v2.0 — per-person praise and growth opportunities

- Identify current user via git config, orient narrative as "you" vs teammates
- Add per-author metrics: commits, LOC, focus areas, commit type mix, sessions
- New "Your Week" section with personal deep-dive for whoever runs the command
- New "Team Breakdown" with per-person praise and growth opportunities
- Track AI-assisted commits via Co-Authored-By trailers
- Personal + team shipping streaks
- Tone: praise like a 1:1, growth like investment advice, never compare negatively

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add Conductor parallel sessions section to README

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 00:31:41 -07:00
Garry Tan 044c6d568e v0.0.2
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 07:37:08 -07:00
Garry Tan 3d901066cd Initial release — gstack v0.0.1
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 01:32:16 -07:00