gstack

mirror of https://github.com/garrytan/gstack.git synced 2026-06-18 07:40:09 +02:00

Author	SHA1	Message	Date
Garry Tan	b071df3579	fix: resolve merge conflicts with origin/main (v0.6.0 + v0.6.0.1 + v0.6.1) Merge main's test bootstrap, boil-the-lake completeness principle, selective expansion, ship gate overrides, and gstack-upgrade vendor sync. Conflicts resolved: - CHANGELOG: keep main's 0.6.1/0.6.0.1/0.6.0/0.5.4/0.5.3 entries - VERSION: take main's 0.6.1 - design-consultation: office-hours naming + main's "what's out there" phrasing - ship: keep both verification rules (fresh evidence + coverage tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 14:37:22 -07:00
Garry Tan	9d47619e4c	feat: Completeness Principle — Boil the Lake (v0.6.1) (#140 ) * feat: Completeness Principle — Boil the Lake (WIP, pre-merge) Add Completeness Principle to all skill preambles, dual-time estimates, compression table, anti-pattern gallery, Lake Score, and completeness gaps review category. VERSION/CHANGELOG will be rebased after merge. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: update stale version reference in TODOS.md (v0.5.3 → v0.6.1) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update CHANGELOG date + README for v0.6.1 features - Add date to CHANGELOG 0.6.1 entry - Add Completeness Principle to README intro - Add SELECTIVE EXPANSION mode to CEO review section - Add test bootstrap mention to /ship section - Fix uninstall command missing design-consultation in project uninstall - Add "recommends shortcuts" and "no tests" to Without gstack list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: split README into lean intro + docs/ directory (gh CLI pattern) README: 875 → 243 lines. Keeps intro, skill table, demo, install, and troubleshooting. All per-skill deep dives, Greptile integration guide, and contributor mode docs moved to docs/ directory. - docs/skills.md — full philosophy and examples for all 13 skills - docs/greptile.md — Greptile setup and triage workflow - docs/contributor-mode.md — how to enable and use contributor mode - README now links to docs/ via Documentation table - Updated skill table entries with latest features (fix-first, regression tests, test health, completeness gaps) - Updated demo transcript with AUTO-FIXED, coverage audit, regression test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: remove "competitor" language, rewrite README in Garry's voice Replace "browses competitors" with "knows the landscape" / "what's out there" throughout all user-facing copy. Trim README from 243 to 167 lines — tighter, more opinionated, less listicle energy. Remove Completeness Principle from README top (it lives in CLAUDE.md and the skill preambles where Claude actually reads it). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: rewrite README in Garry's raw voice — AGI era, L8 factory, real stories The README now sounds like Garry, not a product page. Leads with the live experiment, the 16k LOC/day reality, the real-life coding stories (Austin, hospital bedside). Highlights the newest unlocks (design at the heart, /qa parallelism, smart review routing, test bootstrap). Closes with an open invitation — free MIT, fork it, let's all ride the wave together. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add Garry's bonafides to README intro — Palantir, Posterous, YC, 600k LOC Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add real /retro numbers — 140k lines, 362 commits across 3 projects Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add "in the last 60 days" timeframe to 600k LOC claim Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add GitHub contribution graphs — 2026 vs 2013 side by side Same person, different era. 2013: 772 contributions building Bookface. 2026: 1,237 contributions and accelerating. The difference is the tooling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: clarify /retro stats are from last 7 days Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add designer/PM/eng manager roles to intro Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: remove Josh/L8 reference from README Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: move demo up, make it dramatically more impressive Show the actual architecture diagram, auto-fixed issues, 100% coverage, regression test generation. Punch line: "That is not a copilot. That is a team." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: remove "My journey" section — intro already covers it Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: prefix all skill commands with You: in demo transcript Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: collapse You/Claude lines in demo — no gap between command and response Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: clarify plan mode flow in demo — approve, exit, Claude implements Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: move /ship to end of demo — review → QA → ship is the real flow Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add /plan-design-review to demo, tighten CEO response Shorter CEO reply, compressed eng diagram, added design audit with AI Slop score. Seven commands now: plan → eng → build → design → review → QA → ship. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: move design review before implementation — it's part of planning Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: reorder demo — design before eng, after CEO Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: remove URL from /plan-design-review in demo Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add [...] annotations showing what actually happens at each step Each step now shows what the agent does under the hood: 8 expansion proposals cherry-picked, 80-item design audit, ASCII diagrams for every flow, 2400 lines written in 8 minutes, real browser QA, bug found and fixed. Makes the demo feel real, not abstract. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: rename Contributor Mode to How to Contribute in docs table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add Coinbase, Instacart, Rippling to YC bonafides Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add "one or two people in a garage" to founder story Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add skill table to top of skills.md with anchor links Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: consolidate — roll contributor-mode into CONTRIBUTING, greptile into skills - docs/contributor-mode.md → merged into CONTRIBUTING.md (session awareness section) - docs/greptile.md → merged into docs/skills.md (Greptile integration section) - Reordered docs table: Skills > Architecture > Browser > Contributing > Changelog Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 16:34:08 -05:00
Garry Tan	1f3b691411	feat: /gstack-upgrade detects and syncs stale vendored copies (v0.5.4.1) (#137 ) When the global gstack is already up to date, standalone /gstack-upgrade now checks if the local vendored copy in the current project is at a different version and syncs it automatically. Also adds rollback on setup failure and update-check fallback matching the preamble pattern. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 13:06:14 -05:00
Garry Tan	a2d756f945	feat: Test Bootstrap + Regression Tests + Coverage Audit (v0.6.0) (#136 ) * feat: test bootstrap, regression tests, coverage audit, retro test health - Add {{TEST_BOOTSTRAP}} resolver to gen-skill-docs.ts - Add Phase 8e.5 regression test generation to /qa and /qa-design-review - Add Step 3.4 test coverage audit with quality scoring to /ship - Add test health tracking to /retro - Add 2 E2E evals (bootstrap + coverage audit) - Add 26 validation tests - Update ARCHITECTURE.md placeholder table - Add 2 P3 TODOs (CI/CD non-GitHub, auto-upgrade weak tests) * chore: bump version and changelog (v0.6.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: make coverage audit trace actual codepaths, not just syntax patterns Step 3.4 now instructs Claude to read full files, trace data flow through every branch, diagram the execution, and check each branch against tests. Phase 8e.5 regression tests now trace the bug's codepath before writing the test, catching adjacent edge cases. * feat: coverage audit now maps user flows, interactions, and error states Step 3.4 now covers the full picture: code branches AND user-facing behavior. Maps user flows (complete journey through the feature), interaction edge cases (double-click, back button, stale state, slow connection), error states (what does the user actually see?), and boundary states (zero results, 10k results, max-length input). Coverage diagram splits into Code Path Coverage and User Flow Coverage sections with separate percentages. * fix: raise test gen cap to 20, add validation tests for user flow coverage - Raise Step 3.4 test generation cap from 10 to 20 (code + user flow combined) - Add 3 validation tests: codepath tracing, user flow mapping, diagram sections --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 13:05:18 -05:00
Garry Tan	b65a464d37	feat: always-full eng review + ship review gate persistence (v0.5.4) (#135 ) Remove SMALL/BIG CHANGE menu from /plan-eng-review — every plan gets the full interactive review. Scope reduction is now proactive (only when complexity check triggers) rather than a menu item. Add review gate override persistence to /ship — when the user says "ship anyway" or "not relevant", that decision is saved to the branch's reviews.jsonl so subsequent /ship runs don't re-ask. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 12:41:44 -05:00
Garry Tan	5e9f0e78f2	feat: SELECTIVE EXPANSION + smarter ship gates (v0.5.3) (#134 ) * feat: SELECTIVE EXPANSION mode + user control for CEO review Add 4th mode to /plan-ceo-review: SELECTIVE EXPANSION holds current scope as baseline but surfaces expansion opportunities one by one for cherry-picking. All modes now present every scope-expanding idea as individual AskUserQuestion calls — user opts in or out of each one. EXPANSION recommends enthusiastically, SELECTIVE recommends neutrally. CEO plan persistence writes decisions to disk. * feat: review dashboard — eng required, CEO/design optional Only Eng Review gates shipping. CEO Review recommended for big product changes, Design Review for UI work — both informational only. Adds skip_eng_review global config to disable the gate entirely. * chore: bump version and changelog (v0.5.3) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 12:22:10 -05:00
Garry Tan	46b6ded584	fix: resolve merge conflicts with origin/main (v0.5.1 + v0.5.2) Merge main's gstack-slug helper, review dashboard, screenshot visibility, and design-consultation enhancements into the office-hours branch. Conflicts resolved: - CHANGELOG: keep both 0.6.0 (ours) and 0.5.1/0.5.2 (main) entries - VERSION: keep 0.6.0 (ours) - design-consultation: use gstack-slug + office-hours naming - plan-eng-review: use gstack-slug - office-hours: adopt gstack-slug for SLUG/BRANCH computation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 09:20:38 -07:00
Garry Tan	c99757b522	feat: /design-consultation — risk-taking, visual research, ambitious preview (v0.5.2) (#131 ) * feat: /design-consultation — risk-taking thesis, visual research, ambitious preview Add SAFE/RISK breakdown to design proposals so users see which choices match category conventions vs. which are deliberate creative departures. Wire browse binary for visual competitive research — agent browses competitor sites, takes screenshots, and analyzes fonts/colors/spacing with graceful degradation to WebSearch-only or built-in knowledge. Upgrade preview page instructions to render realistic product mockups (dashboards, marketing pages, settings forms) instead of just swatches. Rewrite README section with the thesis: "coherence is table stakes — the real question is where you take risks." * chore: bump version and changelog (v0.5.2) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: restore SKILL.md files to match main Prior commit included SKILL.md files regenerated from stale templates. Restore to match origin/main content. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 10:49:22 -05:00
Garry Tan	73b00b4e29	feat: Review Readiness Dashboard + gstack-slug helper (v0.5.1) (#130 ) * feat: add bin/gstack-slug helper + migrate all inline SLUG computation Extract the opaque SLUG sed pipeline into a shared 5-line shell script. Replace 8 inline copies across templates with eval $(gstack-slug). Sanitizes branch names (/ → -) to prevent subdirectory creation. * feat: review readiness dashboard — track CEO/Eng/Design reviews per branch Each review skill logs its result to JSONL. A shared {{REVIEW_DASHBOARD}} placeholder displays run counts, timestamps, and a CLEARED TO SHIP verdict. /ship pre-flight reads the dashboard and prompts when reviews are missing. * chore: bump version and changelog (v0.5.1) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 10:33:46 -05:00
Garry Tan	5f41cd9ad7	feat: show screenshots to user during QA and browse sessions (v0.5.0.1) (#129 ) Add rule 11 to QA and Design methodologies in gen-skill-docs.ts instructing Claude to Read screenshot PNGs after taking them. This makes screenshots visible as clickable elements in Conductor and other Claude Code UIs. Also added to browse and gstack SKILL templates. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 10:30:19 -05:00
Garry Tan	c8c2cbba33	docs: add /design-consultation skill to README (#127 ) The skill was fully implemented but completely absent from the README. Add it to the skill table, write a detailed section with usage example, and include it in install/uninstall instructions. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 08:10:01 -05:00
Garry Tan	111a33206f	test: add 14 assertions for YC Office Hours content coverage Validates dual-mode structure (Startup/Builder), all six forcing questions, builder brainstorming content, intrapreneurship adaptation, YC apply CTA, and operating principles for both modes. 192 tests total, all passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 05:50:29 -07:00
Garry Tan	1bf5c00a9b	feat: YC Office Hours — dual-mode product diagnostic + builder brainstorm Rewrite /office-hours with two modes: Startup mode: six forcing questions (Demand Reality, Status Quo, Desperate Specificity, Narrowest Wedge, Observation & Surprise, Future-Fit) that push founders toward radical honesty about demand, users, and product decisions. Includes smart routing by product stage, intrapreneurship adaptation, and YC apply CTA for strong-signal founders. Builder mode: generative brainstorming for side projects, hackathons, learning, and open source. Enthusiastic collaborator tone, design thinking questions, no business interrogation. Mode is determined by an explicit question in Phase 1 — no guessing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 05:50:11 -07:00
Garry Tan	efa519b881	chore: rename /brainstorm → /office-hours across references Update CHANGELOG, CLAUDE.md, TODOS, design-consultation, plan-ceo-review, and gen-skill-docs to reference the new office-hours skill name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 05:50:04 -07:00
Garry Tan	95e4430e63	fix: resolve merge conflicts with origin/main (v0.5.0 + v0.4.5) Merge brings in design review skills (plan-design-review, qa-design-review, design-consultation), fix-first review mode (v0.4.5), and browse JS/click fixes. Conflicts resolved in VERSION (keep 0.6.0), CHANGELOG (keep all entries), TODOS (update design review to shipped), and test arrays (combine all skill registrations). Regenerated all SKILL.md files so design skills get the improved preamble (ELI16, escalation protocol, branch detection). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 19:59:27 -07:00
Garry Tan	4a77cc2c34	feat: /plan-design-review + /qa-design-review skills (v0.5.0) (#102 ) * feat: add {{DESIGN_METHODOLOGY}} resolver and register design review skills Add generateDesignMethodology() to gen-skill-docs.ts with 10-category, 80-item design audit checklist. Register plan-design-review and qa-design-review templates in findTemplates(). Add both skills to skill-check.ts SKILL_FILES. Add command and snapshot flag validation tests for both skills in skill-validation.test.ts. * feat: add /plan-design-review and /qa-design-review skills /plan-design-review: report-only designer audit with letter grades, AI slop scoring, structured first impression, design system extraction, DESIGN.md inference and export offer. Never modifies code. /qa-design-review: same audit, then iterative fix loop with style(design): commits, CSS-safe WTF heuristic, before/after screenshots, final re-audit. * chore: bump version and changelog (v0.5.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: update README, ARCHITECTURE for design review skills (v0.5.0) - Update skill count to 11, add /plan-design-review and /qa-design-review to skill table, install/uninstall commands, and demo walkthrough - Add narrative sections: "senior designer mode" and "designer who codes mode" with compelling examples showing AI Slop detection and design system inference - Add {{DESIGN_METHODOLOGY}} to ARCHITECTURE.md placeholder table - Extend demo to show full plan→eng→review→ship→qa→design-review pipeline Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: regenerate design review SKILL.md files after merge from main Picks up BASE_BRANCH_DETECT resolver and updated contributor mode from main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add /design-consultation skill — design consultant that creates DESIGN.md 6-phase consultant flow: product context → competitive research (WebSearch) → complete coherent proposal → drill-downs on demand → font+color preview page → write DESIGN.md + update CLAUDE.md. Opinionated recommendations grounded in product context, not menu-driven forms. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add E2E tests for design skill family (7 tests + LLM quality judge) Tests 1-4: /design-consultation (core flow, research integration, existing DESIGN.md handling, font+color preview generation). Tests 5-6: /plan-design-review (audit report, DESIGN.md export). Test 7: /qa-design-review (audit + fix loop). LLM judge validates font blacklist compliance, coherence, and AI slop avoidance. Also adds plan-design-review + qa-design-review to ALL_SKILLS test array. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: mark /design-consultation as shipped in TODOS.md Renamed from /setup-design-md to reflect the consultant approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 21:55:07 -05:00
Garry Tan	a30f7079da	feat: Fix-First Review — auto-fix obvious issues, ask about hard ones (v0.4.5) (#116 ) * feat: Fix-First Review — auto-fix obvious issues, ask about hard ones Replace the CRITICAL-only AskUserQuestion flow with Fix-First: - Every finding gets action (not just critical ones) - AUTO-FIX items (dead code, N+1, stale comments) applied directly - ASK items (security, race conditions, design decisions) batched into at most one AskUserQuestion - Fix-First Heuristic in checklist.md (single source of truth) - Gate Classification → Severity Classification rename Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.4.5) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: polish CHANGELOG v0.4.5 voice — lead with user benefit Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 21:52:50 -05:00
Garry Tan	318ffdbdf0	fix: js statement wrapping + click auto-routes option to selectOption (v0.4.5) (#117 ) * fix: js statement wrapping + click auto-routes option to selectOption Bug 1: js command wrapped all code as expressions — const, semicolons, and multi-line code broke with SyntaxError. Added needsBlockWrapper() and wrapForEvaluate() helpers (shared with eval) to detect statements and use block wrapper {…} instead of expression wrapper (…). Bug 2: clicking <option> refs hung forever because Playwright can't .click() native select UI. Click handler now checks ARIA role + DOM tagName and auto-routes to selectOption() via parent <select>. Bug 3: click timeouts on <option> elements gave no guidance. Now throws helpful error: "Use browse select instead of click." * chore: bump version and changelog (v0.4.5) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 21:50:43 -05:00
Garry Tan	9a55efe7c7	fix: resolve merge conflicts with origin/main (v0.4.3 + v0.4.4) Merge origin/main which added: - v0.4.3: /document-release skill, simplified AskUserQuestion (always ELI16), runtime branch detection via _BRANCH preamble variable - v0.4.4: 60-min update check TTL, --force flag for gstack-update-check Resolved conflicts: - VERSION: keep 0.6.0 (our version, above 0.4.4) - CHANGELOG: all entries preserved (0.6.0 > 0.4.4 > 0.4.3 > 0.4.2) - TODOS.md: keep both sections (Brainstorm/Design + Document-Release) - test/skill-validation.test.ts: include all three new skills in arrays (brainstorm, debug, document-release) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 17:44:33 -07:00
Garry Tan	c86faa7968	fix: update check cache — 60min UP_TO_DATE TTL + --force flag (v0.4.4) (#110 ) * fix: split update check cache TTL + add --force flag UP_TO_DATE cache now expires after 60 min (was 720 min / 12 hours). UPGRADE_AVAILABLE keeps 720 min TTL to keep nagging. --force flag deletes cache before checking, used by /gstack-upgrade standalone invocation to always get a fresh result from GitHub. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: /gstack-upgrade standalone uses --force for fresh check Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.4.4) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 14:14:15 -05:00
Garry Tan	a68244ab57	feat: /document-release skill — post-ship doc updates (v0.4.3) (#109 ) * docs: update project documentation for v0.4.2 - README: skill count 9→10, added /document-release to skills table, install/uninstall sections, and dedicated section with example - CHANGELOG: added /document-release bullet to v0.4.2 entry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add /document-release skill with smart VERSION handling New skill runs after /ship but before PR merge. Reads every doc file, cross-references the diff, auto-updates factual changes, asks about risky edits. CHANGELOG clobber protection: never uses Write tool on CHANGELOG.md, only Edit with exact old_string matches. Smart VERSION logic: instead of silently skipping already-bumped versions, compares CHANGELOG entry scope against full diff and asks if significant uncovered changes exist. Also fixes gstack-upgrade/SKILL.md missing from skill-check.ts SKILL_FILES array (existing inconsistency with gen-skill-docs.ts). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: /review Step 5.6 — documentation staleness check Review skill now cross-references code changes against doc files. If a doc describes a feature that changed but the doc wasn't updated, flags it as INFORMATIONAL with a pointer to /document-release. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: /document-release E2E with CHANGELOG clobber guard E2E test creates a repo with existing CHANGELOG entries, runs /document-release, and asserts original entries survive. Critical guardrail against the incident where an agent replaced CHANGELOG entries during conflict resolution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump to v0.4.3 — /document-release skill Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files after merge * chore: regenerate SKILL.md files after merge --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 12:30:22 -05:00
Garry Tan	276d0cc6cb	feat: always-on ELI16 + branch detection (v0.4.3) (#108 ) * feat: always-on ELI16 + branch detection in preamble - Add _BRANCH detection to preamble bash block (git branch --show-current) - Merge ELI16 rules into default AskUserQuestion format (always-on) - Remove _SESSIONS >= 3 conditional — better questions always - Add simplification rules: plain English, no jargon, no raw function names - Update tests for branch detection and simplification regression guard * chore: bump version and changelog (v0.4.3) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 12:27:47 -05:00
Garry Tan	f5b981fcc0	fix: resolve merge conflicts with origin/main (v0.4.2 base branch detection) Merge origin/main which added: - BASE_BRANCH_DETECT placeholder + dynamic branch detection in all skills - Updated contributor mode (reflection-based, 0-10 rating) - Async await wrapping in browse js/eval commands - Hardcoded-main regression test Resolved conflicts: - VERSION: keep 0.6.0 (our version, above 0.4.2) - CHANGELOG: both entries preserved (0.6.0 above 0.4.2) - gen-skill-docs.ts: keep main's updated contributor mode, add our escalation protocol - review/SKILL.md.tmpl: fix hardcoded 'origin/main' in Step 1.5 to use origin/<base> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 11:35:47 -05:00
Garry Tan	78e519e3b7	feat: await support in browse js/eval + contributor mode v2 (#104 ) * feat: support await in $B js and eval commands Auto-wrap await expressions in async IIFE context so $B js "await fetch(...)" works without SyntaxError. - hasAwait() strips comments before detection - js: expression wrapping (async()=>(expr))() - eval: smart wrapping — single-line=expression, multi-line=block - 6 new unit tests covering async, false-positive, and return semantics * feat: redesign contributor mode — periodic reflection with 0-10 rating Replace passive "report when things break" with active reflection: - Rate gstack experience 0-10 at workflow step boundaries - Historical calibration example (await bug) anchors the reporting bar - "What would make this a 10" field focuses on actionable improvements - Removed category lists in favor of judgment-based assessment * test: add deterministic contributor mode preamble validation 40 new skill-validation tests (4 checks × 10 skills) verify: - 0-10 rating scale present - Calibration example present - "What would make this a 10" field present - Periodic reflection (not per-command) Update existing E2E contributor eval for new report format. * chore: bump version and changelog (v0.4.2) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: improve contributor mode + qa-quick E2E reliability Contributor mode: - Add "do not truncate" directive to template — agent was stopping after "My rating" without completing Steps/Raw output/What would make this a 10 sections - Restore assertions for Steps to reproduce and Date footer QA quick: - Make test server URL prominent: top of prompt, explicit "already running" and "do NOT discover ports" instructions - Bump session timeout 180s→240s and test timeout 240s→300s - Set B= at top of prompt (was buried in prose) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use flexible assertions for contributor mode E2E Agent writes thorough reports with creative section names ("Repro Steps" vs "Steps to reproduce"). Match intent not formatting: - /repro\|steps to reproduce/ for reproduction steps - /date.2026/ for date footer presence Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> docs: add E2E eval failure blame protocol "Not related to our changes" is an extraordinary claim that requires extraordinary proof. When evals fail during /ship: 1. Run the same eval on main — prove it fails there too 2. If it passes on main, it IS your change — trace the blame 3. If you can't verify, say "unverified" not "pre-existing" Added to CLAUDE.md and as a comment in skill-e2e.test.ts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: update CONTRIBUTING.md and BROWSER.md for v0.4.2 CONTRIBUTING.md: update contributor mode description — now describes periodic 0-10 reflection loop instead of passive friction detection. BROWSER.md: add js/eval async documentation — await expressions are auto-wrapped in async context, single-line eval returns values directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: restore v0.4.2 changelog entries lost during cherry-pick conflict The base branch detection entries from main were dropped when resolving the CHANGELOG conflict — should have merged both sets, not replaced. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 11:28:58 -05:00
Garry Tan	69fcf2d3ac	chore: bump version and changelog (v0.6.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 11:12:46 -05:00
Garry Tan	1e06b6a5c6	fix: dynamic base branch detection across all SKILL templates (v0.3.10) (#81 ) * feat: add {{BASE_BRANCH_DETECT}} resolver to gen-skill-docs DRY placeholder for dynamic base branch detection across PR-targeting skills. Detects via gh pr view (existing PR base) → gh repo view (repo default) → fallback to main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: ship skill detects base branch instead of hardcoding main Replaces ~14 hardcoded 'main' references with dynamic detection via {{BASE_BRANCH_DETECT}}. Fixes stacked branches and Conductor workspaces targeting non-main branches. Adds --base <base> to gh pr create. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: review, qa, plan-ceo-review detect base branch dynamically Same pattern as ship: replaces hardcoded 'main' with {{BASE_BRANCH_DETECT}}. Also cleans up qa bash-isms (REPORT_DIR variable, port chaining). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: retro detects default branch instead of hardcoding origin/main Retro queries commit history (not PR targets), so uses simpler detection: gh repo view defaultBranchRef. Replaces ~11 origin/main refs with origin/<default>. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add explicit cross-step references in gstack-upgrade template Bash blocks are self-contained, but cross-block variable references (INSTALL_DIR from Step 2) were implicit. Adds prose making them explicit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs+test: SKILL authoring guidance + regression tests Adds "Writing SKILL templates" section to CLAUDE.md explaining that templates are prompts, not scripts. Adds validation test catching hardcoded 'main' in git commands, and resolver content test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update ARCHITECTURE + CONTRIBUTING for new placeholders Add {{BASE_BRANCH_DETECT}} to ARCHITECTURE.md placeholder list. Cross-reference CLAUDE.md template authoring guidance from CONTRIBUTING.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.3.10) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add missing blank line between resolver functions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add 3 E2E smoke tests for base branch detection - /review: verifies Step 0 detection + git diff against detected base - /ship: truncated dry-run (Steps 0-1 only, no push/PR), asserts no destructive actions - /retro: verifies default branch detection for git log queries Covers the {{BASE_BRANCH_DETECT}} resolver path (review), the ship template's dual abort check, and retro's inline detection pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.4.2) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 10:59:13 -05:00
Garry Tan	ba3c2dee2e	test: structural tests for new skills + escalation protocol assertions Add brainstorm + debug to skillsWithUpdateCheck and skillsWithPreamble arrays. Add structural tests: brainstorm (Phase 1-6, Design Doc, Supersedes, Smart-skip), debug (Iron Law, Root Cause, Pattern Analysis, Hypothesis, DEBUG REPORT, 3-strike). Add escalation protocol tests (DONE_WITH_CONCERNS, BLOCKED, NEEDS_CONTEXT) for all preamble skills. Also: 2 new TODOs (design docs → Supabase sync, /plan-design-review skill), update CLAUDE.md project structure with new skill directories. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 10:10:54 -05:00
Garry Tan	4c0a3fe13b	feat: new /brainstorm and /debug skills /brainstorm: Socratic design exploration before planning. Context gathering, clarifying questions (smart-skip), related design discovery (keyword grep), premise challenge, forced alternatives, design doc artifact with lineage tracking (Supersedes: field). Writes to ~/.gstack/projects/$SLUG/. /debug: Systematic root-cause debugging. Iron Law: no fixes without root cause investigation. Pattern analysis, hypothesis testing with 3-strike escalation, structured DEBUG REPORT output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 10:10:41 -05:00
Garry Tan	50d0bc22f6	feat: design doc lookup in /plan-eng-review + fix branch name sanitization Step 0 now checks ~/.gstack/projects/ for /brainstorm design docs (branch-filtered with fallback, reads Supersedes: for revision context). Fix: branch names with '/' (e.g. garrytan/better-process) now get sanitized via tr '/' '-' in test plan artifact filenames. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 10:10:26 -05:00
Garry Tan	e716363916	feat: mandatory implementation alternatives + design doc lookup in /plan-ceo-review Step 0C-bis: Every plan must consider 2-3 approaches (minimal viable vs ideal architecture) before mode selection. RECOMMENDATION required. Pre-Review System Audit now checks ~/.gstack/projects/ for /brainstorm design docs (branch-filtered with fallback). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 10:10:12 -05:00
Garry Tan	b4c33261c7	feat: add scope drift detection + verification of claims to /review Step 1.5: Before reviewing code quality, check if the diff matches stated intent. Flags scope creep and missing requirements (INFORMATIONAL). Step 5 addition: Every review claim must cite evidence — "this pattern is safe" needs a line reference, "tests cover this" needs a test name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 10:10:00 -05:00
Garry Tan	e3d36b645f	feat: add verification gate to /ship (Step 6.5) — no push without fresh evidence Before pushing, re-verify tests if code changed during review fixes. Rationalization prevention: "Should work now" → RUN IT. "I'm confident" → Confidence is not evidence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 10:09:48 -05:00
Garry Tan	6cf424fb49	feat: add escalation protocol to preamble — all skills get DONE/BLOCKED/NEEDS_CONTEXT Every skill now reports completion status (DONE, DONE_WITH_CONCERNS, BLOCKED, NEEDS_CONTEXT) and has escalation rules: 3 failed attempts → STOP, security uncertainty → STOP, scope exceeds verification → STOP. "It is always OK to stop and say 'this is too hard for me.'" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 10:09:32 -05:00
Garry Tan	3e3843c4a9	feat: contributor mode, session awareness, recommendation format (#90 ) * feat: contributor mode, session awareness, universal RECOMMENDATION format - Rename {{UPDATE_CHECK}} → {{PREAMBLE}} across all 10 skill templates - Add session tracking (touch ~/.gstack/sessions/$PPID, count active sessions) - ELI16 mode when 3+ concurrent sessions detected (re-ground user on context) - Contributor mode: auto-file field reports to ~/.gstack/contributor-logs/ - Universal AskUserQuestion format: context → question → RECOMMENDATION → options - Update plan-ceo-review and plan-eng-review to reference preamble baseline - Add vendored symlink awareness section to CLAUDE.md - Rewrite CONTRIBUTING.md with contributor workflow and cross-project testing - Add tests for contributor mode and session awareness in generated output - Add E2E eval for contributor mode report filing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add Enum & Value Completeness to /review critical checklist New CRITICAL review category that traces new enum values, status strings, and type constants through every consumer outside the diff. Catches the class of bugs where a new value is added but not handled in all switch/case chains, allowlists, or frontend-backend contracts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump v0.4.1, user-facing changelog, update qa-only template and architecture docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add CHANGELOG style guide — user-facing, sell the feature Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: rewrite v0.4.1 changelog to be user-facing and sell the features Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add evals for RECOMMENDATION format, session awareness, and enum completeness Free tests (Tier 1): RECOMMENDATION format + session awareness in all preamble SKILL.md files, enum completeness checklist structure and CRITICAL classification. E2E eval: /review catches missed enum handlers when a new status value is added but not handled in case/switch and notify methods. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add E2E eval for session awareness ELI16 mode Stubs _SESSIONS=4, gives agent a decision point on feature/add-payments branch, verifies the output re-grounds the user with project, branch, context, and RECOMMENDATION — the ELI16 mode behavior for 3+ sessions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: contributor mode eval marked FAIL due to expected browse error The test intentionally runs a nonexistent binary to trigger contributor mode. The session runner's browse error detection catches "no such file or directory...browse" and sets browseErrors, causing recordE2E to mark passed=false. Override passed to check only exitReason since the browse error is the expected scenario. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 01:45:50 -05:00
Garry Tan	f3ee0ee28a	feat: QA restructure, browser ref staleness, eval efficiency metrics (v0.4.0) (#83 ) * feat: browser ref staleness detection via async count() validation resolveRef() now checks element count to detect stale refs after page mutations (e.g. SPA navigation). RefEntry stores role+name metadata for better diagnostics. 3 new snapshot tests for staleness detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: qa-only skill, qa fix loop, plan-to-QA artifact flow Add /qa-only (report-only, Edit tool blocked), restructure /qa with find-fix-verify cycle, add {{QA_METHODOLOGY}} DRY placeholder for shared methodology. /plan-eng-review now writes test-plan artifacts to ~/.gstack/projects/<slug>/ for QA consumption. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: eval efficiency metrics — turns, duration, commentary across all surfaces Add generateCommentary() for natural-language delta interpretation, per-test turns/duration in comparison and summary output, judgePassed unit tests, 3 new E2E tests (qa-only, qa fix loop, plan artifact). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump version and changelog (v0.4.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: update ARCHITECTURE, BROWSER, CONTRIBUTING, README for v0.4.0 - ARCHITECTURE: add ref staleness detection section, update RefEntry type - BROWSER: add ref staleness paragraph to snapshot system docs - CONTRIBUTING: update eval tool descriptions with commentary feature - README: fix missing qa-only in project-local uninstall command Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add user-facing benefit descriptions to v0.4.0 changelog Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 23:55:39 -05:00
Garry Tan	bb46ca6b21	feat: smart update check with auto-upgrade, snooze backoff, config CLI (v0.3.9) (#62 ) * feat: add bin/gstack-config CLI for reading/writing ~/.gstack/config.yaml Simple get/set/list interface for persistent gstack configuration. Used by update-check and upgrade skill for auto_upgrade and update_check settings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: smart update check with 12h cache, snooze backoff, config disable - Reduce cache TTL from 24h to 12h for faster update detection - Add exponential snooze backoff: 24h → 48h → 1 week (resets on new version) - Add update_check: false config option to disable checks entirely - Clear snooze file on just-upgraded - 14 new tests covering snooze levels, expiry, corruption, and config paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: upgrade skill with auto-upgrade, 4-option prompt, vendored sync - Auto-upgrade mode via config or GSTACK_AUTO_UPGRADE=1 env var - 4-option AskUserQuestion: upgrade once, always, not now, never - Step 4.5: sync local vendored copy after upgrading primary install - Snooze write with escalating backoff on "Not now" - Update preamble text in gen-skill-docs for new upgrade flow - Regenerate all SKILL.md files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: simplify upgrade instructions, move auto-upgrade to completed README now points to /gstack-upgrade instead of long paste commands. Auto-upgrade TODO moved to Completed section (v0.3.8). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump version and changelog (v0.3.9) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 23:28:02 -07:00
Garry Tan	41141007c1	feat: TODOS-aware skills, 2-tier Greptile replies, gitignore fix (#61 ) * fix: log non-ENOENT errors in ensureStateDir() instead of silently swallowing Replace bare catch {} with ENOENT-only silence. Non-ENOENT errors (EACCES, ENOSPC) are now logged to .gstack/browse-server.log. Includes test for permission-denied scenario with chmod 444. * feat: merge TODO.md + TODOS.md into unified backlog with shared format reference Merge TODO.md (roadmap) and TODOS.md (near-term) into one file organized by skill/component with P0-P4 priority ordering and Completed section. Add shared review/TODOS-format.md for canonical format. Add static validation tests. * feat: add 2-tier Greptile reply system with escalation detection Add reply templates (Tier 1 friendly, Tier 2 firm), explicit escalation detection algorithm, and severity re-ranking guidance to greptile-triage.md. * feat: cross-skill TODOS awareness + Greptile template refs in all skills /ship Step 5.5: auto-detect completed TODOs, offer reorganization. /review Step 5.5: cross-reference PR against open TODOs. /plan-ceo-review, /plan-eng-review: TODOS context in planning. /retro: Backlog Health metric. /qa: bug TODO context in diff-aware mode. All Greptile-aware skills now reference reply templates and escalation detection. * chore: bump version and changelog (v0.3.8) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: update CONTRIBUTING.md for v0.3.8 changes Clarify test tier cost table (Tier 3 standalone vs combined), add TODOS.md to "Things to know", mention Greptile triage in ship workflow description. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 20:15:11 -07:00
Garry Tan	2aa745cb0e	feat: screenshot element/region clipping (v0.3.7) (#56 ) * feat: screenshot element/region clipping (--clip, --viewport, CSS/@ref) Add element crop (CSS selector or @ref), region clip (--clip x,y,w,h), and viewport-only (--viewport) modes to the screenshot command. Uses Playwright's native locator.screenshot() and page.screenshot({ clip }). Full page remains the default. Includes 10 new tests covering all modes and error paths. * chore: bump version and changelog (v0.3.7) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add screenshot modes to BROWSER.md command reference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 12:47:42 -07:00
Garry Tan	0ac7ef4e81	fix: harden planted-bug eval prompt for reliable form testing Phase 3 was too vague ("click every nav link") causing the agent to wander instead of systematically testing form fields. Now explicitly directs: fill every input, clear it, try invalid values, submit and check console. Added Phase 4 finalize step to ensure report is updated with all findings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 13:28:18 -05:00
Garry Tan	7d26666164	Merge pull request #55 from garrytan/v0.3.6-qa-upgrades feat: E2E observability + eval infrastructure + all skills templated	2026-03-14 11:24:24 -07:00
Garry Tan	baf8acd55c	fix: update check ignores stale UP_TO_DATE cache after version change The UP_TO_DATE cache path exited immediately without checking if the cached version still matched the local VERSION. After upgrading (e.g. 0.3.3 → 0.3.4), the cache still said "UP_TO_DATE 0.3.3" and the script never re-checked against remote — so updates were invisible until the 24h cache expired. Now both UP_TO_DATE and UPGRADE_AVAILABLE verify cached version vs local before trusting the cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 13:23:25 -05:00
Garry Tan	4e31acbd47	fix: auto-clear stale heartbeat when process is dead Add PID to heartbeat file. eval-watch checks process.kill(pid, 0) and auto-deletes the heartbeat when the PID is no longer alive — no manual cleanup needed after crashed/killed E2E runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 12:55:40 -05:00
Garry Tan	43fbe165a4	docs: update README, CONTRIBUTING, ARCHITECTURE for v0.3.6 Update test tier costs and commands (Agent SDK → claude -p, SKILL_E2E → EVALS), add E2E observability section to CONTRIBUTING and ARCHITECTURE, add testing quick-start to README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 12:47:00 -05:00
Garry Tan	4ace0c2f6f	chore: bump version and changelog (v0.3.6) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 12:44:41 -05:00
Garry Tan	9f5aa32e67	fix: fail fast on API connectivity — pre-check before E2E suite Spawn a quick claude -p ping before running 13 tests. If the Anthropic API is unreachable (ConnectionRefused), throw immediately instead of burning through the entire suite with silent false passes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 12:37:44 -05:00
Garry Tan	5aae3ce117	fix: never clean up observability artifacts — partial file persists after finalize Removing the _partial-e2e.json deletion from finalize(). These are small files on a local disk and their persistence is the whole point of observability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 12:37:38 -05:00
Garry Tan	336dbaa50d	fix: detect is_error from claude -p result line (ConnectionRefused was PASS) claude -p can return subtype="success" with is_error=true when the API is unreachable. Previously we only checked subtype, so API failures silently passed. Now check is_error first and report as 'error_api'. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 12:35:43 -05:00
Garry Tan	029a7c2a37	feat: eval-watch dashboard + observability unit tests (15 tests, 11 codepaths) eval-watch: live terminal dashboard reads heartbeat + partial file every 1s, shows completed/running tests, stale detection (>10min), --tail flag for progress.log tail. Pure renderDashboard() function for testability. observability.test.ts: unit tests for sanitizeTestName, heartbeat schema, progress.log format, NDJSON file naming, savePartial() with _partial flag, finalize() cleanup, diagnostic fields, watcher rendering, stale detection, and non-fatal I/O guarantees. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 11:04:40 -05:00
Garry Tan	510a8d8dda	feat: wire runId + testName + diagnostics through all E2E tests Generate per-session runId, pass testName + runId to every runSkillTest() call, wire exit_reason/timeout_at_turn/last_tool_call through recordE2E(). Add eval:watch script entry to package.json. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 11:04:28 -05:00
Garry Tan	f9cfabeda8	feat: add E2E observability — heartbeat, progress.log, NDJSON persistence, savePartial() session-runner: atomic heartbeat file (e2e-live.json), per-run log directory (~/.gstack-dev/e2e-runs/{runId}/), progress.log + per-test NDJSON persistence, failure transcripts to persistent run dir instead of tmpdir. eval-store: 3 new diagnostic fields (exit_reason, timeout_at_turn, last_tool_call), savePartial() writes _partial-e2e.json after each addTest() for crash resilience, finalize() cleans up partial file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 11:04:16 -05:00

1 2

84 Commits