mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-02 03:35:09 +02:00
6e1625c0d7
* test(harness): plumb extraArgs and auto_decided outcome through PTY runner runPlanSkillObservation now accepts extraArgs that pass through to launchClaudePty (which already supported them at the lower level), and exposes a new 'auto_decided' outcome detected via isAutoDecidedVisible when the AUTO_DECIDE preamble template fires (Auto-decided ... (your preference)). Both pieces are needed for the v1.21+ AskUserQuestion-blocked regression tests in the next commit. Detection order is deliberate: 'asked' (rendered numbered list) wins over 'auto_decided' (text only, no list), which wins over 'plan_ready' so the auto-decide evidence isn't masked by a downstream plan-mode confirmation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(e2e): add AskUserQuestion-blocked regression cases for 6 plan-mode skills Conductor launches Claude Code with --disallowedTools AskUserQuestion --permission-mode default --permission-prompt-tool stdio (verified by inspecting the live conductor claude process via ps -p ... -o args=). Native AskUserQuestion is removed from the model's tool registry; without fallback guidance the plan-mode skills (plan-ceo-review, plan-eng-review, plan-design-review, plan-devex-review, autoplan, office-hours) silently proceed and never surface decisions to the user. Adds 6 gate-tier real-PTY regression cases: - 4 inline test cases inside the existing plan-X-review-plan-mode.test files, each exercising the same skill with extraArgs ['--disallowedTools', 'AskUserQuestion'] and asserting outcome === 'asked'. plan-design-review keeps the ['asked', 'plan_ready'] envelope (legitimate short-circuit on no-UI-scope) but explicitly fails on 'auto_decided'. - 2 standalone test files for autoplan + office-hours (which had no prior plan-mode test). autoplan asserts the FIRST non-auto-decided gate fires (Phase 1 premise confirmation) — autoplan auto-decides intermediate questions BY DESIGN. Touchfile entries: - autoplan-auto-mode + office-hours-auto-mode added to E2E_TOUCHFILES + E2E_TIERS (gate) - existing plan-X-review-plan-mode entries gain question-tuning.ts and generate-ask-user-format.ts touchfile deps so AUTO_DECIDE-related resolver changes correctly invalidate the regression tests - touchfiles.test.ts count updated 18 -> 19 to cover the autoplan touchfile dependency on plan-ceo-review/** Filenames retain `auto-mode` for branch-history continuity. Auto-mode (the AUTO_DECIDE preamble path when QUESTION_TUNING=true) is a related but distinct silencing mechanism; both share the same fix surface in the preamble. These tests are expected to FAIL on this branch until the fix lands. The failure is the receipt for the regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(preamble): teach the model to prefer mcp__*__AskUserQuestion when registered When a host launches Claude Code with --disallowedTools AskUserQuestion (Conductor does this by default — verified via ps on the live conductor claude process), the native AskUserQuestion tool is removed from the model's tool registry. Skill templates that say "call AskUserQuestion" silently fail in that environment: the model can't ask, the user never sees the question, the skill auto-proceeds without input. The fix is preamble guidance, not a skill-template change: generate-ask-user-format.ts: new "Tool resolution" section at the top of the AskUserQuestion Format block. Tells the model that "AskUserQuestion" can resolve to two tools at runtime — the host MCP variant (e.g. mcp__conductor__AskUserQuestion, registered when the host injects it) and the native tool — and to PREFER any mcp__*__AskUserQuestion variant. Same questions/options shape; same decision-brief format. If neither variant is callable, fall back to writing a "## Decisions to confirm" section into the plan file plus ExitPlanMode (the native plan-mode confirmation surfaces it). Never silently auto-decide. generate-completion-status.ts: the plan-mode-info block (preamble position 1) now explicitly notes that AskUserQuestion satisfies plan mode's end-of-turn requirement for "any variant" and points at the Tool resolution section for the fallback path. This puts the resolution rule in front of every tier-≥2 skill via the preamble, so plan-mode review skills (plan-ceo-review, plan-eng-review, plan-design-review, plan-devex-review, autoplan, office-hours) all gain the fix without per-template surgery. Includes regenerated SKILL.md files for all 41 skills + the 3 host-ship golden fixtures used by test/host-config.test.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(periodic): AUTO_DECIDE opt-in preserved under Conductor flags Periodic-tier eval that exercises the legitimate /plan-tune AUTO_DECIDE path under the same flags Conductor uses (--disallowedTools AskUserQuestion). Confirms the new Tool resolution preamble doesn't trip opt-in users: when the user has set a never-ask preference for a question, the model should auto-pick (outcome 'auto_decided' or 'plan_ready') rather than surface the prompt. Setup runs in an isolated GSTACK_HOME tmpdir — never touches the user's real ~/.gstack state. Writes question_tuning=true + a never-ask preference for plan-ceo-review-mode (source: 'plan-tune', which bypasses the inline-user origin gate). Spawns claude with --disallowedTools AskUserQuestion in plan mode, runs /plan-ceo-review, asserts outcome is NOT 'asked' (i.e., the model honored the preference). Periodic tier because AUTO_DECIDE behavior depends on the model adhering to the QUESTION_TUNING preamble injection — non-deterministic, weekly cron is the right cadence rather than CI gating. Touchfiles cover the AUTO_DECIDE-bearing resolvers + the question-tuning binaries the test setup invokes. touchfiles.test.ts count updates 19 -> 20 because auto-decide-preserved also depends on plan-ceo-review/**. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * v1.21.0.0: AskUserQuestion resolves to host MCP variant when native is disallowed MINOR scale per scale-aware bumps in CLAUDE.md: substantial coordinated multi-file change (preamble fix + new test infrastructure + 6 gate-tier regression cases + 1 periodic eval) and a user-visible regression fix that affects every plan-mode review skill running under Conductor's default flag set. User originally targeted v1.21.2.0; landing as v1.21.0.0 since this is the first 1.21.x release on main and there's no prior 1.21.0.0/1.21.1.0 to skip past. Adjust at /ship time if a different number is preferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(harness): fix detection order + whitespace-tolerant pattern matching Two bugs surfaced when validating the v1.21 fix end-to-end: 1. PlanSkillObservation outcome detection ran 'asked' (any numbered options list) BEFORE 'plan_ready'. Plan-mode's "Ready to execute?" confirmation IS a numbered options list (1=auto, 2=manual, ...), so any skill that successfully reached the native confirmation got misclassified as 'asked'. Reorder: 'auto_decided' (most specific, requires AUTO_DECIDE annotation) > 'plan_ready' (next, requires the "ready to execute" stem) > 'asked' (any remaining numbered list). 2. isPlanReadyVisible and isAutoDecidedVisible regexes only matched spaced forms ("ready to execute", "(your preference)"). stripAnsi removes cursor-positioning escapes (`\x1b[40C`) entirely instead of replacing them with spaces, so the same text can render as "readytoexecute" or "(yourpreference)". Both detectors now test the spaced form first, fall through to a whitespace-collapsed comparison. Inline unit smoke confirms both forms match. Updates to the 5 strict 'asked' regression test cases (plan-ceo, plan-eng, plan-devex, autoplan, office-hours): with the detection order corrected, the model's plan-file fallback flow legitimately lands at 'plan_ready' instead of 'asked'. Pass envelope expanded to ['asked', 'plan_ready'] (matching plan-design-review's existing pattern). Failure signals tightened to include 'auto_decided' (catches AUTO_DECIDE without opt-in) plus the standard silent_write/exited/timeout. plan-design was already on this contract from v1.21's first commit, no change needed. The expanded envelope is correct: under --disallowedTools AskUserQuestion the Tool resolution preamble routes the question through plan-mode's native "Ready to execute?" surface — the user still sees the decision, just via the plan-file flow rather than a numbered prompt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(harness): require ## Decisions section under --disallowedTools plan_ready Adversarial review (during /ship Step 11) found that the previous gate-test envelope ['asked', 'plan_ready'] for the AskUserQuestion-blocked regression cases accepted the bug they exist to catch: a model that silently skips Step 0 entirely (writes a plan with no questions, no `## Decisions to confirm` section, just ExitPlanModes) reaches plan_ready and passes. The fix tightens the contract in two layers: 1. Harness: PlanSkillObservation gains a `planFile?: string` field populated when outcome is plan_ready. extractPlanFilePath() walks the visible TTY buffer for "Plan saved to:", "Plan file:", or ".claude/plans/<name>.md" patterns and resolves tilde to absolute. planFileHasDecisionsSection() reads the resolved file and returns true if it contains a `## Decisions` heading (any form: "to confirm", "needed", etc.). 2. Tests: 5 of 6 regression cases now require, when outcome is plan_ready, that obs.planFile is set AND planFileHasDecisionsSection returns true. Otherwise the test fails with a "Step 0 was silently skipped" diagnosis. plan-design-review remains the sole exception — it legitimately short-circuits to plan_ready on no-UI-scope branches and we have no deterministic way to distinguish that from a silent skip. This closes the loophole the adversarial review identified. The fix preamble flow already tells the model to write `## Decisions to confirm` when neither AUQ variant is callable — now the test verifies the model actually did it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(harness): anchor extractPlanFilePath path captures on /Users|~|/home|/var|/tmp Adversarial-tightened gate sweep surfaced a real bug in the path extraction: stripAnsi collapses whitespace via cursor-positioning escape removal, so "yet at /Users/..." in the visible buffer becomes "yetat/Users/..." with no space between. The previous fallback pattern `(~?\/?\S*\.claude\/plans\/[\w-]+\.md)` greedily matched non-whitespace characters BEFORE the path, producing `yetat/Users/garrytan/.claude/...` which then fails fs.readFileSync. Fix: every regex now requires the path to START at a known path-anchor: `~/`, `/Users/`, `/home/`, `/var/`, `/tmp/`, or `./`. Earlier non-whitespace runs can't be glommed in. Verified against the failing fixture (`yetat/Users/...`) plus the four canonical render forms ("Plan saved to:", "Plan file:", `·`-decorated ctrl-g hint, and the bare fallback). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.8 KiB
5.8 KiB
gstack — AI Engineering Workflow
gstack is a collection of SKILL.md files that give AI agents structured roles for software development. Each skill is a specialist: CEO reviewer, eng manager, designer, QA lead, release engineer, debugger, and more.
Available skills
Skills live in .agents/skills/ (or ~/.claude/skills/gstack/ on Claude Code).
Invoke them by name (e.g., /office-hours).
Plan-mode reviews
| Skill | What it does |
|---|---|
/office-hours |
Start here. Reframes your product idea before you write code. |
/plan-ceo-review |
CEO-level review: find the 10-star product in the request. |
/plan-eng-review |
Lock architecture, data flow, edge cases, and tests. |
/plan-design-review |
Rate each design dimension 0-10, explain what a 10 looks like. |
/plan-devex-review |
DX-mode review: TTHW, magical moments, friction points, persona traces. |
/plan-tune |
Self-tune AskUserQuestion sensitivity per question. |
/autoplan |
One command runs CEO → design → eng → DX review. |
/design-consultation |
Build a complete design system from scratch. |
Implementation + review
| Skill | What it does |
|---|---|
/review |
Pre-landing PR review. Finds bugs that pass CI but break in prod. |
/codex |
Second opinion via OpenAI Codex. Review, challenge, or consult modes. |
/investigate |
Systematic root-cause debugging. No fixes without investigation. |
/design-review |
Live-site visual audit + fix loop with atomic commits. |
/design-shotgun |
Generate multiple AI design variants, comparison board, iterate. |
/design-html |
Generate production-quality Pretext-native HTML/CSS. |
/devex-review |
Live developer experience audit (TTHW measured against the real flow). |
/qa |
Open a real browser, find bugs, fix them, re-verify. |
/qa-only |
Same methodology as /qa but report only — no code changes. |
/scrape |
Pull data from a web page. First call prototypes; codified call runs in ~200ms. |
/skillify |
Codify the most recent successful /scrape flow into a permanent browser-skill. |
Release + deploy
| Skill | What it does |
|---|---|
/ship |
Run tests, review, push, open PR. Workspace-aware version queue. |
/land-and-deploy |
Merge the PR, wait for CI and deploy, verify production health. |
/canary |
Post-deploy monitoring loop using the browse daemon. |
/landing-report |
Read-only dashboard for the workspace-aware ship queue. |
/document-release |
Update all docs to match what you just shipped. |
/setup-deploy |
One-time deploy config detection (Fly.io, Render, Vercel, etc.). |
/gstack-upgrade |
Update gstack to the latest version. |
Operational + memory
| Skill | What it does |
|---|---|
/context-save |
Save working context (git state, decisions, remaining work). |
/context-restore |
Resume from a saved context, even across Conductor workspaces. |
/learn |
Manage what gstack learned across sessions. |
/retro |
Weekly retro with per-person breakdowns and shipping streaks. |
/health |
Code quality dashboard (type checker, linter, tests, dead code). |
/benchmark |
Performance regression detection (page load, Core Web Vitals). |
/benchmark-models |
Cross-model benchmark for skills (Claude, GPT, Gemini side-by-side). |
/cso |
OWASP Top 10 + STRIDE security audit. |
/setup-gbrain |
Set up gbrain for cross-machine session memory sync. |
Browser + agent integration
| Skill | What it does |
|---|---|
/browse |
Headless browser — real Chromium, real clicks, ~100ms/command. |
/open-gstack-browser |
Launch the visible GStack Browser with sidebar + stealth. |
/setup-browser-cookies |
Import cookies from your real browser for authenticated testing. |
/pair-agent |
Pair a remote AI agent (OpenClaw, Codex, etc.) with your browser. |
Safety + scoping
| Skill | What it does |
|---|---|
/careful |
Warn before destructive commands (rm -rf, DROP TABLE, force-push). |
/freeze |
Lock edits to one directory. Hard block, not just a warning. |
/guard |
Activate both careful + freeze at once. |
/unfreeze |
Remove directory edit restrictions. |
/make-pdf |
Turn any markdown file into a publication-quality PDF. |
Build commands
bun install # install dependencies
bun test # run free tests (no API spend)
bun run test:windows # curated Windows-safe subset (runs on windows-latest)
bun run build # generate docs + compile binaries
bun run gen:skill-docs # regenerate SKILL.md files from templates
bun run skill:check # health dashboard for all skills
Platform support
- macOS + Linux: full test suite supported.
- Windows: curated Windows-safe subset runs on
windows-latestvia thewindows-free-testsCI job. Setup script (./setup) requires Git Bash or MSYS today; native PowerShell support is a future expansion. Thebin/gstack-pathshelper resolves state roots throughCLAUDE_PLUGIN_DATA/GSTACK_HOMEso plugin installs work on every platform.
Key conventions
- SKILL.md files are generated from
.tmpltemplates. Edit the template, not the output. - Run
bun run gen:skill-docs --host codexto regenerate Codex-specific output. - The browse binary provides headless browser access. Use
$B <command>in skills. - Safety skills (careful, freeze, guard) use inline advisory prose — always confirm before destructive operations.
- State paths resolve via
bin/gstack-paths(sourced viaeval "$(...)"). HonorsGSTACK_HOME,CLAUDE_PLUGIN_DATA,CLAUDE_PLANS_DIR. - The
claudeCLI binary resolves viabrowse/src/claude-bin.ts(Bun.which()+GSTACK_CLAUDE_BINoverride). SetGSTACK_CLAUDE_BIN=wslplusGSTACK_CLAUDE_BIN_ARGS='["claude"]'to run Claude through WSL on Windows.