diff --git a/CHANGELOG.md b/CHANGELOG.md index 36bc1a1e..311b2525 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,6 +1,6 @@ # Changelog -## [1.8.0.0] - 2026-04-23 +## [1.11.0.0] - 2026-04-23 ## **Workspace-aware ship. Two open PRs can't both claim the same VERSION anymore.** @@ -8,11 +8,11 @@ If you run gstack in multiple Conductor windows at once, you've probably seen th ### What changes for you -Run `/ship` in one Conductor window while another has an open PR claiming v1.7.0.0. Your ship now sees the claim, renders a queue table, and picks v1.8.0.0 (same bump level, next free slot). The PR title starts with `v1.8.0.0` so landing order is visible in `gh pr list` without opening each PR. If a sibling workspace has uncommitted work at a higher VERSION and looks active (commit in the last 24h), `/ship` asks whether to wait for them or advance past. If the queue shifts between ship and merge, CI's new version-gate catches it, and rerunning `/ship` rewrites VERSION, package.json, CHANGELOG, and the PR title atomically. +Run `/ship` in one Conductor window while another has an open PR claiming v1.7.0.0. Your ship now sees the claim, renders a queue table, and picks the next free slot above it (same bump level). The PR title starts with `v` so landing order is visible in `gh pr list` without opening each PR. If a sibling workspace has uncommitted work at a higher VERSION and looks active (commit in the last 24h), `/ship` asks whether to wait for them or advance past. If the queue shifts between ship and merge, CI's new version-gate catches it, and rerunning `/ship` rewrites VERSION, package.json, CHANGELOG, and the PR title atomically. This very release dogfooded the drift path: the original ship at v1.8.0.0 went stale when three other PRs landed first, and the merge-back-to-main rebump (v1.8.0.0 → v1.11.0.0) happened via the same queue-aware codepath it introduces. ### What shipped (by the numbers) -- `bin/gstack-next-version` — 360-line Bun/TS util. 21 passing fixture tests covering happy path, 8 collision scenarios, offline fallback, fork-PR filtering, sibling activity detection. +- `bin/gstack-next-version` — ~390-line Bun/TS util. 21 passing fixture tests covering happy path, 8 collision scenarios, offline fallback, fork-PR filtering, sibling activity detection, self-PR auto-exclusion. - Host parity: GitHub + GitLab both supported. CI gates: `.github/workflows/version-gate.yml`, `.github/workflows/pr-title-sync.yml`, plus `.gitlab-ci.yml` mirror. - Fail-open semantics on util errors (network, auth, bug). A gstack bug never freezes your merge queue. Fail-closed on confirmed collisions. - `/landing-report` skill — read-only dashboard showing queue, siblings, and what all four bump levels would claim. @@ -26,7 +26,7 @@ If you're routinely running 3-10 Conductor windows against the same repo, this i #### Added -- `bin/gstack-next-version`. Host-aware (GitHub + GitLab + unknown) VERSION allocator. Queries open PRs, fetches each PR's VERSION at head (bounded concurrency, 10 parallel), scans sibling Conductor worktrees, picks the next free slot. Pure reader, never writes files. +- `bin/gstack-next-version`. Host-aware (GitHub + GitLab + unknown) VERSION allocator. Queries open PRs, fetches each PR's VERSION at head (bounded concurrency, 10 parallel), scans sibling Conductor worktrees, picks the next free slot. Pure reader, never writes files. Supports `--exclude-pr ` to filter out the PR being checked (prevents self-reference when CI runs against the PR's own VERSION). - `scripts/detect-bump.ts`, `scripts/compare-pr-version.ts`. CI gate helpers. Three exit paths: pass, block on confirmed collision, fail-open on util errors. - `.github/workflows/version-gate.yml`. Merge-time collision gate. Runs when VERSION/CHANGELOG/package.json changes on a PR. - `.github/workflows/pr-title-sync.yml`. Auto-rewrites PR title when VERSION changes on push, only for titles already carrying the `v` prefix (custom titles left alone, idempotent). @@ -42,11 +42,243 @@ If you're routinely running 3-10 Conductor windows against the same repo, this i - `review/SKILL.md.tmpl`. New Step 3.4 advisory one-liner showing queue status. Non-blocking. - `CLAUDE.md`. Versioning invariant paragraph. Documents that VERSION is a monotonic sequence, not a strict semver commitment, and queue-advance within a bump level is permitted. +#### Fixed + +- Self-reference bug in the version gate. The first live CI run (PR #1168 at v1.8.0.0) was rejected as "stale" because the util counted the PR being checked as a queued claim, inflating the next slot by one. Fixed with `--exclude-pr` flag + `gh pr view` auto-detect so the util silently filters the current branch's PR. Caught and fixed in the same ship — exactly the dogfood loop the release is designed for. + #### For contributors - `test/gstack-next-version.test.ts`. 21 pure-function tests (parseVersion / bumpVersion / cmpVersion / pickNextSlot with 8 collision scenarios / markActiveSiblings 4 cases) plus a CLI smoke test against the live repo. - Golden ship fixtures refreshed for all three hosts (claude, codex, factory) after Step 12 and Step 19 template changes. This is exactly the blast radius Codex flagged during the CEO review (cross-model tension #8), handled in the same PR rather than as a follow-up. +## [1.10.1.0] - 2026-04-23 + +## **We tried to make Opus 4.7 faster with a prompt. Measurement said it got slower. Pulled the bullet.** + +gstack shipped a "Fan out explicitly" overlay nudge in `model-overlays/opus-4-7.md` +back in v1.5.2.0. The idea: tell Opus 4.7 to emit multiple tool calls in one +assistant turn instead of one per turn, so "read three files" takes one API +round-trip instead of three. Sounded obvious. This release removes that +bullet after measuring that it actively hurt performance, and ships the eval +harness we used to prove it so you can measure your own overlay changes. + +### The numbers that matter + +Source: new `test/skill-e2e-overlay-harness.test.ts`, N=10 trials per arm per +fixture, 40 trials per run, ~$3 per run. Pinned to `claude-opus-4-7` via +Anthropic's published Agent SDK (`@anthropic-ai/claude-agent-sdk@0.2.117`) +with `pathToClaudeCodeExecutable` set to the locally-installed `claude` binary +(2.1.118). Metric: number of parallel `tool_use` blocks in the first assistant +turn. + +| Prompt text in overlay | First-turn fanout rate (toy: read 3 files) | Lift vs baseline | +|---|---|---| +| No overlay (default Claude Code system prompt only) | **70%** (7/10) | baseline | +| gstack's original "Fan out explicitly" nudge (v1.5.2.0 through v1.6.3.0) | 10% (1/10) | **-60%** | +| Anthropic's own canonical `` text from their parallel-tool-use docs | **0%** (0/10) | **-70%** | + +On a realistic multi-file audit prompt (`read app.ts + config.ts + README.md, +glob src/*.ts, summarize`), Opus 4.7 never fanned out in the first turn at all, +regardless of overlay. Zero of 20 trials. The nudge had nothing to grip. + +Total cost of the investigation: **$7** across three eval runs. + +### What this means for you + +If you ship system-prompt nudges for Claude, measure them. Anthropic's own +published best-practice text dropped our fanout rate to zero. That's not a +claim about Anthropic, it's a claim about measurement: the model, the SDK, +the binary, and the context all move under the advice, and the advice sits +still. The harness is in the repo now. Run +`EVALS=1 EVALS_TIER=periodic bun test test/skill-e2e-overlay-harness.test.ts`. +Three dollars per run. + +### Itemized changes + +#### Fixed + +- `model-overlays/opus-4-7.md` — removed the "Fan out explicitly" block. The + other three nudges (effort-match, batch questions, literal interpretation) + are untested and stay in for now. They're candidates for their own + measurement in a follow-up PR. + +#### Added + +- `test/skill-e2e-overlay-harness.test.ts` — periodic-tier eval that iterates a + typed fixture registry and runs A/B arms through `@anthropic-ai/claude-agent-sdk`. + Uses SDK preset `claude_code` so the arms include Claude Code's real system + prompt; overlay-ON appends the resolved overlay text. Saves per-trial raw + event streams for forensic recovery. Gated on both `EVALS=1` and + `EVALS_TIER=periodic`. +- `test/fixtures/overlay-nudges.ts` — typed `OverlayFixture` registry with + strict validator. Adding a future nudge to measure = one fixture entry. + First two fixtures: `opus-4-7-fanout-toy` and `opus-4-7-fanout-realistic`. +- `test/helpers/agent-sdk-runner.ts` — parametric SDK wrapper with explicit + `AgentSdkResult` types, process-level API concurrency semaphore, and + three-shape 429 retry (thrown error, result-message error, mid-stream + `SDKRateLimitEvent`). Binary pinning via `pathToClaudeCodeExecutable`. +- `test/agent-sdk-runner.test.ts` — 36 free-tier unit tests covering happy + path, all three rate-limit shapes, persistent-429 `RateLimitExhaustedError`, + non-429 propagation, options propagation, concurrency cap, and every + validator rejection case. +- `scripts/preflight-agent-sdk.ts` — 20-line sanity check that confirms the + SDK loads, `claude-opus-4-7` is a live API model, the `SDKMessage` event + shape matches assumptions, and the overlay resolver produces the expected + text. Run manually before paid runs if you suspect drift. Costs ~$0.013. +- `@anthropic-ai/claude-agent-sdk@0.2.117` in `devDependencies`. Exact pin, + no caret — SDK event shapes can drift on minor versions. + +#### Changed + +- `scripts/resolvers/model-overlay.ts` — exported `readOverlay` so the eval + harness can resolve `{{INHERIT:claude}}` directives without synthesizing a + full `TemplateContext`. + +#### For contributors + +- `test/helpers/touchfiles.ts` — registered the new eval in both + `E2E_TOUCHFILES` (deps: `model-overlays/**`, `overlay-nudges.ts`, runner, + resolver) and `E2E_TIERS` (`periodic`). Passes the + `test/touchfiles.test.ts` completeness check. +- The harness is deliberately parametric. Adding a second overlay nudge + measurement (for the remaining three nudges in `opus-4-7.md`, or any + future nudge in any overlay file) is a single entry in + `test/fixtures/overlay-nudges.ts`. Total incremental effort: ~15 minutes + per fixture. + +## [1.10.0.0] - 2026-04-23 + +## **Plan reviews walk you through each issue again, and every question is now a real decision brief.** + +v1.6.4.0 broke something nobody wrote down. Plan reviews on Opus 4.7 silently stopped asking questions one at a time. They turned into a report: here are 6 findings, end of turn. The interactive dialogue that made `/plan-ceo-review`, `/plan-eng-review`, and the rest useful quietly evaporated. v1.10.0.0 restores that, and bundles a format upgrade so every `AskUserQuestion` now renders as a numbered decision brief with ELI10, stakes, recommendation, per-option pros / cons (✅ / ❌), and a closing "Net:" line that frames the trade-off in one sentence. + +### What changes for you + +Run `/plan-ceo-review` or `/plan-eng-review` on a plan with 3 findings. You get 3 separate AskUserQuestion prompts, one per finding, with the full Pros / Cons shape. Pick the option in 5 seconds, or expand the pros / cons if you want to think about it. Every review finding becomes a decision you actually made, not a bullet point you skimmed. The reference shape matches the D2 memory-design question Garry hand-crafted for his own use, now baked into every tier-2 skill via the preamble resolver, so `/ship`, `/office-hours`, `/investigate`, and the rest inherit it for free. + +### The numbers that matter + +Measured across the v1.10.0.0 fix. Verify any claim with `git log 1.9.0.0..1.10.0.0 --oneline` and `bun test` against the pinned commit SHA. + +| Metric | v1.6.4.0 | v1.10.0.0 | Δ | +|---|---|---|---| +| `AskUserQuestion` renders above model overlay in SKILL.md | no | **yes** | ordering inverted | +| Escape-hatch sites hardened across plan-review templates | 0 | **16** | +16 | +| Gate-tier unit tests pinning the format contract | 0 | **30** | +30 (runs in 16ms, $0) | +| Periodic evals defending against escape-hatch abuse | 0 | **4** | +4 (2 positive, 2 negative-case) | +| Cross-model review findings incorporated before landing | N/A | **5 of 8** | Codex caught real bugs CEO+Eng missed | + +Two of the five Codex findings were load-bearing. (1) The overlay reorder theory wasn't enough on its own. The `(recommended)` label on a neutral-posture question had to stay, because `question-tuning.ts:29` reads it to power AUTO_DECIDE. Omitting it would have silently broken auto-decide on every cherry-pick prompt. (2) The "31 sites global replace" in the original plan was factually wrong. Actual count, verified with `rg`, is 16 sites across 4 templates, and eng/design/devex templates used different phrasing than CEO. Without the audit, the fix would have shipped half-applied. + +### What this means for anyone running plan reviews on Opus 4.7 + +Upgrade and re-run your next plan review. You should see D-numbered prompts (D1, D2, D3...) with ELI10 paragraphs, stakes lines, and ✅ / ❌ bullet blocks per option. If you don't, check that `bun run gen:skill-docs` regenerated cleanly after the upgrade, and verify the `Pros / cons:` header renders in `plan-ceo-review/SKILL.md`. Complete plan reviews that used to take 20 minutes and produced a report now take 10 minutes and produce a row of decisions. + +### Itemized changes + +#### Added + +- New Pros / Cons decision-brief format for every `AskUserQuestion` across all tier-2+ skills. Rendering: `D` header, ELI10, "Stakes if we pick wrong:", Recommendation, per-option `✅ / ❌` bullets with minimum 2 pros + 1 con, closing `Net:` synthesis line. Lands in `scripts/resolvers/preamble/generate-ask-user-format.ts` so every skill inherits it. +- Hard-stop escape for destructive one-way choices: single bullet `✅ No cons — this is a hard-stop choice`. +- Neutral-posture handling for SELECTIVE EXPANSION cherry-picks and taste calls: `Recommendation: — this is a taste call, no strong preference either way` with `(recommended)` label preserved on the default to keep AUTO_DECIDE working. +- Three gate-tier unit tests (`test/preamble-compose.test.ts`, `test/resolver-ask-user-format.test.ts`, `test/model-overlay-opus-4-7.test.ts`) that pin the composition order, format contract, and overlay text. Run in <100ms on every `bun test`. +- Four periodic-tier Pros/Cons eval cases in `test/skill-e2e-plan-prosons.test.ts` including two negative-case assertions that catch escape-hatch abuse before it drifts. +- Touchfiles entries (`test/helpers/touchfiles.ts`) for all new eval cases plus expanded-coverage stubs for 7 additional skills. + +#### Fixed + +- Plan-review cadence regression on Opus 4.7. `/plan-ceo-review`, `/plan-eng-review`, `/plan-design-review`, and `/plan-devex-review` now actually pause after each finding and call `AskUserQuestion` as a tool_use instead of batching everything into one summary report. Root cause: `generateModelOverlay` rendered above `generateAskUserFormat` in `scripts/resolvers/preamble.ts`, so the overlay's "Batch your questions" directive registered as the ambient default before the pacing rule. Fixed by reordering the section array and rewriting the overlay directive as "Pace questions to the skill". +- Escape-hatch collapse: "If no issues or fix is obvious, state what you'll do and move on, don't waste a question" at 16 sites across 4 templates let Opus 4.7's literal interpreter classify every finding as self-dismissable. Tightened per-template: zero findings gets "No issues, moving on"; findings require AskUserQuestion as a tool_use. + +#### Changed + +- `test/skill-e2e-plan-format.test.ts`: extended with v1.10.0.0 format token regexes (D-number, ELI10, Stakes, Pros/cons, Net). Existing RECOMMENDATION check loosened to accept mixed-case "Recommendation:". +- `test/skill-validation.test.ts`: format assertions updated from "RECOMMENDATION: Choose" to the new Pros/Cons token set. +- Golden fixtures regenerated: `test/fixtures/golden/claude-ship-SKILL.md`, `codex-ship-SKILL.md`, `factory-ship-SKILL.md`. + +#### For contributors + +- Outside-voice Codex review (`codex exec` with `model_reasoning_effort="high"`) caught two factual bugs in the original plan: the "31 sites" count (actually 16) and the AUTO_DECIDE contract break on neutral-posture questions. 5 of 8 Codex findings incorporated, 1 rejected (kept defense in depth on the composition reorder), 1 declined (HOLD SCOPE mode lock). +- Follow-up: true multi-turn cadence eval (3 findings produce 3 distinct AskUserQuestion invocations across turns) requires new harness support for multi-capture. Filed in NOT-in-scope. Current single-capture eval covers format + escape-hatch abuse but not cadence itself. +- Follow-up: expanded-coverage eval cases for `/ship`, `/office-hours`, `/investigate`, `/qa`, `/review`, `/design-review`, `/document-release`. Touchfiles entries exist; test blocks will land per-skill in follow-up PRs. +- D-numbering is a model-level instruction, not a runtime counter. `TemplateContext` has no state for it. Drift over long sessions is expected; a registry (deferred to TODOs) is the long-term fix. + +## [1.9.0.0] - 2026-04-23 + +## **Your gstack memory now travels with you. Cross-machine brain via a private git repo + optional GBrain indexing, no daemon, no credential leaks.** + +gstack session memory (learnings, plans, designs, retros, developer profile) used to die at the machine boundary. Now it doesn't. `gstack-brain-init` turns `~/.gstack/` into a git repo with an explicit allowlist, writer shims enqueue changed files at write-time, and a preamble-boundary sync pushes them to a private git remote of your choice. GBrain is the first consumer but the architecture is pluggable — Codex, OpenClaw, or anything else can be a reader later. No daemon, no background process, no new auth surface. + +The feature shipped after four plan reviews: /office-hours shaping, /plan-eng-review (6 issues → CLEAR), /plan-ceo-review (SELECTIVE EXPANSION, 2 cherry-picks accepted), /codex twice (16+16 findings applied, daemon model dropped in round 2), and /plan-devex-review (6/10 → 8/10, docs elevated to full treatment). The scope simplification from Codex round 2 alone removed ~1 week of daemon lifecycle surface. + +### What you can now do + +- **Initialize cross-machine sync:** `gstack-brain-init` creates a private git repo (GitHub via `gh`, or any git URL — GitLab, Gitea, self-hosted). 30-90 second TTHW. +- **See yesterday's laptop on today's desktop:** copy `~/.gstack-brain-remote.txt` to the new machine, run `gstack-brain-restore`, and your learnings follow you. +- **Control what syncs:** one-time privacy stop-gate on first run — `full` (everything allowlisted), `artifacts-only` (plans/designs/retros/learnings, skip behavioral), `off` (decline). +- **Sleep through the conflict case:** two machines writing the same JSONL file the same day merge cleanly via a ts-sort-plus-hash-fallback merge driver registered automatically. +- **Uninstall cleanly:** `gstack-brain-uninstall` removes the sync layer, leaves your data intact. +- **Never push a secret:** AWS keys, GitHub tokens (`ghp_`/`gho_`/`ghu_`/`ghs_`/`ghr_`/`github_pat_`), OpenAI `sk-` keys, PEM blocks, JWTs, and bearer-token-in-JSON patterns are all blocked before push. `--skip-file ` gives you a single-command escape hatch for false positives. + +### The numbers that matter + +Source: integration smoke tests run during implementation, plus 27-test consolidated suite (`test/brain-sync.test.ts`). End-to-end round trip (init on machine A → write learning → restore on machine B → see the learning) verified inline. + +| Surface | Shape | +|---|---| +| New binaries | 8 (`gstack-brain-init`, `-enqueue`, `-sync`, `-consumer`, `-reader` alias, `-restore`, `-uninstall`, `gstack-jsonl-merge`) | +| Config keys | 2 enum-validated (`gbrain_sync_mode`: off/artifacts-only/full; `gbrain_sync_mode_prompted`: bool) | +| Writer shims modified | 4 (learnings-log, timeline-log, review-log, developer-profile on --migrate path) | +| Writers deliberately NOT synced | 2 (question-log, question-preference — per-machine UX state, Codex v2 decision) | +| Sync granularity | per-skill-boundary via `gstack-brain-sync --once` from preamble (no daemon) | +| Privacy tiers | 3 (full / artifacts-only / off) | +| Secret patterns blocked | 6 families (AWS, GH tokens, OpenAI, PEM, JWT, bearer-in-JSON) | +| User-facing naming | `reader` (CLI); internal data model stays `consumer` per Codex-v2 DX decision | +| New-machine discovery | auto via `~/.gstack-brain-remote.txt` file (URL-only, no secrets) | + +### What this means for you + +Work on the laptop Monday. Switch to the desktop Tuesday. Skill preamble sees the remote URL, offers `gstack-brain-restore`, your Monday learnings surface on Tuesday. The pattern scales to N consumers: today GBrain is the primary reader, tomorrow Codex or OpenClaw can subscribe without refactoring the sync. + +### Itemized changes + +#### Added + +- `bin/gstack-brain-init` — idempotent first-run setup. Turns `~/.gstack/` into a git repo with `.gitignore = *`, writes canonical `.brain-allowlist` + `.brain-privacy-map.json`, installs pre-commit secret-scan hook, registers JSONL merge driver, creates private remote via `gh repo create --private` (or accepts `--remote `), writes `~/.gstack-brain-remote.txt` for new-machine discovery. +- `bin/gstack-brain-sync` — core sync. Subcommands: `--once` (drain queue, secret-scan staged diff, commit with template message, push with fetch+merge retry), `--status`, `--skip-file `, `--drop-queue --yes`, `--discover-new` (walks allowlist globs with mtime+size cursor). +- `bin/gstack-brain-enqueue` — atomic-append shim called by writers. Silent no-op when feature disabled. +- `bin/gstack-brain-consumer` + `bin/gstack-brain-reader` (symlink alias) — manage the consumer/reader registry in `consumers.json`. User-facing "reader", internal "consumer". +- `bin/gstack-brain-restore` — new-machine bootstrap with safety gates (refuses dangerous clobber, re-registers merge drivers, prompts for per-consumer tokens since tokens stay machine-local). +- `bin/gstack-brain-uninstall` — clean off-ramp. Removes `.git` + `.brain-*` files + `consumers.json` + config keys. Preserves user data (learnings etc). Optional `--delete-remote` for the GitHub repo. +- `bin/gstack-jsonl-merge` — git merge driver. Concat-dedup-sort by ISO `ts` field; deterministic SHA-256 hash fallback when `ts` is missing. +- `scripts/resolvers/preamble/generate-brain-sync-block.ts` — preamble bash block. New-machine restore hint, one-time privacy stop-gate, `--once` at skill start + end, once-daily auto-pull, `BRAIN_SYNC:` status line on every skill run. +- `docs/gbrain-sync.md` — user guide (setup, first-use, restore, privacy modes, secret protection, uninstall). +- `docs/gbrain-sync-errors.md` — error lookup index (problem / cause / fix for every user-visible error). +- `test/brain-sync.test.ts` — 27-test consolidated suite: config isolation, enqueue atomicity, merge driver, secret scan across all 6 regex families, init+sync+restore round-trip, uninstall preserves data, `--discover-new` cursor idempotence, `--skip-file` remediation. + +#### Changed + +- `bin/gstack-config` — added 2 validated keys (`gbrain_sync_mode` enum, `gbrain_sync_mode_prompted` bool). Also accepts `GSTACK_HOME` env override alongside legacy `GSTACK_STATE_DIR` for test isolation (Codex v2 fix). +- `bin/gstack-learnings-log`, `gstack-timeline-log`, `gstack-review-log`, `gstack-developer-profile` — each gains one backgrounded `gstack-brain-enqueue` call after its local write. Fire-and-forget, silent no-op when sync is off. +- `bin/gstack-timeline-log` header comment — updated "local-only, never sent anywhere" to reflect the new privacy-gated sync contract (only applies when user explicitly opts into `full` mode). +- `scripts/resolvers/preamble.ts` — composition root wires in the new `generateBrainSyncBlock`. +- `README.md` — new "Cross-machine memory with GBrain sync" section near the top, plus docs-table entry linking to `docs/gbrain-sync.md` and `docs/gbrain-sync-errors.md`. + +#### For contributors + +- Sync respects `GSTACK_HOME=/tmp/test-$$` so tests never bleed into real `~/.gstack/config.yaml`. New test `test/brain-sync-env-isolation` logic baked into the consolidated suite. +- The consumer registry lives in `consumers.json` (synced); tokens stay in `gstack-config` (local, never synced). Restore prompts for tokens on new machines. +- Merge drivers require local `git config merge..driver=...` registration, not just `.gitattributes`. Both `init` and `restore` register them; uninstall clears them. +- Pre-commit hook is defense-in-depth only. Primary secret scan runs in `gstack-brain-sync --once` BEFORE staging. +- The fnmatch glob engine doesn't handle `**` the way git's gitignore does; allowlist uses explicit one- and two-level patterns instead. +- GBrain HTTP ingest endpoint contract is a cross-project dependency (flagged as v1 blocker for real-world dogfooding). v1 of gbrain-sync ships on this branch regardless; GBrain-side work lands in a separate branch/repo. + +#### Known follow-ups + +- `test/brain-sync.test.ts` — 12 of 27 tests pass on first bun-test run; remaining 15 hit bun-test's 5s default timeout (spawnSync-heavy git operations). Behaviors verified via integration smokes during implementation. Test infrastructure needs a 30s per-test timeout wrapper. +- Three unmerged team-sync branches (`garrytan/team-supabase-store`, `garrytan/fix-team-setup`, `garrytan/team-install-mode`) should be formally closed if team-sync isn't landing — flagged in the CEO plan. +- Pre-existing golden-file regression test failure in `test/host-config.test.ts` (Codex ship skill baseline) exists on `main` too — unrelated to this PR, tracked separately. + ## [1.6.4.0] - 2026-04-22 ## **Sidebar prompt-injection defense got half as noisy, half as trusting of any single classifier.** diff --git a/README.md b/README.md index 05001dce..1d63004d 100644 --- a/README.md +++ b/README.md @@ -359,12 +359,42 @@ I open sourced how I build software. You can fork it and make it your own. > Come work at YC — [ycombinator.com/software](https://ycombinator.com/software) > Extremely competitive salary and equity. San Francisco, Dogpatch District. +## Cross-machine memory with GBrain sync + +gstack accumulates a lot of useful state on your laptop: learnings, CEO +plans, design docs, retros, developer profile. Today, all of that dies when +you switch machines. **GBrain sync** optionally pushes a curated, secret-scanned +subset to a private git repo so your memory follows you, and (if you use +GBrain) becomes indexable there. + +One command to turn it on: + +```bash +gstack-brain-init +``` + +That creates a private GitHub repo (or any git remote you prefer — +GitLab, Gitea, self-hosted). Every skill run syncs the queue at its +start and end boundaries. No daemon, no background process. A one-time +privacy prompt asks how much you want to share (everything allowlisted / +artifacts only / off). Secret-shaped content (AWS keys, GitHub tokens, +PEM blocks, JWTs, etc.) is blocked from sync before it leaves your +machine. + +New machine? Copy `~/.gstack-brain-remote.txt` over, run +`gstack-brain-restore`, and yesterday's learnings surface on today's +laptop. + +Full guide: [docs/gbrain-sync.md](docs/gbrain-sync.md) • +Error index: [docs/gbrain-sync-errors.md](docs/gbrain-sync-errors.md) + ## Docs | Doc | What it covers | |-----|---------------| | [Skill Deep Dives](docs/skills.md) | Philosophy, examples, and workflow for every skill (includes Greptile integration) | | [Builder Ethos](ETHOS.md) | Builder philosophy: Boil the Lake, Search Before Building, three layers of knowledge | +| [GBrain Sync](docs/gbrain-sync.md) | Cross-machine memory setup, privacy modes, troubleshooting | | [Architecture](ARCHITECTURE.md) | Design decisions and system internals | | [Browser Reference](BROWSER.md) | Full command reference for `/browse` | | [Contributing](CONTRIBUTING.md) | Dev setup, testing, contributor mode, and dev mode | diff --git a/SKILL.md b/SKILL.md index 95f22604..ec979715 100644 --- a/SKILL.md +++ b/SKILL.md @@ -349,6 +349,105 @@ AI orchestrator (e.g., OpenClaw). In spawned sessions: - Focus on completing the task and reporting results via prose output. - End with a completion report: what shipped, decisions made, anything uncertain. +## GBrain Sync (skill start) + +```bash +# gbrain-sync: drain pending writes, pull once per day. Silent no-op when +# the feature isn't initialized or gbrain_sync_mode is "off". See +# docs/gbrain-sync.md. + +_GSTACK_HOME="${GSTACK_HOME:-$HOME/.gstack}" +_BRAIN_REMOTE_FILE="$HOME/.gstack-brain-remote.txt" +_BRAIN_SYNC_BIN="~/.claude/skills/gstack/bin/gstack-brain-sync" +_BRAIN_CONFIG_BIN="~/.claude/skills/gstack/bin/gstack-config" + +_BRAIN_SYNC_MODE=$("$_BRAIN_CONFIG_BIN" get gbrain_sync_mode 2>/dev/null || echo off) + +# New-machine hint: URL file present, local .git missing, sync not yet enabled. +if [ -f "$_BRAIN_REMOTE_FILE" ] && [ ! -d "$_GSTACK_HOME/.git" ] && [ "$_BRAIN_SYNC_MODE" = "off" ]; then + _BRAIN_NEW_URL=$(head -1 "$_BRAIN_REMOTE_FILE" 2>/dev/null | tr -d '[:space:]') + if [ -n "$_BRAIN_NEW_URL" ]; then + echo "BRAIN_SYNC: brain repo detected: $_BRAIN_NEW_URL" + echo "BRAIN_SYNC: run 'gstack-brain-restore' to pull your cross-machine memory (or 'gstack-config set gbrain_sync_mode off' to dismiss forever)" + fi +fi + +# Active-sync path. +if [ -d "$_GSTACK_HOME/.git" ] && [ "$_BRAIN_SYNC_MODE" != "off" ]; then + # Once-per-day pull. + _BRAIN_LAST_PULL_FILE="$_GSTACK_HOME/.brain-last-pull" + _BRAIN_NOW=$(date +%s) + _BRAIN_DO_PULL=1 + if [ -f "$_BRAIN_LAST_PULL_FILE" ]; then + _BRAIN_LAST=$(cat "$_BRAIN_LAST_PULL_FILE" 2>/dev/null || echo 0) + _BRAIN_AGE=$(( _BRAIN_NOW - _BRAIN_LAST )) + [ "$_BRAIN_AGE" -lt 86400 ] && _BRAIN_DO_PULL=0 + fi + if [ "$_BRAIN_DO_PULL" = "1" ]; then + ( cd "$_GSTACK_HOME" && git fetch origin >/dev/null 2>&1 && git merge --ff-only "origin/$(git rev-parse --abbrev-ref HEAD)" >/dev/null 2>&1 ) || true + echo "$_BRAIN_NOW" > "$_BRAIN_LAST_PULL_FILE" + fi + # Drain pending queue, push. + "$_BRAIN_SYNC_BIN" --once 2>/dev/null || true +fi + +# Status line — always emitted, easy to grep. +if [ -d "$_GSTACK_HOME/.git" ] && [ "$_BRAIN_SYNC_MODE" != "off" ]; then + _BRAIN_QUEUE_DEPTH=0 + [ -f "$_GSTACK_HOME/.brain-queue.jsonl" ] && _BRAIN_QUEUE_DEPTH=$(wc -l < "$_GSTACK_HOME/.brain-queue.jsonl" | tr -d ' ') + _BRAIN_LAST_PUSH="never" + [ -f "$_GSTACK_HOME/.brain-last-push" ] && _BRAIN_LAST_PUSH=$(cat "$_GSTACK_HOME/.brain-last-push" 2>/dev/null || echo never) + echo "BRAIN_SYNC: mode=$_BRAIN_SYNC_MODE | last_push=$_BRAIN_LAST_PUSH | queue=$_BRAIN_QUEUE_DEPTH" +else + echo "BRAIN_SYNC: off" +fi +``` + + + +**Privacy stop-gate (fires ONCE per machine).** + +If the bash output shows `BRAIN_SYNC: off` AND the config value +`gbrain_sync_mode_prompted` is `false` AND gbrain is detected on this host +(either `gbrain doctor --fast --json` succeeds or the `gbrain` binary is in PATH), +fire a one-time privacy gate via AskUserQuestion: + +> gstack can publish your session memory (learnings, plans, designs, retros) to a +> private GitHub repo that GBrain indexes across your machines. Higher tiers +> include behavioral data (session timelines, developer profile). How much do you +> want to sync? + +Options: +- A) Everything allowlisted (recommended — maximum cross-machine memory) +- B) Only artifacts (plans, designs, retros, learnings) — skip timelines and profile +- C) Decline — keep everything local + +After the user answers, run (substituting the chosen value): + +```bash +# Chosen mode: full | artifacts-only | off +"$_BRAIN_CONFIG_BIN" set gbrain_sync_mode +"$_BRAIN_CONFIG_BIN" set gbrain_sync_mode_prompted true +``` + +If A or B was chosen AND `~/.gstack/.git` doesn't exist, ask a follow-up: +"Set up the GBrain sync repo now? (runs `gstack-brain-init`)" +- A) Yes, run it now +- B) Show me the command, I'll run it myself + +Do not block the skill. Emit the question, continue the skill workflow. The +next skill run picks up wherever this left off. + +**At skill END (before the telemetry block),** run these bash commands to +catch artifact writes (design docs, plans, retros) that skipped the writer +shims, plus drain any still-pending queue entries: + +```bash +"~/.claude/skills/gstack/bin/gstack-brain-sync" --discover-new 2>/dev/null || true +"~/.claude/skills/gstack/bin/gstack-brain-sync" --once 2>/dev/null || true +``` + + ## Model-Specific Behavioral Patch (claude) The following nudges are tuned for the claude model family. They are diff --git a/TODOS.md b/TODOS.md index 3c3502c9..1dda875b 100644 --- a/TODOS.md +++ b/TODOS.md @@ -38,22 +38,6 @@ **Priority:** P3 (nice-to-have, not blocking anyone yet) **Depends on:** `/context-save` + `/context-restore` rename stable in production (v1.0.1.0+). Research: does Conductor expose a spawn-workspace CLI? -## P0: Verify Opus 4.7 fanout nudge inside Claude Code harness (next rev) - -**What:** Re-run the fanout A/B from `test/skill-e2e-opus-47.test.ts` against Opus 4.7 **inside Claude Code's interactive harness**, not via `claude -p`. The current eval calls `claude -p` as a subprocess, which does not load SKILL.md content as system context and uses different tool wiring than the live Claude Code session. Build a small harness (Claude Code extension hook, direct API call with the same system prompt Claude Code uses, or a scripted MCP invocation) that reproduces the real tool_use context, then run the same 3-file-read A/B with and without the `model-overlays/opus-4-7.md` overlay. Record parallel-tool-call count in the first assistant turn for each arm. - -**Why:** v1.6.1.0 shipped a rewritten "Fan out explicitly" nudge with a concrete tool_use example (`[Read(a), Read(b), Read(c)]`). Under `claude -p` on `claude-opus-4-7`, both overlay-ON and overlay-OFF arms emitted zero parallel tool calls in the first turn. The routing A/B worked fine in the same harness (3/3 positives routed correctly), so the gap is specific to fanout, and likely specific to how `claude -p` constructs system prompts and tool schemas. Without measurement inside the real harness, we do not know whether the nudge ever lands for a real user. The PR went to production with the fanout claim asserted but unverified; this TODO closes that loop. - -**Pros:** Produces the "actually shipped fanout" measurement the ship-quality review flagged as missing. If the nudge works in Claude Code harness, we can gate it with a `periodic` eval and stop worrying. If it does not, we know to rewrite or drop the nudge rather than carry dead prompt weight. Either answer is better than the current "unverified." - -**Cons:** Requires instrumenting Claude Code's harness (or a faithful replica) rather than the easier `claude -p` path. A faithful replica needs the same system prompt, the same tool definitions, and the same stop-sequence handling. Estimated one afternoon to wire, plus $3-5 per eval run. - -**Context:** See `~/.gstack/projects/garrytan-gstack/evals/1.6.0.0-feat-opus-4.7-migration-e2e-opus-47-*.json` for the raw transcripts showing 0 parallel calls in first turn across both arms. The overlay is at `model-overlays/opus-4-7.md` with an explicit wrong/right tool_use example. The eval file at `test/skill-e2e-opus-47.test.ts` has the full setup including per-skill SKILL.md install, CLAUDE.md routing block, and overlay inlining. - -**Effort:** M (human: ~1 day / CC: ~45 min for the harness wiring, plus the eval run cost) -**Priority:** P0 (ship-quality commitment from v1.6.1.0 — do not let it drift) -**Depends on / blocked by:** Access to Claude Code's system prompt + tool schema (or a reproducible way to mirror them). May require a small MCP server or a direct Messages API call that mirrors Claude Code's session setup. - ## P0: PACING_UPDATES_V0 — Louise's fatigue root cause (V1.1) **What:** Implement the pacing overhaul extracted from PLAN_TUNING_V1. Full design in `docs/designs/PACING_UPDATES_V0.md`. Requires: session-state model, `phase` field in question-log schema, registry extension for dynamic findings, pacing as skill-template control flow (not preamble prose), `bin/gstack-flip-decision` command, migration-prompt budget rule, first-run preamble audit, ranking threshold calibration from real V0 data, one-way-door uncapped rule, concrete verification values. @@ -1288,6 +1272,15 @@ Shipped in v0.6.5. TemplateContext in gen-skill-docs.ts bakes skill name into pr ## Completed +### Overlay efficacy harness + Opus 4.7 fanout nudge removal (v1.10.1.0) +- Built `test/skill-e2e-overlay-harness.test.ts`, a parametric periodic-tier eval that drives `@anthropic-ai/claude-agent-sdk` and measures first-turn fanout rate (overlay-ON vs overlay-OFF) across registered fixtures +- Measured the original "Fan out explicitly" overlay nudge: baseline Opus 4.7 = 70% first-turn fanout on toy prompt, with our nudge = 10%, with Anthropic's own canonical `` text = 0% +- Removed the counterproductive nudge from `model-overlays/opus-4-7.md` +- Shipped 36-test free-tier unit suite for the SDK runner + strict fixture validator +- Registered `overlay-harness-opus-4-7-fanout-{toy,realistic}` in E2E_TOUCHFILES and E2E_TIERS +- Total investigation cost: ~$7 across 3 eval runs +**Completed:** v1.10.1.0 + ### CI eval pipeline (v0.9.9.0) - GitHub Actions eval upload on Ubicloud runners ($0.006/run) - Within-file test concurrency (test() → testConcurrentIfSelected()) diff --git a/VERSION b/VERSION index 0faf9f75..62bf50d2 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -1.8.0.0 +1.11.0.0 diff --git a/autoplan/SKILL.md b/autoplan/SKILL.md index 387c9902..a4f67770 100644 --- a/autoplan/SKILL.md +++ b/autoplan/SKILL.md @@ -358,6 +358,234 @@ AI orchestrator (e.g., OpenClaw). In spawned sessions: - Focus on completing the task and reporting results via prose output. - End with a completion report: what shipped, decisions made, anything uncertain. +## AskUserQuestion Format + +**ALWAYS follow this structure for every AskUserQuestion call. Every element is non-skippable. If you find yourself about to skip any of them, stop and back up.** + +### Required shape + +Every AskUserQuestion reads like a decision brief, not a bullet list: + +``` +D + +ELI10: + +Stakes if we pick wrong: + +Recommendation: because + +Completeness: A=X/10, B=Y/10 (or: Note: options differ in kind, not coverage — no completeness score) + +Pros / cons: + +A)