Commit Graph

2 Commits

Author SHA1 Message Date
Garry Tan 66f3a180d3 v1.43.2.0 fix wave: post-Daegu paper-cut — 18 fixes, 28 bisect commits (#1642)
* fix(gbrain-sync): --full produces an empty code index on first run of a new repo

`gbrain reindex-code` only RE-EMBEDS pages that already exist; it never walks
the filesystem. On a freshly-registered source (0 pages), a --full run that
called reindex-code alone found nothing ("No code pages to reindex"), finished
in ~1s, and left the code index permanently empty while still reporting OK.

Fix: --full now runs `sync --strategy code` FIRST to create pages via the file
walk, then runs `reindex-code` to honor the documented "full walk + reindex"
contract for both fresh and populated sources.

Contributed by @jetsetterfl via #1584.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gbrain-local-status): classifier falsely reports broken-db inside repos with their own DATABASE_URL

The freshClassify probe ran `gbrain sources list --json` with the inherited
process env. When the probe ran from inside a repo with its own .env (an app
DATABASE_URL on a different port), Bun autoloaded the project's .env, gbrain
connected to the wrong database, and the classifier reported broken-db on
otherwise-healthy brains.

Fix: route the probe env through `buildGbrainEnv` from lib/gbrain-exec, the
same helper the sync orchestrator uses. DATABASE_URL is seeded from
~/.gbrain/config.json so the result is cwd-independent. The 60s cache can no
longer propagate a poisoned negative to clean directories.

Contributed by @jetsetterfl via #1583.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(retro): stale-base + bad-today-anchor pre-flight guard (#1624)

/retro silently produced confidently-wrong output when "today" drifted (model
session-context error) or when origin/<default> was materially behind the
actual remote — git log --since returned zero or near-zero commits and the
narrative was fabricated from nothing.

Adds Step 0.5 with four ordered pre-check branches before any window analysis:

  A. No 'origin' remote → skip with "base freshness not verified" note
  B. Detached HEAD → skip with "base freshness not verified" note
  C. `git fetch origin <default>` fails (offline) → warn, proceed against
     last-known origin/<default>
  D. Fetch succeeded → compare today vs latest origin/<default> commit; if
     gap > window-days, BLOCK with explicit citation of latest-commit date.

Skip paths still proceed to Step 1, but the disclosure is carried into the
retro narrative ("offline run, window not freshness-verified") so the output
is never silently confidently-wrong.

Atomic .tmpl + gen:skill-docs regen commit (T-Codex-3 pattern).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(retro): regression for #1624 stale-base pre-flight guard

13 static-invariant tests pinning the four ordered pre-check branches in
retro/SKILL.md.tmpl:Step 0.5:

  A. no-remote skip            — must check origin presence + set verdict
  B. detached-HEAD skip        — must gate behind prior verdict (ordering)
  C. fetch-fail warn           — must match `if !` or `||` shape, gate by verdict
  D. stale-base BLOCK          — must read latest-commit ISO date, cite remediation

Plus a disclosure-survives-to-narrative invariant: skip-path verdicts must be
named in prose so the retro output carries the cited reason rather than
silently misreporting.

Failing build if Step 0.5 is removed, branches re-ordered (no-remote no longer
wins), or the BLOCK message stops citing today/latest-commit/remediation
path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gbrain-sync): configurable timeouts + resume from gbrain checkpoint (#1611)

The memory and code stages hardcoded a 35-min spawn timeout. On brains with
~2000+ staged files, /sync-gbrain --full reliably SIGTERM'd the child at
exactly 35 minutes with exit 143. gbrain left ~/.gbrain/import-checkpoint.json
pointing at the staging dir, but gstack-memory-ingest's SIGTERM handler
unconditionally cleaned the dir up — so the next run found a checkpoint
pointing at nothing and restaged from scratch, repeating the SIGTERM forever.

Three changes:

1. Configurable timeouts via env (bounds 60_000ms - 86_400_000ms, default
   2_100_000ms = 35min unchanged):
     GSTACK_SYNC_MEMORY_TIMEOUT_MS
     GSTACK_SYNC_CODE_TIMEOUT_MS
   Out-of-range or non-numeric values warn and fall back to the default.

2. SIGTERM in gstack-memory-ingest no longer always cleans up the staging
   dir. If gbrain has written ~/.gbrain/import-checkpoint.json pointing at
   the active staging dir, the dir is PRESERVED for next-run resume.
   Otherwise (no checkpoint pointing here, crash before gbrain ever
   touched it) it's cleaned up as before.

3. Next /sync-gbrain run detects gbrain's checkpoint via decideResume() in
   gstack-gbrain-sync.ts:
     - no checkpoint               → fresh ingest pass
     - checkpoint + staging ok     → set GSTACK_INGEST_RESUME_DIR; child
                                      reuses staging dir and skips
                                      writeStaged; gbrain import resumes
                                      from processedIndex+1
     - checkpoint + staging gone   → warn "previous checkpoint stale
                                      (staging dir gone), restaging from
                                      scratch" and proceed

Reuses gbrain's own checkpoint as the source of truth (D1 — no double-store
state). Detect-then-fallback semantics per C1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(gbrain-sync): regression for #1611 timeouts + resume

19 tests across three surfaces:

  - resolveStageTimeoutMs (10 tests): undefined/empty → default; non-numeric,
    zero, negative, below-floor, above-ceiling → warn + default; at-floor,
    at-ceiling, valid mid-range → accepted as-is.

  - decideResume (6 tests): no checkpoint, corrupt JSON, checkpoint + staging
    ok, checkpoint + staging missing, checkpoint with no dir, checkpoint with
    empty dir.

  - SIGTERM staging preservation (3 static invariants): memory-ingest signal
    handler must check stagingDirIsCheckpointed BEFORE cleanup; preserve
    branch must come before cleanup branch (ordering); orchestrator must
    pass GSTACK_INGEST_RESUME_DIR to the grandchild on resume.

Also threads process.env.HOME through readGbrainCheckpoint and
stagingDirIsCheckpointed so tests can redirect home. os.homedir() caches
at process start and ignores later mutation, so the env override is the
only reliable test injection point.

Failing build if the timeout bounds are removed, the resume detection
short-circuits incorrectly, or the SIGTERM handler regresses to
unconditional cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(review): pre-emit verification gate kills Django-shape FP class (#1539)

External user filed 4/8 false positives on a /review run against a Django +
DRF + PostgreSQL repo (Sprint 2.5). Every FP class was the same shape:
"resolvable in <5 minutes by viewing the actual code or running a simple
grep" — fields that don't exist on the model, dict.get()-might-be-None on a
form that returns {}-initialized cleaned_data, standard ORM save behavior
called out as data loss.

Extends the Confidence Calibration resolver (consumed by review, cso,
plan-eng-review, ship) with a Pre-emit verification gate:

  Every finding MUST quote the specific code line that motivates it
  (file:line + verbatim text). If the reviewer cannot produce the quote,
  the finding is unverified — its confidence is forced to 4-5 so the
  existing "Suppress from main report" rule fires automatically. The
  finding still goes to the appendix for calibration audit, but the user
  does not see it in the critical-pass output.

Reuses the existing suppression mechanism — no new code path. The FP
classes the gate kills are enumerated in the resolver text so reviewers
see the named patterns.

Framework-meta nudge included for Django Meta, Rails associations,
SQLAlchemy relationships, TypeORM decorators, Sequelize init, Prisma
generated client — the reviewer must quote the meta-construct that
generates the symbol, not just grep for the literal name. Deeper
framework-aware ORM verification (model introspection, migration-history-
aware checks) is deliberately deferred to a future wave per T-Codex-2.

Atomic .tmpl-equivalent (resolver) edit + gen:skill-docs regen commit
per T-Codex-3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(review): regression for #1539 pre-emit verification gate

12 tests pinning the gate behavior:

  - Resolver emits the gate header + #1539 reference
  - Gate requires quoting file:line + verbatim text
  - Unverified findings forced to confidence 4-5 (auto-suppress via
    existing <7-rule, no new mechanism)
  - Framework-meta nudge names Django, Rails, SQLAlchemy, TypeORM,
    Sequelize, Prisma
  - Deferred design doc reference present (1539-framework-aware-review.md)
  - Four named FP classes from #1539 enumerated:
      * field doesn't exist on model
      * dict.get() might be None
      * save() might lose fields
      * update_fields might miss X
  - All four downstream SKILL.md consumers (review, cso, plan-eng-review,
    ship) carry the gate text after gen:skill-docs
  - Existing confidence 9-10 'Show normally' + 3-4 'Suppress' rows
    unchanged (regression on existing behavior)

Failing build if the gate is removed, the suppression mechanism is
re-invented separately, the framework-meta nudge drops a framework, or
gen:skill-docs stops propagating the gate to consumers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(config): expose explain_level default

* fix(benchmark): parse positional prompt after flags

* fix(artifacts): reject malformed remote paths

* fix(learnings): preserve current entries in cross-project search

* fix(setup): register root gstack slash alias

* fix(memory): probe gitleaks without shell builtin

* fix(gbrain-lib): pin LC_ALL=C in varname validator (macOS locale guard)

In many macOS shells the default locale (e.g. en_US.UTF-8) makes bash
glob brackets like `[A-Z]` match lowercase letters too, so the existing
`case "$name" in [A-Z_][A-Z0-9_]*)` branch lets names like `lower-case`
through validation. The function then trips `printf -v "$varname"` and
`export "$varname"` with `not a valid identifier` errors that surface
mid-prompt, which is exactly what the validator was supposed to prevent.

Pinning `LC_ALL=C` inside the function gives ASCII-only bracket semantics
on both macOS and Linux, matching the documented `[A-Z_][A-Z0-9_]*`
contract. Declared `local` so it doesn't leak to the calling shell —
`gstack-gbrain-lib.sh` is documented as a sourced helper, so a bare
assignment would mutate the caller's locale for the rest of the process
(silently affecting downstream `sort`, `tr`, locale-aware globs in the
same shell, etc.).

The existing regression test
`test/gbrain-lib-verify.test.ts:'rejects invalid var names'`
already covers the macOS repro shape (passes `lower-case` and expects
the validator to reject + emit `invalid var name`). On Linux CI the
test silently passed because `LC_ALL=C` is the typical default; on
macOS dev boxes it fails.

Verified:
- `bun test test/gbrain-lib-verify.test.ts`: 22 pass, 0 fail (on macOS).
- `_gstack_gbrain_validate_varname lower-case; echo $?` → 2.
- `_gstack_gbrain_validate_varname FOO_BAR; echo $?` → 0.
- Caller's LC_ALL preserved across calls (confirmed via sourced bash).

* fix(land-and-deploy): detect merged PR after gh failure

After `gh pr merge` exits non-zero, the PR may already be MERGED server-side
(concurrent merge landed, or local cleanup phase failed AFTER the merge
succeeded). Calling `gh pr merge` a second time then errors with a confusing
"already merged" — and worse, the deploy workflow never runs because we
stopped on the first failure.

Adds a Post-failure PR-state check (§4a-postfail) that runs after ANY
non-zero exit from `gh pr merge`:

  - state == MERGED  → record MERGE_PATH=direct, OFFER (don't force)
                       stale-worktree cleanup on the base branch with
                       uncommitted-work guard, proceed to §4a CI watch
  - state == OPEN    → check autoMergeRequest; if non-null treat as
                       merge-queue wait; if null surface both errors and STOP
  - state == CLOSED  → STOP

Hard invariant: never retry `gh pr merge` after a non-zero exit. Server
state is authoritative.

Re-authored from PR #1620 into land-and-deploy/SKILL.md.tmpl (the source of
truth) instead of the generated SKILL.md, so the next gen:skill-docs run
preserves the change. Original diff by @davidfoy via #1620.

Related: cli/cli#3442, cli/cli#13380.

Contributed by @davidfoy via #1620.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: detect PgBouncer transaction-mode pooler and set GBRAIN_PREPARE=true (#1435)

When gbrain connects through a PgBouncer transaction-mode pooler (port
6543), it auto-disables prepared statements. This breaks `gbrain search`
silently — the /sync-gbrain capability check fails and the GBrain Search
Guidance block never gets written to CLAUDE.md.

Three-layer fix:

1. **lib/gbrain-exec.ts** — `buildGbrainEnv()` now detects port 6543 in
   the effective DATABASE_URL and sets `GBRAIN_PREPARE=true` in the env
   passed to every gbrain spawn. This is the single chokepoint — all
   gstack gbrain invocations inherit the fix. Caller can opt out with
   `GBRAIN_PREPARE=false`.

2. **sync-gbrain/SKILL.md{,.tmpl}** — capability check now exports
   `GBRAIN_PREPARE=true` explicitly and retries search up to 3x with 1s
   delay for async index propagation under connection pooling.

3. **bin/gstack-gbrain-detect** — surfaces `gbrain_pooler_mode` field
   ("transaction" | "session" | null) in the preamble probe JSON so
   /setup-gbrain and /sync-gbrain can advise users about pooler state.

Closes #1435

Built with [ClosedLoop.AI](https://closedloop.ai) | [GitHub](https://github.com/closedloop-ai/claude-plugins)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(supabase-provision): rewrite transaction/6543 -> session/5432 for new projects

- Single-object pooler API responses default to transaction-mode at 6543,
  but the shared pooler tenant on new projects only listens on session/5432
- Add a `pool_mode == transaction && db_port == 6543` rewrite + stderr note
- Escape hatch via `GSTACK_SUPABASE_TRUST_API_PORT=1` for forward-compat
- 5 new tests covering rewrite, no-op shapes, env opt-out, array path

Fixes #1301.

* fix(browse): GSTACK_CHROMIUM_NO_SANDBOX opt-out for Ubuntu/AppArmor (#1562)

Ubuntu/AppArmor configurations often block unprivileged Chromium sandboxing
for headless agent sessions even for normal users — /qa hangs without
--no-sandbox. The kernel policy denies the unprivileged user namespaces
Chromium needs.

Adds GSTACK_CHROMIUM_NO_SANDBOX=1 as an explicit user override that forces
the sandbox off without changing the default for everyone else. Re-authored
from PR #1562 onto v1.42.2.0's shouldEnableChromiumSandbox() helper —
purely additive, preserves the headed-launch sandbox-on-by-default behavior
that v1.42.2.0 shipped to kill the --no-sandbox yellow infobar.

Three new regression tests cover:
  - linux + override=1 → false (the named use case)
  - darwin + override=1 → false (env wins on any platform)
  - override=0 → does NOT trigger (must be exactly "1")

Original diff by @techcenter68 via #1562.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(browse): mirror isCustomChromium() guard in headless launch()

When BROWSE_EXTENSIONS_DIR is set alongside GSTACK_CHROMIUM_PATH pointing
at a baked-extension build (GBrowser / GStack Browser), the headless launch()
path was unconditionally adding --disable-extensions-except / --load-extension.
This causes the same ServiceWorkerState::SetWorkerId DCHECK crash that
launchHeaded() already guards against via isCustomChromium().

Mirror the existing guard: skip --load-extension flags when isCustomChromium()
returns true; always push the off-screen window geometry args.

* fix(browse): daemonize macOS/Linux server via setsid()

`Bun.spawn().unref()` only releases the child from Bun's event loop —
it does NOT call setsid(). The spawned bun server inherits the spawning
shell's process session. When the CLI runs inside a session-managed shell
that exits shortly after the CLI returns (Claude Code's per-command Bash
sandbox, Conductor, OpenClaw, CI step runners), the session leader's exit
sends SIGHUP to every PID in the session — killing the bun server and
its Chromium grandchildren within seconds of a successful `connect`.

Setting `BROWSE_PARENT_PID=0` (already done by the `connect` command and
pair-agent) disables the parent-process watchdog but does NOT save the
server here: SIGHUP from session teardown still reaps it.

Replace the macOS/Linux `Bun.spawn().unref()` with Node's
`child_process.spawn({ detached: true })`, which calls setsid() and
gives the server its own session leader role (PPID=1, STAT=Ss). This
mirrors the Windows path's rationale (PR #191 by @fqueiro) — same root
cause, different OS surface.

Verified on macOS in Conductor: pre-fix the server dies ~10–15s after
connect across separate Bash invocations; post-fix the same PID stays
alive (PPID=1, SESS=0, STAT=Ss) and responds to `status`/`goto`/
`snapshot` across many separate shell calls.

The `proc?.stderr` startup-error branch is removed since both platforms
now spawn with `stdio: 'ignore'`; both fall through to the on-disk
`browse-startup-error.log` written by `server.ts`'s start().catch.

* fix(design): bump image-gen timeout to 240s + pin gpt-image-2

The design binary calls /v1/responses (gpt-4o + image_generation tool,
quality:high, 1536x1024) but aborted the request after a hardcoded 120s.
That class of request consistently takes ~140-160s end-to-end, so every
generate/variants/evolve/iterate call aborted before the image returned.

In /design-shotgun this cascades: Step 3c launches N parallel agents,
each calling `$D generate`, each aborts at 120s and retries, all fail,
the comparison board never opens — the skill appears to hang indefinitely.

Reproduced the exact API call with a longer budget: HTTP 200, valid
image, 143.5s. A real /design-shotgun run after the patch generated 3
variants in parallel at 150.0s / 161.0s / 152.1s, all exit 0 — note the
161s case, which a naive 150s bump would still have failed.

- Bump AbortController timeout 120_000 -> 240_000 in generate.ts,
  variants.ts, evolve.ts, iterate.ts (both call sites)
- Pin the image_generation tool to model "gpt-image-2"

design/test/variants-retry-after.test.ts: 5 pass, 0 fail. The
feedback-roundtrip.test.ts failures are a pre-existing browse-module
breakage (session.clearLoadedHtml undefined), unrelated to this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: fill coverage gaps for PRs #1606, #1612, #1620

Three cherry-picked PRs in this wave landed without unit-test coverage for
the specific invariant they protect:

  #1606 (@andrey-esipov) — LC_ALL=C pin in _gstack_gbrain_validate_varname
    8 tests by sourcing bin/gstack-gbrain-lib.sh and calling the validator
    directly. Asserts uppercase/digit/underscore accepted, lowercase
    REJECTED (the macOS-locale regression case), mixed-case rejected,
    LC_ALL=C scoping is local (doesn't leak to caller).

  #1612 (@bharat2913) — setsid daemonize via Node child_process.spawn
    4 static-invariant tests on browse/src/cli.ts. The actual setsid
    syscall is hard to assert without a real spawn, so we pin the source
    shape: nodeSpawn imported from child_process; non-Windows branch uses
    nodeSpawn(...) with detached:true and .unref(); comment documents
    setsid/SIGHUP root cause; Bun.spawn() is NOT used on macOS/Linux.

  #1620 (@davidfoy, re-authored into .tmpl per A3) — §4a-postfail
    12 static invariants on land-and-deploy/SKILL.md.tmpl + generated
    SKILL.md. Pins all three state branches (MERGED/OPEN/CLOSED), the
    authoritative state query, the merge-SHA capture, non-destructive
    worktree cleanup with uncommitted-work guard, autoMergeRequest probe
    on OPEN, hard "never retry gh pr merge" rule, and atomic regen
    propagation.

Failing build if any of the three invariants regresses.

Note: gbrain-lib-validate-varname.test.ts also surfaces a pre-existing
glob-pattern overpermissiveness (hyphens + dots accepted) — not in
#1606's scope; documented inline as a separate cleanup target.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(learnings): align injection-prevention tests with PR #1619 tagged-line shape

PR #1619 (preserve current entries in cross-project search) refactored
gstack-learnings-search to tag rows inline (`current\t<json>` vs
`cross\t<json>`) instead of filtering inside the bun block via
process.env.GSTACK_SEARCH_SLUG. The bun block no longer reads SLUG or
CROSS env vars — it parses the per-line tag and sets a per-entry
_crossProject flag.

The pre-existing test/learnings-injection.test.ts still asserted on the
old SLUG + CROSS env var shape. Updates:

  - Remove the SLUG env var assertion (no longer set on bash command line)
  - Remove the bun-block CROSS env var assertion (block reads the tag now,
    not the env)
  - Add a new positive assertion that the bun block parses the tag
    (sourceTag | tabIndex | crossProject)
  - Keep the shell-interpolation safety assertion unchanged — that's
    independent of the SLUG refactor

The CROSS env var is still SET on the bash command line (it controls
whether the cross-project find runs at all), but the bun child no longer
reads it. The existing "env vars set on bash command line" test continues
to pin that.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(fixtures): regenerate ship-SKILL.md golden baselines

ship/SKILL.md consumes the Confidence Calibration resolver via the
preamble pipeline. This wave's #1539 pre-emit verification gate extends
the resolver text, which propagated to ship/SKILL.md via gen:skill-docs.
The golden fixtures in test/fixtures/golden/ matched the pre-#1539 shape
and failed the host-config regression check.

Refreshes claude-ship-SKILL.md, codex-ship-SKILL.md, and factory-ship-SKILL.md
to match the current generated output. Matches the Daegu wave's bisect
commit 23 ("test(fixtures): regenerate ship-SKILL.md golden baselines").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(gbrain-detect): include gbrain_pooler_mode in schema regression (PR #1591)

PR #1591 (PgBouncer transaction-mode detection, @mikeangstadt) added
gbrain_pooler_mode to the gstack-gbrain-detect JSON output but did not
update the schema regression check in
test/gstack-gbrain-detect-mcp-mode.test.ts. Adding the key in alphabetical
order matching the rest of the schema array. Downstream sync-gbrain ignores
unknown keys, so this is forward-compat.

Without this, the test fails with a diff:
  + "gbrain_pooler_mode"
because keys is the actual set returned and the expected array was
pre-#1591.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): v1.43.0.0 — post-Daegu paper-cut wave

Bumps VERSION 1.42.2.0 → 1.43.0.0 (MINOR per scale-aware bump rules: new
env-var surface GSTACK_SYNC_*_TIMEOUT_MS + GSTACK_CHROMIUM_NO_SANDBOX,
behavior expansion in browse/src/browser-manager.ts headless launch,
three skill-template prompt changes affecting /retro, /review,
/sync-gbrain).

CHANGELOG entry leads with what stopped happening: /retro stops
fabricating retros against stale bases, /sync-gbrain stops SIGTERM-looping
35-min restarts on big brains, /review stops shipping framework FPs the
reviewer never grep'd.

18 fixes total — 15 community PRs + 3 self-filed silent-failure issues
(#1624, #1611, #1539) — in one bundled PR with 26 bisect commits and 7
new regression test files. Every wave-touched test file passes in
isolation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): bump v1.43.0.0 → v1.43.2.0 for queue collision

CI check-version-stale flagged v1.43.0.0 already claimed by PR #1574
(garrytan/colombo-v3). PR #1639 (garrytan/muscat-v3) claims v1.43.1.0.
Next available MINOR slot is v1.43.2.0.

Bump VERSION + package.json + CHANGELOG entry header. No behavior
changes — purely re-versioning to clear the queue collision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Jayesh Betala <jayesh.betala7@gmail.com>
Co-authored-by: Andrey Esipov <andrey.esipov@outlook.com>
Co-authored-by: David Foy <davidfoy@users.noreply.github.com>
Co-authored-by: mikeangstadt <mike.angstadt@closedloop.ai>
Co-authored-by: 0xDevNinja <manmit0x@gmail.com>
Co-authored-by: techcenter68 <techcenter68@users.noreply.github.com>
Co-authored-by: shohu <shohu33@gmail.com>
Co-authored-by: Bharat <bharat@theysaid.io>
Co-authored-by: Matteo Hertel <info@matteohertel.com>
2026-05-21 21:21:07 -07:00
Garry Tan 2014557e7f v1.12.0.0 feat: /setup-gbrain — coding-agent onboarding for gbrain (#1183)
* feat(setup-gbrain): add gstack-gbrain-repo-policy bin helper

Per-remote trust-tier store for the forthcoming /setup-gbrain skill.
Tiers are the D3 triad (read-write / read-only / deny), keyed by a
normalized remote URL so ssh-shorthand and https variants collapse to
the same entry. The file carries _schema_version: 2 (D2-eng); legacy
`allow` values from pre-D3 experiments auto-migrate to `read-write`
on first read, idempotent, with a one-shot log line.

Pure bash + jq to match the existing gstack-brain-* family. Atomic
writes via tmpfile + rename. Policy file mode 0600. Corrupt files
quarantine to .corrupt-<ts> and start fresh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(setup-gbrain): unit tests for gstack-gbrain-repo-policy

24 tests covering normalize (ssh/https/shorthand/uppercase collapse to
one key), set/get round-trip, all three D3 tiers accepted, invalid
tiers rejected, file mode 0600, _schema_version field written on fresh
files, legacy allow migration (including idempotence and preservation
of non-allow entries), corrupt-JSON quarantine + fresh-file recovery,
list output sorting, and get-without-arg auto-detect against a git
repo with no origin.

All tests green against a per-test tmpdir GSTACK_HOME so nothing
leaks into the real ~/.gstack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(setup-gbrain): add gstack-gbrain-detect state reporter

Pure-introspection JSON emitter for the /setup-gbrain skill's
start-up branching. Reports: gbrain presence + version on PATH,
~/.gbrain/config.json existence + engine, `gbrain doctor --json`
health (wrapped in timeout 5s to match the /health D6 pattern),
gstack-brain-sync mode via gstack-config, and ~/.gstack/.git
presence for the memory-sync feature.

Never modifies state. Always emits valid JSON even when every check
is false. Handles malformed ~/.gbrain/config.json without crashing
— gbrain_engine is null in that case, not an error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(setup-gbrain): add gstack-gbrain-install with D5 detect-first + D19 PATH-shadow guard

Clones gbrain at a pinned commit (v0.18.2) and registers it via
`bun link`. Before any clone:

  D5 detect-first — probes ~/git/gbrain, ~/gbrain, and the install
  target for a valid pre-existing clone (package.json with name
  "gbrain" and bin.gbrain set). If one is found, `bun link` runs
  there instead of cloning a second copy. Prevents the day-one
  duplicate-install footgun on the skill author's own machine.

After install:

  D19 PATH-shadow guard — reads the install-dir's package.json
  version, compares to `gbrain --version` on PATH. On mismatch:
  exits 3, prints every gbrain binary on PATH via `type -a`, and
  gives a remediation menu. Setup skills refuse broken environments
  instead of warning and continuing.

Prereq checks (bun, git, https://github.com reachability) fail fast
with install hints. --dry-run and --validate-only flags let the
skill probe the plan without touching state; tests use them to
cover D5 and D19 without exercising real bun link.

Pin is a load-bearing version: setup-gbrain v1 verified against
gbrain v0.18.2. Updating requires re-running Pre-Impl Gate 1 to
verify gbrain's CLI + config shapes haven't drifted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(setup-gbrain): unit tests for gstack-gbrain-detect + install

15 tests covering: detect emits valid JSON when nothing configured,
reports gstack_brain_git on GSTACK_HOME/.git presence, reads
~/.gbrain/config.json engine, tolerates malformed config, detects
a mocked gbrain binary on PATH with version parsing.

For install: D5 detect-first uses ~/git/gbrain fixtures under a
sandboxed HOME, verifies fall-through to fresh clone when no valid
clone exists, rejects invalid package.json shapes. D19 PATH-shadow
validation uses a fake gbrain on a minimal SAFE_PATH to simulate
version mismatch, same-version-pass, v-prefix tolerance, missing
binary on PATH, and missing version field in package.json.

--validate-only mode in the install bin makes the D19 check unit-
testable without running real bun link (which touches ~/.bun/bin).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(setup-gbrain): add gstack-gbrain-lib.sh with read_secret_to_env (D3-eng)

Shared secret-read helper for PAT (D11) and pooler URL paste (D16).
One implementation of the hardest-to-get-right pattern: stty -echo +
SIGINT/TERM/EXIT trap that restores terminal mode, read into a named
env var, optional redacted preview.

Validates the target var name against [A-Z_][A-Z0-9_]* to prevent
bash name-injection via `read -r "$varname"`. When stdin is not a TTY
(CI, piped tests) the stty branches skip cleanly — piped input doesn't
echo anyway. Exports the var after read so subprocesses inherit it;
callers own the `unset` at handoff time.

Sourced, not executed — no +x bit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(setup-gbrain): add gstack-gbrain-supabase-verify structural URL check

Zero-network validator for Supabase Session Pooler URLs before handing
them to `gbrain init`. Canonical shape verified per gbrain init.ts:266:
  postgresql://postgres.<ref>:<password>@aws-0-<region>.pooler.supabase.com:6543/postgres

Rejects direct-connection URLs (db.*.supabase.co:5432) with a distinct
exit code 3 and clear IPv6-failure remediation — that's the most common
paste mistake users make, so it earns its own UX path rather than a
generic "bad URL" error.

Never echoes the URL (contains a password) in error messages; tests
verify a distinct seed password never appears in stderr on any reject
path. Accepts URL from argv[1] or stdin ("-" or no arg).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(setup-gbrain): unit tests for supabase-verify + lib.sh secret helper

22 tests. verify: accepts canonical pooler URL (argv + stdin modes),
rejects direct-connection URL with exit 3, rejects wrong scheme, wrong
port, empty password, missing userinfo, plain 'postgres' user (catches
direct-URL paste errors), wrong host, empty URL. Case-insensitive host
match. Explicit negative: error messages never echo the URL password.

lib.sh read_secret_to_env: reads piped stdin into the named env var,
exports to subprocesses, redacted-preview emits masked form on stderr
with the seed password absent, rejects invalid var names (lowercase,
leading digit, hyphens), rejects missing/unknown flags, secret value
never appears on stdout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(setup-gbrain): add gstack-gbrain-supabase-provision Management API wrapper

Four subcommands: list-orgs, create, wait, pooler-url. Built against
the verified Supabase Management API shape (Pre-Impl Gate 1):

  - POST /v1/projects with {name, db_pass, organization_slug, region}
    — not the original plan's /v1/organizations/{ref}/projects
  - No `plan` field; subscription tier is org-level per the OpenAPI
    description ("Subscription Plan is now set on organization level
    and is ignored in this request")
  - GET /v1/projects/{ref}/config/database/pooler for pooler config
    — not /config/database

Secrets discipline: SUPABASE_ACCESS_TOKEN (PAT) and DB_PASS read from
env only, never from argv (D8 grep test enforces this). `set +x` at
the top as a defensive default so debug tracing never leaks secrets.
Management API hostname hardcoded to SUPABASE_API_BASE env override —
no user-controlled URL portion (SSRF guard).

HTTP error paths: 401/403 → exit 3 (auth), 402 → 4 (quota), 409 → 5
(conflict), 429 + 5xx → exponential-backoff retry up to 3 attempts,
then exit 8. Wait subcommand polls every 5s until ACTIVE_HEALTHY
with a configurable timeout; terminal states (INIT_FAILED, REMOVED,
etc.) exit 7 immediately with a clear message. Timeout emits the
--resume-provision hint so the skill can recover.

Pooler-url constructs the URL locally from db_user/host/port/name +
DB_PASS rather than trusting the API response's connection_string
field, which is templated with [PASSWORD] rather than the real value.
Handles both object and array response shapes, preferring session
pool_mode when Supabase returns multiple pooler configs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(setup-gbrain): unit tests for gstack-gbrain-supabase-provision via mock API

22 tests covering D21 HTTP error suite (401/403/402/409/429/5xx) and
happy paths for all four subcommands. Every test spins up a Bun.serve
mock server bound to SUPABASE_API_BASE so nothing hits the real API.

Uses Bun.spawn (async) rather than spawnSync because spawnSync blocks
the Bun event loop, which prevents Bun.serve mocks from responding —
calls would hit curl's own timeout instead of round-tripping.

Verifies: POST body contains organization_slug (not organization_id)
and no `plan` field, bearer-token auth header, retry-on-429 with
eventual success, exit-8 on persistent 5xx after max retries, wait
succeeds on ACTIVE_HEALTHY, exits 7 on INIT_FAILED, exits 6 with
--resume-provision hint on timeout, pooler-url builds URL locally
from db_user/host/port/name + DB_PASS (not response connection_string
template), handles array pooler responses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(setup-gbrain): add SKILL.md.tmpl — user-facing skill prompt

Stitches together every slice built so far (repo-policy, detect,
install, lib.sh secret helper, supabase-verify, supabase-provision)
into a single interactive flow. Paths: Supabase existing-URL, Supabase
auto-provision (D7), Supabase manual, PGLite local, switch (PGLite ↔
Supabase via gbrain migrate wrapped in timeout 180s per D9).

Secrets discipline per D8/D10/D11: PAT + DB_PASS + pooler URL all
read via read_secret_to_env from lib.sh and handed to gbrain via
GBRAIN_DATABASE_URL env, never argv. PAT carries the full D11 scope
disclosure before collection and an explicit revocation reminder after
success. D12 SIGINT recovery prints the in-flight ref + resume command.

D18 MCP registration is scoped honestly to Claude Code — skips with
a manual-register hint when `claude` is not on PATH. D6 per-remote
trust-triad question (read-write/read-only/deny/skip-for-now) gates
repo import; the triad values compose with the D2-eng schema-version
policy file so future migrations stay deterministic.

Skill runs concurrent-run-locked via mkdir ~/.gstack/.setup-gbrain.lock.d
(atomic, same pattern as gstack-brain-sync). Telemetry (D4) payload
carries enumerated categorical values only — never URL, PAT, or any
postgresql:// substring.

--repo, --switch, --resume-provision, --cleanup-orphans shortcut modes
documented inline; the skill parses its own invocation args.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(health): integrate gbrain as D6 composite dimension

Adds a GBrain row to the /health dashboard rubric with weight 10%.
Three sub-signals rolled into one 0-10 score: doctor status (0.5),
sync queue depth (0.3), last-push age (0.2). Redistributes when
gbrain_sync_mode is off so the dimension stays fair.

Weights rebalance: typecheck 25→22, lint 20→18, test 30→28,
deadcode 15→13, shell 10→9, gbrain +10 — sums to 100.

gbrain doctor --json wrapped in timeout 5s so a hung gbrain never
stalls the /health dashboard. Dimension is omitted (not red) when
gbrain is not installed — running /health on a non-gbrain machine
shouldn't penalize that choice.

History-JSONL adds a `gbrain` field. Pre-D6 entries read as null for
trend comparison; new tracking starts from first post-D6 run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(test): add secret-sink-harness for negative-space leak testing (D21 #5)

Runs a subprocess with a seeded secret, captures every channel the
subprocess could leak through, and asserts the seed never appears.
Built per the D1-eng tightened contract: per-run tmp $HOME, four seed
match rules (exact + URL-decoded + first-12-char prefix + base64),
fd-level stdout/stderr capture via Bun.spawn, post-mortem walk of
every file written under $HOME, separate buckets for telemetry JSONL.

Reusable: any future skill that handles secrets can import
runWithSecretSink and run positive/negative controls against its own
bins. The harness itself is ~180 lines of TS with no external deps
beyond Bun + node:fs.

Out of scope for v1 (documented as follow-ups): subprocess env dump
(portable /proc reading), the user's real shell history (bins don't
modify it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: secret-sink harness positive controls + real-bin negative controls

11 tests. Positive controls deliberately leak a seed in every covered
channel (stdout, stderr, a file under $HOME, the telemetry JSONL path,
base64-encoded, first-12-char prefix) and assert the harness catches
each one. Without these, a harness that silently under-reports would
look identical to a harness that works.

Negative controls run real setup-gbrain bins with distinctive seeds:
  - supabase-verify rejects a mysql:// URL and a direct-connection URL,
    password never appears in any captured channel
  - lib.sh read_secret_to_env reads piped stdin, emits only the length,
    seed value stays invisible
  - supabase-provision on an auth-failure path fails fast without
    leaking the PAT to any channel

Covers D21 #5 leak harness + uses it to validate D3-eng, D10, D11
discipline end-to-end on the already-shipped bins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(setup-gbrain): add list-orphans + delete-project subcommands (D20)

Powers /setup-gbrain --cleanup-orphans. list-orphans filters the
authenticated user's Supabase projects by name prefix (default
"gbrain") and excludes the project the local ~/.gbrain/config.json
currently points at, so only unclaimed gbrain-shaped projects come
back. Active-ref detection parses the pooler URL's user portion
(postgres.<ref>:<pw>@...).

delete-project is a thin DELETE /v1/projects/{ref} wrapper with no
confirmation of its own — the skill's UI layer owns the per-project
confirm AskUserQuestion loop. Keeps responsibilities clean: the bin
manages HTTP; the skill manages user intent.

Both subcommands reuse the existing api_call retry+backoff and the
same PAT discipline (env only, never argv).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(setup-gbrain): list-orphans active-ref filtering + delete-project 404

6 new tests bringing the supabase-provision suite to 28:

list-orphans:
  - Filters to gbrain-prefixed projects, excludes the active-ref derived
    from ~/.gbrain/config.json's pooler URL
  - Treats all gbrain-prefixed projects as orphans when no config exists
    (first run on a new machine)
  - Respects custom --name-prefix for users who named their brain
    something else

delete-project:
  - Happy path sends DELETE /v1/projects/<ref> and returns {deleted_ref}
  - 404 surfaces cleanly (exit 2, "404" in stderr)
  - Missing <ref> positional rejected with exit 2

Uses per-test tmpdir HOME with a stubbed ~/.gbrain/config.json so
active-ref extraction runs against deterministic fixtures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: regenerate setup-gbrain SKILL.md after main merge

* chore: bump version and changelog (v1.12.0.0)

Ships /setup-gbrain and its supporting infrastructure end-to-end:
per-remote trust policy, installer with PATH-shadow guard, shared
secret-read helper, structural URL verifier, Supabase Management
API wrapper, /health GBrain dimension, secret-sink test harness.

100 new tests across 5 suites, all green. Three pre-existing test
failures noted as P0 in TODOS.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: add USING_GBRAIN_WITH_GSTACK.md + update README for /setup-gbrain

README changes:
- Rewrote the "Cross-machine memory with GBrain sync" section into
  "GBrain — persistent knowledge for your coding agent." Covers the
  three /setup-gbrain paths (Supabase existing URL, auto-provision,
  PGLite local), MCP registration, per-remote trust triad, and the
  (still-separate) memory sync feature.
- Added /setup-gbrain row to the skills table pointing at the full guide.
- Added /setup-gbrain to both skill-list install snippets.
- Added USING_GBRAIN_WITH_GSTACK.md to the Docs table.

New doc (USING_GBRAIN_WITH_GSTACK.md):
- All three setup paths with trust-surface caveats
- MCP registration details (and honest Claude-Code-v1 scoping)
- Per-remote trust triad semantics + how to change a policy
- Switching engines (PGLite ↔ Supabase) via --switch
- GStack memory sync + its relationship to the gbrain knowledge base
- /setup-gbrain --cleanup-orphans for orphan Supabase projects
- Full command + flag reference, every bin helper, every env var
- Security model: what's enforced in code, what's enforced by the leak
  harness, and the honest limits of v1
- Troubleshooting: PATH shadowing, direct-connection URL reject,
  auto-provision timeout, stale lock, policy file hand-edits,
  migrate hang
- Why-this-design section explaining the non-obvious choices

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(brain-sync): secret scanner now catches Bearer-prefixed auth tokens in JSON

The bearer-token-json regex value charset was [A-Za-z0-9_./+=-]{16,},
which does NOT permit spaces. Real HTTP auth headers embed the scheme
name with a literal space — "Bearer <token>" — so the value portion
actually starts with "Bearer " and the existing regex couldn't match.
Result: any JSON blob containing "authorization":"Bearer ..." would
slip past the scanner and sync to the user's private brain repo with
the bearer token inline.

Added optional (Bearer |Basic |Token )? prefix in front of the value
charset. Now matches the common auth-scheme forms without broadening
the matcher to tolerate arbitrary whitespace (which would false-positive
on lots of benign JSON).

Verified against 5 positive cases (bearer-in-json, clean bearer, apikey
no-prefix, token with Bearer, password no-prefix) + 3 negative cases
(too-short tokens, non-secret field names like username, random JSON).

This closes the P0 security regression first noticed during v1.12.0.0
/ship. brain-sync.test.ts now passes all 7 secret-scan fixtures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: mock-gh integration tests for gstack-brain-init auto-create path

8 tests covering the gh-repo-create happy path that had zero coverage
before. Existing brain-sync.test.ts always passes --remote <bare-url>
to bypass gh entirely, so the interactive default ("press Enter, we'll
run gh repo create for you") was shipping on trust.

Test strategy: write a bash stub for gh that records every call into
a file, then run gstack-brain-init with that stub on PATH. Assertions
verify: gh auth status is checked, gh repo create fires with the
computed gstack-brain-<user> default name + --private + --source
flags, fall-through to gh repo view when create reports already-exists,
user-provided URL bypasses gh entirely, gh-not-on-path and gh-not-authed
branches both prompt for URL, --remote flag short-circuits all gh
calls, conflicting-remote re-runs exit 1 with a clear message.

No real GitHub, no live auth. Gate tier — runs on every commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(e2e): privacy-gate AskUserQuestion fires from preamble (periodic tier)

Two periodic-tier E2E tests exercising the preamble's privacy gate
end-to-end via the Agent SDK + canUseTool. Previously uncovered:

- Positive: stages a fake gbrain on PATH + gbrain_sync_mode_prompted=false
  in config, runs a real skill, intercepts tool-use. Asserts the
  preamble fires a 3-option AskUserQuestion matching the canonical
  prose ("publish session memory" / "artifact" / "decline") and does
  NOT fire a second time in the same run (idempotency within session).

- Negative: same staging but prompted=true. Asserts the gate stays
  silent even with gbrain detected on the host.

Registered in test/helpers/touchfiles.ts as `brain-privacy-gate`
(periodic) with dependency tracking on generate-brain-sync-block.ts,
the three gstack-brain-* bins, gstack-config, and the Agent SDK runner.
Diff-based selection re-runs the E2E when any of those change.

Cost: ~$0.30-$0.50 per run. Only fires under EVALS=1 EVALS_TIER=periodic;
gate tier stays free.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update TODOS for bearer-json fix + new brain-sync test coverage

Moves the bearer-json secret-scan regression from the P0 "pre-existing
failures" block into the Completed section with full context on the
fix, the mock-gh tests, the E2E privacy-gate tests, and the touchfile
registration. Remaining P0s are the GSTACK_HOME config-isolation bug
and the stale Opus 4.7 overlay pacing assertion, both unrelated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(test): E2E privacy gate — ambient env + skill-file prompt

Two fixes to get the E2E actually running end-to-end (first attempt
failed at the SDK auth step, second at the assertion step):

1. Don't pass an explicit `env:` object to runAgentSdkTest. The SDK's
   auth pipeline misses ANTHROPIC_API_KEY when env is supplied as an
   object (verified against the plan-mode-no-op test, which passes no
   env and auths cleanly). Mutate process.env before the call instead,
   and restore the originals in finally so other tests don't inherit
   the ambient mutation.

2. The "Run /learn with no arguments" user prompt was too narrow — the
   model reduced it to a direct action and skipped the preamble
   privacy-gate directives entirely, so zero AskUserQuestions fired.
   Mirror the plan-mode-no-op pattern: point the model at the skill
   file on disk and ask it to follow every preamble directive. Bumped
   maxTurns from 6 to 10 to give the preamble room to execute.

Verified both tests pass under `EVALS=1 EVALS_TIER=periodic bun test
test/skill-e2e-brain-privacy-gate.test.ts` against a real ANTHROPIC_API_KEY.
Cost per run: ~$0.30-$0.50 per test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(CLAUDE.md): source ANTHROPIC/OPENAI keys from ~/.zshrc for paid evals

Conductor workspaces don't inherit the interactive shell env, so
both API keys are absent from the default process env even though
they're set in ~/.zshrc. Documents the source-from-zshrc pattern
(grep + eval, never echo the value) plus the Agent SDK gotcha: do
NOT pass env as an object to runAgentSdkTest — mutate process.env
ambiently and restore in finally.

Discovered this during the brain-privacy-gate E2E. First run failed
at SDK auth with 401; second failed because explicit env handoff
bypassed the SDK's own auth routing. Fix pattern now codified so
the next paid-eval session in a Conductor workspace doesn't hit the
same two dead ends.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 01:38:21 -07:00