Files
gstack/lib/gbrain-guards.ts
T
Garry Tan 3bef43bc5a v1.55.0.0 fix wave: gbrain data-loss guards + browser crash-loop + 6 more (#1808)
* fix(jsonl-merge): make equal-ts resolution converge across machines

The JSONL append merge driver sorted timestamped entries by (0, ts) with no
further tiebreaker. Equal-ts entries then fell back to stable-sort insertion
order (base, ours, theirs), but git assigns the local side to "ours", so two
machines resolving the same conflict emitted equal-ts lines in opposite order.
The merged files diverged and never converged. gstack-telemetry-log uses
second-granularity timestamps, so same-ts collisions are routine.

Add the line content as the final sort tiebreaker so the order is total and
side-independent. Add a regression test that runs the driver with the two
sides swapped and asserts identical output.

* fix(gen-skill-docs): quote frontmatter descriptions with interior colons (#1778)

Generated SKILL.md frontmatter emitted the catalog-trimmed description: as a
plain YAML scalar. A description with an interior ": " (e.g. "Ship workflow:
detect...") parses as a nested mapping under strict YAML loaders, so Codex/OpenAI
skill loading rejected those skills.

applyCatalogTrim now routes the value through toYamlInlineScalar, which quotes
(via JSON.stringify) only when a plain scalar would be invalid — interior ": ",
inline " #", leading indicator char, or surrounding whitespace. Strings that are
already valid plain scalars pass through unchanged to keep regen diffs small.

The frontmatter test now parses every generated block (Claude + Codex hosts) with
Bun.YAML.parse instead of string-checking that name:/description: substrings exist,
so the regression can't reappear. Runs under `bun test` (already in CI).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(skills): regenerate SKILL.md after frontmatter quoting fix (#1778)

9 catalog-trimmed descriptions whose values contain an interior colon or inline-
comment marker are now quoted. Generated output only; rerun of bun run gen:skill-docs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(gbrain-sources): centralize sources-list shape handling in parseSourcesList (#1576)

#1576's crash in sourceLocalPath was already fixed in v1.42.0.0 (dual-shape
handling). But the readers disagreed: sourceLocalPath accepted both the wrapped
{sources:[...]} object (v0.20+) and a bare array, while probeSource and
sourcePageCount accepted only the wrapped shape. Extract one parseSourcesList()
normalizer and route all three through it, so the shape assumption lives in a
single place. This is also the base the #1734 remote_url audit builds on.

parseSourcesList returns [] for null/garbage rather than throwing; callers treat
'no rows' as absent. New test/gbrain-sources-parse.test.ts pins both shapes plus
the garbage paths and confirms config.remote_url survives for the audit.

#1576 is closeable as already-fixed in v1.42.0.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(gbrain): spawn gbrain + brain-sync through a shell on Windows (#1731)

On Windows, bun/npm install gbrain as a gbrain.cmd/.ps1 shim and gstack-brain-sync
is a bash shebang script. spawnSync/spawn/execFileSync resolve neither without a
shell, so the child spawn failed ENOENT — on the sync orchestrator this surfaced
as 'brain-sync exited undefined' (#1731).

Add NEEDS_SHELL_ON_WINDOWS (process.platform === 'win32') in gbrain-exec and pass
it as shell: to every gbrain/brain-sync child spawn: spawnGbrain, spawnGbrainAsync,
execGbrainText (gbrain-exec), the two sources-list/remove/add spawns (gbrain-sources),
the version + probe spawns (gbrain-local-status), and the two brain-sync spawns in
the orchestrator. POSIX keeps the cheaper no-shell path.

macOS/Linux CI can't exercise the Windows path, so test/gbrain-spawn-windows-shell.ts
is a static-grep tripwire: it fails CI if a gbrain/brain-sync spawn is added without
the shell flag.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(catalog-trim): expect YAML-quoted descriptions with interior colons (#1778)

The quoting fix wraps colon-bearing catalog descriptions in double quotes;
two catalog-trim assertions still pinned the old unquoted form. Tolerate the
optional quotes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(gbrain-sync): defensive guards against destructive gbrain ops (#1734)

The orchestrator shelled out to gbrain's destructive subcommands as if they were
safe. gbrain can rm-rf a user's working tree during an autopilot race (its own
bug, upstream gbrain #1526); gstack now defends itself. New lib/gbrain-guards.ts
gates the two destructive reach points, all checked immediately before the op:

- Autopilot refuse (multi-signal, affirmative-only): refuse a destructive op when
  a live 'gbrain autopilot' process (primary) or a known autopilot lock file
  (secondary; checked under both GBRAIN_HOME and ~/.gbrain since gbrain #1226
  ignores GBRAIN_HOME) is present. No signal → proceed; inability to introspect
  never bricks a normal sync.
- sources remove: routed through safeSourcesRemove → decideSourceRemove. Fail
  CLOSED — refuse to remove a user-managed source (remote_url set, local_path
  outside gbrain's clones) when gbrain has no --keep-storage to protect the files
  (it doesn't in 0.41.x). Also fail closed when the source list can't be read.
  Path containment uses realpath so a symlink can't smuggle a delete out of clones.
- sync --strategy code: decideCodeSync refuses URL-managed sources (remote_url
  set) unless --allow-reclone is passed, since the walk can auto-reclone (rm-rf).

Capability detection memoizes per process keyed to gbrain's identity (no stale
persistent cache); --keep-storage can't be probed (generic help) so it defaults
unsupported → fail closed. Every guard surfaces a visible reason; autopilot/reclone
refusals fail the code stage (verdict ERR) rather than silently skipping protection.

test/gbrain-guards.test.ts covers all branches hermetically (injected rows + probe
overrides): autopilot signals, fail-closed remove, keep-storage path, reclone gate,
realpath/symlink containment. Supersedes #1736 (which guarded a nonexistent path).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(sync-gbrain): warn against running during autopilot; prefer --path sources (#1734)

Adds a Safety note to the /sync-gbrain guidance (template + regenerated SKILL.md +
this repo's CLAUDE.md): don't run while autopilot is active, and prefer
`gbrain sources add --path` over URL-managed sources, which can auto-reclone.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(memory-ingest): configurable import timeout + resume-on-timeout messaging (#1611)

The gbrain import (the long pole on big brains) had a hardcoded 30-min timeout,
so large memory corpora got SIGTERM'd mid-import on /sync-gbrain --full. Make it
configurable via GSTACK_INGEST_TIMEOUT_MS (default 30 min, validated 1min–24h).

gstack can't drive gbrain's internal resume, but the existing SIGTERM forwarder
already preserves gbrain's import-checkpoint.json, so the next run resumes. On a
timeout we now say so explicitly ('checkpoint preserved — re-run /sync-gbrain to
resume, raise GSTACK_INGEST_TIMEOUT_MS for big brains') instead of surfacing a
bare 'exited null'. True gstack-driven ingest-resume is deferred to gbrain
(.context/gbrain-asks.md).

Also guards the module's main() behind import.meta.main so resolveImportTimeoutMs
is unit-testable; the orchestrator runs it as a subprocess where main still fires.
New test/memory-ingest-timeout.test.ts pins default/override/invalid resolution.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(browse): stop the headed daemon crash-loop + silent headless downgrade (#1781)

A headed session against a beacon-heavy page (analytics/extension load) could tip
the single-threaded daemon into a self-inflicted crash-loop: a brief HTTP stall
was read as a crash, the restart didn't clear the dead Chromium's SingletonLock,
the relaunch failed, and the session silently came back headless. Four fixes:

1. Busy-vs-dead (sendCommand): on a connection error, if the process is alive give
   /health a bounded probe (3x/250ms) and just retry the command — never kill+restart
   a live-but-busy server. A 30s timeout now reports 'busy, not restarting' when the
   process is alive instead of exiting into a kill cycle.
2. Profile-lock cleanup on (re)start: startServer reaps the orphaned Chromium holding
   the SingletonLock and clears Singleton{Lock,Socket,Cookie} before relaunch, so the
   auto-restart path gets the same clean profile the manual connect preamble did.
3. Headed persistence: the restart env reapplies BROWSE_HEADED from this invocation OR
   the persisted server state (mode==='headed'), so a restart from a plain command
   never downgrades a headed window to invisible headless. Extracted to buildRestartEnv.
4. Force-clean disconnect reaps the Chromium child tree (via the SingletonLock PID) so
   the next connect starts clean instead of fighting an orphan.

Plus macOS window surfacing: connect + focus raise 'Google Chrome for Testing' to the
active Space (best-effort osascript) with a Mission Control hint — the first thing
users read as 'I can't see the browser'.

Shared lock helpers (chromiumProfileDir / cleanChromiumProfileLocks / killOrphanChromium)
dedupe the connect, disconnect, and restart paths. browse/test/restart-env.test.ts pins
the headed-persistence decision; the full crash-loop repro is an E2E (periodic).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(gbrain-install): remove the v0.18.2 pin, install latest + version floor + doctor self-test (#1744)

The installer pinned gbrain at v0.18.2 while gbrain shipped v0.41.x — ~23 versions
behind. Remove the hard pin: a fresh clone now stays on the latest default-branch
HEAD. --pinned-commit <sha> still pins for reproducibility.

Unpinning removes the version gate the pin provided, so add two install-time gates
that fail closed (exit 3, matching the existing PATH-shadow/version-mismatch posture):
- MIN_GBRAIN_VERSION floor (0.20.0, the sources-list/federated surface gstack needs):
  refuse an install below it.
- gbrain doctor --fast self-test when a brain config already exists (re-install /
  detected clone): refuse to leave a broken gbrain in place. Pre-init installs skip
  it; the full /sync-gbrain --dry-run self-test runs from /setup-gbrain after init.

Docs updated (USING_GBRAIN_WITH_GSTACK.md no longer says 'edit PINNED_COMMIT').
Detect-install tests bump the success-path fixtures above the floor and add a
below-floor exit-3 test. The gbrain-side asks (root #1526 fix, --keep-storage,
remove-lease, capability command, ingest-resume, integration CI) are written to
.context/gbrain-asks.md for filing against garrytan/gbrain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(#1778): update claude-ship golden + catalog-mode assertions for quoted descriptions

ship's catalog description ('Ship workflow: detect...') has an interior colon, so
the #1778 fix now YAML-quotes it. Refresh the claude-ship golden baseline to the
quoted output and make the catalog-mode-full trim/restore assertions quote-tolerant.
codex/factory ship goldens are unaffected (they use block-scalar descriptions).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(gen-skill-docs): use function replacer so a $ in a description can't corrupt frontmatter (#1778)

String.prototype.replace treats $&/$1/$` in the replacement as patterns. A future
skill description containing $ (e.g. referencing $B/$D) would silently corrupt the
generated frontmatter. Use a function replacer. Behavior-preserving for all current
descriptions (regen produces no diff).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.55.0.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(gbrain): document configurable memory-ingest timeout for v1.55.0.0

USING_GBRAIN_WITH_GSTACK.md: note GSTACK_INGEST_TIMEOUT_MS (default 30 min,
1 min-24h range) on the /sync-gbrain memory stage, plus checkpoint-resume on
timeout. Fills the reference gap left by the configurable-import-timeout fix
(#1611) shipped in v1.55.0.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jayesh Betala <jayesh.betala7@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 14:57:07 -07:00

267 lines
11 KiB
TypeScript

/**
* gbrain-guards — defense-in-depth against gbrain's destructive code paths (#1734).
*
* gbrain (the separate CLI gstack shells out to) can rm-rf a user's working tree
* during an autopilot race (its own bug, upstream gbrain #1526). gstack can't fix
* that, but it MUST stop treating gbrain's destructive subcommands as safe. These
* guards gate the two ways the orchestrator can reach destruction:
*
* 1. `sources remove --confirm-destructive` → decideSourceRemove()
* 2. `sync --strategy code` (can auto-reclone) → decideCodeSync()
*
* plus an autopilot-active check (detectAutopilot) that refuses to run destructive
* ops concurrently with the daemon.
*
* Design notes grounded in the real gbrain 0.41.x surface:
* - There is NO `--keep-storage` flag and NO structured capability command, and
* subcommand `--help` is generic — so capability detection is best-effort and
* defaults to "unsupported". When we can't protect a user-managed source's
* files, we FAIL CLOSED (refuse the remove) rather than delete unprotected.
* - The autopilot lock filename isn't documented and (gbrain #1226) ignores
* GBRAIN_HOME, so the live `gbrain autopilot` process is the PRIMARY signal;
* known lock paths under both the configured home and ~/.gbrain are secondary.
* - We refuse only on an AFFIRMATIVE autopilot signal — inability to introspect
* never blocks a normal sync (that would brick the tool).
* - Path containment uses realpath so a symlink inside ~/.gbrain/clones can't
* smuggle a delete out to a user repo.
*
* Pure decision functions; the orchestrator logs the reasons (observability).
*/
import { spawnSync } from "child_process";
import { existsSync, realpathSync } from "fs";
import { homedir } from "os";
import { join, resolve, sep } from "path";
import { execGbrainJson, execGbrainText, NEEDS_SHELL_ON_WINDOWS } from "./gbrain-exec";
import { parseSourcesList, type GbrainSourceRow } from "./gbrain-sources";
export function gbrainHome(env: NodeJS.ProcessEnv = process.env): string {
return env.GBRAIN_HOME || join(homedir(), ".gbrain");
}
/**
* Directories gbrain owns and may delete safely. A source whose local_path
* resolves inside one of these is gbrain-managed; outside = user-managed and
* must be protected. Both the configured home and the default ~/.gbrain are
* checked because gbrain #1226 shows home-resolution is inconsistent.
*/
function clonesDirs(env: NodeJS.ProcessEnv = process.env): string[] {
return [...new Set([join(gbrainHome(env), "clones"), join(homedir(), ".gbrain", "clones")])];
}
/** True if `p` resolves (symlinks + `..` collapsed) to a location inside `dir`. */
export function isInside(p: string, dir: string): boolean {
let rp: string;
let rd: string;
try { rp = realpathSync(p); } catch { rp = resolve(p); }
try { rd = realpathSync(dir); } catch { rd = resolve(dir); }
const base = rd.endsWith(sep) ? rd : rd + sep;
return rp === rd || rp.startsWith(base);
}
// ── Autopilot detection (E1: multi-signal, affirmative-only) ────────────────
export interface AutopilotStatus {
active: boolean;
/** Which signal fired (lock path or "process"), or null when inactive. */
signal: string | null;
}
export interface AutopilotProbe {
/** Override the lock-path list (tests). */
lockPaths?: string[];
/** Override the live-process check (tests). */
processRunning?: () => boolean;
}
/**
* Detect a running gbrain autopilot. Refuse the caller's destructive op only on
* an affirmative signal; absence of a confirmable mechanism returns inactive so
* normal syncs are never bricked.
*/
export function detectAutopilot(
env: NodeJS.ProcessEnv = process.env,
probe: AutopilotProbe = {},
): AutopilotStatus {
// Secondary signal: known lock files. gbrain #1226 — the lock ignores
// GBRAIN_HOME, so check both the configured home and the default ~/.gbrain.
const lockPaths = probe.lockPaths ?? [
join(gbrainHome(env), "autopilot.lock"),
join(homedir(), ".gbrain", "autopilot.lock"),
join(gbrainHome(env), "autopilot.pid"),
join(homedir(), ".gbrain", "autopilot.pid"),
];
for (const lp of lockPaths) {
if (existsSync(lp)) return { active: true, signal: `lock:${lp}` };
}
// Primary signal: a live `gbrain autopilot` process.
const running = (probe.processRunning ?? defaultProcessRunning)();
if (running) return { active: true, signal: "process:gbrain autopilot" };
return { active: false, signal: null };
}
function defaultProcessRunning(): boolean {
// No reliable pgrep on Windows; rely on the lock-file signal there.
if (process.platform === "win32") return false;
const r = spawnSync("pgrep", ["-f", "gbrain autopilot"], { encoding: "utf-8", timeout: 3_000 });
return r.status === 0 && (r.stdout || "").trim().length > 0;
}
// ── Capability detection (E4 + Codex: per-process memo, no persistent cache) ─
//
// No structured capability command exists and subcommand --help is generic, so
// --keep-storage support can't be probed reliably; default unsupported. Memoize
// per process (keyed to the resolved gbrain identity) rather than persisting a
// cross-run cache — Codex flagged stale persistent caches, and the probe is cheap.
let _keepStorageMemo: { key: string; value: boolean } | undefined;
function gbrainIdentity(env: NodeJS.ProcessEnv): string {
const r = spawnSync("gbrain", ["--version"], {
encoding: "utf-8",
timeout: 3_000,
shell: NEEDS_SHELL_ON_WINDOWS,
env,
});
return (r.stdout || "").trim() || "unknown";
}
export function gbrainSupportsKeepStorage(env: NodeJS.ProcessEnv = process.env): boolean {
const key = gbrainIdentity(env);
if (_keepStorageMemo && _keepStorageMemo.key === key) return _keepStorageMemo.value;
let value = false;
for (const args of [["sources", "remove", "--help"], ["--help"]]) {
try {
if (/--keep-storage/.test(execGbrainText(args, { baseEnv: env, timeout: 5_000 }))) {
value = true;
break;
}
} catch {
// generic/empty help or non-zero exit → treat as unsupported
}
}
_keepStorageMemo = { key, value };
return value;
}
/** Test-only: reset the per-process capability memo. */
export function _resetCapabilityMemo(): void {
_keepStorageMemo = undefined;
}
// ── Destructive-op decisions ────────────────────────────────────────────────
/**
* Fetch + normalize the source list. Throws on read/parse failure so callers can
* distinguish "couldn't read" (fail closed) from "empty list" (source absent).
* Injectable for hermetic tests.
*/
export function fetchSources(env: NodeJS.ProcessEnv = process.env): GbrainSourceRow[] {
const raw = execGbrainJson(["sources", "list", "--json"], { baseEnv: env });
if (raw === null) throw new Error("gbrain sources list returned no JSON");
return parseSourcesList(raw);
}
export interface RemoveDecision {
allow: boolean;
/** Extra args to append to `sources remove` (e.g. --keep-storage). */
extraArgs: string[];
reason: string;
}
/**
* Decide whether `sources remove <id>` is safe, and with what flags.
*
* Fail-closed cases (allow=false):
* - sources list unreadable/unparseable (can't prove the row is safe).
* - the row is user-managed (remote_url set AND local_path outside gbrain's
* clones) and gbrain has no --keep-storage to protect the files.
*
* Allowed: absent row (no-op), gbrain-managed (inside clones), or path-managed
* without a remote_url (gbrain's remove won't touch an outside-clones path that
* it didn't clone). --keep-storage is appended whenever supported, as extra armor.
*/
export interface DecideRemoveOpts {
/** Override capability detection (tests / cached caps). */
keepStorage?: boolean;
/** Override the source-list fetch (tests). Throwing simulates a read failure. */
fetchRows?: (env: NodeJS.ProcessEnv) => GbrainSourceRow[];
}
export function decideSourceRemove(
sourceId: string,
env: NodeJS.ProcessEnv = process.env,
opts: DecideRemoveOpts = {},
): RemoveDecision {
const keepStorage = opts.keepStorage ?? gbrainSupportsKeepStorage(env);
const extra = keepStorage ? ["--keep-storage"] : [];
let rows: GbrainSourceRow[];
try {
rows = (opts.fetchRows ?? fetchSources)(env);
} catch {
return { allow: false, extraArgs: [], reason: "could not read sources list; refusing remove (fail closed)" };
}
const row = rows.find((r) => r.id === sourceId);
if (!row) return { allow: true, extraArgs: extra, reason: "source absent (no-op)" };
const remoteUrl = row.config?.remote_url;
const userManaged =
!!remoteUrl && !!row.local_path && !clonesDirs(env).some((d) => isInside(row.local_path!, d));
if (userManaged) {
if (keepStorage) {
return { allow: true, extraArgs: ["--keep-storage"], reason: "user-managed; --keep-storage protects files" };
}
return {
allow: false,
extraArgs: [],
reason:
`refusing remove of user-managed source "${sourceId}" (remote_url set, local_path ` +
`${row.local_path} outside gbrain clones) — this gbrain has no --keep-storage to ` +
`protect the working tree. Upgrade gbrain or remove the source manually.`,
};
}
return { allow: true, extraArgs: extra, reason: "gbrain-managed or path-managed without remote_url" };
}
export interface SyncDecision {
allow: boolean;
reason: string;
}
/**
* Decide whether `sync --strategy code --source <id>` is safe to run.
*
* A source with a remote_url can trigger gbrain's auto-reclone, the ungated
* rm-rf path behind the data loss (gbrain #1526). Require an explicit
* --allow-reclone opt-in for URL-managed sources. Read failure here is NOT
* itself destructive, so it fails open (proceed) — the autopilot guard, checked
* first, is the primary protection against the race that caused the loss.
*/
export function decideCodeSync(
sourceId: string,
env: NodeJS.ProcessEnv = process.env,
allowReclone = false,
fetchRows: (env: NodeJS.ProcessEnv) => GbrainSourceRow[] = fetchSources,
): SyncDecision {
let rows: GbrainSourceRow[];
try {
rows = fetchRows(env);
} catch {
return { allow: true, reason: "sources unreadable; proceeding (sync read is non-destructive)" };
}
const row = rows.find((r) => r.id === sourceId);
if (row?.config?.remote_url && !allowReclone) {
return {
allow: false,
reason:
`source "${sourceId}" is URL-managed (remote_url set); sync may auto-reclone and ` +
`delete the working tree. Re-run /sync-gbrain with --allow-reclone to proceed.`,
};
}
return { allow: true, reason: "no remote_url, or reclone explicitly allowed" };
}