v1.33.0.0 feat: /sync-gbrain memory-stage batch-import refactor (D1-D8) + F6/F9 + signal cleanup (#1432)

* refactor: batch-import architecture (D1-D8) + F6 atomic state + F9 full-file hash

bin/gstack-memory-ingest.ts: rewrite memory ingest around `gbrain import <dir>`
batch path. Replaces per-file gbrainPutPage loop (~470s of subprocess startup
per cold run) with prepare-then-batch:

  walkAllSources
    -> preparePages: mtime-skip + optional gitleaks (--scan-secrets) + parse
    -> writeStaged: mkdir -p per slug segment, hierarchical (D1)
    -> snapshot ~/.gbrain/sync-failures.jsonl byte offset
    -> runGbrainImport (async spawn) -> parseImportJson
    -> readNewFailures: read appended bytes, map back to source paths (D7)
    -> state.sessions[path] = {...} for files NOT in failed set
    -> saveStateAtomic (F6) + cleanupStagingDir

Architecture decisions:
  D1 hierarchical staging dir
  D2 cut over, deleted gbrainPutPage entirely
  D3 source-file gitleaks made opt-in via --scan-secrets (gstack-brain-sync
     owns the cross-machine boundary; per-file scan was redundant ~470s tax)
  D4 OK/ERR verdict (no DEGRADED tri-state)
  D5 unified state schema (no separate skip-list)
  D6 trust gbrain content_hash idempotency (no skip_reason bookkeeping)
  D7 byte-offset snapshot of sync-failures.jsonl + per-source mapping
  F6 saveState uses tmp+rename atomic write
  F9 fileSha256 removes 1MB cap; full-file hash (no more silent tail-edit
     misses on long partial transcripts)

Signal handling: installSignalForwarder propagates SIGTERM/SIGINT to the
gbrain child process AND synchronously cleans the staging dir before
process.exit. Pre-fix, orchestrator timeouts left gbrain processes
orphaned holding the PGLite write lock (observed: 15-hour-CPU-time
orphan still alive a day later).

parseImportJson returns null on unparseable output (treated as ERR by
caller) instead of silently zeroing through.

gbrainAvailable() probes for the `import` subcommand instead of `put`.

Plan + review chain at /Users/garrytan/.claude/plans/purrfect-tumbling-quiche.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: orchestrator OK/ERR verdict parser for batch memory ingest

gstack-gbrain-sync.ts: memory-stage parser now picks [memory-ingest] ERR
lines preferentially over the latest [memory-ingest] line, strips the
prefix and any leading 'ERR: ' for cleaner summary output, and surfaces
'(killed by signal / timeout)' when the child exits with status=null.

Matches D6's OK/ERR contract: per-file failures (FILE_TOO_LARGE etc.)
show in the summary count but only system-level failures (gbrain crash,
process kill, missing CLI) mark the stage ERR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: batch-ingest writer regressions + refresh golden ship fixtures

test/gstack-memory-ingest.test.ts: 5 new tests for the batch-import
architecture:
  1. D1 hierarchical staging slug round-trip — asserts staged file lives
     in transcripts/claude-code/<dir>/*.md, not flat at staging root
  2. Frontmatter injection — asserts title/type/tags written into the
     staged page's YAML block
  3. D7 sync-failures.jsonl exclusion — files listed as failed by
     gbrain do NOT get state-recorded; one of two test sessions lands,
     the other stays un-ingested for retry next run
  4. Missing-`import`-subcommand error path — when gbrain only advertises
     legacy `put`, memory-ingest exits 1 with [memory-ingest] ERR
  5. --scan-secrets opt-in path — verifies a dirty-source file is
     skipped via the secret-scan match when the flag is on, while a
     clean session in the same run still gets staged

Replaces the prior put-per-file shim with an import-batch shim. The
shim fails loudly (exit 99) if the new code ever regresses to per-file
`gbrain put` calls.

test/fixtures/golden/{claude,codex,factory}-ship-SKILL.md: refresh
golden baselines to match the current generated SKILL.md content after
the v1.31.0.0 AskUserQuestion fallback-clause deletion. Goldens were
stale from that release; test was failing on origin/main before this
PR. Caught by the /ship test pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.33.0.0 docs: design doc, P2 perf TODOs, gbrain guidance block, changelog

docs/designs/SYNC_GBRAIN_BATCH_INGEST.md: full design doc with the 8
decisions (D1-D8), source-verified gbrain behaviors (content_hash
idempotency, frontmatter parity, path-authoritative slug, per-file
failure surface), measured performance vs plan target, F9 hash
migration one-time cliff note, and follow-up TODOs.

CLAUDE.md: append `## GBrain Search Guidance` block from /sync-gbrain
indicating this worktree's pin and how the agent should prefer gbrain
search over Grep for semantic queries.

TODOS.md: P2 `gbrain import` perf-on-large-staging-dirs investigation
(5,131 files takes >10min in gbrain when 501 takes 10s — likely N+1
SQL or auto-link reconciliation). P3 cache-no-changes-since-last-import
at the prepare-batch level for true no-op fast paths.

VERSION + package.json: bump to 1.33.0.0 (queue-aware via
bin/gstack-next-version — skipped v1.32.0.0 which is claimed by
sibling worktree garrytan/wellington / PR #1431).

CHANGELOG.md: v1.33.0.0 entry per the release-summary format.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: setup-gbrain/memory.md reflects opt-in per-file gitleaks

Per-file gitleaks scanning during memory ingest is now opt-in via
--scan-secrets (or GSTACK_MEMORY_INGEST_SCAN_SECRETS=1). Update the
user-facing reference doc so it stops claiming "every page passes
through gitleaks." Also corrects the /gbrain-sync → /sync-gbrain
command typo and the post-incident recovery section.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-05-11 18:47:33 -07:00
committed by GitHub
parent 74895062fb
commit d21ba06b5a
12 changed files with 1523 additions and 223 deletions
+61
View File
@@ -1,5 +1,66 @@
# TODOS
## /sync-gbrain memory stage perf follow-up
### P2: Investigate `gbrain import` perf on large staging dirs
**What:** Cold-run time on a 5131-file staging dir is >10 min in `gbrain import`
alone (after gstack's prepare phase, which is now <10s after dropping per-file
gitleaks). On 501 files it took 10s. The scaling is worse than linear and the
bottleneck is inside gbrain, not the gstack orchestrator.
**Why:** With memory-ingest's prepare phase now fast, the remaining cold-run cost
is entirely on the gbrain side. Users with large corpora (5K+ files) currently pay
~15-30 min on first ingest. Likely culprits in `~/git/gbrain/src/core/import-file.ts`:
- N+1 SQL queries: `engine.getPage(slug)` for each file's content_hash check
(line 242 + 478) — should be batched into a single query
- Per-page auto-link reconciliation that fires even for unchanged content
- FTS / vector index updates without batching transactions
**Pros:** Lives in gbrain (cleaner separation). Fix in gbrain benefits other
gbrain callers too (`gbrain sync`, MCP `put_page` workflows). Likely 10-50x
speedup from batched queries alone.
**Cons:** Cross-repo change, requires gbrain test coverage for the new batched
path. Not on the gstack critical path; gstack's architecture is already correct.
**Context:** Verified on real corpus 2026-05-10. gstack-side prepare with
`--scan-secrets` off runs in <10s. The full gbrain import on the same staged
dir consumes 100% CPU for >10 min. Both observations from
`bin/gstack-memory-ingest.ts:ingestPass` reaching the `runGbrainImport` call
quickly, then the child process taking the bulk of the wall time.
**Depends on:** None — gstack's batch-ingest architecture (D1-D8 in
`docs/designs/SYNC_GBRAIN_BATCH_INGEST.md`) is already shipped and correct.
---
### P3: Cache "no changes since last import" at the prepare-batch level
**What:** Even with the prepare phase fast (<10s for 5135 files), walking and
mtime-stat'ing every file on a true no-op run adds a few seconds and creates
spurious staging dirs. Cache the most-recent-source-mtime per-source in the
state file; if no source dir has a newer mtime, skip the walk + stage + import
entirely.
**Why:** Most `/sync-gbrain` invocations have nothing new to ingest. The
fastest path is "do nothing, fast." `gbrain doctor` should still report state,
but the actual ingest pipeline can short-circuit when last_full_walk is recent
and no source-tree mtime has moved.
**Pros:** Trivial implementation (~20 lines in `ingestPass`). Makes the
incremental fast-path actually live up to "<30s" in the original plan.
**Cons:** Adds a cache invalidation surface. If a user edits a file but its
parent dir's mtime doesn't update (rare on macOS APFS), changes get missed.
Mitigation: only short-circuit when last_full_walk is recent (e.g. <1 min ago).
**Context:** Filed during 2026-05-10 perf testing after `--scan-secrets` was
made opt-in. Lower priority than the gbrain-side perf issue above.
---
## Browser-skills follow-on (Phases 2-4)
### P1: Browser-skills Phase 2 — `/scrape` and `/skillify` skill templates