v1.33.0.0 feat: /sync-gbrain memory-stage batch-import refactor (D1-D8) + F6/F9 + signal cleanup (#1432)

* refactor: batch-import architecture (D1-D8) + F6 atomic state + F9 full-file hash bin/gstack-memory-ingest.ts: rewrite memory ingest around `gbrain import <dir>` batch path. Replaces per-file gbrainPutPage loop (~470s of subprocess startup per cold run) with prepare-then-batch: walkAllSources -> preparePages: mtime-skip + optional gitleaks (--scan-secrets) + parse -> writeStaged: mkdir -p per slug segment, hierarchical (D1) -> snapshot ~/.gbrain/sync-failures.jsonl byte offset -> runGbrainImport (async spawn) -> parseImportJson -> readNewFailures: read appended bytes, map back to source paths (D7) -> state.sessions[path] = {...} for files NOT in failed set -> saveStateAtomic (F6) + cleanupStagingDir Architecture decisions: D1 hierarchical staging dir D2 cut over, deleted gbrainPutPage entirely D3 source-file gitleaks made opt-in via --scan-secrets (gstack-brain-sync owns the cross-machine boundary; per-file scan was redundant ~470s tax) D4 OK/ERR verdict (no DEGRADED tri-state) D5 unified state schema (no separate skip-list) D6 trust gbrain content_hash idempotency (no skip_reason bookkeeping) D7 byte-offset snapshot of sync-failures.jsonl + per-source mapping F6 saveState uses tmp+rename atomic write F9 fileSha256 removes 1MB cap; full-file hash (no more silent tail-edit misses on long partial transcripts) Signal handling: installSignalForwarder propagates SIGTERM/SIGINT to the gbrain child process AND synchronously cleans the staging dir before process.exit. Pre-fix, orchestrator timeouts left gbrain processes orphaned holding the PGLite write lock (observed: 15-hour-CPU-time orphan still alive a day later). parseImportJson returns null on unparseable output (treated as ERR by caller) instead of silently zeroing through. gbrainAvailable() probes for the `import` subcommand instead of `put`. Plan + review chain at /Users/garrytan/.claude/plans/purrfect-tumbling-quiche.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: orchestrator OK/ERR verdict parser for batch memory ingest gstack-gbrain-sync.ts: memory-stage parser now picks [memory-ingest] ERR lines preferentially over the latest [memory-ingest] line, strips the prefix and any leading 'ERR: ' for cleaner summary output, and surfaces '(killed by signal / timeout)' when the child exits with status=null. Matches D6's OK/ERR contract: per-file failures (FILE_TOO_LARGE etc.) show in the summary count but only system-level failures (gbrain crash, process kill, missing CLI) mark the stage ERR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: batch-ingest writer regressions + refresh golden ship fixtures test/gstack-memory-ingest.test.ts: 5 new tests for the batch-import architecture: 1. D1 hierarchical staging slug round-trip — asserts staged file lives in transcripts/claude-code/<dir>/*.md, not flat at staging root 2. Frontmatter injection — asserts title/type/tags written into the staged page's YAML block 3. D7 sync-failures.jsonl exclusion — files listed as failed by gbrain do NOT get state-recorded; one of two test sessions lands, the other stays un-ingested for retry next run 4. Missing-`import`-subcommand error path — when gbrain only advertises legacy `put`, memory-ingest exits 1 with [memory-ingest] ERR 5. --scan-secrets opt-in path — verifies a dirty-source file is skipped via the secret-scan match when the flag is on, while a clean session in the same run still gets staged Replaces the prior put-per-file shim with an import-batch shim. The shim fails loudly (exit 99) if the new code ever regresses to per-file `gbrain put` calls. test/fixtures/golden/{claude,codex,factory}-ship-SKILL.md: refresh golden baselines to match the current generated SKILL.md content after the v1.31.0.0 AskUserQuestion fallback-clause deletion. Goldens were stale from that release; test was failing on origin/main before this PR. Caught by the /ship test pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * v1.33.0.0 docs: design doc, P2 perf TODOs, gbrain guidance block, changelog docs/designs/SYNC_GBRAIN_BATCH_INGEST.md: full design doc with the 8 decisions (D1-D8), source-verified gbrain behaviors (content_hash idempotency, frontmatter parity, path-authoritative slug, per-file failure surface), measured performance vs plan target, F9 hash migration one-time cliff note, and follow-up TODOs. CLAUDE.md: append `## GBrain Search Guidance` block from /sync-gbrain indicating this worktree's pin and how the agent should prefer gbrain search over Grep for semantic queries. TODOS.md: P2 `gbrain import` perf-on-large-staging-dirs investigation (5,131 files takes >10min in gbrain when 501 takes 10s — likely N+1 SQL or auto-link reconciliation). P3 cache-no-changes-since-last-import at the prepare-batch level for true no-op fast paths. VERSION + package.json: bump to 1.33.0.0 (queue-aware via bin/gstack-next-version — skipped v1.32.0.0 which is claimed by sibling worktree garrytan/wellington / PR #1431). CHANGELOG.md: v1.33.0.0 entry per the release-summary format. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: setup-gbrain/memory.md reflects opt-in per-file gitleaks Per-file gitleaks scanning during memory ingest is now opt-in via --scan-secrets (or GSTACK_MEMORY_INGEST_SCAN_SECRETS=1). Update the user-facing reference doc so it stops claiming "every page passes through gitleaks." Also corrects the /gbrain-sync → /sync-gbrain command typo and the post-incident recovery section. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 00:12:12 +02:00 · 2026-05-11 18:47:33 -07:00
parent 74895062fb
commit d21ba06b5a
12 changed files with 1523 additions and 223 deletions
@@ -1,5 +1,66 @@
 # TODOS

+## /sync-gbrain memory stage perf follow-up
+
+### P2: Investigate `gbrain import` perf on large staging dirs
+
+**What:** Cold-run time on a 5131-file staging dir is >10 min in `gbrain import`
+alone (after gstack's prepare phase, which is now <10s after dropping per-file
+gitleaks). On 501 files it took 10s. The scaling is worse than linear and the
+bottleneck is inside gbrain, not the gstack orchestrator.
+
+**Why:** With memory-ingest's prepare phase now fast, the remaining cold-run cost
+is entirely on the gbrain side. Users with large corpora (5K+ files) currently pay
+~15-30 min on first ingest. Likely culprits in `~/git/gbrain/src/core/import-file.ts`:
+
+- N+1 SQL queries: `engine.getPage(slug)` for each file's content_hash check
+  (line 242 + 478) — should be batched into a single query
+- Per-page auto-link reconciliation that fires even for unchanged content
+- FTS / vector index updates without batching transactions
+
+**Pros:** Lives in gbrain (cleaner separation). Fix in gbrain benefits other
+gbrain callers too (`gbrain sync`, MCP `put_page` workflows). Likely 10-50x
+speedup from batched queries alone.
+
+**Cons:** Cross-repo change, requires gbrain test coverage for the new batched
+path. Not on the gstack critical path; gstack's architecture is already correct.
+
+**Context:** Verified on real corpus 2026-05-10. gstack-side prepare with
+`--scan-secrets` off runs in <10s. The full gbrain import on the same staged
+dir consumes 100% CPU for >10 min. Both observations from
+`bin/gstack-memory-ingest.ts:ingestPass` reaching the `runGbrainImport` call
+quickly, then the child process taking the bulk of the wall time.
+
+**Depends on:** None — gstack's batch-ingest architecture (D1-D8 in
+`docs/designs/SYNC_GBRAIN_BATCH_INGEST.md`) is already shipped and correct.
+
+---
+
+### P3: Cache "no changes since last import" at the prepare-batch level
+
+**What:** Even with the prepare phase fast (<10s for 5135 files), walking and
+mtime-stat'ing every file on a true no-op run adds a few seconds and creates
+spurious staging dirs. Cache the most-recent-source-mtime per-source in the
+state file; if no source dir has a newer mtime, skip the walk + stage + import
+entirely.
+
+**Why:** Most `/sync-gbrain` invocations have nothing new to ingest. The
+fastest path is "do nothing, fast." `gbrain doctor` should still report state,
+but the actual ingest pipeline can short-circuit when last_full_walk is recent
+and no source-tree mtime has moved.
+
+**Pros:** Trivial implementation (~20 lines in `ingestPass`). Makes the
+incremental fast-path actually live up to "<30s" in the original plan.
+
+**Cons:** Adds a cache invalidation surface. If a user edits a file but its
+parent dir's mtime doesn't update (rare on macOS APFS), changes get missed.
+Mitigation: only short-circuit when last_full_walk is recent (e.g. <1 min ago).
+
+**Context:** Filed during 2026-05-10 perf testing after `--scan-secrets` was
+made opt-in. Lower priority than the gbrain-side perf issue above.
+
+---
+
 ## Browser-skills follow-on (Phases 2-4)

 ### P1: Browser-skills Phase 2 — `/scrape` and `/skillify` skill templates