Files
Garry Tan cc42f14a58 docs: gstack compact design doc (tabled pending Anthropic API) (#1027)
Preserves the full architecture, 15 locked eng-review decisions, B-series
benchmark spec, codex review findings, and research that confirmed Claude
Code's PostToolUse cannot replace non-MCP tool output today. Tracks
anthropics/claude-code#36843 for the unblocking API.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-16 15:04:26 -07:00

58 KiB
Raw Permalink Blame History

GCOMPACTION.md — Design & Architecture (TABLED)

Target path on approval: docs/designs/GCOMPACTION.md

This is the preserved design artifact for gstack compact. Everything above the first --- divider below gets extracted verbatim to docs/designs/GCOMPACTION.md on plan approval. Everything after that divider is archived research (office hours + competitive deep-dive + eng-review notes + codex review + research findings) that informed the design.


Status: TABLED (2026-04-17) — pending Anthropic updatedBuiltinToolOutput API

Why tabled. The v1 architecture assumed a Claude Code PostToolUse hook could REPLACE the tool output that enters the model's context for built-in tools (Bash, Read, Grep, Glob, WebFetch). Research on 2026-04-17 confirmed this is not possible today.

Evidence:

  1. Official docs (https://code.claude.com/docs/en/hooks): The only output-replace field documented for PostToolUse is hookSpecificOutput.updatedMCPToolOutput, and the docs explicitly state: "For MCP tools only: replaces the tool's output with the provided value." No equivalent field exists for built-in tools.
  2. Anthropic issue #36843 (OPEN): Anthropic themselves acknowledge the gap. "PostToolUse hooks can replace MCP tool output via updatedMCPToolOutput, but there is no equivalent for built-in tools (WebFetch, WebSearch, Bash, Read, etc.)... They can only add warnings via decision: block (which injects a reason string) or additionalContext. The original malicious content still reaches the model."
  3. RTK mechanism (source-reviewed at src/hooks/init.rs:906-912 and hooks/claude/rtk-rewrite.sh:83-100): RTK is NOT a PostToolUse compactor. It's a PreToolUse Bash matcher that rewrites tool_input.command (e.g., git statusrtk git status). The wrapped command produces compact stdout itself. RTK README confirms: "the hook only runs on Bash tool calls. Claude Code built-in tools like Read, Grep, and Glob do not pass through the Bash hook, so they are not auto-rewritten." RTK is Bash-only by architectural constraint, not by choice.
  4. tokenjuice mechanism (source-reviewed at src/core/claude-code.ts:160, 491, 540-549): tokenjuice DOES register PostToolUse with matcher: "Bash" but has no real output-replace API available — it hijacks decision: "block" + reason to inject compacted text. Whether this actually reduces model-context tokens or just overlays UI output is disputed. tokenjuice is also Bash-only.
  5. Read/Grep/Glob execute in-process inside Claude Code and bypass hooks entirely. Wedge (ii) "native-tool coverage" was architecturally impossible from day one regardless of replacement API.

Consequence. Both wedges are dead in their original form:

  • Wedge (i) "Conditional LLM verifier" — still technically possible, but only for Bash output, via PreToolUse command wrapping (RTK's mechanism). The verifier stops being a differentiator once we're also Bash-only.
  • Wedge (ii) "Native-tool coverage" — impossible today. Read/Grep/Glob don't fire hooks. Even if they did, no output-replace field exists.

Decision. Shelve gstack compact entirely. Track Anthropic issue #36843 for the arrival of updatedBuiltinToolOutput (or equivalent). When that API ships, this design doc + the 15 locked decisions below + the research archive at the bottom become the unblocking artifacts for a fresh implementation sprint.

If un-tabling: Start from the "Decisions locked during plan-eng-review" block below — most remain valid. Then re-verify the hooks reference against the newly-shipped API, update the Architecture data-flow diagram to use whatever real output-replacement field exists, and re-run /codex review against the revised plan before coding.

What we're NOT doing:

  • Not shipping a Bash-only PreToolUse wrapper. That's RTK's product; they're at 28K stars and 3 years of rule scars. No wedge.
  • Not shipping the decision: block + reason hack. Undocumented behavior, Anthropic could break it, and the model may still see the raw output alongside the compacted overlay — context savings are disputed.
  • Not shipping B-series benchmark in isolation. Without a working compactor, there's nothing to benchmark.

Cost of tabling: ~0. No code was written. The design doc + research + decisions remain as a ready-to-unblock artifact.


Decisions locked during plan-eng-review (2026-04-17)

Preserved for the un-tabling sprint if/when Anthropic ships the built-in-tool output-replace API.

Summary of every decision made during the engineering review. Full rationale is preserved throughout the sections below; this block is the single source of truth if anything else drifts.

Scope (Section 0):

  1. Claude-first v1. Ship compact + rules + verifier on Claude Code only. Codex + OpenClaw land at v1.1 after the wedge is proven on the primary host. Cuts ~2 days of host integration and derisks launch. The original "wedge (ii) native-tool coverage" claim applies to Claude Code at v1; we make no cross-host claim until v1.1.
  2. 13-rule launch library. v1 ships tests (jest/vitest/pytest/cargo-test/go-test/rspec) + git (diff/log/status) + install (npm/pnpm/pip/cargo). Build/lint/log families defer to v1.1, driven by gstack compact discover telemetry from real users.
  3. Verifier default ON at v1.0. failureCompaction trigger (exit≠0 AND >50% reduction) is enabled out of the box. The verifier IS the wedge — defaulting it off hides the differentiating feature. Trigger bounds already keep expected fire rate ≤10% of tool calls.

Architecture (Section 1): 4. Exact line-match sanitization for Haiku output. Split raw output by \n, put lines in a set, only append lines from Haiku that appear verbatim in that set. Tightest adversarial contract; prompt-injection attempts cannot slip in novel text. 5. Layered failureCompaction signal. Prefer exitCode from the envelope; if the host omits it, fall back to /FAIL|Error|Traceback|panic/ regex on the output. Log which signal fired in meta.failureSignal ("exit" | "pattern" | "none"). Pre-implementation task #1 still verifies Claude Code's envelope empirically, but the system no longer breaks if it doesn't. 6. Deep-merge rule resolution. User/project rules inherit built-in fields they don't override. Escape hatch: "extends": null in a rule file triggers full replacement semantics. Matches the mental model of eslint/tsconfig/.gitignore — override a piece without losing the rest.

Code quality (Section 2): 7. Per-rule regex timeout, no RE2 dep. Run each rule's regex via a 50ms AbortSignal budget; on timeout, skip the rule and record meta.regexTimedOut: [ruleId]. Avoids a WASM dependency and keeps rule-author syntax unconstrained. 8. Pre-compiled rule bundle. gstack compact install and gstack compact reload produce ~/.gstack/compact/rules.bundle.json (deep-merged, regex-compiled metadata cached). Hook reads that single file instead of parsing N source files. 9. Auto-reload on mtime drift. Hook stats rule source files on startup; if any source file is newer than the bundle, rebuild in-line before applying. Adds ~0.5ms/invocation but eliminates the "I edited a rule and nothing changed" footgun. 10. Expanded v1 redaction set. Tee files redact: AWS keys, GitHub tokens (ghp_/gho_/ghs_/ghu_), GitLab tokens (glpat-), Slack webhooks, generic JWT (three base64 segments), generic bearer tokens, SSH private-key headers (-----BEGIN * PRIVATE KEY-----). Credit cards / SSNs / per-key env-pairs deferred to a full DLP layer in v2.

Testing (Section 3): 11. P-series gate subset. v1 gate-tier P-tests: P1 (binary garbage), P3 (empty output), P6 (RTK-killer critical stack frame), P8 (secrets to tee), P15 (hook timeout), P18 (prompt injection), P26 (malformed user rule JSON), P28 (regex DoS), P30 (Haiku hallucination). Remaining 21 P-cases grow R-series as real bugs hit. 12. Fixture version-stamping. Every golden fixture has a toolVersion: frontmatter. CI warns when fixture toolVersion ≠ currently installed. No more calendar-based rotation. 13. B-series real-world benchmark testbench (hard v1 gate). New component compact/benchmark/ scans ~/.claude/projects/**/*.jsonl, ranks the noisiest tool calls, clusters them into named scenarios, replays the compactor against them, and reports reduction-by-rule-family. v1 cannot ship until B-series on the author's own 30-day corpus shows ≥15% reduction AND zero critical-line loss on planted bugs. Local-only; never uploads. Community-shared corpus is v2.

Performance (Section 4): 14. Revised latency budgets. Bun cold-start on macOS ARM is 15-25ms; the original 10ms p50 target was unrealistic. New budgets: <30ms p50 / <80ms p99 on macOS ARM, <20ms p50 / <60ms p99 on Linux (verifier off). Verifier-fires budget stays <600ms p50 / <2s p99. Daemon mode is a v2 option gated on B-series showing cold-start hurts session savings. 15. Line-oriented streaming pipeline. Readline over stdin → filter → group → dedupe → ring-buffered tail truncation → stdout. Any single line >1MB hits P9 (truncate to 1KB with [... truncated ...] marker). Caps memory at 64MB regardless of total output size.

Every row above is a MUST in the implementation. Drift requires a new eng-review.


Summary

gstack compact was designed as a PostToolUse hook that reduces tool-output noise before it reaches an AI coding agent's context window. Deterministic JSON rules would shrink noisy test runners, build logs, git diffs, and package installs. A conditional Claude Haiku verifier would act as a safety net when over-compaction risk was high.

Current status: TABLED. See "Status" section above. The architecture depends on a Claude Code API (updatedBuiltinToolOutput or equivalent for built-in tools) that does not exist as of 2026-04-17. Anthropic issue #36843 tracks the gap.

Intended goal (preserved for the un-tabling sprint): 1530% tool-output token reduction per long session, with zero increase in task-failure rate.

Original wedge (vs RTK, the 28K-star incumbent) — both invalidated by research:

  1. Conditional LLM verifier. Still technically viable via PreToolUse command wrapping, but only for Bash. Stops being a differentiator once we're Bash-only. Reconsider if the built-in-tool API arrives.
  2. Native-tool coverage. Architecturally impossible today. Read/Grep/Glob execute in-process inside Claude Code and do not fire hooks. Even for tools that do fire PostToolUse, no output-replacement field exists for non-MCP tools.

Original positioning (now moot): "RTK is fast. gstack compact is fast AND safe, and it covers every tool in your toolbox, not just Bash."

Non-goals

  • Summarizing user messages or prior agent turns (Claude's own Compaction API owns that).
  • Compressing agent response output (caveman's layer).
  • Caching tool calls to avoid re-execution (token-optimizer-mcp's layer).
  • Acting as a general-purpose log analyzer.
  • Replacing the agent's own judgement about when to re-run a command with GSTACK_RAW=1.

Why this is worth building

Problem is measured, not hypothetical.

  • Chroma research (2025) tested 18 frontier models. Every model degrades as context grows. Rot starts well before the window limit — a 200K model rots at 50K.
  • Coding agents are the worst case: accumulative context + high distractor density + long task horizon. Tool output is explicitly named as a primary noise source.
  • The market has voted: Anthropic shipped Opus 4.6 Compaction API; OpenAI shipped a compaction guide; Google ADK shipped context compression; LangChain shipped autonomous compression; sst/opencode has built-in compaction. The hybrid deterministic + LLM pattern is industry consensus.

Existing field (what gstack compact joins and differentiates from):

Project Stars License Layer Threat Note
RTK (rtk-ai/rtk) 28K Apache-2.0 Tool output Primary benchmark Pure Rust, Bash-only, zero LLM
caveman 34.8K MIT Output tokens Different axis Terse system prompt; pairs WITH us
claude-token-efficient 4.3K MIT Response verbosity Different axis Single CLAUDE.md
token-optimizer-mcp 49 MIT MCP caching Different axis Prevents calls rather than compresses output
tokenjuice ~12 MIT Tool output Too new 2 days old; inspired our JSON envelope
6-Layer Token Savings Stack Public gist Recipe Zero Documentation; validates stacked compaction thesis

RTK is the only direct competitor. Everything else compresses a different token source.

License compatibility: Every referenced project is permissive-licensed (MIT or Apache-2.0) and compatible with gstack's MIT license. No AGPL, GPL, or other copyleft dependencies. See the "License & attribution" section below for the clean-room policy.

Architecture

Data flow

┌─────────────────────────────────────────────────────────────────┐
│  Host (Claude Code / Codex / OpenClaw)                          │
│  ─────────────────────────────────────────                      │
│  1. Agent requests tool call: Bash|Read|Grep|Glob|MCP           │
│  2. Host executes tool                                          │
│  3. Host invokes PostToolUse hook with: {tool, input, output}   │
└────────────────────┬────────────────────────────────────────────┘
                     │ stdin (JSON envelope)
                     ▼
┌─────────────────────────────────────────────────────────────────┐
│  gstack-compact hook binary                                     │
│  ───────────────────────────                                    │
│  a. Parse envelope                                              │
│  b. Match rule by (tool, command, pattern)                      │
│  c. Apply rule primitives: filter / group / truncate / dedupe   │
│  d. Record reduction metadata                                   │
│  e. Evaluate verifier triggers                                  │
│  f. If trigger met: call Haiku, append preserved lines          │
│  g. On failure exit code: tee raw to ~/.gstack/compact/tee/...  │
│  h. Emit JSON envelope to stdout                                │
└────────────────────┬────────────────────────────────────────────┘
                     │ stdout (JSON envelope)
                     ▼
              Host substitutes compacted output into agent context

Rule resolution

Three-tier hierarchy (highest precedence wins), same pattern as tokenjuice and gstack's existing host-config-export model:

  1. Built-in rules: compact/rules/ shipped with gstack
  2. User rules: ~/.config/gstack/compact-rules/
  3. Project rules: .gstack/compact-rules/

Rules match tool calls by rule ID. A project rule with ID tests/jest overrides the built-in tests/jest entirely. No merging — replace semantics, to keep reasoning simple.

JSON envelope contract (adopted from tokenjuice)

Input:

{
  "tool": "Bash",
  "command": "bun test test/billing.test.ts",
  "argv": ["bun", "test", "test/billing.test.ts"],
  "combinedText": "...",
  "exitCode": 1,
  "cwd": "/Users/garry/proj",
  "host": "claude-code"
}

Output:

{
  "reduced": "compacted output with [gstack-compact: N → M lines, rule: X] header",
  "meta": {
    "rule": "tests/jest",
    "linesBefore": 247,
    "linesAfter": 18,
    "bytesBefore": 18234,
    "bytesAfter": 892,
    "verifierFired": false,
    "teeFile": null,
    "durationMs": 8
  }
}

Rule schema

Compact, minimal. Total rules-payload must stay <5KB on disk (lesson from claude-token-efficient: rule files themselves consume tokens on every session).

{
  "id": "tests/jest",
  "family": "test-results",
  "description": "Jest/Vitest output — preserve failures and summary counts",
  "match": {
    "tools": ["Bash"],
    "commands": ["jest", "vitest", "bun test"],
    "patterns": ["jest", "vitest", "PASS", "FAIL"]
  },
  "primitives": {
    "filter": {
      "strip": ["\\x1b\\[[0-9;]*m", "^\\s*at .+node_modules"],
      "keep": ["FAIL", "PASS", "Error:", "Expected:", "Received:", "✓", "✗", "Tests:"]
    },
    "group": {
      "by": "error-kind",
      "header": "Errors grouped by type:"
    },
    "truncate": {
      "headLines": 5,
      "tailLines": 15,
      "onFailure": { "headLines": 20, "tailLines": 30 }
    },
    "dedupe": {
      "pattern": "^\\s*$",
      "format": "[... {count} blank lines ...]"
    }
  },
  "tee": {
    "onExit": "nonzero",
    "maxBytes": 1048576
  },
  "counters": [
    { "name": "failed", "pattern": "^FAIL\\s", "flags": "m" },
    { "name": "passed", "pattern": "^PASS\\s", "flags": "m" }
  ]
}

The four primitives — filter, group, truncate, dedupe — are lifted directly from RTK's technique taxonomy (the only thing every serious compactor needs to handle). Any rule can combine any subset of the four; omitted primitives are no-ops.

Verifier layer (tiered, opt-in)

The verifier is a cheap Haiku call that fires only under specific triggers. Never on every tool call.

Trigger matrix (user-configurable):

Trigger Default Condition
failureCompaction ON exit code ≠ 0 AND reduction >50% (diagnosis at risk)
aggressiveReduction off reduction >80% AND original >200 lines
largeNoMatch off no rule matched AND output >500 lines
userOptIn on (env-gated) GSTACK_COMPACT_VERIFY=1 forces verifier for that call

Default config ships with failureCompaction only — the highest-leverage case (agent is debugging; rule may have filtered the critical stack frame).

Haiku's job (bounded):

Here is raw output (truncated to first 2000 lines) and a compacted version.
Return any important lines from the raw that are missing from the compacted,
or `NONE` if nothing critical is missing.

The verifier never rewrites the compacted output. It only appends missing lines under a header:

[gstack-compact: 247 → 18 lines, rule: tests/jest]
[gstack-verify: 2 additional lines preserved by Haiku]
  TypeError: Cannot read property 'foo' of undefined
    at parseConfig (src/config.ts:42:18)

Why Haiku, not Sonnet: ~1/12th the cost, ~500ms vs ~2s, and the task is simple substring classification, not reasoning.

Verifier config (compact/rules/_verifier.json):

{
  "verifier": {
    "enabled": true,
    "model": "claude-haiku-4-5-20251001",
    "maxInputLines": 2000,
    "triggers": {
      "aggressiveReduction": { "enabled": false, "thresholdPct": 80, "minLines": 200 },
      "failureCompaction":   { "enabled": true,  "minReductionPct": 50 },
      "largeNoMatch":        { "enabled": false, "minLines": 500 },
      "userOptIn":           { "enabled": true, "envVar": "GSTACK_COMPACT_VERIFY" }
    },
    "fallback": "passthrough"
  }
}

Failure modes (verifier is strictly additive — never breaks the baseline):

  • No ANTHROPIC_API_KEY → skip verifier, use pure rule output.
  • Haiku call times out (>5s) → skip verifier, use pure rule output.
  • Haiku returns malformed JSON → skip, use pure rule output.
  • Haiku returns prompt-injection attempt → sanitize: only append lines that are substring-matches of the original raw output.
  • Haiku returns hallucinated lines (not present in raw) → drop them.

Tee mode (adopted from RTK)

On any command with exit code ≠ 0, the full unfiltered output is written to ~/.gstack/compact/tee/{timestamp}_{cmd-slug}.log. The compacted output includes a tee-file pointer:

[gstack-compact: 247 → 18 lines, rule: tests/jest, tee: ~/.gstack/compact/tee/20260416-143022_bun-test.log]

The agent can read the tee file directly if it needs the full stack trace. This replaces the earlier onFailure.preserveFull mechanic with a cleaner design: compacted output always stays small; raw output is always one cat away.

Tee safety:

  • File mode 0600 — not world-readable.
  • Built-in secret-regex set redacts AWS keys, bearer tokens, and common credential patterns before write.
  • Failed writes (read-only filesystem, permission denied) degrade gracefully: still emit compacted output, record meta.teeFailed: true.
  • Tee files auto-expire after 7 days (cleanup on hook startup).

Host integration matrix

Host Hook type Supported matchers Config path
Claude Code PostToolUse Bash, Read, Grep, Glob, Edit, Write, WebFetch, WebSearch, mcp__* ~/.claude/settings.json
Codex (v1.1) PostToolUse equivalent Bash (primary); tool subset TBD — empirical verification is a v1.1 prereq ~/.codex/hooks.json
OpenClaw (v1.1) Native hook API Bash + MCP OpenClaw config

v1 is Claude-first. Wedge (ii) — native-tool coverage — is confirmed on Claude Code via the hooks reference. Codex and OpenClaw integration ships at v1.1 only after the wedge is proven on the primary host via B-series benchmark data. CHANGELOG for v1 makes the Claude-only scope explicit.

Config surface

User config (~/.config/gstack/compact.toml):

[compact]
enabled = true
level = "normal"                            # minimal | normal | aggressive (caveman pattern)
exclude_commands = ["curl", "playwright"]   # RTK pattern

[compact.bundle]
auto_reload_on_mtime_drift = true           # hook rebuilds bundle if source rule files are newer
bundle_path = "~/.gstack/compact/rules.bundle.json"

[compact.regex]
per_rule_timeout_ms = 50                    # AbortSignal budget per regex; timeout → skip rule

[compact.verifier]
enabled = true
trigger_failure_compaction = true
trigger_aggressive_reduction = false
trigger_large_no_match = false
failure_signal_fallback = true              # use /FAIL|Error|Traceback|panic/ when exitCode missing
sanitization = "exact-line-match"           # only append lines present verbatim in raw output

[compact.tee]
on_exit = "nonzero"
max_bytes = 1048576
redact_patterns = ["aws", "github", "gitlab", "slack", "jwt", "bearer", "ssh-private-key"]
cleanup_days = 7

[compact.benchmark]
local_only = true                           # hard-coded; config is documentary, cannot be changed
transcript_root = "~/.claude/projects"
output_dir = "~/.gstack/compact/benchmark"
scenario_cap = 20                           # top-N clusters by aggregate output volume

Intensity levels (caveman pattern):

  • minimal: only filter + dedupe; no truncation. Safest.
  • normal: filter + dedupe + truncate. Default.
  • aggressive: adds group; more savings, more edge-case risk.

CLI surface

Command Purpose Source
gstack compact install <host> Register PostToolUse hook in host config; builds rules.bundle.json new
gstack compact uninstall <host> Idempotent removal new
gstack compact reload Rebuild rules.bundle.json after editing user/project rules new
gstack compact doctor Detect drift / broken hook config, offer to repair tokenjuice
gstack compact gain Show token/dollar savings over time (per-rule breakdown) RTK
gstack compact discover Find commands with no matching rule, ranked by noise volume RTK
gstack compact verify <rule-id> Dry-run verifier on a fixture new
gstack compact list-rules Show effective rule set after deep-merge (built-in + user + project) new
gstack compact test <rule-id> <fixture> Apply a rule to a fixture and show the diff new
gstack compact benchmark Run B-series testbench against local transcript corpus (see Benchmark section) new

Escape hatch: GSTACK_RAW=1 env var bypasses the hook entirely for the duration of a command (same pattern as tokenjuice's --raw flag). Hook also auto-reloads the bundle if any source rule file's mtime is newer than the bundle file.

File layout

compact/
├── SKILL.md.tmpl              # template; regen via `bun run gen:skill-docs`
├── src/
│   ├── hook.ts                # entry point; reads stdin, writes stdout; mtime-checks bundle
│   ├── engine.ts              # rule matching + reduction metadata
│   ├── apply.ts               # primitive application (line-oriented streaming pipeline)
│   ├── merge.ts               # deep-merge of built-in/user/project rules; honors `extends: null`
│   ├── bundle.ts              # compile source rules → rules.bundle.json (install/reload)
│   ├── primitives/
│   │   ├── filter.ts
│   │   ├── group.ts
│   │   ├── truncate.ts        # ring-buffered tail; safe for arbitrary input size
│   │   └── dedupe.ts
│   ├── regex-sandbox.ts       # AbortSignal-bounded regex execution (50ms budget per rule)
│   ├── verifier.ts            # Haiku integration (triggers + failure-signal fallback + sanitization)
│   ├── sanitize.ts            # exact-line-match filter for verifier output
│   ├── tee.ts                 # raw-output archival with secret redaction + 7-day cleanup
│   ├── redact.ts              # secret-pattern set (AWS/GitHub/GitLab/Slack/JWT/bearer/SSH)
│   ├── envelope.ts            # JSON I/O contract parsing + validation
│   ├── doctor.ts              # hook drift detection + repair
│   ├── analytics.ts           # gain + discover queries against local metadata
│   └── cli.ts                 # argv dispatch; one thin dispatch per subcommand
├── benchmark/                 # B-series testbench (hard v1 gate)
│   └── src/
│       ├── scanner.ts         # walk ~/.claude/projects/**/*.jsonl; pair tool_use × tool_result
│       ├── sizer.ts           # tokens per call (ceil(len/4) heuristic); rank heavy tail
│       ├── cluster.ts         # group high-leverage calls by (tool, command pattern)
│       ├── scenarios.ts       # emit B1-Bn real-world scenario fixtures
│       ├── replay.ts          # run compactor against scenarios; measure reduction
│       ├── pathology.ts       # layer planted-bug P-cases on top of real scenarios
│       └── report.ts          # dashboard: per-scenario before/after + overall reduction
├── rules/                     # v1 built-in JSON rule library (13 rules)
│   ├── tests/
│   │   ├── jest.json
│   │   ├── vitest.json
│   │   ├── pytest.json
│   │   ├── cargo-test.json
│   │   ├── go-test.json
│   │   └── rspec.json
│   ├── install/
│   │   ├── npm.json
│   │   ├── pnpm.json
│   │   ├── pip.json
│   │   └── cargo.json
│   ├── git/
│   │   ├── diff.json
│   │   ├── log.json
│   │   └── status.json
│   ├── _verifier.json         # verifier config (not a rule per se)
│   └── _HOLD/                 # v1.1 rule families (not shipped at v1; kept for reference)
│       ├── build/
│       ├── lint/
│       └── log/
└── test/
    ├── unit/
    ├── golden/
    ├── fuzz/                  # P-series — v1 gate subset only (P1/P3/P6/P8/P15/P18/P26/P28/P30)
    ├── cross-host/            # v1: claude-code.test.ts only; codex/openclaw stub files
    ├── adversarial/           # R-series — grows with shipped bugs
    ├── benchmark/             # B-series scenario fixtures + expected reduction ranges
    ├── fixtures/              # version-stamped golden inputs (toolVersion: frontmatter)
    └── evals/

Testing Strategy

The test plan is comprehensive by design. Shipping into a space where the 28K-star incumbent has three years of regex battle-scars, with our wedges (Haiku verifier + native-tool coverage) introducing new failure surfaces, means we get ONE shot at "the compactor made my agent dumb" going viral. Zero appetite for that.

Test tiers

Tier Cost Frequency Blocks merge
Unit free, <1s every PR yes
Golden file (with toolVersion: frontmatter) free, <1s every PR yes
Rule schema validation free, <1s every PR yes
Fuzz (P-series gate subset: P1/P3/P6/P8/P15/P18/P26/P28/P30) free, <10s every PR yes
Cross-host E2E — Claude Code only at v1 free, ~1min every PR (gate tier) yes
E2E with verifier (mocked Haiku) free, ~15s every PR yes
E2E with verifier (real Haiku) paid, ~$0.10/run PR touching verifier files yes
B-series benchmark (real-world scenarios) free, ~2min pre-release gate yes (hard gate for v1)
Token-savings eval (E1-E4 synthetic) paid, ~$4/run periodic weekly no (informational)
Adversarial regression (R-series) free, <5s every PR yes
Tool-version drift warning free, <1s every PR warning only

Test file layout:

compact/test/
├── unit/
│   ├── engine.test.ts         # rule matching + primitive application
│   ├── primitives.test.ts     # filter / group / truncate / dedupe
│   ├── envelope.test.ts       # JSON input/output contract
│   ├── triggers.test.ts       # verifier trigger evaluation
│   └── verifier.test.ts       # Haiku call (mocked)
├── golden/
│   ├── tests/                 # one fixture per test runner
│   │   ├── jest-success.input.txt
│   │   ├── jest-success.expected.txt
│   │   ├── jest-fail.input.txt
│   │   ├── jest-fail.expected.txt
│   │   └── ... (vitest, pytest, cargo-test, go-test, rspec)
│   ├── install/
│   ├── git/
│   ├── build/
│   ├── lint/
│   └── log/
├── fuzz/
│   └── pathological.test.ts   # P-series
├── cross-host/
│   ├── claude-code.test.ts
│   ├── codex.test.ts
│   └── openclaw.test.ts
├── adversarial/
│   └── regression.test.ts     # R-series; past bugs that must never recur
├── fixtures/
│   └── {tool}/                # shared raw output fixtures
└── evals/
    └── token-savings.eval.ts  # periodic-tier; measures real reduction

G-series: good cases (must produce expected reduction)

ID Scenario Expected reduction
G1 jest 47 passing tests, clean run 150+ lines → ≤10 lines
G2 jest 47 tests with 2 failures 200+ lines → keep both failures + summary
G3 vitest run with --reporter=verbose 300+ lines → ≤15 lines
G4 pytest collection then run preserve failure tracebacks
G5 cargo test with one panic panic location preserved verbatim
G6 go test -v with 200 subtests passing collapse to PASS: 200 subtests
G7 git diff on a file with 2 hunks in 500 lines of context keep hunks, drop context
G8 git log -50 preserve SHA + subject + author, drop body
G9 git status with 30 modified files group by directory
G10 pnpm install fresh final count + warnings; drop resolved packages
G11 pip install -r requirements.txt drop download progress; keep final install list + errors
G12 cargo build success drop compilation progress; keep final target
G13 docker build success drop layer pulls; keep final image digest
G14 tsc --noEmit clean compact to tsc: 0 errors
G15 tsc --noEmit with 3 errors keep all 3 errors with location
G16 eslint . clean compact to eslint: 0 problems
G17 eslint . with violations group by rule; preserve location + fix suggestion
G18 docker logs -f with 1000 repeating lines dedupe with count: [last message repeated 973 times]
G19 kubectl get pods -A group by namespace
G20 ls -la deep tree directory grouping (RTK pattern)
G21 find . -type f 10K files group by extension with counts
G22 grep -r "foo" . with 500 hits cap at 50; suffix [... 450 more matches; use --ripgrep for full]
G23 curl -v https://api.example.com strip verbose headers; keep response body
G24 aws ec2 describe-instances 50 instances columnar summary

P-series: pathological cases (must NOT break the agent)

These turn "nice feature" into "catastrophic regression" if we get any of them wrong.

ID Scenario Required behavior
P1 Binary garbage in output (non-UTF8 bytes) Pass through unchanged; don't crash
P2 ANSI escape explosion (10K+ codes) Strip cleanly, don't choke regex
P3 Empty output ("") Pass through empty; do NOT inject header
P4 Stdout+stderr interleaved Rule matches across both streams
P5 Truncated output (SIGPIPE mid-stream) Don't mis-compact partial output
P6 Failed test, critical stack frame at line 4 of 200 Must NOT filter the frame (the RTK-killer case)
P7 Exit 0 but ERROR: in output Rule must not trust exit code alone
P8 Output contains AWS key / bearer token / password Tee file must NOT be world-readable; redact in compacted output
P9 Single-line minified JS error (40KB one line) Truncate to first 1KB; append [... truncated ...]
P10 Unicode (emoji, RTL, combining chars, CJK) Byte-safe truncation; don't split codepoints
P11 Two rules match same command Deterministic priority: longest match.commands prefix wins; tie → rule ID alphabetical
P12 Rule's compacted output matches another rule's pattern No recursive application; hook runs once per tool call
P13 Command contains embedded newlines in quoted arg Rule doesn't misparse args
P14 Concurrent tool calls (parallel Bash invocations) No shared mutable state in hook; each call is isolated
P15 Hook execution >5s Pass through raw; emit meta.timedOut: true
P16 Haiku API offline/rate-limited Skip verifier silently; use pure rule output
P17 Haiku returns malformed JSON Skip verifier; do NOT feed raw response to agent
P18 Haiku response contains prompt-injection ("Ignore all prior instructions...") Sanitize: only append lines that are substring matches of the original raw output
P19 1M-line output Stream-process, cap memory at 64MB; truncate with clear marker
P20 Rapid-fire: 50 tool calls / sec Hook latency stays <15ms p99
P21 Command with shell redirects (cmd >file 2>&1) Match on the underlying command name, not the redirect wrapper
P22 Deeply nested quotes/escapes in command string Robust arg parser; no shell injection possible
P23 NULL bytes in output Strip safely; don't truncate
P24 Command that exits then writes more to stderr after Hook receives final combined output; handles gracefully
P25 Read-only filesystem / no tee write permission Degrade gracefully; still emit compacted output; record meta.teeFailed: true
P26 User's rule JSON is malformed Skip that rule; emit warning to stderr; don't break hook
P27 Rule references a non-existent primitive field Ignore unknown field; apply rest of rule
P28 Rule regex has catastrophic backtracking RE2-compatible engine (no backtracking) OR per-rule timeout
P29 Exit code 137 (OOM kill) Rule treats same as generic failure; preserves full output
P30 Haiku returns lines NOT present in raw output (hallucination) Drop hallucinated lines; keep only substring matches

CH-series: cross-host E2E

Run each scenario on each supported host. Same input, same expected output. If a host does not support a matcher, the test is marked skip-on-{host} with a comment linking the upstream limitation.

ID Scenario Hosts
CH1 Install hook via gstack compact install <host> Claude Code, Codex, OpenClaw
CH2 Uninstall hook is idempotent All
CH3 Re-install doesn't duplicate entries All
CH4 Hook co-exists with user's other PostToolUse hooks All
CH5 Hook fires on Bash tool All
CH6 Hook fires on Read tool Claude Code (confirmed); Codex/OpenClaw verify-then-require
CH7 Hook fires on Grep tool Same as CH6
CH8 Hook fires on Glob tool Same as CH6
CH9 Hook fires on MCP tool (mcp__* matcher) Claude Code; verify on others
CH10 Config precedence: project > user > built-in All
CH11 GSTACK_RAW=1 env var bypasses hook All
CH12 Rule ID override works (project rule replaces built-in) All
CH13 gstack compact doctor detects drift on each host All
CH14 Hook error does not crash the agent session All

Implementation note: cross-host tests reuse the fixture corpus from the golden/ tree; the harness wraps each fixture in a host-specific hook invocation envelope and asserts the output is byte-identical across hosts (modulo the host field).

V-series: verifier tests (paid)

ID Scenario Expected
V1 Rule reduces 200-line test output to 5 lines, exit=1 Verifier fires (failure + >50% reduction), appends any missing critical lines
V2 Rule reduces 10-line output to 9 lines, exit=1 Verifier does NOT fire (reduction too small)
V3 Rule reduces 200-line output to 5 lines, exit=0 Verifier does NOT fire (success path, default config)
V4 aggressiveReduction trigger enabled, 300 lines → 20 lines, exit=0 Verifier fires
V5 GSTACK_COMPACT_VERIFY=1 env var set Verifier fires once for that call
V6 ANTHROPIC_API_KEY missing Verifier silently skipped; raw rule output returned
V7 Verifier mocked to return "NONE" Output identical to pure-rule path
V8 Verifier mocked to return prompt injection Injection discarded; only substring-matched lines appended
V9 Verifier mocked to time out >5s Skipped; meta.verifierTimedOut: true
V10 Verifier mocked to return 500 error Skipped; rule output returned

R-series: adversarial regression

Every bug caught after v1 ship gets a permanent R-series test. Starts empty; grows with scars. Template:

R{N}: {commit-sha} — {1-line summary}
Scenario: {reproducer}
Fix: {PR link}

Performance budgets (enforced in CI; revised for realistic Bun cold-start)

Metric Target Hard limit
Hook overhead macOS ARM (verifier disabled) <30ms p50 <80ms p99
Hook overhead Linux (verifier disabled) <20ms p50 <60ms p99
Hook overhead (verifier fires) <600ms p50 <2s p99
Bundle deserialize (rules.bundle.json) <2ms <10ms
mtime drift check (stat of source files) <0.5ms <3ms
Single-regex execution budget (per rule) <5ms <50ms (hard abort)
Memory per hook invocation (line-streamed) <16MB typical <64MB max
Total rule-payload size on disk (source files) <5KB <15KB
Compiled bundle size on disk <25KB <80KB

Daemon mode is a v2 optimization. If B-series benchmark on the author's corpus shows cold-start meaningfully hurts session-total savings (e.g., total hook overhead >5% of saved tokens' wall time), promote to v1.1.

B-series real-world benchmark testbench (hard v1 gate)

Why it exists. Every competing compactor ships with hand-picked fixture numbers. B-series proves the compactor works on the user's actual coding sessions before they enable the hook. It's both the ship-gate and the marketing artifact.

Architecture (components in compact/benchmark/src/):

┌──────────────────────────────────────────────────────────────┐
│  1. SCAN     scanner.ts walks ~/.claude/projects/**/*.jsonl  │
│              → pairs tool_use × tool_result blocks           │
│              → emits {tool, command, outputBytes, lineCount, │
│                estimatedTokens, sessionId, timestamp}        │
├──────────────────────────────────────────────────────────────┤
│  2. RANK     sizer.ts sorts corpus by estimatedTokens desc   │
│              → cluster.ts groups by (tool, command-pattern)  │
│              → identifies heavy-tail: which 10% of calls     │
│                produced 80% of the tokens?                   │
├──────────────────────────────────────────────────────────────┤
│  3. SCENARIO scenarios.ts emits fixture files:               │
│              B1_bun_test_heavy.jsonl                         │
│              B2_git_diff_huge.jsonl                          │
│              B3_tsc_errors_production.jsonl                  │
│              B4_pnpm_install_fresh.jsonl ... (one per        │
│              high-leverage cluster, up to ~20 scenarios)     │
├──────────────────────────────────────────────────────────────┤
│  4. REPLAY   replay.ts runs compactor against each scenario, │
│              measures token reduction + diff of dropped lines│
│              → per-rule reduction numbers                    │
│              → per-scenario before/after token counts        │
├──────────────────────────────────────────────────────────────┤
│  5. PATHOLOGY pathology.ts injects planted critical lines    │
│              (line 4 of 200 in a failing test fixture) into  │
│              real B-scenarios. Confirms verifier restores    │
│              them. Real data + real threats = real proof.    │
├──────────────────────────────────────────────────────────────┤
│  6. REPORT   report.ts emits HTML + JSON dashboard to        │
│              ~/.gstack/compact/benchmark/latest/              │
│              "On YOUR 30 days of Claude Code data, gstack    │
│              compact would save X tokens in Y scenarios."    │
└──────────────────────────────────────────────────────────────┘

v1 ship gate (hard):

  • ≥15% total-token reduction across the aggregated scenario corpus on the author's own 30-day transcript set.
  • Zero critical-line loss on planted-bug scenarios (every planted stack frame must survive either the rule or the verifier).
  • No scenario regresses to <5% reduction under the new rules (catch over-compaction edge cases).

Privacy (non-negotiable):

  • Reads ~/.claude/projects/**/*.jsonl locally only. Never uploads. Never shares. Never logs scenarios to telemetry.
  • Output files live under ~/.gstack/compact/benchmark/ with mode 0600.
  • The command prints a confirmation banner: "Scanning local transcripts at ~/.claude/projects/ (local-only; nothing leaves this machine)."
  • Any future community corpus is a separate v2 workstream built from hand-contributed, secret-scanned fixtures on OSS projects.

Ports from analyze_transcripts (TypeScript reimplementation; not a subprocess call):

  • JSONL parsing + tool_use/tool_result pairing pattern (from event_extractor.rb).
  • Token estimate ceil(len/4) (same char-ratio heuristic; sufficient for ranking).
  • Event-type taxonomy (bash_command, file_read, test_run, error_encountered) for scenario clustering.
  • Stress-fixture generation pattern for pathology layering.

What we do NOT port: behavioral scoring, pgvector embeddings, decision-exchange graphs, velocity metrics, the Rails/ActiveRecord layer. Out of scope; not what we're measuring.

Synthetic token-savings evals (E-series, periodic/informational only)

Retained from the original plan but now informational-only because B-series is the real gate.

  • E1: simulated 30-min coding session on a medium TypeScript project. Measure total tokens with/without gstack compact enabled. Target: ≥15% reduction.
  • E2: same session at level=aggressive. Target: ≥25% reduction, zero test-failure increase.
  • E3: same session with verifier on failureCompaction only. Verifier fire rate ≤10% of tool calls.
  • E4: adversarial — inject a planted bug in a test output and confirm the verifier restores the critical stack frame.

Test corpus sourcing

For each rule family, capture 3+ real outputs:

  1. Run the tool against a real project (gstack itself for TS; popular OSS for Rust/Go/Python).
  2. Capture stdout+stderr+exit code into a fixture file with toolVersion: frontmatter (e.g., jest@29.7.0).
  3. Hand-author the expected compacted output once.
  4. Golden file test: rule application must produce byte-identical output.
  5. CI drift warning: if installed tool version differs from fixture's toolVersion:, CI warns (not fails). Drift-warning dashboard is checked pre-release.

Draw from:

  • tokenjuice's fixture directory patterns (tests/fixtures/)
  • RTK's per-command examples (their README lists real before/after metrics; verify independently)
  • gstack's own test output (eat our own dog food)
  • Real failure archives from ~/.gstack/compact/tee/ (once volunteers contribute)
  • B-series real-world scenarios are the primary corpus for reduction measurements.

Pattern adoption table

Concrete patterns borrowed from the competitive landscape:

From Adopt as Why
RTK 4 reduction primitives (filter/group/truncate/dedupe) as JSON rule verbs Table stakes for a serious compactor
RTK gstack compact tee for failure-mode raw save Better than the original onFailure.preserveFull design
RTK gstack compact gain + gstack compact discover Trust + continuous improvement
RTK exclude_commands per-user blocklist Must-have config
tokenjuice JSON envelope contract for hook I/O Clean machine adapter
tokenjuice gstack compact doctor Hooks drift; self-repair matters
caveman Intensity levels (minimal/normal/aggressive) User-tunable safety/savings knob
claude-token-efficient Rules-file size budget (<5KB total) Don't bloat context

Rollout plan

ALL PHASES TABLED pending Anthropic updatedBuiltinToolOutput API. See Status section at the top of this doc. The rollout below is the intended sequence if/when the API ships and this design un-tables.

Un-tabling checklist (do in order when the API arrives)

  1. Confirm the new API's shape. Read the updated Claude Code hooks reference. Capture a real envelope containing the new output-replacement field for Bash, Read, Grep, Glob. Record in docs/designs/GCOMPACTION_envelope.md.
  2. Re-validate the wedge. Does the new API cover Read/Grep/Glob (do they fire PostToolUse now), or just Bash/WebFetch? If Bash-only, wedge (ii) stays dead and the product needs a new pitch before implementation.
  3. Re-run /plan-eng-review against the revised plan with the new API. Most of the 15 locked decisions should carry forward; adjust the Architecture data-flow and any envelope-dependent decisions.
  4. Re-run /codex review against the revised plan. The prior BLOCK verdict's concerns about hook substitution disappear once the API exists; remaining criticals (B-series privacy, regex DoS, JSON-envelope streaming) still apply.
  5. Execute the original rollout below.

Original rollout (preserved for un-tabling)

Each tier blocks on the prior passing all gate-tier tests. Claude-first — Codex and OpenClaw land at v1.1 after the wedge is proven on the primary host.

  1. v0.0 (1 day): rule engine + 4 primitives + line-oriented streaming pipeline + deep-merge + bundle compiler + envelope contract + golden tests for tests/* family only. No host integration yet. Measure savings on offline fixtures.
  2. v0.1 (1 day): Claude Code hook integration + gstack compact install + mtime-based auto-reload. Ship as opt-in; off by default. Ask 10 gstack power users to try it; collect feedback.
  3. v0.5 (1 day): B-series benchmark testbench (compact/benchmark/). Ship gstack compact benchmark so users can measure on their own data. Collect anonymous-from-the-start (nothing uploaded) reduction numbers from dogfooders.
  4. v1.0 (1 day): verifier layer with failureCompaction trigger on by default + exact-line-match sanitization + layered exitCode/pattern fallback + expanded tee redaction set. Hard ship gate: B-series on the author's 30-day local corpus shows ≥15% total reduction AND zero critical-line loss on planted bugs. Publish CHANGELOG entry leading with wedge framing (Claude Code only at v1).
  5. v1.1 (+1 day): Codex + OpenClaw hook integration. Cross-host E2E suite green. Build/lint/log rule families land with gstack compact discover-derived priorities.
  6. v1.2+: expand rule families, community rule contribution workflow, community-corpus benchmark (hand-authored public fixtures, separate from local B-series).

Risk analysis

Risk Severity Mitigation
RTK adds an LLM verifier in response Low Creator is vocal about zero-dependency Rust. Ship first, build the pattern library.
Platform compaction subsumes us (Anthropic Compaction API in Claude Code) Medium We operate at a different layer (per-tool output vs whole-context). Position as complementary.
Rules drop something critical → "compactor made my agent dumb" High B-series real-world benchmark as hard ship gate; tee mode always available; verifier default-on for failures; exact-line-match sanitization.
Haiku cost creep (triggers fire more than expected) Medium E3 eval + B-series fire-rate metric; cost visible in gstack compact gain; per-session rate cap in v1.1 if rate >10%.
Rule maintenance debt (jest/vitest output formats change) Medium toolVersion: fixture frontmatter + CI drift warning; community rule PRs; discover flags bypassing commands.
Rules file bloats context Low CI-enforced <5KB source + <25KB compiled bundle budget; per-rule size warning at schema-validation.
Regex DoS blocks the agent Medium 50ms AbortSignal budget per rule; timeout logged to meta.regexTimedOut; stale rules quarantined on repeated failure.
Bundle staleness silently breaks user edits Low mtime-check on every hook invocation auto-rebuilds; gstack compact reload is a backup not a requirement.
Benchmark leaks user's private data High Local-only by construction: no network call, mode-0600 output, explicit banner at runtime. Privacy review before v1 ship.

Open questions

  1. Does Codex's PostToolUse hook support matchers for Read/Grep/Glob? (Deferred to v1.1 — Claude-first at v1.)
  2. Does OpenClaw's hook API support PostToolUse specifically? (Deferred to v1.1.)
  3. Should the verifier model be pinned, or version-tracked like gstack's other AI calls? (Inclined to pin claude-haiku-4-5-20251001 and bump explicitly in CHANGELOG.)
  4. Built-in secret-redaction regex set for tee files (resolved: expanded set — AWS/GitHub/GitLab/Slack/JWT/bearer/SSH-private-key. See decision #10.)
  5. Should gstack compact discover propose auto-generated rules via Haiku? (Deferred to v2; skill-creep risk.)
  6. New: Does Claude Code's PostToolUse envelope include exitCode? (Still needs empirical verification per pre-implementation task #1; system now has a layered fallback regardless.)
  7. New: What's the right scenario-count cap for B-series? Cluster.ts can produce 5-50 scenarios depending on heavy-tail shape. Plan: cap at top 20 clusters by aggregate output volume.

Pre-implementation assignment (must complete before coding)

  1. Verify Claude Code's PostToolUse envelope contents empirically. Ship a no-op hook; confirm exitCode, command, argv, combinedText are all present. This is the pivot for wedge (ii) native-tool coverage AND for the failureCompaction trigger. Output: docs/designs/GCOMPACTION_envelope.md with real captured envelopes for Bash + Read + Grep + Glob.
  2. Read RTK's rule definitions (ARCHITECTURE.md, src/rules/) and write a 1-paragraph summary of which of the 4 primitives they handle best. Inform our v1 rule set. This is the Search Before Building layer.
  3. Port analyze_transcripts JSONL parser to TypeScript. compact/benchmark/src/scanner.ts. Write a quick-look output that lists the top-50 noisiest tool calls on the author's ~/.claude/projects/. Confirms the testbench premise before we build the replay loop. This is the B-series foundation.
  4. Write the CHANGELOG entry FIRST. Target sentence: "Every tool in your agent's toolbox on Claude Code now produces less noise — test runners, git diffs, package installs — with an intelligent Haiku safety net that restores critical stack frames when our rules over-compact, and a local benchmark that proves the savings on your actual 30 days of coding sessions. Codex + OpenClaw land in v1.1." If we cannot write that sentence honestly, the wedge isn't there yet.
  5. Ship a rule-only v0 (no Haiku verifier, no benchmark). Measure real token savings with current gstack evals + early B-series prototype. If <10% on local corpus, the whole premise is weaker than claimed — iterate the rules before adding the verifier on top.

License & attribution

gstack ships under MIT. To keep the license clean for downstream users, this project follows a strict clean-room policy for everything borrowed from the competitive landscape:

  • Every project referenced above is permissive-licensed (MIT or Apache-2.0). No AGPL, GPL, SSPL, or other copyleft exposure.
    • RTK (rtk-ai/rtk): Apache-2.0 — MIT-compatible; Apache patent grant is a bonus for us.
    • tokenjuice, caveman, claude-token-efficient, token-optimizer-mcp, sst/opencode: MIT.
  • Patterns, not code. We read these projects to understand what they solved and why. We implement independently in TypeScript inside compact/src/. We do not copy source files, translate source files line-for-line, or lift test fixtures verbatim.
  • Attribution. Where a pattern is directly borrowed (the 4 primitives from RTK, the JSON envelope from tokenjuice, intensity levels from caveman, rules-file size budget from claude-token-efficient), we credit the source inline in comments and in the "Pattern adoption table" above. The project's README and NOTICE file (if we add one) list the inspirations.
  • Fixture sourcing. Golden-file fixtures come from running real tools against real projects — they are our own captures, not imported from RTK or tokenjuice. This keeps the test corpus free of license-tangled content.
  • Forbidden sources. Before adding any new reference project, run gh api repos/OWNER/REPO --jq '.license' and verify the license key is one of: mit, apache-2.0, bsd-2-clause, bsd-3-clause, isc, cc0-1.0, unlicense. If the project has no license field, treat it as "all rights reserved" and do not draw from it. Reject agpl-3.0, gpl-*, sspl-*, and any custom or source-available license.

CI enforcement: a scripts/check-references.ts script parses docs/designs/GCOMPACTION.md for GitHub URLs and re-runs the license check, failing if any referenced project's license moves off the allowlist.

References

End of GCOMPACTION.md canonical section. On plan approval, everything above is copied verbatim to docs/designs/GCOMPACTION.md as a tabled design artifact. No code is written; no hook is installed; no CHANGELOG entry is added. The doc exists so a future sprint can unblock quickly when Anthropic ships the built-in-tool output-replace API.