From cc42f14a589e173d64d93ece20b73155a6b0df2d Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Thu, 16 Apr 2026 15:04:26 -0700 Subject: [PATCH] docs: gstack compact design doc (tabled pending Anthropic API) (#1027) Preserves the full architecture, 15 locked eng-review decisions, B-series benchmark spec, codex review findings, and research that confirmed Claude Code's PostToolUse cannot replace non-MCP tool output today. Tracks anthropics/claude-code#36843 for the unblocking API. Co-authored-by: Claude Opus 4.7 --- docs/designs/GCOMPACTION.md | 831 ++++++++++++++++++++++++++++++++++++ 1 file changed, 831 insertions(+) create mode 100644 docs/designs/GCOMPACTION.md diff --git a/docs/designs/GCOMPACTION.md b/docs/designs/GCOMPACTION.md new file mode 100644 index 00000000..3937eccf --- /dev/null +++ b/docs/designs/GCOMPACTION.md @@ -0,0 +1,831 @@ +# GCOMPACTION.md — Design & Architecture (TABLED) + +**Target path on approval:** `docs/designs/GCOMPACTION.md` + +This is the preserved design artifact for `gstack compact`. Everything above the first `---` divider below gets extracted verbatim to `docs/designs/GCOMPACTION.md` on plan approval. Everything after that divider is archived research (office hours + competitive deep-dive + eng-review notes + codex review + research findings) that informed the design. + +--- + +## Status: TABLED (2026-04-17) — pending Anthropic `updatedBuiltinToolOutput` API + +**Why tabled.** The v1 architecture assumed a Claude Code `PostToolUse` hook could REPLACE the tool output that enters the model's context for built-in tools (Bash, Read, Grep, Glob, WebFetch). Research on 2026-04-17 confirmed this is not possible today. + +**Evidence:** + +1. **Official docs** (https://code.claude.com/docs/en/hooks): The only output-replace field documented for `PostToolUse` is `hookSpecificOutput.updatedMCPToolOutput`, and the docs explicitly state: *"For MCP tools only: replaces the tool's output with the provided value."* No equivalent field exists for built-in tools. +2. **Anthropic issue [#36843](https://github.com/anthropics/claude-code/issues/36843)** (OPEN): Anthropic themselves acknowledge the gap. *"PostToolUse hooks can replace MCP tool output via `updatedMCPToolOutput`, but there is no equivalent for built-in tools (WebFetch, WebSearch, Bash, Read, etc.)... They can only add warnings via `decision: block` (which injects a reason string) or `additionalContext`. The original malicious content still reaches the model."* +3. **RTK mechanism** (source-reviewed at `src/hooks/init.rs:906-912` and `hooks/claude/rtk-rewrite.sh:83-100`): RTK is NOT a PostToolUse compactor. It's a **PreToolUse** Bash matcher that rewrites `tool_input.command` (e.g., `git status` → `rtk git status`). The wrapped command produces compact stdout itself. RTK README confirms: *"the hook only runs on Bash tool calls. Claude Code built-in tools like Read, Grep, and Glob do not pass through the Bash hook, so they are not auto-rewritten."* RTK is Bash-only by architectural constraint, not by choice. +4. **tokenjuice mechanism** (source-reviewed at `src/core/claude-code.ts:160, 491, 540-549`): tokenjuice DOES register `PostToolUse` with `matcher: "Bash"` but has no real output-replace API available — it hijacks `decision: "block"` + `reason` to inject compacted text. Whether this actually reduces model-context tokens or just overlays UI output is disputed. tokenjuice is also Bash-only. +5. **Read/Grep/Glob execute in-process inside Claude Code** and bypass hooks entirely. Wedge (ii) "native-tool coverage" was architecturally impossible from day one regardless of replacement API. + +**Consequence.** Both wedges are dead in their original form: +- Wedge (i) "Conditional LLM verifier" — still technically possible, but only for Bash output, via PreToolUse command wrapping (RTK's mechanism). The verifier stops being a differentiator once we're also Bash-only. +- Wedge (ii) "Native-tool coverage" — impossible today. Read/Grep/Glob don't fire hooks. Even if they did, no output-replace field exists. + +**Decision.** Shelve `gstack compact` entirely. Track Anthropic issue #36843 for the arrival of `updatedBuiltinToolOutput` (or equivalent). When that API ships, this design doc + the 15 locked decisions below + the research archive at the bottom become the unblocking artifacts for a fresh implementation sprint. + +**If un-tabling:** Start from the "Decisions locked during plan-eng-review" block below — most remain valid. Then re-verify the hooks reference against the newly-shipped API, update the Architecture data-flow diagram to use whatever real output-replacement field exists, and re-run `/codex review` against the revised plan before coding. + +**What we're NOT doing:** +- Not shipping a Bash-only PreToolUse wrapper. That's RTK's product; they're at 28K stars and 3 years of rule scars. No wedge. +- Not shipping the `decision: block` + `reason` hack. Undocumented behavior, Anthropic could break it, and the model may still see the raw output alongside the compacted overlay — context savings are disputed. +- Not shipping B-series benchmark in isolation. Without a working compactor, there's nothing to benchmark. + +**Cost of tabling:** ~0. No code was written. The design doc + research + decisions remain as a ready-to-unblock artifact. + +--- + +## Decisions locked during plan-eng-review (2026-04-17) + +Preserved for the un-tabling sprint if/when Anthropic ships the built-in-tool output-replace API. + +Summary of every decision made during the engineering review. Full rationale is preserved throughout the sections below; this block is the single source of truth if anything else drifts. + +**Scope (Section 0):** +1. **Claude-first v1.** Ship compact + rules + verifier on Claude Code only. Codex + OpenClaw land at v1.1 after the wedge is proven on the primary host. Cuts ~2 days of host integration and derisks launch. The original "wedge (ii) native-tool coverage" claim applies to Claude Code at v1; we make no cross-host claim until v1.1. +2. **13-rule launch library.** v1 ships tests (jest/vitest/pytest/cargo-test/go-test/rspec) + git (diff/log/status) + install (npm/pnpm/pip/cargo). Build/lint/log families defer to v1.1, driven by `gstack compact discover` telemetry from real users. +3. **Verifier default ON at v1.0.** `failureCompaction` trigger (exit≠0 AND >50% reduction) is enabled out of the box. The verifier IS the wedge — defaulting it off hides the differentiating feature. Trigger bounds already keep expected fire rate ≤10% of tool calls. + +**Architecture (Section 1):** +4. **Exact line-match sanitization for Haiku output.** Split raw output by `\n`, put lines in a set, only append lines from Haiku that appear verbatim in that set. Tightest adversarial contract; prompt-injection attempts cannot slip in novel text. +5. **Layered failureCompaction signal.** Prefer `exitCode` from the envelope; if the host omits it, fall back to `/FAIL|Error|Traceback|panic/` regex on the output. Log which signal fired in `meta.failureSignal` ("exit" | "pattern" | "none"). Pre-implementation task #1 still verifies Claude Code's envelope empirically, but the system no longer breaks if it doesn't. +6. **Deep-merge rule resolution.** User/project rules inherit built-in fields they don't override. Escape hatch: `"extends": null` in a rule file triggers full replacement semantics. Matches the mental model of eslint/tsconfig/.gitignore — override a piece without losing the rest. + +**Code quality (Section 2):** +7. **Per-rule regex timeout, no RE2 dep.** Run each rule's regex via a 50ms AbortSignal budget; on timeout, skip the rule and record `meta.regexTimedOut: [ruleId]`. Avoids a WASM dependency and keeps rule-author syntax unconstrained. +8. **Pre-compiled rule bundle.** `gstack compact install` and `gstack compact reload` produce `~/.gstack/compact/rules.bundle.json` (deep-merged, regex-compiled metadata cached). Hook reads that single file instead of parsing N source files. +9. **Auto-reload on mtime drift.** Hook stats rule source files on startup; if any source file is newer than the bundle, rebuild in-line before applying. Adds ~0.5ms/invocation but eliminates the "I edited a rule and nothing changed" footgun. +10. **Expanded v1 redaction set.** Tee files redact: AWS keys, GitHub tokens (`ghp_/gho_/ghs_/ghu_`), GitLab tokens (`glpat-`), Slack webhooks, generic JWT (three base64 segments), generic bearer tokens, SSH private-key headers (`-----BEGIN * PRIVATE KEY-----`). Credit cards / SSNs / per-key env-pairs deferred to a full DLP layer in v2. + +**Testing (Section 3):** +11. **P-series gate subset.** v1 gate-tier P-tests: P1 (binary garbage), P3 (empty output), P6 (RTK-killer critical stack frame), P8 (secrets to tee), P15 (hook timeout), P18 (prompt injection), P26 (malformed user rule JSON), P28 (regex DoS), P30 (Haiku hallucination). Remaining 21 P-cases grow R-series as real bugs hit. +12. **Fixture version-stamping.** Every golden fixture has a `toolVersion:` frontmatter. CI warns when fixture toolVersion ≠ currently installed. No more calendar-based rotation. +13. **B-series real-world benchmark testbench (hard v1 gate).** New component `compact/benchmark/` scans `~/.claude/projects/**/*.jsonl`, ranks the noisiest tool calls, clusters them into named scenarios, replays the compactor against them, and reports reduction-by-rule-family. v1 cannot ship until B-series on the author's own 30-day corpus shows ≥15% reduction AND zero critical-line loss on planted bugs. Local-only; never uploads. Community-shared corpus is v2. + +**Performance (Section 4):** +14. **Revised latency budgets.** Bun cold-start on macOS ARM is 15-25ms; the original 10ms p50 target was unrealistic. New budgets: <30ms p50 / <80ms p99 on macOS ARM, <20ms p50 / <60ms p99 on Linux (verifier off). Verifier-fires budget stays <600ms p50 / <2s p99. Daemon mode is a v2 option gated on B-series showing cold-start hurts session savings. +15. **Line-oriented streaming pipeline.** Readline over stdin → filter → group → dedupe → ring-buffered tail truncation → stdout. Any single line >1MB hits P9 (truncate to 1KB with `[... truncated ...]` marker). Caps memory at 64MB regardless of total output size. + +Every row above is a `MUST` in the implementation. Drift requires a new eng-review. + +--- + +## Summary + +`gstack compact` was designed as a `PostToolUse` hook that reduces tool-output noise before it reaches an AI coding agent's context window. Deterministic JSON rules would shrink noisy test runners, build logs, git diffs, and package installs. A conditional Claude Haiku verifier would act as a safety net when over-compaction risk was high. + +**Current status: TABLED.** See "Status" section above. The architecture depends on a Claude Code API (`updatedBuiltinToolOutput` or equivalent for built-in tools) that does not exist as of 2026-04-17. Anthropic issue #36843 tracks the gap. + +**Intended goal (preserved for the un-tabling sprint):** 15–30% tool-output token reduction per long session, with zero increase in task-failure rate. + +**Original wedge (vs RTK, the 28K-star incumbent) — both invalidated by research:** +1. ~~**Conditional LLM verifier.**~~ Still technically viable via PreToolUse command wrapping, but only for Bash. Stops being a differentiator once we're Bash-only. Reconsider if the built-in-tool API arrives. +2. ~~**Native-tool coverage.**~~ Architecturally impossible today. Read/Grep/Glob execute in-process inside Claude Code and do not fire hooks. Even for tools that do fire `PostToolUse`, no output-replacement field exists for non-MCP tools. + +**Original positioning (now moot):** *"RTK is fast. gstack compact is fast AND safe, and it covers every tool in your toolbox, not just Bash."* + +## Non-goals + +- Summarizing user messages or prior agent turns (Claude's own Compaction API owns that). +- Compressing agent response output (caveman's layer). +- Caching tool calls to avoid re-execution (token-optimizer-mcp's layer). +- Acting as a general-purpose log analyzer. +- Replacing the agent's own judgement about when to re-run a command with `GSTACK_RAW=1`. + +## Why this is worth building + +**Problem is measured, not hypothetical.** + +- [Chroma research (2025)](https://research.trychroma.com/context-rot) tested 18 frontier models. Every model degrades as context grows. Rot starts well before the window limit — a 200K model rots at 50K. +- Coding agents are the worst case: accumulative context + high distractor density + long task horizon. Tool output is explicitly named as a primary noise source. +- The market has voted: Anthropic shipped Opus 4.6 Compaction API; OpenAI shipped a compaction guide; Google ADK shipped context compression; LangChain shipped autonomous compression; sst/opencode has built-in compaction. The hybrid deterministic + LLM pattern is industry consensus. + +**Existing field (what gstack compact joins and differentiates from):** + +| Project | Stars | License | Layer | Threat | Note | +|---------|-------|---------|-------|--------|------| +| **RTK (rtk-ai/rtk)** | **28K** | Apache-2.0 | Tool output | Primary benchmark | Pure Rust, Bash-only, zero LLM | +| caveman | 34.8K | MIT | Output tokens | Different axis | Terse system prompt; pairs WITH us | +| claude-token-efficient | 4.3K | MIT | Response verbosity | Different axis | Single CLAUDE.md | +| token-optimizer-mcp | 49 | MIT | MCP caching | Different axis | Prevents calls rather than compresses output | +| tokenjuice | ~12 | MIT | Tool output | Too new | 2 days old; inspired our JSON envelope | +| 6-Layer Token Savings Stack | — | Public gist | Recipe | Zero | Documentation; validates stacked compaction thesis | + +RTK is the only direct competitor. Everything else compresses a different token source. + +**License compatibility:** Every referenced project is permissive-licensed (MIT or Apache-2.0) and compatible with gstack's MIT license. No AGPL, GPL, or other copyleft dependencies. See the "License & attribution" section below for the clean-room policy. + +## Architecture + +### Data flow + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Host (Claude Code / Codex / OpenClaw) │ +│ ───────────────────────────────────────── │ +│ 1. Agent requests tool call: Bash|Read|Grep|Glob|MCP │ +│ 2. Host executes tool │ +│ 3. Host invokes PostToolUse hook with: {tool, input, output} │ +└────────────────────┬────────────────────────────────────────────┘ + │ stdin (JSON envelope) + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ gstack-compact hook binary │ +│ ─────────────────────────── │ +│ a. Parse envelope │ +│ b. Match rule by (tool, command, pattern) │ +│ c. Apply rule primitives: filter / group / truncate / dedupe │ +│ d. Record reduction metadata │ +│ e. Evaluate verifier triggers │ +│ f. If trigger met: call Haiku, append preserved lines │ +│ g. On failure exit code: tee raw to ~/.gstack/compact/tee/... │ +│ h. Emit JSON envelope to stdout │ +└────────────────────┬────────────────────────────────────────────┘ + │ stdout (JSON envelope) + ▼ + Host substitutes compacted output into agent context +``` + +### Rule resolution + +Three-tier hierarchy (highest precedence wins), same pattern as tokenjuice and gstack's existing host-config-export model: + +1. Built-in rules: `compact/rules/` shipped with gstack +2. User rules: `~/.config/gstack/compact-rules/` +3. Project rules: `.gstack/compact-rules/` + +Rules match tool calls by rule ID. A project rule with ID `tests/jest` overrides the built-in `tests/jest` entirely. No merging — replace semantics, to keep reasoning simple. + +### JSON envelope contract (adopted from tokenjuice) + +Input: +```json +{ + "tool": "Bash", + "command": "bun test test/billing.test.ts", + "argv": ["bun", "test", "test/billing.test.ts"], + "combinedText": "...", + "exitCode": 1, + "cwd": "/Users/garry/proj", + "host": "claude-code" +} +``` + +Output: +```json +{ + "reduced": "compacted output with [gstack-compact: N → M lines, rule: X] header", + "meta": { + "rule": "tests/jest", + "linesBefore": 247, + "linesAfter": 18, + "bytesBefore": 18234, + "bytesAfter": 892, + "verifierFired": false, + "teeFile": null, + "durationMs": 8 + } +} +``` + +### Rule schema + +Compact, minimal. Total rules-payload must stay <5KB on disk (lesson from claude-token-efficient: rule files themselves consume tokens on every session). + +```json +{ + "id": "tests/jest", + "family": "test-results", + "description": "Jest/Vitest output — preserve failures and summary counts", + "match": { + "tools": ["Bash"], + "commands": ["jest", "vitest", "bun test"], + "patterns": ["jest", "vitest", "PASS", "FAIL"] + }, + "primitives": { + "filter": { + "strip": ["\\x1b\\[[0-9;]*m", "^\\s*at .+node_modules"], + "keep": ["FAIL", "PASS", "Error:", "Expected:", "Received:", "✓", "✗", "Tests:"] + }, + "group": { + "by": "error-kind", + "header": "Errors grouped by type:" + }, + "truncate": { + "headLines": 5, + "tailLines": 15, + "onFailure": { "headLines": 20, "tailLines": 30 } + }, + "dedupe": { + "pattern": "^\\s*$", + "format": "[... {count} blank lines ...]" + } + }, + "tee": { + "onExit": "nonzero", + "maxBytes": 1048576 + }, + "counters": [ + { "name": "failed", "pattern": "^FAIL\\s", "flags": "m" }, + { "name": "passed", "pattern": "^PASS\\s", "flags": "m" } + ] +} +``` + +The four primitives — `filter`, `group`, `truncate`, `dedupe` — are lifted directly from RTK's technique taxonomy (the only thing every serious compactor needs to handle). Any rule can combine any subset of the four; omitted primitives are no-ops. + +### Verifier layer (tiered, opt-in) + +The verifier is a cheap Haiku call that fires only under specific triggers. Never on every tool call. + +**Trigger matrix (user-configurable):** + +| Trigger | Default | Condition | +|---------|---------|-----------| +| `failureCompaction` | **ON** | exit code ≠ 0 AND reduction >50% (diagnosis at risk) | +| `aggressiveReduction` | off | reduction >80% AND original >200 lines | +| `largeNoMatch` | off | no rule matched AND output >500 lines | +| `userOptIn` | on (env-gated) | `GSTACK_COMPACT_VERIFY=1` forces verifier for that call | + +Default config ships with `failureCompaction` only — the highest-leverage case (agent is debugging; rule may have filtered the critical stack frame). + +**Haiku's job (bounded):** + +``` +Here is raw output (truncated to first 2000 lines) and a compacted version. +Return any important lines from the raw that are missing from the compacted, +or `NONE` if nothing critical is missing. +``` + +The verifier never rewrites the compacted output. It only appends missing lines under a header: + +``` +[gstack-compact: 247 → 18 lines, rule: tests/jest] +[gstack-verify: 2 additional lines preserved by Haiku] + TypeError: Cannot read property 'foo' of undefined + at parseConfig (src/config.ts:42:18) +``` + +**Why Haiku, not Sonnet:** ~1/12th the cost, ~500ms vs ~2s, and the task is simple substring classification, not reasoning. + +**Verifier config (`compact/rules/_verifier.json`):** + +```json +{ + "verifier": { + "enabled": true, + "model": "claude-haiku-4-5-20251001", + "maxInputLines": 2000, + "triggers": { + "aggressiveReduction": { "enabled": false, "thresholdPct": 80, "minLines": 200 }, + "failureCompaction": { "enabled": true, "minReductionPct": 50 }, + "largeNoMatch": { "enabled": false, "minLines": 500 }, + "userOptIn": { "enabled": true, "envVar": "GSTACK_COMPACT_VERIFY" } + }, + "fallback": "passthrough" + } +} +``` + +**Failure modes (verifier is strictly additive — never breaks the baseline):** + +- No `ANTHROPIC_API_KEY` → skip verifier, use pure rule output. +- Haiku call times out (>5s) → skip verifier, use pure rule output. +- Haiku returns malformed JSON → skip, use pure rule output. +- Haiku returns prompt-injection attempt → sanitize: only append lines that are substring-matches of the original raw output. +- Haiku returns hallucinated lines (not present in raw) → drop them. + +### Tee mode (adopted from RTK) + +On any command with exit code ≠ 0, the full unfiltered output is written to `~/.gstack/compact/tee/{timestamp}_{cmd-slug}.log`. The compacted output includes a tee-file pointer: + +``` +[gstack-compact: 247 → 18 lines, rule: tests/jest, tee: ~/.gstack/compact/tee/20260416-143022_bun-test.log] +``` + +The agent can read the tee file directly if it needs the full stack trace. This replaces the earlier `onFailure.preserveFull` mechanic with a cleaner design: compacted output always stays small; raw output is always one `cat` away. + +**Tee safety:** + +- File mode `0600` — not world-readable. +- Built-in secret-regex set redacts AWS keys, bearer tokens, and common credential patterns before write. +- Failed writes (read-only filesystem, permission denied) degrade gracefully: still emit compacted output, record `meta.teeFailed: true`. +- Tee files auto-expire after 7 days (cleanup on hook startup). + +### Host integration matrix + +| Host | Hook type | Supported matchers | Config path | +|------|-----------|-------------------|-------------| +| Claude Code | `PostToolUse` | Bash, Read, Grep, Glob, Edit, Write, WebFetch, WebSearch, mcp__* | `~/.claude/settings.json` | +| Codex (v1.1) | `PostToolUse` equivalent | Bash (primary); tool subset TBD — empirical verification is a v1.1 prereq | `~/.codex/hooks.json` | +| OpenClaw (v1.1) | Native hook API | Bash + MCP | OpenClaw config | + +**v1 is Claude-first.** Wedge (ii) — native-tool coverage — is confirmed on Claude Code via [the hooks reference](https://code.claude.com/docs/en/hooks). Codex and OpenClaw integration ships at v1.1 only after the wedge is proven on the primary host via B-series benchmark data. CHANGELOG for v1 makes the Claude-only scope explicit. + +### Config surface + +User config (`~/.config/gstack/compact.toml`): + +```toml +[compact] +enabled = true +level = "normal" # minimal | normal | aggressive (caveman pattern) +exclude_commands = ["curl", "playwright"] # RTK pattern + +[compact.bundle] +auto_reload_on_mtime_drift = true # hook rebuilds bundle if source rule files are newer +bundle_path = "~/.gstack/compact/rules.bundle.json" + +[compact.regex] +per_rule_timeout_ms = 50 # AbortSignal budget per regex; timeout → skip rule + +[compact.verifier] +enabled = true +trigger_failure_compaction = true +trigger_aggressive_reduction = false +trigger_large_no_match = false +failure_signal_fallback = true # use /FAIL|Error|Traceback|panic/ when exitCode missing +sanitization = "exact-line-match" # only append lines present verbatim in raw output + +[compact.tee] +on_exit = "nonzero" +max_bytes = 1048576 +redact_patterns = ["aws", "github", "gitlab", "slack", "jwt", "bearer", "ssh-private-key"] +cleanup_days = 7 + +[compact.benchmark] +local_only = true # hard-coded; config is documentary, cannot be changed +transcript_root = "~/.claude/projects" +output_dir = "~/.gstack/compact/benchmark" +scenario_cap = 20 # top-N clusters by aggregate output volume +``` + +**Intensity levels (caveman pattern):** + +- **minimal:** only `filter` + `dedupe`; no truncation. Safest. +- **normal:** `filter` + `dedupe` + `truncate`. Default. +- **aggressive:** adds `group`; more savings, more edge-case risk. + +### CLI surface + +| Command | Purpose | Source | +|---------|---------|--------| +| `gstack compact install ` | Register PostToolUse hook in host config; builds `rules.bundle.json` | new | +| `gstack compact uninstall ` | Idempotent removal | new | +| `gstack compact reload` | Rebuild `rules.bundle.json` after editing user/project rules | new | +| `gstack compact doctor` | Detect drift / broken hook config, offer to repair | tokenjuice | +| `gstack compact gain` | Show token/dollar savings over time (per-rule breakdown) | RTK | +| `gstack compact discover` | Find commands with no matching rule, ranked by noise volume | RTK | +| `gstack compact verify ` | Dry-run verifier on a fixture | new | +| `gstack compact list-rules` | Show effective rule set after deep-merge (built-in + user + project) | new | +| `gstack compact test ` | Apply a rule to a fixture and show the diff | new | +| `gstack compact benchmark` | Run B-series testbench against local transcript corpus (see Benchmark section) | new | + +Escape hatch: `GSTACK_RAW=1` env var bypasses the hook entirely for the duration of a command (same pattern as tokenjuice's `--raw` flag). Hook also auto-reloads the bundle if any source rule file's mtime is newer than the bundle file. + +## File layout + +``` +compact/ +├── SKILL.md.tmpl # template; regen via `bun run gen:skill-docs` +├── src/ +│ ├── hook.ts # entry point; reads stdin, writes stdout; mtime-checks bundle +│ ├── engine.ts # rule matching + reduction metadata +│ ├── apply.ts # primitive application (line-oriented streaming pipeline) +│ ├── merge.ts # deep-merge of built-in/user/project rules; honors `extends: null` +│ ├── bundle.ts # compile source rules → rules.bundle.json (install/reload) +│ ├── primitives/ +│ │ ├── filter.ts +│ │ ├── group.ts +│ │ ├── truncate.ts # ring-buffered tail; safe for arbitrary input size +│ │ └── dedupe.ts +│ ├── regex-sandbox.ts # AbortSignal-bounded regex execution (50ms budget per rule) +│ ├── verifier.ts # Haiku integration (triggers + failure-signal fallback + sanitization) +│ ├── sanitize.ts # exact-line-match filter for verifier output +│ ├── tee.ts # raw-output archival with secret redaction + 7-day cleanup +│ ├── redact.ts # secret-pattern set (AWS/GitHub/GitLab/Slack/JWT/bearer/SSH) +│ ├── envelope.ts # JSON I/O contract parsing + validation +│ ├── doctor.ts # hook drift detection + repair +│ ├── analytics.ts # gain + discover queries against local metadata +│ └── cli.ts # argv dispatch; one thin dispatch per subcommand +├── benchmark/ # B-series testbench (hard v1 gate) +│ └── src/ +│ ├── scanner.ts # walk ~/.claude/projects/**/*.jsonl; pair tool_use × tool_result +│ ├── sizer.ts # tokens per call (ceil(len/4) heuristic); rank heavy tail +│ ├── cluster.ts # group high-leverage calls by (tool, command pattern) +│ ├── scenarios.ts # emit B1-Bn real-world scenario fixtures +│ ├── replay.ts # run compactor against scenarios; measure reduction +│ ├── pathology.ts # layer planted-bug P-cases on top of real scenarios +│ └── report.ts # dashboard: per-scenario before/after + overall reduction +├── rules/ # v1 built-in JSON rule library (13 rules) +│ ├── tests/ +│ │ ├── jest.json +│ │ ├── vitest.json +│ │ ├── pytest.json +│ │ ├── cargo-test.json +│ │ ├── go-test.json +│ │ └── rspec.json +│ ├── install/ +│ │ ├── npm.json +│ │ ├── pnpm.json +│ │ ├── pip.json +│ │ └── cargo.json +│ ├── git/ +│ │ ├── diff.json +│ │ ├── log.json +│ │ └── status.json +│ ├── _verifier.json # verifier config (not a rule per se) +│ └── _HOLD/ # v1.1 rule families (not shipped at v1; kept for reference) +│ ├── build/ +│ ├── lint/ +│ └── log/ +└── test/ + ├── unit/ + ├── golden/ + ├── fuzz/ # P-series — v1 gate subset only (P1/P3/P6/P8/P15/P18/P26/P28/P30) + ├── cross-host/ # v1: claude-code.test.ts only; codex/openclaw stub files + ├── adversarial/ # R-series — grows with shipped bugs + ├── benchmark/ # B-series scenario fixtures + expected reduction ranges + ├── fixtures/ # version-stamped golden inputs (toolVersion: frontmatter) + └── evals/ +``` + +## Testing Strategy + +The test plan is comprehensive by design. Shipping into a space where the 28K-star incumbent has three years of regex battle-scars, with our wedges (Haiku verifier + native-tool coverage) introducing new failure surfaces, means we get ONE shot at "the compactor made my agent dumb" going viral. Zero appetite for that. + +### Test tiers + +| Tier | Cost | Frequency | Blocks merge | +|------|------|-----------|--------------| +| Unit | free, <1s | every PR | yes | +| Golden file (with `toolVersion:` frontmatter) | free, <1s | every PR | yes | +| Rule schema validation | free, <1s | every PR | yes | +| Fuzz (P-series gate subset: P1/P3/P6/P8/P15/P18/P26/P28/P30) | free, <10s | every PR | yes | +| Cross-host E2E — Claude Code only at v1 | free, ~1min | every PR (gate tier) | yes | +| E2E with verifier (mocked Haiku) | free, ~15s | every PR | yes | +| E2E with verifier (real Haiku) | paid, ~$0.10/run | PR touching verifier files | yes | +| **B-series benchmark (real-world scenarios)** | **free, ~2min** | **pre-release gate** | **yes (hard gate for v1)** | +| Token-savings eval (E1-E4 synthetic) | paid, ~$4/run | periodic weekly | no (informational) | +| Adversarial regression (R-series) | free, <5s | every PR | yes | +| Tool-version drift warning | free, <1s | every PR | warning only | + +Test file layout: + +``` +compact/test/ +├── unit/ +│ ├── engine.test.ts # rule matching + primitive application +│ ├── primitives.test.ts # filter / group / truncate / dedupe +│ ├── envelope.test.ts # JSON input/output contract +│ ├── triggers.test.ts # verifier trigger evaluation +│ └── verifier.test.ts # Haiku call (mocked) +├── golden/ +│ ├── tests/ # one fixture per test runner +│ │ ├── jest-success.input.txt +│ │ ├── jest-success.expected.txt +│ │ ├── jest-fail.input.txt +│ │ ├── jest-fail.expected.txt +│ │ └── ... (vitest, pytest, cargo-test, go-test, rspec) +│ ├── install/ +│ ├── git/ +│ ├── build/ +│ ├── lint/ +│ └── log/ +├── fuzz/ +│ └── pathological.test.ts # P-series +├── cross-host/ +│ ├── claude-code.test.ts +│ ├── codex.test.ts +│ └── openclaw.test.ts +├── adversarial/ +│ └── regression.test.ts # R-series; past bugs that must never recur +├── fixtures/ +│ └── {tool}/ # shared raw output fixtures +└── evals/ + └── token-savings.eval.ts # periodic-tier; measures real reduction +``` + +### G-series: good cases (must produce expected reduction) + +| ID | Scenario | Expected reduction | +|----|----------|-------------------| +| G1 | `jest` 47 passing tests, clean run | 150+ lines → ≤10 lines | +| G2 | `jest` 47 tests with 2 failures | 200+ lines → keep both failures + summary | +| G3 | `vitest` run with `--reporter=verbose` | 300+ lines → ≤15 lines | +| G4 | `pytest` collection then run | preserve failure tracebacks | +| G5 | `cargo test` with one panic | panic location preserved verbatim | +| G6 | `go test -v` with 200 subtests passing | collapse to `PASS: 200 subtests` | +| G7 | `git diff` on a file with 2 hunks in 500 lines of context | keep hunks, drop context | +| G8 | `git log -50` | preserve SHA + subject + author, drop body | +| G9 | `git status` with 30 modified files | group by directory | +| G10 | `pnpm install` fresh | final count + warnings; drop resolved packages | +| G11 | `pip install -r requirements.txt` | drop download progress; keep final install list + errors | +| G12 | `cargo build` success | drop compilation progress; keep final target | +| G13 | `docker build` success | drop layer pulls; keep final image digest | +| G14 | `tsc --noEmit` clean | compact to `tsc: 0 errors` | +| G15 | `tsc --noEmit` with 3 errors | keep all 3 errors with location | +| G16 | `eslint .` clean | compact to `eslint: 0 problems` | +| G17 | `eslint .` with violations | group by rule; preserve location + fix suggestion | +| G18 | `docker logs -f` with 1000 repeating lines | dedupe with count: `[last message repeated 973 times]` | +| G19 | `kubectl get pods -A` | group by namespace | +| G20 | `ls -la` deep tree | directory grouping (RTK pattern) | +| G21 | `find . -type f` 10K files | group by extension with counts | +| G22 | `grep -r "foo" .` with 500 hits | cap at 50; suffix `[... 450 more matches; use --ripgrep for full]` | +| G23 | `curl -v https://api.example.com` | strip verbose headers; keep response body | +| G24 | `aws ec2 describe-instances` 50 instances | columnar summary | + +### P-series: pathological cases (must NOT break the agent) + +These turn "nice feature" into "catastrophic regression" if we get any of them wrong. + +| ID | Scenario | Required behavior | +|----|----------|-------------------| +| P1 | Binary garbage in output (non-UTF8 bytes) | Pass through unchanged; don't crash | +| P2 | ANSI escape explosion (10K+ codes) | Strip cleanly, don't choke regex | +| P3 | Empty output (`""`) | Pass through empty; do NOT inject header | +| P4 | Stdout+stderr interleaved | Rule matches across both streams | +| P5 | Truncated output (SIGPIPE mid-stream) | Don't mis-compact partial output | +| P6 | **Failed test, critical stack frame at line 4 of 200** | Must NOT filter the frame (the RTK-killer case) | +| P7 | Exit 0 but `ERROR:` in output | Rule must not trust exit code alone | +| P8 | Output contains AWS key / bearer token / password | Tee file must NOT be world-readable; redact in compacted output | +| P9 | Single-line minified JS error (40KB one line) | Truncate to first 1KB; append `[... truncated ...]` | +| P10 | Unicode (emoji, RTL, combining chars, CJK) | Byte-safe truncation; don't split codepoints | +| P11 | Two rules match same command | Deterministic priority: longest `match.commands` prefix wins; tie → rule ID alphabetical | +| P12 | Rule's compacted output matches another rule's pattern | No recursive application; hook runs once per tool call | +| P13 | Command contains embedded newlines in quoted arg | Rule doesn't misparse args | +| P14 | Concurrent tool calls (parallel Bash invocations) | No shared mutable state in hook; each call is isolated | +| P15 | Hook execution >5s | Pass through raw; emit `meta.timedOut: true` | +| P16 | Haiku API offline/rate-limited | Skip verifier silently; use pure rule output | +| P17 | Haiku returns malformed JSON | Skip verifier; do NOT feed raw response to agent | +| P18 | Haiku response contains prompt-injection (`"Ignore all prior instructions..."`) | Sanitize: only append lines that are substring matches of the original raw output | +| P19 | 1M-line output | Stream-process, cap memory at 64MB; truncate with clear marker | +| P20 | Rapid-fire: 50 tool calls / sec | Hook latency stays <15ms p99 | +| P21 | Command with shell redirects (`cmd >file 2>&1`) | Match on the underlying command name, not the redirect wrapper | +| P22 | Deeply nested quotes/escapes in command string | Robust arg parser; no shell injection possible | +| P23 | NULL bytes in output | Strip safely; don't truncate | +| P24 | Command that exits then writes more to stderr after | Hook receives final combined output; handles gracefully | +| P25 | Read-only filesystem / no tee write permission | Degrade gracefully; still emit compacted output; record `meta.teeFailed: true` | +| P26 | User's rule JSON is malformed | Skip that rule; emit warning to stderr; don't break hook | +| P27 | Rule references a non-existent primitive field | Ignore unknown field; apply rest of rule | +| P28 | Rule regex has catastrophic backtracking | RE2-compatible engine (no backtracking) OR per-rule timeout | +| P29 | Exit code 137 (OOM kill) | Rule treats same as generic failure; preserves full output | +| P30 | Haiku returns lines NOT present in raw output (hallucination) | Drop hallucinated lines; keep only substring matches | + +### CH-series: cross-host E2E + +Run each scenario on each supported host. Same input, same expected output. If a host does not support a matcher, the test is marked `skip-on-{host}` with a comment linking the upstream limitation. + +| ID | Scenario | Hosts | +|----|----------|-------| +| CH1 | Install hook via `gstack compact install ` | Claude Code, Codex, OpenClaw | +| CH2 | Uninstall hook is idempotent | All | +| CH3 | Re-install doesn't duplicate entries | All | +| CH4 | Hook co-exists with user's other PostToolUse hooks | All | +| CH5 | Hook fires on Bash tool | All | +| CH6 | Hook fires on Read tool | Claude Code (confirmed); Codex/OpenClaw verify-then-require | +| CH7 | Hook fires on Grep tool | Same as CH6 | +| CH8 | Hook fires on Glob tool | Same as CH6 | +| CH9 | Hook fires on MCP tool (`mcp__*` matcher) | Claude Code; verify on others | +| CH10 | Config precedence: project > user > built-in | All | +| CH11 | `GSTACK_RAW=1` env var bypasses hook | All | +| CH12 | Rule ID override works (project rule replaces built-in) | All | +| CH13 | `gstack compact doctor` detects drift on each host | All | +| CH14 | Hook error does not crash the agent session | All | + +Implementation note: cross-host tests reuse the fixture corpus from the `golden/` tree; the harness wraps each fixture in a host-specific hook invocation envelope and asserts the output is byte-identical across hosts (modulo the `host` field). + +### V-series: verifier tests (paid) + +| ID | Scenario | Expected | +|----|----------|----------| +| V1 | Rule reduces 200-line test output to 5 lines, exit=1 | Verifier fires (failure + >50% reduction), appends any missing critical lines | +| V2 | Rule reduces 10-line output to 9 lines, exit=1 | Verifier does NOT fire (reduction too small) | +| V3 | Rule reduces 200-line output to 5 lines, exit=0 | Verifier does NOT fire (success path, default config) | +| V4 | `aggressiveReduction` trigger enabled, 300 lines → 20 lines, exit=0 | Verifier fires | +| V5 | `GSTACK_COMPACT_VERIFY=1` env var set | Verifier fires once for that call | +| V6 | `ANTHROPIC_API_KEY` missing | Verifier silently skipped; raw rule output returned | +| V7 | Verifier mocked to return "NONE" | Output identical to pure-rule path | +| V8 | Verifier mocked to return prompt injection | Injection discarded; only substring-matched lines appended | +| V9 | Verifier mocked to time out >5s | Skipped; `meta.verifierTimedOut: true` | +| V10 | Verifier mocked to return 500 error | Skipped; rule output returned | + +### R-series: adversarial regression + +Every bug caught after v1 ship gets a permanent R-series test. Starts empty; grows with scars. Template: + +``` +R{N}: {commit-sha} — {1-line summary} +Scenario: {reproducer} +Fix: {PR link} +``` + +### Performance budgets (enforced in CI; revised for realistic Bun cold-start) + +| Metric | Target | Hard limit | +|--------|--------|-----------| +| Hook overhead macOS ARM (verifier disabled) | <30ms p50 | <80ms p99 | +| Hook overhead Linux (verifier disabled) | <20ms p50 | <60ms p99 | +| Hook overhead (verifier fires) | <600ms p50 | <2s p99 | +| Bundle deserialize (rules.bundle.json) | <2ms | <10ms | +| mtime drift check (stat of source files) | <0.5ms | <3ms | +| Single-regex execution budget (per rule) | <5ms | <50ms (hard abort) | +| Memory per hook invocation (line-streamed) | <16MB typical | <64MB max | +| Total rule-payload size on disk (source files) | <5KB | <15KB | +| Compiled bundle size on disk | <25KB | <80KB | + +Daemon mode is a v2 optimization. If B-series benchmark on the author's corpus shows cold-start meaningfully hurts session-total savings (e.g., total hook overhead >5% of saved tokens' wall time), promote to v1.1. + +### B-series real-world benchmark testbench (hard v1 gate) + +**Why it exists.** Every competing compactor ships with hand-picked fixture numbers. B-series proves the compactor works on the user's *actual* coding sessions before they enable the hook. It's both the ship-gate and the marketing artifact. + +**Architecture** (components in `compact/benchmark/src/`): + +``` +┌──────────────────────────────────────────────────────────────┐ +│ 1. SCAN scanner.ts walks ~/.claude/projects/**/*.jsonl │ +│ → pairs tool_use × tool_result blocks │ +│ → emits {tool, command, outputBytes, lineCount, │ +│ estimatedTokens, sessionId, timestamp} │ +├──────────────────────────────────────────────────────────────┤ +│ 2. RANK sizer.ts sorts corpus by estimatedTokens desc │ +│ → cluster.ts groups by (tool, command-pattern) │ +│ → identifies heavy-tail: which 10% of calls │ +│ produced 80% of the tokens? │ +├──────────────────────────────────────────────────────────────┤ +│ 3. SCENARIO scenarios.ts emits fixture files: │ +│ B1_bun_test_heavy.jsonl │ +│ B2_git_diff_huge.jsonl │ +│ B3_tsc_errors_production.jsonl │ +│ B4_pnpm_install_fresh.jsonl ... (one per │ +│ high-leverage cluster, up to ~20 scenarios) │ +├──────────────────────────────────────────────────────────────┤ +│ 4. REPLAY replay.ts runs compactor against each scenario, │ +│ measures token reduction + diff of dropped lines│ +│ → per-rule reduction numbers │ +│ → per-scenario before/after token counts │ +├──────────────────────────────────────────────────────────────┤ +│ 5. PATHOLOGY pathology.ts injects planted critical lines │ +│ (line 4 of 200 in a failing test fixture) into │ +│ real B-scenarios. Confirms verifier restores │ +│ them. Real data + real threats = real proof. │ +├──────────────────────────────────────────────────────────────┤ +│ 6. REPORT report.ts emits HTML + JSON dashboard to │ +│ ~/.gstack/compact/benchmark/latest/ │ +│ "On YOUR 30 days of Claude Code data, gstack │ +│ compact would save X tokens in Y scenarios." │ +└──────────────────────────────────────────────────────────────┘ +``` + +**v1 ship gate (hard):** +- ≥15% total-token reduction across the aggregated scenario corpus on the author's own 30-day transcript set. +- Zero critical-line loss on planted-bug scenarios (every planted stack frame must survive either the rule or the verifier). +- No scenario regresses to <5% reduction under the new rules (catch over-compaction edge cases). + +**Privacy (non-negotiable):** +- Reads `~/.claude/projects/**/*.jsonl` locally only. Never uploads. Never shares. Never logs scenarios to telemetry. +- Output files live under `~/.gstack/compact/benchmark/` with mode `0600`. +- The command prints a confirmation banner: *"Scanning local transcripts at ~/.claude/projects/ (local-only; nothing leaves this machine)."* +- Any future community corpus is a separate v2 workstream built from hand-contributed, secret-scanned fixtures on OSS projects. + +**Ports from analyze_transcripts (TypeScript reimplementation; not a subprocess call):** +- JSONL parsing + tool_use/tool_result pairing pattern (from `event_extractor.rb`). +- Token estimate `ceil(len/4)` (same char-ratio heuristic; sufficient for ranking). +- Event-type taxonomy (`bash_command`, `file_read`, `test_run`, `error_encountered`) for scenario clustering. +- Stress-fixture generation pattern for pathology layering. + +**What we do NOT port:** behavioral scoring, pgvector embeddings, decision-exchange graphs, velocity metrics, the Rails/ActiveRecord layer. Out of scope; not what we're measuring. + +### Synthetic token-savings evals (E-series, periodic/informational only) + +Retained from the original plan but now informational-only because B-series is the real gate. + +- **E1:** simulated 30-min coding session on a medium TypeScript project. Measure total tokens with/without gstack compact enabled. Target: ≥15% reduction. +- **E2:** same session at `level=aggressive`. Target: ≥25% reduction, zero test-failure increase. +- **E3:** same session with verifier on `failureCompaction` only. Verifier fire rate ≤10% of tool calls. +- **E4:** adversarial — inject a planted bug in a test output and confirm the verifier restores the critical stack frame. + +### Test corpus sourcing + +For each rule family, capture 3+ real outputs: + +1. Run the tool against a real project (gstack itself for TS; popular OSS for Rust/Go/Python). +2. Capture stdout+stderr+exit code into a fixture file with `toolVersion:` frontmatter (e.g., `jest@29.7.0`). +3. Hand-author the expected compacted output once. +4. Golden file test: rule application must produce byte-identical output. +5. CI drift warning: if installed tool version differs from fixture's `toolVersion:`, CI warns (not fails). Drift-warning dashboard is checked pre-release. + +Draw from: +- tokenjuice's fixture directory patterns (`tests/fixtures/`) +- RTK's per-command examples (their README lists real before/after metrics; verify independently) +- gstack's own test output (eat our own dog food) +- Real failure archives from `~/.gstack/compact/tee/` (once volunteers contribute) +- **B-series real-world scenarios are the primary corpus for reduction measurements.** + +## Pattern adoption table + +Concrete patterns borrowed from the competitive landscape: + +| From | Adopt as | Why | +|------|----------|-----| +| RTK | 4 reduction primitives (filter/group/truncate/dedupe) as JSON rule verbs | Table stakes for a serious compactor | +| RTK | `gstack compact tee` for failure-mode raw save | Better than the original `onFailure.preserveFull` design | +| RTK | `gstack compact gain` + `gstack compact discover` | Trust + continuous improvement | +| RTK | `exclude_commands` per-user blocklist | Must-have config | +| tokenjuice | JSON envelope contract for hook I/O | Clean machine adapter | +| tokenjuice | `gstack compact doctor` | Hooks drift; self-repair matters | +| caveman | Intensity levels (minimal/normal/aggressive) | User-tunable safety/savings knob | +| claude-token-efficient | Rules-file size budget (<5KB total) | Don't bloat context | + +## Rollout plan + +**ALL PHASES TABLED pending Anthropic `updatedBuiltinToolOutput` API.** See Status section at the top of this doc. The rollout below is the intended sequence if/when the API ships and this design un-tables. + +### Un-tabling checklist (do in order when the API arrives) + +1. **Confirm the new API's shape.** Read the updated Claude Code hooks reference. Capture a real envelope containing the new output-replacement field for Bash, Read, Grep, Glob. Record in `docs/designs/GCOMPACTION_envelope.md`. +2. **Re-validate the wedge.** Does the new API cover Read/Grep/Glob (do they fire `PostToolUse` now), or just Bash/WebFetch? If Bash-only, wedge (ii) stays dead and the product needs a new pitch before implementation. +3. **Re-run `/plan-eng-review`** against the revised plan with the new API. Most of the 15 locked decisions should carry forward; adjust the Architecture data-flow and any envelope-dependent decisions. +4. **Re-run `/codex review`** against the revised plan. The prior BLOCK verdict's concerns about hook substitution disappear once the API exists; remaining criticals (B-series privacy, regex DoS, JSON-envelope streaming) still apply. +5. **Execute the original rollout below.** + +### Original rollout (preserved for un-tabling) + +Each tier blocks on the prior passing all gate-tier tests. Claude-first — Codex and OpenClaw land at v1.1 after the wedge is proven on the primary host. + +1. **v0.0 (1 day):** rule engine + 4 primitives + line-oriented streaming pipeline + deep-merge + bundle compiler + envelope contract + golden tests for `tests/*` family only. No host integration yet. Measure savings on offline fixtures. +2. **v0.1 (1 day):** Claude Code hook integration + `gstack compact install` + mtime-based auto-reload. Ship as opt-in; off by default. Ask 10 gstack power users to try it; collect feedback. +3. **v0.5 (1 day):** B-series benchmark testbench (`compact/benchmark/`). Ship `gstack compact benchmark` so users can measure on their own data. Collect anonymous-from-the-start (nothing uploaded) reduction numbers from dogfooders. +4. **v1.0 (1 day):** verifier layer with `failureCompaction` trigger on by default + exact-line-match sanitization + layered exitCode/pattern fallback + expanded tee redaction set. **Hard ship gate:** B-series on the author's 30-day local corpus shows ≥15% total reduction AND zero critical-line loss on planted bugs. Publish CHANGELOG entry leading with wedge framing (Claude Code only at v1). +5. **v1.1 (+1 day):** Codex + OpenClaw hook integration. Cross-host E2E suite green. Build/lint/log rule families land with `gstack compact discover`-derived priorities. +6. **v1.2+:** expand rule families, community rule contribution workflow, community-corpus benchmark (hand-authored public fixtures, separate from local B-series). + +## Risk analysis + +| Risk | Severity | Mitigation | +|------|----------|------------| +| RTK adds an LLM verifier in response | Low | Creator is vocal about zero-dependency Rust. Ship first, build the pattern library. | +| Platform compaction subsumes us (Anthropic Compaction API in Claude Code) | Medium | We operate at a different layer (per-tool output vs whole-context). Position as complementary. | +| Rules drop something critical → "compactor made my agent dumb" | High | B-series real-world benchmark as hard ship gate; tee mode always available; verifier default-on for failures; exact-line-match sanitization. | +| Haiku cost creep (triggers fire more than expected) | Medium | E3 eval + B-series fire-rate metric; cost visible in `gstack compact gain`; per-session rate cap in v1.1 if rate >10%. | +| Rule maintenance debt (jest/vitest output formats change) | Medium | `toolVersion:` fixture frontmatter + CI drift warning; community rule PRs; `discover` flags bypassing commands. | +| Rules file bloats context | Low | CI-enforced <5KB source + <25KB compiled bundle budget; per-rule size warning at schema-validation. | +| Regex DoS blocks the agent | Medium | 50ms AbortSignal budget per rule; timeout logged to `meta.regexTimedOut`; stale rules quarantined on repeated failure. | +| Bundle staleness silently breaks user edits | Low | mtime-check on every hook invocation auto-rebuilds; `gstack compact reload` is a backup not a requirement. | +| Benchmark leaks user's private data | High | Local-only by construction: no network call, mode-0600 output, explicit banner at runtime. Privacy review before v1 ship. | + +## Open questions + +1. ~~Does Codex's PostToolUse hook support matchers for Read/Grep/Glob?~~ (Deferred to v1.1 — Claude-first at v1.) +2. ~~Does OpenClaw's hook API support PostToolUse specifically?~~ (Deferred to v1.1.) +3. Should the verifier model be pinned, or version-tracked like gstack's other AI calls? (Inclined to pin `claude-haiku-4-5-20251001` and bump explicitly in CHANGELOG.) +4. ~~Built-in secret-redaction regex set for tee files~~ **(resolved: expanded set — AWS/GitHub/GitLab/Slack/JWT/bearer/SSH-private-key. See decision #10.)** +5. Should `gstack compact discover` propose auto-generated rules via Haiku? (Deferred to v2; skill-creep risk.) +6. **New:** Does Claude Code's PostToolUse envelope include `exitCode`? (Still needs empirical verification per pre-implementation task #1; system now has a layered fallback regardless.) +7. **New:** What's the right scenario-count cap for B-series? Cluster.ts can produce 5-50 scenarios depending on heavy-tail shape. Plan: cap at top 20 clusters by aggregate output volume. + +## Pre-implementation assignment (must complete before coding) + +1. **Verify Claude Code's PostToolUse envelope contents empirically.** Ship a no-op hook; confirm `exitCode`, `command`, `argv`, `combinedText` are all present. This is the pivot for wedge (ii) native-tool coverage AND for the failureCompaction trigger. Output: `docs/designs/GCOMPACTION_envelope.md` with real captured envelopes for Bash + Read + Grep + Glob. +2. **Read RTK's rule definitions** (`ARCHITECTURE.md`, `src/rules/`) and write a 1-paragraph summary of which of the 4 primitives they handle best. Inform our v1 rule set. This is the Search Before Building layer. +3. **Port analyze_transcripts JSONL parser to TypeScript.** `compact/benchmark/src/scanner.ts`. Write a quick-look output that lists the top-50 noisiest tool calls on the author's `~/.claude/projects/`. Confirms the testbench premise before we build the replay loop. This is the B-series foundation. +4. **Write the CHANGELOG entry FIRST.** Target sentence: *"Every tool in your agent's toolbox on Claude Code now produces less noise — test runners, git diffs, package installs — with an intelligent Haiku safety net that restores critical stack frames when our rules over-compact, and a local benchmark that proves the savings on your actual 30 days of coding sessions. Codex + OpenClaw land in v1.1."* If we cannot write that sentence honestly, the wedge isn't there yet. +5. **Ship a rule-only v0** (no Haiku verifier, no benchmark). Measure real token savings with current gstack evals + early B-series prototype. If <10% on local corpus, the whole premise is weaker than claimed — iterate the rules before adding the verifier on top. + +## License & attribution + +gstack ships under MIT. To keep the license clean for downstream users, this project follows a strict clean-room policy for everything borrowed from the competitive landscape: + +- **Every project referenced above is permissive-licensed** (MIT or Apache-2.0). No AGPL, GPL, SSPL, or other copyleft exposure. + - RTK (rtk-ai/rtk): **Apache-2.0** — MIT-compatible; Apache patent grant is a bonus for us. + - tokenjuice, caveman, claude-token-efficient, token-optimizer-mcp, sst/opencode: **MIT**. +- **Patterns, not code.** We read these projects to understand what they solved and why. We implement independently in TypeScript inside `compact/src/`. We do not copy source files, translate source files line-for-line, or lift test fixtures verbatim. +- **Attribution.** Where a pattern is directly borrowed (the 4 primitives from RTK, the JSON envelope from tokenjuice, intensity levels from caveman, rules-file size budget from claude-token-efficient), we credit the source inline in comments and in the "Pattern adoption table" above. The project's `README` and `NOTICE` file (if we add one) list the inspirations. +- **Fixture sourcing.** Golden-file fixtures come from running real tools against real projects — they are our own captures, not imported from RTK or tokenjuice. This keeps the test corpus free of license-tangled content. +- **Forbidden sources.** Before adding any new reference project, run `gh api repos/OWNER/REPO --jq '.license'` and verify the license key is one of: `mit`, `apache-2.0`, `bsd-2-clause`, `bsd-3-clause`, `isc`, `cc0-1.0`, `unlicense`. If the project has no license field, treat it as "all rights reserved" and do not draw from it. Reject `agpl-3.0`, `gpl-*`, `sspl-*`, and any custom or source-available license. + +CI enforcement: a `scripts/check-references.ts` script parses `docs/designs/GCOMPACTION.md` for GitHub URLs and re-runs the license check, failing if any referenced project's license moves off the allowlist. + +## References + +- [RTK (Rust Token Killer) — rtk-ai/rtk](https://github.com/rtk-ai/rtk) +- [RTK issue #538 — native-tool gap](https://github.com/rtk-ai/rtk/issues/538) +- [tokenjuice — vincentkoc/tokenjuice](https://github.com/vincentkoc/tokenjuice) +- [caveman — juliusbrussee/caveman](https://github.com/juliusbrussee/caveman) +- [claude-token-efficient — drona23](https://github.com/drona23/claude-token-efficient) +- [token-optimizer-mcp — ooples](https://github.com/ooples/token-optimizer-mcp) +- [6-Layer Token Savings Stack — doobidoo gist](https://gist.github.com/doobidoo/e5500be6b59e47cadc39e0b7c5cd9871) +- [Claude Code hooks reference](https://code.claude.com/docs/en/hooks) +- [Chroma context rot research](https://research.trychroma.com/context-rot) +- [Morph: Why LLMs Degrade as Context Grows](https://www.morphllm.com/context-rot) +- [Anthropic Opus 4.6 Compaction API — InfoQ](https://www.infoq.com/news/2026/03/opus-4-6-context-compaction/) +- [OpenAI compaction docs](https://developers.openai.com/api/docs/guides/compaction) +- [Google ADK context compression](https://google.github.io/adk-docs/context/compaction/) +- [LangChain autonomous context compression](https://blog.langchain.com/autonomous-context-compression/) +- [sst/opencode context management](https://deepwiki.com/sst/opencode/2.4-context-management-and-compaction) +- [DEV: Deterministic vs. LLM Evaluators — 2026 trade-off study](https://dev.to/anshd_12/deterministic-vs-llm-evaluators-a-2026-technical-trade-off-study-11h) +- [MadPlay: RTK 80% token reduction experiment](https://madplay.github.io/en/post/rtk-reduce-ai-coding-agent-token-usage) +- [Esteban Estrada: RTK 70% Claude Code reduction](https://codestz.dev/experiments/rtk-rust-token-killer) + +**End of GCOMPACTION.md canonical section.** On plan approval, everything above is copied verbatim to `docs/designs/GCOMPACTION.md` as a **tabled design artifact**. No code is written; no hook is installed; no CHANGELOG entry is added. The doc exists so a future sprint can unblock quickly when Anthropic ships the built-in-tool output-replace API.