fix: resolve merge conflicts — bump to v0.15.1.0, keep both CHANGELOG entries

Main shipped v0.15.0.0 (Session Intelligence) and v0.14.6.0 (Recursive
Self-Improvement). Our GStack Browser entry bumps to v0.15.1.0 on top.
SKILL.md conflict: kept our renamed open-gstack-browser skill name.
This commit is contained in:
Garry Tan
2026-04-01 01:19:50 -07:00
64 changed files with 5934 additions and 1086 deletions
+204 -37
View File
@@ -1,6 +1,7 @@
# Design: GStack Self-Learning Infrastructure
Generated by /office-hours + /plan-ceo-review + /plan-eng-review on 2026-03-28
Updated: 2026-04-01 (post-Session Intelligence, reviewed by Codex)
Branch: garrytan/ce-features
Repo: gstack
Status: ACTIVE
@@ -27,10 +28,10 @@ architectural decision, every past bug pattern, and every time it was wrong.
## North Star
/autoship (Release 4). A full engineering team in one command. Describe a feature,
/autoship (Release 5). A full engineering team in one command. Describe a feature,
approve the plan, everything else is automatic. /autoship can't work without
learnings, because without memory it repeats the same mistakes. Releases 1-3 are
the infrastructure that makes /autoship actually work.
learnings (R1), review quality (R2), session persistence (R3), and adaptive ceremony
(R4). Releases 1-4 are the infrastructure that makes /autoship actually work.
## Audience
@@ -48,13 +49,31 @@ a week and notice when it asks the same question twice.
---
## State Systems
gstack has four distinct persistence layers. They share storage patterns
(JSONL in `~/.gstack/projects/$SLUG/`) but serve different purposes:
| System | File | What it stores | Written by | Read by |
|--------|------|---------------|------------|---------|
| **Learnings** | `learnings.jsonl` | Institutional knowledge (pitfalls, patterns, preferences) | All skills | All skills (preamble) |
| **Timeline** | `timeline.jsonl` | Event history (skill start/complete, branch, outcome) | Preamble (automatic) | /retro, preamble context recovery |
| **Checkpoints** | `checkpoints/*.md` | Working state snapshots (decisions, remaining work, files) | /checkpoint, /ship, /investigate | Preamble context recovery, /checkpoint resume |
| **Health** | `health-history.jsonl` | Code quality scores over time (per-tool, composite) | /health | /retro, /ship (gate), /health (trends) |
These are not overlapping. Learnings = what you know. Timeline = what happened.
Checkpoints = where you are. Health = how good the code is. Each answers a
different question.
---
## Release Roadmap
### Release 1: "GStack Learns" (v0.14)
### Release 1: "GStack Learns" (v0.13-0.14) — SHIPPED
**Headline:** Every session makes the next one smarter.
What ships:
What shipped:
- Learnings persistence at `~/.gstack/projects/{slug}/learnings.jsonl`
- `/learn` skill for manual review, search, prune, export
- Confidence calibration on all review findings (1-10 scores with display rules)
@@ -63,7 +82,7 @@ What ships:
- "Learning applied" callouts when reviews match past learnings
- Integration into /review, /ship, /plan-*, /office-hours, /investigate, /retro
Schema (Supabase-compatible):
Schema:
```json
{
"ts": "2026-03-28T12:00:00Z",
@@ -83,27 +102,25 @@ Types: `pattern` | `pitfall` | `preference` | `architecture` | `tool`
Sources: `observed` | `user-stated` | `inferred` | `cross-model`
Architecture: append-only JSONL. Duplicates resolved at read time ("latest winner"
per key+type). No write-time mutation, no race conditions. Follows the existing
gstack-review-log pattern.
per key+type). No write-time mutation, no race conditions.
### Release 2: "Review Army" (v0.15)
### Release 2: "Review Army" (v0.14.3-0.14.4) — SHIPPED
**Headline:** 10 specialist reviewers on every PR.
What ships:
What shipped:
- 7 parallel specialist subagents: always-on (testing, maintainability) +
conditional (security, performance, data-migration, API contract, design) +
red team (large diffs / critical findings)
- JSON-structured findings with confidence scores + fingerprint dedup across agents
- PR quality score (0-10) logged per review + /retro trending (E2)
- Learning-informed specialist prompts past pitfalls injected per domain (E4)
- Multi-specialist consensus highlighting confirmed findings get boosted (E6)
- Enhanced Delivery Integrity via PLAN_COMPLETION_AUDIT — investigation depth,
commit message fallback, plan-file learnings logging
- PR quality score (0-10) logged per review + /retro trending
- Learning-informed specialist prompts, past pitfalls injected per domain
- Multi-specialist consensus highlighting, confirmed findings get boosted
- Enhanced Delivery Integrity via PLAN_COMPLETION_AUDIT
- Checklist refactored: CRITICAL categories stay in main pass, specialist
categories extracted to focused checklists in review/specialists/
### Release 2.5: "Review Army Expansions" (v0.15.x)
### Release 2.5: "Review Army Expansions" — NOT YET SHIPPED
**Headline:** Ship after R2 proves stable. Check in on how the core loop is performing.
@@ -111,53 +128,203 @@ Pre-check: review R2 quality metrics (PR quality scores, specialist hit rates,
false positive rates, E2E test stability). If core loop has issues, fix those first.
What ships:
- E1: Adaptive specialist gating auto-skip specialists with 0-finding track record.
- E1: Adaptive specialist gating, auto-skip specialists with 0-finding track record.
Store per-project hit rates via gstack-learnings-log. User can force with --security etc.
- E3: Test stub generation each specialist outputs TEST_STUB alongside findings.
- E3: Test stub generation, each specialist outputs TEST_STUB alongside findings.
Framework detected from project (Jest/Vitest/RSpec/pytest/Go test).
Flows into Fix-First: AUTO-FIX applies fix + creates test file.
- E5: Cross-review finding dedup read gstack-review-read for prior review entries.
- E5: Cross-review finding dedup, read gstack-review-read for prior review entries.
Suppress findings matching a prior user-skipped finding.
- E7: Specialist performance tracking log per-specialist metrics via gstack-review-log.
/retro integration: "Top finding specialist: Performance (7 findings)."
- E7: Specialist performance tracking, log per-specialist metrics via gstack-review-log.
Timeline integration: specialist runs appear in timeline.jsonl for /retro trending.
### Release 3: "Smart Ceremony" (v0.16)
### Release 3: "Session Intelligence" (v0.15.0) — SHIPPED
**Headline:** GStack respects your time.
**Headline:** Your AI sessions remember what happened.
What shipped:
- Session timeline: every skill auto-logs start/complete events to
`~/.gstack/projects/$SLUG/timeline.jsonl`. Local-only, never sent anywhere,
always on regardless of telemetry setting.
- Context recovery: after compaction or session start, preamble lists recent CEO
plans, checkpoints, and reviews. Agent reads the most recent to recover context.
- Cross-session injection: preamble prints LAST_SESSION and LATEST_CHECKPOINT for
the current branch. You see where you left off before typing anything.
- Predictive skill suggestion: if your last 3 sessions follow a pattern
(review, ship, review), gstack suggests what you probably want next.
- "Welcome back" synthesized context message on session start.
- `/checkpoint` skill: save/resume/list working state snapshots. Cross-branch
listing for Conductor workspace handoff between agents.
- `/health` skill: code quality scorekeeper wrapping project tools (tsc, biome,
knip, shellcheck, tests). Composite 0-10 score, trend tracking, improvement
suggestions when scores drop.
- Timeline binaries: `bin/gstack-timeline-log` and `bin/gstack-timeline-read`.
- Routing rules: /checkpoint and /health added to preamble skill routing.
Design doc: `docs/designs/SESSION_INTELLIGENCE.md`
### Release 4: "Adaptive Ceremony" — NOT YET SHIPPED
**Headline:** GStack respects your time without compromising your safety.
Ceremony and trust are separate concerns. Ceremony = the set of review/test/QA
steps a PR goes through. Trust = a policy engine that determines which ceremony
level applies. They interact but don't merge.
What ships:
- Scope assessment (TINY/SMALL/MEDIUM/LARGE) in /review, /ship, /autoplan
- Ceremony skipping based on diff size and scope category
- File-based todo lifecycle (/triage for interactive approval, /resolve for batch
resolution via parallel agents)
### Release 4: "/autoship — One Command, Full Feature" (v0.17)
**Ceremony levels:**
- FULL: all specialists, adversarial, Codex structured review, coverage audit, plan
completion. For large diffs, new features, migrations, auth changes.
- STANDARD: adversarial + Codex, coverage audit, plan completion. For medium diffs,
typical feature work.
- FAST: adversarial only. For small, well-tested changes on trusted projects.
**Trust policy engine:**
- Scope-aware trust. Trust is earned per change class, not globally. Clean history on
docs-only PRs does not buy trust on migration PRs.
- Change class detection: docs, tests, config, frontend, backend, migrations, auth,
infra. Each class has its own trust threshold.
- Trust signals: consecutive clean reviews (per class), /health score stability,
regression frequency, test coverage trends.
- Trust never fast-tracks: migrations, auth/permission changes, new API endpoints,
infrastructure changes. These always get FULL ceremony regardless of trust level.
- Gradual degradation, not binary reset. A single regression doesn't reset all trust.
It degrades trust for that change class by one level.
**Scope assessment:**
- TINY/SMALL/MEDIUM/LARGE classification in /review, /ship, /autoplan based on
diff size, files touched, and change class.
- Ceremony level = f(scope, trust, change class).
**TODO lifecycle:**
- /triage for interactive approval of incoming TODOs
- /resolve for batch resolution via parallel agents
### Release 5: "/autoship — One Command, Full Feature" — NOT YET SHIPPED
**Headline:** Describe a feature. Approve the plan. Everything else is automatic.
/autoship is a resumable state machine, not a linear pipeline. Review and QA can
send work back to build/fix. Compaction can interrupt any phase. The system must
recover gracefully.
```
┌──────────┐
│ START │
└────┬─────┘
┌────▼─────┐
│ /office- │
│ hours │
└────┬─────┘
┌────▼─────┐
│/autoplan │ ◄── single approval gate
└────┬─────┘
┌──────────▼──────────┐
│ BUILD │ ◄── /checkpoint auto-save
└──────────┬──────────┘
┌──────────▼──────────┐
│ /health │ ◄── quality gate
│ (score >= 7.0) │
└──────────┬──────────┘
│ fail → back to BUILD
┌──────────▼──────────┐
│ /review │
└──────────┬──────────┘
│ ASK items → back to BUILD
┌──────────▼──────────┐
│ /qa │
└──────────┬──────────┘
│ bugs found → back to BUILD
┌──────────▼──────────┐
│ /ship │
└──────────┬──────────┘
┌──────────▼──────────┐
│ /checkpoint archive │ ◄── preserve, don't destroy
└─────────────────────┘
```
What ships:
- /autoship autonomous pipeline: office-hours → autoplan → build → review → qa →
ship → learn. 7 phases, 1 approval gate (the plan).
- /autoship autonomous pipeline with the state machine above.
Each phase writes to timeline.jsonl. Checkpoints auto-save before each phase.
Compaction recovery: context recovery reads checkpoint + timeline, resumes at
the last completed phase.
- Checkpoint archival on completion (not deletion). Recovery state is preserved
for debugging failed autoship runs.
- /ideate brainstorming skill (parallel divergent agents + adversarial filtering)
- Research agents in /plan-eng-review (codebase analyst, history analyst,
best practices researcher, learnings researcher)
### Release 5: "Studio" (v0.18)
Depends on: R1 (learnings for research agents), R2 (review army for quality),
R3 (session intelligence for persistence), R4 (adaptive ceremony for speed).
**Headline:** The full-stack AI engineering studio.
### Release 6: "Execution Studio" — NOT YET SHIPPED
**Headline:** Parallel execution infrastructure.
What ships:
- Swarm orchestration: multi-worktree parallel builds. Builds on Conductor
workspace handoff from /checkpoint (R3). An orchestrator skill dispatches
independent workstreams to parallel agents, each with its own worktree.
- Codex build delegation: auto-detect when to delegate implementation to Codex
CLI based on task type (boilerplate, test generation, mechanical refactors).
- PR feedback resolution: parallel comment resolver across review platforms.
- /onboard: auto-generated contributor guide from codebase analysis.
- /triage-prs: batch PR triage for maintainers.
### Release 7: "Design & Media" — NOT YET SHIPPED
**Headline:** Visual design integration.
What ships:
- Figma design sync (pixel-matching iteration loop)
- Feature video recording (auto-generated PR demos)
- PR feedback resolution (parallel comment resolver)
- Swarm orchestration (multi-worktree parallel builds)
- /onboard (auto-generated contributor guide)
- /triage-prs (batch PR triage for maintainers)
- Codex build delegation (delegate implementation to Codex CLI)
- Cross-platform portability (Copilot, Kiro, Windsurf output)
---
## Risk Register
### Proxy signals as permission to skip scrutiny
(Identified by Codex review, 2026-04-01)
/health scores, clean review history, and timeline patterns are useful signals.
They are not proof of safety. If those signals feed ceremony reduction AND /autoship,
the failure mode is rare, silent, high-severity mistakes. Mitigations:
- Certain change classes never fast-track (migrations, auth, infra, new endpoints).
- Trust degrades gradually, not binary reset.
- /autoship always runs FULL ceremony on its first run per project. Trust is earned.
### Stale context recovery
(Identified by Codex review, 2026-04-01)
Context recovery can inject wrong-branch state, obsolete plans, or invalid
checkpoints. Mitigations:
- Checkpoints include branch name in YAML frontmatter. Context recovery filters
by current branch.
- Timeline grep filters by branch before showing LAST_SESSION.
- Stale artifact detection: if checkpoint is >7 days old, note it as potentially
stale rather than presenting as current.
### Validation metrics needed
(Identified by Codex review, 2026-04-01)
Before shipping R4 (Adaptive Ceremony), measure:
- Predictive suggestion accuracy (did the user run the suggested skill?)
- Trust policy false-skip rate (did fast-tracked PRs have post-merge issues?)
- Context recovery accuracy (did recovered context match actual state?)
- /health score correlation with actual code quality (do high scores predict
fewer production bugs?)
These metrics should be collected during R3 usage and reviewed before R4 ships.
---
## Acknowledged Inspiration
The self-learning roadmap was inspired by ideas from the [Compound Engineering](https://github.com/nicobailon/compound-engineering) project by Nico Bailon. Their exploration of learnings persistence, parallel review agents, and autonomous pipelines catalyzed the design of GStack's approach. We adapted every concept to fit GStack's template system, voice, and architecture rather than porting directly.
+135
View File
@@ -0,0 +1,135 @@
# Session Intelligence Layer
## The Problem
Claude Code's context window is ephemeral. Every session starts fresh. When
auto-compaction fires at ~167K tokens, it preserves a generic summary but
destroys file reads, reasoning chains, and intermediate decisions.
gstack already produces valuable artifacts that survive on disk: CEO plans,
eng reviews, design reviews, QA reports, learnings. These files contain
decisions, constraints, and context that shaped the current work. But Claude
doesn't know they exist. After compaction, the plans and reviews that
informed every decision silently vanish from context.
The ecosystem is working on this. claude-mem (9K+ stars) captures tool usage
and injects context into future sessions. Claude HUD shows real-time agent
status. Anthropic's own `claude-progress.txt` pattern uses a progress file
that agents read at the start of each session.
Nobody is solving the specific problem of making **skill-produced artifacts**
survive compaction. Because nobody else has gstack's artifact architecture.
## The Insight
gstack already writes structured artifacts to `~/.gstack/projects/$SLUG/`:
- CEO plans: `ceo-plans/`
- Design reviews: `design-reviews/`
- Eng reviews: `eng-reviews/`
- Learnings: `learnings.jsonl`
- Skill usage: `../analytics/skill-usage.jsonl`
The missing piece is not storage. It's awareness. The preamble needs to tell
the agent: "These files exist. They contain decisions you've already made.
After compaction, re-read them."
## The Architecture
```
┌─────────────────────────────────────┐
│ Claude Context Window │
│ (ephemeral, ~167K token limit) │
│ │
│ Compaction fires ──► summary only │
└──────────────┬──────────────────────┘
reads on start / after compaction
┌──────────────▼──────────────────────┐
│ ~/.gstack/projects/$SLUG/ │
│ (persistent, survives everything) │
│ │
│ ceo-plans/ ← /plan-ceo-review
│ eng-reviews/ ← /plan-eng-review
│ design-reviews/ ← /plan-design-review
│ checkpoints/ ← /checkpoint (new)
│ timeline.jsonl ← every skill (new)
│ learnings.jsonl ← /learn
└─────────────────────────────────────┘
rolled up weekly
┌──────────────▼──────────────────────┐
│ /retro │
│ Timeline: 3 /review, 2 /ship, ... │
│ Health trends: compile 8/10 (↑2) │
│ Learnings applied: 4 this week │
└─────────────────────────────────────┘
```
## The Features
### Layer 1: Context Recovery (preamble, all skills)
~10 lines of prose in the preamble. After compaction or context degradation,
the agent checks `~/.gstack/projects/$SLUG/` for recent plans, reviews, and
checkpoints. Lists the directory, reads the most recent file.
Cost: near-zero. Benefit: every skill's plans/reviews survive compaction.
### Layer 2: Session Timeline (preamble, all skills)
Every skill appends a one-line JSONL entry to `timeline.jsonl`: timestamp,
skill name, branch, key outcome. `/retro` renders it.
Makes the project's AI-assisted work history visible. "This week: 3 /review,
2 /ship, 1 /investigate across branches feature-auth and fix-billing."
### Layer 3: Cross-Session Injection (preamble, all skills)
When a new session starts on a branch with recent artifacts, the preamble
prints a one-liner: "Last session: implemented JWT auth, 3/5 tasks done.
Plan: ~/.gstack/projects/$SLUG/checkpoints/latest.md"
The agent knows where you left off before reading any files.
### Layer 4: /checkpoint (opt-in skill)
Manual snapshot of working state: what's being done, files being edited,
decisions made, what's remaining. Useful before stepping away, before
complex operations, for workspace handoffs, or coming back after days.
### Layer 5: /health (opt-in skill)
Code quality dashboard: type-check, lint, test suite, dead code scan.
Composite 0-10 score. Tracks over time. `/retro` shows trends. `/ship`
gates on configurable threshold.
## The Compounding Effect
Each feature is independently useful. Together, they create something
that compounds:
Session 1: /plan-ceo-review produces a plan. Saved to disk.
Session 2: Agent reads the plan after preamble. Doesn't re-ask decisions.
Session 3: /checkpoint saves progress. Timeline shows 2 /review, 1 /ship.
Session 4: Compaction fires mid-refactor. Agent re-reads the checkpoint.
Recovers key decisions, types, remaining work. Continues.
Session 5: /retro rolls up the week. Health trend: 6/10 → 8/10.
Timeline shows 12 skill invocations across 3 branches.
The project's AI history is no longer ephemeral. It persists, compounds,
and makes every future session smarter. That's the session intelligence
layer.
## What This Is Not
- Not a replacement for Claude's built-in compaction (that handles session
state; we handle gstack artifacts)
- Not a full memory system like claude-mem (that handles cross-session
memory via SQLite; we handle structured skill artifacts)
- Not a database or service (just markdown files on disk)
## Research Sources
- [Anthropic: Effective harnesses for long-running agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents)
- [Anthropic: Effective context engineering](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)
- [claude-mem](https://github.com/thedotmack/claude-mem)
- [Claude HUD](https://github.com/jarrodwatts/claude-hud)
- [CodeScene: Agentic AI coding best practices](https://codescene.com/blog/agentic-ai-coding-best-practice-patterns-for-speed-with-quality)
- [Post-compaction recovery via git-persisted state (Beads)](https://dev.to/jeremy_longshore/building-post-compaction-recovery-for-ai-agent-workflows-with-beads-207l)