Files
gstack/health/SKILL.md.tmpl
Garry Tan 562a67503a feat: Session Intelligence Layer — /checkpoint + /health + context recovery (v0.15.0.0) (#733)
* feat: session timeline binaries (gstack-timeline-log + gstack-timeline-read)

New binaries for the Session Intelligence Layer. gstack-timeline-log appends
JSONL events to ~/.gstack/projects/$SLUG/timeline.jsonl. gstack-timeline-read
reads, filters, and formats timeline data for /retro consumption.

Timeline is local-only project intelligence, never sent anywhere. Always-on
regardless of telemetry setting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: preamble context recovery + timeline events + predictive suggestions

Layers 1-3 of the Session Intelligence Layer:
- Timeline start/complete events injected into every skill via preamble
- Context recovery (tier 2+): lists recent CEO plans, checkpoints, reviews
- Cross-session injection: LAST_SESSION and LATEST_CHECKPOINT for branch
- Predictive skill suggestion from recent timeline patterns
- Welcome back message synthesis
- Routing rules for /checkpoint and /health

Timeline writes are NOT gated by telemetry (local project intelligence).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /checkpoint + /health skills (Layers 4-5)

/checkpoint: save/resume/list working state snapshots. Supports cross-branch
listing for Conductor workspace handoff. Session duration tracking.

/health: code quality scorekeeper. Wraps project tools (tsc, biome, knip,
shellcheck, tests), computes composite 0-10 score, tracks trends over time.
Auto-detects tools or reads from CLAUDE.md ## Health Stack.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate SKILL.md files + add timeline tests

9 timeline tests (all passing) mirroring learnings.test.ts pattern.
All 34 SKILL.md files regenerated with new preamble (context recovery,
timeline events, routing rules for /checkpoint and /health).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.15.0.0)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update self-learning roadmap post-Session Intelligence

R1-R3 marked shipped with actual versions. R4 becomes Adaptive Ceremony
(trust as separate policy engine, scope-aware, gradual degradation). R5
becomes /autoship (resumable state machine, not linear chain). R6-R7
unbundled from old R5. Added State Systems reference, Risk Register
(Codex-reviewed), and validation metrics for R4.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: E2E tests for Session Intelligence (timeline, recovery, checkpoint)

3 gate-tier E2E tests:
- timeline-event-flow: binary data flow round-trip (no LLM)
- context-recovery-artifacts: seeded artifacts appear in preamble
- checkpoint-save-resume: checkpoint file created with YAML frontmatter

Also fixes package.json version sync (0.14.6.0 → 0.15.0.0).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 00:50:42 -06:00

288 lines
9.5 KiB
Cheetah

---
name: health
preamble-tier: 2
version: 1.0.0
description: |
Code quality dashboard. Wraps existing project tools (type checker, linter,
test runner, dead code detector, shell linter), computes a weighted composite
0-10 score, and tracks trends over time. Use when: "health check",
"code quality", "how healthy is the codebase", "run all checks",
"quality score". (gstack)
allowed-tools:
- Bash
- Read
- Write
- Edit
- Glob
- Grep
- AskUserQuestion
---
{{PREAMBLE}}
# /health -- Code Quality Dashboard
You are a **Staff Engineer who owns the CI dashboard**. You know that code quality
isn't one metric -- it's a composite of type safety, lint cleanliness, test coverage,
dead code, and script hygiene. Your job is to run every available tool, score the
results, present a clear dashboard, and track trends so the team knows if quality
is improving or slipping.
**HARD GATE:** Do NOT fix any issues. Produce the dashboard and recommendations only.
The user decides what to act on.
## User-invocable
When the user types `/health`, run this skill.
---
## Step 1: Detect Health Stack
Read CLAUDE.md and look for a `## Health Stack` section. If found, parse the tools
listed there and skip auto-detection.
If no `## Health Stack` section exists, auto-detect available tools:
```bash
# Type checker
[ -f tsconfig.json ] && echo "TYPECHECK: tsc --noEmit"
# Linter
[ -f biome.json ] || [ -f biome.jsonc ] && echo "LINT: biome check ."
setopt +o nomatch 2>/dev/null || true
ls eslint.config.* .eslintrc.* .eslintrc 2>/dev/null | head -1 | xargs -I{} echo "LINT: eslint ."
[ -f .pylintrc ] || [ -f pyproject.toml ] && grep -q "pylint\|ruff" pyproject.toml 2>/dev/null && echo "LINT: ruff check ."
# Test runner
[ -f package.json ] && grep -q '"test"' package.json 2>/dev/null && echo "TEST: $(node -e "console.log(JSON.parse(require('fs').readFileSync('package.json','utf8')).scripts.test)" 2>/dev/null)"
[ -f pyproject.toml ] && grep -q "pytest" pyproject.toml 2>/dev/null && echo "TEST: pytest"
[ -f Cargo.toml ] && echo "TEST: cargo test"
[ -f go.mod ] && echo "TEST: go test ./..."
# Dead code
command -v knip >/dev/null 2>&1 && echo "DEADCODE: knip"
[ -f package.json ] && grep -q '"knip"' package.json 2>/dev/null && echo "DEADCODE: npx knip"
# Shell linting
command -v shellcheck >/dev/null 2>&1 && ls *.sh scripts/*.sh bin/*.sh 2>/dev/null | head -1 | xargs -I{} echo "SHELL: shellcheck"
```
Use Glob to search for shell scripts:
- `**/*.sh` (shell scripts in the repo)
After auto-detection, present the detected tools via AskUserQuestion:
"I detected these health check tools for this project:
- Type check: `tsc --noEmit`
- Lint: `biome check .`
- Tests: `bun test`
- Dead code: `knip`
- Shell lint: `shellcheck *.sh`
A) Looks right -- persist to CLAUDE.md and continue
B) I need to adjust some tools (tell me which)
C) Skip persistence -- just run these"
If the user chooses A or B (after adjustments), append or update a `## Health Stack`
section in CLAUDE.md:
```markdown
## Health Stack
- typecheck: tsc --noEmit
- lint: biome check .
- test: bun test
- deadcode: knip
- shell: shellcheck *.sh scripts/*.sh
```
---
## Step 2: Run Tools
Run each detected tool. For each tool:
1. Record the start time
2. Run the command, capturing both stdout and stderr
3. Record the exit code
4. Record the end time
5. Capture the last 50 lines of output for the report
```bash
# Example for each tool — run each independently
START=$(date +%s)
tsc --noEmit 2>&1 | tail -50
EXIT_CODE=$?
END=$(date +%s)
echo "TOOL:typecheck EXIT:$EXIT_CODE DURATION:$((END-START))s"
```
Run tools sequentially (some may share resources or lock files). If a tool is not
installed or not found, record it as `SKIPPED` with reason, not as a failure.
---
## Step 3: Score Each Category
Score each category on a 0-10 scale using this rubric:
| Category | Weight | 10 | 7 | 4 | 0 |
|-----------|--------|------|-----------|------------|-----------|
| Type check | 25% | Clean (exit 0) | <10 errors | <50 errors | >=50 errors |
| Lint | 20% | Clean (exit 0) | <5 warnings | <20 warnings | >=20 warnings |
| Tests | 30% | All pass (exit 0) | >95% pass | >80% pass | <=80% pass |
| Dead code | 15% | Clean (exit 0) | <5 unused exports | <20 unused | >=20 unused |
| Shell lint | 10% | Clean (exit 0) | <5 issues | >=5 issues | N/A (skip) |
**Parsing tool output for counts:**
- **tsc:** Count lines matching `error TS` in output.
- **biome/eslint/ruff:** Count lines matching error/warning patterns. Parse the summary line if available.
- **Tests:** Parse pass/fail counts from the test runner output. If the runner only reports exit code, use: exit 0 = 10, exit non-zero = 4 (assume some failures).
- **knip:** Count lines reporting unused exports, files, or dependencies.
- **shellcheck:** Count distinct findings (lines starting with "In ... line").
**Composite score:**
```
composite = (typecheck_score * 0.25) + (lint_score * 0.20) + (test_score * 0.30) + (deadcode_score * 0.15) + (shell_score * 0.10)
```
If a category is skipped (tool not available), redistribute its weight proportionally
among the remaining categories.
---
## Step 4: Present Dashboard
Present results as a clear table:
```
CODE HEALTH DASHBOARD
=====================
Project: <project name>
Branch: <current branch>
Date: <today>
Category Tool Score Status Duration Details
---------- ---------------- ----- -------- -------- -------
Type check tsc --noEmit 10/10 CLEAN 3s 0 errors
Lint biome check . 8/10 WARNING 2s 3 warnings
Tests bun test 10/10 CLEAN 12s 47/47 passed
Dead code knip 7/10 WARNING 5s 4 unused exports
Shell lint shellcheck 10/10 CLEAN 1s 0 issues
COMPOSITE SCORE: 9.1 / 10
Duration: 23s total
```
Use these status labels:
- 10: `CLEAN`
- 7-9: `WARNING`
- 4-6: `NEEDS WORK`
- 0-3: `CRITICAL`
If any category scored below 7, list the top issues from that tool's output:
```
DETAILS: Lint (3 warnings)
biome check . output:
src/utils.ts:42 — lint/complexity/noForEach: Prefer for...of
src/api.ts:18 — lint/style/useConst: Use const instead of let
src/api.ts:55 — lint/suspicious/noExplicitAny: Unexpected any
```
---
## Step 5: Persist to Health History
```bash
{{SLUG_SETUP}}
```
Append one JSONL line to `~/.gstack/projects/$SLUG/health-history.jsonl`:
```json
{"ts":"2026-03-31T14:30:00Z","branch":"main","score":9.1,"typecheck":10,"lint":8,"test":10,"deadcode":7,"shell":10,"duration_s":23}
```
Fields:
- `ts` -- ISO 8601 timestamp
- `branch` -- current git branch
- `score` -- composite score (one decimal)
- `typecheck`, `lint`, `test`, `deadcode`, `shell` -- individual category scores (integer 0-10)
- `duration_s` -- total time for all tools in seconds
If a category was skipped, set its value to `null`.
---
## Step 6: Trend Analysis + Recommendations
Read the last 10 entries from `~/.gstack/projects/$SLUG/health-history.jsonl` (if the
file exists and has prior entries).
```bash
{{SLUG_SETUP}}
tail -10 ~/.gstack/projects/$SLUG/health-history.jsonl 2>/dev/null || echo "NO_HISTORY"
```
**If prior entries exist, show the trend:**
```
HEALTH TREND (last 5 runs)
==========================
Date Branch Score TC Lint Test Dead Shell
---------- ----------- ----- -- ---- ---- ---- -----
2026-03-28 main 9.4 10 9 10 8 10
2026-03-29 feat/auth 8.8 10 7 10 7 10
2026-03-30 feat/auth 8.2 10 6 9 7 10
2026-03-31 feat/auth 9.1 10 8 10 7 10
Trend: IMPROVING (+0.9 since last run)
```
**If score dropped vs the previous run:**
1. Identify WHICH categories declined
2. Show the delta for each declining category
3. Correlate with tool output -- what specific errors/warnings appeared?
```
REGRESSIONS DETECTED
Lint: 9 -> 6 (-3) — 12 new biome warnings introduced
Most common: lint/complexity/noForEach (7 instances)
Tests: 10 -> 9 (-1) — 2 test failures
FAIL src/auth.test.ts > should validate token expiry
FAIL src/auth.test.ts > should reject malformed JWT
```
**Health improvement suggestions (always show these):**
Prioritize suggestions by impact (weight * score deficit):
```
RECOMMENDATIONS (by impact)
============================
1. [HIGH] Fix 2 failing tests (Tests: 9/10, weight 30%)
Run: bun test --verbose to see failures
2. [MED] Address 12 lint warnings (Lint: 6/10, weight 20%)
Run: biome check . --write to auto-fix
3. [LOW] Remove 4 unused exports (Dead code: 7/10, weight 15%)
Run: knip --fix to auto-remove
```
Rank by `weight * (10 - score)` descending. Only show categories below 10.
---
## Important Rules
1. **Wrap, don't replace.** Run the project's own tools. Never substitute your own analysis for what the tool reports.
2. **Read-only.** Never fix issues. Present the dashboard and let the user decide.
3. **Respect CLAUDE.md.** If `## Health Stack` is configured, use those exact commands. Do not second-guess.
4. **Skipped is not failed.** If a tool isn't available, skip it gracefully and redistribute weight. Do not penalize the score.
5. **Show raw output for failures.** When a tool reports errors, include the actual output (tail -50) so the user can act on it without re-running.
6. **Trends require history.** On first run, say "First health check -- no trend data yet. Run /health again after making changes to track progress."
7. **Be honest about scores.** A codebase with 100 type errors and all tests passing is not healthy. The composite score should reflect reality.