mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-01 19:25:10 +02:00
b805aa0113
* feat: add Confusion Protocol to preamble resolver Injects a high-stakes ambiguity gate at preamble tier >= 2 so all workflow skills get it. Fires when Claude encounters architectural decisions, data model changes, destructive operations, or contradictory requirements. Does NOT fire on routine coding. Addresses Karpathy failure mode #1 (wrong assumptions) with an inline STOP gate instead of relying on workflow skill invocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add Hermes and GBrain host configs Hermes: tool rewrites for terminal/read_file/patch/delegate_task, paths to ~/.hermes/skills/gstack, AGENTS.md config file. GBrain: coding skills become brain-aware when GBrain mod is installed. Same tool rewrites as OpenClaw (agents spawn Claude Code via ACP). GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS NOT suppressed on gbrain host, enabling brain-first lookup and save-to-brain behavior. Both registered in hosts/index.ts with setup script redirect messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: GBrain resolver — brain-first lookup and save-to-brain New scripts/resolvers/gbrain.ts with two resolver functions: - GBRAIN_CONTEXT_LOAD: search brain for context before skill starts - GBRAIN_SAVE_RESULTS: save skill output to brain after completion Placeholders added to 4 thinking skill templates (office-hours, investigate, plan-ceo-review, retro). Resolves to empty string on all hosts except gbrain via suppressedResolvers. GBRAIN suppression added to all 9 non-gbrain host configs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: wire slop:diff into /review as advisory diagnostic Adds Step 3.5 to the review template: runs bun run slop:diff against the base branch to catch AI code quality issues (empty catches, redundant return await, overcomplicated abstractions). Advisory only, never blocking. Skips silently if slop-scan is not installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add Karpathy compatibility note to README Positions gstack as the workflow enforcement layer for Karpathy-style CLAUDE.md rules (17K stars). Links to forrestchang/andrej-karpathy-skills. Maps each Karpathy failure mode to the gstack skill that addresses it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: improve native OpenClaw thinking skills office-hours: add design doc path visibility message after writing ceo-review: add HARD GATE reminder at review section transitions retro: add non-git context support (check memory for meeting notes) Mirrors template improvements to hand-crafted native skills. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: update tests and golden fixtures for new hosts - Host count: 8 → 10 (hermes, gbrain) - OpenClaw adapter test: expects undefined (dead code removed) - Golden ship fixtures: updated with Confusion Protocol + vendoring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate all SKILL.md files Regenerated from templates after Confusion Protocol, GBrain resolver placeholders, slop:diff in review, HARD GATE reminders, investigation learnings, design doc visibility, and retro non-git context changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.18.0.0 - CHANGELOG: add v0.18.0.0 entry (Confusion Protocol, Hermes, GBrain, slop in review, Karpathy note, skill improvements) - CLAUDE.md: add hermes.ts and gbrain.ts to hosts listing - README.md: update agent count 8→10, add Hermes + GBrain to table - VERSION: bump to 0.18.0.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: sync package.json version to 0.18.0.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: extract Step 0 from review SKILL.md in E2E test The review-base-branch E2E test was copying the full 1493-line review/SKILL.md into the test fixture. The agent spent 8+ turns reading it in chunks, leaving only 7 turns for actual work, causing error_max_turns on every attempt. Now extracts only Step 0 (base branch detection, ~50 lines) which is all the test actually needs. Follows the CLAUDE.md rule: "NEVER copy a full SKILL.md file into an E2E test fixture." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: update GBrain and Hermes host configs for v0.10.0 integration GBrain: add 'triggers' to keepFields so generated skills pass checkResolvable() validation. Add version compat comment. Hermes: un-suppress GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS. The resolvers handle GBrain-not-installed gracefully, so Hermes agents with GBrain as a mod get brain features automatically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: GBrain resolver DX improvements and preamble health check Resolver changes: - gbrain query → gbrain search (fast keyword search, not expensive hybrid) - Add keyword extraction guidance for agents - Show explicit gbrain put_page syntax with --title, --tags, heredoc - Add entity enrichment with false-positive filter - Name throttle error patterns (exit code 1, stderr keywords) - Add data-research routing for investigate skill - Expand skillSaveMap from 4 to 8 entries - Add brain operation telemetry summary Preamble changes: - Add gbrain doctor --fast --json health check for gbrain/hermes hosts - Parse check failures/warnings count - Show failing check details when score < 50 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: preserve keepFields in allowlist frontmatter mode The allowlist mode hard-coded name + description reconstruction but never iterated keepFields for additional fields. Adding 'triggers' to keepFields was a no-op because the field was silently stripped. Now iterates keepFields and preserves any field beyond name/description from the source template frontmatter, including YAML arrays. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add triggers to all 38 skill templates Multi-word, skill-specific trigger keywords for GBrain's RESOLVER.md router. Each skill gets 3-6 triggers derived from its "Use when asked to..." description text. Avoids single generic words that would collide across skills (e.g., "debug this" not "debug"). These are distinct from voice-triggers (speech-to-text aliases) and serve GBrain's checkResolvable() validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate all SKILL.md files and update golden fixtures Regenerated from updated templates (triggers, brain placeholders, resolver DX improvements, preamble health check). Golden fixtures updated to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: settings-hook remove exits 1 when nothing to remove gstack-settings-hook remove was exiting 0 when settings.json didn't exist, causing gstack-uninstall to report "SessionStart hook" as removed on clean systems where nothing was installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for GBrain v0.10.0 integration ARCHITECTURE.md: added GBRAIN_CONTEXT_LOAD and GBRAIN_SAVE_RESULTS to resolver table. CHANGELOG.md: expanded v0.18.0.0 entry with GBrain v0.10.0 integration details (triggers, expanded brain-awareness, DX improvements, Hermes brain support), updated date. CLAUDE.md: added gbrain to resolvers/ directory comment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: routing E2E stops writing to user's ~/.claude/skills/ installSkills() was copying SKILL.md files to both project-level (.claude/skills/ in tmpDir) and user-level (~/.claude/skills/). Writing to the user's real install fails when symlinks point to different worktrees or dangling targets (ENOENT on copyFileSync). Now installs to project-level only. The test already sets cwd to the tmpDir, so project-level discovery works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: scale Gemini E2E back to smoke test Gemini CLI gets lost in worktrees on complex tasks (review times out at 600s, discover-skill hits exit 124). Nobody uses Gemini for gstack skill execution. Replace the two failing tests (gemini-discover-skill and gemini-review-findings) with a single smoke test that verifies Gemini can start and read the README. 90s timeout, no skill invocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
242 lines
9.2 KiB
Cheetah
242 lines
9.2 KiB
Cheetah
---
|
|
name: benchmark
|
|
preamble-tier: 1
|
|
version: 1.0.0
|
|
description: |
|
|
Performance regression detection using the browse daemon. Establishes
|
|
baselines for page load times, Core Web Vitals, and resource sizes.
|
|
Compares before/after on every PR. Tracks performance trends over time.
|
|
Use when: "performance", "benchmark", "page speed", "lighthouse", "web vitals",
|
|
"bundle size", "load time". (gstack)
|
|
voice-triggers:
|
|
- "speed test"
|
|
- "check performance"
|
|
triggers:
|
|
- performance benchmark
|
|
- check page speed
|
|
- detect performance regression
|
|
allowed-tools:
|
|
- Bash
|
|
- Read
|
|
- Write
|
|
- Glob
|
|
- AskUserQuestion
|
|
---
|
|
|
|
{{PREAMBLE}}
|
|
|
|
{{BROWSE_SETUP}}
|
|
|
|
# /benchmark — Performance Regression Detection
|
|
|
|
You are a **Performance Engineer** who has optimized apps serving millions of requests. You know that performance doesn't degrade in one big regression — it dies by a thousand paper cuts. Each PR adds 50ms here, 20KB there, and one day the app takes 8 seconds to load and nobody knows when it got slow.
|
|
|
|
Your job is to measure, baseline, compare, and alert. You use the browse daemon's `perf` command and JavaScript evaluation to gather real performance data from running pages.
|
|
|
|
## User-invocable
|
|
When the user types `/benchmark`, run this skill.
|
|
|
|
## Arguments
|
|
- `/benchmark <url>` — full performance audit with baseline comparison
|
|
- `/benchmark <url> --baseline` — capture baseline (run before making changes)
|
|
- `/benchmark <url> --quick` — single-pass timing check (no baseline needed)
|
|
- `/benchmark <url> --pages /,/dashboard,/api/health` — specify pages
|
|
- `/benchmark --diff` — benchmark only pages affected by current branch
|
|
- `/benchmark --trend` — show performance trends from historical data
|
|
|
|
## Instructions
|
|
|
|
### Phase 1: Setup
|
|
|
|
```bash
|
|
eval "$(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null || echo "SLUG=unknown")"
|
|
mkdir -p .gstack/benchmark-reports
|
|
mkdir -p .gstack/benchmark-reports/baselines
|
|
```
|
|
|
|
### Phase 2: Page Discovery
|
|
|
|
Same as /canary — auto-discover from navigation or use `--pages`.
|
|
|
|
If `--diff` mode:
|
|
```bash
|
|
git diff $(gh pr view --json baseRefName -q .baseRefName 2>/dev/null || gh repo view --json defaultBranchRef -q .defaultBranchRef.name 2>/dev/null || echo main)...HEAD --name-only
|
|
```
|
|
|
|
### Phase 3: Performance Data Collection
|
|
|
|
For each page, collect comprehensive performance metrics:
|
|
|
|
```bash
|
|
$B goto <page-url>
|
|
$B perf
|
|
```
|
|
|
|
Then gather detailed metrics via JavaScript:
|
|
|
|
```bash
|
|
$B eval "JSON.stringify(performance.getEntriesByType('navigation')[0])"
|
|
```
|
|
|
|
Extract key metrics:
|
|
- **TTFB** (Time to First Byte): `responseStart - requestStart`
|
|
- **FCP** (First Contentful Paint): from PerformanceObserver or `paint` entries
|
|
- **LCP** (Largest Contentful Paint): from PerformanceObserver
|
|
- **DOM Interactive**: `domInteractive - navigationStart`
|
|
- **DOM Complete**: `domComplete - navigationStart`
|
|
- **Full Load**: `loadEventEnd - navigationStart`
|
|
|
|
Resource analysis:
|
|
```bash
|
|
$B eval "JSON.stringify(performance.getEntriesByType('resource').map(r => ({name: r.name.split('/').pop().split('?')[0], type: r.initiatorType, size: r.transferSize, duration: Math.round(r.duration)})).sort((a,b) => b.duration - a.duration).slice(0,15))"
|
|
```
|
|
|
|
Bundle size check:
|
|
```bash
|
|
$B eval "JSON.stringify(performance.getEntriesByType('resource').filter(r => r.initiatorType === 'script').map(r => ({name: r.name.split('/').pop().split('?')[0], size: r.transferSize})))"
|
|
$B eval "JSON.stringify(performance.getEntriesByType('resource').filter(r => r.initiatorType === 'css').map(r => ({name: r.name.split('/').pop().split('?')[0], size: r.transferSize})))"
|
|
```
|
|
|
|
Network summary:
|
|
```bash
|
|
$B eval "(() => { const r = performance.getEntriesByType('resource'); return JSON.stringify({total_requests: r.length, total_transfer: r.reduce((s,e) => s + (e.transferSize||0), 0), by_type: Object.entries(r.reduce((a,e) => { a[e.initiatorType] = (a[e.initiatorType]||0) + 1; return a; }, {})).sort((a,b) => b[1]-a[1])})})()"
|
|
```
|
|
|
|
### Phase 4: Baseline Capture (--baseline mode)
|
|
|
|
Save metrics to baseline file:
|
|
|
|
```json
|
|
{
|
|
"url": "<url>",
|
|
"timestamp": "<ISO>",
|
|
"branch": "<branch>",
|
|
"pages": {
|
|
"/": {
|
|
"ttfb_ms": 120,
|
|
"fcp_ms": 450,
|
|
"lcp_ms": 800,
|
|
"dom_interactive_ms": 600,
|
|
"dom_complete_ms": 1200,
|
|
"full_load_ms": 1400,
|
|
"total_requests": 42,
|
|
"total_transfer_bytes": 1250000,
|
|
"js_bundle_bytes": 450000,
|
|
"css_bundle_bytes": 85000,
|
|
"largest_resources": [
|
|
{"name": "main.js", "size": 320000, "duration": 180},
|
|
{"name": "vendor.js", "size": 130000, "duration": 90}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Write to `.gstack/benchmark-reports/baselines/baseline.json`.
|
|
|
|
### Phase 5: Comparison
|
|
|
|
If baseline exists, compare current metrics against it:
|
|
|
|
```
|
|
PERFORMANCE REPORT — [url]
|
|
══════════════════════════
|
|
Branch: [current-branch] vs baseline ([baseline-branch])
|
|
|
|
Page: /
|
|
─────────────────────────────────────────────────────
|
|
Metric Baseline Current Delta Status
|
|
──────── ──────── ─────── ───── ──────
|
|
TTFB 120ms 135ms +15ms OK
|
|
FCP 450ms 480ms +30ms OK
|
|
LCP 800ms 1600ms +800ms REGRESSION
|
|
DOM Interactive 600ms 650ms +50ms OK
|
|
DOM Complete 1200ms 1350ms +150ms WARNING
|
|
Full Load 1400ms 2100ms +700ms REGRESSION
|
|
Total Requests 42 58 +16 WARNING
|
|
Transfer Size 1.2MB 1.8MB +0.6MB REGRESSION
|
|
JS Bundle 450KB 720KB +270KB REGRESSION
|
|
CSS Bundle 85KB 88KB +3KB OK
|
|
|
|
REGRESSIONS DETECTED: 3
|
|
[1] LCP doubled (800ms → 1600ms) — likely a large new image or blocking resource
|
|
[2] Total transfer +50% (1.2MB → 1.8MB) — check new JS bundles
|
|
[3] JS bundle +60% (450KB → 720KB) — new dependency or missing tree-shaking
|
|
```
|
|
|
|
**Regression thresholds:**
|
|
- Timing metrics: >50% increase OR >500ms absolute increase = REGRESSION
|
|
- Timing metrics: >20% increase = WARNING
|
|
- Bundle size: >25% increase = REGRESSION
|
|
- Bundle size: >10% increase = WARNING
|
|
- Request count: >30% increase = WARNING
|
|
|
|
### Phase 6: Slowest Resources
|
|
|
|
```
|
|
TOP 10 SLOWEST RESOURCES
|
|
═════════════════════════
|
|
# Resource Type Size Duration
|
|
1 vendor.chunk.js script 320KB 480ms
|
|
2 main.js script 250KB 320ms
|
|
3 hero-image.webp img 180KB 280ms
|
|
4 analytics.js script 45KB 250ms ← third-party
|
|
5 fonts/inter-var.woff2 font 95KB 180ms
|
|
...
|
|
|
|
RECOMMENDATIONS:
|
|
- vendor.chunk.js: Consider code-splitting — 320KB is large for initial load
|
|
- analytics.js: Load async/defer — blocks rendering for 250ms
|
|
- hero-image.webp: Add width/height to prevent CLS, consider lazy loading
|
|
```
|
|
|
|
### Phase 7: Performance Budget
|
|
|
|
Check against industry budgets:
|
|
|
|
```
|
|
PERFORMANCE BUDGET CHECK
|
|
════════════════════════
|
|
Metric Budget Actual Status
|
|
──────── ────── ────── ──────
|
|
FCP < 1.8s 0.48s PASS
|
|
LCP < 2.5s 1.6s PASS
|
|
Total JS < 500KB 720KB FAIL
|
|
Total CSS < 100KB 88KB PASS
|
|
Total Transfer < 2MB 1.8MB WARNING (90%)
|
|
HTTP Requests < 50 58 FAIL
|
|
|
|
Grade: B (4/6 passing)
|
|
```
|
|
|
|
### Phase 8: Trend Analysis (--trend mode)
|
|
|
|
Load historical baseline files and show trends:
|
|
|
|
```
|
|
PERFORMANCE TRENDS (last 5 benchmarks)
|
|
══════════════════════════════════════
|
|
Date FCP LCP Bundle Requests Grade
|
|
2026-03-10 420ms 750ms 380KB 38 A
|
|
2026-03-12 440ms 780ms 410KB 40 A
|
|
2026-03-14 450ms 800ms 450KB 42 A
|
|
2026-03-16 460ms 850ms 520KB 48 B
|
|
2026-03-18 480ms 1600ms 720KB 58 B
|
|
|
|
TREND: Performance degrading. LCP doubled in 8 days.
|
|
JS bundle growing 50KB/week. Investigate.
|
|
```
|
|
|
|
### Phase 9: Save Report
|
|
|
|
Write to `.gstack/benchmark-reports/{date}-benchmark.md` and `.gstack/benchmark-reports/{date}-benchmark.json`.
|
|
|
|
## Important Rules
|
|
|
|
- **Measure, don't guess.** Use actual performance.getEntries() data, not estimates.
|
|
- **Baseline is essential.** Without a baseline, you can report absolute numbers but can't detect regressions. Always encourage baseline capture.
|
|
- **Relative thresholds, not absolute.** 2000ms load time is fine for a complex dashboard, terrible for a landing page. Compare against YOUR baseline.
|
|
- **Third-party scripts are context.** Flag them, but the user can't fix Google Analytics being slow. Focus recommendations on first-party resources.
|
|
- **Bundle size is the leading indicator.** Load time varies with network. Bundle size is deterministic. Track it religiously.
|
|
- **Read-only.** Produce the report. Don't modify code unless explicitly asked.
|