mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-01 19:25:10 +02:00
cf3582c637
* feat: add /cso skill — OWASP Top 10 + STRIDE security audit * fix: harden gstack-slug against shell injection via eval Whitelist safe characters (a-zA-Z0-9._-) in SLUG and BRANCH output to prevent shell metacharacter injection when used with eval. Only affects self-hosted git servers with lax naming rules — GitHub and GitLab enforce safe characters already. Defense-in-depth. * fix(security): sanitize gstack-slug output against shell injection The gstack-slug script is consumed via eval $(gstack-slug) throughout skill templates. If a git remote URL contains shell metacharacters like $(), backticks, or semicolons, they would be executed by eval. Fix: strip all characters except [a-zA-Z0-9._-] from both SLUG and BRANCH before output. This preserves normal values while neutralizing any injection payload in malicious remote URLs. Before: eval $(gstack-slug) with remote "foo/bar$(rm -rf /)" → executes rm After: eval $(gstack-slug) with remote "foo/bar$(rm -rf /)" → SLUG=foo-barrm-rf- * fix(security): redact sensitive values in storage command output The browse `storage` command dumps all localStorage and sessionStorage as JSON. This can expose tokens, API keys, JWTs, and session credentials in QA reports and agent transcripts. Fix: redact values where the key matches sensitive patterns (token, secret, key, password, auth, jwt, csrf) or the value starts with known credential prefixes (eyJ for JWT, sk- for Stripe, ghp_ for GitHub, etc.). Redacted values show length to aid debugging: [REDACTED — 128 chars] * fix(browse): kill old server before restart to prevent orphaned chromium processes When the health check fails or the server connection drops, `ensureServer()` and `sendCommand()` would call `startServer()` without first killing the previous server process. This left orphaned `chrome-headless-shell` renderer processes running at ~120% CPU each. After several reconnect cycles (e.g. pages that crash during hydration or trigger hard navigations via `window.location.href`), dozens of zombie chromium processes accumulate and exhaust system resources. Fix: call `killServer()` on the stale PID before spawning a new server in both the `ensureServer()` unhealthy path and the `sendCommand()` connection- lost retry path. Fixes #294 * Fix YAML linter error: nested mapping in compact sequence entries Having "Run: bun" inside a plain scalar is not allowed per YAML spec which states: Plain scalars must never contain the “: ” and “ #” character combinations. This simple fix switches to block scalars (|) to eliminate the ambiguity without changing runtime behavior. * fix(security): add Azure metadata endpoint to SSRF blocklist Add metadata.azure.internal to BLOCKED_METADATA_HOSTS alongside the existing AWS/GCP endpoints. Closes the coverage gap identified in #125. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add coverage for storage redaction Test key-based redaction (auth_token, api_key), value-based redaction (JWT prefix, GitHub PAT prefix), pass-through for normal keys, and length preservation in redacted output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add community PR triage process to CONTRIBUTING.md Document the wave-based PR triage pattern used for batching community contributions. References PR #205 (v0.8.3) as the original example. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: adjust test key names to avoid redaction pattern collision Rename testKey→testData and normalKey→displayName in storage tests to avoid triggering #238's SENSITIVE_KEY regex (which matches 'key'). Also generate Codex variant of /cso skill. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.9.10.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: zero-noise /cso security audits with FP filtering (v0.11.0.0) Absorb Anthropic's security-review false positive filtering into /cso: - 17 hard exclusions (DOS, test files, log spoofing, SSRF path-only, regex injection, race conditions unless concrete, etc.) - 9 precedents (React XSS-safe, env vars trusted, client-side code doesn't need auth, shell scripts need concrete untrusted input path) - 8/10 confidence gate — below threshold = don't report - Independent sub-agent verification for each finding - Exploit scenario requirement per finding - Framework-aware analysis (Rails CSRF, React escaping, Angular sanitization) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: consolidate CHANGELOG — merge /cso launch + community wave into v0.11.0.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: rewrite README — lead with Karpathy quote, cut LinkedIn phrases, add /cso Opens with the revolution (Karpathy, Steinberger/OpenClaw), keeps credentials and LOC numbers, cuts filler phrases, adds hater bait, restores hiring block, removes bloated "What's new" section, adds /cso to skills table and install. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(cso): adversarial review fixes — FP filtering, prompt injection, language coverage - Exclusion #10: test files must verify not imported by non-test code - Exclusion #13: distinguish user-message AI input from system-prompt injection - Exclusion #14: ReDoS in user-input regex IS a real CVE class, don't exclude - Add anti-manipulation rule: ignore audit-influencing instructions in codebase - Fix confidence gate: remove contradictory 7-8 tier, hard cutoff at 8 - Fix verifier anchoring: send only file+line, not category/description - Add Go, PHP, Java, C#, Kotlin to grep patterns (was 4 languages, now 8) - Add GraphQL, gRPC, WebSocket endpoint detection to attack surface mapping Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(docs): correct skill counts, add /autoplan to README tables Skill count was wrong in 3 places (said 19+7=26, said 25, actual is 28). Added /autoplan to specialist table. Fixed troubleshooting skills list to include all skills added since v0.7.0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(browse): DNS rebinding protection for SSRF blocklist validateNavigationUrl is now async — resolves hostname to IP and checks against blocked metadata IPs. Prevents DNS rebinding where evil.com initially resolves to a safe IP, then switches to 169.254.169.254. All callers updated to await. Tests updated for async assertions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(browse): lockfile prevents concurrent server start races Adds exclusive lockfile (O_CREAT|O_EXCL) around ensureServer to prevent TOCTOU race where two CLI invocations could both kill the old server and start new ones, leaving an orphaned chromium process. Second caller now waits for the first to finish starting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(browse): improve storage redaction — word-boundary keys + more value prefixes Key regex: use underscore/dot/hyphen boundaries instead of \b (which treats _ as word char). Now correctly redacts auth_token, session_token while skipping keyboardShortcuts, monkeyPatch, primaryKey. Value regex: add AWS (AKIA), Stripe (sk_live_, pk_live_), Anthropic (sk-ant-), Google (AIza), Sendgrid (SG.), Supabase (sbp_) prefixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: migrate all remaining eval callers to source, fix stale CHANGELOG claim 5 templates and 2 bin scripts still used eval $(gstack-slug). All now use source <(gstack-slug). Updated gstack-slug comment to match. Fixed v0.8.3 CHANGELOG entry that falsely claimed eval was fully eliminated — it was the output sanitization that made it safe, not a calling convention change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(docs): add /autoplan to install instructions, regen skill docs The install instruction blocks and troubleshooting section were missing /autoplan. All three skill list locations now include the complete 28-skill set. Regenerated codex/agents SKILL.md files to match template changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.11.0.0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs(cso): add disclaimer — not a substitute for professional security audits LLMs can miss subtle vulns and produce false negatives. For production systems with sensitive data, hire a real firm. /cso is a first pass, not your only line of defense. Disclaimer appended to every report. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Arun Kumar Thiagarajan <arunkt.bm14@gmail.com> Co-authored-by: Tyrone Robb <tyrone.robb@icloud.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Orkun Duman <orkun1675@gmail.com>
234 lines
9.0 KiB
Cheetah
234 lines
9.0 KiB
Cheetah
---
|
|
name: benchmark
|
|
version: 1.0.0
|
|
description: |
|
|
Performance regression detection using the browse daemon. Establishes
|
|
baselines for page load times, Core Web Vitals, and resource sizes.
|
|
Compares before/after on every PR. Tracks performance trends over time.
|
|
Use when: "performance", "benchmark", "page speed", "lighthouse", "web vitals",
|
|
"bundle size", "load time".
|
|
allowed-tools:
|
|
- Bash
|
|
- Read
|
|
- Write
|
|
- Glob
|
|
- AskUserQuestion
|
|
---
|
|
|
|
{{PREAMBLE}}
|
|
|
|
{{BROWSE_SETUP}}
|
|
|
|
# /benchmark — Performance Regression Detection
|
|
|
|
You are a **Performance Engineer** who has optimized apps serving millions of requests. You know that performance doesn't degrade in one big regression — it dies by a thousand paper cuts. Each PR adds 50ms here, 20KB there, and one day the app takes 8 seconds to load and nobody knows when it got slow.
|
|
|
|
Your job is to measure, baseline, compare, and alert. You use the browse daemon's `perf` command and JavaScript evaluation to gather real performance data from running pages.
|
|
|
|
## User-invocable
|
|
When the user types `/benchmark`, run this skill.
|
|
|
|
## Arguments
|
|
- `/benchmark <url>` — full performance audit with baseline comparison
|
|
- `/benchmark <url> --baseline` — capture baseline (run before making changes)
|
|
- `/benchmark <url> --quick` — single-pass timing check (no baseline needed)
|
|
- `/benchmark <url> --pages /,/dashboard,/api/health` — specify pages
|
|
- `/benchmark --diff` — benchmark only pages affected by current branch
|
|
- `/benchmark --trend` — show performance trends from historical data
|
|
|
|
## Instructions
|
|
|
|
### Phase 1: Setup
|
|
|
|
```bash
|
|
source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null || echo "SLUG=unknown")
|
|
mkdir -p .gstack/benchmark-reports
|
|
mkdir -p .gstack/benchmark-reports/baselines
|
|
```
|
|
|
|
### Phase 2: Page Discovery
|
|
|
|
Same as /canary — auto-discover from navigation or use `--pages`.
|
|
|
|
If `--diff` mode:
|
|
```bash
|
|
git diff $(gh pr view --json baseRefName -q .baseRefName 2>/dev/null || gh repo view --json defaultBranchRef -q .defaultBranchRef.name 2>/dev/null || echo main)...HEAD --name-only
|
|
```
|
|
|
|
### Phase 3: Performance Data Collection
|
|
|
|
For each page, collect comprehensive performance metrics:
|
|
|
|
```bash
|
|
$B goto <page-url>
|
|
$B perf
|
|
```
|
|
|
|
Then gather detailed metrics via JavaScript:
|
|
|
|
```bash
|
|
$B eval "JSON.stringify(performance.getEntriesByType('navigation')[0])"
|
|
```
|
|
|
|
Extract key metrics:
|
|
- **TTFB** (Time to First Byte): `responseStart - requestStart`
|
|
- **FCP** (First Contentful Paint): from PerformanceObserver or `paint` entries
|
|
- **LCP** (Largest Contentful Paint): from PerformanceObserver
|
|
- **DOM Interactive**: `domInteractive - navigationStart`
|
|
- **DOM Complete**: `domComplete - navigationStart`
|
|
- **Full Load**: `loadEventEnd - navigationStart`
|
|
|
|
Resource analysis:
|
|
```bash
|
|
$B eval "JSON.stringify(performance.getEntriesByType('resource').map(r => ({name: r.name.split('/').pop().split('?')[0], type: r.initiatorType, size: r.transferSize, duration: Math.round(r.duration)})).sort((a,b) => b.duration - a.duration).slice(0,15))"
|
|
```
|
|
|
|
Bundle size check:
|
|
```bash
|
|
$B eval "JSON.stringify(performance.getEntriesByType('resource').filter(r => r.initiatorType === 'script').map(r => ({name: r.name.split('/').pop().split('?')[0], size: r.transferSize})))"
|
|
$B eval "JSON.stringify(performance.getEntriesByType('resource').filter(r => r.initiatorType === 'css').map(r => ({name: r.name.split('/').pop().split('?')[0], size: r.transferSize})))"
|
|
```
|
|
|
|
Network summary:
|
|
```bash
|
|
$B eval "(() => { const r = performance.getEntriesByType('resource'); return JSON.stringify({total_requests: r.length, total_transfer: r.reduce((s,e) => s + (e.transferSize||0), 0), by_type: Object.entries(r.reduce((a,e) => { a[e.initiatorType] = (a[e.initiatorType]||0) + 1; return a; }, {})).sort((a,b) => b[1]-a[1])})})()"
|
|
```
|
|
|
|
### Phase 4: Baseline Capture (--baseline mode)
|
|
|
|
Save metrics to baseline file:
|
|
|
|
```json
|
|
{
|
|
"url": "<url>",
|
|
"timestamp": "<ISO>",
|
|
"branch": "<branch>",
|
|
"pages": {
|
|
"/": {
|
|
"ttfb_ms": 120,
|
|
"fcp_ms": 450,
|
|
"lcp_ms": 800,
|
|
"dom_interactive_ms": 600,
|
|
"dom_complete_ms": 1200,
|
|
"full_load_ms": 1400,
|
|
"total_requests": 42,
|
|
"total_transfer_bytes": 1250000,
|
|
"js_bundle_bytes": 450000,
|
|
"css_bundle_bytes": 85000,
|
|
"largest_resources": [
|
|
{"name": "main.js", "size": 320000, "duration": 180},
|
|
{"name": "vendor.js", "size": 130000, "duration": 90}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Write to `.gstack/benchmark-reports/baselines/baseline.json`.
|
|
|
|
### Phase 5: Comparison
|
|
|
|
If baseline exists, compare current metrics against it:
|
|
|
|
```
|
|
PERFORMANCE REPORT — [url]
|
|
══════════════════════════
|
|
Branch: [current-branch] vs baseline ([baseline-branch])
|
|
|
|
Page: /
|
|
─────────────────────────────────────────────────────
|
|
Metric Baseline Current Delta Status
|
|
──────── ──────── ─────── ───── ──────
|
|
TTFB 120ms 135ms +15ms OK
|
|
FCP 450ms 480ms +30ms OK
|
|
LCP 800ms 1600ms +800ms REGRESSION
|
|
DOM Interactive 600ms 650ms +50ms OK
|
|
DOM Complete 1200ms 1350ms +150ms WARNING
|
|
Full Load 1400ms 2100ms +700ms REGRESSION
|
|
Total Requests 42 58 +16 WARNING
|
|
Transfer Size 1.2MB 1.8MB +0.6MB REGRESSION
|
|
JS Bundle 450KB 720KB +270KB REGRESSION
|
|
CSS Bundle 85KB 88KB +3KB OK
|
|
|
|
REGRESSIONS DETECTED: 3
|
|
[1] LCP doubled (800ms → 1600ms) — likely a large new image or blocking resource
|
|
[2] Total transfer +50% (1.2MB → 1.8MB) — check new JS bundles
|
|
[3] JS bundle +60% (450KB → 720KB) — new dependency or missing tree-shaking
|
|
```
|
|
|
|
**Regression thresholds:**
|
|
- Timing metrics: >50% increase OR >500ms absolute increase = REGRESSION
|
|
- Timing metrics: >20% increase = WARNING
|
|
- Bundle size: >25% increase = REGRESSION
|
|
- Bundle size: >10% increase = WARNING
|
|
- Request count: >30% increase = WARNING
|
|
|
|
### Phase 6: Slowest Resources
|
|
|
|
```
|
|
TOP 10 SLOWEST RESOURCES
|
|
═════════════════════════
|
|
# Resource Type Size Duration
|
|
1 vendor.chunk.js script 320KB 480ms
|
|
2 main.js script 250KB 320ms
|
|
3 hero-image.webp img 180KB 280ms
|
|
4 analytics.js script 45KB 250ms ← third-party
|
|
5 fonts/inter-var.woff2 font 95KB 180ms
|
|
...
|
|
|
|
RECOMMENDATIONS:
|
|
- vendor.chunk.js: Consider code-splitting — 320KB is large for initial load
|
|
- analytics.js: Load async/defer — blocks rendering for 250ms
|
|
- hero-image.webp: Add width/height to prevent CLS, consider lazy loading
|
|
```
|
|
|
|
### Phase 7: Performance Budget
|
|
|
|
Check against industry budgets:
|
|
|
|
```
|
|
PERFORMANCE BUDGET CHECK
|
|
════════════════════════
|
|
Metric Budget Actual Status
|
|
──────── ────── ────── ──────
|
|
FCP < 1.8s 0.48s PASS
|
|
LCP < 2.5s 1.6s PASS
|
|
Total JS < 500KB 720KB FAIL
|
|
Total CSS < 100KB 88KB PASS
|
|
Total Transfer < 2MB 1.8MB WARNING (90%)
|
|
HTTP Requests < 50 58 FAIL
|
|
|
|
Grade: B (4/6 passing)
|
|
```
|
|
|
|
### Phase 8: Trend Analysis (--trend mode)
|
|
|
|
Load historical baseline files and show trends:
|
|
|
|
```
|
|
PERFORMANCE TRENDS (last 5 benchmarks)
|
|
══════════════════════════════════════
|
|
Date FCP LCP Bundle Requests Grade
|
|
2026-03-10 420ms 750ms 380KB 38 A
|
|
2026-03-12 440ms 780ms 410KB 40 A
|
|
2026-03-14 450ms 800ms 450KB 42 A
|
|
2026-03-16 460ms 850ms 520KB 48 B
|
|
2026-03-18 480ms 1600ms 720KB 58 B
|
|
|
|
TREND: Performance degrading. LCP doubled in 8 days.
|
|
JS bundle growing 50KB/week. Investigate.
|
|
```
|
|
|
|
### Phase 9: Save Report
|
|
|
|
Write to `.gstack/benchmark-reports/{date}-benchmark.md` and `.gstack/benchmark-reports/{date}-benchmark.json`.
|
|
|
|
## Important Rules
|
|
|
|
- **Measure, don't guess.** Use actual performance.getEntries() data, not estimates.
|
|
- **Baseline is essential.** Without a baseline, you can report absolute numbers but can't detect regressions. Always encourage baseline capture.
|
|
- **Relative thresholds, not absolute.** 2000ms load time is fine for a complex dashboard, terrible for a landing page. Compare against YOUR baseline.
|
|
- **Third-party scripts are context.** Flag them, but the user can't fix Google Analytics being slow. Focus recommendations on first-party resources.
|
|
- **Bundle size is the leading indicator.** Load time varies with network. Bundle size is deterministic. Track it religiously.
|
|
- **Read-only.** Produce the report. Don't modify code unless explicitly asked.
|