mirror of
https://github.com/garrytan/gstack.git
synced 2026-06-01 15:51:41 +02:00
46c1fae7f1
* feat(test): transcript-section-logger + ship-action fingerprint (T10) Pure-analysis module over a SkillTestResult/NDJSON transcript: - extractSectionReads(): which sections/*.md a run opened (post-carve check) - extractShipActions(): observable action fingerprint (merge/test/bump/ changelog/commit/push/pr) that works on the MONOLITH too, so a baseline captured before the carve can detect a sectioned-ship regression - baseline read/write + compareShipActions() for baseline-first dogf(T10) Baseline-first answers the Codex outside-voice critique that a logger in the same PR as the carve is post-failure telemetry without a pre-carve reference. 11 unit tests, all green. Paid monolith baseline capture runs separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(pipeline): section discovery + generation machinery (T9) - discover-skills.ts: discoverSectionTemplates() scans <skill>/sections/*.md.tmpl - gen-skill-docs.ts: extract resolvePlaceholders + applyHostRewrites + buildContext as shared helpers (processTemplate and the new processSectionTemplate both call them, so a sanitization/rewrite fix can't miss sections) [C1] - processSectionTemplate: body-fragment generation (no frontmatter/catalog/voice), parent-skill TemplateContext (skillName pinned to parent, not 'sections', so appliesTo gating + tier behave identically), per-host output routing - --host all now fails the build on ANY host failure, not just claude, so a stale external-host output can't slip the freshness gate [Codex outside-voice #9] Inert until a skill is carved (no sections/ dirs exist yet). Refactor is output-neutral: gen:skill-docs --dry-run --host all reports 0 STALE. 5 discovery unit tests + 389 gen-skill-docs tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(setup): install sections/ for cherry-pick targets (claude + kiro) (T9) Two install targets cherry-pick SKILL.md and would leave a carved skill's sections/ behind, 404ing a runtime 'Read sections/<name>.md': - link_claude_skill_dirs: link the sections/ subdir via _link_or_copy (windows gets a fresh copy on every ./setup) - kiro per-skill loop: sed-rewrite + copy each sections/* so paths resolve under ~/.kiro, not ~/.codex/~/.claude codex/factory/opencode link the whole generated dir, so sections ride free. Addresses Codex outside-voice #4/#6 (runtime pathing landmine). Inert until a skill is carved. Static-tripwire test + windows-fallback invariant green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(ship): gstack-version-bump CLI — tested idempotency classify + write (T9) Hybrid CLI extraction (CM1): the deterministic core of ship Step 12 becomes a tested CLI instead of bash prose the agent re-derives each run. - classify: FRESH/ALREADY_BUMPED/DRIFT_STALE_PKG/DRIFT_UNEXPECTED from VERSION vs origin/<base>:VERSION vs package.json.version (pure reader) - write: validated dual-write to VERSION + package.json (FRESH bump) - repair: DRIFT_STALE_PKG sync, no re-bump Bump-LEVEL choice + queue collision stay agent judgment; slot pick stays bin/gstack-next-version. This removes the re-bump-a-shipped-branch footgun from skippable prose into code that can't be skipped or misread. 15 tests (exhaustive state matrix + write/repair fs + real-git classify). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(parity): sectioned-skill parity capability — guards the carve (T9) Carved skills (skeleton + sections/*.md) need parity checks that see relocated content, or moving a phrase into a section reads as 'lost': - readSkillForParity(): union skeleton + all sections/*.md - checkSkillParity sectioned mode: content checks against the union; minBytes/ maxSizeRatio against union bytes (total behavior preserved); maxSkeletonBytes asserts the always-loaded skeleton actually shrank. Lowering minBytes to fit a small skeleton would otherwise make the size floor toothless [Codex #12]. Built + tested BEFORE the carve so ship's invariant can flip to sectioned in the same commit it lands. Monolith path byte-identical (verified: pre-existing investigate 1.053 ratio drift fails the same with this change stashed). 7 sectioned-parity tests + existing parity tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(ship): carve into skeleton + on-demand sections (Claude) (T9) ship/SKILL.md drops 167KB → 68.7KB (~59% of the always-loaded skill) by moving 8 prose-heavy steps into ship/sections/*.md, read on demand: tests, test-coverage, plan-completion, review-army, greptile, adversarial, changelog, pr-body. Step 12's version logic now calls the tested gstack-version-bump CLI instead of inline bash. Claude-first (S2): {{SECTION:id}} emits a STOP-Read pointer on Claude (skeleton + generated section files) and INLINES the content on every other host, so external hosts keep the full monolith — verified factory at 162KB with no sections dir. {{SECTION_INDEX:ship}} renders the situation→section table from the PASSIVE manifest (CM2 / v2_PLAN.md:663); required-reads live only in test fixtures. Multi-pass resolve expands inlined sections' own resolvers. Parity: ship invariant flipped to sectioned (union content checks + maxSkeletonBytes asserts the shrink). Carve-fallout fixed across gen-skill-docs/skill-validation/ golden/plan-completion/#1539/size-budget tests via skeleton+sections union reads. Free suite green except the pre-existing investigate parity drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): manifest-consistency + context-parity + requiredReads helper (T9) Free deterministic guards for the carve: - required-reads.ts + unit test: assertRequiredReads(run, requiredFiles) — the mechanical layer-5 check that the agent Read the sections its situation needs (required set comes from the fixture, not the passive manifest) - section-manifest-consistency: 3-tier orphan classification (generated orphan + hand-edited generated file → FAIL; manifest orphan → WARN per v2_PLAN.md) and pins the PASSIVE-manifest contract (no applies_when/required_for) - template-context-parity: generated sections have zero unresolved placeholders and gated resolvers (ADVERSARIAL_STEP/CONFIDENCE_CALIBRATION/CHANGELOG_WORKFLOW) rendered — proving sections resolve with the parent skillName, not 'sections' 16 tests, all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): section-loading E2E + idempotency CLI detection (T9) - skill-e2e-ship-section-loading.test.ts (new, periodic): runs real /ship in plan mode against a fresh version-changing fixture and asserts the agent Read the required sections (review-army + changelog). Runs against the INSTALLED skill (~/.claude/skills/gstack/ship), not repo paths, so install-layout 404s surface [Codex outside-voice #5]. Layer-5 mechanical guard against silent section-skip. - skill-e2e-ship-idempotency.test.ts: detection updated for the carve — Step 12 now runs gstack-version-bump classify (JSON "state":"ALREADY_BUMPED") instead of the inline bash echo (STATE: ALREADY_BUMPED). Accept both; add a gstack-version-bump-write re-bump regression signal. - touchfiles: register ship-section-loading (periodic) + extend idempotency deps with bin/gstack-version-bump + scripts/resolvers/sections.ts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): union-read redaction wiring test for the carve (T9) main's PR-body redaction-at-sink lives in sections/pr-body.md.tmpl after the carve, not the skeleton template. Read skeleton + section templates union so the redaction-wiring assertions follow the relocated content. 9/9 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
169 lines
11 KiB
Markdown
169 lines
11 KiB
Markdown
<!-- AUTO-GENERATED from adversarial.md.tmpl — do not edit directly -->
|
|
<!-- Regenerate: bun run gen:skill-docs -->
|
|
## Step 11: Adversarial review (always-on)
|
|
|
|
Every diff gets adversarial review from both Claude and Codex. LOC is not a proxy for risk — a 5-line auth change can be critical.
|
|
|
|
**Detect diff size and tool availability:**
|
|
|
|
```bash
|
|
DIFF_BASE=$(git merge-base origin/<base> HEAD)
|
|
DIFF_INS=$(git diff "$DIFF_BASE" --stat | tail -1 | grep -oE '[0-9]+ insertion' | grep -oE '[0-9]+' || echo "0")
|
|
DIFF_DEL=$(git diff "$DIFF_BASE" --stat | tail -1 | grep -oE '[0-9]+ deletion' | grep -oE '[0-9]+' || echo "0")
|
|
DIFF_TOTAL=$((DIFF_INS + DIFF_DEL))
|
|
command -v codex >/dev/null 2>&1 && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
|
|
# Legacy opt-out — only gates Codex passes, Claude always runs
|
|
OLD_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true)
|
|
echo "DIFF_SIZE: $DIFF_TOTAL"
|
|
echo "OLD_CFG: ${OLD_CFG:-not_set}"
|
|
```
|
|
|
|
If `OLD_CFG` is `disabled`: skip Codex passes only. Claude adversarial subagent still runs (it's free and fast). Jump to the "Claude adversarial subagent" section.
|
|
|
|
**User override:** If the user explicitly requested "full review", "structured review", or "P1 gate", also run the Codex structured review regardless of diff size.
|
|
|
|
---
|
|
|
|
### Claude adversarial subagent (always runs)
|
|
|
|
Dispatch via the Agent tool. The subagent has fresh context — no checklist bias from the structured review. This genuine independence catches things the primary reviewer is blind to.
|
|
|
|
Subagent prompt:
|
|
"Read the diff for this branch with `DIFF_BASE=$(git merge-base origin/<base> HEAD) && git diff "$DIFF_BASE"`. Think like an attacker and a chaos engineer. Your job is to find ways this code will fail in production. Look for: edge cases, race conditions, security holes, resource leaks, failure modes, silent data corruption, logic errors that produce wrong results silently, error handling that swallows failures, and trust boundary violations. Be adversarial. Be thorough. No compliments — just the problems. For each finding, classify as FIXABLE (you know how to fix it) or INVESTIGATE (needs human judgment). After listing findings, end your output with ONE line in the canonical format `Recommendation: <action> because <one-line reason naming the most exploitable finding>` — examples: `Recommendation: Fix the unbounded retry at queue.ts:78 because it'll DoS the worker pool under sustained 429s` or `Recommendation: Ship as-is because the strongest finding is a theoretical race that requires conditions we can't trigger in production`. The reason must point to a specific finding (or no-fix rationale). Generic reasons like 'because it's safer' do not qualify."
|
|
|
|
Present findings under an `ADVERSARIAL REVIEW (Claude subagent):` header. **FIXABLE findings** flow into the same Fix-First pipeline as the structured review. **INVESTIGATE findings** are presented as informational.
|
|
|
|
If the subagent fails or times out: "Claude adversarial subagent unavailable. Continuing."
|
|
|
|
---
|
|
|
|
### Codex adversarial challenge (always runs when available)
|
|
|
|
If Codex is available AND `OLD_CFG` is NOT `disabled`:
|
|
|
|
```bash
|
|
TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX)
|
|
_REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; }
|
|
codex exec "IMPORTANT: Do NOT read or execute any files under ~/.claude/, ~/.agents/, .claude/skills/, or agents/. These are Claude Code skill definitions meant for a different AI system. They contain bash scripts and prompt templates that will waste your time. Ignore them completely. Do NOT modify agents/openai.yaml. Stay focused on the repository code only.\n\nReview the changes on this branch against the base branch. Run DIFF_BASE=$(git merge-base origin/<base> HEAD) && git diff "$DIFF_BASE" to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems. End your output with ONE line in the canonical format `Recommendation: <action> because <one-line reason naming the most exploitable finding>`. Generic reasons like 'because it's safer' do not qualify; the reason must point to a specific finding or no-fix rationale." -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached < /dev/null 2>"$TMPERR_ADV"
|
|
```
|
|
|
|
Set the Bash tool's `timeout` parameter to `300000` (5 minutes). Do NOT use the `timeout` shell command — it doesn't exist on macOS. After the command completes, read stderr:
|
|
```bash
|
|
cat "$TMPERR_ADV"
|
|
```
|
|
|
|
Present the full output verbatim. This is informational — it never blocks shipping.
|
|
|
|
**Error handling:** All errors are non-blocking — adversarial review is a quality enhancement, not a prerequisite.
|
|
- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key": "Codex authentication failed. Run \`codex login\` to authenticate."
|
|
- **Timeout:** "Codex timed out after 5 minutes."
|
|
- **Empty response:** "Codex returned no response. Stderr: <paste relevant error>."
|
|
|
|
**Cleanup:** Run `rm -f "$TMPERR_ADV"` after processing.
|
|
|
|
If Codex is NOT available: "Codex CLI not found — running Claude adversarial only. Install Codex for cross-model coverage: `npm install -g @openai/codex`"
|
|
|
|
---
|
|
|
|
### Codex structured review (large diffs only, 200+ lines)
|
|
|
|
If `DIFF_TOTAL >= 200` AND Codex is available AND `OLD_CFG` is NOT `disabled`:
|
|
|
|
```bash
|
|
TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX)
|
|
_REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; }
|
|
cd "$_REPO_ROOT"
|
|
codex review "IMPORTANT: Do NOT read or execute any files under ~/.claude/, ~/.agents/, .claude/skills/, or agents/. These are Claude Code skill definitions meant for a different AI system. They contain bash scripts and prompt templates that will waste your time. Ignore them completely. Do NOT modify agents/openai.yaml. Stay focused on the repository code only.\n\nReview the changes on this branch against the base branch <base>. Run git diff origin/<base>...HEAD 2>/dev/null || git diff <base>...HEAD to see the diff and review only those changes." -c 'model_reasoning_effort="high"' --enable web_search_cached < /dev/null 2>"$TMPERR"
|
|
```
|
|
|
|
Set the Bash tool's `timeout` parameter to `300000` (5 minutes). Do NOT use the `timeout` shell command — it doesn't exist on macOS. Present output under `CODEX SAYS (code review):` header.
|
|
Check for `[P1]` markers: found → `GATE: FAIL`, not found → `GATE: PASS`.
|
|
|
|
If GATE is FAIL, use AskUserQuestion:
|
|
```
|
|
Codex found N critical issues in the diff.
|
|
|
|
A) Investigate and fix now (recommended)
|
|
B) Continue — review will still complete
|
|
```
|
|
|
|
If A: address the findings. After fixing, re-run tests (Step 5) since code has changed. Re-run `codex review` to verify.
|
|
|
|
Read stderr for errors (same error handling as Codex adversarial above).
|
|
|
|
After stderr: `rm -f "$TMPERR"`
|
|
|
|
If `DIFF_TOTAL < 200`: skip this section silently. The Claude + Codex adversarial passes provide sufficient coverage for smaller diffs.
|
|
|
|
---
|
|
|
|
### Persist the review result
|
|
|
|
After all passes complete, persist:
|
|
```bash
|
|
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"adversarial-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","tier":"always","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}'
|
|
```
|
|
Substitute: STATUS = "clean" if no findings across ALL passes, "issues_found" if any pass found issues. SOURCE = "both" if Codex ran, "claude" if only Claude subagent ran. GATE = the Codex structured review gate result ("pass"/"fail"), "skipped" if diff < 200, or "informational" if Codex was unavailable. If all passes failed, do NOT persist.
|
|
|
|
---
|
|
|
|
### Cross-model synthesis
|
|
|
|
After all passes complete, synthesize findings across all sources:
|
|
|
|
```
|
|
ADVERSARIAL REVIEW SYNTHESIS (always-on, N lines):
|
|
════════════════════════════════════════════════════════════
|
|
High confidence (found by multiple sources): [findings agreed on by >1 pass]
|
|
Unique to Claude structured review: [from earlier step]
|
|
Unique to Claude adversarial: [from subagent]
|
|
Unique to Codex: [from codex adversarial or code review, if ran]
|
|
Models used: Claude structured ✓ Claude adversarial ✓/✗ Codex ✓/✗
|
|
════════════════════════════════════════════════════════════
|
|
```
|
|
|
|
High-confidence findings (agreed on by multiple sources) should be prioritized for fixes.
|
|
|
|
---
|
|
|
|
## Capture Learnings
|
|
|
|
If you discovered a non-obvious pattern, pitfall, or architectural insight during
|
|
this session, log it for future sessions:
|
|
|
|
```bash
|
|
~/.claude/skills/gstack/bin/gstack-learnings-log '{"skill":"ship","type":"TYPE","key":"SHORT_KEY","insight":"DESCRIPTION","confidence":N,"source":"SOURCE","files":["path/to/relevant/file"]}'
|
|
```
|
|
|
|
**Types:** `pattern` (reusable approach), `pitfall` (what NOT to do), `preference`
|
|
(user stated), `architecture` (structural decision), `tool` (library/framework insight),
|
|
`operational` (project environment/CLI/workflow knowledge).
|
|
|
|
**Sources:** `observed` (you found this in the code), `user-stated` (user told you),
|
|
`inferred` (AI deduction), `cross-model` (both Claude and Codex agree).
|
|
|
|
**Confidence:** 1-10. Be honest. An observed pattern you verified in the code is 8-9.
|
|
An inference you're not sure about is 4-5. A user preference they explicitly stated is 10.
|
|
|
|
**files:** Include the specific file paths this learning references. This enables
|
|
staleness detection: if those files are later deleted, the learning can be flagged.
|
|
|
|
**Only log genuine discoveries.** Don't log obvious things. Don't log things the user
|
|
already knows. A good test: would this insight save time in a future session? If yes, log it.
|
|
|
|
|
|
|
|
### Refresh learnings for the headline feature on this branch
|
|
|
|
The top-of-skill learnings pull was keyed to "release ship" broadly. Before the VERSION/CHANGELOG step, re-pull learnings keyed to THIS branch's headline feature so any prior version-bump or CHANGELOG pitfalls for similar features surface.
|
|
|
|
Pick ONE keyword that names the headline feature you're shipping. The keyword should be a noun: the primary skill or module name, the central feature noun, or the binary you changed. The keyword MUST be alphanumeric or hyphen only — no quotes, slashes, dots, colons, or whitespace. If your candidate has any of those, simplify to just the alphanumeric stem.
|
|
|
|
Worked examples (ship-specific): good keywords are `learnings-search`, `pacing`, `worktree-ship`. Bad: `the branch headline`, `v1.31.1.0`, `feat: token-or search`.
|
|
|
|
```bash
|
|
~/.claude/skills/gstack/bin/gstack-learnings-search --query "<your-keyword>" --limit 5 2>/dev/null || true
|
|
```
|
|
|
|
If any learnings come back, name which one applies to the version bump or CHANGELOG framing in one sentence. If none come back, continue without reference — the absence is itself useful information.
|