Merge remote-tracking branch 'origin/main' into garrytan/usage-telemetry

# Conflicts: # SKILL.md # TODOS.md # browse/SKILL.md # design-consultation/SKILL.md # design-review/SKILL.md # document-release/SKILL.md # plan-ceo-review/SKILL.md # plan-design-review/SKILL.md # plan-eng-review/SKILL.md # qa-only/SKILL.md # qa/SKILL.md # retro/SKILL.md # retro/SKILL.md.tmpl # review/SKILL.md # scripts/gen-skill-docs.ts # setup-browser-cookies/SKILL.md # ship/SKILL.md
2026-06-22 17:49:57 +02:00 · 2026-03-19 00:50:11 -07:00
parent 7d6fc52ae3 cb203777f8
commit 790bd18977
81 changed files with 8178 additions and 609 deletions
@@ -5,6 +5,7 @@ description: |
  Pre-landing PR review. Analyzes diff against the base branch for SQL safety, LLM trust
  boundary violations, conditional side effects, and other structural issues. Use when
  asked to "review this PR", "code review", "pre-landing review", or "check my diff".
+  Proactively suggest when the user is about to merge or land code changes.
 allowed-tools:
  - Bash
  - Read
@@ -40,7 +41,8 @@ _SESSION_ID="$$-$(date +%s)"
 echo "TELEMETRY: ${_TEL:-off}"
 echo "TEL_PROMPTED: $_TEL_PROMPTED"
 mkdir -p ~/.gstack/analytics
-for _PF in ~/.gstack/analytics/.pending-* 2>/dev/null; do [ -f "$_PF" ] && ~/.claude/skills/gstack/bin/gstack-telemetry-log --event-type skill_run --skill _pending_finalize --outcome unknown --session-id "$_SESSION_ID" 2>/dev/null || true; break; done
+echo '{"skill":"review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}'  >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
+for _PF in ~/.gstack/analytics/.pending-*; do [ -f "$_PF" ] && ~/.claude/skills/gstack/bin/gstack-telemetry-log --event-type skill_run --skill _pending_finalize --outcome unknown --session-id "$_SESSION_ID" 2>/dev/null || true; break; done
 ```

 If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
@@ -155,13 +157,37 @@ Hey gstack team — ran into this while using /{skill-name}:

 Slug: lowercase, hyphens, max 60 chars (e.g. `browse-js-no-await`). Skip if file already exists. Max 3 reports per session. File inline and continue — don't stop the workflow. Tell user: "Filed gstack field report: {title}"

+## Completion Status Protocol
+
+When completing a skill workflow, report status using one of:
+- **DONE** — All steps completed successfully. Evidence provided for each claim.
+- **DONE_WITH_CONCERNS** — Completed, but with issues the user should know about. List each concern.
+- **BLOCKED** — Cannot proceed. State what is blocking and what was tried.
+- **NEEDS_CONTEXT** — Missing information required to continue. State exactly what you need.
+
+### Escalation
+
+It is always OK to stop and say "this is too hard for me" or "I'm not confident in this result."
+
+Bad work is worse than no work. You will not be penalized for escalating.
+- If you have attempted a task 3 times without success, STOP and escalate.
+- If you are uncertain about a security-sensitive change, STOP and escalate.
+- If the scope of work exceeds what you can verify, STOP and escalate.
+
+Escalation format:
+```
+STATUS: BLOCKED | NEEDS_CONTEXT
+REASON: [1-2 sentences]
+ATTEMPTED: [what you tried]
+RECOMMENDATION: [what the user should do next]
+```
+
 ## Telemetry (run last)

-After the skill workflow completes (success, error, or abort), write the .pending marker
-with the actual skill name, then log the telemetry event. Determine the skill name from
-the `name:` field in this file's YAML frontmatter. Determine the outcome from the
-workflow result (success if completed normally, error if it failed, abort if the user
-interrupted). Run this bash:
+After the skill workflow completes (success, error, or abort), log the telemetry event.
+Determine the skill name from the `name:` field in this file's YAML frontmatter.
+Determine the outcome from the workflow result (success if completed normally, error
+if it failed, abort if the user interrupted). Run this bash:

 ```bash
 _TEL_END=$(date +%s)
@@ -210,6 +236,40 @@ You are running the `/review` workflow. Analyze the current branch's diff agains

 ---

+## Step 1.5: Scope Drift Detection
+
+Before reviewing code quality, check: **did they build what was requested — nothing more, nothing less?**
+
+1. Read `TODOS.md` (if it exists). Read PR description (`gh pr view --json body --jq .body 2>/dev/null || true`).
+   Read commit messages (`git log origin/<base>..HEAD --oneline`).
+   **If no PR exists:** rely on commit messages and TODOS.md for stated intent — this is the common case since /review runs before /ship creates the PR.
+2. Identify the **stated intent** — what was this branch supposed to accomplish?
+3. Run `git diff origin/<base> --stat` and compare the files changed against the stated intent.
+4. Evaluate with skepticism:
+
+   **SCOPE CREEP detection:**
+   - Files changed that are unrelated to the stated intent
+   - New features or refactors not mentioned in the plan
+   - "While I was in there..." changes that expand blast radius
+
+   **MISSING REQUIREMENTS detection:**
+   - Requirements from TODOS.md/PR description not addressed in the diff
+   - Test coverage gaps for stated requirements
+   - Partial implementations (started but not finished)
+
+5. Output (before the main review begins):
+   ```
+   Scope Check: [CLEAN / DRIFT DETECTED / REQUIREMENTS MISSING]
+   Intent: <1-line summary of what was requested>
+   Delivered: <1-line summary of what the diff actually does>
+   [If drift: list each out-of-scope change]
+   [If missing: list each unaddressed requirement]
+   ```
+
+6. This is **INFORMATIONAL** — does not block the review. Proceed to Step 2.
+
+---
+
 ## Step 2: Read the checklist

 Read `.claude/skills/review/checklist.md`.
@@ -260,7 +320,7 @@ Follow the output format specified in the checklist. Respect the suppressions
 Check if the diff touches frontend files using `gstack-diff-scope`:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
+source <(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
 ```

 **If `SCOPE_FRONTEND=false`:** Skip design review silently. No output.
@@ -283,12 +343,10 @@ eval $(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
 6. **Log the result** for the Review Readiness Dashboard:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null)
-mkdir -p ~/.gstack/projects/$SLUG
-echo '{"skill":"design-review-lite","timestamp":"TIMESTAMP","status":"STATUS","findings":N,"auto_fixed":M}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"design-review-lite","timestamp":"TIMESTAMP","status":"STATUS","findings":N,"auto_fixed":M,"commit":"COMMIT"}'
 ```

-Substitute: TIMESTAMP = ISO 8601 datetime, STATUS = "clean" if 0 findings or "issues_found", N = total findings, M = auto-fixed count.
+Substitute: TIMESTAMP = ISO 8601 datetime, STATUS = "clean" if 0 findings or "issues_found", N = total findings, M = auto-fixed count, COMMIT = output of `git rev-parse --short HEAD`.

 Include any design findings alongside the findings from Step 4. They follow the same Fix-First flow in Step 5 — AUTO-FIX for mechanical CSS fixes, ASK for everything else.

@@ -342,6 +400,16 @@ Apply fixes for items where the user chose "Fix." Output what was fixed.

 If no ASK items exist (everything was AUTO-FIX), skip the question entirely.

+### Verification of claims
+
+Before producing the final review output:
+- If you claim "this pattern is safe" → cite the specific line proving safety
+- If you claim "this is handled elsewhere" → read and cite the handling code
+- If you claim "tests cover this" → name the test file and method
+- Never say "likely handled" or "probably tested" — verify or flag as unknown
+
+**Rationalization prevention:** "This looks fine" is not a finding. Either cite evidence it IS fine, or flag it as unverified.
+
 ### Greptile comment resolution

 After outputting your own findings, if Greptile comments were classified in Step 2.5:
@@ -396,6 +464,55 @@ If no documentation files exist, skip this step silently.

 ---

+## Step 5.7: Codex second opinion (optional)
+
+After completing the review, check if the Codex CLI is available:
+
+```bash
+which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
+```
+
+If Codex is available, use AskUserQuestion:
+
+```
+Review complete. Want an independent second opinion from Codex (OpenAI)?
+
+A) Run Codex code review — independent diff review with pass/fail gate
+B) Run Codex adversarial challenge — try to find ways this code will fail in production
+C) Both — review first, then adversarial challenge
+D) Skip — no Codex review needed
+```
+
+If the user chooses A, B, or C:
+
+**For code review (A or C):** Run `codex review --base <base>` with a 5-minute timeout.
+Present the full output verbatim under a `CODEX SAYS (code review):` header.
+Check the output for `[P1]` markers — if found, note `GATE: FAIL`, otherwise `GATE: PASS`.
+After presenting, compare Codex's findings with your own review findings from Steps 4-5
+and output a CROSS-MODEL ANALYSIS showing what both found, what only Codex found,
+and what only Claude found.
+
+**For adversarial challenge (B or C):** Run:
+```bash
+codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, failure modes. Be adversarial." -s read-only
+```
+Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header.
+
+**Only if a code review ran (user chose A or C):** Persist the Codex review result to the review log:
+```bash
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}'
+```
+
+Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").
+
+**Do NOT persist a codex-review entry when only the adversarial challenge (B) ran** —
+there is no gate verdict to record, and a false entry would make the Review Readiness
+Dashboard believe a code review happened when it didn't.
+
+If Codex is not available, skip this step silently.
+
+---
+
 ## Important Rules

 - **Read the FULL diff before commenting.** Do not flag issues already addressed in the diff.
@@ -5,6 +5,7 @@ description: |
  Pre-landing PR review. Analyzes diff against the base branch for SQL safety, LLM trust
  boundary violations, conditional side effects, and other structural issues. Use when
  asked to "review this PR", "code review", "pre-landing review", or "check my diff".
+  Proactively suggest when the user is about to merge or land code changes.
 allowed-tools:
  - Bash
  - Read
@@ -33,6 +34,40 @@ You are running the `/review` workflow. Analyze the current branch's diff agains

 ---

+## Step 1.5: Scope Drift Detection
+
+Before reviewing code quality, check: **did they build what was requested — nothing more, nothing less?**
+
+1. Read `TODOS.md` (if it exists). Read PR description (`gh pr view --json body --jq .body 2>/dev/null || true`).
+   Read commit messages (`git log origin/<base>..HEAD --oneline`).
+   **If no PR exists:** rely on commit messages and TODOS.md for stated intent — this is the common case since /review runs before /ship creates the PR.
+2. Identify the **stated intent** — what was this branch supposed to accomplish?
+3. Run `git diff origin/<base> --stat` and compare the files changed against the stated intent.
+4. Evaluate with skepticism:
+
+   **SCOPE CREEP detection:**
+   - Files changed that are unrelated to the stated intent
+   - New features or refactors not mentioned in the plan
+   - "While I was in there..." changes that expand blast radius
+
+   **MISSING REQUIREMENTS detection:**
+   - Requirements from TODOS.md/PR description not addressed in the diff
+   - Test coverage gaps for stated requirements
+   - Partial implementations (started but not finished)
+
+5. Output (before the main review begins):
+   ```
+   Scope Check: [CLEAN / DRIFT DETECTED / REQUIREMENTS MISSING]
+   Intent: <1-line summary of what was requested>
+   Delivered: <1-line summary of what the diff actually does>
+   [If drift: list each out-of-scope change]
+   [If missing: list each unaddressed requirement]
+   ```
+
+6. This is **INFORMATIONAL** — does not block the review. Proceed to Step 2.
+
+---
+
 ## Step 2: Read the checklist

 Read `.claude/skills/review/checklist.md`.
@@ -132,6 +167,16 @@ Apply fixes for items where the user chose "Fix." Output what was fixed.

 If no ASK items exist (everything was AUTO-FIX), skip the question entirely.

+### Verification of claims
+
+Before producing the final review output:
+- If you claim "this pattern is safe" → cite the specific line proving safety
+- If you claim "this is handled elsewhere" → read and cite the handling code
+- If you claim "tests cover this" → name the test file and method
+- Never say "likely handled" or "probably tested" — verify or flag as unknown
+
+**Rationalization prevention:** "This looks fine" is not a finding. Either cite evidence it IS fine, or flag it as unverified.
+
 ### Greptile comment resolution

 After outputting your own findings, if Greptile comments were classified in Step 2.5:
@@ -186,6 +231,55 @@ If no documentation files exist, skip this step silently.

 ---

+## Step 5.7: Codex second opinion (optional)
+
+After completing the review, check if the Codex CLI is available:
+
+```bash
+which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
+```
+
+If Codex is available, use AskUserQuestion:
+
+```
+Review complete. Want an independent second opinion from Codex (OpenAI)?
+
+A) Run Codex code review — independent diff review with pass/fail gate
+B) Run Codex adversarial challenge — try to find ways this code will fail in production
+C) Both — review first, then adversarial challenge
+D) Skip — no Codex review needed
+```
+
+If the user chooses A, B, or C:
+
+**For code review (A or C):** Run `codex review --base <base>` with a 5-minute timeout.
+Present the full output verbatim under a `CODEX SAYS (code review):` header.
+Check the output for `[P1]` markers — if found, note `GATE: FAIL`, otherwise `GATE: PASS`.
+After presenting, compare Codex's findings with your own review findings from Steps 4-5
+and output a CROSS-MODEL ANALYSIS showing what both found, what only Codex found,
+and what only Claude found.
+
+**For adversarial challenge (B or C):** Run:
+```bash
+codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, failure modes. Be adversarial." -s read-only
+```
+Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header.
+
+**Only if a code review ran (user chose A or C):** Persist the Codex review result to the review log:
+```bash
+~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}'
+```
+
+Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").
+
+**Do NOT persist a codex-review entry when only the adversarial challenge (B) ran** —
+there is no gate verdict to record, and a false entry would make the Review Readiness
+Dashboard believe a code review happened when it didn't.
+
+If Codex is not available, skip this step silently.
+
+---
+
 ## Important Rules

 - **Read the FULL diff before commenting.** Do not flag issues already addressed in the diff.
@@ -35,16 +35,16 @@ Be terse. For each issue: one line describing the problem, one line with the fix
 ### Pass 1 — CRITICAL

 #### SQL & Data Safety
- String interpolation in SQL (even if values are `.to_i`/`.to_f` — use `sanitize_sql_array` or Arel)
+- String interpolation in SQL (even if values are `.to_i`/`.to_f` — use parameterized queries (Rails: sanitize_sql_array/Arel; Node: prepared statements; Python: parameterized queries))
 - TOCTOU races: check-then-set patterns that should be atomic `WHERE` + `update_all`
- `update_column`/`update_columns` bypassing validations on fields that have or should have constraints
- N+1 queries: `.includes()` missing for associations used in loops/views (especially avatar, attachments)
+- Bypassing model validations for direct DB writes (Rails: update_column; Django: QuerySet.update(); Prisma: raw queries)
+- N+1 queries: Missing eager loading (Rails: .includes(); SQLAlchemy: joinedload(); Prisma: include) for associations used in loops/views

 #### Race Conditions & Concurrency
- Read-check-write without uniqueness constraint or `rescue RecordNotUnique; retry` (e.g., `where(hash:).first` then `save!` without handling concurrent insert)
- `find_or_create_by` on columns without unique DB index — concurrent calls can create duplicates
+- Read-check-write without uniqueness constraint or catch duplicate key error and retry (e.g., `where(hash:).first` then `save!` without handling concurrent insert)
+- find-or-create without unique DB index — concurrent calls can create duplicates
 - Status transitions that don't use atomic `WHERE old_status = ? UPDATE SET new_status` — concurrent updates can skip or double-apply transitions
- `html_safe` on user-controlled data (XSS) — check any `.html_safe`, `raw()`, or string interpolation into `html_safe` output
+- Unsafe HTML rendering (Rails: .html_safe/raw(); React: dangerouslySetInnerHTML; Vue: v-html; Django: |safe/mark_safe) on user-controlled data (XSS)

 #### LLM Output Trust Boundary
 - LLM-generated values (emails, URLs, names) written to DB or passed to mailers without format validation. Add lightweight guards (`EMAIL_REGEXP`, `URI.parse`, `.strip`) before persisting.
@@ -141,7 +141,7 @@ the agent auto-fixes a finding or asks the user.
 ```
 AUTO-FIX (agent fixes without asking):     ASK (needs human judgment):
 ├─ Dead code / unused variables            ├─ Security (auth, XSS, injection)
-├─ N+1 queries (missing .includes())      ├─ Race conditions
+├─ N+1 queries (missing eager loading)      ├─ Race conditions
 ├─ Stale comments contradicting code       ├─ Design decisions
 ├─ Magic numbers → named constants         ├─ Large fixes (>20 lines)
 ├─ Missing LLM output validation           ├─ Enum completeness
@@ -9,7 +9,7 @@ This checklist applies to **source code in the diff** — not rendered output. R
 **Trigger:** Only run this checklist if the diff touches frontend files. Use `gstack-diff-scope` to detect:

 ```bash
-eval $(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
+source <(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
 ```

 If `SCOPE_FRONTEND=false`, skip the entire design review silently.