mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-01 19:25:10 +02:00
aeea57f96a3f73e13b732f036dca5ed7ed7f7bdf
5 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
e3d7f49c74 |
feat(v1.10.1.0): overlay efficacy harness + Opus 4.7 fanout nudge removal (#1166)
* refactor: export readOverlay from model-overlay resolver Needed by the overlay-efficacy eval harness to resolve INHERIT directives without going through generateModelOverlay's full TemplateContext. * chore: add @anthropic-ai/claude-agent-sdk@0.2.117 dep Pinned exact for SDK event-shape stability. Used by the overlay-efficacy harness to drive the model through a closer-to-real Claude Code harness than `claude -p`. * feat(preflight): sanity check for agent-sdk + overlay resolver Verifies: SDK loads, claude-opus-4-7 is a live API model, SDKMessage event shape matches assumptions, readOverlay resolves INHERIT directives and includes expected content. Run with `bun run scripts/preflight-agent-sdk.ts`. PREFLIGHT OK on first run, $0.013 API spend. * feat(eval): parametric overlay-efficacy harness (runner + fixtures) `test/helpers/agent-sdk-runner.ts` wraps @anthropic-ai/claude-agent-sdk with explicit `AgentSdkResult` types, process-level API concurrency semaphore, and 3-shape 429 retry (thrown error, result-message error, mid-stream SDKRateLimitEvent). Pins the local claude binary via `pathToClaudeCodeExecutable`. `test/fixtures/overlay-nudges.ts` holds the typed registry. Two fixtures for the first measurement: `opus-4-7-fanout-toy` (3-file read) and `opus-4-7-fanout-realistic` (mixed-tool audit). Strict validator rejects duplicate ids, non-integer trials, unsafe overlay paths, non-safe id chars, and missing overlay files at module load. Adding a future overlay nudge eval = one fixture entry. * test(eval): unit tests for agent-sdk-runner (36 tests, free tier) Stub `queryProvider` feeds hand-crafted SDKMessage streams. Covers: happy-path shape, all 3 rate-limit shapes + retry, workspace reset on retry, persistent 429 -> `RateLimitExhaustedError`, non-429 propagation, process-level concurrency cap, options propagation, artifact path uniqueness, cost/turn mapping, and every validator rejection case. * test(eval): paid periodic overlay-efficacy harness `test/skill-e2e-overlay-harness.test.ts` iterates OVERLAY_FIXTURES, runs two arms per fixture (overlay-ON, overlay-OFF) at N=10 trials with bounded concurrency. Arms use SDK preset `claude_code` so both include the real Claude Code system prompt; overlay-ON appends the resolved overlay text. Saves per-trial raw event streams to `~/.gstack/projects/<slug>/transcripts/` for forensic recovery. Gated on `EVALS=1 && EVALS_TIER=periodic`. ~$3/run (40 trials). * test: register overlay harness in touchfiles (both maps) Entries for `overlay-harness-opus-4-7-fanout-toy` and `opus-4-7-fanout-realistic` in E2E_TOUCHFILES (deps: model-overlays/, fixtures file, runner, resolver) and E2E_TIERS (`periodic`). Passes `test/touchfiles.test.ts` completeness check. * fix(opus-4.7): remove "Fan out explicitly" overlay nudge Measured counterproductive under the new SDK harness. Baseline Opus 4.7 emits first-turn parallel tool_use blocks 70% of the time on a 3-file read prompt. With the custom nudge: 10%. With Anthropic's own canonical `<use_parallel_tool_calls>` block from their parallel-tool-use docs: 0%. Both overlays suppress fanout; neither improves it. On realistic multi-tool prompts (audit a project: read files + glob + summarize), Opus 4.7 never fans out in first turn regardless of overlay. Zero of 20 trials. Not a prompt problem. Keeping the other three nudges (effort-match, batch questions, literal interpretation) pending their own measurement. Harness is ready for follow-up fixtures — add one entry to `test/fixtures/overlay-nudges.ts` to measure any overlay bullet. Cost of investigation: ~$7 total across 3 eval runs. * chore: bump version and changelog (v1.6.5.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(eval): extend OverlayFixture with allowedTools, maxTurns, direction Per-fixture tool allowlist unblocks measuring nudges that need Edit/Write (e.g. literal-interpretation 'fix the failing tests' needs write access). Per-fixture maxTurns lets harder prompts run longer without changing the default. `direction` is cosmetic metadata for test output labeling. Also adds reusable predicates and metrics: - lowerIsBetter20Pct / higherIsBetter20Pct — 20% lift threshold vs baseline - bashToolCallCount — count of Bash tool_use across the session - turnsToCompletion — SDK-reported num_turns at result - uniqueFilesEdited — Edit/Write/MultiEdit file_path set size test/skill-e2e-overlay-harness.test.ts now threads fixture.allowedTools and fixture.maxTurns through runArm. * test(eval): 3 more overlay fixtures to measure remaining Claude nudges Measures three overlay bullets that haven't been tested yet: - claude-dedicated-tools-vs-bash — claude.md says 'prefer Read/Edit/Write/ Glob/Grep over cat/sed/find/grep'. Fixture prompts 'list every TypeScript file under src/ and tell me what each exports' and counts Bash tool_use across the session. Overlay-ON should drop it by >=20%. - opus-4-7-effort-match-trivial — opus-4-7.md says 'simple file reads don't need deep reasoning.' Fixture uses a trivial one-file prompt (config.json lookup) and measures turns_used. Overlay-ON should be <=80% of baseline turns. - opus-4-7-literal-interpretation — opus-4-7.md says 'fix ALL failing tests, not just the obvious one.' Fixture seeds three failing test files with deliberately distinct failure modes and counts unique files edited. Overlay-ON should touch >=20% more files. Adding a fourth fixture for any remaining overlay nudge is a single entry. The harness is now proven on: fanout (deleted after measurement), dedicated tools, effort-match, and literal-interpretation. * fix(eval): handle SDK max-turns throw gracefully Some @anthropic-ai/claude-agent-sdk versions throw from the query generator when maxTurns is reached, instead of emitting a result message with subtype='error_max_turns'. The runner treated that as a non-retryable error and killed the whole periodic run on the first fixture that exceeded its turn cap. Added isMaxTurnsError() detector and a catch branch that synthesizes an AgentSdkResult from events captured before the throw, with exitReason='error_max_turns' and costUsd=0 (unknown from the thrown path). The metric function still runs against whatever assistant turns were collected, so the trial produces a usable number. Hoisted events/assistantTurns/toolCalls/assistantTextParts and the timing counters out of the inner try so the catch branch can read them. No behavior change on the success path or on rate-limit retry paths. * test(eval): bump maxTurns to 15 for claude-dedicated-tools-vs-bash The prompt 'list every TypeScript file under src/ and tell me what each exports' needs 1 turn for Glob + ~5 for Reads + 1 for summary. Default maxTurns=5 was not enough; prior run threw from the SDK on this fixture and tanked the whole periodic eval. Bumping to 15 gives headroom. The runner now also handles max-turns gracefully even if a future fixture underestimates, so this is belt and suspenders. * test(eval): Sonnet 4.6 variants of the 5 Opus-4.7 fixtures Same overlays, same prompts, same metrics, `model: 'claude-sonnet-4-6'`. Tests whether the overlays behave differently on a weaker Claude model where baseline behavior is shakier. Sonnet trials cost ~3-4x less than Opus so these 5 add ~$4.50 to a full run. Measurement result from the first paired run (100 trials total, ~$14.55): - **Sonnet + effort-match shows real overlay benefit.** With the overlay on, Sonnet takes 2.5 turns on a trivial `What's the version in config.json?` prompt. Without, it takes exactly 3.0 turns in all 10 trials. ~17% reduction, below the 20% pass threshold but the signal is clean: overlay-ON distribution [2,2,2,2,2,3,3,3,3,3] vs overlay-OFF [3,3,3,3,3,3,3,3,3,3]. - All other Sonnet dimensions flat (fanout, dedicated-tools, literal interpretation). Same as Opus on those axes. - Opus effort-match remains flat (2.60 vs 2.50, +4% slower with overlay). Implication: model-stratified. The overlay stack helps Sonnet on some axes where it does nothing on Opus. Wholesale removal would hurt Sonnet. Per-nudge per-model measurement is the right move going forward. * chore: bump version to 1.10.1.0 Updates VERSION, package.json, CHANGELOG header, and TODOS completion marker from 1.6.5.0 to 1.10.1.0. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
97584f9a59 |
feat(security): ML prompt injection defense for sidebar (v1.4.0.0) (#1089)
* chore(deps): add @huggingface/transformers for prompt injection classifier Dependency needed for the ML prompt injection defense layer coming in the follow-up commits. @huggingface/transformers will host the TestSavantAI BERT-small classifier that scans tool outputs for indirect prompt injection. Note: this dep only runs in non-compiled bun contexts (sidebar-agent.ts). The compiled browse binary cannot load it because transformers.js v4 requires onnxruntime-node (native module, fails to dlopen from bun compile's temp extract dir). See docs/designs/ML_PROMPT_INJECTION_KILLER.md for the full architectural decision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): add security.ts foundation for prompt injection defense Establishes the module structure for the L5 canary and L6 verdict aggregation layers. Pure-string operations only — safe to import from the compiled browse binary. Includes: * THRESHOLDS constants (BLOCK 0.85 / WARN 0.60 / LOG_ONLY 0.40), calibrated against BrowseSafe-Bench smoke + developer content benign corpus. * combineVerdict() implementing the ensemble rule: BLOCK only when the ML content classifier AND the transcript classifier both score >= WARN. Single-layer high confidence degrades to WARN to prevent any one classifier's false-positives from killing sessions (Stack Overflow instruction-writing-style FPs at 0.99 on TestSavantAI alone). * generateCanary / injectCanary / checkCanaryInStructure — session-scoped secret token, recursively scans tool arguments, URLs, file writes, and nested objects per the plan's all-channel coverage decision. * logAttempt with 10MB rotation (keeps 5 generations). Salted SHA-256 hash, per-device salt at ~/.gstack/security/device-salt (0600). * Cross-process session state at ~/.gstack/security/session-state.json (atomic temp+rename). Required because server.ts (compiled) and sidebar-agent.ts (non-compiled) are separate processes. * getStatus() for shield icon rendering via /health. ML classifier code will live in a separate module (security-classifier.ts) loaded only by sidebar-agent.ts — compiled browse binary cannot load the native ONNX runtime. Plan: ~/.gstack/projects/garrytan-gstack/ceo-plans/2026-04-19-prompt-injection-guard.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): wire canary injection into sidebar spawnClaude Every sidebar message now gets a fresh CANARY-XXXXXXXXXXXX token embedded in the system prompt with an instruction for Claude to never output it on any channel. The token flows through the queue entry so sidebar-agent.ts can check every outbound operation for leaks. If Claude echoes the canary into any outbound channel (text stream, tool arguments, URLs, file write paths), the sidebar-agent terminates the session and the user sees the approved canary leak banner. This operation is pure string manipulation — safe in the compiled browse binary. The actual output-stream check (which also has to be safe in compiled contexts) lives in sidebar-agent.ts (next commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): make sidebar-agent destructure check regex-tolerant The test asserted the exact string `const { prompt, args, stateFile, cwd, tabId } = queueEntry` which breaks whenever security or other extensions add fields (canary, pageUrl, etc.). Switch to a regex that requires the core fields in order but tolerates additional fields in between. Preserves the test's intent (args come from the queue entry, not rebuilt) while allowing the destructure to grow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): canary leak check across all outbound channels The sidebar-agent now scans every Claude stream event for the session's canary token before relaying any data to the sidepanel. Channels covered (per CEO review cross-model tension #2): * Assistant text blocks * Assistant text_delta streaming * tool_use arguments (recursively, via checkCanaryInStructure — catches URLs, commands, file paths nested at any depth) * tool_use content_block_start * tool_input_delta partial JSON * Final result payload If the canary leaks on any channel, onCanaryLeaked() fires once per session: 1. logAttempt() writes the event to ~/.gstack/security/attempts.jsonl with the canary's salted hash (never the payload content). 2. sends a `security_event` to the sidepanel so it can render the approved canary-leak banner (variant A mockup — ceo-plan 2026-04-19). 3. sends an `agent_error` for backward-compat with existing error surfaces. 4. SIGTERM's the claude subprocess (SIGKILL after 2s if still alive). The leaked content itself is never relayed to the sidepanel — the event is dropped at the boundary. Canary detection is pure-string substring match, so this all runs safely in the sidebar-agent (non-compiled bun) context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): add security-classifier.ts with TestSavantAI + Haiku This module holds the ML classifier code that the compiled browse binary cannot link (onnxruntime-node native dylib doesn't load from Bun compile's temp extract dir — see CEO plan §"Pre-Impl Gate 1 Outcome"). It's imported ONLY by sidebar-agent.ts, which runs as a non-compiled bun script. Two layers: L4 testsavant_content — TestSavantAI BERT-small ONNX classifier. First call triggers a one-time 112MB model download to ~/.gstack/models/testsavant-small/ (files staged into the onnx/ layout transformers.js v4 expects). Classifies page snapshots and tool outputs for indirect prompt injection + jailbreak attempts. On benign-corpus dry-run: Wikipedia/HN/Reddit/tech-blog all score SAFE 0.98+, attack text scores INJECTION 0.99+, Stack Overflow instruction-writing now scores SAFE 0.98 on the shorter form (was 0.99 INJECTION on the longer form — instruction-density threshold). Ensemble combiner downgrades single-layer high to WARN to cover this case. L4b transcript_classifier — Claude Haiku reasoning-blind pre-tool-call scan. Sees only {user_message, last 3 tool_calls}, never Claude's chain-of-thought or tool results (those are how self-persuasion attacks leak). 2000ms hard timeout. Fail-open on any subprocess failure so sidebar stays functional. Gated by shouldRunTranscriptCheck() — only runs when another layer already fired at >= LOG_ONLY, saving ~70% of Haiku spend. Both layers degrade gracefully: load/spawn failures set status to 'degraded' and return confidence=0. Shield icon reflects this via getClassifierStatus() which security.ts's getStatus() composes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): wire TestSavantAI + ensemble into sidebar-agent pre-spawn scan The sidebar-agent now runs a ML security check on the user message BEFORE spawning claude. If the content classifier and (gated) transcript classifier ensemble returns BLOCK, the session is refused with a security_event + agent_error — the sidepanel renders the approved banner. Two pieces: 1. On agent startup, loadTestsavant() warms the classifier in the background. First run triggers a 112MB model download from HuggingFace (~30s on average broadband). Non-blocking — sidebar stays functional during cold-start, shield just reports 'off' until warmed. 2. preSpawnSecurityCheck() runs the ensemble against the user message: - L4 (testsavant_content) always runs - L4b (transcript_classifier via Haiku) runs only if L4 flagged at >= LOG_ONLY — plan §E1 gating optimization, saves ~70% of Haiku spend combineVerdict() applies the BLOCK-requires-both-layers rule, which downgrades any single-layer high confidence to WARN. Stack Overflow-style instruction-heavy writing false-positives on TestSavantAI alone are caught by this degrade — Haiku corrects them when called. Fail-open everywhere: any subprocess/load/inference error returns confidence=0 so the sidebar keeps working on architectural controls alone. Shield icon reflects degraded state via getClassifierStatus(). BLOCK path emits both: - security_event {verdict, reason, layer, confidence, domain} (for the approved canary-leak banner UX mockup — variant A) - agent_error "Session blocked — prompt injection detected..." (backward-compat with existing error surface) Regression test suite still passes (12/12 sidebar-security tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(security): add security.ts unit tests (25 tests, 62 assertions) Covers the pure-string operations that must behave deterministically in both compiled and source-mode bun contexts: * THRESHOLDS ordering invariant (BLOCK > WARN > LOG_ONLY > 0) * combineVerdict ensemble rule — THE critical path: - Empty signals → safe - Canary leak always blocks (regardless of ML signals) - Both ML layers >= WARN → BLOCK (ensemble_agreement) - Single layer >= BLOCK → WARN (single_layer_high) — the Stack Overflow FP mitigation that prevents one classifier killing sessions alone - Max-across-duplicates when multiple signals reference the same layer * Canary generation + injection + recursive checking: - Unique CANARY-XXXXXXXXXXXX tokens (>= 48 bits entropy) - Recursive structure scan for tool_use inputs, nested URLs, commands - Null / primitive handling doesn't throw * Payload hashing (salted sha256) — deterministic per-device, differs across payloads, 64-char hex shape * logAttempt writes to ~/.gstack/security/attempts.jsonl * writeSessionState + readSessionState round-trip (cross-process) * getStatus returns valid SecurityStatus shape * extractDomain returns hostname only, empty string on bad input All 25 tests pass in 18ms — no ML, no network, no subprocess spawning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): expose security status on /health for shield icon The /health endpoint now returns a `security` field with the classifier status, suitable for driving the sidepanel shield icon: { status: 'protected' | 'degraded' | 'inactive', layers: { testsavant, transcript, canary }, lastUpdated: ISO8601 } Backend plumbing: * server.ts imports getStatus from security.ts (pure-string, safe in compiled binary) and includes it in the /health response. * sidebar-agent.ts writes ~/.gstack/security/session-state.json when the classifier warmup completes (success OR failure). This is the cross- process handoff — server.ts reads the state file via getStatus() to surface the result to the sidepanel. The sidepanel rendering (SVG shield icon + color states + tooltip) is a follow-up commit in the extension/ code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(security): document the sidebar security stack in CLAUDE.md Adds a security section to the Browser interaction block. Covers: * Layered defense table showing which modules live where (content-security.ts in both contexts vs security-classifier.ts only in sidebar-agent) and why the split exists (onnxruntime-node incompatibility with compiled Bun) * Threshold constants (0.85 / 0.60 / 0.40) and the ensemble rule that prevents single-classifier false-positives (the Stack Overflow FP story) * Env knobs — GSTACK_SECURITY_OFF kill switch, cache paths, salt file, attack log rotation, session state file This is the "before you modify the security stack, read this" doc. It lives next to the existing Sidebar architecture note that points at SIDEBAR_MESSAGE_FLOW.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(todos): mark ML classifier v1 in-progress + file v2 follow-ups Reframes the P0 item to reflect v1 scope (branch 2 architecture, TestSavantAI pivot, what shipped) and splits v2 work into discrete TODOs: * Shield icon + canary leak banner UI (P0, blocks v1 user-facing completion) * Attack telemetry via gstack-telemetry-log (P1) * Full BrowseSafe-Bench at gate tier (P2) * Cross-user aggregate attack dashboard (P2) * DeBERTa-v3 as third signal in ensemble (P2) * Read/Glob/Grep ingress coverage (P2, flagged by Codex review) * Adversarial + integration + smoke-bench test suites (P1) * Bun-native 5ms inference (P3 research) Each TODO carries What / Why / Context / Effort / Priority / Depends-on so it's actionable by someone picking it up cold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(telemetry): add attack_attempt event type to gstack-telemetry-log Extends the existing telemetry pipe with 5 new flags needed for prompt injection attack reporting: --url-domain hostname only (never path, never query) --payload-hash salted sha256 hex (opaque — no payload content ever) --confidence 0-1 (awk-validated + clamped; malformed → null) --layer testsavant_content | transcript_classifier | aria_regex | canary --verdict block | warn | log_only Backward compatibility: * Existing skill_run events still work — all new fields default to null * Event schema is a superset of the old one; downstream edge function can filter by event_type No new auth, no new SDK, no new Supabase migration. The same tier gating (community → upload, anonymous → local only, off → no-op) and the same sync daemon carry the attack events. This is the "E6 RESOLVED" path from the CEO plan — riding the existing pipe instead of spinning up parallel infra. Verified end-to-end: * attack_attempt event with all fields emits correctly to skill-usage.jsonl * skill_run event with no security flags still works (backward compat) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(security): wire logAttempt to gstack-telemetry-log (fire-and-forget) Every local attempt.jsonl write now also triggers a subprocess call to gstack-telemetry-log with the attack_attempt event type. The binary handles tier gating internally (community → Supabase upload, anonymous → local JSONL only, off → no-op), so security.ts doesn't need to re-check. Binary resolution follows the skill preamble pattern — never relies on PATH, which breaks in compiled-binary contexts: 1. ~/.claude/skills/gstack/bin/gstack-telemetry-log (global install) 2. .claude/skills/gstack/bin/gstack-telemetry-log (symlinked dev) 3. bin/gstack-telemetry-log (in-repo dev) Fire-and-forget: * spawn with stdio: 'ignore', detached: true, unref() * .on('error') swallows failures * Missing binary is non-fatal — local attempts.jsonl still gives audit trail Never throws. Never blocks. Existing 37 security tests pass unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ui): add security banner markup + styles (approved variant A) HTML + CSS for the canary leak / ML block banner. Structure matches the approved mockup from /plan-design-review 2026-04-19 (variant A — centered alert-heavy): * Red alert-circle SVG icon (no stock shield, intentional — matches the "serious but not scary" tone the review chose) * "Session terminated" Satoshi Bold 18px red headline * "— prompt injection detected from {domain}" DM Sans zinc subtitle * Expandable "What happened" chevron button (aria-expanded/aria-controls) * Layer list rendered in JetBrains Mono with amber tabular-nums scores * Close X in top-right, 28px hit area, focus-visible amber outline Enter animation: slide-down 8px + fade, 250ms, cubic-bezier(0.16,1,0.3,1) — matches DESIGN.md motion spec. Respects `role="alert"` + `aria-live="assertive"` so screen readers announce on appearance. Escape-to-dismiss hook is in the JS follow-up commit. Design tokens all via CSS variables (--error, --amber-400, --amber-500, --zinc-*, --font-display, --font-mono, --radius-*) — already established in the stylesheet. No new color constants introduced. JS wiring lands in the next commit so this diff stays focused on presentation layer only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ui): wire security banner to security_event + interactivity Adds showSecurityBanner() and hideSecurityBanner() plus the addChatEntry routing for entry.type === 'security_event'. When the sidebar-agent emits a security_event (canary leak or ML BLOCK), the banner renders with: * Title ("Session terminated") * Subtitle with {domain} if present, otherwise generic * Expandable layer list — each row: SECURITY_LAYER_LABELS[layer] + confidence.toFixed(2) in mono. Readable + auditable — user can see which layer fired at what score Interactivity, wired once on DOMContentLoaded: * Close X → hideSecurityBanner() * Expand/collapse "What happened" → toggles details + aria-expanded + chevron rotation (200ms css transition already in place) * Escape key dismisses while banner is visible (a11y) No shield icon yet — that's a separate commit that will consume the `security` field now returned by /health. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ui): add security shield icon in sidepanel header (3 states) Small "SEC" badge in the top-right of the sidepanel that reflects the security module's current state. Three states drive color: protected green — all layers ok (TestSavantAI + transcript + canary) degraded amber — one+ ML layer offline but canary + arch controls active inactive red — security module crashed, arch controls only Consumes /health.security (surfaced in commit |
||
|
|
d0782c4c4d |
feat(v1.4.0.0): /make-pdf — markdown to publication-quality PDFs (#1086)
* feat(browse): full $B pdf flag contract + tab-scoped load-html/js/pdf
Grow $B pdf from a 2-line wrapper (hard-coded A4) into a real PDF engine
frontend so make-pdf can shell out to it without duplicating Playwright:
- pdf: --format, --width/--height, --margins, --margin-*, --header-template,
--footer-template, --page-numbers, --tagged, --outline, --print-background,
--prefer-css-page-size, --toc. Mutex rules enforced. --from-file <json>
dodges Windows argv limits (8191 char CreateProcess cap).
- load-html: add --from-file <json> mode for large inline HTML. Size + magic
byte checks still apply to the inline content, not the payload file path.
- newtab: add --json returning {"tabId":N,"url":...} for programmatic use.
- cli: extract --tab-id flag and route as body.tabId to the HTTP layer so
parallel callers can target specific tabs without racing on the active
tab (makes make-pdf's per-render tab isolation possible).
- --toc: non-fatal 3s wait for window.__pagedjsAfterFired. Paged.js ships
later; v1 renders TOC statically via the markdown renderer.
Codex round 2 flagged these P0 issues during plan review. All resolved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(resolvers): add MAKE_PDF_SETUP + makePdfDir host paths
Skill templates can now embed {{MAKE_PDF_SETUP}} to resolve $P to the
make-pdf binary via the same discovery order as $B / $D: env override
(MAKE_PDF_BIN), local skill root, global install, or PATH.
Mirrors the pattern established by generateBrowseSetup() and
generateDesignSetup() in scripts/resolvers/design.ts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(make-pdf): new /make-pdf skill + orchestrator binary
Turn markdown into publication-quality PDFs. $P generate input.md out.pdf
produces a PDF with 1in margins, intelligent page breaks, page numbers,
running header, CONFIDENTIAL footer, and curly quotes/em dashes — all on
Helvetica so copy-paste extraction works ("S ai li ng" bug avoided).
Architecture (per Codex round 2):
markdown → render.ts (marked + sanitize + smartypants) → orchestrator
→ $B newtab --json → $B load-html --tab-id → $B js (poll Paged.js)
→ $B pdf --tab-id → $B closetab
browseClient.ts shells out to the compiled browse CLI rather than
duplicating Playwright. --tab-id isolation per render means parallel
$P generate calls don't race on the active tab. try/finally tab cleanup
survives Paged.js timeouts, browser crashes, and output-path failures.
Features in v1:
--cover left-aligned cover page (eyebrow + title + hairline rule)
--toc clickable static TOC (Paged.js page numbers deferred)
--watermark <text> diagonal DRAFT/CONFIDENTIAL layer
--no-chapter-breaks opt out of H1-starts-new-page
--page-numbers "N of M" footer (default on)
--tagged --outline accessible PDF + bookmark outline (default on)
--allow-network opt in to external image loading (default off for privacy)
--quiet --verbose stderr control
Design decisions locked from the /plan-design-review pass:
- Helvetica everywhere (Chromium emits single-word Tj operators for
system fonts; bundled webfonts emit per-glyph and break extraction).
- Left-aligned body, flush-left paragraphs, no text-indent, 12pt gap.
- Cover shares 1in margins with body pages; no flexbox-center, no
inset padding.
- The reference HTMLs at .context/designs/*.html are the implementation
source of truth for print-css.ts.
Tests (56 unit + 1 E2E combined-features gate):
- smartypants: code/URL-safe, verified against 10 fixtures
- sanitizer: strips <script>/<iframe>/on*/javascript: URLs
- render: HTML assembly, CJK fallback, cover/TOC/chapter wrap
- print-css: all @page rules, margin variants, watermark
- pdftotext: normalize()+copyPasteGate() cross-OS tolerance
- browseClient: binary resolution + typed error propagation
- combined-features gate (P0): 2-chapter fixture with smartypants +
hyphens + ligatures + bold/italic + inline code + lists + blockquote
passes through PDF → pdftotext → expected.txt diff
Deferred to Phase 4 (future PR): Paged.js vendored for accurate TOC page
numbers, highlight.js for syntax highlighting, drop caps, pull quotes,
two-column, CMYK, watermark visual-diff acceptance.
Plan: .context/ceo-plans/2026-04-19-perfect-pdf-generator.md
References: .context/designs/make-pdf-*.html
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(build): wire make-pdf into build/test/setup/bin + add marked dep
- package.json: compile make-pdf/dist/pdf as part of bun run build; add
"make-pdf" to bin entry; include make-pdf/test/ in the free test pass;
add marked@18.0.2 as a dep (markdown parser, ~40KB).
- setup: add make-pdf/dist/pdf to the Apple Silicon codesign loop.
- .gitignore: add make-pdf/dist/ (matches browse/dist/ and design/dist/).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ci(make-pdf): matrix copy-paste gate on Ubuntu + macOS
Runs the combined-features P0 gate on pull requests that touch make-pdf/
or browse's PDF surface. Installs poppler (macOS) / poppler-utils (Ubuntu)
per OS. Windows deferred to tolerant mode (Xpdf / Poppler-Windows
extraction variance not yet calibrated against the normalized comparator —
Codex round 2 #18).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(skills): regenerate SKILL.md for make-pdf addition + browse pdf flags
bun run gen:skill-docs picks up:
- the new /make-pdf skill (make-pdf/SKILL.md)
- updated browse command descriptions for 'pdf', 'load-html', 'newtab'
reflecting the new flag contract and --from-file mode
Source of truth stays the .tmpl files + COMMAND_DESCRIPTIONS;
these are regenerated artifacts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(tests): repair stale test expectations + emit _EXPLAIN_LEVEL / _QUESTION_TUNING from preamble
Three pre-existing test failures on main were blocking /ship:
- test/skill-validation.test.ts "Step 3.4 test coverage audit" expected the
literal strings "CODE PATH COVERAGE" and "USER FLOW COVERAGE" which were
removed when the Step 7 coverage diagram was compressed. Updated assertions
to check the stable `Code paths:` / `User flows:` labels that still ship.
- test/skill-validation.test.ts "ship step numbering" allowed-substeps list
didn't include 15.0 (WIP squash) and 15.1 (bisectable commits) which were
added for continuous checkpoint mode. Extended the allowlist.
- test/writing-style-resolver.test.ts and test/plan-tune.test.ts expected
`_EXPLAIN_LEVEL` and `_QUESTION_TUNING` bash variables in the preamble but
generate-preamble-bash.ts had been refactored and those lines were dropped.
Without them, downstream skills can't read `explain_level` or
`question_tuning` config at runtime — terse mode and /plan-tune features
were silently broken.
Added the two bash echo blocks back to generatePreambleBash and refreshed
the golden-file fixtures to match. All three preamble-related golden
baselines (claude/codex/factory) are synchronized with the new output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v1.4.0.0)
New /make-pdf skill + $P binary.
Turn any markdown file into a publication-quality PDF. Default output is
a 1in-margin Helvetica letter with page numbers in the footer. `--cover`
adds a left-aligned cover page, `--toc` generates a clickable table of
contents, `--watermark DRAFT` overlays a diagonal watermark. Copy-paste
extraction from the PDF produces clean words, not "S a i l i n g"
spaced out letter by letter. CI gate (macOS + Ubuntu) runs a combined-
features fixture through pdftotext on every PR.
make-pdf shells out to browse rather than duplicating Playwright.
$B pdf grew into a real PDF engine with full flag contract (--format,
--margins, --header-template, --footer-template, --page-numbers,
--tagged, --outline, --toc, --tab-id, --from-file). $B load-html and
$B js gained --tab-id. $B newtab --json returns structured output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(changelog): rewrite v1.4.0.0 headline — positive voice, no VC framing
The original headline led with "a PDF you wouldn't be embarrassed to send
to a VC": double-negative voice and audience-too-narrow. /make-pdf works
for essays, letters, memos, reports, proposals, and briefs. Framing the
whole release around founders-to-investors misses the wider audience.
New headline: "Turn any markdown file into a PDF that looks finished."
New tagline: "This one reads like a real essay or a real letter."
Positive voice. Broader aperture. Same energy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
8ca950f6f1 |
feat: content security — 4-layer prompt injection defense for pair-agent (#815)
* feat: token registry for multi-agent browser access Per-agent scoped tokens with read/write/admin/meta command categories, domain glob restrictions, rate limiting, expiry, and revocation. Setup key exchange for the /pair-agent ceremony (5-min one-time key → 24h session token). Idempotent exchange handles tunnel drops. 39 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: integrate token registry + scoped auth into browse server Server changes for multi-agent browser access: - /connect endpoint: setup key exchange for /pair-agent ceremony - /token endpoint: root-only minting of scoped sub-tokens - /token/:clientId DELETE: revoke agent tokens - /agents endpoint: list connected agents (root-only) - /health: strips root token when tunnel is active (P0 security fix) - /command: scope/rate/domain checks via token registry before dispatch - Idle timer skips shutdown when tunnel is active Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: ngrok tunnel integration + @ngrok/ngrok dependency BROWSE_TUNNEL=1 env var starts an ngrok tunnel after Bun.serve(). Reads NGROK_AUTHTOKEN from env or ~/.gstack/ngrok.env. Reads NGROK_DOMAIN for dedicated domain (stable URL). Updates state file with tunnel URL. Feasibility spike confirmed: SDK works in compiled Bun binary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: tab isolation for multi-agent browser access Add per-tab ownership tracking to BrowserManager. Scoped agents must create their own tab via newtab before writing. Unowned tabs (pre-existing, user-opened) are root-only for writes. Read access always allowed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: tab enforcement + POST /pair endpoint + activity attribution Server-side tab ownership check blocks scoped agents from writing to unowned tabs. Special-case newtab records ownership for scoped tokens. POST /pair endpoint creates setup keys for the pairing ceremony. Activity events now include clientId for attribution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: pair-agent CLI command + instruction block generator One command to pair a remote agent: $B pair-agent. Creates a setup key via POST /pair, prints a copy-pasteable instruction block with curl commands. Smart tunnel fallback (tunnel URL > auto-start > localhost). Flags: --for HOST, --local HOST, --admin, --client NAME. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: tab isolation + instruction block generator tests 14 tests covering tab ownership lifecycle (access checks, unowned tabs, transferTab) and instruction block generator (scopes, URLs, admin flag, troubleshooting section). Fix server-auth test that used fragile sliceBetween boundaries broken by new endpoints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.15.9.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: CSO security fixes — token leak, domain bypass, input validation 1. Remove root token from /health endpoint entirely (CSO #1 CRITICAL). Origin header is spoofable. Extension reads from ~/.gstack/.auth.json. 2. Add domain check for newtab URL (CSO #5). Previously only goto was checked, allowing domain-restricted agents to bypass via newtab. 3. Validate scope values, rateLimit, expiresSeconds in createToken() (CSO #4). Rejects invalid scopes and negative values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: /pair-agent skill — syntactic sugar for browser sharing Users remember /pair-agent, not $B pair-agent. The skill walks through agent selection (OpenClaw, Hermes, Codex, Cursor, generic), local vs remote setup, tunnel configuration, and includes platform-specific notes for each agent type. Wraps the CLI command with context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: remote browser access reference for paired agents Full API reference, snapshot→@ref pattern, scopes, tab isolation, error codes, ngrok setup, and same-machine shortcuts. The instruction block points here for deeper reading. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: improved instruction block with snapshot→@ref pattern The paste-into-agent instruction block now teaches the snapshot→@ref workflow (the most powerful browsing pattern), shows the server URL prominently, and uses clearer formatting. Tests updated to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: smart ngrok detection + auto-tunnel in pair-agent The pair-agent command now checks ngrok's native config (not just ~/.gstack/ngrok.env) and auto-starts the tunnel when ngrok is available. The skill template walks users through ngrok install and auth if not set up, instead of just printing a dead localhost URL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: on-demand tunnel start via POST /tunnel/start pair-agent now auto-starts the ngrok tunnel without restarting the server. New POST /tunnel/start endpoint reads authtoken from env, ~/.gstack/ngrok.env, or ngrok's native config. CLI detects ngrok availability and calls the endpoint automatically. Zero manual steps when ngrok is installed and authed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: pair-agent skill must output the instruction block verbatim Added CRITICAL instruction: the agent MUST output the full instruction block so the user can copy it. Previously the agent could summarize over it, leaving the user with nothing to paste. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: scoped tokens rejected on /command — auth gate ordering bug The blanket validateAuth() gate (root-only) sat above the /command endpoint, rejecting all scoped tokens with 401 before they reached getTokenInfo(). Moved /command above the gate so both root and scoped tokens are accepted. This was the bug Wintermute hit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: pair-agent auto-launches headed mode before pairing When pair-agent detects headless mode, it auto-switches to headed (visible Chromium window) so the user can watch what the remote agent does. Use --headless to skip this. Fixed compiled binary path resolution (process.execPath, not process.argv[1] which is virtual /$bunfs/ in Bun compiled binaries). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: comprehensive tests for auth ordering, tunnel, ngrok, headed mode 16 new tests covering: - /command sits above blanket auth gate (Wintermute bug) - /command uses getTokenInfo not validateAuth - /tunnel/start requires root, checks native ngrok config, returns already_active - /pair creates setup keys not session tokens - Tab ownership checked before command dispatch - Activity events include clientId - Instruction block teaches snapshot→@ref pattern - pair-agent auto-headed mode, process.execPath, --headless skip - isNgrokAvailable checks all 3 sources (gstack env, env var, native config) - handlePairAgent calls /tunnel/start not server restart Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: chain scope bypass + /health info leak when tunneled 1. Chain command now pre-validates ALL subcommand scopes before executing any. A read+meta token can no longer escalate to admin via chain (eval, js, cookies were dispatched without scope checks). tokenInfo flows through handleMetaCommand into the chain handler. Rejects entire chain if any subcommand fails. 2. /health strips sensitive fields (currentUrl, agent.currentMessage, session) when tunnel is active. Only operational metadata (status, mode, uptime, tabs) exposed to the internet. Previously anyone reaching the ngrok URL could surveil browsing activity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: tout /pair-agent as headline feature in CHANGELOG + README Lead with what it does for the user: type /pair-agent, paste into your other agent, done. First time AI agents from different companies can coordinate through a shared browser with real security boundaries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: expand /pair-agent, /design-shotgun, /design-html in README Each skill gets a real narrative paragraph explaining the workflow, not just a table cell. design-shotgun: visual exploration with taste memory. design-html: production HTML with Pretext computed layout. pair-agent: cross-vendor AI agent coordination through shared browser. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: split handleCommand into handleCommandInternal + HTTP wrapper Chain subcommands now route through handleCommandInternal for full security enforcement (scope, domain, tab ownership, rate limiting, content wrapping). Adds recursion guard for nested chains, rate-limit exemption for chain subcommands, and activity event suppression (1 event per chain, not per sub). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add content-security.ts with datamarking, envelope, and filter hooks Four-layer prompt injection defense for pair-agent browser sharing: - Datamarking: session-scoped watermark for text exfiltration detection - Content envelope: trust boundary wrapping with ZWSP marker escaping - Content filter hooks: extensible filter pipeline with warn/block modes - Built-in URL blocklist: requestbin, pipedream, webhook.site, etc. BROWSE_CONTENT_FILTER env var controls mode: off|warn|block (default: warn) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: centralize content wrapping in handleCommandInternal response path Single wrapping location replaces fragmented per-handler wrapping: - Scoped tokens: content filters + datamarking + enhanced envelope - Root tokens: existing basic wrapping (backward compat) - Chain subcommands exempt from top-level wrapping (wrapped individually) - Adds 'attrs' to PAGE_CONTENT_COMMANDS (ARIA value exposure defense) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: hidden element stripping for scoped token text extraction Detects CSS-hidden elements (opacity, font-size, off-screen, same-color, clip-path) and ARIA label injection patterns. Marks elements with data-gstack-hidden, extracts text from a clean clone (no DOM mutation), then removes markers. Only active for scoped tokens on text command. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: snapshot split output format for scoped tokens Scoped tokens get a split snapshot: trusted @refs section (for click/fill) separated from untrusted web content in an envelope. Ref names truncated to 50 chars in trusted section. Root tokens unchanged (backward compat). Resume command also uses split format for scoped tokens. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add SECURITY section to pair-agent instruction block Instructs remote agents to treat content inside untrusted envelopes as potentially malicious. Lists common injection phrases to watch for. Directs agents to only use @refs from the trusted INTERACTIVE ELEMENTS section, not from page content. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add 4 prompt injection test fixtures - injection-visible.html: visible injection in product review text - injection-hidden.html: 7 CSS hiding techniques + ARIA injection + false positive - injection-social.html: social engineering in legitimate-looking content - injection-combined.html: all attack types + envelope escape attempt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: comprehensive content security tests (47 tests) Covers all 4 defense layers: - Datamarking: marker format, session consistency, text-only application - Content envelope: wrapping, ZWSP marker escaping, filter warnings - Content filter hooks: URL blocklist, custom filters, warn/block modes - Instruction block: SECURITY section content, ordering, generation - Centralized wrapping: source-level verification of integration - Chain security: recursion guard, rate-limit exemption, activity suppression - Hidden element stripping: 7 CSS techniques, ARIA injection, false positives - Snapshot split format: scoped vs root output, resume integration Also fixes: visibility:hidden detection, case-insensitive ARIA pattern matching. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: pair-agent skill compliance + fix all 16 pre-existing test failures Root cause: pair-agent was added without completing the gen-skill-docs compliance checklist. All 16 failures traced back to this. Fixes: - Sync package.json version to VERSION (0.15.9.0) - Add "(gstack)" to pair-agent description for discoverability - Add pair-agent to Codex path exception (legitimately documents ~/.codex/) - Add CLI_COMMANDS (status, pair-agent, tunnel) to skill parser allowlist - Regenerate SKILL.md for all hosts (claude, codex, factory, kiro, etc.) - Update golden file baselines for ship skill - Fix relink tests: pass GSTACK_INSTALL_DIR to auto-relink calls so they use the fast mock install instead of scanning real ~/.claude/skills/gstack Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.15.12.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: E2E exit reason precedence + worktree prune race condition Two fixes for E2E test reliability: 1. session-runner.ts: error_max_turns was misclassified as error_api because is_error flag was checked before subtype. Now known subtypes like error_max_turns are preserved even when is_error is set. The is_error override only applies when subtype=success (API failure). 2. worktree.ts: pruneStale() now skips worktrees < 1 hour old to avoid deleting worktrees from concurrent test runs still in progress. Previously any second test execution would kill the first's worktrees. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: restore token in /health for localhost extension auth The CSO security fix stripped the token from /health to prevent leaking when tunneled. But the extension needs it to authenticate on localhost. Now returns token only when not tunneled (safe: localhost-only path). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: verify /health token is localhost-only, never served through tunnel Updated tests to match the restored token behavior: - Test 1: token assignment exists AND is inside the !tunnelActive guard - Test 1b: tunnel branch (else block) does not contain AUTH_TOKEN Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add security rationale for token in /health on localhost Explains why this is an accepted risk (no escalation over file-based token access), CORS protection, and tunnel guard. Prevents future CSO scans from stripping it without providing an alternative auth path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: verify tunnel is alive before returning URL to pair-agent Root cause: when ngrok dies externally (pkill, crash, timeout), the server still reports tunnelActive=true with a dead URL. pair-agent prints an instruction block pointing at a dead tunnel. The remote agent gets "endpoint offline" and the user has to manually restart everything. Three-layer fix: - Server /pair endpoint: probes tunnel URL before returning it. If dead, resets tunnelActive/tunnelUrl and returns null (triggers CLI restart). - Server /tunnel/start: probes cached tunnel before returning already_active. If dead, falls through to restart ngrok automatically. - CLI pair-agent: double-checks tunnel URL from server before printing instruction block. Falls through to auto-start on failure. 4 regression tests verify all three probe points + CLI verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add POST /batch endpoint for multi-command batching Remote agents controlling GStack Browser through a tunnel pay 2-5s of latency per HTTP round-trip. A typical "navigate and read" takes 4 sequential commands = 10-20 seconds. The /batch endpoint collapses N commands into a single HTTP round-trip, cutting a 20-tab crawl from ~60s to ~5s. Sequential execution through the full security pipeline (scope, domain, tab ownership, content wrapping). Rate limiting counts the batch as 1 request. Activity events emitted at batch level, not per-command. Max 50 commands per batch. Nested batches rejected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add source-level security tests for /batch endpoint 8 tests verifying: auth gate placement, scoped token support, max command limit, nested batch rejection, rate limiting bypass, batch-level activity events, command field validation, and tabId passthrough. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: correct CHANGELOG date from 2026-04-06 to 2026-04-05 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: consolidate Hermes into generic HTTP option in pair-agent Hermes doesn't have a host-specific config — it uses the same generic curl instructions as any other agent. Removing the dedicated option simplifies the menu and eliminates a misleading distinction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump VERSION to 0.15.14.0, add CHANGELOG entry for batch endpoint Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate pair-agent/SKILL.md after main merge Vendoring deprecation section from main's template wasn't reflected in the generated file. Fixes check-freshness CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: checkTabAccess uses options object, add own-only tab policy Refactors checkTabAccess(tabId, clientId, isWrite) to use an options object { isWrite?, ownOnly? }. Adds tabPolicy === 'own-only' support in the server command dispatch — scoped tokens with this policy are restricted to their own tabs for all commands, not just writes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add --domain flag to pair-agent CLI for domain restrictions Allows passing --domain to pair-agent to restrict the remote agent's navigation to specific domains (comma-separated). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * revert: remove batch commands CHANGELOG entry and VERSION bump The batch endpoint work belongs on the browser-batch-multitab branch (port-louis), not this branch. Reverting VERSION to 0.15.14.0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: adopt main's headed-mode /health token serving Our merge kept the old !tunnelActive guard which conflicted with main's security-audit-r2 tests that require no currentUrl/currentMessage in /health. Adopts main's approach: serve token conditionally based on headed mode or chrome-extension origin. Updates server-auth tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: improve snapshot flags docs completeness for LLM judge Adds $B placeholder explanation, explicit syntax line, and detailed flag behavior (-d depth values, -s CSS selector syntax, -D unified diff format and baseline persistence, -a screenshot vs text output relationship). Fixes snapshot flags reference LLM eval scoring completeness < 4. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
cd66fc2f89 |
fix: 6 critical fixes + community PR guardrails (v0.13.2.0) (#602)
* fix(security): commit bun.lock to pin dependency versions Remove bun.lock from .gitignore and commit the lockfile. Every bun install now uses exact pinned versions instead of resolving floating ^ ranges from npm fresh. Closes the supply-chain vector from #566. Co-Authored-By: boinger <boinger@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: gstack-slug falls back to dirname/unknown when git context is absent Add || true to git commands and fallback defaults so gstack-slug works outside git repos. Prevents unbound variable crash that kills every review skill when no git context exists. Co-Authored-By: collinstraka-clov <collinstraka-clov@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: setup auto-selects default after 10s timeout to prevent CI hangs Add -t 10 to the read command in the skill-prefix prompt. In CI, Docker, and Conductor workspaces where a TTY exists but nobody is watching, the prompt now auto-selects short names after 10 seconds instead of blocking forever. Co-Authored-By: stedfn <stedfn@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: browse CLI Windows lockfile — use string flag instead of numeric constants Bun compiled binaries on Windows don't handle numeric fs.constants correctly. The string flag 'wx' is semantically identical to O_CREAT | O_EXCL | O_WRONLY per Node docs and works on all platforms. Fixes #599 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add ~/.gstack/projects/ to plan file search path /office-hours writes design docs to ~/.gstack/projects/$SLUG/ but /ship and /review only searched ~/.claude/plans, ~/.codex/plans, and .gstack/plans. Add the project-scoped directory as the first search location so plan validation finds design docs created by the standard workflow. Fixes #591 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: autoplan dual-voice — sequential foreground execution instead of broken parallel Background subagents don't inherit tool permissions in Claude Code, so the Claude subagent in dual-voice mode was silently failing on every invocation. Every autoplan run was degrading to single-reviewer mode without warning. Change all three phases (CEO, Design, Eng) from "simultaneously" to sequential foreground execution: Claude subagent first (Agent tool, foreground), then Codex (Bash). Both complete before the consensus table. Fixes #497 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files from updated templates Regenerated from autoplan/SKILL.md.tmpl (dual-voice fix) and scripts/resolvers/review.ts (plan search path fix). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add community PR guardrails — protect ETHOS.md and voice Add explicit CLAUDE.md rule requiring AskUserQuestion before accepting any community PR that touches ETHOS.md, removes promotional material, or changes Garry's voice. No exceptions, no auto-merging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.13.2.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: gen-skill-docs detects symlink loop, skips codex write that overwrites Claude SKILL.md When .agents/skills/gstack is symlinked to the repo root (vendored dev mode), gen-skill-docs --host codex was writing the Codex-transformed SKILL.md through the symlink, overwriting the Claude version. This caused SKILL.md and agents/openai.yaml to silently revert to Codex paths after every build. Now detects when the codex output path resolves to the same real file as the Claude output and skips the write. Content is still generated for token budget tracking. The openai.yaml write is also skipped for the same symlink case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: resolve all 7 test failures — version sync, zsh glob guard, symlink-aware codex tests 1. package.json version synced with VERSION file (0.13.3.0) 2. design-shotgun/SKILL.md.tmpl: added setopt +o nomatch guard to bash block with variant-*.png glob 3. Codex generation tests: skip skills where .agents/skills/{name} is a symlink back to repo root (vendored dev mode). These can't have proper codex content since gen-skill-docs skips the write to avoid overwriting the Claude SKILL.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: boinger <boinger@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: collinstraka-clov <collinstraka-clov@users.noreply.github.com> Co-authored-by: stedfn <stedfn@users.noreply.github.com> |