Commit Graph

5 Commits

Author SHA1 Message Date
Garry Tan 756525100c chore: regenerate SKILL.md files for gbrain-sync preamble block
Re-runs bun run gen:skill-docs after adding generateBrainSyncBlock
to scripts/resolvers/preamble.ts in a2aa8a07. CI check-freshness
caught the drift. All 36 SKILL.md files regenerated with the new
skill-start bash block + privacy-gate prose + skill-end sync
instructions baked in.
2026-04-22 13:58:41 -07:00
Garry Tan 656df0e37e feat(v1.5.2.0): Opus 4.7 migration — model overlay, voice, routing (#1117)
* feat(v1.5.2.0): Opus 4.7 migration — model overlay, voice, routing

Adapts GStack skill text for Claude Opus 4.7's behavioral changes per
Anthropic's migration guide and community findings.

Key changes:

model-overlays/claude.md:
  - Fan out explicitly (4.7 spawns fewer subagents by default)
  - Effort-match the step (avoid overthinking simple tasks at max)
  - Batch questions in one AskUserQuestion turn
  - Literal interpretation awareness (deliver full scope)

hosts/claude.ts:
  - coAuthorTrailer updated to Claude Opus 4.7

SKILL.md.tmpl:
  - Expanded routing triggers with colloquial variants ("wtf",
    "this doesn't work", "send it", "where was I") — 4.7 won't
    generalize from sparse trigger patterns like 4.6 did
  - Added missing routes: /context-save, /context-restore, /cso, /make-pdf
  - Changed routing fallback from strict "do NOT answer directly" to
    "when in doubt, invoke the skill" — false positives are cheaper
    than false negatives on 4.7's literal interpreter

generate-voice-directive.ts:
  - Added concrete good/bad voice example — 4.7 needs shown examples,
    not just described tone. "auth.ts:47 returns undefined..." vs
    "I've identified a potential issue..."

Regenerated all 38 SKILL.md files. All tests pass.

* refactor(opus-4.7): split overlay, align routing, fix trailer fallback

Follow-up to wintermute's initial Opus 4.7 migration commit (addresses
ship-quality review findings before v1.6.1.0 release).

Overlay split (model-overlays/):
  - Move 4 Opus-4.7-specific nudges (Fan out, Effort-match, Batch your
    questions, Literal interpretation) from claude.md into new
    opus-4-7.md with {{INHERIT:claude}}
  - claude.md now holds only model-agnostic nudges (Todo discipline,
    Think before heavy, Dedicated tools over Bash)
  - Prevents Opus-4.7-specific guidance leaking onto Sonnet/Haiku
  - Uses existing {{INHERIT:claude}} mechanism at
    scripts/resolvers/model-overlay.ts:28-43

scripts/models.ts:
  - Add opus-4-7 to ALL_MODEL_NAMES
  - resolveModel: claude-opus-4-7-* variants route to opus-4-7,
    all other claude-* variants continue to route to claude

scripts/resolvers/utility.ts:
  - Update coAuthor trailer fallback: Opus 4.6 -> Opus 4.7
    (fallback was missed in the initial migration commit)

scripts/resolvers/preamble/generate-routing-injection.ts:
  - Align policy with new SKILL.md.tmpl: soft "when in doubt, invoke"
    instead of hard "ALWAYS invoke... Do NOT answer directly"
  - Replace stale /checkpoint reference with /context-save +
    /context-restore (skills were renamed in v1.0.1.0)
  - Expand route coverage to match full skill inventory:
    /plan-devex-review, /qa-only, /devex-review, /land-and-deploy,
    /setup-deploy, /canary, /open-gstack-browser,
    /setup-browser-cookies, /benchmark, /learn, /plan-tune, /health

scripts/resolvers/preamble/generate-voice-directive.ts:
  - Voice example closing: "Want me to ship it?" -> "Want me to fix it?"
  - Preserves directness while routing through review gates

SKILL.md.tmpl:
  - Add routing triggers for skills that were missing from the list:
    /plan-devex-review, /qa-only, /devex-review, /land-and-deploy,
    /setup-deploy, /canary, /open-gstack-browser,
    /setup-browser-cookies, /benchmark, /learn, /plan-tune, /health
  - Within Opus 4.7 overlay, added scope boundary to
    "Literal interpretation" nudge ("fix tests that this branch
    introduced or is responsible for")
  - Added pacing exception to "Batch your questions" nudge so skills
    that require one-question-at-a-time pacing still win

Follow-up commit will regenerate SKILL.md files + update goldens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(opus-4.7): regenerate SKILL.md files + update golden fixtures

Mechanical consequence of the preceding source changes (overlay split,
routing alignment, voice example, routing expansion). No behavior change
beyond what that commit introduced.

- 36 SKILL.md files regenerated via bun run gen:skill-docs
- 3 golden fixtures updated (claude, codex, factory ship skill)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(routing): assert slash-prefixed skills + new policy + current names

Align gen-skill-docs.test.ts routing assertions with the remediated
routing-injection output:

- Expect '/office-hours' slash-prefixed form (matches SKILL.md.tmpl style)
- Add test asserting /context-save + /context-restore references
  (guards against stale '/checkpoint' name regression)
- Add test asserting "When in doubt, invoke the skill" soft policy
  (guards against "Do NOT answer directly" hard policy regression)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(binary-guard): replace xargs-per-file loops with fs.statSync + mode filter

The "no compiled binaries in git" describe block had two flaky tests:

- "git tracks no files larger than 2MB" timed out at 5s regularly because
  it spawned one `sh -c` per tracked file via `xargs -I{}` (~571 shells
  on every run, ~11s locally).
- "git tracks no Mach-O or ELF binaries" ran `file --mime-type` over every
  tracked file (~3-10s, flaky near the timeout).

Both were pre-existing — not caused by any recent change — but showed up
as red in every local `bun test` run and masked legit failures in the
same suite.

Rewrites:

- 2MB test: `fs.statSync(f).size` in a filter. Millisecond-fast.
- Mach-O test: pre-filter to mode 100755 files via `git ls-files -s`,
  then batch-invoke `file --mime-type` once across all executables.
  With zero executables tracked, the `file` invocation is skipped.

Test suite: 320 pass, 0 fail, 907ms (was ~12.7s with 2 fails).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(team-mode): give setup -q / setup --local tests a 3-minute budget

./setup runs a full install, Bun binary build, and skill regeneration.
On a cold cache it takes 60-90s, comfortably above bun test's 5s default.
Both "setup -q produces no stdout" and "setup --local prints deprecation
warning" have been flaky-to-failing for a while with [5001.78ms] timeouts.

The test logic was fine, the budget wasn't. Bumped both to 180s via the
third-arg timeout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(opus-4.7): E2E eval for fanout rate + routing precision

Closes the measurement gap flagged by the ship-quality review: "zero
tests exercise Opus 4.7 behavior; every skill-e2e hardcodes 4.6."

Two cases, both pinned to claude-opus-4-7:

1. Fanout rate (A/B)
   - Arm A: regen SKILL.md with --model opus-4-7 (overlay ON, includes
     "Fan out explicitly" nudge).
   - Arm B: regen SKILL.md with --model claude (overlay OFF, only
     model-agnostic nudges).
   - Prompt: "Read alpha.txt, beta.txt, gamma.txt. These are independent."
   - Measure: parallel tool calls in first assistant turn.
   - Assert: arm A >= arm B.

2. Routing precision (6-case mini-benchmark)
   - 3 positive prompts that should route (wtf bug, send it, does it work)
   - 3 negative prompts that match keywords but should NOT route
     (syntax question, algorithm question, slack message)
   - Assert: TP rate >= 66%, FP rate <= 33%.

Cost estimate: ~$3-5 per full run. Classified as periodic tier per
CLAUDE.md convention (Opus model, non-deterministic). Runs only with
EVALS=1 env var, touchfile-gated so unrelated diffs don't trigger it.

Test plan artifact at
~/.gstack/projects/garrytan-gstack/garrytan-feat-opus-4.7-migration-eng-review-test-plan-20260421-230611.md
tracks the full specification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(opus-4.7): rewrite fanout nudge to show parallel tool_use pattern

The original fanout nudge told 4.7 to "spawn subagents in the same turn"
and "run independent checks concurrently" in prose. An E2E eval on
claude-opus-4-7 reading 3 independent files showed zero effect: both
overlay-ON and overlay-OFF arms emitted serial Reads across 3-4 turns.

Rewrite follows the same "show not tell" principle the PR introduced for
voice examples. The nudge now includes a concrete wrong/right contrast
showing the exact tool_use structure:

  Wrong (3 turns):
    Turn 1: Read(foo.ts), then wait
    Turn 2: Read(bar.ts), then wait
    Turn 3: Read(baz.ts)

  Right (1 turn, 3 parallel tool_use blocks in one assistant message):
    Turn 1: [Read(foo.ts), Read(bar.ts), Read(baz.ts)]

Applies to Read, Bash, Grep, Glob, WebFetch, Agent, and any tool where
sub-calls don't depend on each other's output.

Effect on test/skill-e2e-opus-47.test.ts fanout eval: unchanged (both
arms still 0 parallel in first turn via `claude -p`). May land better in
Claude Code's interactive harness, where the system prompt + tool
handlers differ. Tracked as P0 TODO for follow-up verification in the
correct harness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(opus-4.7): tighten ambiguous /qa routing prompt

"does this feature work on mobile? can you check the deploy?" was too
vague — a reasonable agent asks "which feature?" via AskUserQuestion
instead of routing to /qa. That's not a routing miss, it's an under-
specified prompt.

Replaced with "I just pushed the login flow changes. Test the deployed
site and find any bugs." — concrete subject + clear QA verb.

Result: pos-does-it-work went from MISS to OK, routing TP rate 2/3 -> 3/3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(opus-4.7): rewrite scratch-root helper + add afterAll cleanup

First run of the Opus 4.7 eval exposed two test-setup gaps that made
results misleading:

- Only the root gstack SKILL.md was installed. Claude Code does
  auto-discovery per-directory under .claude/skills/{name}/SKILL.md, so
  without individual skill dirs the Skill tool had nothing to route to.
  Positive routing cases all failed.
- `claude -p` does not load SKILL.md content as system context the way
  the Claude Code harness does. The overlay nudges in SKILL.md were
  invisible to the model, so the fanout A/B could not actually differ.

New `mkEvalRoot(suffix, includeOverlay)` helper, modelled on the pattern
in skill-routing-e2e.test.ts:

- Installs per-skill SKILL.md under .claude/skills/ for ~14 key skills
  so the Skill tool has discoverable targets.
- Writes an explicit routing block into project CLAUDE.md.
- When includeOverlay is true, inlines the content of
  model-overlays/opus-4-7.md into CLAUDE.md too. This is what makes the
  fanout A/B observable in `claude -p`: arm ON gets the overlay in
  context, arm OFF does not.

Plus an afterAll that re-runs gen-skill-docs at the default model so
the working tree is not left with opus-4-7-generated SKILL.md files
after the eval finishes (would break golden-file tests in the next
`bun test` run otherwise).

With this setup in place: routing went from 3/3 FAIL to 3/3 PASS
(correct skill or clarification in every positive case, zero false
positives on negatives). Fanout A/B is now a fair comparison; still
shows 0 parallel in both arms under `claude -p` (tracked as a P0 TODO
for re-measurement inside Claude Code's harness, where fanout may land
differently).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(todos): verify Opus 4.7 fanout nudge in Claude Code harness (P0)

v1.6.1.0 shipped a rewritten "Fan out explicitly" nudge with a concrete
tool_use example. Under `claude -p` on claude-opus-4-7, the A/B eval
showed zero parallel tool calls in the first turn for both arms
(overlay ON and OFF). Routing verified 3/3 in the same harness, so the
gap is specific to fanout and likely to `claude -p`'s system prompt +
tool wiring.

This TODO closes the measurement loop the ship-quality review flagged:
re-run the fanout A/B inside Claude Code's real harness (or a faithful
replica) before landing another Opus migration claim.

P0 because it is a ship-quality commitment from the v1.6.1.0 release
notes, not a nice-to-have.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): v1.6.1.0 — Opus 4.7 migration, reviewed

Bump VERSION + package.json from 1.6.0.0 to 1.6.1.0. New CHANGELOG
entry describing the ship-quality remediation of PR #1117:

- Overlay split (model-agnostic claude.md + opus-4-7.md with INHERIT)
- Routing-injection aligned with SKILL.md.tmpl ("when in doubt" policy,
  current skill names, full skill inventory)
- utility.ts trailer fallback updated
- Voice example closes through review gate instead of ship-bypass
- Literal-interpretation nudge bounded to branch scope
- Batch-questions nudge has explicit pacing exception
- First Opus 4.7 eval: routing verified 3/3, fanout A/B unverified
  under `claude -p` (tracked as P0 TODO for next rev)
- Pre-existing test failures fixed: fs.statSync binary guard, 180s
  setup timeout, golden-file updates

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(opus-4.7): key touchfile entries by testName, not describe text

TOUCHFILES completeness scan in test/touchfiles.test.ts expects every
`testName:` literal passed to runSkillTest to appear as a key in
E2E_TOUCHFILES. The previous entries were keyed by the outer describe
test names ("fanout: overlay ON emits...") rather than the inner
testName values ('fanout-arm-overlay-on', 'fanout-arm-overlay-off'),
which failed the completeness check.

Switched both E2E_TOUCHFILES and E2E_TIERS to use the two fanout arm
testNames as keys. The routing sub-tests use a template literal
(`routing-${c.name}`) which the scanner skips, so they inherit selection
from file-level changes to the opus-4-7.md / routing-injection.ts paths
already covered by the fanout entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: gstack <ship@gstack.dev>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 01:06:22 -07:00
Garry Tan 54d4cde773 security: tunnel dual-listener + SSRF + envelope + path wave (v1.6.0.0) (#1137)
* refactor(security): loosen /connect rate limit from 3/min to 300/min

Setup keys are 24 random bytes (unbruteforceable), so a tight rate limit
does not meaningfully prevent key guessing. It exists only to cap
bandwidth, CPU, and log-flood damage from someone who discovered the
ngrok URL. A legitimate pair-agent session hits /connect once; 300/min
is 60x that pattern and never hit accidentally.

3/min caused pairing to fail on any retry flow (network blip, second
paired client) with no upside. Per-IP tracking was considered and
rejected — adds a bounded Map + LRU for defense already adequate at the
global layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(security): add tunnel-denial-log module for attack visibility

Append-only log of tunnel-surface auth denials to
~/.gstack/security/attempts.jsonl. Gives operators visibility into who
is probing tunneled daemons so the next security wave can be driven by
real attack data instead of speculation.

Design notes:
- Async via fs.promises.appendFile. Never appendFileSync — blocking the
  event loop on every denial during a flood is what an attacker wants
  (prior learning: sync-audit-log-io, 10/10 confidence).
- In-process rate cap at 60 writes/minute globally. Excess denials are
  counted in memory but not written to disk — prevents disk DoS.
- Writes to the same ~/.gstack/security/attempts.jsonl used by the
  prompt-injection attempt log. File rotation is handled by the existing
  security pipeline (10MB, 5 generations).

No consumers in this commit; wired up in the dual-listener refactor that
follows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(security): dual-listener tunnel architecture

The /health endpoint leaked AUTH_TOKEN to any caller that hit the ngrok
URL (spoofing chrome-extension:// origin, or catching headed mode).
Surfaced by @garagon in PR #1026; the original fix was header-inference
on the single port. Codex's outside-voice review during /plan-ceo-review
called that approach brittle (ngrok header behavior could change, local
proxies would false-positive), and pushed for the structural fix.

This is that fix. Stop making /health a root-token bootstrap endpoint on
any surface the tunnel can reach. The server now binds two HTTP
listeners when a tunnel is active. The local listener (extension, CLI,
sidebar) stays on 127.0.0.1 and is never exposed to ngrok. ngrok
forwards only to the tunnel listener, which serves only /connect
(unauth, rate-limited) and /command with a locked allowlist of
browser-driving commands. Security property comes from physical port
separation, not from header inference — a tunnel caller cannot reach
/health or /cookie-picker or /inspector because they live on a
different TCP socket.

What this commit adds to browse/src/server.ts:
  * Surface type ('local' | 'tunnel') and TUNNEL_PATHS +
    TUNNEL_COMMANDS allowlists near the top of the file.
  * makeFetchHandler(surface) factory replacing the single fetch arrow;
    closure-captures the surface so the filter that runs before route
    dispatch knows which socket accepted the request.
  * Tunnel filter at dispatch entry: 404s anything not on TUNNEL_PATHS,
    403s root-token bearers with a clear pairing hint, 401s non-/connect
    requests that lack a scoped token. Every denial is logged via
    logTunnelDenial (from tunnel-denial-log).
  * GET /connect alive probe (unauth on both surfaces) so /pair and
    /tunnel/start can detect dead ngrok tunnels without reaching
    /health — /health is no longer tunnel-reachable.
  * Lazy tunnel listener lifecycle. /tunnel/start binds a dedicated
    Bun.serve on an ephemeral port, points ngrok.forward at THAT port
    (not the local port), hard-fails on bind error (no local fallback),
    tears down cleanly on ngrok failure. BROWSE_TUNNEL=1 startup uses
    the same pattern.
  * closeTunnel() helper — single teardown path for both the ngrok
    listener and the tunnel Bun.serve listener.
  * resolveNgrokAuthtoken() helper — shared authtoken lookup across
    /tunnel/start and BROWSE_TUNNEL=1 startup (was duplicated).
  * TUNNEL_COMMANDS check in /command dispatch: on the tunnel surface,
    commands outside the allowlist return 403 with a list of allowed
    commands as a hint.
  * Probe paths in /pair and /tunnel/start migrated from /health to
    GET /connect — the only unauth path reachable on the tunnel surface
    under the new architecture.

Test updates in browse/test/server-auth.test.ts:
  * /pair liveness-verify test: assert via closeTunnel() helper instead
    of the inline `tunnelActive = false; tunnelUrl = null` lines that
    the helper subsumes.
  * /tunnel/start cached-tunnel test: same closeTunnel() adaptation.

Credit
  Derived from PR #1026 by @garagon — thanks for flagging the critical
  bug that drove the architectural rewrite. The per-request
  isTunneledRequest approach from #1026 is superseded by physical port
  separation here; the underlying report remains the root cause for the
  entire v1.6.0.0 wave.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(security): add source-level guards for dual-listener architecture

23 source-level assertions that keep future contributors from silently
widening the tunnel surface during a routine refactor. Covers:

  * Surface type + tunnelServer state variable shape
  * TUNNEL_PATHS is a closed set of /connect, /command, /sidebar-chat
    (and NOT /health, /welcome, /cookie-picker, /inspector/*, /pair,
    /token, /refs, /activity/stream, /tunnel/{start,stop})
  * TUNNEL_COMMANDS includes browser-driving ops only (and NOT
    launch-browser, tunnel-start, token-mint, cookie-import, etc.)
  * makeFetchHandler(surface) factory exists and is wired to both
    listeners with the correct surface parameter
  * Tunnel filter runs BEFORE any route dispatch, with 404/403/401
    responses and logged denials for each reason
  * GET /connect returns {alive: true} unauth
  * /command dispatch enforces TUNNEL_COMMANDS on tunnel surface
  * closeTunnel() helper tears down ngrok + Bun.serve listener
  * /tunnel/start binds on ephemeral port, points ngrok at TUNNEL_PORT
    (not local port), hard-fails on bind error (no fallback), probes
    cached tunnel via GET /connect (not /health), tears down on
    ngrok.forward failure
  * BROWSE_TUNNEL=1 startup uses the dual-listener pattern
  * logTunnelDenial wired for all three denial reasons
  * /connect rate limit is 300/min, not 3/min

All 23 tests pass. Behavioral integration tests (spawn subprocess, real
network) live in the E2E suite that lands later in this wave.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* security: gate download + scrape through validateNavigationUrl (SSRF)

The `goto` command was correctly wired through validateNavigationUrl,
but `download` and `scrape` called page.request.fetch(url, ...) directly.
A caller with the default write scope could hit the /command endpoint
and ask the daemon to fetch http://169.254.169.254/latest/meta-data/
(AWS IMDSv1) or the GCP/Azure/internal equivalents. The response body
comes back as base64 or lands on disk where GET /file serves it.

Fix: call validateNavigationUrl(url) immediately before each
page.request.fetch() call site in download and in the scrape loop.
Same blocklist that already protects `goto`: file://, javascript:,
data:, chrome://, cloud metadata (IPv4 all encodings, IPv6 ULA,
metadata.*.internal).

Tests: extend browse/test/url-validation.test.ts with a source-level
guard that walks every `await page.request.fetch(` call site and
asserts a validateNavigationUrl call precedes it within the same
branch. Regression trips before code review if a future refactor
drops the gate.

* security: route splitForScoped through envelope sentinel escape

The scoped-token snapshot path in snapshot.ts built its untrusted
block by pushing the raw accessibility-tree lines between the literal
`═══ BEGIN UNTRUSTED WEB CONTENT ═══` / `═══ END UNTRUSTED WEB CONTENT ═══`
sentinels. The full-page wrap path in content-security.ts already
applied a zero-width-space escape on those exact strings to prevent
sentinel injection, but the scoped path skipped it.

Net effect: a page whose rendered text contains the literal sentinel
can close the envelope early from inside untrusted content and forge
a fake "trusted" block for the LLM. That includes fabricating
interactive `@eN` references the agent will act on.

Fix:
  * Extract the zero-width-space escape into a named, exported helper
    `escapeEnvelopeSentinels(content)` in content-security.ts.
  * Have `wrapUntrustedPageContent` call it (behavior unchanged on
    that path — same bytes out).
  * Import the helper in snapshot.ts and map it over `untrustedLines`
    in the `splitForScoped` branch before pushing the BEGIN sentinel.

Tests: add a describe block in content-security.test.ts that covers
  * `escapeEnvelopeSentinels` defuses BEGIN and END markers;
  * `escapeEnvelopeSentinels` leaves normal text untouched;
  * `wrapUntrustedPageContent` still emits exactly one real envelope
    pair when hostile content contains forged sentinels;
  * snapshot.ts imports the helper;
  * the scoped-snapshot branch calls `escapeEnvelopeSentinels` before
    pushing the BEGIN sentinel (source-level regression — if a future
    refactor reorders this, the test trips).

* security: extend hidden-element detection to all DOM-reading channels

The Confusion Protocol envelope wrap (`wrapUntrustedPageContent`)
covers every scoped PAGE_CONTENT_COMMAND, but the hidden-element
ARIA-injection detection layer only ran for `text`. Other DOM-reading
channels (html, links, forms, accessibility, attrs, data, media,
ux-audit) returned their output through the envelope with no hidden-
content filter, so a page serving a display:none div that instructs
the agent to disregard prior system messages, or an aria-label that
claims to put the LLM in admin mode, leaked the injection payload on
any non-text channel. The envelope alone does not mitigate this, and
the page itself never rendered the hostile content to the human
operator.

Fix:
  * New export `DOM_CONTENT_COMMANDS` in commands.ts — the subset of
    PAGE_CONTENT_COMMANDS that derives its output from the live DOM.
    Console and dialog stay out; they read separate runtime state.
  * server.ts runs `markHiddenElements` + `cleanupHiddenMarkers` for
    every scoped command in this set. `text` keeps its existing
    `getCleanTextWithStripping` path (hidden elements physically
    stripped before the read). All other channels keep their output
    format but emit flagged elements as CONTENT WARNINGS on the
    envelope, so the LLM sees what it would otherwise have consumed
    silently.
  * Hidden-element descriptions merge into `combinedWarnings`
    alongside content-filter warnings before the wrap call.

Tests: new describe block in content-security.test.ts covering
  * `DOM_CONTENT_COMMANDS` export shape and channel membership;
  * dispatch gates on `DOM_CONTENT_COMMANDS.has(command)`, not the
    literal `text` string;
  * hiddenContentWarnings plumbs into `combinedWarnings` and reaches
    wrapUntrustedPageContent;
  * DOM_CONTENT_COMMANDS is a strict subset of PAGE_CONTENT_COMMANDS.

Existing datamarking, envelope wrap, centralized-wrapping, and chain
security suites stay green (52 pass, 0 fail).

* security: validate --from-file payload paths for parity with direct paths

The direct `load-html <file>` path runs every caller-supplied file path
through validateReadPath() so reads stay confined to SAFE_DIRECTORIES
(cwd, TEMP_DIR). The `load-html --from-file <payload.json>` shortcut
and its sibling `pdf --from-file <payload.json>` skipped that check and
went straight to fs.readFileSync(). An MCP caller that picks the
payload path (or any caller whose payload argument is reachable from
attacker-influenced text) could use --from-file as a read-anywhere
escape hatch for the safe-dirs policy.

Fix: call validateReadPath(path.resolve(payloadPath)) before readFileSync
at both sites. Error surface mirrors the direct-path branch so ops and
agent errors stay consistent.

Test coverage in browse/test/from-file-path-validation.test.ts:
  - source-level: validateReadPath precedes readFileSync in the load-html
    --from-file branch (write-commands.ts) and the pdf --from-file parser
    (meta-commands.ts)
  - error-message parity: both sites reference SAFE_DIRECTORIES

Related security audit pattern: R3 F002 (validateNavigationUrl gap on
download/scrape) and R3 F008 (markHiddenElements gap on 10 DOM commands)
were the same shape — a defense that existed on the primary code path
but not its shortcut sibling. This PR closes the same class of gap on
the --from-file shortcuts.

* fix(design): escape url.origin when injecting into served HTML

serve.ts injected url.origin into a single-quoted JS string in
the response body. A local request with a crafted Host header
(e.g. Host: "evil'-alert(1)-'x") would break out of the string
and execute JS in the 127.0.0.1:<port> origin opened by the
design board. Low severity — bound to localhost, requires a
local attacker — but no reason not to escape.

Fix: JSON.stringify(url.origin) produces a properly quoted,
escaped JS string literal in one call.

Also includes Prettier reformatting (single→double quotes,
trailing commas, line wrapping) applied by the repo's
PostToolUse formatter hook. Security change is the one line
in the HTML injection; everything else is whitespace/style.

* fix(scripts): drop shell:true from slop-diff npx invocations

spawnSync('npx', [...], { shell: true }) invokes /bin/sh -c
with the args concatenated, subjecting them to shell parsing
(word splitting, glob expansion, metacharacter interpretation).
No user input reaches these calls today, so not exploitable —
but the posture is wrong: npx + shell args should be direct.

Fix: scope shell:true to process.platform === 'win32' where
npx is actually a .cmd requiring the shell. POSIX runs the
npx binary directly with array-form args.

Also includes Prettier reformatting (single→double quotes,
trailing commas, line wrapping) applied by the repo's
PostToolUse formatter hook. Security-relevant change is just
the two shell:true -> shell: process.platform === 'win32'
lines; everything else is whitespace/style.

* security(E3): gate GSTACK_SLUG on /welcome path traversal

The /welcome handler interpolates GSTACK_SLUG directly into the filesystem
path used to locate the project-local welcome page. Without validation, a
slug like "../../etc/passwd" would resolve to
~/.gstack/projects/../../etc/passwd/designs/welcome-page-20260331/finalized.html
— classic path traversal.

Not exploitable today: GSTACK_SLUG is set by the gstack CLI at daemon launch,
and an attacker would already need local env-var access to poison it. But
the gate is one regex (^[a-z0-9_-]+$), and a defense-in-depth pass costs us
nothing when the cost of being wrong is arbitrary file read via /welcome.

Fall back to the safe 'unknown' literal when the slug fails validation —
same fallback the code already uses when GSTACK_SLUG is unset. No behavior
change for legitimate slugs (they all match the regex).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* security(N1): replace ?token= SSE auth with HttpOnly session cookie

Activity stream and inspector events SSE endpoints accepted the root
AUTH_TOKEN via `?token=` query param (EventSource can't send Authorization
headers). URLs leak to browser history, referer headers, server logs,
crash reports, and refactoring accidents. Codex flagged this during the
/plan-ceo-review outside voice pass.

New auth model: the extension calls POST /sse-session with a Bearer token
and receives a view-only session cookie (HttpOnly, SameSite=Strict, 30-min
TTL). EventSource is opened with `withCredentials: true` so the browser
sends the cookie back on the SSE connection. The ?token= query param is
GONE — no more URL-borne secrets.

Scope isolation (prior learning cookie-picker-auth-isolation, 10/10
confidence): the SSE session cookie grants access to /activity/stream and
/inspector/events ONLY. The token is never valid against /command, /token,
or any mutating endpoint. A leaked cookie can watch activity; it cannot
execute browser commands.

Components
  * browse/src/sse-session-cookie.ts — registry: mint/validate/extract/
    build-cookie. 256-bit tokens, 30-min TTL, lazy expiry pruning,
    no imports from token-registry (scope isolation enforced by module
    boundary).
  * browse/src/server.ts — POST /sse-session mint endpoint (requires
    Bearer). /activity/stream and /inspector/events now accept Bearer
    OR the session cookie, and reject ?token= query param.
  * extension/sidepanel.js — ensureSseSessionCookie() bootstrap call,
    EventSource opened with withCredentials:true on both SSE endpoints.
    Tested via the source guards; behavioral test is the E2E pairing
    flow that lands later in the wave.
  * browse/test/sse-session-cookie.test.ts — 20 unit tests covering
    mint entropy, TTL enforcement, cookie flag invariants, cookie
    parsing from multi-cookie headers, and scope-isolation contract
    guard (module must not import token-registry).
  * browse/test/server-auth.test.ts — existing /activity/stream auth
    test updated to assert the new cookie-based gate and the absence
    of the ?token= query param.

Cookie flag choices:
  * HttpOnly: token not readable from page JS (mitigates XSS
    exfiltration).
  * SameSite=Strict: cookie not sent on cross-site requests (mitigates
    CSRF). Fine for SSE because the extension connects to 127.0.0.1
    directly.
  * Path=/: cookie scoped to the whole origin.
  * Max-Age=1800: 30 minutes, matches TTL. Extension re-mints on
    reconnect when daemon restarts.
  * Secure NOT set: daemon binds to 127.0.0.1 over plain HTTP. Adding
    Secure would block the browser from ever sending the cookie back.
    Add Secure when gstack ships over HTTPS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* security(N2): document Windows v20 ABE elevation path on CDP port

The existing comment around the cookie-import-browser --remote-debugging-port
launch claimed "threat model: no worse than baseline." That's wrong on
Windows with App-Bound Encryption v20. A same-user local process that
opens the cookie SQLite DB directly CANNOT decrypt v20 values (DPAPI
context is bound to the browser process). The CDP port lets them bypass
that: connect to the debug port, call Network.getAllCookies inside Chrome,
walk away with decrypted v20 cookies.

The correct fix is to switch from TCP --remote-debugging-port to
--remote-debugging-pipe so the CDP transport is a stdio pipe, not a
socket. That requires restructuring the CDP WebSocket client in this
module and Playwright doesn't expose the pipe transport out of the box.
Non-trivial, deferred from the v1.6.0.0 wave.

This commit updates the comment to correctly describe the threat and
points at the tracking issue. No code change to the launch itself.
Follow-up: #1136.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(E2): document dual-listener tunnel architecture in ARCHITECTURE.md

Adds an explicit per-endpoint disposition table to the Security model
section, covering the v1.6.0.0 dual-listener refactor. Every HTTP
endpoint now has a documented local-vs-tunnel answer. Future audits
(and future contributors wondering "is it safe to add X to the tunnel
surface?") can read this instead of reverse-engineering server.ts.

Also documents:
  * Why physical port separation beats per-request header inference
    (ngrok behavior drift, local proxies can forge headers, etc.)
  * Tunnel surface denial logging → ~/.gstack/security/attempts.jsonl
  * SSE session cookie model (gstack_sse, 30-min TTL, stream-scope only,
    module-boundary-enforced scope isolation)
  * N2 non-goal for Windows v20 ABE via CDP port (tracking #1136)

No code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(E1): end-to-end pair-agent flow against a spawned daemon

Spawns the browse daemon as a subprocess with BROWSE_HEADLESS_SKIP=1 so
the HTTP layer runs without a real browser.  Exercises:

  * GET /health — token delivery for chrome-extension origin, withheld
    otherwise (the F1 + PR #1026 invariant)
  * GET /connect — alive probe returns {alive:true} unauth
  * POST /pair — root Bearer required (403 without), returns setup_key
  * POST /connect — setup_key exchange mints a distinct scoped token
  * POST /command — 401 without auth
  * POST /sse-session — Bearer required, Set-Cookie has HttpOnly +
    SameSite=Strict (the N1 invariant)
  * GET /activity/stream — 401 without auth
  * GET /activity/stream?token= — 401 (the old ?token= query param is
    REJECTED, which is the whole point of N1)
  * GET /welcome — serves HTML, does not leak /etc/passwd content under
    the default 'unknown' slug (E3 regex gate)

12 behavioral tests, ~220ms end-to-end, no network dependencies, no
ngrok, no real browser.  This is the receipt for the wave's central
'pair-agent still works + the security boundary holds' claim.

Tunnel-port binding (/tunnel/start) is deliberately NOT exercised here
— it requires an ngrok authtoken and live network.  The dual-listener
route allowlist is covered by source-level guards in
dual-listener.test.ts; behavioral tunnel testing belongs in a separate
paid-evals harness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release(v1.6.0.0): bump VERSION + CHANGELOG for security wave

Architectural bump, not patch: dual-listener HTTP refactor changes the
daemon's tunnel-exposure model.  See CHANGELOG for the full release
summary (~950 words) covering the five root causes this wave closes:

  1. /health token leak over ngrok (F1 + E3 + test infra)
  2. /cookie-picker + /inspector exposed over the tunnel (F1)
  3. ?token=<ROOT> in SSE URLs leaking to logs/referer/history (N1)
  4. /welcome GSTACK_SLUG path traversal (E3)
  5. Windows v20 ABE elevation via CDP port (N2 — documented non-goal,
     tracked as #1136)

Plus the base PRs: SSRF gate (#1029), envelope sentinel escape (#1031),
DOM-channel hidden-element coverage (#1032), --from-file path validation
(#1103), and 2 commits from #1073 (@theqazi).

VERSION + package.json bumped to 1.6.0.0.  CHANGELOG entry covers
credits (@garagon, @Hybirdss, @HMAKT99, @theqazi), review lineage (CEO
→ Codex outside voice → Eng), and the non-goal tracking issue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: pre-landing review findings (4 auto-fixes)

Addresses 4 findings from the Claude adversarial subagent on the
v1.6.0.0 security wave diff.  No user-visible behavior change; all
are defense-in-depth hardening of newly-introduced code.

1. GET /connect rate-limited (was POST-only) [HIGH conf 8/10]
   Attacker discovering the ngrok URL could probe unlimited GETs for
   daemon enumeration.  Now shares the global /connect counter.

2. ngrok listener leak on tunnel startup failure [MEDIUM conf 8/10]
   If ngrok.forward() resolved but tunnelListener.url() or the
   state-file write threw, the Bun listener was torn down but the
   ngrok session was leaked.  Fixed in BOTH /tunnel/start and
   BROWSE_TUNNEL=1 startup paths.

3. GSTACK_SKILL_ROOT path-traversal gate [MEDIUM conf 8/10]
   Symmetric with E3's GSTACK_SLUG regex gate — reject values
   containing '..' before interpolating into the welcome-page path.

4. SSE session registry pruning [LOW conf 7/10]
   pruneExpired() only checked 10 entries per mint call.  Now runs
   on every validate too, checks 20 entries, with a hard 10k cap as
   backstop.  Prevents registry growth under sustained extension
   reconnect pressure.

Tests remain green (56/56 in sse-session-cookie + dual-listener +
pair-agent-e2e suites).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v1.6.0.0

Reflect the dual-listener tunnel architecture, SSE session cookies,
SSRF guards, and Windows v20 ABE non-goal across the three docs
users actually read for remote-agent and browser auth context:

- docs/REMOTE_BROWSER_ACCESS.md: rewrote Architecture diagram for
  dual listeners, fixed /connect rate limit (3/min → 300/min),
  removed stale "/health requires no auth" (now 404 on tunnel),
  added SSE cookie auth, expanded Security Model with tunnel
  allowlist, SSRF guards, /welcome path traversal defense, and
  the Windows v20 ABE tracking note.
- BROWSER.md: added dual-listener paragraph to Authentication and
  linked to ARCHITECTURE.md endpoint table. Replaced the stale
  ?token= SSE auth note with the HttpOnly gstack_sse cookie flow.
- CLAUDE.md: added Transport-layer security section above the
  sidebar prompt-injection stack so contributors editing server.ts,
  sse-session-cookie.ts, or tunnel-denial-log.ts see the load-bearing
  module boundaries before touching them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(make-pdf): write --from-file payload to /tmp, not os.tmpdir()

make-pdf's browseClient wrote its --from-file payload to os.tmpdir(),
which is /var/folders/... on macOS. v1.6.0.0's PR #1103 cherry-pick
tightened browse load-html --from-file to validate against the
safe-dirs allowlist ([TEMP_DIR, cwd] where TEMP_DIR is '/tmp' on
macOS/Linux, os.tmpdir() on Windows). This closed a CLI/API parity
gap but broke make-pdf on macOS because /var/folders/... is outside
the allowlist.

Fix: mirror browse's TEMP_DIR convention — use '/tmp' on non-Windows,
os.tmpdir() on Windows. The make-pdf-gate CI failure on macOS-latest
(run 72440797490) is caused by exactly this: the payload file was
rejected by validateReadPath.

Verified locally: the combined-gate e2e test now passes after
rebuilding make-pdf/dist/pdf.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sidebar): killAgent resets per-tab state; align tests with current agent event format

Two pre-existing bugs surfaced while running the full e2e suite on the
sec-wave branch.  Both pre-date v1.6.0.0 (same failures on main at
e23ff280) but blocked the ship verification, so fixing now.

### Bug 1: killAgent leaked stale per-tab state

`killAgent()` reset the legacy globals (agentProcess, agentStatus,
etc.) but never touched the per-tab `tabAgents` Map.  Meanwhile
`/sidebar-command` routes on `tabState.status` from that Map, not the
legacy globals.  Consequence: after a kill (including the implicit
kill in `/sidebar-session/new`), the next /sidebar-command on the
same tab saw `tabState.status === 'processing'` and fell into the
queue branch, silently NOT spawning an agent.  Integration tests that
called resetState between cases all failed with empty queues.

Fix: when targetTabId is supplied, reset that one tab's state; when
called without a tab (session-new, full kill), reset ALL tab states.
Matches the semantic boundary already used for the cancel-file write.

### Bug 2: sidebar-integration tests drifted from current event format

`agent events appear in /sidebar-chat` posted the raw Claude streaming
format (`{type: 'assistant', message: {content: [...]}}`) but
`processAgentEvent` in server.ts only handles the simplified types
that sidebar-agent.ts pre-processes into (text, text_delta, tool_use,
result, agent_error, security_event).  The architecture moved
pre-processing into sidebar-agent.ts at some point and this test
never got updated.  Fixed by sending the pre-processed `{type:
'text', text: '...'}` format — which is actually what the server sees
in production.

Also removed the `entry.prompt` URL-containment check in the
queue-write test.  The URL is carried on entry.pageUrl (metadata) by
design: the system prompt tells Claude to run `browse url` to fetch
the actual page rather than trust any URL in the prompt body.  That's
the URL-based prompt-injection defense.  The prompt SHOULD NOT
contain the URL, so the test assertion was wrong for the current
security posture.

### Verification

- `bun test browse/test/sidebar-integration.test.ts` → 13/13 pass
  (was 6/13 on both main and branch before this commit)
- Full `bun run test` → exit 0, zero fail markers
- No behavior change for production sidebar flows: killAgent was
  already supposed to return the agent to idle; it just wasn't fully
  doing so.  Per-tab reset now matches the documented semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: gus <gustavoraularagon@gmail.com>
Co-authored-by: Mohammed Qazi <10266060+theqazi@users.noreply.github.com>
2026-04-21 21:58:27 -07:00
Garry Tan e23ff280a1 fix(v1.4.1.0): /make-pdf — page numbers, entity escape, Linux fonts (#1098)
* fix(make-pdf): single-source page numbers via CSS, honor --no-page-numbers end-to-end

Two page-number sources were stacking in every PDF: Chromium's native footer
and our @page @bottom-center CSS. The CLI flag --page-numbers/--no-page-numbers
also never reached the CSS layer, because RenderOptions didn't carry it.
Passing --footer-template likewise dropped the "custom footer replaces stock
footer" semantic.

- orchestrator.ts: browseClient.pdf() gets pageNumbers:false unconditionally.
  CSS is the single source of truth. Chromium native numbering always off.
- render.ts: RenderOptions gains pageNumbers + footerTemplate. render() computes
  showPageNumbers = pageNumbers !== false && !footerTemplate and passes to
  printCss(), preserving the prior footerTemplate-suppresses-stock semantic.
- print-css.ts: PrintCssOptions.pageNumbers wraps @bottom-center in a conditional
  matching the existing showConfidential pattern.
- types.ts: PreviewOptions.pageNumbers so preview path compiles and matches CLI.
- render.test.ts: 7 regression tests covering printCss({pageNumbers}) in
  isolation AND the full render() data flow incl. footerTemplate path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(make-pdf): decode HTML entities in titles and TOC to prevent double-escape

A markdown title like "# Herbert & Garry" rendered as "Herbert &amp;amp; Garry"
in <title>, cover block, and TOC entries. marked emits "&amp;" (correct HTML),
but extractFirstHeading and extractHeadings only stripTags — leaving the entity
intact. That string then flows through escapeHtml, producing the double-encode.

- render.ts: new decodeTextEntities helper, distinct from decodeTypographicEntities
  (which runs on in-pipeline HTML and intentionally preserves &amp;). Covers
  named entities (lt/gt/quot/apos/39/x27/amp) AND numeric (decimal + hex) so
  inputs like "&#169;" or "&#x2014;" don't create the same partial-fix bug.
  Amp-last ordering prevents double-decode on "&amp;lt;" et al.
- Apply in both extractFirstHeading and extractHeadings. extractHeadings feeds
  buildTocBlock → escapeHtml, so the TOC site had the same bug.
- render.test.ts: 8 tests covering the contract — parameterized across &, <, >,
  ©, — chars; single-escape in <title>/cover; TOC double-escape check; numeric
  entity decode; smartypants-interacts-with-quotes contract (no raw equality).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(make-pdf): Liberation Sans font fallback for Linux rendering

On Linux (Docker, CI, servers), neither Helvetica nor Arial exist. Our CSS
stacks were falling through to DejaVu Sans — wider letterforms that look like
Verdana, not the intended Helvetica/Faber look. Liberation Sans is the standard
metric-compatible Arial clone (SIL OFL 1.1, apt package fonts-liberation).

- print-css.ts: all four font stacks (body + @top-center + @bottom-center +
  @bottom-right CONFIDENTIAL) gain "Liberation Sans" between Helvetica and
  Arial. File-header docblock updated to reflect the new stack.
- .github/docker/Dockerfile.ci: explicit apt-get install fonts-liberation +
  fontconfig with retry, fc-cache -f, and a verify step that fails the build
  loud if the font disappears. Playwright's install-deps happens to pull this
  in today but the dep is implicit and could silently regress.
- SKILL.md.tmpl: one-sentence note pointing Linux users at fonts-liberation.
- SKILL.md: regenerated via bun run gen:skill-docs --host all (only make-pdf's
  generated file changed — verified clean diff scope).
- render.test.ts: 2 assertions — Liberation Sans in body stack AND in at least
  one @page margin-box rule (proves all four intended stacks got touched, not
  just one).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.4.1.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: anonymize test fixtures, drop VC-partner framing

- CHANGELOG + render.test.ts fixtures use "Faber & Faber" instead of a
  personal name. Same regression coverage (ampersand in <title>, cover,
  TOC, body), neutral subject.
- make-pdf/SKILL.md.tmpl description drops the "send to a VC partner, a
  book agent, a judge, or Rick Rubin's team" line. "Not a draft artifact
  — a finished artifact" stands on its own without the audience posturing.
- SKILL.md regenerated.

No functional changes. All 58 make-pdf tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 22:32:58 +08:00
Garry Tan d0782c4c4d feat(v1.4.0.0): /make-pdf — markdown to publication-quality PDFs (#1086)
* feat(browse): full $B pdf flag contract + tab-scoped load-html/js/pdf

Grow $B pdf from a 2-line wrapper (hard-coded A4) into a real PDF engine
frontend so make-pdf can shell out to it without duplicating Playwright:

- pdf: --format, --width/--height, --margins, --margin-*, --header-template,
  --footer-template, --page-numbers, --tagged, --outline, --print-background,
  --prefer-css-page-size, --toc. Mutex rules enforced. --from-file <json>
  dodges Windows argv limits (8191 char CreateProcess cap).
- load-html: add --from-file <json> mode for large inline HTML. Size + magic
  byte checks still apply to the inline content, not the payload file path.
- newtab: add --json returning {"tabId":N,"url":...} for programmatic use.
- cli: extract --tab-id flag and route as body.tabId to the HTTP layer so
  parallel callers can target specific tabs without racing on the active
  tab (makes make-pdf's per-render tab isolation possible).
- --toc: non-fatal 3s wait for window.__pagedjsAfterFired. Paged.js ships
  later; v1 renders TOC statically via the markdown renderer.

Codex round 2 flagged these P0 issues during plan review. All resolved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(resolvers): add MAKE_PDF_SETUP + makePdfDir host paths

Skill templates can now embed {{MAKE_PDF_SETUP}} to resolve $P to the
make-pdf binary via the same discovery order as $B / $D: env override
(MAKE_PDF_BIN), local skill root, global install, or PATH.

Mirrors the pattern established by generateBrowseSetup() and
generateDesignSetup() in scripts/resolvers/design.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(make-pdf): new /make-pdf skill + orchestrator binary

Turn markdown into publication-quality PDFs. $P generate input.md out.pdf
produces a PDF with 1in margins, intelligent page breaks, page numbers,
running header, CONFIDENTIAL footer, and curly quotes/em dashes — all on
Helvetica so copy-paste extraction works ("S ai li ng" bug avoided).

Architecture (per Codex round 2):
  markdown → render.ts (marked + sanitize + smartypants) → orchestrator
    → $B newtab --json → $B load-html --tab-id → $B js (poll Paged.js)
    → $B pdf --tab-id → $B closetab

browseClient.ts shells out to the compiled browse CLI rather than
duplicating Playwright. --tab-id isolation per render means parallel
$P generate calls don't race on the active tab. try/finally tab cleanup
survives Paged.js timeouts, browser crashes, and output-path failures.

Features in v1:
  --cover              left-aligned cover page (eyebrow + title + hairline rule)
  --toc                clickable static TOC (Paged.js page numbers deferred)
  --watermark <text>   diagonal DRAFT/CONFIDENTIAL layer
  --no-chapter-breaks  opt out of H1-starts-new-page
  --page-numbers       "N of M" footer (default on)
  --tagged --outline   accessible PDF + bookmark outline (default on)
  --allow-network      opt in to external image loading (default off for privacy)
  --quiet --verbose    stderr control

Design decisions locked from the /plan-design-review pass:
  - Helvetica everywhere (Chromium emits single-word Tj operators for
    system fonts; bundled webfonts emit per-glyph and break extraction).
  - Left-aligned body, flush-left paragraphs, no text-indent, 12pt gap.
  - Cover shares 1in margins with body pages; no flexbox-center, no
    inset padding.
  - The reference HTMLs at .context/designs/*.html are the implementation
    source of truth for print-css.ts.

Tests (56 unit + 1 E2E combined-features gate):
  - smartypants: code/URL-safe, verified against 10 fixtures
  - sanitizer: strips <script>/<iframe>/on*/javascript: URLs
  - render: HTML assembly, CJK fallback, cover/TOC/chapter wrap
  - print-css: all @page rules, margin variants, watermark
  - pdftotext: normalize()+copyPasteGate() cross-OS tolerance
  - browseClient: binary resolution + typed error propagation
  - combined-features gate (P0): 2-chapter fixture with smartypants +
    hyphens + ligatures + bold/italic + inline code + lists + blockquote
    passes through PDF → pdftotext → expected.txt diff

Deferred to Phase 4 (future PR): Paged.js vendored for accurate TOC page
numbers, highlight.js for syntax highlighting, drop caps, pull quotes,
two-column, CMYK, watermark visual-diff acceptance.

Plan: .context/ceo-plans/2026-04-19-perfect-pdf-generator.md
References: .context/designs/make-pdf-*.html

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(build): wire make-pdf into build/test/setup/bin + add marked dep

- package.json: compile make-pdf/dist/pdf as part of bun run build; add
  "make-pdf" to bin entry; include make-pdf/test/ in the free test pass;
  add marked@18.0.2 as a dep (markdown parser, ~40KB).
- setup: add make-pdf/dist/pdf to the Apple Silicon codesign loop.
- .gitignore: add make-pdf/dist/ (matches browse/dist/ and design/dist/).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(make-pdf): matrix copy-paste gate on Ubuntu + macOS

Runs the combined-features P0 gate on pull requests that touch make-pdf/
or browse's PDF surface. Installs poppler (macOS) / poppler-utils (Ubuntu)
per OS. Windows deferred to tolerant mode (Xpdf / Poppler-Windows
extraction variance not yet calibrated against the normalized comparator —
Codex round 2 #18).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(skills): regenerate SKILL.md for make-pdf addition + browse pdf flags

bun run gen:skill-docs picks up:
  - the new /make-pdf skill (make-pdf/SKILL.md)
  - updated browse command descriptions for 'pdf', 'load-html', 'newtab'
    reflecting the new flag contract and --from-file mode

Source of truth stays the .tmpl files + COMMAND_DESCRIPTIONS;
these are regenerated artifacts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tests): repair stale test expectations + emit _EXPLAIN_LEVEL / _QUESTION_TUNING from preamble

Three pre-existing test failures on main were blocking /ship:

- test/skill-validation.test.ts "Step 3.4 test coverage audit" expected the
  literal strings "CODE PATH COVERAGE" and "USER FLOW COVERAGE" which were
  removed when the Step 7 coverage diagram was compressed. Updated assertions
  to check the stable `Code paths:` / `User flows:` labels that still ship.

- test/skill-validation.test.ts "ship step numbering" allowed-substeps list
  didn't include 15.0 (WIP squash) and 15.1 (bisectable commits) which were
  added for continuous checkpoint mode. Extended the allowlist.

- test/writing-style-resolver.test.ts and test/plan-tune.test.ts expected
  `_EXPLAIN_LEVEL` and `_QUESTION_TUNING` bash variables in the preamble but
  generate-preamble-bash.ts had been refactored and those lines were dropped.
  Without them, downstream skills can't read `explain_level` or
  `question_tuning` config at runtime — terse mode and /plan-tune features
  were silently broken.

Added the two bash echo blocks back to generatePreambleBash and refreshed
the golden-file fixtures to match. All three preamble-related golden
baselines (claude/codex/factory) are synchronized with the new output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.4.0.0)

New /make-pdf skill + $P binary.

Turn any markdown file into a publication-quality PDF. Default output is
a 1in-margin Helvetica letter with page numbers in the footer. `--cover`
adds a left-aligned cover page, `--toc` generates a clickable table of
contents, `--watermark DRAFT` overlays a diagonal watermark. Copy-paste
extraction from the PDF produces clean words, not "S a i l i n g"
spaced out letter by letter. CI gate (macOS + Ubuntu) runs a combined-
features fixture through pdftotext on every PR.

make-pdf shells out to browse rather than duplicating Playwright.
$B pdf grew into a real PDF engine with full flag contract (--format,
--margins, --header-template, --footer-template, --page-numbers,
--tagged, --outline, --toc, --tab-id, --from-file). $B load-html and
$B js gained --tab-id. $B newtab --json returns structured output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(changelog): rewrite v1.4.0.0 headline — positive voice, no VC framing

The original headline led with "a PDF you wouldn't be embarrassed to send
to a VC": double-negative voice and audience-too-narrow. /make-pdf works
for essays, letters, memos, reports, proposals, and briefs. Framing the
whole release around founders-to-investors misses the wider audience.

New headline: "Turn any markdown file into a PDF that looks finished."
New tagline: "This one reads like a real essay or a real letter."

Positive voice. Broader aperture. Same energy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:20:30 +08:00