feat(v1.4.0.0): /make-pdf — markdown to publication-quality PDFs (#1086)

* feat(browse): full $B pdf flag contract + tab-scoped load-html/js/pdf Grow $B pdf from a 2-line wrapper (hard-coded A4) into a real PDF engine frontend so make-pdf can shell out to it without duplicating Playwright: - pdf: --format, --width/--height, --margins, --margin-*, --header-template, --footer-template, --page-numbers, --tagged, --outline, --print-background, --prefer-css-page-size, --toc. Mutex rules enforced. --from-file <json> dodges Windows argv limits (8191 char CreateProcess cap). - load-html: add --from-file <json> mode for large inline HTML. Size + magic byte checks still apply to the inline content, not the payload file path. - newtab: add --json returning {"tabId":N,"url":...} for programmatic use. - cli: extract --tab-id flag and route as body.tabId to the HTTP layer so parallel callers can target specific tabs without racing on the active tab (makes make-pdf's per-render tab isolation possible). - --toc: non-fatal 3s wait for window.__pagedjsAfterFired. Paged.js ships later; v1 renders TOC statically via the markdown renderer. Codex round 2 flagged these P0 issues during plan review. All resolved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(resolvers): add MAKE_PDF_SETUP + makePdfDir host paths Skill templates can now embed {{MAKE_PDF_SETUP}} to resolve $P to the make-pdf binary via the same discovery order as $B / $D: env override (MAKE_PDF_BIN), local skill root, global install, or PATH. Mirrors the pattern established by generateBrowseSetup() and generateDesignSetup() in scripts/resolvers/design.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(make-pdf): new /make-pdf skill + orchestrator binary Turn markdown into publication-quality PDFs. $P generate input.md out.pdf produces a PDF with 1in margins, intelligent page breaks, page numbers, running header, CONFIDENTIAL footer, and curly quotes/em dashes — all on Helvetica so copy-paste extraction works ("S ai li ng" bug avoided). Architecture (per Codex round 2): markdown → render.ts (marked + sanitize + smartypants) → orchestrator → $B newtab --json → $B load-html --tab-id → $B js (poll Paged.js) → $B pdf --tab-id → $B closetab browseClient.ts shells out to the compiled browse CLI rather than duplicating Playwright. --tab-id isolation per render means parallel $P generate calls don't race on the active tab. try/finally tab cleanup survives Paged.js timeouts, browser crashes, and output-path failures. Features in v1: --cover left-aligned cover page (eyebrow + title + hairline rule) --toc clickable static TOC (Paged.js page numbers deferred) --watermark <text> diagonal DRAFT/CONFIDENTIAL layer --no-chapter-breaks opt out of H1-starts-new-page --page-numbers "N of M" footer (default on) --tagged --outline accessible PDF + bookmark outline (default on) --allow-network opt in to external image loading (default off for privacy) --quiet --verbose stderr control Design decisions locked from the /plan-design-review pass: - Helvetica everywhere (Chromium emits single-word Tj operators for system fonts; bundled webfonts emit per-glyph and break extraction). - Left-aligned body, flush-left paragraphs, no text-indent, 12pt gap. - Cover shares 1in margins with body pages; no flexbox-center, no inset padding. - The reference HTMLs at .context/designs/*.html are the implementation source of truth for print-css.ts. Tests (56 unit + 1 E2E combined-features gate): - smartypants: code/URL-safe, verified against 10 fixtures - sanitizer: strips <script>/<iframe>/on*/javascript: URLs - render: HTML assembly, CJK fallback, cover/TOC/chapter wrap - print-css: all @page rules, margin variants, watermark - pdftotext: normalize()+copyPasteGate() cross-OS tolerance - browseClient: binary resolution + typed error propagation - combined-features gate (P0): 2-chapter fixture with smartypants + hyphens + ligatures + bold/italic + inline code + lists + blockquote passes through PDF → pdftotext → expected.txt diff Deferred to Phase 4 (future PR): Paged.js vendored for accurate TOC page numbers, highlight.js for syntax highlighting, drop caps, pull quotes, two-column, CMYK, watermark visual-diff acceptance. Plan: .context/ceo-plans/2026-04-19-perfect-pdf-generator.md References: .context/designs/make-pdf-*.html Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(build): wire make-pdf into build/test/setup/bin + add marked dep - package.json: compile make-pdf/dist/pdf as part of bun run build; add "make-pdf" to bin entry; include make-pdf/test/ in the free test pass; add marked@18.0.2 as a dep (markdown parser, ~40KB). - setup: add make-pdf/dist/pdf to the Apple Silicon codesign loop. - .gitignore: add make-pdf/dist/ (matches browse/dist/ and design/dist/). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(make-pdf): matrix copy-paste gate on Ubuntu + macOS Runs the combined-features P0 gate on pull requests that touch make-pdf/ or browse's PDF surface. Installs poppler (macOS) / poppler-utils (Ubuntu) per OS. Windows deferred to tolerant mode (Xpdf / Poppler-Windows extraction variance not yet calibrated against the normalized comparator — Codex round 2 #18). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(skills): regenerate SKILL.md for make-pdf addition + browse pdf flags bun run gen:skill-docs picks up: - the new /make-pdf skill (make-pdf/SKILL.md) - updated browse command descriptions for 'pdf', 'load-html', 'newtab' reflecting the new flag contract and --from-file mode Source of truth stays the .tmpl files + COMMAND_DESCRIPTIONS; these are regenerated artifacts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tests): repair stale test expectations + emit _EXPLAIN_LEVEL / _QUESTION_TUNING from preamble Three pre-existing test failures on main were blocking /ship: - test/skill-validation.test.ts "Step 3.4 test coverage audit" expected the literal strings "CODE PATH COVERAGE" and "USER FLOW COVERAGE" which were removed when the Step 7 coverage diagram was compressed. Updated assertions to check the stable `Code paths:` / `User flows:` labels that still ship. - test/skill-validation.test.ts "ship step numbering" allowed-substeps list didn't include 15.0 (WIP squash) and 15.1 (bisectable commits) which were added for continuous checkpoint mode. Extended the allowlist. - test/writing-style-resolver.test.ts and test/plan-tune.test.ts expected `_EXPLAIN_LEVEL` and `_QUESTION_TUNING` bash variables in the preamble but generate-preamble-bash.ts had been refactored and those lines were dropped. Without them, downstream skills can't read `explain_level` or `question_tuning` config at runtime — terse mode and /plan-tune features were silently broken. Added the two bash echo blocks back to generatePreambleBash and refreshed the golden-file fixtures to match. All three preamble-related golden baselines (claude/codex/factory) are synchronized with the new output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.4.0.0) New /make-pdf skill + $P binary. Turn any markdown file into a publication-quality PDF. Default output is a 1in-margin Helvetica letter with page numbers in the footer. `--cover` adds a left-aligned cover page, `--toc` generates a clickable table of contents, `--watermark DRAFT` overlays a diagonal watermark. Copy-paste extraction from the PDF produces clean words, not "S a i l i n g" spaced out letter by letter. CI gate (macOS + Ubuntu) runs a combined- features fixture through pdftotext on every PR. make-pdf shells out to browse rather than duplicating Playwright. $B pdf grew into a real PDF engine with full flag contract (--format, --margins, --header-template, --footer-template, --page-numbers, --tagged, --outline, --toc, --tab-id, --from-file). $B load-html and $B js gained --tab-id. $B newtab --json returns structured output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(changelog): rewrite v1.4.0.0 headline — positive voice, no VC framing The original headline led with "a PDF you wouldn't be embarrassed to send to a VC": double-negative voice and audience-too-narrow. /make-pdf works for essays, letters, memos, reports, proposals, and briefs. Framing the whole release around founders-to-investors misses the wider audience. New headline: "Turn any markdown file into a PDF that looks finished." New tagline: "This one reads like a real essay or a real letter." Positive voice. Broader aperture. Same energy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 22:16:52 +02:00 · 2026-04-20 13:20:30 +08:00
parent 22a4451e0e
commit d0782c4c4d
74 changed files with 4456 additions and 37 deletions
@@ -0,0 +1,340 @@
+/**
+ * Markdown → HTML renderer. Pure function, no I/O, no Playwright.
+ *
+ * Pipeline:
+ *   1. marked parses markdown → HTML
+ *   2. Sanitize: strip <script>, <iframe>, <object>, <embed>, <link>,
+ *      <meta>, <base>, <form>, and all on* event handlers + javascript:
+ *      URLs. (Codex round 2 #9: untrusted markdown can embed raw HTML.)
+ *   3. Smartypants transform (code/URL-safe).
+ *   4. Assemble full HTML document with print CSS inlined and
+ *      semantic structure (cover, TOC placeholder, body).
+ */
+
+import { marked } from "marked";
+import { smartypants } from "./smartypants";
+import { printCss, type PrintCssOptions } from "./print-css";
+
+export interface RenderOptions {
+  markdown: string;
+
+  // Document-level metadata (used for cover, PDF metadata, running header).
+  title?: string;
+  author?: string;
+  date?: string;                  // ISO or human string
+  subtitle?: string;
+
+  // Features
+  cover?: boolean;
+  toc?: boolean;
+  watermark?: string;
+  noChapterBreaks?: boolean;
+  confidential?: boolean;         // default: true
+
+  // Page layout
+  pageSize?: "letter" | "a4" | "legal" | "tabloid";
+  margins?: string;
+}
+
+export interface RenderResult {
+  html: string;                   // full HTML document, ready for $B load-html
+  printCss: string;               // for debugging / preview
+  bodyHtml: string;               // just the rendered body (tests, snapshots)
+  meta: {
+    title: string;
+    author: string;
+    date: string;
+    wordCount: number;
+  };
+}
+
+/**
+ * Pure renderer. No side effects.
+ */
+export function render(opts: RenderOptions): RenderResult {
+  // 1. Markdown → HTML
+  const rawHtml = marked.parse(opts.markdown, { async: false }) as string;
+
+  // 2. Sanitize
+  const cleanHtml = sanitizeUntrustedHtml(rawHtml);
+
+  // 3. Decode common entities so smartypants can match raw " and '.
+  //    marked HTML-encodes quotes in text ("hello" → &quot;hello&quot;);
+  //    without decoding, smartypants' regex never fires. These get re-encoded
+  //    implicitly by the browser's HTML parser downstream, and for the ones
+  //    that should stay as curly-quote Unicode, that IS the final form.
+  const decoded = decodeTypographicEntities(cleanHtml);
+
+  // 4. Smartypants (code-safe)
+  const typographicHtml = smartypants(decoded);
+
+  // 4. Derive metadata (title from first H1 if not provided)
+  const derivedTitle = opts.title ?? extractFirstHeading(typographicHtml) ?? "Document";
+  const derivedAuthor = opts.author ?? "";
+  const derivedDate = opts.date ?? formatToday();
+
+  // 5. Build CSS
+  const cssOptions: PrintCssOptions = {
+    cover: opts.cover,
+    toc: opts.toc,
+    noChapterBreaks: opts.noChapterBreaks,
+    watermark: opts.watermark,
+    confidential: opts.confidential !== false,
+    runningHeader: derivedTitle,
+    pageSize: opts.pageSize,
+    margins: opts.margins,
+  };
+  const css = printCss(cssOptions);
+
+  // 6. Assemble document
+  const coverBlock = opts.cover
+    ? buildCoverBlock({
+        title: derivedTitle,
+        subtitle: opts.subtitle,
+        author: derivedAuthor,
+        date: derivedDate,
+      })
+    : "";
+
+  const tocBlock = opts.toc
+    ? buildTocBlock(typographicHtml)
+    : "";
+
+  // Wrap body in .chapter sections at H1 boundaries if chapter breaks are on.
+  const chapterHtml = opts.noChapterBreaks
+    ? `<section class="chapter">${typographicHtml}</section>`
+    : wrapChaptersByH1(typographicHtml);
+
+  const watermarkBlock = opts.watermark
+    ? `<div class="watermark">${escapeHtml(opts.watermark)}</div>`
+    : "";
+
+  const fullHtml = [
+    `<!doctype html>`,
+    `<html lang="en">`,
+    `<head>`,
+    `<meta charset="utf-8">`,
+    `<title>${escapeHtml(derivedTitle)}</title>`,
+    derivedAuthor ? `<meta name="author" content="${escapeHtml(derivedAuthor)}">` : ``,
+    `<style>`,
+    css,
+    `</style>`,
+    `</head>`,
+    `<body>`,
+    watermarkBlock,
+    coverBlock,
+    tocBlock,
+    chapterHtml,
+    `</body>`,
+    `</html>`,
+  ].filter(Boolean).join("\n");
+
+  return {
+    html: fullHtml,
+    printCss: css,
+    bodyHtml: typographicHtml,
+    meta: {
+      title: derivedTitle,
+      author: derivedAuthor,
+      date: derivedDate,
+      wordCount: countWords(stripTags(typographicHtml)),
+    },
+  };
+}
+
+/**
+ * Decode the HTML entities that marked emits for text-node quotes/apostrophes.
+ * Only the four that matter for smartypants — leaves &amp; alone because it
+ * can be legitimately doubled (&amp;amp;) and we don't want to double-decode.
+ */
+function decodeTypographicEntities(html: string): string {
+  return html
+    .replace(/&quot;/g, "\"")
+    .replace(/&#39;/g, "'")
+    .replace(/&apos;/g, "'")
+    .replace(/&#x27;/g, "'");
+}
+
+// ─── Sanitizer ────────────────────────────────────────────────────────
+
+/**
+ * Strip dangerous HTML from markdown-produced output.
+ *
+ * We can't use DOMPurify (server-side; adds a jsdom dep). A conservative
+ * regex sanitizer is fine for this use case because:
+ *   1. marked produces structured HTML (never malformed)
+ *   2. we only need to strip a fixed blacklist of elements + attrs
+ *   3. the output goes through Chromium's parser again, which normalizes
+ *
+ * What's stripped:
+ *   - <script>, <iframe>, <object>, <embed>, <link>, <meta>, <base>, <form>
+ *     (and their content).
+ *   - on* event handler attributes (onclick, ONCLICK, etc.).
+ *   - href/src with javascript: scheme.
+ *   - <svg> tags with <script> inside them.
+ */
+export function sanitizeUntrustedHtml(html: string): string {
+  let s = html;
+
+  // Elements to remove entirely (including content).
+  const DANGER_TAGS = [
+    "script", "iframe", "object", "embed", "link", "meta", "base", "form",
+    "applet", "frame", "frameset",
+  ];
+  for (const tag of DANGER_TAGS) {
+    const re = new RegExp(`<${tag}\\b[\\s\\S]*?</${tag}>`, "gi");
+    s = s.replace(re, "");
+    // Self-closing / unclosed variants
+    const selfRe = new RegExp(`<${tag}\\b[^>]*/?>`, "gi");
+    s = s.replace(selfRe, "");
+  }
+
+  // SVG <script>
+  s = s.replace(/<svg([^>]*)>([\s\S]*?)<\/svg>/gi, (_, attrs, body) => {
+    return `<svg${attrs}>${body.replace(/<script\b[\s\S]*?<\/script>/gi, "")}</svg>`;
+  });
+
+  // Event handler attributes (on* in any case).
+  s = s.replace(/\s+on[a-zA-Z]+\s*=\s*"[^"]*"/gi, "");
+  s = s.replace(/\s+on[a-zA-Z]+\s*=\s*'[^']*'/gi, "");
+  s = s.replace(/\s+on[a-zA-Z]+\s*=\s*[^\s>]+/gi, "");
+
+  // javascript: URLs in href/src/action/formaction
+  s = s.replace(
+    /(\s(?:href|src|action|formaction|xlink:href)\s*=\s*)(?:"javascript:[^"]*"|'javascript:[^']*'|javascript:[^\s>]+)/gi,
+    '$1"#"',
+  );
+
+  // srcdoc attribute (iframe escape hatch — already stripped via iframe above,
+  // but defense-in-depth).
+  s = s.replace(/\s+srcdoc\s*=\s*"[^"]*"/gi, "");
+  s = s.replace(/\s+srcdoc\s*=\s*'[^']*'/gi, "");
+
+  // style="url(javascript:..)" — strip javascript: inside style attrs.
+  s = s.replace(/url\(\s*javascript:[^)]*\)/gi, "url(#)");
+
+  return s;
+}
+
+// ─── Cover / TOC / Chapter helpers ────────────────────────────────────
+
+function buildCoverBlock(opts: {
+  title: string;
+  subtitle?: string;
+  author?: string;
+  date: string;
+}): string {
+  const title = escapeHtml(opts.title);
+  const subtitle = opts.subtitle ? escapeHtml(opts.subtitle) : "";
+  const author = opts.author ? escapeHtml(opts.author) : "";
+  const date = escapeHtml(opts.date);
+  return [
+    `<section class="cover">`,
+    `  <h1 class="cover-title">${title}</h1>`,
+    subtitle ? `  <p class="cover-subtitle">${subtitle}</p>` : ``,
+    `  <hr class="rule">`,
+    `  <div class="cover-meta">`,
+    author ? `    <div><strong>${author}</strong></div>` : ``,
+    `    <div>${date}</div>`,
+    `  </div>`,
+    `</section>`,
+  ].filter(Boolean).join("\n");
+}
+
+/**
+ * Scan HTML for H1/H2/H3 headings and emit a TOC placeholder.
+ * Page numbers are filled in by Paged.js (when --toc is passed and Paged.js
+ * polyfill is injected).
+ */
+function buildTocBlock(html: string): string {
+  const headings = extractHeadings(html);
+  if (headings.length === 0) return "";
+
+  const items = headings.map((h, i) => {
+    const level = h.level >= 2 ? "level-2" : "level-1";
+    const id = `toc-${i}`;
+    return [
+      `  <li class="${level}">`,
+      `    <span class="toc-title"><a href="#${id}">${escapeHtml(h.text)}</a></span>`,
+      `    <span class="toc-dots"></span>`,
+      `    <span class="toc-page" data-toc-target="${id}"></span>`,
+      `  </li>`,
+    ].join("\n");
+  }).join("\n");
+
+  return [
+    `<section class="toc">`,
+    `  <h2>Contents</h2>`,
+    `  <ol>`,
+    items,
+    `  </ol>`,
+    `</section>`,
+  ].join("\n");
+}
+
+function extractHeadings(html: string): Array<{ level: number; text: string }> {
+  const re = /<(h[1-3])[^>]*>([\s\S]*?)<\/\1>/gi;
+  const headings: Array<{ level: number; text: string }> = [];
+  let match;
+  while ((match = re.exec(html)) !== null) {
+    const level = parseInt(match[1].slice(1), 10);
+    const text = stripTags(match[2]).trim();
+    if (text) headings.push({ level, text });
+  }
+  return headings;
+}
+
+/**
+ * Wrap H1-rooted sections in <section class="chapter">. When chapter breaks
+ * are on (default), CSS `.chapter { break-before: page }` fires between them.
+ */
+function wrapChaptersByH1(html: string): string {
+  // Split on H1 openings. Everything before the first H1 is a preamble.
+  const h1Re = /<h1\b[^>]*>/gi;
+  const matches: number[] = [];
+  let m;
+  while ((m = h1Re.exec(html)) !== null) {
+    matches.push(m.index);
+  }
+  if (matches.length === 0) {
+    return `<section class="chapter">${html}</section>`;
+  }
+  const chunks: string[] = [];
+  const preamble = html.slice(0, matches[0]);
+  if (preamble.trim().length > 0) {
+    chunks.push(`<section class="chapter">${preamble}</section>`);
+  }
+  for (let i = 0; i < matches.length; i++) {
+    const start = matches[i];
+    const end = i + 1 < matches.length ? matches[i + 1] : html.length;
+    chunks.push(`<section class="chapter">${html.slice(start, end)}</section>`);
+  }
+  return chunks.join("\n");
+}
+
+function extractFirstHeading(html: string): string | null {
+  const m = html.match(/<h1\b[^>]*>([\s\S]*?)<\/h1>/i);
+  return m ? stripTags(m[1]).trim() : null;
+}
+
+function stripTags(html: string): string {
+  return html.replace(/<[^>]+>/g, "");
+}
+
+function escapeHtml(s: string): string {
+  return s
+    .replace(/&/g, "&amp;")
+    .replace(/</g, "&lt;")
+    .replace(/>/g, "&gt;")
+    .replace(/"/g, "&quot;")
+    .replace(/'/g, "&#39;");
+}
+
+function countWords(text: string): number {
+  return text.split(/\s+/).filter(w => w.length > 0).length;
+}
+
+function formatToday(): string {
+  const now = new Date();
+  return now.toLocaleDateString("en-US", { year: "numeric", month: "long", day: "numeric" });
+}