Files
gstack/browse/src/url-validation.ts
T
Garry Tan c15b805cd8 feat(browse): Puppeteer parity — load-html, screenshot --selector, viewport --scale, file:// (v1.1.0.0) (#1062)
* feat(browse): TabSession loadedHtml + command aliases + DX polish primitives

Adds the foundation layer for Puppeteer-parity features:

- TabSession.loadedHtml + setTabContent/getLoadedHtml/clearLoadedHtml —
  enables load-html content to survive context recreation (viewport --scale)
  via in-memory replay. ASCII lifecycle diagram in the source explains the
  clear-before-navigation contract.

- COMMAND_ALIASES + canonicalizeCommand() helper — single source of truth
  for name aliases (setcontent / set-content / setContent → load-html),
  consumed by server dispatch and chain prevalidation.

- buildUnknownCommandError() pure function — rich error messages with
  Levenshtein-based "Did you mean" suggestions (distance ≤ 2, input
  length ≥ 4 to skip 2-letter noise) and NEW_IN_VERSION upgrade hints.

- load-html registered in WRITE_COMMANDS + SCOPE_WRITE so scoped write
  tokens can use it.

- screenshot and viewport descriptions updated for upcoming flags.

- New browse/test/dx-polish.test.ts (15 tests): alias canonicalization,
  Levenshtein threshold + alphabetical tiebreak, short-input guard,
  NEW_IN_VERSION upgrade hint, alias + scope integration invariants.

No consumers yet — pure additive foundation. Safe to bisect on its own.

* feat(browse): accept file:// in goto with smart cwd/home-relative parsing

Extends validateNavigationUrl to accept file:// URLs scoped to safe dirs
(cwd + TEMP_DIR) via the existing validateReadPath policy. The workhorse is a
new normalizeFileUrl() helper that handles non-standard relative forms BEFORE
the WHATWG URL parser sees them:

    file:///abs/path.html       → unchanged
    file://./docs/page.html     → file://<cwd>/docs/page.html
    file://~/Documents/page.html → file://<HOME>/Documents/page.html
    file://docs/page.html       → file://<cwd>/docs/page.html
    file://localhost/abs/path   → unchanged
    file://host.example.com/... → rejected (UNC/network)
    file:// and file:///        → rejected (would list a directory)

Host heuristic rejects segments with '.', ':', '\\', '%', IPv6 brackets, or
Windows drive-letter patterns — so file://docs.v1/page.html, file://127.0.0.1/x,
file://[::1]/x, and file://C:/Users/x are explicit errors.

Uses fileURLToPath() + pathToFileURL() from node:url (never string-concat) so
URL escapes like %20 decode correctly and Node rejects encoded-slash traversal
(%2F..%2F) outright.

Signature change: validateNavigationUrl now returns Promise<string> (the
normalized URL) instead of Promise<void>. Existing callers that ignore the
return value still compile — they just don't benefit from smart-parsing until
updated in follow-up commits. Callers will be migrated in the next few commits
(goto, diff, newTab, restoreState).

Rewrites the url-validation test file: updates existing tests for the new
return type, adds 20+ new tests covering every normalizeFileUrl shape variant,
URL-encoding edge cases, and path-traversal rejection.

References: codex consult v3 P1 findings on URL parser semantics and fileURLToPath.

* feat(browse): BrowserManager deviceScaleFactor + setContent replay + file:// plumbing

Three tightly-coupled changes to BrowserManager, all in service of the
Puppeteer-parity workflow:

1. deviceScaleFactor + currentViewport tracking. New private fields (default
   scale=1, viewport=1280x720) + setDeviceScaleFactor(scale, w, h) method.
   deviceScaleFactor is a context-level Playwright option — changing it
   requires recreateContext(). The method validates (finite number, 1-3 cap,
   headed-mode rejected), stores new values, calls recreateContext(), and
   rolls back the fields on failure so a bad call doesn't leave inconsistent
   state. Context options at all three sites (launch, recreate happy path,
   recreate fallback) now honor the stored values instead of hardcoding
   1280x720.

2. BrowserState.loadedHtml + loadedHtmlWaitUntil. saveState captures per-tab
   loadedHtml from the session; restoreState replays it via newSession.
   setTabContent() — NOT bare page.setContent() — so TabSession.loadedHtml
   is rehydrated and survives *subsequent* scale changes. In-memory only,
   never persisted to disk (HTML may contain secrets or customer data).

3. newTab + restoreState now consume validateNavigationUrl's normalized
   return value. file://./x, file://~/x, and bare-segment forms now take
   effect at every navigation site, not just the top-level goto command.

Together these enable: load-html → viewport --scale 2 → viewport --scale 1.5
→ screenshot, with content surviving both context recreations. Codex v2 P0
flagged that bare page.setContent in restoreState would lose content on the
second scale change — this commit implements the rehydration path.

References: codex v2 P0 (TabSession rehydration), codex v3 P1 (4-caller
return value), plan Feature 3 + Feature 4.

* feat(browse): load-html, screenshot --selector, viewport --scale, alias dispatch

Wires the new handlers and dispatch logic that the previous commits made
possible:

write-commands.ts
- New 'load-html' case: validateReadPath for safe-dir scoping, stat-based
  actionable errors (not found, directory, oversize), extension allowlist
  (.html/.htm/.xhtml/.svg), magic-byte sniff with UTF-8 BOM strip accepting
  any <[a-zA-Z!?] markup opener (not just <!doctype — bare fragments like
  <div>...</div> work for setContent), 50MB cap via GSTACK_BROWSE_MAX_HTML_BYTES
  override, frame-context rejection. Calls session.setTabContent() so replay
  metadata is rehydrated.
- viewport command extended: optional [<WxH>], optional [--scale <n>],
  scale-only variant reads current size via page.viewportSize(). Invalid
  scale (NaN, Infinity, empty, out of 1-3) throws with named value. Headed
  mode rejected explicitly.
- clearLoadedHtml() called BEFORE goto/back/forward/reload navigation
  (not after) so a timed-out goto post-commit doesn't leave stale metadata
  that could resurrect on a later context recreation. Codex v2 P1 catch.
- goto uses validateNavigationUrl's normalized return value.

meta-commands.ts
- screenshot --selector <css> flag: explicit element-screenshot form.
  Rejects alongside positional selector (both = error), preserves --clip
  conflict at line 161, composes with --base64 at lines 168-174.
- chain canonicalizes each step with canonicalizeCommand — step shape is
  now { rawName, name, args } so prevalidation, dispatch, WRITE_COMMANDS.has,
  watch blocking, and result labels all use canonical names while audit
  labels show 'rawName→name' when aliased. Codex v3 P2 catch — prior shape
  only canonicalized at prevalidation and diverged everywhere else.
- diff command consumes validateNavigationUrl return value for both URLs.

server.ts
- Command canonicalization inserted immediately after parse, before scope /
  watch / tab-ownership / content-wrapping checks. rawCommand preserved for
  future audit (not wired into audit log in this commit — follow-up).
- Unknown-command handler replaced with buildUnknownCommandError() from
  commands.ts — produces 'Unknown command: X. Did you mean Y?' with optional
  upgrade hint for NEW_IN_VERSION entries.

security-audit-r2.test.ts
- Updated chain-loop marker from 'for (const cmd of commands)' to
  'for (const c of commands)' to match the new chain step shape. Same
  isWatching + BLOCKED invariants still asserted.

* chore: bump version and changelog (v1.1.0.0)

- VERSION: 1.0.0.0 → 1.1.0.0 (MINOR bump — new user-facing commands)
- package.json: matching version bump
- CHANGELOG.md: new 1.1.0.0 entry describing load-html, screenshot --selector,
  viewport --scale, file:// support, setContent replay, and DX polish in user
  voice with a dedicated Security section for file:// safe-dirs policy
- browse/SKILL.md.tmpl: adds pattern #12 "Render local HTML", pattern #13
  "Retina screenshots", and a full Puppeteer → browse cheatsheet with side-by-
  side API mapping and a worked tweet-renderer migration example
- browse/SKILL.md + SKILL.md: regenerated from templates via `bun run gen:skill-docs`
  to reflect the new command descriptions

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: pre-landing review fixes (9 findings from specialist + adversarial review)

Adversarial review (Claude subagent + Codex) surfaced 9 bugs across
CRITICAL/HIGH severity. All fixed:

1. tab-session.ts:setTabContent — state mutation moved AFTER the setContent
   await. Prior order left phantom HTML in replay metadata if setContent
   threw (timeout, browser crash), which a later viewport --scale would
   silently replay. Now loadedHtml is only recorded on successful load.

2. browser-manager.ts:setDeviceScaleFactor — rollback now forces a second
   recreateContext after restoring the old fields. The fallback path in
   the original recreateContext builds a blank context using whatever
   this.deviceScaleFactor/currentViewport hold at that moment (which were
   the NEW values we were trying to apply). Rolling back the fields without
   a second recreate left the live context at new-scale while state tracked
   old-scale. Now: restore fields, force re-recreate with old values, only
   if that ALSO fails do we return a combined error.

3. commands.ts:buildUnknownCommandError — Levenshtein tiebreak simplified
   to 'd <= 2 && d < bestDist' (strict less). Candidates are pre-sorted
   alphabetically, so first equal-distance wins by default. The prior
   '(d === bestDist && best !== undefined && cand < best)' clause was dead
   code.

4. tab-session.ts:onMainFrameNavigated — now clears loadedHtml, not just
   refs + frame. Without this, a user who load-html'd then clicked a link
   (or had a form submit / JS redirect / OAuth flow) would retain the stale
   replay metadata. The next viewport --scale would silently revert the
   tab to the ORIGINAL loaded HTML, losing whatever the post-navigation
   content was. Silent data corruption. Browser-emitted navigations trigger
   this path via wirePageEvents.

5. browser-manager.ts:saveState + restoreState — tab ownership now flows
   through BrowserState.owner. Without this, a scoped agent's viewport
   --scale would strand them: tab IDs change during recreate, ownership
   map held stale IDs, owner lookup failed. New IDs had no owner, so
   writes without tabId were denied (DoS). Worse, if the agent sent a
   stale tabId the server's swallowed-tab-switch-error path would let the
   command hit whatever tab was currently active (cross-tab authz bypass).
   Now: clear ownership before restore, re-add per-tab with new IDs.

6. meta-commands.ts:state load — disk-loaded state.pages is now explicit
   allowlist (url, isActive, storage:null) instead of object spread.
   Spreading accepted loadedHtml, loadedHtmlWaitUntil, and owner from a
   user-writable state file, letting a tampered state.json smuggle HTML
   past load-html's safe-dirs / extension / magic-byte / 50MB-cap
   validators, or forge tab ownership. Now stripped at the boundary.

7. url-validation.ts:normalizeFileUrl — preserves query string + fragment
   across normalization. file://./app.html?route=home#login previously
   resolved to a filesystem path that URL-encoded '?' as %3F and '#' as
   %23, or (for absolute forms) pathToFileURL dropped them entirely. SPAs
   and fixture URLs with query params 404'd or loaded the wrong route.
   Now: split on ?/# before path resolution, reattach after.

8. url-validation.ts:validateNavigationUrl — reattaches parsed.search +
   parsed.hash to the normalized file:// URL. Same fix at the main
   validator for absolute paths that go through fileURLToPath round-trip.

9. server.ts:writeAuditEntry — audit entries now include aliasOf when the
   user typed an alias ('setcontent' → cmd: 'load-html', aliasOf:
   'setcontent'). Previously the isAliased variable was computed but
   dropped, losing the raw input from the forensic trail. Completes the
   plan's codex v3 P2 requirement.

Also added bm.getCurrentViewport() and switched 'viewport --scale'-
without-size to read from it (more reliable than page.viewportSize() on
headed/transition contexts).

Tests pass: exit 0, no failures. Build clean.

* test: integration coverage for load-html, screenshot --selector, viewport --scale, replay, aliases

Adds 28 Playwright-integration tests that close the coverage gap flagged
by the ship-workflow coverage audit (50% → expected ~80%+).

**load-html (12 tests):**
- happy path loads HTML file, page text matches
- bare HTML fragments (<div>...</div>) accepted, not just full documents
- missing file arg throws usage
- non-.html extension rejected by allowlist
- /etc/passwd.html rejected by safe-dirs policy
- ENOENT path rejected with actionable "not found" error
- directory target rejected
- binary file (PNG magic bytes) disguised as .html rejected by magic-byte check
- UTF-8 BOM stripped before magic-byte check — BOM-prefixed HTML accepted
- --wait-until networkidle exercises non-default branch
- invalid --wait-until value rejected
- unknown flag rejected

**screenshot --selector (5 tests):**
- --selector flag captures element, validates Screenshot saved (element)
- conflicts with positional selector (both = error)
- conflicts with --clip (mutually exclusive)
- composes with --base64 (returns data:image/png;base64,...)
- missing value throws usage

**viewport --scale (5 tests):**
- WxH --scale 2 produces PNG with 2x element dimensions (parses IHDR bytes 16-23)
- --scale without WxH keeps current size + applies scale
- non-finite value (abc) throws "not a finite number"
- out-of-range (4, 0.5) throws "between 1 and 3"
- missing value throws

**setContent replay across context recreation (3 tests):**
- load-html → viewport --scale 2: content survives (hits setTabContent replay path)
- double cycle 2x → 1.5x: content still survives (proves TabSession rehydration)
- goto after load-html clears replay: subsequent viewport --scale does NOT
  resurrect the stale HTML (validates the onMainFrameNavigated fix)

**Command aliases (2 tests):**
- setcontent routes to load-html via chain canonicalization
- set-content (hyphenated) also routes — both end-to-end through chain dispatch

Fixture paths use /tmp (SAFE_DIRECTORIES entry) instead of $TMPDIR which is
/var/folders/... on macOS and outside the safe-dirs boundary. Chain result
labels use rawName→name format when an alias is resolved (matches the
meta-commands.ts chain refactor).

Full suite: exit 0, 223/223 pass.

* docs: update BROWSER.md + CHANGELOG for v1.1.0.0

BROWSER.md:
- Command reference table updated: goto now lists file:// support,
  load-html added to Navigate row, viewport flagged with --scale
  option, screenshot row shows --selector + --base64 flags
- Screenshot modes table adds the fifth mode (element crop via
  --selector flag) and notes the tag-selector-not-caught-positionally
  gotcha
- New "Retina screenshots — viewport --scale" subsection explains
  deviceScaleFactor mechanics, context recreation side effects, and
  headed-mode rejection
- New "Loading local HTML — goto file:// vs load-html" subsection
  explains the two paths, their tradeoffs (URL state, relative asset
  resolution), the safe-dirs policy, extension allowlist + magic-byte
  sniff, 50MB cap, setContent replay across recreateContext, and the
  alias routing (setcontent → load-html before scope check)

CHANGELOG.md (v1.1.0.0 security section expanded, no existing content
removed):
- State files cannot smuggle HTML or forge tab ownership (allowlist
  on disk-loaded page fields)
- Audit log records aliasOf when a canonical command was reached via
  an alias (setcontent → load-html)
- load-html content clears on real navigations (clicks, form submits,
  JS redirects) — not just explicit goto. Also notes SPA query/fragment
  preservation for goto file://

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:25:33 +08:00

300 lines
12 KiB
TypeScript

/**
* URL validation for navigation commands — blocks dangerous schemes and cloud metadata endpoints.
* Localhost and private IPs are allowed (primary use case: QA testing local dev servers).
*/
import { fileURLToPath, pathToFileURL } from 'node:url';
import * as path from 'node:path';
import * as os from 'node:os';
import { validateReadPath } from './path-security';
export const BLOCKED_METADATA_HOSTS = new Set([
'169.254.169.254', // AWS/GCP/Azure instance metadata
'fe80::1', // IPv6 link-local — common metadata endpoint alias
'::ffff:169.254.169.254', // IPv4-mapped IPv6 form of the metadata IP
'::ffff:a9fe:a9fe', // Hex-encoded IPv4-mapped form (URL constructor normalizes to this)
'::a9fe:a9fe', // Deprecated IPv4-compatible hex form
'metadata.google.internal', // GCP metadata
'metadata.azure.internal', // Azure IMDS
]);
/**
* IPv6 prefixes to block (CIDR-style). Any address starting with these
* hex prefixes is rejected. Covers the full ULA range (fc00::/7 = fc00:: and fd00::).
*/
const BLOCKED_IPV6_PREFIXES = ['fc', 'fd'];
/**
* Check if an IPv6 address falls within a blocked prefix range.
* Handles the full ULA range (fc00::/7), not just the exact literal fd00::.
* Only matches actual IPv6 addresses (must contain ':'), not hostnames
* like fd.example.com or fcustomer.com.
*/
function isBlockedIpv6(addr: string): boolean {
const normalized = addr.toLowerCase().replace(/^\[|\]$/g, '');
// Must contain a colon to be an IPv6 address — avoids false positives on
// hostnames like fd.example.com or fcustomer.com
if (!normalized.includes(':')) return false;
return BLOCKED_IPV6_PREFIXES.some(prefix => normalized.startsWith(prefix));
}
/**
* Normalize hostname for blocklist comparison:
* - Strip trailing dot (DNS fully-qualified notation)
* - Strip IPv6 brackets (URL.hostname includes [] for IPv6)
* - Resolve hex (0xA9FEA9FE) and decimal (2852039166) IP representations
*/
function normalizeHostname(hostname: string): string {
// Strip IPv6 brackets
let h = hostname.startsWith('[') && hostname.endsWith(']')
? hostname.slice(1, -1)
: hostname;
// Strip trailing dot
if (h.endsWith('.')) h = h.slice(0, -1);
return h;
}
/**
* Check if a hostname resolves to the link-local metadata IP 169.254.169.254.
* Catches hex (0xA9FEA9FE), decimal (2852039166), and octal (0251.0376.0251.0376) forms.
*/
function isMetadataIp(hostname: string): boolean {
// Try to parse as a numeric IP via URL constructor — it normalizes all forms
try {
const probe = new URL(`http://${hostname}`);
const normalized = probe.hostname;
if (BLOCKED_METADATA_HOSTS.has(normalized) || isBlockedIpv6(normalized)) return true;
// Also check after stripping trailing dot
if (normalized.endsWith('.') && BLOCKED_METADATA_HOSTS.has(normalized.slice(0, -1))) return true;
} catch {
// Not a valid hostname — can't be a metadata IP
}
return false;
}
/**
* Resolve a hostname to its IP addresses and check if any resolve to blocked metadata IPs.
* Mitigates DNS rebinding: even if the hostname looks safe, the resolved IP might not be.
*
* Checks both A (IPv4) and AAAA (IPv6) records — an attacker can use AAAA-only DNS to
* bypass IPv4-only checks. Each record family is tried independently; failure of one
* (e.g. no AAAA records exist) is not treated as a rebinding risk.
*/
async function resolvesToBlockedIp(hostname: string): Promise<boolean> {
try {
const dns = await import('node:dns');
const { resolve4, resolve6 } = dns.promises;
// Check IPv4 A records
const v4Check = resolve4(hostname).then(
(addresses) => addresses.some(addr => BLOCKED_METADATA_HOSTS.has(addr)),
() => false, // ENODATA / ENOTFOUND — no A records, not a risk
);
// Check IPv6 AAAA records — the gap that issue #668 identified
const v6Check = resolve6(hostname).then(
(addresses) => addresses.some(addr => {
const normalized = addr.toLowerCase();
return BLOCKED_METADATA_HOSTS.has(normalized) || isBlockedIpv6(normalized) ||
// fe80::/10 is link-local — always block (covers all fe80:: addresses)
normalized.startsWith('fe80:');
}),
() => false, // ENODATA / ENOTFOUND — no AAAA records, not a risk
);
const [v4Blocked, v6Blocked] = await Promise.all([v4Check, v6Check]);
return v4Blocked || v6Blocked;
} catch {
// Unexpected error — fail open (don't block navigation on DNS infrastructure failure)
return false;
}
}
/**
* Normalize non-standard file:// URLs into absolute form before the WHATWG URL parser
* sees them. Handles cwd-relative, home-relative, and bare-segment shapes that the
* standard parser would otherwise mis-interpret as hostnames.
*
* file:///abs/path.html → unchanged
* file://./<rel> → file://<cwd>/<rel>
* file://~/<rel> → file://<HOME>/<rel>
* file://<single-segment>/... → file://<cwd>/<single-segment>/... (cwd-relative)
* file://localhost/<abs> → unchanged
* file://<host-like>/... → unchanged (caller rejects via host heuristic)
*
* Rejects empty (file://) and root-only (file:///) URLs — these would silently
* trigger Chromium's directory listing, which is a different product surface.
*/
export function normalizeFileUrl(url: string): string {
if (!url.toLowerCase().startsWith('file:')) return url;
// Split off query + fragment BEFORE touching the path — SPAs + fixture URLs rely
// on these. path.resolve would URL-encode `?` and `#` as `%3F`/`%23` (and
// pathToFileURL drops them entirely), silently routing preview URLs to the
// wrong fixture. Extract, normalize the path, reattach at the end.
//
// Parse order: `?` before `#` per RFC 3986 — '?' in a fragment is literal.
// Find the FIRST `?` or `#`, whichever comes first, and take everything
// after (including the delimiter) as the trailing segment.
const qIdx = url.indexOf('?');
const hIdx = url.indexOf('#');
let delimIdx = -1;
if (qIdx >= 0 && hIdx >= 0) delimIdx = Math.min(qIdx, hIdx);
else if (qIdx >= 0) delimIdx = qIdx;
else if (hIdx >= 0) delimIdx = hIdx;
const pathPart = delimIdx >= 0 ? url.slice(0, delimIdx) : url;
const trailing = delimIdx >= 0 ? url.slice(delimIdx) : '';
const rest = pathPart.slice('file:'.length);
// file:/// or longer → standard absolute; pass through unchanged (caller validates path).
if (rest.startsWith('///')) {
// Reject bare root-only (file:/// with nothing after)
if (rest === '///' || rest === '////') {
throw new Error('Invalid file URL: file:/// has no path. Use file:///<absolute-path>.');
}
return pathPart + trailing;
}
// Everything else: must start with // (we accept file://... only)
if (!rest.startsWith('//')) {
throw new Error(`Invalid file URL: ${url}. Use file:///<absolute-path> or file://./<rel> or file://~/<rel>.`);
}
const afterDoubleSlash = rest.slice(2);
// Reject empty (file://) and trailing-slash-only (file://./ listing cwd).
if (afterDoubleSlash === '') {
throw new Error('Invalid file URL: file:// is empty. Use file:///<absolute-path>.');
}
if (afterDoubleSlash === '.' || afterDoubleSlash === './') {
throw new Error('Invalid file URL: file://./ would list the current directory. Use file://./<filename> to render a specific file.');
}
if (afterDoubleSlash === '~' || afterDoubleSlash === '~/') {
throw new Error('Invalid file URL: file://~/ would list the home directory. Use file://~/<filename> to render a specific file.');
}
// Home-relative: file://~/<rel>
if (afterDoubleSlash.startsWith('~/')) {
const rel = afterDoubleSlash.slice(2);
const absPath = path.join(os.homedir(), rel);
return pathToFileURL(absPath).href + trailing;
}
// cwd-relative with explicit ./ : file://./<rel>
if (afterDoubleSlash.startsWith('./')) {
const rel = afterDoubleSlash.slice(2);
const absPath = path.resolve(process.cwd(), rel);
return pathToFileURL(absPath).href + trailing;
}
// localhost host explicitly allowed: file://localhost/<abs> (pass through to standard parser).
if (afterDoubleSlash.toLowerCase().startsWith('localhost/')) {
return pathPart + trailing;
}
// Ambiguous: file://<segment>/<rest> — treat as cwd-relative ONLY if <segment> is a
// simple path name (no dots, no colons, no backslashes, no percent-encoding, no
// IPv6 brackets, no Windows drive letter pattern).
const firstSlash = afterDoubleSlash.indexOf('/');
const segment = firstSlash === -1 ? afterDoubleSlash : afterDoubleSlash.slice(0, firstSlash);
// Reject host-like segments: dotted names (docs.v1), IPs (127.0.0.1), IPv6 ([::1]),
// drive letters (C:), percent-encoded, or backslash paths.
const looksLikeHost = /[.:\\%]/.test(segment) || segment.startsWith('[');
if (looksLikeHost) {
throw new Error(
`Unsupported file URL host: ${segment}. Use file:///<absolute-path> for local files (network/UNC paths are not supported).`
);
}
// Simple-segment cwd-relative: file://docs/page.html → cwd/docs/page.html
const absPath = path.resolve(process.cwd(), afterDoubleSlash);
return pathToFileURL(absPath).href + trailing;
}
/**
* Validate a navigation URL and return a normalized version suitable for page.goto().
*
* Callers MUST use the return value — normalization of non-standard file:// forms
* only takes effect at the navigation site, not at the original URL.
*
* Callers (keep this list current, grep before removing):
* - write-commands.ts:goto
* - meta-commands.ts:diff (both URL args)
* - browser-manager.ts:newTab
* - browser-manager.ts:restoreState
*/
export async function validateNavigationUrl(url: string): Promise<string> {
// Normalize non-standard file:// shapes before the URL parser sees them.
let normalized = url;
if (url.toLowerCase().startsWith('file:')) {
normalized = normalizeFileUrl(url);
}
let parsed: URL;
try {
parsed = new URL(normalized);
} catch {
throw new Error(`Invalid URL: ${url}`);
}
// file:// path: validate against safe-dirs and allow; otherwise defer to http(s) logic.
if (parsed.protocol === 'file:') {
// Reject non-empty non-localhost hosts (UNC / network paths).
if (parsed.host !== '' && parsed.host.toLowerCase() !== 'localhost') {
throw new Error(
`Unsupported file URL host: ${parsed.host}. Use file:///<absolute-path> for local files.`
);
}
// Convert URL → filesystem path with proper decoding (handles %20, %2F, etc.)
// fileURLToPath strips query + hash; we reattach them after validation so SPA
// fixture URLs like file:///tmp/app.html?route=home#login survive intact.
let fsPath: string;
try {
fsPath = fileURLToPath(parsed);
} catch (e: any) {
throw new Error(`Invalid file URL: ${url} (${e.message})`);
}
// Reject path traversal after decoding — e.g. file:///tmp/safe%2F..%2Fetc/passwd
// Note: fileURLToPath doesn't collapse .., so a literal '..' in the decoded path
// is suspicious. path.resolve will normalize it; check the result against safe dirs.
validateReadPath(fsPath);
// Return the canonical file:// URL derived from the filesystem path + original
// query + hash. This guarantees page.goto() gets a well-formed URL regardless
// of input shape while preserving SPA route/query params.
return pathToFileURL(fsPath).href + parsed.search + parsed.hash;
}
if (parsed.protocol !== 'http:' && parsed.protocol !== 'https:') {
throw new Error(
`Blocked: scheme "${parsed.protocol}" is not allowed. Only http:, https:, and file: URLs are permitted.`
);
}
const hostname = normalizeHostname(parsed.hostname.toLowerCase());
if (BLOCKED_METADATA_HOSTS.has(hostname) || isMetadataIp(hostname) || isBlockedIpv6(hostname)) {
throw new Error(
`Blocked: ${parsed.hostname} is a cloud metadata endpoint. Access is denied for security.`
);
}
// DNS rebinding protection: resolve hostname and check if it points to metadata IPs.
// Skip for loopback/private IPs — they can't be DNS-rebinded and the async DNS
// resolution adds latency that breaks concurrent E2E tests under load.
const isLoopback = hostname === 'localhost' || hostname === '127.0.0.1' || hostname === '::1';
const isPrivateNet = /^(10\.|172\.(1[6-9]|2[0-9]|3[01])\.|192\.168\.)/.test(hostname);
if (!isLoopback && !isPrivateNet && await resolvesToBlockedIp(hostname)) {
throw new Error(
`Blocked: ${parsed.hostname} resolves to a cloud metadata IP. Possible DNS rebinding attack.`
);
}
return url;
}