mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-05 05:05:08 +02:00
b73f364411
* refactor: extract path-security.ts shared module validateOutputPath, validateReadPath, and SAFE_DIRECTORIES were duplicated across write-commands.ts, meta-commands.ts, and read-commands.ts. Extract to a single shared module with re-exports for backward compatibility. Also adds validateTempPath() for the upcoming GET /file endpoint (TEMP_DIR only, not cwd, to prevent remote agents from reading project files). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: default paired agents to full access, split SCOPE_CONTROL The trust boundary for paired agents is the pairing ceremony itself, not the scope. An agent with write scope can already click anything and navigate anywhere. Gating js/cookies behind --admin was security theater. Changes: - Default pair scopes: read+write+admin+meta (was read+write) - New SCOPE_CONTROL for browser-wide destructive ops (stop, restart, disconnect, state, handoff, resume, connect) - --admin flag now grants control scope (backward compat) - New --restrict flag for limited access (e.g., --restrict read) - Updated hint text: "re-pair with --control" instead of "--admin" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add media and data commands for page content extraction media command: discovers all img/video/audio/background-image elements on the page. Returns JSON with URLs, dimensions, srcset, loading state, HLS/DASH detection. Supports --images/--videos/--audio filters and optional CSS selector scoping. data command: extracts structured data embedded in pages (JSON-LD, Open Graph, Twitter Cards, meta tags). One command returns product prices, article metadata, social share info without DOM scraping. Both are READ scope with untrusted content wrapping. Shared media-extract.ts helper for reuse by the upcoming scrape command. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add download, scrape, and archive commands download: fetch any URL or @ref element to disk using browser session cookies via page.request.fetch(). Supports blob: URLs via in-page base64 conversion. --base64 flag returns inline data URI (cap 10MB). Detects HLS/DASH and rejects with yt-dlp hint. scrape: bulk media download composing media discovery + download loop. Sequential with 100ms delay, URL deduplication, configurable --limit. Writes manifest.json with per-file metadata for machine consumption. archive: saves complete page as MHTML via CDP Page.captureSnapshot. No silent fallback -- errors clearly if CDP unavailable. All three are WRITE scope (write to disk, blocked in watch mode). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add GET /file endpoint for remote agent file retrieval Remote paired agents can now retrieve downloaded files over HTTP. TEMP_DIR only (not cwd) to prevent project file exfiltration. - Bearer token auth (root or scoped with read scope) - Path validation via validateTempPath() (symlink-aware) - 200MB size cap - Extension-based MIME detection - Zero-copy streaming via Bun.file() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add scroll --times N for automated repeated scrolling Extends the scroll command with --times N flag for infinite feed scraping. Scrolls N times with configurable --wait delay (default 1000ms) between each scroll for content loading. Usage: scroll --times 10 scroll --times 5 --wait 2000 scroll --times 3 .feed-container Composable with scrape: scroll to load content, then scrape images. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add network response body capture (--capture/--export/--bodies) The killer feature for social media scraping. Extends the existing network command to intercept API response bodies: network --capture [--filter graphql] # start capturing network --capture stop # stop network --export /tmp/api.jsonl # export as JSONL network --bodies # show summary Uses page.on('response') listener with URL pattern filtering. SizeCappedBuffer (50MB total, 5MB per-entry cap) evicts oldest entries when full. Binary responses stored as base64, text as-is. This lets agents tap Instagram's GraphQL API, TikTok's hydration data, and any SPA's internal API responses instead of fragile DOM scraping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add screenshot --base64 for inline image return Returns data:image/png;base64,... instead of writing to disk. Cap at 10MB. Works with all screenshot modes (element, clip, viewport). Eliminates the two-step screenshot+file-serve dance for remote agents. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add data platform tests and media fixture Tests for SizeCappedBuffer (eviction, export, summary), validateTempPath (TEMP_DIR only, rejects cwd), command registration (all new commands in correct scope sets), and MIME mapping source checks. Rich HTML fixture with: standard images, lazy-loaded images, srcset, video with sources + HLS, audio, CSS background-images, JSON-LD, Open Graph, Twitter Cards, and meta tags. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: regenerate SKILL.md with Extraction category Add Extraction category to browse command table ordering. Regenerate SKILL.md files to include media, data, download, scrape, archive commands in the generated documentation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.16.0.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
178 lines
5.2 KiB
TypeScript
178 lines
5.2 KiB
TypeScript
/**
|
|
* Media extraction helper — shared between `media` (read) and `scrape` (write) commands.
|
|
*
|
|
* Runs page.evaluate() to discover all media elements on the page:
|
|
* - <img> with src, srcset, currentSrc, alt, dimensions, loading, data-src
|
|
* - <video> with currentSrc, poster, duration, <source> children, HLS/DASH detection
|
|
* - <audio> with src, duration, type
|
|
* - CSS background-image (capped at 500 elements)
|
|
*/
|
|
|
|
import type { Page, Frame } from 'playwright';
|
|
|
|
export interface ImageInfo {
|
|
index: number;
|
|
src: string;
|
|
srcset: string;
|
|
currentSrc: string;
|
|
alt: string;
|
|
width: number;
|
|
height: number;
|
|
naturalWidth: number;
|
|
naturalHeight: number;
|
|
loading: string;
|
|
dataSrc: string;
|
|
visible: boolean;
|
|
}
|
|
|
|
export interface VideoSource {
|
|
src: string;
|
|
type: string;
|
|
}
|
|
|
|
export interface VideoInfo {
|
|
index: number;
|
|
src: string;
|
|
currentSrc: string;
|
|
poster: string;
|
|
width: number;
|
|
height: number;
|
|
duration: number;
|
|
type: string;
|
|
sources: VideoSource[];
|
|
isHLS: boolean;
|
|
isDASH: boolean;
|
|
}
|
|
|
|
export interface AudioInfo {
|
|
index: number;
|
|
src: string;
|
|
currentSrc: string;
|
|
duration: number;
|
|
type: string;
|
|
}
|
|
|
|
export interface BackgroundImageInfo {
|
|
index: number;
|
|
url: string;
|
|
selector: string;
|
|
element: string;
|
|
}
|
|
|
|
export interface MediaResult {
|
|
images: ImageInfo[];
|
|
videos: VideoInfo[];
|
|
audio: AudioInfo[];
|
|
backgroundImages: BackgroundImageInfo[];
|
|
total: number;
|
|
}
|
|
|
|
/** Extract all media elements from the page or a scoped subtree. */
|
|
export async function extractMedia(
|
|
target: Page | Frame,
|
|
options?: { selector?: string; filter?: 'images' | 'videos' | 'audio' },
|
|
): Promise<MediaResult> {
|
|
const result = await target.evaluate(({ scopeSelector, filter }) => {
|
|
const root = scopeSelector
|
|
? document.querySelector(scopeSelector) || document
|
|
: document;
|
|
|
|
const images: any[] = [];
|
|
const videos: any[] = [];
|
|
const audio: any[] = [];
|
|
const backgroundImages: any[] = [];
|
|
|
|
// Images
|
|
if (!filter || filter === 'images') {
|
|
const imgs = root.querySelectorAll('img');
|
|
imgs.forEach((img, i) => {
|
|
const rect = img.getBoundingClientRect();
|
|
images.push({
|
|
index: i,
|
|
src: img.src || '',
|
|
srcset: img.srcset || '',
|
|
currentSrc: img.currentSrc || '',
|
|
alt: img.alt || '',
|
|
width: img.width,
|
|
height: img.height,
|
|
naturalWidth: img.naturalWidth,
|
|
naturalHeight: img.naturalHeight,
|
|
loading: img.loading || '',
|
|
dataSrc: img.getAttribute('data-src') || img.getAttribute('data-lazy-src') || img.getAttribute('data-original') || '',
|
|
visible: rect.width > 0 && rect.height > 0 && rect.bottom > 0 && rect.right > 0,
|
|
});
|
|
});
|
|
}
|
|
|
|
// Videos
|
|
if (!filter || filter === 'videos') {
|
|
const vids = root.querySelectorAll('video');
|
|
vids.forEach((vid, i) => {
|
|
const sources = Array.from(vid.querySelectorAll('source')).map(s => ({
|
|
src: s.src || '',
|
|
type: s.type || '',
|
|
}));
|
|
const isHLS = sources.some(s => s.type.includes('mpegURL') || s.src.includes('.m3u8'));
|
|
const isDASH = sources.some(s => s.type.includes('dash') || s.src.includes('.mpd'));
|
|
videos.push({
|
|
index: i,
|
|
src: vid.src || '',
|
|
currentSrc: vid.currentSrc || '',
|
|
poster: vid.poster || '',
|
|
width: vid.videoWidth || vid.width,
|
|
height: vid.videoHeight || vid.height,
|
|
duration: isFinite(vid.duration) ? vid.duration : 0,
|
|
type: sources[0]?.type || '',
|
|
sources,
|
|
isHLS,
|
|
isDASH,
|
|
});
|
|
});
|
|
}
|
|
|
|
// Audio
|
|
if (!filter || filter === 'audio') {
|
|
const auds = root.querySelectorAll('audio');
|
|
auds.forEach((aud, i) => {
|
|
const source = aud.querySelector('source');
|
|
audio.push({
|
|
index: i,
|
|
src: aud.src || source?.src || '',
|
|
currentSrc: aud.currentSrc || '',
|
|
duration: isFinite(aud.duration) ? aud.duration : 0,
|
|
type: source?.type || '',
|
|
});
|
|
});
|
|
}
|
|
|
|
// Background images (capped at 500 elements for performance)
|
|
if (!filter || filter === 'images') {
|
|
const allElements = root.querySelectorAll('*');
|
|
let bgCount = 0;
|
|
for (let i = 0; i < allElements.length && bgCount < 500; i++) {
|
|
const el = allElements[i];
|
|
const bg = getComputedStyle(el).backgroundImage;
|
|
if (bg && bg !== 'none') {
|
|
const urlMatch = bg.match(/url\(["']?([^"')]+)["']?\)/);
|
|
if (urlMatch && urlMatch[1] && !urlMatch[1].startsWith('data:')) {
|
|
backgroundImages.push({
|
|
index: bgCount,
|
|
url: urlMatch[1],
|
|
selector: el.tagName.toLowerCase() + (el.id ? `#${el.id}` : '') + (el.className && typeof el.className === 'string' ? '.' + el.className.trim().split(/\s+/).join('.') : ''),
|
|
element: el.tagName.toLowerCase(),
|
|
});
|
|
bgCount++;
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
return { images, videos, audio, backgroundImages };
|
|
}, { scopeSelector: options?.selector || null, filter: options?.filter || null });
|
|
|
|
return {
|
|
...result,
|
|
total: result.images.length + result.videos.length + result.audio.length + result.backgroundImages.length,
|
|
};
|
|
}
|