diff --git a/CLAUDE.md b/CLAUDE.md index d422ac7..cd3783a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -29,7 +29,7 @@ You are a **principal Python engineer** maintaining a CLI tool and library for r - `noai/c2pa.py` — PNG chunk parser; use `extract_c2pa_chunk(path)` to get raw caBX payload, `has_c2pa_metadata(path)` to detect. Do not reimplement chunk parsing. `extract_c2pa_info(path)` sets `synthid_watermark`/`synthid_vendors` when the manifest is signed by a SynthID-using vendor, and `soft_binding`/`soft_binding_vendors` when a `c2pa.soft-binding` `alg` names a forensic-watermark vendor (`soft_binding_vendors_in(buffer)` is the shared byte-scan, used by both the PNG parser and the non-PNG binary path). - `noai/constants.py` — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, C2PA_ISSUERS, `SYNTHID_C2PA_ISSUERS` (issuers that pair SynthID with C2PA: Google, OpenAI), and `C2PA_SOFT_BINDINGS` (soft-binding `alg` prefix → forensic-watermark vendor: Adobe TrustMark, Digimarc, Imatag, Steg.AI, Microsoft, ...). Add a new issuer/binding here, not inline. -- `metadata.py` — `synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None. Format-agnostic: PNG via the caBX parser, JPEG/WebP/AVIF/HEIF/JXL via a binary scan (C2PA marker + SynthID issuer + AI-source marker). `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout. Both `get_ai_metadata` and `has_ai_metadata` guard the PIL open with `except Exception` (HEIC/unknown formats raise non-OSError) and fall through to the binary scan. `xai_signature(path)` detects xAI/Grok's EXIF-only scheme (`ImageDescription` = `Signature: ` + UUID `Artist`); it feeds `has_ai_metadata`, `get_ai_metadata` (key `xai_signature`), and `identify`. `iptc_ai_system(path)` detects the IPTC Photo Metadata 2025.1 AI-disclosure XMP properties (`IPTC_AI_FIELD_MARKERS` = `AISystemUsed`/`AISystemVersionUsed`/`AIPromptInformation`/`AIPromptWriterName`) and returns the `AISystemUsed` generator name (or `"fields present"`). `remove_ai_metadata` routes **ISOBMFF video** (`.mp4`/`.mov`/`.m4v`) through the same `isobmff.strip_c2pa_boxes` as AVIF/HEIF (MP4 is ISOBMFF), and `_scrub_ai_exif` removes the xAI signature + AI-generator EXIF tags on JPEG output. +- `metadata.py` — `scan_head(path, size=1MB)` is the shared input for every C2PA/AIGC/IPTC byte scan: first `size` bytes plus, for ISOBMFF, the late provenance-box payloads from `isobmff.scan_c2pa_region` (catches a manifest after a large `mdat`); behavior-neutral (`f.read(size)`) for non-ISOBMFF. Use it instead of `open().read(1MB)` for any new marker scan. `synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None. Format-agnostic: PNG via the caBX parser, JPEG/WebP/AVIF/HEIF/JXL via a binary scan (C2PA marker + SynthID issuer + AI-source marker). `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout. Both `get_ai_metadata` and `has_ai_metadata` guard the PIL open with `except Exception` (HEIC/unknown formats raise non-OSError) and fall through to the binary scan. `xai_signature(path)` detects xAI/Grok's EXIF-only scheme (`ImageDescription` = `Signature: ` + UUID `Artist`); it feeds `has_ai_metadata`, `get_ai_metadata` (key `xai_signature`), and `identify`. `iptc_ai_system(path)` detects the IPTC Photo Metadata 2025.1 AI-disclosure XMP properties (`IPTC_AI_FIELD_MARKERS` = `AISystemUsed`/`AISystemVersionUsed`/`AIPromptInformation`/`AIPromptWriterName`) and returns the `AISystemUsed` generator name (or `"fields present"`). `remove_ai_metadata` routes **ISOBMFF video** (`.mp4`/`.mov`/`.m4v`) through the same `isobmff.strip_c2pa_boxes` as AVIF/HEIF (MP4 is ISOBMFF), and `_scrub_ai_exif` removes the xAI signature + AI-generator EXIF tags on JPEG output. - `identify.py` — `identify(path)` aggregates every locally-readable signal (C2PA issuer→platform, C2PA soft-binding forensic-watermark vendor, IPTC "Made with AI" + IPTC 2025.1 `AISystemUsed`, embedded SD/ComfyUI params, SynthID proxy, xAI/Grok EXIF signature via `metadata.xai_signature`, visible Gemini sparkle, open invisible watermark, Adobe TrustMark via `trustmark_detector`) into one `ProvenanceReport`. `is_ai_generated` is True or None (never asserted False — stripped metadata is not proof of clean origin). Visible-sparkle is promoted only at confidence ≥ `_SPARKLE_THRESHOLD` (0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49). The cv2 dependency lives in `gemini_engine.detect_sparkle_confidence`, not here. **C2PA platform attribution is device-token-first, issuer-scan fallback** (`_device_platform` scans manifest bytes for `_DEVICE_C2PA_PLATFORM` tokens, then `_attribute_platform`/`_ISSUER_PLATFORM`). **Why, verified on real signed files 2026-05-26:** the old issuer-only byte-scan matched ANY issuer substring anywhere, so multi-entity manifests mis-attributed -- Leica→"Truepic" (a signing authority in the trust chain), Nikon→"Adobe Firefly" (XMP-toolkit "Adobe" + the sample's "Adobe_MAX" name), Pixel→"Google (Gemini)" ("Google LLC" cert org), Truepic→"Google". A distinctive device token wins instead. **Token distinctiveness is load-bearing:** bare `b"Truepic"` mis-fires (it appears in unrelated trust chains -- it mis-attributed the OpenAI `chatgpt-1.png` fixture), so the token is the specific `b"Truepic_Lens"` from the Lens SDK claim generator; likewise `b"Pixel Camera"` (cert CN) not bare `b"Pixel"`. `_DEVICE_C2PA_PLATFORM` lists ONLY tokens **verified against a real C2PA file**: Leica (`lc_c2pa`/`Leica Camera`), Nikon (`NIKON`), Pixel (`Pixel Camera` -- from a real Pixel 10 Pro file attached to c2pa-rs issue #1609/#1554), Sony (`sony.sig`/`sony.cert` -- Sony's own C2PA assertion namespace, verified on a real Sony PXW-Z300 file; NOT bare "Sony" which is a common EXIF Make), Truepic (`Truepic_Lens`). Canon/Samsung/Bria have **no public direct-download C2PA sample** (checked exhaustively: GitHub issue/PR attachments, contentcredentials gallery, HF datasets -- all upload-to-verify or token-gated; Canon's only public file was a self-signed hobbyist CR3, not factory), so they stay unmapped until a real file is captured (same fixture discipline as Grok/Doubao). The Sony sample is video (MP4) -- our ISOBMFF C2PA path detects it; Sony Alpha stills likely share the `sony.*` namespace but are not separately verified. Camera C2PA marks capture authenticity, not AI (Pixel carries `computationalCapture`, not `trainedAlgorithmicMedia`), so these never set `is_ai` -- that stays driven by digital-source-type. `c2pa.cbor_text_after` (now public) is best-effort for the `generator` detail string only and can be None when the manifest keys it `claim_generator_info` (Pixel). **Issuer→generator mapping is `is_ai`-gated** (`_attribute_platform(issuers, is_ai=c2pa_is_ai)`): a specific AI-generator platform is named only when the digital-source-type is `trainedAlgorithmicMedia`; on a non-AI source an issuer substring is treated as incidental (an "Adobe XMP" toolkit string in an *unmapped* Canon/Sony capture would otherwise mislabel it "Adobe Firefly"), so it degrades to the neutral "C2PA signer: X" label. Real Firefly/OpenAI/Google output carries the AI source-type, so it is unaffected (verified: chatgpt-1.png→OpenAI, firefly-1.png→Adobe Firefly still attribute). `_attribute_platform` defaults `is_ai=True` so the mapping stays unit-testable in isolation. Add device tokens to `_DEVICE_C2PA_PLATFORM`, generator/issuer platforms to `_ISSUER_PLATFORM`, not inline. For non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) the caBX parser returns nothing, so issuer (`_issuers_in`) and generator (`_ai_tools_in`, reusing `C2PA_AI_TOOLS`) are recovered by binary-scanning the first MB. EXIF `Software` / `Make` / `Artist` / `ImageDescription` and XMP `CreatorTool` generator tags are read by `metadata.exif_generator` (PIL+piexif for any format PIL opens incl. AVIF, plus a container-agnostic XMP raw-byte scan that also covers HEIF/JXL), matched against `AI_GENERATOR_TOKENS` so ordinary editors (plain "Adobe Photoshop") and real-camera `Make` ("Apple"/"Canon") are not flagged. **Ideogram tags its output with EXIF `Make="Ideogram AI"`** (verified on a real download 2026-05-24) — that's why `Make` is read. **Integrity-clash detection** (`_integrity_clashes`, surfaced as `ProvenanceReport.integrity_clashes`, printed in red by `identify` and serialized to `--json`): contradictions between independent generator stamps are a laundering/spoofing tell. Two rules: (1) two or more distinct AI-origin vendors named by independent signals (e.g. C2PA OpenAI + EXIF `Make="Ideogram AI"`), and (2) a camera-capture C2PA device (`_DEVICE_C2PA_PLATFORM`) coexisting with any AI-generation marker. Vendor normalization is `_vendor_of` over `_AI_VENDOR_TOKENS` (so a C2PA "Google (Gemini)" issuer and a SynthID-Google proxy agree, while different vendors clash). **High-precision by design:** only hard generator stamps feed it (C2PA-issuer when source is AI, SynthID, EXIF/XMP generator, IPTC `AISystemUsed`, xAI, AIGC); the fuzzy visible sparkle and the open invisible watermark are **excluded** (the latter can be a by-product of our own SDXL removal pass). The c2pa vendor is classified from the issuer attribution / generator, NOT the resolved `platform` (a camera label like "Google Pixel" would mis-normalize to "Google"). All real single-origin fixtures (chatgpt/firefly/doubao/grok/mj) verified to produce **zero** clashes (false-positive guard in `test_identify.py::TestRealSamplesHaveNoClash`). - `gemini_engine.py` — visible Gemini-sparkle remover/detector (cv2/numpy, no GPU). `detect_sparkle_confidence(path)` is the file-level entry point used by `identify.py`. - `doubao_engine.py` — visible Doubao "豆包AI生成" remover/detector (cv2/numpy, no GPU). `DoubaoEngine.locate` anchors a bottom-right box by **geometry** (mark scales with image WIDTH, fractions in module constants; no bundled template), `extract_mask` pulls the light low-saturation glyphs with a **polarity-aware white top-hat** (brighter-than-blurred-local-bg, so white-paper documents are left untouched instead of smeared), `detect` thresholds glyph coverage (`DETECT_MIN_COVERAGE` 0.16 separates real marks ≥0.20 from corner noise, which stays ≤0.06 on large images but can spike to ~0.15 on tiny ones), `remove_watermark` inpaints (cv2 Telea/NS) and **bails when coverage > `MAX_INPAINT_COVERAGE` 0.50** (dense-text background → would smear). Wired into `visible --mark` via `cli._run_doubao_if_selected`. **Logo is near-white (~253), not the gray some third-party tools assume.** Best on photo/illustration backgrounds; high-contrast edges leave faint residue (cv2-inpaint limit). Clean per-pixel reverse-alpha (Gemini-style) is the future upgrade but needs a captured/distilled alpha map — see below. @@ -54,7 +54,7 @@ Who embeds what, and whether it is locally detectable (so we know which gaps are - **No detectable signal on download (correctly reported `unknown`):** **Recraft** (PNG export is a re-encoded design export — strips everything), **Krea hosting FLUX 2** (no imwatermark despite FLUX — the host omits the encoder, same as Stability's hosted SDXL), and Midjourney (embeds nothing). Lesson: the imwatermark detector only fires on *pristine* output from a pipeline that runs the encoder (diffusers default, official BFL), not from re-hosts (Krea/Stability) or re-encoded exports (Recraft/Canva). - **Invisible but NOT locally detectable (proprietary, API/oracle only — same wall as SynthID):** Amazon Titan Image Generator + Nova Canvas (Bedrock `DetectGeneratedContent` API), Kakao (new SynthID image adopter, May 2026), NVIDIA Cosmos (SynthID video). No local detector possible; treat like SynthID. - **C2PA 2.4 "Durable Content Credentials" (April 2026; verified against the spec) raise the bar for metadata stripping.** 2.4 defines soft bindings (an invisible watermark or a content fingerprint) plus a server-side manifest repository and a new `c2pa.repository-receipt` assertion. Per the spec: "if a C2PA manifest is removed from an asset, but a copy of that manifest remains in a provenance store elsewhere, the manifest and asset may be matched using available soft bindings." So our local `metadata --remove` deletes the *embedded* manifest, but a fingerprint/watermark soft binding can still re-link the image to its manifest in a repository server-side. Stripping the file is becoming necessary-but-not-sufficient against durable provenance. (Our parsers target the stable embedded-manifest format documented in C2PA 2.1 §11; that format is unchanged in 2.4 -- the new pieces are repository/soft-binding infra, not the on-file box layout, so no parser change is implied.) Spec: https://spec.c2pa.org/specifications/specifications/2.4/specs/C2PA_Specification.html We now READ the soft-binding `alg` (`C2PA_SOFT_BINDINGS` / `soft_binding_vendors_in`) to name the forensic-watermark vendor, and locally DECODE the one open scheme, Adobe TrustMark (`trustmark_detector`); the rest (Digimarc/Imatag/Steg.AI/...) stay name-only (proprietary decoders). -- **Built 2026-05-26 (this batch):** soft-binding `alg` vendor detection; IPTC Photo Metadata 2025.1 AI-disclosure fields (`AISystemUsed` etc.); **video C2PA metadata** detect + strip for MP4/MOV/M4V (free — `isobmff.py` is format-agnostic, MP4 is ISOBMFF); Adobe TrustMark open decoder. NOT done (out of cheap reach, per the feasibility review): visible video-logo removal (needs a video frame pipeline) and audio (SynthID/ElevenLabs/Resemble/Suno all oracle-only or unmarked). The soft-binding **box detection window**: non-PNG/video detection scans the first 1 MB, so a C2PA box placed after a large `mdat` in a streaming MP4 can be missed — front-placed manifests (the common case) are caught. +- **Built 2026-05-26 (this batch):** soft-binding `alg` vendor detection; IPTC Photo Metadata 2025.1 AI-disclosure fields (`AISystemUsed` etc.); **video C2PA metadata** detect + strip for MP4/MOV/M4V (free — `isobmff.py` is format-agnostic, MP4 is ISOBMFF); Adobe TrustMark open decoder. NOT done (out of cheap reach, per the feasibility review): visible video-logo removal (needs a video frame pipeline) and audio (SynthID/ElevenLabs/Resemble/Suno all oracle-only or unmarked). **Box detection window — now handled (v0.6.8):** detection no longer relies on a fixed first-MB read. `metadata.scan_head(path, size)` reads the first `size` bytes and, for ISOBMFF, appends the payloads of late provenance boxes found by `isobmff.scan_c2pa_region` (a file-seeking top-level box walker that skips past `mdat` by size without reading it), so a C2PA/AIGC/IPTC manifest placed AFTER a large `mdat` in a streaming/non-faststart MP4 is now caught. Every C2PA/marker byte scan (`has_ai_metadata`, `aigc_label`, `iptc_ai_system`, `synthid_source`, `exif_generator` XMP, `get_ai_metadata` soft-binding, and `identify`) goes through `scan_head`; it is behavior-neutral for non-ISOBMFF inputs (exactly `f.read(size)`). The remaining gap is EXIF/XMP stored as items *inside the `meta` box* (still needs meta-box surgery / exiftool). - **Regulatory driver (context, not a code change):** AI-content labeling mandates are expanding, which pushes more generators toward exactly the C2PA + watermark signals we read. The full per-jurisdiction table lives in README "## Legal" -- keep it there, not duplicated here. Newly added + primary-source verified 2026-05-26: **EU AI Act Article 50** machine-readable marking applicable **2026-08-02** (verified against the article text); **South Korea AI Framework Act Art. 31(3)** in force since **22 January 2026** (verified via Kim & Chang + FPF/Korea Times; Enforcement Decree accepts an invisible-watermark label); **California AB 853** (amends the CA AI Transparency Act) latent-disclosure duty operative **2026-08-02**, requiring a disclosure "permanent or extraordinarily difficult to remove" (verified against the leginfo bill text -- this is the exact disclosure our tool strips); **India IT Amendment Rules 2026** in force **2026-02-20** (verified via Chambers), which prominently-label + permanent-provenance-id all synthetic media AND **expressly prohibit removing/suppressing the label or metadata** -- the first major all-content removal ban outside China. **Removal liability (README "## Legal" disclaimer):** the tool is lawful general-purpose software; liability sits with the remover and is intent-gated -- downstream acts (fraud/deception/IP), plus US DMCA 17 USC 1202 (removing copyright-management info to conceal infringement), plus the removal-as-such bans in China + India. When extending the README table, verify each date/article against the statute/bill text before committing, not against search summaries. ## Known limitations diff --git a/README.md b/README.md index 2575574..f489481 100644 --- a/README.md +++ b/README.md @@ -327,7 +327,7 @@ Tracked but not yet implemented: - **AVIF / HEIF EXIF/XMP inside the `meta` box**. Removal already strips top-level C2PA `uuid` / JUMBF `jumb` boxes and any AI-labelled top-level XMP `uuid` box, and non-ISOBMFF audio/video (WebM, MP3, WAV, FLAC, OGG) is stripped losslessly via ffmpeg. Still open: EXIF/XMP stored as *items inside the `meta` box* (typical for AVIF/HEIF stills) — needs `meta`-box surgery (iinf/iloc + mdat splice) or `exiftool` (a non-bundled binary dependency). - **Multi-signal contradiction reporting ("Integrity Clash")** — *shipped (v0.6.7)*. `identify` now surfaces contradictions between independent provenance signals (two different AI vendors named by separate stamps, or camera-capture C2PA credentials next to AI-generation markers) as `integrity_clashes` (shown in red in the table view and in `--json`), rather than collapsing to a single verdict. Inspired by [arXiv:2603.02378](https://arxiv.org/abs/2603.02378). - **More C2PA device signers**. Leica, Nikon, Google Pixel, Sony, and Truepic are mapped (each verified against a real signed file). Canon and Samsung Galaxy (AI-edit) are deferred until a real signed sample surfaces — no public direct-download C2PA file exists for them today (upload-to-verify / news-agency-licensed only). -- **C2PA detection window for streaming MP4**. Non-PNG detection scans the first 1 MB; a manifest placed after a large `mdat` in a streaming MP4 can be missed (front-placed manifests, the common case, are caught). +- **C2PA detection window for streaming MP4** — *shipped (v0.6.8)*. Detection no longer relies on a fixed first-MB read: for ISOBMFF containers it walks the top-level boxes (seeking past `mdat` by size) to find a C2PA / AIGC / IPTC manifest placed after the media data, so a streaming / non-faststart MP4 is caught. The remaining gap is EXIF/XMP stored as items *inside the `meta` box* (needs meta-box surgery or `exiftool`). - **Resemble PerTh audio detection** — evaluated, not feasible with the public API: `get_watermark()` returns a raw bit array with no presence/confidence flag, so watermarked vs. clean audio can't be reliably separated without Resemble's fixed payload or a confidence service. Same wall as the SynthID pixel detector. - **Video pipeline (`noai-video`)**: per-frame inpainting and tracking for Sora 2 dynamic logo, Veo 3.1 badge, Kling, Runway. Separate package, not folded into this repo. diff --git a/pyproject.toml b/pyproject.toml index 83dae0a..b121918 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "remove-ai-watermarks" -version = "0.6.7" +version = "0.6.8" description = "Remove visible and invisible AI watermarks from images (Gemini / Nano Banana, ChatGPT, Stable Diffusion)" readme = "README.md" requires-python = ">=3.10" diff --git a/src/remove_ai_watermarks/__init__.py b/src/remove_ai_watermarks/__init__.py index 16eef5c..189296f 100644 --- a/src/remove_ai_watermarks/__init__.py +++ b/src/remove_ai_watermarks/__init__.py @@ -1,3 +1,3 @@ """Remove-AI-Watermarks: Unified tool for removing visible and invisible AI watermarks.""" -__version__ = "0.6.7" +__version__ = "0.6.8" diff --git a/src/remove_ai_watermarks/identify.py b/src/remove_ai_watermarks/identify.py index 560b9d2..2ec6291 100644 --- a/src/remove_ai_watermarks/identify.py +++ b/src/remove_ai_watermarks/identify.py @@ -32,6 +32,7 @@ from remove_ai_watermarks.metadata import ( exif_generator, get_ai_metadata, iptc_ai_system, + scan_head, xai_signature, ) from remove_ai_watermarks.noai.c2pa import cbor_text_after, extract_c2pa_info, soft_binding_vendors_in @@ -332,8 +333,9 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b # First MB covers C2PA (PNG caBX, JPEG APP11, AVIF/HEIF/JXL uuid box) and # IPTC markers for the non-PNG path where extract_c2pa_info returns {}. - with open(image_path, "rb") as f: - head = f.read(_SCAN_BYTES) + # scan_head also seeks out late ISOBMFF provenance boxes (manifest after a + # large mdat in a streaming MP4) that a fixed first-MB read would miss. + head = scan_head(image_path, _SCAN_BYTES) signals: list[Signal] = [] watermarks: list[str] = [] diff --git a/src/remove_ai_watermarks/metadata.py b/src/remove_ai_watermarks/metadata.py index 3456809..546a8bd 100644 --- a/src/remove_ai_watermarks/metadata.py +++ b/src/remove_ai_watermarks/metadata.py @@ -132,6 +132,28 @@ def _is_ai_key(key: str) -> bool: return any(kw in key_lower for kw in AI_KEYWORDS) +def scan_head(image_path: Path, size: int = 1024 * 1024) -> bytes: + """First ``size`` bytes of the file, plus -- for ISOBMFF containers -- the + payloads of any provenance (``uuid`` / ``jumb``) boxes found beyond that + window by seeking past large boxes like ``mdat``. + + This is the shared input for every C2PA / AIGC / IPTC byte scan. The + ISOBMFF extension catches a manifest placed AFTER the media data in a + streaming / non-faststart MP4, which a fixed first-MB read would miss. For + non-ISOBMFF inputs it is exactly ``f.read(size)`` -- behavior-neutral. + """ + with open(image_path, "rb") as f: + head = f.read(size) + # Lazy import: isobmff imports this module's constants at top level. + from remove_ai_watermarks.noai import isobmff + + if isobmff.is_isobmff(head): + region = isobmff.scan_c2pa_region(image_path) + if region: + head += region + return head + + def has_ai_metadata(image_path: Path) -> bool: """Check if an image contains AI-generation metadata. @@ -167,9 +189,8 @@ def has_ai_metadata(image_path: Path) -> bool: pass # Binary scan covers C2PA (PNG caBX, JPEG APP11, AVIF/HEIF/JXL uuid boxes) - # and IPTC AI markers in XMP. Read only the first 512KB to bound memory. - with open(image_path, "rb") as f: - data = f.read(512 * 1024) + # and IPTC AI markers in XMP. First 512KB (plus late ISOBMFF provenance boxes). + data = scan_head(image_path, 512 * 1024) if b"c2pa" in data.lower() or b"C2PA" in data: return True if C2PA_UUID in data: @@ -196,8 +217,7 @@ def aigc_label(image_path: Path) -> dict[str, str] | None: import json import re - with open(image_path, "rb") as f: - data = f.read(1024 * 1024) + data = scan_head(image_path) match = re.search(rb"(.*?)", data, re.DOTALL) if not match: return None @@ -219,8 +239,7 @@ def iptc_ai_system(image_path: Path) -> str | None: extractable, otherwise the literal ``"fields present"``. Container-agnostic raw-byte scan; handles both XMP element and attribute serializations. """ - with open(image_path, "rb") as f: - data = f.read(1024 * 1024) + data = scan_head(image_path) if not any(marker in data for marker in IPTC_AI_FIELD_MARKERS): return None match = re.search(rb"AISystemUsed[=:\s]*[\"'>]\s*([^<\"']{1,120})", data) @@ -259,8 +278,7 @@ def synthid_source(image_path: Path) -> str | None: # Non-PNG containers (JPEG APP11, WebP, AVIF/HEIF/JXL uuid box) keep the # C2PA manifest where the PNG parser can't reach it. Binary-scan for the # same signal: a C2PA manifest from a SynthID-using issuer on AI content. - with open(image_path, "rb") as f: - data = f.read(1024 * 1024) + data = scan_head(image_path) has_c2pa = b"c2pa" in data.lower() or C2PA_UUID in data # Matches both "trainedAlgorithmicMedia" and "compositeWithTrainedAlgorithmicMedia". ai_source = b"trainedAlgorithmicMedia" in data or b"TrainedAlgorithmicMedia" in data @@ -311,8 +329,7 @@ def exif_generator(image_path: Path) -> str | None: # XMP CreatorTool: text, container-agnostic (covers HEIF/JXL via raw scan). try: - with open(image_path, "rb") as f: - head = f.read(1024 * 1024) + head = scan_head(image_path) for match in re.finditer(rb"CreatorTool[>\"'=\s]{1,4}([^<\"']{1,80})", head): candidates.append(match.group(1).decode("latin1", "replace")) except Exception as exc: @@ -467,8 +484,7 @@ def get_ai_metadata(image_path: Path) -> dict[str, str]: if "synthid_watermark" not in result and (vendor := synthid_source(image_path)): result.setdefault("synthid_watermark", synthid_verdict(vendor)) if "soft_binding" not in result: - with open(image_path, "rb") as f: - head = f.read(1024 * 1024) + head = scan_head(image_path) if vendors := soft_binding_vendors_in(head): result["soft_binding"] = ", ".join(vendors) @@ -507,10 +523,18 @@ def _strip_with_ffmpeg(source_path: Path, output_path: Path) -> Path: ) output_path.parent.mkdir(parents=True, exist_ok=True) cmd = [ - ffmpeg, "-y", "-loglevel", "error", - "-i", str(source_path), - "-map_metadata", "-1", "-map_chapters", "-1", - "-c", "copy", + ffmpeg, + "-y", + "-loglevel", + "error", + "-i", + str(source_path), + "-map_metadata", + "-1", + "-map_chapters", + "-1", + "-c", + "copy", str(output_path), ] result = subprocess.run(cmd, capture_output=True, text=True, check=False) # noqa: S603 diff --git a/src/remove_ai_watermarks/noai/isobmff.py b/src/remove_ai_watermarks/noai/isobmff.py index 0c27858..105847d 100644 --- a/src/remove_ai_watermarks/noai/isobmff.py +++ b/src/remove_ai_watermarks/noai/isobmff.py @@ -22,6 +22,7 @@ from typing import TYPE_CHECKING if TYPE_CHECKING: from collections.abc import Iterator + from pathlib import Path from remove_ai_watermarks.metadata import ( AIGC_MARKERS, @@ -78,6 +79,58 @@ def is_isobmff(data: bytes) -> bool: return len(data) >= 8 and data[4:8] == b"ftyp" +def scan_c2pa_region(path: str | Path, *, max_total: int = 4 * 1024 * 1024) -> bytes: + """Concatenated payloads of top-level ``uuid`` / ``jumb`` boxes in an ISOBMFF + file, found by seeking past other boxes (``mdat`` etc.) by size. + + C2PA manifests and XMP packets (incl. AI labels) live in top-level ``uuid`` + boxes; JPEG-XL uses ``jumb``. In a streaming / non-faststart MP4 the manifest + sits AFTER a multi-megabyte ``mdat``, so a fixed first-MB read misses it. This + walks box headers (8-16 bytes each) and seeks past payloads it does not need, + so it never loads ``mdat`` into memory and works on multi-GB files. Returns + the relevant box payloads (capped at ``max_total``), or ``b""`` for a + non-ISOBMFF file or on any read error. + """ + collected = bytearray() + try: + with open(path, "rb") as f: + sniff = f.read(8) + if len(sniff) < 8 or sniff[4:8] != b"ftyp": + return b"" + f.seek(0, 2) + file_size = f.tell() + pos = 0 + while pos + 8 <= file_size and len(collected) < max_total: + f.seek(pos) + header = f.read(8) + if len(header) < 8: + break + size32 = struct.unpack(">I", header[:4])[0] + box_type = header[4:8] + payload_off = pos + 8 + if size32 == 1: + ext = f.read(8) + if len(ext) < 8: + break + size = struct.unpack(">Q", ext)[0] + payload_off = pos + 16 + elif size32 == 0: + size = file_size - pos + else: + size = size32 + if size < (payload_off - pos) or pos + size > file_size: + break + if box_type in C2PA_BOX_TYPES: + f.seek(payload_off) + to_read = min(pos + size - payload_off, max_total - len(collected)) + if to_read > 0: + collected += f.read(to_read) + pos += size + except OSError: + return b"" + return bytes(collected) + + def strip_c2pa_boxes(data: bytes) -> tuple[bytes, int]: """Return ``(cleaned_bytes, stripped_count)`` with AI-provenance boxes removed. diff --git a/tests/test_metadata.py b/tests/test_metadata.py index f2cf570..842e603 100644 --- a/tests/test_metadata.py +++ b/tests/test_metadata.py @@ -631,6 +631,11 @@ _MP4_FTYP = b"\x00\x00\x00\x18ftypmp42\x00\x00\x00\x00mp42isom" _MP4_MDAT = b"\x00\x00\x00\x10mdat" + b"videodat" +def _box(box_type: bytes, payload: bytes) -> bytes: + """Build a 32-bit-size ISOBMFF box: [size:4][type:4][payload].""" + return (8 + len(payload)).to_bytes(4, "big") + box_type + payload + + class TestVideoC2pa: """C2PA in MP4 (ISOBMFF) -- detect + strip, reusing the image box walker.""" @@ -654,6 +659,59 @@ class TestVideoC2pa: assert has_ai_metadata(out) is False +class TestLateProvenanceBox: + """A C2PA / provenance box placed AFTER a large mdat (streaming / non-faststart + MP4) must still be detected -- the fixed first-MB scan would miss it.""" + + def _mp4_late_c2pa(self, tmp_path: Path, gap: int = 1_500_000) -> Path: + from remove_ai_watermarks.metadata import C2PA_UUID + + big_mdat = _box(b"mdat", b"\x00" * gap) # > 1 MB pushes the manifest past the scan window + manifest = C2PA_UUID + b"OpenAI jumbf c2pa ... trainedAlgorithmicMedia ..." + p = tmp_path / "stream.mp4" + p.write_bytes(_MP4_FTYP + big_mdat + _box(b"uuid", manifest)) + return p + + def test_scan_c2pa_region_finds_late_box(self, tmp_path: Path): + from remove_ai_watermarks.metadata import C2PA_UUID + from remove_ai_watermarks.noai.isobmff import scan_c2pa_region + + region = scan_c2pa_region(self._mp4_late_c2pa(tmp_path)) + assert C2PA_UUID in region + assert b"trainedAlgorithmicMedia" in region + + def test_fixed_window_would_have_missed_it(self, tmp_path: Path): + # Documents the regression the box walk fixes: the manifest is beyond 1 MB. + from remove_ai_watermarks.metadata import C2PA_UUID + + p = self._mp4_late_c2pa(tmp_path) + assert C2PA_UUID not in p.read_bytes()[: 1024 * 1024] + + def test_scan_head_includes_late_box(self, tmp_path: Path): + from remove_ai_watermarks.metadata import C2PA_UUID, scan_head + + assert C2PA_UUID in scan_head(self._mp4_late_c2pa(tmp_path)) + + def test_has_ai_metadata_detects_late_manifest(self, tmp_path: Path): + assert has_ai_metadata(self._mp4_late_c2pa(tmp_path)) is True + + def test_scan_c2pa_region_non_isobmff_is_empty(self, tmp_path: Path): + from remove_ai_watermarks.noai.isobmff import scan_c2pa_region + + p = tmp_path / "not.bin" + p.write_bytes(b"\x89PNG\r\n\x1a\n not an isobmff file") + assert scan_c2pa_region(p) == b"" + + def test_front_placed_manifest_still_detected(self, tmp_path: Path): + # Regression: a faststart MP4 (manifest before mdat) is unaffected. + from remove_ai_watermarks.metadata import C2PA_UUID + + manifest = C2PA_UUID + b"OpenAI ... trainedAlgorithmicMedia ..." + p = tmp_path / "front.mp4" + p.write_bytes(_MP4_FTYP + _box(b"uuid", manifest) + _box(b"mdat", b"\x00" * 100)) + assert has_ai_metadata(p) is True + + class TestIsobmffMetadataRemoval: """Container-level AI-provenance stripping across ISOBMFF image/video/audio.""" @@ -718,9 +776,17 @@ class TestFfmpegMetadataStrip: def _wav_with_tag(self, path: Path, tag: str = "Suno AI") -> None: subprocess.run( # noqa: S603 [ - shutil.which("ffmpeg"), "-y", "-loglevel", "error", - "-f", "lavfi", "-i", "sine=frequency=440:duration=0.1", - "-metadata", f"title={tag}", str(path), + shutil.which("ffmpeg"), + "-y", + "-loglevel", + "error", + "-f", + "lavfi", + "-i", + "sine=frequency=440:duration=0.1", + "-metadata", + f"title={tag}", + str(path), ], check=True, ) diff --git a/uv.lock b/uv.lock index fe8bd8d..0b13897 100644 --- a/uv.lock +++ b/uv.lock @@ -2865,7 +2865,7 @@ wheels = [ [[package]] name = "remove-ai-watermarks" -version = "0.6.7" +version = "0.6.8" source = { editable = "." } dependencies = [ { name = "click" },