feat(metadata): detect C2PA/AIGC/IPTC manifests after a large mdat in MP4 (v0.6.8)

Provenance detection no longer relies on a fixed first-MB read. In a streaming / non-faststart MP4 the C2PA manifest sits AFTER a multi-megabyte mdat, beyond the 1 MB scan window, so it was missed. - isobmff.scan_c2pa_region(path): a file-seeking top-level box walker that returns the payloads of uuid/jumb (provenance) boxes, seeking past mdat by size without reading it -- works on multi-GB files. Returns b"" for non-ISOBMFF or on read error. Mirrors the box-size encoding of the existing in-memory _iter_top_level_boxes (largesize / size==0). - metadata.scan_head(path, size): the shared input for every C2PA/AIGC/IPTC byte scan -- first __TEXT __DATA __OBJC others dec hex bytes plus, for ISOBMFF, the late provenance-box payloads. Behavior-neutral (f.read(size)) for non-ISOBMFF inputs. - Routed all six metadata scan sites (has_ai_metadata, aigc_label, iptc_ai_system, synthid_source, exif_generator XMP, get_ai_metadata soft-binding) and identify's head read through scan_head. 6 new tests: late box found by scan_c2pa_region / scan_head, the fixed window provably misses it, non-ISOBMFF -> b"", front-placed (faststart) regression. The remaining gap stays documented: EXIF/XMP stored as items inside the meta box (AVIF/HEIF stills) still needs meta-box surgery or exiftool. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-07-20 06:37:20 +02:00 · 2026-05-27 13:42:29 -07:00
parent 18160fe269
commit 31f0a82906
9 changed files with 173 additions and 28 deletions
@@ -327,7 +327,7 @@ Tracked but not yet implemented:
 - **AVIF / HEIF EXIF/XMP inside the `meta` box**. Removal already strips top-level C2PA `uuid` / JUMBF `jumb` boxes and any AI-labelled top-level XMP `uuid` box, and non-ISOBMFF audio/video (WebM, MP3, WAV, FLAC, OGG) is stripped losslessly via ffmpeg. Still open: EXIF/XMP stored as *items inside the `meta` box* (typical for AVIF/HEIF stills) — needs `meta`-box surgery (iinf/iloc + mdat splice) or `exiftool` (a non-bundled binary dependency).
 - **Multi-signal contradiction reporting ("Integrity Clash")** — *shipped (v0.6.7)*. `identify` now surfaces contradictions between independent provenance signals (two different AI vendors named by separate stamps, or camera-capture C2PA credentials next to AI-generation markers) as `integrity_clashes` (shown in red in the table view and in `--json`), rather than collapsing to a single verdict. Inspired by [arXiv:2603.02378](https://arxiv.org/abs/2603.02378).
 - **More C2PA device signers**. Leica, Nikon, Google Pixel, Sony, and Truepic are mapped (each verified against a real signed file). Canon and Samsung Galaxy (AI-edit) are deferred until a real signed sample surfaces — no public direct-download C2PA file exists for them today (upload-to-verify / news-agency-licensed only).
- **C2PA detection window for streaming MP4**. Non-PNG detection scans the first 1 MB; a manifest placed after a large `mdat` in a streaming MP4 can be missed (front-placed manifests, the common case, are caught).
+- **C2PA detection window for streaming MP4** — *shipped (v0.6.8)*. Detection no longer relies on a fixed first-MB read: for ISOBMFF containers it walks the top-level boxes (seeking past `mdat` by size) to find a C2PA / AIGC / IPTC manifest placed after the media data, so a streaming / non-faststart MP4 is caught. The remaining gap is EXIF/XMP stored as items *inside the `meta` box* (needs meta-box surgery or `exiftool`).
 - **Resemble PerTh audio detection** — evaluated, not feasible with the public API: `get_watermark()` returns a raw bit array with no presence/confidence flag, so watermarked vs. clean audio can't be reliably separated without Resemble's fixed payload or a confidence service. Same wall as the SynthID pixel detector.
 - **Video pipeline (`noai-video`)**: per-frame inpainting and tracking for Sora 2 dynamic logo, Veo 3.1 badge, Kling, Runway. Separate package, not folded into this repo.

@@ -1,6 +1,6 @@
 [project]
 name = "remove-ai-watermarks"
-version = "0.6.7"
+version = "0.6.8"
 description = "Remove visible and invisible AI watermarks from images (Gemini / Nano Banana, ChatGPT, Stable Diffusion)"
 readme = "README.md"
 requires-python = ">=3.10"
@@ -1,3 +1,3 @@
 """Remove-AI-Watermarks: Unified tool for removing visible and invisible AI watermarks."""

-__version__ = "0.6.7"
+__version__ = "0.6.8"
@@ -32,6 +32,7 @@ from remove_ai_watermarks.metadata import (
    exif_generator,
    get_ai_metadata,
    iptc_ai_system,
+    scan_head,
    xai_signature,
 )
 from remove_ai_watermarks.noai.c2pa import cbor_text_after, extract_c2pa_info, soft_binding_vendors_in
@@ -332,8 +333,9 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b

    # First MB covers C2PA (PNG caBX, JPEG APP11, AVIF/HEIF/JXL uuid box) and
    # IPTC markers for the non-PNG path where extract_c2pa_info returns {}.
-    with open(image_path, "rb") as f:
-        head = f.read(_SCAN_BYTES)
+    # scan_head also seeks out late ISOBMFF provenance boxes (manifest after a
+    # large mdat in a streaming MP4) that a fixed first-MB read would miss.
+    head = scan_head(image_path, _SCAN_BYTES)

    signals: list[Signal] = []
    watermarks: list[str] = []
@@ -132,6 +132,28 @@ def _is_ai_key(key: str) -> bool:
    return any(kw in key_lower for kw in AI_KEYWORDS)


+def scan_head(image_path: Path, size: int = 1024 * 1024) -> bytes:
+    """First ``size`` bytes of the file, plus -- for ISOBMFF containers -- the
+    payloads of any provenance (``uuid`` / ``jumb``) boxes found beyond that
+    window by seeking past large boxes like ``mdat``.
+
+    This is the shared input for every C2PA / AIGC / IPTC byte scan. The
+    ISOBMFF extension catches a manifest placed AFTER the media data in a
+    streaming / non-faststart MP4, which a fixed first-MB read would miss. For
+    non-ISOBMFF inputs it is exactly ``f.read(size)`` -- behavior-neutral.
+    """
+    with open(image_path, "rb") as f:
+        head = f.read(size)
+    # Lazy import: isobmff imports this module's constants at top level.
+    from remove_ai_watermarks.noai import isobmff
+
+    if isobmff.is_isobmff(head):
+        region = isobmff.scan_c2pa_region(image_path)
+        if region:
+            head += region
+    return head
+
+
 def has_ai_metadata(image_path: Path) -> bool:
    """Check if an image contains AI-generation metadata.

@@ -167,9 +189,8 @@ def has_ai_metadata(image_path: Path) -> bool:
        pass

    # Binary scan covers C2PA (PNG caBX, JPEG APP11, AVIF/HEIF/JXL uuid boxes)
-    # and IPTC AI markers in XMP. Read only the first 512KB to bound memory.
-    with open(image_path, "rb") as f:
-        data = f.read(512 * 1024)
+    # and IPTC AI markers in XMP. First 512KB (plus late ISOBMFF provenance boxes).
+    data = scan_head(image_path, 512 * 1024)
    if b"c2pa" in data.lower() or b"C2PA" in data:
        return True
    if C2PA_UUID in data:
@@ -196,8 +217,7 @@ def aigc_label(image_path: Path) -> dict[str, str] | None:
    import json
    import re

-    with open(image_path, "rb") as f:
-        data = f.read(1024 * 1024)
+    data = scan_head(image_path)
    match = re.search(rb"<TC260:AIGC>(.*?)</TC260:AIGC>", data, re.DOTALL)
    if not match:
        return None
@@ -219,8 +239,7 @@ def iptc_ai_system(image_path: Path) -> str | None:
    extractable, otherwise the literal ``"fields present"``. Container-agnostic
    raw-byte scan; handles both XMP element and attribute serializations.
    """
-    with open(image_path, "rb") as f:
-        data = f.read(1024 * 1024)
+    data = scan_head(image_path)
    if not any(marker in data for marker in IPTC_AI_FIELD_MARKERS):
        return None
    match = re.search(rb"AISystemUsed[=:\s]*[\"'>]\s*([^<\"']{1,120})", data)
@@ -259,8 +278,7 @@ def synthid_source(image_path: Path) -> str | None:
    # Non-PNG containers (JPEG APP11, WebP, AVIF/HEIF/JXL uuid box) keep the
    # C2PA manifest where the PNG parser can't reach it. Binary-scan for the
    # same signal: a C2PA manifest from a SynthID-using issuer on AI content.
-    with open(image_path, "rb") as f:
-        data = f.read(1024 * 1024)
+    data = scan_head(image_path)
    has_c2pa = b"c2pa" in data.lower() or C2PA_UUID in data
    # Matches both "trainedAlgorithmicMedia" and "compositeWithTrainedAlgorithmicMedia".
    ai_source = b"trainedAlgorithmicMedia" in data or b"TrainedAlgorithmicMedia" in data
@@ -311,8 +329,7 @@ def exif_generator(image_path: Path) -> str | None:

    # XMP CreatorTool: text, container-agnostic (covers HEIF/JXL via raw scan).
    try:
-        with open(image_path, "rb") as f:
-            head = f.read(1024 * 1024)
+        head = scan_head(image_path)
        for match in re.finditer(rb"CreatorTool[>\"'=\s]{1,4}([^<\"']{1,80})", head):
            candidates.append(match.group(1).decode("latin1", "replace"))
    except Exception as exc:
@@ -467,8 +484,7 @@ def get_ai_metadata(image_path: Path) -> dict[str, str]:
    if "synthid_watermark" not in result and (vendor := synthid_source(image_path)):
        result.setdefault("synthid_watermark", synthid_verdict(vendor))
    if "soft_binding" not in result:
-        with open(image_path, "rb") as f:
-            head = f.read(1024 * 1024)
+        head = scan_head(image_path)
        if vendors := soft_binding_vendors_in(head):
            result["soft_binding"] = ", ".join(vendors)

@@ -507,10 +523,18 @@ def _strip_with_ffmpeg(source_path: Path, output_path: Path) -> Path:
        )
    output_path.parent.mkdir(parents=True, exist_ok=True)
    cmd = [
-        ffmpeg, "-y", "-loglevel", "error",
-        "-i", str(source_path),
-        "-map_metadata", "-1", "-map_chapters", "-1",
-        "-c", "copy",
+        ffmpeg,
+        "-y",
+        "-loglevel",
+        "error",
+        "-i",
+        str(source_path),
+        "-map_metadata",
+        "-1",
+        "-map_chapters",
+        "-1",
+        "-c",
+        "copy",
        str(output_path),
    ]
    result = subprocess.run(cmd, capture_output=True, text=True, check=False)  # noqa: S603
@@ -22,6 +22,7 @@ from typing import TYPE_CHECKING

 if TYPE_CHECKING:
    from collections.abc import Iterator
+    from pathlib import Path

 from remove_ai_watermarks.metadata import (
    AIGC_MARKERS,
@@ -78,6 +79,58 @@ def is_isobmff(data: bytes) -> bool:
    return len(data) >= 8 and data[4:8] == b"ftyp"


+def scan_c2pa_region(path: str | Path, *, max_total: int = 4 * 1024 * 1024) -> bytes:
+    """Concatenated payloads of top-level ``uuid`` / ``jumb`` boxes in an ISOBMFF
+    file, found by seeking past other boxes (``mdat`` etc.) by size.
+
+    C2PA manifests and XMP packets (incl. AI labels) live in top-level ``uuid``
+    boxes; JPEG-XL uses ``jumb``. In a streaming / non-faststart MP4 the manifest
+    sits AFTER a multi-megabyte ``mdat``, so a fixed first-MB read misses it. This
+    walks box headers (8-16 bytes each) and seeks past payloads it does not need,
+    so it never loads ``mdat`` into memory and works on multi-GB files. Returns
+    the relevant box payloads (capped at ``max_total``), or ``b""`` for a
+    non-ISOBMFF file or on any read error.
+    """
+    collected = bytearray()
+    try:
+        with open(path, "rb") as f:
+            sniff = f.read(8)
+            if len(sniff) < 8 or sniff[4:8] != b"ftyp":
+                return b""
+            f.seek(0, 2)
+            file_size = f.tell()
+            pos = 0
+            while pos + 8 <= file_size and len(collected) < max_total:
+                f.seek(pos)
+                header = f.read(8)
+                if len(header) < 8:
+                    break
+                size32 = struct.unpack(">I", header[:4])[0]
+                box_type = header[4:8]
+                payload_off = pos + 8
+                if size32 == 1:
+                    ext = f.read(8)
+                    if len(ext) < 8:
+                        break
+                    size = struct.unpack(">Q", ext)[0]
+                    payload_off = pos + 16
+                elif size32 == 0:
+                    size = file_size - pos
+                else:
+                    size = size32
+                if size < (payload_off - pos) or pos + size > file_size:
+                    break
+                if box_type in C2PA_BOX_TYPES:
+                    f.seek(payload_off)
+                    to_read = min(pos + size - payload_off, max_total - len(collected))
+                    if to_read > 0:
+                        collected += f.read(to_read)
+                pos += size
+    except OSError:
+        return b""
+    return bytes(collected)
+
+
 def strip_c2pa_boxes(data: bytes) -> tuple[bytes, int]:
    """Return ``(cleaned_bytes, stripped_count)`` with AI-provenance boxes removed.

@@ -631,6 +631,11 @@ _MP4_FTYP = b"\x00\x00\x00\x18ftypmp42\x00\x00\x00\x00mp42isom"
 _MP4_MDAT = b"\x00\x00\x00\x10mdat" + b"videodat"


+def _box(box_type: bytes, payload: bytes) -> bytes:
+    """Build a 32-bit-size ISOBMFF box: [size:4][type:4][payload]."""
+    return (8 + len(payload)).to_bytes(4, "big") + box_type + payload
+
+
 class TestVideoC2pa:
    """C2PA in MP4 (ISOBMFF) -- detect + strip, reusing the image box walker."""

@@ -654,6 +659,59 @@ class TestVideoC2pa:
        assert has_ai_metadata(out) is False


+class TestLateProvenanceBox:
+    """A C2PA / provenance box placed AFTER a large mdat (streaming / non-faststart
+    MP4) must still be detected -- the fixed first-MB scan would miss it."""
+
+    def _mp4_late_c2pa(self, tmp_path: Path, gap: int = 1_500_000) -> Path:
+        from remove_ai_watermarks.metadata import C2PA_UUID
+
+        big_mdat = _box(b"mdat", b"\x00" * gap)  # > 1 MB pushes the manifest past the scan window
+        manifest = C2PA_UUID + b"OpenAI jumbf c2pa ... trainedAlgorithmicMedia ..."
+        p = tmp_path / "stream.mp4"
+        p.write_bytes(_MP4_FTYP + big_mdat + _box(b"uuid", manifest))
+        return p
+
+    def test_scan_c2pa_region_finds_late_box(self, tmp_path: Path):
+        from remove_ai_watermarks.metadata import C2PA_UUID
+        from remove_ai_watermarks.noai.isobmff import scan_c2pa_region
+
+        region = scan_c2pa_region(self._mp4_late_c2pa(tmp_path))
+        assert C2PA_UUID in region
+        assert b"trainedAlgorithmicMedia" in region
+
+    def test_fixed_window_would_have_missed_it(self, tmp_path: Path):
+        # Documents the regression the box walk fixes: the manifest is beyond 1 MB.
+        from remove_ai_watermarks.metadata import C2PA_UUID
+
+        p = self._mp4_late_c2pa(tmp_path)
+        assert C2PA_UUID not in p.read_bytes()[: 1024 * 1024]
+
+    def test_scan_head_includes_late_box(self, tmp_path: Path):
+        from remove_ai_watermarks.metadata import C2PA_UUID, scan_head
+
+        assert C2PA_UUID in scan_head(self._mp4_late_c2pa(tmp_path))
+
+    def test_has_ai_metadata_detects_late_manifest(self, tmp_path: Path):
+        assert has_ai_metadata(self._mp4_late_c2pa(tmp_path)) is True
+
+    def test_scan_c2pa_region_non_isobmff_is_empty(self, tmp_path: Path):
+        from remove_ai_watermarks.noai.isobmff import scan_c2pa_region
+
+        p = tmp_path / "not.bin"
+        p.write_bytes(b"\x89PNG\r\n\x1a\n not an isobmff file")
+        assert scan_c2pa_region(p) == b""
+
+    def test_front_placed_manifest_still_detected(self, tmp_path: Path):
+        # Regression: a faststart MP4 (manifest before mdat) is unaffected.
+        from remove_ai_watermarks.metadata import C2PA_UUID
+
+        manifest = C2PA_UUID + b"OpenAI ... trainedAlgorithmicMedia ..."
+        p = tmp_path / "front.mp4"
+        p.write_bytes(_MP4_FTYP + _box(b"uuid", manifest) + _box(b"mdat", b"\x00" * 100))
+        assert has_ai_metadata(p) is True
+
+
 class TestIsobmffMetadataRemoval:
    """Container-level AI-provenance stripping across ISOBMFF image/video/audio."""

@@ -718,9 +776,17 @@ class TestFfmpegMetadataStrip:
    def _wav_with_tag(self, path: Path, tag: str = "Suno AI") -> None:
        subprocess.run(  # noqa: S603
            [
-                shutil.which("ffmpeg"), "-y", "-loglevel", "error",
-                "-f", "lavfi", "-i", "sine=frequency=440:duration=0.1",
-                "-metadata", f"title={tag}", str(path),
+                shutil.which("ffmpeg"),
+                "-y",
+                "-loglevel",
+                "error",
+                "-f",
+                "lavfi",
+                "-i",
+                "sine=frequency=440:duration=0.1",
+                "-metadata",
+                f"title={tag}",
+                str(path),
            ],
            check=True,
        )
@@ -2865,7 +2865,7 @@ wheels = [

 [[package]]
 name = "remove-ai-watermarks"
-version = "0.6.7"
+version = "0.6.8"
 source = { editable = "." }
 dependencies = [
    { name = "click" },