feat(metadata): detect C2PA/AIGC/IPTC manifests after a large mdat in MP4 (v0.6.8)

Provenance detection no longer relies on a fixed first-MB read. In a streaming /
non-faststart MP4 the C2PA manifest sits AFTER a multi-megabyte mdat, beyond the
1 MB scan window, so it was missed.

- isobmff.scan_c2pa_region(path): a file-seeking top-level box walker that
  returns the payloads of uuid/jumb (provenance) boxes, seeking past mdat by
  size without reading it -- works on multi-GB files. Returns b"" for
  non-ISOBMFF or on read error. Mirrors the box-size encoding of the existing
  in-memory _iter_top_level_boxes (largesize / size==0).
- metadata.scan_head(path, size): the shared input for every C2PA/AIGC/IPTC
  byte scan -- first __TEXT	__DATA	__OBJC	others	dec	hex bytes plus, for ISOBMFF, the late provenance-box
  payloads. Behavior-neutral (f.read(size)) for non-ISOBMFF inputs.
- Routed all six metadata scan sites (has_ai_metadata, aigc_label,
  iptc_ai_system, synthid_source, exif_generator XMP, get_ai_metadata
  soft-binding) and identify's head read through scan_head.

6 new tests: late box found by scan_c2pa_region / scan_head, the fixed window
provably misses it, non-ISOBMFF -> b"", front-placed (faststart) regression.

The remaining gap stays documented: EXIF/XMP stored as items inside the meta
box (AVIF/HEIF stills) still needs meta-box surgery or exiftool.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
test-user
2026-05-27 13:42:29 -07:00
parent 18160fe269
commit 31f0a82906
9 changed files with 173 additions and 28 deletions
+2 -2
View File
File diff suppressed because one or more lines are too long
+1 -1
View File
@@ -327,7 +327,7 @@ Tracked but not yet implemented:
- **AVIF / HEIF EXIF/XMP inside the `meta` box**. Removal already strips top-level C2PA `uuid` / JUMBF `jumb` boxes and any AI-labelled top-level XMP `uuid` box, and non-ISOBMFF audio/video (WebM, MP3, WAV, FLAC, OGG) is stripped losslessly via ffmpeg. Still open: EXIF/XMP stored as *items inside the `meta` box* (typical for AVIF/HEIF stills) — needs `meta`-box surgery (iinf/iloc + mdat splice) or `exiftool` (a non-bundled binary dependency).
- **Multi-signal contradiction reporting ("Integrity Clash")** — *shipped (v0.6.7)*. `identify` now surfaces contradictions between independent provenance signals (two different AI vendors named by separate stamps, or camera-capture C2PA credentials next to AI-generation markers) as `integrity_clashes` (shown in red in the table view and in `--json`), rather than collapsing to a single verdict. Inspired by [arXiv:2603.02378](https://arxiv.org/abs/2603.02378).
- **More C2PA device signers**. Leica, Nikon, Google Pixel, Sony, and Truepic are mapped (each verified against a real signed file). Canon and Samsung Galaxy (AI-edit) are deferred until a real signed sample surfaces — no public direct-download C2PA file exists for them today (upload-to-verify / news-agency-licensed only).
- **C2PA detection window for streaming MP4**. Non-PNG detection scans the first 1 MB; a manifest placed after a large `mdat` in a streaming MP4 can be missed (front-placed manifests, the common case, are caught).
- **C2PA detection window for streaming MP4***shipped (v0.6.8)*. Detection no longer relies on a fixed first-MB read: for ISOBMFF containers it walks the top-level boxes (seeking past `mdat` by size) to find a C2PA / AIGC / IPTC manifest placed after the media data, so a streaming / non-faststart MP4 is caught. The remaining gap is EXIF/XMP stored as items *inside the `meta` box* (needs meta-box surgery or `exiftool`).
- **Resemble PerTh audio detection** — evaluated, not feasible with the public API: `get_watermark()` returns a raw bit array with no presence/confidence flag, so watermarked vs. clean audio can't be reliably separated without Resemble's fixed payload or a confidence service. Same wall as the SynthID pixel detector.
- **Video pipeline (`noai-video`)**: per-frame inpainting and tracking for Sora 2 dynamic logo, Veo 3.1 badge, Kling, Runway. Separate package, not folded into this repo.
+1 -1
View File
@@ -1,6 +1,6 @@
[project]
name = "remove-ai-watermarks"
version = "0.6.7"
version = "0.6.8"
description = "Remove visible and invisible AI watermarks from images (Gemini / Nano Banana, ChatGPT, Stable Diffusion)"
readme = "README.md"
requires-python = ">=3.10"
+1 -1
View File
@@ -1,3 +1,3 @@
"""Remove-AI-Watermarks: Unified tool for removing visible and invisible AI watermarks."""
__version__ = "0.6.7"
__version__ = "0.6.8"
+4 -2
View File
@@ -32,6 +32,7 @@ from remove_ai_watermarks.metadata import (
exif_generator,
get_ai_metadata,
iptc_ai_system,
scan_head,
xai_signature,
)
from remove_ai_watermarks.noai.c2pa import cbor_text_after, extract_c2pa_info, soft_binding_vendors_in
@@ -332,8 +333,9 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b
# First MB covers C2PA (PNG caBX, JPEG APP11, AVIF/HEIF/JXL uuid box) and
# IPTC markers for the non-PNG path where extract_c2pa_info returns {}.
with open(image_path, "rb") as f:
head = f.read(_SCAN_BYTES)
# scan_head also seeks out late ISOBMFF provenance boxes (manifest after a
# large mdat in a streaming MP4) that a fixed first-MB read would miss.
head = scan_head(image_path, _SCAN_BYTES)
signals: list[Signal] = []
watermarks: list[str] = []
+41 -17
View File
@@ -132,6 +132,28 @@ def _is_ai_key(key: str) -> bool:
return any(kw in key_lower for kw in AI_KEYWORDS)
def scan_head(image_path: Path, size: int = 1024 * 1024) -> bytes:
"""First ``size`` bytes of the file, plus -- for ISOBMFF containers -- the
payloads of any provenance (``uuid`` / ``jumb``) boxes found beyond that
window by seeking past large boxes like ``mdat``.
This is the shared input for every C2PA / AIGC / IPTC byte scan. The
ISOBMFF extension catches a manifest placed AFTER the media data in a
streaming / non-faststart MP4, which a fixed first-MB read would miss. For
non-ISOBMFF inputs it is exactly ``f.read(size)`` -- behavior-neutral.
"""
with open(image_path, "rb") as f:
head = f.read(size)
# Lazy import: isobmff imports this module's constants at top level.
from remove_ai_watermarks.noai import isobmff
if isobmff.is_isobmff(head):
region = isobmff.scan_c2pa_region(image_path)
if region:
head += region
return head
def has_ai_metadata(image_path: Path) -> bool:
"""Check if an image contains AI-generation metadata.
@@ -167,9 +189,8 @@ def has_ai_metadata(image_path: Path) -> bool:
pass
# Binary scan covers C2PA (PNG caBX, JPEG APP11, AVIF/HEIF/JXL uuid boxes)
# and IPTC AI markers in XMP. Read only the first 512KB to bound memory.
with open(image_path, "rb") as f:
data = f.read(512 * 1024)
# and IPTC AI markers in XMP. First 512KB (plus late ISOBMFF provenance boxes).
data = scan_head(image_path, 512 * 1024)
if b"c2pa" in data.lower() or b"C2PA" in data:
return True
if C2PA_UUID in data:
@@ -196,8 +217,7 @@ def aigc_label(image_path: Path) -> dict[str, str] | None:
import json
import re
with open(image_path, "rb") as f:
data = f.read(1024 * 1024)
data = scan_head(image_path)
match = re.search(rb"<TC260:AIGC>(.*?)</TC260:AIGC>", data, re.DOTALL)
if not match:
return None
@@ -219,8 +239,7 @@ def iptc_ai_system(image_path: Path) -> str | None:
extractable, otherwise the literal ``"fields present"``. Container-agnostic
raw-byte scan; handles both XMP element and attribute serializations.
"""
with open(image_path, "rb") as f:
data = f.read(1024 * 1024)
data = scan_head(image_path)
if not any(marker in data for marker in IPTC_AI_FIELD_MARKERS):
return None
match = re.search(rb"AISystemUsed[=:\s]*[\"'>]\s*([^<\"']{1,120})", data)
@@ -259,8 +278,7 @@ def synthid_source(image_path: Path) -> str | None:
# Non-PNG containers (JPEG APP11, WebP, AVIF/HEIF/JXL uuid box) keep the
# C2PA manifest where the PNG parser can't reach it. Binary-scan for the
# same signal: a C2PA manifest from a SynthID-using issuer on AI content.
with open(image_path, "rb") as f:
data = f.read(1024 * 1024)
data = scan_head(image_path)
has_c2pa = b"c2pa" in data.lower() or C2PA_UUID in data
# Matches both "trainedAlgorithmicMedia" and "compositeWithTrainedAlgorithmicMedia".
ai_source = b"trainedAlgorithmicMedia" in data or b"TrainedAlgorithmicMedia" in data
@@ -311,8 +329,7 @@ def exif_generator(image_path: Path) -> str | None:
# XMP CreatorTool: text, container-agnostic (covers HEIF/JXL via raw scan).
try:
with open(image_path, "rb") as f:
head = f.read(1024 * 1024)
head = scan_head(image_path)
for match in re.finditer(rb"CreatorTool[>\"'=\s]{1,4}([^<\"']{1,80})", head):
candidates.append(match.group(1).decode("latin1", "replace"))
except Exception as exc:
@@ -467,8 +484,7 @@ def get_ai_metadata(image_path: Path) -> dict[str, str]:
if "synthid_watermark" not in result and (vendor := synthid_source(image_path)):
result.setdefault("synthid_watermark", synthid_verdict(vendor))
if "soft_binding" not in result:
with open(image_path, "rb") as f:
head = f.read(1024 * 1024)
head = scan_head(image_path)
if vendors := soft_binding_vendors_in(head):
result["soft_binding"] = ", ".join(vendors)
@@ -507,10 +523,18 @@ def _strip_with_ffmpeg(source_path: Path, output_path: Path) -> Path:
)
output_path.parent.mkdir(parents=True, exist_ok=True)
cmd = [
ffmpeg, "-y", "-loglevel", "error",
"-i", str(source_path),
"-map_metadata", "-1", "-map_chapters", "-1",
"-c", "copy",
ffmpeg,
"-y",
"-loglevel",
"error",
"-i",
str(source_path),
"-map_metadata",
"-1",
"-map_chapters",
"-1",
"-c",
"copy",
str(output_path),
]
result = subprocess.run(cmd, capture_output=True, text=True, check=False) # noqa: S603
+53
View File
@@ -22,6 +22,7 @@ from typing import TYPE_CHECKING
if TYPE_CHECKING:
from collections.abc import Iterator
from pathlib import Path
from remove_ai_watermarks.metadata import (
AIGC_MARKERS,
@@ -78,6 +79,58 @@ def is_isobmff(data: bytes) -> bool:
return len(data) >= 8 and data[4:8] == b"ftyp"
def scan_c2pa_region(path: str | Path, *, max_total: int = 4 * 1024 * 1024) -> bytes:
"""Concatenated payloads of top-level ``uuid`` / ``jumb`` boxes in an ISOBMFF
file, found by seeking past other boxes (``mdat`` etc.) by size.
C2PA manifests and XMP packets (incl. AI labels) live in top-level ``uuid``
boxes; JPEG-XL uses ``jumb``. In a streaming / non-faststart MP4 the manifest
sits AFTER a multi-megabyte ``mdat``, so a fixed first-MB read misses it. This
walks box headers (8-16 bytes each) and seeks past payloads it does not need,
so it never loads ``mdat`` into memory and works on multi-GB files. Returns
the relevant box payloads (capped at ``max_total``), or ``b""`` for a
non-ISOBMFF file or on any read error.
"""
collected = bytearray()
try:
with open(path, "rb") as f:
sniff = f.read(8)
if len(sniff) < 8 or sniff[4:8] != b"ftyp":
return b""
f.seek(0, 2)
file_size = f.tell()
pos = 0
while pos + 8 <= file_size and len(collected) < max_total:
f.seek(pos)
header = f.read(8)
if len(header) < 8:
break
size32 = struct.unpack(">I", header[:4])[0]
box_type = header[4:8]
payload_off = pos + 8
if size32 == 1:
ext = f.read(8)
if len(ext) < 8:
break
size = struct.unpack(">Q", ext)[0]
payload_off = pos + 16
elif size32 == 0:
size = file_size - pos
else:
size = size32
if size < (payload_off - pos) or pos + size > file_size:
break
if box_type in C2PA_BOX_TYPES:
f.seek(payload_off)
to_read = min(pos + size - payload_off, max_total - len(collected))
if to_read > 0:
collected += f.read(to_read)
pos += size
except OSError:
return b""
return bytes(collected)
def strip_c2pa_boxes(data: bytes) -> tuple[bytes, int]:
"""Return ``(cleaned_bytes, stripped_count)`` with AI-provenance boxes removed.
+69 -3
View File
@@ -631,6 +631,11 @@ _MP4_FTYP = b"\x00\x00\x00\x18ftypmp42\x00\x00\x00\x00mp42isom"
_MP4_MDAT = b"\x00\x00\x00\x10mdat" + b"videodat"
def _box(box_type: bytes, payload: bytes) -> bytes:
"""Build a 32-bit-size ISOBMFF box: [size:4][type:4][payload]."""
return (8 + len(payload)).to_bytes(4, "big") + box_type + payload
class TestVideoC2pa:
"""C2PA in MP4 (ISOBMFF) -- detect + strip, reusing the image box walker."""
@@ -654,6 +659,59 @@ class TestVideoC2pa:
assert has_ai_metadata(out) is False
class TestLateProvenanceBox:
"""A C2PA / provenance box placed AFTER a large mdat (streaming / non-faststart
MP4) must still be detected -- the fixed first-MB scan would miss it."""
def _mp4_late_c2pa(self, tmp_path: Path, gap: int = 1_500_000) -> Path:
from remove_ai_watermarks.metadata import C2PA_UUID
big_mdat = _box(b"mdat", b"\x00" * gap) # > 1 MB pushes the manifest past the scan window
manifest = C2PA_UUID + b"OpenAI jumbf c2pa ... trainedAlgorithmicMedia ..."
p = tmp_path / "stream.mp4"
p.write_bytes(_MP4_FTYP + big_mdat + _box(b"uuid", manifest))
return p
def test_scan_c2pa_region_finds_late_box(self, tmp_path: Path):
from remove_ai_watermarks.metadata import C2PA_UUID
from remove_ai_watermarks.noai.isobmff import scan_c2pa_region
region = scan_c2pa_region(self._mp4_late_c2pa(tmp_path))
assert C2PA_UUID in region
assert b"trainedAlgorithmicMedia" in region
def test_fixed_window_would_have_missed_it(self, tmp_path: Path):
# Documents the regression the box walk fixes: the manifest is beyond 1 MB.
from remove_ai_watermarks.metadata import C2PA_UUID
p = self._mp4_late_c2pa(tmp_path)
assert C2PA_UUID not in p.read_bytes()[: 1024 * 1024]
def test_scan_head_includes_late_box(self, tmp_path: Path):
from remove_ai_watermarks.metadata import C2PA_UUID, scan_head
assert C2PA_UUID in scan_head(self._mp4_late_c2pa(tmp_path))
def test_has_ai_metadata_detects_late_manifest(self, tmp_path: Path):
assert has_ai_metadata(self._mp4_late_c2pa(tmp_path)) is True
def test_scan_c2pa_region_non_isobmff_is_empty(self, tmp_path: Path):
from remove_ai_watermarks.noai.isobmff import scan_c2pa_region
p = tmp_path / "not.bin"
p.write_bytes(b"\x89PNG\r\n\x1a\n not an isobmff file")
assert scan_c2pa_region(p) == b""
def test_front_placed_manifest_still_detected(self, tmp_path: Path):
# Regression: a faststart MP4 (manifest before mdat) is unaffected.
from remove_ai_watermarks.metadata import C2PA_UUID
manifest = C2PA_UUID + b"OpenAI ... trainedAlgorithmicMedia ..."
p = tmp_path / "front.mp4"
p.write_bytes(_MP4_FTYP + _box(b"uuid", manifest) + _box(b"mdat", b"\x00" * 100))
assert has_ai_metadata(p) is True
class TestIsobmffMetadataRemoval:
"""Container-level AI-provenance stripping across ISOBMFF image/video/audio."""
@@ -718,9 +776,17 @@ class TestFfmpegMetadataStrip:
def _wav_with_tag(self, path: Path, tag: str = "Suno AI") -> None:
subprocess.run( # noqa: S603
[
shutil.which("ffmpeg"), "-y", "-loglevel", "error",
"-f", "lavfi", "-i", "sine=frequency=440:duration=0.1",
"-metadata", f"title={tag}", str(path),
shutil.which("ffmpeg"),
"-y",
"-loglevel",
"error",
"-f",
"lavfi",
"-i",
"sine=frequency=440:duration=0.1",
"-metadata",
f"title={tag}",
str(path),
],
check=True,
)
Generated
+1 -1
View File
@@ -2865,7 +2865,7 @@ wheels = [
[[package]]
name = "remove-ai-watermarks"
version = "0.6.7"
version = "0.6.8"
source = { editable = "." }
dependencies = [
{ name = "click" },